Add more Spark DataFrame examples

2024-03-22 13:30:56 +08:00 · 2016-02-21 06:16:34 -05:00 · 2016-02-21 06:16:34 -05:00 · 34889ce7c8
commit 34889ce7c8
parent e4e1284a15
1 changed files with 168 additions and 3 deletions
--- a/spark/spark.ipynb
+++ b/spark/spark.ipynb
@ -103,9 +103,174 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
+    "From the following [reference](https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html):\n",
+    "\n",
    "A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. "
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a DataFrame from JSON files on S3:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "users = context.load(\"s3n://path/to/users.json\", \"json\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a new DataFrame that contains “young users” only:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "young = users.filter(users.age<21)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alternatively, using Pandas-like syntax:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "young = users[users.age<21]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Increment everybody’s age by 1:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "young.select(young.name, young.age+1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Count the number of young users by gender:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "young.groupBy(\"gender\").count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Join young users with another DataFrame called logs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "young.join(logs, logs.userId == users.userId, \"left_outer\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Count the number of users in the young DataFrame:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "young.registerTempTable(\"young\")\n",
+    "context.sql(\"SELECT count(*) FROM young\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Convert Spark DataFrame to Pandas:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "pandas_df = young.toPandas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a Spark DataFrame from Pandas:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "spark_df = context.createDataFrame(pandas_df)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -129,7 +294,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Create a dataframe based on the content of a file:"
+    "Create a DataFrame based on the content of a file:"
   ]
  },
  {
@ -212,7 +377,7 @@
   },
   "outputs": [],
   "source": [
-    "df.filter(df.column_name > 10)"
+    "df.filter(df.column_name>10)"
   ]
  },
  {
@ -284,7 +449,7 @@
   },
   "outputs": [],
   "source": [
-    "rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\") #the result is a RDD"
+    "rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\")"
   ]
  },
  {