Add more Spark DataFrame examples

This commit is contained in:
Donne Martin 2016-02-21 06:16:34 -05:00
parent e4e1284a15
commit 34889ce7c8

View File

@ -103,9 +103,174 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"From the following [reference](https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html):\n",
"\n",
"A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataFrame from JSON files on S3:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"users = context.load(\"s3n://path/to/users.json\", \"json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new DataFrame that contains “young users” only:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"young = users.filter(users.age<21)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, using Pandas-like syntax:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"young = users[users.age<21]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Increment everybodys age by 1:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"young.select(young.name, young.age+1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of young users by gender:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"young.groupBy(\"gender\").count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Join young users with another DataFrame called logs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"young.join(logs, logs.userId == users.userId, \"left_outer\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of users in the young DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"young.registerTempTable(\"young\")\n",
"context.sql(\"SELECT count(*) FROM young\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert Spark DataFrame to Pandas:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pandas_df = young.toPandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a Spark DataFrame from Pandas:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"spark_df = context.createDataFrame(pandas_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -129,7 +294,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a dataframe based on the content of a file:"
"Create a DataFrame based on the content of a file:"
]
},
{
@ -212,7 +377,7 @@
},
"outputs": [],
"source": [
"df.filter(df.column_name > 10)"
"df.filter(df.column_name>10)"
]
},
{
@ -284,7 +449,7 @@
},
"outputs": [],
"source": [
"rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\") #the result is a RDD"
"rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\")"
]
},
{