Move DataFrames before RDDs

This commit is contained in:
Donne Martin 2016-02-21 06:00:40 -05:00
parent b15edb7585
commit e4e1284a15

View File

@ -15,6 +15,7 @@
"\n", "\n",
"* IPython Notebook Setup\n", "* IPython Notebook Setup\n",
"* Python Shell\n", "* Python Shell\n",
"* DataFrames\n",
"* RDDs\n", "* RDDs\n",
"* Pair RDDs\n", "* Pair RDDs\n",
"* Running Spark on a Cluster\n", "* Running Spark on a Cluster\n",
@ -91,6 +92,201 @@
"sc" "sc"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the Spark Context, create a SQLContext:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a dataframe based on the content of a file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.jsonFile(\"file:/path/file.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the content of the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the schema:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select a column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.select(\"column_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataFrame with rows matching a given filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.filter(df.column_name > 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aggregate the results and count:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupBy(\"column_name\").count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert a RDD to a DataFrame (by inferring the schema):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.inferSchema(my_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the DataFrame as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.registerTempTable(\"dataframe_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run a SQL Query on a DataFrame registered as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\") #the result is a RDD"
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
@ -404,201 +600,6 @@
" print user_id, count, user_info" " print user_id, count, user_info"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the Spark Context, create a SQLContext:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a dataframe based on the content of a file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.jsonFile(\"file:/path/file.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the content of the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the schema:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select a column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.select(\"column_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataFrame with rows matching a given filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.filter(df.column_name > 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aggregate the results and count:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupBy(\"column_name\").count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert a RDD to a DataFrame (by inferring the schema):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.inferSchema(my_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the DataFrame as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.registerTempTable(\"dataframe_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run a SQL Query on a DataFrame registered as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\") #the result is a RDD"
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
@ -1497,21 +1498,21 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 2", "display_name": "Python 3",
"language": "python", "language": "python",
"name": "python2" "name": "python3"
}, },
"language_info": { "language_info": {
"codemirror_mode": { "codemirror_mode": {
"name": "ipython", "name": "ipython",
"version": 2 "version": 3
}, },
"file_extension": ".py", "file_extension": ".py",
"mimetype": "text/x-python", "mimetype": "text/x-python",
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython2", "pygments_lexer": "ipython3",
"version": "2.7.10" "version": "3.4.3"
} }
}, },
"nbformat": 4, "nbformat": 4,