Move DataFrames before RDDs

This commit is contained in:
Donne Martin 2016-02-21 06:00:40 -05:00
parent b15edb7585
commit e4e1284a15

View File

@ -15,6 +15,7 @@
"\n",
"* IPython Notebook Setup\n",
"* Python Shell\n",
"* DataFrames\n",
"* RDDs\n",
"* Pair RDDs\n",
"* Running Spark on a Cluster\n",
@ -91,6 +92,201 @@
"sc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the Spark Context, create a SQLContext:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a dataframe based on the content of a file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.jsonFile(\"file:/path/file.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the content of the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the schema:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select a column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.select(\"column_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataFrame with rows matching a given filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.filter(df.column_name > 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aggregate the results and count:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupBy(\"column_name\").count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert a RDD to a DataFrame (by inferring the schema):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.inferSchema(my_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the DataFrame as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.registerTempTable(\"dataframe_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run a SQL Query on a DataFrame registered as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\") #the result is a RDD"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -404,201 +600,6 @@
" print user_id, count, user_info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the Spark Context, create a SQLContext:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a dataframe based on the content of a file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.jsonFile(\"file:/path/file.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the content of the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the schema:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select a column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.select(\"column_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataFrame with rows matching a given filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.filter(df.column_name > 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aggregate the results and count:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupBy(\"column_name\").count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert a RDD to a DataFrame (by inferring the schema):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = sqlContext.inferSchema(my_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the DataFrame as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.registerTempTable(\"dataframe_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run a SQL Query on a DataFrame registered as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rdd_from_df = sqlContext.sql(\"SELECT * FROM dataframe_name\") #the result is a RDD"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1497,21 +1498,21 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,