mirror of
https://github.com/donnemartin/data-science-ipython-notebooks.git
synced 2024-03-22 13:30:56 +08:00
Added snippets to cache RDDs in Spark.
This commit is contained in:
parent
404676a1f7
commit
0481497848
|
@ -1,7 +1,7 @@
|
||||||
{
|
{
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"name": "",
|
"name": "",
|
||||||
"signature": "sha256:ba2a57f8daa28f0934b9121adbb394f85c2a7b6b1c77196c1982e814079af90f"
|
"signature": "sha256:b82a9fa9d896b3d5d681d6fee5ca75e340c9cd86d6292aa63c6f6bff1fb5ea20"
|
||||||
},
|
},
|
||||||
"nbformat": 3,
|
"nbformat": 3,
|
||||||
"nbformat_minor": 0,
|
"nbformat_minor": 0,
|
||||||
|
@ -19,7 +19,8 @@
|
||||||
"* Pair RDDs\n",
|
"* Pair RDDs\n",
|
||||||
"* Running Spark on a Cluster\n",
|
"* Running Spark on a Cluster\n",
|
||||||
"* Viewing the Spark Application UI\n",
|
"* Viewing the Spark Application UI\n",
|
||||||
"* Working with Partitions"
|
"* Working with Partitions\n",
|
||||||
|
"* Caching RDDs\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -557,9 +558,57 @@
|
||||||
" if \".txt\" in line: txt_count += 1\n",
|
" if \".txt\" in line: txt_count += 1\n",
|
||||||
" yield (txt_count)\n",
|
" yield (txt_count)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"sc.textFile(\"file:/path/*\") \\\n",
|
"my_data = sc.textFile(\"file:/path/*\") \\\n",
|
||||||
" .mapPartitions(count_txt) \\\n",
|
" .mapPartitions(count_txt) \\\n",
|
||||||
" .collect()"
|
" .collect()\n",
|
||||||
|
" \n",
|
||||||
|
"# Show the partitioning \n",
|
||||||
|
"print \"Data partitions: \", my_data.toDebugString()"
|
||||||
|
],
|
||||||
|
"language": "python",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": []
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Caching RDDs"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Caching an RDD saves the data in memory. Caching is a suggestion to Spark as it is memory dependent.\n",
|
||||||
|
"\n",
|
||||||
|
"By default, every RDD operation executes the entire lineage. Caching can boost performance for datasets that are likely to be used by saving this expensive recomputation and is ideal for iterative algorithms or machine learning.\n",
|
||||||
|
"\n",
|
||||||
|
"* cache() stores data in memory\n",
|
||||||
|
"* persist() stores data in MEMORY_ONLY, MEMORY_AND_DISK (spill to disk), and DISK_ONLY\n",
|
||||||
|
"\n",
|
||||||
|
"Disk memory is stored on the node, not on HDFS.\n",
|
||||||
|
"\n",
|
||||||
|
"Replication is possible by using MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. If a cached partition becomes unavailable, Spark recomputes the partition through the lineage.\n",
|
||||||
|
"\n",
|
||||||
|
"Serialization is possible with MEMORY_ONLY_SER and MEMORY_AND_DISK_SER. This is more space efficient but less time efficient, as it uses Java serialization by default."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"collapsed": false,
|
||||||
|
"input": [
|
||||||
|
"# Cache RDD to memory\n",
|
||||||
|
"my_data.cache()\n",
|
||||||
|
"\n",
|
||||||
|
"# Persist RDD to both memory and disk (if memory is not enough), with replication of 2\n",
|
||||||
|
"my_data.persist(MEMORY_AND_DISK_2)\n",
|
||||||
|
"\n",
|
||||||
|
"# Unpersist RDD, removing it from memory and disk\n",
|
||||||
|
"my_data.unpersist()\n",
|
||||||
|
"\n",
|
||||||
|
"# Change the persistence level after unpersist\n",
|
||||||
|
"my_data.persist(MEMORY_AND_DISK)"
|
||||||
],
|
],
|
||||||
"language": "python",
|
"language": "python",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
|
|
Loading…
Reference in New Issue
Block a user