Added snippets to cache RDDs in Spark.

2024-03-22 13:30:56 +08:00 · 2015-03-08 05:55:05 -04:00 · 2015-03-08 05:55:05 -04:00 · 0481497848
commit 0481497848
parent 404676a1f7
1 changed files with 53 additions and 4 deletions
--- a/spark/spark.ipynb
+++ b/spark/spark.ipynb
@ -1,7 +1,7 @@
 {
 "metadata": {
  "name": "",
-  "signature": "sha256:ba2a57f8daa28f0934b9121adbb394f85c2a7b6b1c77196c1982e814079af90f"
+  "signature": "sha256:b82a9fa9d896b3d5d681d6fee5ca75e340c9cd86d6292aa63c6f6bff1fb5ea20"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
@ -19,7 +19,8 @@
      "* Pair RDDs\n",
      "* Running Spark on a Cluster\n",
      "* Viewing the Spark Application UI\n",
-      "* Working with Partitions"
+      "* Working with Partitions\n",
      "* Caching RDDs\n"
     ]
    },
    {
@ -557,9 +558,57 @@
      "        if \".txt\" in line: txt_count += 1\n",
      "    yield (txt_count)\n",
      "\n",
-      "sc.textFile(\"file:/path/*\") \\\n",
+      "my_data = sc.textFile(\"file:/path/*\") \\\n",
      "    .mapPartitions(count_txt) \\\n",
-      "    .collect()"
+      "    .collect()\n",
      "    \n",
      "# Show the partitioning \n",
      "print \"Data partitions: \", my_data.toDebugString()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Caching RDDs"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Caching an RDD saves the data in memory.  Caching is a suggestion to Spark as it is memory dependent.\n",
      "\n",
      "By default, every RDD operation executes the entire lineage.  Caching can boost performance for datasets that are likely to be used by saving this expensive recomputation and is ideal for iterative algorithms or machine learning.\n",
      "\n",
      "* cache() stores data in memory\n",
      "* persist() stores data in MEMORY_ONLY, MEMORY_AND_DISK (spill to disk), and  DISK_ONLY\n",
      "\n",
      "Disk memory is stored on the node, not on HDFS.\n",
      "\n",
      "Replication is possible by using MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.  If a cached partition becomes unavailable, Spark recomputes the partition through the lineage.\n",
      "\n",
      "Serialization is possible with MEMORY_ONLY_SER and MEMORY_AND_DISK_SER.  This is more space efficient but less time efficient, as it uses Java serialization by default."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Cache RDD to memory\n",
      "my_data.cache()\n",
      "\n",
      "# Persist RDD to both memory and disk (if memory is not enough), with replication of 2\n",
      "my_data.persist(MEMORY_AND_DISK_2)\n",
      "\n",
      "# Unpersist RDD, removing it from memory and disk\n",
      "my_data.unpersist()\n",
      "\n",
      "# Change the persistence level after unpersist\n",
      "my_data.persist(MEMORY_AND_DISK)"
     ],
     "language": "python",
     "metadata": {},