data-science-ipython-notebooks/commands/aws.ipynb

{
 "metadata": {
  "name": "",
  "signature": "sha256:44b8b97435ef131ae163887a097426e7d9818e0394168d960e892917df05417c"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# AWS Command Lines"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Connect to EC2"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Connect to an Ubuntu EC2 instance through SSH with the given key:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ssh -i key.pem ubuntu@ipaddress"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Connect to an Amazon Linux EC2 instance through SSH with the given key:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ssh -i key.pem ec2-user@ipaddress"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Setup S3cmd\n",
      "\n",
      "Before I discovered [S3cmd](http://s3tools.org/s3cmd), I had been using the [S3 console](http://aws.amazon.com/console/) to do basic operations and [boto](https://boto.readthedocs.org/en/latest/) to do more of the heavy lifting.  However, sometimes I just want to hack away at a command line to do my work.\n",
      "\n",
      "I've found S3cmd to be a great command line tool for interacting with S3 on AWS.  S3cmd is written in Python, is open source, and is free even for commercial use.  It offers more advanced features than those found in the [AWS CLI](http://aws.amazon.com/cli/)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Install s3cmd:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sudo apt-get install s3cmd"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Running the following command will prompt you to enter your AWS access and AWS secret keys. To follow security best practices, make sure you are using an IAM account as opposed to using the root account.\n",
      "\n",
      "I also suggest enabling GPG encryption which will encrypt your data at rest, and enabling HTTPS to encrypt your data in transit.  Note this might impact performance."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s3cmd --configure"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Frequently used S3cmds:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# List all buckets\n",
      "s3cmd ls\n",
      "\n",
      "# List the contents of the bucket\n",
      "s3cmd ls s3://my-bucket-name\n",
      "\n",
      "# Upload a file into the bucket (private)\n",
      "s3cmd put myfile.txt s3://my-bucket-name/myfile.txt\n",
      "\n",
      "# Upload a file into the bucket (public)\n",
      "s3cmd put --acl-public --guess-mime-type myfile.txt s3://my-bucket-name/myfile.txt\n",
      "\n",
      "# Recursively upload a directory to s3\n",
      "s3cmd put --recursive my-local-folder-path/ s3://my-bucket-name/mydir/\n",
      "\n",
      "# Download a file\n",
      "s3cmd get s3://my-bucket-name/myfile.txt myfile.txt\n",
      "\n",
      "# Recursively download files that start with myfile\n",
      "s3cmd --recursive get s3://my-bucket-name/myfile\n",
      "\n",
      "# Delete a file\n",
      "s3cmd del s3://my-bucket-name/myfile.txt\n",
      "\n",
      "# Delete a bucket\n",
      "s3cmd del --recursive s3://my-bucket-name/\n",
      "\n",
      "# Create a bucket\n",
      "s3cmd mb s3://my-bucket-name\n",
      "\n",
      "# List bucket disk usage (human readable)\n",
      "s3cmd du -H s3://my-bucket-name/\n",
      "\n",
      "# Sync local (source) to s3 bucket (destination)\n",
      "s3cmd sync my-local-folder-path/ s3://my-bucket-name/\n",
      "\n",
      "# Sync s3 bucket (source) to local (destination)\n",
      "s3cmd sync s3://my-bucket-name/ my-local-folder-path/\n",
      "\n",
      "# Do a dry-run (do not perform actual sync, but get information about what would happen)\n",
      "s3cmd --dry-run sync s3://my-bucket-name/ my-local-folder-path/\n",
      "\n",
      "# Apply a standard shell wildcard include to sync s3 bucket (source) to local (destination)\n",
      "s3cmd --include '2014-05-01*' sync s3://my-bucket-name/ my-local-folder-path/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Setup s3-parallel-put\n",
      "\n",
      "[s3-parallel-put](https://github.com/twpayne/s3-parallel-put.git) is a great tool for uploading multiple files to S3 in parallel."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Install package dependencies:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sudo apt-get install boto\n",
      "sudo apt-get install git"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Clone the s3-parallel-put repo:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "git clone https://github.com/twpayne/s3-parallel-put.git"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Setup AWS keys for s3-parallel-put:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "export AWS_ACCESS_KEY_ID=XXX\n",
      "export AWS_SECRET_ACCESS_KEY=XXX"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Sample usage:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s3-parallel-put --bucket=bucket --prefix=PREFIX SOURCE"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Dry run of putting files in the current directory on S3 with the given S3 prefix, do not check first if they exist:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s3-parallel-put --bucket=bucket --host=s3.amazonaws.com --put=stupid --dry-run --prefix=prefix/ ./"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## S3DistCp\n",
      "\n",
      "[S3DistCp](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html) is an extension of DistCp that is optimized to work with Amazon S3.  S3DistCp is useful for combining smaller files and aggregate them together, taking in a pattern and target file to combine smaller input files to larger ones.  S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To run S3DistCp with the EMR command line, ensure you are using the proper version of Ruby:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rvm --default ruby-1.8.7-p374"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The EMR command line below executes the following:\n",
      "* Create a master node and slave nodes of type m1.small\n",
      "* Runs S3DistCp on the source bucket location and concatenates files that match the date regular expression, resulting in files that are roughly 1024 MB or 1 GB\n",
      "* Places the results in the destination bucket"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "./elastic-mapreduce --create --instance-group master --instance-count 1 \\\n",
      "--instance-type m1.small --instance-group core --instance-count 4 \\\n",
      "--instance-type m1.small --jar /home/hadoop/lib/emr-s3distcp-1.0.jar \\\n",
      "--args \"--src,s3://my-bucket-source/,--groupBy,.*([0-9]{4}-01).*,\\\n",
      "--dest,s3://my-bucket-dest/,--targetSize,1024\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For further optimization, compression can be helpful to save on AWS storage and bandwidth costs, to speed up the S3 to/from EMR transfer, and to reduce disk I/O. Note that compressed files are not easy to split for Hadoop. For example, Hadoop uses a single mapper per GZIP file, as it does not know about file boundaries.\n",
      "\n",
      "What type of compression should you use?\n",
      "\n",
      "* Time sensitive job: Snappy or LZO\n",
      "* Large amounts of data: GZIP\n",
      "* General purpose: GZIP, as it\u2019s supported by most platforms\n",
      "\n",
      "You can specify the compression codec (gzip, lzo, snappy, or none) to use for copied files with S3DistCp with \u2013outputCodec. If no value is specified, files are copied with no compression change. The code below sets the compression to lzo:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "--outputCodec,lzo"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## mrjob"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Run an mrjob on the given input (must be a flat file hierarchy), placing the results in the output (output directory must not exist):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "python mr-script.py -r emr s3://bucket-source/ --output-dir=s3://bucket-dest/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Run an mrjob locally on the specified input file, sending the results to the specified output file:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "python mrjob_script.py input_data.txt > output_data.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}