Moved S3DistCp below the S3 specific commands to group with upcoming EMR.

This commit is contained in:
Donne Martin 2015-02-22 07:12:15 -05:00
parent 39281aaaa8
commit 7b15eb949b

View File

@ -1,7 +1,7 @@
{
"metadata": {
"name": "",
"signature": "sha256:ac66b87a3f3817882223506d6fb0a3afa803b8f51e2e75d58e824993b116a385"
"signature": "sha256:2531311c9289bbab1a6c03f5be4cffdd2eee75ac64274fe4b532ab55316c066d"
},
"nbformat": 3,
"nbformat_minor": 0,
@ -56,81 +56,6 @@
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## S3DistCp\n",
"\n",
"[S3DistCp](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html) is an extension of DistCp that is optimized to work with Amazon S3. S3DistCp is useful for combining smaller files and aggregate them together, taking in a pattern and target file to combine smaller input files to larger ones. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run S3DistCp with the EMR command line, ensure you are using the proper version of Ruby:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rvm --default ruby-1.8.7-p374"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The EMR command line below executes the following:\n",
"* Create a master node and slave nodes of type m1.small\n",
"* Runs S3DistCp on the source bucket location and concatenates files that match the date regular expression, resulting in files that are roughly 1024 MB or 1 GB\n",
"* Places the results in the destination bucket"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"./elastic-mapreduce --create --instance-group master --instance-count 1 \\\n",
"--instance-type m1.small --instance-group core --instance-count 4 \\\n",
"--instance-type m1.small --jar /home/hadoop/lib/emr-s3distcp-1.0.jar \\\n",
"--args \"--src,s3://my-bucket-source/,--groupBy,.*([0-9]{4}-01).*,\\\n",
"--dest,s3://my-bucket-dest/,--targetSize,1024\""
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For further optimization, compression can be helpful to save on AWS storage and bandwidth costs, to speed up the S3 to/from EMR transfer, and to reduce disk I/O. Note that compressed files are not easy to split for Hadoop. For example, Hadoop uses a single mapper per GZIP file, as it does not know about file boundaries.\n",
"\n",
"What type of compression should you use?\n",
"\n",
"* Time sensitive job: Snappy or LZO\n",
"* Large amounts of data: GZIP\n",
"* General purpose: GZIP, as it\u2019s supported by most platforms\n",
"\n",
"You can specify the compression codec (gzip, lzo, snappy, or none) to use for copied files with S3DistCp with \u2013outputCodec. If no value is specified, files are copied with no compression change. The code below sets the compression to lzo:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"--outputCodec,lzo"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
@ -333,6 +258,81 @@
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## S3DistCp\n",
"\n",
"[S3DistCp](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html) is an extension of DistCp that is optimized to work with Amazon S3. S3DistCp is useful for combining smaller files and aggregate them together, taking in a pattern and target file to combine smaller input files to larger ones. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run S3DistCp with the EMR command line, ensure you are using the proper version of Ruby:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rvm --default ruby-1.8.7-p374"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The EMR command line below executes the following:\n",
"* Create a master node and slave nodes of type m1.small\n",
"* Runs S3DistCp on the source bucket location and concatenates files that match the date regular expression, resulting in files that are roughly 1024 MB or 1 GB\n",
"* Places the results in the destination bucket"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"./elastic-mapreduce --create --instance-group master --instance-count 1 \\\n",
"--instance-type m1.small --instance-group core --instance-count 4 \\\n",
"--instance-type m1.small --jar /home/hadoop/lib/emr-s3distcp-1.0.jar \\\n",
"--args \"--src,s3://my-bucket-source/,--groupBy,.*([0-9]{4}-01).*,\\\n",
"--dest,s3://my-bucket-dest/,--targetSize,1024\""
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For further optimization, compression can be helpful to save on AWS storage and bandwidth costs, to speed up the S3 to/from EMR transfer, and to reduce disk I/O. Note that compressed files are not easy to split for Hadoop. For example, Hadoop uses a single mapper per GZIP file, as it does not know about file boundaries.\n",
"\n",
"What type of compression should you use?\n",
"\n",
"* Time sensitive job: Snappy or LZO\n",
"* Large amounts of data: GZIP\n",
"* General purpose: GZIP, as it\u2019s supported by most platforms\n",
"\n",
"You can specify the compression codec (gzip, lzo, snappy, or none) to use for copied files with S3DistCp with \u2013outputCodec. If no value is specified, files are copied with no compression change. The code below sets the compression to lzo:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"--outputCodec,lzo"
],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}