Continually updated IPython Data Science Notebooks geared towards processing big data (AWS, Spark, Hadoop, Linux command line, Python, NumPy, pandas, matplotlib, SciPy, scikit-learn, Kaggle).
* [titanic]( Predicts survival on the Titanic. Demonstrates data cleaning, exploratory data analysis, and machine learning.
* [s3cmd]( Interacts with S3 through the command line.
* [s3-parallel-put]( Uploads multiple files to S3 in parallel.
* [s3distcp]( Combines smaller files and aggregates them together by taking in a pattern and target file. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.
* [mrjob]( Supports MapReduce jobs in Python 2.5+ and runs them locally or on Hadoop clusters.
* [redshift]( Acts as a fast data warehouse built on top of technology from massive parallel processing (MPP).
* [kinesis]( Streams data in real time with the ability to process thousands of data streams per second.
* [lambda]( Runs code in response to events, automatically managing compute resources.
* [spark]( Open-source in-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms.
* [hdfs]( Reliably stores very large files across machines in a large cluster.
* [Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](
* [Building Machine Learning Systems with Python](