data-science-ipython-notebooks/README.md

![alt text](http://i2.wp.com/donnemartin.com/wp-content/uploads/2015/02/ipython_notebook_cover2-e1425213196820.png)

# ipython-data-notebooks
Continually updated IPython Data Science Notebooks geared towards processing big data (AWS, Spark, Hadoop, Linux command line, Python, NumPy, pandas, matplotlib, SciPy, scikit-learn, Kaggle).

## kaggle

IPython Notebooks used in [kaggle](https://www.kaggle.com/) competitions.

* [titanic](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/kaggle/titanic.ipynb): Predicts survival on the Titanic.  Demonstrates data cleaning, exploratory data analysis, and machine learning.

## aws

IPython Notebooks demonstrating Amazon Web Services functionality.

* [aws commands index](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb)

* [s3cmd](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#s3cmd): Interacts with S3 through the command line.

* [s3-parallel-put](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#s3-parallel-put): Uploads multiple files to S3 in parallel.

* [s3distcp](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#s3distcp): Combines smaller files and aggregates them together by taking in a pattern and target file.  S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.

* [mrjob](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#mrjob): Supports MapReduce jobs in Python 2.5+ and runs them locally or on Hadoop clusters.

* [redshift](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#redshift): Acts as a fast data warehouse built on top of technology from massive parallel processing (MPP).

* [kinesis](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#kinesis): Streams data in real time with the ability to process thousands of data streams per second.

* [lambda](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#lambda): Runs code in response to events, automatically managing compute resources.

## spark

IPython Notebooks demonstrating spark and HDFS functionality.

* [spark](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/spark/spark.ipynb): Open-source in-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms.

* [hdfs](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/spark/hdfs.ipynb): Reliably stores very large files across machines in a large cluster.

## python-core

IPython Notebooks demonstrating core Python functionality geared towards data analysis.

* [data structures](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/structs.ipynb)
* [data structure utilities](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/structs_utils.ipynb)
* [functions](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/functions.ipynb)
* [datetime](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/datetime.ipynb)
* [unit tests](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/unit_tests.ipynb)

## pandas

IPython Notebooks demonstrating pandas functionality.

* [pandas](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/pandas/pandas.ipynb)
* [pandas io](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/pandas/pandas_io.ipynb)
* [pandas cleaning](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/pandas/pandas_clean.ipynb)

## commands

IPython Notebooks demonstrating various command lines for Linux, Git, etc.

* [linux](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/linux.ipynb)
* [anaconda](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#anaconda)
* [ipython notebook](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#ipython-notebook)
* [git](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#git)
* [ruby](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#ruby)
* [jekyll](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#jekyll)

## matplotlib

[Coming Soon] IPython Notebooks demonstrating matplotlib functionality.

## scikit-learn

[Coming Soon] IPython Notebooks demonstrating scikit-learn functionality.

## scipy

[Coming Soon] IPython Notebooks demonstrating SciPy functionality.

## numpy

[Coming Soon] IPython Notebooks demonstrating NumPy functionality.

## References

* [Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793)
* [Building Machine Learning Systems with Python](http://www.amazon.com/Building-Machine-Learning-Systems-Python/dp/1782161406)
* [Think Bayes](http://www.amazon.com/Think-Bayes-Allen-B-Downey/dp/1449370780)
* [Think Stats](http://www.amazon.com/Think-Stats-Allen-B-Downey/dp/1449307116)

## License

    Copyright 2014 Donne Martin

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
Updated repo image. 2015-03-01 20:34:49 +08:00			`![alt text](http://i2.wp.com/donnemartin.com/wp-content/uploads/2015/02/ipython_notebook_cover2-e1425213196820.png)`
Added repo image. 2015-03-01 07:29:16 +08:00
Fixed IPython Notebook links on nbviewer as a result of repo name change. 2015-01-27 03:10:37 +08:00			`# ipython-data-notebooks`
Updated repo description. 2015-03-14 21:22:01 +08:00			`Continually updated IPython Data Science Notebooks geared towards processing big data (AWS, Spark, Hadoop, Linux command line, Python, NumPy, pandas, matplotlib, SciPy, scikit-learn, Kaggle).`
Updated README wth list of packages this repo will contain. 2015-01-24 19:55:05 +08:00
Reordered README sections. 2015-03-18 04:21:33 +08:00			`## kaggle`
Updated README wth list of packages this repo will contain. 2015-01-24 19:55:05 +08:00
Reordered README sections. 2015-03-18 04:21:33 +08:00			`IPython Notebooks used in [kaggle](https://www.kaggle.com/) competitions.`
Added Pandas Series snippets. 2015-01-28 08:22:37 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [titanic](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/kaggle/titanic.ipynb): Predicts survival on the Titanic. Demonstrates data cleaning, exploratory data analysis, and machine learning.`
Updated README with Pandas IO, Pandas Cleaning, and note about various command lines (coming soon). 2015-02-16 06:56:06 +08:00
Moved AWS IPython Notebook to its own directory. 2015-03-01 02:03:06 +08:00			`## aws`

			`IPython Notebooks demonstrating Amazon Web Services functionality.`

Updated AWS index. 2015-03-11 05:00:44 +08:00			`* [aws commands index](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb)`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [s3cmd](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#s3cmd): Interacts with S3 through the command line.`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [s3-parallel-put](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#s3-parallel-put): Uploads multiple files to S3 in parallel.`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [s3distcp](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#s3distcp): Combines smaller files and aggregates them together by taking in a pattern and target file. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [mrjob](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#mrjob): Supports MapReduce jobs in Python 2.5+ and runs them locally or on Hadoop clusters.`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [redshift](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#redshift): Acts as a fast data warehouse built on top of technology from massive parallel processing (MPP).`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [kinesis](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#kinesis): Streams data in real time with the ability to process thousands of data streams per second.`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [lambda](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/aws/aws.ipynb#lambda): Runs code in response to events, automatically managing compute resources.`
Moved AWS IPython Notebook to its own directory. 2015-03-01 02:03:06 +08:00
Added IPython Notebook containing HDFS snippets. 2015-03-01 01:44:56 +08:00			`## spark`

			`IPython Notebooks demonstrating spark and HDFS functionality.`

Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [spark](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/spark/spark.ipynb): Open-source in-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms.`
Added more whitespace to try to improve legibility 2015-04-03 18:38:26 +08:00
Added more detailed descriptions to each notebook in the categories kaggle, aws, and spark. 2015-04-03 18:34:32 +08:00			`* [hdfs](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/spark/hdfs.ipynb): Reliably stores very large files across machines in a large cluster.`
Reordered README sections. 2015-03-18 04:21:33 +08:00
			`## python-core`

			`IPython Notebooks demonstrating core Python functionality geared towards data analysis.`

			`* [data structures](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/structs.ipynb)`
			`* [data structure utilities](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/structs_utils.ipynb)`
			`* [functions](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/functions.ipynb)`
			`* [datetime](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/datetime.ipynb)`
			`* [unit tests](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/python-core/unit_tests.ipynb)`

			`## pandas`

			`IPython Notebooks demonstrating pandas functionality.`

			`* [pandas](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/pandas/pandas.ipynb)`
			`* [pandas io](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/pandas/pandas_io.ipynb)`
			`* [pandas cleaning](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/pandas/pandas_clean.ipynb)`
Added IPython Notebook containing HDFS snippets. 2015-03-01 01:44:56 +08:00
			`## commands`

Updated README to include aws and spark. Removed commands suffix from linux, git, jekyll commands as it seemed redundanct. 2015-03-01 07:28:50 +08:00			`IPython Notebooks demonstrating various command lines for Linux, Git, etc.`
Added IPython Notebook containing HDFS snippets. 2015-03-01 01:44:56 +08:00
Updated README to include aws and spark. Removed commands suffix from linux, git, jekyll commands as it seemed redundanct. 2015-03-01 07:28:50 +08:00			`* [linux](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/linux.ipynb)`
Tweaked header anchors to work with nbviewer. 2015-03-01 20:00:42 +08:00			`* [anaconda](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#anaconda)`
			`* [ipython notebook](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#ipython-notebook)`
			`* [git](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#git)`
			`* [ruby](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#ruby)`
			`* [jekyll](http://nbviewer.ipython.org/github/donnemartin/ipython-data-notebooks/blob/master/commands/misc.ipynb#jekyll)`
Added IPython Notebook containing HDFS snippets. 2015-03-01 01:44:56 +08:00
Updated README wth list of packages this repo will contain. 2015-01-24 19:55:05 +08:00			`## matplotlib`

Added scikit-learn skeleton. Marked which projects are under active development and which are coming soon. Removed redundant info. 2015-01-26 22:15:12 +08:00			`[Coming Soon] IPython Notebooks demonstrating matplotlib functionality.`

			`## scikit-learn`

			`[Coming Soon] IPython Notebooks demonstrating scikit-learn functionality.`
Updated README wth list of packages this repo will contain. 2015-01-24 19:55:05 +08:00
			`## scipy`

Added scikit-learn skeleton. Marked which projects are under active development and which are coming soon. Removed redundant info. 2015-01-26 22:15:12 +08:00			`[Coming Soon] IPython Notebooks demonstrating SciPy functionality.`
Added LICENSE 2015-01-24 05:07:56 +08:00
Added Pandas Series snippets. 2015-01-28 08:22:37 +08:00			`## numpy`

			`[Coming Soon] IPython Notebooks demonstrating NumPy functionality.`

Added References section. 2015-02-06 21:28:25 +08:00			`## References`

Added Think Stats reference. Removed ref attribute from other References. 2015-02-18 19:25:08 +08:00			`* [Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793)`
			`* [Building Machine Learning Systems with Python](http://www.amazon.com/Building-Machine-Learning-Systems-Python/dp/1782161406)`
			`* [Think Bayes](http://www.amazon.com/Think-Bayes-Allen-B-Downey/dp/1449370780)`
			`* [Think Stats](http://www.amazon.com/Think-Stats-Allen-B-Downey/dp/1449307116)`
Added References section. 2015-02-06 21:28:25 +08:00
Updated README to include aws and spark. Removed commands suffix from linux, git, jekyll commands as it seemed redundanct. 2015-03-01 07:28:50 +08:00			`## License`
Added LICENSE 2015-01-24 05:07:56 +08:00
			`Copyright 2014 Donne Martin`

			`Licensed under the Apache License, Version 2.0 (the "License");`
			`you may not use this file except in compliance with the License.`
			`You may obtain a copy of the License at`

			`http://www.apache.org/licenses/LICENSE-2.0`

			`Unless required by applicable law or agreed to in writing, software`
			`distributed under the License is distributed on an "AS IS" BASIS,`
			`WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`See the License for the specific language governing permissions and`
Moved AWS IPython Notebook to its own directory. 2015-03-01 02:03:06 +08:00			`limitations under the License.`