aws | ||
commands | ||
data | ||
kaggle | ||
matplotlib | ||
numpy | ||
pandas | ||
python-core | ||
scikit-learn | ||
scipy | ||
spark | ||
__init__.py | ||
.gitignore | ||
LICENSE | ||
README.md |
ipython-data-notebooks
Continually updated IPython Data Science Notebooks geared towards processing big data (AWS, Spark, Hadoop MapReduce, HDFS, Linux command line, Python, NumPy, pandas, matplotlib, SciPy, scikit-learn, Kaggle).
kaggle
IPython Notebooks used in kaggle competitions.
Notebook | Description |
---|---|
titanic | Predicts survival on the Titanic. Demonstrates data cleaning, exploratory data analysis, and machine learning. |
spark
IPython Notebooks demonstrating spark and HDFS functionality.
Notebook | Description |
---|---|
spark | In-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms. |
hdfs | Reliably stores very large files across machines in a large cluster. |
aws
IPython Notebooks demonstrating Amazon Web Services functionality.
Notebook | Description |
---|---|
s3cmd | Interacts with S3 through the command line. |
s3-parallel-put | Uploads multiple files to S3 in parallel. |
s3distcp | Combines smaller files and aggregates them together by taking in a pattern and target file. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster. |
mrjob | Supports MapReduce jobs in Python 2.5+ and runs them locally or on Hadoop clusters. |
redshift | Acts as a fast data warehouse built on top of technology from massive parallel processing (MPP). |
kinesis | Streams data in real time with the ability to process thousands of data streams per second. |
lambda | Runs code in response to events, automatically managing compute resources. |
python-core
IPython Notebooks demonstrating core Python functionality geared towards data analysis.
Notebook | Description |
---|---|
data structures | Tuples, lists, dicts, sets. |
data structure utilities | Slice, range, xrange, bisect, sort, sorted, reversed, enumerate, zip, list comprehensions. |
functions | Functions as objects, lambda functions, closures, *args, **kwargs currying, generators, generator expressions, itertools. |
datetime | Basics of datetime, strftime, strptime, timedelta. |
unit tests | Nose unit tests. |
pandas
IPython Notebooks demonstrating pandas functionality.
commands
IPython Notebooks demonstrating various command lines for Linux, Git, etc.
matplotlib
[Coming Soon] IPython Notebooks demonstrating matplotlib functionality.
scikit-learn
[Coming Soon] IPython Notebooks demonstrating scikit-learn functionality.
scipy
[Coming Soon] IPython Notebooks demonstrating SciPy functionality.
numpy
[Coming Soon] IPython Notebooks demonstrating NumPy functionality.
References
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
- Building Machine Learning Systems with Python
- Think Bayes
- Think Stats
License
Copyright 2014 Donne Martin
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.