Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Go to file
2015-04-21 07:44:45 -04:00
aws Removed mrjob hadoop mapreduce section, it was moved to its own notebook. 2015-04-16 06:57:46 -04:00
commands Add anaconda commands to list environments, update packages, and cleanup old packages and tarballs. 2015-04-12 18:39:13 -04:00
data Reduced confusion matrix image, it was too wide and forced a horizontal scroll bar on nbviewer. 2015-03-25 07:56:39 -04:00
images Added mapreduce-python to README. 2015-04-15 15:48:33 -04:00
kaggle Added missing __init__.py files to kaggle and mapreduce packages. 2015-04-15 15:20:56 -04:00
mapreduce Added missing __init__.py files to kaggle and mapreduce packages. 2015-04-15 15:20:56 -04:00
matplotlib Moved general matplotlib functionality to the first half of the notebook. Moved kaggle titanic plots to the second half as an 'application' of matplotlib to real world data. 2015-04-10 17:16:25 -04:00
numpy Updated notebook to v3. 2015-04-14 14:19:07 -04:00
pandas Cleaned up notebook. 2015-04-10 11:07:17 -04:00
python-data Renamed python-core to python-data. python-core might lead to confusion with CPython. 2015-04-15 14:49:22 -04:00
scikit-learn Added scikit-learn linear regression notebook. 2015-04-19 14:31:20 -04:00
scipy Added scipy snippet skeleton package 2015-01-23 19:14:49 -05:00
spark Added Spark accumulators snippets. 2015-03-31 21:41:21 -04:00
__init__.py Added __init__.py files 2015-01-23 16:08:32 -05:00
.gitignore Added repo scratch dir to ignore 2015-01-26 09:07:25 -05:00
LICENSE Added LICENSE 2015-01-23 16:07:56 -05:00
README.md Added k nearest neighbors. 2015-04-21 07:44:45 -04:00

ipython-data-notebooks

Continually updated IPython Data Science Notebooks: Spark, Hadoop MapReduce, HDFS, AWS, Kaggle, scikit-learn, matplotlib, pandas, NumPy, SciPy, Python, and various command lines.


spark

IPython Notebook(s) demonstrating spark and HDFS functionality.

Notebook Description
spark In-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms.
hdfs Reliably stores very large files across machines in a large cluster.

mapreduce-python

IPython Notebook(s) demonstrating Hadoop MapReduce with mrjob functionality.

Notebook Description
mapreduce-python Supports MapReduce jobs in Python with mrjob, running them locally or on Hadoop clusters.

aws

IPython Notebook(s) demonstrating Amazon Web Services (AWS) and AWS tools functionality.

Notebook Description
s3cmd Interacts with S3 through the command line.
s3distcp Combines smaller files and aggregates them together by taking in a pattern and target file. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.
s3-parallel-put Uploads multiple files to S3 in parallel.
redshift Acts as a fast data warehouse built on top of technology from massive parallel processing (MPP).
kinesis Streams data in real time with the ability to process thousands of data streams per second.
lambda Runs code in response to events, automatically managing compute resources.

kaggle

IPython Notebook(s) used in kaggle competitions.

Notebook Description
titanic Predicts survival on the Titanic. Demonstrates data cleaning, exploratory data analysis, and machine learning.

scikit-learn

IPython Notebook(s) demonstrating scikit-learn functionality.

Notebook Description
scikit-learn-intro Intro notebook to scikit-learn. Scikit-learn adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
scikit-learn-knn K-Nearest Neighbors.
scikit-learn-linear-reg Linear regression.
scikit-learn-svm Support vector machine classifier, with and without kernels.

matplotlib

IPython Notebook(s) demonstrating matplotlib functionality.

Notebook Description
matplotlib Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

pandas

IPython Notebook(s) demonstrating pandas functionality.

Notebook Description
pandas Software library written for data manipulation and analysis in Python. Offers data structures and operations for manipulating numerical tables and time series.
pandas io Input and output operations.
pandas cleaning Data wrangling operations.

numpy

IPython Notebook(s) demonstrating NumPy functionality.

Notebook Description
numpy Adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.

scipy

[Coming Soon] IPython Notebook(s) demonstrating SciPy functionality.


python-data

IPython Notebook(s) demonstrating Python functionality geared towards data analysis.

Notebook Description
data structures Tuples, lists, dicts, sets.
data structure utilities Slice, range, xrange, bisect, sort, sorted, reversed, enumerate, zip, list comprehensions.
functions Functions as objects, lambda functions, closures, *args, **kwargs currying, generators, generator expressions, itertools.
datetime Datetime, strftime, strptime, timedelta.
unit tests Nose unit tests.

commands

IPython Notebook(s) demonstrating various command lines for Linux, Git, etc.

Notebook Description
linux Unix-like and mostly POSIX-compliant computer operating system. Disk usage, splitting files, grep, sed, curl, viewing running processes, terminal syntax highlighting, and Vim.
anaconda Distribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment.
ipython notebook Web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.
git Distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.
ruby Used to interact with the AWS command line and for Jekyll, a blog framework that can be hosted on GitHub Pages.
jekyll Simple, blog-aware, static site generator for personal, project, or organization sites. Renders Markdown or Textile and Liquid templates, and produces a complete, static website ready to be served by Apache HTTP Server, Nginx or another web server.

credits

license

Copyright 2014 Donne Martin

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.