Merge pull request #42 from donnemartin/develop

Add notebooks for Keras, NumPy, Matplotlib, and Pandas
pull/43/head
Donne Martin 2017-03-13 04:58:54 -04:00 committed by GitHub
commit d3d5aa834a
43 changed files with 39575 additions and 71 deletions

143
README.md
View File

@ -12,14 +12,14 @@
## Index
* [kaggle-and-business-analyses](#kaggle-and-business-analyses)
* [scikit-learn](#scikit-learn)
* [deep-learning](#deep-learning)
* [scikit-learn](#scikit-learn)
* [statistical-inference-scipy](#statistical-inference-scipy)
* [pandas](#pandas)
* [matplotlib](#matplotlib)
* [numpy](#numpy)
* [python-data](#python-data)
* [kaggle-and-business-analyses](#kaggle-and-business-analyses)
* [spark](#spark)
* [mapreduce-python](#mapreduce-python)
* [amazon web services](#aws)
@ -31,41 +31,6 @@
* [contact-info](#contact-info)
* [license](#license)
<br/>
<p align="center">
<img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/kaggle.png">
</p>
## kaggle-and-business-analyses
IPython Notebook(s) used in [kaggle](https://www.kaggle.com/) competitions and business analyses.
| Notebook | Description |
|-------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| [titanic](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/kaggle/titanic.ipynb) | Predict survival on the Titanic. Learn data cleaning, exploratory data analysis, and machine learning. |
| [churn-analysis](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/analyses/churn.ipynb) | Predict customer churn. Exercise logistic regression, gradient boosting classifers, support vector machines, random forests, and k-nearest-neighbors. Includes discussions of confusion matrices, ROC plots, feature importances, prediction probabilities, and calibration/descrimination.|
<br/>
<p align="center">
<img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/scikitlearn.png">
</p>
## scikit-learn
IPython Notebook(s) demonstrating scikit-learn functionality.
| Notebook | Description |
|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [intro](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-intro.ipynb) | Intro notebook to scikit-learn. Scikit-learn adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. |
| [knn](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-intro.ipynb#K-Nearest-Neighbors-Classifier) | Implement k-nearest neighbors in scikit-learn. |
| [linear-reg](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-linear-reg.ipynb) | Implement linear regression in scikit-learn. |
| [svm](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-svm.ipynb) | Implement support vector machine classifiers with and without kernels in scikit-learn. |
| [random-forest](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-random-forest.ipynb) | Implement random forest classifiers and regressors in scikit-learn. |
| [k-means](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-k-means.ipynb) | Implement k-means clustering in scikit-learn. |
| [pca](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-pca.ipynb) | Implement principal component analysis in scikit-learn. |
| [gmm](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-gmm.ipynb) | Implement Gaussian mixture models in scikit-learn. |
| [validation](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-validation.ipynb) | Implement validation and model selection in scikit-learn. |
<br/>
<p align="center">
<img src="http://i.imgur.com/ZhKXrKZ.png">
@ -129,12 +94,56 @@ Additional TensorFlow tutorials:
| [theano-rnn](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/deep-learning/theano-tutorial/rnn_tutorial/simple_rnn.ipynb) | Implement recurrent neural networks in Theano. |
| [theano-mlp](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/deep-learning/theano-tutorial/theano_mlp/theano_mlp.ipynb) | Implement multilayer perceptrons in Theano. |
<br/>
<p align="center">
<img src="http://i.imgur.com/L45Q8c2.jpg">
</p>
### keras-tutorials
| Notebook | Description |
|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| keras | Keras is an open source neural network library written in Python. It is capable of running on top of either Tensorflow or Theano. |
| [setup](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/0.%20Preamble.ipynb) | Learn about the tutorial goals and how to set up your Keras environment. |
| [intro-deep-learning-ann](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/1.1%20Introduction%20-%20Deep%20Learning%20and%20ANN.ipynb) | Get an intro to deep learning with Keras and Artificial Neural Networks (ANN). |
| [theano](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/1.2%20Introduction%20-%20Theano.ipynb) | Learn about Theano by working with weights matrices and gradients. |
| [keras-otto](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/1.3%20Introduction%20-%20Keras.ipynb) | Learn about Keras by looking at the Kaggle Otto challenge. |
| [ann-mnist](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/1.4%20(Extra)%20A%20Simple%20Implementation%20of%20ANN%20for%20MNIST.ipynb) | Review a simple implementation of ANN for MNIST using Keras. |
| [conv-nets](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/2.1%20Supervised%20Learning%20-%20ConvNets.ipynb) | Learn about Convolutional Neural Networks (CNNs) with Keras. |
| [conv-net-1](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/2.2.1%20Supervised%20Learning%20-%20ConvNet%20HandsOn%20Part%20I.ipynb) | Recognize handwritten digits from MNIST using Keras - Part 1. |
| [conv-net-2](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/2.2.2%20Supervised%20Learning%20-%20ConvNet%20HandsOn%20Part%20II.ipynb) | Recognize handwritten digits from MNIST using Keras - Part 2. |
| [keras-models](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/2.3%20Supervised%20Learning%20-%20Famous%20Models%20with%20Keras.ipynb) | Use pre-trained models such as VGG16, VGG19, ResNet50, and Inception v3 with Keras. |
| [auto-encoders](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/3.1%20Unsupervised%20Learning%20-%20AutoEncoders%20and%20Embeddings.ipynb) | Learn about Autoencoders with Keras. |
| [rnn-lstm](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/3.2%20RNN%20and%20LSTM.ipynb) | Learn about Recurrent Neural Networks (RNNs) with Keras. |
| [lstm-sentence-gen](http://nbviewer.ipython.org/github/leriomaggio/deep-learning-keras-tensorflow/blob/master/3.3%20(Extra)%20LSTM%20for%20Sentence%20Generation.ipynb) | Learn about RNNs using Long Short Term Memory (LSTM) networks with Keras. |
### deep-learning-misc
| Notebook | Description |
|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [deep-dream](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/deep-learning/deep-dream/dream.ipynb) | Caffe-based computer vision program which uses a convolutional neural network to find and enhance patterns in images. |
<br/>
<p align="center">
<img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/scikitlearn.png">
</p>
## scikit-learn
IPython Notebook(s) demonstrating scikit-learn functionality.
| Notebook | Description |
|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [intro](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-intro.ipynb) | Intro notebook to scikit-learn. Scikit-learn adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. |
| [knn](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-intro.ipynb#K-Nearest-Neighbors-Classifier) | Implement k-nearest neighbors in scikit-learn. |
| [linear-reg](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-linear-reg.ipynb) | Implement linear regression in scikit-learn. |
| [svm](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-svm.ipynb) | Implement support vector machine classifiers with and without kernels in scikit-learn. |
| [random-forest](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-random-forest.ipynb) | Implement random forest classifiers and regressors in scikit-learn. |
| [k-means](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-k-means.ipynb) | Implement k-means clustering in scikit-learn. |
| [pca](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-pca.ipynb) | Implement principal component analysis in scikit-learn. |
| [gmm](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-gmm.ipynb) | Implement Gaussian mixture models in scikit-learn. |
| [validation](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-validation.ipynb) | Implement validation and model selection in scikit-learn. |
<br/>
<p align="center">
<img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/scipy.png">
@ -164,6 +173,19 @@ IPython Notebook(s) demonstrating pandas functionality.
|--------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [pandas](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/pandas.ipynb) | Software library written for data manipulation and analysis in Python. Offers data structures and operations for manipulating numerical tables and time series. |
| [github-data-wrangling](https://github.com/donnemartin/viz/blob/master/githubstats/data_wrangling.ipynb) | Learn how to load, clean, merge, and feature engineer by analyzing GitHub data from the [`Viz`](https://github.com/donnemartin/viz) repo. |
| [Introduction-to-Pandas](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.00-Introduction-to-Pandas.ipynb) | Introduction to Pandas. |
| [Introducing-Pandas-Objects](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.01-Introducing-Pandas-Objects.ipynb) | Learn about Pandas objects. |
| [Data Indexing and Selection](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.02-Data-Indexing-and-Selection.ipynb) | Learn about data indexing and selection in Pandas. |
| [Operations-in-Pandas](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.03-Operations-in-Pandas.ipynb) | Learn about operating on data in Pandas. |
| [Missing-Values](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.04-Missing-Values.ipynb) | Learn about handling missing data in Pandas. |
| [Hierarchical-Indexing](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.05-Hierarchical-Indexing.ipynb) | Learn about hierarchical indexing in Pandas. |
| [Concat-And-Append](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.06-Concat-And-Append.ipynb) | Learn about combining datasets: concat and append in Pandas. |
| [Merge-and-Join](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.07-Merge-and-Join.ipynb) | Learn about combining datasets: merge and join in Pandas. |
| [Aggregation-and-Grouping](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.08-Aggregation-and-Grouping.ipynb) | Learn about aggregation and grouping in Pandas. |
| [Pivot-Tables](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.09-Pivot-Tables.ipynb) | Learn about pivot tables in Pandas. |
| [Working-With-Strings](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.10-Working-With-Strings.ipynb) | Learn about vectorized string operations in Pandas. |
| [Working-with-Time-Series](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.11-Working-with-Time-Series.ipynb) | Learn about working with time series in pandas. |
| [Performance-Eval-and-Query](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.12-Performance-Eval-and-Query.ipynb) | Learn about high-performance Pandas: eval() and query() in Pandas. |
<br/>
<p align="center">
@ -178,6 +200,21 @@ IPython Notebook(s) demonstrating matplotlib functionality.
|-----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| [matplotlib](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/matplotlib.ipynb) | Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. |
| [matplotlib-applied](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/matplotlib-applied.ipynb) | Apply matplotlib visualizations to Kaggle competitions for exploratory data analysis. Learn how to create bar plots, histograms, subplot2grid, normalized plots, scatter plots, subplots, and kernel density estimation plots. |
| [Introduction-To-Matplotlib](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.00-Introduction-To-Matplotlib.ipynb) | Introduction to Matplotlib. |
| [Simple-Line-Plots](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.01-Simple-Line-Plots.ipynb) | Learn about simple line plots in Matplotlib. |
| [Simple-Scatter-Plots](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.02-Simple-Scatter-Plots.ipynb) | Learn about simple scatter plots in Matplotlib. |
| [Errorbars.ipynb](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.03-Errorbars.ipynb) | Learn about visualizing errors in Matplotlib. |
| [Density-and-Contour-Plots](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.04-Density-and-Contour-Plots.ipynb) | Learn about density and contour plots in Matplotlib. |
| [Histograms-and-Binnings](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.05-Histograms-and-Binnings.ipynb) | Learn about histograms, binnings, and density in Matplotlib. |
| [Customizing-Legends](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.06-Customizing-Legends.ipynb) | Learn about customizing plot legends in Matplotlib. |
| [Customizing-Colorbars](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.07-Customizing-Colorbars.ipynb) | Learn about customizing colorbars in Matplotlib. |
| [Multiple-Subplots](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.08-Multiple-Subplots.ipynb) | Learn about multiple subplots in Matplotlib. |
| [Text-and-Annotation](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.09-Text-and-Annotation.ipynb) | Learn about text and annotation in Matplotlib. |
| [Customizing-Ticks](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.10-Customizing-Ticks.ipynb) | Learn about customizing ticks in Matplotlib. |
| [Settings-and-Stylesheets](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.11-Settings-and-Stylesheets.ipynb) | Learn about customizing Matplotlib: configurations and stylesheets. |
| [Three-Dimensional-Plotting](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.12-Three-Dimensional-Plotting.ipynb) | Learn about three-dimensional plotting in Matplotlib. |
| [Geographic-Data-With-Basemap](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.13-Geographic-Data-With-Basemap.ipynb) | Learn about geographic data with basemap in Matplotlib. |
| [Visualization-With-Seaborn](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/matplotlib/04.14-Visualization-With-Seaborn.ipynb) | Learn about visualization with Seaborn. |
<br/>
<p align="center">
@ -191,6 +228,16 @@ IPython Notebook(s) demonstrating NumPy functionality.
| Notebook | Description |
|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [numpy](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/numpy.ipynb) | Adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. |
| [Introduction-to-NumPy](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.00-Introduction-to-NumPy.ipynb) | Introduction to NumPy. |
| [Understanding-Data-Types](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.01-Understanding-Data-Types.ipynb) | Learn about data types in Python. |
| [The-Basics-Of-NumPy-Arrays](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.02-The-Basics-Of-NumPy-Arrays.ipynb) | Learn about the basics of NumPy arrays. |
| [Computation-on-arrays-ufuncs](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.03-Computation-on-arrays-ufuncs.ipynb) | Learn about computations on NumPy arrays: universal functions. |
| [Computation-on-arrays-aggregates](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.04-Computation-on-arrays-aggregates.ipynb) | Learn about aggregations: min, max, and everything in between in NumPy. |
| [Computation-on-arrays-broadcasting](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.05-Computation-on-arrays-broadcasting.ipynb) | Learn about computation on arrays: broadcasting in NumPy. |
| [Boolean-Arrays-and-Masks](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.06-Boolean-Arrays-and-Masks.ipynb) | Learn about comparisons, masks, and boolean logic in NumPy. |
| [Fancy-Indexing](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.07-Fancy-Indexing.ipynb) | Learn about fancy indexing in NumPy. |
| [Sorting](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.08-Sorting.ipynb) | Learn about sorting arrays in NumPy. |
| [Structured-Data-NumPy](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/numpy/02.09-Structured-Data-NumPy.ipynb) | Learn about structured data: NumPy's structured arrays. |
<br/>
<p align="center">
@ -211,6 +258,20 @@ IPython Notebook(s) demonstrating Python functionality geared towards data analy
| [pdb](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/python-data/pdb.ipynb) | Learn how to debug in Python with the interactive source code debugger. |
| [unit tests](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/python-data/unit_tests.ipynb) | Learn how to test in Python with Nose unit tests. |
<br/>
<p align="center">
<img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/kaggle.png">
</p>
## kaggle-and-business-analyses
IPython Notebook(s) used in [kaggle](https://www.kaggle.com/) competitions and business analyses.
| Notebook | Description |
|-------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| [titanic](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/kaggle/titanic.ipynb) | Predict survival on the Titanic. Learn data cleaning, exploratory data analysis, and machine learning. |
| [churn-analysis](http://nbviewer.ipython.org/github/donnemartin/data-science-ipython-notebooks/blob/master/analyses/churn.ipynb) | Predict customer churn. Exercise logistic regression, gradient boosting classifers, support vector machines, random forests, and k-nearest-neighbors. Includes discussions of confusion matrices, ROC plots, feature importances, prediction probabilities, and calibration/descrimination.|
<br/>
<p align="center">
<img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/spark.png">
@ -301,12 +362,6 @@ Anaconda is a free distribution of the Python programming language for large-sca
Follow instructions to install [Anaconda](https://docs.continuum.io/anaconda/install) or the more lightweight [miniconda](http://conda.pydata.org/miniconda.html).
### pip-requirements
If you prefer to use a more lightweight installation procedure than Anaconda, first clone the repo then run the following pip command on the provided requirements.txt file:
$ pip install -r requirements.txt
### dev-setup
For detailed instructions, scripts, and tools to set up your development environment for data analysis, check out the [dev-setup](https://github.com/donnemartin/dev-setup) repo.
@ -329,6 +384,7 @@ Notebooks tested with Python 2.7.x.
* [Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney
* [PyCon 2015 Scikit-learn Tutorial](https://github.com/jakevdp/sklearn_pycon2015) by Jake VanderPlas
* [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook) by Jake VanderPlas
* [Parallel Machine Learning with scikit-learn and IPython](https://github.com/ogrisel/parallel_ml_tutorial) by Olivier Grisel
* [Statistical Interference Using Computational Methods in Python](https://github.com/AllenDowney/CompStats) by Allen Downey
* [TensorFlow Examples](https://github.com/aymericdamien/TensorFlow-Examples) by Aymeric Damien
@ -337,6 +393,7 @@ Notebooks tested with Python 2.7.x.
* [TensorFlow Tutorials](https://github.com/alrojo/tensorflow-tutorial) by Alexander R Johansen
* [TensorFlow Book](https://github.com/BinRoot/TensorFlow-Book) by Nishant Shukla
* [Summer School 2015](https://github.com/mila-udem/summerschool2015) by mila-udem
* [Keras tutorials](https://github.com/leriomaggio/deep-learning-keras-tensorflow) by Valerio Maggio
* [Kaggle](https://www.kaggle.com/)
* [Yhat Blog](http://blog.yhat.com/)

BIN
images/keras.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.7 KiB

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,94 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"\n",
"*No changes were made to the contents of this notebook from the original.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Visualization with Seaborn](04.14-Visualization-With-Seaborn.ipynb) | [Contents](Index.ipynb) | [Machine Learning](05.00-Machine-Learning.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Further Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Matplotlib Resources\n",
"\n",
"A single chapter in a book can never hope to cover all the available features and plot types available in Matplotlib.\n",
"As with other packages we've seen, liberal use of IPython's tab-completion and help functions (see [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)) can be very helpful when exploring Matplotlib's API.\n",
"In addition, Matplotlibs [online documentation](http://matplotlib.org/) can be a helpful reference.\n",
"See in particular the [Matplotlib gallery](http://matplotlib.org/gallery.html) linked on that page: it shows thumbnails of hundreds of different plot types, each one linked to a page with the Python code snippet used to generate it.\n",
"In this way, you can visually inspect and learn about a wide range of different plotting styles and visualization techniques.\n",
"\n",
"For a book-length treatment of Matplotlib, I would recommend [*Interactive Applications Using Matplotlib*](https://www.packtpub.com/application-development/interactive-applications-using-matplotlib), written by Matplotlib core developer Ben Root."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Other Python Graphics Libraries\n",
"\n",
"Although Matplotlib is the most prominent Python visualization library, there are other more modern tools that are worth exploring as well.\n",
"I'll mention a few of them briefly here:\n",
"\n",
"- [Bokeh](http://bokeh.pydata.org) is a JavaScript visualization library with a Python frontend that creates highly interactive visualizations capable of handling very large and/or streaming datasets. The Python front-end outputs a JSON data structure that can be interpreted by the Bokeh JS engine.\n",
"- [Plotly](http://plot.ly) is the eponymous open source product of the Plotly company, and is similar in spirit to Bokeh. Because Plotly is the main product of a startup, it is receiving a high level of development effort. Use of the library is entirely free.\n",
"- [Vispy](http://vispy.org/) is an actively developed project focused on dynamic visualizations of very large datasets. Because it is built to target OpenGL and make use of efficient graphics processors in your computer, it is able to render some quite large and stunning visualizations.\n",
"- [Vega](https://vega.github.io/) and [Vega-Lite](https://vega.github.io/vega-lite) are declarative graphics representations, and are the product of years of research into the fundamental language of data visualization. The reference rendering implementation is JavaScript, but the API is language agnostic. There is a Python API under development in the [Altair](https://altair-viz.github.io/) package. Though as of summer 2016 it's not yet fully mature, I'm quite excited for the possibilities of this project to provide a common reference point for visualization in Python and other languages.\n",
"\n",
"The visualization space in the Python community is very dynamic, and I fully expect this list to be out of date as soon as it is published.\n",
"Keep an eye out for what's coming in the future!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Visualization with Seaborn](04.14-Visualization-With-Seaborn.ipynb) | [Contents](Index.ipynb) | [Machine Learning](05.00-Machine-Learning.ipynb) >"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,160 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"\n",
"*No changes were made to the contents of this notebook from the original.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to NumPy\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This chapter, along with chapter 3, outlines techniques for effectively loading, storing, and manipulating in-memory data in Python.\n",
"The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else.\n",
"Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.\n",
"\n",
"For example, imagesparticularly digital imagescan be thought of as simply two-dimensional arrays of numbers representing pixel brightness across the area.\n",
"Sound clips can be thought of as one-dimensional arrays of intensity versus time.\n",
"Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words.\n",
"No matter what the data are, the first step in making it analyzable will be to transform them into arrays of numbers.\n",
"(We will discuss some specific examples of this process later in [Feature Engineering](05.04-Feature-Engineering.ipynb))\n",
"\n",
"For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science.\n",
"We'll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package, and the Pandas package (discussed in Chapter 3).\n",
"\n",
"This chapter will cover NumPy in detail. NumPy (short for *Numerical Python*) provides an efficient interface to store and operate on dense data buffers.\n",
"In some ways, NumPy arrays are like Python's built-in ``list`` type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.\n",
"NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.\n",
"\n",
"If you followed the advice outlined in the Preface and installed the Anaconda stack, you already have NumPy installed and ready to go.\n",
"If you're more the do-it-yourself type, you can go to http://www.numpy.org/ and follow the installation instructions found there.\n",
"Once you do, you can import NumPy and double-check the version:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'1.11.1'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy\n",
"numpy.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later.\n",
"By convention, you'll find that most people in the SciPy/PyData world will import NumPy using ``np`` as an alias:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Throughout this chapter, and indeed the rest of the book, you'll find that this is the way we will import and use NumPy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reminder about Built In Documentation\n",
"\n",
"As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature), as well as the documentation of various functions (using the ``?`` character Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)).\n",
"\n",
"For example, to display all the contents of the numpy namespace, you can type this:\n",
"\n",
"```ipython\n",
"In [3]: np.<TAB>\n",
"```\n",
"\n",
"And to display NumPy's built-in documentation, you can use this:\n",
"\n",
"```ipython\n",
"In [4]: np?\n",
"```\n",
"\n",
"More detailed documentation, along with tutorials and other resources, can be found at http://www.numpy.org."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,827 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"\n",
"*No changes were made to the contents of this notebook from the original.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Introduction to NumPy](02.00-Introduction-to-NumPy.ipynb) | [Contents](Index.ipynb) | [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Understanding Data Types in Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Effective data-driven science and computation requires understanding how data is stored and manipulated.\n",
"This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.\n",
"Understanding this difference is fundamental to understanding much of the material throughout the rest of the book.\n",
"\n",
"Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing.\n",
"While a statically-typed language like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:\n",
"\n",
"```C\n",
"/* C code */\n",
"int result = 0;\n",
"for(int i=0; i<100; i++){\n",
" result += i;\n",
"}\n",
"```\n",
"\n",
"While in Python the equivalent operation could be written this way:\n",
"\n",
"```python\n",
"# Python code\n",
"result = 0\n",
"for i in range(100):\n",
" result += i\n",
"```\n",
"\n",
"Notice the main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that we can assign any kind of data to any variable:\n",
"\n",
"```python\n",
"# Python code\n",
"x = 4\n",
"x = \"four\"\n",
"```\n",
"\n",
"Here we've switched the contents of ``x`` from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:\n",
"\n",
"```C\n",
"/* C code */\n",
"int x = 4;\n",
"x = \"four\"; // FAILS\n",
"```\n",
"\n",
"This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use.\n",
"Understanding *how* this works is an important piece of learning to analyze data efficiently and effectively with Python.\n",
"But what this type-flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value. We'll explore this more in the sections that follow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Python Integer Is More Than Just an Integer\n",
"\n",
"The standard Python implementation is written in C.\n",
"This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as ``x = 10000``, ``x`` is not just a \"raw\" integer. It's actually a pointer to a compound C structure, which contains several values.\n",
"Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):\n",
"\n",
"```C\n",
"struct _longobject {\n",
" long ob_refcnt;\n",
" PyTypeObject *ob_type;\n",
" size_t ob_size;\n",
" long ob_digit[1];\n",
"};\n",
"```\n",
"\n",
"A single integer in Python 3.4 actually contains four pieces:\n",
"\n",
"- ``ob_refcnt``, a reference count that helps Python silently handle memory allocation and deallocation\n",
"- ``ob_type``, which encodes the type of the variable\n",
"- ``ob_size``, which specifies the size of the following data members\n",
"- ``ob_digit``, which contains the actual integer value that we expect the Python variable to represent.\n",
"\n",
"This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Integer Memory Layout](figures/cint_vs_pyint.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here ``PyObject_HEAD`` is the part of the structure containing the reference count, type code, and other pieces mentioned before.\n",
"\n",
"Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value.\n",
"A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value.\n",
"This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically.\n",
"All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Python List Is More Than Just a List\n",
"\n",
"Let's consider now what happens when we use a Python data structure that holds many Python objects.\n",
"The standard mutable multi-element container in Python is the list.\n",
"We can create a list of integers as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L = list(range(10))\n",
"L"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"int"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(L[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, similarly, a list of strings:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L2 = [str(c) for c in L]\n",
"L2"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"str"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(L2[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of Python's dynamic typing, we can even create heterogeneous lists:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[bool, str, float, int]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L3 = [True, \"2\", 3.0, 4]\n",
"[type(item) for item in L3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other informationthat is, each item is a complete Python object.\n",
"In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array.\n",
"The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Array Memory Layout](figures/array_vs_list.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the implementation level, the array essentially contains a single pointer to one contiguous block of data.\n",
"The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier.\n",
"Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type.\n",
"Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fixed-Type Arrays in Python\n",
"\n",
"Python offers several different options for storing data in efficient, fixed-type data buffers.\n",
"The built-in ``array`` module (available since Python 3.3) can be used to create dense arrays of a uniform type:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import array\n",
"L = list(range(10))\n",
"A = array.array('i', L)\n",
"A"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here ``'i'`` is a type code indicating the contents are integers.\n",
"\n",
"Much more useful, however, is the ``ndarray`` object of the NumPy package.\n",
"While Python's ``array`` object provides efficient storage of array-based data, NumPy adds to this efficient *operations* on that data.\n",
"We will explore these operations in later sections; here we'll demonstrate several ways of creating a NumPy array.\n",
"\n",
"We'll start with the standard NumPy import, under the alias ``np``:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Arrays from Python Lists\n",
"\n",
"First, we can use ``np.array`` to create arrays from Python lists:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 4, 2, 5, 3])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# integer array:\n",
"np.array([1, 4, 2, 5, 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.\n",
"If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 3.14, 4. , 2. , 3. ])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array([3.14, 4, 2, 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to explicitly set the data type of the resulting array, we can use the ``dtype`` keyword:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1., 2., 3., 4.], dtype=float32)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array([1, 2, 3, 4], dtype='float32')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[2, 3, 4],\n",
" [4, 5, 6],\n",
" [6, 7, 8]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# nested lists result in multi-dimensional arrays\n",
"np.array([range(i, i + 3) for i in [2, 4, 6]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The inner lists are treated as rows of the resulting two-dimensional array."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Arrays from Scratch\n",
"\n",
"Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy.\n",
"Here are several examples:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a length-10 integer array filled with zeros\n",
"np.zeros(10, dtype=int)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1., 1., 1., 1., 1.],\n",
" [ 1., 1., 1., 1., 1.],\n",
" [ 1., 1., 1., 1., 1.]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x5 floating-point array filled with ones\n",
"np.ones((3, 5), dtype=float)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 3.14, 3.14, 3.14, 3.14, 3.14],\n",
" [ 3.14, 3.14, 3.14, 3.14, 3.14],\n",
" [ 3.14, 3.14, 3.14, 3.14, 3.14]])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x5 array filled with 3.14\n",
"np.full((3, 5), 3.14)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an array filled with a linear sequence\n",
"# Starting at 0, ending at 20, stepping by 2\n",
"# (this is similar to the built-in range() function)\n",
"np.arange(0, 20, 2)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0. , 0.25, 0.5 , 0.75, 1. ])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an array of five values evenly spaced between 0 and 1\n",
"np.linspace(0, 1, 5)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.99844933, 0.52183819, 0.22421193],\n",
" [ 0.08007488, 0.45429293, 0.20941444],\n",
" [ 0.14360941, 0.96910973, 0.946117 ]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 array of uniformly distributed\n",
"# random values between 0 and 1\n",
"np.random.random((3, 3))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.51772646, 0.39614948, -0.10634696],\n",
" [ 0.25671348, 0.00732722, 0.37783601],\n",
" [ 0.68446945, 0.15926039, -0.70744073]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 array of normally distributed random values\n",
"# with mean 0 and standard deviation 1\n",
"np.random.normal(0, 1, (3, 3))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[2, 3, 4],\n",
" [5, 7, 8],\n",
" [0, 5, 0]])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 array of random integers in the interval [0, 10)\n",
"np.random.randint(0, 10, (3, 3))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1., 0., 0.],\n",
" [ 0., 1., 0.],\n",
" [ 0., 0., 1.]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 identity matrix\n",
"np.eye(3)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1., 1., 1.])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an uninitialized array of three integers\n",
"# The values will be whatever happens to already exist at that memory location\n",
"np.empty(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NumPy Standard Data Types\n",
"\n",
"NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations.\n",
"Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.\n",
"\n",
"The standard NumPy data types are listed in the following table.\n",
"Note that when constructing an array, they can be specified using a string:\n",
"\n",
"```python\n",
"np.zeros(10, dtype='int16')\n",
"```\n",
"\n",
"Or using the associated NumPy object:\n",
"\n",
"```python\n",
"np.zeros(10, dtype=np.int16)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Data type\t | Description |\n",
"|---------------|-------------|\n",
"| ``bool_`` | Boolean (True or False) stored as a byte |\n",
"| ``int_`` | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| \n",
"| ``intc`` | Identical to C ``int`` (normally ``int32`` or ``int64``)| \n",
"| ``intp`` | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| \n",
"| ``int8`` | Byte (-128 to 127)| \n",
"| ``int16`` | Integer (-32768 to 32767)|\n",
"| ``int32`` | Integer (-2147483648 to 2147483647)|\n",
"| ``int64`` | Integer (-9223372036854775808 to 9223372036854775807)| \n",
"| ``uint8`` | Unsigned integer (0 to 255)| \n",
"| ``uint16`` | Unsigned integer (0 to 65535)| \n",
"| ``uint32`` | Unsigned integer (0 to 4294967295)| \n",
"| ``uint64`` | Unsigned integer (0 to 18446744073709551615)| \n",
"| ``float_`` | Shorthand for ``float64``.| \n",
"| ``float16`` | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| \n",
"| ``float32`` | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| \n",
"| ``float64`` | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| \n",
"| ``complex_`` | Shorthand for ``complex128``.| \n",
"| ``complex64`` | Complex number, represented by two 32-bit floats| \n",
"| ``complex128``| Complex number, represented by two 64-bit floats| "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the [NumPy documentation](http://numpy.org/).\n",
"NumPy also supports compound data types, which will be covered in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Introduction to NumPy](02.00-Introduction-to-NumPy.ipynb) | [Contents](Index.ipynb) | [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) >"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

783
numpy/02.08-Sorting.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,600 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"\n",
"*No changes were made to the contents of this notebook from the original.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Sorting Arrays](02.08-Sorting.ipynb) | [Contents](Index.ipynb) | [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Structured Data: NumPy's Structured Arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy's *structured arrays* and *record arrays*, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas ``Dataframe``s, which we'll explore in [Chapter 3](03.00-Introduction-to-Pandas.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we'd like to store these values for use in a Python program.\n",
"It would be possible to store these in three separate arrays:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"name = ['Alice', 'Bob', 'Cathy', 'Doug']\n",
"age = [25, 45, 37, 19]\n",
"weight = [55.0, 85.5, 68.0, 61.5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this is a bit clumsy. There's nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data.\n",
"NumPy can handle this through structured arrays, which are arrays with compound data types.\n",
"\n",
"Recall that previously we created a simple array using an expression like this:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"x = np.zeros(4, dtype=int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can similarly create a structured array using a compound data type specification:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]\n"
]
}
],
"source": [
"# Use a compound data type for structured arrays\n",
"data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),\n",
" 'formats':('U10', 'i4', 'f8')})\n",
"print(data.dtype)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here ``'U10'`` translates to \"Unicode string of maximum length 10,\" ``'i4'`` translates to \"4-byte (i.e., 32 bit) integer,\" and ``'f8'`` translates to \"8-byte (i.e., 64 bit) float.\"\n",
"We'll discuss other options for these type codes in the following section.\n",
"\n",
"Now that we've created an empty container array, we can fill the array with our lists of values:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)\n",
" ('Doug', 19, 61.5)]\n"
]
}
],
"source": [
"data['name'] = name\n",
"data['age'] = age\n",
"data['weight'] = weight\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we had hoped, the data is now arranged together in one convenient block of memory.\n",
"\n",
"The handy thing with structured arrays is that you can now refer to values either by index or by name:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Alice', 'Bob', 'Cathy', 'Doug'], \n",
" dtype='<U10')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get all names\n",
"data['name']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"('Alice', 25, 55.0)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get first row of data\n",
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Doug'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the name from the last row\n",
"data[-1]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Alice', 'Doug'], \n",
" dtype='<U10')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get names where age is under 30\n",
"data[data['age'] < 30]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that if you'd like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered in the next chapter.\n",
"As we'll see, Pandas provides a ``Dataframe`` object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we've shown here, as well as much, much more."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Structured Arrays\n",
"\n",
"Structured array data types can be specified in a number of ways.\n",
"Earlier, we saw the dictionary method:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype({'names':('name', 'age', 'weight'),\n",
" 'formats':('U10', 'i4', 'f8')})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For clarity, numerical types can be specified using Python types or NumPy ``dtype``s instead:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype({'names':('name', 'age', 'weight'),\n",
" 'formats':((np.str_, 10), int, np.float32)})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A compound type can also be specified as a list of tuples:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype('S10,i4,f8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The shortened string format codes may seem confusing, but they are built on simple principles.\n",
"The first (optional) character is ``<`` or ``>``, which means \"little endian\" or \"big endian,\" respectively, and specifies the ordering convention for significant bits.\n",
"The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).\n",
"The last character or characters represents the size of the object in bytes.\n",
"\n",
"| Character | Description | Example |\n",
"| --------- | ----------- | ------- | \n",
"| ``'b'`` | Byte | ``np.dtype('b')`` |\n",
"| ``'i'`` | Signed integer | ``np.dtype('i4') == np.int32`` |\n",
"| ``'u'`` | Unsigned integer | ``np.dtype('u1') == np.uint8`` |\n",
"| ``'f'`` | Floating point | ``np.dtype('f8') == np.int64`` |\n",
"| ``'c'`` | Complex floating point| ``np.dtype('c16') == np.complex128``|\n",
"| ``'S'``, ``'a'`` | String | ``np.dtype('S5')`` |\n",
"| ``'U'`` | Unicode string | ``np.dtype('U') == np.str_`` |\n",
"| ``'V'`` | Raw data (void) | ``np.dtype('V') == np.void`` |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## More Advanced Compound Types\n",
"\n",
"It is possible to define even more advanced compound types.\n",
"For example, you can create a type where each element contains an array or matrix of values.\n",
"Here, we'll create a data type with a ``mat`` component consisting of a $3\\times 3$ floating-point matrix:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])\n",
"[[ 0. 0. 0.]\n",
" [ 0. 0. 0.]\n",
" [ 0. 0. 0.]]\n"
]
}
],
"source": [
"tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])\n",
"X = np.zeros(1, dtype=tp)\n",
"print(X[0])\n",
"print(X['mat'][0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now each element in the ``X`` array consists of an ``id`` and a $3\\times 3$ matrix.\n",
"Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary?\n",
"The reason is that this NumPy ``dtype`` directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program.\n",
"If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## RecordArrays: Structured Arrays with a Twist\n",
"\n",
"NumPy also provides the ``np.recarray`` class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys.\n",
"Recall that we previously accessed the ages by writing:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([25, 45, 37, 19], dtype=int32)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['age']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we view our data as a record array instead, we can access this with slightly fewer keystrokes:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([25, 45, 37, 19], dtype=int32)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_rec = data.view(np.recarray)\n",
"data_rec.age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000000 loops, best of 3: 241 ns per loop\n",
"100000 loops, best of 3: 4.61 µs per loop\n",
"100000 loops, best of 3: 7.27 µs per loop\n"
]
}
],
"source": [
"%timeit data['age']\n",
"%timeit data_rec['age']\n",
"%timeit data_rec.age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whether the more convenient notation is worth the additional overhead will depend on your own application."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## On to Pandas\n",
"\n",
"This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas.\n",
"Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you're using NumPy arrays to map onto binary data formats in C, Fortran, or another language.\n",
"For day-to-day use of structured data, the Pandas package is a much better choice, and we'll dive into a full discussion of it in the chapter that follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Sorting Arrays](02.08-Sorting.ipynb) | [Contents](Index.ipynb) | [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) >"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,164 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"\n",
"*No changes were made to the contents of this notebook from the original.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) | [Contents](Index.ipynb) | [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Manipulation with Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous chapter, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python.\n",
"Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library.\n",
"Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.\n",
"``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.\n",
"As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.\n",
"\n",
"As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.\n",
"While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.\n",
"Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of \"data munging\" tasks that occupy much of a data scientist's time.\n",
"\n",
"In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.\n",
"We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing and Using Pandas\n",
"\n",
"Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built.\n",
"Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).\n",
"If you followed the advice outlined in the [Preface](00.00-Preface.ipynb) and used the Anaconda stack, you already have Pandas installed.\n",
"\n",
"Once Pandas is installed, you can import it and check the version:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'0.18.1'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas\n",
"pandas.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This import convention will be used throughout the remainder of this book."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reminder about Built-In Documentation\n",
"\n",
"As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ``?`` character). (Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb) if you need a refresher on this.)\n",
"\n",
"For example, to display all the contents of the pandas namespace, you can type\n",
"\n",
"```ipython\n",
"In [3]: pd.<TAB>\n",
"```\n",
"\n",
"And to display Pandas's built-in documentation, you can use this:\n",
"\n",
"```ipython\n",
"In [4]: pd?\n",
"```\n",
"\n",
"More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) | [Contents](Index.ipynb) | [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) >"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,76 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"\n",
"*No changes were made to the contents of this notebook from the original.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Further Resources\n",
"\n",
"In this chapter, we've covered many of the basics of using Pandas effectively for data analysis.\n",
"Still, much has been omitted from our discussion.\n",
"To learn more about Pandas, I recommend the following resources:\n",
"\n",
"- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.\n",
"\n",
"- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).\n",
"\n",
"- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming acrossmore than likely you'll find your answer on a Stack Overflow page.\n",
"\n",
"- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.\n",
"\n",
"Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -1,28 +0,0 @@
backports.ssl-match-hostname==3.4.0.2
certifi==2015.4.28
functools32==3.2.3.post1
gnureadline==6.3.3
ipython==3.2.0
Jinja2==2.7.3
jsonschema==2.5.1
MarkupSafe==0.23
matplotlib==1.4.3
mistune==0.6
mock==1.0.1
nose==1.3.7
numpy==1.9.2
pandas==0.16.2
ptyprocess==0.5
Pygments==2.0.2
pyparsing==2.0.3
python-dateutil==2.4.2
pytz==2015.4
pyzmq==14.7.0
scikit-learn==0.16.1
scipy==0.15.1
seaborn==0.6.0
six==1.9.0
sympy==0.7.6
terminado==0.5
tornado==4.2
wheel==0.24.0