{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was prepared by [Donne Martin](http://donnemartin.com). Source and license info is on [GitHub](https://github.com/donnemartin/data-science-ipython-notebooks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", "\n", "Credits: The following are notes taken while working through [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney\n", "\n", "* Series\n", "* DataFrame\n", "* Reindexing\n", "* Dropping Entries\n", "* Indexing, Selecting, Filtering\n", "* Arithmetic and Data Alignment\n", "* Function Application and Mapping\n", "* Sorting and Ranking\n", "* Axis Indices with Duplicate Values\n", "* Summarizing and Computing Descriptive Statistics\n", "* Cleaning Data (Under Construction)\n", "* Input and Output (Under Construction)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series\n", "\n", "A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Series:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 2\n", "3 -3\n", "4 -5\n", "5 8\n", "6 13\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_1 = Series([1, 1, 2, -3, -5, 8, 13])\n", "ser_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the array representation of a Series:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1, 1, 2, -3, -5, 8, 13])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_1.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Index objects are immutable and hold the axis labels and metadata such as names and axis names.\n", "\n", "Get the index of the Series:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_1.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Series with a custom index:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "d -3\n", "e -5\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])\n", "ser_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a value from a Series:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[4] == ser_2['e']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a set of values from a Series by passing in a list:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 2\n", "a 1\n", "b 1\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[['c', 'a', 'b']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get values great than 0:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[ser_2 > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scalar multiply:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 2\n", "b 2\n", "c 4\n", "d -6\n", "e -10\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2 * 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a numpy math function:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 2.718282\n", "b 2.718282\n", "c 7.389056\n", "d 0.049787\n", "e 0.006738\n", "dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.exp(ser_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series is like a fixed-length, ordered dict. \n", "\n", "Create a series by passing in a dict:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "bar 200\n", "baz 300\n", "foo 100\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}\n", "ser_3 = Series(dict_1)\n", "ser_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Re-order a Series by passing in an index (indices not found are NaN):" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo 100\n", "bar 200\n", "baz 300\n", "qux NaN\n", "dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = ['foo', 'bar', 'baz', 'qux']\n", "ser_4 = Series(dict_1, index=index)\n", "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for NaN with the pandas method:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo False\n", "bar False\n", "baz False\n", "qux True\n", "dtype: bool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.isnull(ser_4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for NaN with the Series method:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo False\n", "bar False\n", "baz False\n", "qux True\n", "dtype: bool" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series automatically aligns differently indexed data in arithmetic operations:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "bar 400\n", "baz 600\n", "foo 200\n", "qux NaN\n", "dtype: float64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_3 + ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Name a Series:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ser_4.name = 'foobarbazqux'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Name a Series index:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ser_4.index.name = 'label'" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "label\n", "foo 100\n", "bar 200\n", "baz 300\n", "qux NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rename a Series' index in place:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fo 100\n", "br 200\n", "bz 300\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.index = ['fo', 'br', 'bz', 'qx']\n", "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame\n", "\n", "A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.\n", "\n", "Create a DataFrame:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
popstateyear
05.0VA2012
15.1VA2013
25.2VA2014
34.0MD2014
44.1MD2015
\n", "
" ], "text/plain": [ " pop state year\n", "0 5.0 VA 2012\n", "1 5.1 VA 2013\n", "2 5.2 VA 2014\n", "3 4.0 MD 2014\n", "4 4.1 MD 2015" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],\n", " 'year' : [2012, 2013, 2014, 2014, 2015],\n", " 'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}\n", "df_1 = DataFrame(data_1)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame specifying a sequence of columns:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepop
02012VA5.0
12013VA5.1
22014VA5.2
32014MD4.0
42015MD4.1
\n", "
" ], "text/plain": [ " year state pop\n", "0 2012 VA 5.0\n", "1 2013 VA 5.1\n", "2 2014 VA 5.2\n", "3 2014 MD 4.0\n", "4 2015 MD 4.1" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])\n", "df_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like Series, columns that are not present in the data are NaN:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
02012VA5.0NaN
12013VA5.1NaN
22014VA5.2NaN
32014MD4.0NaN
42015MD4.1NaN
\n", "
" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 NaN\n", "3 2014 MD 4.0 NaN\n", "4 2015 MD 4.1 NaN" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve a column by key, returning a Series:\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 VA\n", "1 VA\n", "2 VA\n", "3 MD\n", "4 MD\n", "Name: state, dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3['state']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrive a column by attribute, returning a Series:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2012\n", "1 2013\n", "2 2014\n", "3 2014\n", "4 2015\n", "Name: year, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve a row by position:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "year 2012\n", "state VA\n", "pop 5\n", "unempl NaN\n", "Name: 0, dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.ix[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Update a column by assignment:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
02012VA5.00
12013VA5.11
22014VA5.22
32014MD4.03
42015MD4.14
\n", "
" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 0\n", "1 2013 VA 5.1 1\n", "2 2014 VA 5.2 2\n", "3 2014 MD 4.0 3\n", "4 2015 MD 4.1 4" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3['unempl'] = np.arange(5)\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
02012VA5.0NaN
12013VA5.1NaN
22014VA5.26.0
32014MD4.06.0
42015MD4.16.1
\n", "
" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 6.0\n", "3 2014 MD 4.0 6.0\n", "4 2015 MD 4.1 6.1" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unempl = Series([6.0, 6.0, 6.1], index=[2, 3, 4])\n", "df_3['unempl'] = unempl\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign a new column that doesn't exist to create a new column:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunemplstate_dup
02012VA5.0NaNVA
12013VA5.1NaNVA
22014VA5.26.0VA
32014MD4.06.0MD
42015MD4.16.1MD
\n", "
" ], "text/plain": [ " year state pop unempl state_dup\n", "0 2012 VA 5.0 NaN VA\n", "1 2013 VA 5.1 NaN VA\n", "2 2014 VA 5.2 6.0 VA\n", "3 2014 MD 4.0 6.0 MD\n", "4 2015 MD 4.1 6.1 MD" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3['state_dup'] = df_3['state']\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delete a column:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
02012VA5.0NaN
12013VA5.1NaN
22014VA5.26.0
32014MD4.06.0
42015MD4.16.1
\n", "
" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 6.0\n", "3 2014 MD 4.0 6.0\n", "4 2015 MD 4.1 6.1" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del df_3['state_dup']\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MDVA
2013NaN5.1
20144.05.2
20154.1NaN
\n", "
" ], "text/plain": [ " MD VA\n", "2013 NaN 5.1\n", "2014 4.0 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pop = {'VA' : {2013 : 5.1, 2014 : 5.2},\n", " 'MD' : {2014 : 4.0, 2015 : 4.1}}\n", "df_4 = DataFrame(pop)\n", "df_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transpose the DataFrame:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
201320142015
MDNaN4.04.1
VA5.15.2NaN
\n", "
" ], "text/plain": [ " 2013 2014 2015\n", "MD NaN 4.0 4.1\n", "VA 5.1 5.2 NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_4.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame from a dict of Series:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MDVA
2014NaN5.2
20154.1NaN
\n", "
" ], "text/plain": [ " MD VA\n", "2014 NaN 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_2 = {'VA' : df_4['VA'][1:],\n", " 'MD' : df_4['MD'][2:]}\n", "df_5 = DataFrame(data_2)\n", "df_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the DataFrame index name:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MDVA
year
2014NaN5.2
20154.1NaN
\n", "
" ], "text/plain": [ " MD VA\n", "year \n", "2014 NaN 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_5.index.name = 'year'\n", "df_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the DataFrame columns name:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateMDVA
year
2014NaN5.2
20154.1NaN
\n", "
" ], "text/plain": [ "state MD VA\n", "year \n", "2014 NaN 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_5.columns.name = 'state'\n", "df_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the data contained in a DataFrame as a 2D ndarray:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ nan, 5.2],\n", " [ 4.1, nan]])" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_5.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[2012, 'VA', 5.0, nan],\n", " [2013, 'VA', 5.1, nan],\n", " [2014, 'VA', 5.2, 6.0],\n", " [2014, 'MD', 4.0, 6.0],\n", " [2015, 'MD', 4.1, 6.1]], dtype=object)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reindexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a new object with the data conformed to a new index. Any missing values are set to NaN." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
02012VA5.0NaN
12013VA5.1NaN
22014VA5.26.0
32014MD4.06.0
42015MD4.16.1
\n", "
" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 6.0\n", "3 2014 MD 4.0 6.0\n", "4 2015 MD 4.1 6.1" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindexing rows returns a new frame with the specified index:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
5NaNNaNNaNNaN
42015MD4.16.1
32014MD4.06.0
22014VA5.26.0
12013VA5.1NaN
02012VA5.0NaN
\n", "
" ], "text/plain": [ " year state pop unempl\n", "5 NaN NaN NaN NaN\n", "4 2015 MD 4.1 6.1\n", "3 2014 MD 4.0 6.0\n", "2 2014 VA 5.2 6.0\n", "1 2013 VA 5.1 NaN\n", "0 2012 VA 5.0 NaN" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(list(reversed(range(0, 6))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values can be set to something other than NaN:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopunempl
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [year, state, pop, unempl]\n", "Index: []" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(range(6, 0), fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpolate ordered data like a time series:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ser_5 = Series(['foo', 'bar', 'baz'], index=[0, 2, 4])" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 foo\n", "1 foo\n", "2 bar\n", "3 bar\n", "4 baz\n", "dtype: object" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_5.reindex(range(5), method='ffill')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 foo\n", "1 bar\n", "2 bar\n", "3 baz\n", "4 baz\n", "dtype: object" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_5.reindex(range(5), method='bfill')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindex columns:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0VA5.0NaN2012
1VA5.1NaN2013
2VA5.26.02014
3MD4.06.02014
4MD4.16.12015
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindex rows and columns while filling rows:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
500.00.00
4MD4.16.12015
3MD4.06.02014
2VA5.26.02014
1VA5.1NaN2013
0VA5.0NaN2012
\n", "
" ], "text/plain": [ " state pop unempl year\n", "5 0 0.0 0.0 0\n", "4 MD 4.1 6.1 2015\n", "3 MD 4.0 6.0 2014\n", "2 VA 5.2 6.0 2014\n", "1 VA 5.1 NaN 2013\n", "0 VA 5.0 NaN 2012" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(index=list(reversed(range(0, 6))),\n", " fill_value=0,\n", " columns=['state', 'pop', 'unempl', 'year'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindex using ix:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0VA5.0NaN2012
1VA5.1NaN2013
2VA5.26.02014
3MD4.06.02014
4MD4.16.12015
5NaNNaNNaNNaN
6NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]\n", "df_6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dropping Entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop rows from a Series or DataFrame:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
2VA5.26.02014
3MD4.06.02014
4MD4.16.12015
5NaNNaNNaNNaN
6NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " state pop unempl year\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_7 = df_6.drop([0, 1])\n", "df_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop columns from a DataFrame:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopyear
2VA5.22014
3MD4.02014
4MD4.12015
5NaNNaNNaN
6NaNNaNNaN
\n", "
" ], "text/plain": [ " state pop year\n", "2 VA 5.2 2014\n", "3 MD 4.0 2014\n", "4 MD 4.1 2015\n", "5 NaN NaN NaN\n", "6 NaN NaN NaN" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_7 = df_7.drop('unempl', axis=1)\n", "df_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing, Selecting, Filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "d -3\n", "e -5\n", "dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a value from a Series:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[0] == ser_2['a']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice from a Series:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1\n", "c 2\n", "d -3\n", "dtype: int64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[1:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select specific values from a Series:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1\n", "c 2\n", "d -3\n", "dtype: int64" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[['b', 'c', 'd']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select from a Series based on a filter:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "dtype: int64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[ser_2 > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice from a Series with labels (note the end point is inclusive):" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "dtype: int64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2['a':'b']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign to a Series slice (note the end point is inclusive):" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 0\n", "c 2\n", "d -3\n", "e -5\n", "dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2['a':'b'] = 0\n", "ser_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas supports indexing into a DataFrame." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0VA5.0NaN2012
1VA5.1NaN2013
2VA5.26.02014
3MD4.06.02014
4MD4.16.12015
5NaNNaNNaNNaN
6NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select specified columns from a DataFrame:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
popunempl
05.0NaN
15.1NaN
25.26.0
34.06.0
44.16.1
5NaNNaN
6NaNNaN
\n", "
" ], "text/plain": [ " pop unempl\n", "0 5.0 NaN\n", "1 5.1 NaN\n", "2 5.2 6.0\n", "3 4.0 6.0\n", "4 4.1 6.1\n", "5 NaN NaN\n", "6 NaN NaN" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[['pop', 'unempl']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice from a DataFrame:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0VA5.0NaN2012
1VA5.1NaN2013
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select from a DataFrame based on a filter:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
1VA5.1NaN2013
2VA5.262014
\n", "
" ], "text/plain": [ " state pop unempl year\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6 2014" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[df_6['pop'] > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform a scalar comparison on a DataFrame:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0TrueFalseFalseTrue
1TrueTrueFalseTrue
2TrueTrueTrueTrue
3TrueFalseTrueTrue
4TrueFalseTrueTrue
5TrueFalseFalseFalse
6TrueFalseFalseFalse
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 True False False True\n", "1 True True False True\n", "2 True True True True\n", "3 True False True True\n", "4 True False True True\n", "5 True False False False\n", "6 True False False False" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6 > 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform a scalar comparison on a DataFrame, retain the values that pass the filter:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0VANaNNaN2012
1VA5.1NaN2013
2VA5.26.02014
3MDNaN6.02014
4MDNaN6.12015
5NaNNaNNaNNaN
6NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 VA NaN NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD NaN 6.0 2014\n", "4 MD NaN 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[df_6 > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice of rows from a DataFrame (note the end point is inclusive):" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
2VA5.262014
3MD4.062014
\n", "
" ], "text/plain": [ " state pop unempl year\n", "2 VA 5.2 6 2014\n", "3 MD 4.0 6 2014" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.ix[2:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice of rows from a specific column of a DataFrame:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 5.0\n", "1 5.1\n", "2 5.2\n", "Name: pop, dtype: float64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.ix[0:2, 'pop']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select rows based on an arithmetic operation on a specific row:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
2VA5.26.02014
3MD4.06.02014
4MD4.16.12015
\n", "
" ], "text/plain": [ " state pop unempl year\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.ix[df_6.unempl > 5.0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arithmetic and Data Alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.764052\n", "b 0.400157\n", "c 0.978738\n", "d 2.240893\n", "e 1.867558\n", "dtype: float64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(0)\n", "ser_6 = Series(np.random.randn(5),\n", " index=['a', 'b', 'c', 'd', 'e'])\n", "ser_6" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.624345\n", "c -0.611756\n", "e -0.528172\n", "f -1.072969\n", "g 0.865408\n", "dtype: float64" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(1)\n", "ser_7 = Series(np.random.randn(5),\n", " index=['a', 'c', 'e', 'f', 'g'])\n", "ser_7" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 3.388398\n", "b NaN\n", "c 0.366982\n", "d NaN\n", "e 1.339386\n", "f NaN\n", "g NaN\n", "dtype: float64" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_6 + ser_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set a fill value instead of NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 3.388398\n", "b 0.400157\n", "c 0.366982\n", "d 2.240893\n", "e 1.339386\n", "f -1.072969\n", "g 0.865408\n", "dtype: float64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_6.add(ser_7, fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00.5488140.7151890.602763
10.5448830.4236550.645894
20.4375870.8917730.963663
\n", "
" ], "text/plain": [ " a b c\n", "0 0.548814 0.715189 0.602763\n", "1 0.544883 0.423655 0.645894\n", "2 0.437587 0.891773 0.963663" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(0)\n", "df_8 = DataFrame(np.random.rand(9).reshape((3, 3)),\n", " columns=['a', 'b', 'c'])\n", "df_8" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bcd
00.4170220.7203240.000114
10.3023330.1467560.092339
20.1862600.3455610.396767
\n", "
" ], "text/plain": [ " b c d\n", "0 0.417022 0.720324 0.000114\n", "1 0.302333 0.146756 0.092339\n", "2 0.186260 0.345561 0.396767" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(1)\n", "df_9 = DataFrame(np.random.rand(9).reshape((3, 3)),\n", " columns=['b', 'c', 'd'])\n", "df_9" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0NaN1.1322111.323088NaN
1NaN0.7259870.792650NaN
2NaN1.0780331.309223NaN
\n", "
" ], "text/plain": [ " a b c d\n", "0 NaN 1.132211 1.323088 NaN\n", "1 NaN 0.725987 0.792650 NaN\n", "2 NaN 1.078033 1.309223 NaN" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_8 + df_9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set a fill value instead of NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00.5488141.1322111.3230880.000114
10.5448830.7259870.7926500.092339
20.4375871.0780331.3092230.396767
\n", "
" ], "text/plain": [ " a b c d\n", "0 0.548814 1.132211 1.323088 0.000114\n", "1 0.544883 0.725987 0.792650 0.092339\n", "2 0.437587 1.078033 1.309223 0.396767" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_10 = df_8.add(df_9, fill_value=0)\n", "df_10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like NumPy, pandas supports arithmetic operations between DataFrames and Series.\n", "\n", "Match the index of the Series on the DataFrame's columns, broadcasting down the rows:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00.0000000.0000000.0000000.000000
1-0.003930-0.406224-0.5304380.092224
2-0.111226-0.054178-0.0138640.396653
\n", "
" ], "text/plain": [ " a b c d\n", "0 0.000000 0.000000 0.000000 0.000000\n", "1 -0.003930 -0.406224 -0.530438 0.092224\n", "2 -0.111226 -0.054178 -0.013864 0.396653" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_8 = df_10.ix[0]\n", "df_11 = df_10 - ser_8\n", "df_11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "d 1\n", "e 2\n", "dtype: int64" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_9 = Series(range(3), index=['a', 'd', 'e'])\n", "ser_9" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
00.000000NaNNaN-1.000000NaN
1-0.003930NaNNaN-0.907776NaN
2-0.111226NaNNaN-0.603347NaN
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0.000000 NaN NaN -1.000000 NaN\n", "1 -0.003930 NaN NaN -0.907776 NaN\n", "2 -0.111226 NaN NaN -0.603347 NaN" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11 - ser_9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00.5488141.1322111.3230880.000114
10.5448830.7259870.7926500.092339
20.4375871.0780331.3092230.396767
\n", "
" ], "text/plain": [ " a b c d\n", "0 0.548814 1.132211 1.323088 0.000114\n", "1 0.544883 0.725987 0.792650 0.092339\n", "2 0.437587 1.078033 1.309223 0.396767" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_10" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 100\n", "1 200\n", "2 300\n", "dtype: int64" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_10 = Series([100, 200, 300])\n", "ser_10" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0-99.451186-98.867789-98.676912-99.999886
1-199.455117-199.274013-199.207350-199.907661
2-299.562413-298.921967-298.690777-299.603233
\n", "
" ], "text/plain": [ " a b c d\n", "0 -99.451186 -98.867789 -98.676912 -99.999886\n", "1 -199.455117 -199.274013 -199.207350 -199.907661\n", "2 -299.562413 -298.921967 -298.690777 -299.603233" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_10.sub(ser_10, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Function Application and Mapping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy ufuncs (element-wise array methods) operate on pandas objects:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00.0000000.0000000.0000000.000000
10.0039300.4062240.5304380.092224
20.1112260.0541780.0138640.396653
\n", "
" ], "text/plain": [ " a b c d\n", "0 0.000000 0.000000 0.000000 0.000000\n", "1 0.003930 0.406224 0.530438 0.092224\n", "2 0.111226 0.054178 0.013864 0.396653" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11 = np.abs(df_11)\n", "df_11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a function on 1D arrays to each column:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.111226\n", "b 0.406224\n", "c 0.530438\n", "d 0.396653\n", "dtype: float64" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func_1 = lambda x: x.max() - x.min()\n", "df_11.apply(func_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a function on 1D arrays to each row:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.000000\n", "1 0.526508\n", "2 0.382789\n", "dtype: float64" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11.apply(func_1, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a function and return a DataFrame:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
min0.0000000.0000000.0000000.000000
max0.1112260.4062240.5304380.396653
\n", "
" ], "text/plain": [ " a b c d\n", "min 0.000000 0.000000 0.000000 0.000000\n", "max 0.111226 0.406224 0.530438 0.396653" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func_2 = lambda x: Series([x.min(), x.max()], index=['min', 'max'])\n", "df_11.apply(func_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply an element-wise Python function to a DataFrame:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00.000.000.000.00
10.000.410.530.09
20.110.050.010.40
\n", "
" ], "text/plain": [ " a b c d\n", "0 0.00 0.00 0.00 0.00\n", "1 0.00 0.41 0.53 0.09\n", "2 0.11 0.05 0.01 0.40" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func_3 = lambda x: '%.2f' %x\n", "df_11.applymap(func_3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply an element-wise Python function to a Series:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.00\n", "1 0.00\n", "2 0.11\n", "Name: a, dtype: object" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11['a'].map(func_3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting and Ranking" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fo 100\n", "br 200\n", "bz 300\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a Series by its index:" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "br 200\n", "bz 300\n", "fo 100\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a Series by its values:" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fo 100\n", "br 200\n", "bz 300\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.sort_values()" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cabd
three0123
one4567
two891011
\n", "
" ], "text/plain": [ " c a b d\n", "three 0 1 2 3\n", "one 4 5 6 7\n", "two 8 9 10 11" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12 = DataFrame(np.arange(12).reshape((3, 4)),\n", " index=['three', 'one', 'two'],\n", " columns=['c', 'a', 'b', 'd'])\n", "df_12" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a DataFrame by its index:" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cabd
one4567
three0123
two891011
\n", "
" ], "text/plain": [ " c a b d\n", "one 4 5 6 7\n", "three 0 1 2 3\n", "two 8 9 10 11" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a DataFrame by columns in descending order:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dcba
three3021
one7465
two118109
\n", "
" ], "text/plain": [ " d c b a\n", "three 3 0 2 1\n", "one 7 4 6 5\n", "two 11 8 10 9" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12.sort_index(axis=1, ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a DataFrame's values by column:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cabd
three0123
one4567
two891011
\n", "
" ], "text/plain": [ " c a b d\n", "three 0 1 2 3\n", "one 4 5 6 7\n", "two 8 9 10 11" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12.sort_values(by=['d', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 -5\n", "5 0\n", "4 2\n", "3 4\n", "6 4\n", "0 7\n", "2 7\n", "7 7\n", "dtype: int64" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11 = Series([7, -5, 7, 4, 2, 0, 4, 7])\n", "ser_11 = ser_11.sort_values()\n", "ser_11" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 1.0\n", "5 2.0\n", "4 3.0\n", "3 4.5\n", "6 4.5\n", "0 7.0\n", "2 7.0\n", "7 7.0\n", "dtype: float64" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11.rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a Series according to when they appear in the data:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 1\n", "5 2\n", "4 3\n", "3 4\n", "6 5\n", "0 6\n", "2 7\n", "7 8\n", "dtype: float64" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11.rank(method='first')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a Series in descending order, using the maximum rank for the group:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 8\n", "5 7\n", "4 6\n", "3 5\n", "6 5\n", "0 3\n", "2 3\n", "7 3\n", "dtype: float64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11.rank(ascending=False, method='max')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrames can rank over rows or columns." ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
barbazfoo
0-5-17
142-5
2237
3004
4452
5790
6794
7857
\n", "
" ], "text/plain": [ " bar baz foo\n", "0 -5 -1 7\n", "1 4 2 -5\n", "2 2 3 7\n", "3 0 0 4\n", "4 4 5 2\n", "5 7 9 0\n", "6 7 9 4\n", "7 8 5 7" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_13 = DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],\n", " 'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],\n", " 'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})\n", "df_13" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a DataFrame over rows:" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
barbazfoo
01.01.07.0
14.53.01.0
23.04.07.0
32.02.04.5
44.55.53.0
56.57.52.0
66.57.54.5
78.05.57.0
\n", "
" ], "text/plain": [ " bar baz foo\n", "0 1.0 1.0 7.0\n", "1 4.5 3.0 1.0\n", "2 3.0 4.0 7.0\n", "3 2.0 2.0 4.5\n", "4 4.5 5.5 3.0\n", "5 6.5 7.5 2.0\n", "6 6.5 7.5 4.5\n", "7 8.0 5.5 7.0" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_13.rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a DataFrame over columns:" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
barbazfoo
01.02.03
13.02.01
21.02.03
31.51.53
42.03.01
52.03.01
62.03.01
73.01.02
\n", "
" ], "text/plain": [ " bar baz foo\n", "0 1.0 2.0 3\n", "1 3.0 2.0 1\n", "2 1.0 2.0 3\n", "3 1.5 1.5 3\n", "4 2.0 3.0 1\n", "5 2.0 3.0 1\n", "6 2.0 3.0 1\n", "7 3.0 1.0 2" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_13.rank(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Axis Indexes with Duplicate Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Labels do not have to be unique in Pandas:" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo 0\n", "foo 1\n", "bar 2\n", "bar 3\n", "baz 4\n", "dtype: int64" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_12 = Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])\n", "ser_12" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_12.index.is_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select Series elements:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo 0\n", "foo 1\n", "dtype: int64" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_12['foo']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select DataFrame elements:" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
foo-2.3634691.135345-1.0170140.637362
foo-0.8599071.772608-1.1103630.181214
bar0.564345-0.5665100.7299760.372994
bar0.533811-0.0919731.9138200.330797
baz1.141943-1.129595-0.8500520.960820
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "foo -2.363469 1.135345 -1.017014 0.637362\n", "foo -0.859907 1.772608 -1.110363 0.181214\n", "bar 0.564345 -0.566510 0.729976 0.372994\n", "bar 0.533811 -0.091973 1.913820 0.330797\n", "baz 1.141943 -1.129595 -0.850052 0.960820" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_14 = DataFrame(np.random.randn(5, 4),\n", " index=['foo', 'foo', 'bar', 'bar', 'baz'])\n", "df_14" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
bar0.564345-0.5665100.7299760.372994
bar0.533811-0.0919731.9138200.330797
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "bar 0.564345 -0.566510 0.729976 0.372994\n", "bar 0.533811 -0.091973 1.913820 0.330797" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_14.ix['bar']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing and Computing Descriptive Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA." ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statepopunemplyear
0VA5.0NaN2012
1VA5.1NaN2013
2VA5.26.02014
3MD4.06.02014
4MD4.16.12015
5NaNNaNNaNNaN
6NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pop 23.4\n", "unempl 18.1\n", "year 10068.0\n", "dtype: float64" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sum over the rows:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2017.0\n", "1 2018.1\n", "2 2025.2\n", "3 2024.0\n", "4 2025.2\n", "5 0.0\n", "6 0.0\n", "dtype: float64" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.sum(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Account for NaNs:" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 NaN\n", "2 2025.2\n", "3 2024.0\n", "4 2025.2\n", "5 NaN\n", "6 NaN\n", "dtype: float64" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.sum(axis=1, skipna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning Data (Under Construction)\n", "* Replace\n", "* Drop\n", "* Concatenate" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup a DataFrame:" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationstateyear
05.0VA2012
15.1VA2013
25.2VA2014
34.0MD2014
44.1MD2015
\n", "
" ], "text/plain": [ " population state year\n", "0 5.0 VA 2012\n", "1 5.1 VA 2013\n", "2 5.2 VA 2014\n", "3 4.0 MD 2014\n", "4 4.1 MD 2015" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],\n", " 'year' : [2012, 2013, 2014, 2014, 2015],\n", " 'population' : [5.0, 5.1, 5.2, 4.0, 4.1]}\n", "df_1 = DataFrame(data_1)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace all occurrences of a string with another string, in place (no copy):" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationstateyear
05.0VIRGINIA2012
15.1VIRGINIA2013
25.2VIRGINIA2014
34.0MD2014
44.1MD2015
\n", "
" ], "text/plain": [ " population state year\n", "0 5.0 VIRGINIA 2012\n", "1 5.1 VIRGINIA 2013\n", "2 5.2 VIRGINIA 2014\n", "3 4.0 MD 2014\n", "4 4.1 MD 2015" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.replace('VA', 'VIRGINIA', inplace=True)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a specified column, replace all occurrences of a string with another string, in place (no copy):" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationstateyear
05.0VIRGINIA2012
15.1VIRGINIA2013
25.2VIRGINIA2014
34.0MARYLAND2014
44.1MARYLAND2015
\n", "
" ], "text/plain": [ " population state year\n", "0 5.0 VIRGINIA 2012\n", "1 5.1 VIRGINIA 2013\n", "2 5.2 VIRGINIA 2014\n", "3 4.0 MARYLAND 2014\n", "4 4.1 MARYLAND 2015" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.replace({'state' : { 'MD' : 'MARYLAND' }}, inplace=True)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Drop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop the 'population' column and return a copy of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateyear
0VIRGINIA2012
1VIRGINIA2013
2VIRGINIA2014
3MARYLAND2014
4MARYLAND2015
\n", "
" ], "text/plain": [ " state year\n", "0 VIRGINIA 2012\n", "1 VIRGINIA 2013\n", "2 VIRGINIA 2014\n", "3 MARYLAND 2014\n", "4 MARYLAND 2015" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2 = df_1.drop('population', axis=1)\n", "df_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Concatenate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate two DataFrames:" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationstateyear
06.0NY2012
16.1NY2013
26.2NY2014
33.0FL2014
43.1FL2015
\n", "
" ], "text/plain": [ " population state year\n", "0 6.0 NY 2012\n", "1 6.1 NY 2013\n", "2 6.2 NY 2014\n", "3 3.0 FL 2014\n", "4 3.1 FL 2015" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_2 = {'state' : ['NY', 'NY', 'NY', 'FL', 'FL'],\n", " 'year' : [2012, 2013, 2014, 2014, 2015],\n", " 'population' : [6.0, 6.1, 6.2, 3.0, 3.1]}\n", "df_3 = DataFrame(data_2)\n", "df_3" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationstateyear
05.0VIRGINIA2012
15.1VIRGINIA2013
25.2VIRGINIA2014
34.0MARYLAND2014
44.1MARYLAND2015
06.0NY2012
16.1NY2013
26.2NY2014
33.0FL2014
43.1FL2015
\n", "
" ], "text/plain": [ " population state year\n", "0 5.0 VIRGINIA 2012\n", "1 5.1 VIRGINIA 2013\n", "2 5.2 VIRGINIA 2014\n", "3 4.0 MARYLAND 2014\n", "4 4.1 MARYLAND 2015\n", "0 6.0 NY 2012\n", "1 6.1 NY 2013\n", "2 6.2 NY 2014\n", "3 3.0 FL 2014\n", "4 3.1 FL 2015" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_4 = pd.concat([df_1, df_3])\n", "df_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input and Output (Under Construction)\n", "* Reading\n", "* Writing" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read data from a CSV file into a DataFrame (use sep='\\t' for TSV):" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_1 = pd.read_csv(\"../data/ozone.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a summary of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OzoneSolar.RWindTempMonthDay
count116.000000146.000000153.000000153.000000153.000000153.000000
mean42.129310185.9315079.95751677.8823536.99346415.803922
std32.98788590.0584223.5230019.4652701.4165228.864520
min1.0000007.0000001.70000056.0000005.0000001.000000
25%18.000000115.7500007.40000072.0000006.0000008.000000
50%31.500000205.0000009.70000079.0000007.00000016.000000
75%63.250000258.75000011.50000085.0000008.00000023.000000
max168.000000334.00000020.70000097.0000009.00000031.000000
\n", "
" ], "text/plain": [ " Ozone Solar.R Wind Temp Month Day\n", "count 116.000000 146.000000 153.000000 153.000000 153.000000 153.000000\n", "mean 42.129310 185.931507 9.957516 77.882353 6.993464 15.803922\n", "std 32.987885 90.058422 3.523001 9.465270 1.416522 8.864520\n", "min 1.000000 7.000000 1.700000 56.000000 5.000000 1.000000\n", "25% 18.000000 115.750000 7.400000 72.000000 6.000000 8.000000\n", "50% 31.500000 205.000000 9.700000 79.000000 7.000000 16.000000\n", "75% 63.250000 258.750000 11.500000 85.000000 8.000000 23.000000\n", "max 168.000000 334.000000 20.700000 97.000000 9.000000 31.000000" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List the first five rows of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OzoneSolar.RWindTempMonthDay
0411907.46751
1361188.07252
21214912.67453
31831311.56254
4NaNNaN14.35655
\n", "
" ], "text/plain": [ " Ozone Solar.R Wind Temp Month Day\n", "0 41 190 7.4 67 5 1\n", "1 36 118 8.0 72 5 2\n", "2 12 149 12.6 74 5 3\n", "3 18 313 11.5 62 5 4\n", "4 NaN NaN 14.3 56 5 5" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_1.to_csv('../data/ozone_copy.csv', \n", " encoding='utf-8', \n", " index=False, \n", " header=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the data directory:" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 1016\r\n", "-rw-r--r-- 1 gam staff 437903 Feb 15 07:16 churn.csv\r\n", "-rwxr-xr-x 1 gam staff 72050 Feb 15 07:16 \u001b[31mconfusion_matrix.png\u001b[m\u001b[m\r\n", "-rw-r--r-- 1 gam staff 2902 Feb 15 07:16 ozone.csv\r\n", "-rw-r--r-- 1 gam staff 3324 Mar 4 17:53 ozone_copy.csv\r\n", "drwxr-xr-x 10 gam staff 340 Feb 15 07:16 \u001b[1m\u001b[36mtitanic\u001b[m\u001b[m\r\n" ] } ], "source": [ "!ls -l ../data/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }