mirror of
https://github.com/donnemartin/data-science-ipython-notebooks.git
synced 2024-03-22 13:30:56 +08:00
1405 lines
41 KiB
Python
1405 lines
41 KiB
Python
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<!--BOOK_INFORMATION-->\n",
|
|||
|
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
|
|||
|
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
|
|||
|
"\n",
|
|||
|
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
|
|||
|
"\n",
|
|||
|
"*No changes were made to the contents of this notebook from the original.*"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<!--NAVIGATION-->\n",
|
|||
|
"< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Vectorized String Operations"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"One strength of Python is its relative ease in handling and manipulating string data.\n",
|
|||
|
"Pandas builds on this and provides a comprehensive set of *vectorized string operations* that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data.\n",
|
|||
|
"In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Introducing Pandas String Operations\n",
|
|||
|
"\n",
|
|||
|
"We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([ 4, 6, 10, 14, 22, 26])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"x = np.array([2, 3, 5, 7, 11, 13])\n",
|
|||
|
"x * 2"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"This *vectorization* of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.\n",
|
|||
|
"For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"['Peter', 'Paul', 'Mary', 'Guido']"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"data = ['peter', 'Paul', 'MARY', 'gUIDO']\n",
|
|||
|
"[s.capitalize() for s in data]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"This is perhaps sufficient to work with some data, but it will break if there are any missing values.\n",
|
|||
|
"For example:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"ename": "AttributeError",
|
|||
|
"evalue": "'NoneType' object has no attribute 'capitalize'",
|
|||
|
"output_type": "error",
|
|||
|
"traceback": [
|
|||
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|||
|
"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
|
|||
|
"\u001b[0;32m<ipython-input-3-fc1d891ab539>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'peter'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Paul'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'MARY'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gUIDO'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
|
|||
|
"\u001b[0;32m<ipython-input-3-fc1d891ab539>\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'peter'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Paul'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'MARY'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gUIDO'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
|
|||
|
"\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'capitalize'"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"data = ['peter', 'Paul', None, 'MARY', 'gUIDO']\n",
|
|||
|
"[s.capitalize() for s in data]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.\n",
|
|||
|
"So, for example, suppose we create a Pandas Series with this data:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 peter\n",
|
|||
|
"1 Paul\n",
|
|||
|
"2 None\n",
|
|||
|
"3 MARY\n",
|
|||
|
"4 gUIDO\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"names = pd.Series(data)\n",
|
|||
|
"names"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We can now call a single method that will capitalize all the entries, while skipping over any missing values:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 Peter\n",
|
|||
|
"1 Paul\n",
|
|||
|
"2 None\n",
|
|||
|
"3 Mary\n",
|
|||
|
"4 Guido\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"names.str.capitalize()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Tables of Pandas String Methods\n",
|
|||
|
"\n",
|
|||
|
"If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties.\n",
|
|||
|
"The examples in this section use the following series of names:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {
|
|||
|
"collapsed": true
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',\n",
|
|||
|
" 'Eric Idle', 'Terry Jones', 'Michael Palin'])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Methods similar to Python string methods\n",
|
|||
|
"Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:\n",
|
|||
|
"\n",
|
|||
|
"| | | | |\n",
|
|||
|
"|-------------|------------------|------------------|------------------|\n",
|
|||
|
"|``len()`` | ``lower()`` | ``translate()`` | ``islower()`` | \n",
|
|||
|
"|``ljust()`` | ``upper()`` | ``startswith()`` | ``isupper()`` | \n",
|
|||
|
"|``rjust()`` | ``find()`` | ``endswith()`` | ``isnumeric()`` | \n",
|
|||
|
"|``center()`` | ``rfind()`` | ``isalnum()`` | ``isdecimal()`` | \n",
|
|||
|
"|``zfill()`` | ``index()`` | ``isalpha()`` | ``split()`` | \n",
|
|||
|
"|``strip()`` | ``rindex()`` | ``isdigit()`` | ``rsplit()`` | \n",
|
|||
|
"|``rstrip()`` | ``capitalize()`` | ``isspace()`` | ``partition()`` | \n",
|
|||
|
"|``lstrip()`` | ``swapcase()`` | ``istitle()`` | ``rpartition()`` |\n",
|
|||
|
"\n",
|
|||
|
"Notice that these have various return values. Some, like ``lower()``, return a series of strings:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 graham chapman\n",
|
|||
|
"1 john cleese\n",
|
|||
|
"2 terry gilliam\n",
|
|||
|
"3 eric idle\n",
|
|||
|
"4 terry jones\n",
|
|||
|
"5 michael palin\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.lower()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"But some others return numbers:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 14\n",
|
|||
|
"1 11\n",
|
|||
|
"2 13\n",
|
|||
|
"3 9\n",
|
|||
|
"4 11\n",
|
|||
|
"5 13\n",
|
|||
|
"dtype: int64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.len()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Or Boolean values:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 False\n",
|
|||
|
"1 False\n",
|
|||
|
"2 True\n",
|
|||
|
"3 False\n",
|
|||
|
"4 True\n",
|
|||
|
"5 False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.startswith('T')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Still others return lists or other compound values for each element:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 [Graham, Chapman]\n",
|
|||
|
"1 [John, Cleese]\n",
|
|||
|
"2 [Terry, Gilliam]\n",
|
|||
|
"3 [Eric, Idle]\n",
|
|||
|
"4 [Terry, Jones]\n",
|
|||
|
"5 [Michael, Palin]\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.split()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We'll see further manipulations of this kind of series-of-lists object as we continue our discussion."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Methods using regular expressions\n",
|
|||
|
"\n",
|
|||
|
"In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:\n",
|
|||
|
"\n",
|
|||
|
"| Method | Description |\n",
|
|||
|
"|--------|-------------|\n",
|
|||
|
"| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |\n",
|
|||
|
"| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|\n",
|
|||
|
"| ``findall()`` | Call ``re.findall()`` on each element |\n",
|
|||
|
"| ``replace()`` | Replace occurrences of pattern with some other string|\n",
|
|||
|
"| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |\n",
|
|||
|
"| ``count()`` | Count occurrences of pattern|\n",
|
|||
|
"| ``split()`` | Equivalent to ``str.split()``, but accepts regexps |\n",
|
|||
|
"| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"With these, you can do a wide range of interesting operations.\n",
|
|||
|
"For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 Graham\n",
|
|||
|
"1 John\n",
|
|||
|
"2 Terry\n",
|
|||
|
"3 Eric\n",
|
|||
|
"4 Terry\n",
|
|||
|
"5 Michael\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.extract('([A-Za-z]+)', expand=False)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 [Graham Chapman]\n",
|
|||
|
"1 []\n",
|
|||
|
"2 [Terry Gilliam]\n",
|
|||
|
"3 []\n",
|
|||
|
"4 [Terry Jones]\n",
|
|||
|
"5 [Michael Palin]\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.findall(r'^[^AEIOU].*[^aeiou]$')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Miscellaneous methods\n",
|
|||
|
"Finally, there are some miscellaneous methods that enable other convenient operations:\n",
|
|||
|
"\n",
|
|||
|
"| Method | Description |\n",
|
|||
|
"|--------|-------------|\n",
|
|||
|
"| ``get()`` | Index each element |\n",
|
|||
|
"| ``slice()`` | Slice each element|\n",
|
|||
|
"| ``slice_replace()`` | Replace slice in each element with passed value|\n",
|
|||
|
"| ``cat()`` | Concatenate strings|\n",
|
|||
|
"| ``repeat()`` | Repeat values |\n",
|
|||
|
"| ``normalize()`` | Return Unicode form of string |\n",
|
|||
|
"| ``pad()`` | Add whitespace to left, right, or both sides of strings|\n",
|
|||
|
"| ``wrap()`` | Split long strings into lines with length less than a given width|\n",
|
|||
|
"| ``join()`` | Join strings in each element of the Series with passed separator|\n",
|
|||
|
"| ``get_dummies()`` | extract dummy variables as a dataframe |"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Vectorized item access and slicing\n",
|
|||
|
"\n",
|
|||
|
"The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.\n",
|
|||
|
"For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.\n",
|
|||
|
"Note that this behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 Gra\n",
|
|||
|
"1 Joh\n",
|
|||
|
"2 Ter\n",
|
|||
|
"3 Eri\n",
|
|||
|
"4 Ter\n",
|
|||
|
"5 Mic\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str[0:3]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.\n",
|
|||
|
"\n",
|
|||
|
"These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.\n",
|
|||
|
"For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0 Chapman\n",
|
|||
|
"1 Cleese\n",
|
|||
|
"2 Gilliam\n",
|
|||
|
"3 Idle\n",
|
|||
|
"4 Jones\n",
|
|||
|
"5 Palin\n",
|
|||
|
"dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"monte.str.split().str.get(-1)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Indicator variables\n",
|
|||
|
"\n",
|
|||
|
"Another method that requires a bit of extra explanation is the ``get_dummies()`` method.\n",
|
|||
|
"This is useful when your data has a column containing some sort of coded indicator.\n",
|
|||
|
"For example, we might have a dataset that contains information in the form of codes, such as A=\"born in America,\" B=\"born in the United Kingdom,\" C=\"likes cheese,\" D=\"likes spam\":"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>info</th>\n",
|
|||
|
" <th>name</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>B|C|D</td>\n",
|
|||
|
" <td>Graham Chapman</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>B|D</td>\n",
|
|||
|
" <td>John Cleese</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>A|C</td>\n",
|
|||
|
" <td>Terry Gilliam</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>B|D</td>\n",
|
|||
|
" <td>Eric Idle</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>B|C</td>\n",
|
|||
|
" <td>Terry Jones</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>B|C|D</td>\n",
|
|||
|
" <td>Michael Palin</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" info name\n",
|
|||
|
"0 B|C|D Graham Chapman\n",
|
|||
|
"1 B|D John Cleese\n",
|
|||
|
"2 A|C Terry Gilliam\n",
|
|||
|
"3 B|D Eric Idle\n",
|
|||
|
"4 B|C Terry Jones\n",
|
|||
|
"5 B|C|D Michael Palin"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"full_monte = pd.DataFrame({'name': monte,\n",
|
|||
|
" 'info': ['B|C|D', 'B|D', 'A|C',\n",
|
|||
|
" 'B|D', 'B|C', 'B|C|D']})\n",
|
|||
|
"full_monte"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>A</th>\n",
|
|||
|
" <th>B</th>\n",
|
|||
|
" <th>C</th>\n",
|
|||
|
" <th>D</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" A B C D\n",
|
|||
|
"0 0 1 1 1\n",
|
|||
|
"1 0 1 0 1\n",
|
|||
|
"2 1 0 1 0\n",
|
|||
|
"3 0 1 0 1\n",
|
|||
|
"4 0 1 1 0\n",
|
|||
|
"5 0 1 1 1"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"full_monte['info'].str.get_dummies('|')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.\n",
|
|||
|
"\n",
|
|||
|
"We won't dive further into these methods here, but I encourage you to read through [\"Working with Text Data\"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation, or to refer to the resources listed in [Further Resources](03.13-Further-Resources.ipynb)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Example: Recipe Database\n",
|
|||
|
"\n",
|
|||
|
"These vectorized string operations become most useful in the process of cleaning up messy, real-world data.\n",
|
|||
|
"Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web.\n",
|
|||
|
"Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.\n",
|
|||
|
"\n",
|
|||
|
"The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.\n",
|
|||
|
"\n",
|
|||
|
"As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz\n",
|
|||
|
"# !gunzip recipeitems-latest.json.gz"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The database is in JSON format, so we will try ``pd.read_json`` to read it:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"ValueError: Trailing data\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"try:\n",
|
|||
|
" recipes = pd.read_json('recipeitems-latest.json')\n",
|
|||
|
"except ValueError as e:\n",
|
|||
|
" print(\"ValueError:\", e)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Oops! We get a ``ValueError`` mentioning that there is \"trailing data.\"\n",
|
|||
|
"Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not.\n",
|
|||
|
"Let's check if this interpretation is true:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(2, 12)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"with open('recipeitems-latest.json') as f:\n",
|
|||
|
" line = f.readline()\n",
|
|||
|
"pd.read_json(line).shape"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Yes, apparently each line is a valid JSON, so we'll need to string them together.\n",
|
|||
|
"One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {
|
|||
|
"collapsed": true
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# read the entire file into a Python array\n",
|
|||
|
"with open('recipeitems-latest.json', 'r') as f:\n",
|
|||
|
" # Extract each line\n",
|
|||
|
" data = (line.strip() for line in f)\n",
|
|||
|
" # Reformat so each line is the element of a list\n",
|
|||
|
" data_json = \"[{0}]\".format(','.join(data))\n",
|
|||
|
"# read the result as a JSON\n",
|
|||
|
"recipes = pd.read_json(data_json)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(173278, 17)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.shape"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We see there are nearly 200,000 recipes, and 17 columns.\n",
|
|||
|
"Let's take a look at one row to see what we have:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"_id {'$oid': '5160756b96cc62079cc2db15'}\n",
|
|||
|
"cookTime PT30M\n",
|
|||
|
"creator NaN\n",
|
|||
|
"dateModified NaN\n",
|
|||
|
"datePublished 2013-03-11\n",
|
|||
|
"description Late Saturday afternoon, after Marlboro Man ha...\n",
|
|||
|
"image http://static.thepioneerwoman.com/cooking/file...\n",
|
|||
|
"ingredients Biscuits\\n3 cups All-purpose Flour\\n2 Tablespo...\n",
|
|||
|
"name Drop Biscuits and Sausage Gravy\n",
|
|||
|
"prepTime PT10M\n",
|
|||
|
"recipeCategory NaN\n",
|
|||
|
"recipeInstructions NaN\n",
|
|||
|
"recipeYield 12\n",
|
|||
|
"source thepioneerwoman\n",
|
|||
|
"totalTime NaN\n",
|
|||
|
"ts {'$date': 1365276011104}\n",
|
|||
|
"url http://thepioneerwoman.com/cooking/2013/03/dro...\n",
|
|||
|
"Name: 0, dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.iloc[0]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.\n",
|
|||
|
"In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.\n",
|
|||
|
"Let's start by taking a closer look at the ingredients:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"count 173278.000000\n",
|
|||
|
"mean 244.617926\n",
|
|||
|
"std 146.705285\n",
|
|||
|
"min 0.000000\n",
|
|||
|
"25% 147.000000\n",
|
|||
|
"50% 221.000000\n",
|
|||
|
"75% 314.000000\n",
|
|||
|
"max 9067.000000\n",
|
|||
|
"Name: ingredients, dtype: float64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.ingredients.str.len().describe()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!\n",
|
|||
|
"\n",
|
|||
|
"Just out of curiousity, let's see which recipe has the longest ingredient list:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'Carrot Pineapple Spice & Brownie Layer Cake with Whipped Cream & Cream Cheese Frosting and Marzipan Carrots'"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.name[np.argmax(recipes.ingredients.str.len())]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"That certainly looks like an involved recipe.\n",
|
|||
|
"\n",
|
|||
|
"We can do other aggregate explorations; for example, let's see how many of the recipes are for breakfast food:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"3524"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.description.str.contains('[Bb]reakfast').sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Or how many of the recipes list cinnamon as an ingredient:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"10526"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.ingredients.str.contains('[Cc]innamon').sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We could even look to see whether any recipes misspell the ingredient as \"cinamon\":"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"11"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.ingredients.str.contains('[Cc]inamon').sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"This is the type of essential data exploration that is possible with Pandas string tools.\n",
|
|||
|
"It is data munging like this that Python really excels at."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### A simple recipe recommender\n",
|
|||
|
"\n",
|
|||
|
"Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.\n",
|
|||
|
"While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.\n",
|
|||
|
"So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list.\n",
|
|||
|
"For simplicity, let's just stick with herbs and spices for the time being:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {
|
|||
|
"collapsed": true
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',\n",
|
|||
|
" 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We can then build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>cumin</th>\n",
|
|||
|
" <th>oregano</th>\n",
|
|||
|
" <th>paprika</th>\n",
|
|||
|
" <th>parsley</th>\n",
|
|||
|
" <th>pepper</th>\n",
|
|||
|
" <th>rosemary</th>\n",
|
|||
|
" <th>sage</th>\n",
|
|||
|
" <th>salt</th>\n",
|
|||
|
" <th>tarragon</th>\n",
|
|||
|
" <th>thyme</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" cumin oregano paprika parsley pepper rosemary sage salt tarragon thyme\n",
|
|||
|
"0 False False False False False False True False False False\n",
|
|||
|
"1 False False False False False False False False False False\n",
|
|||
|
"2 True False False False True False False True False False\n",
|
|||
|
"3 False False False False False False False False False False\n",
|
|||
|
"4 False False False False False False False False False False"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import re\n",
|
|||
|
"spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))\n",
|
|||
|
" for spice in spice_list))\n",
|
|||
|
"spice_df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Now, as an example, let's say we'd like to find a recipe that uses parsley, paprika, and tarragon.\n",
|
|||
|
"We can compute this very quickly using the ``query()`` method of ``DataFrame``s, discussed in [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb):"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"10"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"selection = spice_df.query('parsley & paprika & tarragon')\n",
|
|||
|
"len(selection)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"We find only 10 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"2069 All cremat with a Little Gem, dandelion and wa...\n",
|
|||
|
"74964 Lobster with Thermidor butter\n",
|
|||
|
"93768 Burton's Southern Fried Chicken with White Gravy\n",
|
|||
|
"113926 Mijo's Slow Cooker Shredded Beef\n",
|
|||
|
"137686 Asparagus Soup with Poached Eggs\n",
|
|||
|
"140530 Fried Oyster Po’boys\n",
|
|||
|
"158475 Lamb shank tagine with herb tabbouleh\n",
|
|||
|
"158486 Southern fried chicken in buttermilk\n",
|
|||
|
"163175 Fried Chicken Sliders with Pickles + Slaw\n",
|
|||
|
"165243 Bar Tartine Cauliflower Salad\n",
|
|||
|
"Name: name, dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recipes.name[selection.index]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Going further with recipes\n",
|
|||
|
"\n",
|
|||
|
"Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods.\n",
|
|||
|
"Of course, building a very robust recipe recommendation system would require a *lot* more work!\n",
|
|||
|
"Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process.\n",
|
|||
|
"This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<!--NAVIGATION-->\n",
|
|||
|
"< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"anaconda-cloud": {},
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.4.3"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 0
|
|||
|
}
|