mirror of
https://github.com/donnemartin/data-science-ipython-notebooks.git
synced 2024-03-22 13:30:56 +08:00
233b197660
Source: https://github.com/jakevdp/PythonDataScienceHandbook unmodified
1148 lines
30 KiB
Python
1148 lines
30 KiB
Python
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<!--BOOK_INFORMATION-->\n",
|
||
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
|
||
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
|
||
"\n",
|
||
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
|
||
"\n",
|
||
"*No changes were made to the contents of this notebook from the original.*"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<!--NAVIGATION-->\n",
|
||
"< [Working with Time Series](03.11-Working-with-Time-Series.ipynb) | [Contents](Index.ipynb) | [Further Resources](03.13-Further-Resources.ipynb) >"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# High-Performance Pandas: eval() and query()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"As we've already seen in previous sections, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas.\n",
|
||
"While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.\n",
|
||
"\n",
|
||
"As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays.\n",
|
||
"These are the ``eval()`` and ``query()`` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package.\n",
|
||
"In this notebook we will walk through their use and give some rules-of-thumb about when you might think about using them."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Motivating ``query()`` and ``eval()``: Compound Expressions\n",
|
||
"\n",
|
||
"We've seen previously that NumPy and Pandas support fast vectorized operations; for example, when adding the elements of two arrays:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"100 loops, best of 3: 3.39 ms per loop\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"rng = np.random.RandomState(42)\n",
|
||
"x = rng.rand(1000000)\n",
|
||
"y = rng.rand(1000000)\n",
|
||
"%timeit x + y"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"As discussed in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb), this is much faster than doing the addition via a Python loop or comprehension:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"1 loop, best of 3: 266 ms per loop\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"But this abstraction can become less efficient when computing compound expressions.\n",
|
||
"For example, consider the following expression:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"mask = (x > 0.5) & (y < 0.5)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Because NumPy evaluates each subexpression, this is roughly equivalent to the following:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"tmp1 = (x > 0.5)\n",
|
||
"tmp2 = (y < 0.5)\n",
|
||
"mask = tmp1 & tmp2"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In other words, *every intermediate step is explicitly allocated in memory*. If the ``x`` and ``y`` arrays are very large, this can lead to significant memory and computational overhead.\n",
|
||
"The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.\n",
|
||
"The [Numexpr documentation](https://github.com/pydata/numexpr) has more details, but for the time being it is sufficient to say that the library accepts a *string* giving the NumPy-style expression you'd like to compute:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import numexpr\n",
|
||
"mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')\n",
|
||
"np.allclose(mask, mask_numexpr)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays.\n",
|
||
"The Pandas ``eval()`` and ``query()`` tools that we will discuss here are conceptually similar, and depend on the Numexpr package."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## ``pandas.eval()`` for Efficient Operations\n",
|
||
"\n",
|
||
"The ``eval()`` function in Pandas uses string expressions to efficiently compute operations using ``DataFrame``s.\n",
|
||
"For example, consider the following ``DataFrame``s:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"nrows, ncols = 100000, 100\n",
|
||
"rng = np.random.RandomState(42)\n",
|
||
"df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))\n",
|
||
" for i in range(4))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"To compute the sum of all four ``DataFrame``s using the typical Pandas approach, we can just write the sum:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"10 loops, best of 3: 87.1 ms per loop\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"%timeit df1 + df2 + df3 + df4"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The same result can be computed via ``pd.eval`` by constructing the expression as a string:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"10 loops, best of 3: 42.2 ms per loop\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"%timeit pd.eval('df1 + df2 + df3 + df4')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The ``eval()`` version of this expression is about 50% faster (and uses much less memory), while giving the same result:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"np.allclose(df1 + df2 + df3 + df4,\n",
|
||
" pd.eval('df1 + df2 + df3 + df4'))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Operations supported by ``pd.eval()``\n",
|
||
"\n",
|
||
"As of Pandas v0.16, ``pd.eval()`` supports a wide range of operations.\n",
|
||
"To demonstrate these, we'll use the following integer ``DataFrame``s:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))\n",
|
||
" for i in range(5))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Arithmetic operators\n",
|
||
"``pd.eval()`` supports all arithmetic operators. For example:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result1 = -df1 * df2 / (df3 + df4) - df5\n",
|
||
"result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Comparison operators\n",
|
||
"``pd.eval()`` supports all comparison operators, including chained expressions:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)\n",
|
||
"result2 = pd.eval('df1 < df2 <= df3 != df4')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Bitwise operators\n",
|
||
"``pd.eval()`` supports the ``&`` and ``|`` bitwise operators:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)\n",
|
||
"result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In addition, it supports the use of the literal ``and`` and ``or`` in Boolean expressions:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')\n",
|
||
"np.allclose(result1, result3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Object attributes and indices\n",
|
||
"\n",
|
||
"``pd.eval()`` supports access to object attributes via the ``obj.attr`` syntax, and indexes via the ``obj[index]`` syntax:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result1 = df2.T[0] + df3.iloc[1]\n",
|
||
"result2 = pd.eval('df2.T[0] + df3.iloc[1]')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Other operations\n",
|
||
"Other operations such as function calls, conditional statements, loops, and other more involved constructs are currently *not* implemented in ``pd.eval()``.\n",
|
||
"If you'd like to execute these more complicated types of expressions, you can use the Numexpr library itself."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## ``DataFrame.eval()`` for Column-Wise Operations\n",
|
||
"\n",
|
||
"Just as Pandas has a top-level ``pd.eval()`` function, ``DataFrame``s have an ``eval()`` method that works in similar ways.\n",
|
||
"The benefit of the ``eval()`` method is that columns can be referred to *by name*.\n",
|
||
"We'll use this labeled array as an example:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0.375506</td>\n",
|
||
" <td>0.406939</td>\n",
|
||
" <td>0.069938</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>0.069087</td>\n",
|
||
" <td>0.235615</td>\n",
|
||
" <td>0.154374</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>0.677945</td>\n",
|
||
" <td>0.433839</td>\n",
|
||
" <td>0.652324</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0.264038</td>\n",
|
||
" <td>0.808055</td>\n",
|
||
" <td>0.347197</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0.589161</td>\n",
|
||
" <td>0.252418</td>\n",
|
||
" <td>0.557789</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" A B C\n",
|
||
"0 0.375506 0.406939 0.069938\n",
|
||
"1 0.069087 0.235615 0.154374\n",
|
||
"2 0.677945 0.433839 0.652324\n",
|
||
"3 0.264038 0.808055 0.347197\n",
|
||
"4 0.589161 0.252418 0.557789"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Using ``pd.eval()`` as above, we can compute expressions with the three columns like this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result1 = (df['A'] + df['B']) / (df['C'] - 1)\n",
|
||
"result2 = pd.eval(\"(df.A + df.B) / (df.C - 1)\")\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The ``DataFrame.eval()`` method allows much more succinct evaluation of expressions with the columns:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result3 = df.eval('(A + B) / (C - 1)')\n",
|
||
"np.allclose(result1, result3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Notice here that we treat *column names as variables* within the evaluated expression, and the result is what we would wish."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Assignment in DataFrame.eval()\n",
|
||
"\n",
|
||
"In addition to the options just discussed, ``DataFrame.eval()`` also allows assignment to any column.\n",
|
||
"Let's use the ``DataFrame`` from before, which has columns ``'A'``, ``'B'``, and ``'C'``:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0.375506</td>\n",
|
||
" <td>0.406939</td>\n",
|
||
" <td>0.069938</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>0.069087</td>\n",
|
||
" <td>0.235615</td>\n",
|
||
" <td>0.154374</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>0.677945</td>\n",
|
||
" <td>0.433839</td>\n",
|
||
" <td>0.652324</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0.264038</td>\n",
|
||
" <td>0.808055</td>\n",
|
||
" <td>0.347197</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0.589161</td>\n",
|
||
" <td>0.252418</td>\n",
|
||
" <td>0.557789</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" A B C\n",
|
||
"0 0.375506 0.406939 0.069938\n",
|
||
"1 0.069087 0.235615 0.154374\n",
|
||
"2 0.677945 0.433839 0.652324\n",
|
||
"3 0.264038 0.808055 0.347197\n",
|
||
"4 0.589161 0.252418 0.557789"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We can use ``df.eval()`` to create a new column ``'D'`` and assign to it a value computed from the other columns:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0.375506</td>\n",
|
||
" <td>0.406939</td>\n",
|
||
" <td>0.069938</td>\n",
|
||
" <td>11.187620</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>0.069087</td>\n",
|
||
" <td>0.235615</td>\n",
|
||
" <td>0.154374</td>\n",
|
||
" <td>1.973796</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>0.677945</td>\n",
|
||
" <td>0.433839</td>\n",
|
||
" <td>0.652324</td>\n",
|
||
" <td>1.704344</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0.264038</td>\n",
|
||
" <td>0.808055</td>\n",
|
||
" <td>0.347197</td>\n",
|
||
" <td>3.087857</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0.589161</td>\n",
|
||
" <td>0.252418</td>\n",
|
||
" <td>0.557789</td>\n",
|
||
" <td>1.508776</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" A B C D\n",
|
||
"0 0.375506 0.406939 0.069938 11.187620\n",
|
||
"1 0.069087 0.235615 0.154374 1.973796\n",
|
||
"2 0.677945 0.433839 0.652324 1.704344\n",
|
||
"3 0.264038 0.808055 0.347197 3.087857\n",
|
||
"4 0.589161 0.252418 0.557789 1.508776"
|
||
]
|
||
},
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.eval('D = (A + B) / C', inplace=True)\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In the same way, any existing column can be modified:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0.375506</td>\n",
|
||
" <td>0.406939</td>\n",
|
||
" <td>0.069938</td>\n",
|
||
" <td>-0.449425</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>0.069087</td>\n",
|
||
" <td>0.235615</td>\n",
|
||
" <td>0.154374</td>\n",
|
||
" <td>-1.078728</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>0.677945</td>\n",
|
||
" <td>0.433839</td>\n",
|
||
" <td>0.652324</td>\n",
|
||
" <td>0.374209</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0.264038</td>\n",
|
||
" <td>0.808055</td>\n",
|
||
" <td>0.347197</td>\n",
|
||
" <td>-1.566886</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0.589161</td>\n",
|
||
" <td>0.252418</td>\n",
|
||
" <td>0.557789</td>\n",
|
||
" <td>0.603708</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" A B C D\n",
|
||
"0 0.375506 0.406939 0.069938 -0.449425\n",
|
||
"1 0.069087 0.235615 0.154374 -1.078728\n",
|
||
"2 0.677945 0.433839 0.652324 0.374209\n",
|
||
"3 0.264038 0.808055 0.347197 -1.566886\n",
|
||
"4 0.589161 0.252418 0.557789 0.603708"
|
||
]
|
||
},
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.eval('D = (A - B) / C', inplace=True)\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Local variables in DataFrame.eval()\n",
|
||
"\n",
|
||
"The ``DataFrame.eval()`` method supports an additional syntax that lets it work with local Python variables.\n",
|
||
"Consider the following:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"column_mean = df.mean(1)\n",
|
||
"result1 = df['A'] + column_mean\n",
|
||
"result2 = df.eval('A + @column_mean')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The ``@`` character here marks a *variable name* rather than a *column name*, and lets you efficiently evaluate expressions involving the two \"namespaces\": the namespace of columns, and the namespace of Python objects.\n",
|
||
"Notice that this ``@`` character is only supported by the ``DataFrame.eval()`` *method*, not by the ``pandas.eval()`` *function*, because the ``pandas.eval()`` function only has access to the one (Python) namespace."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## DataFrame.query() Method\n",
|
||
"\n",
|
||
"The ``DataFrame`` has another method based on evaluated strings, called the ``query()`` method.\n",
|
||
"Consider the following:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result1 = df[(df.A < 0.5) & (df.B < 0.5)]\n",
|
||
"result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"As with the example used in our discussion of ``DataFrame.eval()``, this is an expression involving columns of the ``DataFrame``.\n",
|
||
"It cannot be expressed using the ``DataFrame.eval()`` syntax, however!\n",
|
||
"Instead, for this type of filtering operation, you can use the ``query()`` method:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"result2 = df.query('A < 0.5 and B < 0.5')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand.\n",
|
||
"Note that the ``query()`` method also accepts the ``@`` flag to mark local variables:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"Cmean = df['C'].mean()\n",
|
||
"result1 = df[(df.A < Cmean) & (df.B < Cmean)]\n",
|
||
"result2 = df.query('A < @Cmean and B < @Cmean')\n",
|
||
"np.allclose(result1, result2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Performance: When to Use These Functions\n",
|
||
"\n",
|
||
"When considering whether to use these functions, there are two considerations: *computation time* and *memory use*.\n",
|
||
"Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas ``DataFrame``s will result in implicit creation of temporary arrays:\n",
|
||
"For example, this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"x = df[(df.A < 0.5) & (df.B < 0.5)]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Is roughly equivalent to this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"tmp1 = df.A < 0.5\n",
|
||
"tmp2 = df.B < 0.5\n",
|
||
"tmp3 = tmp1 & tmp2\n",
|
||
"x = df[tmp3]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"If the size of the temporary ``DataFrame``s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an ``eval()`` or ``query()`` expression.\n",
|
||
"You can check the approximate size of your array in bytes using this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"32000"
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.values.nbytes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"On the performance side, ``eval()`` can be faster even when you are not maxing-out your system memory.\n",
|
||
"The issue is how your temporary ``DataFrame``s compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then ``eval()`` can avoid some potentially slow movement of values between the different memory caches.\n",
|
||
"In practice, I find that the difference in computation time between the traditional methods and the ``eval``/``query`` method is usually not significant–if anything, the traditional method is faster for smaller arrays!\n",
|
||
"The benefit of ``eval``/``query`` is mainly in the saved memory, and the sometimes cleaner syntax they offer.\n",
|
||
"\n",
|
||
"We've covered most of the details of ``eval()`` and ``query()`` here; for more information on these, you can refer to the Pandas documentation.\n",
|
||
"In particular, different parsers and engines can be specified for running these queries; for details on this, see the discussion within the [\"Enhancing Performance\" section](http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<!--NAVIGATION-->\n",
|
||
"< [Working with Time Series](03.11-Working-with-Time-Series.ipynb) | [Contents](Index.ipynb) | [Further Resources](03.13-Further-Resources.ipynb) >"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"anaconda-cloud": {},
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.4.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 0
|
||
}
|