mirror of
https://github.com/donnemartin/data-science-ipython-notebooks.git
synced 2024-03-22 13:30:56 +08:00
233b197660
Source: https://github.com/jakevdp/PythonDataScienceHandbook unmodified
1638 lines
46 KiB
Python
1638 lines
46 KiB
Python
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<!--BOOK_INFORMATION-->\n",
|
||
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
|
||
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
|
||
"\n",
|
||
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
|
||
"\n",
|
||
"*No changes were made to the contents of this notebook from the original.*"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<!--NAVIGATION-->\n",
|
||
"< [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) >"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Combining Datasets: Concat and Append"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Some of the most interesting studies of data come from combining different data sources.\n",
|
||
"These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.\n",
|
||
"``Series`` and ``DataFrame``s are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.\n",
|
||
"\n",
|
||
"Here we'll take a look at simple concatenation of ``Series`` and ``DataFrame``s with the ``pd.concat`` function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.\n",
|
||
"\n",
|
||
"We begin with the standard imports:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import numpy as np"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"For convenience, we'll define this function which creates a ``DataFrame`` of a particular form that will be useful below:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" <td>C0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" A B C\n",
|
||
"0 A0 B0 C0\n",
|
||
"1 A1 B1 C1\n",
|
||
"2 A2 B2 C2"
|
||
]
|
||
},
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"def make_df(cols, ind):\n",
|
||
" \"\"\"Quickly make a DataFrame\"\"\"\n",
|
||
" data = {c: [str(c) + str(i) for i in ind]\n",
|
||
" for c in cols}\n",
|
||
" return pd.DataFrame(data, ind)\n",
|
||
"\n",
|
||
"# example DataFrame\n",
|
||
"make_df('ABC', range(3))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In addition, we'll create a quick class that allows us to display multiple ``DataFrame``s side by side. The code makes use of the special ``_repr_html_`` method, which IPython uses to implement its rich object display:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"class display(object):\n",
|
||
" \"\"\"Display HTML representation of multiple objects\"\"\"\n",
|
||
" template = \"\"\"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>{0}</p>{1}\n",
|
||
" </div>\"\"\"\n",
|
||
" def __init__(self, *args):\n",
|
||
" self.args = args\n",
|
||
" \n",
|
||
" def _repr_html_(self):\n",
|
||
" return '\\n'.join(self.template.format(a, eval(a)._repr_html_())\n",
|
||
" for a in self.args)\n",
|
||
" \n",
|
||
" def __repr__(self):\n",
|
||
" return '\\n\\n'.join(a + '\\n' + repr(eval(a))\n",
|
||
" for a in self.args)\n",
|
||
" "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The use of this will become clearer as we continue our discussion in the following section."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Recall: Concatenation of NumPy Arrays\n",
|
||
"\n",
|
||
"Concatenation of ``Series`` and ``DataFrame`` objects is very similar to concatenation of Numpy arrays, which can be done via the ``np.concatenate`` function as discussed in [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb).\n",
|
||
"Recall that with it, you can combine the contents of two or more arrays into a single array:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([1, 2, 3, 4, 5, 6, 7, 8, 9])"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"x = [1, 2, 3]\n",
|
||
"y = [4, 5, 6]\n",
|
||
"z = [7, 8, 9]\n",
|
||
"np.concatenate([x, y, z])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The first argument is a list or tuple of arrays to concatenate.\n",
|
||
"Additionally, it takes an ``axis`` keyword that allows you to specify the axis along which the result will be concatenated:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([[1, 2, 1, 2],\n",
|
||
" [3, 4, 3, 4]])"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"x = [[1, 2],\n",
|
||
" [3, 4]]\n",
|
||
"np.concatenate([x, x], axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Simple Concatenation with ``pd.concat``"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Pandas has a function, ``pd.concat()``, which has a similar syntax to ``np.concatenate`` but contains a number of options that we'll discuss momentarily:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# Signature in Pandas v0.18\n",
|
||
"pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,\n",
|
||
" keys=None, levels=None, names=None, verify_integrity=False,\n",
|
||
" copy=True)\n",
|
||
"```\n",
|
||
"\n",
|
||
"``pd.concat()`` can be used for a simple concatenation of ``Series`` or ``DataFrame`` objects, just as ``np.concatenate()`` can be used for simple concatenations of arrays:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"1 A\n",
|
||
"2 B\n",
|
||
"3 C\n",
|
||
"4 D\n",
|
||
"5 E\n",
|
||
"6 F\n",
|
||
"dtype: object"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])\n",
|
||
"ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])\n",
|
||
"pd.concat([ser1, ser2])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"It also works to concatenate higher-dimensional objects, such as ``DataFrame``s:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df1</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df2</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>A4</td>\n",
|
||
" <td>B4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([df1, df2])</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>A4</td>\n",
|
||
" <td>B4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"df1\n",
|
||
" A B\n",
|
||
"1 A1 B1\n",
|
||
"2 A2 B2\n",
|
||
"\n",
|
||
"df2\n",
|
||
" A B\n",
|
||
"3 A3 B3\n",
|
||
"4 A4 B4\n",
|
||
"\n",
|
||
"pd.concat([df1, df2])\n",
|
||
" A B\n",
|
||
"1 A1 B1\n",
|
||
"2 A2 B2\n",
|
||
"3 A3 B3\n",
|
||
"4 A4 B4"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df1 = make_df('AB', [1, 2])\n",
|
||
"df2 = make_df('AB', [3, 4])\n",
|
||
"display('df1', 'df2', 'pd.concat([df1, df2])')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By default, the concatenation takes place row-wise within the ``DataFrame`` (i.e., ``axis=0``).\n",
|
||
"Like ``np.concatenate``, ``pd.concat`` allows specification of an axis along which concatenation will take place.\n",
|
||
"Consider the following example:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df3</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df4</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>C0</td>\n",
|
||
" <td>D0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>C1</td>\n",
|
||
" <td>D1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([df3, df4], axis='col')</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" <td>C0</td>\n",
|
||
" <td>D0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" <td>D1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"df3\n",
|
||
" A B\n",
|
||
"0 A0 B0\n",
|
||
"1 A1 B1\n",
|
||
"\n",
|
||
"df4\n",
|
||
" C D\n",
|
||
"0 C0 D0\n",
|
||
"1 C1 D1\n",
|
||
"\n",
|
||
"pd.concat([df3, df4], axis='col')\n",
|
||
" A B C D\n",
|
||
"0 A0 B0 C0 D0\n",
|
||
"1 A1 B1 C1 D1"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df3 = make_df('AB', [0, 1])\n",
|
||
"df4 = make_df('CD', [0, 1])\n",
|
||
"display('df3', 'df4', \"pd.concat([df3, df4], axis='col')\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We could have equivalently specified ``axis=1``; here we've used the more intuitive ``axis='col'``. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Duplicate indices\n",
|
||
"\n",
|
||
"One important difference between ``np.concatenate`` and ``pd.concat`` is that Pandas concatenation *preserves indices*, even if the result will have duplicate indices!\n",
|
||
"Consider this simple example:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>x</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>y</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([x, y])</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"x\n",
|
||
" A B\n",
|
||
"0 A0 B0\n",
|
||
"1 A1 B1\n",
|
||
"\n",
|
||
"y\n",
|
||
" A B\n",
|
||
"0 A2 B2\n",
|
||
"1 A3 B3\n",
|
||
"\n",
|
||
"pd.concat([x, y])\n",
|
||
" A B\n",
|
||
"0 A0 B0\n",
|
||
"1 A1 B1\n",
|
||
"0 A2 B2\n",
|
||
"1 A3 B3"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"x = make_df('AB', [0, 1])\n",
|
||
"y = make_df('AB', [2, 3])\n",
|
||
"y.index = x.index # make duplicate indices!\n",
|
||
"display('x', 'y', 'pd.concat([x, y])')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Notice the repeated indices in the result.\n",
|
||
"While this is valid within ``DataFrame``s, the outcome is often undesirable.\n",
|
||
"``pd.concat()`` gives us a few ways to handle it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Catching the repeats as an error\n",
|
||
"\n",
|
||
"If you'd like to simply verify that the indices in the result of ``pd.concat()`` do not overlap, you can specify the ``verify_integrity`` flag.\n",
|
||
"With this set to True, the concatenation will raise an exception if there are duplicate indices.\n",
|
||
"Here is an example, where for clarity we'll catch and print the error message:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"ValueError: Indexes have overlapping values: [0, 1]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"try:\n",
|
||
" pd.concat([x, y], verify_integrity=True)\n",
|
||
"except ValueError as e:\n",
|
||
" print(\"ValueError:\", e)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Ignoring the index\n",
|
||
"\n",
|
||
"Sometimes the index itself does not matter, and you would prefer it to simply be ignored.\n",
|
||
"This option can be specified using the ``ignore_index`` flag.\n",
|
||
"With this set to true, the concatenation will create a new integer index for the resulting ``Series``:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>x</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>y</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([x, y], ignore_index=True)</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"x\n",
|
||
" A B\n",
|
||
"0 A0 B0\n",
|
||
"1 A1 B1\n",
|
||
"\n",
|
||
"y\n",
|
||
" A B\n",
|
||
"0 A2 B2\n",
|
||
"1 A3 B3\n",
|
||
"\n",
|
||
"pd.concat([x, y], ignore_index=True)\n",
|
||
" A B\n",
|
||
"0 A0 B0\n",
|
||
"1 A1 B1\n",
|
||
"2 A2 B2\n",
|
||
"3 A3 B3"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"display('x', 'y', 'pd.concat([x, y], ignore_index=True)')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Adding MultiIndex keys\n",
|
||
"\n",
|
||
"Another option is to use the ``keys`` option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>x</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>y</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([x, y], keys=['x', 'y'])</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"2\" valign=\"top\">x</th>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A0</td>\n",
|
||
" <td>B0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"2\" valign=\"top\">y</th>\n",
|
||
" <th>0</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"x\n",
|
||
" A B\n",
|
||
"0 A0 B0\n",
|
||
"1 A1 B1\n",
|
||
"\n",
|
||
"y\n",
|
||
" A B\n",
|
||
"0 A2 B2\n",
|
||
"1 A3 B3\n",
|
||
"\n",
|
||
"pd.concat([x, y], keys=['x', 'y'])\n",
|
||
" A B\n",
|
||
"x 0 A0 B0\n",
|
||
" 1 A1 B1\n",
|
||
"y 0 A2 B2\n",
|
||
" 1 A3 B3"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"display('x', 'y', \"pd.concat([x, y], keys=['x', 'y'])\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The result is a multiply indexed ``DataFrame``, and we can use the tools discussed in [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) to transform this data into the representation we're interested in."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Concatenation with joins\n",
|
||
"\n",
|
||
"In the simple examples we just looked at, we were mainly concatenating ``DataFrame``s with shared column names.\n",
|
||
"In practice, data from different sources might have different sets of column names, and ``pd.concat`` offers several options in this case.\n",
|
||
"Consider the concatenation of the following two ``DataFrame``s, which have some (but not all!) columns in common:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df5</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df6</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>B3</td>\n",
|
||
" <td>C3</td>\n",
|
||
" <td>D3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>B4</td>\n",
|
||
" <td>C4</td>\n",
|
||
" <td>D4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([df5, df6])</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>B3</td>\n",
|
||
" <td>C3</td>\n",
|
||
" <td>D3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>B4</td>\n",
|
||
" <td>C4</td>\n",
|
||
" <td>D4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"df5\n",
|
||
" A B C\n",
|
||
"1 A1 B1 C1\n",
|
||
"2 A2 B2 C2\n",
|
||
"\n",
|
||
"df6\n",
|
||
" B C D\n",
|
||
"3 B3 C3 D3\n",
|
||
"4 B4 C4 D4\n",
|
||
"\n",
|
||
"pd.concat([df5, df6])\n",
|
||
" A B C D\n",
|
||
"1 A1 B1 C1 NaN\n",
|
||
"2 A2 B2 C2 NaN\n",
|
||
"3 NaN B3 C3 D3\n",
|
||
"4 NaN B4 C4 D4"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df5 = make_df('ABC', [1, 2])\n",
|
||
"df6 = make_df('BCD', [3, 4])\n",
|
||
"display('df5', 'df6', 'pd.concat([df5, df6])')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By default, the entries for which no data is available are filled with NA values.\n",
|
||
"To change this, we can specify one of several options for the ``join`` and ``join_axes`` parameters of the concatenate function.\n",
|
||
"By default, the join is a union of the input columns (``join='outer'``), but we can change this to an intersection of the columns using ``join='inner'``:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df5</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df6</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>B3</td>\n",
|
||
" <td>C3</td>\n",
|
||
" <td>D3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>B4</td>\n",
|
||
" <td>C4</td>\n",
|
||
" <td>D4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([df5, df6], join='inner')</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>B3</td>\n",
|
||
" <td>C3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>B4</td>\n",
|
||
" <td>C4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"df5\n",
|
||
" A B C\n",
|
||
"1 A1 B1 C1\n",
|
||
"2 A2 B2 C2\n",
|
||
"\n",
|
||
"df6\n",
|
||
" B C D\n",
|
||
"3 B3 C3 D3\n",
|
||
"4 B4 C4 D4\n",
|
||
"\n",
|
||
"pd.concat([df5, df6], join='inner')\n",
|
||
" B C\n",
|
||
"1 B1 C1\n",
|
||
"2 B2 C2\n",
|
||
"3 B3 C3\n",
|
||
"4 B4 C4"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"display('df5', 'df6',\n",
|
||
" \"pd.concat([df5, df6], join='inner')\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Another option is to directly specify the index of the remaininig colums using the ``join_axes`` argument, which takes a list of index objects.\n",
|
||
"Here we'll specify that the returned columns should be the same as those of the first input:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df5</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df6</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" <th>D</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>B3</td>\n",
|
||
" <td>C3</td>\n",
|
||
" <td>D3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>B4</td>\n",
|
||
" <td>C4</td>\n",
|
||
" <td>D4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>pd.concat([df5, df6], join_axes=[df5.columns])</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" <th>C</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" <td>C1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" <td>C2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>B3</td>\n",
|
||
" <td>C3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>B4</td>\n",
|
||
" <td>C4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"df5\n",
|
||
" A B C\n",
|
||
"1 A1 B1 C1\n",
|
||
"2 A2 B2 C2\n",
|
||
"\n",
|
||
"df6\n",
|
||
" B C D\n",
|
||
"3 B3 C3 D3\n",
|
||
"4 B4 C4 D4\n",
|
||
"\n",
|
||
"pd.concat([df5, df6], join_axes=[df5.columns])\n",
|
||
" A B C\n",
|
||
"1 A1 B1 C1\n",
|
||
"2 A2 B2 C2\n",
|
||
"3 NaN B3 C3\n",
|
||
"4 NaN B4 C4"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"display('df5', 'df6',\n",
|
||
" \"pd.concat([df5, df6], join_axes=[df5.columns])\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The combination of options of the ``pd.concat`` function allows a wide range of possible behaviors when joining two datasets; keep these in mind as you use these tools for your own data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### The ``append()`` method\n",
|
||
"\n",
|
||
"Because direct array concatenation is so common, ``Series`` and ``DataFrame`` objects have an ``append`` method that can accomplish the same thing in fewer keystrokes.\n",
|
||
"For example, rather than calling ``pd.concat([df1, df2])``, you can simply call ``df1.append(df2)``:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df1</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df2</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>A4</td>\n",
|
||
" <td>B4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>\n",
|
||
"<div style=\"float: left; padding: 10px;\">\n",
|
||
" <p style='font-family:\"Courier New\", Courier, monospace'>df1.append(df2)</p><div>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>A</th>\n",
|
||
" <th>B</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>A1</td>\n",
|
||
" <td>B1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>A2</td>\n",
|
||
" <td>B2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>A3</td>\n",
|
||
" <td>B3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>A4</td>\n",
|
||
" <td>B4</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" </div>"
|
||
],
|
||
"text/plain": [
|
||
"df1\n",
|
||
" A B\n",
|
||
"1 A1 B1\n",
|
||
"2 A2 B2\n",
|
||
"\n",
|
||
"df2\n",
|
||
" A B\n",
|
||
"3 A3 B3\n",
|
||
"4 A4 B4\n",
|
||
"\n",
|
||
"df1.append(df2)\n",
|
||
" A B\n",
|
||
"1 A1 B1\n",
|
||
"2 A2 B2\n",
|
||
"3 A3 B3\n",
|
||
"4 A4 B4"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"display('df1', 'df2', 'df1.append(df2)')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Keep in mind that unlike the ``append()`` and ``extend()`` methods of Python lists, the ``append()`` method in Pandas does not modify the original object–instead it creates a new object with the combined data.\n",
|
||
"It also is not a very efficient method, because it involves creation of a new index *and* data buffer.\n",
|
||
"Thus, if you plan to do multiple ``append`` operations, it is generally better to build a list of ``DataFrame``s and pass them all at once to the ``concat()`` function.\n",
|
||
"\n",
|
||
"In the next section, we'll look at another more powerful approach to combining data from multiple sources, the database-style merges/joins implemented in ``pd.merge``.\n",
|
||
"For more information on ``concat()``, ``append()``, and related functionality, see the [\"Merge, Join, and Concatenate\" section](http://pandas.pydata.org/pandas-docs/stable/merging.html) of the Pandas documentation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<!--NAVIGATION-->\n",
|
||
"< [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) >"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"anaconda-cloud": {},
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.4.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 0
|
||
}
|