data-science-ipython-notebooks/scikit-learn/scikit-learn-random-forest.ipynb

449 lines
528 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# scikit-learn-random forest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Credits: Forked from [PyCon 2015 Scikit-learn Tutorial](https://github.com/jakevdp/sklearn_pycon2015) by Jake VanderPlas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn; \n",
"from sklearn.linear_model import LinearRegression\n",
"from scipy import stats\n",
"import pylab as pl\n",
"\n",
"seaborn.set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random Forest Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Random forests are an example of an *ensemble learner* built on decision trees.\n",
"For this reason we'll start by discussing decision trees themselves.\n",
"\n",
"Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAk4AAAFFCAYAAAAadmKrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnWd4VcXWgN8TQiihC6j0IkyogRCIqIQmhCIIKk1KAOFT\nioqICHhtqCDg5SpgoapUReBerlwFpEiviiDFERDpPRAIAdL292POOaScJCch5KSs93nOk+zZs2fW\n3rNnZs2aNbNtlmUhCIIgCIIgpI6XpwUQBEEQBEHILojiJAiCIAiC4CaiOAmCIAiCILiJKE6CIAiC\nIAhuIoqTIAiCIAiCm4jiJAiCIAiC4CbenhYgM1BKVQL+SuNlnbXWy++BOFkapdRXQB/gn1rr1+5h\nPn8DFVycigVuAqeAzcBMrfWueyVHaiil+gJzgF+01g3vIp04+791tNYHMkK2uyWFMkiJDVrr5hkv\njedQSn0JhNoPg+7F+5YVyx9AKdUMWAdc1lqXSsf1BTHP7inAHygO3AAOAcuBz7XW4S6u+xvz7nXQ\nWv8vneJnCsmVnVLqPuBfQBugCHAeaA2swoP3ppQqAvhqrc/GC3sHeAtYqrXuktky5TRyheKUiF3A\nbTfiXbrXgmRxMmuDrz+BC/GO82Aa3yqAH/CcUmqC1npMJsmTGCvR37tNKyttnLYTOJEorBhQ2/7/\nZhfX7LunEmUy9o7/6XhBAzBtxL0gq5V/fNIsl1KqMbAYKGsPugj8BjwANAKCgJeVUt201huTyTOr\nPo/EuJJ1MdAc05/sB/IDx+PFzfR7U0r1ACYDzwFn453KyHYs15PbFCcL6KK1TtxZCJ5jnNZ6buJA\ne4c2DHgXGKWUuqG1/iDTpYN/A9swVrC7oQbm/Tt21xJlEFrrronDlFJNgfWApbUOznypMp3OQCFg\nNcZa0F0p9YrWOjKD88ly5X83KKWeBL7D9CFLgDe11jre+ZoYa0wr4Eel1GNa6z2JkrFllrx3SZKy\nU0qVxChNFvCE1nptvHMtMc/lVCbLCTAeuN9F+DRgEXA9c8XJmeQ2xUnIJtg7rnFKqWvAFOBdpdR3\nWus/M1mOa8C1DEhHpx4rS5BdOrOMoo/972KgNFAP6AZ8mZGZZKPyTxWl1IPAbEz/8bHWenjiOFrr\ng0qpdphpwCbALKBBpgqaQSRTdiXj/Z/Amqa1TqtbyD1Ha30ZuOxpOXIK4hwuZGm01tOAXzDv6igP\niyPkIOwKQEuM1WAVsMx+aoDHhMoejANKABp4NblIWutY4BX7YT2l1GOZIFtm4TQ6aK2jPSlIMuS2\nAVCmIhanVFBK1cZ03HmB0VrrCYnO9wLmYua5AxM5Dz4B9MPM95cCYoCTwA/ARK31+URpxQHhWuvi\nSqlBwAtANYzF40dgpNb6olKqPmYKqwlmXn0PMFZrvTJRen8DFbTWXkqpocBgoDLGifEnzDSZ21MH\nSqnCwHDgGaCq/X5+x4zO52it41K4/G6YgxmttktGricx9xYI+AKnge+BD7XW55K5pg5mKrAFUAYI\nBzbZr9kVL15fXDiHK6V8gKFAD6AmRrE7A6zFONYnGKXGczCtrbU+mOjcU8DzQEO7/Ocw5TNBa304\nUVyHPJ9i3oF3gI4Y8/wF+32PTe6+M4J4iy0OYKwzs4H6mBHtOK31p/Z4+YBBQE+Mv5oXprNdAEzT\nWrv0NVRKBWM63EcwPlcXMIrNeK31URfxfwaCga+11v3ScCs97TL9prU+rZT6DhgLPKyUqqG1PuQi\nr76k4/m7cjCOtxCjM+adfQt4DNPW/GZP5yellK/9XDfgQcx7Nh94V2sdkyifPJh3sgcQANwH3MKU\n13+AyXYrarqwT6F3tx9+rLVO0WdGa/2r/Zn9qrXe72YeaW0301oX76ruxjsGsMU77qu1npuS47td\neXwRaIx5Zy4Ba4APElvT01KW8d5LB98rpQD6aa2/Tsk53D6AGAE8AVTE9GX7gK+BL+0KcPz4jvsr\nBTTF1FV/+3P8DZiqtf6WHIxYnFLBXtnfsR++Ze80AFBKlQOmYkasoxMpTbOA/2IaxRhgL6ZjUZgX\nbbdSqoSLLG1KqQWYhrkocBhTYUKBtUqp9sB2zPz6X0Ak8DCwQinVxEV6llJqGma6qxSmQhTHOA/+\nYnfwTBX7ff8KvA1UB45iGrPGwAx7/j7upJUOttr/llJK+cWTyaaUmoHxQ2qF8UPah7nPl4C9Sqkk\n0wNKqd4YB+B+mNUwezFl2BnYopRq5UIGK971NnueHwF1Mb4PBzDm+wGY59oopTTs6XgppeZjfERa\nYRTk3zDl3t8uf+dknkkZTHkMAqIwCklZjLK9TSlVNJnrMpKiGIWmJsY5tihwEMD+bm/EOKrWw3RM\nfwJ1gEnAZlfvv1LqH8DPwJP2oL1AQe48jzYu5EivM25v+9/F4JyS2YsZradmdUrP809OxvaYd7wp\npk5HA48CPyil2trPvYp5v49jOrc3gM/jJ6KUyguswAzk2mD8WfYAEZj39C1goz1eemkM5LPfx9pU\n4gKgtZ6bBqUpTe1mWutiBtXdLZiyd7DZ/jufKH7i+j4KUye6YJTjvUABzHu42z6Yc8RNa1mes8vl\nGIwcsMuUeACVWKbGmLr7ClAeMxA+g3n/ZgCr7Iq7q+fxJqbtqoup2zcxg51FSqlhLq7JMeRGxSk9\nJswJmI62APAZOCvgl5jOYp3W+mNHZKVUB0xDHwE011pX1loHaa0rAs3s4WUxyktiimAqVm+tdSWt\ntT/GKmJhVjstB74FSmutG2A0/22YsnT1stow1pgpQBmtdZA97yWY0fwipVT+lG7ePvJZhrEyLQfK\naa3raK1rAbUwS4/b2J/TvSC+M3+ZeP+/gmnsTgMttdbltNaNMKt6PsMoUMuUUoXi3YsCZmIarncx\nz7ER5plMxlhhv7WPrJOjHdAW01lW1lrX1loH2tP4D6ajH+fGff0DeBa4inEwrWQvnwcwjrX5gQVK\nqVouru2MWfbdQGv9kNa6LqbRisR0rP/nRv53Szm7DFXt919Oa73efu4rjAVtC1Bda6201vUxFs9N\nGAvi7PiJ2S1vYzHWv25a6/vtZXM/ppEuCHyjlCqfSI4+GAdet1deKqXqYZQ4C/gm3qlF9r+9U1Ew\nMvL5D8B05mXtz7EippPMg7FgFbLnU0NrXR1jHQDoq5QqHi+dF4AQTGfpr7Wubm93ygBdgThMJ9cp\nDbIlxjFwidJaH7mLdJKQznYzrXXxruuu1roJd7avsLTWwfbfqnjREvQzSqnm9nSjgRe01g/a3+1y\nmPevEKZNdpCmstRar7TL5VDeXnchUwLsCuj3mIH0Ykz/0FBrXQNj7TuJ6Xs+d3G5DTM4fR+4L94z\ndNSlt5VSOXZGK7cpTjbgmFIqLpVfAsdQ+xRUKEabb6OU6op5sVsCYdypRA5aYkahU7XWGxKltRGj\n+IBp7F3xpdZ6QbxrNmGUIzBKQl+t9U37uRvYlTmMudQVq7TWwxxz8VrrCKAXxmpUgTsj7+TojLEa\n/IHp0C7Gk+0QdyryIKVUmveCcQPHShAbxvqGXdkbg+n4esXrsNFaR2qthwI7MKOo/vHSehXwARZr\nrd91mKG11jFa6xGYEVdRjNk6ORwjwx+11mfi5RuBUeZW29NJFvsoboRd/ue11j/ES+eW1vpVjJKa\nH6M0JMYC+mitf4t33XbudPxBKeWfgXyktb5kz/8qgFIqEPP8LgKddLzpYK31KcxUbwTwpFKqbry0\nxtr/DtNafxfvmhhtVlQuxgwsXol3DVrrk1rrPxNPj6WC453fqbX+O164o+EvScoKRkY+/zCgv/39\nQWt9nTudlQ0YpLXeGy/+J5j2xYs7W0eAsULHAG8ntvBorZdgVktC8u2OOxSLJ3NGk552M6118a7r\nrp20DsBH2/9O1lrPiJdvJMbyfRl4KJ61KzPKcgjGV20f8Kyj/trz2I2x+lpAT6WUq3x+1Fq/pe3T\nxVrrKMCx91+Ru5QtS5P
"text/plain": [
"<matplotlib.figure.Figure at 0x104e7a790>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import fig_code\n",
"fig_code.plot_example_decision_tree()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The binary splitting makes this extremely efficient.\n",
"As always, though, the trick is to *ask the right questions*.\n",
"This is where the algorithmic process comes in: in training a decision tree classifier, the algorithm looks at the features and decides which questions (or \"splits\") contain the most information.\n",
"\n",
"### Creating a Decision Tree\n",
"\n",
"Here's an example of a decision tree classifier in scikit-learn. We'll start by defining some two-dimensional labeled data:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAFVCAYAAAA30zxTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3WdAFEcbwPH/Hb1KB7GA9exgx94Qe9fYezRGTXlf8yam\nm94Tk6iJscTYS8Qu9oYdLNhPUFRsgDTpXHs/HCLHHXJ01Pl9kZvdnZ1b4Z6b2dlnJBqNBkEQBEEQ\nypa0vBsgCIIgCC8jEYAFQRAEoRyIACwIgiAI5UAEYEEQBEEoByIAC4IgCEI5EAFYEARBEMqBqTE7\nyWSy1sC3crm8i0wm8wV+A1RAJjBOLpfHlGIbBUEQBOGFU2APWCaTvQssAiyyi+YCM+VyeRcgEHiv\n9JonCIIgCC8mY4agI4DBgCT79Qi5XH4h+2czIL00GiYIgiAIL7ICA7BcLg8ElLlePwSQyWRtgRnA\nL6XWOkEQBEF4QRl1DzgvmUw2HPgA6C2Xy+MK2l+j0WgkEklBuwmCIAjCi6LAoFfoACyTycYAU4HO\ncrk8wahWSCTExiYX9lQvHVdXO3GdjCCuk/HEtTKOuE7GE9fKOK6udgXuU5jHkDQymUwK/ArYAoEy\nmeygTCabU7TmCYIgCMLLy6gesFwuvwW0zX7pXGqtEQRBEISXhEjEIQiCIAjlQARgQRAEQSgHIgAL\ngiAIQjkQAVgQBEEQyoEIwIIgCIJQDkQAFgRBEIRyIAKwIAiCIJQDEYAFQRAEoRyIACwIgiAI5UAE\nYEEQBEEoByIAC4IgCEI5EAFYEARBEMqBCMCCIAiCUA5EABYEQRCEciACsCAIgiCUAxGABUEQBKEc\niAAsCIIgCOVABGBBEARBKAciAAuCIAhCORABWBAEQRDKgQjAgiAIglAORAAWBEEQhHIgArAgCIIg\nlAMRgAVBEAShHIgALAiCIAjlQARg4YWTlZWF/OoVYmJiyrspgiAI+RIBWHihHJw3l5Nd2uHeyY/4\nts3YPn4kMffvlXezBEEQ9JiWdwMEoaQcW7GMNt99RbXMTADqPH5M+6AdLE1MpN/mnUgkknJuoSAI\nwlOiByy8MDICN+QE3yckQK/TJzmzJ6h8GiUIgpAPEYCFF4Z5PkPNnioVCVevlHFrBEEQnk0EYOGF\nkeVR2WD5A6kUB1m9Mm6NIAjCsxl1D1gmk7UGvpXL5V1kMlltYBmgBi4BM+Ryuab0migIxjEfNJS7\nZ0KompWlUx7Uyo++PfuUU6sEQRAMK7AHLJPJ3gUWARbZRT8DH8jl8o5ob7ENKL3mCYLxOkyYzLH/\nvU9g7TrcBI7b2bGsRy9aLVgkJmAJglDhGNMDjgAGAyuyXzeTy+VHsn8OAgKAzaXQNkEotK5vzSJz\n2kxuXpfj5OpKn3yGpQVBEMpbgQFYLpcHymQy71xFubsSKUAlY07k6mpXuJa9pMR1Ms6zr5MdVau6\nlFlbKjrxO2UccZ2MJ65VySjKc8DqXD/bAYnGHBQbm1yEU71cXF3txHUygrhOxhPXyjjiOhlPXCvj\nGPMlpSizoM/JZLJO2T/3Ao48a2dBEARBEPQVpgf8ZKbzLGCRTCYzB64A/5Z4qwRBEAThBWdUAJbL\n5beAttk/hwOdS69JgvD8SUtL49SWQABaDxiMtbV1ObdIEISKTuSCFoRiOr5iGZp5c+kfeRMJsOuX\nH9BMf5N2EyaXd9MEQajARCYsQSiG8PNnqfLFJwyMvIkl2oflB9yKxOvLT7kWeqq8mycIQgUmArAg\nFMOt1Stokaj/IEDTx4+JWru6HFokCMLzQgRg4aWmUqmKdbxpQvwztiUUq25BEF5s4h6w8NJRq9Xs\n+/FbzHbtwCo2ltSq1TAfOpwOk6cWuq6sGrVQo/9NVg1ketcoieYKgvCCEgFYeOkEfTSbwYv/xP5J\nQfRD7l4M41BWFp1fn1moulpOnU7gjq0MDb+uU765Vm1aTJ1eMg0WBOGFJAKw8FJJiI+jyrbNT4Nv\ntqpZWWjWr0Y19XVMTEyMrs/JxQWvxctZ+cM32J4NRaLRkNy8BXVnzcbF3b1E2hwbE82W/8zEM/w6\n9kol6XXq4jrxVXzFCk+C8FwTAVh4qchPn6J99EOD27xu3iQ2NgaPQi7g4F2/Ad5LV6BSqdBoNJia\nlsyflUajYdfnn/JwyZ/MysjA/MmGu1GcPX+WsN//xCegV4mcSxCEsicCsFBmLh7Yx8OVy7C4e5cs\nV1fsBg2l9dDhha5Ho9Fw+O/FqPbvxTQtlVRZfZrNeBP3qtUKPNbNuwa3raxxTE/T2xbr5ERje6PW\nFjGoMD1nYxz6cx4158+lJTwNvtmaJSSwetkSTmdkkLR5I+YJCWR4e1Nn0lRqNvYp0XYIglA6RAAW\nykRo4AYqv/dfuiYl5ZTdOnKIg/fv0+XN/xSqrjXTptFz0SKcNNrsqJqjR9h49DCq5WvxrFHzmcfW\nrFef7W3a4Xtgr065Cojp2LlCZbBS79xOPNA+n+2aM6E0OHqEGhkZ2oJjwRw6sJ8r8/6iQYeOZdVM\nQRCKSDyGJJQ6jUZDwuKF+OYKvgDemZlYrFxGenq60XVdP38On5Urc4IvaNfHHCq/xoXffjaqjpY/\nzmVph05Emmv7lRetbfi7d1+6fv2D0e0oC+aPYrEGHuezPSs5+Wnwzdb5wX3uLvi11NsmCELxiR6w\nUOpiYmKoefWywW3tb0Vy8ugRWnXvYVRdd3bvoF2a/vAxgNXFMKPqcK9ajb7/buXi8WCOX7lCjdZ+\ndPCsytHffsL83j0yXVzwmTSVytW9jKqvtGRWq06nGxFsAl7Jsy0NsFcqDB7ncDGMzMxMLCwsSruJ\ngiAUgwjAQqmztrbivpU1pKbqbYs3M8PW2cXoujRm5mjQ9nrzUpvlvVOaP4lEQpN2HWnSriMRZ0O5\n0tefUTdvIkW77Nf+TRuJ+eGXEp3kpMnutUskhlqvz2HUWCJCTtMoNYX1QADgAJw0MeFAKz9eO3HM\n4HFKc4sSvx8tCELJE0PQQqmzs7Pnvl8bg9tONm1O/abNjK7LZ9RYDru56ZUrgQy/dkVq341vv2RQ\ndvAFbXD3f3CfRz9+i1qtLlKdud2RX2P3axMJbtmE4Fa+7Jo2mXs3bxjc9+y+Pez770wOzJxKSkIC\n1774hrBmLXG1s2ORoyMf+fiSvm03Ezdu40D9hgbrSGrlV2IzsQVBKD0STa57aaVIExubXBbnea65\nutrxol6n6Kg7hEydyOAzIVQCMoDNsnp4zZ1P7eYtC1XX+bXLsP34E9okaXMwxwEbOnel9z9rsLKy\nKlRdcXFxxLT2oeNj/TuttyQSbm7dRePWhr88GFV/TAwXB/dh2HW5Tvnaeg1osTWISg6OOWU7P/mA\njkv/okZWFgApwMpOXen1z2oyMtKxtrbB0tIyZ//zu3agmP0OPe/fQ4p2WHpd0+a0XLoC9ypVgeL9\nTiUlxHPsyzlYnz6JSVYW6Y2bUPuN/1DTp2mR6qvIXuS/vZImrpVxXF3tChzqEgG4AnnRf7GVSiUn\nNqwl67ocqacnfmMmFDpggvY6hR4L5dqalZilpWHeoiVtBg8r0rBrTEwMSa19aZuaorftHnBl03Z8\n2xV9RvGuL+Yw+vef9YaaVMCaWe/R470PAbgachLXIQOon6E7IU0BrHtnNj3e/cBg/Y9iYjj791+Y\nxsdDHRntxk7Qufdb1N8phULBjiH9ePXkcZ3h/iAvb9xWrqe6rF6h66zIXvS/vZIkrpVxjAnAYpzq\nOZaSkszJf5ZCYiKOrfxo5h9g9P3F8mBqakqHkWNKpC6vujK8Pv2i2PW4ublx3seXtseP6m07Xr8h\n7Vu3LVb9ljcjDN7nMQHMcw1D39u2hY4Z+rPBzQDzkPyXNXRxcyPgvY+K1UZDTqxdxfA8wReg1+1b\nrFz0B9V/FDOtBaG4RAB+ToXt3knKR+8z9HYkZsAdU1MCu3Sj9+LlRepVvsw8336Hg5E36fLgfk7Z\nGQdHrKa/Uex7qYpK+Sf
"text/plain": [
"<matplotlib.figure.Figure at 0x10cbc3c10>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.datasets import make_blobs\n",
"\n",
"X, y = make_blobs(n_samples=300, centers=4,\n",
" random_state=0, cluster_std=1.0)\n",
"plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd4VVXWx/FPOoTeewelg4goIAqIXdHRUbG3UUd9nbEz\nNhDUsYzd0bGPjo6OOjqWcewVwV5Q0FClSu+dhOT94yRyc3MTEkhyL+F8nyePnn1/Z691bmGfs/be\nayXl5eUJCQkJCQkJqXiS4+1ASEhISEjIrkI46IaEhISEhFQS4aAbEhISEhJSSYSDbkhISEhISCUR\nDrohISEhISGVRDjohoSEhISEVBKpFW4hKWnPqJbOyCrhONTs5JrJ9Mli9LG0iBTNZNOTXDeG9+Pt\nY6gJNaGmQjXxth9fTV7e14ohfNINKXde56CjowZcaE9GfQ6Oh08hISEhiUA46IaUO5nUTinmtdrU\nq1RnQkJCQhKIcNANKXc28Mv6Yl5bzPxKdSYkJCQkgQgH3ZBy5yw+fJwfo9vHsSSDF+LhU0hISEgi\nUPELqUJ2ORqQ05grbuHizvSqTtoc5i3noZExBuOQkJCQXYXKGHQ7Rx0P2MZxqKkCmuOZgCcXk7aB\n5IPZEysV/j4k/HWEmlATarZLE2/78dYUu3q5Mgbd6OXWsdpCTRXVNN56vCke9kNNqAk1cdPE2368\nNTEJ53RDQkJCQkIqiTC8HGripYm3/VATakJN+PuuKE0YXg41CamJt/1QE2pCTfj7rihNTMLwckhI\nSEhISCURhpdDTbw08bYfakJNqAl/3xWlCcPLoSYhNfG2H2pCTagJf98VpYlJGF4OCQkJCQmpJMLw\ncqiJlybe9gf8j+ZZDM6kxmrmnsu6unH0J9SEmiqkibf9eGvC8HKoSUhN3Ow/wP49uPpS6sJGPM7M\nFvznKJYmos+hJtTsZJp424+3JiZheDlkl+Nrqjfl6EH5Ay5Uw/m0n8wFcXQtJCSkihOGl0NNvDRx\ns/8OQy+lUbQgCe3Yy9bvbML4HGpCzU6mibf9eGvC8HKoSUhNXOwvpncJX/wtUecmhM+hJtTshJp4\n24+3JiZheDkkIVlVgd/Nobz5blDxqAi/hKUHQ0JCKpAwvBxq4qUp0pbDgLv4TRP2rEOd5SxLY8Ep\nPFSe9o/AY0ydSe/2pEMeXmHl7nwoDC+HmlCzo5p424+3Jgwvh5qE1BRqu5OTz+OQOsH0KtSbT9tR\nfD2Wp8vT/tncO4om9TiwJjXnM/cAPhvE+O24jlATakJN4tmPtyYmlTHohoRskw+p2YO9IwZc0ILU\njhy+jGcbBPOt5cZYPhL8FRAdlakUbqVHHgetoul1fJQZPHiHhIRUQcLwcqiJl6ZQ2yR6jqB+DI1u\ntPmWvYexvJJ9rDDNZgbexn4bOfIkGnUldTEjnmRGOx45hPmJ5nOoCTVl0MTbfrw1YXg51CSk5te2\nFNbNZmPDYMtsIX5hRXO+xYY4+FjumqdouYYDW9HhFJJS8tsbk3Q+He/mhAM4Ny2BfA41oWY7NPG2\nH29NTMLwchVjHDXf4IQ2dPuF2e144Qx+KW87T9FsKkdXp1pdFp5DVtoO9Hc+c+/mpz3ZI7J9C77h\n8yO3Drg7PQu5/GI6voWUGK8Pp8ct9LqOTXATe+Sxfw3q5vKvy8rwAw8JCUkswvByFdK8SYslXDSW\nlqmCicH3OPJRnvhd/ird8rB1D8N6cewp1ErCL+TexuALuLdeME5uV/jpcL6/m/Th7NaetEls/IBf\nzuAVVWRF8fvUH0Tf9agV40RoSVoqe+WSezNnHs+gTvmf3/cMu5V3RvJcZfkcakLNdmjibT/emjC8\nvCtofuTsS2lZ0JCEYdR9gsPXcUONcrA1nnrdGDM4YsxoTvLl7DmWgTfxcBl8LtTWCRdy3830WUHH\n5vxwRXDvkPDvfWk1n7DbKVSrrXCC50jGsawbL97NsecwuHHEA3FPqtXg4LG8MYovKsPnUBNqtlMT\nb/vx1sQkTI5RRZhEtd3oEeu1w+n4KL3Kw84EBh8QY8FTBtrRd0f7T8MovrmL56/gpx3tL9E4i5mf\nMjMJDTAt6vVV5H3Au8NZUYNejWNEoDuQnsEBleJwSEhIuRKGl6uIphYZabFDyDJIakB35fBZNKBt\nUgwRVKdOvo1yDz9NJvM19m9Ah9b8+2AWbk8/8da0xBI+/ZnWg0kdh+/zX1vE0nV8MIr/onMtmsTo\nG9SlsdjvdcJee6jZpTTxth9vTRheruqaNmS9yqSDY3wB3mD2QfyrPGwl87/pdO8YPNwWYhbfRWjL\nLfw0hi59+N0VNEnG53S7hteu58604s9LqM8n8vhish5gFfaoQ6NVLE7iuwt4PPKEuXyTQ/voH+la\nzOQTxb/XCXvtoWaX0sTbfrw1MQlXL1chNvP3N2h3KM0K2r5n7Y88cwLZ5WHjDKZczXtXc1jNiPZn\nmNmWf5SHjUjepskQLtwvogzfPtRqx/GjmfZnXitvm7HIxvXsk0nvHiwazood6e8CJig8yBZJzDGC\nN+5ht0voWTAPlIO7+fySYHFZSEjITkY46FYhLuPbx/i/GzihFbstYV42/x3DV8ox29JIrr+Bqe3Z\nO51qK1jWlHtProCtSd8z5LKIAbeAJqQ2Yz+VMOjeT1dccSFdG5PyIb+5ljdHc1dJ26SWkfIXTtud\nwUnkzGV6R548sZTvUxs29eMPozizDd3SqD6FT0/n702D8TckJGQnIxx0qxhnMxu3CQbZUoc8ykId\ncm8NciEX5EPurAIGXMikWnFzyJnUqAibkSwkNY1rzmX3grZhNOjDiWNYfGMxczfZuJMbRnFQ5tbm\nXs+wxz+55ORS2h/E2kHcl39YYZ9pSVzP8KYc1oDWq1g4i/dG888d2ZcdErKrEg66IQnNRn5ejdpR\n7XmYy88Vbf9ODr0uYsAtoD7JbRismEH3JvY+h8GZUe0n0f4GTsN/yt/b8mc0I87korZb5/AbL6Xb\nGOrcyPvx9C0kZGck3DIUktCcx7j7+So3qv15ZnUpXHmoQkinSXFJLGrECHsXkEq/tsWsJm8ZbElO\neFaR3I7hbaMWzTUkZS8OnRkjZWdISEjJVOktQ4vZ/3GOaUzrbLJrsn4Ef08p+bx4LzXfVTSlOq8G\n+xzHQ3cG88a7Z1B7IZN357WDgi1KdSrQxwFtWTmbnDYxfitrWFlcP+nUzhNVMimfvGDvbaJ/PgO+\npH4P2sbQGkizjziqPRsry59Qs1Np4m0/3ppdb8vQe9TOYsgVdCwYZBeTey0rbuaWyvYn1Gz/eR3J\nujwoeEDx85oV4uPpZN3CftcwIDIs9B2rl/EUFsfqpxUr3mf/6EQiGzAtKCc4oaJ8Li9NbeYsYjUa\nRgvnsSGdjyvTn1Cz02nibT/emphU2Tndjzn7ejpGPmk0Jvk0jriV10fyQ7x8qwzepPnXHN2Y5qtZ\nWY8fzorDIpzK5HFafM+ATiSfa8cKMBSQhoP50w1csjt71aXOz0xdxgvXBoNOzFXhJ7JoFI/g3KHU\nS8Icsh/mo8t50k4QYu7H+hv48lAOjX5if4dvr2BBXBwLCdmJqbLh5Q7sGSu014Vqb3KsrftWEy0s\nscOaf9G5AX+8JiI/chYH3kOzP/J2gvhcbn1vYuCdnDOYvmdSYyG5j3J0ax4/nLk7aqtf8PfiOv6z\nmX0PCQZbtpF9ayw/jOP6xzhlHYszmTSGH1KCATdhvi8laY7j5TtocQxd2pO2gNx/M21vnk9Un0NN\nQmjibT/eml0vvJzNuuIcWsWSqHMTLSyxQ5olXDAiqohNZ9J+YOgkHu5eeB5upw8/3cNxF7N/9fzj\nZiSfT6e/ctJ6zsoMFjvvsK0awV9uWfoZFPwtK6utRNF0RgfOupm+6ey/hi+v5+P8KMLSRPQ51CSM\nJt72462JSZVdvTyTiVtitH/NuvqFn/aqFBOp1r6YkOdhtHws2OZSpWhO7+ox2o+n21j2r3SHqhj5\nRSi++hOv37R1wA0JCdk
"text/plain": [
"<matplotlib.figure.Figure at 0x10cd3ccd0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# We have some convenience functions in the repository that help \n",
"from fig_code import visualize_tree, plot_tree_interactive\n",
"\n",
"# Now using IPython's ``interact`` (available in IPython 2.0+, and requires a live kernel) we can view the decision tree splits:\n",
"plot_tree_interactive(X, y);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that at each increase in depth, every node is split in two **except** those nodes which contain only a single class.\n",
"The result is a very fast **non-parametric** classification, and can be extremely useful in practice.\n",
"\n",
"**Question: Do you see any problems with this?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Decision Trees and over-fitting\n",
"\n",
"One issue with decision trees is that it is very easy to create trees which **over-fit** the data. That is, they are flexible enough that they can learn the structure of the noise in the data rather than the signal! For example, take a look at two trees built on two subsets of this dataset:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.figure.Figure at 0x10dcd7550>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnWd4FFUbQM9ms+m9VxJCSagJJLTQe7GgiIiFotgVRURE\n6RBERFEBUQT87F3sIlho0nsPnUCAdNJ7st+PSWCzmYSE7O5sknueJz7unTPzvrPs5mbeuXOvSqvV\nIhAIBAKBwPhYKJ2AQCAQCASNBdHpCgQCgUBgIkSnKxAIBAKBiRCdrkAgEAgEJkJ0ugKBQCAQmAjR\n6QoEAoFAYCIsjR5BpYrUawkDYqt5LZx67lyC8H/g1fHQTFc6CJnrYNorsFvpHIUjHOEY1VE6vrKO\nVruPKhBXugKD8zUMeECvwwWIACcruFOJnAQCgcAcEJ2uwODYgrtVFdvcwMukyQgEAoEZITpdgcHJ\ng9TCKrZdgySTJiMQCARmhOh0BQZnFPz9FZzXbz8EmfnwmxI5CQQCgTlg/IFUgkZHEBRkwaxlMDEa\nwh3BahfEnYePZ8FOpfMTCAQCpTBFpxum9zr6Jq+F0wCcZ2E7sHQdeJ4E6/sgyArOUPHzYPbnIRzh\nCOeWHKXjK+1UOXrZFJ2u/nBruTbhNFBn6I12uaH3ZpGjcIQjHKM4SsdX2pFF3NMVCAQCgcBEiPKy\ncJRylI4vHOEIR3y/jeWI8rJwzNJROr5whCMc8f02liOLKC8LBAKBQGAiRHlZOEo5SscXjnCEI77f\nxnJEeVk4ZukoHV84whGO+H4by5FFlJcFAoFAIDARorwsHKUcpeMLRzjCEd9vYzmivCwcs3SUji8c\n4QhHfL+N5cgiyssCgUAgEJgIUV4WjlKO0vGFIxzhiO+3sRxRXhaOWTpKxxeOcIQjvt/GcmQR5WWB\nQCAQCEyEKC8LRylH6fjCEY5wxPfbWI4oLwvHLB2l4wtHOMIR329jObKI8rJAIBAIBCZClJeFo5Sj\ndHzhCEc44vttLEeUl4Vjlo7S8RV1YuH8RnAdCGnNzSAf4QjHwI7S8ZV2ZDFFpysQCHRIBfVqeKgZ\ntOkGPgcg8Xs4/BTMcYZSpfMTCATGQ5SXG6BzCmxPwTAX+LcHXFM6HzONr5izGsZMhMF2Za8jICAX\nApaDxVT4XG6/7eDsA5EhCuUsHOHU0lE6vtKOKC83BicDLBbD5G7QtyN4n4UBMbCrD8T0gO1mmLPS\n8U3ubAf7NhBup7fRDgiF8H2wIBLyytvfAGcHeDgcWscDv0OrUvjgeThmqpyFI5xbdJSOr7Qjiygv\nNyAWw1PTYLRD2Ws/cOwBAxaAZQ9YrWhyAgDWQ+B48JHb1gZ8vwK/SDgLsA58w2HKYPAud3pBt28h\n4Dd4+HZIN1HaAoHAQIjycgNx8kDVBgY56G1QAcOg60642tWE+dTAUTq+Is5IsD8LmU3BSX+ns5A+\nEpwp+87Ew9jHdDpcnWMEvgkTb4fvTJGzcIRzC47S8ZV2RHm5oTvb4LIPOMp4tAGbJZDe1cxyNoP4\nJnfaAT/A5j5wh+6XrxjYAZsHw/7yNjUUyhwXC8BG+u6Wx1D8vIQjHDOMr7QjiygvNxA6Qc7PkIjM\nFdRByIqE0wqkJZBhNLy+BFw7Q9s24HIEru2Bow/BIl0vCzLl9tcCqZBmkmQFAoFBEZ1uA8EZSk/B\nP6nQzF1nprES4E/YORuSFExPoEMYFITBB99D8hJoHglnXgZP9K5sfWHjLujQBVx02zdAUnP41qRJ\nCwQCgyA63QbEbFg9FyzbwcA2EHgRUnfDjpHwBhCsdH71maXQIh3u8AK/s7DrGfgpGIpqsu8xsNsL\nng9AskanfSSkjoTUspee+vuNgnPz4K0TMGYAtCgG/oITKbDqFbha97MSCASmRnS6DQgNEAMfXIA1\nh6CbJ+ybAzkKp1XvmQOjhsJTXcrumedAnxUwuBO80Kea/b4Ev3iY3BEiW4HmYzidAF/PhPU1jT0L\n1qXChrehfQgEPAi/2UkVZpPyNzieg1BPSLpblLYFgltGdLoNkGAoCpbKyaLDrSPrwKUrPNxFZ5Ca\nPTAFIubCU33gF7n9EsAyBRZOhTblbZ2h3UEIWggZr9TicR93KImBA0jP75q0w00Ay/dgSjT0vg08\nT0P6XNh5F8SEmzIRgaCBIB4ZEo5SjtLxa+QcAtepMqVfFdACugApcsf5HPo+odPhlhMBTttgDJUf\nKTC7cwf4BMa8CoNtyxr9waUnDHkLnMLhoDnmLByzcJSOr7QjHhkSjlk6Sse/qZMO3VQyEoBWetJH\ndqavAhgs+/wW4CBNQCW3n1mdeyycbw5tbfUkNdAN2myBr3qZWc7CMStH6fhKO7I06PLyXnD5DV5s\nAi0KoTATLjwBp8Sk8oKaEgbrd8CYaJlHseIrTsVYgUxIygdsZLZl3JgP26zZAq5R4CG3rRU4fw2B\nvWCHqfMSCOozDba8vAtcz8Cc2eBefqWSA9FvQ/ArsEJt4nyEY3bxa+SMh+2LYHMQDPUv+75oga8g\nvgtsruo4L8Ce7yB+DATobjgHhRo4ZMycDeXcA7s3wrWOMtNWHoHMIVLZXdw+Eo45xlfaaXzl5fXw\n0ixw191oDzwIUfPBeQ7sMmU+pnbWg+9uGO4LfumQ4Q6HHza/nA167G/AaytEh8O18RCrkXFuJdbL\nMHcG7PCEvi7geQaORsHnfaX7uYlyx/GB2HSYvhyeHwjtXECzCeKOw9q58AVSZ2U2nxe5NneIPQjr\nB8M43VJ5EbAJNs+Ff0yZj3DqnaN0fKUdWRpseblJ5b/AAWgKVhroSsVOt0HxNkS0gpdn6pQGj0Pf\nOWAxpwFOqpAHqgXwykDo/y64xEPxezDMCRY8AhcMESMGNiD9yHWWskyEY0Xw+A8w+BCkT4AD91Ux\ntaO58iosXwgW4dC3Ffifg6S9sO1hWAw0Uzo/gaC+0WDLy1RxblrASnr8ozwvcytL1Nmxg4lD9O7F\ntQbbdjA2AU743JjUoUGUnz6H51+BKPuy10FgOQk6rIDXSmCe2oCxautogNHgPxrigBA5x5T51Nax\nA+bD+gT49wIMaAMb74B8pA7XLHMWjlk4SsdX2ml85eWTsKUA2ljrCf9Bhit8DMSbMh9TOYfAJgAC\nZXyGgM9M8FtScXKGel9+sgEPeyozAprNBM/XYKuhYjVWx0f6KTKXfIRTLxyl4yvtyGJxc6V+8jx8\n8ibsvSI91gHADsjZAP97tGKH26Cwh9LiKkZnFwBq6SqlQWFTxepKPqAuqOIPEDn2gdNkGP0i3H1U\nfuCxQCAQ1IkGW172AV6B3e/DllJoXQhFvSBrPuzVy8ncyhJ1cpoD6+A8Mo96/A6X5kAy5lFaN9ix\nC6r4Q+II5HeVpiwMu9lxFsOISBjyJtgVAb/CoyfhyD1SVcTgOQtHOA3cUTq+0k7jKy+DdBn/TMW2\nqgbBmFtZok6OI7z+MbzzIARqkO5j/wXXLsJSezhuRjkb5NjZ8MtJGBcq3YIEpPLGD7BtDvx5s+PM\ngIGPw21NwBqk/4wE73/B5mNYOh6uGDpnJZ08UM2BVgHQSQUWF+HgK7DL1YxzFk69dJSOr7QjS4Md\nvdyYGQ9x52DuLIjwgsA0SO8PB6bDf0rnZgyehG1zIdkHhgdBk3QoOAFb74d3arK/J/Qr73B16QvO\ns2EEsNzgSStELqjehmdeha7OZW15MGApHHoQngjQuR0jEAgMT4MtLzd2JwQ6LpSmGtyp4+jPE9xg\nyk+zYXsJLL4C1t4QZQXbgKY1OY6LzOQPIM2v7CvdE24wtyM+hG5PQBdnnY22wHMQvhSefflGZcBs\nchZOvXSUjq+00zjLy/XJOQNW34LfbXA43AzyMZEju9+70L4YRvtAUBFoT0LpTPhEZ0k72eOouT5q\nqqA2OV6EE0A7fTkHOC99ea7v9zMEHoO73cHrGqSGwP5R5ve+VumUwigP6e+JCtgC9uCnt69Z5Cyc\neusoHV9pR5ZGW15+A9q
"text/plain": [
"<matplotlib.figure.Figure at 0x10dcd7510>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<matplotlib.figure.Figure at 0x10dcd7990>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAdgAAAFRCAYAAADXZryJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd4VEXbh+/dTe+NJKSHVCD0moQWQEBAUBEbWFHBjoqg\nNKkWREV9ra9iAdv3gtiRamgB6Z3QCTWd9J7s98cmsNmckM1my0mY+7q4dGd/M/M752x29jxn5hmF\nWq1GIBAIBAKBcVFa2oBAIBAIBC0RMcAKBAKBQGACxAArEAgEAoEJEAOsQCAQCAQmQAywAoFAIBCY\nADHACgQCgUBgAqyM2ppC0U2nJBpIbqBMaIRGsmw2TH0V7rbXEhQDC+GnBfC2HDwKjdAIjcEaKeTm\nUb/jUKv3IIG4gxXIkiRw7An97XXK7YEe0H8nOFjCl0AgEOiLGGAFsmQtBLQHH6n3YsD3b/AztyeB\nQCBoDGKAFciSPnD5OGRKvXcCMvpBqrk9CQQCQWMQA6ygURwCx1cg7gsIMGU/gyF/OySV65SXA9th\n2wAoMGX/AoFA0FSMO8lJ0GLJBeXb8FJfGDIF3M9AweuwuyssHGaiPh+ANxeCog/EtwOPo5C9Fw49\nCItM1KVAIBAYDWMPsNE6r+MkNLplQtMMNJ/B3a/CKMfqQi9w6gED3gO3YZBkCj/hwBz4aS/8+Qf4\nd4dLU6Ej0KapbQuN0AiNxTVSyM2jvschOYvY2AOs1BRsfcqERsaaXDjhCzGOOuUKYAhErYD/3WU6\nP3TV/Ntd/bLAiG03WnMJbJZABztws4Z9szU/LmR/DYVGaGSqkUJuHg09DhEiFjTMEbD3BQ+p96LB\n/lcIMbMli7AIegXAzPnQ2g44A2VvwqG74NlwKLO0P4FAIC9EiFhoGtR0h+0/Qg7gpvvmfii6VbMm\nVfvat7gQ0lVQtYZX74fWNWVtwOYl6PYezJoK35nTj9AITQvQSCE3jyJELDTXXy+AriXQ1RVs7oAz\nOndWBvVlA8kn4Ld0eMpb6zNTAfwJ216D1U3xrEeZPhpTn9eR8yRmTlsDPhCmXa8mjGwLrg6wa7oI\nIwuN0NyoTB9NczwOESJuKZwCm2Ww8AGICwfbIuD/IOEXWDQFdjW1/dfg2/lg1RZGRUKri3B1P2y/\nBxYDoU0/AnmjBHfdZ9A12MK1txZBbCBMrwkjn4BH3oD9D8DzAZrfJAKB4CZBhIhbiOZnGDcTEqyr\nCxyAhyH0G3g1D2a5QK+m9GUNzIOkYlBug6NdoXA0lKIZXFt8CCkBLu2Gou4SKRpzIA2ITgdrf3j1\nPq0wciTYvgi93ofpU+H/zOlZaIRG5hop5OZRhIiFBnxgsLWE6E4ImgtBi40UprQHBls+9GL2cz8c\nkmdAXBSMdtZMoAZgK+QWwn+B5MVw51yJFI62QCvNqqOG+pfVZ0pohMYMGink5lGEiG927LTClNo4\nAwrwBM6asv8iUKwD1ygo1g1jtBRmwMKPoNQdIu3BOQ1S/GD7S9UheBW46G5OUEN910cgELRcRIi4\nhWhyIB3w1xXugqKBcMmUfn6Ex0ogKBx8j0PRP5A5FBa3gZIm9iWFxc69A/AyZAHv6GiiAbrBlYNQ\n0hHsdBuuvj43mmktu8+U0AiNiTVSyM2jCBELDeTA0e0wL1ZrvWoB8CusWwD/AFdM4WceDL8fYsM1\nkVAAhyrwWggPzYIXjNCXFLI69zWv74LkmRAXDiO0H9Ruhqsl1WFkuXkWGqGxsEYKuXkUIeKbnWmw\nYzHM3AFjfSG4DCpOwPqp8LUp+/WG27QGV0Czg8R90HMhdJkB+0zZv9x4AeZ9CCVeEGEPTqlw1h+S\nXoC9lvYmEAjMiwgRtyDNFMgDvtQqSwIiTenHA4Il6hAOtioYBBQ3oS8pZHnua/AEpmlmFa/S0TT0\ntyHr4xIaoTGBRgq5eRQhYqGxnCZPM5h461ZIg8pcTf7gZKl6jehLCtmeD6ERGqFplEYKuXk0OEQs\n9oMVNImzkJgNat3yH+DwLNhkCU8CgUAgB0SIWGiapJkH//4X+oZAcB9wvwTlmyE9BpY7QFQT+5JC\n1udDaIRGaPTWSCE3jyJELDSW06iASbB4PVyaDT294fIrmjtaEUISGqERmoY0UsjNo8EhYjGLWGAU\nBkP+YNhQ/bKl5poQCAQCvREhYjNoKoF34VY/SFCBOgfS7GDDw9LZfWTh2RSaXyDoFPSogsoRUNy+\n4XaksPhxCI3QCI1RNFLIzaMIEctdMxtefAnu9bg+qcx/L4R/CV9PkKlnY2rKgbfhgeHw8O3goAa2\nQP4suDofvtWj7Yb6arJHoREaobGIRgq5eTQ4RCxmEZuYX8BzEAz10DnXXcFFrVkn2uKZBSMmwODO\n1TvRKIB+4DwGJiyGtha2JxAIBCZBhIhNrDkMCdM1+Qfq4AehWRDjWXufUIt7NrYmFIb7aOZD1aIz\nOCbCeGDZDdpuqC+jeBQaoREas2ukkJtHESKWsyYfPDOgykciWpAPefZwhLrrSGV/XI3RKDRRYkmq\nNJmetOu21BCS0AiN0LTcv29JRIjYxLwGST/DUd3yKuA8HHWQSNLQ0jgPp6UOMh9I0/zAEAgEghaH\nCBGbWOMAeMP/vgKXsRDkBKRAxY9wZAKc5ibIUTsWdvwXRgwFLz/AGs0PjM/g6Aw4wfVz0FJDSEbR\nXIIB38F9PhBaBVVpcPwZyHCykB+hEZomaqSQm0cRIpa7ZgwkX4Q1n8KEXKhSwaEZsMNaM7DI0rOx\nNIfB7meY2hWssqBqE5SfhKvOsGcQvOlyfTOAG7XdUF9N8mgqzRpw+QMSwqDkETjhqvldYVBfSeC4\nDxJehjBF9RtlEPMO7H8Kluq0bfFjFxqh0VMjhdw8GhwiFokmzEQAVEzR5ObV68K0FFbCnNkwuGaG\nU1ewTQev5ZDXre7g2mKYCU/GwW3vgXcOqP8PhufC+1NglyHt/QkPzNMaXAFsgEnQ6XW49S340zjO\nBQKBsRADrA7pYP0u9HaF3ClwzNrShpoxH0PoSIjVnT7sDVZeEFuOJlxsCr6A0HS4vzUEF0BBhSac\nmmyO6/k+9H8aHmpd/fflBYrHIfp7eGUPjDfkh0UQRNWZhg24g6IVdEEMsAKB7LDIAHsA7H6CCaHQ\nwQYcjsOu2+HLnlBkCT81vAbjOsJ9L4JvDpR/BkeqYMlztZfRCPTkGHSeJJ2tCh/wOAp2naDE2P3+\nAqG+MOkx8K8pS4P4OWC7ED4wdn+6OELP1hJ/W3dB8Gy4sxt819g2y28wE7sMyhrbnkAgMD1mn0Wc\nBVZ/wXvz4ZHHoftD0G4BPLQZ3j8Mdub2U8NcuOVBeHIM+HoDkWD9DHR2gdmXNNE4QSMJguTkegbQ\nLMiLgFJT9HsJRo7UGlwBfEA1FEb9CL6m6FMbe3CSKrcBnMDDkDYvwo6rEjPOj0FZFaw1pE2BQGBa\nzD7AfgEDJ0MP7XCXEngOuiyHe8ztpwYPGBYmMcDfB22+g4GW8NTceRmO/Qp7dEeFfFBfhN2mWqLU\nCkKkyvuC2x64xRR9apMD6VLlWaAuhJOGtDkbVn0EW85q3a3uh6LfYPVM2G+oV4FAYDrMvkzHH3rb\nS1S0ASKgF/CvRD2TT7n2BD+J97AFfCCG2scmt2nistUMh+VLwC4eOoSAzV7I3g+7psBl9FvW1VBf\nUmWSj1rLAC9wru7XZOejLaRuh9xYcK0pUwPfwPHZkFJP/zfsywGYCYeXwtbfoEsVVAbD9mmaO3Xx\n2RSa5qiRQm4em9cynVy4AkRKVc6DTC29Wadcp8I5IEL3jXwgQ3PyzObnEyhLheFWoAyElIflNyVd\nL00uKCvA9hmYuBn6vAflQ+HIK1CA/kuUpLhhvRTYVwWDdMMzv8DFEbCU62Frk5yPgbByESTug/sC\nIbIUKpPh316wxB5ymtLXo5rXP2sVSZ1H2X0WhEZoblCmj6Y5Hof5JzkpYVcKxAXr3GVcgYp02Gxu\nPzWUwS/7oGcXrbsOgK/
"text/plain": [
"<matplotlib.figure.Figure at 0x10e44ae50>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"clf = DecisionTreeClassifier()\n",
"\n",
"plt.figure()\n",
"visualize_tree(clf, X[:200], y[:200], boundaries=False)\n",
"plt.figure()\n",
"visualize_tree(clf, X[-200:], y[-200:], boundaries=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The details of the classifications are completely different! That is an indication of **over-fitting**: when you predict the value for a new point, the result is more reflective of the noise in the model rather than the signal."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ensembles of Estimators: Random Forests\n",
"\n",
"One possible way to address over-fitting is to use an **Ensemble Method**: this is a meta-estimator which essentially averages the results of many individual estimators which over-fit the data. Somewhat surprisingly, the resulting estimates are much more robust and accurate than the individual estimates which make them up!\n",
"\n",
"One of the most common ensemble methods is the **Random Forest**, in which the ensemble is made up of many decision trees which are in some way perturbed.\n",
"\n",
"There are volumes of theory and precedent about how to randomize these trees, but as an example, let's imagine an ensemble of estimators fit on subsets of the data. We can get an idea of what these might look like as follows:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXdYFNf3h99l6VV6RxBUxN4V1NiixlhiiYkxmmZ6filq\nNIm9pmq6Sb4aNTFdjRpNMRprxBJ7RcGCSheU3uH3x7C4LANSlp0F7vs8PLozn7nnzOzs3Jkz956j\nKi4uRiAQCAQCQd1jorQDAoFAIBA0FkSnKxAIBAKBgRCdrkAgEAgEBkJ0ugKBQCAQGAjR6QoEAoFA\nYCBEpysQCAQCgYEwrXMLKlVnnSXBQEQln4WmAWri3xocnPxUzzLL3Bf/Feyy6mCl221jafBBpkQA\nHGFqy/dZ9m0wqHXssw8ykmDUaLhVl/shNEIjNFXSKG1fWU1x8VEqQDzpCuoNHXnnwk6sL8mtOwon\nSzpcgUAgMFpEpyuoN6gxI50O636BG4Uly/KBtXDDEj5T0jeBQCCoCnUfXhYI9Egfup9PI3zSHHjI\nCZyTIO4ZONkMIpX2TSAQCO6G6HQF9Y7BkDYYVmgtClbMGYFAIKgGIrwsEAgEAoGBMMSTru5TSOhd\nPgtNA9QUU1x+u+KatC3TjpHtq9AIjdAobl9pTYWjlw3R6eoOt5ZbJjQNXKNCVV6jqknbMu3oyUeh\nERqh0atGaftKa2QR4WWBQCAQCAyECC8LjUE0IrwsNELTqDRK21daI8LLQqOsRoSXhUZoGp1GaftK\na2QR4WWBQCAQCAyECC8LjUE0IrwsNELTqDRK21daI8LLQqOsRoSXhUZoGp1GaftKa2QR4WWBQCAQ\nCAyECC8LjUE0IrwsNELTqDRK21daI8LLQqOsRoSXhUZoGp1GaftKa2QR4WWBQCAQCAyECC8LjUE0\nIrwsNELTqDRK21daI8LLQqOsRoSX765JhkhnKDQWf4RGaGqpUdq+0hpZRD1dgUBB8oH58GQLGOoA\ntsmQeAW2z4G1Zko7JxAI9I4ILwuNQTQivCyv+RDGTYHhTprnfnBJhJYfg880SDBGn4VGaKqgUdq+\n0hoRXhYaZTUivFxecxSim0NnrQ4XADdQ+0DnZJjnbGQ+C43QVEOjtH2lNbKI0csCgUJsgsAu4C23\nrj003QUuhvZJIBDULSK8LDQG0YjwcnnNYDh3DbJ8wVp35Q1I7wGtgHhD+SM0QqNHjdL2ldaI8LLQ\nKKsR4eXyml4QsRgOhkJ/7fhyEXAQ9t8Lu4zNZ6ERmmpolLavtEYWEV4WCBSkGyx5B/ZfgDyAs5D9\nNuztC+8q65lAIKgLRHhZaAyiMcbwciFwCaw8IM9emh9r8GN2r/T3xXaI/wUygiBqJlwBfJTwR2iE\nRk8ape0rrRHhZaFRVmNs4eXlcI8ZzGgGficg/Rwcfhq2eit0zO6Fjfca+XcoNEJTTY3S9pXWyCKS\nYwgaHQtg6HiY1BwsShY5FILPUmg6HZ5V1DmBQNCgEeFloTGIxpjCy57wsFaHC4AaGA7tfoT7x8Ol\nqvsjNEIjNEZoX2mNCC8LjbIaYwov24OjzDpagdlqaDK+7LZGfVyV0KwAv9PQuQskTTICf4TGaDVK\n21daI4sILwsaHRlwC/DUXZ4ARaZwQwGX6gXJYPoJLLwfej0Fdpch/wMYEgDzx0Ci0v4JBPUBMWVI\n0Oi4DHvToFh3+QaImgV7lfCpPvA1TJgF93UDOxMgCMymQffLMFtp3wSC+oLodAWNjpnw9XLY/g+k\n5AGXIO9DOBoEK6xlOmMBxINpALSTq3w0FDq9V37shkAgkEGElwV8Cf6XIcwNLB+FKA8oMLQPUWD+\nJ7S/CZaz4ERdlrWzhuI34NtNsPR16OECN96Ak2ai46iQM2DlCrZy65qBRSwEIPNe63fw3AmeoXB6\nDNysc0cFAiNHdLqNmGxQLYaZI2Hgc2CXDqyDe2/B0qnwn6H8mAfjguGh8dA0AwpXwrks+GxqJSMA\n9cEDkPwA/F6XNhoKoZCxXnpvG6C77iDc6qnzXW2HJv/B7MHQfS5YHoWU2bDvRXjbw2BeCwTGh5gy\n1Ig138FLb0B3zeOLHfAkBH0HsxJhlptUY71OpwythI4Pw4vBYAngAurnoe16WHBeelfYQWcjoz+u\nDVFjDagg5jr4+EJpICIbOAgn3gQnpL9QgNMw5U3opMkp3Q+cesLIj8FiBlw2hM9CI6YMKagRU4aE\npvxnU3CUixeOAe+50PI9WK8vWxVNGUqHSZoOV5tR4D4LOvaF/WW3azgFD+qbZiJ8OQcO+MFQb/BM\ngazz8M9rsBwpjSYAyyG3HwSrdBqzBPwh5Bb86GhE+yU0YspQHWlkEeHlRowV2FSwHHNoYggfHKSn\no3KoAUdwNoQPgqqzAH7Oh58vgOUACDCH87qaSAh6TKZcIYA3OF0C6y5176pAYJSI8HIj1mRDrsx2\nREFeICQjfXd1Gl7OlCKU5ciR/skr35YILyutMQPaSJ97Un60d+hYOH8c0ntJbyzKEA0pD0I7IN2Q\nPguNwTVK21daI8LLQlP+swX8tB+eC9N62swHvoP982CjPm1VFF4ugCP7oXWYzhPvGrg4Ar6ILjcq\n1jDh5T+hyUlwSIauPeBPnZG3RvMdGqMmDCLmwu5uMNxca3kSFEbAFnM4YGw+C02daJS2r7RGFhFe\nbsQ8Aufeg9lH4GEPaJYvXRT3PQ6fG8qH1+DUO7DoFExoByGZkHMaTrrAJ8GQG20oR7SYA4+EwsTp\n4KoCdsGjs+GXhfC1Au7US56GxYsgux3c4w12URB7Af6eC6sRU7MEjRgRXm7kmukQDnylpQkHmunb\nVmUFD96AxEL48AoMKILwQVLI2RrZ8Hbdhpe/h6Ax8Hx76dU2AP3B2ReeWgFpT5cPmSr+HRqjxgdY\nAL9lQMoZODUaMmygCD2+shAao9YobV9pjQgvC42ymrsVPFADQVVqu27Dy9fggQlaHa6G5mC+GUKA\nn/VlSx+aHyHwJHQtgviZ8I+Dwv7oLrMFehiRP0IjwssG1MgiwssCgRY2FYzoLlknm5FJCaLA/DtY\n9DD0Gg/mKVD8A0T4wHfDq3EBAPgF3KNhgj3YZkBaBmybCyl15btA0JgR4WWhMYjGmOrpVqbJhrR8\ntLI/lFAEpEGaPm3VRrMBJs2C/pofsBOonodWq+DlQohWV7GdreCnhpengYdmXu1FGLgGjj4uzb/V\nm8/GqsmDsOXSSGy/PLj9NGTL1H40Kp/rgUZp+0prRHhZaJTVGFM93co0AyH6K2j7IoRoJ3dYDZEd\n4ROkkoCKHtd8wBtayP14h4LrTHB5B/6tiq0L8PhU8NBe0QIsoqDlPrjRGzL04bOxav4Cx/PQbyK0\ndAFVLrAeYhPh8Gtwwhh9rkcape0rrZFFVBkSCLToDNm+MGUxbPoRrn8PV96GLVYwdRCkKu0fSHVt\n7SoIdbuDSbZOJ1oR+YCP9J66HIPA+VcYUgs36wVH4bVXIdil5G7OApgAXlbwWpbmtlAg0CMivCw0\nBtHUl/AywEjpbz0QizSaG8C+5E/x4+oB3IQkwEV3gwOQMwbiKfu7k22n5I5b9hpQCLhL7d+1nar4\nbIyaW6BuBd3ketaR0OobGPs8nDaUPw1Mo7R9pTUivCw0ymrqS3i5vmji4dtT8FY7rSlMGcBeCH+j\nfGhZth01cAWOAEN1xVshYRisoTQ5WO19NjbNZbCyruAa6AQmFyBTZ1vFfa5nGqXtK62RpUGMXj4D\nlr/DQy7gqwbTa3C+A6waobRjAkEdMRO2L4aiAzDBAZwyIfUS/DtbvsOtEH9Y/iU0exKCNdmj9sPt\nBPhtTPkOt0HRGbKXQxRQLhX0doh/APYq4JaggVPvw8uZYPIHTJ8GbTQjNouhzSroeg12+unRltDU\nXFOfwsv1RTMTYpA62XAdTbl8yBV9fhiIhve/hclFkJ8JWc1h5wvgTyN4NWQF+45Amy5ala5ioSgC\nDkyTcnwYnc/1RKO0faU
"text/plain": [
"<matplotlib.figure.Figure at 0x10dce33d0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def fit_randomized_tree(random_state=0):\n",
" X, y = make_blobs(n_samples=300, centers=4,\n",
" random_state=0, cluster_std=2.0)\n",
" clf = DecisionTreeClassifier(max_depth=15)\n",
" \n",
" rng = np.random.RandomState(random_state)\n",
" i = np.arange(len(y))\n",
" rng.shuffle(i)\n",
" visualize_tree(clf, X[i[:250]], y[i[:250]], boundaries=False,\n",
" xlim=(X[:, 0].min(), X[:, 0].max()),\n",
" ylim=(X[:, 1].min(), X[:, 1].max()))\n",
" \n",
"from IPython.html.widgets import interact\n",
"interact(fit_randomized_tree, random_state=[0, 100]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See how the details of the model change as a function of the sample, while the larger characteristics remain the same!\n",
"The random forest classifier will do something similar to this, but use a combined version of all these trees to arrive at a final answer:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8VFX6x/F3KiH03jsoHURECaKg2BXddVVsa+8/d+2s\njaauvZdVV13dtbu69t4RxC4qGgERBKT33pLfHzeByWQSkpBkhnA/r1deOme+5zzPnbnDufe55zxP\nUm5urpCQkJCQkJCKJzneDoSEhISEhOwohJNuSEhISEhIJRFOuiEhISEhIZVEOOmGhISEhIRUEuGk\nGxISEhISUkmEk25ISEhISEglkVrhFpKSdo1q6YzsYl6Hmu1cM4k+2Yw8khaRommse4yrR/N+vH0M\nNaEm1FSoJt7246vJzf1KEYR3uiHlzmvsf0TUhAvtqVafA+LhU0hISEgiEE66IeVOJrVTinivNvUq\n1ZmQkJCQBCKcdEPKnTX8vrqI9+Yzu1KdCQkJCUkgwkk3pNw5lQ8f4cfo9rEsqMZz8fApJCQkJBGo\n+IVUITscDdjYmEtv4ILO9KpO2m/MWswDw2NMxiEhISE7CpUx6XaOep21ldehpgpojmY8HptP2hqS\nD2BXLFXwfEj44wg1oSbUlEkTb/vx1hS5erkyJt3o5dax2kJNFdU03vJ6XTzsh5pQE2ripom3/Xhr\nYhI+0w0JCQkJCakkwvByqImXJt72Q02oCTXh77uiNGF4OdQkpCbe9kNNqAk14e+7ojQxCcPLISEh\nISEhlUQYXg418dLE236oCTWhJvx9V5QmDC+HmoTUxNt+qAk1oSb8fVeUJiZheDkkJCQkJKSSCMPL\noSZemnjbz3qd5tkMyqTGcmaeyaq6cfQn1ISaKqSJt/14a8LwcqhJSE3c7N/H3j244iLqwlo8wrQW\n/O9wFiaiz6Em1Gxnmnjbj7cmJmF4OWSH4yuqN+WIgXkTLmTgHNpP4tw4uhYSElLFCcPLoSZemrjZ\nf4d9LqJRtCAJ7djNlnM2YXwONaFmO9PE2368NWF4OdQkpCYu9ufTu5gTf1NU34TwOdSEmu1QE2/7\n8dbEJAwvhyQkyyrw3NyHN98NKh4V4vew9GBISEgFEoaXQ028NIXaNpJ1O39owq51qLOYRWnMOYEH\nytP+oXiYydPo3Z50yMVLLN2ZD4Xh5VATarZVE2/78daE4eVQk5CaAm23cfxZHFgneLwK9WbTdgRf\njeHx8rR/GneNoEk99qtJzdnM3JcJAxlXhuMINaEm1CSe/XhrYlIZk25IyFb5kJo92D1iwgUtSO3I\nIYt4qkHwvLXcGMNHgr98oqMylcKN9Mhl/2U0vZqPMoMb75CQkCpIGF4ONfHSFGj7gZ7DqB9Doxtt\nvmH3ISyuZB8rTLOeATex11oOO45GXUmdz7DH+KUd/zyQ2Ynmc6gJNaXQxNt+vDVheDnUJKRmc1sK\nq2awtmGwZbYAv7OkOd9gTRx8LHfNf2i5gv1a0eEEklLy2huTdA4d7+CYfTkzLYF8DjWhpgyaeNuP\ntyYmYXi5ijGWmm9wTBu6/c6Mdjx3Mr+Xt53/0GwyR1Qnoy5zzyA7bRvGO4eZd/DTruwS2b4JX/PZ\nYVsm3O2euVxyAR3fQkqM94fS4wZ6Xc06uI5dctm7BnVzePriUvzAQ0JCEoswvFyFNG/SYgHnj6Fl\nquDB4Hsc9hCPnp63Src8bN3JkF4ceQK1kvA7OTcx6FzuqhfMk2UKPx3Cd3eQPpSd2pP2A2s/4PeT\neUkVWVH8PvUH0nc1asXoCC1JS2W3HHKu55SjGdgp7/v7jiE38s5wnqksn0NNqCmDJt72460Jw8s7\nguZHTruIlvkNSRhC3Uc5ZBXX1CgHW+Oo143RgyLmjOYkX8KuYxhwHQ+WwucCbZ1wHndfT58ldGzO\n95cG1w4J/9mXVPMJO51ARm0FEzxHMpZF3Xj+Do48g0GNI26Ie5JRgwPG8MYIPq8Mn0NNqCmjJt72\n462JSZgco4rwAxk70SPWe4fQ8SF6lYed8QzaN8aCp2poR99tHT8NI/j6dp69lJ+2dbxE41Smfcq0\nJDTAlKj3l5H7Ae8OZUkNejWOEYHuQHo19q0Uh0NCQsqVMLxcRTS1qJYWO4SsGkkN6K4cvosGtE2K\nIYLq1MmzUe7hp0lkvsLeDejQmv8ewNyyjBNvTUss4NNfaT2I1LH4Lu+9eSxcxQcjeBWda9Ekxtig\nLo3F/qwT9thDzQ6libf9eGvC8HJV17Qh+2V+OCDGCfAGM/bn6fKwlczrU+neMbi5LcB0vo3Qllv4\naTRd+nD6pTRJxmd0u5JXRnFbWtH9Eur7iXx9Adn3sQy71KHRMuYn8e25PBLZYSZfb6R99I90Jabx\niaI/64Q99lCzQ2nibT/empiEq5erEOv51xu0O4hm+W3fsfJHnjyGDeVh42R+voL3ruDgmhHtTzKt\nLf8uDxuRvE2TwZy3V0QZvj2o1Y6jRzLl77xS3jZjsQGj2COT3j2YN5Ql2zLeuYxXcJItlJhjGG/c\nyU4X0jP/OdBG3MFnFwaLy0JCQrYzwkm3CnEx3zzM/13DMa3YaQGzNvDqaL5UjtmWhjPqGia3Z/d0\nMpawqCl3HV8BW5O+Y/DFERNuPk1IbcZeKmHSvZeuuPQ8ujYm5UP+cBVvjuT24rZJLSLlZv68M4OS\n2DiTqR157NgSfk5tWNePv4zglDZ0S6P6z3x6Ev9qGsy/ISEh2xnhpFvFOI0ZuEkwyZY45FEa6pBz\nY5ALOT8fcmcVMOFCJhlFPUPOpEZF2IxkLqlpXHkmO+e3DaFBH44dzfxri3h2swG3cc0I9s/c0tzr\nSXZ5gguPL6H9gawcyN15LyvsOy2OUQxtysENaL2MudN5byRPbMu+7JCQHZVw0g1JaNby63LUjmrP\nxUx+rWj7t3HQ1RETbj71SW7DIEVMutex+xkMyoxqP4721/Bn/K/8vS1/RjLsFM5vu+UZfuOFdBtN\nnWt5P56+hYRsj4RbhkISmrMYey9f5kS1P8v0LgUrD1UI6TQpKolFjRhh73xS6de2iNXkLYMtyQnP\nMpLbMbRt1KK5hqTsxkHTYqTsDAkJKZ4qvWVoPns/wh8b03oDG2qyehj/Sim+X7yXmu8omhL1q8Ee\nR/HAbcFz452rUXsuk3bmlf2DLUp1KtDHrLYsncHGNjF+KytYWtQ46dTOFVUyKY/cYO9ton8/WV9Q\nvwdtY2gNoNlHHN6etZXlT6jZrjTxth9vzY63Zeg9amcz+FI65k+y88m5iiXXc0Nl+xNqyt6vI9mX\nBAUPKPq5ZoX4eBLZN7DXlWRFhoW+Zfki/oP5scZpxZL32Ts6kcgaTAnKCY6vKJ/LS1Ob3+axHA2j\nhbNYk87HlelPqNnuNPG2H29NTKrsM92POW0UHSPvNBqT/GcOvZHXhvN9vHyrDN6k+Vcc0Zjmy1la\nj+9PjcMinMrkEVp8R1Ynks+0bQUY8knDAfztGi7cmd3qUudXJi/iuauCSSfmqvBjmTeCf+LMfaiX\nhN/Y8CAfXcJjtoMQcz9WX8MXB3FQ9B37O3xzKXPi4lhIyHZMlQ0vd2DXWKG9LmS8yZG27FtNtLDE\nNmuepnMD/nplRH7kbPa7k2Z/5e0E8bncxl7HgNs4YxB9T6HGXHIe4ojWPHIIM7fVVr/g7/lV/G89\nex4YTLZsJfvWGL4fy6iHOWEV8zP5YTTfpwQTbsKcL8VpjuLFW2nxR7q0J20OOf9lyu48m6g+h5qE\n0MTbfrw1O154eQOrinJoGQui+iZaWGKbNAs4d1hUEZvOpH3PPj/wYPeCz+G2+/DTnRx1AXtXz3vd\njORz6HQPx63m1MxgsfM226oR/OWUZpyBwd+i0tpKFE1ndODU6+mbzt4r+GIUH+dFERYmos+hJmE0\n8bYfb01Mquzq5WlM3BSj/StW1S94t1elmEhG+yJCngfT8uFgm0uVojm9q8doP5puY9i70h2qYuQV\nofjyb7x23ZYJNyQkpAx
"text/plain": [
"<matplotlib.figure.Figure at 0x100431790>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)\n",
"visualize_tree(clf, X, y, boundaries=False);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By averaging over 100 randomly perturbed models, we end up with an overall model which is a much better fit to our data!\n",
"\n",
"*(Note: above we randomized the model through sub-sampling... Random Forests use more sophisticated means of randomization, which you can read about in, e.g. the [scikit-learn documentation](http://scikit-learn.org/stable/modules/ensemble.html#forest)*)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not good for random forest:\n",
"lots of 0, few 1\n",
"structured data like images, neural network might be better\n",
"small data, might overfit\n",
"high dimensional data, linear model might work better"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random Forest Regressor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we were considering random forests within the context of classification.\n",
"Random forests can also be made to work in the case of regression (that is, continuous rather than categorical variables). The estimator to use for this is ``sklearn.ensemble.RandomForestRegressor``.\n",
"\n",
"Let's quickly demonstrate how this can be used:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeMAAAFVCAYAAADc5IdQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X2MHOdh3/Hf7h7JM+kVeZCvZgMFFAohD2RUhK8VoJg0\nbKUvchKkrWWm6aGBXUVSK4dy6BceldimlcBSWhkkE0QRGRe2XKpJWiKy5KZGUEdFLeeFRFzBoCq1\nbp9KqSWgqORe6bPudOLxjrvbP/bmuLc7s7c7O7PPM898PwCBveW9PPOy85vndSqtVksAAMCdqusC\nAABQdoQxAACOEcYAADhGGAMA4BhhDACAY4QxAACOTaT5IWNMTdKXJP2YpJakj1pr/1uWBQMAoCzS\n1ox/RlLTWvteSccl/Xp2RQIAoFxShbG19g8l3bf+5Y2SFrIqEAAAZZOqmVqSrLUNY8xZSXdK+tnM\nSgQAQMlURl0O0xjzTknflnSztfZy3Pe0Wq1WpVIZ6e8AAFAgQ4Ve2gFcH5Z0g7X2X0i6LKm5/i++\nRJWK5ueX0vwpDGh6us4+zhn7eDzYz/ljH+dvero+1Penbab+qqSzxpg/kbRN0settVdS/i4AAEot\nVRivN0f/o4zLAgBAKbHoBwAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOE\nMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAY\nYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjk24LgBQ\nBMfOXFCtVtEj971n03uSdOLwAVfFAhAIasYAADhGGAMA4BhhjMwdO3NhowkXALA1whgAAMcIYwAA\nHCOMAQBwjDBGLhaWVug3BoABEcYAADhGGAMA4BhhDACAY4QxAACOEcYAADhGGANbOHnuoi4truj/\nLlzWyXMXN713aXFl4z0ASCvVU5uMMdskfUXSPkk7JD1srf16lgUDfHDy3EV995WFja+/+8qC/tmJ\nZ3W10dr03tHT53Xk0H7t21t3UUwABZe2Zvzzkuatte+T9JOSHsuuSIA//ntHEEc6gziysHRFjz71\nwjiKBCBAaZ9n/KSkr66/rkq6mk1xUHRR860kLS6vOi4NABRDqpqxtXbZWvumMaaudjB/NttioYi6\nm3TXGk0dPX1er76+tPFe0Z7odPONUz3vTdQqPe9N1XfoyKH94ygSgABVWq3eJrdBGGN+VNLTkk5b\na89u8e3p/ggK5e/P/aHiTqfrd0/q7IMfkCTd8/AzkqTHj98xzqKN5K7P/7EuvdGu7UfbEvceAHTo\nvWvvI+0ArndKekbSYWvts4P8zPz80tbfhNSmp+vu93HCLVez2dooW2O9v9V5WYfwsTtv0UNPPLfx\nen5+KfY9ZMOLczlw7OP8TU8PN5gz7QCuz0jaLelBY8yz6/8mU/4uBCKuSVeSrjaam5qqi2bf3rqm\n6pN6x563bYyWjt6bqk8yghrAyFLVjK21H5f08YzLgoKbm53R3Y98s+f9pbfW9OhTL+jU/QcdlAoA\n/MeiHwAAOEYYI5WkUdHvimmqZqQxAPRHGCNTc7MzqnaMIZyq79Cp+w/SrwoAfXgZxkWbi4rN6ju3\nS5KqFRW2Rsw5CGCcvAxjFNtErapqRYw0BoABEcYAADhGGAMA4BhhjLEJ/RnA9DMDSIswxljEPRe4\n+yESAFBWaR+hiJxFNawThw84Lkk24p4LHD0DuAgrc504fKBnPd9Qjg0A96gZw6k33rziugjoQnM7\nMH6EMcYi7iES1cq1OckAUGaEMTJ34vABTdU3P8RrbnZGU/UdG19Hr5feWh1r2QDAR4QxxubIof2q\nVnpX5qJZFEDZEcYYWtopSjwDGADiEcYYShmmKIU+HxqAf7wLYy6Efus3RanTicMHCjn1pww3GwD8\n41UYcyGEa0k3Gw898ZyD0gAoC6/CeNBal89CH4wUN0Vpqr6jsI9KBAAfeBXGZVWkAI+bonTq/oPB\nDMhiPjQAF7wKY2pdxZA0RSkEcTcbU/VJTdT6f1QY6wBgFF6Fcei1rkH5fmEPfYpS5w3GIDcbjHUo\nhiK1QKF8nIdx9wck5FrXILiwu7dvb33jHBzkZiOEsQ4R328EgVA5D+Nuode6OsXdqYd0Ye/n5LmL\narakZktaXGZJzHHpVzvkRhBwx7swRvgWl1c3XfTXGk0tLK0U9qIfyliHstwIAj4ijD0TyoU9yYnD\nB3S10ex5v9lSYS/6jHUAMCpvw3hhaSWxOS3kgRhluLC3Et5fXC7us423GutQhHM29BvBOEU4LigH\nb8O4zMo6iK3RVGH7KEMY61CGG0HAV4Sxh0K4sPdT6fN/9FG6VdYbQcA1whhjF9cc6qvOqT6f+2L4\nzZmh3wgCviKMMXbdzaGdfOij7Jx21Tnq+/mX5gvbjA7Ab4RxhtIsmLC4vFrKRRbiAteHPsruubbd\naEYHkAfCOCNpFkxYXF7VWsc0nyItsjDq84qjVa4qkld9lHFzbQEgb4RxRtIsmLAWM9+2bDWvSkWF\n6qP0oRkdQHi8DOMThw9oqj7puhgooX6Dy67fPem8GR3pdHYh3XfiW66LA/RwGsYhLUo/7IIJSdtL\nzcut7sFl9Z3bNprRj999m8OSjS6kz9swuruQ1hrNwnQHoTychXFoi9IPs2BC0iCh+s5t1Lw80DnX\n9lM/9+6NZvSbbtjjumiphfZ5GwZrbqMInIVxiB+QQRdMYJCQ30Kcaxvi5w0IiZd9xv2a03xuahv1\nIj5R8/JwYEBJI8x9PmfLIKkLac/bt3Nc4A1nV/9+H5Ck5rRQmtritr27Nj3q1CH4wZdztowPgYh0\ndyFVK9JfvX6nvvfatWNQ1GsJwuEsjJP6WF95rffDEDWnhdLUFndxmKpP6rGnX+QJMoHx5Zwt+0Mg\noi4kSarv3O7NcQEiTttFy7wofee213dud10cJ6ImfYzHoJ+3EFtloi6kaoXuIPjJ6VkZ18farzkt\npKa2zm3n4hAun87ZEAempeXTcQGkDMLYGHObMebZLAoj9W9O872p7diZC1pYWnFdjMIIsQbWzfdz\ntqw4LvDNSGFsjHlA0pckxT+CJ6V+zWllbtoOVeihzDnrJ44LfDIx4s+/LOlDkn43g7JsiJrToteD\n/l9ZRIO8Qg6wkHDO+onjAp+MVDO21j4t6WpGZQEAoJRGrRkPbHo6/s6zVqvE/n/S+1v9n0u1WkWq\nVFSrVQYqW7Qd3bb62aTt921/pHXPw89Ikh4/foezMnTu47O/+oGN99PuY1/OWV/KsZVRypd0/iy+\ntapmS7q0uKJHn3pRD330QGH2Rx7KuM0+G1sYz8/HT6ZvNFqx/5/0/lb/51Kj0ZJaLTUarYHKFm1H\nt61+Nm77p6fr3u2PtLI+vmma9R+57z09ZRhlH/tyzvpSjn5GPZfjtvHkuYtau3rtkaXPvzSvj/za\nN9RstjRRq3q9P/IQ0vXCV8Pe7GQ1pyY+VUpoqj45dF9u6AOYANeSFvlYemvVQWmAXiPXjK21r0gi\nSQAASInVJgAEL2mRj7Kufgf/jK3POFTRQh+jLusYPdknej03O5NF8VLpXh+bJvRwlHVRmrnZGR09\nfV4LS1ckXVvkg7Xg4QvnYVzmC3207UlP9jlyaD/zH0fAhRadg/eOHNqvh554ThKLfMA/NFN7gCfI\n9FpYWiFMc1Dmh3OwNjd8RhgDQGCOnbnAzWzBOG+mTlK05uvF5VWtNdrzGIft8735xqlNzdQST5AB\ngDLxNoz78S2omy2p2bi2oMCwfb5Jg0vi+DTQC4Pz7ZxFG8cFvqCZOifD9vkO8gSZpIFer77OSjro\nRVMlUByEsScGGVwyroFeC0srzqbARDX/Zqvd9A8MIzp/Li2u6OS5i66L4w1uzPxXyGbqIqDPd3jd\nNf+1RnPkKV7dffnf/8F
"text/plain": [
"<matplotlib.figure.Figure at 0x110b05d90>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"x = 10 * np.random.rand(100)\n",
"\n",
"def model(x, sigma=0.3):\n",
" fast_oscillation = np.sin(5 * x)\n",
" slow_oscillation = np.sin(0.5 * x)\n",
" noise = sigma * np.random.randn(len(x))\n",
"\n",
" return slow_oscillation + fast_oscillation + noise\n",
"\n",
"y = model(x)\n",
"plt.errorbar(x, y, 0.3, fmt='o');"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeMAAAFVCAYAAADc5IdQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd8ZHW9999nek0ySaYk2exma3aXLazSm2ADeRQLFhSs\nV0RQsQB6H9v1Xr3qlbVcRUSvIBZ0feDiVa9KE5CydJZl2RK2ZTd1SmYyvZ/z/HFm0jbJJpNpWX7v\n18uX7Mwpv5mcOZ/z7ZKiKAgEAoFAIKgdmlovQCAQCASCVzpCjAUCgUAgqDFCjAUCgUAgqDFCjAUC\ngUAgqDFCjAUCgUAgqDFCjAUCgUAgqDG6Unbq7u7WAv8FrAEU4OM9PT27y7kwgUAgEAheKZRqGb8Z\nkHt6es4Bvgz8e/mWJBAIBALBK4uSxLinp+ePwFWFf3YBoXItSCAQCASCVxoluakBenp68t3d3bcD\nbwfeWbYVCQQCgUDwCkNaaDvM7u5uN/AUsK6npyc53TaKoiiSJC3oPAKBQCAQLCLmJXqlJnC9H1jS\n09PzLSAJyIX/Tb8iScLvj5ZyKsEccTrt4juuMOI7rg7ie6484juuPE6nfV7bl+qmvgu4vbu7+x+A\nHvh0T09PusRjCQQCgUDwiqYkMS64o99T5rUIBAKBQPCKRDT9EAgEAoGgxggxFggEAoGgxggxFggE\nAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGg\nxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggx\nFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggE\nAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGg\nxggxFggEAoGgxuhqvQCBYDFww83b0Wolvn3VmZNeA7jxmrNqtSyBQHCCICxjgUAgEAhqjBBjgUAg\nEAhqjBBjQdm54ebtYy5cgUAgEBwfIcYCgUAgENQYIcYCgUAgENQYIcYCgUAgENQYIcaCihCKpkTc\nWCAQCOaIEGOBQCAQCGqMEGOBQCAQCGqMEGOBQCAQCGqMEGOBQCAQCGqMEGOBQCAQCGqMEGOB4Dhs\n3baDkUgKXyjJ1m07Jr02EkmNvSYQCASlUtLUpu7ubj1wG7AMMALf6Onp+XM5FyYQ1ANbt+1gT29o\n7N97ekN87MaHyOWVSa9d9+PHufbSTSzz2GuxTIFAsMgp1TK+HPD39PScB1wE3FS+JQkE9cPeCUJc\nZKIQFwlF0/zwv1+sxpIEAsEJSKnzjO8E7ir8twbIlWc5gsVO0X0LEIlnarwagUAgWByUZBn39PTE\ne3p6Yt3d3XZUYf5SeZclWIxMdOlm03Hi8QjX/fhxjgxHx7ZZbBOd1nU5jnlNp5XwjA7hjPjGXnPY\njVx76aZqLk0gEJxAlGoZ093d3QncDfy4p6dn2/G2dzpFLK3S1Po73nskhJzP4T34NJHAEQD8zR38\np5zj1//2ZgC0Wgmo/Vrnyn986jw+9G/3MhJWrf1zRvbxhYduhqNHOehczmfe/31aGk3c/tULa7zS\nE4vFcn0sZsR3XF+UmsDlBu4Drunp6XloLvv4/dHjbyQoGafTXvPvWJEVhvY/QSzYj8nWjCRpiAUH\nOLTrYXy+85AkiXwh3lrrtc6HT759I1//5TMAXHXoPjh6FICV/sNoJPX9xfR56p16uJZPdMR3XHnm\n+7BTagLXF4FG4Kvd3d0PFf5nKvFYghMEpzFELNiPpdFN54bX03nSa7E62omMDPG3vz9W6+WVzDKP\nHYfdRGuTGevoCIrFQubscwFosehFBrVAIFgwJVnGPT09nwY+Xea1CBYx+XyeLn0/L2RTrHMtJ5tN\nkzRa8Kw8jd4X/sIPb7ubN7zmjFovc8Fo/D7kVhdKcwsA1qSwLgQCwcIpOWYsEExk764Xyf3kh3wz\nEOCi5/5IwmDmK5f+K0Gbg3hrF+HQYXbuXNzNMSRZRhPwE9m4kX6djiWAPRGu9bIEAsEJgBBjQUkU\nM6JvvOYsAF6841foAgHW25t4rnk5rz6yg+/+7vMAjGr1fOW9H2Tnzh0oygYkSarZuhfCu+79L3py\nOX4fi5MeGsAEdN37MyymA6Te/2HkjiW1XqJAIFikCDEWLJhgcARvz15OAhpuvZ1PP5Hn0qfuojUa\noCsyxNojL3G6lGfnyAjxnB9bk6vWS543tniY9dvv5jZAae/g9A2b2P/kEwwefI6D33uOVbk88S9/\nrdbLFAgEi5S6FOOpVpegvtm7dw8av4+NQO6kjVh39fD7M96DRoJvnmmFd76Bs379C3Z++jpG/b2L\nQoynXoNu3xH+AGQ3bubt3/0hSzuX8mh8Cfuf/jPGPY9w5UighqsVCASLHTEoQrAgFEVh797dGP1+\nVjldKE4nOq0GjQQOuwnX2acA0AVY4zHCgT4U5dh2kvWMpu8ob/rl5wkCW848m6VLl4EkEejaiH3D\neaSBJ/v7ar1MgUCwiBFiLFgQgUCA0OAgayNhdBs2HruBVkvsK/+GBlhltZFNJ0hER6q+zoWgffQf\nPArogVMu/8Ck95qWrKUBeM7rJZVK1WJ5AoHgBECIsWBBHD58CCk4whog171u2m3ktjYA1mrVqEgk\nsLisyENDg4SBobMuxbJu/aT3ckYLpwHZTJo9e16qyfoEAsHiR4ixYEEcPnwQKZ1iJSC3tk67jdze\nAUBw51FSWYXB/sOLagbwzt7DAHQ4j82WzuiMnAxoczm+9qM/LKq+2wKBoH4QYiwomXwuS39/H26j\nCRug2Bum3S63Zi05rZ4Nzz6E2e4knQjz4suDxwyRqEfS6TSHfF7cgNXWfMz7slaL1WBgpVZDMhYk\nnajvzyMQCOoTIcZ1ymKYbhQP+8jn86y0WgFQGhun3U5pbWXH0s20h4dptqiCnYj4FsUM4EOHDpLP\nZlkHXPuByR3EbrzmLG685iwUs4W1ilo7HV5kLniBQFAf1GVp04mO1+vlgQfuJRDws3TpMt7whgux\n2RZff+NoaBg8sMJgBEBpmN4yBggXRNhlsNAHJMNeGlqXEY6lq7HUktm/vwfyedYBGAzTbqOYzayV\n8yBJhANHq7q+SiBKCwWC6iMs4yrj9/v5/e/vYGCgH7PZzP79L3PHHb8ikUjUemnzJh72odFo6Cz8\nW7ZPbxkDGFxqPNmj0aDVGUiEfWgksFumF7h6QM7nOXToIM16PS4Ao3Ha7RSzGWs6jbXRRTzsJxYT\nrmqBQDA/hBhXkXw+z1//+mdSqRQXX/wWrrzyak4//UzC4TD33fe3Wi9vXuRzWZLRIB5PG4aI2p+5\n6Ka+8ZqzcNgnD/F69ZlrAWhIxTE3ONEqSbLpONFEproLnwexsJdMJsMaewMSzGgZY7ag9Q7T2KIm\nePX29lZriQKB4ARBiHEVeemlF/F6h9mwYRMbNmxEkiTOO+98lizp5OWXe+jrWzwuzkR0hIboCCue\neQrzr28HZo4ZA8gONfmpIR3F2uDi9HVuklG1a1W9xsdjoWEAVpgLDxYzuantaojhi/fewqv3Po73\nB1thkTU2EQgEtUWIcZWQZZmnn34SrVbLeee9Zux1SZI4//zXAvD444/WannzYuu2HQwN9nPqrofp\n/t1vAEhd8nZkT9uM+8gOBwCvOfQUHU4njTYDqTpv/hENDaHVauksxMRnclPHvvQ18su6OHvgZdb4\nehn60x+Qjh6p4koFAsFiR4hxlTh06CChUIgNGzYdk6zV3t7B0qXLOHr0CH6/v0YrnBtbt+1gT2+I\nZMSHJZNA29TG1z+8lZe+cRPMMo0pt3kLitHIKfu28/Obr8Zy5zbSYV8VVz53tm7bgW8kTHDEz4v9\neYy5nPrGDJZx7owzCT7
"text/plain": [
"<matplotlib.figure.Figure at 0x110d71cd0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"xfit = np.linspace(0, 10, 1000)\n",
"yfit = RandomForestRegressor(100).fit(x[:, None], y).predict(xfit[:, None])\n",
"ytrue = model(xfit, 0)\n",
"\n",
"plt.errorbar(x, y, 0.3, fmt='o')\n",
"plt.plot(xfit, yfit, '-r');\n",
"plt.plot(xfit, ytrue, '-k', alpha=0.5);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the non-parametric random forest model is flexible enough to fit the multi-period data, without us even specifying a multi-period model!\n",
"\n",
"Tradeoff between simplicity and thinking about what your data is.\n",
"\n",
"Feature engineering is important, need to know your domain: Fourier transform frequency distribution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random Forest Limitations\n",
"\n",
"The following data scenarios are not well suited for random forests:\n",
"* y: lots of 0, few 1\n",
"* Structured data like images where a neural network might be better\n",
"* Small data size which might lead to overfitting\n",
"* High dimensional data where a linear model might work better"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}