2015-04-28 17:27:02 -04:00
{
" cells " : [
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
2015-05-31 09:59:13 -04:00
" # scikit-learn-random forest "
2015-04-28 17:27:02 -04:00
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
2015-05-31 09:59:13 -04:00
" Credits: Forked from [PyCon 2015 Scikit-learn Tutorial](https://github.com/jakevdp/sklearn_pycon2015) by Jake VanderPlas "
2015-04-28 17:27:02 -04:00
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 1 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [ ] ,
" source " : [
" % matplotlib inline \n " ,
" import numpy as np \n " ,
" import matplotlib.pyplot as plt \n " ,
" import seaborn; \n " ,
" from sklearn.linear_model import LinearRegression \n " ,
" from scipy import stats \n " ,
" import pylab as pl \n " ,
" \n " ,
" seaborn.set() "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Random Forest Classifier "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
2015-05-31 09:59:13 -04:00
" Random forests are an example of an *ensemble learner* built on decision trees. \n " ,
" For this reason we ' ll start by discussing decision trees themselves. \n " ,
2015-04-28 17:27:02 -04:00
" \n " ,
2015-05-31 09:59:13 -04:00
" Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification: "
2015-04-28 17:27:02 -04:00
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 2 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAk4AAAFFCAYAAAAadmKrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzsnWd4VcXWgN8TQiihC6j0IkyogRCIqIQmhCIIKk1KAOFT \n ioqICHhtqCDg5SpgoapUReBerlwFpEiviiDFERDpPRAIAdL292POOaScJCch5KSs93nOk+zZs2fW \n 3rNnZs2aNbNtlmUhCIIgCIIgpI6XpwUQBEEQBEHILojiJAiCIAiC4CaiOAmCIAiCILiJKE6CIAiC \n IAhuIoqTIAiCIAiCm4jiJAiCIAiC4CbenhYgM1BKVQL+SuNlnbXWy++BOFkapdRXQB/gn1rr1+5h \n Pn8DFVycigVuAqeAzcBMrfWueyVHaiil+gJzgF+01g3vIp04+791tNYHMkK2uyWFMkiJDVrr5hkv \n jedQSn0JhNoPg+7F+5YVyx9AKdUMWAdc1lqXSsf1BTHP7inAHygO3AAOAcuBz7XW4S6u+xvz7nXQ \n Wv8vneJnCsmVnVLqPuBfQBugCHAeaA2swoP3ppQqAvhqrc/GC3sHeAtYqrXuktky5TRyheKUiF3A \n bTfiXbrXgmRxMmuDrz+BC/GO82Aa3yqAH/CcUmqC1npMJsmTGCvR37tNKyttnLYTOJEorBhQ2/7/ \n ZhfX7LunEmUy9o7/6XhBAzBtxL0gq5V/fNIsl1KqMbAYKGsPugj8BjwANAKCgJeVUt201huTyTOr \n Po/EuJJ1MdAc05/sB/IDx+PFzfR7U0r1ACYDzwFn453KyHYs15PbFCcL6KK1TtxZCJ5jnNZ6buJA \n e4c2DHgXGKWUuqG1/iDTpYN/A9swVrC7oQbm/Tt21xJlEFrrronDlFJNgfWApbUOznypMp3OQCFg \n NcZa0F0p9YrWOjKD88ly5X83KKWeBL7D9CFLgDe11jre+ZoYa0wr4Eel1GNa6z2JkrFllrx3SZKy \n U0qVxChNFvCE1nptvHMtMc/lVCbLCTAeuN9F+DRgEXA9c8XJmeQ2xUnIJtg7rnFKqWvAFOBdpdR3 \n Wus/M1mOa8C1DEhHpx4rS5BdOrOMoo/972KgNFAP6AZ8mZGZZKPyTxWl1IPAbEz/8bHWenjiOFrr \n g0qpdphpwCbALKBBpgqaQSRTdiXj/Z/Amqa1TqtbyD1Ha30ZuOxpOXIK4hwuZGm01tOAXzDv6igP \n iyPkIOwKQEuM1WAVsMx+aoDHhMoejANKABp4NblIWutY4BX7YT2l1GOZIFtm4TQ6aK2jPSlIMuS2 \n AVCmIhanVFBK1cZ03HmB0VrrCYnO9wLmYua5AxM5Dz4B9MPM95cCYoCTwA/ARK31+URpxQHhWuvi \n SqlBwAtANYzF40dgpNb6olKqPmYKqwlmXn0PMFZrvTJRen8DFbTWXkqpocBgoDLGifEnzDSZ21MH \n SqnCwHDgGaCq/X5+x4zO52it41K4/G6YgxmttktGricx9xYI+AKnge+BD7XW55K5pg5mKrAFUAYI \n BzbZr9kVL15fXDiHK6V8gKFAD6AmRrE7A6zFONYnGKXGczCtrbU+mOjcU8DzQEO7/Ocw5TNBa304 \n UVyHPJ9i3oF3gI4Y8/wF+32PTe6+M4J4iy0OYKwzs4H6mBHtOK31p/Z4+YBBQE+Mv5oXprNdAEzT \n Wrv0NVRKBWM63EcwPlcXMIrNeK31URfxfwaCga+11v3ScCs97TL9prU+rZT6DhgLPKyUqqG1PuQi \n r76k4/m7cjCOtxCjM+adfQt4DNPW/GZP5yellK/9XDfgQcx7Nh94V2sdkyifPJh3sgcQANwH3MKU \n 13+AyXYrarqwT6F3tx9+rLVO0WdGa/2r/Zn9qrXe72YeaW0301oX76ruxjsGsMU77qu1npuS47td \n eXwRaIx5Zy4Ba4APElvT01KW8d5LB98rpQD6aa2/Tsk53D6AGAE8AVTE9GX7gK+BL+0KcPz4jvsr \n BTTF1FV/+3P8DZiqtf6WHIxYnFLBXtnfsR++Ze80AFBKlQOmYkasoxMpTbOA/2IaxRhgL6ZjUZgX \n bbdSqoSLLG1KqQWYhrkocBhTYUKBtUqp9sB2zPz6X0Ak8DCwQinVxEV6llJqGma6qxSmQhTHOA/+ \n YnfwTBX7ff8KvA1UB45iGrPGwAx7/j7upJUOttr/llJK+cWTyaaUmoHxQ2qF8UPah7nPl4C9Sqkk \n 0wNKqd4YB+B+mNUwezFl2BnYopRq5UIGK971NnueHwF1Mb4PBzDm+wGY59oopTTs6XgppeZjfERa \n YRTk3zDl3t8uf+dknkkZTHkMAqIwCklZjLK9TSlVNJnrMpKiGIWmJsY5tihwEMD+bm/EOKrWw3RM \n fwJ1gEnAZlfvv1LqH8DPwJP2oL1AQe48jzYu5EivM25v+9/F4JyS2YsZradmdUrP809OxvaYd7wp \n pk5HA48CPyil2trPvYp5v49jOrc3gM/jJ6KUyguswAzk2mD8WfYAEZj39C1goz1eemkM5LPfx9pU \n 4gKgtZ6bBqUpTe1mWutiBtXdLZiyd7DZ/jufKH7i+j4KUye6YJTjvUABzHu42z6Yc8RNa1mes8vl \n GIwcsMuUeACVWKbGmLr7ClAeMxA+g3n/ZgCr7Iq7q+fxJqbtqoup2zcxg51FSqlhLq7JMeRGxSk9 \n JswJmI62APAZOCvgl5jOYp3W+mNHZKVUB0xDHwE011pX1loHaa0rAs3s4WUxyktiimAqVm+tdSWt \n tT/GKmJhVjstB74FSmutG2A0/22YsnT1stow1pgpQBmtdZA97yWY0fwipVT+lG7ePvJZhrEyLQfK \n aa3raK1rAbUwS4/b2J/TvSC+M3+ZeP+/gmnsTgMttdbltNaNMKt6PsMoUMuUUoXi3YsCZmIarncx \n z7ER5plMxlhhv7WPrJOjHdAW01lW1lrX1loH2tP4D6ajH+fGff0DeBa4inEwrWQvnwcwjrX5gQVK \n qVouru2MWfbdQGv9kNa6LqbRisR0rP/nRv53Szm7DFXt919Oa73efu4rjAVtC1Bda6201vUxFs9N \n GAvi7PiJ2S1vYzHWv25a6/vtZXM/ppEuCHyjlCqfSI4+GAdet1deKqXqYZQ4C/gm3qlF9r+9U1Ew \n MvL5D8B05mXtz7EippPMg7FgFbLnU0NrXR1jHQDoq5QqHi+dF4AQTGfpr7Wubm93ygBdgThMJ9cp \n DbIlxjFwidJaH7mLdJKQznYzrXXxruuu1roJd7avsLTWwfbfqnjREvQzSqnm9nSjgRe01g/a3+1y \n mPevEKZNdpCmstRar7TL5VDeXnchUwLsCuj3mIH0Ykz/0FBrXQNj7TuJ6Xs+d3G5DTM4fR+4L94z \n dNSlt5VSOXZGK7cpTjbgmFIqLpVfAsdQ+xRUKEabb6OU6op5sVsCYdypRA5aYkahU7XWGxKltRGj \n +IBp7F3xpdZ6QbxrNmGUIzBKQl+t9U37uRvYlTmMudQVq7TWwxxz8VrrCKAXxmpUgTsj7+TojLEa \n /IHp0C7Gk+0QdyryIKVUmveCcQPHShAbxvqGXdkbg+n4esXrsNFaR2qthwI7MKOo/vHSehXwARZr \n rd91mKG11jFa6xGYEVdRjNk6ORwjwx+11mfi5RuBUeZW29NJFvsoboRd/ue11j/ES+eW1vpVjJKa \n H6M0JMYC+mitf4t33XbudPxBKeWfgXyktb5kz/8qgFIqEPP8LgKddLzpYK31KcxUbwTwpFKqbry0 \n xtr/DtNafxfvmhhtVlQuxgwsXol3DVrrk1rrPxNPj6WC453fqbX+O164o+EvScoKRkY+/zCgv/39 \n QWt9nTudlQ0YpLXeGy/+J5j2xYs7W0eAsULHAG8ntvBorZdgVktC8u2OOxSLJ3NGk552M6118a7r \n rp20DsBH2/9O1lrPiJdvJMbyfRl4KJ61KzPKcgjGV20f8Kyj/trz2I2x+lpAT6WUq3x+1Fq/pe3T \n xVrrKMCx91+Ru5QtS5P
" text/plain " : [
" <matplotlib.figure.Figure at 0x104e7a790> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" import fig_code \n " ,
" fig_code.plot_example_decision_tree() "
]
} ,
2015-05-31 09:59:13 -04:00
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" The binary splitting makes this extremely efficient. \n " ,
" As always, though, the trick is to *ask the right questions*. \n " ,
" This is where the algorithmic process comes in: in training a decision tree classifier, the algorithm looks at the features and decides which questions (or \" splits \" ) contain the most information. \n " ,
" \n " ,
" ### Creating a Decision Tree \n " ,
" \n " ,
" Here ' s an example of a decision tree classifier in scikit-learn. We ' ll start by defining some two-dimensional labeled data: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 3 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAeAAAAFVCAYAAAA30zxTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzs3WdAFEcbwPH/Hb1KB7GA9exgx94Qe9fYezRGTXlf8yam \n m94Tk6iJscTYS8Qu9oYdLNhPUFRsgDTpXHs/HCLHHXJ01Pl9kZvdnZ1b4Z6b2dlnJBqNBkEQBEEQ \n ypa0vBsgCIIgCC8jEYAFQRAEoRyIACwIgiAI5UAEYEEQBEEoByIAC4IgCEI5EAFYEARBEMqBqTE7 \n yWSy1sC3crm8i0wm8wV+A1RAJjBOLpfHlGIbBUEQBOGFU2APWCaTvQssAiyyi+YCM+VyeRcgEHiv \n 9JonCIIgCC8mY4agI4DBgCT79Qi5XH4h+2czIL00GiYIgiAIL7ICA7BcLg8ElLlePwSQyWRtgRnA \n L6XWOkEQBEF4QRl1DzgvmUw2HPgA6C2Xy+MK2l+j0WgkEklBuwmCIAjCi6LAoFfoACyTycYAU4HO \n crk8wahWSCTExiYX9lQvHVdXO3GdjCCuk/HEtTKOuE7GE9fKOK6udgXuU5jHkDQymUwK/ArYAoEy \n meygTCabU7TmCYIgCMLLy6gesFwuvwW0zX7pXGqtEQRBEISXhEjEIQiCIAjlQARgQRAEQSgHIgAL \n giAIQjkQAVgQBEEQyoEIwIIgCIJQDkQAFgRBEIRyIAKwIAiCIJQDEYAFQRAEoRyIACwIgiAI5UAE \n YEEQBEEoByIAC4IgCEI5EAFYEARBEMqBCMCCIAiCUA5EABYEQRCEciACsCAIgiCUAxGABUEQBKEc \n iAAsCIIgCOVABGBBEARBKAciAAuCIAhCORABWBAEQRDKgQjAgiAIglAORAAWBEEQhHIgArAgCIIg \n lAMRgAVBEAShHIgALAiCIAjlQARg4YWTlZWF/OoVYmJiyrspgiAI+RIBWHihHJw3l5Nd2uHeyY/4 \n ts3YPn4kMffvlXezBEEQ9JiWdwMEoaQcW7GMNt99RbXMTADqPH5M+6AdLE1MpN/mnUgkknJuoSAI \n wlOiByy8MDICN+QE3yckQK/TJzmzJ6h8GiUIgpAPEYCFF4Z5PkPNnioVCVevlHFrBEEQnk0EYOGF \n keVR2WD5A6kUB1m9Mm6NIAjCsxl1D1gmk7UGvpXL5V1kMlltYBmgBi4BM+Ryuab0migIxjEfNJS7 \n Z0KompWlUx7Uyo++PfuUU6sEQRAMK7AHLJPJ3gUWARbZRT8DH8jl8o5ob7ENKL3mCYLxOkyYzLH/ \n vU9g7TrcBI7b2bGsRy9aLVgkJmAJglDhGNMDjgAGAyuyXzeTy+VHsn8OAgKAzaXQNkEotK5vzSJz \n 2kxuXpfj5OpKn3yGpQVBEMpbgQFYLpcHymQy71xFubsSKUAlY07k6mpXuJa9pMR1Ms6zr5MdVau6 \n lFlbKjrxO2UccZ2MJ65VySjKc8DqXD/bAYnGHBQbm1yEU71cXF3txHUygrhOxhPXyjjiOhlPXCvj \n GPMlpSizoM/JZLJO2T/3Ao48a2dBEARBEPQVpgf8ZKbzLGCRTCYzB64A/5Z4qwRBEAThBWdUAJbL \n 5beAttk/hwOdS69JgvD8SUtL49SWQABaDxiMtbV1ObdIEISKTuSCFoRiOr5iGZp5c+kfeRMJsOuX \n H9BMf5N2EyaXd9MEQajARCYsQSiG8PNnqfLFJwyMvIkl2oflB9yKxOvLT7kWeqq8mycIQgUmArAg \n FMOt1Stokaj/IEDTx4+JWru6HFokCMLzQgRg4aWmUqmKdbxpQvwztiUUq25BEF5s4h6w8NJRq9Xs \n +/FbzHbtwCo2ltSq1TAfOpwOk6cWuq6sGrVQo/9NVg1ketcoieYKgvCCEgFYeOkEfTSbwYv/xP5J \n QfRD7l4M41BWFp1fn1moulpOnU7gjq0MDb+uU765Vm1aTJ1eMg0WBOGFJAKw8FJJiI+jyrbNT4Nv \n tqpZWWjWr0Y19XVMTEyMrs/JxQWvxctZ+cM32J4NRaLRkNy8BXVnzcbF3b1E2hwbE82W/8zEM/w6 \n 9kol6XXq4jrxVXzFCk+C8FwTAVh4qchPn6J99EOD27xu3iQ2NgaPQi7g4F2/Ad5LV6BSqdBoNJia \n lsyflUajYdfnn/JwyZ/MysjA/MmGu1GcPX+WsN//xCegV4mcSxCEsicCsFBmLh7Yx8OVy7C4e5cs \n V1fsBg2l9dDhha5Ho9Fw+O/FqPbvxTQtlVRZfZrNeBP3qtUKPNbNuwa3raxxTE/T2xbr5ERje6PW \n FjGoMD1nYxz6cx4158+lJTwNvtmaJSSwetkSTmdkkLR5I+YJCWR4e1Nn0lRqNvYp0XYIglA6RAAW \n ykRo4AYqv/dfuiYl5ZTdOnKIg/fv0+XN/xSqrjXTptFz0SKcNNrsqJqjR9h49DCq5WvxrFHzmcfW \n rFef7W3a4Xtgr065Cojp2LlCZbBS79xOPNA+n+2aM6E0OHqEGhkZ2oJjwRw6sJ8r8/6iQYeOZdVM \n QRCKSDyGJJQ6jUZDwuKF+OYKvgDemZlYrFxGenq60XVdP38On5Urc4IvaNfHHCq/xoXffjaqjpY/ \n zmVph05Emmv7lRetbfi7d1+6fv2D0e0oC+aPYrEGHuezPSs5+Wnwzdb5wX3uLvi11NsmCELxiR6w \n UOpiYmKoefWywW3tb0Vy8ugRWnXvYVRdd3bvoF2a/vAxgNXFMKPqcK9ajb7/buXi8WCOX7lCjdZ+ \n dPCsytHffsL83j0yXVzwmTSVytW9jKqvtGRWq06nGxFsAl7Jsy0NsFcqDB7ncDGMzMxMLCwsSruJ \n giAUgwjAQqmztrbivpU1pKbqbYs3M8PW2cXoujRm5mjQ9nrzUpvlvVOaP4lEQpN2HWnSriMRZ0O5 \n 0tefUTdvIkW77Nf+TRuJ+eGXEp3kpMnutUskhlqvz2HUWCJCTtMoNYX1QADgAJw0MeFAKz9eO3HM \n 4HFKc4sSvx8tCELJE0PQQqmzs7Pnvl8bg9tONm1O/abNjK7LZ9RYDru56ZUrgQy/dkVq341vv2RQ \n dvAFbXD3f3CfRz9+i1qtLlKdud2RX2P3axMJbtmE4Fa+7Jo2mXs3bxjc9+y+Pez770wOzJxKSkIC \n 1774hrBmLXG1s2ORoyMf+fiSvm03Ezdu40D9hgbrSGrlV2IzsQVBKD0STa57aaVIExubXBbnea65 \n utrxol6n6Kg7hEydyOAzIVQCMoDNsnp4zZ1P7eYtC1XX+bXLsP34E9okaXMwxwEbOnel9z9rsLKy \n KlRdcXFxxLT2oeNj/TuttyQSbm7dRePWhr88GFV/TAwXB/dh2HW5Tvnaeg1osTWISg6OOWU7P/mA \n jkv/okZWFgApwMpOXen1z2oyMtKxtrbB0tIyZ//zu3agmP0OPe/fQ4p2WHpd0+a0XLoC9ypVgeL9 \n TiUlxHPsyzlYnz6JSVYW6Y2bUPuN/1DTp2mR6qvIXuS/vZImrpVxXF3tChzqEgG4AnnRf7GVSiUn \n Nqwl67ocqacnfmMmFDpggvY6hR4L5dqalZilpWHeoiVtBg8r0rBrTEwMSa19aZuaorftHnBl03Z8 \n 2xV9RvGuL+Yw+vef9YaaVMCaWe/R470PAbgachLXIQOon6E7IU0BrHtnNj3e/cBg/Y9iYjj791+Y \n xsdDHRntxk7Qufdb1N8phULBjiH9ePXkcZ3h/iAvb9xWrqe6rF6h66zIXvS/vZIkrpVxjAnAYpzq \n OZaSkszJf5ZCYiKOrfxo5h9g9P3F8mBqakqHkWNKpC6vujK8Pv2i2PW4ublx3seXtseP6m07Xr8h \n 7Vu3LVb9ljcjDN7nMQHMcw1D39u2hY4Z+rPBzQDzkPyXNXRxcyPgvY+K1UZDTqxdxfA8wReg1+1b \n rFz0B9V/FDOtBaG4RAB+ToXt3knKR+8z9HYkZsAdU1MCu3Sj9+LlRepVvsw8336Hg5E36fLgfk7Z \n GQdHrKa/Uex7qYpK+Sf
" text/plain " : [
" <matplotlib.figure.Figure at 0x10cbc3c10> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" from sklearn.datasets import make_blobs \n " ,
" \n " ,
" X, y = make_blobs(n_samples=300, centers=4, \n " ,
" random_state=0, cluster_std=1.0) \n " ,
" plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap= ' rainbow ' ); "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 4 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd4VVXWx/FPOoTeewelg4goIAqIXdHRUbG3UUd9nbEz \n NhDUsYzd0bGPjo6OOjqWcewVwV5Q0FClSu+dhOT94yRyc3MTEkhyL+F8nyePnn1/Z691bmGfs/be \n ayXl5eUJCQkJCQkJqXiS4+1ASEhISEjIrkI46IaEhISEhFQS4aAbEhISEhJSSYSDbkhISEhISCUR \n DrohISEhISGVRDjohoSEhISEVBKpFW4hKWnPqJbOyCrhONTs5JrJ9Mli9LG0iBTNZNOTXDeG9+Pt \n Y6gJNaGmQjXxth9fTV7e14ohfNINKXde56CjowZcaE9GfQ6Oh08hISEhiUA46IaUO5nUTinmtdrU \n q1RnQkJCQhKIcNANKXc28Mv6Yl5bzPxKdSYkJCQkgQgH3ZBy5yw+fJwfo9vHsSSDF+LhU0hISEgi \n UPELqUJ2ORqQ05grbuHizvSqTtoc5i3noZExBuOQkJCQXYXKGHQ7Rx0P2MZxqKkCmuOZgCcXk7aB \n 5IPZEysV/j4k/HWEmlATarZLE2/78dYUu3q5Mgbd6OXWsdpCTRXVNN56vCke9kNNqAk1cdPE2368 \n NTEJ53RDQkJCQkIqiTC8HGripYm3/VATakJN+PuuKE0YXg41CamJt/1QE2pCTfj7rihNTMLwckhI \n SEhISCURhpdDTbw08bYfakJNqAl/3xWlCcPLoSYhNfG2H2pCTagJf98VpYlJGF4OCQkJCQmpJMLw \n cqiJlybe9gf8j+ZZDM6kxmrmnsu6unH0J9SEmiqkibf9eGvC8HKoSUhN3Ow/wP49uPpS6sJGPM7M \n FvznKJYmos+hJtTsZJp424+3JiZheDlkl+Nrqjfl6EH5Ay5Uw/m0n8wFcXQtJCSkihOGl0NNvDRx \n s/8OQy+lUbQgCe3Yy9bvbML4HGpCzU6mibf9eGvC8HKoSUhNXOwvpncJX/wtUecmhM+hJtTshJp4 \n 24+3JiZheDkkIVlVgd/Nobz5blDxqAi/hKUHQ0JCKpAwvBxq4qUp0pbDgLv4TRP2rEOd5SxLY8Ep \n PFSe9o/AY0ydSe/2pEMeXmHl7nwoDC+HmlCzo5p424+3Jgwvh5qE1BRqu5OTz+OQOsH0KtSbT9tR \n fD2Wp8vT/tncO4om9TiwJjXnM/cAPhvE+O24jlATakJN4tmPtyYmlTHohoRskw+p2YO9IwZc0ILU \n jhy+jGcbBPOt5cZYPhL8FRAdlakUbqVHHgetoul1fJQZPHiHhIRUQcLwcqiJl6ZQ2yR6jqB+DI1u \n tPmWvYexvJJ9rDDNZgbexn4bOfIkGnUldTEjnmRGOx45hPmJ5nOoCTVl0MTbfrw1YXg51CSk5te2 \n FNbNZmPDYMtsIX5hRXO+xYY4+FjumqdouYYDW9HhFJJS8tsbk3Q+He/mhAM4Ny2BfA41oWY7NPG2 \n H29NTMLwchVjHDXf4IQ2dPuF2e144Qx+KW87T9FsKkdXp1pdFp5DVtoO9Hc+c+/mpz3ZI7J9C77h \n 8yO3Drg7PQu5/GI6voWUGK8Pp8ct9LqOTXATe+Sxfw3q5vKvy8rwAw8JCUkswvByFdK8SYslXDSW \n lqmCicH3OPJRnvhd/ird8rB1D8N6cewp1ErCL+TexuALuLdeME5uV/jpcL6/m/Th7NaetEls/IBf \n zuAVVWRF8fvUH0Tf9agV40RoSVoqe+WSezNnHs+gTvmf3/cMu5V3RvJcZfkcakLNdmjibT/emjC8 \n vCtofuTsS2lZ0JCEYdR9gsPXcUONcrA1nnrdGDM4YsxoTvLl7DmWgTfxcBl8LtTWCRdy3830WUHH \n 5vxwRXDvkPDvfWk1n7DbKVSrrXCC50jGsawbL97NsecwuHHEA3FPqtXg4LG8MYovKsPnUBNqtlMT \n b/vx1sQkTI5RRZhEtd3oEeu1w+n4KL3Kw84EBh8QY8FTBtrRd0f7T8MovrmL56/gpx3tL9E4i5mf \n MjMJDTAt6vVV5H3Au8NZUYNejWNEoDuQnsEBleJwSEhIuRKGl6uIphYZabFDyDJIakB35fBZNKBt \n UgwRVKdOvo1yDz9NJvM19m9Ah9b8+2AWbk8/8da0xBI+/ZnWg0kdh+/zX1vE0nV8MIr/onMtmsTo \n G9SlsdjvdcJee6jZpTTxth9vTRheruqaNmS9yqSDY3wB3mD2QfyrPGwl87/pdO8YPNwWYhbfRWjL \n Lfw0hi59+N0VNEnG53S7hteu58604s9LqM8n8vhish5gFfaoQ6NVLE7iuwt4PPKEuXyTQ/voH+la \n zOQTxb/XCXvtoWaX0sTbfrw1MQlXL1chNvP3N2h3KM0K2r5n7Y88cwLZ5WHjDKZczXtXc1jNiPZn \n mNmWf5SHjUjepskQLtwvogzfPtRqx/GjmfZnXitvm7HIxvXsk0nvHiwazood6e8CJig8yBZJzDGC \n N+5ht0voWTAPlIO7+fySYHFZSEjITkY46FYhLuPbx/i/GzihFbstYV42/x3DV8ox29JIrr+Bqe3Z \n O51qK1jWlHtProCtSd8z5LKIAbeAJqQ2Yz+VMOjeT1dccSFdG5PyIb+5ljdHc1dJ26SWkfIXTtud \n wUnkzGV6R548sZTvUxs29eMPozizDd3SqD6FT0/n702D8TckJGQnIxx0qxhnMxu3CQbZUoc8ykId \n cm8NciEX5EPurAIGXMikWnFzyJnUqAibkSwkNY1rzmX3grZhNOjDiWNYfGMxczfZuJMbRnFQ5tbm \n Xs+wxz+55ORS2h/E2kHcl39YYZ9pSVzP8KYc1oDWq1g4i/dG888d2ZcdErKrEg66IQnNRn5ejdpR \n 7XmYy88Vbf9ODr0uYsAtoD7JbRismEH3JvY+h8GZUe0n0f4GTsN/yt/b8mc0I87korZb5/AbL6Xb \n GOrcyPvx9C0kZGck3DIUktCcx7j7+So3qv15ZnUpXHmoQkinSXFJLGrECHsXkEq/tsWsJm8ZbElO \n eFaR3I7hbaMWzTUkZS8OnRkjZWdISEjJVOktQ4vZ/3GOaUzrbLJrsn4Ef08p+bx4LzXfVTSlOq8G \n +xzHQ3cG88a7Z1B7IZN357WDgi1KdSrQxwFtWTmbnDYxfitrWFlcP+nUzhNVMimfvGDvbaJ/PgO+ \n pH4P2sbQGkizjziqPRsry59Qs1Np4m0/3ppdb8vQe9TOYsgVdCwYZBeTey0rbuaWyvYn1Gz/eR3J \n ujwoeEDx85oV4uPpZN3CftcwIDIs9B2rl/EUFsfqpxUr3mf/6EQiGzAtKCc4oaJ8Li9NbeYsYjUa \n RgvnsSGdjyvTn1Cz02nibT/emphU2Tndjzn7ejpGPmk0Jvk0jriV10fyQ7x8qwzepPnXHN2Y5qtZ \n WY8fzorDIpzK5HFafM+ATiSfa8cKMBSQhoP50w1csjt71aXOz0xdxgvXBoNOzFXhJ7JoFI/g3KHU \n S8Icsh/mo8t50k4QYu7H+hv48lAOjX5if4dvr2BBXBwLCdmJqbLh5Q7sGSu014Vqb3KsrftWEy0s \n scOaf9G5AX+8JiI/chYH3kOzP/J2gvhcbn1vYuCdnDOYvmdSYyG5j3J0ax4/nLk7aqtf8PfiOv6z \n mX0PCQZbtpF9ayw/jOP6xzhlHYszmTSGH1KCATdhvi8laY7j5TtocQxd2pO2gNx/M21vnk9Un0NN \n QmjibT/eml0vvJzNuuIcWsWSqHMTLSyxQ5olXDAiqohNZ9J+YOgkHu5eeB5upw8/3cNxF7N/9fzj \n ZiSfT6e/ctJ6zsoMFjvvsK0awV9uWfoZFPwtK6utRNF0RgfOupm+6ey/hi+v5+P8KMLSRPQ51CSM \n Jt72462JSZVdvTyTiVtitH/NuvqFn/aqFBOp1r6YkOdhtHws2OZSpWhO7+ox2o+n21j2r3SHqhj5 \n RSi++hOv37R1wA0JCdk
" text/plain " : [
" <matplotlib.figure.Figure at 0x10cd3ccd0> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" # We have some convenience functions in the repository that help \n " ,
" from fig_code import visualize_tree, plot_tree_interactive \n " ,
" \n " ,
" # Now using IPython ' s ``interact`` (available in IPython 2.0+, and requires a live kernel) we can view the decision tree splits: \n " ,
" plot_tree_interactive(X, y); "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Notice that at each increase in depth, every node is split in two **except** those nodes which contain only a single class. \n " ,
" The result is a very fast **non-parametric** classification, and can be extremely useful in practice. \n " ,
" \n " ,
" **Question: Do you see any problems with this?** "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ### Decision Trees and over-fitting \n " ,
" \n " ,
" One issue with decision trees is that it is very easy to create trees which **over-fit** the data. That is, they are flexible enough that they can learn the structure of the noise in the data rather than the signal! For example, take a look at two trees built on two subsets of this dataset: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 5 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" text/plain " : [
" <matplotlib.figure.Figure at 0x10dcd7550> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
} ,
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzsnWd4FFUbQM9ms+m9VxJCSagJJLTQe7GgiIiFotgVRURE \n 6RBERFEBUQT87F3sIlho0nsPnUCAdNJ7st+PSWCzmYSE7O5sknueJz7unTPzvrPs5mbeuXOvSqvV \n IhAIBAKBwPhYKJ2AQCAQCASNBdHpCgQCgUBgIkSnKxAIBAKBiRCdrkAgEAgEJkJ0ugKBQCAQmAjR \n 6QoEAoFAYCIsjR5BpYrUawkDYqt5LZx67lyC8H/g1fHQTFc6CJnrYNorsFvpHIUjHOEY1VE6vrKO \n VruPKhBXugKD8zUMeECvwwWIACcruFOJnAQCgcAcEJ2uwODYgrtVFdvcwMukyQgEAoEZITpdgcHJ \n g9TCKrZdgySTJiMQCARmhOh0BQZnFPz9FZzXbz8EmfnwmxI5CQQCgTlg/IFUgkZHEBRkwaxlMDEa \n wh3BahfEnYePZ8FOpfMTCAQCpTBFpxum9zr6Jq+F0wCcZ2E7sHQdeJ4E6/sgyArOUPHzYPbnIRzh \n COeWHKXjK+1UOXrZFJ2u/nBruTbhNFBn6I12uaH3ZpGjcIQjHKM4SsdX2pFF3NMVCAQCgcBEiPKy \n cJRylI4vHOEIR3y/jeWI8rJwzNJROr5whCMc8f02liOLKC8LBAKBQGAiRHlZOEo5SscXjnCEI77f \n xnJEeVk4ZukoHV84whGO+H4by5FFlJcFAoFAIDARorwsHKUcpeMLRzjCEd9vYzmivCwcs3SUji8c \n 4QhHfL+N5cgiyssCgUAgEJgIUV4WjlKO0vGFIxzhiO+3sRxRXhaOWTpKxxeOcIQjvt/GcmQR5WWB \n QCAQCEyEKC8LRylH6fjCEY5wxPfbWI4oLwvHLB2l4wtHOMIR329jObKI8rJAIBAIBCZClJeFo5Sj \n dHzhCEc44vttLEeUl4Vjlo7S8RV1YuH8RnAdCGnNzSAf4QjHwI7S8ZV2ZDFFpysQCHRIBfVqeKgZ \n tOkGPgcg8Xs4/BTMcYZSpfMTCATGQ5SXG6BzCmxPwTAX+LcHXFM6HzONr5izGsZMhMF2Za8jICAX \n ApaDxVT4XG6/7eDsA5EhCuUsHOHU0lE6vtKOKC83BicDLBbD5G7QtyN4n4UBMbCrD8T0gO1mmLPS \n 8U3ubAf7NhBup7fRDgiF8H2wIBLyytvfAGcHeDgcWscDv0OrUvjgeThmqpyFI5xbdJSOr7Qjiygv \n NyAWw1PTYLRD2Ws/cOwBAxaAZQ9YrWhyAgDWQ+B48JHb1gZ8vwK/SDgLsA58w2HKYPAud3pBt28h \n 4Dd4+HZIN1HaAoHAQIjycgNx8kDVBgY56G1QAcOg60642tWE+dTAUTq+Is5IsD8LmU3BSX+ns5A+ \n Epwp+87Ew9jHdDpcnWMEvgkTb4fvTJGzcIRzC47S8ZV2RHm5oTvb4LIPOMp4tAGbJZDe1cxyNoP4 \n JnfaAT/A5j5wh+6XrxjYAZsHw/7yNjUUyhwXC8BG+u6Wx1D8vIQjHDOMr7QjiygvNxA6Qc7PkIjM \n FdRByIqE0wqkJZBhNLy+BFw7Q9s24HIEru2Bow/BIl0vCzLl9tcCqZBmkmQFAoFBEZ1uA8EZSk/B \n P6nQzF1nprES4E/YORuSFExPoEMYFITBB99D8hJoHglnXgZP9K5sfWHjLujQBVx02zdAUnP41qRJ \n CwQCgyA63QbEbFg9FyzbwcA2EHgRUnfDjpHwBhCsdH71maXQIh3u8AK/s7DrGfgpGIpqsu8xsNsL \n ng9AskanfSSkjoTUspee+vuNgnPz4K0TMGYAtCgG/oITKbDqFbha97MSCASmRnS6DQgNEAMfXIA1 \n h6CbJ+ybAzkKp1XvmQOjhsJTXcrumedAnxUwuBO80Kea/b4Ev3iY3BEiW4HmYzidAF/PhPU1jT0L \n 1qXChrehfQgEPAi/2UkVZpPyNzieg1BPSLpblLYFgltGdLoNkGAoCpbKyaLDrSPrwKUrPNxFZ5Ca \n PTAFIubCU33gF7n9EsAyBRZOhTblbZ2h3UEIWggZr9TicR93KImBA0jP75q0w00Ay/dgSjT0vg08 \n T0P6XNh5F8SEmzIRgaCBIB4ZEo5SjtLxa+QcAtepMqVfFdACugApcsf5HPo+odPhlhMBTttgDJUf \n KTC7cwf4BMa8CoNtyxr9waUnDHkLnMLhoDnmLByzcJSOr7QjHhkSjlk6Sse/qZMO3VQyEoBWetJH \n dqavAhgs+/wW4CBNQCW3n1mdeyycbw5tbfUkNdAN2myBr3qZWc7CMStH6fhKO7I06PLyXnD5DV5s \n Ai0KoTATLjwBp8Sk8oKaEgbrd8CYaJlHseIrTsVYgUxIygdsZLZl3JgP26zZAq5R4CG3rRU4fw2B \n vWCHqfMSCOozDba8vAtcz8Cc2eBefqWSA9FvQ/ArsEJt4nyEY3bxa+SMh+2LYHMQDPUv+75oga8g \n vgtsruo4L8Ce7yB+DATobjgHhRo4ZMycDeXcA7s3wrWOMtNWHoHMIVLZXdw+Eo45xlfaaXzl5fXw \n 0ixw191oDzwIUfPBeQ7sMmU+pnbWg+9uGO4LfumQ4Q6HHza/nA167G/AaytEh8O18RCrkXFuJdbL \n MHcG7PCEvi7geQaORsHnfaX7uYlyx/GB2HSYvhyeHwjtXECzCeKOw9q58AVSZ2U2nxe5NneIPQjr \n B8M43VJ5EbAJNs+Ff0yZj3DqnaN0fKUdWRpseblJ5b/AAWgKVhroSsVOt0HxNkS0gpdn6pQGj0Pf \n OWAxpwFOqpAHqgXwykDo/y64xEPxezDMCRY8AhcMESMGNiD9yHWWskyEY0Xw+A8w+BCkT4AD91Ux \n taO58iosXwgW4dC3Ffifg6S9sO1hWAw0Uzo/gaC+0WDLy1RxblrASnr8ozwvcytL1Nmxg4lD9O7F \n tQbbdjA2AU743JjUoUGUnz6H51+BKPuy10FgOQk6rIDXSmCe2oCxautogNHgPxrigBA5x5T51Nax \n A+bD+gT49wIMaAMb74B8pA7XLHMWjlk4SsdX2ml85eWTsKUA2ljrCf9Bhit8DMSbMh9TOYfAJgAC \n ZXyGgM9M8FtScXKGel9+sgEPeyozAprNBM/XYKuhYjVWx0f6KTKXfIRTLxyl4yvtyGJxc6V+8jx8 \n 8ibsvSI91gHADsjZAP97tGKH26Cwh9LiKkZnFwBq6SqlQWFTxepKPqAuqOIPEDn2gdNkGP0i3H1U \n fuCxQCAQ1IkGW172AV6B3e/DllJoXQhFvSBrPuzVy8ncyhJ1cpoD6+A8Mo96/A6X5kAy5lFaN9ix \n C6r4Q+II5HeVpiwMu9lxFsOISBjyJtgVAb/CoyfhyD1SVcTgOQtHOA3cUTq+0k7jKy+DdBn/TMW2 \n qgbBmFtZok6OI7z+MbzzIARqkO5j/wXXLsJSezhuRjkb5NjZ8MtJGBcq3YIEpPLGD7BtDvx5s+PM \n gIGPw21NwBqk/4wE73/B5mNYOh6uGDpnJZ08UM2BVgHQSQUWF+HgK7DL1YxzFk69dJSOr7QjS4Md \n vdyYGQ9x52DuLIjwgsA0SO8PB6bDf0rnZgyehG1zIdkHhgdBk3QoOAFb74d3arK/J/Qr73B16QvO \n s2EEsNzgSStELqjehmdeha7OZW15MGApHHoQngjQuR0jEAgMT4MtLzd2JwQ6LpSmGtyp4+jPE9xg \n yk+zYXsJLL4C1t4QZQXbgKY1OY6LzOQPIM2v7CvdE24wtyM+hG5PQBdnnY22wHMQvhSefflGZcBs \n chZOvXSUjq+00zjLy/XJOQNW34LfbXA43AzyMZEju9+70L4YRvtAUBFoT0LpTPhEZ0k72eOouT5q \n qqA2OV6EE0A7fTkHOC99ea7v9zMEHoO73cHrGqSGwP5R5ve+VumUwigP6e+JCtgC9uCnt69Z5Cyc \n eusoHV9pR5ZGW15+A9q
" text/plain " : [
" <matplotlib.figure.Figure at 0x10dcd7510> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
} ,
{
" data " : {
" text/plain " : [
" <matplotlib.figure.Figure at 0x10dcd7990> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
} ,
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAdgAAAFRCAYAAADXZryJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd4VEXbh+/dTe+NJKSHVCD0moQWQEBAUBEbWFHBjoqg \n NKkWREV9ra9iAdv3gtiRamgB6Z3QCTWd9J7s98cmsNmckM1my0mY+7q4dGd/M/M752x29jxn5hmF \n Wq1GIBAIBAKBcVFa2oBAIBAIBC0RMcAKBAKBQGACxAArEAgEAoEJEAOsQCAQCAQmQAywAoFAIBCY \n ADHACgQCgUBgAqyM2ppC0U2nJBpIbqBMaIRGsmw2TH0V7rbXEhQDC+GnBfC2HDwKjdAIjcEaKeTm \n Ub/jUKv3IIG4gxXIkiRw7An97XXK7YEe0H8nOFjCl0AgEOiLGGAFsmQtBLQHH6n3YsD3b/AztyeB \n QCBoDGKAFciSPnD5OGRKvXcCMvpBqrk9CQQCQWMQA6ygURwCx1cg7gsIMGU/gyF/OySV65SXA9th \n 2wAoMGX/AoFA0FSMO8lJ0GLJBeXb8FJfGDIF3M9AweuwuyssHGaiPh+ANxeCog/EtwOPo5C9Fw49 \n CItM1KVAIBAYDWMPsNE6r+MkNLplQtMMNJ/B3a/CKMfqQi9w6gED3gO3YZBkCj/hwBz4aS/8+Qf4 \n d4dLU6Ej0KapbQuN0AiNxTVSyM2jvschOYvY2AOs1BRsfcqERsaaXDjhCzGOOuUKYAhErYD/3WU6 \n P3TV/Ntd/bLAiG03WnMJbJZABztws4Z9szU/LmR/DYVGaGSqkUJuHg09DhEiFjTMEbD3BQ+p96LB \n /lcIMbMli7AIegXAzPnQ2g44A2VvwqG74NlwKLO0P4FAIC9EiFhoGtR0h+0/Qg7gpvvmfii6VbMm \n Vfvat7gQ0lVQtYZX74fWNWVtwOYl6PYezJoK35nTj9AITQvQSCE3jyJELDTXXy+AriXQ1RVs7oAz \n OndWBvVlA8kn4Ld0eMpb6zNTAfwJ216D1U3xrEeZPhpTn9eR8yRmTlsDPhCmXa8mjGwLrg6wa7oI \n IwuN0NyoTB9NczwOESJuKZwCm2Ww8AGICwfbIuD/IOEXWDQFdjW1/dfg2/lg1RZGRUKri3B1P2y/ \n BxYDoU0/AnmjBHfdZ9A12MK1txZBbCBMrwkjn4BH3oD9D8DzAZrfJAKB4CZBhIhbiOZnGDcTEqyr \n CxyAhyH0G3g1D2a5QK+m9GUNzIOkYlBug6NdoXA0lKIZXFt8CCkBLu2Gou4SKRpzIA2ITgdrf3j1 \n Pq0wciTYvgi93ofpU+H/zOlZaIRG5hop5OZRhIiFBnxgsLWE6E4ImgtBi40UprQHBls+9GL2cz8c \n kmdAXBSMdtZMoAZgK+QWwn+B5MVw51yJFI62QCvNqqOG+pfVZ0pohMYMGink5lGEiG927LTClNo4 \n AwrwBM6asv8iUKwD1ygo1g1jtBRmwMKPoNQdIu3BOQ1S/GD7S9UheBW46G5OUEN910cgELRcRIi4 \n hWhyIB3w1xXugqKBcMmUfn6Ex0ogKBx8j0PRP5A5FBa3gZIm9iWFxc69A/AyZAHv6GiiAbrBlYNQ \n 0hHsdBuuvj43mmktu8+U0AiNiTVSyM2jCBELDeTA0e0wL1ZrvWoB8CusWwD/AFdM4WceDL8fYsM1 \n kVAAhyrwWggPzYIXjNCXFLI69zWv74LkmRAXDiO0H9Ruhqsl1WFkuXkWGqGxsEYKuXkUIeKbnWmw \n YzHM3AFjfSG4DCpOwPqp8LUp+/WG27QGV0Czg8R90HMhdJkB+0zZv9x4AeZ9CCVeEGEPTqlw1h+S \n XoC9lvYmEAjMiwgRtyDNFMgDvtQqSwIiTenHA4Il6hAOtioYBBQ3oS8pZHnua/AEpmlmFa/S0TT0 \n tyHr4xIaoTGBRgq5eRQhYqGxnCZPM5h461ZIg8pcTf7gZKl6jehLCtmeD6ERGqFplEYKuXk0OEQs \n 9oMVNImzkJgNat3yH+DwLNhkCU8CgUAgB0SIWGiapJkH//4X+oZAcB9wvwTlmyE9BpY7QFQT+5JC \n 1udDaIRGaPTWSCE3jyJELDSW06iASbB4PVyaDT294fIrmjtaEUISGqERmoY0UsjNo8EhYjGLWGAU \n BkP+YNhQ/bKl5poQCAQCvREhYjNoKoF34VY/SFCBOgfS7GDDw9LZfWTh2RSaXyDoFPSogsoRUNy+ \n 4XaksPhxCI3QCI1RNFLIzaMIEctdMxtefAnu9bg+qcx/L4R/CV9PkKlnY2rKgbfhgeHw8O3goAa2 \n QP4suDofvtWj7Yb6arJHoREaobGIRgq5eTQ4RCxmEZuYX8BzEAz10DnXXcFFrVkn2uKZBSMmwODO \n 1TvRKIB+4DwGJiyGtha2JxAIBCZBhIhNrDkMCdM1+Qfq4AehWRDjWXufUIt7NrYmFIb7aOZD1aIz \n OCbCeGDZDdpuqC+jeBQaoREas2ukkJtHESKWsyYfPDOgykciWpAPefZwhLrrSGV/XI3RKDRRYkmq \n NJmetOu21BCS0AiN0LTcv29JRIjYxLwGST/DUd3yKuA8HHWQSNLQ0jgPp6UOMh9I0/zAEAgEghaH \n CBGbWOMAeMP/vgKXsRDkBKRAxY9wZAKc5ibIUTsWdvwXRgwFLz/AGs0PjM/g6Aw4wfVz0FJDSEbR \n XIIB38F9PhBaBVVpcPwZyHCykB+hEZomaqSQm0cRIpa7ZgwkX4Q1n8KEXKhSwaEZsMNaM7DI0rOx \n NIfB7meY2hWssqBqE5SfhKvOsGcQvOlyfTOAG7XdUF9N8mgqzRpw+QMSwqDkETjhqvldYVBfSeC4 \n DxJehjBF9RtlEPMO7H8Kluq0bfFjFxqh0VMjhdw8GhwiFokmzEQAVEzR5ObV68K0FFbCnNkwuGaG \n U1ewTQev5ZDXre7g2mKYCU/GwW3vgXcOqP8PhufC+1NglyHt/QkPzNMaXAFsgEnQ6XW49S340zjO \n BQKBsRADrA7pYP0u9HaF3ClwzNrShpoxH0PoSIjVnT7sDVZeEFuOJlxsCr6A0HS4vzUEF0BBhSac \n mmyO6/k+9H8aHmpd/fflBYrHIfp7eGUPjDfkh0UQRNWZhg24g6IVdEEMsAKB7LDIAHsA7H6CCaHQ \n wQYcjsOu2+HLnlBkCT81vAbjOsJ9L4JvDpR/BkeqYMlztZfRCPTkGHSeJJ2tCh/wOAp2naDE2P3+ \n AqG+MOkx8K8pS4P4OWC7ED4wdn+6OELP1hJ/W3dB8Gy4sxt819g2y28wE7sMyhrbnkAgMD1mn0Wc \n BVZ/wXvz4ZHHoftD0G4BPLQZ3j8Mdub2U8NcuOVBeHIM+HoDkWD9DHR2gdmXNNE4QSMJguTkegbQ \n LMiLgFJT9HsJRo7UGlwBfEA1FEb9CL6m6FMbe3CSKrcBnMDDkDYvwo6rEjPOj0FZFaw1pE2BQGBa \n zD7AfgEDJ0MP7XCXEngOuiyHe8ztpwYPGBYmMcDfB22+g4GW8NTceRmO/Qp7dEeFfFBfhN2mWqLU \n CkKkyvuC2x64xRR9apMD6VLlWaAuhJOGtDkbVn0EW85q3a3uh6LfYPVM2G+oV4FAYDrMvkzHH3rb \n S1S0ASKgF/CvRD2TT7n2BD+J97AFfCCG2scmt2nistUMh+VLwC4eOoSAzV7I3g+7psBl9FvW1VBf \n UmWSj1rLAC9wru7XZOejLaRuh9xYcK0pUwPfwPHZkFJP/zfsywGYCYeXwtbfoEsVVAbD9mmaO3Xx \n 2RSa5qiRQm4em9cynVy4AkRKVc6DTC29Wadcp8I5IEL3jXwgQ3PyzObnEyhLheFWoAyElIflNyVd \n L00uKCvA9hmYuBn6vAflQ+HIK1CA/kuUpLhhvRTYVwWDdMMzv8DFEbCU62Frk5yPgbByESTug/sC \n IbIUKpPh316wxB5ymtLXo5rXP2sVSZ1H2X0WhEZoblCmj6Y5Hof5JzkpYVcKxAXr3GVcgYp02Gxu \n PzWUwS/7oGcXrbsOgK/
" text/plain " : [
" <matplotlib.figure.Figure at 0x10e44ae50> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" from sklearn.tree import DecisionTreeClassifier \n " ,
" clf = DecisionTreeClassifier() \n " ,
" \n " ,
" plt.figure() \n " ,
" visualize_tree(clf, X[:200], y[:200], boundaries=False) \n " ,
" plt.figure() \n " ,
" visualize_tree(clf, X[-200:], y[-200:], boundaries=False) "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" The details of the classifications are completely different! That is an indication of **over-fitting**: when you predict the value for a new point, the result is more reflective of the noise in the model rather than the signal. "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Ensembles of Estimators: Random Forests \n " ,
" \n " ,
" One possible way to address over-fitting is to use an **Ensemble Method**: this is a meta-estimator which essentially averages the results of many individual estimators which over-fit the data. Somewhat surprisingly, the resulting estimates are much more robust and accurate than the individual estimates which make them up! \n " ,
" \n " ,
" One of the most common ensemble methods is the **Random Forest**, in which the ensemble is made up of many decision trees which are in some way perturbed. \n " ,
" \n " ,
" There are volumes of theory and precedent about how to randomize these trees, but as an example, let ' s imagine an ensemble of estimators fit on subsets of the data. We can get an idea of what these might look like as follows: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 6 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzsnXdYFNf3h99l6VV6RxBUxN4V1NiixlhiiYkxmmZ6filq \n NIm9pmq6Sb4aNTFdjRpNMRprxBJ7RcGCSheU3uH3x7C4LANSlp0F7vs8PLozn7nnzOzs3Jkz956j \n Ki4uRiAQCAQCQd1jorQDAoFAIBA0FkSnKxAIBAKBgRCdrkAgEAgEBkJ0ugKBQCAQGAjR6QoEAoFA \n YCBEpysQCAQCgYEwrXMLKlVnnSXBQEQln4WmAWri3xocnPxUzzLL3Bf/Feyy6mCl221jafBBpkQA \n HGFqy/dZ9m0wqHXssw8ykmDUaLhVl/shNEIjNFXSKG1fWU1x8VEqQDzpCuoNHXnnwk6sL8mtOwon \n SzpcgUAgMFpEpyuoN6gxI50O636BG4Uly/KBtXDDEj5T0jeBQCCoCnUfXhYI9Egfup9PI3zSHHjI \n CZyTIO4ZONkMIpX2TSAQCO6G6HQF9Y7BkDYYVmgtClbMGYFAIKgGIrwsEAgEAoGBMMSTru5TSOhd \n PgtNA9QUU1x+u+KatC3TjpHtq9AIjdAobl9pTYWjlw3R6eoOt5ZbJjQNXKNCVV6jqknbMu3oyUeh \n ERqh0atGaftKa2QR4WWBQCAQCAyECC8LjUE0IrwsNELTqDRK21daI8LLQqOsRoSXhUZoGp1GaftK \n a2QR4WWBQCAQCAyECC8LjUE0IrwsNELTqDRK21daI8LLQqOsRoSXhUZoGp1GaftKa2QR4WWBQCAQ \n CAyECC8LjUE0IrwsNELTqDRK21daI8LLQqOsRoSXhUZoGp1GaftKa2QR4WWBQCAQCAyECC8LjUE0 \n IrwsNELTqDRK21daI8LLQqOsRoSX765JhkhnKDQWf4RGaGqpUdq+0hpZRD1dgUBB8oH58GQLGOoA \n tsmQeAW2z4G1Zko7JxAI9I4ILwuNQTQivCyv+RDGTYHhTprnfnBJhJYfg880SDBGn4VGaKqgUdq+ \n 0hoRXhYaZTUivFxecxSim0NnrQ4XADdQ+0DnZJjnbGQ+C43QVEOjtH2lNbKI0csCgUJsgsAu4C23 \n rj003QUuhvZJIBDULSK8LDQG0YjwcnnNYDh3DbJ8wVp35Q1I7wGtgHhD+SM0QqNHjdL2ldaI8LLQ \n KKsR4eXyml4QsRgOhkJ/7fhyEXAQ9t8Lu4zNZ6ERmmpolLavtEYWEV4WCBSkGyx5B/ZfgDyAs5D9 \n NuztC+8q65lAIKgLRHhZaAyiMcbwciFwCaw8IM9emh9r8GN2r/T3xXaI/wUygiBqJlwBfJTwR2iE \n Rk8ape0rrRHhZaFRVmNs4eXlcI8ZzGgGficg/Rwcfhq2eit0zO6Fjfca+XcoNEJTTY3S9pXWyCKS \n YwgaHQtg6HiY1BwsShY5FILPUmg6HZ5V1DmBQNCgEeFloTGIxpjCy57wsFaHC4AaGA7tfoT7x8Ol \n qvsjNEIjNEZoX2mNCC8LjbIaYwov24OjzDpagdlqaDK+7LZGfVyV0KwAv9PQuQskTTICf4TGaDVK \n 21daI4sILwsaHRlwC/DUXZ4ARaZwQwGX6gXJYPoJLLwfej0Fdpch/wMYEgDzx0Ci0v4JBPUBMWVI \n 0Oi4DHvToFh3+QaImgV7lfCpPvA1TJgF93UDOxMgCMymQffLMFtp3wSC+oLodAWNjpnw9XLY/g+k \n 5AGXIO9DOBoEK6xlOmMBxINpALSTq3w0FDq9V37shkAgkEGElwV8Cf6XIcwNLB+FKA8oMLQPUWD+ \n J7S/CZaz4ERdlrWzhuI34NtNsPR16OECN96Ak2ai46iQM2DlCrZy65qBRSwEIPNe63fw3AmeoXB6 \n DNysc0cFAiNHdLqNmGxQLYaZI2Hgc2CXDqyDe2/B0qnwn6H8mAfjguGh8dA0AwpXwrks+GxqJSMA \n 9cEDkPwA/F6XNhoKoZCxXnpvG6C77iDc6qnzXW2HJv/B7MHQfS5YHoWU2bDvRXjbw2BeCwTGh5gy \n 1Ig138FLb0B3zeOLHfAkBH0HsxJhlptUY71OpwythI4Pw4vBYAngAurnoe16WHBeelfYQWcjoz+u \n DVFjDagg5jr4+EJpICIbOAgn3gQnpL9QgNMw5U3opMkp3Q+cesLIj8FiBlw2hM9CI6YMKagRU4aE \n pvxnU3CUixeOAe+50PI9WK8vWxVNGUqHSZoOV5tR4D4LOvaF/WW3azgFD+qbZiJ8OQcO+MFQb/BM \n gazz8M9rsBwpjSYAyyG3HwSrdBqzBPwh5Bb86GhE+yU0YspQHWlkEeHlRowV2FSwHHNoYggfHKSn \n o3KoAUdwNoQPgqqzAH7Oh58vgOUACDCH87qaSAh6TKZcIYA3OF0C6y5176pAYJSI8HIj1mRDrsx2 \n REFeICQjfXd1Gl7OlCKU5ciR/skr35YILyutMQPaSJ97Un60d+hYOH8c0ntJbyzKEA0pD0I7IN2Q \n PguNwTVK21daI8LLQlP+swX8tB+eC9N62swHvoP982CjPm1VFF4ugCP7oXWYzhPvGrg4Ar6ILjcq \n 1jDh5T+hyUlwSIauPeBPnZG3RvMdGqMmDCLmwu5uMNxca3kSFEbAFnM4YGw+C02daJS2r7RGFhFe \n bsQ8Aufeg9lH4GEPaJYvXRT3PQ6fG8qH1+DUO7DoFExoByGZkHMaTrrAJ8GQG20oR7SYA4+EwsTp \n 4KoCdsGjs+GXhfC1Au7US56GxYsgux3c4w12URB7Af6eC6sRU7MEjRgRXm7kmukQDnylpQkHmunb \n VmUFD96AxEL48AoMKILwQVLI2RrZ8Hbdhpe/h6Ax8Hx76dU2AP3B2ReeWgFpT5cPmSr+HRqjxgdY \n AL9lQMoZODUaMmygCD2+shAao9YobV9pjQgvC42ymrsVPFADQVVqu27Dy9fggQlaHa6G5mC+GUKA \n n/VlSx+aHyHwJHQtgviZ8I+Dwv7oLrMFehiRP0IjwssG1MgiwssCgRY2FYzoLlknm5FJCaLA/DtY \n 9DD0Gg/mKVD8A0T4wHfDq3EBAPgF3KNhgj3YZkBaBmybCyl15btA0JgR4WWhMYjGmOrpVqbJhrR8 \n tLI/lFAEpEGaPm3VRrMBJs2C/pofsBOonodWq+DlQohWV7GdreCnhpengYdmXu1FGLgGjj4uzb/V \n m8/GqsmDsOXSSGy/PLj9NGTL1H40Kp/rgUZp+0prRHhZaJTVGFM93co0AyH6K2j7IoRoJ3dYDZEd \n 4ROkkoCKHtd8wBtayP14h4LrTHB5B/6tiq0L8PhU8NBe0QIsoqDlPrjRGzL04bOxav4Cx/PQbyK0 \n dAFVLrAeYhPh8Gtwwhh9rkcape0rrZFFVBkSCLToDNm+MGUxbPoRrn8PV96GLVYwdRCkKu0fSHVt \n 7SoIdbuDSbZOJ1oR+YCP9J66HIPA+VcYUgs36wVH4bVXIdil5G7OApgAXlbwWpbmtlAg0CMivCw0 \n BtHUl/AywEjpbz0QizSaG8C+5E/x4+oB3IQkwEV3gwOQMwbiKfu7k22n5I5b9hpQCLhL7d+1nar4 \n bIyaW6BuBd3ketaR0OobGPs8nDaUPw1Mo7R9pTUivCw0ymrqS3i5vmji4dtT8FY7rSlMGcBeCH+j \n fGhZth01cAWOAEN1xVshYRisoTQ5WO19NjbNZbCyruAa6AQmFyBTZ1vFfa5nGqXtK62RpUGMXj4D \n lr/DQy7gqwbTa3C+A6waobRjAkEdMRO2L4aiAzDBAZwyIfUS/DtbvsOtEH9Y/iU0exKCNdmj9sPt \n BPhtTPkOt0HRGbKXQxRQLhX0doh/APYq4JaggVPvw8uZYPIHTJ8GbTQjNouhzSroeg12+unRltDU \n XFOfwsv1RTMTYpA62XAdTbl8yBV9fhiIhve/hclFkJ8JWc1h5wvgTyN4NWQF+45Amy5ala5ioSgC \n DkyTcnwYnc/1RKO0faU
" text/plain " : [
" <matplotlib.figure.Figure at 0x10dce33d0> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" def fit_randomized_tree(random_state=0): \n " ,
" X, y = make_blobs(n_samples=300, centers=4, \n " ,
" random_state=0, cluster_std=2.0) \n " ,
" clf = DecisionTreeClassifier(max_depth=15) \n " ,
" \n " ,
" rng = np.random.RandomState(random_state) \n " ,
" i = np.arange(len(y)) \n " ,
" rng.shuffle(i) \n " ,
" visualize_tree(clf, X[i[:250]], y[i[:250]], boundaries=False, \n " ,
" xlim=(X[:, 0].min(), X[:, 0].max()), \n " ,
" ylim=(X[:, 1].min(), X[:, 1].max())) \n " ,
" \n " ,
" from IPython.html.widgets import interact \n " ,
" interact(fit_randomized_tree, random_state=[0, 100]); "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" See how the details of the model change as a function of the sample, while the larger characteristics remain the same! \n " ,
" The random forest classifier will do something similar to this, but use a combined version of all these trees to arrive at a final answer: "
]
} ,
2015-04-28 17:27:02 -04:00
{
" cell_type " : " code " ,
" execution_count " : 7 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAd0AAAFRCAYAAAAxT3fNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8VFX6x/F3KiH03jsoHURECaKg2BXddVVsa+8/d+2s \n jaauvZdVV13dtbu69t4RxC4qGgERBKT33pLfHzeByWQSkpBkhnA/r1deOme+5zzPnbnDufe55zxP \n Um5urpCQkJCQkJCKJzneDoSEhISEhOwohJNuSEhISEhIJRFOuiEhISEhIZVEOOmGhISEhIRUEuGk \n GxISEhISUkmEk25ISEhISEglkVrhFpKSdo1q6YzsYl6Hmu1cM4k+2Yw8khaRommse4yrR/N+vH0M \n NaEm1FSoJt7246vJzf1KEYR3uiHlzmvsf0TUhAvtqVafA+LhU0hISEgiEE66IeVOJrVTinivNvUq \n 1ZmQkJCQBCKcdEPKnTX8vrqI9+Yzu1KdCQkJCUkgwkk3pNw5lQ8f4cfo9rEsqMZz8fApJCQkJBGo \n +IVUITscDdjYmEtv4ILO9KpO2m/MWswDw2NMxiEhISE7CpUx6XaOep21ldehpgpojmY8HptP2hqS \n D2BXLFXwfEj44wg1oSbUlEkTb/vx1hS5erkyJt3o5dax2kJNFdU03vJ6XTzsh5pQE2ripom3/Xhr \n YhI+0w0JCQkJCakkwvByqImXJt72Q02oCTXh77uiNGF4OdQkpCbe9kNNqAk14e+7ojQxCcPLISEh \n ISEhlUQYXg418dLE236oCTWhJvx9V5QmDC+HmoTUxNt+qAk1oSb8fVeUJiZheDkkJCQkJKSSCMPL \n oSZemnjbz3qd5tkMyqTGcmaeyaq6cfQn1ISaKqSJt/14a8LwcqhJSE3c7N/H3j244iLqwlo8wrQW \n /O9wFiaiz6Em1Gxnmnjbj7cmJmF4OWSH4yuqN+WIgXkTLmTgHNpP4tw4uhYSElLFCcPLoSZemrjZ \n f4d9LqJRtCAJ7djNlnM2YXwONaFmO9PE2368NWF4OdQkpCYu9ufTu5gTf1NU34TwOdSEmu1QE2/7 \n 8dbEJAwvhyQkyyrw3NyHN98NKh4V4vew9GBISEgFEoaXQ028NIXaNpJ1O39owq51qLOYRWnMOYEH \n ytP+oXiYydPo3Z50yMVLLN2ZD4Xh5VATarZVE2/78daE4eVQk5CaAm23cfxZHFgneLwK9WbTdgRf \n jeHx8rR/GneNoEk99qtJzdnM3JcJAxlXhuMINaEm1CSe/XhrYlIZk25IyFb5kJo92D1iwgUtSO3I \n IYt4qkHwvLXcGMNHgr98oqMylcKN9Mhl/2U0vZqPMoMb75CQkCpIGF4ONfHSFGj7gZ7DqB9Doxtt \n vmH3ISyuZB8rTLOeATex11oOO45GXUmdz7DH+KUd/zyQ2Ynmc6gJNaXQxNt+vDVheDnUJKRmc1sK \n q2awtmGwZbYAv7OkOd9gTRx8LHfNf2i5gv1a0eEEklLy2huTdA4d7+CYfTkzLYF8DjWhpgyaeNuP \n tyYmYXi5ijGWmm9wTBu6/c6Mdjx3Mr+Xt53/0GwyR1Qnoy5zzyA7bRvGO4eZd/DTruwS2b4JX/PZ \n YVsm3O2euVxyAR3fQkqM94fS4wZ6Xc06uI5dctm7BnVzePriUvzAQ0JCEoswvFyFNG/SYgHnj6Fl \n quDB4Hsc9hCPnp63Src8bN3JkF4ceQK1kvA7OTcx6FzuqhfMk2UKPx3Cd3eQPpSd2pP2A2s/4PeT \n eUkVWVH8PvUH0nc1asXoCC1JS2W3HHKu55SjGdgp7/v7jiE38s5wnqksn0NNqCmDJt72460Jw8s7 \n guZHTruIlvkNSRhC3Uc5ZBXX1CgHW+Oo143RgyLmjOYkX8KuYxhwHQ+WwucCbZ1wHndfT58ldGzO \n 95cG1w4J/9mXVPMJO51ARm0FEzxHMpZF3Xj+Do48g0GNI26Ie5JRgwPG8MYIPq8Mn0NNqCmjJt72 \n 462JSZgco4rwAxk70SPWe4fQ8SF6lYed8QzaN8aCp2poR99tHT8NI/j6dp69lJ+2dbxE41Smfcq0 \n JDTAlKj3l5H7Ae8OZUkNejWOEYHuQHo19q0Uh0NCQsqVMLxcRTS1qJYWO4SsGkkN6K4cvosGtE2K \n IYLq1MmzUe7hp0lkvsLeDejQmv8ewNyyjBNvTUss4NNfaT2I1LH4Lu+9eSxcxQcjeBWda9Ekxtig \n Lo3F/qwT9thDzQ6libf9eGvC8HJV17Qh+2V+OCDGCfAGM/bn6fKwlczrU+neMbi5LcB0vo3Qllv4 \n aTRd+nD6pTRJxmd0u5JXRnFbWtH9Eur7iXx9Adn3sQy71KHRMuYn8e25PBLZYSZfb6R99I90Jabx \n iaI/64Q99lCzQ2nibT/empiEq5erEOv51xu0O4hm+W3fsfJHnjyGDeVh42R+voL3ruDgmhHtTzKt \n Lf8uDxuRvE2TwZy3V0QZvj2o1Y6jRzLl77xS3jZjsQGj2COT3j2YN5Ql2zLeuYxXcJItlJhjGG/c \n yU4X0jP/OdBG3MFnFwaLy0JCQrYzwkm3CnEx3zzM/13DMa3YaQGzNvDqaL5UjtmWhjPqGia3Z/d0 \n MpawqCl3HV8BW5O+Y/DFERNuPk1IbcZeKmHSvZeuuPQ8ujYm5UP+cBVvjuT24rZJLSLlZv68M4OS \n 2DiTqR157NgSfk5tWNePv4zglDZ0S6P6z3x6Ev9qGsy/ISEh2xnhpFvFOI0ZuEkwyZY45FEa6pBz \n Y5ALOT8fcmcVMOFCJhlFPUPOpEZF2IxkLqlpXHkmO+e3DaFBH44dzfxri3h2swG3cc0I9s/c0tzr \n SXZ5gguPL6H9gawcyN15LyvsOy2OUQxtysENaL2MudN5byRPbMu+7JCQHZVw0g1JaNby63LUjmrP \n xUx+rWj7t3HQ1RETbj71SW7DIEVMutex+xkMyoxqP4721/Bn/K/8vS1/RjLsFM5vu+UZfuOFdBtN \n nWt5P56+hYRsj4RbhkISmrMYey9f5kS1P8v0LgUrD1UI6TQpKolFjRhh73xS6de2iNXkLYMtyQnP \n MpLbMbRt1KK5hqTsxkHTYqTsDAkJKZ4qvWVoPns/wh8b03oDG2qyehj/Sim+X7yXmu8omhL1q8Ee \n R/HAbcFz452rUXsuk3bmlf2DLUp1KtDHrLYsncHGNjF+KytYWtQ46dTOFVUyKY/cYO9ton8/WV9Q \n vwdtY2gNoNlHHN6etZXlT6jZrjTxth9vzY63Zeg9amcz+FI65k+y88m5iiXXc0Nl+xNqyt6vI9mX \n BAUPKPq5ZoX4eBLZN7DXlWRFhoW+Zfki/oP5scZpxZL32Ts6kcgaTAnKCY6vKJ/LS1Ob3+axHA2j \n hbNYk87HlelPqNnuNPG2H29NTKrsM92POW0UHSPvNBqT/GcOvZHXhvN9vHyrDN6k+Vcc0Zjmy1la \n j+9PjcMinMrkEVp8R1Ynks+0bQUY8knDAfztGi7cmd3qUudXJi/iuauCSSfmqvBjmTeCf+LMfaiX \n hN/Y8CAfXcJjtoMQcz9WX8MXB3FQ9B37O3xzKXPi4lhIyHZMlQ0vd2DXWKG9LmS8yZG27FtNtLDE \n NmuepnMD/nplRH7kbPa7k2Z/5e0E8bncxl7HgNs4YxB9T6HGXHIe4ojWPHIIM7fVVr/g7/lV/G89 \n ex4YTLZsJfvWGL4fy6iHOWEV8zP5YTTfpwQTbsKcL8VpjuLFW2nxR7q0J20OOf9lyu48m6g+h5qE \n 0MTbfrw1O154eQOrinJoGQui+iZaWGKbNAs4d1hUEZvOpH3PPj/wYPeCz+G2+/DTnRx1AXtXz3vd \n jORz6HQPx63m1MxgsfM226oR/OWUZpyBwd+i0tpKFE1ndODU6+mbzt4r+GIUH+dFERYmos+hJmE0 \n 8bYfb01Mquzq5WlM3BSj/StW1S94t1elmEhG+yJCngfT8uFgm0uVojm9q8doP5puY9i70h2qYuQV \n ofjyb7x23ZYJNyQkpAx
" text/plain " : [
" <matplotlib.figure.Figure at 0x100431790> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" from sklearn.ensemble import RandomForestClassifier \n " ,
" clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1) \n " ,
" visualize_tree(clf, X, y, boundaries=False); "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
2015-05-31 09:59:13 -04:00
" By averaging over 100 randomly perturbed models, we end up with an overall model which is a much better fit to our data! \n " ,
" \n " ,
" *(Note: above we randomized the model through sub-sampling... Random Forests use more sophisticated means of randomization, which you can read about in, e.g. the [scikit-learn documentation](http://scikit-learn.org/stable/modules/ensemble.html#forest)*) "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Not good for random forest: \n " ,
" lots of 0, few 1 \n " ,
" structured data like images, neural network might be better \n " ,
" small data, might overfit \n " ,
" high dimensional data, linear model might work better "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Random Forest Regressor "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Above we were considering random forests within the context of classification. \n " ,
" Random forests can also be made to work in the case of regression (that is, continuous rather than categorical variables). The estimator to use for this is ``sklearn.ensemble.RandomForestRegressor``. \n " ,
" \n " ,
" Let ' s quickly demonstrate how this can be used: "
2015-04-28 17:27:02 -04:00
]
} ,
{
" cell_type " : " code " ,
2015-05-31 09:59:13 -04:00
" execution_count " : 8 ,
2015-04-28 17:27:02 -04:00
" metadata " : {
2015-05-31 09:59:13 -04:00
" collapsed " : false
2015-04-28 17:27:02 -04:00
} ,
2015-05-31 09:59:13 -04:00
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAeMAAAFVCAYAAADc5IdQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzt3X2MHOdh3/Hf7h7JM+kVeZCvZgMFFAohD2RUhK8VoJg0 \n bKUvchKkrWWm6aGBXUVSK4dy6BceldimlcBSWhkkE0QRGRe2XKpJWiKy5KZGUEdFLeeFRFzBoCq1 \n bp9KqSWgqORe6bPudOLxjrvbP/bmuLc7s7c7O7PPM898PwCBveW9PPOy85vndSqtVksAAMCdqusC \n AABQdoQxAACOEcYAADhGGAMA4BhhDACAY4QxAACOTaT5IWNMTdKXJP2YpJakj1pr/1uWBQMAoCzS \n 1ox/RlLTWvteSccl/Xp2RQIAoFxShbG19g8l3bf+5Y2SFrIqEAAAZZOqmVqSrLUNY8xZSXdK+tnM \n SgQAQMlURl0O0xjzTknflnSztfZy3Pe0Wq1WpVIZ6e8AAFAgQ4Ve2gFcH5Z0g7X2X0i6LKm5/i++ \n RJWK5ueX0vwpDGh6us4+zhn7eDzYz/ljH+dvero+1Penbab+qqSzxpg/kbRN0settVdS/i4AAEot \n VRivN0f/o4zLAgBAKbHoBwAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOE \n MQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAY \n YQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjhHGAAA4RhgDAOAYYQwAgGOEMQAAjk24LgBQ \n BMfOXFCtVtEj971n03uSdOLwAVfFAhAIasYAADhGGAMA4BhhjMwdO3NhowkXALA1whgAAMcIYwAA \n HCOMAQBwjDBGLhaWVug3BoABEcYAADhGGAMA4BhhDACAY4QxAACOEcYAADhGGANbOHnuoi4truj/ \n LlzWyXMXN713aXFl4z0ASCvVU5uMMdskfUXSPkk7JD1srf16lgUDfHDy3EV995WFja+/+8qC/tmJ \n Z3W10dr03tHT53Xk0H7t21t3UUwABZe2Zvzzkuatte+T9JOSHsuuSIA//ntHEEc6gziysHRFjz71 \n wjiKBCBAaZ9n/KSkr66/rkq6mk1xUHRR860kLS6vOi4NABRDqpqxtXbZWvumMaaudjB/NttioYi6 \n m3TXGk0dPX1er76+tPFe0Z7odPONUz3vTdQqPe9N1XfoyKH94ygSgABVWq3eJrdBGGN+VNLTkk5b \n a89u8e3p/ggK5e/P/aHiTqfrd0/q7IMfkCTd8/AzkqTHj98xzqKN5K7P/7EuvdGu7UfbEvceAHTo \n vWvvI+0ArndKekbSYWvts4P8zPz80tbfhNSmp+vu93HCLVez2dooW2O9v9V5WYfwsTtv0UNPPLfx \n en5+KfY9ZMOLczlw7OP8TU8PN5gz7QCuz0jaLelBY8yz6/8mU/4uBCKuSVeSrjaam5qqi2bf3rqm \n 6pN6x563bYyWjt6bqk8yghrAyFLVjK21H5f08YzLgoKbm53R3Y98s+f9pbfW9OhTL+jU/QcdlAoA \n /MeiHwAAOEYYI5WkUdHvimmqZqQxAPRHGCNTc7MzqnaMIZyq79Cp+w/SrwoAfXgZxkWbi4rN6ju3 \n S5KqFRW2Rsw5CGCcvAxjFNtErapqRYw0BoABEcYAADhGGAMA4BhhjLEJ/RnA9DMDSIswxljEPRe4 \n +yESAFBWaR+hiJxFNawThw84Lkk24p4LHD0DuAgrc504fKBnPd9Qjg0A96gZw6k33rziugjoQnM7 \n MH6EMcYi7iES1cq1OckAUGaEMTJ34vABTdU3P8RrbnZGU/UdG19Hr5feWh1r2QDAR4QxxubIof2q \n VnpX5qJZFEDZEcYYWtopSjwDGADiEcYYShmmKIU+HxqAf7wLYy6Efus3RanTicMHCjn1pww3GwD8 \n 41UYcyGEa0k3Gw898ZyD0gAoC6/CeNBal89CH4wUN0Vpqr6jsI9KBAAfeBXGZVWkAI+bonTq/oPB \n DMhiPjQAF7wKY2pdxZA0RSkEcTcbU/VJTdT6f1QY6wBgFF6Fcei1rkH5fmEPfYpS5w3GIDcbjHUo \n hiK1QKF8nIdx9wck5FrXILiwu7dvb33jHBzkZiOEsQ4R328EgVA5D+Nuode6OsXdqYd0Ye/n5LmL \n arakZktaXGZJzHHpVzvkRhBwx7swRvgWl1c3XfTXGk0tLK0U9qIfyliHstwIAj4ijD0TyoU9yYnD \n B3S10ex5v9lSYS/6jHUAMCpvw3hhaSWxOS3kgRhluLC3Et5fXC7us423GutQhHM29BvBOEU4LigH \n b8O4zMo6iK3RVGH7KEMY61CGG0HAV4Sxh0K4sPdT6fN/9FG6VdYbQcA1whhjF9cc6qvOqT6f+2L4 \n zZmh3wgCviKMMXbdzaGdfOij7Jx21Tnq+/mX5gvbjA7Ab4RxhtIsmLC4vFrKRRbiAteHPsruubbd \n aEYHkAfCOCNpFkxYXF7VWsc0nyItsjDq84qjVa4qkld9lHFzbQEgb4RxRtIsmLAWM9+2bDWvSkWF \n 6qP0oRkdQHi8DOMThw9oqj7puhgooX6Dy67fPem8GR3pdHYh3XfiW66LA/RwGsYhLUo/7IIJSdtL \n zcut7sFl9Z3bNprRj999m8OSjS6kz9swuruQ1hrNwnQHoTychXFoi9IPs2BC0iCh+s5t1Lw80DnX \n 9lM/9+6NZvSbbtjjumiphfZ5GwZrbqMInIVxiB+QQRdMYJCQ30Kcaxvi5w0IiZd9xv2a03xuahv1 \n Ij5R8/JwYEBJI8x9PmfLIKkLac/bt3Nc4A1nV/9+H5Ck5rRQmtritr27Nj3q1CH4wZdztowPgYh0 \n dyFVK9JfvX6nvvfatWNQ1GsJwuEsjJP6WF95rffDEDWnhdLUFndxmKpP6rGnX+QJMoHx5Zwt+0Mg \n oi4kSarv3O7NcQEiTttFy7wofee213dud10cJ6ImfYzHoJ+3EFtloi6kaoXuIPjJ6VkZ18farzkt \n pKa2zm3n4hAun87ZEAempeXTcQGkDMLYGHObMebZLAoj9W9O872p7diZC1pYWnFdjMIIsQbWzfdz \n tqw4LvDNSGFsjHlA0pckxT+CJ6V+zWllbtoOVeihzDnrJ44LfDIx4s+/LOlDkn43g7JsiJrToteD \n /l9ZRIO8Qg6wkHDO+onjAp+MVDO21j4t6WpGZQEAoJRGrRkPbHo6/s6zVqvE/n/S+1v9n0u1WkWq \n VFSrVQYqW7Qd3bb62aTt921/pHXPw89Ikh4/foezMnTu47O/+oGN99PuY1/OWV/KsZVRypd0/iy+ \n tapmS7q0uKJHn3pRD330QGH2Rx7KuM0+G1sYz8/HT6ZvNFqx/5/0/lb/51Kj0ZJaLTUarYHKFm1H \n t61+Nm77p6fr3u2PtLI+vmma9R+57z09ZRhlH/tyzvpSjn5GPZfjtvHkuYtau3rtkaXPvzSvj/za \n N9RstjRRq3q9P/IQ0vXCV8Pe7GQ1pyY+VUpoqj45dF9u6AOYANeSFvlYemvVQWmAXiPXjK21r0gi \n SQAASInVJgAEL2mRj7Kufgf/jK3POFTRQh+jLusYPdknej03O5NF8VLpXh+bJvRwlHVRmrnZGR09 \n fV4LS1ckXVvkg7Xg4QvnYVzmC3207UlP9jlyaD/zH0fAhRadg/eOHNqvh554ThKLfMA/NFN7gCfI \n 9FpYWiFMc1Dmh3OwNjd8RhgDQGCOnbnAzWzBOG+mTlK05uvF5VWtNdrzGIft8735xqlNzdQST5AB \n gDLxNoz78S2omy2p2bi2oMCwfb5Jg0vi+DTQC4Pz7ZxFG8cFvqCZOifD9vkO8gSZpIFer77OSjro \n RVMlUByEsScGGVwyroFeC0srzqbARDX/Zqvd9A8MIzp/Li2u6OS5i66L4w1uzPxXyGbqIqDPd3jd \n Nf+1RnPkKV7dffnf/8F
" text/plain " : [
" <matplotlib.figure.Figure at 0x110b05d90> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" from sklearn.ensemble import RandomForestRegressor \n " ,
" \n " ,
" x = 10 * np.random.rand(100) \n " ,
" \n " ,
" def model(x, sigma=0.3): \n " ,
" fast_oscillation = np.sin(5 * x) \n " ,
" slow_oscillation = np.sin(0.5 * x) \n " ,
" noise = sigma * np.random.randn(len(x)) \n " ,
" \n " ,
" return slow_oscillation + fast_oscillation + noise \n " ,
" \n " ,
" y = model(x) \n " ,
" plt.errorbar(x, y, 0.3, fmt= ' o ' ); "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 9 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" image/png " : " iVBORw0KGgoAAAANSUhEUgAAAeMAAAFVCAYAAADc5IdQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz \n AAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd8ZHW9999nek0ySaYk2exma3aXLazSm2ADeRQLFhSs \n V0RQsQB6H9v1Xr3qlbVcRUSvIBZ0feDiVa9KE5CydJZl2RK2ZTd1SmYyvZ/z/HFm0jbJJpNpWX7v \n 18uX7Mwpv5mcOZ/z7ZKiKAgEAoFAIKgdmlovQCAQCASCVzpCjAUCgUAgqDFCjAUCgUAgqDFCjAUC \n gUAgqDFCjAUCgUAgqDFCjAUCgUAgqDG6Unbq7u7WAv8FrAEU4OM9PT27y7kwgUAgEAheKZRqGb8Z \n kHt6es4Bvgz8e/mWJBAIBALBK4uSxLinp+ePwFWFf3YBoXItSCAQCASCVxoluakBenp68t3d3bcD \n bwfeWbYVCQQCgUDwCkNaaDvM7u5uN/AUsK6npyc53TaKoiiSJC3oPAKBQCAQLCLmJXqlJnC9H1jS \n 09PzLSAJyIX/Tb8iScLvj5ZyKsEccTrt4juuMOI7rg7ie6484juuPE6nfV7bl+qmvgu4vbu7+x+A \n Hvh0T09PusRjCQQCgUDwiqYkMS64o99T5rUIBAKBQPCKRDT9EAgEAoGgxggxFggEAoGgxggxFggE \n AoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGg \n xggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggx \n FggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggE \n AoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGgxggxFggEAoGg \n xggxFggEAoGgxuhqvQCBYDFww83b0Wolvn3VmZNeA7jxmrNqtSyBQHCCICxjgUAgEAhqjBBjgUAg \n EAhqjBBjQdm54ebtYy5cgUAgEBwfIcYCgUAgENQYIcYCgUAgENQYIcYCgUAgENQYIcaCihCKpkTc \n WCAQCOaIEGOBQCAQCGqMEGOBQCAQCGqMEGOBQCAQCGqMEGOBQCAQCGqMEGOBQCAQCGqMEGOB4Dhs \n 3baDkUgKXyjJ1m07Jr02EkmNvSYQCASlUtLUpu7ubj1wG7AMMALf6Onp+XM5FyYQ1ANbt+1gT29o \n 7N97ekN87MaHyOWVSa9d9+PHufbSTSzz2GuxTIFAsMgp1TK+HPD39PScB1wE3FS+JQkE9cPeCUJc \n ZKIQFwlF0/zwv1+sxpIEAsEJSKnzjO8E7ir8twbIlWc5gsVO0X0LEIlnarwagUAgWByUZBn39PTE \n e3p6Yt3d3XZUYf5SeZclWIxMdOlm03Hi8QjX/fhxjgxHx7ZZbBOd1nU5jnlNp5XwjA7hjPjGXnPY \n jVx76aZqLk0gEJxAlGoZ093d3QncDfy4p6dn2/G2dzpFLK3S1Po73nskhJzP4T34NJHAEQD8zR38 \n p5zj1//2ZgC0Wgmo/Vrnyn986jw+9G/3MhJWrf1zRvbxhYduhqNHOehczmfe/31aGk3c/tULa7zS \n E4vFcn0sZsR3XF+UmsDlBu4Drunp6XloLvv4/dHjbyQoGafTXvPvWJEVhvY/QSzYj8nWjCRpiAUH \n OLTrYXy+85AkiXwh3lrrtc6HT759I1//5TMAXHXoPjh6FICV/sNoJPX9xfR56p16uJZPdMR3XHnm \n +7BTagLXF4FG4Kvd3d0PFf5nKvFYghMEpzFELNiPpdFN54bX03nSa7E62omMDPG3vz9W6+WVzDKP \n HYfdRGuTGevoCIrFQubscwFosehFBrVAIFgwJVnGPT09nwY+Xea1CBYx+XyeLn0/L2RTrHMtJ5tN \n kzRa8Kw8jd4X/sIPb7ubN7zmjFovc8Fo/D7kVhdKcwsA1qSwLgQCwcIpOWYsEExk764Xyf3kh3wz \n EOCi5/5IwmDmK5f+K0Gbg3hrF+HQYXbuXNzNMSRZRhPwE9m4kX6djiWAPRGu9bIEAsEJgBBjQUkU \n M6JvvOYsAF6841foAgHW25t4rnk5rz6yg+/+7vMAjGr1fOW9H2Tnzh0oygYkSarZuhfCu+79L3py \n OX4fi5MeGsAEdN37MyymA6Te/2HkjiW1XqJAIFikCDEWLJhgcARvz15OAhpuvZ1PP5Hn0qfuojUa \n oCsyxNojL3G6lGfnyAjxnB9bk6vWS543tniY9dvv5jZAae/g9A2b2P/kEwwefI6D33uOVbk88S9/ \n rdbLFAgEi5S6FOOpVpegvtm7dw8av4+NQO6kjVh39fD7M96DRoJvnmmFd76Bs379C3Z++jpG/b2L \n QoynXoNu3xH+AGQ3bubt3/0hSzuX8mh8Cfuf/jPGPY9w5UighqsVCASLHTEoQrAgFEVh797dGP1+ \n VjldKE4nOq0GjQQOuwnX2acA0AVY4zHCgT4U5dh2kvWMpu8ob/rl5wkCW848m6VLl4EkEejaiH3D \n eaSBJ/v7ar1MgUCwiBFiLFgQgUCA0OAgayNhdBs2HruBVkvsK/+GBlhltZFNJ0hER6q+zoWgffQf \n PArogVMu/8Ck95qWrKUBeM7rJZVK1WJ5AoHgBECIsWBBHD58CCk4whog171u2m3ktjYA1mrVqEgk \n sLisyENDg4SBobMuxbJu/aT3ckYLpwHZTJo9e16qyfoEAsHiR4ixYEEcPnwQKZ1iJSC3tk67jdze \n AUBw51FSWYXB/sOLagbwzt7DAHQ4j82WzuiMnAxoczm+9qM/LKq+2wKBoH4QYiwomXwuS39/H26j \n CRug2Bum3S63Zi05rZ4Nzz6E2e4knQjz4suDxwyRqEfS6TSHfF7cgNXWfMz7slaL1WBgpVZDMhYk \n najvzyMQCOoTIcZ1ymKYbhQP+8jn86y0WgFQGhun3U5pbWXH0s20h4dptqiCnYj4FsUM4EOHDpLP \n ZlkHXPuByR3EbrzmLG685iwUs4W1ilo7HV5kLniBQFAf1GVp04mO1+vlgQfuJRDws3TpMt7whgux \n 2RZff+NoaBg8sMJgBEBpmN4yBggXRNhlsNAHJMNeGlqXEY6lq7HUktm/vwfyedYBGAzTbqOYzayV \n 8yBJhANHq7q+SiBKCwWC6iMs4yrj9/v5/e/vYGCgH7PZzP79L3PHHb8ikUjUemnzJh72odFo6Cz8 \n W7ZPbxkDGFxqPNmj0aDVGUiEfWgksFumF7h6QM7nOXToIM16PS4Ao3Ha7RSzGWs6jbXRRTzsJxYT \n rmqBQDA/hBhXkXw+z1//+mdSqRQXX/wWrrzyak4//UzC4TD33fe3Wi9vXuRzWZLRIB5PG4aI2p+5 \n 6Ka+8ZqzcNgnD/F69ZlrAWhIxTE3ONEqSbLpONFEproLnwexsJdMJsMaewMSzGgZY7ag9Q7T2KIm \n ePX29lZriQKB4ARBiHEVeemlF/F6h9mwYRMbNmxEkiTOO+98lizp5OWXe+jrWzwuzkR0hIboCCue \n eQrzr28HZo4ZA8gONfmpIR3F2uDi9HVuklG1a1W9xsdjoWEAVpgLDxYzuantaojhi/fewqv3Po73 \n B1thkTU2EQgEtUWIcZWQZZmnn34SrVbLeee9Zux1SZI4//zXAvD444/WannzYuu2HQwN9nPqrofp \n /t1vAEhd8nZkT9uM+8gOBwCvOfQUHU4njTYDqTpv/hENDaHVauksxMRnclPHvvQ18su6OHvgZdb4 \n ehn60x+Qjh6p4koFAsFiR4hxlTh06CChUIgNGzYdk6zV3t7B0qXLOHr0CH6/v0YrnBtbt+1gT2+I \n ZMSHJZNA29TG1z+8lZe+cRPMMo0pt3kLitHIKfu28/Obr8Zy5zbSYV8VVz53tm7bgW8kTHDEz4v9 \n eYy5nPrGDJZx7owzCT7
" text/plain " : [
" <matplotlib.figure.Figure at 0x110d71cd0> "
]
} ,
" metadata " : { } ,
" output_type " : " display_data "
}
] ,
" source " : [
" xfit = np.linspace(0, 10, 1000) \n " ,
" yfit = RandomForestRegressor(100).fit(x[:, None], y).predict(xfit[:, None]) \n " ,
" ytrue = model(xfit, 0) \n " ,
" \n " ,
" plt.errorbar(x, y, 0.3, fmt= ' o ' ) \n " ,
" plt.plot(xfit, yfit, ' -r ' ); \n " ,
" plt.plot(xfit, ytrue, ' -k ' , alpha=0.5); "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" As you can see, the non-parametric random forest model is flexible enough to fit the multi-period data, without us even specifying a multi-period model! \n " ,
" \n " ,
" Tradeoff between simplicity and thinking about what your data is. \n " ,
" \n " ,
" Feature engineering is important, need to know your domain: Fourier transform frequency distribution. "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Random Forest Limitations \n " ,
" \n " ,
" The following data scenarios are not well suited for random forests: \n " ,
" * y: lots of 0, few 1 \n " ,
" * Structured data like images where a neural network might be better \n " ,
" * Small data size which might lead to overfitting \n " ,
" * High dimensional data where a linear model might work better "
]
2015-04-28 17:27:02 -04:00
}
] ,
" metadata " : {
" kernelspec " : {
" display_name " : " Python 2 " ,
" language " : " python " ,
" name " : " python2 "
} ,
" language_info " : {
" codemirror_mode " : {
" name " : " ipython " ,
" version " : 2
} ,
" file_extension " : " .py " ,
" mimetype " : " text/x-python " ,
" name " : " python " ,
" nbconvert_exporter " : " python " ,
" pygments_lexer " : " ipython2 " ,
" version " : " 2.7.9 "
}
} ,
" nbformat " : 4 ,
" nbformat_minor " : 0
}