mirror of
https://github.com/donnemartin/data-science-ipython-notebooks.git
synced 2024-03-22 13:30:56 +08:00
556 lines
158 KiB
Python
556 lines
158 KiB
Python
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Density Estimation: Gaussian Mixture Models\n",
|
||
|
"\n",
|
||
|
"Credits: Forked from [PyCon 2015 Scikit-learn Tutorial](https://github.com/jakevdp/sklearn_pycon2015) by Jake VanderPlas"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Here we'll explore **Gaussian Mixture Models**, which is an unsupervised clustering & density estimation technique.\n",
|
||
|
"\n",
|
||
|
"We'll start with our standard set of initial imports"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"%matplotlib inline\n",
|
||
|
"import numpy as np\n",
|
||
|
"import matplotlib.pyplot as plt\n",
|
||
|
"from scipy import stats\n",
|
||
|
"\n",
|
||
|
"# use seaborn plotting defaults\n",
|
||
|
"import seaborn as sns; sns.set()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Introducing Gaussian Mixture Models\n",
|
||
|
"\n",
|
||
|
"We previously saw an example of K-Means, which is a clustering algorithm which is most often fit using an expectation-maximization approach.\n",
|
||
|
"\n",
|
||
|
"Here we'll consider an extension to this which is suitable for both **clustering** and **density estimation**.\n",
|
||
|
"\n",
|
||
|
"For example, imagine we have some one-dimensional data in a particular distribution:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAesAAAFVCAYAAADPM8ekAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHCZJREFUeJzt3X+M3Hed3/GnF7xwzq737LBGkJiIc8NbllCOEleYBAVS\nMFdoHcIJnWRxJ2EucIRKdUUbSOBIJcRdT7UcNSdIenUSjt6hHjXFyASdE1RQc9mg0NLjQtvkbRKQ\nvelFyeLdrL0xycb29o+ZTYb9Md/Z9czOZ3eeD8nSfr+f+Yzf/vi785rvr8933czMDJIkqVx93S5A\nkiQ1Z1hLklQ4w1qSpMIZ1pIkFc6wliSpcIa1JEmFe2WzxojoA+4ArgBeAG7IzCfmvGYD8F3go5mZ\n9T53AW8CzgMfy8zsRPGSJPWCqj3r64H+zLwKuBk40NgYETuAB4A3ArM3bL8XuCgz3wF8AfijtlYs\nSVKPqQrrq4GjAJn5MLBjTns/tUBv3HP+JTAUEeuAIWC6PaVKktSbmh4GBzYCpxqWz0VEX2aeB8jM\nhwAiorHPCPBq4DHgYmB326qVJKkHVYX1KWCwYfmloG7i08BIZn4uIi4FvhcRb87MRfewZ2ZmZtat\nW9daxZIkrX5LCr2qsB6htmd8KCJ2Ao+08J4X8fLe+ASwHnhFsw7r1q1jbOx0C2/d24aHBx2nFjlW\nrXGcWudYtcZxas3w8GD1ixpUhfVhYFdEjNSX90bEHmAgMw8u0mc/8JWI+BtqQX1LZv5ySVVJkqSX\nNA3rzJwBbpyz+tgCr7u24edngQ+2pTpJkuSkKJIklc6wliSpcIa1JEmFM6wlSSqcYS1JUuEMa0mS\nCmdYS5JUOMNakqTCGdaSJBXOsJYkqXCGtSRJhTOsJUkqnGEtSVLhDGtJkgpnWEuSVDjDWpKkwhnW\nkiQVzrCWJKlwhrUkSYUzrCVJKpxhLUlS4QxrSZIKZ1hLklQ4w1qSpMIZ1pIkFc6wliSpcK9s1hgR\nfcAdwBXAC8ANmfnEnNdsAL4LfDQzs77uFmA3sB74UmZ+tQO1S6vW9PQ0o6PHF23fuvUy+vv7V7Ai\nSSVrGtbA9UB/Zl4VEW8DDtTXARARO4D/ALwemKmvexfw9nqfi4BPd6JwaTUbHT3Ovv1H2DC0ZV7b\nmclnuP2m69i27fIuVCapRFVhfTVwFCAzH66Hc6N+auH9Fw3rfgv4SUR8C9gI3NSmWqU1ZcPQFgY2\nXdLtMiStAlXnrDcCpxqWz9UPjQOQmQ9l5pNz+rwGuBL4EPAJ4GvtKFSSpF5VtWd9ChhsWO7LzPMV\nfX4BPJqZZ4FjEfF8RLwmM3/RrNPw8GCzZtU5Tq0reawmJgaatm/ePLBi9Zc8TqVxrFrjOLVfVViP\nULtQ7FBE7AQeaeE9HwT2AbdFxOuBi4CTVZ3Gxk638Na9bXh40HFqUeljNT4+Vdm+EvWXPk4lcaxa\n4zi1ZqlfaKrC+jCwKyJG6st7I2IPMJCZBxfqkJnfiYhrIuKH1A6zfzIzZ5ZUlSRJeknTsK6H7I1z\nVh9b4HXXzln+zIWXJkmSwElRJEkqnmEtSVLhDGtJkgpnWEuSVDjDWpKkwhnWkiQVzrCWJKlwhrUk\nSYUzrCVJKpxhLUlS4QxrSZIKZ1hLklQ4w1qSpMIZ1pIkFc6wliSpcIa1JEmFM6wlSSqcYS1JUuEM\na0mSCmdYS5JUOMNakqTCGdaSJBXOsJYkqXCGtSRJhTOsJUkqnGEtSVLhXtmsMSL6gDuAK4AXgBsy\n84k5r9kAfBf4aGZmw/otwI+Ad2fmsXYXLklSr6jas74e6M/Mq4CbgQONjRGxA3gAeCMw07B+PfBn\nwHNtrVaSpB5UFdZXA0cBMvNhYMec9n5qgZ5z1u8H7gSeakONkiT1tKqw3gicalg+Vz80DkBmPpSZ\nTzZ2iIiPAGOZeX991bp2FCpJUq9qes6aWlAPNiz3Zeb5ij57gZmIeA/wFuCrEfGBzHy6Wafh4cFm\nzapznFpX8lhNTAw0bd+8eWDF6i95nErjWLXGcWq/qrAeAXYDhyJiJ/BI1Rtm5jtnf46I7wN/UBXU\nAGNjp6te0vOGhwcdpxaVPlbj41OV7StRf+njVBLHqjWOU2uW+oWmKqwPA7siYqS+vDci9gADmXlw\nGfVJkqQlahrWmTkD3Dhn9bzbsDLz2kX6L7hekiS1zklRJEkqnGEtSVLhDGtJkgpnWEuSVDjDWpKk\nwhnWkiQVzrCWJKlwhrUkSYWrmsFM0jJNT08zOnp8wbYTJxZeL0kLMaylDhkdPc6+/UfYMLRlXtvJ\nJx/l4ku3d6EqSauRYS110IahLQxsumTe+jOTlc+2kaSXeM5akqTCGdaSJBXOsJYkqXCGtSRJhTOs\nJUkqnGEtSVLhDGtJkgpnWEuSVDjDWpKkwhnWkiQVzrCWJKlwhrUkSYUzrCVJKpxhLUlS4QxrSZIK\n1/R51hHRB9wBXAG8ANyQmU/Mec0G4LvARzMzI2I9cA9wGfAq4IuZ+e1OFC9JUi+o2rO+HujPzKuA\nm4EDjY0RsQN4AHgjMFNf/WFgLDOvAf4J8KW2VixJUo+pCuurgaMAmfkwsGNOez+1QM+GdYeAWxve\n/+yFlylJUu9qehgc2Aicalg+FxF9mXkeIDMfAoiIl16Qmc/V1w1SC+7PtbNgSZJ6TVVYnwIGG5Zf\nCupmImIr8E3gy5n5V60UMjw8WP0iOU5L0O2xmpgYWHbfzZsHVqz+bo/TauJYtcZxar+qsB4BdgOH\nImIn8EjVG0bEa4H7gU9m5vdbLWRs7HSrL+1Zw8ODjlOLFhur6elpRkePL9pv69bL6O/vb0sN4+NT\nF9R3Jf6v3aZa51i1xnFqzVK/0FSF9WFgV0SM1Jf3RsQeYCAzDy7S57PAEHBrRMyeu35fZj6/pMqk\nDhgdPc6+/UfYMLRlXtuZyWe4/abr2Lbt8i5UJkmLaxrWmTkD3Dhn9bEFXndtw8/7gH1tqU7qgA1D\nWxjYdEm3y5CkljkpiiRJhTOsJUkqnGEtSVLhDGtJkgpnWEuSVDjDWpKkwhnWkiQVrmpSFKlIVTOR\nDQ29eQWrkaTOMqy1KlXNRPYX/3aATZte14XKJKn9DGutWs5EJqlXeM5akqTCGdaSJBXOsJYkqXCG\ntSRJhTOsJUkqnGEtSVLhDGtJkgpnWEuSVDjDWpKkwjmDmVSY8+fOcuLE4vOeb916Gf39/StYkaRu\nM6ylwjw/dZIDXx9nw9BT89rOTD7D7Tddx7Ztl3ehMkndYlhLBXLec0mNPGctSVLhDGtJkgrnYXCt\nOefPneXnP/854+NT89qaXbglSaUyrLXmPD91klv/4w/YMLRlXtvJJx/l4ku3d6EqSVo+w1pr0mIX\naJ2ZfLoL1UjShWka1hHRB9wBXAG8ANyQmU/Mec0G4LvARzMzW+kjSZJaV3WB2fVAf2ZeBdwMHGhs\njIgdwAPAG4GZVvpIkqSlqQrrq4GjAJn5MLBjTns/tXDOJfSRJElLUHXOeiNwqmH5XET0ZeZ5gMx8\nCCAiWu6zmOHhwZaL7mWOU83ExEBH3nfz5oG2jfFqqBHcppbCsWqN49R+VWF9Cmgc9crQXWYfxsZO\nV72k5w0PD/bUOE1PTzM6uvCtVp26BWt8fKptY7zQrWPtet921dhr29SFcKxa4zi1ZqlfaKrCegTY\nDRyKiJ3AIy2853L6SPOMjh5n3/4jK3YLlg/QkFSqqrA+DOyKiJH68t6I2AMMZObBVvu0oU71qJW8\nBcsHaEgqVdOwzswZ4MY5q48t8LprK/pIq4IP0JBUIidFkS5AN86rS+o9hrV0AVb6vLqk3mRYSxfI\nqU0ldZqPyJQkqXCGtSRJhTOsJUkqnGEtSVLhDGtJkgpnWEuSVDhv3ZJasNi84U58ImklGNZSCxab\nN9yJTyStBMNaatFCk5848YmkleA5a0mSCueetbrKB2FIUjXDWl3lgzAkqZphra7zQRiS1JznrCVJ\nKpxhLUlS4QxrSZIKZ1h
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x109842cf8>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"np.random.seed(2)\n",
|
||
|
"x = np.concatenate([np.random.normal(0, 2, 2000),\n",
|
||
|
" np.random.normal(5, 5, 2000),\n",
|
||
|
" np.random.normal(3, 0.5, 600)])\n",
|
||
|
"plt.hist(x, 80, normed=True)\n",
|
||
|
"plt.xlim(-10, 20);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Gaussian mixture models will allow us to approximate this density:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAesAAAFVCAYAAADPM8ekAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xl8VPW9//HXLJlJZklCFnbZBI4oiwoqgoILuOPSWq21\nttVrXdre2l9Xe1tt771tb3ut3a7XLlZ77eJSUVwAEXEBjIKCC8hyQPadJDPZk5lkZn5/zCQEApkE\nJnMmmffz8eBR5nznTD45jbzzPd/v+X5tsVgMERERyVx2qwsQERGRzimsRUREMpzCWkREJMMprEVE\nRDKcwlpERCTDKaxFREQynLOzRsMw7MDDwEQgBNxumuaWI97jAV4FbjNN00yc82dgLBAFvmyaptkT\nxYuIiGSDZD3rawGXaZrTgHuBB9s3GoYxBVgGjARaH9i+BPCapnke8B/AT1NasYiISJZJFtbTgUUA\npmmuBKYc0e4iHujte86NQIFhGDagAAinplQREZHs1OltcCAfqGn3OmIYht00zSiAaZpvAxiG0f6c\nMiAX2AgUA3NSVq2IiEgWShbWNYC/3eu2oO7Ed4Ey0zR/YBjGUOB1wzDGm6Z5zB52LBaL2Wy2rlUs\nIiLS+3Ur9JKFdRnxnvEzhmFMBdZ04TO9HOqNB4EcwNHZCTabjfLy2i58dHYrLfXrOnWRrlXX6Dp1\nna5V1+g6dU1pqT/5m9pJFtbzgNmGYZQlXt9qGMZNgM80zUeOcc4DwF8Mw1hOPKi/b5pmY7eqEhER\nkTadhrVpmjHg7iMObzrK+y5s9/cq4LqUVCciIiJaFEVERCTTKaxFREQynMJaREQkwymsRUREMpzC\nWkREJMMprEVERDKcwlpERCTDKaxFREQynMJaREQkwymsRUREMpzCWkREJMMprEVERDKcwlpERCTD\nKaxFREQynMJaREQkwymsRUREMpzCWkREJMMprEVERDKcwlpERCTDKaxFrNTSgv/OWym48TpsgUqr\nqxGRDKWwFrGQ+7lnyJ33LK43XsPz8P9YXY6IZCiFtYiF3C893/Z3V7u/i4i0p7AWsUosRs7q94gM\nG0Fo9qU4t23Fvn+f1VWJSAZSWItYxH7wAPaKClrGT6BlytkAONd8aHFVIpKJFNYiFnF+vAaAltPG\n0zLuNAAcGzdYWZKIZCiFtYhFHOs+BqDltAm0GKcA4FRYi8hRODtrNAzDDjwMTARCwO2maW454j0e\n4FXgNtM0zcSx7wNzgBzgIdM0H++B2kV6rUgkQvSD9wGoHDqUZo+HIrud6LYtBAKVFBQU4nA4LK5S\nRDJFsp71tYDLNM1pwL3Ag+0bDcOYAiwDRgKxxLELgHMT51wAjEptySK9X3V1FbXrNhJxOHlhZwsL\nV++ltrCElu27mLtkLdXVVVaXKCIZJFlYTwcWAZimuRKYckS7i3igm+2OXQqsNQzjeeAl4MXUlCrS\ntxQGK6gvGYivoAifv5CG/oPxBSvw5nmsLk1EMkyysM4Hatq9jiRujQNgmubbpmnuPuKcEmAycD1w\nF/CPVBQq0pfYQiG81QHqSge1HasrHYQ9GsEXrLCwMhHJRJ2OWRMPan+713bTNKNJzqkANpim2QJs\nMgyjyTCMEtM0O/0XqLTU31mzJOg6dV0mXyvntmoAGgcNxet1AxAafBIApfUBSkr8FBenp/5Mvk6Z\nRteqa3SdUi9ZWJcRnyj2jGEYU4E1XfjMt4B7gF8ZhjEY8AJJFz0uL6/twkdnt9JSv65TF2X6tWpa\nt4l+QFW//tTXhwAIFpQC4Nqzm4qKWqJRV4/XkenXKZPoWnWNrlPXdPcXmmRhPQ+YbRhGWeL1rYZh\n3AT4TNN85GgnmKa5wDCMGYZhvEv8NvtXTNOMdasqkT4uZ+8egMNug9cm/p5fccCSmkQkc3Ua1omQ\nvfuIw5uO8r4Lj3j9vRMvTaTvcraF9eC2Yw1F/QHwVmn3LRE5nBZFEbGA80C891xfPKDtWGNhMQCe\nmqAlNYlI5lJYi1jAWVEOQEO/krZjTf5CYjabwlpEOlBYi1jAWV5O2J1Lc5637VjM4aDJX6iwFpEO\nFNYiFnCWH6ShoKjD8caCIjy1Wr1MRA6nsBZJt0gER6CS+qOFdWExufW1EA5bUJiIZCqFtUia2QIB\nbJHI0XvW+fFjzmAg3WWJSAZTWIukmf1gYib4UXvW8WOOCi05KiKHKKxF0qyzsG5KHHME9Ky1iByi\nsBZJs0NhXdyhrTFxzFmpnrWIHKKwFkkz+8GDADQU9OvQ1pg45qhUz1pEDlFYi6SZPXGLu9Ff2KEt\n5CsAwFFTndaaRCSzKaxF0szWGta+/A5tocQxR7XCWkQOUViLpJk98VhWU6IX3V5rWNvVsxaRdhTW\nImlmr6wk5nAQarfUaKuwN9GzrtIqZiJyiMJaJM1swQCRgkKwd/zPL+Jy0+xyY6+psaAyEclUCmuR\nNLMHA0T6dZwJ3qrJ68dRrZ61iByisBZJp2gUWzBIpLDjTPBWTd58hbWIHEZhLZJGtuoqbNEokX4d\nVy9rFfL6sdfWQiSSxspEJJMprEXSqPUZ68571n5ssRg2zQgXkQSFtUga2QLxx7Y661k3ef3x92pG\nuIgkKKxF0sjeGtZJetYA9qpgWmoSkcynsBZJI1uwNaw7GbP2JHrWQYW1iMQprEXSyJ7YoCPSr5Oe\ndesqZpoRLiIJCmuRNLJ3oWfdNmatnrWIJCisRdLI1sXZ4KCetYgcorAWSSN7F2aDh9SzFpEjODtr\nNAzDDjwMTARCwO2maW454j0e4FXgNtM0zXbH+wOrgYtN09yU6sJFeiNboJKYzUY0Px84+vrfmg0u\nIkdK1rO+FnCZpjkNuBd4sH2jYRhTgGXASCDW7ngO8EegPqXVivRy9mCAWGEhOBzHfE9TYuctPWct\nIq2ShfV0YBGAaZorgSlHtLuIB7p5xPEHgN8D+1JQo0ifYa+sJFpU3Ol7QnleYjYbNvWsRSQhWVgf\nea8ukrg1DoBpmm+bprm7/QmGYXwJKDdNc3HikC0VhYr0erEYtmCAWCfj1QDY7UR9Pm2TKSJtOh2z\nJh7U/nav7aZpRpOccysQMwxjFnA68LhhGNeYpnmgs5NKS/2dNUuCrlPXZdy1qqqCSIScQQMoKfHj\n8bjwet0d3haNuKCgAGd9bVq+h4y7ThlM16prdJ1SL1lYlwFzgGcMw5gKrEn2gaZpzmz9u2EYbwB3\nJgtqgPLy2mRvyXqlpX5dpy7KxGtl37aDYuJj0hUVtTQ0hLE7Qh3e19AQpsXjxXbgAJU9/D1k4nXK\nVLpWXaPr1DXd/YUmWVjPA2YbhlGWeH2rYRg3AT7TNB85jvpEslbrjlvJxqwBIj4/rk82QywGNo0k\niWS7TsPaNM0YcPcRhzs8hmWa5oXHOP+ox0WyUevqZdGiJGPWQNTvxxaNYquvI+bTLUWRbKdFUUTS\nxJZYFzzWhZ51NBHQtlrdThQRhbVI2rT1rJPNBifeswawaUa4iKCwFkkbW2Kp0VhxV8asffFzaqp7\ntCYR6R0U1iJp0roueJd61m23wdWzFhGFtUjaHJpg1oUx68RtcLvGrEUEhbVI2rRujxnr1y/pezVm\nLSLtKaxF0sQeCBDNLwBnsuUN4s9Zg8JaROIU1iJpYgtUEuvCM9YA0dYJZrWaYCYiyVcwE5HjFIlE\nqK5ObHMZi1EcCBAaN45AoJJgMEgsFjvmuVF/YptMjVmLCAprkR5TXV3F3CVr8fjyyWlswGgOs488\nFq7YQcX+3fgKSkhkcgdtE8x0G1xE0G1wkR7l8eXj8xdSTLwXHelXgs9fSF6SJUQ1Zi0i7SmsRdIg\ntyZ+O7zJX9il90e9XkDPWYtInMJaJA1yExPFmvwFXTshJ4eYx6MxaxEBFNYiaeGuDQLQlJ/8GetW\nUX++lhsVEUBhLZIWrT3rUFd71kAsPx+7boOLCAprkbTo7pg1QMzv1wQzEQEU1iJp4a47nrDOxxYO\nQyjUU2WJSC+hsBZJg7a
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x109a3cc88>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.mixture import GMM\n",
|
||
|
"clf = GMM(4, n_iter=500, random_state=3).fit(x)\n",
|
||
|
"xpdf = np.linspace(-10, 20, 1000)\n",
|
||
|
"density = np.exp(clf.score(xpdf))\n",
|
||
|
"\n",
|
||
|
"plt.hist(x, 80, normed=True, alpha=0.5)\n",
|
||
|
"plt.plot(xpdf, density, '-r')\n",
|
||
|
"plt.xlim(-10, 20);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Note that this density is fit using a **mixture of Gaussians**, which we can examine by looking at the ``means_``, ``covars_``, and ``weights_`` attributes:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 4,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([[ 4.5601338 ],\n",
|
||
|
" [ 0.0861325 ],\n",
|
||
|
" [ 3.01890623],\n",
|
||
|
" [ 6.87627234]])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 4,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"clf.means_"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 5,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([[ 30.34748627],\n",
|
||
|
" [ 4.30521863],\n",
|
||
|
" [ 0.19750802],\n",
|
||
|
" [ 14.78230876]])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 5,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"clf.covars_"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 6,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([ 0.27613209, 0.48308463, 0.11612442, 0.12465886])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 6,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"clf.weights_"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 7,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAesAAAFVCAYAAADPM8ekAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XmcXHWd7//XOVVd1VW9JZ10QthXD8iiSNAYIIAKKsjm\nOCjj5aEwuDDOVX86Kg7KHWeGe/UiLiPCddw3RBFRFoGwKZDIqhhFcoAECAESOr1Vd62nzjm/P6q6\nU91VXVtXd1d3vZ+PRz+SOt9zTn37pNOf+nxXw/d9REREpHmZ810BERERKU/BWkREpMkpWIuIiDQ5\nBWsREZEmp2AtIiLS5BSsRUREmlywXKFlWSZwNXAUkAYusm17y5RzosCdwIW2bdv5a74DvArwgA/Y\ntm3PRuVFRERaQaXM+mwgZNv2WuAS4MrCQsuyVgP3AQcA4xO2TwU6bNs+Hvh34PKG1lhERKTFVArW\nxwG3A9i2/RCwekp5iFxAL8yck0CPZVkG0ANkGlNVERGR1lS2GRzoBmIFr13Lskzbtj0A27Y3AliW\nVXjNBqAd2AwsA85oWG1FRERaUKVgHQO6Cl5PBOoyPg1ssG37Usuy9gbusSzrCNu2p82wfd/3DcOo\nrsYiIiILX01Br1Kw3kAuM77esqw1wKYq7tnB7mx8CGgDAuUuMAyD/v7RKm7d2vr6uvScqqRnVR09\np+rpWVVHz6k6fX1dlU8qUClY3wicYlnWhvzrCyzLOg/otG3729NccwXwfcuy7icXqD9r23ayplqJ\niIjIhLLB2rZtH7h4yuGnSpx3csHfh4FzGlI7ERER0aIoIiIizU7BWkREpMkpWIuIiDQ5BWsREZEm\np2AtIiLS5BSsRUREmpyCtYiISJNTsBYREWlyCtYiIiJNTsFaRESkySlYi4iINDkFaxERkSanYC0i\nItLkFKxFRESanIK1iIhIk1OwFhERaXIK1iIiIk1OwVpERKTJKViLiIg0OQVrkfmUzdL1oQvoefc5\nGIMD810bEWlSCtYi8yj8q+tpv/EGQvfeTfTqb8x3dUSkSSlYi8yj8M2/nvh7qODvIiKFFKxF5ovv\n0/bYI7j77k/6lLcSfHYr5o6X57tWItKEFKxF5on5yk7MXbvIHnEk2dWvByC46fF5rpWINCMFa5F5\nEvzrJgCyhx9B9rDDAQhsfnI+qyQiTUrBWmSeBJ74KwDZw48kax0KQFDBWkRKCJYrtCzLBK4GjgLS\nwEW2bW+Zck4UuBO40LZtO3/ss8AZQBtwlW3bP5yFuossWJ7nwaY/AzB64EFkl/bSa5rw3LPE43Ei\nkQimqc/SIpJT6bfB2UDItu21wCXAlYWFlmWtBu4DDgD8/LGTgDfmrzkJOLCxVRZZ+JLJJGn7abxg\nkD+mImzaFiO5bAXethd45IntJJPJ+a6iiDSRSsH6OOB2ANu2HwJWTykPkQvodsGxtwJ/sSzr18DN\nwE2NqarI4tLRv4PUij2JdHQSiURJr9qL9oFXaG9rm++qiUiTqRSsu4FYwWs33zQOgG3bG23b3j7l\nmuXAMcC7gA8DP21ERUUWEyOdon1wF6k99po4llq5F6br0r7rlXmsmYg0o7J91uQCdVfBa9O2ba/C\nNbuAJ23bzgJPWZaVsixruW3bu8pd1NfXVa5Y8vScqtfMzyrx0lYA3H33pacnAoC/334A9I0N0tfX\nRUdHx5zUpZmfU7PRs6qOnlPjVQrWG8gNFLvesqw1wKYq7vkA8DHgK5Zl7Ql0ABUXPe7vH63i1q2t\nr69Lz6lKzf6s3L/YRIFY70pGRnL9011LVrAn4D37LP39oyQSlT4Xz1yzP6dmomdVHT2n6tT6gaZS\nsL4ROMWyrA351xdYlnUe0Gnb9rdLXWDb9q2WZa2zLOthcs3s/2Tbtl9TrUQWueCLud6jwmbwZP7v\nkZ0vzUudRKR5lQ3W+SB78ZTDT5U47+Qprz8z86qJLF7jwTpZEKzTfSsBaB/on5c6iUjz0kROkXkQ\n2LkDgHTfHhPHMr3LAQgPD85LnUSkeSlYi8yDQH9uxHd62YqJY5meXnzDIDykfa1FZDIFa5F5EOh/\nhWx7BLejs+BgAKenl5AyaxGZQsFaZB4EXnmFdL7Zu1Bm6TLCw8qsRWQyBWuRuea6BAZ2kVpaHKzT\nvcsIjcYgk5mHiolIs1KwFpljxuAghuuWzKydpcsACAwquxaR3RSsReaY+cpOgJLBOp3PtgO7NH1L\nRHZTsBaZY+PBulQzeGY8sx4ouzqviLQYBWuROTaRWS8rEazz2ba5S8FaRHZTsBaZY+Yr+TnW+Sy6\nkLNkPLNWn7WI7KZgLTLHzPzgsXRPb1GZ092TO2dkeE7rJCLNTcFaZI4Z+WDtdC8pKhs/pmAtIoUU\nrEXmmDmUW6Esk8+iCzlduWOBkZE5rZOINDcFa5E5Zg4M4AcCOB3F+9lm88HaHB6a62qJSBNTsBaZ\nY8bQIN6SJWAW//fzwmGy4XbMmDJrEdlNwVpkjplDg7hLiweXjXO6ejCH1WctIrspWIvMJc/DGBrC\nW7J02lOcrm4FaxGZRMFaZA4ZI8MYnofbO31mnenqwRyNgevOYc1EpJkpWIvMofE51t6Scs3g3Ri+\nj6F+axHJU7AWmUPGYG7allcms3Y6u3PnqilcRPIUrEXmkJkP1m6ZPuvx+deaviUi4xSsReaQMVRD\nZj2kYC0iOQrWInPIzG/QUS6zHl/FTEuOisg4BWuROWRWkVln8sFambWIjFOwFplD45t4lM+sc83g\nyqxFZJyCtcgcMqsYDa7MWkSmCpYrtCzLBK4GjgLSwEW2bW+Zck4UuBO40LZtu+D4CuAx4M22bT/V\n6IqLLETG4AC+YeB198Bg6cx5IrPWaHARyauUWZ8NhGzbXgtcAlxZWGhZ1mrgPuAAwC843gZ8C4g3\ntLYiC5w5NIi/ZAkEAtOeMz7ATPOsRWRcpWB9HHA7gG3bDwGrp5SHyAV0e8rxK4BrgJcbUEeRRcMc\nGMDrXVb2HKejC98wMJRZi0hepWDdDcQKXrv5pnEAbNveaNv29sILLMt6P9Bv2/b6/CGjERUVWfB8\nH2NoEL/MjlsAmCZ+ZxdmLFb+PBFpGWX7rMkF6q6C16Zt216Fay4AfMuy3gK8FvihZVln2ba9s9xF\nfX1d5YolT8+pek33rIaHwXVpW7WSvr4uevoTRKKRotNCbT4s6SEYH52T76HpnlMT07Oqjp5T41UK\n1huAM4DrLctaA2yqdEPbtk8c/7tlWfcCH6oUqAH6+0crndLy+vq69Jyq1IzPynz2eZYBqY5u+vtH\nGYklyTjFDU/JZJJstJPgjpcZmOXvoRmfU7PSs6qOnlN1av1AUylY3wicYlnWhvzrCyzLOg/otG37\n23XUT6RlTey4VaHPGsDr6sR4Oga+D4Z6kkRaXdlgbdu2D1w85XDRNCzbtk+e5vqSx0VaUTWrl43z\nuroxPA8jPobfqSZFkVanRVFE5oiRXxfcryqzzm/mMarmRBFRsBaZMxOZdaXR4IDXlcumDY0IFxEU\nrEXmjJFfatRfVjmzHm/6NmIjs1onEVkYFKxF5sjEuuC1ZNajyqxFRMFaZM7sHmBWfZ+1qT5rEUHB\nWmTOjG+P6S+dfnvMceqzFpFCCtYic8QcHMztthWstLxBwWhwBWsRQcFaZM4YgwP4VcyxBvA6O3PX\njGqAmYhUXsFMROrkeR7JZDL3wvdZPjRIZo/DicfjJBIJfH/6Zfb9bs2zFpHdFKxFZkkymeSRJ7YT\nbo8QSMTZP5NhONzJpi0DjAwP0h6JEo2WvnZigJmawUUENYOLzKpwe4RIJEp3Jg2A27ucSCRKONxe\n9joNMBORQgrWInMgNJybtuX0VB4JDuB1jPdZK1iLiIK1yJxoiw0D1Qdr2trwo1H1WYsIoGAtMifa\n8pl1ptpgTX7nLS03KiIoWIvMid2Z9ZKqr/G7uzHVDC4iKFiLzInQ8BAATk9186wB/K4uDTATEUDB\nWmROtMXGg3UNmXVXN0Y
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x10aab3390>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"plt.hist(x, 80, normed=True, alpha=0.3)\n",
|
||
|
"plt.plot(xpdf, density, '-r')\n",
|
||
|
"\n",
|
||
|
"for i in range(clf.n_components):\n",
|
||
|
" pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],\n",
|
||
|
" np.sqrt(clf.covars_[i, 0])).pdf(xpdf)\n",
|
||
|
" plt.fill(xpdf, pdf, facecolor='gray',\n",
|
||
|
" edgecolor='none', alpha=0.3)\n",
|
||
|
"plt.xlim(-10, 20);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"These individual Gaussian distributions are fit using an expectation-maximization method, much as in K means, except that rather than explicit cluster assignment, the **posterior probability** is used to compute the weighted mean and covariance.\n",
|
||
|
"Somewhat surprisingly, this algorithm **provably** converges to the optimum (though the optimum is not necessarily global)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## How many Gaussians?\n",
|
||
|
"\n",
|
||
|
"Given a model, we can use one of several means to evaluate how well it fits the data.\n",
|
||
|
"For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"25696.4735953\n",
|
||
|
"25625.7016679\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"print(clf.bic(x))\n",
|
||
|
"print(clf.aic(x))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Let's take a look at these as a function of the number of gaussians:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 9,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfEAAAFVCAYAAAD/v9aFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8XXWd//HXuTf3Zr3JTZN0TZuu+bK0tIAKCII6yIDy\nU1FAUQdxZ1HUwdEZRhn1oTM/FxCVAg6L4jIsRVBAB6riT4QZla1lafmWQveWNm1zsy93Ob8/zkma\ntGm23uTk3vt+Ph595Obcc04+36bNO9/v+Z7vcVzXRURERHJPKOgCREREZHwU4iIiIjlKIS4iIpKj\nFOIiIiI5SiEuIiKSoxTiIiIiOapouDeNMRHgdqABKAa+AfwVuAWIAw5wsbV2szHmHOAa/9AnrbVX\nGmNKgZ8DdUAb8GFr7V5jzMnA9UAKWG2t/Xr2myYiIpLfRuqJfxBostaeDpwNrAS+BfzMWnsGXmgv\nNcbEgG8D77DWngLsMMbUAZcBa/3jfwp82T/vzcBF1trTgJOMMSuy3TAREZF8N1KIr+JA7zoEJIFT\ngbnGmN/hhfyjwBuB54HrjDGPAbustU3+vg/7xz8MnOkHftRau8nf/ghwZpbaIyIiUjCGDXFrbYe1\ntt0P3lV4Pen5wH5r7duArcCXgBrgLcAXgXOAzxljlgCVQIt/ujagyt/WOuDL9G0XERGRMRj2mjiA\nMWYucB+w0lp7pzHmOuAB/+0HgW8C/4N3HXyPf8xjwAq8sK70940BCX9bbMCXqPS3H5bruq7jOKNt\nk4iISD4YMfhGmtg2A1gNXG6t/aO/+XHgHXgT1s4AXgCewbs2XoPX8z4Z+E/gCeDtwJN4PfTHrLVt\nxpheY8xCYBNwFvDVYVvhODQ1tY3UlpxVVxdT+3KY2pe78rltoPblurq62Ij7jNQTvxpvqPsaY8w1\ngAtcAtxqjLkMrwf9AWttizHmX/CubwPcba1dZ4zZBNxhjPkz0AN8wH//UuAXQBh4xFr75JhaJiIi\nIjg58hQzN99/21L7cpfal7vyuW2g9uW6urrYiMPpWuxFREQkRynERUREcpRCXEREJEcpxEVERHKU\nQlxERCRHKcRFRERy1IgrtomIiMhgzzzzFNdc8y8sWLAQ13VJJpN84Qv/zGOP/T9qamp597vfS2tr\nKytXXs+OHdtJp1NMnz6TL37xasrLK7JWh0JcRERy2j2PbuTJl/Zk9ZyvP2o6F7518WHfdxyH173u\nDXz1q98E4Mkn/8Itt9zMUUcd3b/PV7/6r5x33nt505ve7NV5z3/x7W//O1/72r9nrU6FuIiIyBi5\nrsvAxdJaW1uprq7u//y113bR3LyvP8ABzj///Zx7bldW61CIi4hITrvwrYuH7TVPlGeeeYrPfOZT\nJJNJNm7cwH/8x3d5/vnnANi7dy+zZs0ZtH8oFKKsrDyrNeTExLaHHn+VVDoTdBkiIiL9Tjjhdfzw\nhz/i5ptv5/bbf8E111xNT08PADNnzqSpafeg/VOpFKtXP5zVGnIixH90//P85cXdI+8oIiISgOrq\naQx8ZHZtbR1VVXEef/xP/dvuuedOnnjiT0MdPm45M5y+bst+TjtuVtBliIiI4DhO/3B6KBSms7OD\nz3zm8+zatbM/zL/yla9z3XXf4s47f04ymaS+fi5f+tKXs1pHToR4PFbM+s3NuK476DcdERGRIBx/\n/Ik8+ODqYfepqorzta/9x4TWkRPD6XWNO2jp7GLn3o6gSxEREZkyciLEt4efIlyzi3VbmoMuRURE\nZMrIiRAHCFXuY/1mhbiIiEifnAjxqpJKIvFm7Lb9pDO61UxERARyJMSPnd6IW9RNt9PK5l1tQZcj\nIiIyJeREiC+d3ghAqHI/6zbvD7gaERGRqSEnbjFbOt0AEK7cx/otzfyfUxcEXJGIiIjnF7+4g3vu\nuZN7732QSCTCbbf9aNKeZJYTIT6joo54cRUtVc1sXJOgJ5mmOBIOuiwREZkC7tv4EM/ueT6r5zx+\n+jLes/jcUe27evV/c+aZf8/vf/8I55xzLo7j9K9pMtFPMsuJ4XTHcTDVi3HDvaSjbby8PRF0SSIi\nIjzzzFPU18/lXe96D/fdd0//dtd1D/sksy996V+z9vVzoicO0Fi9iL++9nT/rWZLF9QEXZKIiEwB\n71l87qh7zdn20EO/5txz38W8eQ1EIlHWrXuh/73JeJJZToU4QFHVftbpfnEREQlYa2srf/nL/5BI\nNHPvvffQ0dHOL395D3Pm1AOHf5LZo4/+nrPOOjsrNeRMiE8rqaa2tIZ9mWa2bmilvStJRWkk6LJE\nRKRArV79W849911cfvmVAPT0dHPBBe+ivLycadNqBj3J7LTTzgC8J5lZuy5rIZ4T18T7NMYX4YaS\nUN7KS1qCVUREAvTQQw9w9tlv7/+8uLiEM854Kw899OtBTzL73e8e5oorPsEnP3kJGzduyOqTzBzX\ndbN2sgnkNjW18dRrz/LjdXeS3NrIabPexMV/b4KuKyvq6mI0NeXvIjZqX27L5/blc9tA7ct1dXWx\nER/bmVM98SXViwEoijezXou+iIhIgcupEK8qjjGzbDqhWDO7Ex3sa+kOuiQREZHA5FSIAzRWL8Z1\nUoTKW1i3Rb1xEREpXDkY4t6tZqHK/Xo0qYiIFLScC/El1QsBiMabWbelmRyZmCciIpJ1ORfiFZFy\n6itmQ/l+Wju72LG3I+iSREREApFzIQ7ekLrrZAhVJDSkLiIiBStnQxz86+Ja9EVERApUTob44vgC\nHBxKpjXz0tZm0plM0CWJiIhMupwM8dKiUuZV1pMpaaY71cOmXfm7Yo+IiMjh5GSIg7+OuuMSijWz\nTqu3iYhIAcrZEDf+EqzhmO4XFxGRwpSzIb4wPp+wE6akJsHGHS309KaDLklERGRS5WyIF4ejzK+c\nS6o4Qdrp5eXtiaBLEhERmVQ5G+LgraMO/nVx3WomIiIFJsdD3LtfvKhqvya3iYhIwcnpEF9QOY9I\nqIiSaQm27W6nrbM36JJEREQmTU6HeCQcYWHVfJKRBG5RLy9t1XVxEREpHDkd4jBgCdbYftZrSF1E\nRApIHoS4d794cXUz63S/uIiIFJCcD/GGWD3RcJRodTN7El3sbekKuiQREZFJkfMhHg6FWRxfQG+4\nFSLdWr1NREQKRs6HOAxYgrVyv+4XFxGRgpEXId4Y9ya3lUxLsH7zflzXDbgiERGRiZcXIV4fm01p\nUSlFVftp7Uyyo6kj6JJEREQmXF6EeMgJsSS+kN5QO060U0PqIiJSEIqGe9MYEwFuBxqAYuAbwF+B\nW4A44AAXW2s3+/uHgN8Av7LW/sgYUwr8HKgD2oAPW2v3GmNOBq4HUsBqa+3Xj7QhjdWLeG7vi4Qq\nvfvFz3r93CM9pYiIyJQ2Uk/8g0CTtfZ04GxgJfAt4GfW2jOAa4ClA/b/Bl64912UvgxY6x//U+DL\n/vabgYustacBJxljVhxpQ/omt5XXtvLStgSpdOZITykiIjKljRTiq/CCum/fJHAqMNcY8zu8kH8U\nwBhzPpAGHh5w/KkDPn8YONMYEwOi1tpN/vZHgDOPsB3MKp9BRaQcJ7aXnt4Um3e1HekpRUREprRh\nh9OttR0AfvCuwutJ3wHst9a+zRjzFeBLxphVwEXA+cC/4Q2zA1QCLf7rNqDK39Y64Mu0AQtHKrSu\nLjZiY5bNPIr/3fY0TnEnm5s6OOX4+hGPmSpG075cpvbltnxuXz63DdS+fDdsiAMYY+YC9wErrbV3\nGmOuAx7w334Q+CZQAszB65XPB3qMMZvxwrrS3zcGJPxtA//WK/3tw2pqGrln3VA2j//lacKV+3hq\n3WucefzsEY+ZCurqYqNqX65S+3JbPrcvn9sGal+uG80vKMMOpxtjZgCrgS9aa3/ib34ceIf/+gzg\nBWvtl6y1J1tr3wL8BLjOWvsI8ATwdn/fc4DHrLVtQK8xZqExxgHOAh4bS8MOp28d9Yrprbyyo4We\n3nQ2TisiIjIljdQTvxp
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x10ac82b38>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"n_estimators = np.arange(1, 10)\n",
|
||
|
"clfs = [GMM(n, n_iter=1000).fit(x) for n in n_estimators]\n",
|
||
|
"bics = [clf.bic(x) for clf in clfs]\n",
|
||
|
"aics = [clf.aic(x) for clf in clfs]\n",
|
||
|
"\n",
|
||
|
"plt.plot(n_estimators, bics, label='BIC')\n",
|
||
|
"plt.plot(n_estimators, aics, label='AIC')\n",
|
||
|
"plt.legend();"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"It appears that for both the AIC and BIC, 4 components is preferred."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Example: GMM For Outlier Detection\n",
|
||
|
"\n",
|
||
|
"GMM is what's known as a **Generative Model**: it's a probabilistic model from which a dataset can be generated.\n",
|
||
|
"One thing that generative models can be useful for is **outlier detection**: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where \"suitable\" is up to your own bias/variance preference) can be labeld outliers.\n",
|
||
|
"\n",
|
||
|
"Let's take a look at this by defining a new dataset with some outliers:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 10,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"np.random.seed(0)\n",
|
||
|
"\n",
|
||
|
"# Add 20 outliers\n",
|
||
|
"true_outliers = np.sort(np.random.randint(0, len(x), 20))\n",
|
||
|
"y = x.copy()\n",
|
||
|
"y[true_outliers] += 50 * np.random.randn(20)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAe4AAAFVCAYAAAApGgzgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcZGV97/HPqbW7qpfZemYCgywCjwRFVCQsOkB0cJ0r\nmpv4IttLFIloEhOzIUazXE3uDRfMzUsxhkW93kTvBUQQlE1QoBEMuDCCPOMgiyMD9PRaXVVd1V1V\n949zqqdoerp6qapzTp3v+/XixdQ5Vaee31T3fOt5znOe49RqNURERCQcYn43QERERJZPwS0iIhIi\nCm4REZEQUXCLiIiEiIJbREQkRBTcIiIiIZJYaqcxJgZcDpwAlIDzrbWPL3hOBrgdeK+11jZs3ww8\nBLzBWru71Q0XERGJomY97nOAlLX2NOAi4NLGncaYk4C7gSOBWsP2JPB5IN/S1oqIiERcs+A+HbgF\nwFr7AHDSgv0p3HC3C7ZfAnwO2NeCNoqIiIinWXAPAFMNjyve8DkA1tr7rLV7G19gjHkPMGKtvc3b\n5LSioSIiItLkHDduaPc3PI5Za6tNXnMeUDPGvBE4EfiSMeYd1trnDvaCWq1Wcxzlu4iIRMaqQ69Z\ncA8DO4FrjDGnAA83O6C19oz6n40xdwF/sFRoAziOw8hIbhnNDbahof7Q19ENNUB31NENNYDqCJJu\nqAG6o46hof7mTzqIZsF9PbDDGDPsPT7PGHMu0GetvWLV7yoiIiKrsmRwW2trwIULNr/o0i5r7VkH\nef2i20VERGR1tACLiIhIiCi4RUREQkTBLSIiEiIKbhERkRBRcIuIiISIgltERCREFNwiIiIhouAW\nEREJEQW3iIhIiCi4RUREQkTBLSIiEiIKbhERkRBRcIuIiISIgltERCREFNwiIiIhouAWEREJEQW3\niIhIiCi4RUREQkTBLSIiEiIKbhERkRBRcIuIiISIgltERCREFNwiIiIhouAWEREJEQW3iIhIiCi4\nRUREQkTBLSIiEiIKbhERkRBRcIuIiISIgltERCREFNwiIiIhouAWEREJEQW3SMDFf7KL5PA9fjdD\nRAJCwS0SYM7EOOve8RbWvfNtCm8RARTcIoGW+s6dxHJTAKSvv87fxohIICi4RQIs+b3h+T8nfviQ\njy0RkaBILLXTGBMDLgdOAErA+dbaxxc8JwPcDrzXWmuNMUngauBwIA180lr7jXY0XqTbxX+2G4C5\n444n8dijMDsLyaTPrRIRPzXrcZ8DpKy1pwEXAZc27jTGnATcDRwJ1LzNvwOMWGu3A28GPtPSFotE\nSHzPz6hsO4zZ41+OMztL7tGfMDY2Ov9fpVLxu4ki0mFL9riB04FbAKy1D3hB3SiFG+5fbth2DXCt\n9+cYMNeCdopEz/Q08Wf3UT7jLPKbt9AL/PC2/+Tp493vyIXpKf7rG1/Bhg0b/W2niHRUs+AeAKYa\nHleMMTFrbRXAWnsfgDFm/gnW2ry3rR83xD/WygaLRMbTTwNQeckRzG47DICh3CRj/ev8bJWI+KxZ\ncE8B/Q2P50N7KcaYw4CvAZ+11n51OQ0ZGupv/qQQ6IY6uqEG6II6fvwMAL0vPZzy8ccCsGHiebLZ\nNADVSopNm/rZuDH4dYb+s/B0Qx3dUAN0Tx2r0Sy4h4GdwDXGmFOAh5sd0BizBbgN+KC19q7lNmRk\nJLfcpwbW0FB/6OvohhqgO+oY2rcPgFzfekZ7BxkEks8/Sz5fAqBQKLN/f45qNeVjK5vrhs8CuqOO\nbqgBuqOOtXzxaBbc1wM7jDH1a1LOM8acC/RZa684yGsuBgaBTxhjPuFte4u1dmbVrRSJIi+4q1u2\nMrdxEwC94/v9bJGIBMCSwW2trQEXLti8e5HnndXw5w8DH25J60SirB7cW7dSy2Ypp3vITIz63CgR\n8ZsWYBEJqmfcc9zVLVsBKAxuoFfBLRJ5Cm6RoNq3j5rjUN00BEB+cAO9k2NQbTo/VES6mIJbJKhG\nR6lt2AAJ94xWYWADsWqFntyEzw0TET8puEWCanSU6rr18w8LA+71272TY361SEQCQMEtEkS1GoyN\nUWsI7pm+AQDS01MHe5WIRICCWySI8nmYnaW6viG4s/XgnvSrVSISAApukQCKTYwDvLDHnXUXbFBw\ni0SbglskgJxxN7hf0OPWULmIoOAWCaQle9w59bhFokzBLRJAzsRiPe5BAHo0VC4SaQpukQCKjbmX\nfNXWb5jfph63iICCWySQFu1xz09O0zlukShTcIsEUGz8xee4q4kk5Z6MZpWLRJyCWySAnEl3WdPa\nunUv2F7qG9RQuUjEKbhFAsjJ5QCo9g++YHs500eqOO1Hk0QkIBTcIgEUy7nnsWv9/S/YPpvtI1nM\nu0uiikgkKbhFAsjJ5SCZhHT6BdvLvVli1SrJmYJPLRMRvym4RQLImc7BwAA4zgu2lzN9ACQLGi4X\niSoFt0gAOTkvuBcoZ9yh85SCWySyFNwiAXSw4J7NZAFIFfOdbpKIBISCWyRoajWc3NRBetzuUHkq\nrx63SFQpuEWCJp/HqdUWD+5e7xy3LgkTiSwFt0jAxKbda7iX7HHrHLdIZCX8boCIvFB1wl01bSaZ\nZGxsFIDx8XFqtdqBc9wKbpHIUnCLBEzh2WcAsONz3Hn/UwDsf3YvfYObdDmYiCi4RYImNu2GsrNu\nI3397lrlee/GIrocTER0jlskYOrBPZvte9G+si4HE4k8BbdIwMwHd2ax4NblYCJRp+AWCZj6rPLF\ngntWl4OJRJ6CWyRgYvl6jzv7on3VZJK5VFrnuEUiTMEtEjDxJYbKwbsnt4JbJLIU3CIBs9Q5boDZ\nnoxu6ykSYQpukYCJ5d0Z47O9Lx4qd7dnSM4UO9kkEQkQBbdIwDgFN7jnejOL7p/ryZCYKUCt1slm\niUhAKLhFAibm9abnenoX3T/bkyFWrZIolzrZLBEJCAW3SMDECkUq8Ti1RHLR/bM9bk88WdJwuUgU\nLbnkqTEmBlwOnACUgPOttY8veE4GuB14r7XWLuc1InJwTrHAbKrnoPtnvSH0lCaoiURSsx73OUDK\nWnsacBFwaeNOY8xJwN3AkUBtOa8RkaXFikXm0osPk0NDj1sT1EQiqVlwnw7cAmCtfQA4acH+FG5Q\n2xW8RkSWECsWmE0v0eP2gjuloXKRSGoW3APAVMPjijcUDoC19j5r7d6VvEZEluYUiksHtzdUnixq\nqFwkiprd1nMK6G94HLPWVtvwGoaG+ps9JRS6oY5uqAFCWketRm2mSMUL52w2DUBvb4p4Ikk2myY2\nMABAH7Ns2tTPxo3BrzOUn8UiuqGObqgBuqeO1WgW3MPATuAaY8wpwMPLOOZqXsPISG45Twu0oaH+\n0NfRDTVAiOsolRiqVCgnUgDk8+4lX8VimXjcIZ8vMR1z99Umc+zfn6NaTfnW3OUI7WexQDfU0Q01\nQHfUsZYvHs2C+3pghzFm2Ht8njHmXKDPWnvFcl+z6taJREx98ZWlhsrrC7PocjCRaFoyuK21NeDC\nBZt3L/K8s5q8RkSWwSm4562XNTlNs8pFIkmTxkQC5EBwL3E5WK8uBxOJMgW3SIA4xRX0uDVULhJJ\nCm6RAJnvcS+1ctr8Aiy6HEwkihTcIkFSvzPYktdxu7f71DlukWhScIsEiJNfzlC5e/5bs8pFoknB\nLRIg85eDLTFUXkn1UI3FdJMRkYhScIsEyHIuB8NxmOvJaFa5SEQpuEUCxCm6YbzU3cHAHS7XULlI\nNCm4RQJkOSungTuzPKWbjIhEkoJbJEAOXA6WXvJ5s71Z9bhFIkrBLRIgB3rczYfKU6UZqDa98Z6I\ndBkFt0iA1HvcS13HDTDnLcJSf76IRIeCWyRIljOrHCj3uIuwxBTcIpGj4BYJkOWsVQ4w1+sOpce8\noXURiQ4Ft0iALGetcjiwXnksr+AWiRoFt0iAOIU81XQaYkv/as4Hty4JE4kcBbdIgDiFAlXvfttL\nqU9O0zlukehRcIsEiFM
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x10ac70828>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"clf = GMM(4, n_iter=500, random_state=0).fit(y)\n",
|
||
|
"xpdf = np.linspace(-10, 20, 1000)\n",
|
||
|
"density_noise = np.exp(clf.score(xpdf))\n",
|
||
|
"\n",
|
||
|
"plt.hist(y, 80, normed=True, alpha=0.5)\n",
|
||
|
"plt.plot(xpdf, density_noise, '-r')\n",
|
||
|
"#plt.xlim(-10, 20);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Now let's evaluate the log-likelihood of each point under the model, and plot these as a function of ``y``:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 12,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAesAAAFVCAYAAADPM8ekAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAF+JJREFUeJzt3X+sZGddx/H3vfSH7b13u4bselkluo36bZvwI2UttGJb\n4rXKBMKP6B+VHylZ0CIQCgQCxVSDpBoRFCLyoyI1QKqgVURWCjWkQCvdXtAgsj6CuTbRpfQaqXt3\nXSh0xj/OzHZ29v6cOTPnOWfer6TpzNk7Z57vnbnzmec5z3nOTKfTQZIk5Wu26gZIkqTNGdaSJGXO\nsJYkKXOGtSRJmTOsJUnKnGEtSVLmziprRxFxAfBhYAE4B3htSumLZe1fkqRpVWbP+jXAZ1JKVwPX\nAe8ucd+SJE2t0nrWwO8D3+3ePhs4WeK+JUmaWkOFdUQcBG4Y2HxdSulLEbEIfAh49aiNkyRJMFPm\ncqMR8QTgNuB1KaU7NvvZTqfTmZmZKe25pSbp/9twSWCpUYYKvtLCOiIuAW4Hfiml9M/beEhndXWt\nlOeu0p49C9S9jibUAM2pY+/eXWdse/DBYxW0ZHhNeS2aUEcTaoBG1TFUWJc5wexmilng74qIz0bE\nX5W4b2kqrBfUm22XNB1Km2CWUnpuWfuSptFWgbx3767a9bAllcNFUaQMbLfnbA9bmk6GtZQpe9GS\negxrqWKbTShbL7DtXUvTx7CWJClzhrWUmcHTKR0Ol2RYSxUadkjboXBpuhjWUkbm5ubX3W7vWppu\nhrWUkZWVo9v+WXvX0vQwrKWKGLaStsuwlmrCoXBpehnWUiYMY0kbMaylGnMoXZoOhrVUAUNW0k4Y\n1lKNOFQuTSfDWsrAKCFsL11qPsNakqTMGdbShI3aE3YoXJo+hrUkSZkzrKWKldFT9ri11GyGtSRJ\nmTOspRryuLU0XQxraYIcrpY0DMNaaojFxd1VN0HSmBjWUoXKHM5ut9ul7UtSXgxrqaY8bi1ND8Na\nkqTMGdZSgziBTWomw1qakHEE6dzcfOn7lJQfw1qqSBnHnFdWjpbQEkm5M6ylhnEoXGoew1qSpMwZ\n1lLNeQqX1HyGtTQBrdbSRJ/PoXCpWQxraQKWlw9X3QRJNWZYSxUoe+h6vf3Zu5aaw7CWJClzhrXU\nEPaupeYyrKWG27t3l6Et1ZxhLTXIZsfCDW2pvgxracwmHZBbTV4zsKX6MaylBjKwpWYxrKUJm9SK\nYw8+eGzLYXFJ9WBYSw23VWhLyl/pYR0RF0XEQxFxTtn7ljQ8T+2S6qvUsI6IXcDbge+UuV9J5TCw\npXoqLawjYgZ4H/Am4GRZ+5UkadqdNcyDIuIgcMPA5vuBP0spfSUiAGZGbJtUezn2Wh988NgZ7dq7\nd5fHtaWMzXQ6nVJ2FBFfB/6ze/dpwL0ppas3eUg5TyxlbGbm9O+sl19+Offcc09FrXnUYLsAyvos\nkLSpoTqypYV1v4hYASKl9PAmP9ZZXV0r/bknbc+eBepeRxNqgDzrGOzBbqf3Oqk6hmnbduX4Wgyj\nCXU0oQZoVB1DhfW4Tt3yK7pUMzkO2UsqDHXMeisppQvHsV9JkqaRi6JIU8oJZVJ9GNaSTllc3F11\nEyStw7CWxqTVWqq6CVsa7F232+2KWiJpM4a1NCZf/vLyaffn5uYraomkujOspTEZ7KWurBytqCU7\n46xwKT+GtTTlZmf9GJBy51+pNOUeeOChqpsgaQuGtSRJmTOsJZ0x+c3j1lJeDGtpDOpw2la/ukx+\nk6aVYS2NwZEjXzvtvqdtSRqFYS2NwYkTx0+7X8eea91GB6QmM6wlAWeewrW8fLiilkgaZFhLAjyF\nS8qZYS1JUuYMa0mSMmdYS2PQP/u7zjPBPd9ayoNhLY1B/2zwkyf/r8KW7MzgJTMl5cGwlkq2f/++\n0+6fd975FbVEUlMY1lLJBnvSdTzHWlJeDGtpjOp4+cnBNi8u7q6oJZJ66vdJImWuf9i7jkPgg+db\nt9vtiloiqcewlkq2snKU2dlZZmdnHQKXVArDWipZq7VEu92m3W67vrakUhjWks7Qf9y6jsfdpaY5\nq+oGSE00NzfPxRdfwqFDd1bdlKH0H6f2mLVUPb8ySyVqtZZYXj58xiUy62Zw1TWH86VqGdZSifov\nK1nnS0wOToyrcy1SExjWkiRlzrCWStSUC3hIyothLZXo4osvWfe2JI3CsJZKdOTI15idneXAgctq\nOxO858CBy067P3iBEkmTY1hLJdm/fx8nThyn3W5z5MjXqm7OyAa/bNTpUp9S0xjWUkn6w6wpwdZ/\n3L2O65xLTWFYSyWp+wU81tP/paPu545LdWZYS9rQ4OplLo4iVcOwlrShwdPPvvzl5YpaIk03w1oq\nSRNP2xpcyawpw/tS3RjWUonm5uYbcdpWPxd3karnVbekEvQu4NFETZzlLtWNPWupBP3nVTfhHOt+\nl156YN3bkibHnrVUgv7Tmpp2itOhQ3eemgXepOF9qU7sWUvaUi+kPXVLqkZpPeuIeAzwDuApwDnA\nTSmlT5W1fylnBw5cduqY9eCa2k3Qf0y+1Vqyhy1NWJk96xcBZ6WUng48F7i4xH1LqlD/+dWeay1N\nXpnHrK8BvhoRfwvMAK8qcd9S1vpngjdtghkU51c37Vi8VCdDhXVEHARuGNi8CpxMKT0rIq4EPghc\nNWL7pOwNXjqyKQui9FtZOcri4m7a7TbtdtuhcGnCZjqdTik7iojbgI+llG7v3v9mSulxmzyknCeW\nKrawsMDx40Wvc3Z2lkceeaTiFo1Hf53z8/Osra1V3CKplmaGeVCZw+BfAFrA7RHxJOD+rR6wulr/\nP/Y9exZqX0cTaoDq6rjooktODYNfeumBkduQ6+vRX+dFF12yaRtzrWGnmlBHE2qAZtUxjDInmN0C\nzETEPwDvBa4vcd9Stpq8IEq/aalTylFpPeuU0sPAwbL2JykvLjsqVcdFUaQRNfFqW+tx2VGpOi43\nKo1oWpbjnJY6pRwZ1lIJpiW8pqVOKTcOg0sjarWWpmrN7GmrV8qBYS2NoLdm9vLy4akIsGmrV8qF\nYS1p21wjXKqGx6ylEUzbpCvXCJeqYc9aGtGhQ3dORVBDsUb47GzxsdFbI1zS+BnW0gj27993xoU8\nmu68884/dduVzKTJMKylIe3fv48TJ45z4sTxqQtsSZNlWEtD6j926/KbksbJsJaGMHisdpqW35yW\n5VWlnDgbXBrR3Nz81Ewwg+mbAS/lwLCWRjSNvUtDWposh8GlIXhtZ1hc3M3i4u6qmyFNBcNaGtE0\n9qwXF3fTbrdpt9sGtjQBhrW0Q63W0qmZ4NN2vHo97Xa76iZIjWdYSzs0rcPe/R544KHT7ruSmTRe\nhrWkoRw4cFnVTZCmhrPBpR1yAZSCp3BJk2PPWtqBVmvptGO00zi5TNLkGdbSDiwvHz7t/jT3KFut\nJZaXD7O8fNhj1tKYGdbSDvQuDzl4W5LGyWPW0g6cd975p07b6r9U5DTymLU0OYa1pKEZ0tJkOI4n\n7cDKylHm5uaZm5tnZeVo1c2RNCXsWUs7ZEhLmjR71tI2tVpLznregL8babwMa2kbPE1pY/5upPEz\nrCWNxMuFSuNnWEvbNDc3z4EDlzkDekD/Km6u6CaNhxPMpC30hnm1Ps+3lsbPsJY0MkNaGi/DWtqC\nPUdJVTOspW0wpLfmFxppfJxgJmlknr4ljZdhLUlS5hwGlzQyj+tL42VYSyqFIS2Nj8PgkiRlzrCW\nNuDFKSTlwrCW1uHsZkk5MawlScqcYS1JUuZKmw0eEecDtwG7gYeBF6aUvlXW/qVJ8rKPknJSZs/6\nxcCRlNJVwJ8Dry9x39JEedlHSTkp8zzrk8Bju7cvoOhdS7XkIh+jWVhYoNOBlZWjVTdFaoSZTqez\n4wdFxEHghr5NHeCVwPuANvCDwJUppW9sspudP7Gk7C0sLHD8+HEA5ufnWVtbq7hFUlZmhnrQMGG9\nnoh4P3BfSumWiHgC8OG
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x10271c470>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"log_likelihood = clf.score_samples(y)[0]\n",
|
||
|
"plt.plot(y, log_likelihood, '.k');"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 13,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"true outliers:\n",
|
||
|
"[ 99 537 705 1033 1653 1701 1871 2046 2135 2163 2222 2496 2599 2607 2732\n",
|
||
|
" 2893 2897 3264 3468 4373]\n",
|
||
|
"\n",
|
||
|
"detected outliers:\n",
|
||
|
"[ 537 705 1653 2046 2135 2163 2496 2732 2893 2897 3067 3468 4373]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"detected_outliers = np.where(log_likelihood < -9)[0]\n",
|
||
|
"\n",
|
||
|
"print(\"true outliers:\")\n",
|
||
|
"print(true_outliers)\n",
|
||
|
"print(\"\\ndetected outliers:\")\n",
|
||
|
"print(detected_outliers)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The algorithm misses a few of these points, which is to be expected (some of the \"outliers\" actually land in the middle of the distribution!)\n",
|
||
|
"\n",
|
||
|
"Here are the outliers that were missed:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 14,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"{99, 1033, 1701, 1871, 2222, 2599, 2607, 3264}"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 14,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"set(true_outliers) - set(detected_outliers)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"And here are the non-outliers which were spuriously labeled outliers:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 15,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"{3067}"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 15,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"set(detected_outliers) - set(true_outliers)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Finally, we should note that although all of the above is done in one dimension, GMM does generalize to multiple dimensions, as we'll see in the breakout session."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Other Density Estimators\n",
|
||
|
"\n",
|
||
|
"The other main density estimator that you might find useful is *Kernel Density Estimation*, which is available via ``sklearn.neighbors.KernelDensity``. In some ways, this can be thought of as a generalization of GMM where there is a gaussian placed at the location of *every* training point!"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 16,
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAesAAAFVCAYAAADPM8ekAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XmcFPWd//FX3zPT3XMfwAy3UCAKIqgIiifGqESziUlM\nslk1hzG7McZsou4mWc3uJptfYjZrspqsMfehkXjjgXgLAiJyQyE3A3MfPdM9R08fvz9qGIdrDuie\n7pl+Px8PHtL9rar+TDnDe77f+ta3bPF4HBEREUlf9lQXICIiIn1TWIuIiKQ5hbWIiEiaU1iLiIik\nOYW1iIhImlNYi4iIpDlnX42GYdiBB4CZQCfwBdM0dx21TQ7wEnCzaZpm9z6/AqYCMeCLpmmayShe\nREQkE/TXs74OcJumOR+4C7ivd6NhGHOBN4CJwOEbtq8AvKZpXgB8D/jPhFYsIiKSYfoL6wXACwCm\naa4G5h7V7sYK9N4953YgzzAMG5AHhBNTqoiISGbqcxgcyAVaer2OGoZhN00zBmCa5koAwzB677MC\nyAK2A0XA4oRVKyIikoH6C+sWwN/rdU9Q9+FbwArTNP/VMIwK4BXDMM4wTfOEPex4PB632WwDq1hE\nRGT4G1To9RfWK7B6xo8ZhjEP2DiAY3r5oDfeBLgAR1872Gw26upaB3DozFZS4td5GiCdq4HReRo4\nnauB0XkamJISf/8b9dJfWD8BLDIMY0X365sMw7gB8Jmm+dAJ9vkR8BvDMN7ECuq7TdNsH1RVIiIi\n0qPPsDZNMw7cetTbO46z3SW9/t4MfDQh1YmIiIgWRREREUl3CmsREZE0p7AWERFJcwprERGRNKew\nFhERSXMKaxERkTSnsBYRkYx08GAl3/72t7jllpv42tdu5Vvfup09e3bz8MO/5KKLzqO+vr5n26am\nRi666Dyef/5ZqqoOceGF5/DHP/72iOPdeefX+epXb0lKrf0tiiIiIpJU99zj4ZlnEhtHixdHuOee\nzhO2d3R0cPfd3+DOO7/DjBlnALBt2xZ+8pMfMnv2HMaOHccrr7zEJz5xAwAvv7yMUaNGA9aqm+Xl\nFbz++qt89rM3AhAINHPwYCWFhUUJ/ToOU89aREQyzooVbzBnzrk9QQ0wffoMfvazXwJw6aWLePXV\nl3raVq58iwULLux5nZeXT0FBIfv27QXglVeWc8kllxOPx0kG9axFRCSl7rmns89ecDJUVR2ivLy8\n5/Xdd3+DYDBIQ0M9s2bNxjCmk5WVzaFDB4nFYpSWluF2e444xuWXf4jly1/k85+/hbfeeoNbbvkK\n69evS0q96lmLiEjGKS0dRVXVoZ7XP/jBffzsZ7/E788lGo0CH4TxSy+9wBVXfPiYY1x44UWsWPEG\n1dVVFBUVkZWVlbR6FdYiIpJxLrzwItauXcOWLZt73qusPEBdXW3P64svvpQ333ydjRvXM3v2nGOO\nkZ2dzdix43nggftZtOhKkjQCDmgYXEREMlB2djY//OF/8+CDP6OhoZ5oNIrD4eC22+5g9+5d2Gw2\nvF4fZWVllJePxWY78vHTh19fccWH+fGPf8C9936f/fv3HbNdotiSdTF8kOJ6/mn/9JzYgdO5Ghid\np4HTuRoYnaeBKSnxDyrVNQwuIiKS5hTWIiIiaU5hLSIikuYU1iIiImlOYS0iIpLmFNYiKRSJwC23\nZPHJT2bT2JjqakQkXek+a5EUevxxJ0884QLggQfcfPvb4RRXJJIZ1q1by1NPPc69934fgFdfXc5v\nfvMQBQWFNDc3k5trrWSWl5fPbbfdwejRY3j44V+yfPmLFBeX9BznnHPO43Ofuznp9SqsRVLomWdc\nR/xdYS2ZyHvPt/E882RCj9m5+DpC9/zHgLZ96aUXePTRP/M///MLHnzwfj7zmX/g3HPnAbBhw3q+\n+927eOih32Oz2fjUpz7Ltdf+XUJrHQgNg4ukSDwO775rZ9y4GIsWRdizx051dXJWPxKRIx1eaeyF\nF5by17/+hZ/+9AEKCgoAjnhy1qxZZ+F0Ojl4sPKYtqGknrVIitTW2qivt3PVVV3MmhXjpZecbNxo\nZ9SoaKpLExlSoXv+Y8C94ESJx+Ns3Lie+vo6WltbiUQiJ9y2oKCI5uZmAB599E+8/PKynrbPfe5m\nzjnnvKTXq7AWSZHNm62BrRkzYiyueZixbGLXxv+EK1z97CkiiVBUVMxPf/oATz/9BN/73ne47777\nAY5Z37u6uorS0lIADYOLZJotWxwAXOhdy/m//kdu5RdMffb+FFclkjnKyytwuVx87GOfwOVy8rvf\nPQwcOdT9zjuryM7OpqSk9Ji2odRnz9owDDvwADAT6AS+YJrmrqO2yQFeAm42TdPsfu9uYDHgAn5u\nmubvklC7yLAVjUZ5770uwMOsTb/teX/2nidpbLyRvLx8HA5HyuoTGelsNtsRPei77/43br75Mzgc\nDkxzO3/842+x2x14vV7uvfcHPdsdPQw+btx4vvnNf0l6vf0Ng18HuE3TnG8YxnnAfd3vAWAYxlzg\nF8AYIN793sXA+d37eIFvJaNwkeEsEGhmw5Y8HI4Y2atfIeJysy4+m3PbV3Pf029zzUfOp7CwKNVl\nioxYs2fPOeIZ1fn5+Tz++NI+97n55i9x881fSnZpx9XfMPgC4AUA0zRXA3OPandjhbfZ670PAZsM\nw3gSeAZ4OjGliowsgaYcyouaKTq0l7opZ7Ih17pVZExjfYorE5F0019Y5wItvV5Hu4fGATBNc6Vp\nmpVH7VMMzAE+DnwZ+FMiChUZSTo7bbQGPJzvW4c9FqN+0nSqiqcAkLPzYIqrE5F0098weAvg7/Xa\nbppmrJ996oFtpmlGgB2GYXQYhlFsmmaf3YWSEn9fzdJN52ng0vlc7dlj3SYy17MZgODU02ntOA12\nQEHlQYqL/RQVDU396Xye0o3O1cDoPCVef2G9Amui2GOGYcwDNg7gmG8BXwN+YhjGGMALNPS3U11d\n6wAOndlKSvw6TwOU7udqy5ZOACZHrStItcXlhEZZs02zq6upr28lFnMnvY50P0/pROdqYHSeBmaw\nv9D0F9ZPAIsMw1jR/fomwzBuAHymaT50vB1M01xqGMZCwzDWYA2zf8U0zdTMdRdJU4cOWT96E8I7\nAQiMGY8zmEs7WeQ3V9Pf8JWIZJY+w7o7ZG896u0dx9nukqNe33nqpYmMXIfDurx1Nx2+PDr9+RQU\nRdjHeMpCh6hNcX0ikl60KIpICtTUOHAQoahpP4Ex4wHIzY9QSQUFXY3Ywp0prlBE0onCWiQF6usd\nTGAvjliEwJhxAPj8EWooA8Chh1uLSC8Ka5EUqKtzMM1lXVFqKRsLgN0Bze5iABwN/c7JFJEMorAW\nSYG6OgdTs3YDECwd0/N+S7a1apmjUWEtIh9QWIsMsWgUGhsdTHbuBaC1ZHRPW8jXvcRojVYxE5EP\nKKxFhlhjo41o1MYE214Agr3Cui23EICug02pKE1E0pTCWmSI1dZaT/qpiB4gZrcTKiztaQsXFAAQ\nOahhcBH5gMJaZIgdDuvR4QO0FZYSd7p62rqKrbCmTmEtIh9QWIsMsdpaG066KO6oOWIIHCBWmg+A\nU7PBRaQXhbXIEKuttVPOQezx2BGTywDcxW5C5OAOKKxF5AMKa5Eh1thoYzz7AAgWHxnWXl+UOkrI\nCdalojQRSVMKa5EhdkRYlxwd1hEaKSS7s+V4u4pIhlJYiwyxpiZOGNY5vijN5JMdCUIkkoryRCQN\nKaxFhlhDg/2491gDeL1WWAPYWgJDXZqIpCmFtcgQa2qCyY7jX7N2ueO02PMAsAUU1iJiUViLDLGm\nJhvjbfto8+cRyco+pj3kygXArp61iHRTWIsMoVgMmhphTOQArUVlx90m5LHC2tbcPJSliUgaU1iL\nDKFAAIrjdXjinbQUjTruNu1ZfgDizepZi4hFYS0yhHrfttVSfPyedUe2D4DOGoW1iFgU1iJDqHdY\nn2gYPOz1Wv9VWItIN4W
|
||
|
"text/plain": [
|
||
|
"<matplotlib.figure.Figure at 0x109ad97b8>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.neighbors import KernelDensity\n",
|
||
|
"kde = KernelDensity(0.15).fit(x[:, None])\n",
|
||
|
"density_kde = np.exp(kde.score_samples(xpdf[:, None]))\n",
|
||
|
"\n",
|
||
|
"plt.hist(x, 80, normed=True, alpha=0.5)\n",
|
||
|
"plt.plot(xpdf, density, '-b', label='GMM')\n",
|
||
|
"plt.plot(xpdf, density_kde, '-r', label='KDE')\n",
|
||
|
"plt.xlim(-10, 20)\n",
|
||
|
"plt.legend();"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"All of these density estimators can be viewed as **Generative models** of the data: that is, that is, the model tells us how more data can be created which fits the model."
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 2",
|
||
|
"language": "python",
|
||
|
"name": "python2"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 2
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython2",
|
||
|
"version": "2.7.10"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 0
|
||
|
}
|