data-science-ipython-notebooks/deep-learning/tensor-flow-tutorials/5_clustering.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1\"><a href=\"#Clustering-and-k-means\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Clustering and k-means</a></div><div class=\"lev1\"><a href=\"#Generating-Samples\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Generating Samples</a></div><div class=\"lev1\"><a href=\"#Initialisation\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Initialisation</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering and k-means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now venture into our first application, which is clustering with the k-means algorithm. Clustering is a data mining exercise where we take a bunch of data and find groups of points that are similar to each other. K-means is an algorithm that is great for finding clusters in many types of datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First up, we are going to need to generate some samples. We could generate the samples randomly, but that is likely to either give us very sparse points, or just one big group - not very exciting for clustering.\n",
    "\n",
    "Instead, we are going to start by generating three centroids, and then randomly choose (with a normal distribution) around that point. First up, here is a method for doing this:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting functions.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile functions.py\n",
    "\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed):\n",
    "    np.random.seed(seed)\n",
    "    slices = []\n",
    "    centroids = []\n",
    "    # Create samples for each cluster\n",
    "    for i in range(n_clusters):\n",
    "        samples = tf.random_normal((n_samples_per_cluster, n_features),\n",
    "                               mean=0.0, stddev=5.0, dtype=tf.float32, seed=seed, name=\"cluster_{}\".format(i))\n",
    "        current_centroid = (np.random.random((1, n_features)) * embiggen_factor) - (embiggen_factor/2)\n",
    "        centroids.append(current_centroid)\n",
    "        samples += current_centroid\n",
    "        slices.append(samples)\n",
    "    # Create a big \"samples\" dataset\n",
    "    samples = tf.concat(0, slices, name='samples')\n",
    "    centroids = tf.concat(0, centroids, name='centroids')\n",
    "    return centroids, samples\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way this works is to create **n_clusters different** centroids at random (using **np.random.random((1, n_features))**) and using those as the centre points for **tf.random_normal**. The **tf.random_normal** function generates normally distributed random values, which we then add to the current centre point. This creates a blob of points around that center. We then record the centroids (**centroids.append**) and the generated samples (**slices.append(samples)**). Finally, we create “One big list of samples” using **tf.concat**, and convert the centroids to a TensorFlow Variable as well, also using **tf.concat**.\n",
    "\n",
    "Saving this **create_samples** method in a file called **functions.py** allows us to import these methods into our scripts for this (and the next!) lesson. Create a new file called **generate_samples.py**, which has the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting generate_samples.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile generate_samples.py\n",
    "\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "from functions import create_samples\n",
    "\n",
    "n_features = 2\n",
    "n_clusters = 3\n",
    "n_samples_per_cluster = 500\n",
    "seed = 700\n",
    "embiggen_factor = 70\n",
    "\n",
    "np.random.seed(seed)\n",
    "\n",
    "centroids, samples = create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed)\n",
    "\n",
    "model = tf.initialize_all_variables()\n",
    "with tf.Session() as session:\n",
    "    sample_values = session.run(samples)\n",
    "    centroid_values = session.run(centroids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This just sets up the number of clusters and features (I recommend keeping the number of features at 2, allowing us to visualise them later), and the number of samples to generate. Increasing the [embiggen_factor](https://en.wiktionary.org/wiki/embiggen) will increase the “spread” or the size of the clusters. I chose a value here that provides good learning opportunity, as it generates visually identifiable clusters.\n",
    "\n",
    "To visualise the results, lets create a plotting function using matplotlib. Add this code to functions.py:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to functions.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile -a functions.py\n",
    "\n",
    "\n",
    "def plot_clusters(all_samples, centroids, n_samples_per_cluster):\n",
    "    import matplotlib.pyplot as plt\n",
    "    # Plot out the different clusters\n",
    "    # Choose a different colour for each cluster\n",
    "    colour = plt.cm.rainbow(np.linspace(0,1,len(centroids)))\n",
    "    for i, centroid in enumerate(centroids):\n",
    "        # Grab just the samples fpr the given cluster and plot them out with a new colour\n",
    "        samples = all_samples[i*n_samples_per_cluster:(i+1)*n_samples_per_cluster]\n",
    "        plt.scatter(samples[:,0], samples[:,1], c=colour[i])\n",
    "        # Also plot centroid\n",
    "        plt.plot(centroid[0], centroid[1], markersize=35, marker=\"x\", color='k', mew=10)\n",
    "        plt.plot(centroid[0], centroid[1], markersize=30, marker=\"x\", color='m', mew=5)\n",
    "    plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All this code does is plots out the samples from each cluster using a different colour, and creates a big magenta X where the centroid is. The centroid is given as an argument, which will be handy later on.\n",
    "\n",
    "Update the **generate_samples.py** to import this function by adding **from functions import plot_clusters** to the top of the file. Then, add this line of code to the bottom:\n",
    "\n",
    "```python\n",
    "plot_clusters(sample_values, centroid_values, n_samples_per_cluster)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting generate_samples.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile generate_samples.py\n",
    "\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "from functions import create_samples\n",
    "from functions import plot_clusters\n",
    "\n",
    "n_features = 2\n",
    "n_clusters = 3\n",
    "n_samples_per_cluster = 500\n",
    "seed = 700\n",
    "embiggen_factor = 70\n",
    "\n",
    "np.random.seed(seed)\n",
    "\n",
    "centroids, samples = create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed)\n",
    "\n",
    "model = tf.initialize_all_variables()\n",
    "with tf.Session() as session:\n",
    "    sample_values = session.run(samples)\n",
    "    centroid_values = session.run(centroids)\n",
    "    \n",
    "plot_clusters(sample_values, centroid_values, n_samples_per_cluster)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python generate_samples.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Initialisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  },
  "toc": {
   "toc_cell": true,
   "toc_number_sections": true,
   "toc_threshold": "8",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}