data-science-ipython-notebooks/python-data/functions.ipynb

597 lines
14 KiB
Plaintext
Raw Normal View History

{
2015-05-14 18:55:28 +08:00
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<small><i>This notebook was prepared by [Donne Martin](http://donnemartin.com). Source and license info is on [GitHub](https://github.com/donnemartin/data-science-ipython-notebooks).</i></small>"
]
},
{
2015-05-14 18:55:28 +08:00
"cell_type": "markdown",
"metadata": {},
"source": [
"# Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Functions as Objects\n",
"* Lambda Functions\n",
"* Closures\n",
"* \\*args, \\*\\*kwargs\n",
"* Currying\n",
"* Generators\n",
"* Generator Expressions\n",
"* itertools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Functions as Objects"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python treats functions as objects which can simplify data cleaning. The following contains a transform utility class with two functions to clean strings:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting transform_util.py\n"
2015-01-28 00:14:57 +08:00
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"%%file transform_util.py\n",
"import re\n",
"\n",
"\n",
"class TransformUtil:\n",
"\n",
" @classmethod\n",
" def remove_punctuation(cls, value):\n",
" \"\"\"Removes !, #, and ?.\n",
" \"\"\" \n",
" return re.sub('[!#?]', '', value) \n",
"\n",
" @classmethod\n",
" def clean_strings(cls, strings, ops): \n",
" \"\"\"General purpose method to clean strings.\n",
"\n",
" Pass in a sequence of strings and the operations to perform.\n",
" \"\"\" \n",
" result = [] \n",
" for value in strings: \n",
" for function in ops: \n",
" value = function(value) \n",
" result.append(value) \n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below are nose tests that exercises the utility functions:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting tests/test_transform_util.py\n"
2015-01-28 00:14:57 +08:00
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"%%file tests/test_transform_util.py\n",
"from nose.tools import assert_equal\n",
"from ..transform_util import TransformUtil\n",
"\n",
"\n",
"class TestTransformUtil():\n",
"\n",
" states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', \\\n",
" 'FlOrIda', 'south carolina##', 'West virginia?']\n",
" \n",
" expected_output = ['Alabama',\n",
" 'Georgia',\n",
" 'Georgia',\n",
" 'Georgia',\n",
" 'Florida',\n",
" 'South Carolina',\n",
" 'West Virginia']\n",
" \n",
" def test_remove_punctuation(self):\n",
" assert_equal(TransformUtil.remove_punctuation('!#?'), '')\n",
" \n",
" def test_map_remove_punctuation(self):\n",
" # Map applies a function to a collection\n",
" output = map(TransformUtil.remove_punctuation, self.states)\n",
" assert_equal('!#?' not in output, True)\n",
"\n",
" def test_clean_strings(self):\n",
" clean_ops = [str.strip, TransformUtil.remove_punctuation, str.title] \n",
" output = TransformUtil.clean_strings(self.states, clean_ops)\n",
" assert_equal(output, self.expected_output)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the nose tests in verbose mode:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"core.tests.test_transform_util.TestTransformUtil.test_clean_strings ... ok\r\n",
"core.tests.test_transform_util.TestTransformUtil.test_map_remove_punctuation ... ok\r\n",
"core.tests.test_transform_util.TestTransformUtil.test_remove_punctuation ... ok\r\n",
"\r\n",
"----------------------------------------------------------------------\r\n",
"Ran 3 tests in 0.001s\r\n",
"\r\n",
"OK\r\n"
2015-01-28 00:14:57 +08:00
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"!nosetests tests/test_transform_util.py -v"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lambda Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lambda functions are anonymous functions and are convenient for data analysis, as data transformation functions take functions as arguments."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort a sequence of strings by the number of letters:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['f', 'b', 'fo', 'ba', 'foo', 'baz', 'bar,']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"strings = ['foo', 'bar,', 'baz', 'f', 'fo', 'b', 'ba']\n",
"strings.sort(key=lambda x: len(list(x)))\n",
"strings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Closures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Closures are dynamically-genearated functions returned by another function. The returned function has access to the variables in the local namespace where it was created. \n",
"\n",
"Closures are often used to implement decorators. Decorators are useful to transparently wrap something with additional functionality:\n",
"\n",
"```python\n",
"def my_decorator(fun):\n",
" def myfun(*params, **kwparams):\n",
" do_something()\n",
" fun(*params, **kwparams)\n",
" return myfun\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each time the following closure() is called, it generates the same output:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Secret value is: 7\n"
2015-01-28 01:57:08 +08:00
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"def make_closure(x):\n",
" def closure():\n",
" print('Secret value is: %s' % x)\n",
" return closure\n",
"\n",
"closure = make_closure(7)\n",
"closure()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Keep track of arguments passed:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[False, True, False, False, False, False, False, True, True, True]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def make_watcher():\n",
" dict_seen = {}\n",
" \n",
" def watcher(x):\n",
" if x in dict_seen:\n",
" return True\n",
" else:\n",
" dict_seen[x] = True\n",
" return False\n",
" \n",
" return watcher\n",
"\n",
"watcher = make_watcher()\n",
"seq = [1, 1, 2, 3, 5, 8, 13, 2, 5, 13]\n",
"[watcher(x) for x in seq]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \\*args, \\*\\*kwargs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\\*args and \\*\\*kwargs are useful when you don't know how many arguments might be passed to your function or when you want to handle named arguments that you have not defined in advance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print arguments and call the input function on *args:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('arg: %s', 'foo')\n",
"('args: %s', (1, 2, 3, 4, 5))\n",
"('kwargs: %s', {})\n",
"('func result: %s', 15)\n"
2015-01-28 01:57:08 +08:00
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"def foo(func, arg, *args, **kwargs):\n",
" print('arg: %s', arg)\n",
" print('args: %s', args)\n",
" print('kwargs: %s', kwargs)\n",
" \n",
" print('func result: %s', func(args))\n",
"\n",
"foo(sum, \"foo\", 1, 2, 3, 4, 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Currying"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Currying means to derive new functions from existing ones by partial argument appilcation. Currying is used in pandas to create specialized functions for transforming time series data.\n",
"\n",
"The argument y in add_numbers is curried:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def add_numbers(x, y):\n",
" return x + y\n",
"\n",
"add_seven = lambda y: add_numbers(7, y)\n",
"add_seven(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The built-in functools can simplify currying with partial:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from functools import partial\n",
"add_five = partial(add_numbers, 5)\n",
"add_five(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generators"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A generator is a simple way to construct a new iterable object. Generators return a sequence lazily. When you call the generator, no code is immediately executed until you request elements from the generator.\n",
"\n",
"Find all the unique ways to make change for $1:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\n",
"4\n",
"9\n",
"16\n",
"25\n"
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"def squares(n=5):\n",
" for x in xrange(1, n + 1):\n",
" yield x ** 2\n",
"\n",
"# No code is executed\n",
"gen = squares()\n",
"\n",
"# Generator returns values lazily\n",
"for x in squares():\n",
" print x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generator Expressions\n",
"\n",
"A generator expression is analogous to a comprehension. A list comprehension is enclosed by [], a generator expression is enclosed by ():"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\n",
"4\n",
"9\n",
"16\n",
"25\n"
2015-01-28 02:15:47 +08:00
]
2015-05-14 18:55:28 +08:00
}
],
"source": [
"gen = (x ** 2 for x in xrange(1, 6))\n",
"for x in gen:\n",
" print x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## itertools\n",
"\n",
"The library itertools has a collection of generators useful for data analysis.\n",
"\n",
"Function groupby takes a sequence and a key function, grouping consecutive elements in the sequence by the input function's return value (the key). groupby returns the function's return value (the key) and a generator."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"f ['foo']\n",
"b ['bar', 'baz']\n"
2015-01-28 02:15:47 +08:00
]
}
],
2015-05-14 18:55:28 +08:00
"source": [
"import itertools\n",
"first_letter = lambda x: x[0]\n",
"strings = ['foo', 'bar', 'baz']\n",
"for letter, gen_names in itertools.groupby(strings, first_letter):\n",
" print letter, list(gen_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"itertools contains many other useful functions:\n",
"\n",
"| Function | Description|\n",
"| ------------- |-------------|\n",
"| imap | Generator version of map |\n",
"| ifilter | Generator version of filter |\n",
"| combinations | Generates a sequence of all possible k-tuples of elements in the iterable, ignoring order |\n",
"| permutations | Generates a sequence of all possible k-tuples of elements in the iterable, respecting order |\n",
"| groupby | Generates (key, sub-iterator) for each unique key |"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
2015-05-14 18:55:28 +08:00
},
"nbformat": 4,
"nbformat_minor": 0
}