2015-01-27 02:56:49 +08:00
{
2015-05-14 18:55:28 +08:00
" cells " : [
2015-06-19 09:07:36 +08:00
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
2015-11-01 19:44:00 +08:00
" This notebook was prepared by [Donne Martin](http://donnemartin.com). Source and license info is on [GitHub](https://github.com/donnemartin/data-science-ipython-notebooks). "
2015-06-19 09:07:36 +08:00
]
} ,
2015-01-27 02:56:49 +08:00
{
2015-05-14 18:55:28 +08:00
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" # Functions "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" * Functions as Objects \n " ,
" * Lambda Functions \n " ,
" * Closures \n " ,
" * \\ *args, \\ * \\ *kwargs \n " ,
" * Currying \n " ,
" * Generators \n " ,
" * Generator Expressions \n " ,
" * itertools "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Functions as Objects "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Python treats functions as objects which can simplify data cleaning. The following contains a transform utility class with two functions to clean strings: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 1 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" Overwriting transform_util.py \n "
2015-01-28 00:14:57 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" %% file transform_util.py \n " ,
" import re \n " ,
" \n " ,
" \n " ,
" class TransformUtil: \n " ,
" \n " ,
" @classmethod \n " ,
" def remove_punctuation(cls, value): \n " ,
" \" \" \" Removes !, #, and ?. \n " ,
" \" \" \" \n " ,
" return re.sub( ' [!#?] ' , ' ' , value) \n " ,
" \n " ,
" @classmethod \n " ,
" def clean_strings(cls, strings, ops): \n " ,
" \" \" \" General purpose method to clean strings. \n " ,
" \n " ,
" Pass in a sequence of strings and the operations to perform. \n " ,
" \" \" \" \n " ,
" result = [] \n " ,
" for value in strings: \n " ,
" for function in ops: \n " ,
" value = function(value) \n " ,
" result.append(value) \n " ,
" return result "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Below are nose tests that exercises the utility functions: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 2 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" Overwriting tests/test_transform_util.py \n "
2015-01-28 00:14:57 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" %% file tests/test_transform_util.py \n " ,
" from nose.tools import assert_equal \n " ,
" from ..transform_util import TransformUtil \n " ,
" \n " ,
" \n " ,
" class TestTransformUtil(): \n " ,
" \n " ,
" states = [ ' Alabama ' , ' Georgia! ' , ' Georgia ' , ' georgia ' , \\ \n " ,
" ' FlOrIda ' , ' south carolina## ' , ' West virginia? ' ] \n " ,
" \n " ,
" expected_output = [ ' Alabama ' , \n " ,
" ' Georgia ' , \n " ,
" ' Georgia ' , \n " ,
" ' Georgia ' , \n " ,
" ' Florida ' , \n " ,
" ' South Carolina ' , \n " ,
" ' West Virginia ' ] \n " ,
" \n " ,
" def test_remove_punctuation(self): \n " ,
" assert_equal(TransformUtil.remove_punctuation( ' !#? ' ), ' ' ) \n " ,
" \n " ,
" def test_map_remove_punctuation(self): \n " ,
" # Map applies a function to a collection \n " ,
" output = map(TransformUtil.remove_punctuation, self.states) \n " ,
" assert_equal( ' !#? ' not in output, True) \n " ,
" \n " ,
" def test_clean_strings(self): \n " ,
" clean_ops = [str.strip, TransformUtil.remove_punctuation, str.title] \n " ,
" output = TransformUtil.clean_strings(self.states, clean_ops) \n " ,
" assert_equal(output, self.expected_output) \n "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Execute the nose tests in verbose mode: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 3 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" core.tests.test_transform_util.TestTransformUtil.test_clean_strings ... ok \r \n " ,
" core.tests.test_transform_util.TestTransformUtil.test_map_remove_punctuation ... ok \r \n " ,
" core.tests.test_transform_util.TestTransformUtil.test_remove_punctuation ... ok \r \n " ,
" \r \n " ,
" ---------------------------------------------------------------------- \r \n " ,
" Ran 3 tests in 0.001s \r \n " ,
" \r \n " ,
" OK \r \n "
2015-01-28 00:14:57 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" !nosetests tests/test_transform_util.py -v "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Lambda Functions "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Lambda functions are anonymous functions and are convenient for data analysis, as data transformation functions take functions as arguments. "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Sort a sequence of strings by the number of letters: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 4 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" text/plain " : [
" [ ' f ' , ' b ' , ' fo ' , ' ba ' , ' foo ' , ' baz ' , ' bar, ' ] "
]
} ,
" execution_count " : 4 ,
" metadata " : { } ,
" output_type " : " execute_result "
}
] ,
" source " : [
" strings = [ ' foo ' , ' bar, ' , ' baz ' , ' f ' , ' fo ' , ' b ' , ' ba ' ] \n " ,
" strings.sort(key=lambda x: len(list(x))) \n " ,
" strings "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Closures "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Closures are dynamically-genearated functions returned by another function. The returned function has access to the variables in the local namespace where it was created. \n " ,
" \n " ,
" Closures are often used to implement decorators. Decorators are useful to transparently wrap something with additional functionality: \n " ,
" \n " ,
" ```python \n " ,
" def my_decorator(fun): \n " ,
" def myfun(*params, **kwparams): \n " ,
" do_something() \n " ,
" fun(*params, **kwparams) \n " ,
" return myfun \n " ,
" ``` "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Each time the following closure() is called, it generates the same output: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 5 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" Secret value is: 7 \n "
2015-01-28 01:57:08 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" def make_closure(x): \n " ,
" def closure(): \n " ,
" print( ' Secret value is: %s ' % x ) \n " ,
" return closure \n " ,
" \n " ,
" closure = make_closure(7) \n " ,
" closure() "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Keep track of arguments passed: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 6 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" text/plain " : [
" [False, True, False, False, False, False, False, True, True, True] "
]
} ,
" execution_count " : 6 ,
" metadata " : { } ,
" output_type " : " execute_result "
}
] ,
" source " : [
" def make_watcher(): \n " ,
" dict_seen = {} \n " ,
" \n " ,
" def watcher(x): \n " ,
" if x in dict_seen: \n " ,
" return True \n " ,
" else: \n " ,
" dict_seen[x] = True \n " ,
" return False \n " ,
" \n " ,
" return watcher \n " ,
" \n " ,
" watcher = make_watcher() \n " ,
" seq = [1, 1, 2, 3, 5, 8, 13, 2, 5, 13] \n " ,
" [watcher(x) for x in seq] "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## \\ *args, \\ * \\ *kwargs "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" \\ *args and \\ * \\ *kwargs are useful when you don ' t know how many arguments might be passed to your function or when you want to handle named arguments that you have not defined in advance. "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Print arguments and call the input function on *args: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 7 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" ( ' arg: %s ' , ' foo ' ) \n " ,
" ( ' args: %s ' , (1, 2, 3, 4, 5)) \n " ,
" ( ' kwargs: %s ' , {} ) \n " ,
" ( ' func result: %s ' , 15) \n "
2015-01-28 01:57:08 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" def foo(func, arg, *args, **kwargs): \n " ,
" print( ' arg: %s ' , arg) \n " ,
" print( ' args: %s ' , args) \n " ,
" print( ' kwargs: %s ' , kwargs) \n " ,
" \n " ,
" print( ' func result: %s ' , func(args)) \n " ,
" \n " ,
" foo(sum, \" foo \" , 1, 2, 3, 4, 5) "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Currying "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" Currying means to derive new functions from existing ones by partial argument appilcation. Currying is used in pandas to create specialized functions for transforming time series data. \n " ,
" \n " ,
" The argument y in add_numbers is curried: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 8 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" text/plain " : [
" 10 "
]
} ,
" execution_count " : 8 ,
" metadata " : { } ,
" output_type " : " execute_result "
}
] ,
" source " : [
" def add_numbers(x, y): \n " ,
" return x + y \n " ,
" \n " ,
" add_seven = lambda y: add_numbers(7, y) \n " ,
" add_seven(3) "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" The built-in functools can simplify currying with partial: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 9 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" data " : {
" text/plain " : [
" 7 "
]
} ,
" execution_count " : 9 ,
" metadata " : { } ,
" output_type " : " execute_result "
}
] ,
" source " : [
" from functools import partial \n " ,
" add_five = partial(add_numbers, 5) \n " ,
" add_five(2) "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Generators "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" A generator is a simple way to construct a new iterable object. Generators return a sequence lazily. When you call the generator, no code is immediately executed until you request elements from the generator. \n " ,
" \n " ,
" Find all the unique ways to make change for $1: "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 10 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" 1 \n " ,
" 4 \n " ,
" 9 \n " ,
" 16 \n " ,
" 25 \n "
2015-01-28 01:58:50 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" def squares(n=5): \n " ,
" for x in xrange(1, n + 1): \n " ,
" yield x ** 2 \n " ,
" \n " ,
" # No code is executed \n " ,
" gen = squares() \n " ,
" \n " ,
" # Generator returns values lazily \n " ,
" for x in squares(): \n " ,
" print x "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## Generator Expressions \n " ,
" \n " ,
" A generator expression is analogous to a comprehension. A list comprehension is enclosed by [], a generator expression is enclosed by (): "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 11 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" 1 \n " ,
" 4 \n " ,
" 9 \n " ,
" 16 \n " ,
" 25 \n "
2015-01-28 02:15:47 +08:00
]
2015-05-14 18:55:28 +08:00
}
] ,
" source " : [
" gen = (x ** 2 for x in xrange(1, 6)) \n " ,
" for x in gen: \n " ,
" print x "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" ## itertools \n " ,
" \n " ,
" The library itertools has a collection of generators useful for data analysis. \n " ,
" \n " ,
" Function groupby takes a sequence and a key function, grouping consecutive elements in the sequence by the input function ' s return value (the key). groupby returns the function ' s return value (the key) and a generator. "
]
} ,
{
" cell_type " : " code " ,
" execution_count " : 12 ,
" metadata " : {
" collapsed " : false
} ,
" outputs " : [
{
" name " : " stdout " ,
" output_type " : " stream " ,
" text " : [
" f [ ' foo ' ] \n " ,
" b [ ' bar ' , ' baz ' ] \n "
2015-01-28 02:15:47 +08:00
]
2015-01-27 02:56:49 +08:00
}
] ,
2015-05-14 18:55:28 +08:00
" source " : [
" import itertools \n " ,
" first_letter = lambda x: x[0] \n " ,
" strings = [ ' foo ' , ' bar ' , ' baz ' ] \n " ,
" for letter, gen_names in itertools.groupby(strings, first_letter): \n " ,
" print letter, list(gen_names) "
]
} ,
{
" cell_type " : " markdown " ,
" metadata " : { } ,
" source " : [
" itertools contains many other useful functions: \n " ,
" \n " ,
" | Function | Description| \n " ,
" | ------------- |-------------| \n " ,
" | imap | Generator version of map | \n " ,
" | ifilter | Generator version of filter | \n " ,
" | combinations | Generates a sequence of all possible k-tuples of elements in the iterable, ignoring order | \n " ,
" | permutations | Generates a sequence of all possible k-tuples of elements in the iterable, respecting order | \n " ,
" | groupby | Generates (key, sub-iterator) for each unique key | "
]
}
] ,
" metadata " : {
" kernelspec " : {
" display_name " : " Python 2 " ,
" language " : " python " ,
" name " : " python2 "
} ,
" language_info " : {
" codemirror_mode " : {
" name " : " ipython " ,
" version " : 2
} ,
" file_extension " : " .py " ,
" mimetype " : " text/x-python " ,
" name " : " python " ,
" nbconvert_exporter " : " python " ,
" pygments_lexer " : " ipython2 " ,
2015-06-19 09:07:36 +08:00
" version " : " 2.7.10 "
2015-01-27 02:56:49 +08:00
}
2015-05-14 18:55:28 +08:00
} ,
" nbformat " : 4 ,
" nbformat_minor " : 0
}