data-science-ipython-notebooks/core/structs_utils.ipynb

998 lines
20 KiB
Plaintext

{
"metadata": {
"name": "",
"signature": "sha256:fed6792b1275b1a5b92e846b93e797942c4eac23ccf1c44b750b469e3f33ffbe"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Structure Utilities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* slice\n",
"* range and xrange\n",
"* bisect\n",
"* sort\n",
"* sorted\n",
"* reversed\n",
"* enumerate\n",
"* zip\n",
"* list comprehensions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## slice"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Slice selects a section of list types (arrays, tuples, NumPy arrays) with its arguments [start:end]: start is included, end is not. The number of elements in the result is stop - end."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![alt text](http://www.nltk.org/images/string-slicing.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Slice 4 elements starting at index 6 and ending at index 9:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = 'Monty Python'\n",
"seq[6:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"'Pyth'"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Omit start to default to start of the sequence:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq[:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"'Monty'"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Omit end to default to end of the sequence:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq[6:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"'Python'"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Negative indices slice relative to the end:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq[-12:-7]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"'Monty'"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Slice can also take a step [start:end:step].\n",
"\n",
"Get every other element:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq[::2]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"'MnyPto'"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Passing -1 for the step reverses the list or tuple:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq[::-1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"'nohtyP ytnoM'"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can assign elements to a slice (note the slice range does not have to equal number of elements to assign):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = [1, 1, 2, 3, 5, 8, 13]\n",
"seq[5:] = ['H', 'a', 'l', 'l']\n",
"seq"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"[1, 1, 2, 3, 5, 'H', 'a', 'l', 'l']"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare the output of assigning into a slice (above) versus the output of assigning into an index (below):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = [1, 1, 2, 3, 5, 8, 13]\n",
"seq[5] = ['H', 'a', 'l', 'l']\n",
"seq"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"[1, 1, 2, 3, 5, ['H', 'a', 'l', 'l'], 13]"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## range and xrange"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a list of evenly spaced integers with range or xrange. Note: range in Python 3 returns a generator and xrange is not available.\n",
"\n",
"Generate 10 integers:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"range(10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Range can take start, stop, and step arguments:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"range(0, 20, 3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"[0, 3, 6, 9, 12, 15, 18]"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is very common to iterate through sequences by index with range:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = [1, 2, 3]\n",
"for i in range(len(seq)):\n",
" val = seq[i]\n",
" print(val)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1\n",
"2\n",
"3\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For longer ranges, xrange is recommended and is available in Python 3 as range. It returns an iterator that generates integers one by one rather than all at once and storing them in a large list."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sum = 0\n",
"for i in xrange(100000):\n",
" if i % 2 == 0:\n",
" sum += 1\n",
"print(sum)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"50000\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## bisect"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The bisect module does not check whether the list is sorted, as this check would be expensive O(n). Using bisect on an unsorted list will not result in an error but could lead to incorrect results."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import bisect"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the location where an element should be inserted to keep the list sorted:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = [1, 2, 2, 3, 5, 13]\n",
"bisect.bisect(seq, 8)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"5"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Insert an element into a location to keep the list sorted:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"bisect.insort(seq, 8)\n",
"seq"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"[1, 2, 2, 3, 5, 8, 13]"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## sort"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort in-place O(n log n)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = [1, 5, 3, 9, 7, 6]\n",
"seq.sort()\n",
"seq"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"[1, 3, 5, 6, 7, 9]"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort by the secondary key of str length:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = ['the', 'quick', 'brown', 'fox', 'jumps', 'over']\n",
"seq.sort(key=len)\n",
"seq"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"['the', 'fox', 'over', 'quick', 'brown', 'jumps']"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## sorted"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return a new sorted list from the elements of a sequence O(n log n):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sorted([2, 5, 1, 8, 7, 9])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"[1, 2, 5, 7, 8, 9]"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sorted('foo bar baz')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"[' ', ' ', 'a', 'a', 'b', 'b', 'f', 'o', 'o', 'r', 'z']"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's common to get a sorted list of unique elements by combining sorted and set:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq = [2, 5, 1, 8, 7, 9, 9, 2, 5, 1, (4, 2), (1, 2), (1, 2)]\n",
"sorted(set(seq))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"[1, 2, 5, 7, 8, 9, (1, 2), (4, 2)]"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## reversed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Iterate over the sequence elements in reverse order:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(reversed(seq))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"[(1, 2), (1, 2), (4, 2), 1, 5, 2, 9, 9, 7, 8, 1, 5, 2]"
]
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## enumerate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the index of a collection and the value:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strings = ['foo', 'bar', 'baz']\n",
"for i, string in enumerate(strings):\n",
" print(i, string)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(0, 'foo')\n",
"(1, 'bar')\n",
"(2, 'baz')\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pair up the elements of sequences to create a list of tuples:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq_1 = [1, 2, 3]\n",
"seq_2 = ['foo', 'bar', 'baz']\n",
"zip(seq_1, seq_2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
"[(1, 'foo'), (2, 'bar'), (3, 'baz')]"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zip takes an arbitrary number of sequences. The number of elements it produces is determined by the shortest sequence:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"seq_3 = [True, False]\n",
"zip(seq_1, seq_2, seq_3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"[(1, 'foo', True), (2, 'bar', False)]"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is common to use zip for simultaneously iterating over multiple sequences combined with enumerate:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for i, (a, b) in enumerate(zip(seq_1, seq_2)):\n",
" print('%d: %s, %s' % (i, a, b))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0: 1, foo\n",
"1: 2, bar\n",
"2: 3, baz\n"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zip can unzip a zipped sequence, which you can think of as converting a list of rows into a list of columns:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"numbers = [(1, 'one'), (2, 'two'), (3, 'three')]\n",
"a, b = zip(*numbers)\n",
"a"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"(1, 2, 3)"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"b"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"('one', 'two', 'three')"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## List Comprehensions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"List comprehensions concisely form a new list by filtering the elements of a sequence and transforming the elements passing the filter. List comprehensions take the form:\n",
"```python\n",
"[expr for val in collection if condition]\n",
"```\n",
"Which is equivalent to the following for loop:\n",
"```python\n",
"result = []\n",
"for val in collection:\n",
" if condition:\n",
" result.append(expr)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert to upper case all strings that start with a 'b':"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"strings = ['foo', 'bar', 'baz', 'f', 'fo', 'b', 'ba']\n",
"[x.upper() for x in strings if x[0] == 'b']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"['BAR', 'BAZ', 'B', 'BA']"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"List comprehensions can be nested:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list_of_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]\n",
"[x for tup in list_of_tuples for x in tup]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 29,
"text": [
"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
]
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dict Comprehension"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A dict comprehension is similar to a list comprehension but returns a dict.\n",
"\n",
"Create a mapping of strings and their locations in the list for strings that start with a 'b':"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{index : val for index, val in enumerate(strings) if val[0] == 'b'}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 30,
"text": [
"{1: 'bar', 2: 'baz', 5: 'b', 6: 'ba'}"
]
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set Comprehension"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A set comprehension is similar to a list comprehension but returns a set.\n",
"\n",
"Get the unique lengths of strings that start with a 'b':"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{len(x) for x in strings if x[0] == 'b'}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"{1, 2, 3}"
]
}
],
"prompt_number": 31
}
],
"metadata": {}
}
]
}