Chapter 2: An Array of Sequences
Exploring sequences in python.
- Overview of build-in Sequence
- List Comprehensions and Generator Expressions
- Tuples are not just Immutable Lists
- Slicing
- Sort and Sorted
- Managing Ordered Sequences with bisect
- When a List is Not all the Answer
Overview of build-in Sequence
-
Container Sequence They hold references of other objects. They can store different types of objects. Eg list, tuple, collections.deque.
-
Flat Sequence Physically store the objects in its own memory space. Store similar values. Eg str, byte, memoryview, array.array
they can also be catogrised as immutable and mutable.
List Comprehensions and Generator Expressions
Listcomps are more readable since they are only used to build new lists. They are also really powerfull. They are also specific, listcomps can only do one thing, build new lists.
[i for i in range(10)]
Listcomps can also do everything the map
and filter
functions perform, without using python lambda
. Its seems to be faster in simple use cases too.
colors = ['black', 'white']
sizes = ['S', 'M', 'L', 'XL']
# do not the order of the for loops determins the order of the product
tshirts = [(color, size) for color in colors
for size in sizes]
tshirts
Generator Expressions
Listcomps build lists, but to fill up all other sequences, genexp is the way to go. You can build tuples, arrays and other sequences from listcomps but genexps saves memory because they only generate one element at a time instead of storing the whole thing in memory with a list and just passing it into another constructor
symbols = '$¢£¥€¤'
tuple(ord(symbol) for symbol in symbols)
genexp = (i for i in range(10))
listcomp = [i for i in range(10)]
genexp, listcomp
%%timeit
# seems like genexps are faster too!
for i in listcomp:
#print(i)
pass
%%timeit
for i in genexp:
#print(i)
pass
import array
array.array('I', (ord(symbol) for symbol in symbols))
The main advantage of genexp is that they are never build in memory. So in cases where you just want to use the resulting sequence and not save it this becomes handy. This make genexps really scalable. For example if you took the cartician product we did with listcomps earlier and wanted to do that for a million colors and sizes just so that you can print them out will lead to huge use of memory, but if you use a genexp, only one item needs to be created when executing each loop.
Tuples are not just Immutable Lists
Tuples are immutable lists and records with no field names
Tuples as Records
tuples hold record, each item in the tuple holds data for one field and the position of the item gives its meaning. They work as good record because we can easily upack them in order. using * to grab excess items
traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'),
('ESP', 'XDA205856')]
for country, passport in sorted(traveler_ids):
print('%s/%s' % (country, passport))
one, two, *rest = range(1, 10)
print(one, two)
print(rest)
Tuple unpacking does work with any interable object, you only have to ensure that when unpacking you map each individual item to 1 corresponding variable (except when the variable has a * operator like in the above example
nested tuples are also handled just like you would expect
metro_areas = [
('Tokyo', 'JP', 36.933, (35.689722, 139.691667)), #
('Delhi NCR', 'IN', 21.935, (28.613889, 77.208889)),
('Mexico City', 'MX', 20.142, (19.433333, -99.133333)),
('New York-Newark', 'US', 20.104, (40.808611, -74.020386)),
('Sao Paulo', 'BR', 19.649, (-23.547778, -46.635833)),
]
print('{:15} | {:^9} | {:^9}'.format('', 'lat.' , 'long.'))
fmt = '{:15} | {:9.4f} | {:9.4f}'
for name, cc, population, (lat, long) in metro_areas:
if long <= 0:
print(fmt.format(name, lat, long))
Named Tuples
Names tuples function is a factory that produces a subclass of tuples that attaches field names and class names to it. They also use the same amt of memory as a tuple but less that an object would because it doesn't have a __dict__
. This makes it ideal for used where you want an object to contain all the data that is associated with it but doesn't have function of its own
namedtuple( <class name> , < field names: Optional(iter, str)> )
from collections import namedtuple
City = namedtuple('City', 'city state country')
Kochi = City('Kochi', 'Kerala', 'India')
Kochi
from collections import namedtuple
# takes the name of the class and either a iterable with field names or as a string in which field
# names are seperated by space
City = namedtuple('City', 'name county population coordinates')
tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))
tokyo
tokyo.population
NamedTuple has a few attributes in addition to the ones in tuple. the most useful ones are _fields, _make(iterable), _asdict()
print(City._fields) # gets the field names
print(tokyo._asdict())
LatLong = namedtuple('LatLong', 'lat long')
delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889))
print(City._make(delhi_data)) # allows you to init a named tuple from a
# iterable; City(*delhi_data) would do the same
strs = 'bicycle'
strs[::-1], strs[::3], strs[::-2]
Internally [ac] creates a slice object slice(a, b, c) which is invoked by the
obj.__getitem__(slice(start, stop, step))
. This is handy since you can now get slice
working your own objects.
You can also assign to slices.
reverse = slice(None, None, -1)
strs[reverse]
Multidimensional Slicing and Ellipsis
the []
operator can also take multiple indexes or slices seperated by commas. This is used, for instance, in the external NumPy package, where items of a two-dimensional numpy.ndarray can be fetched using the syntax a[i, j] or 2-d slice as a[m:n, k:l]
The ellipsis (3 full stops ...
) is valid token by the python parser. In Numpy it is used as a shortcut when slicing arrays of many dimensions for example if x is a 4-d array x[i, ...] is a shortcut for x[i, :, :, :,]
Assigning to Slices
Mutable sequnces can be grafted or modified in place using the slice notation. You can also del
slices.
l = list(range(10))
l
l[2:5] = [20, 30]
l
del l[5:7]
l
l[2:5] = 100
l[2:5] = [100]
l
'abc' * 3
'abc' + 'def'
board = [['_'] * 3 for i in range(3)]
board
board[2][2] = 'X'
board
weird_board = [['_']*3 ]*3
weird_board[2][2] = '2'
weird_board
Here the outer list is made up of refferences of the inner list. Here the issue is that the above code acts similar to this
row = ['_'] * 3
board = []
for i in range(3):
board.append(row)
The same row gets appended 3 times. On the other hand using list comprehension is similar to this code
board = []
for i in range(3):
row = ['_'] * 3
board.append(row)
Augmented Assignment with Sequences
Using operators like *= += behaves very differently depending on the first operand. The special method that works is __iadd__
(inplace addition). If it is not implemented, pythons falls back to __add__
. If __iadd__
is present and the sequence is mutable then the operation happens in place. If either of the conditions is false then pythons calls __add__
and assigns the result to the object.
l = [1, 2, 3]
print(l, id(l))
l *= 2
print(l, id(l))
t = (1, 2, 3)
print(t, id(t))
t *= 2
print(t, id(t))
Sort and Sorted
list.sort()
sorts the list in place and returns a None object to signal this. Infact this is a common practice in Python standard lib to return None from functions that operate on objects in place.
The build-in function sorted()
on the other hand creates a new array and returns it.
Both of them take 2 arguments
- reverse - which is a flag which tells to sort in reverse order
- key - A one argument function that generates a key for each element to sort
fruits = ['grape', 'raspberry', 'apple', 'banana']
sorted(fruits)
fruits
fruits.sort()
fruits
Managing Ordered Sequences with bisect
bisect finds the insert point for an item in a list and returns the index, then we use list.insert(index, item)
to add the item to the corresponding index. the bisect module also has an efficient implementation of binary search which can be used.
bisect
offers 2 functions - bisect
and insort
import bisect
import sys
HAYSTACK = [1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30]
NEEDLES = [0, 1, 2, 5, 8, 10, 22, 23, 29, 30, 31]
ROW_FMT = '{0:2d} @ {1:2d} {2}{0:<2d}'
def demo(bisect_fn):
for needle in reversed(NEEDLES):
position = bisect_fn(HAYSTACK, needle)
offset = position * ' |'
print(ROW_FMT.format(needle, position, offset))
def main(add_to='right'):
if add_to == 'left':
bisect_fn = bisect.bisect_left
else:
bisect_fn = bisect.bisect
print('DEMO: ', bisect_fn.__name__)
print('haystack ->', ' '.join('%2d'%n for n in HAYSTACK))
demo(bisect_fn)
main()
main(add_to='left')
bisect
has is really bisect_right
, which inserts the same element always to the right of the corresponing element in the list. It also has a sister function bisect_left
which corresponds to inserting the same item to left of the matching item in the list
bisect
has another function called bisect.insort(seq, item)
which inserts an the item into seq, this is much faster than using insert()
%%time
haystack = HAYSTACK[:]
idx = bisect.bisect(haystack, 19)
haystack.insert(idx, 19)
print(haystack)
%%time
haystack = HAYSTACK[:]
bisect.insort(haystack, 19)
print(haystack)
import bisect
import random
SIZE = 7
my_list = []
for i in range(SIZE):
new_item = random.randrange(SIZE*2)
bisect.insort(my_list, new_item)
print('%2d ->'%new_item, my_list)
When a List is Not all the Answer
when we are starting out we tend to overuse lists for everything. But there are times when other sequence objects make more sense. Let explore some of them.
-
if you need large list of floating points it's better to user
arrays
since the only store the numerical value and not float object, just like how C/C++ implements. -
if you need FIFO, LIFO operation consider the
deque
-
if you do frequent membership checks, a
set
is much more efficient (note they are not a sequence since they are not ordered).
from array import array
from random import random
# create a array to hold decimal float
floats = array('d', (random() for i in range(10**7)))
floats[-1]
fp = open('floats.bin', 'wb')
floats.tofile(fp) # size = 77M
fp.close()
float2 = array('d')
fp = open('floats.bin', 'rb')
float2.fromfile(fp, 10**7)
fp.close()
float2[-1]
float2 == floats
Here we are creating a new int array using the old HAYSTACK
HAYSTACK
a = array('i', reversed(HAYSTACK))
a
Array doesn't have a built-in sort mechanism, you can create a new sorted array easily though.
a = array(a.typecode, sorted(a))
a
bisect.bisect(a, 8), bisect.insort(a, 8), a
Memory Views
A memoryview is essentially a generalized NumPy array structure in Python itself (without the math). It allows you to share memory between data-structures (things like PIL images, SQLlite databases, NumPy arrays, etc.) without first copying. This is very important for large data sets
Travis Oliphant, lead author of Num-pymemoryview
has acase
method that returns yet anothermemoryview
object which is sharing the same memory.
import array
numbers = array.array('h', [-2, -1, 0, 1, 2])
memv = memoryview(numbers)
len(memv)
# apparently there is some issue with python3 hence not used
# ref - https://stackoverflow.com/questions/4877866/why-is-it-not-possible-to-get-a-py-buffer-from-an-array-object
print(memv[0])
memv_oct = memv.cast('B')
memv_oct.tolist(), memv_oct[5], numbers
Numpy ans SciPy
Now if your doing advanced array and matrix operation numpy
and scipy
are your gotos. These libs are the reason by python is so popular in the scientific communities. The key point of these libraries was that they were fast and reliable because it leverages the widely used C and Fortran code base.
Deque and other Queues
lists are great but constrained for appending or poping from the left, deque on the other hand is optimized for operating from both sides (but is poor for operations in the middle). We use the Queue data structure for a wide range of algos and hence we have an optimised implementation of that in python.
collections.deque
is a thread-safe double-ended queue designed for fast insertion and deletion from both ends. They can also be bounded to a fixed length so if you add any new elements after that, the data will be discarded from the top (first entered data point first).
from collections import deque
dq = deque(range(10), maxlen=10)
dq
dq.rotate(3)
dq
dq.append(2)
dq
dq.appendleft(-1)
dq
dq.extend([11, 12, 13])
dq
dq.maxlen
dq.extendleft([10, 20, 30, 40])
dq
the append()
and popleft
operations are atomic, so deque
is safe to use as a LIFO queue in multithreaded applications without the need for locks.
Other implementations of Queues
queue:
This provides the synchronized (i.e., thread-safe) classes Queue
, LifoQueue
, and PriorityQueue
. These are used for safe communication between threads. All three classes can be bounded by providing a maxsize
argument greater than 0 to the constructor. However, they don’t discard items to make room as deque does. In‐stead, when the queue is full the insertion of a new item blocks—i.e., it waits until some other thread makes room by taking an item from the queue, which is useful to throttle the number of live threads.
multiprocessing
Implements its own bounded Queue
, very similar to queue.Queue but designed for interprocess communication. A specialized multiprocessing.JoinableQueue
is also available for easier task management.
asyncio
Newly added to Python 3.4, asyncio provides Queue
, LifoQueue
, PriorityQueue
, and JoinableQueue
with APIs inspired by the classes contained in the queue
and multiprocessing
modules, but adapted for managing tasks in asynchronous pro‐gramming.
heapq
In contrast to the previous three modules, heapq
does not implement a queue
class, but provides functions like heappush
and heappop
that let you use a mutable sequence as a heap queue or priority queue.