Overview of build-in Sequence

  1. Container Sequence They hold references of other objects. They can store different types of objects. Eg list, tuple, collections.deque.

  2. Flat Sequence Physically store the objects in its own memory space. Store similar values. Eg str, byte, memoryview, array.array

they can also be catogrised as immutable and mutable.

List Comprehensions and Generator Expressions

Listcomps are more readable since they are only used to build new lists. They are also really powerfull. They are also specific, listcomps can only do one thing, build new lists.

[i for i in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Listcomps can also do everything the map and filter functions perform, without using python lambda. Its seems to be faster in simple use cases too.

colors = ['black', 'white']
sizes = ['S', 'M', 'L', 'XL']

# do not the order of the for loops determins the order of the product
tshirts = [(color, size) for color in colors
                         for size in sizes]
tshirts
[('black', 'S'),
 ('black', 'M'),
 ('black', 'L'),
 ('black', 'XL'),
 ('white', 'S'),
 ('white', 'M'),
 ('white', 'L'),
 ('white', 'XL')]

Generator Expressions

Listcomps build lists, but to fill up all other sequences, genexp is the way to go. You can build tuples, arrays and other sequences from listcomps but genexps saves memory because they only generate one element at a time instead of storing the whole thing in memory with a list and just passing it into another constructor

symbols = '$¢£¥€¤'
tuple(ord(symbol) for symbol in symbols)
(36, 162, 163, 165, 8364, 164)
genexp = (i for i in range(10))
listcomp = [i for i in range(10)]
genexp, listcomp
(<generator object <genexpr> at 0x7f382dfc30b0>,
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
%%timeit

# seems like genexps are faster too!
for i in listcomp:
    #print(i)
    pass
224 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit

for i in genexp:
    #print(i)
    pass
41.5 ns ± 0.399 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
import array 
array.array('I', (ord(symbol) for symbol in symbols))
array('I', [36L, 194L, 162L, 194L, 163L, 194L, 165L, 226L, 130L, 172L, 194L, 164L])

The main advantage of genexp is that they are never build in memory. So in cases where you just want to use the resulting sequence and not save it this becomes handy. This make genexps really scalable. For example if you took the cartician product we did with listcomps earlier and wanted to do that for a million colors and sizes just so that you can print them out will lead to huge use of memory, but if you use a genexp, only one item needs to be created when executing each loop.

Tuples are not just Immutable Lists

Tuples are immutable lists and records with no field names

Tuples as Records

tuples hold record, each item in the tuple holds data for one field and the position of the item gives its meaning. They work as good record because we can easily upack them in order. using * to grab excess items

traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'), 
                ('ESP', 'XDA205856')]


for country, passport in sorted(traveler_ids):
    print('%s/%s' % (country, passport))
BRA/CE342567
ESP/XDA205856
USA/31195855
one, two, *rest = range(1, 10)
print(one, two)
print(rest)
1 2
[3, 4, 5, 6, 7, 8, 9]

Tuple unpacking does work with any interable object, you only have to ensure that when unpacking you map each individual item to 1 corresponding variable (except when the variable has a * operator like in the above example

nested tuples are also handled just like you would expect

metro_areas = [
    ('Tokyo', 'JP', 36.933, (35.689722, 139.691667)), #
    ('Delhi NCR', 'IN', 21.935, (28.613889, 77.208889)),
    ('Mexico City', 'MX', 20.142, (19.433333, -99.133333)),
    ('New York-Newark', 'US', 20.104, (40.808611, -74.020386)),
    ('Sao Paulo', 'BR', 19.649, (-23.547778, -46.635833)),
]

print('{:15} | {:^9} | {:^9}'.format('', 'lat.' , 'long.'))
fmt = '{:15} | {:9.4f} | {:9.4f}'

for name, cc, population, (lat, long) in metro_areas:
    if long <= 0:
        print(fmt.format(name, lat, long))
                |   lat.    |   long.  
Mexico City     |   19.4333 |  -99.1333
New York-Newark |   40.8086 |  -74.0204
Sao Paulo       |  -23.5478 |  -46.6358

Named Tuples

Names tuples function is a factory that produces a subclass of tuples that attaches field names and class names to it. They also use the same amt of memory as a tuple but less that an object would because it doesn't have a __dict__. This makes it ideal for used where you want an object to contain all the data that is associated with it but doesn't have function of its own

namedtuple( <class name> , < field names: Optional(iter, str)> )

from collections import namedtuple

City = namedtuple('City', 'city state country')
Kochi = City('Kochi', 'Kerala', 'India')
Kochi
City(city='Kochi', state='Kerala', country='India')
from collections import namedtuple

# takes the name of the class and either a iterable with field names or as a string in which field
# names are seperated by space
City = namedtuple('City', 'name county population coordinates')
tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))
tokyo
City(name='Tokyo', county='JP', population=36.933, coordinates=(35.689722, 139.691667))
tokyo.population
36.933

NamedTuple has a few attributes in addition to the ones in tuple. the most useful ones are _fields, _make(iterable), _asdict()

print(City._fields) # gets the field names
print(tokyo._asdict())
LatLong = namedtuple('LatLong', 'lat long')
delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889))
print(City._make(delhi_data)) # allows you to init a named tuple from a 
                              # iterable; City(*delhi_data) would do the same
('name', 'county', 'population', 'coordinates')
{'name': 'Tokyo', 'county': 'JP', 'population': 36.933, 'coordinates': (35.689722, 139.691667)}
City(name='Delhi NCR', county='IN', population=21.935, coordinates=LatLong(lat=28.613889, long=77.208889))

Slicing

This is pretty straight forward but none the less very powerful. Lets see all their advanced usecases here.

Why Slices and Range exclude the last item?

This is because that way its more intutive for zero-based indexing used in python and c.

Slice Objects

How slice is works

strs = 'bicycle'
strs[::-1], strs[::3], strs[::-2]
('elcycib', 'bye', 'eccb')

Internally [a:b:c] creates a slice object slice(a, b, c) which is invoked by the obj.__getitem__(slice(start, stop, step)). This is handy since you can now get slice working your own objects.

You can also assign to slices.

reverse = slice(None, None, -1)
strs[reverse]
'elcycib'

Multidimensional Slicing and Ellipsis

the [] operator can also take multiple indexes or slices seperated by commas. This is used, for instance, in the external NumPy package, where items of a two-dimensional numpy.ndarray can be fetched using the syntax a[i, j] or 2-d slice as a[m:n, k:l]

The ellipsis (3 full stops ...) is valid token by the python parser. In Numpy it is used as a shortcut when slicing arrays of many dimensions for example if x is a 4-d array x[i, ...] is a shortcut for x[i, :, :, :,]

Assigning to Slices

Mutable sequnces can be grafted or modified in place using the slice notation. You can also del slices.

l = list(range(10))
l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
l[2:5] = [20, 30]
l
[0, 1, 20, 30, 5, 6, 7, 8, 9]
del l[5:7]
l
[0, 1, 20, 30, 5, 8, 9]
l[2:5] = 100

TypeErrorTraceback (most recent call last)
<ipython-input-10-da8b10461280> in <module>
----> 1 l[2:5] = 100

TypeError: can only assign an iterable
l[2:5] = [100]
l
[0, 1, 100, 8, 9]

Using + and * with Seqences

concatenation is a common operation with any sequence but there are some subtle details as to how they work. When using the */+ both sequences have to be of the same time and none of them is modified, instead a new on is created

'abc' * 3
'abcabcabc'
'abc' + 'def'
'abcdef'

Building Lists of Lists

now this can bite you when you don't expect it to. Sometimes you want to init a new list with certain number of nested lists, the best way to do this is with list comprehension.

board = [['_'] * 3 for i in range(3)]
board
[['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']]
board[2][2] = 'X'
board
[['_', '_', '_'], ['_', '_', '_'], ['_', '_', 'X']]
weird_board = [['_']*3 ]*3
weird_board[2][2] = '2'
weird_board
[['_', '_', '2'], ['_', '_', '2'], ['_', '_', '2']]

Here the outer list is made up of refferences of the inner list. Here the issue is that the above code acts similar to this

row = ['_'] * 3
board = []
for i in range(3):
    board.append(row)

The same row gets appended 3 times. On the other hand using list comprehension is similar to this code

board = []
for i in range(3):
    row = ['_'] * 3
    board.append(row)

Augmented Assignment with Sequences

Using operators like *= += behaves very differently depending on the first operand. The special method that works is __iadd__(inplace addition). If it is not implemented, pythons falls back to __add__. If __iadd__ is present and the sequence is mutable then the operation happens in place. If either of the conditions is false then pythons calls __add__ and assigns the result to the object.

l = [1, 2, 3]
print(l, id(l))
l *= 2
print(l, id(l))

t = (1, 2, 3)
print(t, id(t))
t *= 2
print(t, id(t))
[1, 2, 3] 140649174697408
[1, 2, 3, 1, 2, 3] 140649174697408
(1, 2, 3) 140649174633216
(1, 2, 3, 1, 2, 3) 140649217792800

Sort and Sorted

list.sort() sorts the list in place and returns a None object to signal this. Infact this is a common practice in Python standard lib to return None from functions that operate on objects in place.

The build-in function sorted() on the other hand creates a new array and returns it.

Both of them take 2 arguments

  1. reverse - which is a flag which tells to sort in reverse order
  2. key - A one argument function that generates a key for each element to sort
fruits = ['grape', 'raspberry', 'apple', 'banana']
sorted(fruits)
['apple', 'banana', 'grape', 'raspberry']
fruits
['grape', 'raspberry', 'apple', 'banana']
fruits.sort()
fruits
['apple', 'banana', 'grape', 'raspberry']

Managing Ordered Sequences with bisect

bisect finds the insert point for an item in a list and returns the index, then we use list.insert(index, item) to add the item to the corresponding index. the bisect module also has an efficient implementation of binary search which can be used.

bisect offers 2 functions - bisect and insort

import bisect
import sys

HAYSTACK = [1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30]
NEEDLES = [0, 1, 2, 5, 8, 10, 22, 23, 29, 30, 31]

ROW_FMT = '{0:2d} @ {1:2d}    {2}{0:<2d}'

def demo(bisect_fn):
    for needle in reversed(NEEDLES):
        position = bisect_fn(HAYSTACK, needle)
        offset = position * '  |'
        print(ROW_FMT.format(needle, position, offset))
        
def main(add_to='right'):
    if add_to == 'left':
        bisect_fn = bisect.bisect_left
    else:
        bisect_fn = bisect.bisect
    
    print('DEMO: ', bisect_fn.__name__)
    print('haystack ->', ' '.join('%2d'%n for n in HAYSTACK))
    demo(bisect_fn)
main()
DEMO:  bisect_right
haystack ->  1  4  5  6  8 12 15 20 21 23 23 26 29 30
31 @ 14      |  |  |  |  |  |  |  |  |  |  |  |  |  |31
30 @ 14      |  |  |  |  |  |  |  |  |  |  |  |  |  |30
29 @ 13      |  |  |  |  |  |  |  |  |  |  |  |  |29
23 @ 11      |  |  |  |  |  |  |  |  |  |  |23
22 @  9      |  |  |  |  |  |  |  |  |22
10 @  5      |  |  |  |  |10
 8 @  5      |  |  |  |  |8 
 5 @  3      |  |  |5 
 2 @  1      |2 
 1 @  1      |1 
 0 @  0    0 
main(add_to='left')
DEMO:  bisect_left
haystack ->  1  4  5  6  8 12 15 20 21 23 23 26 29 30
31 @ 14      |  |  |  |  |  |  |  |  |  |  |  |  |  |31
30 @ 13      |  |  |  |  |  |  |  |  |  |  |  |  |30
29 @ 12      |  |  |  |  |  |  |  |  |  |  |  |29
23 @  9      |  |  |  |  |  |  |  |  |23
22 @  9      |  |  |  |  |  |  |  |  |22
10 @  5      |  |  |  |  |10
 8 @  4      |  |  |  |8 
 5 @  2      |  |5 
 2 @  1      |2 
 1 @  0    1 
 0 @  0    0 

bisect has is really bisect_right, which inserts the same element always to the right of the corresponing element in the list. It also has a sister function bisect_left which corresponds to inserting the same item to left of the matching item in the list

bisect has another function called bisect.insort(seq, item) which inserts an the item into seq, this is much faster than using insert()

%%time
haystack = HAYSTACK[:]
idx = bisect.bisect(haystack, 19)
haystack.insert(idx, 19)
print(haystack)
[1, 4, 5, 6, 8, 12, 15, 19, 20, 21, 23, 23, 26, 29, 30]
CPU times: user 2.41 ms, sys: 0 ns, total: 2.41 ms
Wall time: 1.51 ms
%%time
haystack = HAYSTACK[:]
bisect.insort(haystack, 19)
print(haystack)
[1, 4, 5, 6, 8, 12, 15, 19, 20, 21, 23, 23, 26, 29, 30]
CPU times: user 161 µs, sys: 30 µs, total: 191 µs
Wall time: 198 µs
import bisect
import random

SIZE = 7

my_list = []
for i in range(SIZE):
    new_item = random.randrange(SIZE*2)
    bisect.insort(my_list, new_item)
    print('%2d ->'%new_item, my_list)
 5 -> [5]
 8 -> [5, 8]
 3 -> [3, 5, 8]
 6 -> [3, 5, 6, 8]
13 -> [3, 5, 6, 8, 13]
 0 -> [0, 3, 5, 6, 8, 13]
 9 -> [0, 3, 5, 6, 8, 9, 13]

When a List is Not all the Answer

when we are starting out we tend to overuse lists for everything. But there are times when other sequence objects make more sense. Let explore some of them.

  • if you need large list of floating points it's better to user arrays since the only store the numerical value and not float object, just like how C/C++ implements.

  • if you need FIFO, LIFO operation consider the deque

  • if you do frequent membership checks, a set is much more efficient (note they are not a sequence since they are not ordered).

Arrays

Array can be used to store same type of numeraical data and are as efficient as C-arrays. You can also provide a typecode (a letter) to determine the underlying C type used.

from array import array
from random import random

# create a array to hold decimal float
floats = array('d', (random() for i in range(10**7)))
floats[-1]
0.19034769013593422
fp = open('floats.bin', 'wb')
floats.tofile(fp)    # size = 77M
fp.close()
float2 = array('d')
fp = open('floats.bin', 'rb')
float2.fromfile(fp, 10**7)
fp.close()
float2[-1]
0.19034769013593422
float2 == floats
True

Here we are creating a new int array using the old HAYSTACK

HAYSTACK
[1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30]
a = array('i', reversed(HAYSTACK))
a
array('i', [30, 29, 26, 23, 23, 21, 20, 15, 12, 8, 6, 5, 4, 1])

Array doesn't have a built-in sort mechanism, you can create a new sorted array easily though.

a = array(a.typecode, sorted(a))
a
array('i', [1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30])
bisect.bisect(a, 8), bisect.insort(a, 8), a
(5, None, array('i', [1, 4, 5, 6, 8, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30]))

Memory Views

A memoryview is essentially a generalized NumPy array structure in Python itself (without the math). It allows you to share memory between data-structures (things like PIL images, SQLlite databases, NumPy arrays, etc.) without first copying. This is very important for large data sets

Travis Oliphant, lead author of Num-py
memoryview has a case method that returns yet another memoryview object which is sharing the same memory.
import array
numbers = array.array('h', [-2, -1, 0, 1, 2])
memv = memoryview(numbers)
len(memv)

# apparently there is some issue with python3 hence not used
# ref - https://stackoverflow.com/questions/4877866/why-is-it-not-possible-to-get-a-py-buffer-from-an-array-object
5
print(memv[0])

memv_oct = memv.cast('B')
memv_oct.tolist(), memv_oct[5], numbers
-2
([254, 255, 255, 255, 0, 0, 1, 0, 2, 0], 0, array('h', [-2, -1, 0, 1, 2]))

Numpy ans SciPy

Now if your doing advanced array and matrix operation numpy and scipy are your gotos. These libs are the reason by python is so popular in the scientific communities. The key point of these libraries was that they were fast and reliable because it leverages the widely used C and Fortran code base.

Deque and other Queues

lists are great but constrained for appending or poping from the left, deque on the other hand is optimized for operating from both sides (but is poor for operations in the middle). We use the Queue data structure for a wide range of algos and hence we have an optimised implementation of that in python.

collections.deque is a thread-safe double-ended queue designed for fast insertion and deletion from both ends. They can also be bounded to a fixed length so if you add any new elements after that, the data will be discarded from the top (first entered data point first).

from collections import deque
dq = deque(range(10), maxlen=10)
dq
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dq.rotate(3)
dq
deque([7, 8, 9, 0, 1, 2, 3, 4, 5, 6])
dq.append(2)
dq
deque([8, 9, 0, 1, 2, 3, 4, 5, 6, 2])
dq.appendleft(-1)
dq
deque([-1, 8, 9, 0, 1, 2, 3, 4, 5, 6])
dq.extend([11, 12, 13])
dq
deque([0, 1, 2, 3, 4, 5, 6, 11, 12, 13])
dq.maxlen
10
dq.extendleft([10, 20, 30, 40])
dq
deque([40, 30, 20, 10, 0, 1, 2, 3, 4, 5])

the append() and popleft operations are atomic, so deque is safe to use as a LIFO queue in multithreaded applications without the need for locks.

Other implementations of Queues

queue: This provides the synchronized (i.e., thread-safe) classes Queue , LifoQueue , and PriorityQueue . These are used for safe communication between threads. All three classes can be bounded by providing a maxsize argument greater than 0 to the constructor. However, they don’t discard items to make room as deque does. In‐stead, when the queue is full the insertion of a new item blocks—i.e., it waits until some other thread makes room by taking an item from the queue, which is useful to throttle the number of live threads.

multiprocessing Implements its own bounded Queue , very similar to queue.Queue but designed for interprocess communication. A specialized multiprocessing.JoinableQueue is also available for easier task management.

asyncio Newly added to Python 3.4, asyncio provides Queue , LifoQueue , PriorityQueue, and JoinableQueue with APIs inspired by the classes contained in the queue and multiprocessing modules, but adapted for managing tasks in asynchronous pro‐gramming.

heapq In contrast to the previous three modules, heapq does not implement a queue class, but provides functions like heappush and heappop that let you use a mutable sequence as a heap queue or priority queue.