Chapter 3: Dictionaries and Sets
Exploring the fundamental datastructure in python.
- Generic Mapping Types - Mapping and MutableMapping
- dict Comprehensions
- Overview of Common Mapping Methods
- Immutable Mappings
- Set Theory
- dict and set under the Hood
- Practical Consequeces of How dict Works
- Practical Concequeces of how sets Work
Dictionaries are not just another dataset in python, they are an itergral part of the language. Hence Dicts are highly optimized, Dicts are implemented using hash tables
. Sets are also build using hash tables and hence covered with this.
Mapping
and MutableMapping
Generic Mapping Types - Both of these form the base classes for dicts and sets. The collection.abs
module provides these. For implementing specialized mappings we often extend dict
or collection.UserDict
instead of ABCs. The main value of ABCs is documenting and formalizing the minimal interfaces for mapping.
from collections.abc import Mapping
my_dict = {}
isinstance(my_dict, Mapping)
The dict
has one requirement and that is that the keys should be hashable. What do you mean by hashable?
An object is hashable if it has a hash value which never changes during its lifetime (it needs a
__hash__()
method), and can be compared to other objects (it needs an__eq__()
method). Hashable objects which compare equal must have the same hash value.
print(hash((1, 2, 3)), hash('1, 2, 3'), hash(1))
print(hash([1, 2, 3]))
a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict([('two', 2), ('one', 1), ('three', 3)])
e = dict({'three': 3, 'two': 2, 'one':1})
a == b == c == d == e
nums = {num: str(num) for num in range(10)}
nums
Overview of Common Mapping Methods
There are 3 main dictionary types. dict
, defaultdict
, orderedDict
.
Handlying Missing Keys with setdefault
dicts have a get
function which is what is used most of the time when you want to handle missing keys (every pythonista knows this one!). In some cases where we have to handle key error and use a default value when the keys is not present don't use get
to define the default value, instead use setdefault
function. Use the 2 wisely. Most of the time setdefault()
is a wise choice because using get()
can lead to more searches. This is when you are inserting. if you also what to do a lookup...
Mappings with Flexible Key Lookup
There are 2 other methods to do this.
- Use
defaultdict
- this expect a callable to be passed to thedefaultdict
constructor that will be called when there is a key error in the__getitem__
. Note this will only be called for__getitem__
and not for other functions like.get()
. Under the hood, it is using the__missing__
special methods. - subclass dict and add a
__missing__
method - when__getitem__
is called with a key that is not present, it calls the__missing__
function if it is implemented to handle it.
from collections import defaultdict
# takes a callable which is called when there is a key error
d = defaultdict(list)
d[0] # what happend? -> calls list() to create list and inserts that into key
but under the hood defaultdict
uses the __missing__
method. This method is used to handle the missing values in any mapping object.
class PinToFun(dict):
# the __getitem in dict calls this function
def __missing__(self, key):
if isinstance(key, str): # check to avoid recursion
raise KeyError(key)
return self[str(key)]
def get(self, key, default=None):
try:
return self[key]
except KeyError:
return default
def __contains__(self, key):
return key in self.keys() or str(key) in self.keys()
pins = {'0': 'I/O', '1': 'LED'}
pins['0']
pins[0]
pins_improved = PinToFun({'0': 'I/O'})
pins_improved[0]
pins_improved.get(0)
Variations of Dict:
collections.OrderedDict
: Maintains keys in insertion order. Hence iteration is predictable.
collections.ChainMap
: class is provided for quickly linking a number of mappings so they can be treated as a single unit. It is often much faster than creating a new dictionary and running multiple update()
calls
collections.Counter
: Holds the integer count of each key. Can be used to count instances of hashable objects. has addition functions like most_common
to return the odered list of tuples.
from collections import Counter
ct = Counter('abaaachdddrllkk')
ct
collections.UserDict
: Used as base class for creating new mapping classes. The main reason we don't use dict
as base class is that it has some implementation shortcuts that we will have to override in order to make it work. Note that UserDict
does not inherit from dict
instead has data
, which is a dict instance to avoid possible recursion issues.
we will now modify the PinToFun class using this to show its effectiveness.
import collections
class PinToFun(collections.UserDict):
def __missing__(self, key):
if isinstance(key, str):
raise KeyError(key)
return self[str(key)]
def __contains__(self, key):
return str(key) in self.data
def __setitem__(self, key, item):
#here the data is accessed as an attribute
self.data[str(key)] = item
pins = PinToFun({'0': 'I/O', '1': 'LED'})
pins
0 in pins
Immutable Mappings
mapping types are mutable but if you want to constrain the user from making changes, use this.
Introducing MappingProxyType
from types
module. This returns a read-only by dynamic view of the original mapping. Hence updates can be seen but no changes can be performed using the mappingproxy
from types import MappingProxyType
d = {1: 'A'}
d_proxy = MappingProxyType(d)
d_proxy
d_proxy[1]
d_proxy[0] = 'B'
d[0] = 'B'
d_proxy[0]
Set Theory
An underused concept in python. The basic use is removing duplicate elements. Set elements must be hashable.
In addition to that sets also support basic set operations.
-
a | b
- Union -
a & b
- Intersection -
a - b
- Difference
Sets have extremely fast membership functions. If used effectively it can make you code faster and easier to read.
l = ['spam', 'eggs', 'spam', 'spam', 'spam']
l
set(l)
s = {1}
type(s)
s
s.pop()
s
# set() is used to denote empty set there is no literal notation for sets
%time
# this is faster and cleaner
s = {1, 2, 3, 1}
s
%time
s = set([1, 2, 3, 1])
s
Using the literal set syntax is faster and more readable than calling the constructor set([1, 2, 3])
. The latter is slower because when you use the literal, python directly calls the BUILD_SET
bytecode to create the set.
from dis import dis #disassembe bytecode
dis('{1}')
dis('set([1])')
frozenset(range(10))
The advantage of frozenset is that they are hashable while sets are not. So a frozenset can be used inside a set but nested sets are not possible.
f = frozenset(range(10))
s = set(range(10))
f, s
new_set = {f}
new_set = {s}
{i for i in 'abcdeeefghhijkkkll'}
dict
and set
under the Hood
dict and set has really fast membership operations (lookup) due to the fact that they are implemented using hash tables. Comparing this with something like a list, there is a world of difference. All this is due to the hash tables.
Hash Tables
It is implemented as a sparse array. Each cell in the array is called a bucket
. In a dict table each bucket has 2 fields, a ref to the key and a ref to the value item.
Python tries to keep 1/3 of the buckets empty. If its size increases it is copied out to a know location with larger memory. The first set in putting an element inside is to hash the item.
To put an item in hash table, the first step is to calculate the hash value of the item key. This is a unique value for the given data type. The hash()
function is called, which in turn uses the __hash__
for calculating the hash.
hash(1)
If 2 objs compare equal there hash values must be equal.
hash(1), hash(1.0)
Also hashes for objects that are similar to should be as different as possible.
hash(1.001), hash(1.002)
Note: There is a detailed overview of the hash table algorithms in the book. Please google for it to find it. (It is in page 89 in the book). I actually forgot all about hashes, shouldn't have skipped those classes in college...
Practical Consequeces of How dict Works
-
keys must be hashable objects - the object must support
hash()
funtion andeq()
function. also ifa == b
thehash(a) == hash(b)
. By default all user objects are hashable. -
dicts have significant memory overhead - since dicts use sparese arrays they are not efficient. Note that if your using dicts in JSON style with one dict per object, a namedtuple is far more efficent alternative.
-
Key search is fast (very fast) - hash tables are the reason.
-
Key ordering depends on insertion order - Consider the case where a key collition happens.
-
Adding items to dict may change the order of existing keys - this is why modifing the contents of a dict while iterating through it is a bad idea. If you need to scan and add items to a dict, do it in 2 steps.
Practical Concequeces of how sets Work
similar to dict, sets and frozensets also implement hash tables but each bucket only holds a reference. Just like dicts but without a value to go with it. similar concequences as above.
to sum up
Dictionaries are a keystone of Python. Beyond the basic dict , the standard library offers handy, ready-to-use specialized mappings like defaultdict , OrderedDict , ChainMap , and Counter , all defined in the collections module. The same module also provides the easy-to-extend UserDict class.
Two powerful methods available in most mappings are setdefault and update . The setdefault method is used to update items holding mutable values, for example, in a dict of list values, to avoid redundant searches for the same key. The update method allows bulk insertion or overwriting of items from any other mapping, from iterables providing (key, value) pairs and from keyword arguments. Mapping constructors also use update internally, allowing instances to be initialized from mappings, iterables, or keyword arguments.
A clever hook in the mapping API is the missing method, which lets you customize what happens when a key is not found. The collections.abc module provides the Mapping and MutableMapping abstract base classes for reference and type checking. The little-known MappingProxyType from the types module creates immutable mappings. There are also ABCs for Set and Mutable Set .
The hash table implementation underlying dict and set is extremely fast. Understand‐ ing its logic explains why items are apparently unordered and may even be reordered behind our backs. There is a price to pay for all this speed, and the price is in memory.