Dictionaries are not just another dataset in python, they are an itergral part of the language. Hence Dicts are highly optimized, Dicts are implemented using hash tables. Sets are also build using hash tables and hence covered with this.

Generic Mapping Types - Mapping and MutableMapping

Both of these form the base classes for dicts and sets. The collection.abs module provides these. For implementing specialized mappings we often extend dict or collection.UserDict instead of ABCs. The main value of ABCs is documenting and formalizing the minimal interfaces for mapping.

from collections.abc import Mapping
my_dict = {}
isinstance(my_dict, Mapping)
True

The dict has one requirement and that is that the keys should be hashable. What do you mean by hashable?

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash value.

print(hash((1, 2, 3)), hash('1, 2, 3'), hash(1))
print(hash([1, 2, 3]))
529344067295497451 4576582733818374657 1

TypeErrorTraceback (most recent call last)
<ipython-input-8-146e8d157a5b> in <module>
      1 print(hash((1, 2, 3)), hash('1, 2, 3'), hash(1))
----> 2 print(hash([1, 2, 3]))

TypeError: unhashable type: 'list'
a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict([('two', 2), ('one', 1), ('three', 3)])
e = dict({'three': 3, 'two': 2, 'one':1})

a == b == c == d == e
True

dict Comprehensions

the syntax of listcomps and genexps can be used

nums = {num: str(num) for num in range(10)}
nums
{0: '0',
 1: '1',
 2: '2',
 3: '3',
 4: '4',
 5: '5',
 6: '6',
 7: '7',
 8: '8',
 9: '9'}

Overview of Common Mapping Methods

There are 3 main dictionary types. dict, defaultdict, orderedDict.

Handlying Missing Keys with setdefault

dicts have a get function which is what is used most of the time when you want to handle missing keys (every pythonista knows this one!). In some cases where we have to handle key error and use a default value when the keys is not present don't use get to define the default value, instead use setdefault function. Use the 2 wisely. Most of the time setdefault() is a wise choice because using get() can lead to more searches. This is when you are inserting. if you also what to do a lookup...

Mappings with Flexible Key Lookup

There are 2 other methods to do this.

  1. Use defaultdict - this expect a callable to be passed to the defaultdict constructor that will be called when there is a key error in the __getitem__. Note this will only be called for __getitem__ and not for other functions like .get(). Under the hood, it is using the __missing__ special methods.
  2. subclass dict and add a __missing__ method - when __getitem__ is called with a key that is not present, it calls the __missing__ function if it is implemented to handle it.
from collections import defaultdict

# takes a callable which is called when there is a key error
d = defaultdict(list)
d[0] # what happend? -> calls list() to create list and inserts that into key
[]

but under the hood defaultdict uses the __missing__ method. This method is used to handle the missing values in any mapping object.

class PinToFun(dict):
    
    # the __getitem in dict calls this function
    def __missing__(self, key):
        if isinstance(key, str): # check to avoid recursion
            raise KeyError(key)
        return self[str(key)]
    
    def get(self, key, default=None):
        try:
            return self[key]
        except KeyError:
            return default
        
    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()
pins = {'0': 'I/O', '1': 'LED'}
pins['0']
'I/O'
pins[0]

KeyErrorTraceback (most recent call last)
<ipython-input-12-bc9cb425f661> in <module>
----> 1 pins[0]

KeyError: 0
pins_improved = PinToFun({'0': 'I/O'})
pins_improved[0]
'I/O'
pins_improved.get(0)
'I/O'

Variations of Dict:

collections.OrderedDict: Maintains keys in insertion order. Hence iteration is predictable.

collections.ChainMap: class is provided for quickly linking a number of mappings so they can be treated as a single unit. It is often much faster than creating a new dictionary and running multiple update()calls

collections.Counter: Holds the integer count of each key. Can be used to count instances of hashable objects. has addition functions like most_common to return the odered list of tuples.

from collections import Counter
ct = Counter('abaaachdddrllkk')
ct
Counter({'a': 4, 'b': 1, 'c': 1, 'h': 1, 'd': 3, 'r': 1, 'l': 2, 'k': 2})

collections.UserDict: Used as base class for creating new mapping classes. The main reason we don't use dict as base class is that it has some implementation shortcuts that we will have to override in order to make it work. Note that UserDict does not inherit from dict instead has data, which is a dict instance to avoid possible recursion issues.

we will now modify the PinToFun class using this to show its effectiveness.

import collections

class PinToFun(collections.UserDict):
    
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]
    
    def __contains__(self, key):
        return str(key) in self.data
    
    def __setitem__(self, key, item):
        #here the data is accessed as an attribute
        self.data[str(key)] = item
pins = PinToFun({'0': 'I/O', '1': 'LED'})
pins
{'0': 'I/O', '1': 'LED'}
0 in pins
True

Immutable Mappings

mapping types are mutable but if you want to constrain the user from making changes, use this.

Introducing MappingProxyType from types module. This returns a read-only by dynamic view of the original mapping. Hence updates can be seen but no changes can be performed using the mappingproxy

from types import MappingProxyType
d = {1: 'A'}
d_proxy = MappingProxyType(d)
d_proxy
mappingproxy({1: 'A'})
d_proxy[1]
'A'
d_proxy[0] = 'B'

TypeErrorTraceback (most recent call last)
<ipython-input-3-c636e28161a2> in <module>
----> 1 d_proxy[0] = 'B'

TypeError: 'mappingproxy' object does not support item assignment
d[0] = 'B'
d_proxy[0]
'B'

Set Theory

An underused concept in python. The basic use is removing duplicate elements. Set elements must be hashable.

In addition to that sets also support basic set operations.

  • a | b - Union
  • a & b - Intersection
  • a - b - Difference

Sets have extremely fast membership functions. If used effectively it can make you code faster and easier to read.

l = ['spam', 'eggs', 'spam', 'spam', 'spam']
l
['spam', 'eggs', 'spam', 'spam', 'spam']
set(l)
{'eggs', 'spam'}
s = {1}
type(s)
set
s
{1}
s.pop()
1
s
# set() is used to denote empty set there is no literal notation for sets
set()
%time
# this is faster and cleaner
s = {1, 2, 3, 1}
s
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 2.62 µs
{1, 2, 3}
%time
s = set([1, 2, 3, 1])
s
CPU times: user 1 µs, sys: 4 µs, total: 5 µs
Wall time: 8.58 µs
{1, 2, 3}

Using the literal set syntax is faster and more readable than calling the constructor set([1, 2, 3]). The latter is slower because when you use the literal, python directly calls the BUILD_SET bytecode to create the set.

from dis import dis #disassembe bytecode
dis('{1}')
  1           0 LOAD_CONST               0 (1)
              2 BUILD_SET                1
              4 RETURN_VALUE
dis('set([1])')
  1           0 LOAD_NAME                0 (set)
              2 LOAD_CONST               0 (1)
              4 BUILD_LIST               1
              6 CALL_FUNCTION            1
              8 RETURN_VALUE
frozenset(range(10))
frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9})

The advantage of frozenset is that they are hashable while sets are not. So a frozenset can be used inside a set but nested sets are not possible.

f = frozenset(range(10))
s = set(range(10))
f, s
(frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}), {0, 1, 2, 3, 4, 5, 6, 7, 8, 9})
new_set = {f}
new_set = {s}

TypeErrorTraceback (most recent call last)
<ipython-input-33-0e201e86858f> in <module>
      1 new_set = {f}
----> 2 new_set = {s}

TypeError: unhashable type: 'set'
{i for i in 'abcdeeefghhijkkkll'}
{'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'}

dict and set under the Hood

dict and set has really fast membership operations (lookup) due to the fact that they are implemented using hash tables. Comparing this with something like a list, there is a world of difference. All this is due to the hash tables.

Hash Tables

It is implemented as a sparse array. Each cell in the array is called a bucket. In a dict table each bucket has 2 fields, a ref to the key and a ref to the value item.

Python tries to keep 1/3 of the buckets empty. If its size increases it is copied out to a know location with larger memory. The first set in putting an element inside is to hash the item.

To put an item in hash table, the first step is to calculate the hash value of the item key. This is a unique value for the given data type. The hash() function is called, which in turn uses the __hash__ for calculating the hash.

 hash(1)
1

If 2 objs compare equal there hash values must be equal.

hash(1), hash(1.0)
(1, 1)

Also hashes for objects that are similar to should be as different as possible.

hash(1.001), hash(1.002)
(2305843009213441, 4611686018427393)

Note: There is a detailed overview of the hash table algorithms in the book. Please google for it to find it. (It is in page 89 in the book). I actually forgot all about hashes, shouldn't have skipped those classes in college...

Practical Consequeces of How dict Works

  1. keys must be hashable objects - the object must support hash() funtion and eq() function. also if a == b the hash(a) == hash(b). By default all user objects are hashable.

  2. dicts have significant memory overhead - since dicts use sparese arrays they are not efficient. Note that if your using dicts in JSON style with one dict per object, a namedtuple is far more efficent alternative.

  3. Key search is fast (very fast) - hash tables are the reason.

  4. Key ordering depends on insertion order - Consider the case where a key collition happens.

  5. Adding items to dict may change the order of existing keys - this is why modifing the contents of a dict while iterating through it is a bad idea. If you need to scan and add items to a dict, do it in 2 steps.

Practical Concequeces of how sets Work

similar to dict, sets and frozensets also implement hash tables but each bucket only holds a reference. Just like dicts but without a value to go with it. similar concequences as above.

to sum up

Dictionaries are a keystone of Python. Beyond the basic dict , the standard library offers handy, ready-to-use specialized mappings like defaultdict , OrderedDict , ChainMap , and Counter , all defined in the collections module. The same module also provides the easy-to-extend UserDict class.

Two powerful methods available in most mappings are setdefault and update . The setdefault method is used to update items holding mutable values, for example, in a dict of list values, to avoid redundant searches for the same key. The update method allows bulk insertion or overwriting of items from any other mapping, from iterables providing (key, value) pairs and from keyword arguments. Mapping constructors also use update internally, allowing instances to be initialized from mappings, iterables, or keyword arguments.

A clever hook in the mapping API is the missing method, which lets you customize what happens when a key is not found. The collections.abc module provides the Mapping and MutableMapping abstract base classes for reference and type checking. The little-known MappingProxyType from the types module creates immutable mappings. There are also ABCs for Set and Mutable Set .

The hash table implementation underlying dict and set is extremely fast. Understand‐ ing its logic explains why items are apparently unordered and may even be reordered behind our backs. There is a price to pay for all this speed, and the price is in memory.