Python: Learning about defaultdict’s handling of missing keys

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.

This is the code snippet that intrigued me:

vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__

Let’s quickly see how it works by adapting an example from scikit-learn:

>>> from collections import defaultdict
>>> vocabulary = defaultdict()
>>> vocabulary.default_factory = vocabulary.__len__
>>> vocabulary["foo"]
>>> vocabulary.items()
dict_items([('foo', 0)])
>>> vocabulary["bar"]
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1)])

What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.

Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:

>>> vocabulary["baz"] = "Mark
>>> vocabulary["baz"]
>>> vocabulary["python"]

Now let’s see what the dictionary contains:

>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])

All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:

__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
  if self.default_factory is None: raise KeyError((key,))
  self[key] = value = self.default_factory()
  return value

scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.

All in all, very neat!

Published on Web Code Geeks with permission by Mark Needham, partner at our WCG program. See the original article here: Python: Learning about defaultdict’s handling of missing keys

Opinions expressed by Web Code Geeks contributors are their own.

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Back to top button