While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.
This is the code snippet that intrigued me:
vocabulary = defaultdict() vocabulary.default_factory = vocabulary.__len__
Let’s quickly see how it works by adapting an example from scikit-learn:
>>> from collections import defaultdict >>> vocabulary = defaultdict() >>> vocabulary.default_factory = vocabulary.__len__ >>> vocabulary["foo"] 0 >>> vocabulary.items() dict_items([('foo', 0)]) >>> vocabulary["bar"] 1 >>> vocabulary.items() dict_items([('foo', 0), ('bar', 1)])
What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.
Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:
>>> vocabulary["baz"] = "Mark >>> vocabulary["baz"] 'Mark' >>> vocabulary["python"] 3
Now let’s see what the dictionary contains:
>>> vocabulary.items() dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])
All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:
""" __missing__(key) # Called by __getitem__ for missing key; pseudo-code: if self.default_factory is None: raise KeyError((key,)) self[key] = value = self.default_factory() return value """ pass
scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.
All in all, very neat!
|Published on Web Code Geeks with permission by Mark Needham, partner at our WCG program. See the original article here: Python: Learning about defaultdict’s handling of missing keys|
Opinions expressed by Web Code Geeks contributors are their own.