The Gotcha

A defaultdict factory method that depends on a nonlocal variable is affected whenever that variable's value changes. As a result, new dictionary values created for missing keys may seem incorrect.

Background

Python's defaultdict is a dict subclass that modifies the standard dictionary behaviour when attempts are made to access a key not currently in the dictionary. Instead of raising a KeyError, it adds the missing key to the dictionary and uses a preprovided factory method to determine what default value should be associated with that key.

from collections import defaultdict

# A defaultdict that retrieves a hard-coded string as the default value
my_dict = defaultdict(lambda: "my_default_value")

print(my_dict.items())  # dict_items([])
print(my_dict["key"])   # 'my_default_value'
print(my_dict.items())  # dict_items([('key', 'my_default_value')])

# A defaultdict that retrieves an integer as the default value
my_dict = defaultdict(int)

print(my_dict.items())  # dict_items([])
print(my_dict["key"])   # 0
print(my_dict.items())  # dict_items([('key', 0)])

The above code snippet shows examples of factory methods that use hard-coded default values. However, if the desired default value can't be calculated from inside the factory method, it must instead use variables that are defined outside of the method. Because factory methods never accept arguments, these variables are nonlocal to the method. That means that code outside the factory method can change the nonlocal variable's value and have an adverse side effect on the factory method.

Example

The code snippet below demonstrates the use of a nonlocal variable in a defaultdict's factory method. When the value of nonlocal_var is changed, new keys added to the dictionary take on its new value. This may seem incorrect if the desired behaviour is for new keys to always use the original value of nonlocal_var.

from collections import defaultdict

nonlocal_var = True

my_dict = defaultdict(lambda: nonlocal_var)
print(my_dict.items())  # dict_items([])

print(my_dict[1])       # True
print(my_dict.items())  # dict_items([(1, True)])

nonlocal_var = False

print(my_dict[2])       # False
print(my_dict.items())  # dict_items([(1, True), (2, False)])

Why it Happens

The defaultdict lazily retrieves new values for missing keys, i.e. the factory method is called each time a key not already in the dictionary is accessed. Additionally, the example's factory method is a closure because it closes over nonlocal_var, meaning that it is aware of future changes to the value of nonlocal_var even if those changes are made from outside of the lambda. Thus, when the program attempts to access a key not already in the defaultdict, the factory method is called and the lambda returns the current value of nonlocal_var.

This may seem undesirable, but the lazy evaluation and closure combination allows developers to create powerful factory methods that suit their needs. For example, a program can update default values for a defaultdict in real-time based on a stream of input provided by an outside user!  

Going back to the original problem, how can we capture the value of nonlocal_var so that changes to nonlocal_var no longer affect our defaultdict?

The Fix

Solution 1: Add a parameter with a default value

Add a parameter to the lambda with a default value set to the desired variable, and reference the parameter when executing code within the lambda, e.g. lambda arg=var: 10 * arg. The original example would then look like this:

from collections import defaultdict

my_var = True

my_dict = defaultdict(lambda x=my_var: x)
print(my_dict.items())  # dict_items([])

print(my_dict[1])       # True
print(my_dict.items())  # dict_items([(1, True)])

my_var = False

print(my_dict[2])       # True
print(my_dict.items())  # dict_items([(1, True), (2, True)])

Because the defaultdict factory method will always call this lambda without an argument, the lambda will always use the default value to execute the function. And because the variable is passed in as an argument, its scope is now local (by referencing the parameter) instead of nonlocal.

Most importantly, the variable's original value is "captured" at the time that the lambda is defined, so the lambda's default value is not affected by future changes to the variable.

Solution 2: Add a new scope

Insert a middleman function that captures the nonlocal variable's value between the closure and its current scope. This can be done with a wrapper function:

from collections import defaultdict

my_var = True
closure = (lambda x: lambda: x)(my_var)

my_dict = defaultdict(closure)
print(my_dict.items())  # dict_items([])

print(my_dict[1])       # True
print(my_dict.items())  # dict_items([(1, True)])

my_var = False

print(my_dict[2])       # True
print(my_dict.items())  # dict_items([(1, True), (2, True)])

The wrapper function captures the value of my_var in variable x and makes x available to the inner closure but inaccessible to modification from outside of it. Thus, changes to my_var no longer affect the defaultdict factory method.

Solution 3: Use a partial

This solution is essentially the same as the previous solution, but it uses existing Python tools to accomplish the same task. The partial function simplifies a function's signature by capturing (or "freezing") some of the function's parameters and returning a function with fewer parameters.

Below, lambda x: x is reduced from a one-parameter function to a zero-parameter function by using partial to freeze x with the value of my_var. Thus, the value of my_var is captured, and the factory method is not affected by changes to my_var:

from collections import defaultdict
from functools import partial

my_var = True
my_partial = partial(lambda x: x, my_var)

my_dict = defaultdict(my_partial)
print(my_dict.items())  # dict_items([])

print(my_dict[1])       # True
print(my_dict.items())  # dict_items([(1, True)])

my_var = False

print(my_dict[2])       # True
print(my_dict.items())  # dict_items([(1, True), (2, True)])

Solution 4: Use a dedicated variable

Copy the desired variable value to a new variable, and use the new variable in the factory method instead. Don't touch the new variable ever again.

Of all the solutions here, I can't believe I didn't come up with this one myself. My best friend was the one to suggest it! After all, if the program requires its defaultdict's factory method to return a static value, why would it deliberately reference a variable that gets modified?

from collections import defaultdict

my_var = True
dedicated_var = my_var

my_dict = defaultdict(lambda: dedicated_var)
print(my_dict.items())  # dict_items([])

print(my_dict[1])       # True
print(my_dict.items())  # dict_items([(1, True)])

my_var = False

print(my_dict[2])       # True
print(my_dict.items())  # dict_items([(1, True), (2, True)])

Also, this solution comes with an important caveat: make sure to perform a deep copy when working with mutable objects. Otherwise, you may unintentionally modify the value of your dedicated variable when modifying the original.

In the end, Python's late-binding closures are a cool feature even though they can wreak havoc on the unsuspecting developer. Keep an eye out for them, and you'll be okay!