Python dataclasses and typing

I'm going to preach the wonders of Python dataclasses, but for reasons of interested to those who have already gone down the delightful rabbit-hole of typed Python. So let me start with a quick plug for mypy if you haven't heard about it.

(Warning: this is going to be a bit long.)

Type Checking

mypy is a static type-checker for Python. In its simplest form, instead of writing:

def hello(name):
    return f"Hello, {name}"

you write:

def hello(name: str) -> str:
    return f"Hello, {name}"

The type annotations are ignored at runtime, but the mypy command makes use of them to do static typing. So, for instance:

$ cat > t.py
def hello(name: str) -> str:
    return f"Hello {name}"
hello(1)
$ mypy t.py
t.py:3: error: Argument 1 to "hello" has incompatible type "int"; expected "str"

If you're not already using this with your Python code, I cannot recommend it highly enough. It's somewhat tedious to add type annotations to existing code, particularly at first when you have to look up how mypy represents some of the more complicated constructs like decorators, but once you do, mypy starts finding bugs in your code like magic. And it's designed to work incrementally and tolerate untyped code, so you can start slow, and the more annotations you add, the more bugs it finds. mypy is much faster than a comprehensive test suite, so even if you would have found the bug in testing, you can iterate faster on changes. It can even be told which variables may be None and then warn you if you use them without checking for None in a context where None isn't allowed.

But mypy can only help with code that's typed, so once you get the religion, the next goal is to push typing ever deeper into complicated corners of your code.

Typed Data Structures

Python code often defaults to throwing any random collection of data into a dict. For a simple example, suppose you have a paginated list of strings (a list, an offset from the start of the list, a limit of the number of strings you want to see, and a total number of strings in the underlying data). In a lot of Python code, you'll see something like:

strings = {
    "data": ["foo", "bar", "baz"],
    "offset": 5,
    "limit": 3,
    "total": 10,
}

mypy is good, but it's not magical. It has no way to keep track of the fact that strings["data"] is a list of strings, but strings["offset"] is a int. Instead, it decides the type of each value is the superclass of the types it sees in the initializer (in this case, object, which provides almost no type checking).

There are two traditional solutions: an object, and a NamedTuple (the typing-enhanced version of collections.namedtuple). An object is tedious:

class Strings:
    def __init__(
        self, data: List[str], offset: int, limit: int, total: int
    ) -> None:
        self.data = data
        self.offset = offset
        self.limit = limit
        self.total = total

This provides perfect typing, but who wants to write all that. A NamedTuple is a little better:

Strings = NamedTuple(
    "Strings",
    [("data", List[str]), ("offset", int), ("limit", int), ("total", int)],
)

but still kind of tedious and has other quirks, such as the fact that your object can now be used as a tuple, which can introduce some surprising bugs.

Enter dataclasses, which are new in Python 3.7 (although inspired by attrs, which have been around for some time). The equivalent is:

@dataclass
class Strings:
    data: List[str]
    offset: int
    limit: int
    total: int

So much nicer, and the same correct typing. And unlike NamedTuple, dataclasses support default values, expansion via inheritance, and are full classes so you can attach short methods and do other neat tricks. You can also optionally mark them as frozen, which provides the NamedTuple behavior of making them immutable after creation.

A Detailed Example

Using dataclasses for those random accumulations of data is already great, but today I found a way to use them for a trickier typing problem.

I work on a medium-sized (about 75 routes) Tornado web UI using Jinja2 and WTForms for templating. Returning a page to the user's browser involves lots of code that looks something like this:

self.render("template.html", service=service, owner=owner)

Under the hood, this loads that template, builds a dictionary of template variables, and tells Jinja2 to render the template with those variables. The problem is the typing: the render method has no idea what sort of data you want to pass to a given template, so it uses the dreaded **kargs: Any, so you can pass anything you want. And mypy can't look inside Jinja2 template code.

Forget to pass in owner? Exception or silent failure during template rendering depending on your Jinja2 options. Pass in the name of a service when the template was expecting a rich object? Exception or silent failure. Typo in the name of the parameter? Exception or silent failure. Better hope your test suite is thorough.

What I did today was wrap each template in a dataclass:

@dataclass
class Template(BaseTemplate):
    service: str
    owner: str
    template: InitVar[str] = "template.html"

Now, the code to render it looks like:

template = Template(service, owner)
self.finish(template.render(self))

and now I have type-checking of all of the template arguments and only need to ensure the dataclass definition matches the needs of the template implementation.

The magic happens in the base template class:

@dataclass
class BaseTemplate:
    def render(self, handler: RequestHandler) -> str:
        template_name = getattr(self, "template")
        template = handler.environment.get_template(template_name)
        namespace = handler.get_template_namespace()
        for field in fields(self):
            namespace[field.name] = getattr(self, field.name)
        return template.render(namespace)

(My actual code is a bit different and more complicated since I move some other template setup to the render() method.)

There's some magic here to work around dataclass limitations that warrants some explanation.

I pass the Tornado handler class into the template render() method so that I have access to the template environment and the (overridden) Tornado get_template_namespace() call to get default variable settings. Passing them into the dataclass constructor would make the code less clean and is harder to implement, mostly due to limitations on attributes with default values, mentioned below.

The name of the template file should be a property of the template definition rather than something the caller needs to know, but that means it has to be given last since dataclasses require that all attributes without default values come before ones with default values. That in turn also means that the template attribute cannot be defined in BaseTemplate, even without a default value, because if a child class sets a default value, @dataclass then objects. Hence the getattr to hide from mypy the fact that I'm breaking type rules and assuming all child classes are well-behaved.

template in the child classes is marked as InitVar so that it won't be included in the fields of the dataclass and thus won't be passed down to Jinja2.

Finally, it would be nice to be able to use dataclasses.asdict() to turn the object into a dictionary for passing into Jinja2, but unfortunately asdict tries to do a deep copy of all template attributes, which causes all sorts of problems. I want to pass functions and WTForms form objects into Jinja2, which resulted in asdict throwing all sorts of obscure exceptions. Hence the two lines that walk through the fields and add a shallow copy of each field to the template namespace.

I've only converted four templates so far (this code base is littered with half-finished transitions to better ways of doing things that I try to make forward progress on when I can), but I'm already so much happier. All sorts of obscure template problems will now be caught at mypy even before needing to run the test suite.

Posted: 2019-11-09 15:02 — Why no comments?

Last spun 2022-02-06 from thread modified 2019-11-09