DEV Community

Flavio Curella for REVSYS

Posted on • Originally published at revsys.com

Dataclasses and attrs: when and why

Python 3.7 introduced dataclasses (PEP557). Dataclasses can be a convenient way to generate classes whose primary goal is to contain values.

The design of dataclasses is based on the pre-existing attr.s library. In fact Hynek Schlawack, the very same author of attrs, helped with the writing of PEP557.

Basically dataclasses are a slimmed-down version of attrs. Whether this is an improvement or not really depends on your specific use-case.

I think the addition of dataclasses to the standard library makes attrs even more relevant. The way I see it is that one is a subset of the other, and having both options is a good thing. You should probably use both in your project, according to the level of formality you want in that particular piece of code.

In this article I will show the way I use dataclasses and attrs, why I think you should use both, and why I think attrs is still very relevant.

What do they do

Both the standard library's dataclasses and the attrs library provide a way to define what I'll call "structured data types" (I would put namedtuple, dict and typeddict in the same family)

PS: There's probably some more correct CS term for them, but I didn't go to CS School, so ¯\(ツ)

They are all variations on the same concept: a class representing a data type containing multiple values, each value addressed by some kind of key.

They also do a few more useful things: they provide ordering, serialization, and a nice string representation. But for the most part, the most useful purpose is adding a certain degree of formalization to a group of values that need to be passed around.

An example

I think an example would better illustrate what I use dataclasses and attrs for.
Suppose you want to render a template containing a table. You want to make sure the table has a title, a description, and rows:

def render_document(title: str, caption: str, data: List[Dict[str, Any]]):
    return template.render({
        "title" : title,
        "caption": caption,
        "data": data,
})
Enter fullscreen mode Exit fullscreen mode

Now, suppose you want to render a document, which consists of a title, description, status ("draft", "in review", "approved"), and a list of tables. How would you pass the tables to render_document?

You may choose to represent each table as a dict:

{
    "title": "My Table",
    "caption": "2019 Earnings",
    "data": [
        {"Period": "QT1", "Europe": 500, "USA": 467},
        {"Period": "QT2", "Europe": 345, "USA": 765},
    ]
}
Enter fullscreen mode Exit fullscreen mode

But how would you express the type annotation for the tables argument so that it's correct, explicit and simple to understand?

def render_document(title: str, description: str, status: str, tables: List[Dict[str, Any]]):
    return template.render({
        "title": title,
        "description": description,
        "status": status,
        "tables": tables,
    })
Enter fullscreen mode Exit fullscreen mode

That only gets us to describe the first level if tables. It doesn't tell us that a Table has a title, or caption. Instead, you could use a dataclass:

@dataclass
class Table:
    title: str
    data: List[Dict[str, Any]]
    caption: str = ""

def render_document(title: str, description: str, tables: List[Table]):
    return template.render({
        "title": title,
        "description": description,
        "tables": tables,
    })
Enter fullscreen mode Exit fullscreen mode

This way we have type hinting, helping our IDE helping us.

But we can go one step further, and also provide type validation at runtime. This is where dataclasses stops, and attrs comes in:

@attr.s
class Table(object):
    title: str = attr.ib(validator=attr.validators.instance_of(str))  # don't you pass no bytes!
    data: List[Dict[str, Any]] = attr.ib(validator=...)
    description: str = attr.ib(validator=attr.validators.instance_of(str), default="")


def render_document(title: str, description: str, tables: List[Table]):
    return template.render({
        "title": title,
        "description": description,
        "tables": tables,
    })
Enter fullscreen mode Exit fullscreen mode

Now, suppose we also need to render a "Report", which is a collection of "Document"s. You can probably see where this is going:

@dataclass
class Table:
    title: str
    data: List[Dict[str, Any]]
    caption: str = ""

@attr.s
class Document(object):
    status: str = attr.ib(validators=attr.validators.in_(
        ["draft", "in review", "approved"]
    ))
    tables: List[Table] = attr.ib(default=[])

def render_report(self, title: str, documents: List[Document]):
    return template.render({
        "title": title,
        "documents": documents,
    })
Enter fullscreen mode Exit fullscreen mode

Note how I am validating that Document.status is one of the allowed values. This comes particularly handy when you're building abstractions on top of Django models with a field that uses choices. Dataclasses can't do that.

A couple of patterns I keep finding myself in are the following:

  1. Write a function that accepts some arguments
  2. Group some of the arguments into a tuple
  3. Hm, I want field names -> namedtuple.
  4. Hm, I want types -> dataclass.
  5. Hm, I want validation -> attrs.

Another situation that happens quite often is this:

  1. write a function that accepts some arguments
  2. add typing so my IDE can help me out
  3. oh, by the way, it needs to support a list of those things, not just one at a time!
  4. refactor to use dataclasses
  5. This argument can only be one of those values, or
    1. I ask myself: How do I make sure other developers are passing the right type and/or values?
  6. switch to attrs

Sometimes I stop at the dataclasses. Lots of times I get to the attrs step.

And sometimes, this happens:

  1. one half of this legacy codebase uses -1 as special value for False, that other half uses False. Switch to attr.s so I can use converter= to normalize.

Comparison

The two libraries do appear very similar. To get a clearer picture of how they compare, I've made a table of the features I use most:

feature dataclasses attrs
frozen
defaults
totuple
todict
validators
converters
slotted classes

As you can see, there's a lot of overlap. But the additional features on attrs provide functionality that I need more often than not.

When to use dataclasses

Dataclasses are just about the "shape" of the data.
Choose dataclasses if:

  • You don't care about values in the fields, only their type
  • adding a dependency is not trivial

When to use attrs

attrs is about the shape and the values.
Choose attrs if:

  • you want to validate values. A common case would be the equivalent of a ChoiceField.
  • you want to normalize, or sanitize the input
  • whenever you want more formalization than dataclasses alone can offer
  • you are concerned about memory and performances. attrs can create slotted classes, which are optimized by CPython.

I often find myself using dataclasses and later switching to attr.s because the requirements changed or I find out I need to guard against some particular value. I think that's a normal aspect of developing software and what I call "continuous refactoring".

Why I like dataclasses

I'm glad dataclasses have been added to the standard library, and I think it's a beneficial addition. It's a very convenient thing to have at your disposal whenever you need.

For one, it will encourage a more structured style of programming from the beginning.

But I think the most compelling case is a practical one. Some high-risk corporate environments (eg: financial institutions) require every package to be vetted (with good reason: we've already had incidents of malicious code in libraries). That means that adding attrs is not as simple as adding a line to your requirements.txt, and will involve waiting on approval from your corpops team. Those developers can use dataclasses right away and their code will immediately benefit from using more formalized data types.

Why I like attrs

Most people don't work in such strictly-controlled environments.

And sure, sometimes you don't need all the features from attrs, but it doesn't hurt having them.

More often than not, I end up needing them anyway, as I formalize more and more of my code's API. Dataclasses only gets half-way of where I want to go.

Conclusion

I think dataclasses encompass only a subset of what attrs has to offer. Admittedly, it is a big subset. But the features that are not covered are important enough and needed often enough that they make attrs not only still relevant and useful, but also necessary.

In my mind, using both allows developers to progressively refactor their code, moving the contracts amongst functions from loosely-defined arguments all the way up to formally described data structures as the requirements of the app stabilize over time.

One nice effect of having dataclasses is that now developers are more incentivized to refactor their code toward more formalization. At some point dataclasses is not going to be enough, and that's when developers will refactor to use attrs. In this way, dataclasses actually acts as an introduction to attrs. I wouldn't be surprised if attrs becomes more popular thanks to dataclasses.

References

Acknowledgments + Thanks

I want to thank the following people for revising drafts and providing input and insights:

  • Hynek Schlawack
  • Jacob Kaplan-Moss
  • Jacob Burch
  • Jeff Triplett

Top comments (0)