Furkan Kalkan

Posted on Sep 23, 2018 • Edited on Sep 27, 2018

Fully automated metadata objects with Python 3.7's brand new dataclass library.

#python #dataclass #metadata

Dataclass is a brand new data structure which featured in Python 3.7. Recently @btaskaya write about it a great article. If you hadn't read yet, you can read on here.

Dataclass has promising features to create reusable, self-verified and automated metadata objects. Before that, I used to use dict format to create metadata objects but copying and pasting the same object all the time is boring and conflict with against DRY (Don't Repeat Yourself) rule.

It was like this:

Metadata = {}
Metadata["id"] = id
Metadata["url"] = url

if something:
    Metadata["some_field"] = some_data

Metadata["media"] = {}
Metadata["media"]["id"] = media_id 
...

I can use NamedTuple or something instead of dict but they have some limitations and I really didn't have enough time to implement something fancier in the early days of the project. When I refactor code I realize that dataclass is more functional for my needs.

In this article, I will show you how to implement fully automated metadata objects with dataclasses step by step.

Part 1: Implement metadata fields that don't need calculation

There is no problem at all in this step. It's just standard implementation.

from dataclasses import *


@dataclass
class Metadata:
    title: str
    url: str
    created_at: str = None    # Fields may have default value

Part 2: Add some fields that need calculation and let's calculate it automatically.

This fields will get values only after calculations. In our case, post_id should equal the random number plus url.

import random
from dataclasses import *
@dataclass
class Metadata:
    # Normal fields
    title: str
    url: str
    created_at: str = None
    # Calculated fields
    post_id: str = field(init=False)
    def __post_init__(self):
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"

__post_init__ function will calculate our field post_id after initiliasion.

Let's call it:

>>> Metadata(
...  title="Some Article",
...  url="https://example.com/article.html",
...  created_at="2018-09-23"
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23', post_id='696953_https://example.com/article.html')

Gotcha!

Part 3: Make our hands dirtier; add `__post_init__` only pseudo fields

We may want to build autonomous complex structures. For instance, if one field annotated, dataclass can build the whole substructure for us. In our case, we use additional fields author_names and author_ids to construct authors field as list of dict. If author information not provided for the article, the value of authors field should be None.

import random
from dataclasses import *


@dataclass
class Metadata:
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Calculated fields
    post_id: str = field(init=False)
    # Non-nullable Pseudo fields
    author_names: InitVar[list]
    author_ids: InitVar[list]

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        self.authors = []
        for i in range(0, len(author_names)):
            self.authors.append({"id": author_ids[i], "name": author_names[i]})

Let's call it:

>>> Metadata(
...  title="Some Article",
...  url="https://example.com/article.html",
...  created_at="2018-09-23"
... )

TypeError: non-default argument 'author' follows default argument.

It didn't work:(

Important Note: You have to group default and non-default fields.

Try again:

import random, json
from dataclasses import *


@dataclass
class Metadata:
    # Non-nullable Pseudo fields
    author_names: InitVar[list]
    author_ids: InitVar[list]
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Calculated fields
    post_id: str = field(init=False)

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        self.authors = []
        for i in range(0, len(author_names)):
            self.authors.append({"id": author_ids[i], "name": author_names[i]})

    def to_json(self):
        json.dumps(asdict(self))

Let's call it again:

>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_names=["Furkan Kalkan", "John Doe"],
... author_ids=["1", "2"]
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23', authors=[{'id': '1', 'name': 'Furkan Kalkan'}, {'id': '2', 'name': 'John Doe'}], post_id='692728_https://example.com/article.html')

Yeah!

But wait... Where the author_names and author_ids are gone?

Note: Pseudo fields that InitVar instance, only used in __post_init__() as parameters, they are not a part of object.
>> Metadata.author_names
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'Metadata' has no attribute 'author_names'

Part 4: We don't need to define `author_names`.

We can make pseudo fields as optional, too.

import random, json
from dataclasses import *


@dataclass
class Metadata:
    # Non-nullable Pseudo fields
    author_ids: InitVar[list]
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Nullable Pseudo fields
    author_names: InitVar[list] = field(default=None)
    # Calculated fields
    post_id: str = field(init=False)

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        if author_names:
            self.authors = []
            for i in range(0, len(author_names)):
                self.authors.append({"id": author_ids[i], "name": author_names[i]})

    def to_json(self):
        json.dumps(asdict(self))

Call it:

>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_ids=["1", "2"]
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23',authors=None,post_id='692728_https://example.com/article.html')

Part 5: We need JSON.

Python objects are good but we need to dump it as JSON to POST it to web services, MQs, etc. Dataclass library has builtin function asdict() which can dump our object to dict.

Let's write the wrapper for our object.

import random, json
from dataclasses import *


@dataclass
class Metadata:
    # Non-nullable Pseudo fields
    author_names: InitVar[list]
    author_ids: InitVar[list]
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Calculated fields
    post_id: str = field(init=False)

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        if author_names:
            self.authors = []
            for i in range(0, len(author_names)):
                self.authors.append({"id": author_ids[i], "name": author_names[i]})

    def to_json(self):
        return json.dumps(asdict(self))

Check it:

>>> m = Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_names=["Furkan Kalkan", "John Doe"],
... author_ids=["1", "2"]
... )
>>> m.to_json()
{"title": "Some Article", "url": "https://example.com/article.html", "created_at": "2018-09-23", "authors": [{"id": "1", "name": "Furkan Kalkan"}, {"id": "2", "name": "John Doe"}], "post_id": "466969_https://example.com/article.html"}

Part 6: Remove unnecessary fields from json.

We want to remove None valued fields from json except the url field. It's possible with a little bit of change:


def to_json(self):    
    metadata = asdict(self)
    for key in list(metadata):
        if key != "url" and metadata[key] == None:
                del metadata[key]
    return json.dumps(metadata)

Top comments (2)

Gregory Wendel • Sep 25 '19

Hello - thanks for the helpful article and code examples. Pylint suggested I use enumerate instead of for ... . Here is the code I changed to follow the advice. I think I am getting the same response, but am curious if you see any issues with it.

Original:
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})

My changes:
for count in enumerate(author_names):
self.authors.append({"id": author_ids[count], "name":author_names[count]})

Furkan Kalkan • Oct 1 '19

enumerate() is ok.

DEV Community

Fully automated metadata objects with Python 3.7's brand new dataclass library.

Part 1: Implement metadata fields that don't need calculation

Part 2: Add some fields that need calculation and let's calculate it automatically.

Part 3: Make our hands dirtier; add `__post_init__` only pseudo fields

Part 4: We don't need to define `author_names`.

Part 5: We need JSON.

Part 6: Remove unnecessary fields from json.

Top comments (2)

Read next

Advent of Code 2024 - Day 2: Red-Nosed Reports

Django bookmark management software

Advent of Code 24

Advent of Code 2024 - Day 1: Historian Hysteria

Part 1: Implement metadata fields that don't need calculation

Part 2: Add some fields that need calculation and let's calculate it automatically.

Part 3: Make our hands dirtier; add __post_init__ only pseudo fields

Part 4: We don't need to define author_names.

Part 5: We need JSON.

Part 6: Remove unnecessary fields from json.

Read next

Advent of Code 2024 - Day 2: Red-Nosed Reports

Django bookmark management software

Advent of Code 24

Advent of Code 2024 - Day 1: Historian Hysteria

Part 3: Make our hands dirtier; add `__post_init__` only pseudo fields

Part 4: We don't need to define `author_names`.