When it comes to information systems, things can get pretty complex, to say the least. A typical information system like a web service, at the most basic level, is just one process in a massive, integrated data pipeline. It deals mostly with data processing: fetching data, transforming it and passing it on to another system. But as other systems pile up on top of it, the complexity builds up quickly. Managing and mitigating that complexity then becomes a major challenge for developer teams.
Traditionally, information systems have been implemented using software programming paradigms like Object-Oriented Programming, based on the concept of “objects”, which can contain data and code. Information systems that follow Object-Oriented Programming with no constraints tend to be complex, in the sense that they are hard to understand and hard to maintain.
The increase of system complexity tends to reduce the velocity of the development team as it takes more time to add new features to the system. Hard-to-diagnose issues occur more frequently in production. Issues that cause either user frustration when the system doesn’t behave as expected or even worse, system down time.
Three aspects of Object-Oriented programming are a source of complexity:
- Data encapsulation in objects
- Non-flexible data layout in classes
- State mutation
Data encapsulation inside objects is beneficial in many cases. However, in the context of modern information systems, data encapsulation tends to create complex class hierarchies where objects are involved in many relations with other objects.
Over the years, this complexity has been alleviated by the invention of advanced design patterns and software frameworks. But information systems built with Object-Oriented programming still tend to be complex.
Representing every piece of data through a class is helpful for tooling (e.g. autocompletion in the editor) and errors like accessing non-existing fields are detected at compile time. However, the rigidity of class layout makes data access not flexible. In the context of information systems, it’s painful: Each and every variation of data is represented by a different class. For instance, in a system that deals with customers, there is a class that represents a customer as seen by the database and a different class that represents a customer as seen by the data manipulation logic. Similar data with different field names, but the proliferation of classes is unavoidable. The reason is that data is “locked” in classes.
In multi-threaded information systems, the fact that the state of the object’s is allowed to be mutated is another source of complexity. The introduction of various lock mechanisms in order to prevent data from being modified concurrently and to ensure the state of our objects remain valid makes the code harder to write and to maintain. Sometimes, before passing data to a method form third-party libraries, we use a defensive copy strategy to make sure our data is not modified. The addition of lock mechanisms or defensive copy strategy makes our code more complex and less performant.
Data-Oriented Programming (DOP) is a set of best practices that have been followed by developers in order to reduce complexity of information systems.
The idea behind DOP is to simplify the design and implementation of information systems by treating data as a “first-class citizen”. Instead of designing information systems around objects that combine data and code, DOP guides us to separate code from data and to represent data with immutable generic data structures. As a consequence, in DOP developers manipulate data with the same flexibility and serenity as they manipulate numbers or strings in any program.
DOP reduces system complexity by following three core principles:
- Separating code from data
- Representing data with generic data structures
- Keeping data immutable
One possible way to adhere to DOP in an Object-Oriented programming language is to write code in static class methods that receive data they manipulate as an explicting argument.
The separation of concerns achieved by separating code from data tends to make the class hierarchy less complex: instead of designing a system with a class diagram made of entities involved in many relationships, the system is made of two disjoint simpler subsystems: a code subsystem and a data subsystem.
When we represent data with generic data structures (like hash maps and lists), data access is flexible and it tends to reduce the number of classes in our system.
Keeping data immutable brings serenity to the developer when they need to write a piece of code in a multi-threaded environment. Data validity is ensured without the need to protect the code with lock mechanisms or defensive copy.
DOP principles are applicable both to Object-Oriented and to functional programming languages. However, for Object-Oriented developers, the transition to DOP might require more of a mind shift than for functional programming developers, as DOP guides us to get rid of the habit of encapsulating data in stateful classes.
Yehonathan Sharvit has been working as a software engineer since 2000, programming with C++, Java, Ruby, JavaScript, Clojure and ClojureScript. He currently works as a software architect at CyCognito, building software infrastructures for high scale data pipelines. He shares insights about software at his tech blog. Yehonathan recently published the book Data-Oriented Programming available from Manning.
Top comments (7)
Finally someone who seems to actually know his topic. Thank you for this insight.
I would like to see more concrete examples though. I'm using DTOs (value objects) and the benefit is the strictness and type safety. Json and nested associative arrays (maps) are fine, but to actually know what data they contain, one has to constantly run a debugger or other visualisation or inspection tool.
I'm interested to hear your opinions.
The DOP approach is to decouple schema validation from data typing.
In other words:
The developer is then free to decide what pieces of data should have a schema and what functions should validate their input.
@viebel The only online references I found on "data oriented programming" are yours :-)
The problem I see with this design is that it seems like just another tradeoff. You trade classes for schema validators. You design immutable structures to mitigate errors and sacrifice performance, as every mutation creates a copy of the previous data.
It reminds me of what happens on every REST API request - you map the xml/json to arrays/objects, then validate the schema, then do stuff with the validated data. I usually map it to internal data structures (DTOs) or ORM classes, depending on use-case. I can't really imagine how I write a schema+validator for every part of my application, every service interacting with the data (services, controllers, background jobs, etc.). I can see flexibility, but I can't see the reduction in complexity mentioned.
Could you point me to some sample code? Prefereably something heavier than a hello-world app.
Also, I can see you mention the paradigm is language-agnostic, but which language do you think would benefit most? (Seeing the tags in this article, would it be Java or JavaScript?)
I think this discrbes a technique I use with json data. I read the json into the library's generic storage, then define types which pull out the data of interest, using the libraries object conversion. It is kind of like network masking where the IP address does not change but I'm only going to look at...
Sound interesting... I will do a little research about this.
Please share the results of your research, once you have them
I did a little research as well, because I've felt some of the frustrations born of what you discussed above. I work in eCommerce and Business Systems specifically and the level of complexity in the business logic often causes junior and mid-level developers to struggle to deliver code because they can't model the logic conceptually in their mind when they are working. The idea of limiting the scope of what a developer has to think about when they are making modifications is an enticing one.
However, upon doing some research, I found the same as you mentioned in your Feb 2021 talk Data Oriented Programming in Practice: there is very little information available on data oriented programming (aside from your blog posts and your book). I suspect some of the issue may be due to a naming conflict. Data oriented design has long been a popular method of approaching OOP in systems where performance is key, especially memory-based performance. The most obvious sector here is video games' Entity Component System architecture pattern, which itself is a data-oriented approach to the composition pattern.
It's unfortunate that there isn't more information about DOP, especially in the form of practical examples and comparisons of similar problems solved with DOP vs OOP. I feel like that would get more developers on board rather than just conceptual essays or talks about how to think about data and functions in a DOP paradigm. If you have any suggestions for other resources, I'd love to hear them!