DEV Community

Cover image for Why we built a polymorphic database
mifuneson2 for TypeDB

Posted on • Edited on • Originally published at typedb.com

Why we built a polymorphic database

Database models have failed to keep up with the rapid evolution of programming languages, and modern applications must use complex layered architectures to manage data as result. In order to resolve this, we built TypeDB on a completely new, highly expressive database paradigm.

In this blog post we explore the origin and core ideas that led us to create a polymorphic database.


Why do we need a polymorphic database?

Relational databases lack the expressivity of model polymorphism. Programming languages are becoming increasingly more declarative, empowering engineers to quickly write safe and expressive code backed by static type checking and abstract data constructs. Relational databases were designed at a time when procedural programming languages were the norm. They are built on Codd’s relational algebra and implemented using tables and one-dimensional tuples.

Object-oriented programming involves complex modeling constructs such as abstraction, inheritance, and polymorphism, requiring the expression of multidimensional data structures.

// Class inheritance in Java

class User {
    String email;
    String name;

    User(String email, String name) {
        this.name = name;
        this.email = email;
    }
}

class Employee extends User {
    int employeeId;

    Employee(String email, String name, int employeeId) {
        super(email, name);
        this.employeeId = employeeId;
    }
}

class PartTimeEmployee extends Employee {
    int weeklyHour;

    PartTimeEmployee(String email, String name, 
                     int employeeId, int weeklyHour) {
        super(email, name, employeeId);
        this.wheeklyHour = weeklyHour;
    }
}

// Class instantiation in Java

PartTimeEmployee john = 
    new PartTimeEmployee("john.doe@vaticle.com",
                         "John Doe", 346523, 35);
Enter fullscreen mode Exit fullscreen mode

However, relational databases are unable to natively model objects due to their lack of expressivity.

-- Table inheritance in SQL

CREATE TABLE Users (
    id SERIAL PRIMARY KEY,
    email TEXT NOT NULL,
    name TEXT NOT NULL,
    UNIQUE (email)
);

CREATE TABLE Employees (
    id INTEGER NOT NULL,
    employeeId INTEGER NOT NULL,
    PRIMARY KEY (id),
    FOREIGN KEY (id) REFERENCES Users(id),
    UNIQUE (employeeId)
);

CREATE TABLE PartTimeEmployee (
    id INTEGER NOT NULL,
    weeklyHour INTEGER NOT NULL,
    PRIMARY KEY (id),
    FOREIGN KEY (id) REFERENCES Employees(id)
);

-- Data inserting for extended tables in SQL

DO $$
DECLARE
    inserted_user_id INTEGER;
BEGIN
    INSERT INTO Users (id, email, name)
    VALUES (DEFAULT, "john.doe@vaticle.com", "John Doe")
    RETURNING id INTO inserted_user_id;

    INSERT INTO Employees (inserted_user_id, 346523)

    INSERT INTO PartTimeEmployees (inserted_user_id, 35)
COMMIT;
END $$;

-- Data retrieval for all Users and extended tables in SQL

SELECT
    Users.id AS id,
    Users.email AS email,
    Users.name AS name,   
    null AS employeeId, 
    null AS weeklyHours,
    "user" AS userType 
FROM Users 
WHERE NOT EXISTS (
    SELECT id FROM Employees WHERE Users.id = Employees.id
) 
UNION ALL 
SELECT
    Users.id AS id, 
    Users.email AS email,
    Users.name AS name,
    employeeId,
    null AS weeklyHours,
    "employee" AS userType
FROM Users 
INNER JOIN Employees ON Users.id = Employees.id
WHERE NOT EXISTS (
    SELECT Employees.id 
    FROM PartTimeEmployee 
    WHERE Employees.id = PartTimeEmployee.id
) 
UNION ALL
SELECT 
    Users.id AS id,
    Users.email AS email,
    Users.name AS name,
    employeeId,
    weeklyHours,
    "partTimeEmployee" AS userType
FROM Users
INNER JOIN Employees ON Users.id = Employees.id
INNER JOIN PartTimeEmployee ON Employees.id = PartTimeEmployee.id
Enter fullscreen mode Exit fullscreen mode

This fundamental incompatibility between object and relational models has become one of the biggest challenges in database engineering. Because of object-relational mismatch, relational databases have been unable to evolve alongside programming languages.

NoSQL eliminated the schema at the cost of declarative data retrieval. The limitations of relational databases led to the emergence of NoSQL databases, particularly document and graph databases. These databases eliminated the predefined schema, making data insertion trivial, but this comes with the cost of complicating retrieval. Without a schema, structural metadata must be stored as data, hardcoded into queries, or modeled in a secondary data store. This forces engineers to access their data imperatively, as the database does not have the context to correctly interpret declarative polymorphic queries.

Document databases are optimized to store hierarchical data, but this breaks down when attempting to model highly interconnected data. To do so, engineers are required to choose between using performance-intensive foreign ID references or duplicating data across documents without native consistency control.

// Retrieval of polymorphic resources in MongoDB

db.resource_ownerships.aggregate( [
  {
    $lookup:
    {
        from: "resources",
        localField: "resource",
        foreignField: "_id",
        as: "resource"
    }
  },
  {
    $unwind:
    {
        path: "$resource"
    }
  },
  {
    $lookup:
    {
        from: "users",
        localField: "owner",
        foreignField: "_id",
        as: "owner"
    }
  },
  {
    $unwind:
    {
        path: "$owner"
    }
  },
  {
    $unwind:
    {
        path: "$owner.emails"
    }
  },
  {
    $addFields:
    {
      resource_id: {
        $switch: {
            branches: [
            {
              case: { 
                $eq: ["$resource.resource_type", "file"]
              }, 
              then: "$resource.path" 
            },
            {
              case: { 
                $eq: ["$resource.resource_type", "directory"]
              }, 
              then: "$resource.path" 
            },
            {
              case: { 
                $eq: ["$resource.resource_type", "commit"]
              }, 
              then: "$resource.hash" 
            },
            {
              case: { 
                $eq: ["$resource.resource_type", "repository"]
              }, 
              then: "$resource.name" 
            },
            {
              case: { 
                $eq: ["$resource.resource_type", "table"]
              }, 
              then: "$resource.name" 
            },
            {
              case: { 
                $eq: ["$resource.resource_type", "database"]
              }, 
              then: "$resource.name" 
            }
          ]
        }
      }
    }
  },
  {
      $project: {
          _id: false,
          email: "$owner.emails",
          resource_type: "$resource.resource_type",
          id: "$resource_id"
      }
  }
] )
Enter fullscreen mode Exit fullscreen mode

Meanwhile, graph databases excel at linking highly interconnected data, but are severely limited by the implementation of relations as edges. This means that more complex types like n-ary relations, nested relations, and variadic relations are impossible to express without reifying the data model.

// Retrieval of polymorphic resources in Neo4j's Cypher
MATCH
    (user:User)-[:OWNS]->(rsrc:Resource)
WITH
    rsrc,
    [user.primary_email] + user.alias_emails AS emails,
    labels(rsrc) AS resource_types,
    keys(rsrc) AS properties
UNWIND emails AS email
UNWIND resource_types AS resource_type
WITH
    rsrc, email, resource_type, properties,
    {
        File: "path",
        Directory: "path",
        Commit: "hash",
        Repository: "name",
        Table: "name",
        Database: "name"
    } AS id_type_map
WHERE resource_type IN keys(id_type_map)
AND id_type_map[resource_type] IN properties
RETURN email,
       resource_type,
       rsrc[id_type_map[resource_type]] AS id
Enter fullscreen mode Exit fullscreen mode

ORMs work around the fundamental problem by trading off performance. Engineers have attempted to work around the mismatch problem, building layers of abstraction over databases in an attempt to give them the expressivity they were never designed with. This requires engineers to non-natively manage structural metadata, leading to the widespread use of ORMs. They offer object-oriented APIs for managing and storing data, but because of the imperfect translation to the underlying models, they can never express the full range of queries that can be written natively. This results in limited capabilities and poor query optimization, while introducing an additional layer of complexity and overhead to database architecture.

Databases have been left far behind programming languages in safety and expressivity.

Current paradigms are unable to natively handle the data abstractions that we so easily take for granted in programming languages. TypeDB aims to solve this problem by providing a database that natively understands complex object-oriented data structures and supports abstraction, inheritance, and polymorphism. By integrating the benefits of strong typing and query flexibility, TypeDB simplifies data access and eliminates the need for manual mapping.


What defines a polymorphic database?

There are three forms of polymorphism in computer science. In order to be fully polymorphic, a database must implement the three fundamental kinds of polymorphism possible in object-oriented programming:

  1. Inheritance Polymorphism: The ability to define a class hierarchy where children inherit properties from their parents, and to interpret queries that operate on instances of a class and all its children.
  2. Interface Polymorphism: The ability to define properties as interfaces that classes can implement independently of their parents, and to interpret queries that operate on instances of all classes that implement a specified interface.
  3. Parametric Polymorphism: The ability to interpret queries that are abstract and able to operate on instances of any class supplied as a parameter to the query.

To be fully polymorphic, a database has to implement three systems:

  1. Variablizable Language: A query language that is expressive enough to variablize classes, either explicitly or implicitly. This is required to describe interface and parametric polymorphism, and so declaratively capture the intent of the query.
  2. Polymorphic Schema: A schema or equivalent metadata store containing the semantic context for interpreting the query’s intent. It must contain the class and interface definitions, the class hierarchy, and the interface implementations.
  3. Inference Engine: An inference engine for reducing the declarative query into an imperative form before being sent to the query planner. This is based on the structure of the query combined with the semantic context provided by the schema.

Current databases do not implement any form of polymorphism
Relational schemas define tables and columns as direct analogs of classes and their properties. As columns cannot be independently implemented by multiple tables, relational databases cannot natively implement interface polymorphism. Tables and columns also cannot be variablized in SQL, meaning that SQL cannot express interface and parametric polymorphism.
Document and graph databases can define a limited number of constraints on inserted data, but this does not have the expressivity of a hierarchical schema built with classes and interfaces. As a result, these databases cannot natively implement all three types of polymorphism.

A fully polymorphic database implements all three of these requirements, that is a) a polymorphic query language, b) polymorphic schema, and c) inference engine. It is able to express all three fundamental kinds of polymorphism possible in object-oriented programming, which can be combined to yield unique and powerful features, not possible with any one kind alone. These include:

  • the ability to write polymorphic queries in near-natural language,
  • to construct polymorphic views that extend themselves when the schema is extended, and
  • to perform polymorphic deductions that generate new data using rules.

A fully polymorphic database also displays model polymorphism - that is the ability to natively model data from relational, document, graph, and other database paradigms.


What makes TypeDB the polymorphic database?

TypeDB is a polymorphic database with a strongly typed schema for defining inheritance and interfaces, a variablizable query language for composing declarative polymorphic queries, and a type inference engine for resolving queries against the schema. TypeDB schemas implement the polymorphic entity-relation-attribute (PERA) model for data, an extension of Chen’s entity-relationship (ER) model. The ER model is the most widely used tool for building conceptual data models prior to translating them into the logical bounds of a database paradigm.

A conceptual data model built on a type-theoretic language

TypeDB implements the PERA model as its database paradigm, which allows schemas to be built directly from the conceptual models that people use to represent domains and their data. Schemas and queries for TypeDB are described using TypeQL, its type-theoretic query language. By combining a strongly typed schema, a fully variablizable query language, and a type inference engine, TypeDB implements all three fundamental kinds of polymorphism, making it a truly polymorphic database.

Types are defined in a hierarchy

The PERA model is described by types, defined in the schema as templates for data instances and analogous to classes in OOP. Each user-defined type extends one of the three root types: entity, relation, and attribute, or a previously user-defined type.

  • Entities represent independent concepts.
  • Relations represent concepts that depend on one or more roles that can be played by entities and other relations.
  • Attributes represent properties that entities and relations can have, and they are defined with a declared value type and instantiated with a particular value.

Entities, relations, and attributes can all be either concrete or abstract, and roles in relations can be overridden by their subtypes.

#Schema definition of the user-employee object model in TypeQL
#TypeQL is TypeDB's query language

define

id sub attribute, abstract, value string;
email sub id;
name sub id;
path sub id;

user sub entity;
admin sub user;
user-group sub entity;
resource sub entity, abstract;
file sub resource;

ownership sub relation, abstract,
    relates owned,
    relates owner;
group-ownership sub ownership,
    relates group as owned;
resource-ownership sub ownership,
    relates resource as owned;
Enter fullscreen mode Exit fullscreen mode

Behaviors are implemented and inherited

Once entity, relation, and attribute type hierarchies have been defined, interfaces are created by declaring the attributes that entities and relations own, and the roles they play in relations. Declared attributes owned and roles played are independent of each other, and multiple entity and relation types can own the same attribute or play the same role. The attributes a type owns and the roles it plays are inherited by its subtypes, and can be overridden to control the specificity of these properties.

#Attribute ownership in an entity 
#and entity role definition in a relation in TypeQL

define

user owns email,
    plays resource-ownership:owner;
admin plays group-ownership:owner;
user-group owns name,
    plays group-ownership:group,
    plays resouce-ownership:owner;
resource owns id,
    plays resource-ownership:resource;
file owns path as id;
Enter fullscreen mode Exit fullscreen mode

Data is semantically validated

With the type hierarchies and interfaces defined in the schema, data can be instantiated with an insert query. Data instances and types are defined by variables using a $variable-name that exists in the scope of the query and can be reused to describe complex data patterns. Relations are defined with a tuple of roleplayers and the roles they play. Insert queries undergo semantic validation against the schema, ensuring that the inserted data patterns are valid. Queries that would insert data not allowed by the schema are rejected.

#Compound insert statement that efficiently
#populates users, group definitions, assets 
#and various ownership relations into 
#a trivial IAM database in TypeDB.

insert

$naomi isa admin, has email "naomi@vaticle.com";
$amos isa user, has email "amos@vaticle.com";
$engineers isa user-group, has name "engineers";
$benchmark isa file, has path "/amos/benchmark-results.xlsx";
$roadmap isa file, has path "/vaticle/feature-roadmap.pdf";

(group: $engineers, owner: $naomi) isa group-ownership;
(resource: $benchmark, owner: $amos) isa resource-ownership;
(resource: $roadmap, owner: $engineers) isa resource-ownership;
Enter fullscreen mode Exit fullscreen mode

Data is queried polymorphically

Data is queried with high-level patterns, in which any element can be variablized. Queries are analyzed by the type inference engine before going to the query planner. It resolves polymorphism by identifying possible types that could fit patterns as defined by the schema, and queries return instances of all those types:

  • Querying a supertype returns instances of subtypes that inherit from it (i.e. inheritance polymorphism).

  • Querying an interface returns instances of types that implement it (i.e. interface polymorphism)

  • Querying a variablized type parametrically returns instances of all types that match the pattern (i.e. parametric polymorphism)

#An efficient 6-line polymorphic query in TypeDB 
#retrieves all instances of ownership relations 
#(i.e. group-ownership and resource-ownership)
#and their related data.

match
(owned: $object, owner: $owner) isa! $ownership-type;
fetch
$ownership;
$object: id;
$owner: id;

# Results:
[{
    "ownership-type": { "root": "relation", "label": "group-ownership" },
    "object": {
        "type": { "root": "entity", "label": "user-group" },
        "id": [
            { "value": "engineers", "value_type": "string",
              "type": { "root": "attribute", "label": "name" } }
        ]
    },
    "owner": {
        "type": { "root": "entity", "label": "admin" },
        "id": [
            { "value": "naomi@vaticle.com", "value_type": "string",
              "type": { "root": "attribute", "label": "email" } }
        ]
    }
},
{
    "ownership-type": { "root": "relation", "label": "resource-ownership" },
    "object": {
        "type": { "root": "entity", "label": "file" },
        "id": [
            { "value": "/amos/benchmark-results.xlsx",
              "value_type": "string", 
              "type": { "root": "attribute", "label": "path" }
            }
        ]
    },
    "owner": {
        "type": { "root": "entity", "label": "user" },
        "id": [
            { "value": "amos@vaticle.com", "value_type": "string", 
              "type": { "root": "attribute", "label": "email" } }
        ]
    }
},
{
    "ownership-type": { "root": "relation", "label": "resource-ownership" },
    "object": {
        "type": { "root": "entity", "label": "file" },
        "id": [
            { "value": "/vaticle/feature-roadmap.pdf", 
              "value_type": "string", 
              "type": { "root": "attribute", "label": "path" }
            }
        ]
    },
    "owner": {
        "type": { "root": "entity", "label": "user-group" },
        "id": [
            { "value": "engineers", "value_type": "string", 
              "type": { "root": "attribute", "label": "name" } }
        ]
    }
}]
Enter fullscreen mode Exit fullscreen mode

How does TypeDB impact database engineering?

A unified way of working with data in the database and application. TypeDB provides a database that natively models and implements complex object-oriented data structures like abstraction, inheritance, and polymorphism. With these capabilities, TypeDB enables engineers to work with flexible and adaptable data models, making it easier to manage, query, and reason over complex data structures.

By integrating a conceptual data model, a strong subtyping system, a symbolic reasoning engine, and a type-theoretic language, TypeDB has redefined database architecture and achieved the native expressivity required for modern applications.

// Class inheritance in Java of a user-employee model 

class User {
    String email;
    String name;

    User(String email, String name) {
        this.name = name;
        this.email = email;
    }
}

class Employee extends User {
    int employeeId;

    Employee(String email, String name, int employeeId) {
        super(email, name);
        this.employeeId = employeeId;
    }
}

class PartTimeEmployee extends Employee {
    int weeklyHour;

    PartTimeEmployee(String email, String name, 
                     int employeeId, int weeklyHour) {
        super(email, name, employeeId);
        this.wheeklyHour = weeklyHour;
    }
}

// Class instantiation in Java

PartTimeEmployee john = 
    new PartTimeEmployee("john.doe@vaticle.com",
                         "John Doe", 346523, 35);
Enter fullscreen mode Exit fullscreen mode
#the TypeQL schema definition for the user-employee model

define

email sub attribute, value string;
name sub attribute, value string;
emloyee-id sub attribute, value long;
weekly-hour sub attribute, value long;

user sub entity,
    owns name,
    owns email;

employee sub user,
    owns employee-id;

part-time-employee sub employee,
    owns weekly-hour;

insert $john isa part-time-employee,
    has email "john.doe@vaticle.com", 
    has name "John Doe",
    has employee-id 346523, 
    has weekly-hour 35;
Enter fullscreen mode Exit fullscreen mode

Moreover, as a truly polymorphic database, TypeDB offers a number of unique capabilities not natively attainable with other database paradigms:

  1. Object Model Parity: TypeDB’s type-theoretic schemas allow for perfect parity with object models. Constructs that can be challenging to natively model in other databases are simple and intuitive in TypeDB. Utilize type hierarchies, abstract types, multivalued attributes, n-ary relations, nested relations, variadic relations, and complex cardinality constraints all without normalization, reification, or any other process that warps the conceptual model.

  2. Continuous Extensibility: Types can be polymorphically queried, so the results of queries automatically extend to include new valid types that are added to the schema after the query is written. This minimizes the need to maintain and update queries when the schema is intended, so long as the semantic intent of the query remains the same. With careful design of the initial schema to ensure extensibility, migrations can be entirely avoided.

  3. Logical Abstractions: TypeDB allows for logical abstractions by defining rules for the native symbolic reasoning engine. They are written using the same patterns as queries and resolved against the schema, combining the flexibility of polymorphism with the power of symbolic reasoning. By employing sequential and recursive triggering of rules, extremely complex logical behaviors can arise from rules that are individually very simple, mirroring the true semantic logic of complex data domains.

  4. Data Consistency: When utilizing symbolic reasoning, new fact generation occurs at query time, ensuring that generated data is never stale. By using rules to govern data dependencies within the database, all inferred data can be made to have a single source of truth, preventing data conflicts from ever occurring. This ensures that the database is always in a consistent and current state, preventing the need for precomputation cycles.

  5. Unified Data Models: The high-level conceptual data model means that schemas in relational, document, and graph databases can be translated with no loss of information, and then enhanced with TypeDB’s unique modeling constructs. This allows for easy transfer of data from other databases, as part of migrations or a multi-database architecture.

  6. Near-Natural Language: Due to the declarative nature of TypeQL and its careful syntax design, schemas and queries read close to natural language. This allows engineers to state the intent of their queries at the highest level, without having to use imperative language to describe low-level data structure. Because of this, engineers and domain experts can understand the intent of queries, even with no experience of writing them.

Though it is possible to implement these features in other databases by building a layer of abstraction over the database, the lack of native support means that engineers have to build, maintain, and debug these solutions themselves. Such solutions are also poorly optimized because they cannot take advantage of direct integration with the database. With TypeDB, all of these features are built-in, robust, and performant.


Welcome to the dawn of the polymorphic database

TypeDB enables engineers to work with flexible and adaptable data models, making it easier to manage, query, and reason over complex data structures. By integrating a conceptual data model, a strong subtyping system, a symbolic reasoning engine, and a type-theoretic language, TypeDB has redefined database architecture and achieved the native expressivity required for modern applications.

To learn more about TypeDB and its features, please visit https://typedb.com and https://typedb.com/features

To download TypeDB, please visit https://typedb.com/deploy

To read TypeDB's documentation, please visit https://typedb.com/docs

Top comments (0)