A simple concept at the core of a new way of processing data
What's a streaming graph?
When we first released Quine streaming graph last year, we had to answer this question a lot. After all, a "streaming graph" had never existed before.
As interest grew, we got pretty good at answering, usually something like this: Quine is a real-time event processor like Flink or ksqlDB. It consumes data from sources like Kafka and Kinesis, queries for complex patterns in event streams, and pushes results to the next hop in the streaming architecture the instant a match is made. However, unlike those venerable systems, Quine uses graph data structure.
Hence, streaming graph.
That seemed to work and, engineers being a curious lot, led inevitably to a second question: "How's it different from a graph database?"
That's a fun question to answer, because it means we get to talk aboutidFrom()
. And explaining idFrom()
allows us to begin to unpack all the interesting architectural properties that make Quine uniquely well-suited for real-time complex event processing.
"Big things have small beginnings." -- David from the film Prometheus (2012)
Event-driven: what if we stopped querying databases?
Unlike a graph database, which relies on an index to query for the existence of data in the graph, Quine uses idFrom()
, a custom Cypher function.
idFrom()
generates a unique node ID from a set of user-provided arguments -- most commonly taken from the data in the event stream itself -- which is then used in lieu of an index to locate and operate on a node and its properties. (We will get to the why in a bit but it will help first to look at how you use idFrom()
.)
Say you want to analyze an event stream of edits from wikipedia to keep an eye out for edits made by specific authors to specific articles in specific databases.
The json record (a pared back version of the actual Wikipedia event feed used in the Wikipedia API recipe featured in our docs example here) might look like this:
{
"$schema": "/mediawiki/revision/create/1.1.0",
"database": "wikidatawiki",
"page_id": 83996749,
"rev_id": 1869025669,
"rev_timestamp": "2023-04-05T18:18:23Z",
"performer": {
"user_is_bot": true,
"user_id": 6135162,
},
"rev_parent_id": 1869025663,
}
To create the nodes in a continuous stream of records, you would use MATCH
to declare the node names then call the idFrom()
function to generate unique node IDs based on the values in the json itself.
MATCH (revNode),(pageNode),(dbNode),(userNode), (parentNode)
WHERE id(revNode) = idFrom('revision', $that.rev_id)
AND id(pageNode) = idFrom('page', $that.page_id)
AND id(dbNode) = idFrom('db', $that.database)
AND id(userNode) = idFrom('id', $that.performer.user_id)
AND id(parentNode) = idFrom('revision', $that.rev_parent_id)
For now, we can skip adding properties to nodes but it helps our discussion to complete this simple graph by adding relationships between the nodes:
CREATE (revNode)-[:IN]->(dbNode),
(revNode)-[:TO]->(pageNode),
(userNode)-[:MADE]->(revNode)
(parentNode)-[:NEXT]->(revNode)
Now, as each event streams in, Quine will create and connect nodes, forming the desired subgraph that looks like this:
You can see the same subgraph with node ID no longer concealed by the node labels:
Note the things you didn't have to do to create this graph:
- Query to find out if the node exists already before
- Consult a schema
Quine eliminates the need to check to see if the node exists before completing an operation.
The deterministic nature of node IDs created using idFrom()
means a value or combination of values passed to the function will always result in the same ID.
It will either create a new node based on the value or, if that node already exists, update it.
In the latter case, because Quine is an event-sourced system, when Quine updates a node, it doesn't need to look up if the node already exists. Quine appends the update to the existing node, preserving historical versions that can be retrieved using idFrom()
with the at.time
idFrom()
and CRUD operations: why Quine is so dang fast
Inasmuch as Quine uses a hash of a value to generate a node ID that is then used for CRUD operations, it bears a superficial similarity between NoSQL key-value stores As long as you know either the ID or the value, it is dead simple to retrieve data from the graph.
However, because of Quine's in-memory graph structure, it is far more efficient and performant operating on patterns, ranges (e.g. time-ordered), or otherwise related data than key-value databases.
Using the node ID to anchor the query, you specify the edges to traverse to find connected data.
This might be a query to retrieve a node's properties using node ID (in this case, for revNode
):
MATCH (n) WHERE strId(n) = "8b290926-271c-3497-b5d6-e30fcf934a73" RETURN id(n), properties(n)
Which delivers these results:
If you don't know a node's ID, you can query for it using the node's properties and the strid()
function:
MATCH (userNode:user {user_is_bot: true}) RETURN DISTINCT strid(userNode)
But what about more complex queries -- for example, a query that must retrieve multiple related objects. Key-value stores are famously inefficient in this scenario. But this is precisely where Quine's architectural choices come in. Using an in-memory graph structure means you can query for any node in a subgraph, follow it's edges, and produce one or more values.
For example, say you want to find all revisions where a bot made an update to the 'wikidatawiki' database:
MATCH (userNode:user {user_is_bot:true})-[:MADE]->(revNode:revision)-[:TO]->(pageNode:page)-[:IN]->(dbNode:db {database : "wikidatawiki"})
RETURN DISTINCT id(revNode) as id, id(userNode) as id2
Either way, it starts with setting the node ID with idFrom()
. And idFrom()
makes Quine very, very fast.
Standing queries and querying data from the future with idFrom()
Standing queries persist inside the graph, monitoring the stream for specific patterns. Propagate them throughout the graph without you ever having to issue a query again. Standing queries persist, monitoring for matches.
Once matches are found, standing queries trigger actions using those results (e.g. report results, execute code, transform other data in the graph, publish data to another source).
To do this, every standing query must have two parts, the pattern
portion (what sub-graph you are matching for in the event stream) and the outputs
portion (the action you wish to take).
Adapted from the recipe used in Getting Started, here's a standing query that monitors for non-bot revisions to the 'enwiki' database and outputs these events to the terminal:
standingQueries:
- pattern:
query: |-
MATCH (userNode:user {user_is_bot: false})-[:MADE]->(revNode:revision {database: 'enwiki'})
RETURN DISTINCT id(revNode) as id
type: Cypher
outputs:
print-output:
type: CypherQuery
query: |-
MATCH (n)
WHERE id(n) = $that.data.id
RETURN properties(n)
andThen:
type: PrintToStandardOut
Standing query matches printing to console.
Because standing queries persist in the graph, incrementally updating partial results as new data arrives, you are not just querying the past and present state, you are setting up queries for data yet to arrive.
And while idFrom()
is a key part of what makes standing queries possible, to really understand what makes Quine function so efficiently as a stream processor, we'll need to dive into the actor-based, graph-shaped compute model. But that's for a different post.
Instead, I'll leave you with a clever use of idFrom()
employed by developers at a SaaS company that uses Quine.
Partitioning Key Spaces for a SaaS application using idFrom()
Since you can generate a node ID by passing an arbitrary combination of values to idFrom()
, some Quine users with SaaS or internal multi-tenant applications have employed it to partition graphs by customer namespace or similar property.
Sticking with the Wikipedia example, you could create distinct sub-graphs corresponding to each of the database types by adding $that.database
as an additional value determining each node ID:
MATCH (revNode),(pageNode),(dbNode),(userNode),(parentNode)
WHERE id(revNode) = idFrom('revision', $that.rev_id, $that.database)
AND id(pageNode) = idFrom('page', $that.page_id, $that.database)
AND id(dbNode) = idFrom('db', $that.database)
AND id(userNode) = idFrom('id', $that.performer.user_id, $that.database)
AND id(parentNode) = idFrom('revision', $that.rev_parent_id, $that.database)
This creates a series of subgraphs partitioned by database and would allow you to be certain that if you query for data related to a specific database, you won't inadvertently return data from others.
And while the chance of key collision exists, it is vanishingly small, making this approach suitable for use in multi-tenant SaaS applications.
At any rate, this accomplished what the company wanted: a partitioned graph for data separation, all standing and ad hoc queries work the same across the entire graph, and the only real cost is the discipline of always using the compound key.
Pretty clever.
If any of this inspires you or piques your interest and you want to try Quine yourself, check out Getting Started docs.
Top comments (0)