Christian Miller

Posted on Sep 26 • Edited on Oct 18 • Originally published at devzero.io

Log Streaming - what we got wrong and how we fixed it

#devtools #logging #go #infra

Logs, often overlooked and disregarded until you need them! “Wait, what happened? Why did this fail? OMG why aren’t we logging this error case!” For DevZero, like all our tech brethren, we use logs to figure out what is happening in our applications. But we also show logs to our users so they can also figure out what’s going on and that’s what this post is all about…what we’ve done before, what went wrong and how we’ve been improving it.

Why logs

A quick primer on DevZero, we’re building a platform (of course 😉) for engineers to develop, build and deploy their software. Initially, we focused on the “developing” part, meaning building a platform for engineers to develop in a remote environment, the benefits of which I’ll just footnote now¹, but any remote environment brings challenges, so logs are critical for figuring out what the heck is going on.

Where we started

At first, when we were provisioning a development environment (let’s call it a “workspace” for now) we would run some steps that were codified in yaml (again, of course 😉) and supplied by the user, imagine these steps might be something like, “install kubectl & helm” or “clone this repo and run npm run build”. Well then we would execute those steps in the remote environment and then magically when you got your workspace everything was setup just right 🎉…except sometimes something wouldn’t work: mistyped the install URL, package no longer exists, missing environment variables, etc. and users need a way to see what happened so they could fix the issue.

An aside on other logs

For this post we’re focusing on logs that are intended for end users, not the internal logging we might add for our control plane, you know those messages like, log.Debug(“TODO: figure out looping...”, zap.Int(“iter”, i)) ². Instead, we’re thinking about how to make the remote development environment more like a local one, so if you add a step like apt install jq you might expect some output like:

Reading package lists... Done  
Building dependency tree... Done  
Reading state information... Done  
The following additional packages will be installed:  
  libjq1 libonig5  
The following NEW packages will be installed:  
  jq libjq1 libonig5  
etc.

First Attempt

So, our users need to see the logs from when a step was run, our first pass was quick and simple: a daemon process would execute the steps in the workspace³ and POST the entire stdout/stderr output once the command was done to our control plane (via our API-Gateway since the VMs existed outside the control-plane) where we’d stash them with really minimal identifiers into a big ‘ol table in our primary SQL database. Then when we wanted to show the logs we’d make a GET request and return a big ‘ol JSON response. The code was straightforward and looked something like:

// shell out to execute command and capture all the output
out, err = exec.Command("su", "devzero", "-c", "apt install jq").CombinedOutput()
if err != nil {
  //
}
// send that output to the gateway
if err := api.PostOutput(out); err != nil {
  //
}

This “simple” approach suffers from a some drawbacks:

For DevZero: For plenty of commands, e.g. apt install jq the command executes relatively quickly (a few seconds) and with minimal output (less than 10KB), but other commands, for example npm build, installing some python packages or who knows what, the commands can generate a lot of output (on the order of megabytes). Although this isn’t THAT big, our processing was quite inefficient. We would unmarshal the responses in the api-gateway and then again in our backend. This led to CPU and memory spikes in our control plane. Finally, using MySQL we wrote all of the output at once into our table, aptly named log_outputs, and we encountered some additional problems:
1. Large requests could be truncated, we used BLOB datatype⁴, which only accommodates content up to ~65KB. This was a relatively small fix to batch requests with a maximum size and chunk writes, but again it was another complication.
2. Our database wasn’t prepared for the volume of data or the frequency of writes leading to running out of disk space, running into slow queries and hitting too many connections–simply put, we needed to scale our database
For our users: we’d collect all the logs for a step once it completed and then send those to our control plane. This meant for long running steps our customers would have to wait a long time to see output. Our infrastructure problem was also a user experience problem! This was the largest motivator for making changes to our system.

First Attempt Architecture

Second Attempt

For our second pass we knew we wanted to solve the user experience problem by streaming the output from the commands the daemon was running. In addition we wanted to solve some of the infrastructure problems (database issues, inefficient processing, etc)

The challenge of the volume of our table or CPU and memory usage of the database really wasn’t such a problem in this wonderful “cloud infrastructure” world. We spun up a second dedicated SQL database for logs, this kept our application code almost exactly the same but quickly alleviated issues with slow queries and running out of disk space. We knew it still wasn’t the “forever” solution, but it worked well for our scale.⁵

The bigger challenge was to get output from our daemon to our control-plane and to our users as soon as the log message was produced. This required changes in our daemon, the control-plane (backend and api-gateway) and our frontend.

Daemon Changes

The first part of our system that needed to change was to stream output from our daemon to the control plane. Imagine the command:

for i in $(seq 1 120); do echo $i && sleep 1; done

Although this is a boring command, waiting 2 minutes to see the output is too much! Let’s start seeing: 1 2 3 as soon as possible! We switched from CombinedOutput to Start and but also needed to handle stdout and stderr more carefully. We added in some abstraction so we could test how we piped output. We ended up with something a little more complicated, but it worked well and looked something like this:

// write the commands to a tempfile to be executed
c := exec.Command(shell, tempFile.Name())

// setup our writers
var out bytes.Buffer
// ReportingWriter sets up our writer which will send results to api-gateway
streamingWriter := NewReportingWriter()
tee := io.MultiWriter(&out, streamingWriter)

// Execute calls Start/Wait while setting streams with the tee writer and handling some timeouts
if err := streaming.Execute(c, tee); err != nil {
  // handle err
}

Our daemon would still send individual POST requests to the control-plane (i.e. we didn’t implement true streaming from daemon to control-plane), but now we would send the output much more frequently and the end result for users would be more like real-time.

Control Plane Changes

API-Gateway

In order to handle these streaming requests and to get faster feedback to users we needed to implement streaming from our backend to our frontend via the API-Gateway.

Streaming Options

gRPC

A little more background on how our control-plane used to work: we had an HTTP-based API gateway that talked to our gRPC monolithic backend service.⁶ We thought about extending gRPC streaming from our backend to clients…but the API-gateway handled auth and connections for us and doing gRPC in our Javascript frontend wasn’t something we wanted to take on and in addition it meant that we wouldn’t be able to curl our endpoints easily. To me there’s nothing quite like curl localhost:8000 to begin understanding a service.⁷

Websockets

Websockets would have been fun, but truthfully we didn’t need ‘em, the communication we wanted was really just getting output to our users quickly and we didn’t really need bi-directional communication.

Server-Sent Events

Server Sent Events (SSE) is pretty cool, enabling sending real-time updates to clients. SSE is well-supported by browsers and you can also “just curl it!”. One of the only drawbacks for us was the OpenAPI support wasn’t great at the time (see this), but our frontend team was able to work around it and in addition there wasn’t a tightly defined model for the logs output, so the benefits of using our spec (strict types) weren’t as big here compared to other parts of our API.

We ended up going with SSE, but it wasn’t as simple as “add Content-Type: text/event-stream to our handlers”.

The API-Gateway translated between client requests and the backend, so we introduced a good chunk of code to go from streaming gRPC to SSE. Our handler code looked something like:

{
    // setup a new SSE handler
    sseServer := sse.New().CreateStream("someid")

    // Retrieve the log output messages
    backendStream, err := g.backend.GetCommands(r.Context(), req)
    if err != nil {
        ...
    }
    receivedEvents := make(chan Event)
    // Listener on the blocking backendStream
    go func(ctx context.Context) {
        for {
            message, err := backendStream.Recv()
            if err == io.EOF {
                //
            }
            receivedEvents <- Event{
                Message: message,
                Err:     err,
            }
        }
    }(ctx)

    // Goroutine to loop and listen for received event on channel or timeout/cancellation
    go func() {
        for {
            // Handle either event received, ticker iteration (keep alive), or context cancellation/timeout (note we're skipping the context cancel for now)
            select {
              case e := <-receivedEvents:
                messageSentTime, done := g.publishSSEMessage(w, sseServer, logger, templateApplicationId, e)
                if done {
                    return
                }
    }()
    // okay now serve!
    sseServer.ServeHTTP(w, r)
}

Phew! Handling the nuances around timeouts, context cancellations and keep-alives is mostly elided here too and required a few PRs to carefully implement the streaming in the API-gateway smoothly from backend to API-Gateway and API-Gateway to our client.

Backend

Our API-Gateway could now accommodate streaming, but we also needed our backend and storage layer to support the same. Our implementation for this was a bit longer than the code needed in our gateway, but boiled down to:

Polling for new output against MySQL based on an auto-incrementing ID
- specifically BIGINT UNSIGNED NOT NULL AUTO_INCREMENT which would have given us as many IDs as we figured we could ever want⁸
Sending results via a channel

Second Attempt Architecture

Hey wait! What about the frontend?

I hear you...read on!

Third Time's The Charm

Well everything was going just fine with logs but in our core product we found our approach centered around VMs was too restrictive and our applications needed to adapt in order to support a new approach based on containers and Kubernetes.

With that change we have more log-generating components, specifically: logs coming from builds, logs coming from workspaces as they started up and also logs that we wanted surfaced to the user generated in a couple of different backend services.

So what to do? We made our own. Enter the DevZero logging service: logsrv pronounced, “log serve” (stop: we are engineers, not marketing. This is as creative as naming gets).

Logsrv

Our current architecture for logsrv is quite a bit different from our initial implementations. Now logsrv endpoints are directly exposed to end users, no API-gateway as we’ve moved our auth middleware to it as well. This has the benefit of allowing us to clean up and remove a LOT of code that was needed to handle streams in our backend service AND the API-gateway (all that code! 😱). Instead logsrv handles everything related to streaming and storing the logs. The surface area is much smaller and instead of logs being an afterthought or something tacked onto our primary service it was given its own consideration. Massive kudos to Brad who’s doing his own thing now, but was instrumental in designing and implementing logsrv.

Another change from our first two passes is that logs are written to a queue from different places in our control plane, not sent to a monolith, and then logsrv would process those messages, writing them to a persistent SQL database (again, a really big table). This change meant that it was possible for some log messages to be in the queue OR in the database and we needed to handle processing them correctly. The code to handle this was pretty fun to write and test, and with our slimmed down API interface and the dedicated time to spend on the service we were able to hit a bunch of edge cases and test it really well. The service has been humming along in prod for months with little churn or operational issues since it’s been deployed.

Making The Logs Visible

Now frontend stuff right?

🫣 I'm sorry, it was a trick, there's no way I could fit in all the cool frontend stuff to this article, plus I didn't do most of it, so we're going to share that part of our story soon!

Ok, well I'll show something, here's how are logs are showing up now!

What we learned

Skipping right to the end would have been great, but in each iteration we learned and in the end I think our solution fits our needs really well. I’m not sure if we would have dedicated the time at the beginning we would have ended up with a solution that would have kept up with the changes in our overall platform. Software is an iterative process!

While software is an iterative process, good design is also important. Before we built logsrv we realized we were going to need another pass at logs and this time wrote a proper design document (we call ‘em ERDs or PRDs). We were able to build consensus and try a few different approaches before we implemented logsrv fully.

Finally, our architecture became simpler with a smaller API surface and was easier to run and test locally (or remotely as we like to do 😉!). Having a single-purpose service has made development and operating on logsrv much simpler than before, where “logging” was spread around our codebase.

What’s next

Is our current solution perfect? Nah, there’s always room to improve, and we’re seeing interest from our customers for more functionality, like adding search capabilities. Also as we continue to scale we may need to scale logsrv up and/or out; right now it’s running as a singleton and has pretty low overhead, but we’re seeing bursts of activity that may slow processing.

Well that’s a bit about our journey with logs for our product. We’d love to show it off! Checkout DevZero: devzero.io/dashboard and give it a spin!

Consistent environments for teams, a way to share environment variables and secrets, better compute or more storage than available on local, etc. ↩
It wasn’t just me! https://go.dev/wiki/LoopvarExperiment ↩
At first our workspaces were individual VMs hosted on $LARGE_CLOUD_PROVIDER, now we use containers also on $LARGE_CLOUD_PROVIDER and in the future we’ll see what we do, but that’s a story for another time. ↩
https://dev.mysql.com/doc/refman/9.0/en/storage-requirements.html#data-types-storage-reqs-strings ↩
We decided not to scale up our primary database since the logging experience was still a bit experimental and later scaling down our primary database seemed riskier and more time consuming than scaling up a secondary database to exclusively handle the logs. ↩
The monolith has since been replaced with a few different services which use OpenAPI as a spec, we still have the API-gateway and this can be a story for another time too 😀 ↩
gRPCurl and gRPC UI are neat too, but require another application to install and pointing to your proto or using server reflection, which we didn’t have at the time. ↩
2^64-1 https://dev.mysql.com/doc/refman/9.0/en/integer-types.html ↩