Preface
I am passionate about my interest in software development, specifically the puzzle of ergonomically creating software systems that solve the broadest set of problems while making as few compromises as possible. I also like to think of myself as a systems developer, which, by Andrew Kelley's definition, means a developer interested in understanding completely the systems they are working with. In this blog I share with you my ideas on solving the following problem: Building a reliable and performant full-stack enterprise application. Quite a challenge, isn't it? In the blog I focus on the "performant web server" part - that's where I feel I can offer a fresh perspective, as the rest is either well-trodden, or I have nothing to add.
A major caveat - there will be no code samples, I have not actually tested this. Yep, this is a major flaw, but actually implementing this would take a lot of time, which I do not have, and between publishing a flawed blog and not publishing it at all, I stuck with the former. You've been warned.
And what pieces would we assemble our application out of?
- A frontend you are comfortable with, but if you want minimal dependencies - there is Zig in WASM form + HTMX.
- A Zig web server, closely integrated with the Linux kernel. This is the performant part, that I will focus on in this blog.
- A Python backend, integrated with Zig. This is the complex part.
- Integration with durable execution systems such as Temporal and Flowable. This aids reliability, and will not be discussed in the blog.
With our tools decided on, let's start!
Are coroutines overrated anyway?
Zig has no language level support for coroutines :( And coroutines is what every performant web server is built with. So, is there no point trying?
Hold, on, let's first put our systems programmer hat on. Coroutines are not a silver bullet, nothing is. What are the actual benefits and drawbacks involved?
It is common knowledge that coroutines (userspace threads) are more lighweight and faster. But in what way exactly? (the answers here are largely speculation, take with a grain of salt and test it yourself)
- They start out with less stack space by default (2KB instead of 4MB). But this can be adjusted manually.
- They better cooperate with the userspace scheduler. Because the kernel scheduler is preemptive, the tasks performed by threads are allocated time slices. If the actual tasks do not fit into the slices - some CPU time is wasted. As opposed to, say, Goroutines, which fit as many micro-tasks performed by different goroutines into the same time slice of the OS-thread as possible.
The Go runtime, for example, multiplexes goroutines onto OS threads. Threads share the page table, as well as other resources owned by a process. If we introduce CPU isolation and affinity to the mix - the threads will continuously run on their respective CPU cores, all of the OS data structures will stay in memory with no need to be swapped out, the userspace scheduler will allocate CPU time to goroutines with precision, because it uses the cooperative multitasking model. Is competition even possible?
The performance wins are achieved by sidelining the OS-level abstraction of a thread, and replacing it with that of a goroutine. But is nothing lost in the translation?
Can we cooperate with the kernel?
I'll argue that the "true" OS-level abstraction for an independent unit of execution is not even a thread - it is actually the OS process. Actually, the distinction here is not as obvious - all that distinguishes threads and processes is the different PID and TID values. As for file descriptors, virtual memory, signal handlers, tracked resources - whether these are separate for the child is specified in the arguments to the "clone" syscall. Thus, I'll be using the term "process" to mean a thread of execution that owns its own system resources - primarily cpu time, memory, open file descriptors.
Now why is this important? Each unit of execution has its own demands for system resources. Each complex task can be broken down into units, where each one can make its own, predictable, request for resources - memory and CPU time. And the further up the tree of subtasks you go, towards a more general task - the system resources graph forms a bell curve with long tails. And it is your responsibility to make sure that the tails do not overrun the system resources limit. But how is that done, and what happens if that limit is in fact overrun?
If we use the model of a single process and many coroutines for independent tasks, when one coroutine overruns the memory limit - because memory usage is tracked at the process level, the whole process is killed. That's in the best case - if you make use of cgroups (which is automatically the case for pods in Kubernetes, which have a cgroup per pod) - the whole cgroup is killed. Making a reliable system needs this to be taken into account. And what about CPU time? If our service gets hit with many compute-intensive requests at the same time, it will become unresponsive. Then deadlines, cancelations, retries, restarts follow.
The only realistic way to deal with these scenarios for most mainstream software stacks is leaving "fat" in the system - some unused resources for the tail of the bell curve - and limiting the number of concurrent requests - which, again, leads to unused resources. And even with that, we will get OOM killed or go unresponsive every once in a while - including for "innocent" requests that happen to be in the same process as the outlier. This compromise is acceptable to many, and serves software systems in practice well enough. But can we do better?
A concurrency model
Since resource usage is tracked per-process, ideally we would spawn a new process for each small, predictable unit of execution. Then we set the ulimit for cpu time and memory - and we're good to go! ulimit has soft and hard limits, which will allow the process to terminate gracefully upon hitting the soft limit, and if that does not occur, possibly due to a bug - be terminated forcefully upon hitting the hard limit. Unfortunately, spawning new processes on Linux is slow, spawning new process per request is not supported for many web frameworks, as well as other systems such as Temporal. Additionally, process switching is more expensive - which is mitigated by CoW and cpu pinning, but still not ideal. Long-running processes are an inevitable reality, unfortunately.
The further we go from the clean abstraction of short-lived processes, the more OS-level work we would need to take care of ourselves. But there are also benefits to be gained - such as making use of io_uring for batching IO between many threads of execution. In fact, if a large task is made up of sub-tasks - do we really care about their individual resource utilization? Only for profiling. But if for the large task we could manage (cut off) the tails of the resource bell curve, that would be good enough. So, we could spawn as many processes as the requests we wish to handle simultaneously, have them be long-lived, and simply readjust the ulimit for each new request. So when a request overruns its resource constraints, it gets an OS signal and is able to terminate gracefully, unaffecting other requests. Or, if the high resource usage is intentional, we could tell the client to pay for a higher resource quota. Sounds pretty good to me.
But the performance will still suffer, compared to a coroutine-per-request approach. First, copying around the process memory table is expensive. Because the table contains references to memory pages, we could make use of hugepages, thus limiting the size of data to be copied. This is only directly possible with low-level languages, such as Zig. Additionally, the OS level multitasking is preemptive and not cooperative, which will always be less efficient. Or is it?
Cooperative multitasking with Linux
There is the syscall sched_yield, which allows the thread to relinquish the CPU when it has completed its portion of work. Seems quite cooperative. Could there be a way to request a time slice of a given size as well? Actually, there is - with the scheduling policy SCHED_DEADLINE. This is a realtime policy, which means that for the requested CPU time slice, the thread runs uninterrupted. But if the slice is overrun - preemption kicks in, and your thread is swapped out and deprioritized. And if the slice is underrun - the thread can call sched_yield to signal an early finish, allowing other threads to run. That looks like the best of both worlds - a cooperative and preemtive model.
A limitation is the fact that a SCHED_DEADLINE thread cannot fork. This leaves us with two models for concurrency - either a process per request, which sets the deadline for itself, and runs an event loop for efficient IO, or a process that from the start spawns a thread for each micro-task, each of which sets its own deadline, and makes use of queues for communication with each other. The former is more straighforward, but requires an event loop in userspace, the latter makes more use of the kernel.
Both strategies achieve the same end as the coroutine model - by cooperating with the kernel, it is possible to have application tasks run with minimal interruptions.
Python as an embedded scripting language
This is all for the high-performance, low-latency, low-level side of things, where Zig shines. But when it comes to the actual business of the application, flexibility is much more valuable than latency. If a process involves real people signing off on documents - the latency of a computer is negligible. Also, despite suffering in performance, object oriented languages give the developer better primitives to model the domain of the business with. And on the furthest end of this, systems like Flowable and Camunda allow managerial and operations staff to program the business logic with more flexibility and a lower barrier of entry. Languages like Zig will not help with this, and only stand in your way.
Python, on the other hand, is one of the most dynamic languages there are. Classes, objects - they are all dictionaries under the hood, and can be manipulated at runtime however you like. This has a performance penalty, but makes modeling the business with classes and objects and many clever tricks practical. Zig is the opposite of that - there are intentionally few clever tricks in Zig, giving you maximum control. Can we combine their powers by having them interoperate?
Indeed we can, due to having both support the C ABI. We can have the Python interpreter run from within the Zig process, and not as a separate process, reducing overhead in runtime cost and glue code. This further allows us to make use of Zig's custom allocators within Python - setting an arena for processing the individual request, thus reducing if not eliminating the overhead of a garbage collector, and setting a memory cap. A major limitation would be the CPython runtime spawning threads for garbage collection and IO, but I found no evidence that it does. We could hook Python into a custom event loop in Zig, with per-coroutine memory tracking, by making use of the "context" field in AbstractMemoryLoop. The possibilities are limitless.
Conclusion
We discussed the merits of concurrency, parallelism, and various forms of integration with the OS kernel. The exploration lacks benchmarks and code, which I hope it makes up for in the quality of ideas offered. Have you tried anything similar? What are your thoughts? Feedback welcome :)
Further reading
- https://linux.die.net/man/2/clone
- https://man7.org/linux/man-pages/man7/sched.7.html
- https://man7.org/linux/man-pages/man2/sched_yield.2.html
- https://rigtorp.se/low-latency-guide/
- https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/
- https://hadar.gr/2017/lightweight-goroutines
Top comments (0)