When you are building a product from scratch, as my co-founders and I are doing right now, it is easy to become very passionate about doing very high quality work. We constantly talk about how we want to make everything "World Class".
Quality is hard to define though. One of my favorite books growing up was "Zen and the Art of Motorcycle Maintenance", and a good chunk of the book is about the author trying to define quality and what happens to him as a result. Quality is something that you know when you see. High quality products give this impression as a strong immediate experience. You read a story and the first sentence grabs you and you can't let go. But there is also an element of context and experience - it gives you a wider range of quality experiences. I enjoy some wines more than others, but I can't really understand great wines. On the other hand, after 20 years of code reviews, I can tell a lot about an engineer by looking at their code and can distinguish good code from great in different domains, paradigms and languages.
When it comes to software products, I believe that the final judge of quality is the customer. We talked to a lot of potential customers about what makes an experience high quality for them.
A lot of companies and cultures confuse "lack of bugs" with "quality". There is some overlap, but the Venn Diagram is not a circle. We all know software that was buggy and immature but compelling enough to still provide high quality experience. We also know software that is nearly bug free but the experience doesn't feel high quality.
If lack of bugs isn’t it, what makes for high quality product?
Here is my definition: High quality product offers a magical experience to a user in a specific dimension that they really care about. A lot of other things can be forgiven if you can be truly magical in the specific things that matter to them.
Few examples:
- Early Slackware Linux had tons of bugs, but it was the first time I could run a Unix on my desktop at home and not the computer lab at the university. Changed my life.
- My first car navigation system kept crashing, but it was still way better than stopping to look at maps.
- Early Kafka was easy to get started with and it had amazing uptime. There were major bugs and people reported bugs and kept using it. Eventually the bugs were resolved.
- Early Twitter and the fail whale.
- Datadog was super simple to get started with and sending metrics "just worked", we had some issues with reporting that they fixed later, but we remained a customer forever.
- Expensify allowed me to take photos of receipts and not carry them around.
The take-way here is that you need to figure out what your users really care about, especially in their early adoption steps, and make it feel magical.
Very early in my Confluent career, we hired an amazing training developer (she went on to be much more). On her second day, she said "I want to structure my training around a practical example, what is a fun thing to do with Kafka?" and I was the PM of Connect, so I said - "Why not get some data from MySQL to Kafka, do simple aggregation with Kafka Streams, and write the result to S3?". This was already a fairly popular use-case in my mind. And 2 days later she said, "something is wrong, it doesn't work". She was right, none of this "just worked", you had to figure out specific configurations, specific formats, specific steps. It took us weeks to get it to work. And we saw this as a basic use-case! This was completely “green path” - no chaos, no high load, nothing that should have been challenging.
Note that a QA teams rarely finds these kinds of issues, and the issues she found were not in any one part of the product. It was either usability issues or more frequently integration issues - you only see them when you try to use your product like a real customer and implement an entire workflow. We eventually built automated testing framework specifically around real customer workflows.
And to close on a more quantifiable note, one last tip for quality:
How often bad things happen, or if they start happening more frequently, is actually really important thing to know when thinking about user experience in SaaS. SLOs is a good tool around this, but many years ago I learned about a more flexible tool that is worth knowing about. It is called a control chart. You basically take a metric, say response time latency, and you plot it over time. You then define a range of "normal values", it can be an overall average, average per entity like machine or user, or an adaptive average that can handle things like weekend use and rush hour. Now you'll have a set of points outside the "normal range" and you can define rules on them:
- Any point 3 standard deviations above the baseline. This will indicate extreme sudden increase.
- 5 consecutive measurements more than one standard deviation over the baseline. This indicates a sustained increase.
- 10 consecutive measurement each higher than previous one. This indicates steady upward trend.
This is a super flexible way to detect and communicate a wide range of quality issues in a production system. So you can discuss not just a specific incident but worrying trends.
I also posted this content in a video, if you prefer:
Top comments (0)