Todd Ekenstam is a Principal Software Engineer at Intuit, and Avni Sharma is a Product Manager at Intuit. This article is based on Todd and Avni's March 19, 2024 presentation at KubeCon + CloudNativeCon Europe 2024. You can watch the entire session video here. You can read the event details or download the slide deck for more information on this session.
Intuit is a global fintech company that builds many financial products like QuickBooks and TurboTax. We’re an AI-driven expert platform. How much so? To give one metric, we make more than 40 million AIOps inferences per day.
We also have a huge Kubernetes platform, with approximately 2,500 production services and even more pre-prod services. We have approximately 315 Kubernetes clusters and more than 16,000 namespaces.
Our developer community is large, too, with around 1,000 teams and 7,000 developers at Intuit working on end-user products. The service developer is among the personas who deal with the platform on a day-to-day basis. Service developers build the app logic for the product, that is then shipped to the end user. They’re focused on code and shipping fast.
Then, we have the platform persona, or who we call platform experts.
Challenges for Developers on the Platform
The overarching goal of platform engineering is to drive developer autonomy. Platform engineers focus on enabling service developers by providing capabilities through several interfaces. For example, if a developer needs a database, they should be able to access one—whether a Node.js developer or a database administrator.
Using these capabilities should be frictionless and easy. If a developer wants to deploy or manage an application on Kubernetes, they shouldn’t need to know the nitty-gritty of the platform and the infrastructure. They should be able to do it seamlessly.
But what are the challenges that service developers face today?
Steep learning curve
Our developers often deal with many Kubernetes internals and APIs, as well as infra-related configurations, on a daily basis. When something is misconfigured, they need more troubleshooting help.
Local development with dependencies
The second friction point revolves around the developer experience. Developers need help managing environments, local testing, and a lengthy onboarding process.
Tech refreshes need migrations
The third challenge concerns tech refreshes that require migration. For example, if we upgrade Kubernetes or replace the CloudWatch agent with Fluentd, we must migrate deprecated APIs. These kinds of migrations require support from the service developer team.
Consider a sample workflow our service developer goes through on our internal developer portal.
First, the developer creates an asset on our dev portal. An asset is the atomic form of deployment on our Kubernetes layer. Next, the developer develops and deploys the app.
After that, they configure specific Kubernetes primitives in their deployment repo—like PDBs, HPA configuration, or Argo rollout analysis templates. They might be onboarded to an API gateway to expose their app to the internet and onboarded to a service mesh for configuring rate limiting. Next, they have end-to-end testing of their application.
Finally, if they have any performance tests to run, they perform load testing by configuring their min-max HPA. And, of course, this is all intertwined with perpetual platform migrations—quarterly—to stay up to date.
This kind of workflow can undoubtedly drive your service developer crazy. They must focus on the business logic and deal with infrastructure concerns.
The Target State
Now that we’ve examined the challenges let’s look at our target state and where we would like to be. We want to translate all these application needs into platform means. Service developers can focus on developing the code, deploying it seamlessly without knowing the platform’s nitty-gritty details, and then performing end-to-end testing.
The platform should address all other concerns.
Addressing the Challenges
To set some context, let’s describe Intuit's development platform, which we call Modern SaaS AIR. At the top, our developer portal provides the developer experience for all of our engineers and manages an application's complete lifecycle.
From there, our platform is based on these four pillars:
- AI-powered app experiences
- GenAI-assisted development
- App-centric runtime
- Smart operations
Our operational data lake supports all of this. It provides a rich data set for visibility into how all our applications are developed, deployed, and run.
IKS AIR
Let’s focus on the runtime and traffic management component we call IKS AIR. IKS is Intuit’s Kubernetes Service layer. IKS AIR is a simplified deployment and management platform for containerized applications running on Kubernetes. It provides everything an engineer needs to build, run, and scale an application. The main components of IKS AIR are:
- An abstracted application-centric runtime environment
- Unified traffic management
- Developer-friendly debug tools
If you build and run your own platform on Kubernetes, you likely have many of these same concerns.
Application-centric runtime
The application-centric runtime relates to two main concerns: abstracting the Kubernetes details and intelligently recommending scaling solutions.
Abstraction and simplification
The application specification abstracts and simplifies the Kubernetes details (see the left side of the diagram below) and provides an app-centric specification (see the right side of the diagram below). The platform is responsible for generating the actual Kubernetes resources submitted to the clusters.
Our application spec is heavily influenced by the Open Application Model (OAM), which describes the application in terms of components that are customized or extended using traits. Components represent individual pieces of functionality, such as a web service or a worker. Traits can modify or change the behavior of a component that they're applied to.
Through this system of components and traits, developers can define what their applications require from the platform—without needing a complete understanding of how it's implemented in Kubernetes.
Let’s consider an example of this complexity: the progressive delivery solution utilizing Argo Rollouts and Numaflow, created by the platform to enable the automatic rollback of buggy code. When a new version of an application is rolled out, canary pods with that new version are first created, and then some percentage of the traffic is sent to those new pods.
Numaflow pipelines analyze the metrics of those canary pods, generating an anomaly score. If the anomaly score is low, then the rollout will continue. However, if it is high—for example, above seven or eight—then Argo Rollouts will stop the deployment and automatically revert to the prior revision.
This is an essential aspect of how we help our developers deploy confidently without knowing how to set up this complex solution.
Intelligent autoscaling
IKS AIR also automatically recommends scaling solutions for applications. To do this, the application resource sizing, such as the memory and CPU, must be determined, and unexpected events like OOMKilled or eviction must be handled.
The platform must also handle horizontal scaling to ensure the applications operate correctly at scale and with varying load levels. It needs to identify the metrics the application should scale on and the minimum and maximum number of replicas. Naturally, these are all primarily data-driven problems.
AI will significantly impact capacity planning and autoscaling, helping us to be more efficient with our computing resources. So we're building an intelligent autoscaling recommendation system to:
- Reduce the burden on our developers
- Help us ensure our workloads have the resources they need
- Improve the efficiency of our platform
The basic underlying idea is this: We have components in the cluster handling short-window scaling operations and emitting metrics. Subsequently, these metrics are analyzed by a group of ML models that make long-window capacity and scaling recommendations. The solutions to different scaling problems are then applied back to the clusters.
Traffic management
Another big challenge we've identified for our developers is the configuration and management of network traffic.
While some applications need to use particular capabilities in our networking environment, we found that most only need to change a few common configurations. Our solution simplifies endpoint management, unifying the configuration of our API gateway and service mesh while providing graduated complexity as required. Here is an example of the traffic configuration of a service on our developer platform.
Most applications only need to configure routes and throttling. However, if required, they can toggle on Advanced Configs, which gives them access to CORS and OAuth scopes. They can edit the underlying YAML configuration for even more complex use cases.
Debug Tools
With the platform abstracted, our service developers faced another challenge: troubleshooting their services. And we know that abstraction and debuggability don’t go hand in hand. Therefore, it was critical for us to not only provide a paved path but also to service their debuggability needs.
To accomplish this, we provided an extremely developer-friendly debugging experience in the developer portal. Service developers don't need to know about Kubernetes primitives or have any historical knowledge of the platform. Our aim is to democratize debug tooling across teams, as we saw many developers already juggling many different tools. This approach will help us reduce MTTR and friction in debugging workflows.
Interactive debugging using ephemeral containers
First, we've provided our developers with an interactive debugging shell experience using ephemeral containers. Ephemeral containers are a GA feature in the Kubernetes 1.25 release. They’re ideal for running a custom-built transient container and troubleshooting the main app container in the service pod. This is great for introspection and debugging. So now you can launch a new debug container into an existing pod.
This ephemeral debug container will share the same process namespace, PID, IPC, and UTS namespaces of the target container. Given that the containers in a pod already share their network namespaces, this debug container is set up perfectly to debug issues in the app container.
This is a demo of how this looks on our development platform:
- Click on the shell icon.
- Select a host, which is a pod. Select a pod and click Connect.
- From here, an ephemeral container will try to connect to the particular app container.
- Once it attaches and a connection is established, a session is started.
In this way, we have hidden the complexity of using kubectl exec to access a pod or even get a kubeconfig. A developer can quickly use this frictionless experience to debug their service.
One-click debugging
Another feature we have provided is one-click debugging. We have used Argo Workflows for the workflow implementation, which is ideal for defining the sequence of steps needed for a debugging workflow. Specific debugging techniques are required based on the language and framework. We also want to preserve as much application context, structures, and code references as possible while debugging a service.
At Intuit, we determined that our top two languages are Java and Golang. These are some of the language-specific debugging tools that we might use:
What does this look like in our developer portal? The user interacts with the UI to take a thread or heap dump on a target pod for a Java service. A workflow is executed in the background when they hit Add to Debug List and add that specific pod or host. This will run certain steps in sequence to perform the thread dump and heap dump workflow. The developer can later download the artifact and use their preferred analysis tool. These artifacts are available for download for only 24 hours.
Key Takeaways
To conclude this post, let’s go over the key takeaways of building this paved road:
- Increases developer release velocity
- Streamlines the process of performing platform migrations with very little friction for service developers and their teams
- Reduces the time to get a service into production, since the platform handles much of the heavy lifting, taking that burden off the developer.
- Helps to reduce potential incidents caused by infrastructure misconfigurations.
- Provides a better developer experience by abstracting infrastructure network connectivity and providing intelligent autoscaling for managing service availability
Top comments (0)