DEV Community

David Montoya
David Montoya

Posted on • Edited on

10 milestones for the Internal Developer Platform. A roadmap for teams getting started

Internal Developer Platform (IDP) teams lay the rails for domain teams to ship apps and features (ultimately, to production) with low friction. Team Topologies (Skelton, Pais, 2019) defines the purpose of a platform team as "to enable stream-aligned teams to deliver work with substantial autonomy". The ability for domain teams to deliver that work depends partly (there are other organizational factors, of course) on the capabilities and developer experience provided by the platform. How can then a platform team set out to build those capabilities and lay the foundation to provide that developer experience? How can they take an incremental approach to building it so that domain teams can benefit from platform features early on? The milestones listed below trace a route for platform teams to lay that foundation and to provide essential features for their platform, like tenancy controls and secret management.

Not all milestones are achieved sequentially. A team or multiple teams may be at some point working simultaneously toward several milestones. Each milestone is a journey on which platform teams progress from early to advanced practitioners, as they incrementally add features and create abstractions, and gain momentum by building on top of stabilized tooling.

Before you go build an IDP

As Kris Nova and Justin Garrison mention in Cloud Native Infrastructure (Garrison, Nova, 2018), “When you’re building a platform to run applications, it’s important to know what you are getting into. Initial development is only a small fraction of what it takes to build and maintain a platform”. Maintaining and evolving the platform along with the organization will consume most of the team(s) efforts. Before you set out to build a developer platform, define the guiding principles and the architectural -ilities that should influence your design and implementation. Keep a watchful eye for unnecessary complexity and redundant infrastructure, as unchecked growth will increase cognitive overload and get in the way of delivering new features for devs. Having "self-serviceability" influence the design (E.g.: of pipelines, APIs or Kubernetes CRDs), can help unlock said teams' autonomy. Modularity and evolvability are other -ilities to consider to have the platform fare the flux of change in organizations and avoid going into costly migrations every few years.

1. Bootstrapping a developer platform

On day 0, there is only a root service account for some cloud provider. There are no "Admin" or "Team" personas, and no definition of "Environments". There are no pipelines ("Provisioner" persona) to manage cloud infrastructure nor clusters for workloads to run. Using the root service account, the platform team seeds the roles and IAM policies required by the "Admin" and "Provisioner" personas to continue scaffolding the platform. If using Terraform, the state file should be clear of secrets and versioned along with the bootstrapping code. A bucket or other storage backend is required to store the state for subsequent infrastructure. Resources managed by this component may require more permissions than the given to admins; creating a service account with elevated permissions and allowing admins to cut short-lived keys, serves as an escape hatch when additional permissions are required. At the end of day 0, members of the platform team should be able to use their “admin” credentials to access the cloud.

Code produced in this milestone should have a low frequency of change as time passes. A high frequency of change indicates that a given configuration should be promoted to a tenancy control in a separate pipeline with less permissions.

2. Implement access controls for teams

This is day 1. There is one or a couple teams already waiting for access to begin deploying their apps. Team boundaries must be delineated to prevent teams from stepping on each other, to enforce resource quotas, and to grant least privilege access. A pipeline can be created to bind IAM roles and policies to each domain team. If using Terraform, optimize for configurability by having an internal module represent a single instance of a team and invoke it for each onboarded team with different inputs. Optimize for self-serviceability by surfacing all customizable settings to a format that teams can edit easily like a YAML file, and by requiring approvals from platform and security engineers. Customizable IAM roles can help enforce the principle of least privilege. Permissions granted via this pipeline may be long-lived but that's only until the team unlocks tenancy 2.0 and allows for break-glass workflows to get just-in-time elevated access to environments and cloud resources.

Code produced in this milestone evolves as the organization grows in teams and workloads. Separating language specific configuration from tenancy metadata allows evolving this component into an API or a Kubernetes CRD.

3. Provide isolated environments for applications

Developers require environments to deploy and iterate on their apps as they progress towards production. Defining environment boundaries at the network level ensures isolation and stability. A blueprint for early platform teams to quickly get started is to create a Kubernetes cluster per environment, each with its dedicated VPC network. As the organization grows, so does the footprint of the apps and APIs running on each cluster, which requires the platform team to evolve the cluster topology to allow for more specialized definitions of environments or advanced cluster management strategies like blue/green deployment. Once an environment is operational, enhance the pipelines and components produced on milestone #2 to allow managing teams' access to namespaces with Kubernetes RBAC.

Code produced in this milestone has a consistent frequency of change as new "cluster components" will be added to support various platform features like sidecar injectors, secret and certificate managers, config reloaders, telemetry collectors, API gateways, etc. Separating language specific configuration from cluster metadata allows evolving this component into a pipeline to manage the lifecycle of multiple clusters.

4. Unlock configuration management

Both admins and users require a development workflow to configure Kubernetes and Cloud APIs. Terraform coupled with Atlantis or Argo Workflows can allow managing cloud resources a la GitOps. Kubernetes apps are configured with YAML documents, which can be templated with tools like Helm or Jsonnet or CUE to reduce repetitiveness and perform validations, and to modify applications in bulk. For mid-level to advanced teams, choosing a configuration language like Jsonnet, allows for creating expressive abstractions with functions or object-orientation; those can then be published as libraries (or "libsonnets") for other apps to import. Both templated and raw YAML documents can then be applied to their destination clusters with tools like Flux or ArgoCD.

As the platform team grows, they may favor the Kubernetes-style configuration and unify their cloud configuration workflows with tools like GoogleCloudPlatform/k8s-config-connector or developing custom CRDs with Kubebuilder or Crossplane, which I talk about on milestone #9.

Libraries, pipelines and documentation created in this milestone begin as focused on the needs of the platform team, and then, once tested, extended to domain teams.

5. Allow reusing build jobs and steps

Build pipelines are required to build and test code, and to publish packaged applications like Docker images. Platform teams can create reusable pipeline steps with tools like Github Actions or CircleCI Orbs, which can then be combined to implement CI (Continuous Integration) and CD (Continuous Delivery) workflows. You will need an artifact registry and a set of base images that developers can extend (FROM). Put them on a nightly build so you can keep them patched. Encourage teams to practice semantic versioning to tag artifacts produced and track them across environments. Platform teams with an advanced Kubernetes practice or particular build requirements, may choose to run their own CI/CD system with Tekton or Argo Workflows.

Code produced includes artifact registries, reusable actions or jobs, IAM service accounts, example applications and documentation. Build pipelines may require over-privileged service accounts. If using Hashicorp Vault, use a scheduled task to rotate the API key at least once daily.

6. Implement secret management

Secrets are tricky to deal with. Devs can't store them with code, or at least not unencrypted; they vary across environments (development vs production), some are even required by apps to run successfully, and rotating them in the event of a security incident (or just good security hygiene), should not require application down time nor action from more than one team. Cloud native secret managers like AWS's parameter store and GCP's Secret Manager are handy for the platform team to bootstrap infrastructure and share secrets with admins. For Kubernetes platforms, there are tools like ExternalSecrets and the Kubernetes Container Storage Interface, CSI, which map secrets from various sources to Kubernetes secrets and volumes, which are then mounted by Deployments or Statefulsets. Hashicorp Vault is also a great companion to any Kubernetes app. When injected as a sidecar, it allows the app to auto-reload on secret changes or to issue short-lived credentials to access databases and cloud APIs, using Vault's Secret Backends and custom plugins. For mid-level to advanced platform teams, Hashicorp Vault helps prevent secret sprawl, unlocks workflows for devs to access cloud resources securely, and because of the Vault agent caching capabilities, it mitigates the thundering herd problem on the Vault server and the Kubernetes API server (used for authentication).

Code produced includes secret stores and RBAC controls for workloads and users to access them, cluster operators and sidecar injectors, and secret self-service documentation.

7. Ensure application monitoring and observability

Teams can't just launch critical services into production with no visibility. Part of making a Kubernetes cluster operational for apps and humans, is to deploy log collectors and metrics scrapers as part of the lifecycle of every cluster. Allow domain teams to access those logs and metrics so they can create visualization dashboards and monitoring alerts, which they'll need to operate their service in production. If your org is not already using a managed monitoring stack, Prometheus + Alert Manager + Grafana is an industry chaos-proofed OSS stack for Kubernetes. At the very least your Kubernetes monitoring stack should report on app’s resource utilization and allow visualizing patterns over time. As cluster operators, platform teams implement safety checks and quotas to enforce tenancy and ensure stability for other apps; As reliability consultants, they encourage best practice configurations for resource allocation and logging practices, and advice on performance optimizations and alert tuning. If using Grafana, manage your dashboards with code using grafana-operator.

Deliverables produced in this milestone includes cluster features, best-practice libraries and snippets, dashboards, self-service guides and operator manuals.

8. Templatize best practices and delineate golden paths

As domain teams iterate on their apps and improve on their software craft, they'll arrive at practices and conventions that improve the ergonomics (and economics) of building and running their software. As a platform engineer, spot the patterns across teams and codebases, and seek consensus on redundant libraries or divergent themes. Take the practices and patterns most voted/favored by teams and use them to publish template apps that devs can easily clone and rename to quickly get started with a new app. Keep each template fully functional and deploy them to every environment, production included; this way they serve both as live documentation, and as the canary in the coal mine to test new platform features and cluster upgrades. The build and deployment workflows showcased by a template are the golden paths that users follow to take their app to production. This is a key place to showcase the best practices like semantic versioning. To create advanced templating workflows such as allowing devs to choose over different storage backends, or choosing whether their new app is a UI or a gRPC API, create a wizard-like experience with Backstage's Software Templates feature.

In this milestone there's an intentional focus on the developer experience hence surveying developers, and reading their code are key to producing templates that devs will want to use. Deliverables include template apps running live, templating workflows, deployment pipelines and release workflows, how-to wikis, demo videos, developer surveys.

9. Enter domain abstractions

This is a milestone for teams and organizations gaining momentum, where roadmaps are signaling new features and product development. Take configuration abstractions that have been in use by domain teams and that have stabilized (see milestone #4), and consolidate them behind a common interface. Compose them together into new abstractions and give them names that communicate to users what they do. E.g. an Application API Key, a Highly Available Application, a Tenant Namespace. Hide away the boilerplate bits, cement the good practices already internalized by the platform team (like resource allocation best practices), and focus on surfacing the config settings most relevant to devs. In Kubernetes platforms, Custom Resource Definitions (CRD) are the building blocks to creating domain abstractions. They allow configuring apps and infrastructure with YAML code, which can be easily templated, and fits well with GitOps and pull request workflows. In addition to a Kubernetes-style declarative configuration, CRDs come with “controllers” that ensure resources are reconciled to a healthy state, allowing for orchestration of resources based on their health and lifecycle stage using "operators". For teams getting started, Kubebuilder is a must try to understand the lifecycle of CRDs and how the Reconciler Pattern works in action. For teams looking to reconcile resources outside of Kubernetes, Crossplane provides a way to represent third-party resources and manage their lifecycle with pluggable providers.

This milestone increases the configuration surface for platform teams and simplifies it for domain teams. It produces Kubernetes controllers and operators, new configuration templates, declarative config examples, wikis.

10. Address concerns at the edge with an API Gateway

This is a milestone most pressing for engineering teams that support an internet-facing product and for teams looking to benefit from the endpoint tenancy model unlocked by the Kubernetes Gateway API. This is also a milestone for mid-level to advanced teams, which makes it a subject for another post!

Top comments (0)