Nail the DevOps part as your company's CTO
The goal of this handbook is to give you clarity on DevOps:
- Understand what’s DevOps (in simple words)
- Know what’s possible with DevOps (in simple goals)
- Get simple “when-to-do-what” DevOps guidelines
I added a bonus at the bottom of the article.
It's a production-ready setup example you could take inspiration from.
Who this article is for
You might be a founder who wishes to get started with DevOps the right way.
You might be a CTO of a 1,000 employees company who wishes to get simple principles.
Or, maybe you’re a Software Engineer, and you want to understand if your company’s DevOps approach is good.
If you’re looking for a simple DevOps playbook, this is it.
Understand the desired result
Two things your company needs to be able to do
- Serve its product to customers
- Build and improve the product
Abilities you need to build, improve, and serve software
- Run experiments and test changes
DevOps has a simple meaning
Developers and Operators have shared responsibility for building and improving the system.
In practice:
- Developers are responsible to “Operate”
- DevOps Engineers are responsible to enable to “Operate” AND do some of it themselves
Operate = provision, monitor, secure, configure, deploy, scale.
Choose a balance: Enabler, Doer, or Automator
The DevOps role will end up as a balance between:
- Enabler: Provides the tools and knowledge to fulfill the DevOps goals
- Doer: Does the tasks that fulfill the DevOps goals
- Automator: Automates any repeating operation
Know what things you should enable, do, or automate
- Provision infrastructure
- Secure the system
- Deploy workloads
- Monitor the system
- Recover from issues
- Scale up or down
- Track & test changes
- Automate processes
Choose the right tools
- Has state management = Saves time automating state-aware processes (e.g., Terraform)
- Has a big community & good docs = Saves time dealing with common issues (e.g., Kubernetes)
- Has multiple interface types: API, CLI, UI = Saves time integrating with the existing system (e.g., Vault)
You can also read about choosing tools here.
Set useful goals
There are DevOps goals that adopting them will focus you on the right direction:
- One-Click Environments: makes e2e tests easy and quick
- Atomic Commits: provides confidence that a tested change will work in production
- Separate the Shared & Env-Specific Parts: enables e2e tests as the company scales up
If you want to learn about more useful DevOps goals, feel free to book a free consultation here.
Enablers: Choose the Tools-to-Knowledge Balance
Developers can either have the knowledge or the tools to do something.
- More knowledge-reliance: if you want the developers to contribute to the DevOps efforts
- More tools-reliance: if you want to abstract the operations from the developers
If the balance between the two is not intentional, it’s accidental.
Doers: Have a good reason to do it
- Is it a one-time task?
- Does it teach you how the developers work?
- Are you directly accountable for the results of the task?
If you answered “no” to the above questions, enable or automate it instead.
Doing more = Learning the system's use-cases
Doing too much = Not scalable, too-much knowledge-reliance
Automators: Have a good reason to automate it
- Did it happen before?
- Is it likely to happen again?
- Will automating it take less time than doing it?
- Will automating it teach you an important company process?
If you answered “yes” to 2 out of the 4 questions - automate it!
More automations = Less reliance on knowledge to operate the system.
Too much automations = No system awareness.
P.S. - you can also enable developers to automate it.
Create available DevOps Capacity
The DevOps needs of a company have spikes.
One month you need 2 DevOps Engineers, and half of that the next month.
Switchovers between big efforts and small tasks are common.
This is true, especially for new companies.
Break the assumption: “DevOps tasks must be done by a DevOps Engineer”.
There are 3 types of DevOps capacity
- Non-Flexible: A full-time DevOps Engineer on the team
- Semi-Flexible: Key developers that can contribute to the DevOps goals
- Fully-Flexible: A flexible DevOps Services company or freelancer
You can read more about calculating the DevOps capacity your company needs here.
When to focus on what: Common Dilemmas
When: You work alone, and the system is simple
Focus: On simplifying the development - Dockerize your apps, Create a post-commit pipeline that runs tests
When: You need to be able to create new environments quickly (for development, or for clients)
Focus: On implementing “One-Click Environments”: Using IaC (e.g., Terraform) + Deployment tool (Depends on the platform).
When: You want to e2e test every code modification, but there are many code modifications
> Focus: On splitting the “One-Click Env” into a “base” with shared resources, and “env” with env-specific resources
When: You want to unify & standardize how you deploy, monitor, scale, configure, and secure your workloads
Focus: On implementing an orchestrator such as Kubernetes
When: You want you have many moving parts and wish to be certain a tested change will work
Focus: On implementing GitOps and consider a Monorepo (the sooner the better)
When: You want the DevOps efforts to be done by the dev team
Focus: On using “actual” IaC tools (Pulumi Typescript/Python), Full “how to operate” (see above) documentation
Never: Invest lots of time in new tech without a strong reason
Always:
- Have your code in Git
- Monitor the basic stuff: CPU, Memory, Disk, Network, App Logs, Cloud Costs
- Architect for high-availability
- Test before you deploy
BONUS: An example setup for a CTO approaching Production
2 AWS Accounts
- One for development and staging
- Another for production
Monorepo in Github
- Docker-Compose for local development
2 Infrastructure-as-Code projects: 'base' & 'apps'
- base = shared resources (e.g., VPC, RDS, ECS Cluster, EKS Cluster)
- apps = env-specific resources (e.g., Lambda Functions, ECS Services, Kubernetes Namespaces)
- config file per environment
Github Actions Workflow: Development workflow
- Checkout branch and locally develop + test changes
- Create a Pull Request: Deploys a Pull-Request ‘apps’ environment on the ‘development’ environment ‘base’
- On merge to main: Deploys from the ‘main’ branch an ‘apps’ environment onto the ‘development’ environment ‘base’
- Manual: Deploy from the ‘main’ branch onto the ‘staging’ / ‘production’ environment ‘base’
Setup Notes
- Avoid mentioning an environmnent's name in the code for conditional resources deployment
- Use each environment’s config file to declare if a resource should be created
- Could be implemented using Terraform, Terragrunt, Pulumi, CDK, and other IaC tools
- Production should have 2-instances of every workload for high-availability
If you’d like to see this setup in your startup, click here to book a call 👈🏼
P.S. - I'll be updating this page occasionally, so you might want to visit again
Another Bonus: DevOps Dictionary for Human Beings
Term | Definition | Tools |
---|---|---|
Environment | A working instance of the entire system | |
CI (Continuous Integration) | Enable developers to collaborate by agreeing on a single source-of-truth (master/main) | Jenkins, Github Actions, GitlabCI |
CD (Continuous Delivery) | Create an artifact that’s ready for production (tested, tagged) | JFrog Artifactory, Nexus, AWS ECR |
CD (Continuous Deployment) | Every available deliverable (artifact) gets deployed automatically | ArgoCD, Jenkins, AWS CodeDeploy |
Monitoring / Observability | Collect metrics/traces/logs from apps and infrastructure, analyze them, and display them, and setup alerts | Prometheus, Jaeger, Elasticsearch, Fluentd, OpenTelemetry |
Infrastructure | The resources on which the workloads run, in which the data is stored, and through which the network flows | Servers, Databases, Network Routers & Switches |
Cloud Infrastructure | Same as the above, but specifically in the cloud | AWS EC2, AWS RDS, GCP Compute Engine, Azure Virtual Machines |
Cloud | Computing & Data services served from remote locations for you to build your system | AWS, Azure, GCP |
Containerization & Virtualization | Technologies utilizing Kernel & OS features to create virtual machines, or isolate process (AKA run containers) | Docker, vSphere, KVM |
Secrets Management | Storing and retrieving sensitive configurations (e.g., tokens, passwords) | Hashicorp Vault, AWS Secrets Manager, SealedSecrets |
Configuration Management | Usually refers to preparing servers for workloads (e.g., creating directories & files, starting processes) | Ansible, Chef, Puppet |
Version Control | Saving the code in a versioned way (Git) | Github, Gitlab |
GitOps | Making the system is the same as it’s described in Git | Flux, ArgoCD, Jenkins |
Monorepo | All of the company’s code is in one Git Repository | NX, Turborepo |
Polyrepo | Multiple Git repositories for different components | |
IaC (Infrastructure-as-Code) | Creating Cloud infrastructure with idempotent code and state management | Terraform, Pulumi, CDK, Crossplane |
Deployment | Execute, serve, or install the artifacts | ArgoCD, Jenkins, AWS CodeDeploy, Scripts (Bash, Python, etc.) |
Orchestrator | Dynamically allocating workloads to a pool of nodes | Kubernetes, Nomad, AWS ECS |
Authentication & Authorization | Making sure each person, workload, or resource, has access only to what’s necessary (other workloads and resources) | AWS IAM, OpenID, OpenVPN, Twingate, Istio |
Service Discovery | Exposing available workloads using DNS | Consul, CoreDNS |
Get more practical advice
I post small nuggets of practical advice on the "MeteorOps Newsletter".
You can subscribe here 👈🏼
Top comments (6)
Great article, what hat you wear professionally? CTO or DevOps?
Thank you, and I'm wearing a DevOps hat!
The background is that I spoke to many CTOs and found myself repeating some advice to most of them.
Figured I'd compile it into a handbook :)
Great article @michaelzion !
Thank you Alex!
Great article Michael!
Thank you for posting!
Thank you for the feedback!