Michael Zion for MeteorOps

Posted on Dec 25, 2023 • Edited on Jan 8, 2024 • Originally published at meteorops.com

The CTO DevOps Handbook: Simple Principles and Examples

#devops #cto #softwareengineering #programming

Nail the DevOps part as your company's CTO

The goal of this handbook is to give you clarity on DevOps:

Understand what’s DevOps (in simple words)
Know what’s possible with DevOps (in simple goals)
Get simple “when-to-do-what” DevOps guidelines ‍

I added a bonus at the bottom of the article.
It's a production-ready setup example you could take inspiration from.

Who this article is for

You might be a founder who wishes to get started with DevOps the right way.

You might be a CTO of a 1,000 employees company who wishes to get simple principles.

Or, maybe you’re a Software Engineer, and you want to understand if your company’s DevOps approach is good.

If you’re looking for a simple DevOps playbook, this is it.

Understand the desired result

Two things your company needs to be able to do

Serve its product to customers
Build and improve the product

Abilities you need to build, improve, and serve software

Run experiments and test changes

DevOps has a simple meaning

Developers and Operators have shared responsibility for building and improving the system.

In practice:

Developers are responsible to “Operate”
DevOps Engineers are responsible to enable to “Operate” AND do some of it themselves

Operate = provision, monitor, secure, configure, deploy, scale.

Choose a balance: Enabler, Doer, or Automator

The DevOps role will end up as a balance between:

Enabler: Provides the tools and knowledge to fulfill the DevOps goals
Doer: Does the tasks that fulfill the DevOps goals
Automator: Automates any repeating operation

Know what things you should enable, do, or automate

Provision infrastructure
Secure the system
Deploy workloads
Monitor the system
Recover from issues
Scale up or down
Track & test changes
Automate processes

Choose the right tools

Has state management = Saves time automating state-aware processes (e.g., Terraform)
Has a big community & good docs = Saves time dealing with common issues (e.g., Kubernetes)
Has multiple interface types: API, CLI, UI = Saves time integrating with the existing system (e.g., Vault)

You can also read about choosing tools here.

Set useful goals

There are DevOps goals that adopting them will focus you on the right direction:

One-Click Environments: makes e2e tests easy and quick
Atomic Commits: provides confidence that a tested change will work in production
Separate the Shared & Env-Specific Parts: enables e2e tests as the company scales up

If you want to learn about more useful DevOps goals, feel free to book a free consultation here.

Enablers: Choose the Tools-to-Knowledge Balance

Developers can either have the knowledge or the tools to do something.

More knowledge-reliance: if you want the developers to contribute to the DevOps efforts
More tools-reliance: if you want to abstract the operations from the developers

If the balance between the two is not intentional, it’s accidental.

Doers: Have a good reason to do it

Is it a one-time task?
Does it teach you how the developers work?
Are you directly accountable for the results of the task?

If you answered “no” to the above questions, enable or automate it instead.

Doing more = Learning the system's use-cases

Doing too much = Not scalable, too-much knowledge-reliance

Automators: Have a good reason to automate it

Did it happen before?
Is it likely to happen again?
Will automating it take less time than doing it?
Will automating it teach you an important company process?

If you answered “yes” to 2 out of the 4 questions - automate it!

More automations = Less reliance on knowledge to operate the system.

Too much automations = No system awareness.

P.S. - you can also enable developers to automate it.

Create available DevOps Capacity

The DevOps needs of a company have spikes.

One month you need 2 DevOps Engineers, and half of that the next month.

Switchovers between big efforts and small tasks are common.

This is true, especially for new companies.

Break the assumption: “DevOps tasks must be done by a DevOps Engineer”.

There are 3 types of DevOps capacity

Non-Flexible: A full-time DevOps Engineer on the team
Semi-Flexible: Key developers that can contribute to the DevOps goals
Fully-Flexible: A flexible DevOps Services company or freelancer

You can read more about calculating the DevOps capacity your company needs here.

When to focus on what: Common Dilemmas

When: You work alone, and the system is simple

‍Focus: On simplifying the development - Dockerize your apps, Create a post-commit pipeline that runs tests

When: You need to be able to create new environments quickly (for development, or for clients)

‍ Focus: On implementing “One-Click Environments”: Using IaC (e.g., Terraform) + Deployment tool (Depends on the platform).

When: You want to e2e test every code modification, but there are many code modifications

‍> Focus: On splitting the “One-Click Env” into a “base” with shared resources, and “env” with env-specific resources

When: You want to unify & standardize how you deploy, monitor, scale, configure, and secure your workloads

Focus: On implementing an orchestrator such as Kubernetes

When: You want you have many moving parts and wish to be certain a tested change will work

‍ Focus: On implementing GitOps and consider a Monorepo (the sooner the better)

When: You want the DevOps efforts to be done by the dev team

‍ Focus: On using “actual” IaC tools (Pulumi Typescript/Python), Full “how to operate” (see above) documentation‍

Never: Invest lots of time in new tech without a strong reason

Always:

Have your code in Git
Monitor the basic stuff: CPU, Memory, Disk, Network, App Logs, Cloud Costs
Architect for high-availability
Test before you deploy

BONUS: An example setup for a CTO approaching Production

2 AWS Accounts

One for development and staging
Another for production

Monorepo in Github

Docker-Compose for local development

2 Infrastructure-as-Code projects: 'base' & 'apps'

base = shared resources (e.g., VPC, RDS, ECS Cluster, EKS Cluster)
apps = env-specific resources (e.g., Lambda Functions, ECS Services, Kubernetes Namespaces)
config file per environment

Github Actions Workflow: Development workflow

Checkout branch and locally develop + test changes
Create a Pull Request: Deploys a Pull-Request ‘apps’ environment on the ‘development’ environment ‘base’
On merge to main: Deploys from the ‘main’ branch an ‘apps’ environment onto the ‘development’ environment ‘base’
Manual: Deploy from the ‘main’ branch onto the ‘staging’ / ‘production’ environment ‘base’

Setup Notes

Avoid mentioning an environmnent's name in the code for conditional resources deployment
Use each environment’s config file to declare if a resource should be created
Could be implemented using Terraform, Terragrunt, Pulumi, CDK, and other IaC tools
Production should have 2-instances of every workload for high-availability

If you’d like to see this setup in your startup, click here to book a call 👈🏼

P.S. - I'll be updating this page occasionally, so you might want to visit again

Another Bonus: DevOps Dictionary for Human Beings

Term	Definition	Tools
Environment	A working instance of the entire system
CI (Continuous Integration)	Enable developers to collaborate by agreeing on a single source-of-truth (master/main)	Jenkins, Github Actions, GitlabCI
CD (Continuous Delivery)	Create an artifact that’s ready for production (tested, tagged)	JFrog Artifactory, Nexus, AWS ECR
CD (Continuous Deployment)	Every available deliverable (artifact) gets deployed automatically	ArgoCD, Jenkins, AWS CodeDeploy
Monitoring / Observability	Collect metrics/traces/logs from apps and infrastructure, analyze them, and display them, and setup alerts	Prometheus, Jaeger, Elasticsearch, Fluentd, OpenTelemetry
Infrastructure	The resources on which the workloads run, in which the data is stored, and through which the network flows	Servers, Databases, Network Routers & Switches
Cloud Infrastructure	Same as the above, but specifically in the cloud	AWS EC2, AWS RDS, GCP Compute Engine, Azure Virtual Machines
Cloud	Computing & Data services served from remote locations for you to build your system	AWS, Azure, GCP
Containerization & Virtualization	Technologies utilizing Kernel & OS features to create virtual machines, or isolate process (AKA run containers)	Docker, vSphere, KVM
Secrets Management	Storing and retrieving sensitive configurations (e.g., tokens, passwords)	Hashicorp Vault, AWS Secrets Manager, SealedSecrets
Configuration Management	Usually refers to preparing servers for workloads (e.g., creating directories & files, starting processes)	Ansible, Chef, Puppet
Version Control	Saving the code in a versioned way (Git)	Github, Gitlab
GitOps	Making the system is the same as it’s described in Git	Flux, ArgoCD, Jenkins
Monorepo	All of the company’s code is in one Git Repository	NX, Turborepo
Polyrepo	Multiple Git repositories for different components
IaC (Infrastructure-as-Code)	Creating Cloud infrastructure with idempotent code and state management	Terraform, Pulumi, CDK, Crossplane
Deployment	Execute, serve, or install the artifacts	ArgoCD, Jenkins, AWS CodeDeploy, Scripts (Bash, Python, etc.)
Orchestrator	Dynamically allocating workloads to a pool of nodes	Kubernetes, Nomad, AWS ECS
Authentication & Authorization	Making sure each person, workload, or resource, has access only to what’s necessary (other workloads and resources)	AWS IAM, OpenID, OpenVPN, Twingate, Istio
Service Discovery	Exposing available workloads using DNS	Consul, CoreDNS

Get more practical advice

I post small nuggets of practical advice on the "MeteorOps Newsletter".
You can subscribe here 👈🏼

Top comments (6)

sreejinsreenivasan • Jan 4 '24

Great article, what hat you wear professionally? CTO or DevOps?

Michael Zion • Jan 4 '24

Thank you, and I'm wearing a DevOps hat!
The background is that I spoke to many CTOs and found myself repeating some advice to most of them.
Figured I'd compile it into a handbook :)