tl;dr;
We've been happy with all the features of Kubernetes and Istio to create a service mesh for APIs and web apps. It comes with a cost and require a team of people willing to sometimes dive deep into the unknown.
Kubernetes is pretty mature and if you are using a cloud provider you can get up and running pretty fast. Keeping it up to date is also fairly straightforward and, most of the time, upgrading components won't even require downtime.
Istio on the other hand requires a lot more time to get right and, although it adds a really wide array of capabilities, it has a considerable maintenance cost.
If you don't have a safe environment to learn and fail (be it due to time constraints or the criticality of projects) I recommend sticking with just Kubernetes, you'll get a lot from using it. Meanwhile try to look at service mesh capabilities since you'll probably find the need for them early on, maybe try implementing them one at a time.
If you want more details on why, keep reading :)
Before you begin
Let's just get the fact that Kubernetes is complex out of the way early on this article. There are many components for you to learn and figure out the right configuration in multiple yaml files until you get your application up and running.
Imagine if you were part of a team or company that has to run 100+ microservices. Even if all of them are extremely well designed and developed into a standard tech stack, following all the best practices and building quality into every step of the way, it is still a pretty hard and complex problem to solve.
Running a distributed service oriented architecture is really complex, there's no easy way to do it (and if anyone tells you there is they either have never done it or are straight up lying). In my opinion Kubernetes offers a programatic and pretty standardize way to deal with all requirements to run a distributed architecture. The API is well designed and after the initial learning curve you'll find to be quite powerful.
Kubernetes will require you to really understand your applications. How can you signal that a service is ready to receive traffic? How can you signal they are unhealthy and need to be restarted? How many instances be created? How to route traffic between different versions of services?
In my experience as a developer these questions were rarely asked, but to really benefit from Kubernetes and Istio you need to have answers sooner rather than later.
I want to highlight two main areas which I think are really important in a distributed environment: compute engine (where applications are running) and service mesh (how all the services integrate with each other).
Compute engine
The first big challenge of adopting Kubernetes and Istio is not the technology but the mindset. As mentioned in the previous sections if you are deploying your application in Kubernetes you need to have answers for a lot of questions that rarely get asked and looking at the bigger picture can be a daunting experience.
The benefit you gain from running an application is Kubernetes is pretty incredible. Having an elastic and highly available infrastructure frees you from worrying about a lot of common problems. Once you put all the pieces in place it is a pretty smooth ride. The same principle is valid for the infrastructure itself, creating a cluster that scales as needed is fairly straightforward if you use a cloud provider.
From the perspective of developing the infrastructure itself, deploying and running a cluster is not as hard as most people may think - but not as easy as shown in a presentation. Many cloud services have their own self-managed version of Kubernetes which takes care of a whole lot of concerns you'd have running a cluster on your own.
The second biggest challenge you will probably have is to keep an eye on all the moving pieces: Kubernets, Istio and many, many, many plugins. Luckily the environment around both tools is ever growing and with a pretty standard API, managing resources in a cluster is not the worst experience in the world.
To give an example, right out of the box Istio will give you a good array of monitoring tools like Prometheus, Kiali, Jaegger and plugin in other services is also straightforward (like Datadog or Splunk) since most of them provide easy adapters. We then built templates for alerts and dashboards that work based off of the standard metrics that Istio and Kubernetes plugins provide. This way all teams had a good baseline of observability before even starting instrumenting their applications.
Keeping track of all the dependencies and making sure everything is kept up to date is also another big challenge. Important security and performance updates are released constantly, some tools may not even be at 1.0+ yet. You'll need to invest time to make sure you have pretty good testing environment and be confident in your CI/CD process to release changes.
Service Mesh
Defining a service mesh is pretty hard because there is a lot to unpack, especially with Istio and the many capabilities it brings. Istio extends the Kubernetes API by creating its own resources and allowing you to define configurations ranging from terminating TLS and encrypting requests between services to traffic shaping and tracing across multiple services.
This also means that when something doesn't quite work it can be really hard to find out why, which is definitely the biggest challenges with a Service Mesh. Things go wrong without much explanation and require you to look into the logs for the many components running to find out clues on what is missing, misconfigured or simply broken.
To give a more concrete example let me tell about when we tried to setup automatic TLS resolution. With Istio you can set up TLS termination outside of the application, meaning the app running in the docker container uses simple HTTP but the requests are served with HTTPS. This not only simplifies the application code but would also allow us to automate the whole process by having an automated TLS certificate generation process.
All was great, teams could deploy their own TLS key and certificates and Istio would take care of TLS termination. Then, one day, all ingress traffic to the cluster started to fail, no request would make it to any app in the cluster. After long sessions of debugging the problem (from the external load balancers all the way to the VPC configuration) we finally saw some logs indicating that Istio's ingress application was not able to parse a certificate. Although we thought it had nothing to do with the issue, once we deleted the bad cert all traffic came back to normal.
We then saw on a similar github issue that we could setup a default certificate to be used when the ingress pod failed to load a certificate, saving it from looping indefinitely/crashing. Once we deployed this configuration everything went back to normal and even when someone deployed a bad cert the service mesh would handle it gracefully.
There are many other cases that required a digging really deep into logs, github issues and source code. If your team doesn't have a safe environment to learn, or simply doesn't want to deal with that kind of experience, I definitely recommend not using Istio or else you'll get burned and start posting memes on twitter about how complex it is.
Remember to smell the flowers
Seeing everything coming together is a really good feeling and while there are many reasons to worry and lots of things to learn, it feels great to intentionally destroy 90% of your resources and watch as everything comes back automatically and no downtime is noticed :)
It may be overwhelming to get started with Kubernetes and Istio, but they also have great people doing a really good job to solve really complex problems. I am thankful for the many people dedicating their time and efforts to develop those tools.
This post has been almost a brain dump from when I stopped and looked at what we've put together over an year. If you want to know more about some specific area please let me know in the comments or @MarcosBrizeno. I plan on writing a few other posts to talk about the architecture and components we use - and why we use them.
A bit of context
I am part of a team focusing on the cloud infrastructure where application teams can deploy and run their services. It is not a DevOps team or even an infrastructure team, we try to provide the best developer experience we can so that the other teams feel empowered and are enabled to do the best job they can.
Most of the services running on our platform are greenfield applications - mostly APIs, a few UIs and some background jobs. The tech stack is pretty standard, about 80% of all apps use the same language/web framework.
The teams deploying applications vary in experience with microservices, continuous integration and cloud native development.
See you space cowboy
Top comments (2)
Thanks for your article and experience sharing. I'm working with my teammates on our Kubernetes (on AKS) and Istio journey for 2 months, and we have already seen some challenges when things got broken and we need to drill into so many different layers (issues on In/egress gateway, Envoy sidecar, Policies ... etc) in order to troubleshoot. Would you mind sharing any tips, or any monitoring strategy from your experience for Prod environment? In particular, which kind of logs you will be monitoring closely? any recommendations or best practices on the Istio settings? or any commands or logs you would check first in order improve the efficiency when troubleshooting issues on Istio?
Hi, yeah I think everyone has been in a situation where things just don't work and you can't reproduce the problem.
I think understanding all the pieces in between someone making a request and the application inside the pod responding to it is really important and then you will be able to look at the right place faster. This talk "the life of a packet through istio" (youtube.com/watch?v=cB611FtjHcQ) is really good and it goes into a lot of details.
For logs I usually look at Ingress, Mixer and the application sidecar. Doing a port-forward and setting the istio-proxy log level to debug gives a lot of information and then you can read through everything to try and find what could be wrong.
And for best practices I think that this is where, Istio in particular, is struggling the most simply because you can do so much with it that is hard to create any sort of convention or capture best practices. One thing that helped us was running istioctl validate (istio.io/docs/reference/commands/i...) on resources being deployed to avoid potential issues - we made it part of our pipelines and also an admission controller validation.
Good luck on your journey!