Serverless Chats

Episode #107: Serverless Infrastructure as Code with Ben Kehoe

Jun 28 '21

About Ben Kehoe
Ben Kehoe is a Cloud Robotics Research Scientist at iRobot and an AWS Serverless Hero. As a serverless practitioner, Ben focuses on enabling rapid, secure-by-design development of business value by using managed services and ephemeral compute (like FaaS). Ben also seeks to amplify voices from dev, ops, and security to help the community shape the evolution of serverless and event-driven designs.

Twitter: @ben11kehoe
Medium: ben11kehoe
GitHub: benkehoe
LinkedIn: ben11kehoe
iRobot: www.irobot.com

Watch this episode on YouTube: https://youtu.be/B0QChfAGvB0

This episode is sponsored by CBT Nuggets and Lumigo.

Transcript
Jeremy: Hi, everyone. I'm Jeremy Daly.

Rebecca: And I'm Rebecca Marshburn.

Jeremy: And this is Serverless Chats. And this is a momentous occasion on Serverless Chats because we are welcoming in Rebecca Marshburn as an official co-host of Serverless Chats.

Rebecca: I'm pretty excited to be here. Thanks so much, Jeremy.

Jeremy: So for those of you that have been listening for hopefully a long time, and we've done over 100 episodes. And I don't know, Rebecca, do I look tired? I feel tired.

Rebecca: I've never seen you look tired.

Jeremy: Okay. Well, I feel tired because we've done a lot of these episodes and we've published a new episode every single week for the last 107 weeks, I think at this point. And so what we're going to do is with you coming on as a new co-host, we're going to take a break over the summer. We're going to revamp. We're going to do some work. We're going to put together some great content. And then we're going to come back on, I think it's August 30th with a new episode and a whole new show. Again, it's going to be about serverless, but what we're thinking is ... And, Rebecca, I would love to hear your thoughts on this as I come at things from a very technical angle, because I'm an overly technical person, but there's so much more to serverless. There's so many other sides to it that I think that bringing in more perspectives and really being able to interview these guests and have a different perspective I think is going to be really helpful. I don't know what your thoughts are on that.

Rebecca: Yeah. I love the tech side of things. I am not as deep in the technicalities of tech and I come at it I think from a way of loving the stories behind how people got there and perhaps who they worked with to get there, the ideas of collaboration and community because nothing happens in a vacuum and there's so much stuff happening and sharing knowledge and education and uplifting each other. And so I'm super excited to be here and super excited that one of the first episodes I get to work on with you is with Ben Kehoe because he's all about both the technicalities of tech, and also it's actually on his Twitter, a new compassionate tech values around humility, and inclusion, and cooperation, and learning, and being a mentor. So couldn't have a better guest to join you in the Serverless Chats community and being here for this.

Jeremy: I totally agree. And I am looking forward to this. I'm excited. I do want the listeners to know we are testing in production, right? So we haven't run any unit tests, no integration tests. I mean, this is straight test in production.

Rebecca: That's the best practice, right? Total best practice to test in production.

Jeremy: Best practice. Right. Exactly.

Rebecca: Straight to production, always test in production.

Jeremy: Push code to the cloud. Here we go.

Rebecca: Right away.

Jeremy: Right. So if it's a little bit choppy, we'd love your feedback though. The listeners can be our observability tool and give us some feedback and we can ... And hopefully continue to make the show better. So speaking of Ben Kehoe, for those of you who don't know Ben Kehoe, I'm going to let him introduce himself, but I have always been a big fan of his. He was very, very early in the serverless space. I read all his blogs very early on. He was an early AWS Serverless Hero. So joining us today is Ben Kehoe. He is a cloud robotics research scientist at iRobot, as I said, an AWS Serverless Hero. Ben, welcome to the show.

Ben: Thanks for having me. And I'm excited to be a guinea pig for this new exciting format.

Rebecca: So many observability tools watching you be a guinea pig too. There's lots of layers to this.

Jeremy: Amazing. All right. So Ben, why don't you tell the listeners for those that don't know you a little bit about yourself and what you do with serverless?

Ben: Yeah. So I mean, as with all software, software is people, right? It's like Soylent Green. And so I'm really excited for this format being about the greater things that technology really involves in how we create it and set it up. And serverless is about removing the things that don't matter so that you can focus on the things that do matter.

Jeremy: Right.

Ben: So I've been interested in that since I learned about it. And at the time saw that I could build things without running servers, without needing to deal with the scaling of stuff. I've been working on that at iRobot for over five years now. As you said early on in serverless at the first serverless con organized by A Cloud Guru, now plural sites.

Jeremy: Right.

Ben: And yeah. And it's been really exciting to see it grow into the large-scale community that it is today and all of the ways in which community are built like this podcast.

Jeremy: Right. Yeah. I love everything that you've done. I love the analogies you've used. I mean, you've always gone down this road of how do you explain serverless in a way to show really the adoption of it and how people can take that on. Serverless is a ladder. Some of these other things that you would ... I guess the analogies you use were always great and always helped me. And of course, I don't think we've ever really come to a good definition of serverless, but we're not talking about that today. But ...

Ben: There isn't one.

Jeremy: There isn't one, which is also a really good point. So yeah. So welcome to the show. And again, like I said, testing in production here. So, Rebecca, jump in when you have questions and we'll beat up Ben from both sides on this, but, really ...

Rebecca: We're going to have Ben from both sides.

Jeremy: There you go. We'll embrace him from both sides. There you go.

Rebecca: Yeah. Yeah.

Jeremy: So one of the things though that, Ben, you have also been very outspoken on which I absolutely love, because I'm in very much closely aligned on this topic here. But is about infrastructure as code. And so let's start just quickly. I mean, I think a lot of people know or I think people working in the cloud know what infrastructure as code is, but I also think there's a lot of people who don't. So let's just take a quick second, explain what infrastructure as code is and what we mean by that.

Ben: Sure. To my mind, infrastructure as code is about having a definition of the state of your infrastructure that you want to see in the cloud. So rather than using operations directly to modify that state, you have a unified definition of some kind. I actually think infrastructure is now the wrong word with serverless. It used to be with servers, you could manage your fleet of servers separate from the software that you were deploying onto the servers. And so infrastructure being the structure below made sense. But now as your code is intimately entwined in the rest of your resources, I tend to think of resource graph definitions rather than infrastructure as code. It's a less convenient term, but I think it's worth understanding the distinction or the difference in perspective.

Jeremy: Yeah. No, and I totally get that. I mean, I remember even early days of cloud when we were using the Chefs and the Puppets and things like that, that we were just deploying the actual infrastructure itself. And sometimes you deploy software as part of that, but it was supporting software. It was the stuff that ran in the runtime and some of those and some configurations, but yeah, but the application code that was a whole separate process, and now with serverless, it seems like you're deploying all those things at the same time.

Ben: Yeah. There's no way to pick it apart.

Jeremy: Right. Right.

Rebecca: Ben, there's something that I've always really admired about you and that is how strongly you hold your opinions. You're fervent about them, but it's also because they're based on this thorough nature of investigation and debate and challenging different people and yourself to think about things in different ways. And I know that the rest of this episode is going to be full with a lot of opinions. And so before we even get there, I'm curious if you can share a little bit about how you end up arriving at these, right? And holding them so steady.

Ben: It's a good question. Well, I hope that I'm not inflexible in these strong opinions that I hold. I mean, it's one of those strong opinions loosely held kind of things that new information can change how you think about things. But I do try and do as much thinking as possible so that there's less new information that I have to encounter to change an opinion.

Rebecca: Yeah. Yeah.

Ben: Yeah. I think I tend to try and think about how people ... But again, because it's always people. How people interact with the technology, how people behave, how organizations behave, and then how technology fits into that. Because sometimes we talk about technology in a vacuum and it's really not. Technology that works for one context doesn't work for another. I mean, a lot of my strong opinions are that there is no one right answer kind of a thing, or here's a framework for understanding how to think about this stuff. And then how that fits into a given person is just finding where they are in that more general space. Does that make sense? So it's less about finding out here's the one way to do things and more about finding what are the different options, how do you think about the different options that are out there.

Rebecca: Yeah, totally makes sense. And I do want to compliment you. I do feel like you are very good at inviting new information in if people have it and then you're like, "Aha, I've already thought of that."

Ben: I hope so. Yeah. I was going to say, there's always a balance between trying to think ahead so that when you discover something you're like, "Oh, that fits into what I thought." And the danger of that being that you're twisting the information to fit into your preexisting structures. I hope that I find a good balance there, but I don't have a principle way of determining that balance or knowing where you are in that it's good versus it's dangerous kind of spectrum.

Jeremy: Right. So one of the opinions that you hold that I tend to agree with, I have some thoughts about some of the benefits, but I also really agree with the other piece of it. And this really has to do with the CDK and this idea of using CloudFormation or any sort of DSL, maybe Terraform, things like that, something that is more domain-specific, right? Or I guess declarative, right? As opposed to something that is imperative like the CDK. So just to get everybody on the same page here, what is the top reasons why you believe, or you think that DSL approach is better than that iterative approach or interpretive approach, I guess?

Ben: Yeah. So I think we get caught up in the imperative versus declarative part of it. I do think that declarative has benefits that can be there, but the way that I think about it is with the CDK and infrastructure as code in general, I'm like mildly against imperative definitions of resources. And we can get into that part, but that's not my smallest objection to the CDK. I'm moderately against not being able to enforce deterministic builds. And the CDK program can do anything. Can use a random number generator and go out to the internet to go ask a question, right? It can do anything in that program and that means that you have no guarantees that what's coming out of it you're going to be able to repeat.

So even if you check the source code in, you may not be able to go back to the same infrastructure that you had before. And you can if you're disciplined about it, but I like tools that help give you guardrails so that you don't have to be as disciplined. So that's my moderately against. My strongly against piece is I'm strongly against developer intent remaining client side. And this is not an inherent flaw in the CDK, is a choice that the CDK team has made to turn organizational dysfunction in AWS into ownership for their customers. And I don't think that's a good approach to take, but that's also fixable.

So I think if we want to start with the imperative versus declarative thing, right? When I think about the developers expressing an intent, and I want that intent to flow entirely into the cloud so that developers can understand what's deployed in the cloud in terms of the things that they've written. The CDK takes this approach of flattening it down, flattening the richness of the program the developer has written into ... They think of it as assembly language. I think that is a misinterpretation of what's happening. The assembly language in the process is the imperative plan generated inside the CloudFormation engine that says, "Here's how I'm going to take this definition and turn it into an actual change in the cloud.

Jeremy: Right.

Ben: They're just translating between two definition formats in CDK scene. But it's a flattening process, it's a lossy process. So then when the developer goes to the Console or the API has to go say, "What's deployed here? What's going wrong? What do I need to fix?" None of it is framed in terms of the things that they wrote in their original language.

Jeremy: Right.

Ben: And I think that's the biggest problem, right? So drift detection is an important thing, right? What happened when someone went in through the Console? Went and tweaked some stuff to fix something, and now it's different from the definition that's in your source repository. And in CloudFormation, it can tell you that. But what I would want if I was running CDK is that it should produce another CDK program that represents the current state of the cloud with a meaningful file-level diff with my original program.

Jeremy: Right. I'm just thinking this through, if I deploy something to CDK and I've got all these loops and they're generating functions and they're using some naming and all this kind of stuff, whatever, now it produces this output. And again, my naming of my functions might be some function that gets called to generate the names of the function. And so now I've got all of these functions named and I have to go in. There's no one-to-one map like you said, and I can imagine somebody who's not familiar with CloudFormation which is ultimately what CDK synthesizes and produces, if you're not familiar with what that output is and how that maps back to the constructs that you created, I can see that as being really difficult, especially for younger developers or developers who are just getting started in that.

Ben: And the CDK really takes the attitude that it's going to hide those things from those developers rather than help them learn it. And so when they do have to dive into that, the CDK refers to it as an escape hatch.

Jeremy: Yeah.

Ben: And I think of escape hatches on submarines, where you go from being warm and dry and having air to breathe to being hundreds of feet below the sea, right? It's not the sort of thing you want to go through. Whereas some tools like Amplify talk about graduation. In Amplify they aim to help you understand the things that Amplify is doing for you, such that when you grow beyond what Amplify can provide you, you have the tools to do that, to take the thing that you built and then say, "Okay, I know enough now that I understand this and can add onto it in ways that Amplify can't help with."

Jeremy: Right.

Ben: Now, how successful they are in doing that is a separate question I think, but the attitude is there to say, "We're looking to help developers understand these things." Now the CDK could also if the CDK was a managed service, right? Would not need developers to understand those things. If you could take your program directly to the cloud and say, "Here's my program, go make this real." And when it made it real, you could interact with the cloud in an understanding where you could list your deployed constructs, right? That you can understand the program that you wrote when you're looking at the resources that are deployed all together in the cloud everywhere. That would be a thing where you don't need to learn CloudFormation.

Jeremy: Right.

Ben: Right? That's where you then end up in the imperative versus declarative part where, okay, there's some reasons that I think declarative is better. But the major thing is that disconnect that's currently built into the way that CDK works. And the reason that they're doing that is because CloudFormation is not moving fast enough, which is not always on the CloudFormation team. It's often on the service teams that aren't building the resources fast enough. And that's AWS's problem, AWS as an entire company, as an organization. And this one team is saying, "Well, we can fix that by doing all this client side."

What that means is that the customers are then responsible for all the things that are happening on the client side. The reason that they can go fast is because the CDK team doesn't have ownership of it, which just means the ownership is being pushed on customers, right? The CDK deploys Lambda functions into your account that they don't tell you about that you're now responsible for. Right? Both the security and operations of. If there are security updates that the CDK team has to push out, you have to take action to update those things, right? That's ownership that's being pushed onto the customer to fix a lack of ACM certificate management, right?

Jeremy: Right. Right.

Ben: That is ACM not building the thing that's needed. And so AWS says, "Okay, great. We'll just make that the customer's problem."

Jeremy: Right.

Ben: And I don't agree with that approach.

Rebecca: So I'm sure as an AWS Hero you certainly have pretty good, strong, open communication channels with a lot of different team members across teams. And I certainly know that they're listening to you and are at least hearing you, I should say, and watching you and they know how you feel about this. And so I'm curious how some of those conversations have gone. And some teams as compared to others at AWS are really, really good about opening their roadmap or at least saying, "Hey, we hear this, and here's our path to a solution or a success." And I'm curious if there's any light you can shed on whether or not those conversations have been fruitful in terms of actually being able to get somewhere in terms of customer and AWS terms, right? Customer obsession first.

Ben: Yeah. Well, customer obsession can mean two things, right? Customer obsession can mean giving the customer what they want or it can mean giving the customer what they need and different AWS teams' approach fall differently on that scale. The reason that many of those things are not available in CloudFormation is that those teams are ... It could be under-resourced. They could have a larger majority of customer that want new features rather than infrastructure as code support. Because as much as we all like infrastructure as code, there are many, many organizations out there that are not there yet. And with the CDK in particular, I'm a relatively lone voice out there saying, "I don't think this ownership that's being pushed onto the customer is a good thing." And there are lots of developers who are eating up CDK saying, "I don't care."

That's not something that's in their worry. And because the CDK has been enormously successful, right? It's fixing these problems that exists. And I don't begrudge them trying to fix those problems. I think it's a question of do those developers who are grabbing onto those things and taking them understand the full total cost of ownership that the CDK is bringing with it. And if they don't understand it, I think AWS has a responsibility to understand it and work with it to help those customers either understand it and deal with it, right? Which is where the CDK takes this approach, "Well, if you do get Ops, it's all fine." And that's somewhat true, but also many developers who can use the CDK do not control their CI/CD process. So there's all sorts of ways in which ... Yeah, so I think every team is trying to do the best that they can, right?

They're all working hard and they all have ... Are pulled in many different directions by customers. And most of them are making, I think, the right choices given their incentives, right? Given what their customers are asking for. I think not all of them balance where customers ... meeting customers where they are versus leading them where they should, like where they need to go as well as I would like. But I think ... I had a conclusion to that. Oh, but I think that's always a debate as to where that balance is. And then the other thing when I talk about the CDK, that my ideal audience there is less AWS itself and more AWS customers ...

Rebecca: Sure.

Ben: ... to understand what they're getting into and therefore to demand better of AWS. Which is in general, I think, the approach that I take with AWS, is complaining about AWS in public, because I do have the ability to go to teams and say, "Hey, I want this thing," right? There are plenty of teams where I could just email them and say, "Hey, this feature could be nice", but I put it on Twitter because other people can see that and say, "Oh, that's something that I want or I don't think that's helpful," right? "I don't care about that," or, "I think it's the wrong thing to ask for," right? All of those things are better when it's not just me saying I think this is a good thing for AWS, but it being a conversation among the community differently.

Rebecca: Yeah. I think in the spirit too of trying to publicize types of what might be best next for customers, you said total cost of ownership. Even though it might seem silly to ask this, I think oftentimes we say the words total cost of ownership, but there's actually many dimensions to total cost of ownership or TCO, right? And so I think it would be great if you could enumerate what you think of as total cost of ownership, because there might be dimensions along that matrices, matrix, that people haven't considered when they're actually thinking about total cost of ownership. They're like, "Yeah, yeah, I got it. Some Ops and some security stuff I have to do and some patches," but they might only be thinking of five dimensions when you're like, "Actually the framework is probably 10 to 12 to 14." And so if you could outline that a bit, what you mean when you think of a holistic total cost of ownership, I think that could be super helpful.

Ben: I'm bad at enumeration. So I would miss out on dimensions that are obvious if I was attempting to do that. But I think a way that I can, I think effectively answer that question is to talk about some of the ways in which we misunderstand TCO. So I think it's important when working in an organization to think about the organization as a whole, not just your perspective and that your team's perspective in it. And so when you're working for the lowest TCO it's not what's the lowest cost of ownership for my team if that's pushing a larger burden onto another team. Now if it's reducing the burden on your team and only increasing the burden on another team a little bit, that can be a lower total cost of ownership overall. But it's also something that then feeds into things like political capital, right?

Is that increased ownership that you're handing to that team something that they're going to be happy with, something that's not going to cause other problems down the line, right? Those are the sorts of things that fit into that calculus because it's not just about what ... Moving away from that topic for a second. I think about when we talk about how does this increase our velocity, right? There's the piece of, "Okay, well, if I can deploy to production faster, right? My feedback loop is faster and I can move faster." Right? But the other part of that equation is how many different threads can you be operating on and how long are those threads in time? So when you're trying to ship a feature, if you can ship it and then never look at it again, that means you have increased bandwidth in the future to take on other features to develop other new features.

And so even if you think about, "It's going to take me longer to finish this particular feature," but then there's no maintenance for that feature, that can be a lower cost of ownership in time than, "I can ship it 50% faster, but then I'm going to periodically have to revisit it and that's going to disrupt my ability to ship other things," right? So this is where I had conversations recently about increasing use of Step Functions, right? And being able to replace Lambda functions with Step Functions express workflows because you never have to go back to those Lambdas and update dependencies in them because dependent bot has told you that you need to or a version of Python is getting deprecated, right? All of those things, just if you have your Amazon States Language however it's been defined, right?

Once it's in there, you never have to touch it again if nothing else changes and that means, okay, great, that piece is now out of your work stream forever unless it needs to change. And that means that you have more bandwidth for future things, which serverless is about in general, right? Of say, "Okay, I don't have to deal with this scaling problems here. So those scaling things. Once I have an auto-scaling group, I don't have to go back and tweak it later." And so the same thing happens at the feature level if you build it in ways that allow you to do that. And so I think that's one of the places where when we focus on, okay, how fast is this getting me into production, it's okay, but how often do you have to revisit it ...

Jeremy: Right. And so ... So you mentioned a couple of things in there, and not only in that question, but in the previous questions as you were talking about the CDK in general, and I am 100% behind you on this idea of deterministic builds because I want to know exactly what's being deployed. I want to be able to audit that and map that back. And you can audit, I mean, you could run CDK synth and then audit the CloudFormation and test against certain things. But if you are changing stuff, right? Then you have to understand not only the CDK but also the CloudFormation that it actually generates. But in terms of solving problems, some of the things that the CDK does really, really well, and this is something where I've always had this issue with just trying to use raw CloudFormation or Serverless Framework or SAM or any of these things is the fact that there's a lot of boilerplate that you often have to do.

There's ways that companies want to do something specifically. I basically probably always need 1,400 lines of CloudFormation. And for every project I do, it's probably close to the same, and then add a little bit more to actually make it adaptive for my product. And so one thing that I love about the CDK is constructs. And I love this idea of being able to package these best practices for your company or these compliance requirements, excuse me, compliance requirements for your company, whatever it is, be able to package these and just hand them to developers. And so I'm just curious on your thoughts on that because that seems like a really good move in the right direction, but without the deterministic builds, without some of these other problems that you talked about, is there another solution to that that would be more declarative?

Ben: Yeah. In theory, if the CDK was able to produce an artifact that represented all of the non-deterministic dependencies that it had, right? That allowed you to then store that artifacts as you'd come back and put that into the program and say, "I'm going to get out the same thing," but because the CDK doesn't control upstream of it, the code that the developers are writing, there isn't a way to do that. Right? So on the abstraction front, the constructs are super useful, right? CloudFormation now has modules which allow you to say, "Here's a template and I'm going to represent this as a CloudFormation type itself," right? So instead of saying that I need X different things, I'm going to say, "I packaged that all up here. It is as a type."

Now, currently, modules can only be playing CloudFormation templates and there's a lot of constraints in what you can express inside a CloudFormation template. And I think the answer for me is ... What I want to see is more richness in the CloudFormation language, right? One of the things that people do in the CDK that's really helpful is say, "I need a copy of this in every AZ."

Jeremy: Right.

Ben: Right? There's so much boilerplate in server-based things. And CloudFormation can't do that, right? But if you imagine that it had a map function that allowed you to say, "For every AZ, stamp me out a copy of this little bit." And then that the CDK constructs allowed to translate. Instead of it doing all this generation only down to the L one piece, instead being able to say, "I'm going to translate this into more rich CloudFormation templates so that the CloudFormation template was as advanced as possible."

Right? Then it could do things like say, "Oh, I know we need to do this in every AZ, I'm going to use this map function in the CloudFormation template rather than just stamping it out." Right? And so I think that's possible. Now, modules should also be able to be defined as CDK programs. Right? You should be able to register a construct as a CloudFormation tag.

Jeremy: It would be pretty cool.

Ben: There's no reason you shouldn't be able to. Yeah. Because I think the declarative versus imperative thing is, again, not the most important piece, it's how do we move ... It's shifting right in this case, right? That how do you shift what's happening with the developer further into the process of deployment so that more of their context is present? And so one of the things that the CDK does that's hard to replicate is have non-local effects. And this is both convenient and I think of code smell often.

So you can pass a bucket resource from another stack into a piece of code in your CDK program that's creating a different stack and you say, "Oh great, I've got this Lambda function, it needs permissions to that bucket. So add permissions." And it's possible for the CDK programs to either be adding the permissions onto the IAM role of that function, or non-locally adding to that bucket's resource policy, which is weird, right? That you can be creating a stack and the thing that you do to that stack or resource or whatever is not happening there, it's happening elsewhere. I don't think that's a great approach, but it's certainly convenient to be able to do it in a lot of situations.

Now, that's not representable within a module. A module is a contained piece of functionality that can't touch anything else. So things like SAM where you can add events onto a function that can go and create ... You create the API events on different functions and then SAM aggregates them and creates an API gateway for you. Right? If AWS serverless function was a module, it couldn't do that because you'd have these in different places and you couldn't aggregate something between all of them and put them in the top-level thing, right?

This is what CloudFormation macros enable, but they don't have a... There's no proper interface to them, right? They don't define, "This is what I'm doing. This is the kind of resources I can create." There's none of that that would help you understand them. So they're infinitely flexible, but then also maybe less principled for that reason. So I think there are ways to evolve, but it's investment in the CloudFormation language that allows us to shift that burden from being a flattening inside client-side code from the developer and shifting it to be able to be represented in the cloud.

Jeremy: Right. Yeah. And I think from that standpoint too if we go back to the solving people's problems standpoint, that everything you explained there, they're loaded with nuances, it's loaded with gotchas, right? Like, "Oh, you can't do this, you can't do that." So that's just why I think the CDK is so popular because it's like you can do so much with it so quickly and it's very, very fast. And I think that trade-off, people are just willing to make it.

Ben: Yes. And that's where they're willing to make it, do they fully understand the consequences of it? Then does AWS communicate those consequences well? Before I get into that question of, okay, you're a developer that's brand new to AWS and you've been tasked with standing up some Kubernetes cluster and you're like, "Great. I can use a CDK to do this." Something is malfunctioning. You're also tasked with the operations and something is malfunctioning. You go in through the Console and maybe figure out all the things that are out there are new to you because they're hidden inside L3 constructs, right?

You're two levels down from where you were defining what you want, and then you find out what's wrong and you have no idea how to turn that into a change in your CDK program. So instead of going back and doing the thing that infrastructure as code is for, which is tweaking your program to go fix the problem, you go and you tweak it in the Console ...

Jeremy: Right. Which you should never do.

Ben: ... and you fix it that way. Right. Well, and that's the thing that I struggle with, with the CDK is how does the CDK help the developer who's in that situation? And I don't think they have a good story around that. Now, I don't know. I haven't talked with enough junior developers who are using the CDK about how often they get into that situation. Right? But I always say client-side code is not a replacement for a managed service because when it's client-side code, you still own the result.

Jeremy: Right.

Ben: If a particular CDK construct was a managed service in AWS, then all of the resources that would be created underneath AWS's problem to make work. And the interface that the developer has is the only level of ownership that they have. Fargate is this. Because you could do all the things that Fargate does with a CDK construct, right? Set up EC2, do all the things, and represent it as something that looks like Fargate in your CDK program. But every time your EC2 fleet is unhealthy that's your problem. With Fargate, that's AWS's problem. If we didn't have Fargate, that's essentially what CDK would be trying to do for ECS.

And I think we all recognize that Fargate is very necessary and helpful in that case, right? And I just want that for all the things, right? Whenever I have an abstraction, if it's an abstraction that I understand, then I should have a way of zooming into it while not having to switch languages, right? So that's where you shouldn't dump me out the CloudFormation to understand what you're doing. You should help me understand the low-level things in the same language. And if it's not something that I need to understand, it should be a managed service. It shouldn't be a bunch of stuff that I still own that I haven't looked at.

Jeremy: Makes sense. Got a question, Rebecca? Because I was waiting for you to jump in.

Rebecca: No, but I was going to make a joke, but then the joke passed, and then I was like, "But should I still make it?" I was going to be like, "Yeah, but does the CDK let you test in production?" But that was a 32nd ago joke and then I was really wrestling with whether or not I should tell it, but I told it anyway, hopefully, someone gets a laugh.

Ben: Yeah. I mean, there's the thing that Charity Majors says, right? Which is that everybody tests in production. Some people are lucky enough to have a development environment in production. No, sorry. I said that the wrong way. It's everybody has a test environment. Some people are lucky enough that it's not in production.

Rebecca: Yeah. Swap that. Reverse it. Yeah.

Ben: Yeah.

Jeremy: All right. So speaking of talking to developers and getting feedback from them, so I actually put a question out on Twitter a couple of weeks ago and got a lot of really interesting reactions. And essentially I asked, "What do you love or hate about infrastructure as code?" And there were a lot of really interesting things here. I don't know, maybe it might be fun to go through a couple of these and get your thoughts on them. So this is probably not a great one to start with, but I thought it was interesting because this I think represents the frustration that a lot of us feel. And it was basically that they love that automation minimizes future work, right? But they hate that it makes life harder over time. And that pretty much every approach to infrastructure in, sorry, yeah, infrastructure in code at the present is flawed, right? So really there are no good solutions right now.

Ben: Yeah. CloudFormation is still a pain to learn and deal with. If you're operating in certain IDEs, you can get tab completion.

Jeremy: Right.

Ben: If you go to CDK you get tab completion, which is, I think probably most of the value that developers want out of it and then the abstraction, and then all the other fancy things it does like pipelines, which again, should be a managed service. I do think that person is absolutely right to complain about how difficult it is. That there are many ways that it could be better. One of the things that I think about when I'm using tools is it's not inherently bad for a tool to have some friction to use it.

Jeremy: Right.

Ben: And this goes to another infrastructure as code tool that goes even further than the CDK and says, "You can define your Lambda code in line with your infrastructure definition." So this is fine with me. And there's some other ... I think Punchcard also lets you do some of this. Basically extracts out the bits of your code that you say, "This is a custom thing that glues together two things I'm defining in here and I'll make that a Lambda function for you." And for me, that is too little friction to defining a Lambda function.

Because when I define a Lambda function, just going back to that bringing in ownership, every time I add a Lambda function, that's something that I own, that's something that I have to maintain, that I'm responsible for, that can go wrong. So if I'm thinking about, "Well, I could have API Gateway direct into DynamoDB, but it'd be nice if I could change some of these fields. And so I'm just going to drop in a little sprinkle of code, three lines of code in between here to do some transformation that I want." That is all of sudden an entire Lambda function you've brought into your infrastructure.

Jeremy: Right. That's a good point.

Ben: And so I want a little bit of friction to do that, to make me think about it, to make me say, "Oh, yeah, downstream of this decision that I am making, there are consequences that I would not otherwise think about if I'm just trying to accomplish the problem," right? Because I think developers, humans, in general, tend to be a bit shortsighted when you have a goal especially, and you're being pressured to complete that goal and you're like, "Okay, well I can complete it." The consequences for later are always a secondary concern.

And so you can change your incentives in that moment to say, "Okay, well, this is going to guide me to say, "Ah, I don't really need this Lambda function in here. Then I'm better off in the long term while accomplishing that goal in the short term." So I do think that there is a place for tools making things difficult. That's not to say that the amount of difficult that infrastructure as code is today is at all reasonable, but I do think it's worth thinking about, right?

I'd rather take on the pain of creating an ASL definition by hand for express workflow than the easier thing of writing Lambda code. Because I know the long-term consequences of that. Now, if that could be flipped where it was harder to write something that took more ownership, it'd be just easy to do, right? You'd always do the right thing. But I think it's always worth saying, "Can I do the harder thing now to pay off to pay off later?"

Jeremy: And I always call those shortcuts "tomorrow-Jeremy's" problem. That's how I like to look at those.

Ben: Yeah. Yes.

Jeremy: And the funny thing about that too is I remember right when EventBridge came out and there was no CloudFormation support for a long time, which was super frustrating. But Serverless Framework, for example, implemented a custom resource in order to do that. And I remember looking at a clean stack and being like, "Why are there two Lambda functions there that I have no idea?" I'm like, "I didn't publish ..." I honestly thought my account was compromised that somebody had published a Lambda function in there because I'm like, "I didn't do that." And then it took me a while to realize, I'm like, "Oh, this is what this is." But if it is that easy to just create little transform functions here and there, I can imagine there being thousands of those in your account without anybody knowing that they even exist.

Ben: Now, don't get me wrong. I would love to have the ability to drop in little transforms that did not involve Lambda functions. So in other words, I mean, the thing that VTL does for API Gateway, REST APIs but without it being VTL and being ... Because that's hard and then also restricted in what you can do, right? It's not, "Oh, I can drop in arbitrary code in here." But enough to say, "Oh, I want to flip ... These fields should go from a key-value mapping to a list of key-value, right? In the way that it addresses inconsistent with how tags are defined across services, those kinds of things. Right? And you could drop that in any service, but once you've defined it, there's no maintenance for you, right?

You're writing JavaScript. It's not actually a JavaScript engine underneath or something. It's just getting translated into some big multi-tenant fancy thing. And I have a hypothesis that that should be possible. You should be able to do it where you could even do it in the parsing of JSON, being able to do transforms without ever having to have the whole object in memory. And if we could get that then, "Oh, sure. Now I have sprinkled all over the place all of these little transforms." Now there's a little bit of overhead if the transform is defined correctly or not, right? But once it is, then it just works. And having all those little transforms everywhere is then fine, right? And that incentive to make it harder it doesn't need to be there because it's not bringing ownership with it.

Rebecca: Yeah. It's almost like taking the idea of tomorrow-Jeremy's problem and actually switching it to say tomorrow-Jeremy's celebration where tomorrow-Jeremy gets to look back at past-Jeremy and be like, "Nice. Thank you for making that decision past-Jeremy." Because I think we often do look at it in terms of tomorrow-Jeremy will think of this, we'll solve this problem rather than how do we approach it by saying, how do I make tomorrow-Jeremy thankful for it today-Jeremy? And that's a simple language, linguistic switch, but a hard switch to actually make decisions based on.

Ben: Yeah. I don't think tomorrow-Ben is ever thankful for today-Ben. I think it's tomorrow-Ben is thankful for yesterday-Ben setting up the incentives correctly so that today-Ben will do the right thing for tomorrow-Ben. Right? When I think about people, I think it's easier to convince people to accept a change in their incentives than to convince them to fight against their incentives sustainably.

Jeremy: Right. And I think developers and I'm guilty of this too, I mean, we make decisions based off of expediency. We want to get things done fast. And when you get stuck on that problem you're like, "You know what? I'm not going to figure it out. I'm just going to write a loop or I'm going to do whatever I can do just to make it work." Another if statement here, "Isn't going to hurt anybody." All right. So let's move to ... Sorry, go ahead.

Ben: We shouldn't feel bad about that.

Jeremy: You're right.

Ben: I was going to say, we shouldn't feel bad about that. That's where I don't want tomorrow-Ben to have to be thankful for today-Ben, because that's the implication there is that today-Ben is fighting against his incentives to do good things for tomorrow-Ben. And if I don't need to have to get to that point where just the right path is the easiest path, right? Which means putting friction in the right places than today-Ben ... It's never a question of whether today-Ben is doing something that's worth being thankful for. It's just doing the job, right?

Jeremy: Right. No, that makes sense. All right. I got another question here, I think falls under the category of service discovery, which I know is another topic that you love. So this person said, "I love IaC, but hate the fuzzy boundaries where certain software awkwardly fall. So like Istio and Prometheus and cert-manager. That they can be considered part of the infrastructure, but then it's awkward to deploy them when something like Terraform due to circular dependencies relating to K8s and things like that."

So, I mean, I know that we don't have to get into the actual details of that, but I think that is an important aspect of infrastructure as code where best practices sometimes are deploy a stack that has your permanent resources and then deploy a stack that maybe has your more femoral or the ones that are going to be changing, the more mutable ones, maybe your Lambda functions and some of those sort of things. If you're using Terraform or you're using some of these other services as well, you do have that really awkward mix where you're trying to use outputs from one stack into another stack and trying to do all that. And really, I mean, there are some good tools that help with it, but I mean just overall thoughts on that.

Ben: Well, we certainly need to demand better of AWS services when they design new things that they need to be designed so that infrastructure as code will work. So this is the S3 bucket notification problem. A very long time ago, S3 decided that they were going to put bucket notifications as part of the S3 bucket. Well, CloudFormation at that point decided that they were going to put bucket notifications as part of the bucket resource. And S3 decided that they were going to check permissions when the notification configuration is defined so that you have to have the permissions before you create the configuration.

This creates a circular dependency when you're hooking it up to anything in CloudFormation because the dependency depends on the resource policy on an SNS topic, and SQS queue or a Lambda function depends on the bucket name if you're letting CloudFormation name the bucket, which is the best practice. Then bucket name has to exist, which means the resource has to have been created. But the notification depends on the thing that's notifying, which doesn't have the names and the resource policy doesn't exist so it all fails. And this is solved in a couple of different ways. One of which is name your bucket explicitly, again, not a good practice. Another is what SAM does, which says, "The Lambda function will say I will allow all S3 buckets to invoke me."

So it has a star permission in it's resource policy. So then the notification will work. None of which is good or there's custom resources that get created, right? Now, if those resources have been designed with infrastructure as code as part of the process, then it would have been obvious, "Oh, you end up with a circular pendency. We need to split out bucket notifications as a separate resource." And not enough teams are doing this. Often they're constrained by the API that they develop first ...

Jeremy: That's a good point.

Ben: ... they come up with the API, which often makes sense for a Console experience that they desire. So this is where API Gateway has this whole thing where you create all the routes and the resources and the methods and everything, right? And then you say, "Great, deploy." And in the Console you only need one mutable working copy of that at a time, but it means that you can't create two deployments or update two stages in parallel through infrastructure as code and API Gateway because they both talk to this mutable working copy state and would overwrite each other.

And if infrastructure as code had been on their list would have been, "Oh, if you have a definition of your API, you should be able to go straight to the deployment," right? And so trying to push that upstream, which to me is more important than infrastructure as code support at launch, but people are often like, "Oh, I want CloudFormation support at launch." But that often means that they get no feedback from customers on the design and therefore make it bad. KMS asymmetric keys should have been a different resource type so that you can easily tell which key types are in your template.

Jeremy: Good point. Yeah.

Ben: Right? So that you can use things like CloudFormation Guard more easily on those. Sure, you can control the properties or whatever, but you should be able to think in terms of, "I have a symmetric key or an asymmetric key in here." And they're treated completely separately because you use them completely differently, right? They don't get used to the same place.

Jeremy: Yeah. And it's funny that you mentioned the lacking support at launch because that was another complaint. That was quite prevalent in this thread here, was people complaining that they don't get that CloudFormation support right away. But I think you made a very good point where they do build the APIs first. And that's another thing. I don't know which question asked me or which one of these mentioned it, but there was a lot of anger over the fact that you go to the API docs or you go to the docs for AWS and it focuses on the Console and it focuses on the CLI and then it gives you the API stuff and very little mention of CloudFormation at all. And usually, you have to go to a whole separate set of docs to find the CloudFormation. And it really doesn't tie all the concepts together, right? So you get just a block of JSON or of YAML and you're like, "Am I supposed to know what everything does here?"

Ben: Yeah. I assume that's data-driven. Right? And we exist in this bubble where everybody loves infrastructure as code.

Jeremy: True.

Ben: And that AWS has many more customers who set things up using Console, people who learn by doing it first through the Console. I assume that's true, if it's not, then the AWS has somehow gotten on the extremely wrong track. But I imagine that's how they find that they get the right engagement. Now maybe the CDK will change some of this, right? Maybe the amount of interest that is generating, we'll get it to the point where blogs get written with CDK programs being written there. I think that presents different problems about what that CDK program might hide from when you're learning about a service. But yeah, it's definitely not ... I wrote a blog for AWS and my first draft had it as CloudFormation and then we changed it to the Console. Right? And ...

Jeremy: That must have hurt. Did you die a little inside when that happened?

Ben: I mean, no, because they're definitely our users, right? That's the way in which they interact with data, with us and they should be able to learn from that, their company, right? Because again, developers are often not fully in control of this process.

Jeremy: Right. That's a good point.

Ben: And so they may not be able to say, "I want to update this through CloudFormation," right? Either because their organization says it or just because their team doesn't work that way. And I think AWS gets requests to prevent people from using the Console, but also to force people to use the Console. I know that at least one of them is possible in IAM. I don't remember which, because I've never encountered it, but I think it's possible to make people use the Console. I'm not sure, but I know that there are companies who want both, right? There are companies who say, "We don't want to let people use the API. We want to force them to use the Console." There are companies who say, "We don't want people using the Console at all. We want to force them to use the APIs."

Jeremy: Interesting.

Ben: Yeah. There's a lot of AWS customers, right? And there's every possible variety of organization and AWS should be serving all of them, right? They're all customers. And certainly, I want AWS to be leading the ones that are earlier in their cloud journey and on the serverless ladder to getting further but you can't leave them behind, I think it's important.

Jeremy: So that people argument and those different levels and coming in at a different, I guess, level or comfortability with APIs versus infrastructure as code and so forth. There was another question or another comment on this that said, "I love the idea of committing everything that makes my solution to text and resurrect an entire solution out of nothing other than an account key. Loved the ability to compare versions and unit tests, every bit of my solution, and not having to remember that one weird setting if you're using the Console. But hate that it makes some people believe that any coder is now an infrastructure wizard."

And I think this is a good point, right? And I don't 100% agree with it, but I think it's a good point that it basically ... Back to your point about creating these little transformations in Pulumi, you could do a lot of damage, I mean, good or bad, right? When you are using these tools. What are your thoughts on that? I mean, is this something where ... And again, the CDK makes it so easy for people to write these constructs pretty quickly and spin up tons of infrastructure without a lot of guard rails to protect them.

Ben: So I think if we tweak the statement slightly, I think there's truth there, which isn't about the self-perception but about what they need to be. Right? That I think this is more about serverless than about infrastructure as code. Infrastructure as code is just saying that you can define it. Right? I think it's more about the resources that are in a particular definition that require that. My former colleague, Aaron Camera says, "Serverless means every developer is an architect" because you're not in that situation where the code you write goes onto something, you write the whole thing. Right?

And so you do need to have those ... You do need to be an infrastructure wizard whether you're given the tools to do that and the education to do that, right? Not always, like if you're lucky. And the self-perception is again an even different thing, right? Especially if coders think that there's nothing to be learned ... If programmers, software developers, think that there's nothing to be learned from the folks who traditionally define the infrastructure, which is Ops, right? They think, "Those people have nothing to teach me because now I can do all the things that they did." Well, you can create the things that they created and it does not mean that you're as good at it ...

Jeremy: Or responsible for monitoring it too. Right.

Ben: ... and have the ... Right. The monitoring, the experience of saying these are the things that will come back to bite you that are obvious, right? This is how much ownership you're getting into. There's very much a long-standing problem there of devaluing Ops as a function and as a career. And for my money when I look at serverless, I think serverless is also making the software development easier because there's so much less software you need to write. You need to write less software that deals with the hard parts of these architectures, the scaling, the distributed computing problems.

You still have this, your big computing problems, but you're considering them functionally rather than coding things that address them, right? And so I see a lot of operations folks who come into serverless learn or learn a new programming language or just upscale, right? They're writing Python scripts to control stuff and then they learn more about Python to be able to do software development in it. And then they bring all of that Ops experience and expertise into it and look at something and say, "Oh, I'd much rather have step functions here than something where I'm running code for it because I know how much my script break and those kinds of things when an API changes or ... I have to update it or whatever it is."

And I think that's something that Tom McLaughlin talks about having come from an outside ground into serverless. And so I think there's definitely a challenge there in both directions, right? That Ops needs to learn more about software development to be more engaged in that process. Software development does need to learn much more about infrastructure and is also at this risk of approaching it from, "I know the syntax, but not the semantics, sort of thing." Right? We can create ...

Jeremy: Just because I can doesn't mean I should.

Ben: ... an infrastructure. Yeah.

Rebecca: So Ben, as we're looping around this conversation and coming back to this idea that software is people and that really software should enable you to focus on the things that do matter. I'm wondering if you can perhaps think of, as pristine as possible, an example of when you saw this working, maybe it was while you've been at iRobot or a project that you worked on your own outside of that, but this moment where you saw software really working as it should, and that how it enabled you or your team to focus on the things that matter. If there's a concrete example that you can give when you see it working really well and what that looks like.

Ben: Yeah. I mean, iRobot is a great example of this having been the company without need for software that scaled to consumer electronics volumes, right? Roomba volumes. And needing to build a IOT cloud application to run connected Roombas and being able to do that without having to gain that expertise. So without having to build a team that could deal with auto-scaling fleets of servers, all of those things was able to build up completely serverlessly. And so skip an entire level of organizational expertise, because that's just not necessary to accomplish those tasks anymore.

Rebecca: It sounds quite nice.

Ben: It's really great.

Jeremy: Well, I have one more question here that I think could probably end up ... We could talk about for another hour. So I will only throw it out there and maybe you can give me a quick answer on this, but I actually had another Twitter thread on this not too long ago that addressed this very, very problem. And this is the idea of the feedback cycle on these infrastructure as code tools where oftentimes to deploy infrastructure changes, I mean, it just takes time. In many cases things can run in parallel, but as you said, there's race conditions and things like that, that sometimes things have to be ... They just have to be synchronous. So is this something where there are ways where you see in the future these mutations to your infrastructure or things like that potentially happening faster to get a better feedback cycle, or do you think that's just something that we're going to have to deal with for a while?

Ben: Yeah, I think it's definitely a very extensive topic. I think there's a few things. One is that the deployment cycle needs to get shortened. And part of that I think is splitting dev deployments from prod deployments. In prod it's okay for it to take 30 seconds, right? Or a minute or however long because that's at the end of a CI/CD pipeline, right? There's other things that are happening as part of that. Now, you don't want that to be hours or whatever it is. Right? But it's okay for that to be proper and to fully manage exactly what's going on in a principled manner.

When you're doing for development, it would be okay to, for example, change the Lambda code without going through CloudFormation to change the Lambda code, right? And this is what an architect does, is there's a notion of a dirty deploy which just packages up. Now, if your resource graph has changed, you do need to deploy again. Right? But if the only thing that's changing is your code, sure, you can go and say, "Update function code," on that Lambda directly and that's faster.

But calling it a dirty deploy is I think important because that is not something that you want to do in prod, right? You don't want there to be drift between what the infrastructure as code service understands, but then you go further than that and imagine there's no reason that you actually have to do this whole zip file process. You could be R sinking the code directly, or you could be operating over SSH on the code remotely, right? There's many different ways in which the loop from I have a change in my Lambda code to that Lambda having that change could be even shorter than that, right?

And for me, that's what it's really about. I don't think that local mocking is the answer. You and Brian Rue were talking about this recently. I mean, I agree with both of you. So I think about it as I want unit tests of my business logic, but my business logic doesn't deal with AWS services. So I want to unit test something that says, "Okay, I'm performing this change in something and that's entirely within my custom code." Right? It's not touching other services. It doesn't mean that I actually need adapters, right? I could be dealing with the native formats that I'm getting back from a given service, but I'm not actually making calls out of the code. I'm mocking out, "Well, here's what the response would look like."

And so I think that's definitely necessary in the unit testing sense of saying, "Is my business logic correct? I can do that locally. But then is the wiring all correct?" Is something that should only happen in the cloud. There's no reason to mock API gateway into Lambda locally in my mind. You should just be dealing with the Lambda side of it in your local unit tests rather than trying to set up this multiple thing. Another part of the story is, okay, so these deploys have to happen faster, right? And then how do we help set up those end-to-end test and give you observability into it? Right? X-Ray helps, but until X-Ray can sort through all the services that you might use in the serverless architecture, can deal with how does it work in my Lambda function when it's batching from Kinesis or SQS into my function?

So multiple traces are now being handled by one invocation, right? These are problems that aren't solved yet. Until we get that kind of inspection, it's going to be hard for us to feel as good about cloud development. And again, this is where I feel sometimes there's more friction there, but there's bigger payoff. Is one of those things where again, fighting against your incentives which is not the place that you want to be.

Jeremy: I'm going to stop you before you disagree with me anymore. No, just kidding! So, Rebecca, you have any final thoughts or questions for Ben?

Rebecca: No. I just want to say to both of you and to everyone listening that I hope your today self is celebrating your yesterday-self right now.

Jeremy: Perfect. Well, Ben, thank you so much for joining us and being a guinea pig as we said on this new format that we are trying. Excellent guinea pig. Excellent.

Rebecca: An excellent human too but also great guinea pig.

Jeremy: Right. Right. Pretty much so. So if people want to find out more about you, read some of the stuff you're doing and working on, how do they do that?

Ben: I'm on Twitter. That's the primary place. I'm on LinkedIn, I don't post much there. And then I write articles that show up on Medium.

Rebecca: And just so everyone knows your Twitter handle I'll say it out loud too. It's @ben11kehoe, K-E-H-O-E, ben11kehoe.

Jeremy: Right. Perfect. All right. Well, we will put all that in the show notes and hopefully people will like this new format. And again, we'd love your feedback on this, things that you'd like us to do in the future, any ideas you have. And of course, make sure you reach out to Ben. He's an amazing resource for serverless. So again, thank you for everything you do, and thank you for being on the show.

Ben: Yeah. Thanks so much for having me. This was great.

Rebecca: Good to see you. Thank you.

Episode source

Serverless Chats Follow

Episode #107: Serverless Infrastructure as Code with Ben Kehoe

Serverless Chats