Deyan Petrov

Posted on Apr 20, 2021 • Edited on Apr 21, 2021

Use Azure Kubernetes Service (AKS) + Traefik instead of Azure Functions hosting + Azure API Management

#azure #aks #azurefunctions #serverless

TLDR; You'd better use AKS (incl. Traefik or similar ingress controller) instead of Azure Functions App Service/Premium/Consumption Plans + Azure API Management for hosting your microservices.

Disclaimer: The context of this blog post are systems with a fair number of API/transaction requests per day (e.g. 50k+) with API calls (almost) every second for the majority of the day, and standard non-functional requirements like response times must be way below 1 second, top-notch security, etc. This is not addressing hobby projects with public access to everything and occasional 1-2 API calls per day and/or a single monthly peak of 1000 API calls/hour!

Introduction

Being aware of the latest serverless trend I started a project with the "fully serverless" approach in mind, which in MS/Azure world means Azure Functions framework for building our .NET Core applications, and using Azure Functions Hosting Plans on Azure. The Azure Functions were "hidden" behind Azure API Management (APIM), so that JWT Tokens (from a web-based backoffice UI) and api keys (from B2B integrations) could be centrally validated.

Fast-forward to today and we are currently using none of these:

Azure Functions framework
Azure Functions hosting in Consumption/Premium/App Service plan
APIM

Instead we are pretty happy with these:

Standard .NET 5 apps (using WebJobs SDK only as a syntactic sugar for some non-HTTP triggers)
AKS
Traefik (running in AKS)

Why did we change our mind? There were numerous reasons for that, and I will try to explain each one below. But first a short overview of Azure Functions and APIM.

Azure Functions Hosting

An Azure Functions App can be hosted in multiple ways:

Consumption plan
Premium plan
Dedicated App Service plan (one of the oldest Azure offerings)
App Service Environment (ASE)
Azure Kubernetes Service (AKS)

When I read the list I was thinking - wow, I can't go wrong with Azure Functions, even if one hosting option turns out to be sub-optimal there are so many others! Upon a second look though 2 of the options should be removed straight away:

Consumption plan has unsurmountable cold start issues, and no VNET integration
a. Yes, we did implement a health-check calling every single function app (50+) every 5 minutes, plus integrated this in Azure DevOps pipelines, however this keeps warm 1 existing instance only, not any new instance created due to scale out
b. Yes, we did implement WarmupTrigger in our function apps, however we still experienced cold starts
c. Without VNET integration, you cannot effectively hide all function apps from the Internet and connect them with any other hidden infrastructure like Azure Key Vaults, Event Hubs etc, unless you start playing with source IP restrictions, and this becomes very quickly unmaintainable, latest when you start whitelisting whole Azure datacenters/regions ...
App Service Environment (ASE) is a super expensive dinosaur which seems to be intended for enterprise customers. ASEv2 requires a monthly fee of more than EUR 1000 for "flat stamping" (whatever that means). ASEv3 (preview) reduces that to USD 280/month for merely 2 vCPUs and 8 GB of RAM of processing capacity ... Not to mention the complexity of network configurations, etc.

So you are effectively left with options 2, 3 and 5.

Azure Functions Application Framework

Azure Functions as Application Framework with .NET Core 3.1 takes your "functions" decorated with a FunctionName attribute, and runs them in a magical way ¹ within the Azure Functions WebScript Host (provided by MS). You do not have a Program.cs/fs with main or similar, if you want to wire up some start up code you have to implement a class and use an assembly-level FunctionsStartup attribute so that the Azure Functions runtime can find it (awkwardness to the max ;)

Azure Functions with .NET 5 lets you run your code in a separate process and you have a Program.cs/fs with main, however all calls to your code will be passing first the Azure Functions WebScript host and then the latter will invoke your code via GRPC ² ...

There is no GRPC support on the horizon for Azure Functions to my knowledge, which means that your inter-service calls cannot use GRPC ... FYI, everybody is using GRPC for inter-service calls due to significant performance benefits, and also .NET Core 3.1/.NET 5.0 have very good GRPC support already.

Azure API Management (APIM)

APIM is a pretty sophisticated service which wants to intimately know your APIs. You import your Open API / Swagger yamls for example, and it enumerates all operations, so it understands your APIs. Based on that APIM can also render a Web UI where you can list your APIs, test them, etc. Defining Products and Subscriptions comes on top, as well as various Policy which can be injected at different levels.

APIM comes in several tiers, but note that even though most of them may support your requests/second requirements, only 2 of them have VNET support: Developer (non-production use) and Premium (at staggering EUR 2350/month).

Does that sound like a real enterprise offering? Well, if this is not an enterprise offering, I don't know what is ...

Reasons for migrating away from Azure Functions + APIM

Costs

Azure Functions Premium Plan Costs

There are 3 instances sizes:

and the billing is

which means more than 100 EUR/month for a single instance of the smallest 1 vCpu and 3.5GB memory!!
Azure Functions App Service Plan Costs

Premium v3 (VNET Integration)

In contrast, AKS node billing is actually the standard Azure Virtual Machine billing, so you can get a standard VM with 2 vCPUs + 8 Gb RAM for as low 70-80 EUR/month excluding any discounts for 1-3y reservations, or spot pricing, etc.! Traefik is an open-source project with a pretty sufficient for our uses free community edition ...

Performance

We experienced serious problems running 5 different function apps only on the smallest SKUs of Azure Functions Premium (EP1) or App Service Plan (P1V2) ... Not only the applications were slow at processing a very low number of requests, but the memory utilization was very high (due to only 3,5Gb of RAM, much less available though).

In contrast, we are easily running almost 50 apps in AKS per node (every node with 2 vCPU + 16GB RAM, costing around 90 EUR/month), with less than 20% max CPU utilization and less than 55% RAM utilization ...

Security

Imagine you want to secure your fairly simple APIM + Azure Functions + some underlying Azure services "modern" architecture:

Diagram: "Modern" serverless architecture based on Azure Functions

You are using the "serverless/consumption" versions of Azure Functions and APIM ... you will be surprised by quite a few imho unsurmountable security challenges:

VNET Integration

I am not getting why in 2021 Azure allows you to create resources without a VNET (VPC) ... For hobby projects I understand, but any serious enterprise-grade system cannot survive without vnets ... It is also beyond my understanding why the Consumption Plan has no VNET Integration (besides internal technical limitations of Microsoft's implementation, or marketing/pricing agendas) ... I see talented people wasting their time trying to find Source IP Restriction-based approaches to security their Consumption Plan hosted Azure Functions and I am wondering why they are doing this to themselves ...
Network security for the underlying Storage Account - some people are still not able in 2021 to secure the required by Azure Functions runtime storage account ... this has been a major oustanding issue for the past several years, and even though MS have finally fixed that I am not 100% sure if it has been rolled out everywhere ...
With Premium and App Service Plans VNET Integration exists, however with the former 1 of my function apps once lost its VNET Integration (= full downtime). The answer to my MS Support ticket (120051322002711) was to host my function app in other region in addition, and put azure front door or similar on top ... I am not saying that this incident can/will repeat for you, I just have the feeling that the VNET Integration was bolted on the Azure Functions Premium Plan more of as an afterthought rather than a mandatory underlying infrastructure principle ...
Function Apps on Consumption Plan do not have reliable outboundIPAddresses/possibleOutboundIPAddresses, all IP ranges of the whole Azure region must be whitelisted (> 127 IP ranges), however the firewall rules in Key Vault allows for max 127 rules (no, this is not a joke)
If function apps and storage accounts are in the same region, then connectivity function apps -> storage accounts goes via private Azure network (e.g. 10.150.*), however whitelisting of public ip ranges is only allowed in storage account firewall (or vnets, no private ip ranges – no, this is not a joke either).
It turns out that Azure API Management on Consumption Plan does not have a dedicated IP, so whitelisting of the whole Azure West Europe datacenter IP addresses must be performed, which is clashes with 127 rules max limit for firewalls in Azure. Azure Support is suggesting additionally to use the 40k json file with all Azure IP ranges, however that requires heavy investment in setting up a function app to parse the file on a weekly basis and to update the firewall rules in all ip-restricted services.
Vnet Integration for APIM is only possible in Developer (no SLA, gets restarted with downtime on a monthly basis) or Premium Tier (costs EUR 2500/month). Numerous ideas in the feedback center, no result.

Slow deployment + 100% CPU

On both App Service Linux and Windows Plans we experienced very high CPU peak every time we deployed (incl. slot swap). We had only a few (max 5) function apps hosted on the smallest (1 vCore) instance, with no application load whatsoever (no requests). Microsoft Support's answer was: "It is normal to have some CPU increase during deployment for a short period of time and it should not impact the overall availability of the functions.". Of course, it did affect the performance of the functions dramatically ...

With the App Service Plan the deployment is handled by the Kudu management container where it is using the local storage for deployment which is also slower and besides this there are some additional intermediary steps in the deployment.

Additionally, we experienced from time to time failed deployments with errors like "Bad Gateway 502", or sometimes 409.

I think there is a general issue with the performance of function apps on the App Service Linux plan, and that may be the root cause for the deployment issues …

In contrast, deploying to AKS is very fast (less than 50 seconds) and has no measurable impact on the CPU utilization of the node (2 vCPU, 16 GB RAM).

Inter-service (Service-to-Service) API calls

In the context of Microservice Architecture sooner or later you will need to have synchronous (REST or GRPC) calls from one app to another. Yes, you should try not to have such, yes, you should try to make everything based on events/messages via message bus ... however, the reality is that you will definitely reach some consistency limitations which will force you to have such inter-service calls.

In case you are using Azure Functions Subscription Plan you will be charged for every inter-service call ... imagine you have an orchestration service1, which is making a GET to service2, then making POST to service3 and service4 ... then you will be charged for additional 3 calls for every API call to the orchestration service.

In case of Premium or App Service Plan, your inter-service calls will traverse the Azure network stack, which is much slower than having service calls on the same node as is the case for AKS. We were even forced to deploy several microservices as 1 deployment unit with in-process calls instead (using project references, which was causing a big deployment mess) ...

Other Azure Functions peculiarities

Strange limitations of length of app name

Update of Azure Function's App Application Settings from the UI takes 2x longer for App Service Linux hosted function app compared to Premium Windows hosted function app.
It seems that if we have a long function name (e.g. 59 characters) everything is slower, including updating Application Settings in the UI.
MS Support recommends keeping function app names under 40 characters especially when working with Deployment Slots because there is a limit set in for the length of the hostname that is generated from the function name ( app_name.azurewebsites.net ), and as the name of the deployment slot is added to the host name it can get truncated and could cause issues with deployment or slot swapping.

In contrast, K8s defines name limits of up to 253 characters ...
Strange tooling (Kudu)

While investigating an issue with a function app I got the request from MS Support to "please login to the kudu site" .. of the function app "... or select the functionapp-> platform features-> AdvancedTools(kudu)". Kudu seems to a very old and custom web UI for App Service ... with 100% guarantee that any knowledge gained is not transferrable to any other cloud provider or similar ;)
Strange settings

There are a number of "magic" settings to be considered like Always On so that the app runs correctly. On an App Service plan, the functions runtime goes idle after a few minutes of inactivity, so only HTTP triggers will 'wake up' your functions.
Whenever you are doing run from packages it is recommended to have the WEBSITE_RUN_FROM_PACKAGE setting in the app settings.
For VNET Integration of Premium Plan-hosted function apps for example you should set WEBSITE_VNET_ROUTE_ALL=1 and WEBSITE_DNS_SERVER=168.63.129.16 ...

Other APIM-related hassles

APIM setup takes extremely long time

Extremely long means more than half an hour. The "serverless" tier is the only exception and takes much less, but compare the features before you decide for it (remember - no VNET Integration!)

In contrast Traefik takes seconds to install in AKS/K8s cluster (we are using simple yaml deployment files, no helm charts).
Same server resources seem to be shared between the Developer Portal and API requests

Can it be that someone browsing your API Web Portal has an impact on your REST API response time SLA?

In contrast AKS/K8s has extensive requests/limits capabilities.
CPU utilization for a prolonged time

We were paying 125 EUR/month for 1 unit which should allow us 1000 rps, but even with only 20-30 requests per 5 minutes we were seeing 50% capacity utilization of our APIM instance.

Diagram: 2 APIM instances with same minimal load, the one one the right showing high utilization for no reason

After some useless roundtrips with the MS Support person (no, we have absolutely no load on the system currently, no complex polices, nothing ...) he indicated, that such capacity utilization is to be ignored, and might happen due to updates of a freshly set up APIM instance for a duration of a couple of days ...
Cumbersome APIM deployment scripts

Creating an APIM instance and configuring it requires some heavy scripting, as you need to additionally:
1. Define Products
2. Define Policies
3. Define Subscriptions
4. Import APIs from swagger yamls for example
5. Create Revisions of the APIs
Policies alone can be defined on different levels - Product, all APIs, single API (all operations), single operation.

Until recently the full functionality of APIM could only be configured with Powershell, as Azure CLI did not have the support ...

The creation of the deployment scripts took us tons of time (still fragile), and their execution was also pretty lengthy.

Azure Functions Runtime/Framework issues

The fact that Azure Functions Team "missed" the .NET 5 launch by 4-5 months (Nov 2021 - March 2022) is pretty well known. I think MS Team got surprised that not all of their customers are "enterprise developers" lagging 1-2 years behind latest technologies due to big companies upgrading slowly ... The promise is that this will not repeat with .NET 6+ ...

2: Now we have 2 parallel runtimes - in-process (.NET Core 3.1) and out-of-process (.NET 5) and it seems this duality will continue for a couple of years ... Not sure how often you have to switch in-process -> out-of-process -> in-process ...

With the out-of-process model (.NET 5) the Azure Functions Runtime calls your functions using GRPC ... as mentioned above, not sure what is the performance impact of that vs. fully in process ². I am not sure I want to use GRPC inside my application process just because MS wants to have out-of-process now ...
Azure Functions Runtime seems to consume 2 times more memory
GRPC is not supported for inter-service communications, so a microservice based on the Azure Functions runtime cannot expose GRPC interface, which is easily possible with non-Azure Functions Runtime hosted .NET 5 app

Summary and Recommendation

Long story short, my personal recommendation is: do not waste your time with Azure Functions hosting (or even application framework) and APIM for any serious project with standard security/performance requirements, and with the goal of having competitive pricing for hosting ... Or in other words, use Azure Functions only for quick-and-dirty hobby projects, or one-off jobs (maybe some glue infrastructure code), or in case performance, security and costs are absolutely no factors.
Additionally, value your time and better invest in learning something standard like Kubernetes where you can use your knowledge across clouds and on premise, instead of learning all Azure Functions peculiarities, which IMHO may be a result of Microsoft's trying to reuse existing (but with lots of heritage) Azure App Service/Web Script platform/framework for Azure Function ...

In detail:

Migrate from Azure Functions Hosting (Consumption/Premium/App Service Plan) to AKS³ for much lower costs, better security, and well-known container management
Migrate from Azure API Management to Traefik or similar K8s Ingress Controller for reducing costs, simplifying management and deployment.
Migrate from Azure Functions .NET Core 3 (in-process) or .NET 5 (out-of-process) runtime to standard .NET 5 application using WebJobs SDK directly for independence, faster upgrades, lower memory consumption, GRPC and many other possibilities.

If there is interest I could create another post showing how to do the above ...

For example the mandatory Microsoft.NET.Sdk.Functions nuget generates automatically function.json file upon compilation .. ↩
Had an open github issue on the .NET 5 worker preview project asking what about the performance impact of out-of-process/grpc between the host and my code, however the repo got archived and my issue got dropped unanswered ... ↩
I think AKS is one of the best Azure products currently, very competitive when compared to EKS or GKE etc. ↩

Top comments (8)

Casper Rubæk • Sep 15 '21 • Edited

This is just the kind of article I have been looking for to provide some scientific reasoning behind choosing serverless versus Kubernetes and certainly it has made it clear to me that serverless is not a good fit for most use cases and for my use cases in general. The only reason to using Functions is the low cost in most situations, however this is a fallacy when comparing the downsides, mostly that a simple hello world api call takes at least 300-400 ms and inter service communication is slow.
Adding to that it is very cumbersome to develop and debug all the services on a single development machine because there is no orchestration of running the functions apps locally, so ideally you would need to use kubelet to run it in and you might as well then just Kubernetes.

That being said I have used Functions for some time, however I can also confirm that is sometimes unstable and slow at least in consumption mode and the fact that VNET is not mandatory is also concerning and cold start is a problem.

I have the same experience with API Management consumption tier, it is very difficult to debug since the servers running the service is masqueraded in a shared environment. I have yet to send enough trafic through it to test the performance properly, but I have implemented a production system with low trafic with this service and Logic Apps, which is working very nicely.

Thanks for posting a follow up article with your implementation of the .Net 5 app. I would also like to see how you would setup Kubernetes for production purposes and also how the whole infrastructure would look in a diagram. As well as the reasoning behind using F# instead of C#. Also does Azure provide enough benefit for example service/feature wise whilst using Kubernetes that it makes sense to still Azure or would a cheaper provider like Digitalocean suffice?

Deyan Petrov • Sep 17 '21 • Edited

Hi @casperrubaekm ,

Why F# and not C# is a huge topic with a lot of googleable links on it already, I will simply say that after countless years with C# and early experience with a bunch of other low-level and OOP languages I am fully sold on the simplicity and elegance of F# and the functional paradigm ..

I will try to find time for another post on setting up AKS with Traefik etc.

Missed the question about AKS alternatives - I guess there are viable ones for sure, hoping to find time in the future to test GKE for example ... The context of the above article was an already built Azure-Functions-based microservice system, which was easier to migrate to AKS (and stay inside Azure for the other integrated services like AppInsights, KeyVault, Event Hubs, Storage etc) rather than migrate to another cloud ...

Br,
Deyan

Deyan Petrov • Sep 18 '21 • Edited

For hosting pure .NET 5 apps you need:

AKS
Traefik
AAD Pod Identity (so that the pods can contact Key Vault without credentials)

I see there some excellent articles on setting up AKS with Traefik - e.g. [this one from Kumar Allamraju](kumar-allamraju.medium.com/using-t...

Also Aad Pod Identity is quite well-documented, even MS is building it into AKS ...

So not sure if an additional post of mine is actually needed ...

Maybe what is not covered is the automatic renewal of Letsencrypt certificates for which we have created a custom .net app/pod to do that, based on someones F# code, need to dig it out ...

Casper Rubæk • Sep 26 '21 • Edited

Thanks.

I have just read the article you mentioned for configuring Traefik and it is great, however I would like to know how an AKS cluster might look like in a real world production app.
There is a lot to learn from it such as service mesh, health checks, monitoring, interservice communication, shared or non shared databases, etc..
So if you get the time I would look forward to reading an article on how such an AKS cluster would look like, ideally based on your own real world production microservices app.

Casper Rubæk • Sep 26 '21

Sounds like I should take F# for a spin sometime.

Yes it makes sense to stick with Azure since you are heavily invested in supporting services like AppInsights and KeyVault.

Erythnul • May 12 '21

Interesting thoughts, definitely sounds worth experimenting with. We've also experienced some issues with Azure Functions Hosting.

Interested to see how you implemented applications using the WebJobs SDK. If you do create another post, please let me know.
We currently use Nginx for our Kubernetes Ingress, any good reason we should consider switching away from Nginx to Traefik?

Deyan Petrov • May 12 '21 • Edited

Will do.

I have no real experience with nginx, just read its config is more complicated, and a colleague was happy using Traefik for years, so that's how I ended up with it. Pretty happy so far with it, especially with its forward-auth middleware.

Deyan Petrov • Jun 1 '21

@erythnul , published dev.to/deyanp/f-app-stub-for-aks-h..., working github project will follow soon.

DEV Community