If you're building a multi-tenant SaaS, securely isolating customer data not only from other customers but from your own developers is a conversation that you'll have sooner or later. At our startup, our customers' data is highly confidential and we go to extreme efforts to protect it both from inadvertent exposure caused by software bugs and internal access by employees unless absolutely necessary.
For a hybrid pool/silo architecture like Skribe's, my favorite security strategy to achieve this is one that AWS promotes known as the Token Vending Machine that leverages IAM to isolate customer data.
Essentially an authorized user (1) makes an API request through the API Gateway (2), which calls a custom authorizer to validate the credentials and generate a dynamic IAM policy (3). The dynamic IAM policy is passed to the handler function (4) that locks all further processes into a specific set of resources (5). The elegance of this solution is that it removes the burden of handling tenant security from the developers' hands and moves it down to the platform level. The threat of inadvertently exposing tenant data even at the hands of a malicious developer is almost completely mitigated.
The Problem
Skribe is a primarily built on Google Cloud in a hybrid pooled/siloed architecture, and for what felt like an eternity, I'd been researching methods to implement this same strategy on GCP. It had seemed impossible given the limitations of their managed services.
Endpoints and API Gateway don't support custom authorizers.
Dynamically generated IAM policies aren't supported.
The proposed solutions you'll find on StackOverflow, Reddit and even GCP's own whitepapers all basically say the same thing: "Tenant security should be handled at the app level."
Yuck!
But after days of trial and error, we found a solution that gives us the highly secure tenant isolation we needed on Google Cloud!
The Solution
Similarly as before, the user in Tenant A (1) makes an authorized request to list the users in their tenant (2). The API Gateway passes that to the UsersEndpoint
service (3) that has no inherit permission to access any database, so it passes the user's auth token to the TokenVendingMachine
(4). The TokenVendingMachine
validates the token and based on the custom claims retrieves the tenant's Service Account key file from our secure bucket (5) and returns it to the UsersEndpoint
service. Finally we can call our database using the key file (6) and return the results to the user.
Step 1: Onboarding
When a new tenant is created, a tenant-specific Service Account is asynchronously created and the JSON key file is stored in a highly-secured bucket containing tenant key files.
Step 2: Authentication
We use the Identity Platform with multi-tenancy enabled to authenticate users. When a user logs in they exchange their initial token with a custom token containing custom claims such as the user's tenant and role, and that custom token is sent with every subsequent request.
Those custom claims look something like this:
{
tn: 'tn-xyz987',
rl: 'editor',
rg: 1,
...
}
The claims identify the user's tenant, their role and the region that their data resides in.
Step 3: API Requests
When a user's authenticated request hits the API Gateway, it's sent to a Cloud Run service that runs our API. The database and storage buckets are abstracted behind like-named services and require a valid JSON key file in order to access any resource.
So if a user requests a list of users within their tenant, the API's code can be as simple as this pseudocode:
app.run('/users', (res: Request, res: Response) => {
// Create a new instance of our TokenVendingMachine class
const tvm = new TokenVendingMachine();
// Request the key file using the user's auth token
tvm.get(req.headers.authorization)
.then(async (key: Credentials) => {
// The tenant's database name has been embedded in the key
const db = new Database(key);
const rows = await db.query("SELECT ...");
res.json(rows);
})
.catch((e: any) => res.status(403));
});
Main Takeaway: The developers can write code as if this is a single-tenant environment!
I know what you're going to say...
Why not issue short lived service account credentials?
Latency. Retrieving an existing key file from a GCS bucket is extremely fast compared to requesting new credentials on each request. Sure you could cache those short-lived credentials, but it creates a new set of problems of storing those securely if your goal is total isolation.
Why not use the Secrets Manager to store the key files?
In a word, cost. At $0.03 per 10,000 operations the costs will add up fast for an API.
Isn't a storage bucket full of key files dangerous?
Not if properly secured. The TokenVendingMachine
service has read only access to all objects in that bucket and another service that generates the key file during the onboarding process has write access. There's also have a backend service that regularly cycles the keys so that they don't live on in perpetuity.
Conclusion
What's important is that by separating tenant security from the app level, we achieve reliable, secure storage and access of our customers' data while removing the responsibility of tenant security from our developers' hands.
Top comments (0)