A GitOps Way To Manage Grafana Data Sources At Scale

#monitoring #aws #grafana #gitops

Problem

I'm working for the enterprise organization and assigned the task of improving the monitoring system. Since the monitoring system is a centralized system used for the whole organization, we have to make it easy to use for cross teams in the organization. The system uses Grafana for visualization parts. I will not mention the backend of Grafana in this post. If you're interested, you can refer to my post Ultra Monitoring with Victoria Metrics

In the past, Grafana data sources were manually added via WebUI. We want to avoid doing such kinds of operations. Instead, it should be automated as much as we can. Also, we need to follow GitOps practice to manage, and track/audit changes.

Solution

Thanks to Grafana Provisioning feature. It’s possible to manage data sources in Grafana by adding one or more YAML config files in the provisioning/datasources directory. Each config file can contain a list of data sources that will get added or updated during start up. If the data source already exists, then Grafana updates it to match the configuration file.

Combine with reload provisioning configurations API, we can achieve the goal without needing to restart Grafana on every data sources change

The idea is that Grafana data source configuration files will be kept in a Git repository. Then using AWS Automation to sync configurations to Grafana servers. The Git repository structure looks like below:

.
├── team-1
│   ├── clickhouse-2.yaml
│   └── cloudwatch-1.yaml
├── team-2
│   ├── clickhouse-1.yaml
│   └── influxdb-1.yaml
├── team-3
│   ├── elasticsearch-1.yaml
│   └── victoria-metrics-1.yaml
└── team-4
    ├── mysql-1.yaml
    └── prometheus-1.yml

The solution is a combination of AWS Automation Runbook and Secret Manager so it’s a secured, AWS fully-managed, serverless solution.

The following diagram is high-level architecture of the solution:

But wait!! Why is Secret Manager in architecture diagram?
To answer this question, let's see a data source is stored in the repository:

name: Prometheus Example 1
type: prometheus
access: proxy
url: http://123.123.1.1:9090
user: "username"
password: "password"
basicAuth: "false"
jsonData:
  httpMethod: POST

Data sources may need credentials info, and we cannot leave them as plaintext in the repository which leads to security issues.

Let's back to architecture diagram. Here is how the process works:

Administrators create a secret to store credential of a data source (can be automate portal and/or chatbot)
Administrators review and merge a PR
When PR merged, GitHub/Gitlab pipeline triggers predefined Automation runbook
Runbook executes steps from SSM Documents and gets secrets from Secret Manager
Runbook executes defined steps to generate data source provisioning file and invoke Grafana API to reload data sources.

Runbook has three main steps:

Pull the repository from GitHub/Gitlab into Grafana server
Get data source credentials from Secret Manager
Generate data source provisioning files with credentials

Secrets stored in Secret Manager will have name as following format:
{env}/grafana/datasource/{team}/{datasource-name}
Eg. prod/grafana/datasource/team-3/elasticsearch-1

Secret value are store as JSON format. E.g:

{
  "username": "elasticUser",
  "password": "elasticP@ssw0rD"
}

Each secret will have two required tags. They are:

env: prod/qa/dev
secret-type: grafana-datasource.

Data source file now looks like as following:

name: Elasticsearch Example 1
type: elasticsearch
access: proxy
url: http://elasticsearc.example.com:9200
user: "@team-3/elasticsearch-1:username"
password: "@team-3/elasticsearch-1:password"
database: logs-index
basicAuth: true
jsonData:
  esVersion: 7.7.0
  includeFrozen: false
  logLevelField: ""
  logMessageField: ""
  maxConcurrentShardRequests: 5
  timeField: "@timestamp"

Step #2 in the runbook, I write a Python script to get secret values from Secret Manager and pass to step #3. The Python script return secrets as JSON format as following structure:

{
  "team-1": {
    "clickhouse-2": {
      "username": "team-1-clickhouse-2-username",
      "password": "team-1-clickhouse-2-password"
    }
  },
  "team-2": {
    "mysql-1": {
      "username": "mysql-1-username",
      "password": "mysql1P@ssword"
    }
  },
  "team-3": {
    "victoria-metrics-1": {
      "authorizationToken": "vict0ri@Metric$Tok3n"
    },
    "elasticsearch-1": {
      "username": "elasticUser",
      "password": "elasticP@ssw0rD"
    }
  }
}

Step #3 in the runbook, I also write a small Python script to combine data source files in the repository into Grafana data source provisioning file, and also replace secret holders by the secret values from Secret Manager.
Grafana data source provisioning configuration looks like:

[root@grafana datasources]# pwd
/var/lib/grafana/provisioning/datasources

[root@grafana datasources]# ll
total 16
-rw-r--r-- 1 root root 362 May 22 11:00 team-1.yaml
-rw-r--r-- 1 root root 628 May 22 11:00 team-2.yaml
-rw-r--r-- 1 root root 669 May 22 11:00 team-3.yaml
-rw-r--r-- 1 root root 515 May 22 11:00 team-4.yaml

/var/lib/grafana/provisioning/datasources/team-3.yaml

apiVersion: 1
datasources:
- access: proxy
  basicAuth: true
  database: logs-index
  jsonData:
    esVersion: 7.7.0
    includeFrozen: false
    logLevelField: ''
    logMessageField: ''
    maxConcurrentShardRequests: 5
    timeField: '@timestamp'
  name: Elasticsearch Example 1
  password: elasticP@ssw0rD
  type: elasticsearch
  url: http://elasticsearc.example.com:9200
  user: elasticUser
- access: proxy
  isDefault: true
  jsonData:
    httpHeaderName1: Authorization
  name: Victoria Metrics Example 1
  secureJsonData:
    httpHeaderValue1: Bearer vict0ri@Metric$Tok3n
  type: prometheus
  url: http://ultra-metrics.com