Introduction
Prometheus is a powerful open-source monitoring system that can collect metrics from various sources and store them in a time-series database. It is widely used in the industry to monitor and alert the health of applications, servers, and other infrastructure components.
In this article, we will provide a practical introduction to Prometheus monitoring and cover the essential concepts and features that you need to know to get started:
- Architecture
- Setup and Configuration
- PromQL Queries
- Alerting
- Grafana Integration
- Conclusion & Further Reading
The only requirements to follow this guide are:
- Node.Js: We base our examples on metrics from a Node.js server, but if you want to use something else, this guide should still be helpful.
- Docker: We use a dockerized version of Prometheus; if you want to use a different approach, that should be no problem, only minor adjustments might be needed.
You can find the example project we use in this guide here: statuspal/prometheus-nodejs. Just follow the instructions in the README to get the necessary Node.js server and dockerized version of Prometheus running.
Architecture
Prometheus follows a pull-based architecture where the (Prometheus) server periodically fetches metrics from the targets that it is configured to monitor. The targets can be various types of applications, services, or servers, and they expose their metrics using a supported protocol, such as HTTP, SNMP, or JMX. Prometheus supports various metric types, such as counters, gauges, histograms, and summaries, and can perform complex aggregations and calculations on them.
Prometheus server stores the collected metrics in a time-series database where a unique combination of metric name, labels, and a timestamp identifies each time-series. The metric name represents the type of metric, such as HTTP requests, CPU usage, or memory usage. The labels provide additional metadata that can be used to differentiate between instances of the same metric, such as server hostname, service name, or version. The timestamp represents the time when the metric was collected.
Prometheus provides a built-in query language called PromQL, which allows users to retrieve and aggregate metrics based on various criteria, such as metric name, labels, time range, or mathematical expressions. The query results can be visualized using various tools, such as Grafana, or integrated with alerting systems, such as Alertmanager, to trigger alerts based on predefined rules and thresholds.
Setup and Configuration
To get started with Prometheus monitoring, you need to set up a Prometheus server and configure it to scrape the targets that you want to monitor. Prometheus can be installed on various platforms, such as Linux, macOS, or Windows, and run as a standalone binary or a Docker container, which we'll use in this guide.
Once the Prometheus server is up and running, you need to configure it to scrape the targets by defining a set of scrape jobs in the prometheus.yml configuration file. A scrape job specifies the target endpoints, the interval at which the metrics should be scrapped, and any additional options or filters. For example, the following scrape job configures Prometheus to scrape metrics from a Node.js server that exposes metrics on the default port:
scrape_configs:
- job_name: 'nodejs'
scrape_interval: 5s
static_configs:
- targets: ['host.docker.internal:3000']
This configuration tells Prometheus to scrape the metrics every 5 seconds from the target endpoint host.docker.internal:3000 and store them in the time-series database.
PromQL Queries
After configuring Prometheus to scrape your targets, you can start querying the collected metrics using PromQL. PromQL is a flexible and powerful query language that allows you to perform various types of aggregations, filtering, and calculations on the collected metrics.
To get started with PromQL, you can use the Prometheus expression browser, which provides a web-based interface to explore and test PromQL queries. The expression browser allows you to select the metric name, labels, time range, and aggregation functions and displays the query result as a graph or table.
Here are some examples of basic PromQL queries which we'll find in our example project:
-
up
- retrieves the health status of the targets (0 for down, 1 for up). -
http_request_duration_seconds_count
- retrieves the total number of HTTP requests handled by the Node.js server. -
http_request_duration_seconds_sum
- retrieves the response time in seconds of the requests handled by the Node.js server.
In addition to the basic queries, PromQL supports more advanced features such as vector matching, functions, and operators, that allow you to perform more complex queries. For example, the following query retrieves the average count of requests that responded with 304 status of all instances of the nodejs metric:
avg_over_time(http_request_duration_seconds_count{job="nodejs",status_code="304"}[5m])
This query uses the avg_over_time function to calculate the average number of requests over the last 5 minutes and the [] vector selector to filter the “nodejs” jobs with status code 304.
Alerting
Prometheus supports built-in alerting capabilities that allow you to configure and trigger alerts based on predefined rules and thresholds. To create an alert, you need to define a set of rules in the Prometheus alerting rules file, which specifies the metric name, labels, and the conditions that trigger the alert.
For example, the following alerting rule triggers an alert if the Node.js server error rate exceeds 30% for more than 1 minute:
groups:
- name: nodejs
rules:
- alert: NodejsHighErrorRate
expr: http_request_duration_seconds_count{job="nodejs",status_code="500"} / http_request_duration_seconds_count{job="nodejs"} > 0.3
for: 1m
labels:
severity: warning
annotations:
summary: "High error rate on Node.js server"
This rule uses the expr
field to define the condition that triggers the alert, the for field to specify the duration of the condition, and the labels and annotations fields to add additional metadata to the alert.
When the condition is met, Prometheus sends an alert to the Alertmanager, which can then forward the alert to various notification channels, such as email, Slack, or PagerDuty.
Going back to our example project, you should be able to find this alert by clicking on Alerts in your Prometheus dashboard. Initially, it will be green, but if you visit http://localhost:3009/error a couple of times, after one minute, you should see the alert triggered:
Grafana Integration
Prometheus integrates well with Grafana, a popular open-source dashboarding and visualization platform that allows you to create custom dashboards and visualize the collected metrics. To use Grafana with Prometheus, you need to set up a data source that points to your Prometheus server and create a dashboard that displays the desired metrics.
Grafana supports various types of visualizations, such as graphs, tables, gauges, and heatmaps, and allows you to customize the appearance, layout, and behavior of the dashboard. You can also add alerts and annotations to the dashboard to highlight important events and conditions.
We’ll go much deeper into Grafana in the next blog post in this series, so stay tuned!
Conclusion
Prometheus is a powerful and flexible monitoring system that can help you keep track of the health and performance of your applications, servers, and infrastructure components. In this article, we covered the essential concepts and features of Prometheus monitoring, including the architecture, setup and configuration, PromQL queries, alerting, and Grafana integration.
With Prometheus, you can collect, store, and query metrics from various sources, and use the results to gain insights into the behavior and trends of your systems. You can also configure alerts and dashboards that help you detect and respond to issues before they escalate into critical problems.
Whether you are a developer, operator, or site reliability engineer, Prometheus can provide you with the visibility and control that you need to ensure the reliability and availability of your systems. So go ahead and give it a try, and see how it can improve your monitoring and observability workflows.
Further Reading
We're making these a series of blog posts, so we'll continue to elaborate on Prometheus. In the meantime, the following resources are a great place to continue right now.
- https://prometheus.io/docs/prometheus/latest/querying/basics/
- https://prometheus.io/docs/alerting/latest/overview/
- https://prometheus.io/docs/visualization/grafana/
If you're looking for an efficient way to communicate incidents and maintenance to your stakeholders or team members, consider giving StatusPal a try. The first 14 days are free!
Top comments (0)