Monitoring with Prometheus

Overview

  • Create a K8s cluster with EKS
  • Deploy Microservices app
  • Deploy Prometheus Monitoring Stack
  • Monitor Cluster Nodes
  • Monitor K8s components
  • Monitor 3rd party apps eg. Redis
    • Deploy Redis exporter
  • Monitor our own apps

Repos

Introduction to monitoring with Prometheus

  • Created to monitor highly dynamic container environments
  • Can work with traditional and bare servers as well

Why use Prometheus

  • We need a tool that constantly monitors all the services and alerts us when something occurs
    • Automated monitoring
    • Alerting

Prometheus Architecture

  • Prometheus Server: Does the actual monitoring work. It has
    • Time Series Data
      • Stores all the metrics data like CPU usage, exceptions, etc
    • Data Retrieval Worker
      • Pulls metrics data from apps, services, resources, etc
      • Stores them in database
    • Web Server / Server API
      • Accepts queries(PromQL) for that stored data
      • Server API is used to display the data

What Prometheus Monitors

  • Servers
  • Apps
  • Services
  • These are called targets
  • Targets have units of monitoring like CPU status, request counts, exception counts, etc
  • These units you want monitored are called metrics
  • Metrics are saved into the Prometheus database component
  • Prometheus defines human-readable, text-based formats for these metrics
  • Metrics entries: Type and Help attributes
    • Help: Description of what the metrics is
    • There are 3 types
      • Counter: How many times x happened
      • Gauge: What is the current value of x now? (can go up and down)
      • Histogram: How long or how big

How does Prometheus collect those metrics from targets?

  • Pulls metrics from the targets using an HTTP endpoint which by default is host address/metrics
    • Targets must expose /metrics endpoint
    • Data available on /metrics must be in the format that Prometheus understands
  • Some servers are already exposing Prometheus endpoints so you don't need to do any extra work
  • Some services don't have native Prometheus endpoints so you need a component to help with that
  • This is called an Exporter
    • Fetches metrics from the target and converts it to a format that Prometheus understands
    • It exposes the data at its own /metrics endpoint where Prometheus can scrape them
    • There are different exporters for different services
    • Exporters are also available as Docker images
    • For your own apps, there are also Prometheus client libraries for different languages like Node, Java, etc

Pull Mechanism

  • Prometheus pulls data from endpoints
  • Most monitoring systems like Amazon Cloud Watch or New Relic use a push system
    • Applications and servers push their data to a centralized collection platform of that monitoring tool
    • When working with many microservices and we have all these services pushing to the monitoring system, it creates a high load of traffic within the infrastructure and monitoring can be your bottleneck
      • High load of network traffic
      • Monitoring can be your bottleneck
      • Have to install daemon or additional software or tools to push metrics to the monitoring server
  • Pull system
    • Multiple Prometheus instances can pull metrics data
    • Better detection/insight if service is up and running
  • Push Gateway
    • What happens when target runs for a short time? eg. batch operation
    • Prometheus offers a Push gateway where the target can push their data to Prometheus

Configuring Prometheus

  • Uses a prometheus.yaml file
  • Things you can configure
    • Global: How often Prometheus will scrape its targets
      • scrape_interval: How often Prometheus scrapes
      • evaluation_interval: How often rules are evaluated
    • Rules: Rules for aggregating metric values or creating alerts when conditions are met
    • Scrape Configs: What resources Prometheus monitors
  • Prometheus can monitor its own health with its own /metrics endpoint
  • Can define other endpoints to scrape through jobs
    • Define your own jobs
    • Default values for each job

Alert Manager

  • How are alerts triggered?
  • Prometheus can fire alerts using the Alert manager
  • It reads alert rules and if the conditions are met, the alert is fired

Data Storage

  • Stores metrics data on disk
  • Optionally integrates with Remote storage systems
  • Data is stored in a custom time series format
  • Cannot write data directly into a relational database
  • Lets you query data on targets using PromQL Query Language

PromQL Query Language

  • Query target directly
  • With Grafana, you have UI and you can create dashboards that use PromQL in the background to query data

Characteristics

  • Reliable

  • Stand-alone and self-containing

  • Works even if other parts of infrastructure is broken

  • No extensive setup needed

  • Less complex

  • Difficult to scale

  • Limits monitoring

  • Workarounds

    • Increase Prometheus server capacity
    • Limit number of metrics

Scale Prometheus using Prometheus Federation

  • Scalable cloud apps need monitoring that scales with them
  • Prometheus Federation
    • Allows a Prometheus server to scrape data from other Prometheus servers

Prometheus with Docker and K8s

  • Fully compatible
  • Prometheus components are available as Docker images
  • Can easily be deployed in Container Environments like Kubernetes
  • Monitoring of K8s Cluster Node resources come out-of-the-box

Install Prometheus stack in K8s

How to deploy the different parts in K8s cluster

  1. Creating all configuration YAML files yourself and execute them in the right order
  2. Using an operator
    • Operator: Manager of all Prometheus individual components
    • Manages the combination of all components as one unit
    • Find Prometheus operator
    • Deploy in K8s cluster
  3. Using Helm chart to deploy operator
    • Maintained by Prometheus community
    • Helm: initiate setup
    • Operator: Manage setup

Demo Overview

  • Create EKS cluster
  • Deploy Microservices Application
  • Deploy Prometheus Stack
  • Monitor Kubernetes cluster
  • Monitor Microservices Application

Create EKS Cluster

 eksctl create cluster \
> --node-type t2.micro \
> --nodes 2 \
> --name demo-cluster

Delete

eksctl delete cluster --name <cluster-name>

Deploy Microservices App

k apply -f config-microservices.yaml

Deploy Prometheus Stack Using Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install Prometheus into own namespace

k create namespace monitoring
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring
k get all -n monitoring

Understanding Prometheus Stack Components

  • 2 StatefulSets

    • Prometheus Server
    • Alert manager
  • 3 Deployments

    • Prometheus Operator
      • created Prometheus and Alertmanager StatefulSets
    • Grafana
    • Kube State Metrics
      • is its own Helm chart
      • Dependency of this Helm chart
      • Scrapes K8s metrics
  • 3 ReplicaSets

    • Created by Deployments
  • 1 DaemonSet

    • Runs on every Worker Node
    • Node Exporter DaemonSet
      • Connects to the server
      • Translates Worker Node metrics to Prometheus metrics so they can be scraped
  • Pods

    • From deployments and StatefulSets
  • Services

  • ConfigMaps

    • Configurations for different parts
    • Managed by operator
    • How to connect to different metrics
  • Secrets

    • For Grafana, Prometheus, Operator, Alert Manager, certificates, usernames and password
  • CRDs

    • Custom Resource Definitions
    • Extension of K8s API
  • We've set up our monitoring stack and configuration for the K8s cluster

    • Worker Nodes are monitored
    • K8s components are monitored

Components inside Prometheus, Alertmanager and Operator

k describe statefulset <name> > prom.yaml
  • Mounts: where Prometheus gets its configuration data mounted into Prometheus pod
  • Config file: Prom defines what endpoints it should scrape
  • Rules file: alerting rules, etc
  • Config-reloader sidecar / Helper container
    • Reloads Prom config file when it changes
  • Main things to understand
    • How to add / adjust alert rules
    • How to adjust Prometheus configuration

Data Visualization with Prometheus UI

  • Decide what to monitor
    • We want to notice when something unexpected happens
    • Observe any anomalies
      • CPU spikes
      • Insufficient storage
      • High load
      • Unauthorized requests
    • Analyze and react accordigly
  • How do we get the information?
    • Visibility of this monitoring data
    • What data do we have available?

Prometheus UI

k port-forward service/monitoring-kube-prometheus-prometheus 9090:9090 -n monitoring &
  • Targets
    • Status > Targets
  • If target is not found, add target
  • Prom UI
    • Low level
    • For Debugging
    • Not ideal for data visualization
  • Prom config
    • Have jobs
    • Instance: an endpoint you can scrape
    • Job: Collection of instances with the same purpose

Introduction to Grafana

k port-forward service/monitoring-grafana 8080:80 -n monitoring
  • Default credentials
    • user: admin
    • pwd: prom-operator
  • Grafana Dashboards
    • Dashboard is a set of one or more panels
    • Can create your own dashboards
    • Organized into one or more rows
    • Row is a logical divider within a dashboard
    • Rows are used to group panels together
  • Have different rows
  • In each row there are panels
  • Panel
    • Basic visualization building block in Grafana
    • Composed by a query and a visualization
    • Each panel has a query editor specific to the data source selected in the panel
    • Can be moved and resized within a dashboard
  • Structure Summary
    • Folders
    • Dashboards
    • Rows
    • Panels
  • When we observe the graphs, we're looking for anomalies
  • You can use PromQL to get data from Prometheus and Grafana will use that data to visualize it

Create Your Own Dashboard

  • Select metric from dropdown and Use Query

Resource Consumption of Cluster Nodes

  • Node Exporter / Nodes

Test Anomaly

k run curl-test --image=radial/busyboxplus:curl -i --tty --rm
  • In container
    • Create a shell script
for i in $(seq 1 10000)
do
    curl http://a379ec4ed55544d4e845fbd4f1142cf0-1765023943.us-east-1.elb.amazonaws.com > test.txt
done

Make file executable

chmod +x test.sh
./test.sh

Configure Users & Data Sources

  • Can configure users in Grafana
  • Grafana can add visualize data from many sources.
    • Prometheus
    • AWS CloudWatch
    • PostgreSQL
    • Elasticsearch, etc
  • Based on the Datasource you select, the query language will differ

Alert rules in Prometheus

  • People won't wait in front of screens waiting for anomalies
  • Then you check the dashboard to fix the issue
  • Configure our Monitoring Stack to notify us whenever something unexpected happens
  • Alerting with Prometheus is separated in two parts:
    • Define what we want to be notified about
      • e.g. Send notification when CPU usage is above 50%
    • Send notification (Configure Alertmanager)
  • Check Alert rules on Prometheus UI

Which Alert Rules do we want to configure?

  • We get some alert rules out of the box
  • Firing an alert means the alert is sent to the Alertmanager

Query example

alertmanager_notifications_failed_total{job="monitoring-kube-prometheus alertmanager",namespace="monitoring"}
  • Labels: allow specifying a set of additional labels to be attached to the alert
  • Can specify the severity of alerts. e.g. critical or warning
  • Can add more labels
  • Can decide to send all critical warnings to slack and warning to emails
  • Can use namespaces too
  • When are the notifications sent when an alert happens?
    • for: causes Prometheus to wait for a certain duration
    • e.g. if for is set to 10 minutes, Prometheus will check that the alert continues to be active for 10 minutes before firing the alert
    • App could reload without our intervention
    • Can put the alert in a pending state and then use the for attribute to see if the condition resolves itself. If it hasn't, fire the alert
  • When alert gets sent, it has to have a message
    • annotations: specify a set of information labels for longer additional information
    • runbook_url: Points to a url that explains the issue as well as a possible fix for the issue

Create Own Alert Rules - 1

  • First rule
    • Alert when CPU usage > 50%
  • Second rule:
    • Alert when Pod cannot start (CrashLoop)

First rule

(avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)

Show percentage that CPU is idle grouped by instance

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)

Find usage over 50%

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50

Rule

  name: HostHighCpuLoad
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
  for: 2m
  labels:
    severity: warning
    namespace: monitoring
  annotations:
    description: "CPU load on host is over 50%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
    summary: "Host CPU load high"

Create your own alert rules - 2

  • Where config is found
    • Status > Configuration
  • The rules file is a ConfigMap in the cluster
  • The rules file has the list of all the alert rules
  • Prometheus Operator lets us create custom Kubernetes components which are defined by CRDs to create alert rules so that the Operator will tell Prometheus about the new rules that need to be loaded
    • It extends the Kubernetes API and lets us create custom K8s resources
    • The Operator takes our custom K8s resource and tells Prometheus to reload the alert rules

Custom resource

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: main-rules
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: monitoring
spec:
  groups:
    - name: main.rules
      rules:
        - alert: HostHighCpuLoad
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
          for: 2m
          labels:
            severity: warning
            namespace: monitoring
          annotations:
            description: "CPU load on host is over 50%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
            summary: "Host CPU load high"

Second Rule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: main-rules
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: monitoring
spec:
  groups:
    - name: main.rules
      rules:
        - alert: HostHighCpuLoad
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
          for: 2m
          labels:
            severity: warning
            namespace: monitoring
          annotations:
            description: "CPU load on host is over 50%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
            summary: "Host CPU load high"
        - alert: KubernetesPodCrashLooping
          expr: kube_pod_container_status_restarts_total > 5
          for: 0m
          labels:
            severity: critical
            namespace: monitoring
          annotations:
            description: "Pod {{ $labels.pod }} is crash looping\n Value = {{ $value }}"
            summary: "Kubernetes pod crash looping"

Apply Alert Rules

k apply -f alert-rules.yaml

Check rule in monitoring namespace

k get PrometheusRule -n monitoring

Check if config is reloaded

Check logs

k logs prometheus-monitoring-kube-prometheus-prometheus-0 -n monitoring -c config-reloader

Create own alert rules - 3

  • We'll simulate a CPU load to test the trigger
k run cpu-test --image=containerstack/cpustress -- --cpu 4 --timeout 30s --metrics-brief

Introduction to AlertManager

Firing State

  • Alert has been sent to the Alert manager
  • Alert manager is the last piece in the pipeline
  • AlertManager dispatches notifications about the alert
  • It takes care of deduplicating, grouping and routing them to the correct receiver integration

Configuring AlertManager

  • 2 separate components

    • Prometheus Server
    • AlertManager
  • Each has its own configuration

  • AlertManager has a simple UI

  • Access AlertManager UI

Use port forwarding

k port-forward svc/monitoring-kube-prometheus-alertmanager -n monitoring 9093:9093
  • Gives us a read-only view of the configuration as well as a way to filter them

  • The default config has 3 main parts:

    • global
      • Kind of like global variables that can be used throughout the whole operation
    • routes
    • receivers (main part)
      • These are the notification integrations
  • We'll need to tell AlertManager which alerts will be sent to which receivers

  • So we'll need routing for our alerts

    • We target alerts using match attribute
    • We have the top-level route and specific routes
    • Top-level
      • Every alert enters the routing tree at the top level
      • Configuration applying to all alerts

Configure Alert Manager with Email Receiver

  • Configuration of AlertManager can be found in this secret: alertmanager-monitoring-kube-prometheus-alertmanager-generated
k get secret alertmanager-monitoring-kube-prometheus-alertmanager-generated -n monitoring -o yaml | less

Decode the base64 secret value

echo <value> | base64 -D | less
  • We can create a custom component called AlertManagerConfig that lets us add or adjust AlertManager config

Configure Email Notification

alert-manager-configuration.yaml

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: main-rules-alert-config
  namespace: monitoring
spec:
  route:
    receiver: 'email'
    repeatInterval: 30m
    routes:
      - matchers:
          - name: alertname
            value: HostHighCpuLoad
      - matchers:
          - name: alertname
            value: KubernetesPodCrashLooping
        repeatInterval: 10m
  receivers:
    - name: 'email'
      emailConfigs:
        - to: 'alfredasare101@gmail.com'
          from: 'alfredasare101@gmail.com'
          smarthost: 'smtp.gmail.com:587'
          authUsername: 'alfredasare101@gmail.com'
          authIdentity: 'alfredasare101@gmail.com'
          authPassword:
            name: gmail-auth
            key: password

email-secret.yaml

apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: gmail-auth
  namespace: monitoring
data:
  password: base64-encoded-value of password
  • Won't be checked into repository
  • Need to use an application password created in your Google account
  • Can also enable allow less secure apps (Not recommended)

Apply configurations

k apply -f email-secret.yaml
k apply -f alert-manager-configuration.yaml
  • Our AlertManager config will be picked up and reloaded inside the AlertManager application
  • Check update on AlertManager UI
  • Can send email when action is resolved
  • Using send_resolved
  • Can override headers

Trigger alerts for Email Receiver

  • AlertManager endpoint for firing alerts
    • /api/v2/alerts
  • If email is not sent, check AlertManager logs

Monitor third party applications

  • We are currently monitoring
    • Kubernetes components
    • Resource consumption of nodes
    • Prometheus Stack
  • But not
    • Redis: Third party app
    • Our own application: Online shop
  • We want to know if the redis app has too much load, too many connections or if its down
  • We want to monitor the app on the application level not the Kubernetes level
  • To monitor 3rd party apps, we use Prometheus Exporters for those services

Exporters

  • An exporter gets metrics data from the service
  • It translates these service specific metrics to Prometheus understandable metrics
  • Exporter exposes these translated metrics under /metrics endpoint
  • Prometheus scrapes this endpoint
  • We need to tell Prometheus about this new Exporter
  • For that ServiceMonitor (custom K8s resource) needs to be deployed

Deploy redis exporter

  • We'll use a Helm chart that deploys Redis Exporter plus all the needed config
  • Prometheus redis exporter
  • ServiceMonitor is the link between our exporter and our application
    • It describes the set of targets to be monitored by Prometheus
  • Need to add release: monitoring so the exporter can register with Prometheus
  • If Redis is password protected, we need to authenticate
# - name: REDIS_PASSWORD
#   valueFrom:
#     secretKeyRef:
#       key: redis-password
#       name: redis-config-0.0.2
  • Can configure prometheus rule in the same file
  • Best to separate it

Get Repo Info

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install Chart

helm install redis-exporter prometheus-community/prometheus-redis-exporter -f redis-values.yaml
k get servicemonitor
  • Go to Prometheus UI > Status > Targets

Alert rules: Grafana dashboard for Redis

Create Alert Rules For Redis

redis-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: redis-rules
  labels:
    app: kube-prometheus-stack
    release: monitoring
spec:
  groups:
    - name: redis.rules
      rules:
        - alert: RedisDown
          expr: redis_up == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: Redis down (instance {{ $labels.instance }})
            description: "Redis instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: RedisTooManyConnections
          expr: redis_connected_clients > 100
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Redis too many connections (instance {{ $labels.instance }})
            description: "Redis instance has {{ $value }} connections\n LABELS = {{ $labels }}"

Apply rules

k apply -f redis-rules.yaml

Take redis down

k edit deployment <deployment-name>
  • Change replicas to zero

Create Redis Dashboard In Grafana

Collect expose metrics with Prometheus Client Library: Monitor Own App - Part 1

  • No exporter available for our own app
  • We have to define the metrics
  • We need to use Prometheus client libraries
  • Client libraries
    • Choose a Prometheus client library that matches the language in which your application is written
    • Abstract interface to expose your metrics
  • Libraries implement the Prometheus metric types

Steps To Monitor Own Application

  • Expose metrics for our Nodejs application using Nodejs client library
  • Deploy Nodejs app in the cluster
  • Configure Prometheus to scrape new target (ServiceMonitor)
  • Visualize scraped metrics in Grafana Dashboard

Expose Metrics - Nodejs Client Library

  • Metrics
    • Number of requests
    • Duration of requests
  • As a devops engineer, you will ask developers to expose the metrics in the application
  • Developers write the code using Prometheus client library
  • Client: prom-client

server.js

const client = require('prom-client');
const collectDefaultMetrics = client.collectDefaultMetrics;
// Probe every 5th second.
collectDefaultMetrics({ timeout: 5000 });

const httpRequestsTotal = new client.Counter({
  name: 'http_request_operations_total',
  help: 'Total number of Http requests'
})

const httpRequestDurationSeconds = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of Http requests in seconds',
  buckets: [0.1, 0.5, 2, 5, 10]
})

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType)
  res.end(await client.register.metrics())
})

app.get('/', function (req, res) {
    // Simulate sleep for a random number of milliseconds
    var start = new Date()
    var simulateTime = Math.floor(Math.random() * (10000 - 500 + 1) + 500)

    setTimeout(function(argument) {
      // Simulate execution time
      var end = new Date() - start
      httpRequestDurationSeconds.observe(end / 1000); //convert to seconds
    }, simulateTime)

    httpRequestsTotal.inc();
    res.sendFile(path.join(__dirname, "index.html"));
});
  • As a developer
    • You need to define the metric
    • Track the value in your logic

Build Docker Image

docker build -t alfredasare/devops-demo-app:nodeapp .
docker push alfredasare/devops-demo-app:nodeapp

Deploy To K8s Cluster

config.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nodeapp
  labels:
    app: nodeapp
spec:
  selector:
    matchLabels:
      app: nodeapp
  template:
    metadata:
      labels:
        app: nodeapp
    spec:
      imagePullSecrets:
      - name: my-registry-key
      containers:
      - name: nodeapp
        image: nanajanashia/demo-app:nodeapp
        ports:
        - containerPort: 3000
        imagePullPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: nodeapp
  labels:
    app: nodeapp
spec:
  type: ClusterIP
  selector:
    app: nodeapp
  ports:
  - name: service
    protocol: TCP
    port: 3000
    targetPort: 3000
  • We'll need to create a Docker login secret in the cluster
k create secret docker-registry my-registry-key --docker-server https://index.docker.io/v1/ --docker-username=alfredasare --docker-password=pass

Apply Config File

k apply -f k8s-config.yaml

View app

k port-forward svc/nodeapp 3000:3000

Scrape own application metrics: Configure own grafana Dashboard - Monitor own app - Part 2

  • Create Service Monitor
    • This is how we tell Prometheus we have a new endpoint to scrape
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: monitoring-node-app
  labels:
    release: monitoring
    app: nodeapp
spec:
  endpoints:
  - path: /metrics
    port: service
    targetPort: 3000
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      app: nodeapp
  • When the ServiceMonitor is an a different namespace from the Prometheus stack, we have to specify the namespaceSelector
  • port: Name of the service of the app

Create Grafana Dashboard

  • No of total requests per send measured in 2 minutes intervals
rate(http_request_operations_total[2m])
  • Can use this query for Grafana