Monitoring with Prometheus

Overview

Create a K8s cluster with EKS
Deploy Microservices app
Deploy Prometheus Monitoring Stack
Monitor Cluster Nodes
Monitor K8s components
Monitor 3rd party apps eg. Redis
- Deploy Redis exporter
Monitor our own apps

Repos

Introduction to monitoring with Prometheus

Created to monitor highly dynamic container environments
Can work with traditional and bare servers as well

Why use Prometheus

We need a tool that constantly monitors all the services and alerts us when something occurs
- Automated monitoring
- Alerting

Prometheus Architecture

Prometheus Server: Does the actual monitoring work. It has
- Time Series Data
  - Stores all the metrics data like CPU usage, exceptions, etc
- Data Retrieval Worker
  - Pulls metrics data from apps, services, resources, etc
  - Stores them in database
- Web Server / Server API
  - Accepts queries(PromQL) for that stored data
  - Server API is used to display the data

What Prometheus Monitors

Servers
Apps
Services
These are called targets
Targets have units of monitoring like CPU status, request counts, exception counts, etc
These units you want monitored are called metrics
Metrics are saved into the Prometheus database component
Prometheus defines human-readable, text-based formats for these metrics
Metrics entries: Type and Help attributes
- Help: Description of what the metrics is
- There are 3 types
  - Counter: How many times x happened
  - Gauge: What is the current value of x now? (can go up and down)
  - Histogram: How long or how big

How does Prometheus collect those metrics from targets?

Pulls metrics from the targets using an HTTP endpoint which by default is host address/metrics
- Targets must expose /metrics endpoint
- Data available on /metrics must be in the format that Prometheus understands
Some servers are already exposing Prometheus endpoints so you don't need to do any extra work
Some services don't have native Prometheus endpoints so you need a component to help with that
This is called an Exporter
- Fetches metrics from the target and converts it to a format that Prometheus understands
- It exposes the data at its own /metrics endpoint where Prometheus can scrape them
- There are different exporters for different services
- Exporters are also available as Docker images
- For your own apps, there are also Prometheus client libraries for different languages like Node, Java, etc

Pull Mechanism

Prometheus pulls data from endpoints
Most monitoring systems like Amazon Cloud Watch or New Relic use a push system
- Applications and servers push their data to a centralized collection platform of that monitoring tool
- When working with many microservices and we have all these services pushing to the monitoring system, it creates a high load of traffic within the infrastructure and monitoring can be your bottleneck
  - High load of network traffic
  - Monitoring can be your bottleneck
  - Have to install daemon or additional software or tools to push metrics to the monitoring server
Pull system
- Multiple Prometheus instances can pull metrics data
- Better detection/insight if service is up and running
Push Gateway
- What happens when target runs for a short time? eg. batch operation
- Prometheus offers a Push gateway where the target can push their data to Prometheus

Configuring Prometheus

Uses a prometheus.yaml file
Things you can configure
- Global: How often Prometheus will scrape its targets
  - scrape_interval: How often Prometheus scrapes
  - evaluation_interval: How often rules are evaluated
- Rules: Rules for aggregating metric values or creating alerts when conditions are met
- Scrape Configs: What resources Prometheus monitors
Prometheus can monitor its own health with its own /metrics endpoint
Can define other endpoints to scrape through jobs
- Define your own jobs
- Default values for each job

Alert Manager

How are alerts triggered?
Prometheus can fire alerts using the Alert manager
It reads alert rules and if the conditions are met, the alert is fired

Data Storage

Stores metrics data on disk
Optionally integrates with Remote storage systems
Data is stored in a custom time series format
Cannot write data directly into a relational database
Lets you query data on targets using PromQL Query Language

PromQL Query Language

Query target directly
With Grafana, you have UI and you can create dashboards that use PromQL in the background to query data

Characteristics

Reliable
Stand-alone and self-containing
Works even if other parts of infrastructure is broken
No extensive setup needed
Less complex
Difficult to scale
Limits monitoring
Workarounds
- Increase Prometheus server capacity
- Limit number of metrics

Scale Prometheus using Prometheus Federation

Scalable cloud apps need monitoring that scales with them
Prometheus Federation
- Allows a Prometheus server to scrape data from other Prometheus servers

Prometheus with Docker and K8s

Fully compatible
Prometheus components are available as Docker images
Can easily be deployed in Container Environments like Kubernetes
Monitoring of K8s Cluster Node resources come out-of-the-box

Install Prometheus stack in K8s

How to deploy the different parts in K8s cluster

Creating all configuration YAML files yourself and execute them in the right order
Using an operator
- Operator: Manager of all Prometheus individual components
- Manages the combination of all components as one unit
- Find Prometheus operator
- Deploy in K8s cluster
Using Helm chart to deploy operator
- Maintained by Prometheus community
- Helm: initiate setup
- Operator: Manage setup

Demo Overview

Create EKS cluster
Deploy Microservices Application
Deploy Prometheus Stack
Monitor Kubernetes cluster
Monitor Microservices Application

Create EKS Cluster

 eksctl create cluster \
> --node-type t2.micro \
> --nodes 2 \
> --name demo-cluster

Delete

eksctl delete cluster --name <cluster-name>

Deploy Microservices App

k apply -f config-microservices.yaml

Deploy Prometheus Stack Using Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install Prometheus into own namespace

k create namespace monitoring
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring
k get all -n monitoring

Understanding Prometheus Stack Components

2 StatefulSets
- Prometheus Server
- Alert manager
3 Deployments
- Prometheus Operator
  - created Prometheus and Alertmanager StatefulSets
- Grafana
- Kube State Metrics
  - is its own Helm chart
  - Dependency of this Helm chart
  - Scrapes K8s metrics
3 ReplicaSets
- Created by Deployments
1 DaemonSet
- Runs on every Worker Node
- Node Exporter DaemonSet
  - Connects to the server
  - Translates Worker Node metrics to Prometheus metrics so they can be scraped
Pods
- From deployments and StatefulSets
Services
ConfigMaps
- Configurations for different parts
- Managed by operator
- How to connect to different metrics
Secrets
- For Grafana, Prometheus, Operator, Alert Manager, certificates, usernames and password
CRDs
- Custom Resource Definitions
- Extension of K8s API
We've set up our monitoring stack and configuration for the K8s cluster
- Worker Nodes are monitored
- K8s components are monitored

Components inside Prometheus, Alertmanager and Operator

k describe statefulset <name> > prom.yaml

Mounts: where Prometheus gets its configuration data mounted into Prometheus pod
Config file: Prom defines what endpoints it should scrape
Rules file: alerting rules, etc
Config-reloader sidecar / Helper container
- Reloads Prom config file when it changes
Main things to understand
- How to add / adjust alert rules
- How to adjust Prometheus configuration

Data Visualization with Prometheus UI

Decide what to monitor
- We want to notice when something unexpected happens
- Observe any anomalies
  - CPU spikes
  - Insufficient storage
  - High load
  - Unauthorized requests
- Analyze and react accordigly
How do we get the information?
- Visibility of this monitoring data
- What data do we have available?

Prometheus UI

k port-forward service/monitoring-kube-prometheus-prometheus 9090:9090 -n monitoring &

Targets
- Status > Targets
If target is not found, add target
Prom UI
- Low level
- For Debugging
- Not ideal for data visualization
Prom config
- Have jobs
- Instance: an endpoint you can scrape
- Job: Collection of instances with the same purpose

Introduction to Grafana

k port-forward service/monitoring-grafana 8080:80 -n monitoring

Default credentials
- user: admin
- pwd: prom-operator
Grafana Dashboards
- Dashboard is a set of one or more panels
- Can create your own dashboards
- Organized into one or more rows
- Row is a logical divider within a dashboard
- Rows are used to group panels together
Have different rows
In each row there are panels
Panel
- Basic visualization building block in Grafana
- Composed by a query and a visualization
- Each panel has a query editor specific to the data source selected in the panel
- Can be moved and resized within a dashboard
Structure Summary
- Folders
- Dashboards
- Rows
- Panels
When we observe the graphs, we're looking for anomalies
You can use PromQL to get data from Prometheus and Grafana will use that data to visualize it

Create Your Own Dashboard

Select metric from dropdown and Use Query

Resource Consumption of Cluster Nodes

Node Exporter / Nodes

Test Anomaly

k run curl-test --image=radial/busyboxplus:curl -i --tty --rm

In container
- Create a shell script

for i in $(seq 1 10000)
do
    curl http://a379ec4ed55544d4e845fbd4f1142cf0-1765023943.us-east-1.elb.amazonaws.com > test.txt
done

Make file executable

chmod +x test.sh
./test.sh

Configure Users & Data Sources

Can configure users in Grafana
Grafana can add visualize data from many sources.
- Prometheus
- AWS CloudWatch
- PostgreSQL
- Elasticsearch, etc
Based on the Datasource you select, the query language will differ

Alert rules in Prometheus

People won't wait in front of screens waiting for anomalies
Then you check the dashboard to fix the issue
Configure our Monitoring Stack to notify us whenever something unexpected happens
Alerting with Prometheus is separated in two parts:
- Define what we want to be notified about
  - e.g. Send notification when CPU usage is above 50%
- Send notification (Configure Alertmanager)
Check Alert rules on Prometheus UI

Which Alert Rules do we want to configure?

We get some alert rules out of the box
Firing an alert means the alert is sent to the Alertmanager

Query example

alertmanager_notifications_failed_total{job="monitoring-kube-prometheus alertmanager",namespace="monitoring"}

Labels: allow specifying a set of additional labels to be attached to the alert
Can specify the severity of alerts. e.g. critical or warning
Can add more labels
Can decide to send all critical warnings to slack and warning to emails
Can use namespaces too
When are the notifications sent when an alert happens?
- for: causes Prometheus to wait for a certain duration
- e.g. if for is set to 10 minutes, Prometheus will check that the alert continues to be active for 10 minutes before firing the alert
- App could reload without our intervention
- Can put the alert in a pending state and then use the for attribute to see if the condition resolves itself. If it hasn't, fire the alert
When alert gets sent, it has to have a message
- annotations: specify a set of information labels for longer additional information
- runbook_url: Points to a url that explains the issue as well as a possible fix for the issue

Create Own Alert Rules - 1

First rule
- Alert when CPU usage > 50%
Second rule:
- Alert when Pod cannot start (CrashLoop)

First rule

(avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)

Show percentage that CPU is idle grouped by instance

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)

Find usage over 50%

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50

Rule

  name: HostHighCpuLoad
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
  for: 2m
  labels:
    severity: warning
    namespace: monitoring
  annotations:
    description: "CPU load on host is over 50%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
    summary: "Host CPU load high"

Create your own alert rules - 2

Where config is found
- Status > Configuration
The rules file is a ConfigMap in the cluster
The rules file has the list of all the alert rules
Prometheus Operator lets us create custom Kubernetes components which are defined by CRDs to create alert rules so that the Operator will tell Prometheus about the new rules that need to be loaded
- It extends the Kubernetes API and lets us create custom K8s resources
- The Operator takes our custom K8s resource and tells Prometheus to reload the alert rules

Custom resource

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: main-rules
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: monitoring
spec:
  groups:
    - name: main.rules
      rules:
        - alert: HostHighCpuLoad
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
          for: 2m
          labels:
            severity: warning
            namespace: monitoring
          annotations:
            description: "CPU load on host is over 50%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
            summary: "Host CPU load high"

Second Rule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: main-rules
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: monitoring
spec:
  groups:
    - name: main.rules
      rules:
        - alert: HostHighCpuLoad
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
          for: 2m
          labels:
            severity: warning
            namespace: monitoring
          annotations:
            description: "CPU load on host is over 50%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
            summary: "Host CPU load high"
        - alert: KubernetesPodCrashLooping
          expr: kube_pod_container_status_restarts_total > 5
          for: 0m
          labels:
            severity: critical
            namespace: monitoring
          annotations:
            description: "Pod {{ $labels.pod }} is crash looping\n Value = {{ $value }}"
            summary: "Kubernetes pod crash looping"

Apply Alert Rules

k apply -f alert-rules.yaml

Check rule in monitoring namespace

k get PrometheusRule -n monitoring

Check if config is reloaded

Check logs

k logs prometheus-monitoring-kube-prometheus-prometheus-0 -n monitoring -c config-reloader

Create own alert rules - 3

We'll simulate a CPU load to test the trigger

k run cpu-test --image=containerstack/cpustress -- --cpu 4 --timeout 30s --metrics-brief

Introduction to AlertManager

Firing State

Alert has been sent to the Alert manager
Alert manager is the last piece in the pipeline
AlertManager dispatches notifications about the alert
It takes care of deduplicating, grouping and routing them to the correct receiver integration

Configuring AlertManager

2 separate components
- Prometheus Server
- AlertManager
Each has its own configuration
AlertManager has a simple UI
Access AlertManager UI

Use port forwarding

k port-forward svc/monitoring-kube-prometheus-alertmanager -n monitoring 9093:9093

Gives us a read-only view of the configuration as well as a way to filter them
The default config has 3 main parts:
- global
  - Kind of like global variables that can be used throughout the whole operation
- routes
- receivers (main part)
  - These are the notification integrations
We'll need to tell AlertManager which alerts will be sent to which receivers
So we'll need routing for our alerts
- We target alerts using match attribute
- We have the top-level route and specific routes
- Top-level
  - Every alert enters the routing tree at the top level
  - Configuration applying to all alerts

Configure Alert Manager with Email Receiver

Configuration of AlertManager can be found in this secret: alertmanager-monitoring-kube-prometheus-alertmanager-generated

k get secret alertmanager-monitoring-kube-prometheus-alertmanager-generated -n monitoring -o yaml | less

Decode the base64 secret value

echo <value> | base64 -D | less

We can create a custom component called AlertManagerConfig that lets us add or adjust AlertManager config

Configure Email Notification

Documentation

alert-manager-configuration.yaml

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: main-rules-alert-config
  namespace: monitoring
spec:
  route:
    receiver: 'email'
    repeatInterval: 30m
    routes:
      - matchers:
          - name: alertname
            value: HostHighCpuLoad
      - matchers:
          - name: alertname
            value: KubernetesPodCrashLooping
        repeatInterval: 10m
  receivers:
    - name: 'email'
      emailConfigs:
        - to: 'alfredasare101@gmail.com'
          from: 'alfredasare101@gmail.com'
          smarthost: 'smtp.gmail.com:587'
          authUsername: 'alfredasare101@gmail.com'
          authIdentity: 'alfredasare101@gmail.com'
          authPassword:
            name: gmail-auth
            key: password

email-secret.yaml

apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: gmail-auth
  namespace: monitoring
data:
  password: base64-encoded-value of password

Won't be checked into repository
Need to use an application password created in your Google account
Can also enable allow less secure apps (Not recommended)

Apply configurations

k apply -f email-secret.yaml
k apply -f alert-manager-configuration.yaml

Our AlertManager config will be picked up and reloaded inside the AlertManager application
Check update on AlertManager UI
Can send email when action is resolved
Using send_resolved
Can override headers

Trigger alerts for Email Receiver

AlertManager endpoint for firing alerts
- /api/v2/alerts
If email is not sent, check AlertManager logs

Monitor third party applications

We are currently monitoring
- Kubernetes components
- Resource consumption of nodes
- Prometheus Stack
But not
- Redis: Third party app
- Our own application: Online shop
We want to know if the redis app has too much load, too many connections or if its down
We want to monitor the app on the application level not the Kubernetes level
To monitor 3rd party apps, we use Prometheus Exporters for those services

Exporters

An exporter gets metrics data from the service
It translates these service specific metrics to Prometheus understandable metrics
Exporter exposes these translated metrics under /metrics endpoint
Prometheus scrapes this endpoint
We need to tell Prometheus about this new Exporter
For that ServiceMonitor (custom K8s resource) needs to be deployed

Deploy redis exporter

We'll use a Helm chart that deploys Redis Exporter plus all the needed config
Prometheus redis exporter
ServiceMonitor is the link between our exporter and our application
- It describes the set of targets to be monitored by Prometheus
Need to add release: monitoring so the exporter can register with Prometheus
If Redis is password protected, we need to authenticate

# - name: REDIS_PASSWORD
#   valueFrom:
#     secretKeyRef:
#       key: redis-password
#       name: redis-config-0.0.2

Can configure prometheus rule in the same file
Best to separate it

Get Repo Info

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install Chart

helm install redis-exporter prometheus-community/prometheus-redis-exporter -f redis-values.yaml

k get servicemonitor

Go to Prometheus UI > Status > Targets

Alert rules: Grafana dashboard for Redis

Create Alert Rules For Redis

redis-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: redis-rules
  labels:
    app: kube-prometheus-stack
    release: monitoring
spec:
  groups:
    - name: redis.rules
      rules:
        - alert: RedisDown
          expr: redis_up == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: Redis down (instance {{ $labels.instance }})
            description: "Redis instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: RedisTooManyConnections
          expr: redis_connected_clients > 100
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Redis too many connections (instance {{ $labels.instance }})
            description: "Redis instance has {{ $value }} connections\n LABELS = {{ $labels }}"

Common rules
- Awesome Prometheus alerts

Apply rules

k apply -f redis-rules.yaml

Take redis down

k edit deployment <deployment-name>

Change replicas to zero

Create Redis Dashboard In Grafana

2 approaches
- Create Grafana dashboard with Redis metrics ourselves
- Use existing Redis Dashboard
  - Grafana dashboards
  - Dashboard for prometheus redis exporter
- Copy dashboard ID
- Create new dashboard with ID

Collect expose metrics with Prometheus Client Library: Monitor Own App - Part 1

No exporter available for our own app
We have to define the metrics
We need to use Prometheus client libraries
Client libraries
- Choose a Prometheus client library that matches the language in which your application is written
- Abstract interface to expose your metrics
Libraries implement the Prometheus metric types

Steps To Monitor Own Application

Expose metrics for our Nodejs application using Nodejs client library
Deploy Nodejs app in the cluster
Configure Prometheus to scrape new target (ServiceMonitor)
Visualize scraped metrics in Grafana Dashboard

Expose Metrics - Nodejs Client Library

Metrics
- Number of requests
- Duration of requests
As a devops engineer, you will ask developers to expose the metrics in the application
Developers write the code using Prometheus client library
Client: prom-client

server.js

const client = require('prom-client');
const collectDefaultMetrics = client.collectDefaultMetrics;
// Probe every 5th second.
collectDefaultMetrics({ timeout: 5000 });

const httpRequestsTotal = new client.Counter({
  name: 'http_request_operations_total',
  help: 'Total number of Http requests'
})

const httpRequestDurationSeconds = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of Http requests in seconds',
  buckets: [0.1, 0.5, 2, 5, 10]
})

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType)
  res.end(await client.register.metrics())
})

app.get('/', function (req, res) {
    // Simulate sleep for a random number of milliseconds
    var start = new Date()
    var simulateTime = Math.floor(Math.random() * (10000 - 500 + 1) + 500)

    setTimeout(function(argument) {
      // Simulate execution time
      var end = new Date() - start
      httpRequestDurationSeconds.observe(end / 1000); //convert to seconds
    }, simulateTime)

    httpRequestsTotal.inc();
    res.sendFile(path.join(__dirname, "index.html"));
});

As a developer
- You need to define the metric
- Track the value in your logic

Build Docker Image

docker build -t alfredasare/devops-demo-app:nodeapp .
docker push alfredasare/devops-demo-app:nodeapp

Deploy To K8s Cluster

config.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nodeapp
  labels:
    app: nodeapp
spec:
  selector:
    matchLabels:
      app: nodeapp
  template:
    metadata:
      labels:
        app: nodeapp
    spec:
      imagePullSecrets:
      - name: my-registry-key
      containers:
      - name: nodeapp
        image: nanajanashia/demo-app:nodeapp
        ports:
        - containerPort: 3000
        imagePullPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: nodeapp
  labels:
    app: nodeapp
spec:
  type: ClusterIP
  selector:
    app: nodeapp
  ports:
  - name: service
    protocol: TCP
    port: 3000
    targetPort: 3000

We'll need to create a Docker login secret in the cluster

k create secret docker-registry my-registry-key --docker-server https://index.docker.io/v1/ --docker-username=alfredasare --docker-password=pass

Apply Config File

k apply -f k8s-config.yaml

View app

k port-forward svc/nodeapp 3000:3000

Scrape own application metrics: Configure own grafana Dashboard - Monitor own app - Part 2

Create Service Monitor
- This is how we tell Prometheus we have a new endpoint to scrape

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: monitoring-node-app
  labels:
    release: monitoring
    app: nodeapp
spec:
  endpoints:
  - path: /metrics
    port: service
    targetPort: 3000
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      app: nodeapp

When the ServiceMonitor is an a different namespace from the Prometheus stack, we have to specify the namespaceSelector
port: Name of the service of the app

Create Grafana Dashboard

No of total requests per send measured in 2 minutes intervals

rate(http_request_operations_total[2m])

Can use this query for Grafana