Introduction:

Downtime is inevitable. But as SREs, our goal is to make it short, automated, and low-pain. That’s where self-healing infrastructure comes in.

In this post, I’ll show how to build basic self-healing mechanisms using tools you already have: systemd, cron, and bash scripts. These aren’t theoretical tricks—they’re the kind of quick wins that reduce pager fatigue and improve MTTR instantly.

What You’ll Need

  • systemd (default in most modern Linux distros)
  • cron or systemd-timers
  • Some bash scripting
  • Optional: Slack/Webhook for alerts

Let’s break it down with real-world examples.

Auto-Restart Critical Services with systemd

Systemd has built-in healing via the Restart= directive. For example, here’s how you can auto-restart mysqld_exporter if it crashes. Some services like mysqld_exporter (used for Prometheus monitoring) might crash unexpectedly due to:

  • Memory issues
  • Port binding errors
  • Unexpected input or dependencies

When a service crashes, if nothing restarts it, monitoring breaks and alerts stop working — silently. systemd solves this.

sudo nano /etc/systemd/system/mysqld_exporter.service

[Unit]
Description=Prometheus MySQL Exporter
After=network.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/mysql/my.cnf
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Use Restart=always for very critical daemons.

Use Restart=on-failure if the service might exit cleanly in some cases.

Then enable and start:

sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl enable mysqld_exporter
sudo systemctl restart mysqld_exporter

Clean Up Disk Space Automatically

Backups and logs eat up disk space. If disk fills:

  • Services stop
  • Databases fail to write
  • Monitoring might break

We need a job that deletes old files (e.g., backups older than 7 days).

Example Script: cleanup_old_backups.sh

#!/bin/bash
# Delete MySQL backups older than 7 days

find /backups/mysql/ -type f -name "*.gz" -mtime +7 -delete
df -h / | awk 'NR==2 { print "Disk usage: "$5 }'

Explanation:

  • find searches for .gz files older than 7 days and deletes them.
  • df -h / shows root partition usage.
  • awk prints just the used disk %.

Schedule via cron:

0 2 * * * /usr/local/bin/cleanup_old_backups.sh >> /var/log/cleanup.log 2>&1

This runs every day at 2:00 AM.

Detect and Restart Hung Services

Some services don’t crash — they just hang silently. For example:

  • mysqld_exporter is running but doesn’t respond on port 9104.
  • Your monitoring system receives no metrics.

So you need a health check script that verifies functionality, not just process status.

Example script: check_exporter_health.sh

#!/bin/bash
# Check if mysqld_exporter is exposing metrics

curl -s http://localhost:9104/metrics | grep mysql_up > /dev/null

if [ $? -ne 0 ]; then
  systemctl restart mysqld_exporter
  logger "mysqld_exporter restarted due to missing metrics"
fi

Explanation:

  • curl fetches exporter metrics.
  • grep mysql_up checks if it contains a known metric.
  • If not found ($? -ne 0), the script restarts the exporter.

Schedule it every minute:

* * * * * /usr/local/bin/check_exporter_health.sh

Or even better, use systemd.timer for more control.

Restart on High Memory Usage / OOM

Sometimes a service doesn’t crash — it just consumes too much memory.

Examples:

  • MongoDB
  • Java apps
  • Backup daemons

This can make the server unusable, or cause other services to crash due to memory starvation.

Example script: memory_watch.sh

#!/bin/bash
# Restart Mongo if memory usage > 90%

usage=$(free | awk '/Mem:/ { printf("%.0f", $3/$2 * 100) }')

if [ "$usage" -gt 90 ]; then
  systemctl restart mongod
  logger "MongoDB restarted due to high memory usage ($usage%)"
fi

Explanation:

  • free gets memory stats.
  • $3/$2 = used / total memory in percent.
  • If over 90%, it restarts MongoDB.

Schedule with cron every 5 mins:

5 * * * * /usr/local/bin/memory_watch.sh

Automation without visibility is chaos. Every time a service restarts, you should:

  • Notify your team (Slack, webhook)
  • Record the action in logs or a dashboard

Slack Notification via Webhook:

curl -X POST -H 'Content-type: application/json' \
--data '{"text":"🔧 Restarted mysqld_exporter on db01 due to no metrics"}' \
https://hooks.slack.com/services/XXXXX/XXXXX/XXXXX

To use:

  • Go to your Slack → Apps → Incoming Webhooks.
  • Create a webhook and paste the URL in place of XXXXX.

Include this line in any of your scripts after restarting the service to notify the team.

When Not to Use Self-Healing

Self-healing is powerful but should not:

  • Mask recurring failures (alert if a service restarts >3 times/hour).
  • Be used on data-sensitive failures (like DB corruption).
  • Replace observability — always log and alert.

Summary

Self-healing is a key SRE pattern that reduces toil and improves uptime—especially for known failure modes.

  • Use systemd for crash recovery
  • Write lightweight health scripts for silent failure detection
  • Automate cleanup tasks before they become incidents
  • Notify your team or system with every auto-recovery action

You don’t need Kubernetes, serverless, or AI to build resilient infrastructure. Sometimes, a bash script and a cron job is all it takes.

Discover more from Genexdbs

Subscribe now to keep reading and get access to the full archive.

Continue reading