Introduction:
Downtime is inevitable. But as SREs, our goal is to make it short, automated, and low-pain. That’s where self-healing infrastructure comes in.
In this post, I’ll show how to build basic self-healing mechanisms using tools you already have: systemd, cron, and bash scripts. These aren’t theoretical tricks—they’re the kind of quick wins that reduce pager fatigue and improve MTTR instantly.
What You’ll Need
- systemd (default in most modern Linux distros)
- cron or systemd-timers
- Some bash scripting
- Optional: Slack/Webhook for alerts
Let’s break it down with real-world examples.
Auto-Restart Critical Services with systemd
Systemd has built-in healing via the Restart= directive. For example, here’s how you can auto-restart mysqld_exporter if it crashes. Some services like mysqld_exporter (used for Prometheus monitoring) might crash unexpectedly due to:
- Memory issues
- Port binding errors
- Unexpected input or dependencies
When a service crashes, if nothing restarts it, monitoring breaks and alerts stop working — silently. systemd solves this.
sudo nano /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/mysql/my.cnf
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Use Restart=always for very critical daemons.
Use Restart=on-failure if the service might exit cleanly in some cases.
Then enable and start:
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl enable mysqld_exporter
sudo systemctl restart mysqld_exporter
Clean Up Disk Space Automatically
Backups and logs eat up disk space. If disk fills:
- Services stop
- Databases fail to write
- Monitoring might break
We need a job that deletes old files (e.g., backups older than 7 days).
Example Script: cleanup_old_backups.sh
#!/bin/bash
# Delete MySQL backups older than 7 days
find /backups/mysql/ -type f -name "*.gz" -mtime +7 -delete
df -h / | awk 'NR==2 { print "Disk usage: "$5 }'
Explanation:
- find searches for .gz files older than 7 days and deletes them.
- df -h / shows root partition usage.
- awk prints just the used disk %.
Schedule via cron:
0 2 * * * /usr/local/bin/cleanup_old_backups.sh >> /var/log/cleanup.log 2>&1
This runs every day at 2:00 AM.
Detect and Restart Hung Services
Some services don’t crash — they just hang silently. For example:
- mysqld_exporter is running but doesn’t respond on port 9104.
- Your monitoring system receives no metrics.
So you need a health check script that verifies functionality, not just process status.
Example script: check_exporter_health.sh
#!/bin/bash
# Check if mysqld_exporter is exposing metrics
curl -s http://localhost:9104/metrics | grep mysql_up > /dev/null
if [ $? -ne 0 ]; then
systemctl restart mysqld_exporter
logger "mysqld_exporter restarted due to missing metrics"
fi
Explanation:
- curl fetches exporter metrics.
- grep mysql_up checks if it contains a known metric.
- If not found ($? -ne 0), the script restarts the exporter.
Schedule it every minute:
* * * * * /usr/local/bin/check_exporter_health.sh
Or even better, use systemd.timer for more control.
Restart on High Memory Usage / OOM
Sometimes a service doesn’t crash — it just consumes too much memory.
Examples:
- MongoDB
- Java apps
- Backup daemons
This can make the server unusable, or cause other services to crash due to memory starvation.
Example script: memory_watch.sh
#!/bin/bash
# Restart Mongo if memory usage > 90%
usage=$(free | awk '/Mem:/ { printf("%.0f", $3/$2 * 100) }')
if [ "$usage" -gt 90 ]; then
systemctl restart mongod
logger "MongoDB restarted due to high memory usage ($usage%)"
fi
Explanation:
- free gets memory stats.
- $3/$2 = used / total memory in percent.
- If over 90%, it restarts MongoDB.
Schedule with cron every 5 mins:
5 * * * * /usr/local/bin/memory_watch.sh
Automation without visibility is chaos. Every time a service restarts, you should:
- Notify your team (Slack, webhook)
- Record the action in logs or a dashboard
Slack Notification via Webhook:
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"🔧 Restarted mysqld_exporter on db01 due to no metrics"}' \
https://hooks.slack.com/services/XXXXX/XXXXX/XXXXX
To use:
- Go to your Slack → Apps → Incoming Webhooks.
- Create a webhook and paste the URL in place of XXXXX.
Include this line in any of your scripts after restarting the service to notify the team.
When Not to Use Self-Healing
Self-healing is powerful but should not:
- Mask recurring failures (alert if a service restarts >3 times/hour).
- Be used on data-sensitive failures (like DB corruption).
- Replace observability — always log and alert.
Summary
Self-healing is a key SRE pattern that reduces toil and improves uptime—especially for known failure modes.
- Use systemd for crash recovery
- Write lightweight health scripts for silent failure detection
- Automate cleanup tasks before they become incidents
- Notify your team or system with every auto-recovery action
You don’t need Kubernetes, serverless, or AI to build resilient infrastructure. Sometimes, a bash script and a cron job is all it takes.