How to Build Self-Healing Linux Systems with systemd and Bash

Introduction:

Downtime is inevitable. But as SREs, our goal is to make it short, automated, and low-pain. That’s where self-healing infrastructure comes in.

In this post, I’ll show how to build basic self-healing mechanisms using tools you already have: systemd, cron, and bash scripts. These aren’t theoretical tricks—they’re the kind of quick wins that reduce pager fatigue and improve MTTR instantly.

What You’ll Need

systemd (default in most modern Linux distros)
cron or systemd-timers
Some bash scripting
Optional: Slack/Webhook for alerts

Let’s break it down with real-world examples.

Auto-Restart Critical Services with systemd

Systemd has built-in healing via the Restart= directive. For example, here’s how you can auto-restart mysqld_exporter if it crashes. Some services like mysqld_exporter (used for Prometheus monitoring) might crash unexpectedly due to:

Memory issues
Port binding errors
Unexpected input or dependencies

When a service crashes, if nothing restarts it, monitoring breaks and alerts stop working — silently. systemd solves this.

sudo nano /etc/systemd/system/mysqld_exporter.service

[Unit]
Description=Prometheus MySQL Exporter
After=network.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/mysql/my.cnf
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Use Restart=always for very critical daemons.

Use Restart=on-failure if the service might exit cleanly in some cases.

Then enable and start:

sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl enable mysqld_exporter
sudo systemctl restart mysqld_exporter

Clean Up Disk Space Automatically

Backups and logs eat up disk space. If disk fills:

Services stop
Databases fail to write
Monitoring might break

We need a job that deletes old files (e.g., backups older than 7 days).

Example Script: cleanup_old_backups.sh

#!/bin/bash
# Delete MySQL backups older than 7 days

find /backups/mysql/ -type f -name "*.gz" -mtime +7 -delete
df -h / | awk 'NR==2 { print "Disk usage: "$5 }'

Explanation:

find searches for .gz files older than 7 days and deletes them.
df -h / shows root partition usage.
awk prints just the used disk %.

Schedule via cron:

0 2 * * * /usr/local/bin/cleanup_old_backups.sh >> /var/log/cleanup.log 2>&1

This runs every day at 2:00 AM.

Detect and Restart Hung Services

Some services don’t crash — they just hang silently. For example:

mysqld_exporter is running but doesn’t respond on port 9104.
Your monitoring system receives no metrics.

So you need a health check script that verifies functionality, not just process status.

Example script: check_exporter_health.sh

#!/bin/bash
# Check if mysqld_exporter is exposing metrics

curl -s http://localhost:9104/metrics | grep mysql_up > /dev/null

if [ $? -ne 0 ]; then
  systemctl restart mysqld_exporter
  logger "mysqld_exporter restarted due to missing metrics"
fi

Explanation:

curl fetches exporter metrics.
grep mysql_up checks if it contains a known metric.
If not found ($? -ne 0), the script restarts the exporter.

Schedule it every minute:

* * * * * /usr/local/bin/check_exporter_health.sh

Or even better, use systemd.timer for more control.

Restart on High Memory Usage / OOM

Sometimes a service doesn’t crash — it just consumes too much memory.

Examples:

MongoDB
Java apps
Backup daemons

This can make the server unusable, or cause other services to crash due to memory starvation.

Example script: memory_watch.sh

#!/bin/bash
# Restart Mongo if memory usage > 90%

usage=$(free | awk '/Mem:/ { printf("%.0f", $3/$2 * 100) }')

if [ "$usage" -gt 90 ]; then
  systemctl restart mongod
  logger "MongoDB restarted due to high memory usage ($usage%)"
fi

Explanation:

free gets memory stats.
$3/$2 = used / total memory in percent.
If over 90%, it restarts MongoDB.

Schedule with cron every 5 mins:

5 * * * * /usr/local/bin/memory_watch.sh

Automation without visibility is chaos. Every time a service restarts, you should:

Notify your team (Slack, webhook)
Record the action in logs or a dashboard

Slack Notification via Webhook:

curl -X POST -H 'Content-type: application/json' \
--data '{"text":"🔧 Restarted mysqld_exporter on db01 due to no metrics"}' \
https://hooks.slack.com/services/XXXXX/XXXXX/XXXXX

To use:

Go to your Slack → Apps → Incoming Webhooks.
Create a webhook and paste the URL in place of XXXXX.

Include this line in any of your scripts after restarting the service to notify the team.

When Not to Use Self-Healing

Self-healing is powerful but should not:

Mask recurring failures (alert if a service restarts >3 times/hour).
Be used on data-sensitive failures (like DB corruption).
Replace observability — always log and alert.

Summary

Self-healing is a key SRE pattern that reduces toil and improves uptime—especially for known failure modes.

Use systemd for crash recovery
Write lightweight health scripts for silent failure detection
Automate cleanup tasks before they become incidents
Notify your team or system with every auto-recovery action

You don’t need Kubernetes, serverless, or AI to build resilient infrastructure. Sometimes, a bash script and a cron job is all it takes.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How to Build Self-Healing Linux Systems with systemd and Bash

Introduction:

What You’ll Need

Auto-Restart Critical Services with systemd

Clean Up Disk Space Automatically

Schedule via cron:

Detect and Restart Hung Services

Schedule it every minute:

Restart on High Memory Usage / OOM

Schedule with cron every 5 mins:

Automation without visibility is chaos. Every time a service restarts, you should:

Slack Notification via Webhook:

When Not to Use Self-Healing

Summary

Like this:

Related

Leave a ReplyCancel reply

Latest to read

EXPERT DATABASE SUPPORT PARTNER

How to Build Self-Healing Linux Systems with systemd and Bash

Introduction:

What You’ll Need

Auto-Restart Critical Services with systemd

Clean Up Disk Space Automatically

Schedule via cron:

Detect and Restart Hung Services

Schedule it every minute:

Restart on High Memory Usage / OOM

Schedule with cron every 5 mins:

Automation without visibility is chaos. Every time a service restarts, you should:

Slack Notification via Webhook:

When Not to Use Self-Healing

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Latest to read

Discover more from Genexdbs