Introduction:
Modern infrastructure is built on containers and Kubernetes, but databases still remain the hardest components to operate reliably. As Site Reliability Engineers (SREs), our responsibility is not just to deploy MySQL, but to ensure it remains available, consistent, observable, and recoverable under failure.
Running MySQL on Docker and Kubernetes is no longer a theoretical exercise it is happening in production across startups and enterprises. However, MySQL is a stateful system, while Kubernetes is fundamentally designed for stateless workloads. This mismatch introduces operational risks that SREs must understand deeply.
This blog focuses on how MySQL behaves when deployed using Docker and Kubernetes, what architectural patterns work, what fails in production, and what SREs must do to keep MySQL reliable in containerized environments.
Why MySQL is different from Typical Container Workloads:
Containers are designed to be:
- Ephemeral
- Easily replaceable
- Horizontally scalable
MySQL, on the other hand, expects:
- Stable disk storage
- Predictable startup and shutdown
- Low-latency I/O
- Strong data consistency
From an SRE perspective, MySQL inside containers must be treated as stateful infrastructure, not just another microservice.
MySQL in Docker: Understanding the Basics
Running MySQL in Docker is often the first step before Kubernetes.
Basic MySQL Docker Run
docker run -d \
--name mysql \
-e MYSQL_ROOT_PASSWORD=secret \
-v mysql_data:/var/lib/mysql \
mysql:8.0
What an SRE Learns from This Setup
- Containers can restart, but data must persist
- Volumes are mandatory
- Logs are limited to STDOUT
- Resource isolation is controlled by Docker limits
SRE Rule:
Never run MySQL in Docker without persistent volumes.
Container Resource Limits and MySQL
Docker and Kubernetes enforce CPU and memory limits. MySQL is unaware of these limits unless explicitly tuned.
Common Failure Pattern
- Pod hits memory limit
- Kernel OOM kills MySQL
- Kubernetes restarts the pod
- Crash recovery increases startup time
SRE Mitigation
- Set explicit memory limits
- Tune
innodb_buffer_pool_size - Avoid overcommitting memory
innodb_buffer_pool_size=70% of container memory
Why Kubernetes Makes MySQL More Challenging
Kubernetes provides:
- Auto-restarts
- Self-healing
- Scheduling flexibility
But MySQL requires:
- Stable network identity
- Ordered startup
- Safe shutdown
- Disk affinity
This is why Deployments are unsuitable and StatefulSets are required.
MySQL on Kubernetes Using StatefulSets
Why StatefulSet Is Mandatory
| Feature | Deployment | StatefulSet |
| Pod identity | Random | Stable |
| Storage | Shared | Dedicated PVC |
| Startup order | Parallel | Ordered |
| DNS name | Unstable | Predictable |
StatefulSets allow MySQL pods to maintain identity across restarts.
Minimal MySQL StatefulSet Example
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
serviceName: mysql
replicas: 1
selector:
matchLabels:
app: mysql
template:
metadata:
labels:
app: mysql
spec:
containers:
- name: mysql
image: mysql:8.0
env:
- name: MYSQL_ROOT_PASSWORD
value: secret
volumeMounts:
- name: mysql-data
mountPath: /var/lib/mysql
volumeClaimTemplates:
- metadata:
name: mysql-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
Storage: The Most Critical Component
Most MySQL incidents on Kubernetes are storage-related, not CPU-related.
Storage Options Evaluation
| Storage Type | SRE Verdict |
| NFS | Unsafe |
| HostPath | Node-dependent |
| Cloud Block Storage | Recommended |
| Local SSD | Risky |
SRE Rule:
If storage latency spikes, MySQL performance collapses.
Readiness and Liveness Probes
Readiness Probe (Safe)
readinessProbe:
exec:
command:
- mysqladmin
- ping
- "-psecret"
initialDelaySeconds: 30
Readiness ensures traffic is only sent when MySQL is actually ready.
Liveness Probe (Dangerous if Misused)
Aggressive liveness probes can:
- Kill MySQL during crash recovery
- Cause data corruption
- Increase downtime
MySQL Replication in Kubernetes
MySQL replication is not Kubernetes-aware.
Typical setup:
- One primary pod
- Multiple replica pods
- Manual or external failover logic
Challenges
| Issue | Impact |
| Pod restart | Replica breakage |
| Primary cash | Manual promotion |
| Split brain | Data inconsistency |
Kubernetes does not manage MySQL leadership.
Operators: Production-Grade MySQL on Kubernetes
Operators automate:
- Replication setup
- Leader election
- Backup and restore
- Scaling
Popular operators:
- Percona XtraDB Operator
- Oracle MySQL Operator
- Vitess (large-scale)
SRE Recommendation:
Do not run self-managed MySQL on Kubernetes in production without an Operator.
Backups and Disaster Recovery
Backup Strategies
- Logical backups (
mysqldump) - Physical backups
- Volume snapshots
SRE Reality
- Backups without restore tests are useless
- Snapshot restore time matters more than backup speed
- Backups must be isolated from cluster failure
Monitoring MySQL in Kubernetes
Key Metrics
- Replication lag
- Disk I/O latency
- Pod restarts
- Query latency
- Buffer pool usage
Common Tools
- mysqld_exporter
- Prometheus
- Grafana
SRE Insight:
Pod restarts indicate platform issues, not database tuning problems.
Failure Scenarios Every SRE Should Test
Scenario 1: Pod Restart
- Crash recovery duration
- Readiness delay behavior
Scenario 2: Node Failure
- PVC reattachment time
- Data integrity verification
Scenario 3: Disk Latency Spike
- Query timeout behavior
- Thread pile-ups
When MySQL Should NOT Run on Kubernetes
Avoid Kubernetes MySQL if:
- Ultra-low latency is required
- Database size is massive
- Team lacks DB SRE maturity
Sometimes the best solution is:
Managed MySQL > Kubernetes MySQL
Conclusion
Running MySQL on Docker and Kubernetes is not impossible, but it is operationally expensive.
For SREs, success depends on:
- Storage reliability
- Safe restarts
- Tested backups
- Controlled failover
Kubernetes does not magically solve database reliability. It only amplifies mistakes.
Key Rule:
Treat MySQL as critical state, not just another container.