MySQL on Kubernetes: An SRE’s Practical Guide

Introduction:

Modern infrastructure is built on containers and Kubernetes, but databases still remain the hardest components to operate reliably. As Site Reliability Engineers (SREs), our responsibility is not just to deploy MySQL, but to ensure it remains available, consistent, observable, and recoverable under failure.

Running MySQL on Docker and Kubernetes is no longer a theoretical exercise it is happening in production across startups and enterprises. However, MySQL is a stateful system, while Kubernetes is fundamentally designed for stateless workloads. This mismatch introduces operational risks that SREs must understand deeply.

This blog focuses on how MySQL behaves when deployed using Docker and Kubernetes, what architectural patterns work, what fails in production, and what SREs must do to keep MySQL reliable in containerized environments.

Why MySQL is different from Typical Container Workloads:

Containers are designed to be:

Ephemeral
Easily replaceable
Horizontally scalable

MySQL, on the other hand, expects:

Stable disk storage
Predictable startup and shutdown
Low-latency I/O
Strong data consistency

From an SRE perspective, MySQL inside containers must be treated as stateful infrastructure, not just another microservice.

MySQL in Docker: Understanding the Basics

Running MySQL in Docker is often the first step before Kubernetes.

Basic MySQL Docker Run

docker run -d \
  --name mysql \
  -e MYSQL_ROOT_PASSWORD=secret \
  -v mysql_data:/var/lib/mysql \
  mysql:8.0

What an SRE Learns from This Setup

Containers can restart, but data must persist
Volumes are mandatory
Logs are limited to STDOUT
Resource isolation is controlled by Docker limits

SRE Rule:

Never run MySQL in Docker without persistent volumes.

Container Resource Limits and MySQL

Docker and Kubernetes enforce CPU and memory limits. MySQL is unaware of these limits unless explicitly tuned.

Common Failure Pattern

Pod hits memory limit
Kernel OOM kills MySQL
Kubernetes restarts the pod
Crash recovery increases startup time

SRE Mitigation

Set explicit memory limits
Tune innodb_buffer_pool_size
Avoid overcommitting memory

innodb_buffer_pool_size=70% of container memory

Why Kubernetes Makes MySQL More Challenging

Kubernetes provides:

Auto-restarts
Self-healing
Scheduling flexibility

But MySQL requires:

Stable network identity
Ordered startup
Safe shutdown
Disk affinity

This is why Deployments are unsuitable and StatefulSets are required.

MySQL on Kubernetes Using StatefulSets

Why StatefulSet Is Mandatory

Feature	Deployment	StatefulSet
Pod identity	Random	Stable
Storage	Shared	Dedicated PVC
Startup order	Parallel	Ordered
DNS name	Unstable	Predictable

StatefulSets allow MySQL pods to maintain identity across restarts.

Minimal MySQL StatefulSet Example

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: mysql
  replicas: 1
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mysql
        image: mysql:8.0
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: secret
        volumeMounts:
        - name: mysql-data
          mountPath: /var/lib/mysql
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 20Gi

Storage: The Most Critical Component

Most MySQL incidents on Kubernetes are storage-related, not CPU-related.

Storage Options Evaluation

Storage Type	SRE Verdict
NFS	Unsafe
HostPath	Node-dependent
Cloud Block Storage	Recommended
Local SSD	Risky

SRE Rule:

If storage latency spikes, MySQL performance collapses.

Readiness and Liveness Probes

Readiness Probe (Safe)

readinessProbe:
  exec:
    command:
    - mysqladmin
    - ping
    - "-psecret"
  initialDelaySeconds: 30

Readiness ensures traffic is only sent when MySQL is actually ready.

Liveness Probe (Dangerous if Misused)

Aggressive liveness probes can:

Kill MySQL during crash recovery
Cause data corruption
Increase downtime

MySQL Replication in Kubernetes

MySQL replication is not Kubernetes-aware.

Typical setup:

One primary pod
Multiple replica pods
Manual or external failover logic

Challenges

Issue	Impact
Pod restart	Replica breakage
Primary cash	Manual promotion
Split brain	Data inconsistency

Kubernetes does not manage MySQL leadership.

Operators: Production-Grade MySQL on Kubernetes

Operators automate:

Replication setup
Leader election
Backup and restore
Scaling

Popular operators:

Percona XtraDB Operator
Oracle MySQL Operator
Vitess (large-scale)

SRE Recommendation:

Do not run self-managed MySQL on Kubernetes in production without an Operator.

Backups and Disaster Recovery

Backup Strategies

Logical backups (mysqldump)
Physical backups
Volume snapshots

SRE Reality

Backups without restore tests are useless
Snapshot restore time matters more than backup speed
Backups must be isolated from cluster failure

Monitoring MySQL in Kubernetes

Key Metrics

Replication lag
Disk I/O latency
Pod restarts
Query latency
Buffer pool usage

Common Tools

mysqld_exporter
Prometheus
Grafana

SRE Insight:

Pod restarts indicate platform issues, not database tuning problems.

Failure Scenarios Every SRE Should Test

Scenario 1: Pod Restart

Crash recovery duration
Readiness delay behavior

Scenario 2: Node Failure

PVC reattachment time
Data integrity verification

Scenario 3: Disk Latency Spike

Query timeout behavior
Thread pile-ups

When MySQL Should NOT Run on Kubernetes

Avoid Kubernetes MySQL if:

Ultra-low latency is required
Database size is massive
Team lacks DB SRE maturity

Sometimes the best solution is:

Managed MySQL > Kubernetes MySQL

Conclusion

Running MySQL on Docker and Kubernetes is not impossible, but it is operationally expensive.

For SREs, success depends on:

Storage reliability
Safe restarts
Tested backups
Controlled failover

Kubernetes does not magically solve database reliability. It only amplifies mistakes.

Key Rule:

Treat MySQL as critical state, not just another container.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

MySQL on Kubernetes: An SRE’s Practical Guide

Introduction:

Why MySQL is different from Typical Container Workloads:

MySQL in Docker: Understanding the Basics

Basic MySQL Docker Run

What an SRE Learns from This Setup

SRE Rule:

Container Resource Limits and MySQL

Common Failure Pattern

SRE Mitigation

Why Kubernetes Makes MySQL More Challenging

MySQL on Kubernetes Using StatefulSets

Why StatefulSet Is Mandatory

Minimal MySQL StatefulSet Example

Storage: The Most Critical Component

Storage Options Evaluation

SRE Rule:

Readiness and Liveness Probes

Readiness Probe (Safe)

Liveness Probe (Dangerous if Misused)

MySQL Replication in Kubernetes

Challenges

Operators: Production-Grade MySQL on Kubernetes

SRE Recommendation:

Backups and Disaster Recovery

Backup Strategies

SRE Reality

Monitoring MySQL in Kubernetes

Key Metrics

Common Tools

SRE Insight:

Failure Scenarios Every SRE Should Test

Scenario 1: Pod Restart

Scenario 2: Node Failure

Scenario 3: Disk Latency Spike

When MySQL Should NOT Run on Kubernetes

Conclusion

Key Rule:

Share this:

Like this:

Related

Leave a ReplyCancel reply

Latest to read

Discover more from Genexdbs