Introduction
In today’s always-on world, downtime is expensive. Whether it’s an e-commerce site losing transactions, a banking app failing to process payments, or a SaaS product missing its uptime SLA, availability is directly tied to trust and revenue. For Site Reliability Engineers (SREs), one of the most critical responsibilities is ensuring that databases — the backbone of any application — remain resilient even in the face of failures.
Two of the most widely deployed databases in production environments are MySQL and MongoDB. Both are powerful, both can scale, and both have mechanisms to support high availability (HA). But here’s the catch: HA doesn’t come out of the box. It requires deliberate architecture decisions, careful failover strategies, and proactive monitoring to minimize downtime and data loss.
In this blog, we’ll explore how SREs can design highly available MySQL and MongoDB deployments, with a focus on failover strategies. We’ll break down how each database handles failovers, what challenges you can expect, and how to align them with your SLOs and SLAs.
What High Availability Means for Databases
Before diving into specifics, let’s align on what high availability actually means.
- Replication vs HA: Replication ensures data exists in multiple places. High availability ensures that if one node fails, another seamlessly takes over with minimal disruption.
- Redundancy: Multiple servers running copies of your data, often spread across availability zones or data centers.
- Failover: The process of switching database traffic from a failed primary node to a healthy replica.
- Recovery Objectives:
- RTO (Recovery Time Objective) = how quickly the database should recover from a failure.
- RPO (Recovery Point Objective) = how much data (if any) you can afford to lose during failover.
When you’re running mission-critical workloads, these definitions aren’t just theory they guide the design of your replication topology, the choice of failover tooling, and the operational runbooks your SRE team maintains.
MySQL Failover Strategies

MySQL has long been the go-to relational database, but by default, it doesn’t provide built-in automatic failover. Instead, SREs rely on replication and external tools to achieve high availability.
Replication Basics:
- One primary node accepts writes.
- One or more replicas replicate data asynchronously or semi-synchronously.
- If the primary fails, a replica can be promoted to primary.
Failover Approaches:
Orchestrator:
- A battle-tested tool for topology management and automated failover.
- Monitors replication health and can promote a replica automatically.
- Works well in complex topologies with multiple replicas.
MHA (Master High Availability):
- Popular in older setups.
- Detects primary failure and promotes a replica.
- Reliable but less feature-rich compared to Orchestrator.
Group Replication / InnoDB Cluster:
- Native HA solution from MySQL.
- Provides built-in consensus and automatic failover.
- Great for greenfield projects but requires modern MySQL versions and careful tuning.
Challenges in MySQL HA:
- Application connection strings often point to a specific node, requiring DNS or proxy layers (e.g., ProxySQL or HAProxy).
- Failover can take 10–30 seconds depending on detection and promotion.
- Semi-sync replication reduces data loss but adds latency.
Failover Example
Let’s say the primary MySQL node crashes at 2 AM:
- Orchestrator detects that the primary is unresponsive.
- It evaluates replicas for eligibility (based on replication health, position, and consistency).
- A healthy replica is promoted to the new primary.
- Other replicas are reconfigured to replicate from the new primary.
- Applications reconnect transparently to the new node.
This entire process happens automatically, reducing downtime from hours to seconds.
Benefits
- Minimal downtime with automated failover.
- Clear visibility into replication topology.
- Ability to handle complex setups like multi-tier replication.
MongoDB Failover Strategies
MongoDB, on the other hand, bakes high availability directly into its architecture through Replica Sets.
Replica Set Basics:
- A primary handles all writes.
- Secondaries replicate data and can take over if the primary fails.
- An arbiter can be used for tie-breaking in elections.
Election Process:
- When the primary goes down, secondaries hold a vote.
- A new primary is elected within 5–10 seconds.
- Applications with modern MongoDB drivers automatically retry and reconnect to the new primary.
Considerations for SREs:
- Failover disrupts in-flight writes. With writeConcern: majority, you minimize data loss.
- Long-running queries may fail mid-way during elections.
- Adding hidden or delayed nodes can help with backup and recovery strategies.
Monitoring Failovers:
- MongoDB logs clearly show when an election occurs.
- Exporters (like mongodb_exporter) expose election metrics to Prometheus.
- SREs can alert on frequent elections, which may indicate instability.
Failover Example
If the primary MongoDB instance fails:
- The secondaries detect the failure within seconds.
- An election is triggered.
- A secondary with the most up-to-date data is promoted.
- Clients are automatically redirected to the new primary.
This ensures minimal service disruption without human intervention.
Benefits
- Built-in automation:no external tool required for failover.
- Flexible scaling:secondaries can handle read-heavy workloads.
- Data redundancy:multiple copies reduce risk of corruption.
Comparing MySQL vs MongoDB HA
| Feature | MySQL | MongoDB |
|---|---|---|
| Failover Mechanism | External tooling (Orchestrator, MHA) | Built-in Replica Set elections |
| Typical Failover Time | 10–30s (depending on setup) | 5–10s |
| Data Loss Risk | Depends on sync settings | Mitigated with majority writes |
| Complexity | High (tooling + proxy/DNS layers) | Lower (built into DB) |
| Operational Flexibility | Fine-grained tuning possible | Opinionated but reliable |
MySQL requires more external tooling and SRE expertise to achieve reliable HA. MongoDB’s approach is more self-managing, but still needs careful tuning of election settings and monitoring.
Monitoring and Observability
High availability doesn’t end with replication and failover; monitoring is equally critical.
Tools:
- Percona Monitoring and Management (PMM): Provides deep insights into MySQL and MongoDB performance.
- Prometheus + Grafana: Collects and visualizes metrics like replication lag, CPU, memory usage, and disk I/O.
Key Metrics to Watch:
- Replication Lag: High lag means replicas can’t take over quickly.
- Election Time: In MongoDB, elections should complete in a few seconds.
- Disk Utilization: HA setups fail if disks run out of space.
- Service Uptime: Detect unexpected restarts or crashes.
Alerting Strategies
- Per-host alerts: Triggered when a single node crosses thresholds.
- Cluster-level alerts: Triggered when primary or quorum availability is compromised.
Common Pitfalls in HA
Even with HA configured, missteps can lead to outages:
- Unmonitored replication lag: Failovers to a lagging replica can cause data loss.
- Single data center deployment: All replicas in one DC are vulnerable to site-level outages.
- Ignoring backups: HA ≠ backup; you still need disaster recovery strategies.
- Split-brain in MongoDB: Misconfigured replica sets can result in two primaries.
- Improper tuning: Overloaded primaries can still fail even with replicas available.
Disaster Recovery vs High Availability
While often used interchangeably, HA and DR are distinct:
- High Availability (HA): Focuses on minimizing downtime through redundancy and failover.
- Disaster Recovery (DR): Ensures data and operations can be restored after catastrophic events (natural disasters, data center loss).
Comparison:
- HA = automatic failover, minimal downtime.
- DR = backups, offsite replication, longer RTO/RPO.
An ideal strategy combines HA + DR: use replicas for local failover, and backups with cross-region replication for disaster recovery.
Conclusion
High availability is not a single setting, it’s a design philosophy that requires alignment between technology, processes, and people.
For MySQL, achieving HA means combining replication with tools like Orchestrator or Group Replication, plus careful integration with application layers. For MongoDB, Replica Sets simplify HA, but SREs still need to monitor, test, and fine-tune for production workloads.
As SREs, our mission is not just to keep the databases running, but to make sure they remain resilient under failure. Failovers will happen, the question is whether they will feel like a disaster or just a small blip.