Surviving Failures: Practical HA Models for MySQL and MongoDB

Introduction

In today’s always-on world, downtime is expensive. Whether it’s an e-commerce site losing transactions, a banking app failing to process payments, or a SaaS product missing its uptime SLA, availability is directly tied to trust and revenue. For Site Reliability Engineers (SREs), one of the most critical responsibilities is ensuring that databases — the backbone of any application — remain resilient even in the face of failures.

Two of the most widely deployed databases in production environments are MySQL and MongoDB. Both are powerful, both can scale, and both have mechanisms to support high availability (HA). But here’s the catch: HA doesn’t come out of the box. It requires deliberate architecture decisions, careful failover strategies, and proactive monitoring to minimize downtime and data loss.

In this blog, we’ll explore how SREs can design highly available MySQL and MongoDB deployments, with a focus on failover strategies. We’ll break down how each database handles failovers, what challenges you can expect, and how to align them with your SLOs and SLAs.

What High Availability Means for Databases

Before diving into specifics, let’s align on what high availability actually means.

Replication vs HA: Replication ensures data exists in multiple places. High availability ensures that if one node fails, another seamlessly takes over with minimal disruption.
Redundancy: Multiple servers running copies of your data, often spread across availability zones or data centers.
Failover: The process of switching database traffic from a failed primary node to a healthy replica.
Recovery Objectives:
- RTO (Recovery Time Objective) = how quickly the database should recover from a failure.
- RPO (Recovery Point Objective) = how much data (if any) you can afford to lose during failover.

When you’re running mission-critical workloads, these definitions aren’t just theory they guide the design of your replication topology, the choice of failover tooling, and the operational runbooks your SRE team maintains.

MySQL Failover Strategies

MySQL has long been the go-to relational database, but by default, it doesn’t provide built-in automatic failover. Instead, SREs rely on replication and external tools to achieve high availability.

Replication Basics:

One primary node accepts writes.
One or more replicas replicate data asynchronously or semi-synchronously.
If the primary fails, a replica can be promoted to primary.

Failover Approaches:

Orchestrator:

A battle-tested tool for topology management and automated failover.
Monitors replication health and can promote a replica automatically.
Works well in complex topologies with multiple replicas.

MHA (Master High Availability):

Popular in older setups.
Detects primary failure and promotes a replica.
Reliable but less feature-rich compared to Orchestrator.

Group Replication / InnoDB Cluster:

Native HA solution from MySQL.
Provides built-in consensus and automatic failover.
Great for greenfield projects but requires modern MySQL versions and careful tuning.

Challenges in MySQL HA:

Application connection strings often point to a specific node, requiring DNS or proxy layers (e.g., ProxySQL or HAProxy).
Failover can take 10–30 seconds depending on detection and promotion.
Semi-sync replication reduces data loss but adds latency.

Failover Example

Let’s say the primary MySQL node crashes at 2 AM:

Orchestrator detects that the primary is unresponsive.
It evaluates replicas for eligibility (based on replication health, position, and consistency).
A healthy replica is promoted to the new primary.
Other replicas are reconfigured to replicate from the new primary.
Applications reconnect transparently to the new node.

This entire process happens automatically, reducing downtime from hours to seconds.

Benefits

Minimal downtime with automated failover.
Clear visibility into replication topology.
Ability to handle complex setups like multi-tier replication.

MongoDB Failover Strategies

MongoDB, on the other hand, bakes high availability directly into its architecture through Replica Sets.

Replica Set Basics:

A primary handles all writes.
Secondaries replicate data and can take over if the primary fails.
An arbiter can be used for tie-breaking in elections.

Election Process:

When the primary goes down, secondaries hold a vote.
A new primary is elected within 5–10 seconds.
Applications with modern MongoDB drivers automatically retry and reconnect to the new primary.

Considerations for SREs:

Failover disrupts in-flight writes. With writeConcern: majority, you minimize data loss.
Long-running queries may fail mid-way during elections.
Adding hidden or delayed nodes can help with backup and recovery strategies.

Monitoring Failovers:

MongoDB logs clearly show when an election occurs.
Exporters (like mongodb_exporter) expose election metrics to Prometheus.
SREs can alert on frequent elections, which may indicate instability.

Failover Example

If the primary MongoDB instance fails:

The secondaries detect the failure within seconds.
An election is triggered.
A secondary with the most up-to-date data is promoted.
Clients are automatically redirected to the new primary.

This ensures minimal service disruption without human intervention.

Benefits

Built-in automation:no external tool required for failover.
Flexible scaling:secondaries can handle read-heavy workloads.
Data redundancy:multiple copies reduce risk of corruption.

Comparing MySQL vs MongoDB HA

Feature	MySQL	MongoDB
Failover Mechanism	External tooling (Orchestrator, MHA)	Built-in Replica Set elections
Typical Failover Time	10–30s (depending on setup)	5–10s
Data Loss Risk	Depends on sync settings	Mitigated with majority writes
Complexity	High (tooling + proxy/DNS layers)	Lower (built into DB)
Operational Flexibility	Fine-grained tuning possible	Opinionated but reliable

MySQL requires more external tooling and SRE expertise to achieve reliable HA. MongoDB’s approach is more self-managing, but still needs careful tuning of election settings and monitoring.

Monitoring and Observability

High availability doesn’t end with replication and failover; monitoring is equally critical.

Tools:

Percona Monitoring and Management (PMM): Provides deep insights into MySQL and MongoDB performance.
Prometheus + Grafana: Collects and visualizes metrics like replication lag, CPU, memory usage, and disk I/O.

Key Metrics to Watch:

Replication Lag: High lag means replicas can’t take over quickly.
Election Time: In MongoDB, elections should complete in a few seconds.
Disk Utilization: HA setups fail if disks run out of space.
Service Uptime: Detect unexpected restarts or crashes.

Alerting Strategies

Per-host alerts: Triggered when a single node crosses thresholds.
Cluster-level alerts: Triggered when primary or quorum availability is compromised.

Common Pitfalls in HA

Even with HA configured, missteps can lead to outages:

Unmonitored replication lag: Failovers to a lagging replica can cause data loss.
Single data center deployment: All replicas in one DC are vulnerable to site-level outages.
Ignoring backups: HA ≠ backup; you still need disaster recovery strategies.
Split-brain in MongoDB: Misconfigured replica sets can result in two primaries.
Improper tuning: Overloaded primaries can still fail even with replicas available.

Disaster Recovery vs High Availability

While often used interchangeably, HA and DR are distinct:

High Availability (HA): Focuses on minimizing downtime through redundancy and failover.
Disaster Recovery (DR): Ensures data and operations can be restored after catastrophic events (natural disasters, data center loss).

Comparison:

HA = automatic failover, minimal downtime.
DR = backups, offsite replication, longer RTO/RPO.

An ideal strategy combines HA + DR: use replicas for local failover, and backups with cross-region replication for disaster recovery.

Conclusion

High availability is not a single setting, it’s a design philosophy that requires alignment between technology, processes, and people.

For MySQL, achieving HA means combining replication with tools like Orchestrator or Group Replication, plus careful integration with application layers. For MongoDB, Replica Sets simplify HA, but SREs still need to monitor, test, and fine-tune for production workloads.

As SREs, our mission is not just to keep the databases running, but to make sure they remain resilient under failure. Failovers will happen, the question is whether they will feel like a disaster or just a small blip.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Surviving Failures: Practical HA Models for MySQL and MongoDB

Introduction

What High Availability Means for Databases

MySQL Failover Strategies

Replication Basics:

Failover Approaches:

Orchestrator:

MHA (Master High Availability):

Group Replication / InnoDB Cluster:

Challenges in MySQL HA:

Failover Example

Benefits

MongoDB Failover Strategies

Replica Set Basics:

Election Process:

Considerations for SREs:

Monitoring Failovers:

Failover Example

Benefits

Comparing MySQL vs MongoDB HA

Monitoring and Observability

Tools:

Key Metrics to Watch:

Alerting Strategies

Common Pitfalls in HA

Even with HA configured, missteps can lead to outages:

Disaster Recovery vs High Availability

Comparison:

Conclusion

Like this:

Related

Leave a ReplyCancel reply

Latest to read

EXPERT DATABASE SUPPORT PARTNER

Surviving Failures: Practical HA Models for MySQL and MongoDB

Introduction

What High Availability Means for Databases

MySQL Failover Strategies

Replication Basics:

Failover Approaches:

Orchestrator:

MHA (Master High Availability):

Group Replication / InnoDB Cluster:

Challenges in MySQL HA:

Failover Example

Benefits

MongoDB Failover Strategies

Replica Set Basics:

Election Process:

Considerations for SREs:

Monitoring Failovers:

Failover Example

Benefits

Comparing MySQL vs MongoDB HA

Monitoring and Observability

Tools:

Key Metrics to Watch:

Alerting Strategies

Common Pitfalls in HA

Even with HA configured, missteps can lead to outages:

Disaster Recovery vs High Availability

Comparison:

Conclusion

Share this:

Like this:

Related

Leave a ReplyCancel reply

Latest to read

Discover more from Genexdbs