Resilience Engineering of MailPlus High Availability

At Synology, we rely on MailPlus to support over 1,500 employees and handle more than 170,000 emails each day. We closely monitor its uptime to ensure it delivers the reliability we demand in a production environment.

Over time, MailPlus has consistently delivered 99.99% service uptime. For comparison, a typical SLA of 99.9% allows for up to 43 minutes of downtime per month, while 99.99% reduces that to about 4 minutes per month, or less than 1 minute per week.

This level of availability is the result of deliberate engineering decisions, especially the implementation of MailPlus High Availability (HA). Unlike traditional active-passive setups, MailPlus HA is designed as an active-active cluster. Both servers handle requests simultaneously, allowing administrators to perform updates or maintenance on one node while the other continues serving users. This architecture significantly reduces both planned and unplanned downtime.

In this article, we’ll take a closer look at the design principles and mechanisms behind MailPlus HA, and how they work together to provide uninterrupted email service.

Reliable Synchronization: Ensuring Real-Time Consistency

Keeping two mail servers perfectly in sync isn’t as simple as copying files. Mailboxes are in constant flux as new messages arrive and users read, move, or delete emails throughout the day. Our first engineering challenge was to design a synchronization engine that could keep up with this dynamic environment, one that is efficient and precise enough to track every change in real time.

Targeted Sync Through Directory Isolation

In many systems, all users’ data is stored in a single, centralized database. While convenient, this approach creates a critical risk: if the database becomes corrupted or experiences a failure, it can impact the entire system and disrupt service for all users. Recovery and troubleshooting in such cases are complex and time-consuming.

That’s why MailPlus takes a different approach. Instead of using a monolithic user store, each user’s email data and configuration settings are stored in a dedicated directory. These directories are managed individually, and while metadata may still be indexed or tracked in a central service for efficiency, the core mail and user-specific files are kept separate.

This architectural choice provides two key advantages:

Limited blast radius: If one user’s mailbox becomes corrupted or encounters a synchronization issue, it doesn’t affect other users.
Granular synchronization: When one user’s data changes (i.e., new mail or updated settings), we only need to synchronize that specific data set between servers. This targeted approach eliminates the need to transfer the entire mail store, maintaining performance as user numbers and mail volume scale.

Task-Oriented Sync Through Ordered Queues

Synchronization performance and accuracy often depend on how changes are tracked and applied. Instead of real-time data replication, MailPlus HA uses a task-oriented approach: each change is captured, organized into ordered task queues, and then synchronized.

Each mailbox operation, such as receiving a new message, moving an email, or deleting a folder, is recorded as a task. These tasks are processed sequentially, with both servers exchanging updates and confirming successful application before moving to the next task.

This design also helps MailPlus handle synchronization conflicts. For example, if a temporary network issue causes users to interact with both servers simultaneously, conflicting changes may occur, such as the same email being moved to two different mailboxes. MailPlus uses a combination of timestamps and task order to determine which action should take precedence, ensuring accurate results.

By tracking only real changes and maintaining the correct sequence of actions, MailPlus delivers synchronization that is both efficient and reliable, ensuring your data stays consistent without unnecessary overhead.

Split-Brain Recovery: Self-Healing and Data Reconciliation

Split-brain is the most feared failure in any HA cluster. It occurs when the connection between two servers breaks, but both servers remain online and running. Unable to detect each other, both servers may wrongly assume the active role. This creates two “primary” servers independently accepting new emails and processing user actions. When the connection is restored, the data between servers becomes conflicted, often causing irreversible data loss.

The Debate: Prevention vs. Cure

The conventional strategy for preventing split-brain involves the use of external mechanisms, such as a third-party witness server or quorum-based rules, to determine which server should retain the primary role. While effective in many cases, these methods also add architectural complexity and introduce new potential points of failure. For example, if the witness server goes offline, fault tolerance can be undermined.

MailPlus approaches this challenge by layering conflict resolution on top of preventive measures. We incorporated built-in logic to reconcile changes when inconsistencies arise. This added failsafe ensures continued data consistency when prevention mechanisms fall short.

Data Integrity Through Change-Aware Reconciliation

Our approach uses a bidirectional reconciliation mechanism that safely resolves changes following a split-brain event:

Automatic assignment: During MailPlus HA setup, the system writes specific metadata to both the primary and secondary servers to establish priority. Both servers actively handle mail delivery, but only the primary server can modify system settings. If the servers lose connection, the secondary server continues delivering mail while entering read-only mode for configurations. When the connection is restored, the system automatically reassigns the originally designated primary server.
Difference check: Instead of overwriting one server’s data with the other’s, the system performs a difference check, comparing the changes made on both sides. It evaluates user actions such as new emails, deletions, and message moves, then determines what unique data exists on each server.
Data reconciliation: The system then reconciles the differences with care to preserve all valid user data. For example, if Server A indicates a message was deleted while Server B records a reply to that message, the reply is retained. Likewise, if both servers received new messages independently during the split, all messages are kept. This approach ensures no valuable data is lost.

A Mail System You Can Rely On

When you deploy MailPlus HA, you’re getting more than just a failover server. You’re getting a resilient system engineered with multiple layers of protection. From robust, real-time synchronization that safeguards data consistency to automated recovery mechanisms capable of resolving split-brain scenarios, the system is designed to maintain service continuity and data integrity without manual intervention.

Our objective is to give you the confidence to “set it and forget it,” knowing that behind the scenes, a resilient and intelligent system is always on guard.

Source link

Sign up for Newsletter

News