A simple synchronization algorithm
I am writing this article out of my experience with writing vdirsyncer, a simple synchronization program for calendar events. The same ideas can be used for file synchronization. The whole article is targeted at people who need to write some sort of "cloud synchronization" and have no idea where to start, and/or when journal-based approaches are not an option due to API restrictions.
Recap: The problem and OfflineIMAP's approach
For a start, read How OfflineIMAP works by E.Z. Yang. It deals with the synchronization algorithm used by OfflineIMAP, a program that can be used to synchronize emails between IMAP accounts and/or Maildir folders.
OfflineIMAP's algorithmic problem can be summarized as follows:
You are given two "sets"
A
andB
of "items", where the items have immutable content and globally unique IDs. Define the functionsync
. If items are added to one set, thensync(A, B)
should copy those items to the other set. If items are deleted from one set, those deletions should also be performed on the other set.
The presented solution involves maintaining a third set, called the
status
, that keeps track of the item IDs (not content) that
were present after the previous sync process. On first synchronization,
that set is empty.
For each item ID (from A, B, status
), there are several
possibilities as to which sets it is in:
A - B - status
-- If the ID exists on sideA
, but not onB
or thestatus
, it must have been created onA
. Copy the item fromA
toB
and also insert it intostatus
.B - A - status
-- Likewise if the ID exists only inB
, it must have been created there.
A + status - B
-- If the ID exists on sideA
and the status, but not onB
, it has been deleted onB
. Delete it fromA
and thestatus
.B + status - A
-- If the ID exists on sideB
and the status, but not onA
, it has been deleted onA
.
A + B - status
-- If it exists on sideA
and sideB
, but not instatus
, add the ID to the status.status - A - B
-- If it exists only in thestatus
it doesn't exist in reality, delete the ID from the status.
This algorithm assumes that items are immutable. In order to use it with mutable items, one could declare that modified items are treated as entirely new items. If an item is changed in one set, it would be first deleted from, then created again at the other side. However elegant that approach might seem, it has two huge flaws:
- If an item is modified on both sides, both versions will be present on both sides. This is actually desirable in applications that don't know anything about the item content anyway (such as file sync), but in other cases this is nearly useless behavior.
- In practice, deleting and then creating often requires more API calls than a single update.
The problem with mutable items
Let's restate the problem above with mutable items:
You are given two "sets"
A
andB
of "items", where each item has a globally unique ID and mutable content. Define the functionsync
.
- If items are added to one set, then
sync(A, B)
should copy those items to the other set.- If items are deleted from one set, those deletions should also be performed on the other set.
- If items on one side are modified, they should replace the items on the other side.
Let's also assume that there's a cheap way to determine a checksum
(an etag
) for each item, cheap enough to be run for all
items in both sets. In the case of a file sync tool you could use file's
mtimes1, or if you don't have too tight
performance constraints, just hash the content of each file. The
etag
for an item on side A
may differ from the
etag
of the same exact item on side B
.
The algorithm
Before we create the algorithm, we need to redefine our
status
from a set to a mapping. Now it not only has to keep
track of the item IDs, but also has to map each item ID to a tuple of
two etags: One for each side of {A, B}
. Here's an
example:
Item ID (index) | ETag A | ETag B |
1 | 771627ff | caf893af |
2 | dc165135 | 07b92840 |
A few additional cases have to be added if the item exists on
A, B, status
:
- If the etag has changed on
A
but notB
, copy fromA
toB
.- If the etag has changed on
B
but notA
, copy fromB
toA
.
- If the etag has changed on
- If the etag has changed on both
A
andB
, invokeconflict_resolution
.
And the following case has to be modified:
A + B - status
-- If the ID exists on sideA
and sideB
, but not instatus
, we can not just create it in status, since the two items might contain different content each. Invokeconflict_resolution
.
Conflict resolution
conflict_resolution
is its own routine, and what it
actually does depends heavily on actual usecase and the amount of
knowledge your application has of the underlying data. A few options for
its implementation:
- Prompt the user which version is preferrable.
- Create both versions of each file under new IDs on both sides.
- Do some other thing that only works in specific situations.
- Perhaps you have access to the last-modified date and can use that to determine the most up-to-date version?
- Perhaps you managed to save the old version from the previous sync and can perform a three-way merge?
- Perhaps the file format you're working with carries a full change history with it, and you can use that to get a usable merge?
What you always should do as part of conflict resolution is to check
whether the item actually has different content on each side. When
called the first time, the status is empty, so sync
will
call conflict_resolution
for each item ID.
The last-modified date of a file on POSIX. If you consider using that, keep in mind:
- Depending on the OS and filesystem, mtime's precision may be
insufficient. On FAT, it has only 2-second resolution, and POSIX
st_mtime
may vary in precision (there's alsost_mtime_ns
). This means that mtime-based change detection might miss some changes, as mtime after file modification is the same as the old one. - You can randomly touch files (updating their mtimes) without
modifying their content, such that your application sees "bogus changes"
in items. This might lead to more synchronization conflicts and
unnecessary calls to
conflict_resolution
.
Vdirsyncer uses mtimes only as an indicator that a file might have changed, and gets rid of false positives by comparing hashes of item's content.↩︎
- Depending on the OS and filesystem, mtime's precision may be
insufficient. On FAT, it has only 2-second resolution, and POSIX