The fact that they found bugs that rely on sensitive timing doesn't surprise me.
I would love to have all the file sync solutions tested with this suite.
Not everything is a CRUD app website.
I was running my own hacky sync thing to the cloud a decade ago. I would never in my boldest dreams compared it to dropbox.
Even if you know the use cases, the edge cases could be 99% of the work. POCs are 100x easier than working production multi-user applications. Don’t confuse getting to a POC in 2 hours with getting a final product in 4 hours.
You could have it mirror an entire subdirectory, including external drives.
If you booted up long enough and that external drive was not mounted, the service registered that as a subdirectory delete (bad). When you then mounted it again, the sync agent saw it as out of sync with the newer server-side delete and proceeded to clear the local external drives.
They also implemented versioning so poorly that a deleted directory was not versioned, only the files within it. So you could recover raw files without the directory structure back in a giant bundle of 1000s of files. Horrible.
See: https://dynamicsgpland.blogspot.com/2011/11/one-significant-...
Imagine that you are on a plane, (and don't have an internet connection). You edit a file.
At the same time, I edit that file.
What should we do? We can't possibly know every file format out there, and implement operational transform for all of them.
Now, imagine that we both edit the same file, at the same instant. One of us is going to submit the change first, and the other will submit it second. It's the same use case, and there's no way to avoid this.
---
Renaming folders was a lot weirder, because you get situations like:
I rename a folder, but you save a new change to a file in that renamed folder and your computer doesn't know about the renamed folder.
Or, I rename a folder and you have a file open. That application has an open file handle to that file, so we can't just rename the folder. What do we do? (This is how Excel does it.)
Or, I rename a folder and you have a file open, but that application doesn't have an open file handle to that file. What happens when you try to save the file and it's been moved? (This is how most applications do it.)
---
Application bundles (on Mac) were weird because we didn't support the metadata needed to sync them.
---
The general "Merge" use case, which had to do with the fact that Syncplicity could sync folders anywhere on disk. (As opposed to the way That Dropbox, Google Drive, and OneDrive stick everything into a single folder.) We'd have customers disconnect a folder, and then re-add it to the same location. The problem was if they were disconnected for a long time, they would "merge" the old version of the folder into the new one:
If you edited a file while disconnected, it hit the same "multiple editors" use case that I mentioned above.
If someone deleted a file but you still had it, we'd recreate it. (We can't read minds, you know!)
If someone renamed a folder, but you still had the old path, we'd re-add it.
I remember overhearing non-programmer product managers trying to talk through these use cases and just getting overwhelmed with the complexity and realizing they were deep, deep over their heads.
---
A lot of these corner cases were smoothed over when we wrote "SyncDrive", which was a virtual disk drive, because all of the IO came through us. (Instead of scanning a folder to understand what the user did.)
- Receiver tried to create a file before receiving attributes of the directory containing the file. Receiver author assumed it would always receive directory attributes first and create the directory, so it crashed.
- Receiver created a file before receiving attributes of the directory containing the file. Parent directory was created automatically, but with default attributes so the file was too accessible on the receiver when it should not have been.
- Bidirectional sync peers got into a non-terminating protocol loop (livelock) when trying to agree if a directory deep in a tree should be empty or removed (garbage collected) after synchronising removal of contents. It always worked if one side changed and sync settled before the next change, but could fail if both sides had concurrent changes.
- Mesh sync among multiple peers, with some of them acting as publish-subscribe proxies forwarding changes to others as quickly as possible merged with their own changes, got into a more complicated non-terminating protocol loop when trying to broadcast and reconcile overlapping changes observed on three or more nodes concurrently. The solution was similar to distributed garbage collecting and spanning tree protocols used in Ethernet switch networks.
- Transmission of commands halted due to head of line blocking (deadlock) on a multiplexed sync stream because a data channel was going to a receiver process whose buffer filled while waiting for a command on the command channel, which the transmitter process had issued but couldn't transmit. The fault was separate, modular tasks assuming data for each flowed independently. The solution was to multiplex correctly with per-channel credits like HTTP/2 and QUIC, instead of incorrectly assuming you can just mix formatted messages over TCP.
- Rendered pages built from mesh data-synchronised components, similar to Dropbox-style sync'd files but with a mesh of 1000s of peers, showing flashes of inconsistent data, e.g. tables whose columns should always add to 100% showing a different total (e.g. "110% (11050 of 10000) devices online"), displayed addresses showing the wrong country, numbers of devices exceeeding the total number shipped, devices showing error flags yet also "green - all good" indication, number of comments not matching the shown commments, number of rows not matching rows in a table, etc. Usually for only a few seconds, sometimes staying on screen for a long time if the 3G network went down, or if rendered to a PDF report. Such glitches made the underlying systems look like they had a lot of bugs when they really didn't, especially when captured in a PDF report. It completely undermined trust in the presented data being something you could rely on. All for want of more careful synchronisation protocol.
With documents in general there are common workflows from the paper era that just haven't aged gracefully.
This case, and a bunch of the others, are variations on failing to correctly implement dependency analysis. I'm not saying it's easy, it is far from easy, but this has been part of large systems design (anything that involves complex operations on trees of dependent objects) for years, especially in the networking space.
Indeed, your fourth bullet gets to some of the very ancient techniques (though STP isn't a great example) to address parts of the problem.
The last bullet is very hard. Honestly, I'd be happy if icloud and dropbox just got the basics right in the single-writer case and stopped fucking up my cloud-synced .sparsebundle directory trees. I run mtree on all of these and routinely find sync issues in Dropbox and iCloud drive, from minor (crazy timestamp changes that make no sense and are impossible, but the data still complete and intact) to serious (one December, Dropbox decided to revert about 1/3rd of the files to the previous October version).
The single writer case (no concurrency, large gaps in time between writers) _is_ easy and yet they continue to fuck it up. I check every week with mtree and see at least one significant error a year (and since I mirror these to my NAS and offline external storage, I am confident this is not a user error or measuring error).
The root cause of the problem is that in .net, there is a bug with File.Exists. If there is a filesystem / network error, instead of getting an exception, the error is swallowed and the call just returns false. I'm not sure if newer versions of .net fix it or not; I only learned about this when we were implementing a driver / filesystem.