Moving out of Amazon Drive

Posted by Jen Tong on July 31, 2017

tl;dr:

I’m moving my data out of Amazon Drive (formerly known as Amazon Cloud Drive). I have 4TB across 1,000,000 files. I’ve struggled to download my data, but I found some tricks to make it easier. Here they are in a handy listicle.

More detail

Every couple of years, someone announces an unlimited capacity cloud storage product targeted at consumers. Then, inevitably, a few jerks with multiple terabytes of data swoop in and ruin the deal for everyone.

I’m one of those jerks. I’ve migrated 4TB of data from one consumer cloud storage provider to another over the course of several years.

With their prices increasing significantly, it gave me a challenge: download all of my data using only the official sync client. This blog entry describes the lessons I learned from that process.

When you install Amazon Drive, you can select a folder to use for synchronization. On macOS the default is ~/Amazon Drive, and the setup configurator prevents you from pointing it to a removable disk. This is a bummer because modern computers have smaller, faster boot disks. None of my computers have a boot disk bigger than 1TB.

Changing the configuration file in ~/Library/Application Support/Amazon Cloud Drive seems to make the sync client angry, but there is another way: symbolic links.

# Dangeously stop the sync client with this shell-fu, or just quit from the menu
$ ps -ef | grep 'Amazon Drive'  | awk '{print $2}' | xargs -n1 kill
# Delete the old target
$ rm -rf '~/Amazon Drive'
# Swap in your removable storage
$ ln -s /Volumes/4tb/ '~/Amazon Drive'

The client is happy to sync down to a removable disk behind a symlink, but I have no idea what will happen if the disk is removed while the client is working (I found out… it purges all client metadata, and you have to start over). So, don’t do that.

Tip 2: Use a SSD boot disk

tl;dr:

Your sync host computer must have an SSD boot disk. The sync client has very poor performance when managing metadata on spinning disks.

Details

My primary computer is a laptop. I often carry it with me. This means it’s disconnected from the Internet, and not a great sync host.

I had this brilliant idea of dusting off an old mac mini from 2011, putting the Amazon Drive sync client on it, and letting it churn away for a few days to recover all of my data.

This did not work. First, Amazon Drive spent two days Preparing. After that, file synchronization proceeded at about 10 files per minute, regardless of their size. There were a few spikes of CPU and network usage, but nothing that explained the glacial pace. At this pace, it would not finish until early October.

I did what any engineer would do, and whipped out dtrace. A little probing found the problem. The sync client was doing a staggering number of tiny, scattered I/O operations. This probably has something to do with their heavy use of SQLite. Check this out:

~/Library/Application Support/Amazon Cloud Drive$ ls -l
-rw-r--r--   1 mim  eng  758280192 Jul 31 00:58 amzn1.account.MSSM74Z-cloud.db
-rw-r--r--   1 mim  eng      32768 Jul 31 12:00 amzn1.account.MSSM74Z-cloud.db-shm
-rw-r--r--   1 mim  eng  212966952 Jul 31 14:55 amzn1.account.MSSM74Z-cloud.db-wal
-rw-r--r--   1 mim  eng       4096 May 28 14:24 amzn1.account.MSSM74Z-download.db
-rw-r--r--   1 mim  eng      32768 Jul 31 12:00 amzn1.account.MSSM74Z-download.db-shm
-rw-r--r--   1 mim  eng    2171272 Jul 31 14:00 amzn1.account.MSSM74Z-download.db-wal
-rw-r--r--   1 mim  eng        129 May 28 14:25 amzn1.account.MSSM74Z-settings.json
-rw-r--r--   1 mim  eng   81358848 Jul 31 14:56 amzn1.account.MSSM74Z-sync.db
-rw-r--r--   1 mim  eng      65536 Jul 31 14:31 amzn1.account.MSSM74Z-sync.db-shm
-rw-r--r--   1 mim  eng   44982192 Jul 31 14:56 amzn1.account.MSSM74Z-sync.db-wal
-rw-r--r--   1 mim  eng       4096 May 28 14:24 amzn1.account.MSSM74Z-uploads.db
-rw-r--r--   1 mim  eng      32768 Jul 31 12:00 amzn1.account.MSSM74Z-uploads.db-shm
-rw-r--r--   1 mim  eng    2171272 Jul 31 14:00 amzn1.account.MSSM74Z-uploads.db-wal
-rw-r--r--   1 mim  eng        352 Jul 31 13:01 app-settings.json
-rw-r--r--   1 mim  eng        368 May 28 14:24 refresh-token
-rw-r--r--   1 mim  eng         32 May 28 14:23 serial-number
~/Library/Application Support/Amazon Cloud Drive$ sqlite3 amzn1.account.MSSM74Z-cloud.db 'select count(*) from nodes;'
1077668
~/Library/Application Support/Amazon Cloud Drive$

Yeah, that’s over a gigabyte of SQLite databases! Some tables have more than a million records. Count queries take a few seconds, and toggling an option in the client sometimes can trigger millions of SQLite queries across multiple databases. This had the read head of my spinning disk thrashing back and fourth. Fortunately, random access penalties are much lower on SSDs.

Tip 3: Take smaller bites

The client is more stable when attempting to sync fewer files in one batch. Sync at most 100,000 files at a time, allow it to finish, and then sync another batch.

If you try to sync too many files at once, the client gets CPU and memory hungry, slows down, and becomes unstable. If the sync request is over 1,000,000 files, the client may start crashing on launch. Once this happens, you must delete the SQLite databases, and start over.

Tip 4: Don’t copy files into the sync path

Don’t copy files into sync client’s target path. This means no attempting to help it along by copying in previous partial download attempts. Let the client sync every file down itself.

Copying files into the sync path confused my sync client, and it delete a bunch of stuff from Amazon Drive. If you suspect this happened, don’t panic. You have a few days to restore files from the web interface. Sign in, navigate to trash, and restore deleted files from there.

Conclusion

At this pace, I’ll be able to download all of my data out before the new rates hit for me. Yay!

In retrospect I should have written my own sync client on the API, or tried to get the possibly-banned rclone client working. However, I did enjoy the adventure in exploring how the sync client works.

With this migration wrapping up, I’ve given up on consumer cloud storage products. They’re too painful to use for large volumes of data. It’s time to switch to an enterprise storage product so I can use real APIs to move data around, and benefit from SLAs and deprecation policies.

Update

I shared this post around, and got some great feedback on r/DataHoarder, the subreddit for people who laugh at my meager 4TB of accumulated data.

Here are their proposed solutions:

  • The Syncovery client supports Amazon Drive. The interface is a bit complicated, but it actually works! I was able to slurp my data down using the trial, and plan to purchase a real license next time I need to cart my data around.
  • The Amazon Cloud client runs on Windows Server. So, if the final home of my data is Google Cloud Storage, I could run it on a Windows virtual machine.

Thanks for the advice Redditors! :)