Over 7 million files?
As we were pondering moving OnBase, one of the first considerations was how to move over five years’ worth of documents from our local data center to our primary data center in AWS. The challenge wasn’t massive in big data terms: 7 million files, 2 terabytes. Due to file transfer overhead, it is more efficient to transfer one big file instead of copying millions of tiny files.
Since OnBase is a Windows-centric platform, the need to retain Windows file permissions as part of the data transfer was of prime concern. Fundamentally, there were two ways to approach this: trickling the data out using multiple robocopy threads, or doing a bulk data transfer.
A word on CIFS in AWS
A brief aside on providing CIFS storage in AWS. A straightforward way to get CIFS is to use Windows Server 2012 and Distributed File System from Microsoft.
You could also use a product from a company such as Avere or Panzura to present CIFS. These products are storage caching devices that use RAM, EBS, and S3 in a tiered fashion, serving as a translation layer between S3 object storage and CIFS. Our current configuration makes use of Panzura, striped EBS volumes, and S3.
So, let’s move this data
The initial goal was to get the data from its current CIFS system to a CIFS environment in AWS with all relevant metadata in place. We evaluated a number of options, including:
- Zipping up the directory structure and using AWS Snowball for the transfer.
- Zipping up the directory structure, using S3 multipart upload to pump the data into S3.
- Robocopy to local storage on a virtual machine, use Windows backup to get to a single file, transmit that backup file to S3, copy file to a local EBS volume, and finally restore.
- Use NetBackup to backup to S3 and then restore to EC2.
- Zip the file structure, gather metadata with Icacls, transmit to S3, copy from S3 to EBS, and restore.
Using the robocopy approach would take weeks to transfer all of the data. We immediately started trickling the data out using robocopy. That said, the team was interested in seeing if it was possible to compress that data transfer time to fit within an outage weekend.
Snowball
So we tested Snowball. The first try didn’t go so well. The Snowball was dead on arrival, and we had to get a second one shipped to us. Ultimately, it worked, but it was overkill for the volume of data we needed to move.
Zip, transmit, unzip
We broke down the transfer process into three basic steps:
- Prepare and package the data
- Transmit the data
- Rehydrate the data
The transmit was the easy part. We have a 10 GB network connection, and were able to use multipart upload to pump data to S3 at 6 Gpbs, transmitting a 200 GB test file in under an hour.
Reality Check
While discussing the different packaging/rehydration options, we talked a bit more about how OnBase actually works. It turns out, it manages files a bit like a rotating log file. That is, it writes to a directory, then switches to a new directory and starts writing to that. After files files are written, the old directory essentially becomes read only.
That took the time pressure off the bulk of our data. We could robocopy out the bulk of our data at our leisure. On cutover weekend, we will migrate the active directory/directories.
Problem solved.
What did we learn?
- We can move data really quickly if we need to
- CIFS and AWS is feasible, but aren’t a match made in heaven
- You really need a comprehensive understanding of how your application works when you plan for any migration
- With a full understanding of what we needed to do, the simple, slow, tortoise approach of trickling data with robocopy met our needs.
Stay tuned for adventures in load balancing!