Monday Morning, Start Again

Happy Monday!

Monday morning, start again

Thanks to http://www.usedave.com for the artwork!

It’s another beautiful Monday morning here in the Midwest.  It’s sunny, hot, and humid.  People are at their desks, working away.

I took a look at OnBase user activity, and it is normal for this time of year.  People are making updates to student files, annotating, making decisions, doing their jobs.  Just another normal Monday morning – nothing has changed.

Except for the fact that our OnBase platform is running in AWS!!!!  Massive kudos to the team for an excellent job, a well-designed plan, and a beautifully transparent migration.  Zero user impact apart from the negotiated outage on Saturday.

On to the next thing – just another normal Monday morning at #NDCloudFirst.

OnBase to AWS: Database Migration

As part of migrating OnBase to AWS, we also had to migrate the database itself.  This proved to be a much less complicated task than moving all of the document files, somewhat of an anti-climax.

A Bit of History

When we first implemented OnBase, we chose to use Oracle as the database engine. In order to improve operational stability, we switched database platforms to Microsoft SQL Server in 2014.  Our primary reason for the switch was to align us with Hyland’s main line of development.  We felt good about getting that work done prior to our move to AWS.

It’s important to understand what purpose the OnBase database serves.  It contains the metadata and user annotations, while the underlying documents reside on a file system.  As such, the database doesn’t take up much space.  A compressed backup of our database is only 16 GB.

Transfer the Data

Transferring a 16 GB file to AWS is a relatively straightforward task.  The most important feature you to be aware of is multi-part upload.  This feature takes a single large file and splits it into multiple parts.  These parts are then uploaded in parallel to S3.

Initially, we looked into using CrossFTP.  CrossFTP is a third party, GUI tool which allows you to specify parameters to the S3 copy command, including thread count and chunk size.  That said, CrossFTP simply takes what you can do natively using the AWS Command Line Interface (CLI) and gives you a graphical front end. The graphical interface abstracts the messiness of a command line and is quite approachable for a novice.  However, that was not necessary in this case.  We just used AWS CLI to transfer the data from here to S3.  It couldn’t be much more straightforward, as the syntax is:

aws s3 cp <fileName> s3://<bucketName>/<fileName>

In order to get the performance to an acceptable level, we made the following configuration adjustments in the ~/.aws/config file:

[profile xferonbase]
aws_access_key_id=<serviceId>
aws_secret_access_key=<serviceKey>
s3 =
  max_concurrent_requests = 40
  multipart_threshold = 512MB
  multipart_chunksize = 512MB

Appropriately configured, transferring our 16 GB database backup took just under 17 minutes.  Once the file was in S3, we simply transferred it to our SQL Server database instance and performed a restore.  Simple and easy.  Highly advisable if you have the luxury of bringing your system entirely down when migrating.

A look to the future

In fact, the simplicity of this backup/transfer/restore approach is influencing how we think about migrating our ERP.  We use Banner running on Oracle.  While our database size is a bit over 1 TB, the compressed backup is 175 GB.  Conservatively, that’s a 3 hour segment of time to get the file to S3.  Once the data is in S3, we’re using an S3 VPC endpoint for private connectivity and speed.

Thinking about migration weekend and the need to build in sleep time, the backup, validate, copy to S3, copy to EC2, restore process is something easy to script and test.  And a great way to build in sleep time!  We’ve all done enough migrations to know that people don’t make good decisions when they are exhausted.

OnBase to AWS: Adventures in Load Balancing

Good to have Goals

One architectural goal we had in mind as we designed our OnBase implementation in AWS was to break out the single, collapsed web and application tier into distinct web and application tiers.

Here’s a simplification of what the design of our on-premises implementation looked like:

SBN Existing - 2
Due to licensing considerations, we wanted to retain two, right-sized web servers behind an Elastic Load Balancer (ELB).  We then wanted to have that web tier interact with a load balanced application tier.  Since licensing at the application tier is very favorable, we knew we could horizontally scale as necessary to meet our production workload.

To ELB, or not to ELB, that is the question

Our going-in assumption was that we would use a pair of ELBs.  It was remarkably quick to configure.  In under an hour, we had built out an ELB to front the web servers, and an ELB to front the application servers.  It was so easy to set up.  Too easy, as it turned out.

During functional testing, we observed very strange session collisions.  We worked with our partners, looked at configuration settings, and did quite a bit of good, old-fashioned shovel work.  The big sticking point turned out to be in how we have OnBase authentication implemented.  Currently, we are using Windows Challenge/Response (NTLM to be specific, dating back to the days of NT LAN Manager) to authenticate users.  The problem is, NTLM authentication does not work across an HTTP proxy because it needs a point-to-point connection between the user’s browser and server.  See https://commscentral.net/tech/?post=52 for an explanation of NTLM via an ELB.

When an IIS server gets a request, it sends a 401 response code (auth required) and keeps the HTTP connection alive.  An HTTP proxy closes the connection after getting a 401.  So in order to proxy traffic successfully, the proxy needs to be configured to proxy TCP, not HTTP.  NTLM authenticates the TCP connection.

Enter HAProxy

In order to get past this obstacle, we ended up standing up an EC2 instance with HAProxy on it, and configuring it to balance TCP and stick sessions to backends.  Properly configured, we were cooking with gas.  

A word on scheduling

For our production account, we make extensive use of reserved instances.  On a monthly basis, 85% of our average workload is covered by a reservation.  For OnBase, we use t2.large running Windows for our application servers.

Currently, a one-year reservation costs $782.  If you were to run the same instance at the on demand rate of $0.134 per hour, the cost for a year would be $1174.  For an instance which needs to be on 24×7, reservations are great.  For a t2.large, we end up saving just over 33%.

Due to application load characteristics and the desire to span availability zones, we ended up with four application servers.  However, from a workload perspective, we only need all four during business hours: 7 am to 7 pm.  So, 52 weeks per year, 5 days per week, 12 hours per day, $0.134 per hour.  That comes out to just over $418, which is a savings of over 64%.  If you can schedule, by all means, do it!

Simplified final design

So, where did we end up?  Consider the following simplified diagram:

SBN Existing - 1

The web servers are in separate availability zones, as are the application servers.  There are still single points of failure in the design, but we our confident in our ability to recover within a timeframe acceptable to our customers.

So, what did we learn?

We learned quite a bit here, including:

  1. NTLM is not our friend.  Replacing it with SAML/CAS this summer will allow us to jettison the HAProxy instance and replace it with an ELB.
  2. Scheduling is important.  Reserved Instances optimize your spend, but they don’t optimize your usage.  You’re leaving a lot of money on the table if you’re not actively involved in scheduling your instances, which you can only do if you have a deep understanding of your system’s usage profile.
  3. Working in AWS is a lot of fun.  If you can imagine it, it’s really easy to build, prototype, and shift as appropriate.

Coming soon, migrating the OnBase database.

OnBase to AWS: Move the Data

Over 7 million files?

As we were pondering moving OnBase, one of the first considerations was how to move over five years’ worth of documents from our local data center to our primary data center in AWS.  The challenge wasn’t massive in big data terms: 7 million files, 2 terabytes.  Due to file transfer overhead, it is more efficient to transfer one big file instead of copying millions of tiny files.

By Kbh3rd (Own work) [CC BY 4.0 (http://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

Over 7 million files!

BigBoulder

7 million documents zipped into one large file.

Since OnBase is a Windows-centric platform, the need to retain Windows file permissions as part of the data transfer was of prime concern.  Fundamentally, there were two ways to approach this: trickling the data out using multiple robocopy threads, or doing a bulk data transfer.

A word on CIFS in AWS

A brief aside on providing CIFS storage in AWS.  A straightforward way to get CIFS is to use Windows Server 2012 and Distributed File System from Microsoft.

http://www.slideshare.net/AmazonWebServices/nfs-and-cifs-options-for-aws-stg401-aws-reinvent-2013/20

CIFS in AWS

You could also use a product from a company such as Avere or Panzura to present CIFS.  These products are storage caching devices that use RAM, EBS, and S3 in a tiered fashion, serving as a translation layer between S3 object storage and CIFS.  Our current configuration makes use of Panzura, striped EBS volumes, and S3.

So, let’s move this data

Robocopy versus bulk transfer

The initial goal was to get the data from its current CIFS system to a CIFS environment in AWS with all relevant metadata in place.  We evaluated a number of options, including:

  1. Zipping up the directory structure and using AWS Snowball for the transfer.
  2. Zipping up the directory structure, using S3 multipart upload to pump the data into S3.
  3. Robocopy to local storage on a virtual machine, use Windows backup to get to a single file, transmit that backup file to S3, copy file to a local EBS volume, and finally restore.
  4. Use NetBackup to backup to S3 and then restore to EC2.
  5. Zip the file structure, gather metadata with Icacls, transmit to S3, copy from S3 to EBS, and restore.

Using the robocopy approach would take weeks to transfer all of the data.  We immediately started trickling the data out using robocopy.  That said, the team was interested in seeing if it was possible to compress that data transfer time to fit within an outage weekend.

Snowball

1 PB of capacity…overkill for 2 TB

So we tested Snowball.  The first try didn’t go so well.  The Snowball was dead on arrival, and we had to get a second one shipped to us.  Ultimately, it worked, but it was overkill for the volume of data we needed to move.

Zip, transmit, unzip

We broke down the transfer process into three basic steps:

  1. Prepare and package the data
  2. Transmit the data
  3. Rehydrate the data

The transmit was the easy part.  We have a 10 GB network connection, and were able to use multipart upload to pump data to S3 at 6 Gpbs, transmitting a 200 GB test file in under an hour.

10 GB network connection + S3 multipart upload == speedy transfer

Reality Check

While discussing the different packaging/rehydration options, we talked a bit more about how OnBase actually works.  It turns out, it manages files a bit like a rotating log file.  That is, it writes to a directory, then switches to a new directory and starts writing to that.  After files files are written, the old directory essentially becomes read only.

That took the time pressure off the bulk of our data.  We could robocopy out the bulk of our data at our leisure.  On cutover weekend, we will migrate the active directory/directories.

Problem solved.

What did we learn?

  1. We can move data really quickly if we need to
  2. CIFS and AWS is feasible, but aren’t a match made in heaven
  3. You really need a comprehensive understanding of how your application works when you plan for any migration
  4. With a full understanding of what we needed to do, the simple, slow, tortoise approach of trickling data with robocopy met our needs.

 

Stay tuned for adventures in load balancing!

A Microsoft-centric platform in AWS?

At Notre Dame, we use OnBase by Hyland Software as our Enterprise Content Management platform.  Hyland has been a Microsoft Certified Partner for almost two decades, and has optimized its software for Windows, so OnBase is very much a Windows-centric product.

If you’ve been following this blog, you know that Notre Dame has selected Amazon Web Services as our infrastructure provider of the future.  Linux-based systems fit nicely in the automation sweet spot that the AWS tools provide.  Windows systems can be a bit more of a challenge.

Nevertheless, our team was excited about the move.  We knew there would be some challenges along the way, including the migration of 2 TB of data from campus to AWS.  The challenges were offset by the opportunities, including the longstanding architectural goal of a load-balanced web tier communicating with a load-balanced application tier, as opposed to the collapsed web and application tier we were running on campus.

We did a quick proof of concept exercise in the fall of 2015, then started building out our full test environment early in 2016.

Stay tuned for details of our struggles, learnings, and ultimately successful migration.