Another quiet weekend

Except for one little thing.  The amazing team that is the OIT at Notre Dame moved our ERP and associated systems to AWS.  In less than 24 hours.

We are the OIT at ND!

We planned for years.  We prepared.  We were conservative with our time estimates.  Instead of working through the weekend, everyone was home having dinner by 5 pm on Friday.

I can’t say enough about the team.  I would also be remiss if I didn’t mention the fantastic support we’ve been fortunate to receive form our partners at AWS.  It’s been a 4 year journey to get here, and we’re not done yet.

A few more services to move, and then we can focus our sights on the fun things, like batch workload refactoring, the spot market, and cost optimization!

OnBase to AWS: Database Migration

As part of migrating OnBase to AWS, we also had to migrate the database itself.  This proved to be a much less complicated task than moving all of the document files, somewhat of an anti-climax.

A Bit of History

When we first implemented OnBase, we chose to use Oracle as the database engine. In order to improve operational stability, we switched database platforms to Microsoft SQL Server in 2014.  Our primary reason for the switch was to align us with Hyland’s main line of development.  We felt good about getting that work done prior to our move to AWS.

It’s important to understand what purpose the OnBase database serves.  It contains the metadata and user annotations, while the underlying documents reside on a file system.  As such, the database doesn’t take up much space.  A compressed backup of our database is only 16 GB.

Transfer the Data

Transferring a 16 GB file to AWS is a relatively straightforward task.  The most important feature you to be aware of is multi-part upload.  This feature takes a single large file and splits it into multiple parts.  These parts are then uploaded in parallel to S3.

Initially, we looked into using CrossFTP.  CrossFTP is a third party, GUI tool which allows you to specify parameters to the S3 copy command, including thread count and chunk size.  That said, CrossFTP simply takes what you can do natively using the AWS Command Line Interface (CLI) and gives you a graphical front end. The graphical interface abstracts the messiness of a command line and is quite approachable for a novice.  However, that was not necessary in this case.  We just used AWS CLI to transfer the data from here to S3.  It couldn’t be much more straightforward, as the syntax is:

aws s3 cp <fileName> s3://<bucketName>/<fileName>

In order to get the performance to an acceptable level, we made the following configuration adjustments in the ~/.aws/config file:

[profile xferonbase]
aws_access_key_id=<serviceId>
aws_secret_access_key=<serviceKey>
s3 =
  max_concurrent_requests = 40
  multipart_threshold = 512MB
  multipart_chunksize = 512MB

Appropriately configured, transferring our 16 GB database backup took just under 17 minutes.  Once the file was in S3, we simply transferred it to our SQL Server database instance and performed a restore.  Simple and easy.  Highly advisable if you have the luxury of bringing your system entirely down when migrating.

A look to the future

In fact, the simplicity of this backup/transfer/restore approach is influencing how we think about migrating our ERP.  We use Banner running on Oracle.  While our database size is a bit over 1 TB, the compressed backup is 175 GB.  Conservatively, that’s a 3 hour segment of time to get the file to S3.  Once the data is in S3, we’re using an S3 VPC endpoint for private connectivity and speed.

Thinking about migration weekend and the need to build in sleep time, the backup, validate, copy to S3, copy to EC2, restore process is something easy to script and test.  And a great way to build in sleep time!  We’ve all done enough migrations to know that people don’t make good decisions when they are exhausted.

OnBase to AWS: Adventures in Load Balancing

Good to have Goals

One architectural goal we had in mind as we designed our OnBase implementation in AWS was to break out the single, collapsed web and application tier into distinct web and application tiers.

Here’s a simplification of what the design of our on-premises implementation looked like:

SBN Existing - 2
Due to licensing considerations, we wanted to retain two, right-sized web servers behind an Elastic Load Balancer (ELB).  We then wanted to have that web tier interact with a load balanced application tier.  Since licensing at the application tier is very favorable, we knew we could horizontally scale as necessary to meet our production workload.

To ELB, or not to ELB, that is the question

Our going-in assumption was that we would use a pair of ELBs.  It was remarkably quick to configure.  In under an hour, we had built out an ELB to front the web servers, and an ELB to front the application servers.  It was so easy to set up.  Too easy, as it turned out.

During functional testing, we observed very strange session collisions.  We worked with our partners, looked at configuration settings, and did quite a bit of good, old-fashioned shovel work.  The big sticking point turned out to be in how we have OnBase authentication implemented.  Currently, we are using Windows Challenge/Response (NTLM to be specific, dating back to the days of NT LAN Manager) to authenticate users.  The problem is, NTLM authentication does not work across an HTTP proxy because it needs a point-to-point connection between the user’s browser and server.  See https://commscentral.net/tech/?post=52 for an explanation of NTLM via an ELB.

When an IIS server gets a request, it sends a 401 response code (auth required) and keeps the HTTP connection alive.  An HTTP proxy closes the connection after getting a 401.  So in order to proxy traffic successfully, the proxy needs to be configured to proxy TCP, not HTTP.  NTLM authenticates the TCP connection.

Enter HAProxy

In order to get past this obstacle, we ended up standing up an EC2 instance with HAProxy on it, and configuring it to balance TCP and stick sessions to backends.  Properly configured, we were cooking with gas.  

A word on scheduling

For our production account, we make extensive use of reserved instances.  On a monthly basis, 85% of our average workload is covered by a reservation.  For OnBase, we use t2.large running Windows for our application servers.

Currently, a one-year reservation costs $782.  If you were to run the same instance at the on demand rate of $0.134 per hour, the cost for a year would be $1174.  For an instance which needs to be on 24×7, reservations are great.  For a t2.large, we end up saving just over 33%.

Due to application load characteristics and the desire to span availability zones, we ended up with four application servers.  However, from a workload perspective, we only need all four during business hours: 7 am to 7 pm.  So, 52 weeks per year, 5 days per week, 12 hours per day, $0.134 per hour.  That comes out to just over $418, which is a savings of over 64%.  If you can schedule, by all means, do it!

Simplified final design

So, where did we end up?  Consider the following simplified diagram:

SBN Existing - 1

The web servers are in separate availability zones, as are the application servers.  There are still single points of failure in the design, but we our confident in our ability to recover within a timeframe acceptable to our customers.

So, what did we learn?

We learned quite a bit here, including:

  1. NTLM is not our friend.  Replacing it with SAML/CAS this summer will allow us to jettison the HAProxy instance and replace it with an ELB.
  2. Scheduling is important.  Reserved Instances optimize your spend, but they don’t optimize your usage.  You’re leaving a lot of money on the table if you’re not actively involved in scheduling your instances, which you can only do if you have a deep understanding of your system’s usage profile.
  3. Working in AWS is a lot of fun.  If you can imagine it, it’s really easy to build, prototype, and shift as appropriate.

Coming soon, migrating the OnBase database.

OnBase to AWS: Move the Data

Over 7 million files?

As we were pondering moving OnBase, one of the first considerations was how to move over five years’ worth of documents from our local data center to our primary data center in AWS.  The challenge wasn’t massive in big data terms: 7 million files, 2 terabytes.  Due to file transfer overhead, it is more efficient to transfer one big file instead of copying millions of tiny files.

By Kbh3rd (Own work) [CC BY 4.0 (http://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

Over 7 million files!

BigBoulder

7 million documents zipped into one large file.

Since OnBase is a Windows-centric platform, the need to retain Windows file permissions as part of the data transfer was of prime concern.  Fundamentally, there were two ways to approach this: trickling the data out using multiple robocopy threads, or doing a bulk data transfer.

A word on CIFS in AWS

A brief aside on providing CIFS storage in AWS.  A straightforward way to get CIFS is to use Windows Server 2012 and Distributed File System from Microsoft.

http://www.slideshare.net/AmazonWebServices/nfs-and-cifs-options-for-aws-stg401-aws-reinvent-2013/20

CIFS in AWS

You could also use a product from a company such as Avere or Panzura to present CIFS.  These products are storage caching devices that use RAM, EBS, and S3 in a tiered fashion, serving as a translation layer between S3 object storage and CIFS.  Our current configuration makes use of Panzura, striped EBS volumes, and S3.

So, let’s move this data

Robocopy versus bulk transfer

The initial goal was to get the data from its current CIFS system to a CIFS environment in AWS with all relevant metadata in place.  We evaluated a number of options, including:

  1. Zipping up the directory structure and using AWS Snowball for the transfer.
  2. Zipping up the directory structure, using S3 multipart upload to pump the data into S3.
  3. Robocopy to local storage on a virtual machine, use Windows backup to get to a single file, transmit that backup file to S3, copy file to a local EBS volume, and finally restore.
  4. Use NetBackup to backup to S3 and then restore to EC2.
  5. Zip the file structure, gather metadata with Icacls, transmit to S3, copy from S3 to EBS, and restore.

Using the robocopy approach would take weeks to transfer all of the data.  We immediately started trickling the data out using robocopy.  That said, the team was interested in seeing if it was possible to compress that data transfer time to fit within an outage weekend.

Snowball

1 PB of capacity…overkill for 2 TB

So we tested Snowball.  The first try didn’t go so well.  The Snowball was dead on arrival, and we had to get a second one shipped to us.  Ultimately, it worked, but it was overkill for the volume of data we needed to move.

Zip, transmit, unzip

We broke down the transfer process into three basic steps:

  1. Prepare and package the data
  2. Transmit the data
  3. Rehydrate the data

The transmit was the easy part.  We have a 10 GB network connection, and were able to use multipart upload to pump data to S3 at 6 Gpbs, transmitting a 200 GB test file in under an hour.

10 GB network connection + S3 multipart upload == speedy transfer

Reality Check

While discussing the different packaging/rehydration options, we talked a bit more about how OnBase actually works.  It turns out, it manages files a bit like a rotating log file.  That is, it writes to a directory, then switches to a new directory and starts writing to that.  After files files are written, the old directory essentially becomes read only.

That took the time pressure off the bulk of our data.  We could robocopy out the bulk of our data at our leisure.  On cutover weekend, we will migrate the active directory/directories.

Problem solved.

What did we learn?

  1. We can move data really quickly if we need to
  2. CIFS and AWS is feasible, but aren’t a match made in heaven
  3. You really need a comprehensive understanding of how your application works when you plan for any migration
  4. With a full understanding of what we needed to do, the simple, slow, tortoise approach of trickling data with robocopy met our needs.

 

Stay tuned for adventures in load balancing!

A Microsoft-centric platform in AWS?

At Notre Dame, we use OnBase by Hyland Software as our Enterprise Content Management platform.  Hyland has been a Microsoft Certified Partner for almost two decades, and has optimized its software for Windows, so OnBase is very much a Windows-centric product.

If you’ve been following this blog, you know that Notre Dame has selected Amazon Web Services as our infrastructure provider of the future.  Linux-based systems fit nicely in the automation sweet spot that the AWS tools provide.  Windows systems can be a bit more of a challenge.

Nevertheless, our team was excited about the move.  We knew there would be some challenges along the way, including the migration of 2 TB of data from campus to AWS.  The challenges were offset by the opportunities, including the longstanding architectural goal of a load-balanced web tier communicating with a load-balanced application tier, as opposed to the collapsed web and application tier we were running on campus.

We did a quick proof of concept exercise in the fall of 2015, then started building out our full test environment early in 2016.

Stay tuned for details of our struggles, learnings, and ultimately successful migration.

Outreach

One thing that is incredibly fulfilling about being at Notre Dame is its core mission.  Attributed to Father Sorin, Notre Dame’s primary reason for existence is to be a “powerful force for good” in the world.  In keeping with that mission, we at #NDCloudFirst have done a couple of things to help spread the operational cloud knowledge we are acquiring.

Over the past year or so, we have shared our knowledge in person and via conference calls with our higher education colleagues in both public and private forums.  We’ve talked to over 20 schools, both nationally and internationally, about the work we are doing in this space.  That’s one of the things I really enjoy about working in higher ed IT – the overall level of sharing and inter-institutional collaboration.  Public or private, with or without a medical school, centralized or not, we face the universal constraint of being oversubscribed and facing an ever-increasing demand for services.  I relish the opportunity to share what we’ve learned so far with our colleagues, and it’s always fun to learn from them as they follow their own paths.

In the interests of leveling-up the regional knowledge of this space, we started the #INAWS Meetup.  It is an inclusive, public forum, and we welcome anyone who has an interest in sharing challenges and solving problems using AWS.  In concert with our friends from the City of South BendTrek 10, independent consultants, small business owners, representatives from local IT departments, and graduate students, we have had robust conversations.  We have explored ways in which to solve disparate challenges using AWS.

Our goal is simply an extension of Father Sorin’s – to educate and empower people to avail themselves of the world’s best infrastructure.  While AWS can seem initially intimidating to the uninitiated, it is remarkably approachable technology.  By adopting AWS, we believe that our regional businesses can experience overall IT service levels that used to exist only in fantasies.

Unencumbered by hardware concerns, our community can focus more energy on improving the quality of city services, building businesses, and fabricating products.

I am remarkably bullish on the future.  We at #NDCloudFirst are working diligently to improve IT services at Notre Dame.  I look forward to continuing to learn and share with our colleagues throughout the region, nation, and world.

AWS assume role with Fog

Amazon Web Services (AWS) provides users with a wealth of services and a suite of ways in which to keep those services secure. Unfortunately, software libraries are not always able to keep up with all the new features available. While wiring up our Rails application to deposit files into an AWS S3 bucket we ran into such a problem. AWS has developed an extensive role-based security structure; and as part of their Ruby AWS SDK they allow a person, or role, to assume another role which may have completely different access.

money-patch-croppedOur troubles came in the fact that the interface provided by the fog-aws gem does not currently have any way to tell it to assume a different role. It does provide a way to use a role your server may already have been assigned, and a nice mechanism to re-negotiate your credentials when they are about to expire. So what do we do when something doesn’t quite work the way we want? Monkey patch!

Below is a link to a Gist showing how we were able to leverage the things fog-aws did the way we wanted, and overwrite the one thing we needed to be done differently.

https://gist.github.com/peterwells/39a5c31d934fa8eb0f2c

 

RDS Subnet Groups

If you are having issues finding your VPC when you are creating an RDS it is more than likely because you are missing an RDS Subnet Group. To remedy this problem simply traverse to RDS > Subnet Groups and click Create DB Subnet group. Make sure your subnet group spans all AZs.

Screen Shot 2014-10-09 at 4.15.59 PM

Posted in AWS

AWS Public Sector Symposium

It has been a while since our last update, so now is the perfect time to give a little perspective as to where we stand with regards to our Amazon initiatives.

Amazon graciously invited Bob and me to present the Notre Dame AWS story and its fifth annual Public Sector Symposium.

Bq_6fKUCIAEKAvz.jpg-large

We described how we got started with AWS in the context of migrating www.nd.edu 18 months ago.  Over the past 18 months, we have experienced zero service issues in this, our most reliable and resilient service offering.  Zero.

Since then, as mentioned in our 2014 spring update when Notre Dame hosted the Common Solutions Group in May, we reviewed progress to-date in the context of governance, people and processes, and our initiatives to grow our institutional understanding of how to operate in the cloud.

Bob and I had fun presenting, and were able to give a synopsis on:

  • Conductor Migration:  ND-engineered content management system that migrated to AWS in March 2014.  You can read all about Conductor’s operational statistics here. The short story is the migration was a massive success.  Improved performance, reduced cost, increased use of the AWS tools (RDS, Elasticache), and a sense of joy on the part of the engineers involved in the move
  • ND Mobile App:  Instantiation of the Kurogo framework in providing a revamped mobile experience.  Zero performance issues associated with rollout.
  • AAA in the Cloud:  Replicating our authentication store in Amazon.  Using Route 53 to heartbeat against local resources so that in a failure condition, login.nd.edu redirects to the authentication store in Amazon for traffic originating from off campus.  This was tested in June and performed as designed.  Since the preponderance of our main academic applications, including Sakai, Box, and Gmail, are hosted off campus, the result of this effort is that we have separated service availability from local border network incidents.  If campus is offline for some reason, off-campus students, faculty, and staff are still able to use the University’s SaaS services.
  • Backups:  We are in the midst of testing Panzura, a cloud storage gateway product. Preliminary testing looks promising – look for an update on this ongoing testing soon.

We are absolutely thrilled with the progress being made, and look forward to continuing to push the envelope.  More to come!

Onward!

Posted in AWS

Why We Test

Despite design principles, despite best intentions, things can go wrong.  This is true when all of the design parameters are quantifiable, as exhibited by this spectacular engine failure:F1EngineFail What happens when some of the design constraints are not fully understood?  Specifically, consider the following excerpt taken from Amazon’s page detailing instance specifications:

Instance Family Instance Type Processor Arch vCPU ECU Memory (GiB) Instance Storage (GB) EBS-optimized Available Network Performance
Storage optimized i2.xlarge 64-bit 4 14 30.5 1 x 800 SSD Yes Moderate
Storage optimized i2.2xlarge 64-bit 8 27 61 2 x 800 SSD Yes High
Storage optimized i2.4xlarge 64-bit 16 53 122 4 x 800 SSD Yes High
Storage optimized i2.8xlarge 64-bit 32 104 244 8 x 800 SSD 10 Gigabit*4
Storage optimized hs1.8xlarge 64-bit 16 35 117 24 x 2,048*3 10 Gigabit*4
Storage optimized hi1.4xlarge 64-bit 16 35 60.5 2 x 1,024
SSD*2
10 Gigabit*4
Micro instances t1.micro 32-bit or
64-bit
1 Variable*5 0.615 EBS only Very Low
10 Gigabit is a measure we can understand.  But what does Moderate mean?  How does it compare to High, or Very Low?  It is possible to find out, but only if we test.

According to Amazon’s page on instance types (emphasis added):

Amazon EC2 allows you to provision a variety of instances types, which provide different combinations of CPU, memory, disk, and networking. Launching new instances and running tests in parallel is easy, and we recommend measuring the performance of applications to identify appropriate instance types and validate application architecture. We also recommend rigorous load/scale testing to ensure that your applications can scale as you intend.

Using LoadUIWeb by Smartbear, we were able to take a given workload and run it on different instance types to gain a better understanding of the performance differences.  Running 100 simultaneous virtual users with a 9 millisecond think time between pages, we ran against an M1.small and an M1.medium.  Here is are some pictures to illustrate what we discovered:

M1.small performance test output:

Unknown-2Small

M1.medium performance test output:

Unknown-1

Medium

 

Looking at the numbers, we see that the the response time of the small is roughly twice that of the medium.  As neither CPU, memory, or disk activity were constraints, this serves as a warning to us – the quality of service limitations placed on different instance types must be taken into consideration when doing instance size analysis.

When in doubt, test it out!