Feature Image

Another quiet weekend

Except for one little thing.  The amazing team that is the OIT at Notre Dame moved our ERP and associated systems to AWS.  In less than 24 hours.

We are the OIT at ND!

We planned for years.  We prepared.  We were conservative with our time estimates.  Instead of working through the weekend, everyone was home having dinner by 5 pm on Friday.

I can’t say enough about the team.  I would also be remiss if I didn’t mention the fantastic support we’ve been fortunate to receive form our partners at AWS.  It’s been a 4 year journey to get here, and we’re not done yet.

A few more services to move, and then we can focus our sights on the fun things, like batch workload refactoring, the spot market, and cost optimization!

Monday Morning, Start Again

Happy Monday!

Monday morning, start again

Thanks to http://www.usedave.com for the artwork!

It’s another beautiful Monday morning here in the Midwest.  It’s sunny, hot, and humid.  People are at their desks, working away.

I took a look at OnBase user activity, and it is normal for this time of year.  People are making updates to student files, annotating, making decisions, doing their jobs.  Just another normal Monday morning – nothing has changed.

Except for the fact that our OnBase platform is running in AWS!!!!  Massive kudos to the team for an excellent job, a well-designed plan, and a beautifully transparent migration.  Zero user impact apart from the negotiated outage on Saturday.

On to the next thing – just another normal Monday morning at #NDCloudFirst.

OnBase to AWS: Database Migration

As part of migrating OnBase to AWS, we also had to migrate the database itself.  This proved to be a much less complicated task than moving all of the document files, somewhat of an anti-climax.

A Bit of History

When we first implemented OnBase, we chose to use Oracle as the database engine. In order to improve operational stability, we switched database platforms to Microsoft SQL Server in 2014.  Our primary reason for the switch was to align us with Hyland’s main line of development.  We felt good about getting that work done prior to our move to AWS.

It’s important to understand what purpose the OnBase database serves.  It contains the metadata and user annotations, while the underlying documents reside on a file system.  As such, the database doesn’t take up much space.  A compressed backup of our database is only 16 GB.

Transfer the Data

Transferring a 16 GB file to AWS is a relatively straightforward task.  The most important feature you to be aware of is multi-part upload.  This feature takes a single large file and splits it into multiple parts.  These parts are then uploaded in parallel to S3.

Initially, we looked into using CrossFTP.  CrossFTP is a third party, GUI tool which allows you to specify parameters to the S3 copy command, including thread count and chunk size.  That said, CrossFTP simply takes what you can do natively using the AWS Command Line Interface (CLI) and gives you a graphical front end. The graphical interface abstracts the messiness of a command line and is quite approachable for a novice.  However, that was not necessary in this case.  We just used AWS CLI to transfer the data from here to S3.  It couldn’t be much more straightforward, as the syntax is:

aws s3 cp <fileName> s3://<bucketName>/<fileName>

In order to get the performance to an acceptable level, we made the following configuration adjustments in the ~/.aws/config file:

[profile xferonbase]
aws_access_key_id=<serviceId>
aws_secret_access_key=<serviceKey>
s3 =
  max_concurrent_requests = 40
  multipart_threshold = 512MB
  multipart_chunksize = 512MB

Appropriately configured, transferring our 16 GB database backup took just under 17 minutes.  Once the file was in S3, we simply transferred it to our SQL Server database instance and performed a restore.  Simple and easy.  Highly advisable if you have the luxury of bringing your system entirely down when migrating.

A look to the future

In fact, the simplicity of this backup/transfer/restore approach is influencing how we think about migrating our ERP.  We use Banner running on Oracle.  While our database size is a bit over 1 TB, the compressed backup is 175 GB.  Conservatively, that’s a 3 hour segment of time to get the file to S3.  Once the data is in S3, we’re using an S3 VPC endpoint for private connectivity and speed.

Thinking about migration weekend and the need to build in sleep time, the backup, validate, copy to S3, copy to EC2, restore process is something easy to script and test.  And a great way to build in sleep time!  We’ve all done enough migrations to know that people don’t make good decisions when they are exhausted.

OnBase to AWS: Adventures in Load Balancing

Good to have Goals

One architectural goal we had in mind as we designed our OnBase implementation in AWS was to break out the single, collapsed web and application tier into distinct web and application tiers.

Here’s a simplification of what the design of our on-premises implementation looked like:

SBN Existing - 2
Due to licensing considerations, we wanted to retain two, right-sized web servers behind an Elastic Load Balancer (ELB).  We then wanted to have that web tier interact with a load balanced application tier.  Since licensing at the application tier is very favorable, we knew we could horizontally scale as necessary to meet our production workload.

To ELB, or not to ELB, that is the question

Our going-in assumption was that we would use a pair of ELBs.  It was remarkably quick to configure.  In under an hour, we had built out an ELB to front the web servers, and an ELB to front the application servers.  It was so easy to set up.  Too easy, as it turned out.

During functional testing, we observed very strange session collisions.  We worked with our partners, looked at configuration settings, and did quite a bit of good, old-fashioned shovel work.  The big sticking point turned out to be in how we have OnBase authentication implemented.  Currently, we are using Windows Challenge/Response (NTLM to be specific, dating back to the days of NT LAN Manager) to authenticate users.  The problem is, NTLM authentication does not work across an HTTP proxy because it needs a point-to-point connection between the user’s browser and server.  See https://commscentral.net/tech/?post=52 for an explanation of NTLM via an ELB.

When an IIS server gets a request, it sends a 401 response code (auth required) and keeps the HTTP connection alive.  An HTTP proxy closes the connection after getting a 401.  So in order to proxy traffic successfully, the proxy needs to be configured to proxy TCP, not HTTP.  NTLM authenticates the TCP connection.

Enter HAProxy

In order to get past this obstacle, we ended up standing up an EC2 instance with HAProxy on it, and configuring it to balance TCP and stick sessions to backends.  Properly configured, we were cooking with gas.  

A word on scheduling

For our production account, we make extensive use of reserved instances.  On a monthly basis, 85% of our average workload is covered by a reservation.  For OnBase, we use t2.large running Windows for our application servers.

Currently, a one-year reservation costs $782.  If you were to run the same instance at the on demand rate of $0.134 per hour, the cost for a year would be $1174.  For an instance which needs to be on 24×7, reservations are great.  For a t2.large, we end up saving just over 33%.

Due to application load characteristics and the desire to span availability zones, we ended up with four application servers.  However, from a workload perspective, we only need all four during business hours: 7 am to 7 pm.  So, 52 weeks per year, 5 days per week, 12 hours per day, $0.134 per hour.  That comes out to just over $418, which is a savings of over 64%.  If you can schedule, by all means, do it!

Simplified final design

So, where did we end up?  Consider the following simplified diagram:

SBN Existing - 1

The web servers are in separate availability zones, as are the application servers.  There are still single points of failure in the design, but we our confident in our ability to recover within a timeframe acceptable to our customers.

So, what did we learn?

We learned quite a bit here, including:

  1. NTLM is not our friend.  Replacing it with SAML/CAS this summer will allow us to jettison the HAProxy instance and replace it with an ELB.
  2. Scheduling is important.  Reserved Instances optimize your spend, but they don’t optimize your usage.  You’re leaving a lot of money on the table if you’re not actively involved in scheduling your instances, which you can only do if you have a deep understanding of your system’s usage profile.
  3. Working in AWS is a lot of fun.  If you can imagine it, it’s really easy to build, prototype, and shift as appropriate.

Coming soon, migrating the OnBase database.

OnBase to AWS: Move the Data

Over 7 million files?

As we were pondering moving OnBase, one of the first considerations was how to move over five years’ worth of documents from our local data center to our primary data center in AWS.  The challenge wasn’t massive in big data terms: 7 million files, 2 terabytes.  Due to file transfer overhead, it is more efficient to transfer one big file instead of copying millions of tiny files.

By Kbh3rd (Own work) [CC BY 4.0 (http://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

Over 7 million files!

BigBoulder

7 million documents zipped into one large file.

Since OnBase is a Windows-centric platform, the need to retain Windows file permissions as part of the data transfer was of prime concern.  Fundamentally, there were two ways to approach this: trickling the data out using multiple robocopy threads, or doing a bulk data transfer.

A word on CIFS in AWS

A brief aside on providing CIFS storage in AWS.  A straightforward way to get CIFS is to use Windows Server 2012 and Distributed File System from Microsoft.

http://www.slideshare.net/AmazonWebServices/nfs-and-cifs-options-for-aws-stg401-aws-reinvent-2013/20

CIFS in AWS

You could also use a product from a company such as Avere or Panzura to present CIFS.  These products are storage caching devices that use RAM, EBS, and S3 in a tiered fashion, serving as a translation layer between S3 object storage and CIFS.  Our current configuration makes use of Panzura, striped EBS volumes, and S3.

So, let’s move this data

Robocopy versus bulk transfer

The initial goal was to get the data from its current CIFS system to a CIFS environment in AWS with all relevant metadata in place.  We evaluated a number of options, including:

  1. Zipping up the directory structure and using AWS Snowball for the transfer.
  2. Zipping up the directory structure, using S3 multipart upload to pump the data into S3.
  3. Robocopy to local storage on a virtual machine, use Windows backup to get to a single file, transmit that backup file to S3, copy file to a local EBS volume, and finally restore.
  4. Use NetBackup to backup to S3 and then restore to EC2.
  5. Zip the file structure, gather metadata with Icacls, transmit to S3, copy from S3 to EBS, and restore.

Using the robocopy approach would take weeks to transfer all of the data.  We immediately started trickling the data out using robocopy.  That said, the team was interested in seeing if it was possible to compress that data transfer time to fit within an outage weekend.

Snowball

1 PB of capacity…overkill for 2 TB

So we tested Snowball.  The first try didn’t go so well.  The Snowball was dead on arrival, and we had to get a second one shipped to us.  Ultimately, it worked, but it was overkill for the volume of data we needed to move.

Zip, transmit, unzip

We broke down the transfer process into three basic steps:

  1. Prepare and package the data
  2. Transmit the data
  3. Rehydrate the data

The transmit was the easy part.  We have a 10 GB network connection, and were able to use multipart upload to pump data to S3 at 6 Gpbs, transmitting a 200 GB test file in under an hour.

10 GB network connection + S3 multipart upload == speedy transfer

Reality Check

While discussing the different packaging/rehydration options, we talked a bit more about how OnBase actually works.  It turns out, it manages files a bit like a rotating log file.  That is, it writes to a directory, then switches to a new directory and starts writing to that.  After files files are written, the old directory essentially becomes read only.

That took the time pressure off the bulk of our data.  We could robocopy out the bulk of our data at our leisure.  On cutover weekend, we will migrate the active directory/directories.

Problem solved.

What did we learn?

  1. We can move data really quickly if we need to
  2. CIFS and AWS is feasible, but aren’t a match made in heaven
  3. You really need a comprehensive understanding of how your application works when you plan for any migration
  4. With a full understanding of what we needed to do, the simple, slow, tortoise approach of trickling data with robocopy met our needs.

 

Stay tuned for adventures in load balancing!

A Microsoft-centric platform in AWS?

At Notre Dame, we use OnBase by Hyland Software as our Enterprise Content Management platform.  Hyland has been a Microsoft Certified Partner for almost two decades, and has optimized its software for Windows, so OnBase is very much a Windows-centric product.

If you’ve been following this blog, you know that Notre Dame has selected Amazon Web Services as our infrastructure provider of the future.  Linux-based systems fit nicely in the automation sweet spot that the AWS tools provide.  Windows systems can be a bit more of a challenge.

Nevertheless, our team was excited about the move.  We knew there would be some challenges along the way, including the migration of 2 TB of data from campus to AWS.  The challenges were offset by the opportunities, including the longstanding architectural goal of a load-balanced web tier communicating with a load-balanced application tier, as opposed to the collapsed web and application tier we were running on campus.

We did a quick proof of concept exercise in the fall of 2015, then started building out our full test environment early in 2016.

Stay tuned for details of our struggles, learnings, and ultimately successful migration.

Outreach

One thing that is incredibly fulfilling about being at Notre Dame is its core mission.  Attributed to Father Sorin, Notre Dame’s primary reason for existence is to be a “powerful force for good” in the world.  In keeping with that mission, we at #NDCloudFirst have done a couple of things to help spread the operational cloud knowledge we are acquiring.

Over the past year or so, we have shared our knowledge in person and via conference calls with our higher education colleagues in both public and private forums.  We’ve talked to over 20 schools, both nationally and internationally, about the work we are doing in this space.  That’s one of the things I really enjoy about working in higher ed IT – the overall level of sharing and inter-institutional collaboration.  Public or private, with or without a medical school, centralized or not, we face the universal constraint of being oversubscribed and facing an ever-increasing demand for services.  I relish the opportunity to share what we’ve learned so far with our colleagues, and it’s always fun to learn from them as they follow their own paths.

In the interests of leveling-up the regional knowledge of this space, we started the #INAWS Meetup.  It is an inclusive, public forum, and we welcome anyone who has an interest in sharing challenges and solving problems using AWS.  In concert with our friends from the City of South BendTrek 10, independent consultants, small business owners, representatives from local IT departments, and graduate students, we have had robust conversations.  We have explored ways in which to solve disparate challenges using AWS.

Our goal is simply an extension of Father Sorin’s – to educate and empower people to avail themselves of the world’s best infrastructure.  While AWS can seem initially intimidating to the uninitiated, it is remarkably approachable technology.  By adopting AWS, we believe that our regional businesses can experience overall IT service levels that used to exist only in fantasies.

Unencumbered by hardware concerns, our community can focus more energy on improving the quality of city services, building businesses, and fabricating products.

I am remarkably bullish on the future.  We at #NDCloudFirst are working diligently to improve IT services at Notre Dame.  I look forward to continuing to learn and share with our colleagues throughout the region, nation, and world.

AWS assume role with Fog

Amazon Web Services (AWS) provides users with a wealth of services and a suite of ways in which to keep those services secure. Unfortunately, software libraries are not always able to keep up with all the new features available. While wiring up our Rails application to deposit files into an AWS S3 bucket we ran into such a problem. AWS has developed an extensive role-based security structure; and as part of their Ruby AWS SDK they allow a person, or role, to assume another role which may have completely different access.

money-patch-croppedOur troubles came in the fact that the interface provided by the fog-aws gem does not currently have any way to tell it to assume a different role. It does provide a way to use a role your server may already have been assigned, and a nice mechanism to re-negotiate your credentials when they are about to expire. So what do we do when something doesn’t quite work the way we want? Monkey patch!

Below is a link to a Gist showing how we were able to leverage the things fog-aws did the way we wanted, and overwrite the one thing we needed to be done differently.

https://gist.github.com/peterwells/39a5c31d934fa8eb0f2c

 

Automating Configuration Managment

So many choices!  Puppet, Chef, Salt, Ansible!  What’s an organization to do?
We initially went down the Puppet path, as one of our distributed IT organizations invested lots of time in getting Puppet going.  We ended up not going too far down the path as we started using Ansible.
The biggest reason is that Ansible is agentless.  All the commands go over ssh, and there is nothing to install on destination servers.  We’ve run into a couple of issues where the documentation doesn’t match the behavior when developing an Ansible playbook, but nothing insurmountable.
We realize many benefits from having a fully self-documenting infrastructure, and find that it, in concert with git (we use BitBucket b/c of free unlimited private repos for educational institutions), enables the adoption of devops principles.
At a high level, we have a playbook we call Ansible-Core which contains a variety of roles, maintained by our Platform team.  These roles correspond to specific configurations, including:
  • Ensuring that our traditional Platform/OS engineers have accounts/sudo
  • Account integration with central authentication
  • Common software installation
    • NGINX, configuration of our wildcard SSL certificate chain, etc
When developing a playbook for an individual service, the developer scripting software installation/configuration may encounter a dependency which is not specific to the service.  For example, installation of the AWS CLI (not there by default if you start with a minimal machine config).  Upon realizing that, it leads to a conversation with the Platform team to incorporate the addition of that role into Ansible-Core.  That can happen two ways:
  • By the dev, who issues a pull request to the Platform team.  That team reviews the change and merges as appropriate.
  • By a member of the Platform team
In the process of creating Ansible scripts, conversation between traditional operations folks and developers flows naturally, and we end up with truly reusable chunks of infrastructure code.  Everyone wins, and more importantly, everyone learns!

Gluecon 2015

Gluecon

I had the good fortune of attending Gluecon this past week.  It is a short, intense conference held in Broomfield, Colorado, attended by some of the best and brightest folks in technology.  There was a lot of talk about microservices, APIs, DevOps, and containers – all the latest tech, with an eye towards 2018.

While the majority of slides are available here, this is a quick synopsis of what I took away.

Tweet early, tweet often

I’m a sporadic tweeter, but go into full-on microblogging mode when at a conference.  It’s a great way to share information with the public, and a great way to make connections.  By adding a column in TweetDeck for #gluecon, the following image caught my eye:

Per @_nicolemargaret, “Doing some realtime analysis of hashtag co-occurrence among #gluecon tweets. #neo4j #rstats #d3 #nosql”

 

I love graphs, and the powerful way they communicate relationships.  Through that tweet, I had the opportunity to meet Nicole White.  Come to find out that she works for Neo Technology, the company behind neo4j.  While I have other friends at Neo, I had never heard of or met Nicole before (she is a relatively new hire).  I’m happy to have added her as a new node on my graph, as it were.  Very cool.

Tweeting is also a great way to reflect back on gems and tidbits – simply look at your own history to help organize your thoughts.  Like I’m doing now.

It’s always the people

There were quite a few sessions talking about the importance of culture and talent in making for productive, healthy organizations.  Salesforce did a good keynote, illustrating the gap between available technology jobs and qualified candidates.

A challenge for us all

This theme continued through the very last keynote session:

Is this clear?

Almost every session I attended had a subtle undertone of “we need talent.”  Most messages were not subtle.

Talking about microservices, Adrian Cockroft made a great cultural point.  When operating microservices, organizations need to fully embrace the DevOps model with a clear escalation chain.

Culture Win: the VP of Engineering volunteers to be on call while expecting never to get called.

Tools?  What tools?

Building on the importance of people, let’s talk about tools.  Specifically, let’s talk orchestration tools – Ansible, Puppet, and Chef.  I happen to think agentless Ansible is the way to go, but ultimately, it’s what you and your organization are capable of doing with the tools, not the tools you pick.  Brian Coca illustrated many possible ways in which Ansible can be used…because he deeply understands how to use Ansible!

You can do this with Ansible…should you?

One of my favorite one-liners from the conference sums up the tools discussion: “Rather than teach everyone to code, let’s teach them to think. The coding can come later; it’s easier.”

Right on – pick a tool that can be successfully adopted by you/your organization, and stick with it.  Stay focused, and don’t get distracted.

APIs still rule…and enterprise software still lags behind

APIs have been a thing for years now.  I remember writing the customer profile master data store for a major airline in the late 90s.  As a master data source, many internal systems wanted to access/update said data.  Instead of giving each system direct access to our database, we surrounded it with a cloud of services.  At the time, these were written in C, using Tuxedo.

What has changed in the last 20 years?  The utter ubiquity of APIs in the form of web services.  The concept is the same – encapsulate business logic, publish a defined interface, and let the data flow quickly and easily.  And yes, it is much, much easier to get going with a RESTful API than a Tuxedo service.

Table stakes for software vendors

What else has changed?  Organizations simply expect data to be available via APIs.  If you are an enterprise software vendor and your data/business logic is locked up in a proprietary monolithic application, start opening up or face customer defection.

What a Wonderful World

In his presentation, Runscope CEO and co-founder John Sheehan put this slide up.

Can you imagine life without these tools?

Think for a moment about how remarkably powerful any one of the concepts listed in his slide is.  We live in a world where all of them are simply available for use.  With the remarkably rich tools which are out there, there is simply no excuse for a poorly performing web site or API.  If an organization is running into scale issues, the technology is not likely to be at fault – it’s how that technology is implemented.

Talk with Adrian

If you get the chance, spend some time talking with Adrian Cockroft.  I was fortunate enough to spend 20 or 30 minutes talking with him over lunch.  First off, he is a genuinely friendly and kind person.  Second, he likes interesting cars (Tesla Roadster, Lotus Elise, among others).  Finally, he’s flat brilliant with loads of experience.

I was able to glean useful tidbits about containers, tertiary data storage, and autocrossing.

Lunch with the incomparable Adrian Cockroft

One thing Adrian mentioned that stuck with me concerned the current state of containers.  They are mutating so fast that even companies who are working with them full time have difficulty keeping up.  That said, the speed at which this space moves makes for a high degree of agility.  However, every organization has finite limits on what new/emerging technologies can be pursued.  Containerize if you wish, but you should have a clearly defined objective in mind with palpable benefits.

Serious about automation? Take away SSH access.

One of my favorite tidbits was to remove SSH access from servers.  If no individual has SSH access, it forces them to automate everything.  At that point, servers truly become disposable.

Get beyond the tech

I was pretty fried by the afternoon of day two of the conference.  I took the opportunity to skip a couple of sessions and spend some time with Kin Lane.  His dedication to understanding and explaining APIs earned him a Presidential Innovation Fellowship in 2013.

Yes, we talked tech…and proceeded to go beyond.  Kin likes motorcycles.  He was a former VW mechanic.  He’s gone through a material purge and enjoys the mobility his work affords him.  Yes, he strives to be all things API, but that’s only one facet to his very interesting personality.

Opting to hang with Kin Lane instead of attending a session

Repeat attendance?

So, would I go to Gluecon again?  Most definitely.  It was a worthy spend of time, providing insight into the leading edge of technology in the context of microservices, devops, containers, and APIs.  Not too long, and not too big.

I came away from the conference with a better understanding of trends in technology.  With that knowledge, I am better prepared to work with, ask questions of, and assess potential vendors.