Another quiet weekend

Except for one little thing.  The amazing team that is the OIT at Notre Dame moved our ERP and associated systems to AWS.  In less than 24 hours.

We are the OIT at ND!

We planned for years.  We prepared.  We were conservative with our time estimates.  Instead of working through the weekend, everyone was home having dinner by 5 pm on Friday.

I can’t say enough about the team.  I would also be remiss if I didn’t mention the fantastic support we’ve been fortunate to receive form our partners at AWS.  It’s been a 4 year journey to get here, and we’re not done yet.

A few more services to move, and then we can focus our sights on the fun things, like batch workload refactoring, the spot market, and cost optimization!

OnBase to AWS: Database Migration

As part of migrating OnBase to AWS, we also had to migrate the database itself.  This proved to be a much less complicated task than moving all of the document files, somewhat of an anti-climax.

A Bit of History

When we first implemented OnBase, we chose to use Oracle as the database engine. In order to improve operational stability, we switched database platforms to Microsoft SQL Server in 2014.  Our primary reason for the switch was to align us with Hyland’s main line of development.  We felt good about getting that work done prior to our move to AWS.

It’s important to understand what purpose the OnBase database serves.  It contains the metadata and user annotations, while the underlying documents reside on a file system.  As such, the database doesn’t take up much space.  A compressed backup of our database is only 16 GB.

Transfer the Data

Transferring a 16 GB file to AWS is a relatively straightforward task.  The most important feature you to be aware of is multi-part upload.  This feature takes a single large file and splits it into multiple parts.  These parts are then uploaded in parallel to S3.

Initially, we looked into using CrossFTP.  CrossFTP is a third party, GUI tool which allows you to specify parameters to the S3 copy command, including thread count and chunk size.  That said, CrossFTP simply takes what you can do natively using the AWS Command Line Interface (CLI) and gives you a graphical front end. The graphical interface abstracts the messiness of a command line and is quite approachable for a novice.  However, that was not necessary in this case.  We just used AWS CLI to transfer the data from here to S3.  It couldn’t be much more straightforward, as the syntax is:

aws s3 cp <fileName> s3://<bucketName>/<fileName>

In order to get the performance to an acceptable level, we made the following configuration adjustments in the ~/.aws/config file:

[profile xferonbase]
aws_access_key_id=<serviceId>
aws_secret_access_key=<serviceKey>
s3 =
  max_concurrent_requests = 40
  multipart_threshold = 512MB
  multipart_chunksize = 512MB

Appropriately configured, transferring our 16 GB database backup took just under 17 minutes.  Once the file was in S3, we simply transferred it to our SQL Server database instance and performed a restore.  Simple and easy.  Highly advisable if you have the luxury of bringing your system entirely down when migrating.

A look to the future

In fact, the simplicity of this backup/transfer/restore approach is influencing how we think about migrating our ERP.  We use Banner running on Oracle.  While our database size is a bit over 1 TB, the compressed backup is 175 GB.  Conservatively, that’s a 3 hour segment of time to get the file to S3.  Once the data is in S3, we’re using an S3 VPC endpoint for private connectivity and speed.

Thinking about migration weekend and the need to build in sleep time, the backup, validate, copy to S3, copy to EC2, restore process is something easy to script and test.  And a great way to build in sleep time!  We’ve all done enough migrations to know that people don’t make good decisions when they are exhausted.

OnBase to AWS: Adventures in Load Balancing

Good to have Goals

One architectural goal we had in mind as we designed our OnBase implementation in AWS was to break out the single, collapsed web and application tier into distinct web and application tiers.

Here’s a simplification of what the design of our on-premises implementation looked like:

SBN Existing - 2
Due to licensing considerations, we wanted to retain two, right-sized web servers behind an Elastic Load Balancer (ELB).  We then wanted to have that web tier interact with a load balanced application tier.  Since licensing at the application tier is very favorable, we knew we could horizontally scale as necessary to meet our production workload.

To ELB, or not to ELB, that is the question

Our going-in assumption was that we would use a pair of ELBs.  It was remarkably quick to configure.  In under an hour, we had built out an ELB to front the web servers, and an ELB to front the application servers.  It was so easy to set up.  Too easy, as it turned out.

During functional testing, we observed very strange session collisions.  We worked with our partners, looked at configuration settings, and did quite a bit of good, old-fashioned shovel work.  The big sticking point turned out to be in how we have OnBase authentication implemented.  Currently, we are using Windows Challenge/Response (NTLM to be specific, dating back to the days of NT LAN Manager) to authenticate users.  The problem is, NTLM authentication does not work across an HTTP proxy because it needs a point-to-point connection between the user’s browser and server.  See https://commscentral.net/tech/?post=52 for an explanation of NTLM via an ELB.

When an IIS server gets a request, it sends a 401 response code (auth required) and keeps the HTTP connection alive.  An HTTP proxy closes the connection after getting a 401.  So in order to proxy traffic successfully, the proxy needs to be configured to proxy TCP, not HTTP.  NTLM authenticates the TCP connection.

Enter HAProxy

In order to get past this obstacle, we ended up standing up an EC2 instance with HAProxy on it, and configuring it to balance TCP and stick sessions to backends.  Properly configured, we were cooking with gas.  

A word on scheduling

For our production account, we make extensive use of reserved instances.  On a monthly basis, 85% of our average workload is covered by a reservation.  For OnBase, we use t2.large running Windows for our application servers.

Currently, a one-year reservation costs $782.  If you were to run the same instance at the on demand rate of $0.134 per hour, the cost for a year would be $1174.  For an instance which needs to be on 24×7, reservations are great.  For a t2.large, we end up saving just over 33%.

Due to application load characteristics and the desire to span availability zones, we ended up with four application servers.  However, from a workload perspective, we only need all four during business hours: 7 am to 7 pm.  So, 52 weeks per year, 5 days per week, 12 hours per day, $0.134 per hour.  That comes out to just over $418, which is a savings of over 64%.  If you can schedule, by all means, do it!

Simplified final design

So, where did we end up?  Consider the following simplified diagram:

SBN Existing - 1

The web servers are in separate availability zones, as are the application servers.  There are still single points of failure in the design, but we our confident in our ability to recover within a timeframe acceptable to our customers.

So, what did we learn?

We learned quite a bit here, including:

  1. NTLM is not our friend.  Replacing it with SAML/CAS this summer will allow us to jettison the HAProxy instance and replace it with an ELB.
  2. Scheduling is important.  Reserved Instances optimize your spend, but they don’t optimize your usage.  You’re leaving a lot of money on the table if you’re not actively involved in scheduling your instances, which you can only do if you have a deep understanding of your system’s usage profile.
  3. Working in AWS is a lot of fun.  If you can imagine it, it’s really easy to build, prototype, and shift as appropriate.

Coming soon, migrating the OnBase database.

OnBase to AWS: Move the Data

Over 7 million files?

As we were pondering moving OnBase, one of the first considerations was how to move over five years’ worth of documents from our local data center to our primary data center in AWS.  The challenge wasn’t massive in big data terms: 7 million files, 2 terabytes.  Due to file transfer overhead, it is more efficient to transfer one big file instead of copying millions of tiny files.

By Kbh3rd (Own work) [CC BY 4.0 (http://creativecommons.org/licenses/by/4.0)], via Wikimedia Commons

Over 7 million files!

BigBoulder

7 million documents zipped into one large file.

Since OnBase is a Windows-centric platform, the need to retain Windows file permissions as part of the data transfer was of prime concern.  Fundamentally, there were two ways to approach this: trickling the data out using multiple robocopy threads, or doing a bulk data transfer.

A word on CIFS in AWS

A brief aside on providing CIFS storage in AWS.  A straightforward way to get CIFS is to use Windows Server 2012 and Distributed File System from Microsoft.

http://www.slideshare.net/AmazonWebServices/nfs-and-cifs-options-for-aws-stg401-aws-reinvent-2013/20

CIFS in AWS

You could also use a product from a company such as Avere or Panzura to present CIFS.  These products are storage caching devices that use RAM, EBS, and S3 in a tiered fashion, serving as a translation layer between S3 object storage and CIFS.  Our current configuration makes use of Panzura, striped EBS volumes, and S3.

So, let’s move this data

Robocopy versus bulk transfer

The initial goal was to get the data from its current CIFS system to a CIFS environment in AWS with all relevant metadata in place.  We evaluated a number of options, including:

  1. Zipping up the directory structure and using AWS Snowball for the transfer.
  2. Zipping up the directory structure, using S3 multipart upload to pump the data into S3.
  3. Robocopy to local storage on a virtual machine, use Windows backup to get to a single file, transmit that backup file to S3, copy file to a local EBS volume, and finally restore.
  4. Use NetBackup to backup to S3 and then restore to EC2.
  5. Zip the file structure, gather metadata with Icacls, transmit to S3, copy from S3 to EBS, and restore.

Using the robocopy approach would take weeks to transfer all of the data.  We immediately started trickling the data out using robocopy.  That said, the team was interested in seeing if it was possible to compress that data transfer time to fit within an outage weekend.

Snowball

1 PB of capacity…overkill for 2 TB

So we tested Snowball.  The first try didn’t go so well.  The Snowball was dead on arrival, and we had to get a second one shipped to us.  Ultimately, it worked, but it was overkill for the volume of data we needed to move.

Zip, transmit, unzip

We broke down the transfer process into three basic steps:

  1. Prepare and package the data
  2. Transmit the data
  3. Rehydrate the data

The transmit was the easy part.  We have a 10 GB network connection, and were able to use multipart upload to pump data to S3 at 6 Gpbs, transmitting a 200 GB test file in under an hour.

10 GB network connection + S3 multipart upload == speedy transfer

Reality Check

While discussing the different packaging/rehydration options, we talked a bit more about how OnBase actually works.  It turns out, it manages files a bit like a rotating log file.  That is, it writes to a directory, then switches to a new directory and starts writing to that.  After files files are written, the old directory essentially becomes read only.

That took the time pressure off the bulk of our data.  We could robocopy out the bulk of our data at our leisure.  On cutover weekend, we will migrate the active directory/directories.

Problem solved.

What did we learn?

  1. We can move data really quickly if we need to
  2. CIFS and AWS is feasible, but aren’t a match made in heaven
  3. You really need a comprehensive understanding of how your application works when you plan for any migration
  4. With a full understanding of what we needed to do, the simple, slow, tortoise approach of trickling data with robocopy met our needs.

 

Stay tuned for adventures in load balancing!

A Microsoft-centric platform in AWS?

At Notre Dame, we use OnBase by Hyland Software as our Enterprise Content Management platform.  Hyland has been a Microsoft Certified Partner for almost two decades, and has optimized its software for Windows, so OnBase is very much a Windows-centric product.

If you’ve been following this blog, you know that Notre Dame has selected Amazon Web Services as our infrastructure provider of the future.  Linux-based systems fit nicely in the automation sweet spot that the AWS tools provide.  Windows systems can be a bit more of a challenge.

Nevertheless, our team was excited about the move.  We knew there would be some challenges along the way, including the migration of 2 TB of data from campus to AWS.  The challenges were offset by the opportunities, including the longstanding architectural goal of a load-balanced web tier communicating with a load-balanced application tier, as opposed to the collapsed web and application tier we were running on campus.

We did a quick proof of concept exercise in the fall of 2015, then started building out our full test environment early in 2016.

Stay tuned for details of our struggles, learnings, and ultimately successful migration.

Higher Education Needs the Public Cloud

Today is an exciting time to be a part of higher education IT! We are participating in the biggest transformation in our field since the client/server computing model displaced mainframes – the adoption of public cloud computing. The innovation, flexibility, cost effectiveness and security that the public cloud brings to our institutions will permanently change the way that we work. This new technology model will render the construction of on-campus data centers obsolete and transform academic and administrative computing for decades to come.

Why is this transformation happening? We’ve reached a tipping point where network speeds allow the consolidation of computing resources in a manner where large providers can achieve massive economies of scale. For the vast majority of workloads, there’s really no difference if computing power is located down the hall or across the country. Computing infrastructure is now a commodity for us to leverage rather than an art for us to master. We have the opportunity to add more value higher up in the stack by becoming integrators and innovators instead of hardware maintainers.

We Need the Cloud’s Disruptive Innovation

The reality is that cloud providers can simply innovate faster than we can. Our core mission is education and research – not information technology. The core mission of cloud providers is providing rock solid infrastructure services that make IT easier. How can we possibly compete with this laser-like focus? Better question – why would we even want to try? Instead of building data centers that compete with cloud providers, we can leverage the innovations they bring to the table and ensure that our laser-like focus is in the right place – on our students and faculty.

As an example, consider the automatic scaling capabilities of cloud providers. At Notre Dame, we leveraged Amazon Web Services’ autoscaling capability totransform the way we host the University website. We now provision exactly the number of servers required to support our site at any given time and deprovision servers when they are no longer needed. Could we have built this autoscaling capability in our own data center? Sure. The technology has been around for years, but we hadn’t done it because we were focused on other things. AWS’ engineering staff solved that for us by building the capability into their product.

We Need the Cloud’s Unlimited Capacity and Flexibility

The massive scale of public cloud infrastructures makes them appear to have essentially unlimited capacity from our perspective. Other than some extremely limited high performance computing applications, it’s hard to imagine a workload coming out of our institutions that a major cloud provider couldn’t handle on a no-notice basis. We have the ability to quickly provision massive computing resources, use them for as long or short a duration as necessary, and then quickly deprovision them.

The beauty of doing this type of provisioning in the public cloud is that overprovisioning becomes a thing of the past. We no longer need to plan our capacity to handle an uncertain future demand – we can simply add resources on-demand as they are needed.

We Need the Cloud’s Cost Effectiveness

Cloud solutions are cost effective for two reasons. First, they allow us to leverage the massive scale of cloud providers. Gartner estimates that the public cloud market in 2013 reached $131 billion in spend. The combined on-campus data centers of all higher education institutions combined constitute a tiny fraction of that size. When companies like Amazon, Google and Microsoft build at macro-enterprise scale, they are able to generate a profit while still passing on significant cost savings to customers. The history of IaaS price cuts by AWS, Google and others bear this out.

The second major cost benefit of the public cloud stems from the public cloud’s “pay as you go” model. Computing no longer requires major capital investments – it’s now available at per-hour, per-GB and per-action rates. If you provision a massive amount of computing power to perform serious number crunching for a few hours, you pay for those hours and no more. Overprovisioning is now the responsibility of the IaaS provider and the costs are shared across all customers.

We Need the Cloud’s Security and Resiliency

Security matters. While some may cite security as a reason not to move to the cloud, security is actually a reason to make the move. Cloud providers invest significant time and money in building highly secure environments. They are able to bring security resources to bear that we can only dream about having on our campuses. The Central Intelligence Agency recently recognized this and made a $600M investment in cloud computing. If IaaS security is good enough for the CIA, it should be good enough for us. That’s not to say that moving to the cloud is the silver bullet for security – we’ll still need a solid understanding of information security to implement our services properly in a cloud environment.

The cloud also simplifies the creation of resilient, highly available services. Most providers operate multiple data centers in geographically diverse regions and offer toolkits that help build solutions that leverage that geographic diversity. The Obama for America campaign discovered this when they picked up and moved their entire operation from the AWS eastern region to a west coast region in hours as Superstorm Sandy bore down on the eastern seaboard.

Higher education needs the cloud. The innovation, flexibility, cost effectiveness and security provided by public cloud solutions give us a tremendous head start on building tomorrow’s technology infrastructure for our campuses. Let’s enjoy the journey!

Originally published on LinkedIn November 12, 2014

Why We Test

Despite design principles, despite best intentions, things can go wrong.  This is true when all of the design parameters are quantifiable, as exhibited by this spectacular engine failure:F1EngineFail What happens when some of the design constraints are not fully understood?  Specifically, consider the following excerpt taken from Amazon’s page detailing instance specifications:

Instance Family Instance Type Processor Arch vCPU ECU Memory (GiB) Instance Storage (GB) EBS-optimized Available Network Performance
Storage optimized i2.xlarge 64-bit 4 14 30.5 1 x 800 SSD Yes Moderate
Storage optimized i2.2xlarge 64-bit 8 27 61 2 x 800 SSD Yes High
Storage optimized i2.4xlarge 64-bit 16 53 122 4 x 800 SSD Yes High
Storage optimized i2.8xlarge 64-bit 32 104 244 8 x 800 SSD 10 Gigabit*4
Storage optimized hs1.8xlarge 64-bit 16 35 117 24 x 2,048*3 10 Gigabit*4
Storage optimized hi1.4xlarge 64-bit 16 35 60.5 2 x 1,024
SSD*2
10 Gigabit*4
Micro instances t1.micro 32-bit or
64-bit
1 Variable*5 0.615 EBS only Very Low
10 Gigabit is a measure we can understand.  But what does Moderate mean?  How does it compare to High, or Very Low?  It is possible to find out, but only if we test.

According to Amazon’s page on instance types (emphasis added):

Amazon EC2 allows you to provision a variety of instances types, which provide different combinations of CPU, memory, disk, and networking. Launching new instances and running tests in parallel is easy, and we recommend measuring the performance of applications to identify appropriate instance types and validate application architecture. We also recommend rigorous load/scale testing to ensure that your applications can scale as you intend.

Using LoadUIWeb by Smartbear, we were able to take a given workload and run it on different instance types to gain a better understanding of the performance differences.  Running 100 simultaneous virtual users with a 9 millisecond think time between pages, we ran against an M1.small and an M1.medium.  Here is are some pictures to illustrate what we discovered:

M1.small performance test output:

Unknown-2Small

M1.medium performance test output:

Unknown-1

Medium

 

Looking at the numbers, we see that the the response time of the small is roughly twice that of the medium.  As neither CPU, memory, or disk activity were constraints, this serves as a warning to us – the quality of service limitations placed on different instance types must be taken into consideration when doing instance size analysis.

When in doubt, test it out!

Google Apps Scripts, LDAP, and Wookiee Steak

I’ve been a Google Apps user for as long as I could score an invite code to Gmail. Then came Google Calendar, Google Docs and Spreadsheets, and a whole slew of other stuff. But as I moved from coder to whatever-I-am-now, I stopped following along with the new stuff Google and others kept putting out there. I mean, I’m vaguely aware of Google Apps Engine and something called Google Apps Script, but I hadn’t spent any real time looking at them.

But as any good IT professional should, I’m not afraid to do a little scripting here any there to accomplish a task. It’s fairly common to get a CSV export that needs a little massaging or find a perfect API to mash up with some other data.

The other day, I found myself working with some data in a Google Spreadsheet and thinking about one of my most common scripting tasks – appending data to existing data. For instance, I start with something like a list of users and want to know whether they are faculty, staff, or student.

netid name Affiliation
cgrundy1 Chas Grundy ???
katie Katie Rose ???

Aside: Could I get this easily by searching EDS? Sure. Unfortunately, my data set might be thousands of records long and while I could certainly ask our friends in identity management to get me the bulk data I want, they’re busy and I might need it right away. Or at least, I’m impatient and don’t want to wait. Oh, and I need something like this probably once or twice a week so I’d just get on their nerves. Moving on…

So that’s when I began to explore Google Apps Script. It’s a Javascript-based environment to interact with various Google Apps, perform custom operations, or access external APIs. This last part is what got my attention.

First, let’s think about how we might use a feature like this. Imagine the following function:

=getLDAPAttribute("cgrundy1","ndPrimaryAffiliation")

If I passed it the NetID and attribute I wanted, perhaps this function would be so kind as to go fetch it for me. Sounds cool to me.

Now the data I want to append lives in LDAP, but Javascript (and GA Script) can’t talk to LDAP directly. But Google Apps Script can interpret JSON (as well as SOAP or XML), so I needed a little LDAP web service. Let’s begin shaving the yak. Or we could shave a wookiee.

Google Apps Script and LDAP API flowchart

First, I cobbled together a quick PHP microapp and threw it into my Notre Dame web space. All it does is take two parameters, q and attribute, query LDAP, and return the requested attribute in JSON format.

Here’s an example:

index.php?q=cgrundy1&attribute=ndPrimaryAffiliation

And this returns:

{
"title": "LDAP",
"attributes": {
"netid": "cgrundy1",
"ndprimaryaffiliation": "Staff"
}
}

View the full PHP script here

For a little extra convenience, the script allows q to be a NetID or email address. Inputs are sanitized to avoid LDAP injections, but there’s no throttling at the script level so for now it’s up to the LDAP server to enforce any protections against abuse.

Next, the Google Apps Script needs to actually make the request. Let’s create the custom function. In Spreadsheets, this is found under Tools > Script Editor.


function getLDAPAttribute(search,attribute) {
search = search.toLowerCase();
attribute = attribute.toLowerCase();
var attr_value = "";
var url = 'https://www3.nd.edu/~cgrundy1/gapps-ldap-test/?'
+ 'q=' + search
+ '&attribute=' + attribute;
var response = UrlFetchApp.fetch(url);
var json = response.getContentText();
var data = JSON.parse(json);
attr_value = data.attributes[attribute];
return attr_value;
};

This accepts our parameters, passes them along to the PHP web service in a GET request, parses the response as JSON, and returns the attribute value.

Nota bene: Google enforces quotas on various operations including URL Fetch. The PHP script and the Google Apps Script function could be optimized, cache results, etc. I didn’t do those things. You are welcome to.

Anyway, let’s put it all together and see how it works:

Screenshot of custom formula in action

Well, there you have it. A quick little PHP microapp, a simple custom Javascript function, and now a bunch of cool things I can imagine doing with Google Apps. And just in case you only clicked through to this article because of the title:

<joke>I tried the Wookiee Steak, but it was a little chewy.</joke>

4 Projects

Matt in Milan

In order to focus our energy on understanding the toolset that Amazon Web Services presents, we are focusing on four specific initiatives in the first quarter of 2014.  These include:

  • AAA High Availability for Cloud Services, which will remove the campus dependency for off-campus users accessing SaaS services, including Box, Google, Innotas, Concur among others.
  • Backups, which will allow us to explore the feasibility of using a combination of S3 and Glacier for backups.
  • ModoLabs, which will host the campus mobile application and is not tightly coupled to existing campus services.
  • Conductor, which entails moving from an existing hosting provider to Amazon Web Services.  For perspective, Conductor hosts more than 360 of the University’s websites, including oit.nd.edu.  For monthly statistics about Conductor’s usage, refer to the nd.edu blog.

Each of these four projects will deliver value to Notre Dame and the students, faculty, and staff whom we serve.  These projects will also serve as the vehicle through which our operational cloud model, leveraging Amazon Web Services, gets defined.

In a governance working session this afternoon, we made progress on defining the standards for tagging machine instances.  As we approach the holiday break, we look forward to resting, reflecting, and returning, energized to make tremendous strides in 2014.

Onward!

AWS in Classroom

The ability to control provisioning of an entire development stack in AWS is not just a fantastic opportunity for the enterprise; it is also a great way to let students learn using infrastructure that might otherwise be prohibitively expensive and difficult to procure. Understanding this potential to facilitate learning and train a new generation of cloud-native developers, Amazon offers educational grants for use of AWS services. I started using program about a year ago for my Database Topics class in the Mendoza College of Business, and Chris Frederick is looking into it for his No-SQL/Big Data class.

It can give students complete hands on with servers/services without the issues of spinning up servers in house!

A snapshot of services are below.   The main services I have used in the RDBMS class are EC2 and RDS.   Each student was able to have their own Oracle DB instance and able to have DBA privileges to run their own DB and in that way set up users, privileges, roles which would be much more difficult to do on a shared environment.   They could then work on projects without fear of bringing down another student’s database.  One of the other projects I did in a class, with the help of Xiaojing Duan, was to set up a PHP server to show integration with Facebook.   As you can see from the list below,  there are a LOT more services available for classroom use!

11-21-2013 1-06-37 PM