Higher Education Needs the Public Cloud

Today is an exciting time to be a part of higher education IT! We are participating in the biggest transformation in our field since the client/server computing model displaced mainframes – the adoption of public cloud computing. The innovation, flexibility, cost effectiveness and security that the public cloud brings to our institutions will permanently change the way that we work. This new technology model will render the construction of on-campus data centers obsolete and transform academic and administrative computing for decades to come.

Why is this transformation happening? We’ve reached a tipping point where network speeds allow the consolidation of computing resources in a manner where large providers can achieve massive economies of scale. For the vast majority of workloads, there’s really no difference if computing power is located down the hall or across the country. Computing infrastructure is now a commodity for us to leverage rather than an art for us to master. We have the opportunity to add more value higher up in the stack by becoming integrators and innovators instead of hardware maintainers.

We Need the Cloud’s Disruptive Innovation

The reality is that cloud providers can simply innovate faster than we can. Our core mission is education and research – not information technology. The core mission of cloud providers is providing rock solid infrastructure services that make IT easier. How can we possibly compete with this laser-like focus? Better question – why would we even want to try? Instead of building data centers that compete with cloud providers, we can leverage the innovations they bring to the table and ensure that our laser-like focus is in the right place – on our students and faculty.

As an example, consider the automatic scaling capabilities of cloud providers. At Notre Dame, we leveraged Amazon Web Services’ autoscaling capability totransform the way we host the University website. We now provision exactly the number of servers required to support our site at any given time and deprovision servers when they are no longer needed. Could we have built this autoscaling capability in our own data center? Sure. The technology has been around for years, but we hadn’t done it because we were focused on other things. AWS’ engineering staff solved that for us by building the capability into their product.

We Need the Cloud’s Unlimited Capacity and Flexibility

The massive scale of public cloud infrastructures makes them appear to have essentially unlimited capacity from our perspective. Other than some extremely limited high performance computing applications, it’s hard to imagine a workload coming out of our institutions that a major cloud provider couldn’t handle on a no-notice basis. We have the ability to quickly provision massive computing resources, use them for as long or short a duration as necessary, and then quickly deprovision them.

The beauty of doing this type of provisioning in the public cloud is that overprovisioning becomes a thing of the past. We no longer need to plan our capacity to handle an uncertain future demand – we can simply add resources on-demand as they are needed.

We Need the Cloud’s Cost Effectiveness

Cloud solutions are cost effective for two reasons. First, they allow us to leverage the massive scale of cloud providers. Gartner estimates that the public cloud market in 2013 reached $131 billion in spend. The combined on-campus data centers of all higher education institutions combined constitute a tiny fraction of that size. When companies like Amazon, Google and Microsoft build at macro-enterprise scale, they are able to generate a profit while still passing on significant cost savings to customers. The history of IaaS price cuts by AWS, Google and others bear this out.

The second major cost benefit of the public cloud stems from the public cloud’s “pay as you go” model. Computing no longer requires major capital investments – it’s now available at per-hour, per-GB and per-action rates. If you provision a massive amount of computing power to perform serious number crunching for a few hours, you pay for those hours and no more. Overprovisioning is now the responsibility of the IaaS provider and the costs are shared across all customers.

We Need the Cloud’s Security and Resiliency

Security matters. While some may cite security as a reason not to move to the cloud, security is actually a reason to make the move. Cloud providers invest significant time and money in building highly secure environments. They are able to bring security resources to bear that we can only dream about having on our campuses. The Central Intelligence Agency recently recognized this and made a $600M investment in cloud computing. If IaaS security is good enough for the CIA, it should be good enough for us. That’s not to say that moving to the cloud is the silver bullet for security – we’ll still need a solid understanding of information security to implement our services properly in a cloud environment.

The cloud also simplifies the creation of resilient, highly available services. Most providers operate multiple data centers in geographically diverse regions and offer toolkits that help build solutions that leverage that geographic diversity. The Obama for America campaign discovered this when they picked up and moved their entire operation from the AWS eastern region to a west coast region in hours as Superstorm Sandy bore down on the eastern seaboard.

Higher education needs the cloud. The innovation, flexibility, cost effectiveness and security provided by public cloud solutions give us a tremendous head start on building tomorrow’s technology infrastructure for our campuses. Let’s enjoy the journey!

Originally published on LinkedIn November 12, 2014

Avoiding Waterfall Governance

A brief follow-up to the post I wrote on governance topics.

As we operationalize the cloud, we will eventually get to a place where we have solid policies and process around everything on our governance to-do list and more. However, I think it’s critical, as Sharif put it, to “think like a startup” while we “operate like an enterprise.”

The first thing any programmer learns about managing a development lifecycle is to forget “waterfall.”  Trying to nail down 100% of the design before coding is not only a waste of time; it’s actively damaging to your project, as changes inevitably come later in the process, but you’re too tied down with earlier design decisions to adapt.  I’m sure no one in this organization doubts the value of agile, iterative coding.  We do our best to establish requirements early on, often broadly, refining as we go.  We get feedback early and often to constantly guide our design toward harmony with the customer’s changing understanding of the software as it grows.

We should apply the same strategy toward our cloud initiative.  Because it’s so easy to build and tear down infrastructure, and because our host and application deployments will be scripted, automated, and version-controlled, we have the luxury of trying things, seeing how they work, and shifting our strategy in a tight feedback loop.  Our governance initiative is very important, but it’s also important that it’s not a “let’s decide, then do” process.  That’s why I didn’t just ask for people to come back with designs on paper for things like IAM roles and infrastructure.  I asked that they try things out and return with demos for the group.  Let’s be iterative; let’s be agile.  Let’s learn as we go and build, build, build, until we get where we want to be.

Governance To-Do List

It’s no secret that we are making a push to the cloud.  As Sharif noted in his last blog post, there are many compelling reasons to do so, financially, technically, and organizationally.  However, no one should be under the impression that we are forging ahead thoughtlessly!  We need to come together to understand the many implications of this move and how it affects nearly every part of our jobs.

To this end, we convened our first meeting of the ND IaaS Design Governance group.  In the room were representatives from system administration, networking, architecture, and information security.  On the agenda: a long list of open questions about the implementation, policy, and process of moving into Amazon.

I’m happy to report that we made excellent progress on this list.  For the first half-dozen, we either made policy decisions on the spot, assigned ownership to certain groups, or assigned volunteers to learn more and present a recommendation at the next meeting.  As we continue to meet biweekly, I’ll spotlight the decisions made on this blog.  For now, here’s a glance at the broad topics.

  1. Designing / managing Infrastructure
  2. Security Groups
  3. Account Management
  4. Key Management
  5. Multitenancy
  6. Managing Instances
  7. Version Control
  8. Development
  9. Provisioning Instances
  10. Deployment Policy / Process
  11. Tagging / Naming Conventions
  12. Scripting
  13. Connecting to ND resources
  14. Licensing
  15. Object Ownership
  16. Budget
  17. Other

For the full document with sub-bullets and decisions made from the first meeting, see this document (ask me for access).  This is certainly a first draft.  I can already think of a number of things we should add, not the least of which is how our future change control / RFC process will work.

But things are already happening!  Before I left Friday, I stood in on an impromptu design session with Bob, John, Jaime, and Milind.

IMG_1358

is this what a think tank looks like?

IMG_1359

this could be your next data center

So thank you to everyone on the committee for your dedication and excitement as we move forward!

Cloudformation and the Challenge of Governance

aka a maybe possibly better way to script a stack

As I demonstrated in a recent post, you can script just about anything in AWS from the console using the Boto package for Python.  I knew that in order to stand up EC2 instances, I was going to need a few things: a VPC, some subnets, and security groups.  Turns out there are a few other things I needed, like an internet gateway and a routing table, but that comes later.  As I wrote those Boto scripts, I found myself going out of my way to do a few things:

  • provide helper functions to resolve “Name” tags on objects to their corresponding ID (or to just fetch an object by its Name tag, depending on which I needed)
  • check for the existence of an object before creating it
  • separate details about the infrastructure being created into config files

The first one was critical, because it doesn’t take long for your AWS console to become a nice bowl of GUID soup.  The first time you see a table like this, you know you’ve got a problem:

what is this i don't even

what is this i don’t even

I’ve obscured IDs to protect… something.  Resources that probably won’t exist tomorrow.  But believe me, changing those IDs took a long time, which is half my point:  let’s get some name tags up in here.

The challenge is that you have to be vigilant about creating tags on objects with the key “Name,” and then you have to go out of your way to code support for that, because tagging is entirely optional.  You want friendly names?  Implement it yourself.  This is your little cross to bear.

The second task was aimed at making the scripts idempotent.  It’s very useful when building anything like this to be able to run it over and over without breaking something.

The third task was an attempt to decouple the infrastructure from boto (and optimistically, any particular IaaS) and plan for a “configuration as code” future.  How nice would it be to commit a change to the JSON for a routing table and have that trigger an update in AWS?

Enter Cloudformation

So all this was working rather well, but before I delved too deep, I knew I needed to check out AWS Cloudformation.  Bob Winding had described how it can stand up a full stack of resources; in fact, it does many of the things I was already trying to do:

  • describe your infrastructure as JSON
  • reference resources within the file by a friendly name
  • stand up a whole stack with one command

In addition, it adds a lot of metadata tags to each object that allow it to easily tear down the whole stack after it’s created.  As an added bonus, it provides a link to the AWS price calculator before you execute, giving you an indication of how much this stack is going to cost you on a month-to-month basis.

Nice! This is exactly what we want, right?  The resource schema already exists and is well documented.  Most of those things correspond to real-life hardware or configuration concepts!  I could give this to any sysadmin or network engineer, regardless of AWS experience, and they should be able to read it right away.

The Catch

I like where this is going, and indeed, it’s a lovely solution.  I have a few concerns — most of which I believe can be handled with good policy and processes.  Still, they present some challenges, so let’s look at them individually:

1. One stack, one file.

You define a stack in a single JSON file.  The console / command you use to execute it only runs one at a time, which you must name.  The stack I was trying to run starts with a fundamental resource, the VPC.  I can’t just drop and recreate that willy-nilly.  It’s clear that there must exist a line between “permanent” infrastructure like VPCs and subnets and the more ephemeral resources at the EC2 layer.  I need to split these files, and not touch the VPC once it’s up.  However…

2. Reference-by-ID only works within a single file

As soon as you start splitting things up, you lose the ability to reference resources by their friendly names.  You’re back to referencing relationships by ID.  This is not just annoying:  it’s incredibly brittle.  AWS resource IDs are generated fresh when the resource is created, so the only way those references stay meaningful is if the resource you depend on is created once and only once.  That’s not always what we want, and it’s extra problematic because…

3. Cloudformation is not idempotent (but maybe that’s good)

Run that stack twice, and you’ll get two of everything (usually).  Now, this is actually a feature of CloudFormation.  Define a stack, and then you can create multiple instances of it.  If you want to update an existing stack, you can declare that explicitly.  However, some resources have a default “update” behavior of “drop and recreate.”  So if it’s a resource with dependencies, things get tricky.  The bottom line here is that we have to be smart about what sorts of resources get bundled into a “stack,” so we can use this behavior as intended — to replicate stacks.  And finally…

4. It’s not particularly fast

It just seems a bit slow.  We’re talking like 2 minutes to go from VPC to reachable EC2 instance, but still.  My boto scripts are a good deal faster.

There is a lot to like about CloudFormation.  You can accomplish quite a bit without much trouble, and the resulting configuration file is easy to read and understand.  Nothing here is a showstopper, as long as we understand the implications of our tool and process choices.  We can always return to boto or AWS CLI if we need more control over the build process.

The Challenge of Governance

I don’t believe any of the difficulties outlined above are unique to CloudFormation.  Keeping track of created resources and their various dependencies, deciding on the relative permanence of stack layers, and implementing a solution where parts of the infrastructure can truly be run as “configuration as code” are all issues that we must tackle as we get serious about DevOps practices.  These are just the sorts of questions I have in mind for the first meeting of the DevOps / IaaS Governance Team this week.

We should feel encouraged that we’re not pioneers here!  Let’s reach out to friends and colleagues in higher ed and in industry to see how these issues have played out before, and what solutions we may be able to adopt.  When we know more, we’ll be that much more confident to proceed, and I can write the next blog post on what we have learned.

OIT staff can view my CloudFormation templates here.

 

Thoughts on AWS

The more I think about it, the more I realize that we’re witnessing a transformation in IT. Despite the AWS tattoo and kool-aid mustache, I’m not married to AWS. Nothing’s perfect, and AWS is no exception. There are always questions about vendor lock-in and backing out or shifting services as the landscape changes.

Vendor lock-in is a possibility with every IT partnership, and there are ways to mitigate those risks. If Cisco, Redhat, IBM, or Oracle go out of business, it will have a huge impact. If any change their pricing structure, the same. In a practical sense, you usually have time to react. AWS might get purchased, or the CEO might change, but it’s unlikely that we’ll be informed on Monday that they’re shutting down, and we have 3 days to relocate services. With AWS, there’s no upfront cost:  you pay as you go. A provider can’t have success with that model without a rock-solid service. Furthermore, there’s no financial lock-in:  no multi-million dollar contracts or support agreements, which are probably the most common and painful cases.  AWS frees us from those kinds of lock-in.

AWS isn’t outsourcing — it’s a force multiplier. I guarantee that OIT engineers, with minimal additional training, can build more secure systems with better architectures in Amazon they’d be able to do on-premises for a fraction of the cost and time. Infrastructure is hard. It takes massive resources. AWS has the economy of scale and has been able to execute. The result is a game changing historic transformation of IT. Really, look at it:  it’s that profound. It enables experimentation and simplification. Take sizing systems. Just rough out your scaling, pick something larger than you need and scale it back after two months of usage data. Calculate the low nominal use and scale it back further, then auto-scale for the peak. Script it and reproduce everything, network, servers, security policies, all the things. That kind of architecture is an engineer’s dream, for pennies per hour, now, this instant, not years from now.

AWS or not, there is no question that this is the future of IT. Except for low latency or high performance, or very specialized applications that require a local presence, in my opinion, most companies will shed their datacenters for providers like AWS. My guess:  it’s going to come faster than we think. It’s not that we can’t accomplish these things, or that we don’t understand them. We simply don’t have the resources. Take long term storage and archiving. Glacier and S3 have eleven 9s of durability. Elasticity, I can upload 500 TB into Glacier tomorrow, or a petabyte, no problem. Spin up 10 machines, 100 machines, 1000 machines:  done. Chris did a data warehouse in a couple of days. There’s really no turning back.

I suspect we’ll be cautious, as we should be, moving services to AWS. But my forecast is that with each success, it will become apparent that we should do more. It’s going to become so compelling that you can’t look away.

I’m really glad we’ve picked AWS for IaaS. It is a huge opportunity.

The Start of Something Big

The spark of DevOps at ND became a flame early this week as the OIT welcomed Dan Ryan to McKenna Hall for a two-day bootcamp on DevOps and Infrastructure-as-a-Service.  Over 50 OIT professionals from networking, product management, custom development, virtualization, senior leadership, information security, data center operations, identity/access management, and architecture gathered together to learn about DevOps and decide upon an IaaS provider.

Day One kicked off with an introduction by CIO Ron Kraemer, who challenged us to seize the “historic opportunity” represented by cloud computing.

Continuing a discussion started with his appearance at the Sept 2013 ND Mobile Summit, Dan made a compelling case not only for migrating our data center to the cloud, but for doing it using Amazon Web Services.

IMG_1012

Dan Ryan dropping some #IaaS knowledge

IMG_1013

AWS easily surpasses their closest competitor, Rackspace

The morning concluded by the assembly agreeing that Amazon Web Services is our preferred infrastructure as a service provider of choice based on organizational capability, price, and position in the market.

That afternoon and all of Day Two saw working groups spring up across OIT departments to tackle the practical architecture, design, and implementation details of a few specific projects, including AAA, Backup, and the Data Governance Neo4j application.  Bobs Winding and Richman led a discussion of how exactly to lay out VPCs, subnets, and security groups.

vpc architecture

A virtual data center is born.

Assisted by Milind Saraph, Chris Frederick and I dove into Boto, the Python SDK for AWS scripting, and started building automation scripts for realizing that early design.  Evan Grantham-Brown stood up a Windows EC2 instance and created a public, Amazon-hosted version of the Data Governance dictionary.

devops_pic

Just look at all that inter-departmental collaboration

Around 2pm, we were joined via WebEx by AWS Solutions Architect Leo Zhadanovsky, who talked us through some particular details of automating instance builds with CloudInit and Puppet, as detailed in this youtube presentation.

As the day came to a close, the conversation turned to governance, ownership, and process documentation. This Google Doc outlines next steps for continuing to roll out AWS and DevOps practices in many areas of OIT operations, and contains the roster of a Design Review Board to guide the architecture, implementation, and documentation of our new data center in the cloud.

whiteboard

The goal: continuous integration. Automated builds, testing, and deployment.

Aside from the decision to choose AWS as our IaaS provider of choice, it was really encouraging to see so many people cross departmental lines to try things out and make things happen.  Here’s to making every day look like these.  Many thanks go to Sharif Nijim () for conceiving and coordinating this event, to Mike Chapple () and OIT leadership for supporting the idea, and especially to Dan () for showing us the way forward.  Let’s do it!

all of them

ALL OF THEM!  (thanks @dowens)

Artifacts from the DevOpsND workshop are available here.