About Sharif Nijim

I'm a husband, dad, son, driver, and biker. I tweet as @snijim

#NDCloudFirst

Today is the most exciting day of my modern professional life.  It’s the day we are announcing to the world our goal of migrating 80% of our IT service portfolio to the cloud over the next three years.

Yes, that’s right, 80% in 3 years.  What an opportunity!  What a challenge!  What a goal!  What a mission for a focused group of IT professionals!

The following infographic accurately illustrates our preference in terms of prioritizing how we will achieve this goal.

Screen Shot 2014-11-07 at 8.54.34 AM

 

Opportunistically, we will select SaaS products first, then PaaS products, followed by IaaS, with solutions requiring on-premises infrastructure reserved for situations where there a compelling need for geographic proximity.

The layer at which we, as an IT organization, can add value without disrupting university business processes is IaaS.  After extensive analysis, we have selected Amazon Web Services as our IaaS partner of choice, and are looking forward to a strong partnership as we embark upon this journey.

Already documented on this blog are success stories Notre Dame has enjoyed migrating www.nd.edu, the infrastructure for the Notre Dame mobile app, Conductor (and its ~400 departmental web sites), a copy of our authentication service, and server backups into AWS.  We have positioned ourselves to capitalize on what we have learned from these experiences and proceed with migrating the rest of the applications which are currently hosted on campus.

So incredibly, incredibly fired up about the challenge that is before us.

If you want to learn more, please head over to Cloud Central: http://oit.nd.edu/cloud-first/

Onward!

Just because you can, doesn’t mean you

ImageUploadedByAG Free1350804245.491429

One of the applications that is a shoe-in candidate for migration has a smalltime usage profile.  We are talking 4 hits/day.  No big deal, it’s a business process consideration.

It needs to interface with enterprise data, resident in our Banner database.  No worries there, the data this app needs access to is decoupled via web services.  Now lets swing our attention to the apps transactional data storage requirements.

First question – does it need any of the Oracle-specific feature set?  No.  So, let’s not use Oracle – no need to further bind ourselves in that area.  Postgres is a reasonable alternative.

OK, so, RDS?  Yes please – no need to start administering an Postgres stack when all we want to do is use it.

Multiple availability zones?  Great question.  Fantastic question!  Glad you asked.

Consider the usage profile of this app.  4 records per day. 4.  Can the recovery point/time objectives be met with snapshotting?  Absolutely.  Is that more cost-effective than running a multi-AZ configuration?  Yes.

Does it make sense for this application?

Yes.

Thank you Amazon for providing a fantastic set of tools, and thank you to the #NDCloudFirst team for thinking through using those tools appropriately.

The Speed of Light

How fast is it really?  In the course I teach, students have the opportunity to interact with a database, taking their logical models, turning them into physical designs, and finally implementing them.

Up until this semester, I have made use of a database that is local to campus.  The ongoing management and maintenance of that environment is something which is of no particular interest to me – I just want to use the database.  Database-as-a-Service, as it were.  As in, Amazon Relational Database Service.

Lucky for us all, Amazon has a generous grant program for education.  After a very straight-forward application process, I was all set to experiment.

To baseline, I executed a straightforward query against a small, local table.  Unsurprisingly, the response time was lightning-fast.

local

 

Using RDS, I went ahead and created an Oracle database, just like the one I have typically used on campus.  After setting up a VPC, subnet groups, and creating a database subnet group, I chose to create this instance in Amazon’s N. Virginia Eastern Region.  Firing off the test, we find that, yes, it takes time for light to travel between Notre Dame’s campus and northern Virginia:

east

 

Looks like it added about 30 milliseconds.  I can live with that.

Out of curiosity, how fast would it be to the west coast?  Say, Amazon’s Oregon Western Region?  Fortunately, it is a trivial exercise to find out.  I simply snapshotted the database and copied the snapshot from the eastern region to the west.  A simple restore and security group assignment later, and I could re-execute my test:

west

 

Looks like the time added was roughly double – 60 milliseconds.

Is that accurate?  According to Google Maps, it looks like yes indeed, Oregon is roughly twice as far away from Notre Dame as Virginia.  The speed of light doesn’t lie.

So, what did I learn?  First, imagine for a moment what I just did.  Instantiate an Oracle database, on the east coast, and the west coast.  From nothing!  No servers to order, to routers to buy, no disks to burn in, no gnomes to wire equipment together, no Oracle Universal Installer to walk through.  I still get a thrill every time I use Amazon services and think about what is actually happening.  I can already see myself when I’m 70, regaling stories about what it was like to actually see a data center.

OK, deep breath.

Second, is 30 milliseconds acceptable?  For my needs, absolutely.  My students can learn what they need to, and the 30 millisecond hit per interaction is not going to inhibit that process.  It’s certainly a reasonable price to pay, especially considering there is nothing to maintain.

What is the enterprise implication?  Is 30 milliseconds going to be insufficient?  An obstacle that inhibits business processes?  We shall see.  For local databases and remote web/application servers, perhaps.  Perhaps not.

This is why we test, remembering that despite what a remarkably amazing toolset AWS represents, we are still bound by the speed limit of light.

AWS Midwest Region, anyone?

Positive Trajectory

Bob and I did a brief update on where the OIT is in terms of cloud strategy and the progress we have made with our IaaS provider of choice, Amazon Web Services.  It was essentially a highly summarized, non-technical version of the presentation we gave in D.C. earlier this year.

The most interesting thing that came out of the update was a follow-on conversation with one of the non-technical users on campus.  This person was not aware of the implications of moving Conductor and its associated sites to AWS.  We explained that her departmental website is a Conductor site, and as such, she is directly benefiting from the improved performance and availability.

It was in the context of this conversation that the beauty of what we are doing struck me.  Functionally, end users should not even realize that the infrastructure running their web presence or line of business applications has transitioned to AWS.  If well-executed, these users should continue to be oblivious to the changes being made, on their behalf and in their best interest.

The fact this person had no realization that anything changed is a testament to our success to date, because as stated previously, the performance essentially doubled for roughly half the cost.  All in the best interest of the University, allowing us to focus our efforts on initiatives which are core to the institutional mission.

Very, very excited about what is to come.

Onward!

AWS Public Sector Symposium

It has been a while since our last update, so now is the perfect time to give a little perspective as to where we stand with regards to our Amazon initiatives.

Amazon graciously invited Bob and me to present the Notre Dame AWS story and its fifth annual Public Sector Symposium.

Bq_6fKUCIAEKAvz.jpg-large

We described how we got started with AWS in the context of migrating www.nd.edu 18 months ago.  Over the past 18 months, we have experienced zero service issues in this, our most reliable and resilient service offering.  Zero.

Since then, as mentioned in our 2014 spring update when Notre Dame hosted the Common Solutions Group in May, we reviewed progress to-date in the context of governance, people and processes, and our initiatives to grow our institutional understanding of how to operate in the cloud.

Bob and I had fun presenting, and were able to give a synopsis on:

  • Conductor Migration:  ND-engineered content management system that migrated to AWS in March 2014.  You can read all about Conductor’s operational statistics here. The short story is the migration was a massive success.  Improved performance, reduced cost, increased use of the AWS tools (RDS, Elasticache), and a sense of joy on the part of the engineers involved in the move
  • ND Mobile App:  Instantiation of the Kurogo framework in providing a revamped mobile experience.  Zero performance issues associated with rollout.
  • AAA in the Cloud:  Replicating our authentication store in Amazon.  Using Route 53 to heartbeat against local resources so that in a failure condition, login.nd.edu redirects to the authentication store in Amazon for traffic originating from off campus.  This was tested in June and performed as designed.  Since the preponderance of our main academic applications, including Sakai, Box, and Gmail, are hosted off campus, the result of this effort is that we have separated service availability from local border network incidents.  If campus is offline for some reason, off-campus students, faculty, and staff are still able to use the University’s SaaS services.
  • Backups:  We are in the midst of testing Panzura, a cloud storage gateway product. Preliminary testing looks promising – look for an update on this ongoing testing soon.

We are absolutely thrilled with the progress being made, and look forward to continuing to push the envelope.  More to come!

Onward!

Posted in AWS

Why We Test

Despite design principles, despite best intentions, things can go wrong.  This is true when all of the design parameters are quantifiable, as exhibited by this spectacular engine failure:F1EngineFail What happens when some of the design constraints are not fully understood?  Specifically, consider the following excerpt taken from Amazon’s page detailing instance specifications:

Instance Family Instance Type Processor Arch vCPU ECU Memory (GiB) Instance Storage (GB) EBS-optimized Available Network Performance
Storage optimized i2.xlarge 64-bit 4 14 30.5 1 x 800 SSD Yes Moderate
Storage optimized i2.2xlarge 64-bit 8 27 61 2 x 800 SSD Yes High
Storage optimized i2.4xlarge 64-bit 16 53 122 4 x 800 SSD Yes High
Storage optimized i2.8xlarge 64-bit 32 104 244 8 x 800 SSD 10 Gigabit*4
Storage optimized hs1.8xlarge 64-bit 16 35 117 24 x 2,048*3 10 Gigabit*4
Storage optimized hi1.4xlarge 64-bit 16 35 60.5 2 x 1,024
SSD*2
10 Gigabit*4
Micro instances t1.micro 32-bit or
64-bit
1 Variable*5 0.615 EBS only Very Low
10 Gigabit is a measure we can understand.  But what does Moderate mean?  How does it compare to High, or Very Low?  It is possible to find out, but only if we test.

According to Amazon’s page on instance types (emphasis added):

Amazon EC2 allows you to provision a variety of instances types, which provide different combinations of CPU, memory, disk, and networking. Launching new instances and running tests in parallel is easy, and we recommend measuring the performance of applications to identify appropriate instance types and validate application architecture. We also recommend rigorous load/scale testing to ensure that your applications can scale as you intend.

Using LoadUIWeb by Smartbear, we were able to take a given workload and run it on different instance types to gain a better understanding of the performance differences.  Running 100 simultaneous virtual users with a 9 millisecond think time between pages, we ran against an M1.small and an M1.medium.  Here is are some pictures to illustrate what we discovered:

M1.small performance test output:

Unknown-2Small

M1.medium performance test output:

Unknown-1

Medium

 

Looking at the numbers, we see that the the response time of the small is roughly twice that of the medium.  As neither CPU, memory, or disk activity were constraints, this serves as a warning to us – the quality of service limitations placed on different instance types must be taken into consideration when doing instance size analysis.

When in doubt, test it out!

Our Finest Hour

Winston-Churchill4

Last Friday (1/24/2014) saw our campus experience multiple service disruptions.  Both were service provider issues – one with Box, the other with Google.  In both cases, the services were resolved in a timely manner.  Both the Box and Google status pages were kept up to date.

What was required from OIT to resolve these interruptions?  Communication and vendor management.  We did not have to burn hours and hours of people time digging in to determine the root cause of these issues.  We did not have to bring in pizza and soda, flip charts and markers, bagels and coffee, and more pizza and more soda to fuel ourselves to deal with this issue.  We did not work late and long into the night, identifying and resolving the root cause.

Why?  Because we have vendors we trust, employing more engineers than we have humans in all of OIT, singularly focused on their product.

I was part of the team that worked through our Exchange issues last fall.  I will be the first to tell you that the week we spent getting to operational stability was a phenomenal team effort.  We had an engineer from Microsoft and all of the considerable talent OIT could bring to bear to solve our issues.  And solve them we did.  Yes, we worked late into the night.  Yes, we fueled ourselves with grease and caffeine.  Yes, we used giant post-its and all the typical crisis management tools.

Everyone on the team did not get much sleep, did not spend the typical evenings with their families, and displayed their remarkable devotion to their profession and to our University.  Everyone on the team was elated when we solved a gnarly set of technical issues and got to operational stability.  And everyone on the team did not relive that experience this past Friday.  They were able to focus on their core mission.

We have moved all the way up the stack of abstraction in the space offered by both Box and Google.  It is something to celebrate, as it allows us to relentlessly focus on delivering value to the students, faculty, and staff at our Lady’s University.

 

A pause for reflection

CIMG7307

With the holiday season upon us, we are blessed to work at an institution that shuts down for the better part of two weeks, creating a culture of rejuvenation.  For our part, we will be using this time to passively reflect upon what we have learned this fall, and the amazing things we will accomplish in 2014.

Wishing everyone a happy, healthy, and restful break.

Phenomenally exciting things coming in 2014!

4 Projects

Matt in Milan

In order to focus our energy on understanding the toolset that Amazon Web Services presents, we are focusing on four specific initiatives in the first quarter of 2014.  These include:

  • AAA High Availability for Cloud Services, which will remove the campus dependency for off-campus users accessing SaaS services, including Box, Google, Innotas, Concur among others.
  • Backups, which will allow us to explore the feasibility of using a combination of S3 and Glacier for backups.
  • ModoLabs, which will host the campus mobile application and is not tightly coupled to existing campus services.
  • Conductor, which entails moving from an existing hosting provider to Amazon Web Services.  For perspective, Conductor hosts more than 360 of the University’s websites, including oit.nd.edu.  For monthly statistics about Conductor’s usage, refer to the nd.edu blog.

Each of these four projects will deliver value to Notre Dame and the students, faculty, and staff whom we serve.  These projects will also serve as the vehicle through which our operational cloud model, leveraging Amazon Web Services, gets defined.

In a governance working session this afternoon, we made progress on defining the standards for tagging machine instances.  As we approach the holiday break, we look forward to resting, reflecting, and returning, energized to make tremendous strides in 2014.

Onward!

Think like a startup, run like an enterprise

Sun-Valley-3001

We live in an age where the IT landscape is drastically shifting.  Companies like Amazon, Rackspace, Google, and Microsoft are each building more server capacity daily that the University possesses in its entirety.  Think about that for a moment.  In a game where economy of scale wins, these companies are achieving massive economies simply due to the massive scale to which they create their infrastructure.

When more than one company has moved beyond the hypervisor to sourcing components directly and building their own servers, storage units, and networking equipment, it is clear that the future points towards global data center consolidation.  We would be fooling ourselves if we think we can create and manage more resilient, denser, geographically dispersed data centers than companies that employ geologists to ensure tectonic diversity.

In light of this shifting environment, it behooves us now more than ever to challenge our notions of what is possible, what makes sense, and how to operate.  We need to think lean, like a startup company.  Pay-as-you-go services like Amazon, PagerDuty, NewRelic, Cloudability, and 3Scale allow us to take advantage of technological advancements without entering into complex, multi-year contracts.  We can evaluate services as we use them, determining in real-time if they deliver on the technical and functional objectives we expect from them.  Practical experience trumps theory every time.

As we adopt these new services, we must focus internally on how that impacts the way we organize our work.  Processes which make sense in a local virtual/physical environment may not make sense as we move towards software-defined everything.  As we go through this organizational metamorphosis, we need to actively challenge business as usual to arrive at business as optimal.  This will allow us to derive maximum value from our vendors, optimize our organizational agility, improve the way we serve our constituents, provide exciting new opportunities for our staff, and control our costs.

Onward!