Did you know that Amazon had an outage last week? If you were surfing Amazon.com for a copy of Beattles Rock Band (good thing my kids don’t read this blog), you wouldn’t have noticed. However, if you were a firm that depends on Amazon Web Services (AWS) to provide on-demand processing capacity or data storage “in the cloud,” you probably noticed and may have had a nervous minute or two. Here is how AWS describes their business:
Amazon Web Services … provide[s] companies of all sizes with an infrastructure web services platform in the cloud. With AWS you can requisition compute power, storage, and other services–gaining access to a suite of elastic IT infrastructure services as your business demands them. With AWS you have the flexibility to choose whichever development platform or programming model makes the most sense for the problems you’re trying to solve. You pay only for what you use, with no up-front expenses or long-term commitments, making AWS the most cost-effective way to deliver your application to your customers and clients.
The appeal here (and for cloud computing in general) is that a client can buy what they need when the need it, reducing capital outlays and allowing the flexibility to have just right the level of capacity. One feature that Amazon emphasizes is its reliability. As their web site says, “The Amazon Web Services cloud is distributed, secure and resilient, giving you reliability and massive scale.” That, of course, makes the outage a little embarrassing. The ultimate source of the problem seems fairly mundane. As reported in Computerworld (Amazon’s data center outage reads like a thriller: The outage shows why performance monitoring services are gaining ground, Dec 11):
At first, a “single component of the redundant power distribution system failed in this zone,” Sysadmin would later write in a postscript for his audience. But while the data center staff worked on that component, there was a twist: “A second component, used to assure redundant power paths, failed as well.”
This is what ultimately caused customers to lose connectivity with whatever services Amazon was supposed to be providing them. The interesting part (to my mind) is how a user of these service is supposed to monitor what is going on. This is an issue with any outsourcing arrangement. Whether you are hiring someone to assemble electronic devices or run a call center for you, you are giving up some control relative to doing it yourself. How well the system will work will depend in part on how well the client can monitor what the supplier is doing. For manufacturing or call centers, you at least have some idea of where the work is being done. Visiting the call center might mean flying to Iowa or India but at least you know where you need to go. Once you get there, you can talk to an agent who has been trained to handle your calls. Many service providers now record every call and allow the client to access their networks to evaluate the calls.
Hiring someone to put, say, your e-commerce site in the cloud, is not quite the same. The firm you hire likely has multiple data centers and if you visit one, what do you learn from looking at a server? Amazon didn’t exactly hide that they were having problems. Someone (referred to as Sysadmin in the article) was posting to AWS’s status board as AWS staff tried to uncover what had happened. That’s all well and good when an “event” happens but it’s not going to help a client learn about fluctuations in performance. Enter monitoring services:
In November, Apparent Networks launched its Cloud Performance Center, an online service that allows anyone to review — in real-time — the performance of 16 cloud providers, including Amazon and Google. It covers such things as bandwidth capacity, latency and data loss, then scores them overall. Jim Melvin, president of the privately held Apparent Networks, said his firm can continuously monitor network performance over WANs using technology it has extended to the cloud. The monitoring is done with a “very lightweight stream of packets” that continuously travels the network to monitor activity and cloud performance. With the available free version of its PathView Cloud tool, users can detect performance issues with the network or cloud provider, and see whether service level performance agreements are being met, Melvin said.
This is a really interesting idea. I’m hard pressed to think of another interesting in which third-party evaluations of performance of competing firms is readily available in real time. JD Power, say, rates cars and airlines but those are all ex post evaluations. I as a consumer find out how good a job Toyota did in building cars last year. That is not perfectly informative of how they will do with the current model year. Here, however, we’re getting a much more salient view. Of course, exactly what is happening is somewhat clouded by a “proprietary scoring algorithm” so it is not immediately clear that what Apparent Networks assigns the exact same value to a performance measure that a particular cloud user does. However, I suspect that they can provide a more detailed view for a client willing to spend enough. That suggests that there are opportunities for new ways of defining service level agreements evaluating supplier performance at least for large users. Even smaller users should have a much better view of performance when they sign up for cloud computing services.