Latest Post | Last 10 Posts | Archives
Previous Post: SaaS for your business in the cloud
Next Post: Blind to the elephant in the cloud
Posted in:
"This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline. "However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!'. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. "We've turned our full attention to helping ensure this kind of event doesn't happen again. Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements ..."You see what I mean? Learning, learning, learning from every glitch, and as soon as the solution is found it's implemented to the benefit of every one of Gmail's millions of users. As my friend and fellow Enterprise Irregular Anshu Sharma wrote a while back, this is one of the unsung benefits of multi-tenancy. The disasters may be high-profile, but that just incents the provider even more to avoid them in the future. Whereas a software vendor of on-premise, single-tenant applications has little incentive to fix problems that only affect one customer at a time, even if the aggregate outage time is far more severe once you add up the results of each individual failure. I realize it would be better still if Gmail didn't fail at all, ever. But think of each small outage as one more step along the path to that ultimate nirvana.
posted by Phil Wainewright
September 24, 2009 @ 10:57 am
Previous Post: SaaS for your business in the cloud
Next Post: Blind to the elephant in the cloud
WordPress Mobile Edition available at alexking.org.
powered by WordPress.