On CBS MoneyWatch: 6 things NOT to do on Twitter, Facebook
BNET Business Network:
BNET
TechRepublic
ZDNet

August 1st, 2007

365 Main details SF outage problems

Posted by Larry Dignan @ 6:40 am

Categories: Business Continuity, Datacenter, Disaster Recovery, General, Hardware Infrastructure, IT Management

Tags: Generator, San Francisco, Outage, Larry Dignan

365 Main on Wednesday detailed what went wrong during the San Francisco power outage last week and detailed what it’s doing to make sure its facilities stay running in the future.

The power outage on July 24 knocked various sites–including CNET, Craiglist and others–offline and raised questions about business continuity planning.

Here’s the explanation provided by the company in full:

At 1:47 p.m. on Tuesday, July 24, 365 Main’s San Francisco data center was impacted by a power surge caused when transformer breakers at a local PG&E power station unexpectedly opened. PG&E has still not determined what caused the breakers to open.

Typically when a power outage occurs, the outage triggers 365 Main’s rigorously maintained and tested back-up diesel generators to start-up and take over providing power supply to customers. 365 Main’s San Francisco facility has ten 2.1 megawatt back-up generators to be used in the event of a loss of utility power. Eight primary generators can successfully power the building, with two generators available on stand-by in case there are any failures with the primary eight.

However, following the power outage last week, three of 365 Main’s 10 back-up power generators, manufactured by Hitec, failed to complete their start sequence. A complete investigation of the incident began immediately.

Within hours of the incident, an international team of specialists was deployed to 365 Main’s San Francisco data center facility to join on-site technicians and begin systematically testing the generators in search of a root cause. After days of thorough testing around the clock, the team discovered a weakness in an essential component of the back-up generator system known as a DDEC (Detroit Diesel Electronic Controller).

The team discovered a setting in the DDEC that was not allowing the component to correctly reset its memory. Erroneous data left in the DDEC’s memory subsequently caused misfiring or engine start failures when the generators were called on to start during the power outage on July 24.

The investigation team discovered DDEC issues on each of the failed Hitec units and were able to successfully simulate failure. A fix was introduced by altering the timing of a command to the DDEC component, allowing more time between the engine shut-down command and the DDEC reset command. Once this fix was introduced, the Hitec generators successfully passed more than 50 consecutive start-up sequence tests without incident.

The testing methodology was performed by Hitec specialists along with 365 Main’s chief technician and staff. Specialists from Cupertino Electric were present during all testing, and EYP Mission Critical Facilities will provide independent verification of the findings the week of 8/6/07.

365 Main has implemented the DDEC fix in its San Francisco and El Segundo facilities.  Of the five data centers in 365 Main’s portfolio, the San Francisco and El Segundo facilities are the only ones with Hitec generators containing DDECs.  All other facilities feature other brands of generators or have different models of Hitecs.

365 Main is sharing the discoveries of its investigation with other Hitec customers. In addition, Hitec has expanded its preventative maintenance procedures as a direct result of discoveries made during the 365 Main investigation.

The company also has a full archive of the developments last week.

Larry DignanLarry Dignan is Editor in Chief of ZDNet and Editorial Director of ZDNet sister site TechRepublic. See his full profile and disclosure of his industry affiliations.

For daily updates, follow Larry on Twitter.

Email Larry Dignan

Subscribe to Between the Lines via Email alerts or RSS.

  • Talkback
  • Most Recent of 5 Talkback(s)
PM
Preventative Maintenance - With as many faults as were listed, I would say the PM program is ineffective. The US Navy has an excellent model that businesses could learn from. I am in the... (Read the rest)
Posted by: Linux User 147560 Posted on: 08/02/07 You are currently: a Guest | | Terms of Use
Oh the cure is simple...  Linux User 147560 | 08/01/07
Actually, it's a bit more complex  CobraA1 | 08/01/07
Really the blame lies with the manufacturer then.  odubtaig | 08/02/07
PM  Linux User 147560 | 08/02/07
That would be like !  not of this world | 08/02/07

What do you think?

SponsoredWhite Papers, Webcasts, and Downloads

advertisement

Recent Entries

Archives

Favorite Links

ZDNet Blogs

White Papers, Webcasts, and Downloads

SmartPlanet

  • Thought-provoking progressive ideas on diverse topics that intersect with technology, business, and life, and matter to the world at large. Visit SmartPlanet
  • More from IBM
  • Innovate your business' process model, play against the market, compete against others on our scoreboards and WIN! Try INNOV8 2.0: A BPM Simulator
  • Enabling Real-World Business Transformation through IBM Service Management Read the EMA Analyst Report
Click Here