On mySimon: Backyard Safari Underground Time Capsule
BNET Business Network:
BNET
TechRepublic
ZDNet

Category: Availability and reliability

March 11th, 2009

Will downtime rain on the cloud computing parade?

Posted by Michael Krigsman @ 6:22 am

Categories: Availability and reliability, CIO issues, Enterprise 2.0, Google, IT issues, SAP, SaaS, PaaS, and SOA, Uncategorized

Tags: Salesforce.com Inc., Google Inc., Google Gmail, Downtime, Outage, Manufacturing, E-mail Providers, Cloud Computing, Sales Force Management, Internet

Gmail recorded downtime yesterday, once again raising availability as an on-demand software issue. I’m interested in your opinion on whether this issue is significant.

Here’s a view of this most recent Gmail outage from Google’s new status dashboard:

And here’s detail describing the problem:

Read the rest of this entry »

March 10th, 2009

Google Apps Status Dashboard: enterprise-friendly

Posted by Michael Krigsman @ 6:43 am

Categories: Availability and reliability, Enterprise 2.0, Google, IT issues, SaaS, PaaS, and SOA, Salesforce.com

Tags: Google Inc., Google Apps, Dashboard, Status Dashboard, Cloud Computing, Michael Krigsman

Following a major Gmail outage, Google released a new, publicly available status dashboard showing system health for the company’s Apps products.

This release brings Google into the ranks of companies such as Salesforce.com and Amazon, which already have dashboards that accurately report system status to end-users.

The Google Enterprise Blog announced the news:

The Google Apps Status Dashboard represents an additional layer of transparency that we believe will be particularly useful for our business users, and it’s also relevant to users of our consumer products. The Status Dashboard is the best place to check for information on service availability for Google Apps anywhere in the world.

Here’s a screen capture of the dashboard:

And this screen shows the detail if you click one of the information icons in the dashboard:

Read the rest of this entry »

February 27th, 2009

IT failures roundup: Banks around the world!

Posted by Michael Krigsman @ 5:26 am

Categories: Availability and reliability, End-user impact, News roundup

Tags: Software, Bank, Information Technology, Debit Card, Online Banking, UBS AG, Australian IT, Banking, Financial Services, Personal Finance

Failures seem to come in waves, and today it’s banking. Think your bank’s systems are perfect? Don’t be so sure.

“System error” at UBS Japan causes erroneous $31 billion trade. From Finextra:

Swiss bank UBS has confirmed that a “system error” was responsible for the entry of an erroneous trade for three trillion yen at the Tokyo Stock Exchange on Tuesday.

UBS told the Bloomberg newswire that its Japanese unit mistakenly ordered $30.9 billion of Capcom Co. convertible bonds, 100,000 times more than it intended, because of an internal system error.

Allied Irish Bank incorrectly pulls cash from customer accounts. Again, Finextra has the story:

Read the rest of this entry »

October 9th, 2008

London Stock Exchange website reports incorrect prices

Posted by Michael Krigsman @ 5:25 am

Categories: Availability and reliability, CIO issues, End-user impact, Financial impact, Project failures

Tags: London Stock Exchange Plc., Web Site, Nasdaq Stock Market Inc., MarketWatch, Web Site Development, Web Technology, Internet, Michael Krigsman

London Stock Exchange website reports incorrect data

The London Stock Exchange (LSE) website displayed incorrect prices for the important FTSE 100 index yesterday morning. The problems were less severe than previous London Stock Exchange failures, which actually interrupted trading.

According to Online Financial News:

The website reported inaccurate points movements for the FTSE 100 from the start of trading until the price ticker was removed from the website later in the morning… The LSE said the exchange’s data feeds to professional traders…worked perfectly all day and technical glitch on the website had been fixed.

When an exchange reports incorrect stock data investors can be hurt, which damages market confidence in that exchange. As one trader said, “This is not the kind of mistake we needed today.”

Read the rest of this entry »

September 29th, 2008

World's worst IT failure report

Posted by Michael Krigsman @ 7:26 am

Categories: Availability and reliability, End-user impact, Failure 2.0, IT issues

Tags: Information Technology, Failure, Productivity, Channel Management, Retail, Internet, Telecom & Utilities, Marketing, Michael Krigsman

 World’s worst IT failure report

Reporting failure requires balancing facts with making interpretations about why the failure occurred. Respectable bloggers and journalists temper their comments because inaccurate accusations and sensationalism unnecessarily harm the innocent and provide little value to readers.

With this in mind, yesterday I read one of the most irresponsible failure reports I’ve ever seen. Talking about B&H Photo, a major New York photography retailer, one user started this discussion on Flickr:

B&H going out of business?

[F]or days their tracking and history systems have been down, now their phone system reports they are having computer errors, so their system is down….

This is how it looks when a company is circling the drain….Seems like it’s high time for the rumors to start flying.

THE PROJECT FAILURES ANALYSIS

This report committed several cardinal sins governing responsible reporting on failures:

1. Not accurate. Several follow-ups from readers pointed out that B&H systems are working properly. A minor web glitch may have occurred, but it’s clearly not an ongoing or consistent problem.

2. Not objective. Personal soap boxing is more important to the writer than uncovering facts.

3. Clear rumor mongering. The writer explicitly wants to start negative rumors, despite the presence of facts contrary to his assertions and conclusions.

I contacted B&H’s Director of Corporate Communications, Henry Posner, who summarized the matter:

The writer took one symptom, extrapolated to the most extreme and far-fetched possible conclusion, and ran with that. Calling it “irresponsible” is an insult to things that actually are irresponsible.

My take: The writer hit a momentary glitch and nothing more. Whether or not the problem was attributable to B&H is immaterial: these things happen all the time and pass quickly. And no, B&H is obviously not going out of business.

[Image via Roslyn High School. Disclosure: As a photographer, I love B&H's huge New York store. Check out my photos on Flickr. ]

September 16th, 2008

J.Crew: Failed upgrade hits financial performance

Posted by Michael Krigsman @ 9:22 am

Categories: Availability and reliability, CIO issues, End-user impact, Financial impact, IT issues, Implementation, Project failures

Tags: Team, Financial, Web Site, Levi Strauss, Team Management, Sales Strategy, Web Site Development, Transportation, Call Centers, Web Technology

Clothing retailer, J.Crew, reported weaker than expected second-quarter earnings due to severe problems following a website and call center upgrade. The fiasco caused cost the company $3 million in addition to lost sales and dissatisfied customers.

The company’s 10-Q SEC filing describes what happened:

During the second quarter of fiscal 2008 we implemented certain Direct channel systems upgrades which impacted our ability to capture, process, ship and service customer orders. As a result, our Direct sales growth rate was lower than recent quarterly trends. We expect the impact of the systems upgrades to continue into the second half of fiscal 2008.

In plain English, J.Crew couldn’t process orders or ship clothing to customers after deploying a website upgrade. Ouch, that one hurt.

James Scully, company CFO, provided details during an earnings call (transcription from Seeking Alpha):

In order to properly support our multi-channel, multi-brand strategy, we were required to make some significant investments in the direct business. These includes the following: a new platform for our website to allow for multiple brands, enhanced functionality, and increased growth capacity; a new order management system to improve the overall customer experience in our call center and drive future efficiencies; and a new direct warehouse management system to support our multi-branch strategy. On the weekend of June 28th, after taking our website down for 24 hours, we cut over to the new systems. Over the next several weeks, we experienced issues related to the site performance, order fulfillment, and call center performance….

The merchandise margin deterioration resulted from unplanned customer accommodations related to the direct system initiatives in the form of free and upgraded shipping, increased markdown activity as a result of our inability to transfer store markdowns to the direct channel, and increased freight expenses as a result of transferring inventory between stores combined with an overall increase in freight transportation costs….

In other words, the company screwed up customer orders. The J.Crew Aficionada blog described it this way:

As some of us have experienced first hand, many orders have been short-shipped or canceled due to lack of inventory (despite showing as “in stock” online). I really do hope that J.Crew can sort out the inventory issues this weekend to show in-stock items only on their website.

In a visceral demonstration of customer impact, one poor guy documented his J.Crew experience, which included being charged $9,208.50 for shipping three shirts. Here’s the receipt, along with a photo showing the baby-sized shirt J.Crew sent him by mistake:

J.Crew: Failed upgrade hits financial performance

THE PROJECT FAILURES ANALYSIS

There’s something special about companies that deploy mission-critical, customer facing systems without sufficient testing. Although we can only speculate why this happened, CFO Scully acknowledged the screw-up:

[We should have been] more conservative in our internal planning in terms of the potential disruption related to the direct business, which would have led us to be more conservative in setting expectations externally with our customers and our other constituents.

In fairness, once management understood the enormity of the situation, the company did take steps to satisfy customers, including offering discounts and other incentives. In the earnings call, J.Crew CEO, Millard Drexel, focused on not alienating customers:

[T]he risk was having a little more inventory but for us, more importantly, we wanted to make up to any customers and we didn’t want to just piss off more people.

Discussing this situation, we can’t ignore the recent ERP failure at jeans retailer Levi Strauss. In that case, Levi’s reported a 98% drop in net income when implementation problems prevented the company from shipping product to retailers.

My take. The post-mortem analysis should focus on a key question: did management or the project team drive this failure?

In one scenario, management forced a premature rollout despite warnings from the project team, which knew the system wasn’t ready. We all know cases where management pressured a software team to ship code before it was ready, despite protests and warnings from developers — such moments are pure Dilbert but nonetheless happen all the time.

I asked Retail Systems Research managing partner and former retail industry CIO, Paula Rosenblum, for her thoughts on the issue of rollouts in the retail industry:

Sometimes a rollout is dictated by the desire to get it done before a busy season, which typically become quiet times for IT. You want to get a system live and de-bugged before any critical season. I think this is what happened here.

Another, equally plausible, scenario is the implementation team itself didn’t recognize poorly-tested flaws that could not withstand the rigors of going live. Sadly, the annals of project failure are filled with stories about project teams that didn’t pay sufficient attention to testing and data integration before going live.

Paula points out that management accepted full blame in this case:

It sounds more like that which they bought (and I’m assuming they did not build, but bought) was less mature than they thought it was. But if you read carefully, they’re not saying they were too aggressive in implementation times, nor are they saying they should have allowed more time for implementation. Management is saying they should have planned their sales numbers less aggressively to accommodate problems that could (and did) occur. That’s fascinating really. Management is falling on its sword and actually not blaming IT at all.

[Via Steven M. Bellovin in Risks Digest. Image source: J.Crew Customer blog.]

September 5th, 2008

Netflix post-mortem: hardware failure and poor transparency

Posted by Michael Krigsman @ 7:06 am

Categories: Availability and reliability, CIO issues, End-user impact, Enterprise 2.0, Failure 2.0, Financial impact, IT issues, Project failures

Tags: NetFlix Inc., Hardware, Customs, Stacksafe, Michael Krigsman

Netflix post-mortem: hardware failure and poor transparency

Massive shipping delays last month at Netflix were caused by hardware failure. Although diagnosing hardware problems can be tough, Netflix gets the dunce award for lack of transparency in the face of disaster.

Here’s the post-mortem analysis from Mike Osier, head of IT Operations at Netflix:

On Monday, 8/11, our monitors flagged a database corruption event in our shipping system. Over the course of the day, we began experiencing similar problems in peripheral databases until our shipping system went down. It was going to be a long night.

We suspected hardware and moved the shipping system to an isolated environment, gradually getting DVD shipments moving again. Eventually the system was repaired and shipping returned to normal conditions. With some great forensic help from our vendors, root cause was identified as a key faulty hardware component. It definitively caused the problem yet reported no detectable errors. We’ve taken steps to fortify our shipping system with the acquisition of additional equipment and worked with our vendors to verify we’re in good shape elsewhere.

THE PROJECT FAILURES ANALYSIS

There are two significant points to consider:

  1. Hardware failure, especially involving network and communications equipment, can be a nightmare to troubleshoot and repair.
  2. Appropriate end-user communications are a critical part of managing IT downtime.

Hardware failure. When hardware fails, symptoms can appear as software, database, or telecom link problems unrelated to the specific equipment that’s flaky. For example, ComputerWorld reported on hardware-related problems at Kennedy Airport (emphasis added):

Initially, American’s parent company, AMR Corp., said the malfunction yesterday was in software that controlled the baggage-sorting conveyor belt in American’s bag room at JFK. However, airline spokesman Tim Wagner said today that the glitch was caused by a hardware issue involving the network between the computer software that controls the sorting function and the baggage conveyor belts. Wagner said the software was working, the conveyors were working, but some of the network hardware was failing.

Along the same lines, I wrote about a Los Angeles Airport failure that sounds similar to Netflix:

Assuming this [failure] to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem…. The real culprit: a failed router. Update: Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn’t happen again.

Lack of transparency. During times of failure, communication with end-users is critical if you value their continued loyalty. In this case, Netflix’s post-mortem was anemic and their status updates were too vague.

Netflix should have disclosed which hardware failed, why repairs took so long, and specifics on what it has done to prevent future problems. Investors should ask why management’s backup and contingency plans handled this mission critical failure so badly.

Given the financial implications for Netflix, as described by Larry Dignan, the company underperformed its crisis management:

These issues are obviously going to cost Netflix some dough. First, the company is losing revenue. That slippage will result in an earnings hit. Meanwhile, Netflix will have to account for reimbursing subscribers (currently credits to one-third of the subscriber base).

It’s also worth noting there are currently 145 responses to the post-mortem blog post, most of them negative; that should tell you something.

Stacksafe’s Jonah Paransky offers an excellent seven-point framework for communicating IT failure:

  1. Have a communication plan in place and ready to go
  2. Direct communication with your customers is the number one concern
  3. Be prepared to communicate over multiple channels.
  4. Over-communicating is better than under-communicating
  5. Expect the failure to become public
  6. Humor probably isn’t the right call
  7. Don’t underestimate the communication necessary after the failure is resolved

Netflix looks bad when measured against this list. Although it offered several superficial blog posts describing status during the outage, Netflix never really disclosed what happened or why.

The company has much to learn about the user relations aspect of IT downtime. Google, Salesforce, and Amazon take system transparency seriously; Netflix should do the same.

[Photo source: iStockphoto.]

August 28th, 2008

Google Apps dashboard: Serious about the enterprise?

Posted by Michael Krigsman @ 8:22 am

Categories: Availability and reliability, CIO issues, Enterprise 2.0, Google, Project portfolio management, SaaS, PaaS, and SOA

Tags: Google Inc., Google Apps, Dashboard, PROJECT FAILURES ANALYSIS Google, Michael Krigsman

Google Apps dashboard: Serious about the enterprise?

Google is developing a system status reporting dashboard for its Apps Premier product line. This decision provides further evidence Google is serious about becoming an enterprise software vendor.

CNET’s Dave Rosenberg posted the announcement email, describing the dashboard’s incident reporting features:

  • A description of the problem, with emphasis on user impact….
  • A continuously updated estimated time-to-resolution….

For most minor incidents, that should provide sufficient information to users regarding problem status and expected recovery times. For more serious problems, Google’s plans go much further:

[A] formal incident report within 48 hours of problem resolution. This incident report will contain the following information:

a. business description of the problem, with emphasis on user impact;
b. technical description of the problem, with emphasis on root cause;
c. actions taken to solve the problem;
d. actions taken or to be taken to prevent recurrence of the problem; and
e. time line of the outage.

Finally, if things get really bad, the email promises in-depth consultation with customers on an individual basis:

[W]e’ll support your internal communication process through participation in post-mortem calls with you and your management team.

When an incident first occurs, reporting is limited to status, availability, and predicted resolution times. For more severe situations, that basic status reporting will be supplemented by a business-oriented description of the cause, scope, and impact of the problems. Finally, following the worst downtime issues, Google will present a transparent and detailed analysis.

THE PROJECT FAILURES ANALYSIS

Google joins the ranks of Salesforce.com and Amazon, both of which offer industry-leading incident reporting to end-users. The Salesforce reporting service has been around a long time, while Amazon instituted theirs following a series of serious downtime incidents earlier this year.

I’m rather shocked to see Google’s willingness to participate in detailed post-mortem analysis discussions with customers. For such consultations to offer any value whatsoever, the company’s representative must be knowledgeable regarding both the business and technical implications of downtime events. People with this experience don’t grow on trees, especially if they are also strong communicators, so this represents a significant resource investment.

Although Google may offer this service level to large accounts such as Cap Gemini, I doubt smaller customers will receive any personalized attention whatsoever. After all, Google isn’t known for providing stellar customer service; actually, the company’s customer care record sucks widgets. Only time will tell whether Google can successfully transition from its mass market consumer mentality to becoming a trusted, service oriented enterprise vendor.

Great status reporting systems, while important, don’t turn consumer application companies into enterprise software vendors. However, the business focus and directional strategic intent of this investment are clear.

[Via Charlie Wood. Image source: iStockphoto.]

August 27th, 2008

MediaMax / The Linkup: When the cloud fails

Posted by Michael Krigsman @ 9:55 am

Categories: Availability and reliability, CIO issues, End-user impact, Enterprise 2.0, Failure 2.0, Financial impact, IT issues, Project failures, SaaS, PaaS, and SOA

Tags: Information Technology, Data, Failure, Storage, Hardware, Michael Krigsman

MediaMax / Linkup: When the cloud fails

Online storage service MediaMax, also called The Linkup, went out of business following a system administration error that deleted active customer data. The defunct company leaves behind unhappy users and raises questions about the reliability of cloud computing.

According to Nirvanix, which along with MediaMax was spun out of a company called Streamload, a faulty script caused the problem:

Streamload offered unlimited and then 25 GB of free storage for quite some time. This resulted in a tremendous amount of data stored in a few million free, non-active accounts for [nine years]. Streamload was literally paying for former users to store 100’s of terabytes of old, inactive data for free. In preparation for the split of the two companies, and subsequent move of the MediaMax application to SAVVIS, it was determined that the inactive data from former users would be purged on the Streamload/MediaMax storage system, thus shrinking the overall storage needs and costs for the new MediaMax company. During this process, a system administrator ran a script that misidentified active account data and disassociated physical files from their owners.

Although The Linkup lost a ton of customer data, CEO Steve Iverson told Network World he’s unsure how much is gone:

Iverson says at least 55% of the data was safe. How much of the remaining 45% was saved is not clear, he says. “We know there was definitely a lot of customer problems, and when we looked at some individual accounts, some people didn’t have any files, and some people had all their files.”

THE PROJECT FAILURES ANALYSIS

As with most failures, this story is fraught with complications and contradictions. Besides finger pointing and back-biting, which I suppose is to be expected, confusing corporate relationships coupled with a seemingly bizarre level of process and technical carelessness lend a weird flavor to the whole mess.

The human drama is documented in links from this post; more importantly, two significant and highly connected issues were at play:

  1. Business process failures. Apparently, the company allowed a lone system administrator to perform tasks affecting the company’s core business without sufficiently performing dry runs. I suppose this point is self-evident: scenario planning is critical whenever IT handles irreplaceable data. Management is responsible for establishing all operating plans and contingency procedures before IT executes data-threatening procedures.
  2. Technical failures. Given the high stakes and the script’s intended goal, the company should have performed intensive testing ahead of time.

I asked Newsgator’s VP of Software as a Service (SaaS), Jeff Nolan, to comment. Jeff questioned why the company maintained so much historical data:

Beyond meeting legitimate business and regulatory requirements, retaining years of old, inactive data adds unnecessary risk and cost.

There was also a process failure. Newsgator takes active steps to isolate problems and prevent this type of damage. In addition to sandbox testing, which is computer science 101, we require two-key authorization: the sys admin can only run these types of scripts after a second person has given approval. A well-defined system of checks and balances prevents problems.

While this case is an interesting footnote in the history of IT failures, the larger implications relate to cloud computing. On this subject, Larry Dignan says:

[The cloud's] growing pains, which are more evident each day that we rely more on service-based software efforts, indicate that you can’t really trust the cloud at this juncture. It’s too early and providers are learning as they go.

Despite being a cloud-based failure, the underlying problem is human error and poor judgment. This cloud failure is no different from any other IT problem, where immature process coupled with lax management oversight resulted in catastrophic meltdown.

[Via George Ou. Image from iStockphoto.]

August 11th, 2008

Gmail is down

Posted by Michael Krigsman @ 3:09 pm

Categories: Availability and reliability, Enterprise 2.0, SaaS, PaaS, and SOA

Tags: Google Gmail, Habit, E-mail Providers, Cloud Computing, Internet, Michael Krigsman

Update 8/11/08 10:45pm EDT: Gmail is back up now. According to the Gmail blog:

Many of you had trouble accessing Gmail for a couple of hours this afternoon, and we’re really sorry. The issue was caused by a temporary outage in our contacts system that was preventing Gmail from loading properly. Everything should be back to normal by the time you read this.

This is becoming an unhappy habit :(

Gmail is down

[Thanks to Ed Shaz for alerting me to the problem.]

Michael KrigsmanMichael Krigsman is CEO of Asuret, Inc., a software and consulting company dedicated to reducing software implementation failures. Click here to discuss this post with him on Twitter. See his full profile and disclosure of his industry affiliations.

Email Michael Krigsman

Subscribe to IT Project Failures via Email alerts or RSS.

SponsoredWhite Papers, Webcasts, and Downloads

advertisement

Recent Entries

Most Popular Posts

Premier Vendor Content Whitepapers, webcasts & resources from our Power Center Sponsors
advertisement

Archives

ZDNet Blogs

White Papers, Webcasts, and Downloads

SmartPlanet

  • Thought-provoking progressive ideas on diverse topics that intersect with technology, business, and life, and matter to the world at large. Visit SmartPlanet
  • More from IBM
  • Innovate your business' process model, play against the market, compete against others on our scoreboards and WIN! Try INNOV8 2.0: A BPM Simulator
  • Enabling Real-World Business Transformation through IBM Service Management Read the EMA Analyst Report
Click Here