ZDNet Must Read:
The ERP devil's triangle
IT failure arises from core dynamics between enterprise software sellers, buyers, and third-party consultants and integrators. The ERP devil's triangle describes how conflicting agendas and cross-linked goals cause failures.... Continued »
September 5th, 2008
Netflix post-mortem: hardware failure and poor transparency

Massive shipping delays last month at Netflix were caused by hardware failure. Although diagnosing hardware problems can be tough, Netflix gets the dunce award for lack of transparency in the face of disaster.
Here’s the post-mortem analysis from Mike Osier, head of IT Operations at Netflix:
On Monday, 8/11, our monitors flagged a database corruption event in our shipping system. Over the course of the day, we began experiencing similar problems in peripheral databases until our shipping system went down. It was going to be a long night.
We suspected hardware and moved the shipping system to an isolated environment, gradually getting DVD shipments moving again. Eventually the system was repaired and shipping returned to normal conditions. With some great forensic help from our vendors, root cause was identified as a key faulty hardware component. It definitively caused the problem yet reported no detectable errors. We’ve taken steps to fortify our shipping system with the acquisition of additional equipment and worked with our vendors to verify we’re in good shape elsewhere.
THE PROJECT FAILURES ANALYSIS
There are two significant points to consider:
- Hardware failure, especially involving network and communications equipment, can be a nightmare to troubleshoot and repair.
- Appropriate end-user communications are a critical part of managing IT downtime.
Hardware failure. When hardware fails, symptoms can appear as software, database, or telecom link problems unrelated to the specific equipment that’s flaky. For example, ComputerWorld reported on hardware-related problems at Kennedy Airport (emphasis added):
Initially, American’s parent company, AMR Corp., said the malfunction yesterday was in software that controlled the baggage-sorting conveyor belt in American’s bag room at JFK. However, airline spokesman Tim Wagner said today that the glitch was caused by a hardware issue involving the network between the computer software that controls the sorting function and the baggage conveyor belts. Wagner said the software was working, the conveyors were working, but some of the network hardware was failing.
Along the same lines, I wrote about a Los Angeles Airport failure that sounds similar to Netflix:
Assuming this [failure] to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem…. The real culprit: a failed router. Update: Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn’t happen again.
Lack of transparency. During times of failure, communication with end-users is critical if you value their continued loyalty. In this case, Netflix’s post-mortem was anemic and their status updates were too vague.
Netflix should have disclosed which hardware failed, why repairs took so long, and specifics on what it has done to prevent future problems. Investors should ask why management’s backup and contingency plans handled this mission critical failure so badly.
Given the financial implications for Netflix, as described by Larry Dignan, the company underperformed its crisis management:
These issues are obviously going to cost Netflix some dough. First, the company is losing revenue. That slippage will result in an earnings hit. Meanwhile, Netflix will have to account for reimbursing subscribers (currently credits to one-third of the subscriber base).
It’s also worth noting there are currently 145 responses to the post-mortem blog post, most of them negative; that should tell you something.
Stacksafe’s Jonah Paransky offers an excellent seven-point framework for communicating IT failure:
- Have a communication plan in place and ready to go
- Direct communication with your customers is the number one concern
- Be prepared to communicate over multiple channels.
- Over-communicating is better than under-communicating
- Expect the failure to become public
- Humor probably isn’t the right call
- Don’t underestimate the communication necessary after the failure is resolved
Netflix looks bad when measured against this list. Although it offered several superficial blog posts describing status during the outage, Netflix never really disclosed what happened or why.
The company has much to learn about the user relations aspect of IT downtime. Google, Salesforce, and Amazon take system transparency seriously; Netflix should do the same.
[Photo source: iStockphoto.]
September 3rd, 2008
Has IT public relations arrived?

Common wisdom suggests that recognition naturally comes to a CIO when his department does a great job delivering successful projects. But what happens when the organization just doesn’t acknowledge IT’s greatness? For one CIO, the answer lies in public relations.
An IT department posted a note in CIO Magazine seeking PR assistance:
I am looking for a publicist with a good history of marketing the CIO, Director and IT department’s accomplishments internally and externally.
Reading this, sarcastic comments come quickly to mind. For example, maybe the organization hasn’t recognized the CIO because his department constantly screws up. In that case, the PR initiative represents Mr. CIO’s lame attempt to combat rising tides of well-deserved, anti-IT sentiment. Hey, it’s a big city out there and sometimes people do crazy things.
On the other hand, maybe this IT department is too good: they execute on time, within budget, and just get things done. Many organizations take a CIO for granted when his IT department consistently delivers the goods without fanfare and attention; sadly, this human failing is all too common. In that case, PR might be a great idea, especially if the CIO isn’t a great communicator. Of course, the CIO should improve his communication skills, but that’s another story.
I don’t know whether this particular IT group wants PR to trumpet great, unheralded accomplishments or to protect a lousy CIO. Either way, despite overall weirdness associated with the whole idea, the age of IT public relations seems to have arrived.
[Image via: HRB Public Relations.]
September 2nd, 2008
FAA outage due to ‘fix-on-fail’ policy

Last week’s technology failure at a major FAA facility caused air traffic delays throughout the country and highlighted the agency’s poor computing practices. Unlike major corporations and utilities, the FAA operates its air traffic control system with minimal redundancy using a “fix-on-fail” policy.
Redundancy is the foundation concept behind business continuity planning (BCP), which involves creating logistical and operating plans designed to take effect after a major disaster or critical infrastructure disruption. According to the Associated Press, the FAA maintains less redundancy than water or power utilities:
Redundancy is so critical for power and water utilities that they can be fined hundreds of thousands of dollars a day if they’re found insufficiently prepared — and $1 million per day if they’re found to be willfully negligent.
“If this (FAA outage) happened at a power plant,” [according to security researcher, Jason Larsen,] “I’d be telling them to open up their checkbook and expect to be fined.”
The Associated Press article points out pitfalls of the fix-on-fail policy:
“[I]t’s the whole `don’t fix it if it ain’t broke’ thing,” said Branden Williams, director of a unit of VeriSign Inc. that assesses the security of retailers’ payment systems. “It’s unfortunate because it’s very reactive, and it typically winds up costing you more. If you do fix-on-fail, it usually costs you more.”
The AltuisIT blog discusses this same issue:
To reduce their total cost of ownership, industry-leading organizations know that IT systems need to be properly managed and maintained. The “Fix on Fail” approach to systems management results in employee frustration, missed deadlines, increased costs, and lower levels of customer service.
THE PROJECT FAILURES ANALYSIS
The FAA must manage it’s resources and infrastructure within strict budget limitations. By implementing a fix-on-fail policy, which the agency must have decided years ago, the FAA made three bets:
- Passenger safety would not being jeopardized
- The system would not likely fail on a regular basis
- Taxpayers would not accept the costs associated with greater redundancy
In other words, sometime in the past, the agency decided the hassles and risks of the current system were acceptable, given the high cost of alternative policies.
The current situation has focused attention on the FAA and its technology policies. The Wall Street Journal reports the agency is currently engaged in a massive system upgrade, however the article doesn’t provide much detail:
The Federal Aviation Administration said it is overhauling an error-prone computer system that caused hundreds of delayed flights Tuesday.
The system is part of the aging infrastructure that guides air traffic, which the FAA has been trying to update to reduce chronic delays.
Although the agency must manage to a limited budget relative to its large mandate, one wonders whether sufficiently good judgment, and good practice, is being applied to FAA technology decisions.
It’s important to note the FAA consistently states that passenger safety is not compromised by its computing practices.
—
As an aside, here’s an interesting FAA-related story.
Some years ago, I happened to drive by the Boston air traffic control center for the northeast, which is located in Nashua, NH. Being an inquisitive and rather geeky fellow, I pulled up to the main gate and asked the guard for a tour. To my absolute amazement, he phoned someone from the air traffic control floor who promptly arrived and took me inside. I spent the next hour observing air traffic controllers at work and listening to their conversations with planes.
The place looked like a movie set and was darn cool. Unfortunately, in a post-9/11 world such impromptu visits will never, ever happen again.
[Via AMR analyst Jonathan Yarmis on Twitter. Image via http://www.subbrit.org.uk.]
August 28th, 2008
Google Apps dashboard: Serious about the enterprise?

Google is developing a system status reporting dashboard for its Apps Premier product line. This decision provides further evidence Google is serious about becoming an enterprise software vendor.
CNET’s Dave Rosenberg posted the announcement email, describing the dashboard’s incident reporting features:
- A description of the problem, with emphasis on user impact….
- A continuously updated estimated time-to-resolution….
For most minor incidents, that should provide sufficient information to users regarding problem status and expected recovery times. For more serious problems, Google’s plans go much further:
[A] formal incident report within 48 hours of problem resolution. This incident report will contain the following information:
a. business description of the problem, with emphasis on user impact;
b. technical description of the problem, with emphasis on root cause;
c. actions taken to solve the problem;
d. actions taken or to be taken to prevent recurrence of the problem; and
e. time line of the outage.
Finally, if things get really bad, the email promises in-depth consultation with customers on an individual basis:
[W]e’ll support your internal communication process through participation in post-mortem calls with you and your management team.
When an incident first occurs, reporting is limited to status, availability, and predicted resolution times. For more severe situations, that basic status reporting will be supplemented by a business-oriented description of the cause, scope, and impact of the problems. Finally, following the worst downtime issues, Google will present a transparent and detailed analysis.
THE PROJECT FAILURES ANALYSIS
Google joins the ranks of Salesforce.com and Amazon, both of which offer industry-leading incident reporting to end-users. The Salesforce reporting service has been around a long time, while Amazon instituted theirs following a series of serious downtime incidents earlier this year.
I’m rather shocked to see Google’s willingness to participate in detailed post-mortem analysis discussions with customers. For such consultations to offer any value whatsoever, the company’s representative must be knowledgeable regarding both the business and technical implications of downtime events. People with this experience don’t grow on trees, especially if they are also strong communicators, so this represents a significant resource investment.
Although Google may offer this service level to large accounts such as Cap Gemini, I doubt smaller customers will receive any personalized attention whatsoever. After all, Google isn’t known for providing stellar customer service; actually, the company’s customer care record sucks widgets. Only time will tell whether Google can successfully transition from its mass market consumer mentality to becoming a trusted, service oriented enterprise vendor.
Great status reporting systems, while important, don’t turn consumer application companies into enterprise software vendors. However, the business focus and directional strategic intent of this investment are clear.
[Via Charlie Wood. Image source: iStockphoto.]
August 27th, 2008
MediaMax / The Linkup: When the cloud fails

Online storage service MediaMax, also called The Linkup, went out of business following a system administration error that deleted active customer data. The defunct company leaves behind unhappy users and raises questions about the reliability of cloud computing.
According to Nirvanix, which along with MediaMax was spun out of a company called Streamload, a faulty script caused the problem:
Streamload offered unlimited and then 25 GB of free storage for quite some time. This resulted in a tremendous amount of data stored in a few million free, non-active accounts for [nine years]. Streamload was literally paying for former users to store 100’s of terabytes of old, inactive data for free. In preparation for the split of the two companies, and subsequent move of the MediaMax application to SAVVIS, it was determined that the inactive data from former users would be purged on the Streamload/MediaMax storage system, thus shrinking the overall storage needs and costs for the new MediaMax company. During this process, a system administrator ran a script that misidentified active account data and disassociated physical files from their owners.
Although The Linkup lost a ton of customer data, CEO Steve Iverson told Network World he’s unsure how much is gone:
Iverson says at least 55% of the data was safe. How much of the remaining 45% was saved is not clear, he says. “We know there was definitely a lot of customer problems, and when we looked at some individual accounts, some people didn’t have any files, and some people had all their files.”
THE PROJECT FAILURES ANALYSIS
As with most failures, this story is fraught with complications and contradictions. Besides finger pointing and back-biting, which I suppose is to be expected, confusing corporate relationships coupled with a seemingly bizarre level of process and technical carelessness lend a weird flavor to the whole mess.
The human drama is documented in links from this post; more importantly, two significant and highly connected issues were at play:
- Business process failures. Apparently, the company allowed a lone system administrator to perform tasks affecting the company’s core business without sufficiently performing dry runs. I suppose this point is self-evident: scenario planning is critical whenever IT handles irreplaceable data. Management is responsible for establishing all operating plans and contingency procedures before IT executes data-threatening procedures.
- Technical failures. Given the high stakes and the script’s intended goal, the company should have performed intensive testing ahead of time.
I asked Newsgator’s VP of Software as a Service (SaaS), Jeff Nolan, to comment. Jeff questioned why the company maintained so much historical data:
Beyond meeting legitimate business and regulatory requirements, retaining years of old, inactive data adds unnecessary risk and cost.
There was also a process failure. Newsgator takes active steps to isolate problems and prevent this type of damage. In addition to sandbox testing, which is computer science 101, we require two-key authorization: the sys admin can only run these types of scripts after a second person has given approval. A well-defined system of checks and balances prevents problems.
While this case is an interesting footnote in the history of IT failures, the larger implications relate to cloud computing. On this subject, Larry Dignan says:
[The cloud’s] growing pains, which are more evident each day that we rely more on service-based software efforts, indicate that you can’t really trust the cloud at this juncture. It’s too early and providers are learning as they go.
Despite being a cloud-based failure, the underlying problem is human error and poor judgment. This cloud failure is no different from any other IT problem, where immature process coupled with lax management oversight resulted in catastrophic meltdown.
[Via George Ou. Image from iStockphoto.]
August 26th, 2008
FAA computer failure slows nationwide air traffic
A computer problem at the Federal Aviation Administration center in Atlanta has created air traffic problems across the country.
From WSB Radio in Atlanta:
FAA Computer Problems: Failure in the system that processes flight plans in the eastern U.S. FAA says no danger is involved. System that processes the information for the east coast is based in Atlanta. It has been moved to a facility in Utah which usually processes only west coast flights. Flights across the country are backing up.
Here’s a screen capture from the FAA, showing delayed flights:

Information on the outage is still sketchy and I’m looking forward to learning whether this is a hardware or software problem. Either way, sounds like a truly massive outage.
Update 8/26/08 10:15 pm EDT: According to CNN, things are back to normal:
Airports experienced hours of flight delays Tuesday afternoon after a communications breakdown at a Federal Aviation Administration facility, the administration said.
The facility south of Atlanta had problems processing data, requiring that all flight-plan information be processed through a facility in Salt Lake City, Utah, overloading that facility.
“The situation is pretty much resolved,” FAA spokeswoman Diane Spitaliere said.
Sounds to me like an unexpectedly large breakdown, for which full scenario planning had not been performed. All the same, considering the scope of effort required to transfer flight processing across the country, it could have been much worse.
[Via Twitter.]
August 26th, 2008
Failed government IT: ‘The mother of all databases’
The “IT system used to identify terrorist threats that has been crippled by technical flaws,” according to a memo from the House of Representatives Committee on Science and Technology. The failed system is part of a “central US government repository of data on international terrorist identities…described by Vice Admiral (Ret.) John Scott Redd as ‘the mother of all databases.’”
This enormous database, called the Terrorist Identities Datamart Environment (TIDE), is operated by the National Counterterrorism Center (NCTC) to support the “government’s various terrorist screening systems or watchlists.”
My take. I was initially skeptical of the allegations described in the House “Inspector General memo” because it raises highly technical issues in a political context. However, my impression changed substantially after studying the more detailed “Subcommittee memo,” which exhaustively documents the investigative sources forming the basis for the allegations.
Given the careful documentation, I believe the memos accurately portray current project status. While I have no opinion regarding specific descriptions of misappropriation of funds, the project management and contractor oversight flaws certainly ring true. From a technical perspective, the allegations are sufficiently detailed to appear rooted in fact.
The official NCTC response, described at the end of this post, offers little reassurance to those concerned about government waste on IT projects. Apparently, even the nation’s most substantial national security projects are subject to failure and allegations of malfeasance.
This isn’t the first government IT failure and certainly won’t be the last.
INSPECTOR GENERAL MEMO
The House Committee on Science and Technology impact memo, written to the Office of the Directorate of National Intelligence (ODNI) Inspector General, frames the issue:
The Subcommittee has learned that the TIDE database is suffering from serious, long-standing technical problems. The Subcommittee has also learned that a critical NCTC initiative, named “Railhead,” which is intended to replace TIDE with enhanced capabilities has suffered from severe technical troubles, poor contractor management and weak government oversight. As a result, potentially hundreds of millions of dollars have been wasted, delivery schedules have slipped, contractor employees have been laid off in order to restrain escalating costs, and the NCTC is now scrambling either to fix the technical troubles or possibly to abandon the program altogether. The end result is a current IT system used to identify terrorist threats that has been crippled by technical flaws and a new system that if actually deployed will leave our country more vulnerable than the existing yet flawed system in operation today.
—
Some Railhead insiders allege that a significant portion of the estimated $500 million dollars spent on Railhead has been inappropriately used to renovate a building of one of the prime contractors, The Boeing Company, into a Sensitive Compartmentalized Information Facility (SCIF) in Herndon, Virginia. These individuals have also questioned the technical solutions endorsed by the government to replace the current TIDE database, the qualifications of some of the Boeing subcontractors and potential conflicts-of-interest between the program director of another key Railhead contractor, SRI International, and the government’s Railhead program manager because of their alleged close personal ties. In short, documents obtained by the Subcommittee suggest that, despite hundreds of millions of dollars invested in Railhead and years of development, the government has little to show for its efforts.
—
Like many of these programs, the flaws and failures on Railhead have been exacerbated by weak government oversight, poor contractor management and lack of contractor accountability for the program’s performance. Turfbattles among contractors, particularly between the design team and development team, have hampered the sharing of critical technical data that has impaired the success of the Railhead program. In addition, one list of Railhead staff from January 2008 identifies a virtual army of 814 private contract employees from dozens of companies involved in Railhead and only 48 government officials keeping tabs on this mammoth and critically important national security program. In fact, an estimated one dozen government slots on Railhead have been vacant for more than one year. A combination of these management problems and technical troubles seems to have doomed the Railhead program to failure.
SUBCOMMITTEE MEMO
The Inspector General memo was based on worked performed by the Subcommittee on Investigations and Oversight. The more specific technical memo adds depth and detail to the allegations:
Among the largest and most expensive programs currently being funded by the ODNI is a program at the National Counterterrorism Center to improve and replace its current information technology systems, including the TIDE database, in order to enhance information sharing among federal agencies and improve access to counterterrorism intelligence data collected from more than 30 separate government networks that feed data into NCTC.
—
Documentation obtained by the Subcommittee points to a host of technical problems on Railhead, potential contractor mismanagement, contractor disputes, agency turf battles, poor government oversight and schedule delays that have hindered and hampered legitimate information sharing efforts on the program, have resulted in the potential waste of hundreds of millions of taxpayer dollars and placed the government’s key counterterrorism information sharing initiative in jeopardy of failing.
—
But technical problems on the current TIDE database appear to be hindering those efforts, and its successor –Railhead — is on the verge of collapse.
The original TIDE database, built by Lockheed Martin, replaced the Department of State’s TIPOFF database, designed and built by The Analysis Corporation, in the wake of the 9.11 terrorist attacks to automate the terrorist watch list. The TIDE database was built in Oracle as a relational database management system (RDBMS). This original database, however, suffers from basic design, management and maintenance ‘ inefficiencies and problems. For instance, only about 60% of the data, including names and addresses, mentioned in CIA cables provided to NCTC are actually extracted from these messages and placed into the TIDE database.
The TIDE database has evolved overtime as both contractors and government employees have attempted to expand and enhance the database to improve their own use of the system. But none of them appear to have taken into account the overall design or engineering architecture of the entire system. As a result, there are now dozens of tables or categories for identical fields of information making the ability to search or locate key data inefficient, ineffective and more time consuming and difficult than necessary.
In addition, the TIDE database relies on Structured Query Language (SQL), a cumbersome computer code that must utilize complicated sentence structures to query the tables, rows and columns that encompass the TIDE database. Without proper documentation on whether a table contains information on names, addresses, vehicles, license plates or an individual’s nationality, for instance, analysts have no valid mechanism to conduct a search of these “undocumented” tables.
Without a detailed index of the data stored in each table in TIDE, the SQL search engine is blindfolded, unable to locate or identify undocumented data. The current TIDE database is composed of data fields that are presented in 463 separate tables, 295 of which are undocumented, according to one internal Railhead document. As a result, critical terrorist intelligence in the TIDE system may not be searched at all. “Existing TIDE data model is complex, undocumented, and brittle,” the document notes, “which poses significant risk to RLSI [Railhead Lead System Integrator] data migration and modeling.”
GOVERNMENT RESPONSE
The NCTC provided a vague and general response to the allegations, saying the conclusions are:
[I]nconsistent with the facts. The letter implies that there exists a risk to our nation’s security related to the implementation of NCTC’s information technology program, commonly known as Railhead. There has been no degradation in the capability to access, manage and share terrorist information during the life of the Railhead program.
Railhead is a multiple contract venue to support the operations and maintenance of existing IT systems; it replaces and builds new functions for the Center. Fundamentally, it is a series of technology (primarily software) upgrades implemented between now and 2012, rather than all at once to improve mission capabilities for many systems.
[Via an unnamed reader who referred me to the Ars Technica story; I’m always grateful for reader submissions of failed IT projects. Anonymous submissions are welcome. Requests for interview to both the ODNI and the Subcommittee were not returned.]
August 25th, 2008
Office 2.0: ‘Conversations’ prevent IT failure
Cultural issues are among the key drivers causing acute IT problems. Project failure rates remain high in large part because these drivers are difficult to identify and diagnose.
Many organizations accept information silos as a cost of doing business, despite the clear negative impact of these boundaries in communicating project status, problems, and potential points of failure. In extreme cases, projects fail and management claims complete ignorance of any problems whatsoever. Yes indeed, these are Dilbert moments.
The importance of conversation becomes magnified when we recognize the term information silos really means “people don’t talk with one another.” Sal Rasa, an innovative organizational development colleague of mine, elaborates:
Living in a Web 2.0 environment changes our perspectives on knowledge sharing and traditional organizational dynamics frameworks. “Conversations” become understood as critical….
It’s easy to sidestep the human dimension of success and failure, focusing instead on abstract notions of culture and politics. Consultant and blogger, Susan Scrupski, sent me an email making clear that self-serving individuals are responsible for project failures:
It’s not culture, but rather hubris and ego that blows up what could be fantastic product design or customer experiences. When people can’t work out their differences on a human level, brilliant projects are canceled and abandoned.
Still, culture can have a dramatic impact on success and failure across a range of industries and sectors. In a conversation about public sector financial waste, Suffolk University professor of organizational ethics, Lydia Segal, told me:
So you have rules designed to stop waste that now cause it. The waste is built into the rules and reinforced by the myopic organizational culture that those rules fostered.
Changing an organization’s culture to support successful IT involves establishing new attitudes toward organizational communication. Most organizations will continue to experience unacceptably high rates of IT project failure until they explicitly redefine work processes to reduce communication boundaries.
IT success rates will only improve when organizations initiate systematic efforts to institutionalize greater information sharing.
============
The upcoming Office 2.0 conference includes numerous sessions examining processes and technologies forward-thinking organizations have used to overcome information boundaries. If you’re interested in these issues, I recommend attending or sponsoring this conference.
ZDnet blogger, Dennis Howlett, comments on the Enterprise Irregulars’ connection to Office 2.0; another ZDNet colleague, Oliver Marks, is a conference speaker.
August 22nd, 2008
‘Debunking IT Project Failure Myths’ [podcast]
According to various studies, at least 30% of all IT projects fail in some important way. Failure rates seem to have plateaued at this level because most organizations don’t really understand why their IT projects go down. As a result, failures persist, with some organizations even proclaiming, “We’re not at fault because we did everything right.” Such attitudes are misguided.
In his recent report, titled “Debunking IT Project Failure Myths,” Lewis Cardin, a former CIO and currently senior analyst at Forrester Research, states:
Firms commonly use three metrics to decide whether a project effort is successful: Did the project meet its schedule, stay within budget, and deliver on requirements?… Firms use these same measurements to establish project success rates simply because they are so obvious and business execs easily understand them. IT execs often get measured on these outcomes and are held accountable for the results delivered by project managers; when a project is deemed a failure, this accountability can be bad news for IT. The problem is that this IT accountability is frequently misplaced. Worse still, the conclusion of failure is often incorrect.
Translation from the trenches: the real sources of IT failure lie in issues like project management culture, which neither IT nor most business people are comfortable analyzing, let alone fixing. Lewis’ research describes four critical dimensions that project stakeholders frequently misdiagnose, resulting in repeated failure:
Unresponsive governance, which leaves project decisions hanging. IT project governance has the role of project approval, problem resolution, direction-setting, and communication with business stakeholders…. Unless the cause of this delay is visible and shown to be for governance reasons, someone is going to point fingers — months later — to defective project management and not the real source of the delay.
Lack of communication from change management, which leads to false conclusions. Business managers may change requirements during project execution…. At project completion, the business has forgotten about the improved value component while memories are crystal clear about the increased dollar and time investment. A project has a high probability of being tallied under the failure column when in fact it may have been a noteworthy success.
An unrealistic project plan, which dooms the best project. All too often, when these projects go on the rails of the original project plan, PMs must spend more time on damage control with steering committees and project resources rather than on execution — doubling their work when it is least desirable to do so.
First-number syndrome, which makes business execs forget it’s an estimate. When projects are first sized, which is likely to occur before they are approved, estimates of cost, time, and resources are preliminary with a wide confidence interval…. But business execs remember the number and forget how uncertain it is…[and] may see this simply as increasing costs, not as the inevitable result of greater knowledge.
Issues such as executive sponsorship, business case, usability, and vendor integrity play a large role in determining the outcome of IT deployments. Unfortunately, most organizations don’t pay sufficient attention to these key areas because they’re hard to measure. Nonetheless, IT project success is not possible without paying careful attention to the real causes of failure.
ABOUT THE PODCAST
I urge you to spend 12 minutes and listen to the attached podcast. Lewis and I explore these issues during a provocative and informing conversation. The discussion of change management and the cultural determinants of IT failure alone is worth your time.
August 19th, 2008
The triple sins that cause IT failure

Why don’t more organizations recognize potential IT project problems before they escalate into full-blown failures? Bruce F. Webster believes many companies reject good solutions to fix bad projects for three reasons: internal politics, budget, and fear/pride.
Bruce’s column in Baseline describes three sins that make failure almost inevitable in many organizations:
Internal politics. Large internal IT systems…usually involve several different groups, each of which may or may not be all that happy about having to work with some of the others, but are forced to do so for various budgetary, departmental, or business alignment reasons.
Budget. This may seem counter-intuitive, but management often finds it easier and safer to have a project drag on year after year, ultimately costing large sums of money, than to spend a relatively small (but still painful) portion of that amount up front and fix the problems now.
Fear/pride. Fear and pride can be closely related, particularly when the issue is admitting you made a mistake. This is particularly true if a key manager, architect, team leader, or developer has championed or defended a given approach that turns out not to have worked.
Organizational inertia, the decision-making gridlock that arises when conflicting personal agendas and viewpoints prevent team consensus, lies at the heart of many failures.
While experienced CIOs may recognize that politics and fear cause failure, simply wishing the problem away accomplishes nothing. Instead, wise leaders must take active steps to change organizational attitudes toward failure itself. In fact, it can be healthy for companies to prune back their project portfolio periodically, encouraging natural selection to leave only strong projects untouched.
Facing the inevitability of failure, what’s a responsible CIO to do? Aside from seeking new employment, transparency is the best weapon in the fight against corporate inertia. Exposing self-interested agendas to the harsh glare of daylight is the surest way to keep the system honest.
And that, my friends, is precisely what’s needed to improve IT project success.
[Image via http://home.att.net/~s.l.keim/Sermon.htm.]
Michael Krigsman is CEO of Asuret, Inc., a software and consulting company dedicated to reducing software implementation failures. Click here to discuss this post with him on Twitter.
See his full profile and disclosure of his industry affiliations.
SponsoredWhite Papers, Webcasts, and Downloads
- 2008 IT Salary and Skills Report Global Knowledge
- TCP/IP Sleuthing--Troubleshooting TCP/IP Using Your Toolbox Global Knowledge
- Vista SP1: What You Need To Know Before You Deploy Global Knowledge
Recent Entries
- Netflix post-mortem: hardware failure and poor transparency
- Has IT public relations arrived?
- FAA outage due to ‘fix-on-fail’ policy
- Google Apps dashboard: Serious about the enterprise?
- MediaMax / The Linkup: When the cloud fails
Most Popular Posts
- Failed government IT: 'The mother of all databases'
- FAA outage due to 'fix-on-fail' policy
- FAA computer failure slows nationwide air traffic
- Google Apps dashboard: Serious about the enterprise?
- Office 2.0: 'Conversations' prevent IT failure
- 12 early warning signs of IT failure
Top Rated
- 12 early warning signs of IT failure+7 votes
- FAA outage due to 'fix-on-fail' policy+6 votes
- Failed government IT: 'The mother of all databases'+5 votes
- Heart pacemakers vulnerable to attack+5 votes
- FAA computer failure slows nationwide air traffic+4 votes
- Office 2.0: 'Conversations' prevent IT failure+3 votes
- Implementation complexity enables higher software prices+2 votes
- The triple sins that cause IT failure+1 vote



