Famous failures of complex engineering systems

P.S.:- I found this information at Original Location

“To err is human, but to really foul things up you need a computer.”
I am listing some of most famous failures of complex engg. systems.These failures were due partly or primarily to factors beyond engineering or technical considerations, we will concentrate on the technical issues. I haven’t included some of the most dramatic failures, such as Chernobyl, Challenger, or Bhopal, because these involve much more complicated interactions of engineering and human judgment and they’ve received such extensive coverage.
I guess complexity arises from need to provide reliable predictability in the presence of uncertainty, and that failures occur when uncertainties and interactions are not properly accounted for. In retrospect, for all of these failures, we can always identify a component that failed and do simple “back of the envelope” calculations with very simple models to explain the failure. It is essentially always possible to ignore, if we choose, the system design issues that contributed to the failure. A deeper view also always reveals that there were system design flaws and that the apparent component failure was merely a symptom. Of course, the VE(Dynamics, interconnection, and uncertainty management) challenge is to create an environment where we are better at doing that before the failure occurs.

Titanic

On April 14, 1912, the Titanic, the largest, most complex ship afloat, struck an iceberg and sank. The Titanic had a double-bottomed hull that was divided into 16 watertight compartments. Because at least four of these could be flooded without endangering the liner's buoyancy, it was considered unsinkable. Unfortunately, these compartments weren’t sealed off at the top, so water could have just filled each compartment, tilting the ship, and then spilled over the top into the next one. Following recent expeditions to examine the Titanic wreckage and a review of survivor accounts, it is now generally agreed that the iceberg scraped along the starboard side of the ship causing the plates to buckle and burst at the seams and producing several small ruptures in up to six of the first compartments. This is perhaps one of the all-time great failures to correctly modeling the interaction of uncertainty in the environment and the way it can couple with the dynamics of a system. A purely static view of the ship, one that ignored the dynamics of the interaction with the iceberg and the water flow between the compartments, would not have predicted the actual disaster.


Estonia Ferry

It would seem unlikely that a mistake of the type that occurred in the Titanic would be repeated. However, a weak door lock was one of the main reasons for the 1994 Estonia ferry disaster that caused the deaths of more than 800 people. The ferry’s bow visor, a huge top-hinged door at the front of the ferry that swung up to allow vehicles to be driven into and out of the ferry’s car deck was secured by several locks. The lower lock, known as the Atlantic lock, was too weak to withstand extremely heavy pounding by rough seas. Stormy seas in the Baltic Sea on Sept. 28 broke the lock between 30 minutes and one hour before the 157-meter (515-foot) ferry sank shortly after midnight. The noise of the loose bow visor slamming at the hull was heard by several survivors. The slamming set off a chain of events including the breaking of other locks that ended in the tragedy. Only 137 of the more than 900 people on board survived. The commission that investigated the incident said the shipbuilder did not have proper blueprints for the lock when constructing the ferry in 1980. As a result, the commission says the shipbuilder apparently made its own calculations and underestimated how strong the lock should be. This particular failure would seem the one most likely to be caught with an integrated CAD system.

Tacoma Narrows Bridge
The Tacoma Narrows Bridge was the first suspension bridge across the Narrows of Puget Sound, connecting the Olympic Peninsula with the mainland of Washington, and a landmark failure in engineering history. Four months after its opening, on the morning of Nov. 7, 1940, in a wind of about 42 miles (68 km) per hour, the 2,800-foot (853-meter) main span went into a series of torsional oscillations the amplitude of which steadily increased until the convolutions tore several suspenders loose, and the span broke up. The bridge was designed to have acceptable horizontal displacement under the static pressure of a much larger wind, but was not designed to handle the dynamic instability caused by an interaction of the winds and the high degree of flexibility of the light, narrow, two-lane bridge. Modeling this type of fluid/structure interaction, a particularly simple type of flutter, was within the technical capability of engineers at the time, but was evidently not considered. A modern analysis would likely view the fluid/structure flutter as a bifurcation problem, and analyze the nature of the bifurcation as the wind speed increased. Immediately after the accident, numerous investigators were able to create both simple mathematical and scale physical models that exhibited the same failure as the actual bridge, and very simple models were able to predict the wind speed that would cause the collapse.

Subsynchronous resonance in power systems
Series capacitors are often used in AC transmission systems to provide impedance compensation, particularly for long lines with high inductance, at the 60 Hz synchronous transmission frequency. Series capacitors are economical ways to increase load carrying capacity and enhance transient stability, but the capacitors can combine with the line inductance to create oscillators with natural frequencies below 60 Hz. These electrical oscillators can interact with mechanical torsional vibrational modes of the generator turbine shaft, and in some circumstances can cause instabilities that snap the shaft. This happened dramatically at the Mohave Generating Station in Southern Nevada in 1971 when the turbine shaft broke twice before the condition was properly diagnosed. This is a classic example of uncertainty management gone awry. The capacitors were introduced to improve the stability on the electrical side, and reduce the potential vulnerability to electrical side disturbances, but they had the unanticipated effect of destabilizing the mechanical side. The phenomena is now reasonably well understood and is taken very seriously in design of power systems.

Telephone and power system outages
In recent years, there have been an increasing rash of large scale breakdowns of both the telephone and power systems, typically triggered by small events that lead to a cascade of failures that eventually bring down large portions of the network. The high complexity and interconnectedness of these networks is designed to improve their performance and robustness, but can lead to extreme and unexpected sensitivity to small disturbances. In both cases, highly interconnected nationwide networks allow load balancing to be achieved more economically, and the resulting system is, in principle and usually in practice, much more robust to large disturbances or variations in demand. The high degree of connectivity also makes it possible for small failures to propagate and lead to massive outages. The solution to these sensitivities is to add additional complexity in the form of more sophisticated control strategies. Without careful design, this trend to increasing complexity will not improve robustness.

Denver airport baggage handling system
The automated system was supposed to improve baggage handling by using a computer tracking system to direct baggage contained in unmanned carts that run on a track. Originally scheduled for completion in March 1994, the unfinished $234 million project helped postpone opening of the airport until February 1995. The delay reportedly cost the city roughly $1 million per day in operations costs and interest on bond issues, more than the direct cost of the project. Significant mechanical and software problems plagued the automated baggage handling system. In tests of the system, bags were misloaded, were misrouted, or fell out of telecarts, causing the system to jam. The baggage system continued to unload bags even though they were jammed on the conveyor belt, because the photo eye at this location could not detect the pile of bags on the belt and hence could not signal the system to stop. The baggage system also loaded bags into telecarts that were already full. Hence, some bags fell onto the tracks, again causing the telecarts to jam. This problem occurred because the system had lost track of which telecarts were loaded or unloaded during a previous jam. When the system came back on-line, it failed to show that the telecarts were loaded. The timing between the conveyor belts and the moving telecarts was not properly synchronized, causing bags to fall between the conveyor belt and the telecarts. The bags became wedged under the telecarts, which were bumping into each other near the load point.

Ariane 5
The Ariane 5 was not flight tested because there was so much confidence in M&S. The first flight carried $500M of satellites and was destroyed about 40 seconds after lift off. The error which ultimately led to the destruction of the Ariane 5 launcher was clearly identified in the report of the investigating committee: a program segment for converting a floating point number, representing a measurement, to a signed 16 bit integer was executed with an input data value outside the range representable by a signed 16 bit integer. This run time error (out of range, overflow), which arose in both the active and the backup computers at about the same time, was detected and both computers shut themselves down. This resulted in the total loss of attitude control. The Ariane 5 turned uncontrollably and aerodynamic forces broke the vehicle apart. This breakup was detected by an on-board monitor which ignited the explosive charges to destroy the vehicle in the air. The code in question was reused from an earlier vehicle where the measurement would not have become large enough to cause this failure.

It is tempting to simply dismiss this as a software bug that would be eliminated by better software engineering. It is obvious that the programmer should have checked that the measurement was small enough that the conversion could take place, and if it could not, have the control system take some appropriate action rather than simply shut down. In this case the appropriate action would have been to do nothing, because this measurement, ironically, wasn’t even needed after liftoff. This may seem to make it a trivial issue, but the same code did work fine on the Ariane 4, although a control engineer would presumably have preferred it be done differently.

While the “software bug” view has some truth, it is misleading, because the failure was due to dynamics of the Ariane 5 that were different than the Ariane 4. It is the interaction of the software with the uncertainty in the environment and the dynamics of the vehicle that caused the failure. This is not a software issue, but a design flaw at a much deeper level. It is likely the programmers responsible had no idea how to determine if the Ariane 5 had dynamics such that under suitable environmental conditions the measurement would be too large. Presumably, they could have consulted appropriate experts in control and aerodynamics and anticipated the problem, but it wouldn’t have been a computer science issue at all.

Something interesting

20 Famous Software Disasters

No comments: