When Science Goes Wrong: Twelve Tales From the Dark Side of Discovery - Simon LeVay (2008)

SPACE SCIENCE: Off Target

ON SEPTEMBER 23, 1999, after a journey of 419 million miles, the Mars Climate Orbiter spacecraft made its final approach to the Red Planet. At one minute after two in the morning, a tired but excited group of engineers and scientists at NASA’s Jet Propulsion Laboratory (JPL) near Pasadena, California, broke into smiles and applause: a signal had arrived to indicate that the spacecraft’s main engine had begun firing. This event would reduce the spacecraft’s speed enough for it to be captured by Mars’ gravitational field and go into orbit. Shortly thereafter, as expected, signals from the spacecraft ceased as it passed into the radio shadow behind the planet. The JPL group waited impatiently for the spacecraft to re-emerge on the other side of the planet, an event that was predicted to occur 21 minutes later. But the silence stretched to 22, 23 and 24 minutes, and then to hours and days. In fact, no signal was ever received from the spacecraft again. The Mars Climate Orbiter was lost, and the mission was a total failure.

Losing a Mars mission was not exactly a new experience for NASA: three out of the 11 previous US missions had ended the same way. Just six years before the Mars Climate Orbiter mishap, the $800 million Mars Observer spacecraft was lost in rather similar circumstances: radio signals from the spacecraft mysteriously ceased during final approach.

Still, the US Mars programme overall had a good track record, especially in comparison with the Soviet programme. Of the 18 Russian spacecraft sent to Mars before 1999, 15 had been total losses – often failing to reach space at all – and the remaining three were only partial successes. And of the successful earlier American missions, some had been extraordinarily complex. These included the Viking mission of 1975 – a fleet of two orbiters and two landers, all of which functioned as planned and for far longer than their nominal design lifetimes. By the time of the Mars Climate Orbiter launch, there was a real confidence that NASA and its industrial partners – Lockheed Martin Astronautics, in this case – knew how to get the job done. It may have been this very confidence that sank the mission.

The root cause of the loss was a scientific blunder as old as science itself: the confusion of units. Such errors can be prevented. Yet, in a larger sense, the loss represented a failure in systems engineering – that is, a failure successfully to integrate thousands of individual technical contributions into a single cohesive whole: a product that would fulfil the objectives of the customer, the US government. In that sense, the Mars Climate Orbiter mishap represented what is probably the most common mode of failure in large and complex scientific enterprises, and one that is extremely difficult to eradicate.

In the early 1990s, NASA administrator Dan Goldin spearheaded a new approach to the design and implementation of space missions, an approach that was encapsulated in the slogan ‘Faster, Better, Cheaper’, or FBC. The FBC philosophy was, in part, a response to fiscal belt-tightening imposed by the US government. It also represented a reaction to some of the earlier missions – huge, long-delayed projects that incorporated every imaginable bell and whistle and that went tens or hundreds of millions of dollars over budget. FBC was a leaner approach that aimed to achieve more with less, employing economical strategies such as the re-use of design elements that had proven successful in earlier missions. Although this ‘heritage’ approach promised great savings in time and money, it also injected risk. How could one be sure that a large piece of software, for example, would function successfully in the different environment of a new spacecraft? And the FBC approach also demanded economies in manpower. This was something that might be acceptable so long as things went according to plan, but it might cause problems when unexpected difficulties needed to be surmounted, as happened with the Mars Climate Orbiter.

Goldin’s strategy had some early successes. The 1996 Mars Pathfinder mission, for example, safely delivered the rover Sojourner to the Martian surface using a novel airbag landing system. The rover was able to navigate semi-autonomously around the landing site, and it did some simple geological investigations of nearby rocks. It also caught the imagination of the public, including children, back on Earth: Mattel’s rover action model was America’s best-selling toy of the summer of 1997.

Soon after Pathfinder’s landing, another spacecraft, the Mars Global Surveyor, reached the planet and went into a polar orbit. Over the following 10 years, it took nearly a quarter of a million photographs of the Martian surface, and it also operated as a communications satellite for other missions until it ceased functioning in 2006.

In spite of these successes, there were also hints of problems with the FBC approach. In 1997, for example, an Earth-orbiting satellite called Lewis was launched, but once in orbit it went into a spin that prevented its solar panels from facing the sun; this caused the batteries to lose charge. The problem occurred at night while the controllers were off duty; economic considerations had prevented the appointment of sufficient controllers for round-the-clock staffing. By the time the controllers returned to work the next morning, the spacecraft was completely out of electrical power and thus could not be resuscitated: it burned up in the atmosphere a few weeks later.

The Mars Climate Orbiter (MCO) was one element in a two-spacecraft mission named the Mars Surveyor ’98 Programme. The other element was the Mars Polar Lander. The role of the Climate Orbiter was to study the Martian atmosphere with a variety of instruments and also to serve as a communication link for the Lander and for other, future, missions. The Lander was to set down near the planet’s south pole – using retrorockets rather than an airbag – and dig into the soil with the particular aim of finding water.

Lockheed Martin won the $l21-million contract to build both spacecraft (not including the scientific instruments) and the company was expected to complete the job with minimal oversight from NASA, consistent with the Faster, Better, Cheaper philosophy.

The error that doomed the MCO spacecraft occurred during the development of its navigational software. To understand the error, it is necessary to appreciate that once a Mars-bound spacecraft has escaped Earth’s gravitational field, it is essentially coasting in an orbit around the sun – albeit a highly elliptical orbit that is carefully planned to intersect the orbit of Mars at a time when Mars itself has reached that location. Thus, if the spacecraft’s initial direction and speed are well enough known, its future trajectory can be readily calculated from gravitational equations provided by Isaac Newton.

Two nongravitational factors can affect the trajectory, however. One consists of deliberate changes in trajectory induced by firing of the spacecraft’s small rocket engines, or ‘thrusters’. Usually, four or five of these trajectory correction manoeuvres (or TCMs) are performed in the course of the flight. Each time a TCM is performed, its effect on the trajectory has to be determined.

The other main source of non-gravitational effects comes from the pressure of solar radiation on the spacecraft, and from the craft’s efforts to compensate for that pressure. Radiation pressure is exerted mainly on the craft’s solar panels, on account of their large area. Unlike the Mars Global Surveyor, which had solar panels on either side of the spacecraft, the Mars Climate Orbiter had all of its panels on one side of the craft. Because of this asymmetrical design, the main effect of solar radiation was a tendency to spin the spacecraft around on its axis. Such a spin was undesirable because it reduced the amount of sunlight received by the panels. To counteract this effect, a set of small reaction wheels resembling the metal discs in gyroscope toys were automatically spun up by electric motors. These spinning reaction wheels generated an equal and opposite rotational force, so no actual rotation of the spacecraft occurred and it maintained a constant orientation to the sun.

Of course, the reaction wheels could only be spun up to a certain limiting speed, which was about 3,000 rpm. Because of the spacecraft’s asymmetrical design, it took less than 24 hours of flight for the reaction wheels to reach this limit. Then their accumulated angular momentum had to be ‘dumped’. This was done by slowing the wheels, while at the same time compensating for the effect of the slowing by firing some of the thrusters. In this way the reaction wheels could be brought back to a stop while keeping the spacecraft’s attitude constant. Then the cycle began again. These dumps were formally named ‘Angular Momentum Desaturation’ events, or AMDs, and they occurred about 10 times a week during the trip to Mars.

If these thruster firings had only affected the spacecraft’s rotation, they would not have disturbed its trajectory through space and would, therefore, have been irrelevant from a navigational standpoint. Unfortunately, design considerations led to the positioning of the thrusters in such a way that the AMD would also kick the spacecraft sideways by a small amount, and the frequent repetition of these events over the course of the flight would cause the spacecraft to deviate by a significant degree from its planned trajectory – enough to prevent the craft from entering the Mars orbit correctly.

The Lockheed engineers knew about this issue. They also knew that it would be difficult to measure the effect of the AMDs on the spacecraft’s trajectory at the time the AMDs occurred. This was because only limited information would be available about the spacecraft’s position, speed and direction of travel during the flight. In general, it is possible accurately to measure a spacecraft’s distance from Earth (based on the time for a radio signal to travel from Earth to the spacecraft and back) and its speed along the line of sight from Earth (based on Doppler changes in a fixed-frequency signal emitted by the spacecraft). However, it is not possible directly to measure its position or speed in the two dimensions that are perpendicular to the line of sight. These other variables can eventually be determined by making repeated measurements of the spacecraft’s range and line-of-sight velocity as it follows its elliptical trajectory, but these determinations are slow and subject to various forms of error. Unfortunately, the effects of the AMD events were exerted largely in the difficult-to-observe dimensions perpendicular to the line of sight from Earth.

To solve this problem, the engineers followed a strategy that had been employed in previous missions such as Mars Global Surveyor: they would use a form of ‘dead reckoning’. This depended on knowing exactly how much thrust would be exerted on the spacecraft – and in which direction – when each thruster was fired for a known period of time. With this information, it would be possible to calculate how much the speed and direction of the spacecraft would change during each AMD event, even without measuring those changes directly.

To make this possible, the subcontractor who manufactured the thrusters sent paperwork to Lockheed Martin that documented how much thrust was generated when each thruster was fired. This manufacturer was accustomed to working in British units (pounds, feet and so on). NASA generally requires the use of metric units throughout its operations and those of its contractors, but it makes exceptions in cases where ordering a change of units may be unduly burdensome, or where it increases the risk that the contractor will make some kind of mistake. NASA made an exception of this kind for this subcontractor, so the paperwork received by Lockheed Martin listed the thrusters’ performance in units of pounds of force.

The root cause of the Mars Climate Orbiter mishap was the failure to convert these English units to metric units of force – newtons – in the preparation of a navigational software file called ‘Small Forces’. The purpose of this file was to determine how strongly each AMD event would push the spacecraft out of its intended path. Because the remaining navigational software assumed that the output of the Small Forces file was in newtons, it underestimated the deflection of the spacecraft’s trajectory caused by each AMD event by a factor equal to the ratio between pounds and newtons – which is to say, by a factor of 4.45.

To understand how this error occurred, I spoke with John Casani, the onetime chief engineer at JPL who led the lab’s internal investigation into the mishap. I also spoke with Steve Jolly of Lockheed Martin, who was the lead systems engineer for the Mars Climate Orbiter. Casani’s and Jolly’s accounts agreed in one respect: they both told me that the failure to convert the units was primarily the fault of a young engineer who had only completed college a couple of months earlier and who was a new hire at Lockheed Martin. (This person has never been identified by name.)

Casani and Jolly gave me somewhat differing accounts of how the young engineer actually came to make the mistake. According to Casani, the engineer was given the documentation from the thruster manufacturer that contained performance data in pounds, as well as a set of instructions from JPL that specified that the output of the Small Forces file should be in newtons. The engineer simply failed to read the JPL document with sufficient care and thus overlooked the requirement for the conversion. According to this account, then, the engineer thought he was doing the right thing by providing the output in English units.

According to Jolly (who presumably would have been more knowledgeable about the matter), the engineer did know that pounds needed to be converted to newtons. The reason he failed to make the conversion, Jolly told me, had to do with the ‘heritage’ issue. The Mars Global Surveyor had used similar navigational software, and the plan was to save money by having the Climate Orbiter ‘inherit’ or reuse it. The new spacecraft used different thrusters, however, so the portion of the software relating to thruster performance was excised, and the engineer’s task was to replace that portion with code incorporating performance data for the new thrusters. Unfortunately, he assumed that the code that made the conversion of units was left in the unexcised Global Surveyor software, whereas in fact it was in the excised portion. The conversion was represented simply by the number 4.45 in an equation, without any comment as to its purpose, so it was easy to miss. Thus, in writing the new code, the engineer left the units in pounds, thinking that the required conversion would be made by the pre-existing software.

Although the engineer’s mistake was the root cause of the mishap, such mistakes are inevitable as long as science is done by humans. The more serious error was the failure of anyone to spot the mistake. Part of the problem was that a factor of 4.45 is not a terribly large error in engineering terms: the faulty code produced output that looked quite reasonable. In fact, if the Mars Global Surveyor (with its symmetrical solar panels) had incorporated the same error, that mission would probably not have been affected. It was only the asymmetrical design of the Climate Orbiter, with the resulting need for numerous AMD events, that allowed the small individual errors to accumulate to a mission-endangering level.

Following standard procedures, the faulty software was reviewed, but the error wasn’t spotted. Then it went through formal testing: using fictional AMD events, the output of the software was compared with the output of manual calculations. Unfortunately, the manual calculations somehow incorporated the same error as was present in the faulty software, so the two outputs were in agreement and the software was judged to be good.

The Small Forces software was not actually loaded into the spacecraft’s computer; rather, it was placed in computers that remained on the ground. The idea was that every time an AMD event occurred, the spacecraft would radio back data about the length of firing of the thrusters, the attitude of the spacecraft and so on, and the navigational team would then feed the data into the ground computer to extract measures of the magnitude, duration, and direction of thrust. These measures would then be used to adjust the model of the spacecraft’s trajectory.

By a bitter irony, the spacecraft’s own computer did in fact possess software to make this calculation independently, and these files correctly specified the resulting thrust in newtons. ‘You can imagine how many times I wake up at night thinking about that,’ said Jolly. In fact, the spacecraft was even programmed to radio the output of these calculations to the ground, but the navigators did not know this so no one looked at the incoming data packets or compared them to the output of the erroneous calculations being performed on the ground. If they had done so, the error would have been quickly detected. Even when I spoke with him in 2006, after NASA’s official inquiry had established and published the fact, Jolly said that he didn’t know that the spacecraft had been transmitting the correct data to Earth.

On December 11, 1998, the Mars Climate Orbiter was launched from Cape Canaveral Air Station in Florida, atop a Delta II rocket. The Delta II is a relatively inexpensive, medium-powered launch vehicle. So as not to exceed the Delta II’s lifting capacity, the mission planners had to economise on the weight of fuel carried by the spacecraft – fuel which was required for slowing the spacecraft when it reached Mars. The planners took two steps to save on fuel. First, they sent the spacecraft by a long route that took it more than halfway around the sun: this ensured that it was travelling relatively slowly as it approached Mars, but it lengthened the trip to nine months rather than the six months needed for a more direct route. Second, they planned to accomplish some of the slowing by aerobraking – repeatedly dipping the spacecraft into Mars’s outer atmosphere on successive orbits after the first encounter – rather than relying entirely on the spacecraft’s engines. Even with these measures, the orbital insertion burn would have to slow the spacecraft by nearly 5,000 kilometres per hour, a task that would consume nearly 300 kilograms of fuel – almost half the total weight of the spacecraft.

The launch went flawlessly. The Delta II’s first two stages lifted the spacecraft into low Earth orbit, then the third stage booster rocket fired for 88 seconds, kicking the spacecraft out of Earth’s gravitational clutches. After the booster separated, the spacecraft deployed its solar panels and began its long, unpowered cruise toward Mars.

Teams at JPL and Lockheed Martin, led by JPL flight operations manager Sam Thurman, monitored and controlled the spacecraft during its journey. Part of the team was a group of four JPL navigators, led by Pat Esposito, whose task was to determine the spacecraft’s trajectory and calculate the required corrections during the flight. The team was also responsible for two other spacecraft, however – Mars Global Surveyor (which was orbiting Mars) and Mars Polar Lander (which was launched on January 3). Only one team member, Eric Graat, could give his undivided attention to the Mars Climate Orbiter. This was a low level of staffing compared with previous and subsequent missions: the successful Mars Odyssey mission of 2001, for example, boasted a 15-member navigation team. Although neither Esposito nor Graat agreed to speak with me, Sam Thurman told me that they were very much overworked.

The first trajectory correction manoeuvre (TCM-1) took place ten days after launch. It corrected a deliberate mis-aim in the launch trajectory – a mis-aim whose purpose was to ensure that the third-stage booster did not strike Mars and contaminate the planet with terrestrial germs. The manoeuvre involved an elaborate sequence of operations. First, the solar array was folded and locked against the spacecraft body to protect it from damage, then the entire spacecraft was rotated so that the firing of its aft-pointing thrusters would deflect the craft’s trajectory in the right direction, and then the thrusters were fired for a few minutes to achieve the correct trajectory. Finally, the spacecraft was rotated back into its flight orientation and the solar panels were deployed once more. A second, much smaller trajectory manoeuvre (TCM-2) was performed on January 26,1999, and it, too, went according to plan.

About every 17 hours during the flight, the spacecraft automatically performed angular momentum desaturation (AMD) procedures, firing its thrusters for a few seconds to allow the reaction wheels to be decelerated. The navigators had not been expecting the AMD events to occur so frequently, because when they came onto the job they were not familiar with the Orbiter and they did not realise that its asymmetrical design would cause an increased tendency to spin under the influence of solar radiation.

During the first four months of the flight, the navigators did not use the Small Forces software to calculate the effects of the AMDs on the spacecraft’s trajectory. This was because the software not only contained the units error (which no one was aware of), but also some other bugs that had come to light. Because the tiny effects of the AMD events would only really be important for the final approach to Mars and orbital insertion, the navigators simply did without the output of the Small Forces software, planning to incorporate the data at a later time.

Finally, in mid-April, the ground software was delivered and put into operation. Now the effects of the AMD events (including those that had already taken place) were incorporated into the navigational calculations. But for each AMD event the software told the navigators that the spacecraft had been deflected by an amount that was nearly five times larger than what had actually occurred, thanks to the poison pill that was the units error. Still, each individual navigational solution looked good, because the error was in an unobservable dimension perpendicular to the line of sight.

Only over time, as more and more solutions were calculated along the spacecraft’s curving path, did Graat become aware that the individual solutions didn’t quite mesh together to form a coherent trajectory. And calculations of the spacecraft’s current position that were derived from different data sets (for example, those based on range or Doppler measurements, or those that were based on different parts of the spacecraft’s trajectory) gave a fuzzy cluster of solutions instead of a single, unanimous answer.

Graat discussed this navigational problem with the leader of the navigational team, Pat Esposito. According to John Casani, the problem should have been entered as a formal written record known as an Incident, Surprise, Anomaly form – or ISA – which guarantees that a problem is followed up to a satisfactory resolution, but that’s not what happened. ‘The navigator here at JPL sent an email message to someone at Lockheed Martin, saying, “Take a look at this, there’s something funny going on that we don’t understand.” That never got entered into the formal record, which is our normal practice. So someone received this at the Lockheed end and said, “I’m going to work on this,” and then he got some other task that came along that either he thought was a higher priority, or his boss thought was a higher priority, and he got deflected, and this problem that was communicated to him by email just fell off the table, so to speak. If the form had been filled out, that could not have happened.’

Sam Thurman clued me in to the ‘other task’ that got higher priority. A serious incident occurred during the third trajectory correction manoeuvre, which was performed on July 23. Although the procedure for TCM-3 was the same as for TCM-1 and TCM-2, the process of retracting and locking the solar panels in preparation for the burn went awry. This procedure involved rotating the panels around a ball joint, using a gimbal drive. ‘There are devices on this gimbal that read out its angular position,’ he said, ‘and there were some calibration errors on those things, so the solar array scraped up against the side of the spacecraft and nearly got jammed in the stowed position. That put the spacecraft into “safe mode”: when it tried to un-stow after the manoeuvre, the array wouldn’t move, so the software stopped and called up the ground and said, “Hey, I’m trying to move this and it’s not moving; there’s something wrong.” So we spent most of the month of the approach phase scrambling to try to resolve this problem with the gimbal drive – because we knew that when we got to orbit insertion we had to have the solar array in the stowed position when we fired the main engine. The support that held the array wasn’t strong enough to take that force without the array being stowed. So we knew following TCM-3 that we had a problem we must fix or orbit insertion would fail. And that was very scary – that took a hell of a lot of effort from the team, the spacecraft team [at Lockheed Martin] in particular. So I think the navigation team’s problem was they were calling up the spacecraft team, saying, “Gee, we’re seeing this funny stuff, can you help us work the Small Forces modelling and try to understand it?” And they said, “Oh my God, we’ve got this huge other problem that could end the mission if we don’t fix it in the next two weeks.”’

With TCM-3 out of the way, most of the team members turned their attention to solving the problem of the balky gimbal drive while the navigators prepared for TCM-4 – the last scheduled course correction. This was planned for September 15, just eight days before arrival at Mars. TCM-4 was the really crucial manoeuvre: it had to leave the spacecraft aiming for a point of closest approach to Mars (the point known as first periapse) that was about 150 to 200 kilometres above the planet’s surface. Much higher than that, and the ensuing aerobraking procedure would take too long for the Orbiter to be in place by the time the Lander arrived on December 3. Much lower, and the spacecraft might be damaged by frictional heating in the planet’s outer atmosphere; the lowest survivable altitude was thought to be about 80 kilometres. Previous missions had achieved their preset altitudes with extraordinary precision, sometimes missing by as little as four kilometres – not bad marksmanship after a 400 million kilometre voyage. But the fuzzy navigational solutions made such a precise result unlikely with Mars Climate Orbiter. In fact, the solutions obtained by use of Doppler measurements and those obtained by range measurements were predicting fly-by altitudes that differed by tens of kilometres. No one knew which set of calculations was more accurate, so it wasn’t clear exactly how large the TCM-4 correction should be. The team decided to aim for an altitude of 226 kilometres, a height that left considerable leeway in case the spacecraft was coming in lower than the navigators realised.

By September 15, the spacecraft team had the gimbal-drive problem fixed, and TCM-4 went smoothly: the solar panels stowed themselves and redeployed without incident, and the spacecraft did not go into safe mode. Now, with arrival at Mars just a week away, most of the spacecraft team turned their attention to making preparations for the aerobraking phase that would follow insertion into Mars orbit. These preparations were far behind schedule on account of the gimbal problem.

Meanwhile the navigators worked almost non-stop to refine their trajectory calculations. As the days passed, the predicted altitude of the first periapse gradually decreased to about 150 kilometres – a safe altitude, but only if their prediction was correct to within a few tens of kilometres. During the final two or three days before arrival, the navigators got a serious case of cold feet, and they brought up the possibility of making yet another, fifth course correction to raise the spacecraft’s fly-by to a safer altitude.

This was the last thing that Sam Thurman, the flight operations manager, wanted to hear. The possibility of conducting a TCM-5 was written into the mission’s contingency plans. However, the optional TCM-5 was mainly planned for a different circumstance, namely for a situation in which the second periapse (on the first aerobraking orbit) would be at the wrong altitude. Reprogramming a TCM-5 to alter the first periapse altitude would be a major task for a relatively small spacecraft team, especially when it was so far behind on its preparations for the aerobraking phase.

‘I remember that the nav team chief was nervous,’ Thurman told me. ‘He said, “I think we ought to bump up [the altitude],” and other people [were] saying, “If we do that, it’s going to screw up our aerobraking preparations.”’ Thurman himself sided with the latter group. “[Doing a TCM-5] would have put at risk our ability to get into the correct science mapping orbit a few weeks later,’ he said. ‘It would have jeopardised our ability to transition people over to preparations for the Lander’s arrival. We needed the same people at Lockheed Martin to do that as well as to get aerobraking started.’

Given the lack of consensus, it was left to Thurman (perhaps in conjunction with other leaders of the mission) to make the decision, and they decided against a course correction. ‘I think that was an error of judgment,’ John Casani told me, ‘but that’s easy to say.’

Up until about noon on the day before orbit insertion, it looked as if Thurman had made the right judgment call, because the navigational solutions began to cluster more closely together, suggesting that the 150-kilometre periapse prediction was correct. Unfortunately, the solutions were clustering tightly around an incorrect value. At about 1am the following morning, which was an hour before the spacecraft’s arrival at Mars, a new set of solutions became available, refined by observation of the ever-increasing pull of Mars’s gravity. These solutions now predicted that the spacecraft would fly by the planet at an altitude of 110 kilometres – a distance that was very much on the low side, though still probably survivable. Nothing could be done now except wait and pray.

Two groups of scientists and engineers had gathered at JPL and Lockheed Martin for the critical orbital insertion event. NASA’s cable channel began broadcasting the event live. Probably only a small group of insomniac space buffs watched the live broadcast, but several other TV channels sent cameramen and reporters to tape the event. Thurman and seven other JPL team members worked at computer terminals in a glass-lined control room, while the media and other curious onlookers peered in through the glass. A similar scene took place at Lockheed Martin’s mission support centre in Denver, though the working group there was much larger and the media representatives fewer. The dress code was shirtsleeves and jeans at the science lab, ties and slacks at the contractor.

At 1:41am in ‘Earth receive time’ – 11 minutes after the corresponding events at Mars – signals arrived indicating that the Mars Climate Orbiter had begun retracting and stowing its solar panels in preparation for the orbital insertion burn. The process took eight minutes. Then the spacecraft began turning 180 degrees so as to convert its main rocket engine into a retrorocket. This took six minutes. At this point, the spacecraft was heading over Mars’s north pole, and NASA-TV showed an animation of the gracefully gyrating spacecraft as it cruised over the white expanses of frozen carbon dioxide that marked the pole. At 1:56, pyros (explosive devices) fired to pressurise the fuel and oxidiser tanks. As planned, the spacecraft stopped transmitting data: the only signal still being received at JPL was the single-frequency ‘carrier’ signal. The tension was visible in the faces of the eight men in the control room.

Then at 2:01am came the voice of Lockheed Martin systems engineer Kelly Irish: ‘Real-time Doppler indicates main engine burn.’ In other words, engineers had seen that the frequency of the carrier signal was beginning to rise as the spacecraft decelerated under the influence of the retrorocket. It was a moment of exuberant relief in the JPL control room.

The group could follow the slowing of the spacecraft for the first four minutes of the planned 16-minute burn, but at 2:04 and 56 seconds the spacecraft’s carrier signal began to break up, and six seconds later it disappeared completely. This was the expected effect of the spacecraft passing behind the planet into its radio shadow. ‘At this time we are in our occultation period,’ announced Irish.

The spacecraft actually went into occultation 52 seconds before the event was expected. Less than a minute early – surely that tiny error could be of no significance after a nine-month voyage? Irish’s voice didn’t betray any surprise, and the NASA-TV commentators continued their chatter about the details of aerobraking. But to Thurman, it was a bad omen. He started staring intensely at a sheet of paper that he was holding in his left hand, then glancing up at the screen in front of him.

‘I remember I had a plot next to me that our mission engineer had made that allowed us to correlate the time that the spacecraft headed behind the planet. He’d come up with a clever scheme where, by looking at the time of loss of signal, we could get a guesstimate of the actual altitude. The lower the altitude, the earlier the loss of signal. I had this very quick way of watching data on the screen, seeing the loss-of-signal time, and then looking at the data sheet for the guesstimate. I remember we had the signal – it was very early, and I remember thinking, Uh-oh, this is not good. That really – yeah, my anxiety level shot up, I tried to hold my composure since there were seven guys with cameras in front of me.’

The NASA-TV commentator eventually seemed to pick up on the early occultation, because without mentioning the fact that it was early he attempted to explain it away. ‘The signal can be refracted by the atmosphere,’ he said. ‘The accuracy is not as deterministic as we’d like.’ Meanwhile, the eight men in the glass booth sat or stood there, fidgeting, looking at one another’s screens, well aware that they could do nothing but wait 21 minutes for the scheduled re-emergence of the spacecraft from occultation. By that time the spacecraft should have terminated its burn and turned so that its signals could be picked up by the 70-metre Tidbinbilla deep space antenna near Canberra, Australia.

At 2:26am, the predicted time for the spacecraft to re-emerge from occultation, there was dead silence in the control room, and even the television commentator stopped talking. The silence went on for minutes, while the men in the glass booth became increasingly fidgety, staring at their watches or at the computer screens, standing up and sitting down again for no apparent reason, or glancing into one another’s tired, anxious faces. Sam Thurman kept adjusting his wedding ring, as if its precise alignment was crucial for the spacecraft’s destiny. ‘Waiting for acquisition of signal,’ said the disembodied voice of Kelly Irish. And a minute later: ‘Still haven’t seen anything; stand by.’

After a few minutes, Thurman stood up and began to talk with the other members of the JPL team, including the Mars ’98 project manager, Richard Cook. His words weren’t audible on the TV broadcast, so I asked him whether he had been telling them that the spacecraft was probably lost. ‘I don’t say things like that,’ he said. ‘That can be devastating to a team’s morale. You always have to hope for the best and do everything you can to make it come about. And there’s a little bit of lore called the 24-hour rule, which is, when something seemingly bad or scary happens, don’t overreact for a minimum of 24 hours, because more often than not what actually happened and what you need to do about it might be different from what you think at first. So I tried to reach inside and gather the presence to say – getting emotional doesn’t serve any purpose, so you have to say, “Here are the options. The vehicle survived and may be in safe mode; it might be tumbling; it might have ended up in a much lower or higher orbit than expected because of an overburn or an underburn.” Remember, the ground stations need fairly accurate predictions of what the transmitter frequency is going to be in order to tune their receivers properly to hear a spacecraft 120 million miles away, orbiting another planet. So it could be the spacecraft was there and transmitting fine, and the set of predicts we used to drive the antennas were off.’

About 30 minutes later, Richard Cook came out of the glass booth and spoke with NASA-TV. He outlined the possible factors that might have caused the spacecraft to go into safe mode. ‘At this point, we’re still very confident that we’re in orbit at Mars,’ he said, ‘and we’re going to see the spacecraft signal sometime in the next few hours.’ With that, NASA wrapped up its live television coverage for the night.

Very quickly, however, devastating news came in. During the occultation period, the navigators had been working on a revised prediction of the spacecraft’s periapse altitude, based on data received shortly before orbital insertion. The results weren’t good. ‘They came back in and said, “Oh my God, this thing was 60 or 70 kilometres lower than we thought it was going to be,”’ said Thurman. ‘And that’s when we thought, “Oh boy, if it went that deep, it must have fried.”’

Engineers continued searching for a signal from the Mars Climate Orbiter for 48 hours, but it was largely a formality. Navigational errors, it now seemed clear, had led to the spacecraft dipping too deep into the Martian atmosphere during orbital insertion. Its exact fate was a matter for speculation. It may have broken up or exploded, scattering debris over the Martian surface – in which case there is some concern that terrestrial germs may have survived the heat of the re-entry and contaminated the surface. Alternatively, frictional heating may have caused the retrorocket to cut out prematurely – in which case the spacecraft may not have been captured into orbit around the planet at all and would have continued forever on its lonely solar orbit, perhaps to be seen again by some distant generation of Earthlings or Martians.

JPL’s John Casani was quickly appointed to head an internal investigation of the mishap. There was considerable time pressure, because the Orbiter’s companion spacecraft, Mars Polar Lander, was fast approaching the planet and was due to arrive on December 3. There were many similarities between the two spacecraft, and investigators wanted to ensure that whatever doomed the Orbiter would not also affect the Lander.

All the focus was on the navigational problems. Every piece of navigational software was scrutinised line by line, and on September 29 an engineer identified the crucial error: the lack of a conversion factor to change pounds of force to newtons in the Small Forces software. Once that error was identified, all the problems with navigation were readily explained.

To my knowledge, Casani’s report only circulated internally and was never published. ‘In my subjective opinion, it focused too much on all the technical details of what didn’t get done, or didn’t get done well enough,’ commented Thurman, ‘and too little on how the lab got itself in that position to begin with.’ That deficiency was quickly made up for by a second investigation, headed by Art Stephenson, Director of NASA’s Marshall Space Flight Center in Alabama. While agreeing with Casani that the units problem was the root cause of the mishap, Stephenson’s report put far more emphasis on the numerous contributing factors – inadequate training, testing and communication; the failure to resolve the anomalous navigational solutions or to report the problem through the proper channels; the failure to execute TCM-5, and so on – that together amounted to a systems-engineering or even programmatic failure. While not exactly criticising Goldin’s ‘Faster, Better, Cheaper’ philosophy, Stephenson urged that it be practiced under a set of guidelines that he summarised as ‘Mission Success First.’

Like the Mars Climate Orbiter, the Polar Lander was plagued with problems during its journey to Mars, but navigational errors were not among them; the navigational software did not contain the units error that destroyed the Orbiter. By the time the Lander reached Mars, it looked like all the kinks had been ironed out, but after it began its descent through the atmosphere nothing more was heard from it. Once again, Richard Cook was delegated to give the waiting press an upbeat assessment of the situation. ‘I’m very confident the Lander survived the descent,’ he said, five hours after it should have been broadcasting its first images. The reason for the Lander’s demise was never established with certainty, but the most probable cause was a set of faulty sensors in the spacecraft’s landing legs. These, investigators believe, told the Lander’s computer that the spacecraft had landed while in fact it was still airborne. This caused the Lander to switch off its retrorocket prematurely and fall to the surface at a speed sufficient to destroy the entire spacecraft. Again, then, the root cause of the loss was attributed to a failure at Lockheed Martin, the manufacturer of both spacecraft.

The loss of the Lander, coming so quickly after that of the Orbiter, triggered an investigation that was independent of NASA. This one was headed by Tom Young, former executive vice president of Lockheed Martin – an odd choice, perhaps, given that anyone associated with the company might be expected to lay blame somewhere outside the company’s purview. In fact, Young’s report laid most emphasis for the failures on niggardly federal funding for the missions: they were underfunded by at least 30 per cent, Young wrote.

Casani and Thurman concurred. ‘That was the problem,’ said Casani. ‘The only way that you could get the costs down was with less people.’ ‘Faster, Better, Cheaper was raging like influenza through the agency,’ said Thurman. ‘NASA and JPL should not have attempted to do two missions with that ambition and with that kind of cost and schedule. To me, that’s the fundamental root cause.’

As the boss of NASA and a political appointee, Dan Goldin took a different line, accusing Lockheed Martin of having underbid to win the contract for the missions. ‘I think in this circumstance that the Lockheed Martin team was overly aggressive, because their focus was on the winning,’ he said in a PBS interview. ‘The Lockheed Martin Company did not pay attention, and I know it sounds like a paradox, but it was more important to them to win for today, and they didn’t think of the long-term future or the reputation of their company.’

In any event, Lockheed Martin Astronautics went through a tough period after the Mars ’98 failures. The company suffered financial losses, NASA cancelled a contract for a follow-up mission and many astronautical engineers left the company, at least temporarily. But it rebounded, and in 2001 Lockheed Martin was awarded the contract for a successor to Mars Global Surveyor – a spacecraft named Mars Reconnaissance Orbiter. This was launched in 2005 and successfully went into orbit around Mars the following year. Orbital insertion was aided by a new technology, in which photographs of Mars’s two moons were used to get a precise fix on the spacecraft’s position as it approached the planet. As if to celebrate this success, in September of 2006 Lockheed Martin was awarded a multibillion-dollar contract, this time to build Orion, the successor to the Space Shuttle.

As to the mantras of ‘Faster, Better, Cheaper’ and ‘Mission Success First’, Steve Jolly expressed some scepticism. ‘I think what we do is avoid using any branding like that,’ he said. ‘Not because it’s not fashionable anymore, but because slogans can sometimes hurt you. Now the real approach is, what’s the right design to accomplish the objectives that we’re being asked to do, and what’s the doable cost associated with that? What we’re finding is that we can still leverage all those technologies and approaches that were developed in the nineties to pull off what we call ‘best-value missions’ for the government.’

Although Lockheed Martin lost some employees after the Mars ’98 failure, the hapless young engineer who made the units error was not fired; in fact, he’s still with the company. ‘He has a lead position; he’s in the critical path for all the flying missions that we have,’ said Jolly. ‘You know, that’s the noble way to do it. Engineers do not walk in and say, “I’m going to make a mistake today.”’

The Mars Climate Orbiter is the only US space mission to have failed on account of a confusion of units, but several others have failed or gone seriously wrong on account of similarly ‘dumb errors’ in data handling. In April 1999, for example, a Titan IVB rocket carrying a military satellite failed to reach orbit after its upper stage lost stability and broke up. The mishap was caused by the misplacement of a single decimal point in the control-system software.

Confusion of units has been the cause of many mishaps in other fields of science. Medicine and medical research has been particularly susceptible, most commonly with respect to drug dosages. A tragic example occurred in Ottawa, Canada, in 2002. Researchers at the Children’s Hospital of Eastern Ontario were testing the use of the interleukin-2, an immune system booster, in the treatment of a childhood cancer known as neuroblastoma. The first patient, one-year-old Ryan Carroll, became severely ill during the drug treatment, but he survived. Rather than halting the trial, the researchers proceeded to treat another patient, four-year-old Ryan Lucio, with the same drug regimen. After four injections, he suffered multiple organ failure and died. At this point, his doctors went back and examined his treatment plan, and they discovered a terrible mistake. Instead of calculating the dose of interleukin-2 in units of micrograms per square metre of body area as they should have done, they had calculated it in micrograms per kilogram of body weight, which meant that they had given him an approximately 25-fold overdose. They soon realised that Ryan Carroll had been overdosed in the same way.

The US Food and Drug Administration, which had approved the trial, went ballistic and posted an excoriating critique of the trial’s principal investigator, Dr Jacqueline Halton, on its website. Health Canada, on the other hand, expressed little, if any, criticism and quickly issued its approval for the trial to continue. It may be that Health Canada’s mild-mannered approach reflected guilt that it had allowed the two Ryans to be experimented on at a time when the proper application and safety assurances had not been provided by the researchers, contrary to Canadian law.