Knocking on Heaven's Door: How Physics and Scientific Thinking Illuminate the Universe and the Modern World - Lisa Randall (2011)
Part III. MACHINERY, MEASUREMENTS, AND PROBABILITY
Chapter 12. MEASUREMENT AND UNCERTAINTY
Familiarity and comfort with statistics and probability help when evaluating scientific measurements, not to mention many of the difficult issues of today’s complex world. I was reminded of the virtue of probabilistic reasoning when, a few years back, a friend was frustrated by my “I don’t know” response to his question about whether or not I planned to attend an event the following evening. Fortunately for me, he was a gambler and mathematically inclined. So instead of exasperatingly insisting on a definite reply, he asked me to tell him the odds. To my surprise, I found that question a lot simpler to deal with. Even though the probability estimate I gave him was only a rough guess, it more closely reflected my competing considerations and uncertainties than a definite yes or no reply would have done. In the end, it felt like a more honest response.
Since then I’ve tried this probabilistic approach out on friends and colleagues when they didn’t think they could reply to a question. I’ve found that most people—scientists or not—have strong but not irrevocable opinions that they frequently feel more comfortable expressing probabilistically. Someone might not know if he wants to go to the baseball game on the Thursday three weeks from now. But if he knows that he likes baseball and doesn’t think he has any work trips coming up, yet hesitates because it’s during the week, he might agree he is 80 percent likely to, even if he can’t give a definite yes. Although just an estimate, this probability—even one he makes up on the spot—more accurately reflects his true expectation.
In our conversation about science and how scientists operate, the screenwriter and director Mark Vicente observed how he was struck by the way that scientists hesitate to make definite unqualified statements of the sort most other people do. Scientists aren’t necessarily always the most articulate, but they aim to state precisely what they do and don’t know or understand, at least when speaking about their field of expertise. So they rarely just say yes or no, since such an answer doesn’t accurately reflect the full range of possibilities. Instead, they speak in terms of probabilities or qualified statements. Ironically, this difference in language frequently leads people to misinterpret or underplay scientists’ claims. Despite the improved precision that scientists aim for, nonexperts don’t necessarily know how to weigh their statements—since anyone other than a scientist with as much evidence in support of their thesis wouldn’t hesitate to say something more definite. But scientists’ lack of 100 percent certainty doesn’t reflect an absence of knowledge. It’s simply a consequence of the uncertainties intrinsic to any measurement—a topic we’ll now explore. Probabilistic thinking helps clarify the meaning of data and facts, and allows for better-informed decisions. In this chapter, we’ll reflect on what measurements tell us and explore why probabilistic statements more accurately reflect the state of knowledge—scientific or otherwise—at any given time.
Harvard recently completed a curricular review to try and determine the essential elements of a liberal education. One of the categories the faculty considered and discussed as part of a science requirement was “empirical reasoning.” The teaching proposal suggested the university’s purpose should be to “teach how to gather and assess empirical data, weigh evidence, understand estimates of probabilities, draw inferences from the data when available [so far, so good], and also to recognize when an issue cannot be settled on the basis of the available evidence.”
The proposed wording of the teaching requirement—later clarified—was well intentioned, but it belied a fundamental misunderstanding of how measurements work. Science generally settles issues with some degree of probability. Of course we can achieve high confidence in any particular idea or observation and use science to make sound judgments. But only infrequently can anyone absolutely settle an issue—scientific or otherwise—on the basis of evidence. We can collect enough data to trust causal relationships and even to make incredibly precise predictions, but we can generally do it only probabilistically. As Chapter 1 discussed, uncertainty—however small—allows for the potential existence of interesting new phenomena that remain to be discovered. Rarely is anything 100 percent certain, and no theory or hypotheses will be guaranteed to apply under conditions where tests have not yet been performed.
Phenomena can only ever be demonstrated with a certain degree of precision in a set domain of validity where they can be tested. Measurements always have some probabilistic component. Many science measurements rely on the assumption that an underlying reality exists that we can uncover with sufficiently precise and accurate measurements. We use measurements to find this underlying reality as well as we can (or as well as necessary for our purposes). This then permits statements such as that an interval centered on a collection of measurements contains the true value with 95 percent probability. In that case, we might colloquially say we are confident with 95 percent confidence. Such probabilities tell us the reliability of any particular measurement and the full range of possibilities and implications. You can’t fully understand a measurement without knowing and evaluating its associated uncertainties.
One source of uncertainty is the absence of infinitely precise measuring instruments. Such a precise measurement would require a device calibrated with an infinite number of decimal places. The measured value would have an infinite number of carefully measured numbers after the decimal place. Experimenters can‘t make such measurements—they can only calibrate their tools to make them as accurate as possible with available technology, just as the astronomer Tycho Brahe did so expertly more than four centuries ago. Increasingly advanced technology results in increasingly precise measuring devices. Even so, measurements will never achieve infinite accuracy, despite the many advances that have occurred over time. Some systematic uncertainty,49 characteristic of the measuring device itself, will always remain.
Uncertainty doesn’t mean that scientists treat all options or statements equally (though news reports frequently make this mistake). Only rarely are probabilities 50 percent. But they do mean that scientists (or anyone aiming for complete accuracy) will make statements that tell what has been measured and what it implies in a probabilistic way, even when those probabilities are very high.
When scientists and wordsmiths are extremely careful, they use the words precision and accuracy differently. An apparatus is precise if, when you repeat a measurement of a single quantity, the values you record won’t differ from each other very much. Precision is a measure of the degree of variability. If the result of repeating a measurement doesn’t vary a lot, the measurements are precise. Because more precisely measured values span a smaller range, the average value will more rapidly converge if you make repeated measurements.
Accuracy, on the other hand, tells you how close your average measurement is to the correct result. In other words, it tells whether there is bias in a measuring apparatus. Technically speaking, an intrinsic error in your measuring apparatus doesn’t reduce its precision—you would make the same mistake every time—though it would certainly reduce your accuracy. Systematic uncertainty refers to the unbeatable lack of accuracy that is intrinsic to the measuring devices themselves.
Nonetheless, in many situations, even if you could construct a perfect measuring instrument, you would still need to make many measurements to get a correct result. That is because the other source of uncertainty50 is statistical,which means that measurements usually need to be repeated many times before you can trust the result. Even an accurate apparatus won’t necessarily give the right value for any particular measurement. But the average will converge to the right answer. Systematic uncertainties control the accuracy of a measurement while statistical uncertainty affects its precision. Good scientific studies take both into account, and measurements are done as carefully as possible on as large a sample as is feasible. Ideally, you want your measurements to be both accurate and precise so that the expected absolute error is small and you trust the values you find. This means you want them to be within as narrow a range as possible (precision) and you want them to converge to the correct number (accuracy).
One familiar (and important) example where we can consider these notions is tests of drug efficacy. Doctors often won’t say or perhaps they don’t know the relevant statistics. Have you ever been frustrated by being told, “Sometimes this medicine works; sometimes it doesn’t”? Quite a bit of useful information is suppressed in this statement, which gives no idea of how often the drug works or how similar the population they tested it on is to you. This makes it very difficult to decide what to do. A more useful statement would tell us the fraction of times a drug or procedure has worked on a patient with similar age and fitness level. Even in the cases when the doctors themselves don’t understand statistics, they can almost certainly provide some data or information.
In fairness, the heterogeneity of the population, with different individuals responding to drugs in different ways, makes determining how a medicine will work a complicated question. So let’s first consider a simpler case in which we can test on a single individual. Let’s use as an example the procedure for testing whether or not aspirin helps relieve your headache.
The way to figure this out seems pretty easy: take an aspirin and see if it works. But it’s a little more complicated than that. Even if you get better, how do you know it was the aspirin that helped? To ascertain whether or not it really worked—that is, whether your headache was less painful or went away faster than without the drug—you would have to be able to compare how you feel with and without the drug. However, since you either took aspirin or you didn’t, a single measurement isn’t enough to tell you the answer you want.
The way to tell is to do the test many times. Each time you have a headache, flip a coin to decide whether to take an aspirin or not and record the result. After you do this enough, you can average out over all the different types of headaches you had and the varying circumstances in which you had them (maybe they go away faster when you’re not so sleepy) and use your statistics to find the right result. Presumably there is no bias in your measurement since you flipped a coin to decide and the population sample you used was just yourself so your result will correctly converge with enough self-imposed tests.
It would be nice to always be able to learn whether drugs worked with such a simple procedure. However, most drugs are treating more serious illnesses than headaches—perhaps even ones that lead to death. And many drugs have long-term effects, so you couldn’t do repeated short-term trials on a single individual even if you wanted to.
So usually when biologists or doctors test how well a drug works, they don’t simply study a single individual, even though for scientific purposes at least they would prefer to do so. They then have to contend with the fact that people respond differently to the same drug. Any medicine produces a range of results, even when tested on a population with the same degree of severity of a disease. So the best scientists can do in most cases is to design studies for a population as similar as possible to any given individual they are deciding whether or not to give the drug to. In reality, however, most doctors don’t design the studies themselves, so similarity to their patient is hard for them to guarantee.
Doctors might want instead to try to use pre-existing studies where no one did a carefully designed trial but the results were based simply on observations of existing populations, such as the members of an HMO. They would then face the challenge of making the correct interpretation. With such studies, it can be difficult to ensure that the relevant measurement establishes causality and not just association or correlation. For example, someone might mistakenly conclude that yellow fingers cause lung cancer because they noticed many lung cancer patients have yellow fingers.
That’s why scientists prefer studies in which treatments or exposures are randomly assigned. For example, a study in which people take a drug based on a coin toss will be less dependent on the population sample since whether or not any patient receives treatment depends only on the random outcome of a coin flip. Similarly, a randomized study could in principle teach about the relationships among smoking, lung cancer, and yellow fingers. If you were to randomly assign members of a group to either smoke or refrain from smoking, you would determine that smoking was at least one underlying factor responsible for both yellow fingers and lung cancer in the patients you observed, whether or not one was the cause of the other. Of course, this particular study would be unethical.
Whenever possible, scientists aim to simplify their systems as much as possible so as to isolate the specific phenomena they want to study. The choice of a well-defined population sample and an appropriate control group are essential to both the precision and accuracy of the result. With something as complicated as the effect of a drug on human biology, many factors enter simultaneously. The relevant question is then how reliable do the results need to be?
THE OBJECTIVE OF MEASUREMENTS
Measurements are never perfect. With scientific research—as with any decision—we have to determine an acceptable level of uncertainty. This allows us to move forward. For example, if you are taking a drug you hope will mitigate your nagging headache, you might be satisfied to try it even if it significantly helps the general population only 75 percent of the time (as long as the side effects are minimal). On the other hand, if a change in diet will reduce your already low likelihood of heart disease by a mere two percent of your existing risk, decreasing it from five percent to 4.9 percent, for example, that might not worry you enough to convince you to forgo your favorite Boston cream pie.
For public policy, decision points can be even less clear. Public opinion usually occupies a gray zone where people don’t necessarily agree on how accurately we should know something before changing laws or implementing restrictions. Many factors complicate the necessary calculations. As the previous chapter discussed, ambiguity in goals and methods make cost-benefit analyses notoriously difficult, if not impossible, to reliably perform.
As New York Times columnist Nicholas Kristof wrote in arguing for prudency about potentially dangerous chemicals (BPA) in foods or containers, “Studies of BPA have raised alarm bells for decades, and the evidence is still complex and open to debate. That’s life: in the real world, regulatory decisions usually must be made with ambiguous and conflicting data.”51
None of these issues mean that we shouldn’t aim for quantitative evaluations of costs and benefits when assessing policy. But they do mean that we should be clear about what the assessments mean, how much they can vary according to assumptions or goals, and what the calculations have and have not taken into account. Cost-benefit analyses can be useful but they can also give a false sense of concreteness, certainty, and security that can lead to misguided applications in society.
Fortunately for physicists, the questions we ask are usually a lot simpler—at least to formulate—than they are for public policy. When we’re dealing with pure knowledge without an immediate eye to applications, we make different types of inquiries. Measurements with elementary particles are a lot simpler, at least in principle. All electrons are intrinsically the same. You have to worry about statistical and systematic error, but not the heterogeneity of a population. The behavior of one electron is representative of them all. But the same notions of statistical and systematic error apply, and scientists try to minimize these whenever feasible. However, the lengths to which they will go to accomplish this depends on the questions they want to answer.
Nonetheless, even in “simple” physics systems, given that measurements won’t ever be perfect, we need to decide the accuracy to aim for. At a practical level, this question is equivalent to asking how many times an experimenter should repeat a measurement and how precise he needs his measuring device to be. The answer is up to him. The acceptable level of uncertainty depends on the question he asks. Different goals require different degrees of accuracy and precision.
For example, atomic clocks measure time with stability of one in 10 trillion, but few measurements require such a precise knowledge of time. Tests of Einstein‘s theory of gravity are an exception—they use as much precision and accuracy as can be attained. Even though all tests so far demonstrate that the theory works, measurements continue to improve. With higher precision, as-yet-unseen deviations representing new physical effects might appear that were impossible to see with previous less precise measurements. If so, these deviations would give us important insights into new physical phenomena. If not, we would trust that Einstein’s theory was even more accurate than had been demonstrated before. We would know we can confidently apply it over a greater regime of energy and distances and with a higher degree of accuracy. If you were sending a man to the Moon, on the other hand, you would want to understand physical laws sufficiently well that you aim your rocket correctly, but you wouldn’t need to include general relativity—and you certainly would not need to account for the even smaller potential effects representing possible deviations.
ACCURACY IN PARTICLE PHYSICS
In particle physics, we search for the underlying rules that govern the smallest and most fundamental components of matter we can detect. An individual experiment is not measuring a mishmash of many collisions happening at once or repeatedly interacting over time. The predictions we make apply to single collisions of known particles colliding at a definite energy. Particles enter the collision point, interact, and fly through detectors, usually depositing energy along the way. Physicists characterize particle collisions by the distinctive properties of the particles flying out—their mass, energy, and charges.
In this sense, despite the technical challenges of our experiments, particle physicists have it lucky. We study systems that are as basic as possible so that we can isolate fundamental components and laws. The idea is to make experimental systems that are as clean as existing resources permit. The challenge for physicists is reaching the required physical parameters rather than disentangling complex systems. Experiments are difficult because science has to push the frontiers of knowledge in order to be interesting. They are therefore often at the outer limit of the energies and distances accessible to technology.
In truth, particle physics experiments aren’t all that simple, even when studying precise fundamental quantities. Experimenters presenting their results face one of two challenges. If they do see something exotic, they have to be able to prove it cannot be the result of mundane Standard Model events that occasionally resemble some new particle or effect. On the other hand, if they don’t see anything new, they have to be certain of their level of accuracy in order to present a more stringent new limit on what can exist beyond known Standard Model effects. They have to understand the sensitivity of the measuring apparatus sufficiently well to know what they can rule out.
To be sure of their result, experimenters have to be able to distinguish those events that can signal new physics from the background events that arise from the known physical particles of the Standard Model. This is one reason we need many collisions to make new discoveries. The presence of lots of collisions ensures enough events representing new physics to distinguish them from “boring” Standard Model processes they might resemble.
Experiments therefore require adequate statistics. Measurements themselves have some intrinsic uncertainties necessitating their repetition. Quantum mechanics tells us that the underlying events do too. Quantum mechanics implies that no matter how cleverly we design our technology, we can compute only the probability that interactions occur. This uncertainty exists, no matter how we make a measurement. That means that the only way to accurately measure the strength of an interaction is to repeat the measurement many times. Sometimes this uncertainty is smaller than measurement uncertainty and too small to matter. But sometimes we need to take it into account.
Quantum mechanical uncertainty tells us, for example, that the mass of a particle that decays is an intrinsically uncertain quantity. The principle tells us that no energy measurement can possibly be exact when a measurement takes a finite time. The time of the measurement will necessarily be shorter than the lifetime of the decaying particle, which sets the amount of variation expected for the measured masses. So if experimenters were to find evidence of a new particle by finding the particles it decayed into, measuring its mass would require that they repeat the measurement many times. Even though no single measurement would be exact, the average of all the measurements would nonetheless converge to the correct value.
In many cases, the quantum mechanical mass uncertainty is less than the systematic uncertainties (intrinsic error) of the measuring devices. When that is true, experimenters can ignore the quantum mechanical uncertainty in mass. Even so, a large number of measurements are required to ensure the precision of a measurement due to the probabilistic nature of the interactions involved. As was the case with drug testing, large statistics help get us to the right answer.
It’s important to recognize that the probabilities associated with quantum mechanics are not completely random. Probabilities can be calculated from well-defined laws. We’ll see this in Chapter 14 in which we discuss the Wboson mass. We know the overall shape of the curve describing the likelihood that this particle with a given mass and a given lifetime will emerge from a collision. Each energy measurement centers around the correct value, and the distribution is consistent with the lifetime and the uncertainty principle. Even though no single measurement suffices to determine the mass, many measurements do. A definite procedure tells us how to deduce the mass from the average value of these repeated measurements. Sufficiently many measurements ensure that the experimenters determine the correct mass within a certain level of precision and accuracy.
MEASUREMENTS AND THE LHC
Neither the use of probability to present scientific results nor the probabilities intrinsic to quantum mechanics imply that we don’t know anything. In fact, it is often quite the opposite. We know quite a lot. For example, the magnetic moment of the electron is an intrinsic property of an electron that we can calculate extremely accurately using quantum field theory, which combines together quantum mechanics and special relativity and is the tool used to study the physical properties of elementary particles. My Harvard colleague Gerald Gabrielse has measured the magnetic moment of the electron with 13 digits of accuracy and precision, and it agrees with the prediction at nearly this level. Uncertainty enters only at the level of less than one in a trillion and makes the magnetic moment of the electron the constant of nature with the most accurate agreement between theoretical prediction and measurement.
No one outside of physics can make such an accurate prediction about the world. But most people with such a precise number would say they definitely know the theory and the phenomena it predicts. Scientists, while able to make much more accurate statements than most anyone else, nonetheless acknowledge that measurements and observations, no matter how precise, still leave room for as-yet-unseen phenomena and new ideas.
But they can also state a definite limit to the size of those new phenomena. New hypotheses could change predictions, but only at the level of the present measurement uncertainty or less. Sometimes the predicted new effects are so small that we have no hope of ever encountering them in the lifetime of the universe—in which case even scientists might make a definite statement such as “that won’t ever happen.”
Clearly Gabrielse’s measurement shows that quantum field theory is correct to a very high degree of precision. Even so, we can’t confidently state that quantum field theory or particle physics or the Standard Model is all that exists. As explained in Chapter 1, new phenomena whose effects appear only at different energy scales or when we make even more precise measurements can underlie what we see. Because we haven’t yet experimentally studied those regimes of distance and energy, we don’t yet know.
LHC experiments occur at higher energies than we have ever studied before and therefore open up new possibilities in the form of new particles or interactions that the experiments search for directly, rather than through only indirect effects that can be identified only with extremely precise measurements. In all likelihood, LHC measurements won’t reach sufficiently high energy to see deviations from quantum field theory. But they could conceivably reveal other phenomena that would predict deviations to Standard Model predictions for measurements at the level of current precision—even the well-measured magnetic moment of the electron.
For any given model of physics beyond the Standard Model, any predicted small discrepancies—where the inner workings of an as-yet-unseen theory would make a visible difference—would be a big clue as to the underlying nature of reality. The absence of such discrepancies so far tells us the level of precision or how high an energy we need to find something new—even without knowing the precise nature of potential new phenomena.
The real lesson of effective theories, introduced in the opening chapter, is that we only fully understand what we are studying and its limitations at the point where we see them fail. Effective theories that incorporate existing constraints not only categorize our ideas at a given scale, but they also provide systematic methods for determining how big new effects can be at any specific energy.
Measurements concerning the electromagnetic and weak forces agree with Standard Model predictions at the level of 0.1 percent. Particle collision rates, masses, decay rates, and other properties agree with their predicted values at this level of precision and accuracy. The Standard Model therefore leaves room for new discoveries, and new physical theories can yield deviations, but they must be small enough to have eluded detection up to now. The effects of any new phenomena or underlying theory must have been too small to have been seen already—either because the interactions themselves are small or because the effects are associated with particles too heavy to be produced at the energies already probed. Existing measurements tell us how high an energy we require to directly find new particles or new forces, which can’t cause bigger deviations to measurements than current uncertainties allow. They also tell us how rare such new events have to be. By increasing measurement precision sufficiently, or doing an experiment under different physical conditions, experimenters search for deviations from a model that has so far described all experimental particle physics results.
Current experiments are based on the understanding that new ideas build upon a successful effective theory that applies at lower energies. Their goal is to unveil new matter or interactions, keeping in mind that physics builds knowledge scale by scale. By studying phenomena at the LHC’s higher energies, we hope to find and fully understand the theory that underlies what we have seen so far. Even before we measure new phenomena, LHC data will give us valuable and stringent constraints on what phenomena or theories beyond the Standard Model can exist. And—if our theoretical considerations are correct—new phenomena should eventually emerge at the higher energies the LHC now studies. Such discoveries would force us to extend or absorb the Standard Model into a more complete formulation. The more comprehensive model would apply with greater accuracy over a larger range of scales.
We don’t know which theory will be realized in nature. We also don’t know when we will make new discoveries. The answers depend on what is out there, and we don’t yet know that or we wouldn’t have to look. But for any particular speculation about what exists, we know how to calculate how we might discover the experimental consequences and estimate when it might occur. In the next couple of chapters, we’ll look into how LHC experiments work, and in Part IV that follows, we’ll consider how what they might see.