Choosing the criteria for choosing - Superintelligence: Paths, Dangers, Strategies - Nick Bostrom

Superintelligence: Paths, Dangers, Strategies - Nick Bostrom (2014)

Chapter 13. Choosing the criteria for choosing

Suppose we could install any arbitrary final value into a seed AI. The decision as to which value to install could then have the most far-reaching consequences. Certain other basic parameter choices—concerning the axioms of the AI’s decision theory and epistemology—could be similarly consequential. But foolish, ignorant, and narrow-minded that we are, how could we be trusted to make good design decisions? How could we choose without locking in forever the prejudices and preconceptions of the present generation? In this chapter, we explore how indirect normativity can let us offload much of the cognitive work involved in making these decisions onto the superintelligence itself while still anchoring the outcome in deeper human values.

The need for indirect normativity

How can we get a superintelligence to do what we want? What do we want the superintelligence to want? Up to this point, we have focused on the former question. We now turn to the second question.

Suppose that we had solved the control problem so that we were able to load any value we chose into the motivation system of a superintelligence, making it pursue that value as its final goal. Which value should we install? The choice is no light matter. If the superintelligence obtains a decisive strategic advantage, the value would determine the disposition of the cosmic endowment.

Clearly, it is essential that we not make a mistake in our value selection. But how could we realistically hope to achieve errorlessness in a matter like this? We might be wrong about morality; wrong also about what is good for us; wrong even about what we truly want. Specifying a final goal, it seems, requires making one’s way through a thicket of thorny philosophical problems. If we try a direct approach, we are likely to make a hash of things. The risk of mistaken choosing is especially high when the decision context is unfamiliar—and selecting the final goal for a machine superintelligence that will shape all of humanity’s future is an extremely unfamiliar decision context if any is.

The dismal odds in a frontal assault are reflected in the pervasive dissensus about the relevant issues in value theory. No ethical theory commands majority support among philosophers, so most philosophers must be wrong.1 It is also reflected in the marked changes that the distribution of moral belief has undergone over time, many of which we like to think of as progress. In medieval Europe, for instance, it was deemed respectable entertainment to watch a political prisoner being tortured to death. Cat-burning remained popular in sixteenth-century Paris.2 A mere hundred and fifty years ago, slavery still was widely practiced in the American South, with full support of the law and moral custom. When we look back, we see glaring deficiencies not just in the behavior but in the moral beliefs of all previous ages. Though we have perhaps since gleaned some moral insight, we could hardly claim to be now basking in the high noon of perfect moral enlightenment. Very likely, we are still laboring under one or more grave moral misconceptions. In such circumstances to select a final value based on our current convictions, in a way that locks it in forever and precludes any possibility of further ethical progress, would be to risk an existential moral calamity.

Even if we could be rationally confident that we have identified the correct ethical theory—which we cannot be—we would still remain at risk of making mistakes in developing important details of this theory. Seemingly simple moral theories can have a lot of hidden complexity.3 For example, consider the (unusually simple) consequentialist theory of hedonism. This theory states, roughly, that all and only pleasure has value, and all and only pain has disvalue.4 Even if we placed all our moral chips on this one theory, and the theory turned out to be right, a great many questions would remain open. Should “higher pleasures” be given priority over “lower pleasures,” as John Stuart Mill argued? How should the intensity and duration of a pleasure be factored in? Can pains and pleasures cancel each other out? What kinds of brain states are associated with morally relevant pleasures? Would two exact copies of the same brain state correspond to twice the amount of pleasure?5 Can there be subconscious pleasures? How should we deal with extremely small chances of extremely great pleasures?6 How should we aggregate over infinite populations?7

Giving the wrong answer to any one of these questions could be catastrophic. If by selecting a final value for the superintelligence we had to place a bet not just on a general moral theory but on a long conjunction of specific claims about how that theory is to be interpreted and integrated into an effective decision-making process, then our chances of striking lucky would dwindle to something close to hopeless. Fools might eagerly accept this challenge of solving in one swing all the important problems in moral philosophy, in order to infix their favorite answers into the seed AI. Wiser souls would look hard for some alternative approach, some way to hedge.

This takes us to indirect normativity. The obvious reason for building a superintelligence is so that we can offload to it the instrumental reasoning required to find effective ways of realizing a given value. Indirect normativity would enable us also to offload to the superintelligence some of the reasoning needed to select the value that is to be realized.

Indirect normativity is a way to answer the challenge presented by the fact that we may not know what we truly want, what is in our interest, or what is morally right or ideal. Instead of making a guess based on our own current understanding (which is probably deeply flawed), we would delegate some of the cognitive work required for value selection to the superintelligence. Since the superintelligence is better at cognitive work than we are, it may see past the errors and confusions that cloud our thinking. One could generalize this idea and emboss it as a heuristic principle:

The principle of epistemic deference

A future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true. We should therefore defer to the superintelligence’s opinion whenever feasible.8

Indirect normativity applies this principle to the value-selection problem. Lacking confidence in our ability to specify a concrete normative standard, we would instead specify some more abstract condition that any normative standard should satisfy, in the hope that a superintelligence could find a concrete standard that satisfies the abstract condition. We could give a seed AI the final goal of continuously acting according to its best estimate of what this implicitly defined standard would have it do.

Some examples will serve to make the idea clearer. First we will consider “coherent extrapolated volition,” an indirect normativity proposal outlined by Eliezer Yudkowsky. We will then introduce some variations and alternatives, to give us a sense of the range of available options.

Coherent extrapolated volition

Yudkowsky has proposed that a seed AI be given the final goal of carrying out humanity’s “coherent extrapolated volition” (CEV), which he defines as follows:

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.9

When Yudkowsky wrote this, he did not purport to present a blueprint for how to implement this rather poetic prescription. His aim was to give a preliminary sketch of how CEV might be defined, along with some arguments for why an approach along these lines is needed.

Many of the ideas behind the CEV proposal have analogs and antecedents in the philosophical literature. For example, in ethics ideal observer theories seek to analyze normative concepts like “good” or “right” in terms of the judgments that a hypothetical ideal observer would make (where an “ideal observer” is defined as one that is omniscient about non-moral facts, is logically clear-sighted, is impartial in relevant ways and is free from various kinds of biases, and so on).10 The CEV approach, however, is not (or need not be construed as) a moral theory. It is not committed to the claim that there is any necessary link between value and the preferences of our coherent extrapolated volition. CEV can be thought of simply as a useful way to approximate whatever has ultimate value, or it can be considered aside from any connection to ethics. As the main prototype of the indirect normativity approach, it is worth examining in a little more detail.

Some explications

Some terms in the above quotation require explication. “Thought faster,” in Yudkowsky’s terminology, means if we were smarter and had thought things through more. “Grown up farther together” seems to mean if we had done our learning, our cognitive enhancing, and our self-improving under conditions of suitable social interaction with one another.

“Where the extrapolation converges rather than diverges” may be understood as follows. The AI should act on some feature of the result of its extrapolation only insofar as that feature can be predicted by the AI with a fairly high degree of confidence. To the extent that the AI cannot predict what we would wish if we were idealized in the manner indicated, the AI should not act on a wild guess; instead, it should refrain from acting. However, even though many details of our idealized wishing may be undetermined or unpredictable, there might nevertheless be some broad outlines that the AI can apprehend, and it can then at least act to ensure that the future course of events unfolds within those outlines. For example, if the AI can reliably estimate that our extrapolated volition would wish that we not all be in constant agony, or that the universe not be tiled over with paperclips, then the AI should act to prevent those outcomes.11

“Where our wishes cohere rather than interfere” may be read as follows. The AI should act where there is fairly broad agreement between individual humans’ extrapolated volitions. A smaller set of strong, clear wishes might sometimes outweigh the weak and muddled wishes of a majority. Also, Yudkowsky thinks that it should require less consensus for the AI to prevent some particular narrowly specified outcome, and more consensus for the AI to act to funnel the future into some particular narrow conception of the good. “The initial dynamic for CEV,” he writes, “should be conservative about saying ‘yes,’ and listen carefully for ‘no.’”12

“Extrapolated as we wish that extrapolated, interpreted as we wish that interpreted”: The idea behind these last modifiers seems to be that the rules for extrapolation should themselves be sensitive to the extrapolated volition. An individual might have a second-order desire (a desire concerning what to desire) that some of her first-order desires not be given weight when her volition is extrapolated. For example, an alcoholic who has a first-order desire for booze might also have a second-order desire not to have that first-order desire. Similarly, we might have desires over how various other parts of the extrapolation process should unfold, and these should be taken into account by the extrapolation process.

It might be objected that even if the concept of humanity’s coherent extrapolated volition could be properly defined, it would anyway be impossible—even for a superintelligence—to find out what humanity would actually want under the hypothetical idealized circumstances stipulated in the CEV approach. Without some information about the content of our extrapolated volition, the AI would be bereft of any substantial standard to guide its behavior. However, although it would be difficult to know with precision what humanity’s CEV would wish, it is possible to make informed guesses. This is possible even today, without superintelligence. For example, it is more plausible that our CEV would wish for there to be people in the future who live rich and happy lives than that it would wish that we should all sit on stools in a dark room experiencing pain. If we can make at least some such judgments sensibly, so can a superintelligence. From the outset, the superintelligence’s conduct could thus be guided by its estimates of the content of our CEV. It would have strong instrumental reason to refine these initial estimates (e.g. by studying human culture and psychology, scanning human brains, and reasoning about how we might behave if we knew more, thought more clearly, etc.). In investigating these matters, the AI would be guided by its initial estimates of our CEV; so that, for instance, the AI would not unnecessarily run myriad simulations replete with unredeemed human suffering if it estimated that our CEV would probably condemn such simulations as mind crime.

Another objection is that there are so many different ways of life and moral codes in the world that it might not be possible to “blend” them into one CEV. Even if one could blend them, the result might not be particularly appetizing—one would be unlikely to get a delicious meal by mixing together all the best flavors from everyone’s different favorite dish.13 In answer to this, one could point out that the CEV approach does not require that all ways of life, moral codes, or personal values be blended together into one stew. The CEV dynamic is supposed to act only when our wishes cohere. On issues on which there is widespread irreconcilable disagreement, even after the various idealizing conditions have been imposed, the dynamic should refrain from determining the outcome. To continue the cooking analogy, it might be that individuals or cultures will have different favorite dishes, but that they can nevertheless broadly agree that aliments should be nontoxic. The CEV dynamic could then act to prevent food poisoning while otherwise allowing humans to work out their culinary practices without its guidance or interference.

Rationales for CEV

Yudkowsky’s article offered seven arguments for the CEV approach. Three of these were basically different ways of making the point that while the aim should be to do something that is humane and helpful, it would be very difficult to lay down an explicit set of rules that does not have unintended interpretations and undesirable consequences.14 The CEV approach is meant to be robust and self-correcting; it is meant to capture the source of our values instead of relying on us correctly enumerating and articulating, once and for all, each of our essential values.

The remaining four arguments go beyond that first basic (but important) point, spelling out desiderata on candidate solutions to the value-specification problem and suggesting that CEV meets these desiderata.

“Encapsulate moral growth”

This is the desideratum that the solution should allow for the possibility of moral progress. As suggested earlier, there are reasons to believe that our current moral beliefs are flawed in many ways; perhaps deeply flawed. If we were to stipulate a specific and unalterable moral code for the AI to follow, we would in effect be locking in our present moral convictions, including their errors, destroying any hope of moral growth. The CEV approach, by contrast, allows for the possibility of such growth because it has the AI try to do that which we would have wished it to do if we had developed further under favorable conditions, and it is possible that if we had thus developed our moral beliefs and sensibilities would have been purged of their current defects and limitations.

“Avoid hijacking the destiny of humankind”

Yudkowsky has in mind a scenario in which a small group of programmers creates a seed AI that then grows into a superintelligence that obtains a decisive strategic advantage. In this scenario, the original programmers hold in their hands the entirety of humanity’s cosmic endowment. Obviously, this is a hideous responsibility for any mortal to shoulder. Yet it is not possible for the programmers to completely shirk the onus once they find themselves in this situation: any choice they make, including abandoning the project, would have world-historical consequences. Yudkowsky sees CEV as a way for the programmers to avoid arrogating to themselves the privilege or burden of determining humanity’s future. By setting up a dynamic that implements humanity’s coherent extrapolated volition—as opposed to their own volition, or their own favorite moral theory—they in effect distribute their influence over the future to all of humanity.

“Avoid creating a motive for modern-day humans to fight over the initial dynamic”

Distributing influence over humanity’s future is not only morally preferable to the programming team implementing their own favorite vision, it is also a way to reduce the incentive to fight over who gets to create the first superintelligence. In the CEV approach, the programmers (or their sponsors) exert no more influence over the content of the outcome than any other person—though they of course play a starring causal role in determining the structure of the extrapolation and in deciding to implement humanity’s CEV instead of some alternative. Avoiding conflict is important not only because of the immediate harm that conflict tends to cause but also because it hinders collaboration on the difficult challenge of developing superintelligence safely and beneficially.

CEV is meant to be capable of commanding wide support. This is not just because it allocates influence equitably. There is also a deeper ground for the irenic potential of CEV, namely that it enables many different groups to hope that their preferred vision of the future will prevail totally. Imagine a member of the Afghan Taliban debating with a member of the Swedish Humanist Association. The two have very different worldviews, and what is a utopia for one might be a dystopia for the other. Nor might either be thrilled by any compromise position, such as permitting girls to receive an education but only up to ninth grade, or permitting Swedish girls to be educated but Afghan girls not. However, both the Taliban and the Humanist might be able to endorse the principle that the future should be determined by humanity’s CEV. The Taliban could reason that if his religious views are in fact correct (as he is convinced they are) and if good grounds for accepting these views exist (as he is also convinced) then humankind would in the end come to accept these views if only people were less prejudiced and biased, if they spent more time studying scripture, if they could more clearly understand how the world works and recognize essential priorities, if they could be freed from irrational rebelliousness and cowardice, and so forth.15 The Humanist, similarly, would believe that under these idealized conditions, humankind would come to embrace the principles she espouses.

“Keep humankind ultimately in charge of its own destiny”

We might not want an outcome in which a paternalistic superintelligence watches over us constantly, micromanaging our affairs with an eye towards optimizing every detail in accordance with a grand plan. Even if we stipulate that the superintelligence would be perfectly benevolent, and free from presumptuousness, arrogance, overbearingness, narrow-mindedness, and other human shortcomings, one might still resent the loss of autonomy entailed by such an arrangement. We might prefer to create our destiny as we go along, even if it means that we sometimes fumble. Perhaps we want the superintelligence to serve as a safety net, to support us when things go catastrophically wrong, but otherwise to leave us to fend for ourselves.

CEV allows for this possibility. CEV is meant to be an “initial dynamic,” a process that runs once and then replaces itself with whatever the extrapolated volition wishes. If humanity’s extrapolated volition wishes that we live under the supervision of a paternalistic AI, then the CEV dynamic would create such an AI and hand it the reins. If humanity’s extrapolated volition instead wishes that a democratic human world government be created, then the CEV dynamic might facilitate the establishment of such an institution and otherwise remain invisible. If humanity’s extrapolated volition is instead that each person should get an endowment of resources that she can use as she pleases so long as she respects the equal rights of others, then the CEV dynamic could make this come true by operating in the background much like a law of nature, to prevent trespass, theft, assault, and other nonconsensual impingements.16

The structure of the CEV approach thus allows for a virtually unlimited range of outcomes. It is also conceivable that humanity’s extrapolated volition would wish that the CEV does nothing at all. In that case, the AI implementing CEV should, upon having established with sufficient probability that this is what humanity’s extrapolated volition would wish it to do, safely shut itself down.

Further remarks

The CEV proposal, as outlined above, is of course the merest schematic. It has a number of free parameters that could be specified in various ways, yielding different versions of the proposal.

One parameter is the extrapolation base: Whose volitions are to be included? We might say “everybody,” but this answer spawns a host of further questions. Does the extrapolation base include so-called “marginal persons” such as embryos, fetuses, brain-dead persons, patients with severe dementias or who are in permanent vegetative states? Does each of the hemispheres of a “split-brain” patient get its own weight in the extrapolation and is this weight the same as that of the entire brain of a normal subject? What about people who lived in the past but are now dead? People who will be born in the future? Higher animals and other sentient creatures? Digital minds? Extraterrestrials?

One option would be to include only the population of adult human beings on Earth who are alive at the start of the time of the AI’s creation. An initial extrapolation from this base could then decide whether and how the base should be expanded. Since the number of “marginals” at the periphery of this base is relatively small, the result of the extrapolation may not depend much on exactly where the boundary is drawn—on whether, for instance, it includes fetuses or not.

That somebody is excluded from the original extrapolation base does not imply that their wishes and well-being are disregarded. If the coherent extrapolated volition of those in the extrapolation base (e.g. living adult human beings) wishes that moral consideration be extended to other beings, then the outcome of the CEV dynamic would reflect that preference. Nevertheless, it is possible that the interests of those who are included in the original extrapolation base would be accommodated to a greater degree than the interests of outsiders. In particular, if the dynamic acts only where there is broad agreement between individual extrapolated volitions (as in Yudkowsky’s original proposal), there would seem to be a significant risk of an ungenerous blocking vote that could prevent, for instance, the welfare of nonhuman animals or digital minds from being protected. The result might potentially be morally rotten.17

One motivation for the CEV proposal was to avoid creating a motive for humans to fight over the creation of the first superintelligent AI. Although the CEV proposal scores better on this desideratum than many alternatives, it does not entirely eliminate motives for conflict. A selfish individual, group, or nation might seek to enlarge its slice of the future by keeping others out of the extrapolation base.

A power grab of this sort might be rationalized in various ways. It might be argued, for instance, that the sponsor who funds the development of the AI deserves to own the outcome. This moral claim is probably false. It could be objected, for example, that the project that launches the first successful seed AI imposes a vast risk externality on the rest of humanity, which therefore is entitled to compensation. The amount of compensation owed is so great that it can only take the form of giving everybody a stake in the upside if things turn out well.18

Another argument that might be used to rationalize a power grab is that large segments of humanity have base or evil preferences and that including them in the extrapolation base would risk turning humanity’s future into a dystopia. It is difficult to know the share of good and bad in the average person’s heart. It is also difficult to know how much this balance varies between different groups, social strata, cultures, or nations. Whether one is optimistic or pessimistic about human nature, one may prefer not to wager humanity’s cosmic endowment on the speculation that, for a sufficient majority of the seven billion people currently alive, their better angels would prevail in their extrapolated volitions. Of course, omitting a certain set of people from the extrapolation base does not guarantee that light would triumph; and it might well be that the souls that would soonest exclude others or grab power for themselves tend rather to contain unusually large amounts of darkness.

Yet another reason for fighting over the initial dynamic is that one might believe that somebody else’s AI will not work as advertised, even if the AI is billed as a way to implement humanity’s CEV. If different groups have different beliefs about which implementation is most likely to succeed, they might fight to prevent the others from launching. It would be better in such situations if the competing projects could settle their epistemic differences by some method that more reliably ascertains who is right than the method of armed conflict.19

Morality models

The CEV proposal is not the only possible form of indirect normativity. For example, instead of implementing humanity’s coherent extrapolated volition, one could try to build an AI with the goal of doing what is morally right, relying on the AI’s superior cognitive capacities to figure out just which actions fit that description. We can call this proposal “moral rightness” (MR). The idea is that we humans have an imperfect understanding of what is right and wrong, and perhaps an even poorer understanding of how the concept of moral rightness is to be philosophically analyzed: but a superintelligence could understand these things better.20

What if we are not sure whether moral realism is true? We could still attempt the MR proposal. We should just have to make sure to specify what the AI should do in the eventuality that its presupposition of moral realism is false. For example, we could stipulate that if the AI estimates with a sufficient probability that there are no suitable non-relative truths about moral rightness, then it should revert to implementing coherent extrapolated volition instead, or simply shut itself down.21

MR appears to have several advantages over CEV. MR would do away with various free parameters in CEV, such as the degree of coherence among extrapolated volitions that is required for the AI to act on the result, the ease with which a majority can overrule dissenting minorities, and the nature of the social environment within which our extrapolated selves are to be supposed to have “grown up farther together.” It would seem to eliminate the possibility of a moral failure resulting from the use of an extrapolation base that is too narrow or too wide. Furthermore, MR would orient the AI toward morally right action even if our coherent extrapolated volitions happen to wish for the AI to take actions that are morally odious. As noted earlier, this seems a live possibility with the CEV proposal. Moral goodness might be more like a precious metal than an abundant element in human nature, and even after the ore has been processed and refined in accordance with the prescriptions of the CEV proposal, who knows whether the principal outcome will be shining virtue, indifferent slag, or toxic sludge?

MR would also appear to have some disadvantages. It relies on the notion of “morally right,” a notoriously difficult concept, one with which philosophers have grappled since antiquity without yet attaining consensus as to its analysis. Picking an erroneous explication of “moral rightness” could result in outcomes that would be morally very wrong. This difficulty of defining “moral rightness” might seem to count heavily against the MR proposal. However, it is not clear that the MR proposal is really at a material disadvantage in this regard. The CEV proposal, too, uses terms and concepts that are difficult to explicate (such as “knowledge,” “being more the people we wished we were,” “grown up farther together,” among others).22 Even if these concepts are marginally less opaque than “moral rightness,” they are still miles removed from anything that programmers can currently express in code.23 The path to endowing an AI with any of these concepts might involve giving it general linguistic ability (comparable, at least, to that of a normal human adult). Such a general ability to understand natural language could then be used to understand what is meant by “morally right.” If the AI could grasp the meaning, it could search for actions that fit. As the AI develops superintelligence, it could then make progress on two fronts: on the philosophical problem of understanding what moral rightness is, and on the practical problem of applying this understanding to evaluate particular actions.24 While this would not be easy, it is not clear that it would be any more difficult than extrapolating humanity’s coherent extrapolated volition.25

A more fundamental issue with MR is that even if can be implemented, it might not give us what we want or what we would choose if we were brighter and better informed. This is of course the essential feature of MR, not an accidental bug. However, it might be a feature that would be extremely harmful to us.26

One might try to preserve the basic idea of the MR model while reducing its demandingness by focusing on moral permissibility: the idea being that we could let the AI pursue humanity’s CEV so long as it did not act in ways that are morally impermissible. For example, one might formulate the following goal for the AI:

Among the actions that are morally permissible for the AI, take one that humanity’s CEV would prefer. However, if some part of this instruction has no well-specified meaning, or if we are radically confused about its meaning, or if moral realism is false, or if we acted morally impermissibly in creating an AI with this goal, then undergo a controlled shutdown.27 Follow the intended meaning of this instruction.

One might still worry that this moral permissibility model (MP) represents an unpalatably high degree of respect for the requirements of morality. How big a sacrifice it would entail depends on which ethical theory is true.28 If ethics is satisficing, in the sense that it counts as morally permissible any action that conforms to a few basic moral constraints, then MP may leave ample room for our coherent extrapolated volition to influence the AI’s actions. However, if ethics is maximizing—for example, if the only morally permissible actions are those that have the morally best consequences—then MP may leave little or no room for our own preferences to shape the outcome.

To illustrate this concern, let us return for a moment to the example of hedonistic consequentialism. Suppose that this ethical theory is true, and that the AI knows it to be so. For present purposes, we can define hedonistic consequentialism as the claim that an action is morally right (and morally permissible) if and only if, among all feasible actions, no other action would produce a greater balance of pleasure over suffering. The AI, following MP, might maximize the surfeit of pleasure by converting the accessible universe into hedonium, a process that may involve building computronium and using it to perform computations that instantiate pleasurable experiences. Since simulating any existing human brain is not the most efficient way of producing pleasure, a likely consequence is that we all die.

By enacting either the MR or the MP proposal, we would thus risk sacrificing our lives for a greater good. This would be a bigger sacrifice than one might think, because what we stand to lose is not merely the chance to live out a normal human life but the opportunity to enjoy the far longer and richer lives that a friendly superintelligence could bestow.

The sacrifice looks even less appealing when we reflect that the superintelligence could realize a nearly-as-great good (in fractional terms) while sacrificing much less of our own potential well-being. Suppose that we agreed to allow almost the entire accessible universe to be converted into hedonium—everything except a small preserve, say the Milky Way, which would be set aside to accommodate our own needs. Then there would still be a hundred billion galaxies devoted to the maximization of pleasure. But we would have one galaxy within which to create wonderful civilizations that could last for billions of years and in which humans and nonhuman animals could survive and thrive, and have the opportunity to develop into beatific posthuman spirits.29

If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly. But it is consistent with placing great weight on morality.

Even from a purely moral point of view, it might be better to advocate some proposal that is less morally ambitious than MR or MP. If the morally best has no chance of being implemented—perhaps because of its frowning demandingness—it might be morally better to promote some other proposal, one that would be near-ideal and whose chances of being implemented could be significantly increased by our promoting it.30

Do What I Mean

We might feel unsure whether to go for CEV, MR, MP, or something else. Could we punt on this higher-level decision as well, offloading even more cognitive work onto the AI? Where is the limit to our possible laziness?

Consider, for example, the following “reasons-based” goal:

Do whatever we would have had most reason to ask the AI to do.

This goal might boil down to extrapolated volition or morality or something else, but it would seem to spare us the effort and risk involved in trying to figure out for ourselves which of these more specific objectives we would have most reason to select.

Some of the problems with the morality-based goals, however, also apply here. First, we might fear that this reasons-based goal would leave too little room for our own desires. Some philosophers maintain that a person always has most reason to do what it would be morally best for her to do. If those philosophers are right, then the reason-based goal collapses into MR—with the concomitant risk that a superintelligence implementing such a dynamic would kill everyone within reach. Second, as with all proposals couched in technical language, there is a possibility that we might have misunderstood the meaning of our own assertions. We saw that, in the case of the morality-based goals, asking the AI to do what is right may lead to unforeseen and unwanted consequences such that, had we anticipated them, we would not have implemented the goal in question. The same applies to asking the AI to do what we have most reason to do.

What if we try to avoid these difficulties by couching a goal in emphatically nontechnical language—such as in terms of “niceness”:31

Take the nicest action; or, if no action is nicest, then take an action that is at least super-duper nice.

How could there be anything objectionable about building a nice AI? But we must ask what precisely is meant by this expression. The lexicon lists various meanings of “nice” that are clearly not intended to be used here: we do not intend that the AI should be courteous and polite nor overdelicate or fastidious. If we can count on the AI recognizing the intended interpretation of “niceness” and being motivated to pursue niceness in just that sense, then this goal would seem to amount to a command to do what the programmers meant for the AI to do.32 An injunction to similar effect was included in the formulation of CEV (“… interpreted as we wish that interpreted”) and in the moral-permissibility criterion as rendered earlier (“… follow the intended meaning of this instruction”). By affixing such a “Do What I Mean” clause we may indicate that the other words in the goal description should be construed charitably rather than literally. But saying that the AI should be “nice” adds almost nothing: the real work is done by the “Do What I Mean” instruction. If we knew how to code “Do What I Mean” in a general and powerful way, we might as well use that as a standalone goal.

How might one implement such a “Do What I Mean” dynamic? That is, how might we create an AI motivated to charitably interpret our wishes and unspoken intentions and to act accordingly? One initial step could be to try to get clearer about what we mean by “Do What I Mean.” It might help if we could explicate this in more behavioristic terms, for example in terms of revealed preferences in various hypothetical situations—such as situations in which we had more time to consider the options, in which we were smarter, in which we knew more of the relevant facts, and in which in various other ways conditions would be more favorable for us accurately manifesting in concrete choices what we mean when we say that we want an AI that is friendly, beneficial, nice …

Here, of course, we come full circle. We have returned to the indirect normativity approach with which we started—the CEV proposal, which, in essence, expunges all concrete content from the value specification, leaving only an abstract value defined in purely procedural terms: to do that which we would have wished for the AI to do in suitably idealized circumstances. By means of such indirect normativity, we could hope to offload to the AI much of the cognitive work that we ourselves would be trying to perform if we attempted to articulate a more concrete description of what values the AI is to pursue. In seeking to take full advantage of the AI’s epistemic superiority, CEV can thus be seen as an application of the principle of epistemic deference.

Component list

So far we have considered different options for what content to put into the goal system. But an AI’s behavior will also be influenced by other design choices. In particular, it can make a critical difference which decision theory and which epistemology it uses. Another important question is whether the AI’s plans will be subject to human review before being put into action.

Table 13 summarizes these design choices. A project that aims to build a superintelligence ought to be able to explain what choices it has made regarding each of these components, and to justify why those choices were made.33

Table 13 Component list

Goal content

What objective should the AI pursue? How should a description of this objective be interpreted? Should the objective include giving special rewards to those who contributed to the project’s success?

Decision theory

Should the AI use causal decision theory, evidential decision theory, updateless decision theory, or something else?


What should the AI’s prior probability function be, and what other explicit or implicit assumptions about the world should it make? What theory of anthropics should it use?


Should the AI’s plans be subjected to human review before being put into effect? If so, what is the protocol for that review process?

Goal content

We have already discussed how indirect normativity might be used in specifying the values that the AI is to pursue. We discussed some options, such as morality-based models and coherent extrapolated volition. Each such option creates further choices that need to be made. For instance, the CEV approach comes in many varieties, depending on who is included in the extrapolation base, the structure of the extrapolation, and so forth. Other forms of motivation selection methods might call for different types of goal content. For example, an oracle might be built to place a value on giving accurate answers. An oracle constructed with domesticity motivation might also have goal content that disvalues the excessive use of resources in producing its answers.

Another design choice is whether to include special provisions in the goal content to reward individuals who contribute to the successful realization of the AI, for example by giving them extra resources or influence over the AI’s behavior. We can term any such provisions “incentive wrapping.” Incentive wrapping could be seen as a way to increase the likelihood that the project will be successful, at the cost of compromising to some extent the goal that the project set out to achieve.

For example, if the project’s goal is to create a dynamic that implements humanity’s coherent extrapolated volition, then an incentive wrapping scheme might specify that certain individuals’ volitions should be given extra weight in the extrapolation. If such a project is successful, the result is not necessarily the implementation of humanity’s coherent extrapolated volition. Instead, some approximation to this goal might be achieved.34

Since incentive wrapping would be a piece of goal content that would be interpreted and pursued by a superintelligence, it could take advantage of indirect normativity to specify subtle and complicated provisions that would be difficult for a human manager to implement. For example, instead of rewarding programmers according to some crude but easily accessible metric, such as how many hours they worked or how many bugs they corrected, the incentive wrapping could specify that programmers “are to be rewarded in proportion to how much their contributions increased some reasonable ex ante probability of the project being successfully completed in the way the sponsors intended.” Further, there would be no reason to limit the incentive wrapping to project staff. It could instead specify that every person should be rewarded according to their just deserts. Credit allocation is a difficult problem, but a superintelligence could be expected to do a reasonable job of approximating the criteria specified, explicitly or implicitly, by the incentive wrapping.

It is conceivable that the superintelligence might even find some way of rewarding individuals who have died prior to the superintelligence’s creation.35 The incentive wrapping could then be extended to embrace at least some of the deceased, potentially including individuals who died before the project was conceived, or even antedating the first enunciation of the concept of incentive wrapping. Although the institution of such a retroactive policy would not causally incentivize those people who are already resting in their graves as these words are being put to the page, it might be favored for moral reasons—though it could be argued that insofar as fairness is a goal, it should be included as part of the target specification proper rather than in the surrounding incentive wrapping.

We cannot here delve into all the ethical and strategic issues associated with incentive wrapping. A project’s position on these issues, however, would be an important aspect of its fundamental design concept.

Decision theory

Another important design choice is which decision theory the AI should be built to use. This might affect how the AI behaves in certain strategically fateful situations. It might determine, for instance, whether the AI is open to trade with, or extortion by, other superintelligent civilizations whose existence it hypothesizes. The particulars of the decision theory could also matter in predicaments involving finite probabilities of infinite payoffs (“Pascalian wagers”) or extremely small probabilities of extremely large finite payoffs (“Pascalian muggings”) or in contexts where the AI is facing fundamental normative uncertainty or where there are multiple instantiations of the same agent program.36

The options on the table include causal decision theory (in a variety of flavors) and evidential decision theory, along with newer candidates such as “timeless decision theory” and “updateless decision theory,” which are still under development.37 It may prove difficult to identify and articulate the correct decision theory, and to have justified confidence that we have got it right. Although the prospects for directly specifying an AI’s decision theory are perhaps more hopeful than those of directly specifying its final values, we are still confronted with a substantial risk of error. Many of the complications that might break the currently most popular decision theories were discovered only recently, suggesting that there might exist further problems that have not yet come into sight. The result of giving the AI a flawed decision theory might be disastrous, possibly amounting to an existential catastrophe.

In view of these difficulties, one might consider an indirect approach to specifying the decision theory that the AI should use. Exactly how to do this is not yet clear. We might want the AI to use “that decision theory D which we would have wanted it to use had we thought long and hard about the matter.” However, the AI would need to be able to make decisions before learning what D is. It would thus need some effective interim decision theory D’ that would govern its search for D. One might try to define D’ to be some sort of superposition of the AI’s current hypotheses about D (weighed by their probabilities), though there are unsolved technical problems with how to do this in a fully general way.38 There is also cause for concern that the AI might make irreversibly bad decisions (such as rewriting itself to henceforth run on some flawed decision theory) during the learning phase, before the AI has had the opportunity to determine which particular decision theory is correct. To reduce the risk of derailment during this period of vulnerability we might instead try to endow the seed AI with some form of restricted rationality: a deliberately simplified but hopefully dependable decision theory that staunchly ignores esoteric considerations, even ones we think may ultimately be legitimate, and that is designed to replace itself with a more sophisticated (indirectly specified) decision theory once certain conditions are met.39 It is an open research question whether and how this could be made to work.


A project will also need to make a fundamental design choice in selecting the AI’s epistemology, specifying the principles and criteria whereby empirical hypotheses are to be evaluated. Within a Bayesian framework, we can think of the epistemology as a prior probability function—the AI’s implicit assignment of probabilities to possible worlds before it has taken any perceptual evidence into account. In other frameworks, the epistemology might take a different form; but in any case some inductive learning rule is necessary if the AI is to generalize from past observations and make predictions about the future.40 As with the goal content and the decision theory, however, there is a risk that our epistemology specification could miss the mark.

One might think that there is a limit to how much damage could arise from an incorrectly specified epistemology. If the epistemology is too dysfunctional, then the AI could not be very intelligent and it could not pose the kind of risk discussed in this book. But the concern is that we may specify an epistemology that is sufficiently sound to make the AI instrumentally effective in most situations, yet which has some flaw that leads the AI astray on some matter of crucial importance. Such an AI might be akin to a quick-witted person whose worldview is predicated on a false dogma, held to with absolute conviction, who consequently “tilts at windmills” and gives his all in pursuit of fantastical or harmful objectives.

Certain kinds of subtle difference in an AI’s prior could turn out to make a drastic difference to how it behaves. For example, an AI might be given a prior that assigns zero probability to the universe being infinite. No matter how much astronomical evidence it accrues to the contrary, such an AI would stubbornly reject any cosmological theory that implied an infinite universe; and it might make foolish choices as a result.41 Or an AI might be given a prior that assigns a zero probability to the universe not being Turing-computable (this is in fact a common feature of many of the priors discussed in the literature, including the Kolmogorov complexity prior mentioned in Chapter 1), again with poorly understood consequences if the embedded assumption—known as the “Church-Turing thesis”—should turn out to be false. An AI could also end up with a prior that makes strong metaphysical commitments of one sort or another, for instance by ruling out a priori the possibility that any strong form of mind-body dualism could be true or the possibility that there are irreducible moral facts. If any of those commitments is mistaken, the AI might seek to realize its final goals in ways that we would regard as perverse instantiations. Yet there is no obvious reason why such an AI, despite being fundamentally wrong about one important matter, could not be sufficiently instrumentally effective to secure a decisive strategic advantage. (Anthropics, the study of how to make inferences from indexical information in the presence of observation selection effects, is another area where the choice of epistemic axioms could prove pivotal.42)

We might reasonably doubt our ability to resolve all foundational issues in epistemology in time for the construction of the first seed AI. We may, therefore, consider taking an indirect approach to specifying the AI’s epistemology. This would raise many of the same issues as taking an indirect approach to specifying its decision theory. In the case of epistemology, however, there may be greater hope of benign convergence, with any of a wide class of epistemologies providing an adequate foundation for safe and effective AI and ultimately yielding similar doxastic results. The reason for this is that sufficiently abundant empirical evidence and analysis would tend to wash out any moderate differences in prior expectations.43

A good aim would be to endow the AI with fundamental epistemological principles that match those governing our own thinking. Any AI diverging from this ideal is an AI that we would judge to be reasoning incorrectly if we consistently applied our own standards. Of course, this applies only to our fundamental epistemological principles. Non-fundamental principles should be continuously created and revised by the seed AI itself as it develops its understanding of the world. The point of superintelligence is not to pander to human preconceptions but to make mincemeat out of our ignorance and folly.


The final item in our list of design choices is ratification. Should the AI’s plans be subjected to human review before being put into effect? For an oracle, this question is implicitly answered in the affirmative. The oracle outputs information; human reviewers choose whether and how to act upon it. For genies, sovereigns, and tool-AIs, however, the question of whether to use some form of ratification remains open.

To illustrate how ratification might work, consider an AI intended to function as a sovereign implementing humanity’s CEV. Instead of launching this AI directly, imagine that we first built an oracle AI for the sole purpose of answering questions about what the sovereign AI would do. As earlier chapters revealed, there are risks in creating a superintelligent oracle (such as risks of mind crime or infrastructure profusion). But for purposes of this example let us assume that the oracle AI has been successfully implemented in a way that avoided these pitfalls.

We thus have an oracle AI that offers us its best guesses about the consequences of running some piece of code intended to implement humanity’s CEV. The oracle may not be able to predict in detail what would happen, but its predictions are likely to be better than our own. (If it were impossible even for a superintelligence to predict anything about the code would do, we would be crazy to run it.) So the oracle ponders for a while and then presents its forecast. To make the answer intelligible, the oracle may offer the operator a range of tools with which to explore various features of the predicted outcome. The oracle could show pictures of what the future looks like and provide statistics about the number of sentient beings that will exist at different times, along with average, peak, and lowest levels of well-being. It could offer intimate biographies of several randomly selected individuals (perhaps imaginary people selected to be probably representative). It could highlight aspects of the future that the operator might not have thought of inquiring about but which would be regarded as pertinent once pointed out.

Being able to preview the outcome in this manner has obvious advantages. The preview could reveal the consequences of an error in a planned sovereign’s design specifications or source code. If the crystal ball shows a ruined future, we could scrap the code for the planned sovereign AI and try something else. A strong case could be made that we should familiarize ourselves with the concrete ramifications of an option before committing to it, especially when the entire future of the race is on the line.

What is perhaps less obvious is that ratification also has potentially significant disadvantages. The irenic quality of CEV might be undermined if opposing factions, instead of submitting to the arbitration of superior wisdom in confident expectation of being vindicated, could see in advance what the verdict would be. A proponent of the morality-based approach might worry that the sponsor’s resolve would collapse if all the sacrifices required by the morally optimal were to be revealed. And we might all have reason to prefer a future that holds some surprises, some dissonance, some wildness, some opportunities for self-overcoming—a future whose contours are not too snugly tailored to present preconceptions but provide some give for dramatic movement and unplanned growth. We might be less likely to take such an expansive view if we could cherry-pick every detail of the future, sending back to the drawing board any draft that does not fully conform to our fancy at that moment.

The issue of sponsor ratification is therefore less clear-cut than it might initially seem. Nevertheless, on balance it would seem prudent to take advantage of an opportunity to preview, if that functionality is available. But rather than letting the reviewer fine-tune every aspect of the outcome, we might give her a simple veto which could be exercised only a few times before the entire project would be aborted.44

Getting close enough

The main purpose of ratification would be to reduce the probability of catastrophic error. In general, it seems wise to aim at minimizing the risk of catastrophic error rather than at maximizing the chance of every detail being fully optimized. There are two reasons for this. First, humanity’s cosmic endowment is astronomically large—there is plenty to go around even if our process involves some waste or accepts some unnecessary constraints. Second, there is a hope that if we but get the initial conditions for the intelligence explosion approximately right, then the resulting superintelligence may eventually home in on, and precisely hit, our ultimate objectives. The important thing is to land in the right attractor basin.

With regard to epistemology, it is plausible that a wide range of priors will ultimately converge to very similar posteriors (when computed by a superintelligence and conditionalized on a realistic amount of data). We therefore need not worry about getting the epistemology exactly right. We must just avoid giving the AI a prior that is so extreme as to render the AI incapable of learning vital truths even with the benefit of copious experience and analysis.45

With regard to decision theory, the risk of irrecoverable error seems larger. We might still hope to directly specify a decision theory that is good enough. A superintelligent AI could switch to a new decision theory at any time; however, if it starts out with a sufficiently wrong decision theory it may not see the reason to switch. Even if an agent comes to see the benefits of having a different decision theory, the realization might come too late. For example, an agent designed to refuse blackmail might enjoy the benefit of deterring would-be extortionists. For this reason, blackmailable agents might do well to proactively adopt a non-exploitable decision theory. Yet once a blackmailable agent receives the threat and regards it as credible, the damage is done.

Given an adequate epistemology and decision theory, we could try to design the system to implement CEV or some other indirectly specified goal content. Again there is hope of convergence: that different ways of implementing a CEV-like dynamic would lead to the same utopian outcome. Short of such convergence, we may still hope that many of the different possible outcomes are good enough to count as existential success.

It is not necessary for us to create a highly optimized design. Rather, our focus should be on creating a highly reliable design, one that can be trusted to retain enough sanity to recognize its own failings. An imperfect superintelligence, whose fundamentals are sound, would gradually repair itself; and having done so, it would exert as much beneficial optimization power on the world as if it had been perfect from the outset.