Acquiring values - Superintelligence: Paths, Dangers, Strategies - Nick Bostrom

Superintelligence: Paths, Dangers, Strategies - Nick Bostrom (2014)

Chapter 12. Acquiring values

Capability control is, at best, a temporary and auxiliary measure. Unless the plan is to keep superintelligence bottled up forever, it will be necessary to master motivation selection. But just how could we get some value into an artificial agent, so as to make it pursue that value as its final goal? While the agent is unintelligent, it might lack the capability to understand or even represent any humanly meaningful value. Yet if we delay the procedure until the agent is superintelligent, it may be able to resist our attempt to meddle with its motivation system—and, as we showed in Chapter 7, it would have convergent instrumental reasons to do so. This value-loading problem is tough, but must be confronted.

The value-loading problem

It is impossible to enumerate all possible situations a superintelligence might find itself in and to specify for each what action it should take. Similarly, it is impossible to create a list of all possible worlds and assign each of them a value. In any realm significantly more complicated than a game of tic-tac-toe, there are far too many possible states (and state-histories) for exhaustive enumeration to be feasible. A motivation system, therefore, cannot be specified as a comprehensive lookup table. It must instead be expressed more abstractly, as a formula or rule that allows the agent to decide what to do in any given situation.

One formal way of specifying such a decision rule is via a utility function. A utility function (as we recall from Chapter 1) assigns value to each outcome that might obtain, or more generally to each “possible world.” Given a utility function, one can define an agent that maximizes expected utility. Such an agent selects at each time the action that has the highest expected utility. (The expected utility is calculated by weighting the utility of each possible world with the subjective probability of that world being the actual world conditional on a particular action being taken.) In reality, the possible outcomes are too numerous for the expected utility of an action to be calculated exactly. Nevertheless, the decision rule and the utility function together determine a normative ideal—an optimality notion—that an agent might be designed to approximate; and the approximation might get closer as the agent gets more intelligent.1 Creating a machine that can compute a good approximation of the expected utility of the actions available to it is an AI-complete problem.2 This chapter addresses another problem, a problem that remains even if the problem of making machines intelligent is solved.

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly.3 We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity.4 How could our programmer transfer this complexity into a utility function?

One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the only thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal—recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.5

If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.

Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.

Evolutionary selection

Evolution has produced an organism with human values at least once. This fact might encourage the belief that evolutionary methods are the way to solve the value-loading problem. There are, however, severe obstacles to achieving safety along this path. We have already pointed to these obstacles at the end of Chapter 10 when we discussed how powerful search processes can be dangerous.

Evolution can be viewed as a particular class of search algorithms that involve the alternation of two steps, one expanding a population of solution candidates by generating new candidates according to some relatively simple stochastic rule (such as random mutation or sexual recombination), the other contracting the population by pruning candidates that score poorly when tested by an evaluation function. As with many other types of powerful search, there is the risk that the process will find a solution that satisfies the formally specified search criteria but not our implicit expectations. (This would hold whether one seeks to evolve a digital mind that has the same goals and values as a typical human being, or instead a mind that is, for instance, perfectly moral or perfectly obedient.) The risk would be avoided if we could specify a formal search criterion that accurately represented all dimensions of our goals, rather than just one aspect of what we think we desire. But this is precisely the value-loading problem, and it would of course beg the question in this context to assume that problem solved.

There is a further problem:

The total amount of suffering per year in the natural world is beyond all decent contemplation. During the minute that it takes me to compose this sentence, thousands of animals are being eaten alive, others are running for their lives, whimpering with fear, others are being slowly devoured from within by rasping parasites, thousands of all kinds are dying of starvation, thirst and disease.6

Even just within our species, 150,000 persons are destroyed each day while countless more suffer an appalling array of torments and deprivations.7 Nature might be a great experimentalist, but one who would never pass muster with an ethics review board—contravening the Helsinki Declaration and every norm of moral decency, left, right, and center. It is important that we not gratuitously replicate such horrors in silico. Mind crime seems especially difficult to avoid when evolutionary methods are used to produce human-like intelligence, at least if the process is meant to look anything like actual biological evolution.8

Reinforcement learning

Reinforcement learning is an area of machine learning that studies techniques whereby agents can learn to maximize some notion of cumulative reward. By constructing an environment in which desired performance is rewarded, a reinforcement-learning agent can be made to learn to solve a wide class of problems (even in the absence of detailed instruction or feedback from the programmers, aside from the reward signal). Often, the learning algorithm involves the gradual construction of some kind of evaluation function, which assigns values to states, state-action pairs, or policies. (For instance, a program can learn to play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form of learning about value. However, what is being learned is not new final values but increasingly accurate estimates of the instrumental values of reaching particular states (or of taking particular actions in particular states, or of following particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And reward consists of specially designated percepts received from the environment. Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.9

These remarks do not imply that reinforcement-learning methods could never be used in a safe seed AI, only that they would have to be subordinated to a motivation system that is not itself organized around the principle of reward maximization. That, however, would require that a solution to the value-loading problem had been found by some other means than reinforcement learning.

Associative value accretion

Now one might wonder: if the value-loading problem is so tricky, how do we ourselves manage to acquire our values?

One possible (oversimplified) model might look something like this. We begin life with some relatively simple starting preferences (e.g. an aversion to noxious stimuli) together with a set of dispositions to acquire additional preferences in response to various possible experiences (e.g. we might be disposed to form a preference for objects and behaviors that we find to be valued and rewarded in our culture). Both the simple starting preferences and the dispositions are innate, having been shaped by natural and sexual selection over evolutionary timescales. Yet which preferences we end up with as adults depends on life events. Much of the information content in our final values is thus acquired from our experiences rather than preloaded in our genomes.

For example, many of us love another person and thus place great final value on his or her well-being. What is required to represent such a value? Many elements are involved, but consider just two: a representation of “person” and a representation of “well-being.” These concepts are not directly coded in our DNA. Rather, the DNA contains instructions for building a brain, which, when placed in a typical human environment, will over the course of several years develop a world model that includes concepts of persons and of well-being. Once formed, these concepts can be used to represent certain meaningful values. But some mechanism needs to be innately present that leads to values being formed around these concepts, rather than around other acquired concepts (like that of a flowerpot or a corkscrew).

The details of how this mechanism works are not well understood. In humans, the mechanism is probably complex and multifarious. It is easier to understand the phenomenon if we consider it in a more rudimentary form, such as filial imprinting in nidifugous birds, where the newly hatched chick acquires a desire for physical proximity to an object that presents a suitable moving stimulus within the first day after hatching. Which particular object the chick desires to be near depends on its experience; only the general disposition to imprint in this way is genetically determined. Analogously, Harry might place a final value on Sally’s well-being; but had the twain never met, he might have fallen in love with somebody else instead, and his final values would have been different. The ability of our genes to code for the construction of a goal-acquiring mechanism explains how we come to have final goals of great informational complexity, greater than could be contained in the genome itself.

We may consequently consider whether we might build the motivation system for an artificial intelligence on the same principle. That is, instead of specifying complex values directly, could we specify some mechanism that leads to the acquisition of those values when the AI interacts with a suitable environment?

Mimicking the value-accretion process that takes place in humans seems difficult. The relevant genetic mechanism in humans is the product of eons of work by evolution, work that might be hard to recapitulate. Moreover, the mechanism is presumably closely tailored to the human neurocognitive architecture and therefore not applicable in machine intelligences other than whole brain emulations. And if whole brain emulations of sufficient fidelity were available, it would seem easier to start with an adult brain that comes with full representations of some human values preloaded.10

Seeking to implement a process of value accretion closely mimicking that of human biology therefore seems an unpromising line of attack on the value-loading problem. But perhaps we might design a more unabashedly artificial substitute mechanism that would lead an AI to import high-fidelity representations of relevant complex values into its goal system? For this to succeed, it may not be necessary to give the AI exactly the same evaluative dispositions as a biological human. That may not even be desirable as an aim—human nature, after all, is flawed and all too often reveals a proclivity to evil which would be intolerable in any system poised to attain a decisive strategic advantage. Better, perhaps, to aim for a motivation system that departs from the human norm in systematic ways, such as by having a more robust tendency to acquire final goals that are altruistic, compassionate, or high-minded in ways we would recognize as reflecting exceptionally good character if they were present in a human person. To count as improvements, however, such deviations from the human norm would have to be pointed in very particular directions rather than at random; and they would continue to presuppose the existence of a largely undisturbed anthropocentric frame of reference to provide humanly meaningful evaluative generalizations (so as to avoid the kind of perverse instantiation of superficially plausible goal descriptions that we examined in Chapter 8). It is an open question whether this is feasible.

One further issue with associative value accretion is that the AI might disable the accretion mechanism. As we saw in Chapter 7, goal-system integrity is a convergent instrumental value. When the AI reaches a certain stage of cognitive development it may start to regard the continued operation of the accretion mechanism as a corrupting influence.11 This is not necessarily a bad thing, but care would have to be taken to make the sealing-up of the goal system occur at the right moment, after the appropriate values have been accreted but before they have been overwritten by additional unintended accretions.

Motivational scaffolding

Another approach to the value-loading problem is what we may refer to as motivational scaffolding. It involves giving the seed AI an interim goal system, with relatively simple final goals that we can represent by means of explicit coding or some other feasible method. Once the AI has developed more sophisticated representational faculties, we replace this interim scaffold goal system with one that has different final goals. This successor goal system then governs the AI as it develops into a full-blown superintelligence.

Because the scaffold goals are not just instrumental but final goals for the AI, the AI might be expected to resist having them replaced (goal-content integrity being a convergent instrumental value). This creates a hazard. If the AI succeeds in thwarting the replacement of its scaffold goals, the method fails.

To avoid this failure mode, precautions are necessary. For example, capability control methods could be applied to limit the AI’s powers until the mature motivation system has been installed. In particular, one could try to stunt its cognitive development at a level that is safe but that allows it to represent the values that we want to include in its ultimate goals. To do this, one might try to differentially stunt certain types of intellectual abilities, such as those required for strategizing and Machiavellian scheming, while allowing (apparently) more innocuous abilities to develop to a somewhat higher level.

One could also try to use motivation selection methods to induce a more collaborative relationship between the seed AI and the programmer team. For example, one might include in the scaffold motivation system the goal of welcoming online guidance from the programmers, including allowing them to replace any of the AI’s current goals.12 Other scaffold goals might include being transparent to the programmers about its values and strategies, and developing an architecture that is easy for the programmers to understand and that facilitates the later implementation of a humanly meaningful final goal, as well as domesticity motivations (such as limiting the use of computational resources).

One could even imagine endowing the seed AI with the sole final goal of replacing itself with a different final goal, one which may have been only implicitly or indirectly specified by the programmers. Some of the issues raised by the use of such a “self-replacing” scaffold goal also arise in the context of the value learning approach, which is discussed in the next subsection. Some further issues will be discussed in Chapter 13.

The motivational scaffolding approach is not without downsides. One is that it carries the risk that the AI could become too powerful while it is still running on its interim goal system. It may then thwart the human programmers’ efforts to install the ultimate goal system (either by forceful resistance or by quiet subversion). The old final goals may then remain in charge as the seed AI develops into a full-blown superintelligence. Another downside is that installing the ultimately intended goals in a human-level AI is not necessarily that much easier than doing so in a more primitive AI. A human-level AI is more complex and might have developed an architecture that is opaque and difficult to alter. A seed AI, by contrast, is like a tabula rasa on which the programmers can inscribe whatever structures they deem helpful. This downside could be flipped into an upside if one succeeded in giving the seed AI scaffold goals that made it want to develop an architecture helpful to the programmers in their later efforts to install the ultimate final values. However, it is unclear how easy it would be to give a seed AI scaffold goals with this property, and it is also unclear how even an ideally motivated seed AI would be capable of doing a much better job than the human programming team at developing a good architecture.

Value learning

We come now to an important but subtle approach to the value-loading problem. It involves using the AI’s intelligence to learn the values we want it to pursue. To do this, we must provide a criterion for the AI that at least implicitly picks out some suitable set of values. We could then build the AI to act according to its best estimates of these implicitly defined values. It would continually refine its estimates as it learns more about the world and gradually unpacks the implications of the value-determining criterion.

In contrast to the scaffolding approach, which gives the AI an interim scaffold goal and later replaces it with a different final goal, the value learning approach retains an unchanging final goal throughout the AI’s developmental and operational phases. Learning does not change the goal. It changes only the AI’s beliefs about the goal.

The AI thus must be endowed with a criterion that it can use to determine which percepts constitute evidence in favor of some hypothesis about what the ultimate goal is, and which percepts constitute evidence against. Specifying a suitable criterion could be difficult. Part of the difficulty, however, pertains to the problem of creating artificial general intelligence in the first place, which requires a powerful learning mechanism that can discover the structure of the environment from limited sensory inputs. That problem we can set aside here. But even modulo a solution to how to create superintelligent AI, there remain the difficulties that arise specifically from the value-loading problem. With the value learning approach, these take the form of needing to define a criterion that connects perceptual bitstrings to hypotheses about values.

Before delving into the details of how value learning could be implemented, it might be helpful to illustrate the general idea with an example. Suppose we write down a description of a set of values on a piece of paper. We fold the paper and put it in a sealed envelope. We then create an agent with human-level general intelligence, and give it the following final goal: “Maximize the realization of the values described in the envelope.” What will this agent do?

The agent does not initially know what is written in the envelope. But it can form hypotheses, and it can assign those hypotheses probabilities based on their priors and any available empirical data. For instance, the agent might have encountered other examples of human-authored texts, or it might have observed some general patterns of human behavior. This would enable it to make guesses. One does not need a degree in psychology to predict that the note is more likely to describe a value such as “minimize injustice and unnecessary suffering” or “maximize returns to shareholders” than a value such as “cover all lakes with plastic shopping bags.”

When the agent makes a decision, it seeks to take actions that would be effective at realizing the values it believes are most likely to be described in the letter. Importantly, the agent would see a high instrumental value in learning more about what the letter says. The reason is that for almost any final value that might be described in the letter, that value is more likely to be realized if the agent finds out what it is, since the agent will then pursue that value more effectively. The agent would also discover the convergent instrumental reasons described in Chapter 7—goal system integrity, cognitive enhancement, resource acquisition, and so forth. Yet, assuming that the agent assigns a sufficiently high probability to the values described in the letter involving human welfare, it would not pursue these instrumental values by immediately turning the planet into computronium and thereby exterminating the human species, because doing so would risk permanently destroying its ability to realize its final value.

We can liken this kind of agent to a barge attached to several tugboats that pull in different directions. Each tugboat corresponds to a hypothesis about the agent’s final value. The engine power of each tugboat corresponds to the associated hypothesis’s probability, and thus changes as new evidence comes in, producing adjustments in the barge’s direction of motion. The resultant force should move the barge along a trajectory that facilitates learning about the (implicit) final value while avoiding the shoals of irreversible destruction; and later, when the open sea of more definite knowledge of the final value is reached, the one tugboat that still exerts significant force will pull the barge toward the realization of the discovered value along the straightest or most propitious route.

The envelope and barge metaphors illustrate the principle underlying the value learning approach, but they pass over a number of critical technical issues. They come into clearer focus once we start to develop the approach within a formal framework (see Box 10).

One outstanding issue is how to endow the AI with a goal such as “Maximize the realization of the values described in the envelope.” (In the terminology of Box 10, how to define the value criterion.) To do this, it is necessary to identify the place where the values are described. In our example, this requires making a successful reference to the letter in the envelope. Though this might seem trivial, it is not without pitfalls. To mention just one: it is critical that the reference be not simply to a particular external physical object but to an object at a particular time. Otherwise the AI may determine that the best way to attain its goal is by overwriting the original value description with one that provides an easier target (such as the value that for every integer there be a larger integer). This done, the AI could lean back and crack its knuckles—though more likely a malignant failure would ensue, for reasons we discussed in Chapter 8. So now we face the question of how to define time. We could point to a clock and say, “Time is defined by the movements of this device”—but this could fail if the AI conjectures that it can manipulate time by moving the hands on the clock, a conjecture which would indeed be correct if “time” were given the aforesaid definition. (In a realistic case, matters would be further complicated by the fact that the relevant values are not going to be conveniently described in a letter; more likely, they would have to be inferred from observations of pre-existing structures that implicitly contain the relevant information, such as human brains.)

Box 10 Formalizing value learning

Introducing some formal notation can help us see some things more clearly. However, readers who dislike formalism can skip this part.

Consider a simplified framework in which an agent interacts with its environment in a finite number of discrete cycles.13 In cycle k, the agent performs action yk, and then receives the percept xk. The interaction history of an agent with lifespan m is a string y1x1y2x2ymxm (which we can abbreviate as yx1:m or yxm). In each cycle, the agent selects an action based on the percept sequence it has received to date.

Consider first a reinforcement learner. An optimal reinforcement learner (AI-RL) is one that maximizes expected future rewards. It obeys the equation14


The reward sequence rk, …, rm is implied by the percept sequence xk:m, since the reward that the agent receives in a given cycle is part of the percept that the agent receives in that cycle.

As argued earlier, this kind of reinforcement learning is unsuitable in the present context because a sufficiently intelligent agent will realize that it could secure maximum reward if it were able to directly manipulate its reward signal (wireheading). For weak agents, this need not be a problem, since we can physically prevent them from tampering with their own reward channel. We can also control their environment so that they receive rewards only when they act in ways that are agreeable to us. But a reinforcement learner has a strong incentive to eliminate this artificial dependence of its rewards on our whims and wishes. Our relationship with a reinforcement learner is therefore fundamentally antagonistic. If the agent is strong, this spells danger.

Variations of the wireheading syndrome can also affect systems that do not seek an external sensory reward signal but whose goals are defined as the attainment of some internal state. For example, in so-called “actor-critic” systems, there is an actor module that selects actions in order to minimize the disapproval of a separate critic module that computes how far the agent’s behavior falls short of a given performance measure. The problem with this setup is that the actor module may realize that it can minimize disapproval by modifying the critic or eliminating it altogether—much like a dictator who dissolves the parliament and nationalizes the press. For limited systems, the problem can be avoided simply by not giving the actor module any means of modifying the critic module. A sufficiently intelligent and resourceful actor module, however, could always gain access to the critic module (which, after all, is merely a physical process in some computer).15

Before we get to the value learner, let us consider as an intermediary step what has been called an observation-utility maximizer (AI-OUM). It is obtained by replacing the reward series (rk + … + rm) in the AI-RL with a utility function that is allowed to depend on the entire future interaction history of the AI:


This formulation provides a way around the wireheading problem because a utility function defined over an entire interaction history could be designed to penalize interaction histories that show signs of self-deception (or of a failure on the part of the agent to invest sufficiently in obtaining an accurate view of reality).

The AI-OUM thus makes it possible in principle to circumvent the wireheading problem. Availing ourselves of this possibility, however, would require that we specify a suitable utility function over the class of possible interaction histories—a task that looks forbiddingly difficult.

It may be more natural to specify utility functions directly in terms of possible worlds (or properties of possible worlds, or theories about the world) rather than in terms of an agent’s own interaction histories. If we use this approach, we could reformulate and simplify the AI-OUM optimality notion:


Here, E is the total evidence available to the agent (at the time when it is making its decision), and U is a utility function that assigns utility to some class of possible worlds. The optimal agent chooses the act that maximizes expected utility.

An outstanding problem with these formulations is the difficulty of defining the utility function U. This, finally, returns us to the value-loading problem. To enable the utility function to be learned, we must expand our formalism to allow for uncertainty over utility functions. This can be done as follows (AI-VL):16


Here, ν(.) is a function from utility functions to propositions about utility functions. ν(U) is the proposition that the utility function U satisfies the value criterion expressed by ν.17

To decide which action to perform, one could hence proceed as follows: First, compute the conditional probability of each possible world w (given available evidence and on the supposition that action y is to be performed). Second, for each possible utility function U, compute the conditional probability that U satisfies the value criterion ν (conditional on w being the actual world). Third, for each possible utility function U, compute the utility of possible world w. Fourth, combine these quantities to compute the expected utility of action y. Fifth, repeat this procedure for each possible action, and perform the action found to have the highest expected utility (using some arbitrary method to break ties). As described, this procedure—which involves giving explicit and separate consideration to each possible world—is, of course, wildly computationally intractable. The AI would have to use computational shortcuts that approximate this optimality notion.

The question, then, is how to define this value criterion ν.18 Once the AI has an adequate representation of the value criterion, it could in principle use its general intelligence to gather information about which possible worlds are most likely to be the actual one. It could then apply the criterion, for each such plausible possible world w, to find out which utility function satisfies the criterion in w. One can thus regard the AI-VL formula as a way of identifying and separating out this key challenge in the value learning approach—the challenge of how to represent ν. The formalism also brings to light a number of other issues (such as how to define Image, Image, and Image) which would need to be resolved before the approach could be made to work.19

Another issue in coding the goal “Maximize the realization of the values described in the envelope” is that even if all the correct values were described in a letter, and even if the AI’s motivation system were successfully keyed to this source, the AI might not interpret the descriptions the way we intended. This would create a risk of perverse instantiation, as discussed in Chapter 8.

To clarify, the difficulty here is not so much how to ensure that the AI can understand human intentions. A superintelligence should easily develop such understanding. Rather, the difficulty is ensuring that the AI will be motivated to pursue the described values in the way we intended. This is not guaranteed by the AI’s ability to understand our intentions: an AI could know exactly what we meant and yet be indifferent to that interpretation of our words (being motivated instead by some other interpretation of the words or being indifferent to our words altogether).

The difficulty is compounded by the desideratum that, for reasons of safety, the correct motivation should ideally be installed in the seed AI before it becomes capable of fully representing human concepts or understanding human intentions. This requires that somehow a cognitive framework be created, with a particular location in that framework designated in the AI’s motivation system as the repository of its final value. But the cognitive framework itself must be revisable, so as to allow the AI to expand its representational capacities as it learns more about the world and grows more intelligent. The AI might undergo the equivalent of scientific revolutions, in which its worldview is shaken up and it perhaps suffers ontological crises in which it discovers that its previous ways of thinking about values were based on confusions and illusions. Yet starting at a sub-human level of development and continuing throughout all its subsequent development into a galactic superintelligence, the AI’s conduct is to be guided by an essentially unchanging final value, a final value that becomes better understood by the AI in direct consequence of its general intellectual progress—and likely quite differently understood by the mature AI than it was by its original programmers, though not different in a random or hostile way but in a benignly appropriate way. How to accomplish this remains an open question.20 (See Box 11.)

In summary, it is not yet known how to use the value learning approach to install plausible human values (though see Box 12 for some examples of recent ideas). At present, the approach should be viewed as a research program rather than an available technique. If it could be made to work, it might constitute the most ideal solution to the value-loading problem. Among other benefits, it would seem to offer a natural way to prevent mind crime, since a seed AI that makes reasonable guesses about which values its programmers might have installed would anticipate that mind crime is probably negatively evaluated by those values, and thus best avoided, at least until more definitive information has been obtained.

Last, but not least, there is the question of “what to write in the envelope”—or, less metaphorically, the question of which values we should try to get the AI to learn. But this issue is common to all approaches to the AI value-loading problem. We return to it in Chapter 13.

Box 11 An AI that wants to be friendly

Eliezer Yudkowsky has tried to describe some features of a seed AI architecture intended to enable the kind of behavior described in the text above. In his terminology, the AI would use “external reference semantics.”21 To illustrate the basic idea, let us suppose that we want the system to be “friendly.” The system starts out with the goal of trying to instantiate property F but does not initially know much about what F is. It might just know that F is some abstract property and that when the programmers speak of “friendliness,” they are probably trying to convey information about F. Since the AI’s final goal is to instantiate F, an important instrumental value is to learn more about what F is. As the AI discovers more about F, its behavior is increasingly guided by the actual content of F. Thus, hopefully, the AI becomes increasingly friendly the more it learns and the smarter it gets.

The programmers can help this process along, and reduce the risk of the AI making some catastrophic mistake while its understanding of F is still incomplete, by providing the AI with “programmer affirmations,” hypotheses about the nature and content of F to which an initially high probability is assigned. For instance, the hypothesis “misleading the programmers is unfriendly” can be given a high prior probability. These programmer affirmations, however, are not “true by definition”—they are not unchallengeable axioms about the concept of friendliness. Rather, they are initial hypotheses about friendliness, hypotheses to which a rational AI will assign a high probability at least for as long as it trusts the programmers’ epistemic capacities more than its own.

Yudkowsky’s proposal also involves the use of what he called “causal validity semantics.” The idea here is that the AI should do not exactly what the programmers told it to do but rather (something like) what they were trying to tell it to do. While the programmers are trying to explain to the seed AI what friendliness is, they might make errors in their explanations. Moreover, the programmers themselves may not fully understand the true nature of friendliness. One would therefore want the AI to have the ability to correct errors in the programmers’ thinking, and to infer the true or intended meaning from whatever imperfect explanations the programmers manage to provide. For example, the AI should be able to represent the causal processes whereby the programmers learn and communicate about friendliness. Thus, to pick a trivial example, the AI should understand that there is a possibility that a programmer might make a typo while inputting information about friendliness, and the AI should then seek to correct the error. More generally, the AI should seek to correct for whatever distortive influences may have corrupted the flow of information about friendliness as it passed from its source through the programmers to the AI (where “distortive” is an epistemic category). Ideally, as the AI matures, it should overcome any cognitive biases and other more fundamental misconceptions that may have prevented its programmers from fully understanding what friendliness is.

Box 12 Two recent (half-baked) ideas

What we might call the “Hail Mary” approach is based on the hope that elsewhere in the universe there exist (or will come to exist) civilizations that successfully manage the intelligence explosion, and that they end up with values that significantly overlap with our own. We could then try to build our AI so that it is motivated to do what these other superintelligences want it to do.22 The advantage is that this might be easier than to build our AI to be motivated to do what we want directly.

For this scheme to work it is not necessary that our AI can establish communication with any alien superintelligence. Rather, our AI’s actions would be guided by its estimates of what the alien superintelligences would want it to do. Our AI would model the likely outcomes of intelligence explosions elsewhere, and as it becomes superintelligent itself its estimates should become increasingly accurate. Perfect knowledge is not required. There may be a range of plausible outcomes of intelligence explosions, and our AI would then do its best to accommodate the preferences of the various different kinds of superintelligence that might emerge, weighted by probability.

This version of the Hail Mary approach requires that we construct a final value for our AI that refers to the preferences of other superintelligences. Exactly how to do this is not yet clear. However, superintelligent agents might be structurally distinctive enough that we could write a piece of code that would function as a detector that would look at the world model in our developing AI and designate the representational elements that correspond to the presence of a superintelligence. The detector would then, somehow, extract the preferences of the superintelligence in question (as it is represented within our own AI).23 If we could create such a detector, we could then use it to define our AI’s final values. One challenge is that we may need to create the detector before we know what representational framework our AI will develop. The detector may thus need to query an unknown representational framework and extract the preferences of whatever superintelligence may be represented therein. This looks difficult, but perhaps some clever solution can be found.24

If the basic setup could be made to work, various refinements immediately suggest themselves. For example, rather than aiming to follow (some weighted composition of) the preferences of every alien superintelligence, our AI’s final value could incorporate a filter to select a subset of alien superintelligences for obeisance (with the aim of selecting ones whose values are closer to our own). For instance, we might use criteria pertaining to a superintelligence’s causal origin to determine whether to include it in the obeisance set. Certain properties of its origination (which we might be able to define in structural terms) may correlate with the degree to which the resultant superintelligence could be expected to share our values. Perhaps we wish to place more trust in superintelligences whose causal origins trace back to a whole brain emulation, or to a seed AI that did not make heavy use of evolutionary algorithms or that emerged slowly in a way suggestive of a controlled takeoff. (Taking causal origins into account would also let us avoid over-weighting superintelligences that create multiple copies of themselves—indeed would let us avoid creating an incentive for them to do so.) Many other refinements would also be possible.

The Hail Mary approach requires faith that there are other superintelligences out there that sufficiently share our values.25 This makes the approach non-ideal. However, the technical obstacles facing the Hail Mary approach, though very substantial, might possibly be less formidable than those confronting alternative approaches. Exploring non-ideal but more easily implementable approaches can make sense—not with the intention of using them, but to have something to fall back upon in case an ideal solution should not be ready in time.

Another idea for how to solve the value-loading problem has recently been proposed by Paul Christiano.26 Like the Hail Mary, it is a value learning method that tries to define the value criterion by means of a “trick” rather than through laborious construction. By contrast to the Hail Mary, it does not presuppose the existence of other superintelligent agents that we could point to as role models for our own AI. Christiano’s proposal is somewhat resistant to brief explanation—it involves a series of arcane considerations—but we can try to at least gesture at its main elements.

Suppose we could obtain (a) a mathematically precise specification of a particular human brain and (b) a mathematically well-specified virtual environment that contains an idealized computer with an arbitrarily large amount of memory and CPU power. Given (a) and (b), we could define a utility function U as the output the human brain would produce after interacting with this environment. U would be a mathematically well-defined object, albeit one which (because of computational limitations) we may be unable to describe explicitly. Nevertheless, U could serve as the value criterion for a value learning AI, which could use various heuristics for assigning probabilities to hypotheses about what U implies.

Intuitively, we want U to be the utility function that a suitably prepared human would output if she had the advantage of being able use an arbitrarily large amount of computing power—enough computing power, for example, to run astronomical numbers of copies of herself to assist her with her analysis of specifying a utility function, or to help her devise a better process for going about this analysis. (We are here foreshadowing a theme, “coherent extrapolated volition,” which will be further explored in Chapter 13.)

It would seem relatively easy to specify the idealized environment: we can give a mathematical description of an abstract computer with arbitrarily large capacity; and in other respects we could use a virtual reality program that gives a mathematical description of, say, a single room with a computer terminal in it (instantiating the abstract computer). But how to obtain a mathematically precise description of a particular human brain? The obvious way would be through whole brain emulation, but what if the technology for emulation is not available in time?

This is where Christiano’s proposal offers a key innovation. Christiano observes that in order to obtain a mathematically well-specified value criterion, we do not need a practically useful computational model of a mind, a model we could run. We just need a (possibly implicit and hopelessly complicated) mathematical definition—and this may be much easier to attain. Using functional neuroimaging and other measurements, we can perhaps collect gigabytes of data about the input-output behavior of a selected human. If we collect a sufficient amount of data, then it might be that the simplest mathematical model that accounts for all this data is in fact an emulation of the particular human in question. Although it would be computationally intractable for us to find this simplest model from the data, it could be perfectly possible for us to define the model, by referring to the data and a using a mathematically well-defined simplicity measure (such as some variant of the Kolmogorov complexity, which we encountered in Box 1, Chapter 1).27

Emulation modulation

The value-loading problem looks somewhat different for whole brain emulation than it does for artificial intelligence. Methods that presuppose a fine-grained understanding and control of algorithms and architecture are not applicable to emulations. On the other hand, the augmentation motivation selection method—inapplicable to de novo artificial intelligence—is available to be used with emulations (or enhanced biological brains).28

The augmentation method could be combined with techniques to tweak the inherited goals of the system. For example, one could try to manipulate the motivational state of an emulation by administering the digital equivalent of psychoactive substances (or, in the case of biological systems, the actual chemicals). Even now it is possible to pharmacologically manipulate values and motivations to a limited extent.29 The pharmacopeia of the future may contain drugs with more specific and predictable effects. The digital medium of emulations should greatly facilitate such developments, by making controlled experimentation easier and by rendering all cerebral parts directly addressable.

Just as when biological test subjects are used, research on emulations would get entangled in ethical complications, not all of which could be brushed aside with a consent form. Such entanglements could slow progress along the emulation path (because of regulation or moral restraint), perhaps especially hindering studies on how to manipulate the motivational structure of emulations. The result could be that emulations are augmented to potentially dangerous superintelligent levels of cognitive ability before adequate work has been done to test or adjust their final goals. Another possible effect of the moral entanglements might be to give the lead to less scrupulous teams and nations. Conversely, were we to relax our moral standards for experimenting with digital human minds, we could become responsible for a substantial amount of harm and wrongdoing, which is obviously undesirable. Other things equal, these considerations favor taking some alternative path that does not require the extensive use of digital human research subjects in a strategically high-stakes situation.

The issue, however, is not clear-cut. One could argue that whole brain emulation research is less likely to involve moral violations than artificial intelligence research, on the grounds that we are more likely to recognize when an emulation mind qualifies for moral status than we are to recognize when a completely alien or synthetic mind does so. If certain kinds of AIs, or their subprocesses, have a significant moral status that we fail to recognize, the consequent moral violations could be extensive. Consider, for example, the happy abandon with which contemporary programmers create reinforcement-learning agents and subject them to aversive stimuli. Countless such agents are created daily, not only in computer science laboratories but in many applications, including some computer games containing sophisticated non-player characters. Presumably, these agents are still too primitive to have any moral status. But how confident can we really be that this is so? More importantly, how confident can we be that we will know to stop in time, before our programs become capable of experiencing morally relevant suffering?

(We will return in Chapter 14 to some of the broader strategic questions that arise when we compare the desirability of emulation and artificial intelligence paths.)

Institution design

Some intelligent systems consist of intelligent parts that are themselves capable of agency. Firms and states exemplify this in the human world: whilst largely composed of humans they can, for some purposes, be viewed as autonomous agents in their own right. The motivations of such composite systems depend not only on the motivations of their constituent subagents but also on how those subagents are organized. For instance, a group that is organized under strong dictatorship might behave as if it had a will that was identical to the will of the subagent that occupies the dictator role, whereas a democratic group might sometimes behave more as if it had a will that was a composite or average of the wills of its various constituents. But one can also imagine governance institutions that would make an organization behave in a way that is not a simple function of the wills of its subagents. (Theoretically, at least, there could exist a totalitarian state that everybody hated, because the state had mechanisms to prevent its citizens from coordinating a revolt. Each citizen could be worse off by revolting alone than by playing their part in the state machinery.)

By designing appropriate institutions for a composite system, one could thus try to shape its effective motivation. In Chapter 9, we discussed social integration as a possible capability control method. But there we focused on the incentives faced by an agent as a consequence of its existence in a social world of near-equals. Here we are focusing on what happens inside a given agent: how its will is determined by its internal organization. We are therefore looking at a motivation selection method. Moreover, since this kind of internal institution design does not depend on large-scale social engineering or reform, it is a method that might be available to an individual project developing superintelligence even if the wider socioeconomic or international milieu is less than ideally favorable.

Institution design is perhaps most plausible in contexts where it would be combined with augmentation. If we could start with agents that are already suitably motivated or that have human-like motivations, institutional arrangements could be used as an extra safeguard to increase the chances that the system will stay on course.

For example, suppose that we start with some well-motivated human-like agents—let us say emulations. We want to boost the cognitive capacities of these agents, but we worry that the enhancements might corrupt their motivations. One way to deal with this challenge would be to set up a system in which individual emulations function as subagents. When a new enhancement is introduced, it is first applied to a small subset of the subagents. Its effects are then studied by a review panel composed of subagents who have not yet had the enhancement applied to them. Only when these peers have satisfied themselves that the enhancement is not corrupting is it rolled out to the wider subagent population. If the enhanced subagents are found to be corrupted, they are not given further enhancements and are excluded from key decision-making functions (at least until the system as a whole has advanced to a point where the corrupted subagents can be safely reintegrated).30 Although the corrupted subagents might have gained some advantage from the enhancement, the institutional structure in which they are embedded, and the fact that they constitute a small minority of all subagents, would hopefully prevent them from seizing power or spreading their corruption to the overall system. Thus, the collective intelligence and capability of the system could be gradually enhanced in a sequence of small steps, where the soundness of each step is verified by subagents only slightly less capable than the new subagents that are introduced in that step.

The amount of safety obtainable by this kind of institution design is limited by the accuracy of the tests that are used to evaluate the enhanced subagents. The problem of test construction is complicated by the fact that the test batteries could only be validated within the pre-existing range of cognitive ability yet must be applied to agents outside this range. One could mitigate this problem by taking small enhancement steps and by letting the test run for a long time. Such caution, however, would raise the cost and slow progress (which, if a race dynamic is occurring, could mean a project employing these safety measures would place itself at a disadvantage). Even with small steps and extensive testing, there would remain some risk of subtle flaws being introduced, leading to a gradual corruption of the goal system, a corruption that might become manifest only when the system reaches superintelligence. This risk could be reduced by letting the testing condition extend for many steps of cognitive enhancement, so that even agents that have been initially certified as sound and that have received several subsequent enhancements would continue to be under review by a panel of unmodified and uncorrupted original peers and would remain subject to reversion.

Ideally, the testing should be ongoing rather than confined to an initial evaluation period. Human-level subagents have the ability to strategize and might thus choose to conceal certain goals while their behavior was being monitored. To enable continuous testing, a hierarchy might be created in which subagents at a certain level of capability are tasked with monitoring subagents at a somewhat higher level of capability. At the bottom of the capability hierarchy (but at the top of the power hierarchy) would sit the relatively dumb and slow principal. This human principal would be like a demented king who reigns over an incompetent court that oversees a mediocre administration which governs a capable people. Alternatively, he might be likened to the evolutionarily ancient “reptilian brain” which ensures that the newer and much cleverer neocortex is employed in the service of goals such as feeding and copulation.

One might question whether such an inverse meritocracy could remain stable. If the bulk of the superintelligent subagents were found to be corrupt, and the humans in charge of the whole thing attempted to roll back many of the preceding enhancements, would the institutional structure be strong enough to withstand the opposition that such a move would engender? In the context of historical or contemporary human societies, the answer perhaps is that this would be implausible. (Not that the most capable people consistently rise to power, but we are not literally ruled by chimpanzees either.) The issue is less clear, however, in a context where society is implemented as software.31

Imagine a digital hierarchy that has one highly trained police agent for every two proletarians. It also has one special-police agent for every two police agents. It also has one special-special-police agent for every two special-police agents—and so forth, so that each layer of overseers has at least half the numbers of the layer it oversees. Imagine, further, that this society is architected in such a way that a supervisor has big advantages over his subordinates. For example, the supervisor is able to monitor and record everything his subordinates say and do, whereas the supervisor himself is hidden behind a one-way glass, so that his subordinates can only hear the orders and instructions that the superior chooses to transmit. Imagine that supervisors have a panel of buttons that they can press at any time to cause a subordinate to be punished or rewarded, paused, deleted, or reverted to an earlier state. These capabilities would greatly boost the powers of a supervising agent, beyond what has been possible historically in even the most totalitarian of organizations.

Even this does not exhaust the possible instruments of control. Supervisors might also have at their disposal realistic virtual reality simulations that they can use to probe subordinates’ reactions to hypothetical scenarios (such as simulated offers to join a conspiracy or to slack when they think nobody is watching). Supervisors might, additionally, have the ability to eavesdrop on internal monologues in the subordinates’ minds and to directly manipulate their emotional states.

The upshot is a series of massive advantages for the supervisor.32 Each supervisor would be a subordinate to another, higher-level supervisor, who would monitor his every move and make sure he did not slacken in the execution of his managerial duties. With this kind of arrangement, it might be possible for the structure to be stable even if each supervisory level is intellectually somewhat weaker than the level it supervises.33 And the entire structure, which could include many highly superintelligent agents, may be controlled by a small number of humans with root access.34

The cost of such an internal control structure would be significant but not necessarily prohibitive. For example, in the model where each level of supervision has half the numbers of the layer below, the extra computational overhead is bounded at a mere 100% of what the proletarian part of the system costs—less if the dumber boss layers require fewer computations per subagent. Oversight could be tightened by increasing the ratio, or cost lowered by decreasing it. The extra computational cost would be a major factor in a competitive market scenario (cf. Chapter 11) but looks affordable in the context of a project that is not facing strong immediate competition. There would also be a cost in terms of the development time needed to create and test these supervisory functionalities. A well-resourced project could reduce this time cost by parallelizing the development of the control structure with the development of the machine intelligence; but the extra task load could be prohibitive for smaller projects and for projects caught in a close technology race.

One other type of cost also deserves consideration: the risk of mind crimes being committed in this kind of structure.35 As described, the institution sounds like a rather horrible North Korean labor camp. Yet there are ways of at least mitigating the moral problems with running this kind of institution, even if the subagents contained in the institution are emulations with full human moral status. At a minimum, the system could rely on volunteering emulations. Each subagent could have the option at any time of withdrawing its participation.36 Terminated emulations could be stored to memory, with a commitment to restart them under much more ideal conditions once the dangerous phase of the intelligence explosion is over. Meanwhile, subagents who chose to participate could be housed in very comfortable virtual environments and allowed ample time for sleep and recreation. These measures would impose a cost, one that should be manageable for a well-resourced project under noncompetitive conditions. In a highly competitive situation, the cost may be unaffordable unless an enterprise could be assured that its competitors would incur the same cost.

In the example, we imagined the subagents as emulations. One might wonder, does the institution design approach require that the subagents be anthropomorphic? Or is it equally applicable to systems composed of artificial subagents?

One’s first thought here might be skeptical. One notes that despite our plentiful experience with human-like agents, we still cannot precisely predict the outbreak or outcomes of revolutions; social science can, at most, describe some statistical tendencies.37 Since we cannot reliably predict the stability of social structures for ordinary human beings (about which we have much data), it is tempting to infer that we have little hope of precision-engineering stable social structures for cognitively enhanced human-like agents (about which we have no data), and that we have still less hope of doing so for advanced artificial agents (which are not even similar to agents that we have data about).

Yet the matter is not so cut-and-dried. Humans and human-like beings are complex; but artificial agents could have relatively simple architectures. Artificial agents could also have simple and explicitly characterized motivations. Furthermore, digital agents in general (whether emulations or artificial intelligences) are copyable: an affordance that may revolutionize management, much like interchangeable parts revolutionized manufacturing. These differences, together with the opportunity to work with agents that are initially powerless and to create institutional structures that use the various abovementioned control measures, might combine to make it possible to achieve particular institutional outcomes—such as a system that does not revolt—more reliably than if one were working with human beings under historical conditions.

But then again, artificial agents might lack many of the attributes that help us predict the behavior of human-like agents. Artificial agents need not have any of the social emotions that bind human behavior, emotions such as fear, pride, and remorse. Nor need artificial agents develop attachments to friends and family. Nor need they exhibit the unconscious body language that makes it difficult for us humans to conceal our intentions. These deficits might destabilize institutions of artificial agents. Moreover, artificial agents might be capable of making big leaps in cognitive performance as a result of seemingly small changes in their algorithms or architecture. Ruthlessly optimizing artificial agents might be willing to take extreme gambles from which humans would shrink.38 And superintelligent agents might show a surprising ability to coordinate with little or no communication (e.g. by internally modeling each other’s hypothetical responses to various contingencies). These and other differences could make sudden institutional failure more likely, even in the teeth of what seem like Kevlar-clad methods of social control.

It is unclear, therefore, how promising the institution design approach is, and whether it has a greater chance of working with anthropomorphic than with artificial agents. It might be thought that creating an institution with appropriate checks and balances could only increase safety—or, at any rate, not reduce safety—so that from a risk-mitigation perspective it would always be best if the method were used. But even this cannot be said with certainty. The approach adds parts and complexity, and thus may also introduce new ways for things to go wrong that do not exist in the case of an agent that does not have intelligent subagents as parts. Nevertheless, institution design is worthy of further exploration.39


Goal system engineering is not yet an established discipline. It is not currently known how to transfer human values to a digital computer, even given human-level machine intelligence. Having investigated a number of approaches, we found that some of them appear to be dead ends; but others appear to hold promise and deserve to be explored further. A summary is provided in Table 12.

Table 12 Summary of value-loading techniques

Explicit representation

May hold promise as a way of loading domesticity values. Does not seem promising as a way of loading more complex values.

Evolutionary selection

Less promising. Powerful search may find a design that satisfies the formal search criteria but not our intentions. Furthermore, if designs are evaluated by running them—including designs that do not even meet the formal criteria—a potentially grave additional danger is created. Evolution also makes it difficult to avoid massive mind crime, especially if one is aiming to fashion human-like minds.

Reinforcement learning

A range of different methods can be used to solve “reinforcement-learning problems,” but they typically involve creating a system that seeks to maximize a reward signal. This has an inherent tendency to produce the wireheading failure mode when the system becomes more intelligent. Reinforcement learning therefore looks unpromising.

Value accretion

We humans acquire much of our specific goal content from our reactions to experience. While value accretion could in principle be used to create an agent with human motivations, the human value-accretion dispositions might be complex and difficult to replicate in a seed AI. A bad approximation may yield an AI that generalizes differently than humans do and therefore acquires unintended final goals. More research is needed to determine how difficult it would be to make value accretion work with sufficient precision.

Motivational scaffolding

It is too early to tell how difficult it would be to encourage a system to develop internal high-level representations that are transparent to humans (while keeping the system’s capabilities below the dangerous level) and then to use those representations to design a new goal system. The approach might hold considerable promise. (However, as with any untested approach that would postpone much of the hard work on safety engineering until the development of human-level AI, one should be careful not to allow it to become an excuse for a lackadaisical attitude to the control problem in the interim.)

Value learning

A potentially promising approach, but more research is needed to determine how difficult it would be to formally specify a reference that successfully points to the relevant external information about human value (and how difficult it would be to specify a correctness criterion for a utility function in terms of such a reference). Also worth exploring within the value learning category are proposals of the Hail Mary type or along the lines of Paul Christiano’s construction (or other such shortcuts).

Emulation modulation

If machine intelligence is achieved via the emulation pathway, it would likely be possible to tweak motivations through the digital equivalent of drugs or by other means. Whether this would enable values to be loaded with sufficient precision to ensure safety even as the emulation is boosted to superintelligence is an open question. (Ethical constraints might also complicate developments in this direction.)

Institution design

Various strong methods of social control could be applied in an institution composed of emulations. In principle, social control methods could also be applied in an institution composed of artificial intelligences. Emulations have some properties that would make them easier to control via such methods, but also some properties that might make them harder to control than AIs. Institution design seems worthy of further exploration as a potential value-loading technique.

If we knew how to solve the value-loading problem, we would confront a further problem: the problem of deciding which values to load. What, in other words, would we want a superintelligence to want? This is the more philosophical problem to which we turn next.