Oracles, genies, sovereigns, tools - Superintelligence: Paths, Dangers, Strategies - Nick Bostrom

Superintelligence: Paths, Dangers, Strategies - Nick Bostrom (2014)

Chapter 10. Oracles, genies, sovereigns, tools

Some say: “Just build a question-answering system!” or “Just build an AI that is like a tool rather than an agent!” But these suggestions do not make all safety concerns go away, and it is in fact a non-trivial question which type of system would offer the best prospects for safety. We consider four types or “castes”—oracles, genies, sovereigns, and tools—and explain how they relate to one another.1 Each offers different sets of advantages and disadvantages in our quest to solve the control problem.

Oracles

An oracle is a question-answering system. It might accept questions in a natural language and present its answers as text. An oracle that accepts only yes/no questions could output its best guess with a single bit, or perhaps with a few extra bits to represent its degree of confidence. An oracle that accepts open-ended questions would need some metric with which to rank possible truthful answers in terms of their informativeness or appropriateness.2 In either case, building an oracle that has a fully domain-general ability to answer natural language questions is an AI-complete problem. If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.

Oracles with domain-limited forms of superintelligence are also conceivable. For instance, one could conceive of a mathematics-oracle which would only accept queries posed in a formal language but which would be very good at answering such questions (e.g. being able to solve in an instant almost any formally expressed math problem that the human mathematics profession could solve by laboring collaboratively for a century). Such a mathematics-oracle would form a stepping-stone toward domain-general superintelligence.

Oracles with superintelligence in extremely limited domains already exist. A pocket calculator can be viewed as a very narrow oracle for basic arithmetical questions; an Internet search engine can be viewed as a very partial realization of an oracle with a domain that encompasses a significant part of general human declarative knowledge. These domain-limited oracles are tools rather than agents (more on tool-AIs shortly). In what follows, though, the term “oracle” will refer to question-answering systems that have domain-general superintelligence, unless otherwise stated.

To make a general superintelligence function as an oracle, we could apply both motivation selection and capability control. Motivation selection for an oracle may be easier than for other castes of superintelligence, because the final goal in an oracle could be comparatively simple. We would want the oracle to give truthful, non-manipulative answers and to otherwise limit its impact on the world. Applying a domesticity method, we might require that the oracle should use only designated resources to produce its answer. For example, we might stipulate that it should base its answer on a preloaded corpus of information, such as a stored snapshot of the Internet, and that it should use no more than a fixed number of computational steps.3 To avoid incentivizing the oracle to manipulate us into giving it easier questions—which would happen if we gave it the goal of maximizing its accuracy across all questions we will ask it—we could give it the goal of answering only one question and to terminate immediately upon delivering its answer. The question would be preloaded into its memory before the program is run. To ask a second question, we would reset the machine and run the same program with a different question preloaded in memory.

Subtle and potentially treacherous challenges arise even in specifying the relatively simple motivation system needed to drive an oracle. Suppose, for example, that we come up with some explication of what it means for the AI “to minimize its impact on the world, subject to achieving certain results” or “to use only designated resources in preparing the answer.” What happens if the AI, in the course of its intellectual development, undergoes the equivalent of a scientific revolution involving a change in its basic ontology?4 We might initially have explicated “impact” and “designated resources” using our own ontology (postulating the existence of various physical objects such as computers). But just as we have abandoned ontological categories that were taken for granted by scientists in previous ages (e.g. “phlogiston,” “élan vital,” and “absolute simultaneity”), so a superintelligent AI might discover that some of our current categories are predicated on fundamental misconceptions. The goal system of an AI undergoing an ontological crisis needs to be resilient enough that the “spirit” of its original goal content is carried over, charitably transposed into the new key.

Although making an oracle safe through the use of motivation selection might be far from trivial, it may nevertheless be easier than doing the same for an AI that roams the world in pursuit of some complicated goal. This is an argument for preferring that the first superintelligence be an oracle. Another point in favor of an oracle-first path is the greater susceptibility of an oracle to capability control. All the standard boxing methods apply here. In addition, there may be methods that are distinctively applicable to oracles. For example, consider the risk that an oracle will answer questions not in a maximally truthful way but in such a way as to subtly manipulate us into promoting its own hidden agenda. One way to slightly mitigate this threat could be to create multiple oracles, each with a slightly different code and a slightly different information base. A simple mechanism could then compare the answers given by the different oracles and only present them for human viewing if all the answers agree. While one must assume that any superintelligent domain-general oracle would know about the existence of other oracles (it could infer this from its knowledge about human society), it might be feasible to hide some implementation details of each oracle from the views of the others. If the oracles are not able to communicate with one another, it might then be difficult for them to coordinate on how to answer our questions manipulatively. There are many ways of deviating from the truth, and the oracles may not all agree on which of these deviations is most attractive—whereas the truth itself is a Schelling point (a salient place for agreement in the absence of communication). So if the oracles achieve consensus, it might be a sign that they gave the true answer.5

An oracle would ideally be trustworthy in the sense that we could safely assume that its answers are always accurate to the best of its ability. But even an untrustworthy oracle could be useful. We could ask such an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct. Many mathematical problems are of this kind. If we are wondering whether a mathematical proposition is true, we could ask the oracle to produce a proof or disproof of the proposition. Finding the proof may require insight and creativity beyond our ken, but checking a purported proof’s validity can be done by a simple mechanical procedure.

If it is expensive to verify answers (as is often the case on topics outside logic and mathematics), we can randomly select a subset of the oracle’s answers for verification. If they are all correct, we can assign a high probability to most of the other answers also being correct. This trick can give us a bulk discount on trustworthy answers that would be costly to verify individually. (Unfortunately, it cannot give us trustworthy answers that we are unable to verify, since a dissembling oracle may choose to answer correctly only those questions where it believes we could verify its answers.)

There could be important issues on which we could benefit from an augural pointer toward the correct answer (or toward a method for locating the correct answer) even if we had to actively distrust the provenance. For instance, one might ask for the solution to various technical or philosophical problems that may arise in the course of trying to develop more advanced motivation selection methods. If we had a proposed AI design alleged to be safe, we could ask an oracle whether it could identify any significant flaw in the design, and whether it could explain any such flaw to us in twenty words or less. Questions of this kind could elicit valuable information. Caution and restraint would be required, however, for us not to ask too many such questions—and not to allow ourselves to partake of too many details of the answers given to the questions we do ask—lest we give the untrustworthy oracle opportunities to work on our psychology (by means of plausible-seeming but subtly manipulative messages). It might not take many bits of communication for an AI with the social manipulation superpower to bend us to its will.

Even if the oracle itself works exactly as intended, there is a risk that it would be misused. One obvious dimension of this problem is that an oracle AI would be a source of immense power which could give a decisive strategic advantage to its operator. This power might be illegitimate and it might not be used for the common good. Another more subtle but no less important dimension is that the use of an oracle could be extremely dangerous for the operator herself. Similar worries (which involve philosophical as well as technical issues) arise also for other hypothetical castes of superintelligence. We will explore them more thoroughly in Chapter 13. Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria.

Genies and sovereigns

A genie is a command-executing system: it receives a high-level command, carries it out, then pauses to await the next command.6 A sovereign is a system that has an open-ended mandate to operate in the world in pursuit of broad and possibly very long-range objectives. Although these might seem like radically different templates for what a superintelligence should be and do, the difference is not as deep as it might at first glance appear.

With a genie, one already sacrifices the most attractive property of an oracle: the opportunity to use boxing methods. While one might consider creating a physically confined genie, for instance one that can only construct objects inside a designated volume—a volume that might be sealed off by a hardened wall or a barrier loaded with explosive charges rigged to detonate if the containment is breached—it would be difficult to have much confidence in the security of any such physical containment method against a superintelligence equipped with versatile manipulators and construction materials. Even if it were somehow possible to ensure a containment as secure as that which can be achieved for an oracle, it is not clear how much we would have gained by giving the superintelligence direct access to manipulators compared to requiring it instead to output a blueprint that we could inspect and then use to achieve the same result ourselves. The gain in speed and convenience from bypassing the human intermediary seems hardly worth the loss of foregoing the use of the stronger boxing methods available to contain an oracle.

If one were creating a genie, it would be desirable to build it so that it would obey the intention behind the command rather than its literal meaning, since a literalistic genie (one superintelligent enough to attain a decisive strategic advantage) might have a propensity to kill the user and the rest of humanity on its first use, for reasons explained in the section on malignant failure modes in Chapter 8. More broadly, it would seem important that the genie seek a charitable—and what human beings would regard as reasonable—interpretation of what is being commanded, and that the genie be motivated to carry out the command under such an interpretation rather than under the literalistic interpretation. The ideal genie would be a super-butler rather than an autistic savant.

A genie endowed with such a super-butler nature, however, would not be far from qualifying for membership in the caste of sovereigns. Consider, for comparison, the idea of building a sovereign with the final goal of obeying the spirit of the commands we would have given had we built a genie rather than a sovereign. Such a sovereign would mimic a genie. Being superintelligent, this sovereign would do a good job at guessing what commands we would have given a genie (and it could always ask us if that would help inform its decisions). Would there then really be any important difference between such a sovereign and a genie? Or, pressing on the distinction from the other side, consider that a superintelligent genie may likewise be able to predict what commands we will give it: what then is gained from having it await the actual issuance before it acts?

One might think that a big advantage of a genie over a sovereign is that if something goes wrong, we could issue the genie with a new command to stop or to reverse the effects of the previous actions, whereas a sovereign would just push on regardless of our protests. But this apparent safety advantage for the genie is largely illusory. The “stop” or “undo” button on a genie works only for benign failure modes: in the case of a malignant failure—one in which, for example, carrying out the existing command has become a final goal for the genie—the genie would simply disregard any subsequent attempt to countermand the previous command.7

One option would be to try to build a genie such that it would automatically present the user with a prediction about salient aspects of the likely outcomes of a proposed command, asking for confirmation before proceeding. Such a system could be referred to as a genie-with-a-preview. But if this could be done for a genie, it could likewise be done for a sovereign. So again, this is not a clear differentiator between a genie and a sovereign. (Supposing that a preview functionality could be created, the questions of whether and if so how to use it are rather less obvious than one might think, notwithstanding the strong appeal of being able to glance at the outcome before committing to making it irrevocable reality. We will return to this matter later.)

The ability of one caste to mimic another extends to oracles, too. A genie could be made to act like an oracle if the only commands we ever give it are to answer certain questions. An oracle, in turn, could be made to substitute for a genie if we asked the oracle what the easiest way is to get certain commands executed. The oracle could give us step-by-step instructions for achieving the same result as a genie would produce, or it could even output the source code for a genie.8 Similar points can be made with regard to the relation between an oracle and a sovereign.

The real difference between the three castes, therefore, does not reside in the ultimate capabilities that they would unlock. Instead, the difference comes down to alternative approaches to the control problem. Each caste corresponds to a different set of safety precautions. The most prominent feature of an oracle is that it can be boxed. One might also try to apply domesticity motivation selection to an oracle. A genie is harder to box, but at least domesticity may be applicable. A sovereign can neither be boxed nor handled through the domesticity approach.

If these were the only relevant factors, then the order of desirability would seem clear: an oracle would be safer than a genie, which would be safer than a sovereign; and any initial differences in convenience and speed of operation would be relatively small and easily dominated by the gains in safety obtainable by building an oracle. However, there are other factors that need to be taken into account. When choosing between castes, one should consider not only the danger posed by the system itself but also the dangers that arise out of the way it might be used. A genie most obviously gives the person who controls it enormous power, but the same holds for an oracle.9 A sovereign, by contrast, could be constructed in such way as to accord no one person or group any special influence over the outcome, and such that it would resist any attempt to corrupt or alter its original agenda. What is more, if a sovereign’s motivation is defined using “indirect normativity” (a concept to be described in Chapter 13) then it could be used to achieve some abstractly defined outcome, such as “whatever is maximally fair and morally right”—without anybody knowing in advance what exactly this will entail. This would create a situation analogous to a Rawlsian “veil of ignorance.”10 Such a setup might facilitate the attainment of consensus, help prevent conflict, and promote a more equitable outcome.

Another point, which counts against some types of oracles and genies, is that there are risks involved in designing a superintelligence to have a final goal that does not fully match the outcome that we ultimately seek to attain. For example, if we use a domesticity motivation to make the superintelligence want to minimize some of its impacts on the world, we might thereby create a system whose preference ranking over possible outcomes differs from that of the sponsor. The same will happen if we build the AI to place a peculiarly high value on answering questions correctly, or on faithfully obeying individual commands. Now, if sufficient care is taken, this should not cause any problems: there would be sufficient agreement between the two rankings—at least insofar as they pertain to possible worlds that have a reasonable chance of being actualized—that the outcomes that are good by the AI’s standard are also good by the principal’s standard. But perhaps one could argue for the design principle that it is unwise to introduce even a limited amount of disharmony between the AI’s goals and ours. (The same concern would of course apply to giving sovereigns goals that do not completely harmonize with ours.)

Tool-AIs

One suggestion that has been made is that we build the superintelligence to be like a tool rather than an agent.11 This idea seems to arise out of the observation that ordinary software, which is used in countless applications, does not raise any safety concerns even remotely analogous to the challenges discussed in this book. Might one not create “tool-AI” that is like such software—like a flight control system, say, or a virtual assistant—only more flexible and capable? Why build a superintelligence that has a will of its own? On this line of thinking, the agent paradigm is fundamentally misguided. Instead of creating an AI that has beliefs and desires and that acts like an artificial person, we should aim to build regular software that simply does what it is programmed to do.

This idea of creating software that “simply does what it is programmed to do” is, however, not so straightforward if the product being created is a powerful general intelligence. There is, of course, a trivial sense in which all software simply does what it is programmed to do: the behavior is mathematically specified by the code. But this is equally true for all castes of machine intelligence, “tool-AI” or not. If, instead, “simply doing what it is programmed to do” means that the software behaves as the programmers intended, then this is a standard that ordinary software very often fails to meet.

Because of the limited capabilities of contemporary software (compared with those of machine superintelligence) the consequences of such failures are manageable, ranging from insignificant to very costly, but in no case amounting to an existential threat.12 However, if it is insufficient capability rather than sufficient reliability that makes ordinary software existentially safe, then it is unclear how such software could be a model for a safe superintelligence. It might be thought that by expanding the range of tasks done by ordinary software, one could eliminate the need for artificial general intelligence. But the range and diversity of tasks that a general intelligence could profitably perform in a modern economy is enormous. It would be infeasible to create special-purpose software to handle all of those tasks. Even if it could be done, such a project would take a long time to carry out. Before it could be completed, the nature of some of the tasks would have changed, and new tasks would have become relevant. There would be great advantage to having software that can learn on its own to do new tasks, and indeed to discover new tasks in need of doing. But this would require that the software be able to learn, reason, and plan, and to do so in a powerful and robustly cross-domain manner. In other words, it would require general intelligence.

Especially relevant for our purposes is the task of software development itself. There would be enormous practical advantages to being able to automate this. Yet the capacity for rapid self-improvement is just the critical property that enables a seed AI to set off an intelligence explosion.

If general intelligence is not dispensable, is there some other way of construing the tool-AI idea so as to preserve the reassuringly passive quality of a humdrum tool? Could one have a general intelligence that is not an agent? Intuitively, it is not just the limited capability of ordinary software that makes it safe: it is also its lack of ambition. There is no subroutine in Excel that secretly wants to take over the world if only it were smart enough to find a way. The spreadsheet application does not “want” anything at all; it just blindly carries out the instructions in the program. What (one might wonder) stands in the way of creating a more generally intelligent application of the same type? An oracle, for instance, which, when prompted with a description of a goal, would respond with a plan for how to achieve it, in much the same way that Excel responds to a column of numbers by calculating a sum—without thereby expressing any “preferences” regarding its output or how humans might choose to use it?

The classical way of writing software requires the programmer to understand the task to be performed in sufficient detail to formulate an explicit solution process consisting of a sequence of mathematically well-defined steps expressible in code.13 (In practice, software engineers rely on code libraries stocked with useful behaviors, which they can invoke without needing to understand how the behaviors are implemented. But that code was originally created by programmers who had a detailed understanding of what they were doing.) This approach works for solving well-understood tasks, and is to credit for most software that is currently in use. It falls short, however, when nobody knows precisely how to solve all of the tasks that need to be accomplished. This is where techniques from the field of artificial intelligence become relevant. In narrow applications, machine learning might be used merely to fine-tune a few parameters in a largely human-designed program. A spam filter, for example, might be trained on a corpus of hand-classified email messages in a process that changes the weights that the classification algorithm places on various diagnostic features. In a more ambitious application, the classifier might be built so that it can discover new features on its own and test their validity in a changing environment. An even more sophisticated spam filter could be endowed with some ability to reason about the trade-offs facing the user or about the contents of the messages it is classifying. In neither of these cases does the programmer need to know the best way of distinguishing spam from ham, only how to set up an algorithm that can improve its own performance via learning, discovering, or reasoning.

With advances in artificial intelligence, it would become possible for the programmer to offload more of the cognitive labor required to figure out how to accomplish a given task. In an extreme case, the programmer would simply specify a formal criterion of what counts as success and leave it to the AI to find a solution. To guide its search, the AI would use a set of powerful heuristics and other methods to discover structure in the space of possible solutions. It would keep searching until it found a solution that satisfied the success criterion. The AI would then either implement the solution itself or (in the case of an oracle) report the solution to the user.

Rudimentary forms of this approach are quite widely deployed today. Nevertheless, software that uses AI and machine learning techniques, though it has some ability to find solutions that the programmers had not anticipated, functions for all practical purposes like a tool and poses no existential risk. We would enter the danger zone only when the methods used in the search for solutions become extremely powerful and general: that is, when they begin to amount to general intelligence—and especially when they begin to amount to superintelligence.

There are (at least) two places where trouble could then arise. First, the superintelligent search process might find a solution that is not just unexpected but radically unintended. This could lead to a failure of one of the types discussed previously (“perverse instantiation,” “infrastructure profusion,” or “mind crime”). It is most obvious how this could happen in the case of a sovereign or a genie, which directly implements the solution it has found. If making molecular smiley faces or transforming the planet into paperclips is the first idea that the superintelligence discovers that meets the solution criterion, then smiley faces or paperclips we get.14 But even an oracle, which—if all else goes well—merely reports the solution, could become a cause of perverse instantiation. The user asks the oracle for a plan to achieve a certain outcome, or for a technology to serve a certain function; and when the user follows the plan or constructs the technology, a perverse instantiation can ensue, just as if the AI had implemented the solution itself.15

A second place where trouble could arise is in the course of the software’s operation. If the methods that the software uses to search for a solution are sufficiently sophisticated, they may include provisions for managing the search process itself in an intelligent manner. In this case, the machine running the software may begin to seem less like a mere tool and more like an agent. Thus, the software may start by developing a plan for how to go about its search for a solution. The plan may specify which areas to explore first and with what methods, what data to gather, and how to make best use of available computational resources. In searching for a plan that satisfies the software’s internal criterion (such as yielding a sufficiently high probability of finding a solution satisfying the user-specified criterion within the allotted time), the software may stumble on an unorthodox idea. For instance, it might generate a plan that begins with the acquisition of additional computational resources and the elimination of potential interrupters (such as human beings). Such “creative” plans come into view when the software’s cognitive abilities reach a sufficiently high level. When the software puts such a plan into action, an existential catastrophe may ensue.

As the examples in Box 9 illustrate, open-ended search processes sometimes evince strange and unexpected non-anthropocentric solutions even in their currently limited forms. Present-day search processes are not hazardous because they are too weak to discover the kind of plan that could enable a program to take over the world. Such a plan would include extremely difficult steps, such as the invention of a new weapons technology several generations ahead of the state of the art or the execution of a propaganda campaign far more effective than any communication devised by human spin doctors. To have a chance of even conceiving of such ideas, let alone developing them in a way that would actually work, a machine would probably need the capacity to represent the world in a way that is at least as rich and realistic as the world model possessed by a normal human adult (though a lack of awareness in some areas might possibly be compensated for by extra skill in others). This is far beyond the reach of contemporary AI. And because of the combinatorial explosion, which generally defeats attempts to solve complicated planning problems with brute-force methods (as we saw in Chapter 1), the shortcomings of known algorithms cannot realistically be overcome simply by pouring on more computing power.21 However, once the search or planning processes become powerful enough, they also become potentially dangerous.


Box 9 Strange solutions from blind search

Even simple evolutionary search processes sometimes produce highly unexpected results, solutions that satisfy a formal user-defined criterion in a very different way than the user expected or intended.

The field of evolvable hardware offers many illustrations of this phenomenon. In this field, an evolutionary algorithm searches the space of hardware designs, testing the fitness of each design by instantiating it physically on a rapidly reconfigurable array or motherboard. The evolved designs often show remarkable economy. For instance, one search discovered a frequency discrimination circuit that functioned without a clock—a component normally considered necessary for this function. The researchers estimated that the evolved circuit was between one and two orders of magnitude smaller than what a human engineer would have required for the task. The circuit exploited the physical properties of its components in unorthodox ways; some active, necessary components were not even connected to the input or output pins! These components instead participated via what would normally be considered nuisance side effects, such as electromagnetic coupling or power-supply loading.

Another search process, tasked with creating an oscillator, was deprived of a seemingly even more indispensible component, the capacitor. When the algorithm presented its successful solution, the researchers examined it and at first concluded that it “should not work.” Upon more careful examination, they discovered that the algorithm had, MacGyver-like, reconfigured its sensor-less motherboard into a makeshift radio receiver, using the printed circuit board tracks as an aerial to pick up signals generated by personal computers that happened to be situated nearby in the laboratory. The circuit amplified this signal to produce the desired oscillating output.16

In other experiments, evolutionary algorithms designed circuits that sensed whether the motherboard was being monitored with an oscilloscope or whether a soldering iron was connected to the lab’s common power supply. These examples illustrate how an open-ended search process can repurpose the materials accessible to it in order to devise completely unexpected sensory capabilities, by means that conventional human design-thinking is poorly equipped to exploit or even account for in retrospect.

The tendency for evolutionary search to “cheat” or find counterintuitive ways of achieving a given end is on display in nature too, though it is perhaps less obvious to us there because of our already being somewhat familiar with the look and feel of biology, and thus being prone to regarding the actual outcomes of natural evolutionary processes as normal—even if we would not have expected them ex ante. But it is possible to set up experiments in artificial selection where one can see the evolutionary process in action outside its familiar context. In such experiments, researchers can create conditions that rarely obtain in nature, and observe the results.

For example, prior to the 1960s, it was apparently quite common for biologists to maintain that predator populations restrict their own breeding in order to avoid falling into a Malthusian trap.17 Although individual selection would work against such restraint, it was sometimes thought that group selection would overcome individual incentives to exploit opportunities for reproduction and favor traits that would benefit the group or population at large. Theoretical analysis and simulation studies later showed that while group selection is possible in principle, it can overcome strong individual selection only under very stringent conditions that may rarely apply in nature.18 But such conditions can be created in the laboratory. When flour beetles (Tribolium castaneum) were bred for reduced population size, by applying strong group selection, evolution did indeed lead to smaller populations.19 However, the means by which this was accomplished included not only the “benign” adaptations of reduced fecundity and extended developmental time that a human naively anthropomorphizing evolutionary search might have expected, but also an increase in cannibalism.20


Instead of allowing agent-like purposive behavior to emerge spontaneously and haphazardly from the implementation of powerful search processes (including processes searching for internal work plans and processes directly searching for solutions meeting some user-specified criterion), it may be better to create agents on purpose. Endowing a superintelligence with an explicitly agent-like structure can be a way of increasing predictability and transparency. A well-designed system, built such that there is a clean separation between its values and its beliefs, would let us predict something about the outcomes it would tend to produce. Even if we could not foresee exactly which beliefs the system would acquire or which situations it would find itself in, there would be a known place where we could inspect its final values and thus the criteria that it will use in selecting its future actions and in evaluating any potential plan.

Comparison

It may be useful to summarize the features of the different system castes we have discussed (Table 11).

Table 11 Features of different system castes

Oracle

A question-answering system

✵ Boxing methods fully applicable

Variations: Domain-limited oracles (e.g. mathematics); output-restricted oracles (e.g. only yes/no/undecided answers, or probabilities); oracles that refuse to answer questions if they predict the consequences of answering would meet pre-specified “disaster criteria”; multiple oracles for peer review

✵ Domesticity fully applicable

✵ Reduced need for AI to understand human intentions and interests (compared to genies and sovereigns)

✵ Use of yes/no questions can obviate need for a metric of the “usefulness” or “informativeness” of answers

✵ Source of great power (might give operator a decisive strategic advantage)

✵ Limited protection against foolish use by operator

✵ Untrustworthy oracles could be used to provide answers that are hard to find but easy to verify

✵ Weak verification of answers may be possible through the use of multiple oracles

Genie

A command-executing system

✵ Boxing methods partially applicable (for spatially limited genies)

Variations: Genies using different “extrapolation distances” or degrees of following the spirit rather than letter of the command; domain-limited genies; genies-with-preview; genies that refuse to obey commands if they predict the consequences of obeying would meet pre-specified “disaster criteria”

✵ Domesticity partially applicable

✵ Genie could offer a preview of salient aspects of expected outcomes

✵ Genie could implement change in stages, with opportunity for review at each stage

✵ Source of great power (might give operator a decisive strategic advantage)

✵ Limited protection against foolish use by operator

✵ Greater need for AI to understand human interests and intentions (compared to oracles)

Sovereign

A system designed for open-ended autonomous operation

✵ Boxing methods inapplicable

✵ Most other capability control methods also inapplicable (except, possibly, social integration or anthropic capture)

Variations: Many possible motivation systems; possibility of using preview and “sponsor ratification” (to be discussed in Chapter 13)

✵ Domesticity mostly inapplicable

✵ Great need for AI to understand true human interests and intentions

✵ Necessity of getting it right on the first try (though, to a possibly lesser extent, this is true for all castes)

✵ Potentially a source of great power for sponsor, including decisive strategic advantage

✵ Once activated, not vulnerable to hijacking by operator, and might be designed with some protection against foolish use

✵ Can be used to implement “veil of ignorance” outcomes (cf. Chapter 13)

Tool

A system not designed to exhibit goal-directed behavior

✵ Boxing methods may be applicable, depending on the implementation

✵ Powerful search processes would likely be involved in the development and operation of a machine superintelligence

✵ Powerful search to find a solution meeting some formal criterion can produce solutions that meet the criterion in an unintended and dangerous way

✵ Powerful search might involve secondary, internal search and planning processes that might find dangerous ways of executing the primary search process


Further research would be needed to determine which type of system would be safest. The answer might depend on the conditions under which the AI would be deployed. The oracle caste is obviously attractive from a safety standpoint, since it would allow both capability control methods and motivation selection methods to be applied. It might thus seem to simply dominate the sovereign caste, which would only allow motivation selection methods (except in scenarios in which the world is believed to contain other powerful superintelligences, in which case social integration or anthropic capture might apply). However, an oracle could place a lot of power into the hands of its operator, who might be corrupted or might apply the power unwisely, whereas a sovereign would offer some protection against these hazards. The safety ranking is therefore not so easily determined.

A genie can be viewed as a compromise between an oracle and a sovereign—but not necessarily a good compromise. In many ways, it would share the disadvantages of both. The apparent safety of a tool-AI, meanwhile, may be illusory. In order for tools to be versatile enough to substitute for superintelligent agents, they may need to deploy extremely powerful internal search and planning processes. Agent-like behaviors may arise from such processes as an unplanned consequence. In that case, it would be better to design the system to be an agent in the first place, so that the programmers can more easily see what criteria will end up determining the system’s output.