Supplemental Material for Chapter 13 - The Bell Curve: Intelligence and Class Structure in American Life - Richard J. Herrnstein, Charles Murray

The Bell Curve: Intelligence and Class Structure in American Life - Richard J. Herrnstein, Charles Murray (1996)

Appendix 5. Supplemental Material for Chapter 13

Four issues raised in Chapter 13 are elaborated here: test bias, narrowing of the black-white difference in academic test scores, the broader argument for racial differences advanced by Philippe Rushton, and possible language bias against Latinos in the NLSY data.


In Chapter 13, we reported that the scientific evidence demonstrates overwhelmingly that standardized tests of cognitive ability are not biased against blacks. Here, we elaborate on the reasoning and evidence that lead to that conclusion.

More on External Evidence of Bias: Predictive Validity

Everyday commentary on test bias usually starts with the observation that members of various ethnic (or socioeconomic) groups have different average scores and leaps to the assumption that a group difference is prima facie evidence of bias. But a moment’s thought should convince anyone that this is not necessarily so. A group difference is, in and of itself, evidence of test bias only if we have some reason for assuming that an unbiased test would find no average difference between the groups. What might such a reason be? We cast the answer in terms of whites and blacks, since that is the context for most charges of test bias. Inasmuch as the context also usually involves a criticism of the use of the test in selection of persons for school or job, the most pertinent reason for assuming equality in the absence of test bias would be that we have other data showing that a randomly selected black and white with the same test score have different outcomes. This is what the text refers to as external evidence of bias.

If for example, blacks do better in school than whites after choosing blacks and whites with equal test scores, we could say that the test was biased against blacks in academic prediction. Similarly, if they do better on the job after choosing blacks and whites with equal test scores, the test could be considered biased against blacks for predicting work performance. This way of demonstrating bias is tantamount to showing that the regression of outcomes on scores differs for the two groups. On a test biased against blacks, the regression intercept would be higher for blacks than whites, as illustrated in the graphic below. Test scores under these conditions would underestimate, or “underpredict,” the performance outcome of blacks. A randomly selected black and white with the same IQ (shown by the vertical broken line) would not have equal outcomes; the black would outperform the white (as shown by the horizontal broken lines). The test is therefore biased against blacks. On an unbiased test, the two regression lines would converge because they would have the same intercept (the point at which the regression line crosses the vertical axis).

When a test is biased because it systematically underpredicts one group’s performance


But the graphic above captures only one of the many possible manifestations of predictive bias. Suppose, for example, a test was less valid for blacks than for whites.1 In regression terms, this would translate into a smaller coefficient (slope in these graphics), which could, in turn, be associated either with or without a difference in the intercept. The next figure illustrates a few hypothetical possibilities.

All three black lines have the same low coefficient; they vary only in their intercepts. The gray line, representing whites, has a higher coefficient (therefore, the line is steeper). Begin with the lowest of the three black lines. Only at the very lowest predictor scores do blacks score higher than whites on the outcome measure. As the score on the predictor increases, whites with equivalent predictor scores have higher outcome scores. Here, the test bias is against whites, not blacks. For the intermediate black line, we would pick up evidence for test bias against blacks in the low range of test scores and bias against whites in the high range. The top black line, with the highest of the three intercepts, would accord with bias against blacks throughout the range, but diminishing in magnitude the higher the score.

When a test is biased because it is a less valid predictor of performance for one group than another


Readers will quickly grasp that test scores can predict outcomes differently for members of different groups and that such differences may justify claims of test bias. So what are the facts? Do we see anything like the first of the two graphics in the data—a clear difference in intercepts, to the disadvantage of blacks taking the test? Or is the picture cloudier—a mixture of intercept and coefficient differences, yielding one sort of bias or another in different ranges of the test scores? When questions about data come up, cloudier and murkier is usually a safe bet. So let us start with the most relevant conclusion, and one about which there is virtual unanimity among students of the subject of predictive bias in testing: No one has found statistically reliable evidence of predictive bias against blacks, of the sort illustrated in the first graphic, in large, representative samples of blacks and whites, where cognitive ability tests are the predictor variable for educational achievement or job performance. In the notes, we list some of the larger aggregations of data and comprehensive analyses substantiating this conclusion.2 We have found no modern, empirically based survey of the literature on test bias arguing that tests are predictively biased against blacks, although we have looked for them.

When we turn to the hundreds of smaller studies that have accumulated in the literature, we find examples of varying regression coefficients and intercepts, and predictive validities. This is a fundamental reason for focusing on syntheses of the literature. Smaller or unrepresentative individual studies may occasionally find test bias because of the statistical distortions that plague them. There are, for example, sampling and measurement errors, errors of recording, transcribing, and computing data, restrictions of range in both the predictor and outcome measurements, and predictor or outcome scales that are less valid than they might have been.3 Given all the distorting sources of variation, lack of agreement across studies is the rule.

But even taken down to so fine a level, the case against predictive bias against blacks remains overwhelming. As late as 1984, Arthur Jensen was able to proclaim that “I have not come across a bona fide example of the opposite finding [of a test that underpredicts black performance].”4 Jensen’s every finding regarding racial differences in IQ is routinely subjected to intense scrutiny by his critics, but no one has contradicted this one. We are not absolutely sure that our literature review has identified every study since 1984, but our search revealed no examples to counter Jensen’s generalization.5

Insofar as the many individual studies show a pattern at all, it points to overprediction for blacks. More simply, this body of evidence suggests that IQ tests are biased in favor of blacks, not against them. The single most massive set of data bearing on this issue is the national sample of more than 645,000 school children conducted by sociologist James Coleman and his associates for their landmark examination of the American educational system in the mid-1960s. Coleman’s survey included a standardized test of verbal and nonverbal IQ, using the kinds of items that characterize the classic IQ test and are commonly thought to be culturally biased against blacks: picture vocabulary, sentence completion, analogies, and the like. The Coleman survey also included educational achievement measures of reading level and math level that are thought to be straightforward measures of what the student has learned. If IQ items are culturally biased against blacks, it could be predicted that a black student would do better on the achievement measures than the putative IQ measure would lead one to expect (this is the rationale behind the current popularity of steps to modify the SAT so that it focuses less on aptitude and more on measures of what has been learned). But the opposite occurred. Overall, black IQ scores overpredicted black academic achievement by .26 standard deviations.6

One inference that might be drawn from this finding is that black children were for some reason not taking as much from school as their ability would permit, or that black children went to worse schools than white children, or any of several other interpretations. But whatever the explanation might be, the results directly contradict the hypothesis that IQ tests give an unfairly low estimate of black academic performance.

A second major source of data suggesting that standardized tests over-predict black performance is the SAT. Colleges commonly compare the performance of freshmen, measured by grade point average, against the expectations of their performance as predicted by SAT scores. A literature review of studies that broke down these data by ethnic group revealed that SAT scores overpredicted freshman grades for blacks in fourteen of fifteen studies, by a median of .20 standard deviation.7 In five additional studies where the ethnic classification was “minority” rather than specifically “black,” the SAT score overpredicted college performance in all five cases, by a median of .40 standard deviation.8

For job performance, the most thorough analysis is provided by the Hartigan Report, assessing the relationship between the General Aptitude Test Battery (GATB) and job performance measures. Out of seventy-two studies that were assembled for review, the white intercept was higher than the black intercept in sixty of them—that is, the GATB overpredicted black performance in sixty out of the seventy-two studies.9 Of the twenty studies in which the intercepts were statistically significantly different (at the .01 level), the white intercept was greater than the black intercept in all twenty cases.10

These findings about overprediction apply to the ordinary outcome measures of academic and job performance. But it should also be noted that “overprediction” can be a misleading concept when it is applied to outcome measures for which the predictor (IQ, in our continuing example) has very low validity. Inasmuch as blacks and whites differ on average in their scores on some outcome that is not linked to the predictor, the more biased it will be against whites. Consider the next figure, constructed on the assumption that the predictor is nearly invalid and that the two groups differ on average in their outcome levels.

A predictor with low validity may seem to be biased against whites if there is a substantial difference in the outcome measure


This situation is relevant to some of the outcome measures discussed in Chapter 14, such as short-term male unemployment, where the black and white means are quite different, but IQ has little relationship to short-term unemployment for either whites or blacks. This figure was constructed assuming only that there are factors influencing outcomes that are not captured by the predictor, hence its low validity, resulting in the low slope of the parallel regression lines.11 The intercepts differ, expressing the generally higher level of performance by whites compared to blacks that is unexplained by the predictor variable. If we knew what the missing predictive factors are, we could include them in the predictor, and the intercept difference would vanish—and so would the implication that the newly constituted predictor is biased against whites. What such results seem to be telling us is, first, that IQ tests are not predictively biased against blacks but, second, that IQ tests alone do not explain the observed black-white differences in outcomes. It therefore often looks as if the IQ test is biased against whites.

More on Internal Evidence of Bias: hem Analysis

Laymen are often skeptical that IQ test items could measure anything as deep as intelligence. Knowing the answers seems to them to depend less on intelligence than on having been exposed to certain kinds of cultural or historical information. It is usually a short step from here to the conclusion that the tests must be biased. Pundits of varying sorts reinforce this intuition about test item bias, claiming that the middle-and upper-class white culture infuses test items even after vigorous efforts to expunge it.

The data confirming Spearman’s hypothesis, which we discussed at some length in Chapter 13, provide the most convincing conceptual refutation of this allegation by providing an alternative explanation that has been borne out by many studies: the items on which blacks and whites differ most widely are not those with the most esoteric cultural content, but the ones that best measure the general intelligence factor, g.12 But many other studies have directly asked whether the cultural content of items is associated with the magnitude of the black-white difference, which we review here.

One of the earliest of the studies, a 1951 doctoral thesis at Catholic University, proceeded on the assumption that some test items are more dependent on exposure to culture than others.13 Frank McGurk, the study’s author, consequently had large numbers of independent judges rate many test items for their cultural loading. On exploratory tests, he was able to establish each item’s general difficulty, which is defined simply as the proportion of a population that gets the item wrong. He could therefore identify pairs of items, one highly loaded with cultural information and the other not highly loaded but of equal difficulty. Now, finally, the crucial evaluation could be made with a sample of black and white high school students matched for schooling and socioeconomic background. The black-white gap, he discovered, was about twice as large on items rated as low in cultural loading as on items rated as high in cultural loading. Consider, for example, a pair of equally difficult test items. The one that is culturally loaded is probably difficult because it draws on esoteric knowledge; the other item is probably difficult because it calls on complex cognitive processing—g. McGurk’s results undermined the proposition that access to esoteric knowledge was to blame for the black-white difference.

Another approach in the pursuit of test-item bias is based on which items blacks and whites find hard or easy. Conceptually, this is much like McGurk’s approach, except that it does not require us to have items rated by experts, a subjective procedure that some might find suspect. Instead, if the cultural influence matters and if blacks and whites have access to different cultural backgrounds, then items that pick up these cultural differences should split the two groups. Items drawing on cultural knowledge more available to whites than to blacks should be, on average, relatively easier for whites than for blacks. Items lacking this tip for whites or items with a tip for blacks should not be differentially easier for whites and may be easier for blacks.

This idea is tested by ranking the items on a test separately for whites and for blacks, in order of difficulty. That is, the easiest item for whites is the one with the highest proportion of correct answers among whites; the next easiest item for whites is the one with the second highest proportion of correct answers for whites; and so on. Now repeat the procedure using the blacks’ proportions of correct answers. This will result in two sets of rank orders for all the items. The rank-order correlation between them is a measure of the test-item bias hypothesis: The larger the correlation is, the less support the hypothesis finds. Alternatively, the proportions of correct responses within each group are transformed into standard scores and then correlated by some other measure of correlation, such as the Pearson product-moment coefficient.

Either way, the result is clear. Relative item difficulties are essentially the same for both races (by sex). That is, blacks and whites of the same sex come close to finding the same item the easiest, the same item next easiest, all the way down to the hardest item.14 When the rank order of difficulty differs across races, the differences tend to be small and unsystematic. Rank order correlations above .95 are not uncommon for the items on the Wechsler and Stanford-Binet tests, which are, in fact, the tests that provide most of the anecdotal material for arguing that test items are biased. Pearson correlations are often somewhat lower but typically still above .8. Moreover, when items do vary in difficulty across races, most of the variation is eliminated by taking mental age into account. Since blacks and whites of the same chronological age differ on average in mental age, allowing a compensating lag in chronological age will neutralize the contribution of mental age. Compare, say, the item difficulties for 10-year-old blacks with that for 9-year-old or 8-year-old whites. When this is done, the correlations in difficulty almost all rise into the .9 range and above.15

Because “item bias” ordinarily defined has failed to materialize, the concept has been extended to encompass item characteristics that are intertwined with the underlying rationale for thinking that an item measures g. For example, one researcher has found that the black-white gap is diminished for items that call for the subject to identify the one false response, compared to items requiring the subject to identify the one correct response.16 Is this a matter of bias, or a matter of how well the two types of items tap the construct called intelligence? This in turn brings us full circle to Spearman’s hypothesis discussed in Chapter 13, which offers an interpretative framework for explaining such differences.

More on Other Potential Sources of Bias

We turn now to one of the least precisely but most commonly argued reasons for thinking that tests are biased: Tests are a sort of game, and, as in most games, it helps to have played the testing game, it helps to get coaching, and it helps to be playing on the home field. Privileged groups get more practice and coaching than underprivileged groups. They have a home-court advantage; the tests are given in familiar environments, administered by familiar kinds of people. A major part of the racial differences in test scores may be attributed to these differences. In this discussion, we begin with coaching and practice, then turn to some of the other ways in which the testing situation might influence scores.

PRACTICE AND COACHING. For IQ tests, coaching and practice are not a significant issue because coaching and practice effects exist only under conditions that virtually never apply. To get a sizable practice effect for an IQ test, it is necessary to use subjects who have never taken an IQ-like test, administer the identical test twice, and do so quickly (preferably within a few weeks).17 If the subjects fail to meet any of those conditions, the chances of finding a practice effect are small, and the size of any effect, if one is found, will be just a few points. Coaching effects are even harder to obtain. We are unable to identify any IQ data in any study, large or small, in which the results are compromised because the IQ scores of part of the sample have been obtained after this kind of experience. That’s not the way that IQ tests have been administered anywhere to any significant sample at any time during the history of IQ testing—except to the samples used to assess practice and coaching effects, and sometimes to the subjects of intensive remedial programs such as those discussed in Chapter 17.

The story regarding practice and coaching for such tests as the Scholastic Aptitude Test (SAT), the Law School Admissions Test (LSAT), and the Medical College Admissions Test (MCAT) is much more contentious than the story about IQ. Many people do take these tests more than once, many people practice for them, and many people get extensive coaching. Moreover, these tests are supposed to be “coachable,” insofar as they measure the verbal, reasoning, and analytic skills that a good education is supposed to enhance, and prolonged exposure to such coaching should produce better scores. Or to put it another way, two students with the same IQ should be able to get different LSAT and MCAT scores if one student has taken more appropriate courses and studied harder than the other student. That SAT scores declined by almost half a standard deviation from 1964 to 1980 strongly suggests that something coachable—or “negatively coachable” in this example—is being measured. In Chapter 17, we discuss the effects of coaching for the SAT, which are real but also smaller and harder to obtain than the widely advertised claims of the coaching industry.

The belief that coaching might explain part of the black-white gap often rests on a notion that, on the average, blacks receive less of the practice and coaching that might have elevated their scores than does the average white. We have already undermined this notion by showing that the tests are biased against blacks neither predictively nor in terms of particular item difficulties. There is, however, a literature that bears more directly on this idea, by looking for an interaction effect between practice or coaching and race.

If practice and coaching explain any portion of a group difference in scores in the population as a whole, then it necessarily follows that representative samples of those groups who are equally well practiced and well coached will show a smaller difference than is observed in the population at large. It is not enough that practice or coaching raises the mean score of the lower-scoring group; it must raise its mean score more than it raises the score of the higher-scoring group.

Several studies have investigated whether this is found for blacks and whites. In a well-designed study, representative samples of blacks and whites are randomly divided into two groups. The experimental black and white groups receive identical coaching (or practice), and the control groups receive no treatment at all. At the end of the experiment, the investigator has four different sets of results: test scores for coached blacks, uncoached blacks, coached whites, and uncoached whites. These results may be analyzed in three basic ways: One may compare blacks overall with whites overall, which will reveal the main effect of race; or the coached samples overall with the uncoached samples overall, which will reveal the main effect of the coaching; or the way in which the effects of coaching vary according to the race of the persons being coached, known as the interaction effect.

One study found a statistically significant differential response to practice, but not to direct instruction, on a reasoning test, between black and white college students.18 The differential advantage of practice for blacks compared to whites was about an eighth of the overall black-white gap on this test. Other studies have failed to find even this much of a differential response, or they have found differential responses in the opposite direction, tending to increase the black-white gap after practice.19 Taking the evidence as a whole, any differential coaching and practice effects by race (or socioeconomic status) is at most sporadic and small. If such a differential effect exists, it is too small to be replicated reliably. The scattered evidence of a differential effect is about as supportive of a white advantage from coaching as of a black advantage.

EXAMINER EFFECTS AND OTHER SITUATIONAL VARIABLES. Is it possible that disadvantaged groups come to the test with greater anxiety than confident middle-class students, and this mental state depresses their scores? That, when a black student takes a bus across town to an unfamiliar neighborhood and goes into a testing room filled with white students and overseen by a white test supervisor, this situation has an intimidating effect on performance? What about the time limits on tests? Might these have more pronounced effects on disadvantaged students than on test-wise middle-class students? All are plausible questions, but the answer to each is the same: Investigations to date give no reason to believe that such considerations explain a nontrivial portion of the group differences in scores.

The race of the examiner has been the subject of numerous studies. Of those with adequate experimental designs, most have showed nonsignificant effects; of the rest, the evidence is as strong that the presence of a white examiner reduces overall black-white difference as that a white examiner exacerbates the difference.20 Examinations of the results of time pressures fail to demonstrate either that blacks do better in untimed than in timed tests or that the test-taking “personal tempo” of blacks is different from that of whites.21 Test anxiety has been investigated extensively but, as in so many other aspects of this discussion, the relationship tends to be the opposite of the expected one: To the extent that test anxiety affects performance at all, it seems to help slightly. Only a few studies have specifically addressed black-white differences in test anxiety; they have shown either nonsignificant results, or that the white subjects were slightly more anxious than the black subjects.22

“BLACK ENGLISH.” Language looms larger. It is well established that the students from many different cultural backgrounds for whom English is a second language tend to score better on the nonverbal part of the test than a verbal component given in English.23 Whereas this imbalance may be independent of language for East Asians (Japanese in Japan have superior nonverbal scores even taking verbal test batteries designed in Japanese), it is also manifest among Latinos, who do not otherwise exhibit the characteristic East Asian verbal-nonverbal pattern. This suggests that students who are taking the test in a second language suffer some decrement of their scores.

It has been a small step from this to hypothesize that, for practical purposes, many blacks are taking the test in a “second language,” with their first language being the dialect known as “black English,” ubiquitous in the black inner city and used to some extent by blacks of broader socioeconomic backgrounds. Researchers have approached the issue in several ways. First, the evidence indicates that black children who use black English understand standard English at least as well.24 A more direct test came in the 1970s, when L. C. Quay had the Stanford-Binet translated into black dialect and tested several samples with both the original and the revised version. The studies produced no evidence that black students in any of the various test groups benefited (the differences in scores from the two tests generally amounted to less than one IQ point).25 But the most powerful data suggesting that language does not explain the black-white difference is provided by the evidence for Spearman’s hypothesis presented in Chapter 13: If language were the problem, then blacks would be at the greatest disadvantage on test items that rely on a knowledge of standard English and be at the least disadvantage on test items that use no language at all. As we discuss with regard to Spearman’s hypothesis in Chapter 13, this expectation is contradicted by a large and consistent body of work. Black populations generally do relatively better on test items that are less saturated with g and relatively worse on items more saturated with g, whether the items are verbal or nonverbal.

The Continuing Debate

Allegations that standardized tests are culturally biased still appear, and presumably this account will fuel additional ones. What about all the articles appearing in many quarters making these claims? They make up a varied lot, but typically consist of allegations that ignore the data. A particularly striking example was a long article entitled “IQ and Standard English,” which appeared in a technical journal and attributed the black-white IQ test differences to language difficulties. The article was followed by four responses, plus by a counterstatement by the author. Neither the original article nor any of the responses cited any of the data discussed above.26 The debate was carried on entirely on the basis of argumentation about the extent to which black culture is more orally based than white culture. This readiness to theorize about what might be true about black-white differences in test scores while ignoring the pertinent data is common.

Other articles, cited in the note, have discussed a variety of ways in which culture interacts with human functioning, intellectual and otherwise.27 The movement surrounding Howard Gardner’s concept of multiple intelligences (see the Introduction) is only the best known of these new ways of talking about intelligence. But these discussions do not try to argue with the two core statements that we have made: In the major standardized tests, test items function in the same way for both blacks and whites, and the tests results are similarly predictive for blacks and whites, tending to overpredict black performance rather than underpredict it.

In the popular media, the persistence of belief in cultural bias, we think, is based on a misapprehension. To many people, proof that tests are unbiased seems tantamount to proof that the black-white gap reflects genetic differences in intelligence. Since they reject the possibility that genetic differences could be involved, the tests must be biased. One of the major purposes of Chapter 13 is to discredit both the notion that real differences in intelligence must be genetically founded and the assumption that a role for genes must have horrific consequences.


The text discusses the evidence for converging black and white test scores on the NAEP (National Assessment of Education Progress) and the SAT Here, we summarize other sources of data about the two ethnic populations.

National High School Studies, 1972 and 1980

In 1972 and 1980, the federal government sponsored large-sample studies intended to provide reliable national estimates of the high school population. As part of both studies, tests measuring vocabulary, reading, and mathematics were administered to all participants. Although not technically IQ tests, all three had high g loadings. Furthermore, the tests were virtually identical for the two test administrations,28 and the study procedures in 1980 were deliberately constructed to maximize the comparability of the two samples. In 1982, the sophomores from the 1980 sample were tested as seniors. The table below summarizes the results for the three test years by ethnic group. The black-white difference diminished on two of the three tests, but all of the shrinkage came about because white scores fell, not because black scores rose. Indeed, black scores also fell on all three tests but (except in the case of vocabulary), by less than the reduction in white scores.

Black-White Difference for High School Seniors in 1972, 1980, and 1982

White-Black Difference, in SDs
















Source: Rock et al. 1985, Appendixes B,C, E.

CollegeBoard Achievement Tests

THE SAT In Chapter 13, we noted that the overall black-white gap in SAT scores had narrowed between 1976 and 1993, from 1.16 to .88 standard deviation in the verbal portion of the test and from 1.27 to .92 standard deviation in the mathematics portion of the test.29 More detailed breakdowns are available for the period 1980 to 1991, as shown in the table below. The trend is consistently positive, with narrowing black-white differences of at least .1 standard deviation units on the tests for Literature, European History, Math II, Physics, French, German, Latin, and Spanish. The average shrinkage of the gap is .05 standard deviation unit. From further analyses, we conclude that the narrowing is not entirely explained away by changes in the representativeness of the black and white samples of test takers or by declining white scores.

Reductions in the Black-White Difference on the Scholastic Aptitude and Achievement Tests, 1980-1991

White-Black Difference, in SDs

1980 1991 Change

Source: The College Board’s annual summaries of test scores by ethnicity.





Reading subscore




Vocabulary subscore








Test of standard written English




Achievement tests

Overall average




English Composition








American History




European History




Math I




Math II









.69 . 74






















To interpret the changes in scores on achievement tests, which are taken by small proportions of the SAT test takers, we used the mean that the College Board provides on the SAT Verbal and Math scores for each achievement test population in each year. The question we asked was: For a given achievement test, how did the place of the average test taker on his race’s cognitive ability distribution change from 1980 to 1991? For example, the average white taking the Literature achievement test in 1980 had an SAT Verbal score that put him at the 80th percentile of white testees; in 1991, he was at the 85th percentile. Meanwhile, the average black taking the Literature achievement test in 1980 had an SAT Verbal score that put him at the 88th percentile of all black SAT testees; in 1991, he was still at the 88th percentile of the black distribution. The difference between blacks and whites on the Literature achievement test narrowed during that period, but, given where the blacks and whites were relative to the white and black SAT distributions, it seems unlikely that the narrowing was caused by changes in the self-selection that artificially raised black scores relative to whites. Ten of the thirteen achievement tests fit this pattern. In only three cases (European History, Physics, and German) did changes in the SAT Math or Verbal scores indicate that the black pool had become differentially more selective. Only in the case of German was this difference large enough to account plausibly for much of the black improvement relative to whites.

THE ACT. The College Board’s major competitor in the college entrance examination business is the American College Testing program, which has also shown decreasing differences between black and white students who take the test, as summarized in the table below. Reductions in the gap occurred in all the subtests between 1970 and 1991, with by far the largest reduction on the English subtest. The magnitude of the overall change in the composite is about half the size of the reduction observed in the black-white difference on the SAX Like the SAT population, the ACT’s population of black test takers has been increasing, suggesting that the increases in scores are not the result of a more selective test-taking population.

Black-White Difference in the ACT, 1970-1991

White-Black Difference, in SDs

1970 1991 Change

Source: ACT 1991, Tables 1, 4; Congressional Budget Office 1986, Fig. E-2.

















THE GRADUATE RECORD EXAMINATION (GRE). The GRE is the equivalent of the SAT for admission to graduate school in the arts and sciences. Not many people in any cohort take the GRE, so the sample is obviously highly self-selected and atypical of the population. In 1988, for example, the number of white GRE test takers represented only 5.6 percent of the 22-year-old white population; black test takers represented 2.3 percent of its 22-year-old population. On the other hand, the proportions in 1988 were about the same as they were in 1979. The self-selection process has remained fairly steady over the years, so it is worth at least mentioning the results, as shown in the table below. The GRE gap narrowed only slightly less than that for the SAT. Another positive note is that the narrowing was achieved because black scores rose more than white scores, not because white scores were falling.

Black-White Difference in the GRE, 1979-1988

White-Black Difference, in SDs

1979 1988 Change

Source: Graduate Record Examination Board.













These results from national tests are echoed in state-level data from Texas and North Carolina, as reported in the Congressional Budget Office’s survey of trends in educational achievement.30 Overall, the evidence seems clear beyond a reasonable doubt: On college entrance tests and national tests of educational proficiency, the gap between whites and blacks remained large into the early 1990s, but it had been narrowing in the preceding decade or two. The optimist may argue that the trend will continue indefinitely if improvements in the environment and education for American blacks can be continued. The pessimist may note that there seems to have been little narrowing since the mid-1980s, as we observed in the text for Chapter 13, and that the black-white IQ gap in the NLSY seems to be widening rather narrowing in the next generation, as we discussed in Chapter 15.


Controversy unprecedented even for the contentious subject of racial differences has erupted around the work of J. Philippe Rushton, a developmental psychologist at the University of Western Ontario. Rushton argues that the differences in the average intelligence test scores among East Asians, blacks, and whites are not only primarily genetic but part of a complex of racial differences that includes such variables as brain size,31 genital size, rate of sexual maturation, length of the menstrual cycle, frequency of sexual intercourse, gamete production, sexual hormone levels, the tendency to produce dizygotic twins, marital stability, infant mortality, altruism, law abidingness, and mental health. For each variable, Rushton has concluded, the three races—Mongoloids, Caucasoids, and Negroids—fall in a certain order, with the average Caucasoid in the middle and the other two races on one side or the other. The ordering of the races, he further argues, has an evolutionary basis; hence these ordered racial differences must involve genes.

To reach his conclusion, Rushton starts with the well-established observation in biology that species vary in their reproductive strategies. Some species produce many offspring (per parent) of which only a small fraction survive; others produce small numbers of offspring with relatively high survival rates. The involvement of parents in their offsprings’ health and development (which biologists call “parental investment”) tends to be high for species having few offspring and high survival rates and low for those employing the other strategy (many offspring and low survival rates). Many other species differences are concomitant with this fundamental one, according to standard biological doctrine.

Rushton’s thesis is that this standard biological principle may be applied within our own species. Rushton acknowledges that human beings are as a species far out along the continuum of low reproduction, high offspring survival, and high parental investment, but he argues that the ordering of the races on the many variables he has identified can be explained as the result of evolutionary differences in how far out the races are. According to Rushton, the average Mongoloid is toward one end of the continuum of reproductive strategies—the few offspring, high survival, and high parental investment end—the average Negroid is shifted toward the other end, and the average Caucasoid is in the middle.

Rushton paints with a broad brush, focusing on the major racial categories rather than the dozens of more finely drawn reproductively isolated human populations that might test his theory more conclusively. But beyond that, his thesis raises numerous questions—moral, pragmatic, and scientific. Many critics attack the theory on scientific, not just moral, grounds. They question whether Rushton has really shown that the races are consistently ordered in the way he says they are, or whether a biological theory that was meant to explain species differences can be properly applied to groups within a single species, or whether the evidence for genetic influences on his variables stands up. Rushton has responded to his critics with increasingly detailed and convincing empirical reports of the race differences in some of the traits on his list, and he cites preeminent biological authority for his use of the concept of reproductive strategies. He has strengthened the case for consistently ordered race differences, at least for some of the variables he discusses, since his first formulation of the theory in 1985. Nevertheless, the theory remains a long way from confirmation.

We cannot at present say who is more nearly right as a matter of science, Rushton or his critics.32 However, Rushton’s work is not that of a crackpot or a bigot, as many of his critics are given to charging. Nor are we sympathetic with Rushton’s academic colleagues or the politicians in Ontario who have called for his peremptory dismissal from a tenured professorship. Setting aside whether his work is timely or worthwhile—a judgment we are loath to make under any circumstances—it is plainly science. He is not alone in seeking an evolutionary explanation of the observed differences among the races.33 As science, there is nothing wrong with Rushton’s work in principle; we expect that time will tell whether it is right or wrong in fact.


AFQT scores reported for Latinos in the NLSY were supposed to be limited to persons who were fluent in English. However, a lingering question remains: To what extent might language difficulties have skewed the Latino results?

To investigate this issue, we first examined the scores on the subtests of the AFQT, looking for evidence that the pattern of Latino scores across subtests was different from the patterns for whites and blacks. For the Latino sample as a whole, no such evidence was found. The correlations between the verbal subtests and the arithmetic subtests were as high for Latinos as for whites and blacks. The size of the Latino-white differences in scores across subtests followed the same pattern as the black-white differences. The correlation between the overall AFQT score and educational attainment was about the same for Latinos as for blacks and whites.

We next broke the Latino sample into those who had been born abroad (26 percent of all Latinos in the NLSY) and those who had been born in the United States, hypothesizing that those who had been born abroad had usually learned English as a second language, and we looked for evidence that this constituted a special language disability.

The results generally paralleled those for the Latino sample as a whole. The correlations between the verbal and arithmetic subtests were substantially higher for Latinos born abroad than for whites, blacks, or Latinos born in the United States, the opposite of what would be expected if English fluency were a problem for the foreign-borns. This was true even of the “numerical operations” subtest, which has no verbal content at all. The correlation between the AFQT and educational attainment was higher for Latinos born abroad than for any other group. Furthermore, the rank order of the subtest differences between whites and foreign-born Latinos was the same as the rank order for other groups.

The magnitude of the difference between whites and foreign-born Latinos on the verbal test was greater than the difference separating whites and U.S.-born Latinos. An argument could be made (equivocally) that the differences were disproportionately larger for the verbal subtests than for the arithmetic subtests. On the other hand, it is also true that the socioeconomic background of the foreign-born Latinos was substantially lower than for U.S.-born Latinos, and socioeconomic background itself is associated with lower IQ scores, independently of any language problems.

Our overall conclusion is that it is difficult to make a case that language difficulties contribute significantly to the Latino-white difference in the NLSY. The parsimonious explanation is that the administrators of the NLSY did a good job of screening out subjects with language difficulties.