Technical Issues Regarding the Armed Forces Qualification Test as a Measure of IQ - The Bell Curve: Intelligence and Class Structure in American Life - Richard J. Herrnstein, Charles Murray

The Bell Curve: Intelligence and Class Structure in American Life - Richard J. Herrnstein, Charles Murray (1996)

Appendix 3. Technical Issues Regarding the Armed Forces Qualification Test as a Measure of IQ

Throughout The Bell Curve, we use the Armed Forces Qualification Test (AFQT) as a measure of IQ. This appendix discusses a variety of related issues that may help readers interpret the meaning of the analyses presented in the full text.

DOES THE AFQT MEASURE THE SAME THING THAT IQ TESTS MEASURE?

The AFQT is a paper-and-pencil test designed for youths who have reached their late teens. In effect, it assumes exposure to an ordinary high school education (or the opportunity to get one). This kind of restriction is shared by any IQ test, all of which are designed for certain groups.

The AFQT as scored by the armed forces is not age referenced. The armed forces have no need to do so, because the overwhelming majority of recruits taking the test are 18 and 19 years old. In contrast, the NLSY sample varied from 14 to 23 years old when they took the test. Therefore, as discussed in Appendix 3, all analyses in the book take age into account through one of two methods: entering age as an independent variable in the multivariate analyses, and, for all descriptive statistics, age referencing the AFQT score by expressing it in terms of the mean and standard deviation for each year’s birth cohort. In this appendix, we will uniformly use the age-referenced version for analyses based on the NLSY.

Is a set of age-referenced AFQT scores appropriately treated as IQ scores? We approach this issue from two perspectives. First, we examine the internal psychometric properties of the AFQT and show that the AFQT is one of the most highly g-loaded mental tests in current use. It seems to do what a good IQ test is supposed to do—tap into a general factor rather than specific bits of learning or skill—as well as or better than its competitors. Second, we examine the correlation between the AFQT and other IQ tests, and show that the AFQT is more highly correlated with a wide range of other mental tests than those other mental tests are with each other. On both counts, the AFQT qualifies not just as an IQ test, but one of the better ones psychometrically.

Psychometric Characteristics of the ASVAB

Let us begin by considering the larger test from which the AFQT is computed, the ASVAB (Armed Services Vocational Aptitude Battery), taken every year by between a half million and a million young adults who are applying for entry into one of the armed services. The ASVAB has ten subtests, spanning a range from test items that could appear equally well on standard tests of intelligence to items testing knowledge of automobile repair and electronics.1Scores on the subtests determine whether the applicant will be accepted by his chosen branch of service; for those accepted, the scores are later used for the placement of enlisted personnel into military occupations. How well or poorly a person performs in military occupational training schools, and also how well he does on the job, can therefore be evaluated against the scores earned on a battery of standardized tests.

The ten subtests of ASVAB can be paired off into forty-five correlations. Of the forty-five, the three highest correlations in a large study of enlisted personnel were between Word Knowledge and General Science, Word Knowledge and Paragraph Completion, and, highest of all, between Mathematics Knowledge and Arithmetic Reasoning.2 Correlations above .8, as these were, are in the range observed between different IQ tests, which are frankly constructed to measure the same attribute. To see them arising between tests of such different subject matter should alert us to some deeper level of mental functioning. The three lowest correlations, none lower than .22, were between Coding Speed and Mechanical Comprehension, Numerical Operations and Auto/Shop Information, and, lowest of all, between Coding Speed and Automobile/Shop Information. Between those extremes, there were rather large correlations between Paragraph Completion and General Science and between Word Knowledge and Electronics Information but only moderate correlations between Electronics Information and Coding Speed and between Mathematics Knowledge and Automobile/Shop Information. Thirty-six of the forty-five correlations were above .5.

Psychometrics tries approaches a table of correlations with one or another of its methods for factor analysis. Factor analysis (or other mathematical procedures that go under other names) extracts the factors3 that account for the observed pattern of subtest scores. The basic idea is that scores on any pair of tests are correlated to the extent that the tests measure something in common: If they test traits in common, they are correlated, and if not, not. Factor analysis tells how many different underlying factors are necessary to account for the observed correlations between them. If, for example, the subtest scores were totally uncorrelated, it would take ten independent and equally significant factors, one for each subtest by itself. With each test drawing on its own unique factor, the forty-five correlations would all be zeros. At the other extreme, if the subtests measured precisely the same thing down to the very smallest detail, then all the correlations among scores on the subtests could be explained by a single factor—that thing which all the subtests precisely measured—and the correlations would all be ones. Neither extreme describes the actuality, but for measures of intellectual performance, one large factor comes closer than many small ones. This is not the place to dwell on mathematical details except to note that, contrary to claims in nontechnical works,4 the conclusions we draw about general intelligence do not depend on the particular method of analysis used.5

For the ASVAB, 64 percent of the variance among the ten subtest scores is accounted for by a single factor, g.A second factor accounts for another 13 percent. With three inferred factors, 82 percent of the variance is accounted for.6 The intercorrelations indicate that people do vary importantly in some single, underlying trait and that those variations affect how they do on every test. Nor is the predominance of ga fortuitous result of the particular subtests in ASVAB. The air force’s aptitude test for prospective officers, the AFOQT (Air Force Officer Qualifying Test) similarly has gas its major source of individual variation.7 Indeed, all broad-gauged test batteries of cognitive ability have gas their major source of variation among the scores people get.8

The naive theory assumes that when scores on two subtests are correlated, it is because of overlapping content. But it is impossible to make sense of the varying correlations between the subtests in terms of overlapping content. Consider again the correlation between Arithmetic Reasoning and Mathematical Knowledge, which is the highest of all. It may seem to rest simply on a knowledge of mathematics and arithmetic. However, the score on Numerical Operations is less correlated with either of those two tests than the two are with each other. Content provides no clue as to why. Arithmetic Reasoning has only word problems on it; Mathematical Knowledge applies the basic methods of algebra and geometry; and Numerical Operations is an arithmetic test. Why are scores on algebra and geometry more similar to those on word problems than to those on arithmetic? Such variations in the correlations between the subtests arise, in fact, less from common content than from how much they draw on the underlying ability we call g.The varying correlations between the subtests preclude explaining gaway as, for example, simply a matter of test-taking ability or test-taking experience, which should affect all tests more or less equally. We try to make some of these ideas visible in the figure below.

The relation of the ASVAB subtests to each other and to g

Imag

For each subtest on ASVAB, we averaged the nine correlations with each of the other subtests, and that average correlation defines the horizontal axis. The vertical axis is a measure, for each subtest, of the correlation between the score and g.9The two-letter codes identify the subtests. At the top is General Science (GS), closely followed by Word Knowledge (WK), and Arithmetical Reasoning (AR), for which the scores are highly correlated with gand have the highest average correlations with all the subtests. Another three subtests—Mathematics Knowledge (MK), Paragraph Comprehension (PC), and Electronics Information (EI)—are just slightly below the top cluster in both respects. At the bottom are Coding Speed (CS), Automobile/Shop Information (AS), Numerical Operations (NO), and Mechanical Comprehension (MC), subtests that correlate, on the average, the least with other subtests and are also the least correlated with g(although still substantially correlated in their own right). The bottom group includes the two speeded subtests, CS and NO, thereby refuting another common misunderstanding about g,which is that it refers to mental speed and little more. Virtually without exception, the more dependent a subtest score is on g,the higher is its average correlation with the other subtests. This is the pattern that betrays what gmeans—a broad mental capacity that permeates performance on anything that challenges people cognitively. A rough rule of thumb is that items or tests that require mental complexity draw more on gthan items that do not—the difference, for example, between simply repeating a string of numbers after hearing them once, which does not much test g, and repeating them in reverse order, which does.10

The four subtests used in the 1989 scoring version of the AFQT (the one used throughout the text) and their gloadings are Word Knowledge (.87), Paragraph Comprehension (.81), Arithmetic Reasoning (.87), and Mathematics Knowledge (.82).11 The AFQT is thus one of the most highly g-loaded tests in use. By way of comparison, the factor loadings for the eleven subtests of the Wechsler Adult Intelligence Scale (WAIS) range from .63 to .83, with a median of .69.12 Whereas the first factor, g, accounts for over 70 percent of the variance in the AFQT, it accounts for only 53 percent in the WAIS.

Correlations of the AFQT with Other IQ Tests

Our second approach to the question, Is the AFQT an IQ test? is to ask how the AFQT correlates with other well-known standardized mental tests (see the table below). We can dc so by making use of the high school transcript survey conducted by the NLSY in 1979. In addition to gathering information about grades, the survey picked up any other IQ test that the student had taken within the school system. The data usually included both the test score and the percentile rank, based on national norms. In accordance with the recommendation of the NLSY User’s Manual,we use percentiles throughout.13

Correlations of the AFQT with Other IQ Tests in The NLSY

Sample

Correlation with the AFQT

California Test of Mental Maturity

356

.81

Coop School and College Ability Test

121

.90

Differential Atitude Test

443

.81

Henmon Nelson Test of Mental Maturity

152

.71

Kuhlmann-Anderson Intelligence Test

36

.80

Lorge-Thorndike Intelligence Test

170

.72

Otis-Lennnon Mental Ability Test

530

.81

The magnitudes of the correlations between the AFQT (using the age-referenced percentile scores) and classic IQ tests are as high as or higher than the observed correlations of the classic IQ tests with each other. For example, the best-known adult test, the WAIS, is known to correlate (using the median correlation with various studies, and not correcting for restriction of range in the samples) with the Stanford-Binet at .77, with the Ravens Standard Progressive Matrices at .72, the SRA Non-verbal test at .81, the Peabody Picture Vocabulary Test at .83, and the Otis at .78.14 The table below summarizes the intercorrelations of IQ tests, based on the comparisons assembled by Arthur Jensen as of 1980, and adding a line for the AFQT comparisons from the NLSY. The AFQT compares favorably with the other major IQ tests by this measure, which in turn is consistent with the high g-loading of the AFQT.

Correlations of the Major IQ Tests with Other Standardized Mental Tests

Median Correlation with Other Mental Tests

Source: Jensen 1980, Table 8.5, and authors’ analysis of the NLSY.

AFQT (age-referenced, 1989 scoring)

.81

Wechsler-Bellevue I

.73

Wechsler Adult Intelligence Scale (WAIS)

.77

Wechsler Intelligence Scale for Children

.64

Stanford-Binet

.71

HOW SENSITIVE ARE THE RESULTS TO THE ASSUMPTION THAT IQ IS NORMALLY DISTRIBUTED?

Any good test designed to measure a complex ability (whether a test of cognitive ability or carpentry ability) will have several characteristics that common sense says are desirable: a large number of items, a wide range of difficulty among the items, no marked gaps in the difficulty of the items, a variety of types of items, and items that have some relationship to each other (i.e., are to some degree measuring the same thing).15 Empirically, tests with these characteristics, administered to a representative sample of those for whom the test is intended, will yield scores that are spread out in a fashion resembling a normal distribution, or a bell curve. In this sense, tests of mental ability are not designed to produce normally distributed scores; that’s just what happens, the same way that height is normally distributed without anyone planning it.

It is also true, however, that tests are usually scored and standardized under the assumption that intelligence is normally distributed, and this has led to allegations that psychometricians have bamboozled people into accepting that intelligence is normally distributed, when in fact it may just be an artifact of the way they choose to measure intelligence. For a response to such allegations, Chapter 4 of Arthur Jensen’s Bias in Mental Testing (New York: Free Press, 1980) remains the best discussion we have seen.

For purposes of assessing the analyses in this book, it may help readers to know the extent to which any assumptions about the distribution of AFQT scores might have affected the results, especially since we rescored the AFQT to correct for skew (see Appendix 2). The descriptive statistics showing the breakdown of each variable by cognitive class, presented in each chapter of Part II, address that issue. Assignment to cognitive classes was based on the subject’s rank within the distribution, and these ranks are invariant no matter what the normality of the distribution might be. Ranks were also unaffected by the correction for skew.

The descriptive statistics in the text were bivariate. To examine this issue in a multivariate framework, we replicated the analyses of Part II substituting a set of nominal variables, denoting the cognitive classes, for the continuous AFQT measure. That is, the regression treated “membership in Class I” as a nominal variable, just as it would treat “married” or “Latino” as a nominal characteristic—and similarly for the other four cognitive classes, also entered as nominal variables (See Appendix 4 for a discussion of how to interpret the coefficients for nominal variables as created by the software used in these analyses, JMP 3.0). Below, we show the results for the opening analysis of Part II (Chapter 5), the probability of being in poverty.

Comparison of results when AFQT is treated as a continuous, normally distributed variable and when it is treated as a set of nominal categories based on groupings by centile

Imag

Note: For computing the plot, age and SES were set at their mean values.

Whole-Model Test

Source

DF

-LogLikelihood

ChiSquare

Prob>ChiSq

Model

3

477222.0

954443.9

0.000000

Error

4488

4587166.7

null

C Total

4491

5064388.7

null

RSquare (U)

0.0942

Observation

4,492

Parameter Estimates

Term

Estimate

Std Error

ChiSquare

Prob>Chisq

Intercept

-2.6579692

0.0009826

.

0.0000

zAFQT89

-0.8177031

0.0012228

447179

0.0000

zSES

-0.2744971

0.0011661

55416

0.0000

zAge

-0.0482156

0.0009187

2754.1

0.0000

These are the results using the categorization into cognitive classes by centile:

Whole-Model Test

Source

DF

-LogLikelihood

ChiSquare

Prob>ChiSq

Model

6

383494.7

766989.4

0.000000

Error

4485

4680894.0

C Total

4491

5064388.7

RSquare (U)

0.0757

Observations

4,492

Parameter Estimates

Term

Estimate

Std Error

ChiSquare

Prob>ChiSq

Intercept

-2.5097718

0.0015823

.

0.0000

CogClas.[1-5]

-1.0067168

0.0050693

39439

0.0000

CogClas.[2-5]

-0.6803606

0.0025486

71265

0.0000

CogClas.[3-5]

-0.1905042

0.0018498

10606

0.0000

CogClas.[4-5]

0.64764109

0.0021336

92138

0.0000

zSES

-0.3902981

0.0011276

119800

0.0000

zAge

-0.1605992

0.000907

31350

0.0000

We repeated these comparisons for a broad sampling of the outcome variables discussed in Part IL The results for poverty were typical When the results for the two expressions of IQ do not correspond (e.g., the relationship of mother’s IQ to low birth weight, as discussed in Chapter 10), the lack of correspondence also showed up in the bivariate table showing the breakdown by cognitive class. Or to put it another way, the results presented in the text using IQ as a continuous, normally distributed variable are produced as well when IQ is treated as a set of categories. Any exceptions to that may be identified through the bivariate tables based on cognitive class.

RELATIONSHIP OF THE AFQT SCORE TO EDUCATION AND PARENTAL SES

The relationship of an IQ test score to education and socioeconomic background is a constant and to some extent unresolvable source of controversy. It is known that the environment (including exposure to education) affects realized cognitive ability. To that extent, it is conceptually appropriate that parental SES and years of education show an independent causal effect on IQ. On the other hand, an IQ test score is supposed to represent cognitive ability and to have an independent reality of its own; in other words, it should not simply be a proxy measure of either parental SES or years of education. The following discussion elaborates on the statistical relationship of both parental SES and years of education to the AFQT score.

The Socioeconomic Status Index and the AFQT Score.

The SES index consists of four indicators as described in Appendix 2: mother’s and father’s years of education, the occupational status of the parent with the higher-status job, and the parents’ total family income in 1979-1980. The correlations of the index and its four constituent variables with the AFQT are in the table below.

Intel-correlations of the AFQT and the Indicators in the Socioeconomic Status Index

AFQT

Mother’s education

.43

Father’s education

.46

Occupational status

.43

Family income

.38

SES Index

.55

The correlation of AFQT with the SES index itself is .55, consistent with other investigations of this topic.16

There are three broad interpretations of these correlations:

1. Test bias. IQ tests scores are artificially high for persons from highstatus backgrounds because the tests are biased in favor of people from high-status homes.

2. Environmental advantage. IQ tends to be genuinely higher for children from high-status homes, because they enjoy a more favorable environment for realizing their cognitive ability than do children from low-status homes.

3. Genetic advantage. IQ tends to be genuinely higher for children from high-status homes because they enjoy a more favorable genetic background (parental SES is a proxy measure for parental IQ).

The first explanation is discussed in Appendix 5. The other two explanations have been discussed at various points in the text (principally Chapter 4’s discussion of heritability, Chapter 10’s discussion of parenting styles, and Chapter 17’s discussion of adoption). To summarize those discussions, being brought up in a conspicuously high-status or low-status family from birth probably has a significant effect on IQ, independent of the genetic endowment of the parents. The magnitude of this effect is uncertain. Studies of adoption suggest that the average is in the region of six IQ points, given the difference in the environments provided by adopting and natural parents. Outside interventions to augment the environment have had only an inconsistent and uncertain effect, although it remains possible that larger effects might be possible for children from extremely deprived environments. In terms of the topic of this appendix, the flexibility of the AFQT score, the AFQT was given at ages 15-23, when the effect of socioeconomic background on IQ had already played whatever independent role it might have.

Years of Education and the AFQT Score

For the AFQT as for other IQ tests, scores vary directly with educational attainment, leaving aside for the moment the magnitude of reciprocal cause and effect. But to what extent could we expect that, if we managed to keep low-scoring students in school for another year or two, their AFQT scores would have risen appreciably?

Chapter 17 laid out the general answer from a large body of research: Systematic attempts to raise IQ through education (exemplified by the Venezuelan experiment and the analyses of SAT coaching) can indeed have an effect on the order of 2 standard deviation, or three IQ points. As far as anyone can tell, there are diminishing marginal benefits of this kind of coaching (taking three intensive SAT coaching programs in succession will raise a score by less than three times the original increment).

We may explore the issue more directly by making use of the other IQ scores obtained for members of the NLSY. Given scores that were obtained several years earlier than the AFQT score, to what extent do the intervening years of education appear to have elevated the AFQT?

Underlying the discussion is a simple model:

Imag

The earlier IQ score affects both years of education and is a measure of the same thing that AFQT measures. Meanwhile, the years of education add something (we hypothesize) to the AFQT score that would not otherwise have been added.

Actually testing the model means bringing in several complications, however. The elapsed time between the earlier IQ test and the AFQT test presumably affects the relationships. So does the age of the subject (a subject who took the test at age 22 had a much different “chance” to add years of education than did a subject who took the test at age 18, for example). The age at which the earlier IQ test was taken is also relevant, since IQ test scores are known to become more stable at around the age of 6. But the main point of the exercise may be illustrated straightforwardly. We will leave the elaboration to our colleagues.

The database consists of all NLSY students who had an earlier IQ test score, as reported in the table on page 596, plus students with valid Stanford-Binet and WISC scores (too few to report separately). We report the results for two models in the table below, with the AFQT score as the dependent variable in both cases. In the first model, the explanatory variables are the earlier IQ score, elapsed years between the two tests, and type of test (entered as a vector of dummy variables). In the second model, we add years of education as an independent variable. An additional year of education is associated with a gain of 3.2 centiles per year, in line with other analyses of the effects of education on IQ.17 What happens if the dependent variable is expressed in standardized scores rather than percentiles? In that case (using the same independent variables), the independent effect of education is to increase the AFQT score by .11 standard deviation—also in line with other analyses.

The Independent Effect of Education on AFQT Scores as Inferred from Earlier IQ Tests

Dependent variable: AFQT percentile score

Independent Variables

Model 1

Model 2

Coefficient

Std. Error

Coefficient

Std. Error

Intercept

12.312

1.655

-14.331

2.780

Earlier IQ percentile score

.787

.016

.736

.016

Elapsed years between tests

-.316

.166

-1.288

.179

Years of education

3.185

.273

Type of test (entered as a vector of nominal variables, coefficients not shown.)

No. of observations

1,404 1,

404

R2 (Adjusted)

.656

.686

We caution against interpreting these coefficients literally across the entire educational range. Whereas it may be reasonable to think about IQ gains for six additional years of education when comparing subjects who had no schooling versus those who reached sixth grade, or even comparing those who dropped out in sixth grade and those who remained through high school, interpreting these coefficients becomes problematic when moving into post-high school education.

The negative coefficient for “elapsed years between tests” in the table above is worth mentioning. Suppose that the true independent relationship between years of education and AFQT is negatively accelerated—that is, the causal importance of the elementary grades in developing a person’s IQ is greater than the causal role of, say, graduate school. If so, then the more years of separation between tests, the lower would be the true value of the dependent variable, AFQT, compared to the predicted value in a linear regression, because people with many years of separation between tests in the sample are, on average, getting less incremental benefit of years of education than the sample with just a few years of separation. The observed results are consistent with this hypothesis.