Technical Issues Regarding the National Longitudinal Survey of Youth - The Bell Curve: Intelligence and Class Structure in American Life - Richard J. Herrnstein, Charles Murray

The Bell Curve: Intelligence and Class Structure in American Life - Richard J. Herrnstein, Charles Murray (1996)

Appendix 2. Technical Issues Regarding the National Longitudinal Survey of Youth

This appendix provides details about the variables used in the text and about other technical issues associated with the NLSY.1 Colleagues who wish to recreate analyses will need additional information, which may be obtained from the authors.2


Our use of the NLSY extends through the 1990 survey year.3

All dollar figures are expressed in 1990 dollars, using the consumer price index inflators as reported in the 1992 edition of Statistical Abstract of the United States, Table 737.

Sample weights were employed in all analyses in the main text. We do not so note in each instance, to simplify the description. In computing scores that were based on the 11, 878 subjects who had valid scores on the Armed Forces Qualification Test (AFQT), we used the sampling weights specifically assigned for the AFQT population. For analyses based on the NLSY subjects’ status as of a given year (usually 1990), we used the sampling weights for that survey year. For analyses in which the children of NLSY women were the unit of analysis, the child’s sampling weights were used rather than the mother’s.

To make interpretation of the statistical significance easier, we replicated all the analyses in Part II using just the unweighted cross-sectional sample of whites, as reported in Appendix 4.


The AFQT is a combination of highly g-loaded subtests from the Armed Services Vocational Aptitude Battery (ASVAB) that serves as the armed services’ measure of cognitive ability, described in detail in Appendix 3. Until 1989, the AFQT consisted the summed raw scores of the ASVAB’s arithmetic reasoning, word knowledge, and paragraph comprehension subtests, plus half of the score on numerical operations subtest. In 1989, the armed forces decided to rescore the AFQT so that it consisted of the word knowledge, paragraph comprehension, arithmetic reasoning, and mathematics knowledge subtests. The reason for the change was to avoid the numerical operations subtest, which was both less highly g-loaded than the mathematics knowledge subtest and sensitive to small discrepancies in the time given to subjects when administering the test (numerical operations is a speeded test in which the subject completes as many arithmetic problems as possible within a time limit).

A draft of The Bell Curve was well underway when we became aware of the 1989 scoring scheme. We completed a full draft using the 1980 scoring system but decided that the revised scoring system was psychometrically superior to the old one and therefore replicated all of the analyses using the 1989 version.

Scholars who wish to replicate our analyses should note that the 1989 AFQT score as reported in the NLSY database is not the one used in the text. The NLSY’s variable is rounded to the nearest whole centile and based on the 18-to 23-year-old subset of the NLSY sample. We recomputed the AFQT from scratch using the raw subtest scores, and the population mean and standard deviation used in producing the across-ages AFQT score was based on all 11,878 subjects, not just those ages 18 to 23.4 This measure is useful for multivariate analyses in which age is also entered as an independent variable but should not be used (and is never used in the text) as a representation of an individual subject’s cognitive ability because of age-related differences in test scores (see discussion below).


AFQT scores in the NLSY sample rose by an average of .07 standard deviations per year. The simplest explanation for this is that the AFQT was designed by the military for a population of recruits who would be taking the test in their late teens, and younger subjects in the NLSY sample got lower scores for the same reason that high school freshmen get lower SAT scores than high school seniors. However, a cohort effect could also be at work, whereby (because of educational or broad environmental reasons) youths born in the first half of the 1960s had lower realized cognitive ability than youths born in the last half of the 1950s. There is no empirical way of telling which reason really explains the age-related differences in the AFQT or what the mix of reasons might be. The age-related increase is not perfectly linear (it levels off in the top two years) but close enough that the age problem is best handled in the multivariate analyses by entering the subject’s birthdate as an independent variable (all the NLSY sample took the AFQT within a few months of each other in late 1980).

For all analyses except the multivariate regression analyses, we use age-equated scores. These were produced by using the sample weight as a frequency, then preparing separate distributions by birth year, expressed in centiles.5Each subject’s rank in that population (mathematically, the “population” is the sum of the sample weights for that birth year) was divided by the population to obtain the centile where that subject fell within his birth year cohort.6

That AFQT scores vary according to education raises an additional issue: To what extent is the AFQT a measure of cognitive ability, and not just length and quality of education? We explore this issue at length in Appendix 3.


The distribution of the AFQT in either of its versions is skewed so that the high scores tend to be more closely bunched than the low scores. To put it roughly, the most intelligent people who take the test have less of an opportunity to get a high score than the least intelligent people have to get a low score. One effect is to limit artificially the maximum size of a standardized score. It is artificial because the AFQT does in fact discriminate reasonably well at the high end of the scale. For example, only 22 youths out of 11,878 in the NLSY with valid AFQT scores earned perfect scores on the subtests, representing 0.253 percent of the national population of their age (using sampling weights). In a test with a normal distribution, those youths would have had a standardized score of 2.80. But given the skew in the NLSY, it is impossible for anyone to have a standardized score higher than 1.66. The standard deviation for a high-scoring group is similarly squeezed.

A certain amount of skew is not a concern for many kinds of analysis. For the analyses in The Bell Curve, however, the difference between two groups is often expressed in terms of standard deviations, and the size of that difference was likely to be affected by skew.

We therefore computed standardized scores corrected for skew, first by computing the centile scores for the NLSY population, using sample weights as always, then assigning to each subject the standardized score corresponding to that centile in a normal distribution. We did this for both the old and new versions of the AFQT. Following armed forces’ convention, all scores greater or smaller than 3 standard deviations from the mean were set at 3 standard deviations (this affected only a small number of scores at the low end of the distribution).

The effects of correcting for skew were noticeable when expressing differences between groups. For example, for the most sensitive group comparison, between ethnic groups, the results are shown in the following table. As always when full information about means, standard deviations, and sample sizes is available, the group differences are computed using the weighted average of the groups’ standard deviations. The equation is given in note 25 for Chapter 13. The primary effect of the skew was to squeeze the standard deviation of the higher-scoring group (whites) and, in comparison, elongate the standard deviation of the lower scoring groups. Correcting for skew thus shrank both the black-white and Latino-white differences. The same phenomenon affected all comparisons involving subgroups with markedly different AFQT means. All standardized AFQT scores, for both the regression analyses and the age-equated scores, are therefore corrected for skew. In other words, each represents the standardized score in a normal distribution that corresponds to the (unrounded) centile score of the subject in the observed distribution.

Comparison of Two Versions of The AFQT, Uncorrected and Corrected for Skew

Version of the AFQT

Corrected for Skew?




Black/ White Difference

Latino/ White Differences


























1989 revision



















The effects of the different scoring methods on ethnic differences raise a larger question that we should answer directly: How would the results presented in this book be different if we had used the 1980 version of the AFQT instead of the 1989 version? If we had not corrected for skew instead of correcting for skew? For most analyses, the answer is that the results are unaffected. But it may also be said that whenever differences were found, the scoring procedure we used tended to produce smaller relationships between IQ and the indicators, and smaller ethnic differences, than the alternatives. We did not compute every analysis by each of the four scoring permutations, but we did replicate all of the analyses using the two extremes (1980 version uncorrected for skew and the 1989 version corrected for skew). In no instance did the 1989 version corrected for skew—the version reported in the text—yield significant findings that were not also found when using the 1980 uncorrected version. In terms of the relationships explored in this book, the 1989 version corrected for skew is the most conservative of the alternatives.

Why Not Just Use Centiles?

One way of avoiding the skew problem is to leave the AFQT scores in centiles. This was unsatisfactory, however, for we knew from collateral data that much of the important role of IQ occurs at the tails of the distribution. Using centiles throws away information about the tails. (See Appendix 1 on the normal distribution.)


The SES index was created with the variables that are commonly used in developing measures of socioeconomic status: education, income, and occupation. Since the purpose of the index was to measure the socioeconomic environment in which the NLSY youth was raised, the specific variables employed referred to the parents’ status: total net family income, mother’s education, father’s education, and an index of occupational status of the adults living with the subject at the age of 14. The population for the computation was limited to the 11,878 NLSY subjects with valid AFQT scores. In more detail:

Mother’s education and father’s education were based on years of education, converted to standardized scores.

Family income was based on the averaged total net family income for 1978 and 1979, in constant dollars, when figures for both years were available. If income for only one of the two years was reported, that year was used. Family income was excluded if the subject was a Schedule C interviewee (the reported income for the year in question referred to his or her own income, not to the parental household’s income). The dollar figure was expressed as a logarithm before being standardized. This procedure, customary when working with income data, has the effect of discounting extremely high values of income and permitting greater discrimination among lower incomes. A minimum standardized value of −4 was set for incomes of less than $1,000 (all figures are in 1990 dollars).

Parental occupation was coded with a modified version of the Duncan socioeconomic index, grouping the Duncan values (which go from 1 to 100) into deciles. A value of −1 was assigned to persons out of the labor force altogether. It was assumed that the family’s socioeconomic status is predominantly determined by the higher of the two occupations held by two parents. Thus the occupational variable was based on the higher of the two ratings of the two parents. The increment in socioeconomic status represented by both parents holding high-status occupations is indirectly reflected in the higher income and in the two educational variables. The eleven values in the modified Duncan scale were standardized.

The reliability of the four-indicator index (Cronbach’s α) is .76. The correlations among the components of the index are shown in the table. The four variables were summed and averaged. If only a subset of variables had valid scores, that subset was summed and averaged. By far the most common missing variable was family income, since many of the NLSY youths were already living in independent households as of the beginning of the survey, and hence were reporting their own income, not parental income. Overall, data were available on all four indicators for 7,447 subjects, for three on an additional 3,612, on two for 679, and on one for 138. Two subjects with valid scores on the AFQT had no information available on any of the four indicators. For use in the regression analyses, the SES index scores were set to a mean of 0 and a standard deviation of 1.

Correlations of Indicators in the Socioeconomic Status Index

Mother’s Education

Father’s Education

Parental Occupation

Father’s education


Parental occupation



Family income





Highest Grade Completed.

The NLSY creates a variable each year for “highest grade completed,” incorporating information from several questions.7 For analyses based on the occurrence of an event (e.g., the birth of a child), the value of “highest grade completed” for the contemporaneous survey year is used. For all other analyses, the 1990 value for “highest grade completed” is used. Values run from 0 through 20.

Highest Degree Ever Received

In the 1988-1990 surveys, the NLSY asked respondents to report the highest degree they had ever received. The possible responses were: high school diploma, associate degree, bachelor of arts, bachelor of science, master’s, Ph.D., professional degree (law, medicine, dentistry), and “other.” These self-reported degrees were sometimes questionable, especially when the degree did not correspond to the number of years of education (e.g., a bachelor’s degree for someone who also reported only fourteen years of education). To eliminate the most egregiously suspicious cases, we made adjustments. For those who reported their highest degree as being a high school diploma, we required at least eleven reported years of completed education. For degrees beyond the high school diploma, we required that the report of the highest grade completed be within at least one year of the normal number of years required to obtain that degree. Specifically, the minimum number of years of completed years of education required to use a reported degree were thirteen for the Associate’s degree, fifteen for a bachelor’s degree, sixteen for a master’s degree, and 18 for a Ph.D., law degree, or medical degree.

We also employed the NLSY’s variables to discriminate between those whose terminal degree was a high school diploma versus a GED. We excluded the 190 persons whose degree was listed as “other,” after trying fruitlessly to come up with a satisfactory means of estimating what the “other” meant from collateral educational data.

The “high school” and “college graduate” samples used throughout Part II are designed to isolate populations with homogeneous educational experiences as of the 1990 survey year, The high school sample is defined as those who reported twelve years of completed education and a high school diploma received through the normal process (i.e., excluding GEDs) as the highest attained degree. The college graduate sample is defined as all those who reported sixteen years of completed education and a B.A. or B.S. as the highest attained degree.

Transition to College

In Chapter 1, we used the NLSY to determine the percentage of students in various IQ groupings who went directly to college. We limited the analysis to students who obtained a high school diploma between January 1980 and July 1982, meaning that all subjects had taken the AFQT prior to attending college. The analysis thus also reflects the experience of those who obtain their high school diploma via the normal route (comparable to the analyses from the 1960s and 1920s, which are also reported in the same figure). A subject is classified as attending college in the year following graduation if he reported having enrolled in college at any point in the calendar year following the date of graduation.


All variables relating to marital history and childbearing employed the NLSY’s synthesis as contained in the 1990 Fertility File of the NLSY.


The most commonly reported measure of a problematic birth weight is “low birth weight,” defined as no more than 5.5 pounds. In its raw form, however, low birth weight is limited as a measure because it is confounded with prematurity. A baby born five weeks prematurely will probably weigh less than 5.5 pounds and yet be a fully developed, healthy child for gestational age, with excellent prospects. Conversely, a child carried to term but weighing slightly more than the cutoff of 5.5 pounds is (given parents of average stature) small for its gestational age. We therefore created a variable expressing the baby’s birth weight as a ratio of the weight for fetuses at the 50th centile for that gestational age, using the Colorado Intrauterine Growth Charts as the basis for the computation. If a baby weighed less than 5.5 pounds but the ratio was equal to or greater than 1, that case was excluded from the analysis. All uses of this variable in Chapters 10 and 13 are based on a sample that is exclusively white (Latino or non-Latino) or black, thereby sidestepping the complications that would be introduced by the populations of smaller stature, such as East Asians. We further excluded cases reporting gestational ages of less than twenty-six weeks, reports of pregnancies that lasted more than forty-four weeks or birth weights in excess of thirteen pounds, and one remarkable case in which a mother reported gestation of twenty-six weeks and a birth weight of more than twelve pounds.