Now You See It: How the Brain Science of Attention Will Transform the Way We Live, Work, and Learn - Cathy N. Davidson (2011)

Part II. The Kids Are All Right

Chapter 4. How We Measure



“A lie!”

“Cynical and meaningless!”

“A wacko holding forth on a soapbox. If Prof Davidson just wants to yammer and lead discussions, she should resign her position and head for a park or subway platform, and pass a hat for donations.”

Some days, it’s not easy being Prof Davidson.

What caused the ruckus in the blogosphere this time was a blog I posted on the HASTAC site called “How to Crowdsource Grading,” in which I proposed a new form of assessment that I planned to use the next time I taught This Is Your Brain on the Internet.

“The McDonaldization of American education.”

“Let’s crowdsource Prof Davidson’s salary too.”

It was my students’ fault, really. By the end of This Is Your Brain on the Internet, I felt confident I’d taught a pretty impressive course. I settled in with my students’ course evaluations, waiting for the accolades to flow over me, a pedagogical shower of student appreciation. And mostly that’s what I read, thankfully. But there was one group of students who had some candid feedback to offer me for the next time I taught This Is Your Brain on the Internet, and it took me by surprise. They said everything about the course had been bold, new, and exciting.

Everything, that is, except grading.

They pointed out that I had used entirely conventional methods for testing and evaluating their work. We had talked as a class about the new modes of assessment on the Internet—everything from public commenting on products and services to leader boards—where the consumer of content could also evaluate that content. These students said they loved the class but were perplexed that my assessment method had been so twentieth-century. Midterm. Final. Research paper. Graded A, B, C, D. The students were right. You couldn’t get more twentieth-century than that.

It’s hard for students to critique a teacher, especially one they like, but they not only did so, they signed their names to the course evaluations. It turned out these were A+ students, not B students. That stopped me in my tracks. If you’re a teacher worth your salt, you really pay attention when the A+ students say something is wrong.

I was embarrassed that I had overlooked such a crucial part of our brain on the Internet. Assessment is a bit like the famous Heisenberg principle in quantum mechanics: the more precisely you measure for one property, the less precisely you can measure for another. If you’re looking for conventional achievement using conventional measures, then by definition you cannot at the same time be measuring by other criteria or measuring other qualities. In grading them strictly on content based on papers and exams, I had failed to evaluate them on all the really amazing ways that they had contributed to a semester of collaborative research, thinking, and interdisciplinary teamwork—exactly what the course was supposed to be about.

I contacted my students and said they’d made me rethink some very old habits. Unlearning. I promised I would rectify this the next time I taught the course. I thought about my promise for a while, came up with what seemed like a good system, then wrote about it in that “How to Crowdsource Grading” blog.1

“Idiocracy, here we come.”


“David Lodge of our generation—Wherefore Art Thou?” someone offered, apparently thinking that Prof Davidson was ripe for a witty British parody.


REALLY? FOR A COLUMN ON grading? I was taken by surprise again. I had underestimated just how invested nonacademics were in the topic of evaluation. My usual blog posts on the HASTAC site receive several hundred readers in the first week or so, with an audience largely of educators interested in cross-disciplinary thinking for a digital age. Ten thousand readers later, “How to Crowdsource Grading” had crossed over to a different, wider audience. It also garnered writeups in academic news outlets like The Chronicle of Higher Education and Inside Higher Ed, where, by the end of the month, it ranked as the “most read” and “most commented on” article. The Associated Press produced an article about my blog, and that piece turned up in local newspapers from Tulsa to Tallahassee. For a moment or two, “crowdsourcing grading” even reached the pinnacle of celebrity. It became a trending topic on Twitter.

Not all the comments were negative. Many were exceptionally thoughtful and helpful. But a lot of people were unhappy with me. It threatened to be the iPod experiment all over again.



“Prof Davidson is an embarrassment.”

“I give the professor F for failure to account for real-world appropriateness and efficacy.”

This is what happens when you listen to the A+ students in This Is Your Brain on the Internet.

LEARNING TO GIVE AND TAKE feedback responsibly should be a key component of our training as networked citizens of the world. When I built the class that aimed to leave behind a dusty set of old pedagogical tools, I’d given up many of my props as a prof, but not the most basic one: my gradebook. Like Ichabod Crane’s switch or Obi-Wan Kenobi’s lightsaber, the gradebook is the symbol of pedagogical power. Testing and measuring is always the most conservative and traditional feature of any educational enterprise—and the most contested.

It is artificial, too. In the world of work beyond graduation, there’s no such thing as a “gentleman’s C”—or even a hard-earned B+. At work, you succeed or fail, and sometimes you’re just fired or laid off, no matter how good you are. In life, there is no grading on the curve or otherwise. We know that success in business does not correlate with one’s GPA in college. Many other factors are just as important as a perfect academic record in my own field of academe, where one would think grades would be the prime predictors of success. They are not. We just tend to treat them as though they are, as if tests and measurements and grades were an end in themselves and not a useful metric that helps students and teachers all stay on track.

My new grading method that set off such waves of vitriol strove to be fair and useful, and in method it combined old-fashioned contract grading with peer review. Contract grading goes back at least to the 1960s. In it, the requirements of a course are laid out in advance and students contract to do all of the assignments or only some of them. A student with a heavy course or work load who doesn’t need an A, for example, might contract to do everything but the final project and then, according to the contract, she might earn a B. It’s all very adult. No squirming, no smarmy grade-grinding to coax the C you earned up to a B.

But I also wanted some quality control. So I added the crowdsourcing component based on the way I had already structured the course. I thought that since pairs of students were leading each class session and also responding to their peers’ required weekly reading blogs, why not have those student leaders determine whether the blogs were good enough to count as fulfilling the terms of the contract? If a blog didn’t pass muster, it would be the task of the student leaders that week to tell the blogger that and offer feedback on what would be required for it to count. Student leaders for a class period would have to do this carefully, for, of course, next week they would be back in the role of the student, blogging too, and a classmate would be evaluating their work.

Many of those in the blogosphere who were critical of my crowdsourcing grading idea were sure that my new grading philosophy would be an invitation for every slacker who wanted an easy A. These critics obviously hadn’t seen my long syllabus, with its extensive readings in a range of fields, from the computational sciences to literature and the multimedia arts, and with its equally long writing and research requirements. I have not yet had a single student read over my syllabus and think he was in for a cakewalk.

I also liked the idea of students each having a turn at being the one giving the grades. That’s not a role most students experience, even though every study of learning shows that you learn best by teaching someone else. Besides, if constant public self-presentation and constant public feedback are characteristics of a digital age, why aren’t we rethinking how we evaluate, measure, test, assess, and create standards? Isn’t that another aspect of our brain on the Internet? Collective responsibility—crowdsourcing—seemed like a fine way to approach assessment.

There are many ways of crowdsourcing, and mine was simply to extend the concept of peer leadership to grading. Crowdsourcing is the single most common principle behind the creation of the open-source Web. It works. This is how the brilliant Linux code—free, open, and used to power everything from servers to supercomputers—was originally written. As we’ll see in a later chapter, businesses, even traditional ones like IBM, use crowdsourcing more and more as a way to get a read on their employees’ opinions, to frame a challenge, or to solve a problem. Sometimes they create a competition and award the winner a prize while the company receives the rights to the winning idea. Crowdsourced grading was an extension of the peer leadership we’d pioneered throughout the course.

From the fury of the remarks, it was clear I had struck a nerve. My critics were acting as if testing and then assigning a grade to the test scores had real-world value. It doesn’t. Grading measures some things and fails to measure other things, but in the end, all assessment is circular: It measures what you want it to measure by a standard of excellence that you determine in advance. We too often substitute test results for some basic truth about a whole person, partly because we live in a test-obsessed society, giving more tests (and at a younger age) than any other country. We love polls and metrics and charts—numbers for everything. Studies have shown that if you include numbers in an article, Americans are more inclined to believe the results. If you include a chart or table, confidence goes even higher. And if you include numbers, a chart, and an illustration of a brain—even if the article isn’t really about the parts of the brain—the credibility is through the roof.2 That’s the association we make: Numbers equal brain.

Given how often we were tested as kids, it is no surprise that we put a lot of faith in test results of one kind or another. But in fact, grading doesn’t have “real-world appropriateness.” You don’t get an A or an F on life’s existential report card. Yet the blogosphere was convinced that either I or my students would be pulling a fast one if the grading were crowdsourced and students had a role in it. That says to me that we don’t believe people can learn unless they are forced to, unless they know it will “count on the test.” As an educator, I find that very depressing. As a student of the Internet, I also find it implausible. If you give people the means to self-publish—whether it’s a photo from their iPhone or a blog—they do. They seem to love learning and sharing what they know with others. But much of our emphasis on grading is based on the assumption that learning is like cod-liver oil: It is good for you, even though it tastes horrible going down. And much of our educational emphasis is on getting one answer right on one test—as if that says something about the quality of what you have learned or the likelihood that you will remember it after the test is over. At that failing school I visited, everyone knew the kids who didn’t speak English well were ruining the curve. Ironically, Spanish was a required subject but not one on which kids were tested—and so it was done as an afterthought, and very few of the Anglo kids could speak or read Spanish. Blinded by the need to succeed at a multiple-choice test, no one had the energy to create peer-learning teams in which those immigrant kids could teach their classmates how to speak some Spanish and everyone would learn more about respect and regard for one another’s abilities in the process, including how to communicate across difference and in ways not measured by a bubble test.

Grading, in a curious way, exemplifies our deepest convictions about excellence and authority, and specifically about the right of those with authority to define what constitutes excellence. If we crowdsource grading, we are suggesting that young people without credentials are fit to judge quality and value. Welcome to the Internet, where everyone’s a critic and anyone can express a view about the new iPhone, restaurant, or quarterback. That democratizing of who can pass judgment is digital thinking. As I found out, it is quite unsettling to people stuck in top-down models of formal education and authority.

Explicitly or implicitly, any classroom is about both content and method, information that you pass on and ways of knowing the world that you put into action. If a teacher wants a hierarchical class, in which she quizzes students on the right answers, she reinforces a process of learning in which answers are to be delivered and unquestioned. What a teacher—like a parent—has to decide is not only what content is worth knowing, but which ways of knowing should be modeled and encouraged.

Learn. Unlearn. Relearn. How do you grade successful unlearning? It is a contradiction to try to model serious intellectual risk-taking and then use an evaluation system based on conventional grades. My A+ students were right. How you measure changes how you teach and how you learn.

In addition to the content of our course—which ranged across cognitive psychology, neuroscience, management theory, literature and the arts, and the various fields that comprise science and technology studies—This Is Your Brain on the Internet was intended to model a different way of knowing the world, one that encompassed new and different forms of collaboration and attention. And more than anything, it courted failure. Unlearning.

“I smell a reality TV show,” one contributor, the self-styled BoredDigger, sniffed on the Internet crowdsourcing site Digg.

That’s not such a bad idea, actually. Maybe I’ll try that next time I teach This Is Your Brain on the Internet. They can air it right after Project Classroom Makeover.


If we want our schools to work differently and to focus on different priorities, we still have to come up with some kind of metric for assessing students. And it should come as no surprise that the tests we use now are just as outmoded as most of the ways we structure our classrooms. They were designed for an era that valued different things, and in seeking to test for those things, they limited the types of aptitude they measured. While twentieth-century tests and the accompanying A-B-C-D grades we use to mark them are quite good at producing the kind of standardized, hierarchical results we might expect for a society that valued those qualities, they often fail to tell us anything meaningful about our kids’ true potential.

If grading as we know it doesn’t thrill us now, it might be comforting to know that it was equally unpopular when it first appeared. Historians aren’t exactly sure who invented testing, but it seems as if the concept of assigning quantitative grades to students might have begun at Cambridge University, with numerical or letter grades supplementing written comments on compositions in a few cases by a few dons sometime at the end of the eighteenth century. It was not really approved of, though. In fact, quantifying grades was considered fine for evaluating lower-order thinking, but implicitly and explicitly was considered a degraded (so to speak) form of evaluation. It was not recommended for complex, difficult, higher-order thinking. You needed to hear oratory, watch demonstrations, and read essays to really understand if a student had mastered the complexity of a subject matter, whether classics or calculus.3 At Oxford and Cambridge, to this day, there is still heavy reliance on the essay for demonstrating intelligence and achievement.

In the United States, Yale seems to have been the first university to adopt grading as a way of differentiating the top twenty or thirty “Optimi” students, from the middling “Inferiores” and the lowly “Pejores.”4The practice, with variations, spread to other universities, including Harvard, William and Mary, and the University of Michigan. By the middle of the nineteenth century, public school teachers were affixing grades to students’ essays (this was before the advent of multiple-choice “objective” exams), largely as a convenience and expedient for the teachers, not because such a grade was considered better for the students.

Assigning a number or letter grade to the attainment of knowledge in a complex subject (let’s say, history) as measured by a test is not a logical or self-evident method of determining how much someone understands about history. There are a lot of assumptions about learning that get skipped over in the passage from a year’s lectures on the controversial, contradictory, value-laden causes and politics of, for example, World War I, to the B+ that appears on a student’s transcript for History 120, World History 1900–1920. That letter grade reduces the age-old practice of evaluating students’ thoughts in essays—what was once a qualitative, evaluative, and narrative practice—to a grade.

The first school to adopt a system of assigning letter grades was Mount Holyoke in 1897, and from there the practice was adopted in other colleges and universities as well as in secondary schools.5 A few years later, the American Meat Packers Association thought it was so convenient that they adopted the system for the quality or grades, as they called it, of meats.6a

Virtually from the beginning of grading in the United States, there was concern about the variability of the humans assigning grades. What we now know as grade inflation was a worry early on, with sterner commentators wagging censorious fingers at the softies who were far too lenient, they thought, in giving high grades to the essays they were marking. Others were less concerned about the “objectivity” of the tester and more concerned with efficiency. If you were testing not just a few privileged students at Yale or Harvard, but thousands and thousands of students, could you really be reading and carefully marking their long, handwritten, Oxbridge-style ruminations on life or geography or mathematical principles?

Because uniformity, regularity, standardization, and therefore objectivity (in the sense of objective measurement of grades) were the buzzwords of the first decade of the twentieth century, the search was on for a form of testing that fit the needs of the day. How could we come up with a uniform way of marking brain power that would help sort out the abilities of the industrial labor force, from the pig-iron guy to the CEO?

Thus was born the multiple-choice test, what one commentator has called the symbol of American education, “as American as the assembly line.”7 It is estimated that Americans today take over 600 million standardized tests annually—the equivalent of about two tests per year for every man, woman, and child in America. There are apparently more standardized tests given in schools in the United States than in any other country. Americans are test-happy. We use standardized tests for every level of education, including, in some cases, preschool. We use them for driver’s licenses and for entry-level positions in business and industry, for government and military jobs.8 Americans test earlier and more often than anyone else. We even have standardized tests for measuring the validity of our standardized tests.

So where did standardized testing come from anyway? That’s not just a rhetorical question. There is a “father” of the multiple-choice test, someone who actually sat down and wrote the first one. His name was Frederick J. Kelly, and he devised it in 1914. It’s pretty shocking that if someone gave it to you today, the first multiple-choice test would seem quite familiar, at least in form. It has changed so little in the last eight or nine decades that you might not even notice the test was an antique until you realized that, in content, it addressed virtually nothing about the world since the invention of the radio.

Born in 1880 in the small farming town of Wymore, Nebraska, Kelly lived until 1959. A lifelong educator, he had seen, by the time of his death, the multiple-choice test adapted to every imaginable use, although it was not yet elevated into a national educational policy, the sole metric for assessing what kids were learning in school, how well teachers were teaching them, and whether schools were or were not failing.

Kelly began his career at Emporia State University (formerly Kansas State Teachers’ College). In 1914, he finished his doctoral dissertation at Teachers’ College, entitled Teachers’ Marks, Their Variability and Standardization. His thesis argued two main points. First, he was concerned about the significant degree of subjective judgment in how teachers mark papers. Second, he thought marking takes too much of a teacher’s time. He advocated solving the first problem—“variability”—with the solution of standardization, which would also solve the second problem by allowing for a fast, efficient method of marking.

Inspired by the “mental testing movement,” or early IQ testing, Kelly developed what he called the Kansas Silent Reading Test. By that time, he had progressed to become director of the Training School at the State Normal School at Emporia, Kansas, and from there, he went on to become the dean of education at the University of Kansas. “There has always been a demand on the part of teachers to know how effectively they are developing in their children the ability to get meaning from the printed page,” Kelly wrote. “Nothing is more fundamentally important in our school work than the development of this ability.”9For Kelly, “effective teaching” meant uniform results. In this, he was a creature of his age, prizing a dependable, uniform, easily replicated product—the assembly-line model of dependability and standardization—over ingenuity, creativity, individuality, idiosyncrasy, judgment, and variability.

Thus was born the timed reading test. The modern world of 1914 needed people who could come up with the exact right answer in the exact right amount of time, in a test that could be graded quickly and accurately by anyone. The Kansas Silent Reading Test was as close to the Model T form of automobile production as an educator could get in this world. It was the perfect test for the machine age, the Fordist ideal of “any color you want so long as it’s black.”

To make the tests both objective as measures and efficient administratively, Kelly insisted that questions had to be devised that admitted no ambiguity whatsoever. There had to be wholly right or wholly wrong answers, with no variable interpretations. The format will be familiar to any reader: “Below are given the names of four animals. Draw a line around the name of each animal that is useful on the farm: cow tiger rat wolf.”

The instructions continue: “The exercise tells us to draw a line around the word cow. No other answer is right. Even if a line is drawn under the word cow, the exercise is wrong, and nothing counts.... Stop at once when time is called. Do not open the papers until told, so that all may begin at the same time.”10

Here are the roots of today’s standards-based education reform, solidly preparing youth for the machine age. No one could deny the test’s efficiency, and efficiency was important in the first decades of the twentieth century, when public schools exploded demographically, increasing from about five hundred in 1880 to ten thousand by 1910, and when the number of students in secondary education increased more than tenfold.11 Yet even still, many educators objected that Kelly’s test was so focused on lower-order thinking that it missed all other forms of complex, rational, logical thinking entirely. They protested that essays, by then a long-established form of examination, were an exalted form of knowledge, while multiple-choice tests were a debased one. While essay tests focused on relationships, connections, structures, organization, and logic, multiple-choice exams rewarded memorization rather than logic, facts without context, and details disconnected from analysis. While essays allowed for creativity, rhetorical flourishes, and other examples of individual style, the Silent Reading Test insisted on timed uniformity, giving the most correct answers within a specific time. While essays stressed coherent thinking, the Silent Reading Test demanded right answers and divided knowledge into discrete bits of information. While essays prized individuality and even idiosyncrasy, the bywords of the Silent Reading exam were uniformity and impersonality.12

What the multiple-choice test did avoid, though, was judgment. It was called objective, not because it was an accurate measure of what a child knew but because there was no subjective element in the grading. There was a grade key that told each teacher what was right and what was wrong. The teacher or teacher’s aide merely recorded the scores. Her judgment was no longer a factor in determining how much a child did or did not know. And it was “her” judgment. By the 1920s, teaching was predominantly a woman’s profession. However, school administration was increasingly a man’s world. It was almost exclusively men who went off to earn advanced degrees in schools of education, not so they could teach better but so they could run schools (our word again) efficiently.13

The values that counted and prevailed for the Silent Reading Test were efficiency, quantification, objectivity, factuality, and, most of all, the belief that the test was “scientific.” By 1926, a form of Kelly’s test was adopted by the College Entrance Examination Board as the basis for the Scholastic Aptitude Test (SAT).14 Because masses of students could all take the same test, all be graded in the same way, and all turned into numbers crunched to yield comparative results, they were ripe to become yet another product of the machine age: statistics, with different statisticians coming up with different psychometric theories about the best number of items, the best number of questions, and so forth.15 The Kansas Silent Reading Test asked different kinds of questions, depending on whether it was aimed at third graders or tenth graders. The reason for this was that Kelly designed the test to be analyzable with regard not just to individual achievement, as an assessment tool that would help a teacher and a parent determine how the child was doing, but also as a tool that would allow results to be compared from one grade to another within a school, across a range of grades within a school, and then outward, across schools, across districts, across cities, and divisible in any way one wanted within those geographical categories. If this sounds familiar, it is because it’s almost identical to our current educational policy.

This story has a bittersweet coda: It is clear from Kelly’s own later writings that he changed his mind about the wisdom of these tests. He didn’t write much, but what he did write doesn’t mention the Kansas Silent Reading Test. Far from enshrining this accomplishment for the historical record, his later writing passes over it in silence. It seems as if his educational philosophy had taken a decidedly different turn. By 1928, when he ascended to the presidency of the University of Idaho, he had already changed direction in his own thinking about the course of American education. In his inaugural presidential address, “The University in Prospect,” Kelly argued against what he saw as the predominant tendency of post–World War I education in America, toward a more specialized, standardized ideal. His most important reform at the University of Idaho during his presidency was to go stridently against the current of the modern educational movement and to create a unified liberal arts curriculum for the first and second years of study. His method emphasized general, critical thinking. “College practices have shifted the responsibility from the student to the teacher, the emphasis from learning to teaching, with the result that the development of fundamental strengths of purpose or of lasting habits of study is rare,” President Kelly said, announcing his own blueprint for educational reform. He railed against specialization at too early a stage in a student’s educational development and advocated “more fundamental phases of college work.” He insisted, “College is a place to learn how to educate oneself rather than a place in which to be educated.”16

Unfortunately, his message ran counter to the modernization, specialization, and standardization of education he himself had helped start. Faculty in the professional schools at the University of Idaho protested President Kelly’s reforms, and in 1930, he was asked to step down from his position .17 Yet his test soldiered on and, as we have seen, persists to the present day in the end-of-grade exams that measure the success or failure of every child in public school in America, of every teacher in the public school system, and of every public school in America.

Once more, the roots of our twenty-first-century educational philosophy go back to the machine age and its model of linear, specialized, assembly-line efficiency, everyone on the same page, everyone striving for the same answer to a question that both offers uniformity and suffers from it. If the multiple-choice test is the Model T of knowledge assessment, we need to ask, What is the purpose of a Model T in an Internet age?

IF THE REACH OF STANDARDIZED testing has been long, so has been the social impact of this form of testing. Not far from the uniform and efficient heart of the standardized achievement test was the eugenic soul of the IQ test. The dubious origins of IQ testing have been entwined with the standardized, multiple-choice achievement tests for so long that it is easy to confuse the aims and ambitions of one with the other. In fact, the very origins are mixed up and merged.18 Although more has been written about IQ tests, it is not well known that the first ones were intended to be not objective measures of innate mental abilities (IQ) but tests for one specific kind of aptitude: an ability to do well in academic subjects. While Kelly was trying to figure out a uniform way of testing school achievement, educators in France were working on testing that could help predict school success or failure, as a way of helping kids who might be having difficulty in the French public school system.

In 1904, two French psychologists, Alfred Binet and Théodore Simon, were commissioned by the Ministry of Public Education to develop tests to identify and diagnose children who were having difficulties mastering the French academic curriculum.19 They were using the word intelligence in the older sense of “understanding” and were interested in charting a child’s progress over time, rather than positing biological, inherited, or natural mental characteristics.

Like Kelly in Kansas, who began his testing research around the same year, the French psychologists were of their historical moment in seeking efficient and standardized forms of assessment: “How will it be possible to keep a record of the intelligence in the pupils who are treated and instructed in a school . . . if the terms applied to them—feeble minded, retarded, imbecile, idiot—vary in meaning according to the doctor who examined them?” Their further rationale for standardized testing was that neurologists alone had proved to be incapable of telling which “sixteen out of twenty” students were the top or the bottom students without receiving feedback from the children’s teachers.

Note the odd assumption here that you need a test to confirm what the teacher already knows. The tests were important because the neurologists kept getting it wrong, merely by plying their trade and using their scientific methods. Historian Mark Garrison asks, “If one knows who the top and bottom students are, who cares if the neurologists can tell?” He wonders if the importance of the tests was to confirm the intelligence of the students or to assert the validity of institutional judgment, educational assessment, and scientific practice of the day.20

Binet himself worried about the potential misuse of the tests he designed. He insisted they were not a measurement, properly speaking. He argued that intelligence comes in many different forms, only some of them testable by his or by any test. His understanding of different skills, aptitudes, or forms of intelligence was probably closer to that of educator Howard Gardner’s concept of “multiple intelligences” than to anything like a rigid, measurable standard reducible to a single numerical score.21

His words of caution fell on deaf ears. Less than a year after Binet’s death in 1911, the German psychologist William Stern argued that one could take the scores on Binet’s standardized tests, calculate them against the age of the child tested, and come up with one number that defined a person’s “intelligence quotient” (IQ).22 Adapted in 1916 by Lewis Terman of Stanford University and renamed the Stanford-Binet Intelligence Scale, this method, along with Binet’s test, became the gold standard for measuring not aptitude or progress but innate mental capacity, IQ. This was what Binet had feared. Yet his test and that metric continue to be used today, not descriptively as a relative gauge of academic potential, but as a purportedly scientific grading of innate intelligence.

Before his death, Binet protested against the idea that his tests measured hereditary brain power or innate intelligence. In midlife, he had become more introspective and cautious about science than he had been as a young man, having done some soul-searching when intellectual areas he had championed (spiritualism, hypnotism, and phrenology—the study of head bumps) had proved suspect. Largely self-taught, Binet had become cautious about overstating the importance of his diagnostic testing.

He would have been appalled and disgusted by the misuse of his test that began in 1917. In a story that has been told many times, the president of the American Psychological Association, Robert Yerkes, convinced the military to give the new Stanford-Binet IQ tests to more than a million recruits to determine who had sufficient brain power to serve as officers, who was fit to serve overseas, and who simply was not intelligent enough to fight in World War I at all.23 This massive sampling was then later used to “prove” the mental inferiority of Jews, Italians, eastern Europeans, the Irish, and just about any newly arrived immigrant group, as well as African Americans. Native-born American, English-speaking Anglo-Saxons turned out to have the highest IQ scores. They were innately the most intelligent people. Familiarity with English or with the content was irrelevant. There were Alpha tests for those who could read and Army Beta tests for those who couldn’t, and they were confusing enough that no one did well on them, raising the eugenic alarm that the “decline” in American intelligence was due to racial and immigrant mixing. Almost immediately, in other words, the tests were used as the scientific basis for determining not just individual but group intelligence. The tests were also adopted by U.S. immigration officials, with profound impact on the Immigration Restriction Act of 1924 and other exclusionary policies.24 They were also used against those of African descent, including native-born African Americans, to justify decades of legal segregation, and against Asians and Native Americans as the basis for inequitable citizenship laws.25 Of course, the argument had to run in the opposite direction when applied to women who, from the beginning, showed no statistical difference from men in IQ testing. Magically, the same tests that “proved” that different races had unequal innate intellectual capacities nevertheless were not held to “prove” that different genders were equal.26

ONE FINAL FACTOR HAS TO be added into this discussion of how we measure: the very statistical methods by which we understand what test scores mean. There is nothing intrinsic about measurement. Yet after a century of comparative testing, finding the implicit “mean” and “median” of everything we measure, such forms of testing are so woven into the fabric of our culture that it is almost impossible to imagine assessment apart from distribution curves. It is as if someone cannot do well if someone else does not do badly. We hold to the truth of grades and numbers so firmly, as if they tell the objective truth about how we are doing, that you would think that, from time immemorial, all human mental activity came with a grade as well as a ranking on humanity’s great bell curve.

Yet the statistical assumptions about how we measure were also invented quite recently, and also as part of the industrial age. Many of the most familiar features of modern statistics were devised by Sir Francis Galton (1822–1911), a cousin of Darwin’s and proponent of social Darwinism, a misapplication of evolutionary theory that argued that those at the top of the social heap—aristocratic members of the upper classes—had gotten there through their excellent inherited qualities. Galton believed that to ensure “survival of the fittest,” rich aristocrats should receive government subsidies to encourage them to have more children while the British poor should be sterilized. If that seems a reversal of Darwin’s ideas of how the strongest in a species survive, well, it is.

Galton was a fervent believer in what he called eugenics, selective breeding as applied to humans. Sir Francis also coined the phrase “nature versus nurture” and believed that inherited, natural characteristics were what mattered. He believed that the human inheritance pool was being “weakened,” and he was positive he had the statistical measures to prove his eugenic contentions. He was the first person to develop the form of the questionnaire in order to gather what he saw as empirical evidence, and he developed the influential statistical ideas of standard deviation, statistical correlation, and regression toward the mean.27

These ideas about statistics aren’t inherently problematic, but given what we know about how attention works, it shouldn’t be surprising to see how they can be used to support interpretations that reinforce dominant values. Because we see not the world but the particular view of it conditioned by our past experiences and past learning, it is predictable that we view deviation as bad, a problem rather than simply a change. If we establish a mean, deviation from the mean is almost inevitably a decline. When we measure the decline against that mean, it looks like scientific proof of failure, weakness, a downward trajectory: crisis. But what is interesting is how in the very first statistical methods and the very first tests—multiple-choice or IQ—there was already a problem of decline (in standards or intelligence) that had to be addressed. From the beginning, the scientific assessments were designed to solve a problem presumed (not proved) to exist.

Not everyone who uses statistics or objective forms of testing does so to prove that nature favors the ruling class. For example, Horace Mann championed both common schools and testing as more fair and unbiased than the system of privilege operating in the private schools and elite universities of the day. But it is depressing to see how often methods purported to be quantitative, objective assessments have been used for ideological purposes justified by those metrics.

THREE FEATURES OF HOW WE measure came together during the machine age and embody the machine-age ideals of uniformity and efficiency: statistical methods of analysis, standardized or multiple-choice testing, and IQ testing of “natural” capabilities. Together those three have defined virtually all conversation about learning and education for a century.

This is not to say that educational philosophers and reformers haven’t protested these methods, metrics, and conclusions. Hundreds have. In The Mismeasure of Man, the late Harvard paleontologist Stephen Jay Gould showed how quantitative data can be misused not to measure what they purport to test but to support the preconceived ideas upon which the tests themselves are constructed.28 He suggested there is an inherited component to IQ—but it is inherited cultural privilege, supported by affluence, passed on from parents to children, that is the greatest educational legacy, not genetics.29 Recently a group of educators trained in new computational methods have been confirming Gould’s assertion by microprocessing data from scores earned on end-of-grade exams in tandem with GIS (geographic information systems) data. They are finding clear correlations between test scores and the income of school districts, schools, neighborhoods, and even individual households. As we are collecting increasing amounts of data on individuals, we are also accumulating empirical evidence that the most statistically meaningful “standard” measured by end-of-grade tests is standard of living, as enjoyed by the family of the child taking the exam.30

University of Virginia psychologist Timothy Salthouse has been conducting research in a different vein that also should sweep aside any lingering perceptions that the current, formal mode of testing captures the whole truth of our learning potential. Most psychologists study intelligence or aptitude by administering one test to a sampling of subjects. In 2007, Salthouse used a different method. He gave a battery of sixteen different intelligence and aptitude tests to subjects ranging from eighteen to ninety-seven years of age. He included the most important cognitive and neuropsychological tests designed to measure everything from IQ to attention span. By using many tests, Salthouse put into his own complex experiment all the variables—the categories, biases, assumptions, and hierarchies—implicit in each test but inconsistent across them. Typical studies focus on one activity or characteristic and use that measure as if it is an absolute against which other qualities are gauged. By using multiple tests, Salthouse gave us the tools to see in a different way.

He did something else ingenious too. Most studies of intelligence or intellectual aptitude give a subject a test one time. They operate from a presumption that one person tests the same on all occasions. We know that isn’t true. We know our scores vary. In the case of a major, life-changing test such as the SAT, you might do badly one year, and so take it again later. Or you might even shell out some money to take the Kaplan tests that prep you on how to take that test better, and then you might take it again. Salthouse tested the tests by having each of his subjects take the tests more than once. He repeated the same tests on different days, with no new prepping. He also had the same subject retake various kinds of tests under different conditions. He then compiled all the statistical results from all of these multiple tests by multiple people, aggregating the different scores and then crunching the numbers in various ways, seeking out different variables, including such “extrinsic” factors as time of day a test was taken, whether it was the first or a subsequent time that the same kind of test was taken, and so forth. With new computational tools, it is certainly possible to take data with various different structures and crunch it every which way to see if anything interesting emerges.

After analyzing all the data from all the tests in many different ways, Salthouse found wide variability within the same individual, even when the individual was being tested for the same characteristics. The variety depended on such factors as what day and what time of day a test happened to be taken, as well as what type of test the person was taking. An individual scored differently on the same test taken on two different days and scored differently, again, depending on the kind of test he or she took. Salthouse found that “the existence of within-person variability complicates the assessment of cognitive and neuropsychological functioning and raises the possibility that single measurements may not be sufficient for precise evaluations of individuals, or for sensitive detection of change.”31 Not sufficient? With all the data crunched, he was able to demonstrate that the within-person deviation in test scores averaged about 50 percent of the between-person deviation for a variety of cognitive tasks.32 Or as one popular science reporter put it, “Everyone has a range of typical performances, a one-person bell curve.”33 That is a decisive statistic, one that should have everyone rethinking how we measure.

Note to self: Remember not to take important tests on Mondays or ever in the late afternoons. Note to those who have substituted standardization for high standards: Even what we presume is standardized isn’t even particularly consistent for measuring one individual. Yet we put so much weight on test scores that whole lives can be changed by having a bad day and doing badly on one. Your child’s excellent magnet preschool is suddenly off limits because of a mediocre test score, and a whole domino effect of subsequent achievement based on past achievement can be set in motion.

For the most test-happy nation on earth, Salthouse’s experiment poses a conundrum. Critics of standardized tests often say that they reduce learning to the practice of “teaching to the test”—that is, teaching not so students understand but so they do well on the test. The Salthouse study, and the critique of testing that goes all the way back to its origins with Kelly and Binet, suggests it may be worse than that. We may be teaching to contradictory, inconsistent, and inconclusive tests of lower-order thinking.


It’s only recently that our policy makers seem to be getting the message, but for now we’re still stuck, trapped in a system that isn’t working for kids or their teachers. Everyone knows this. We don’t know yet how to get out, though more and more people have finally started thinking of new ways to solve the problem. Even Diane Ravitch, the influential educator and policy maker who helped shape national educational policy under presidents George H. W. Bush and Bill Clinton, has recently renounced her earlier faith in standardized testing.34 In the end, she says, the only people to have profited by our national standards-based reform are commercial providers of tests and testing services, all of whom are now concerted in their lobbying efforts to keep end-of-grade tests mandatory across the nation.

Assessment is a particularly volatile issue in America now because, since January 8, 2002, we have had a national policy that is, for all intents and purposes, wholly shaped by standardized testing. The No Child Left Behind Act (Bill 107-110), typically abbreviated as NCLB, was proposed by President George W. Bush soon after he came into office and was led through Congress by Senator Ted Kennedy. It received overwhelming bipartisan support in Congress and justified a substantial influx of funding into public education. It is based on the theory of “standards-based education reform,” which relies on standardized testing to measure individual and school performance results. It equates those test results with better outcomes in education. No Child Left Behind adjusts federal funding to the test scores achieved by students on state-prescribed multiple-choice exams. School districts in some states can be penalized with lower funds if their kids don’t achieve a certain standard, as set by individual states, on the tests.

What are the values that No Child Left Behind reinforces? Memorization, mastering a definitive body of knowledge that an institution insists is worth knowing, understanding that some subjects “count on the test” and other areas aren’t worth testing at all. Everything about No Child Left Behind reinforces the idea of knowledge as a noun, not knowing as a verb. Exactly at the moment that kids have the tremendous opportunity to explore online on their own, when they need to be taught higher-order critical thinking and evaluation skills to understand what is credible, how you think, how you know what is or isn’t good information, and how you use information to drive sound conclusions, we subject them to machine-readable, multiple-choice, end-of-grade bubble tests. It’s not that there is anything terrible about them, but they are redundant for the most academically gifted students and are an irrelevant distraction from real and necessary learning for the kids who don’t do well in school. Except for preparing one for trivia-based games such as Who Wants to Be a Millionaire, it’s hard to imagine what might be the real-world utility of testing based on responding to any individual question by selecting from a sampling of possible answers.

Online, kids have to make choices among seemingly infinite possibilities. There’s a mismatch between our national standards of testing and the way students are tested every time they sit by themselves in front of a computer screen. If the school bell was the symbol of American public education in the nineteenth century, summoning America’s farm kids into the highly scheduled and efficient new world of industrial labor, should the multiple-choice test really be the symbol of the digital era? If so, our symbol underscores the fact that American public education, as currently constituted, is an anachronism, with very little relevance to our children’s lives now and even less to their future.

In 2006, two distinguished historians, Daniel H. Cohen and the late Roy Rosenzweig, worked with a bright high school student to create H-Bot, an online robot installed with search algorithms capable of reading test questions and then browsing the Internet for answers. When H-Bot took a National Assessment of Educational Progress (NAEP) test designed for fourth graders, it scored 82 percent, well above the national average. Given advances in this technology, a 2010 equivalent of H-Bot would most likely receive a perfect score, and not just on the fourth-grade NAEP; it would also ace the SATs and maybe even the GREs, LSATs, and MCATs too.

Cohen and Rosenzweig created H-Bot to make a point. If calculators have made many old arithmetic skills superfluous, search functions on Google have rendered the multiple-choice form of testing obsolete. We need to be teaching students higher-order cognitive and creative skills that computers cannot replicate. Cohen and Rosenzweig want far higher standards for creative, individual thinking. “We will not be among the mourners at the funeral of the multiple-choice test,” they insist. “Such exams have fostered a school-based culture of rote memorization that has little to do with true learning.”35


“So, Prof Davidson, if you’re so smart, how would you change assessment today?” the blogosphere well might ask. “Well,” I might answer, “having spent time with so many extraordinary teachers, I have a few ideas.”

First, I would stop, immediately, the compulsory end-of-grade exams for every child in an American public school. The tests are not relevant enough to the actual learning kids need, they offer little in the way of helpful feedback to either students or their teachers, and there are too many dire institutional consequences from failure resting on our children’s shoulders. The end-of-grade exam has become a ritual obligation, like paying taxes, and no one wants our kids to grow up thinking life’s inevitabilities are death, taxes, and school exams. That’s a disincentive to learning if ever there was one. Wanting to improve learning on a national scale, across the board, in every school district, north and south, east and west, urban and rural, rich and poor, is a fine ideal, but the current end-of-grade testing has not had that result, nor does it even do a good job of measuring the kinds of skills and kinds of thinking kids need today. This national policy has been around less than a decade. Let’s admit the experiment has failed, end it, and move on.

In a digital age, we need to be testing for more complex, connected, and interactive skills, what are sometimes called twenty-first-century skills or twenty-first-century literacies (see the appendix). Of course this includes the basics, the three Rs, but for those, any kind of testing a teacher devises to help her measure how well her class is doing is fine. Parents and students all know when a teacher isn’t doing her job. We don’t need multiple-choice tests to prove she’s not, with consequences for the whole school. Use the funds from all that grading to get a teacher who isn’t succeeding in the classroom some additional mentoring, maybe even a teacher’s aide. If she still isn’t working out, maybe then, as a last resort, suggest this may not be the job for her. Our crisis in teachers leaving the profession is greater than that of students leaving, and no wonder, given our current punitive attitudes toward them.

I’m not against testing. Not at all. If anything, research suggests there should be more challenges offered to students, with more variety, and they should be more casual, with less weight, and should offer more feedback to kids, helping them to see for themselves how well they are learning the material as they go along. This is called adaptive or progressive testing. Fortunately, we are close to having machine-generated, -readable, and -gradable forms of such tests, so if we want to do large-scale testing across school districts or states or even on a national level, we should soon have the means to do so, in human-assisted, machine-readable testing-learning programs with real-time assessment mechanisms that can adjust to the individual learning styles of the individual student. Just as it has taken almost a hundred years to move from Frederick Kelly’s Kansas Silent Reading Tests to the present tests created by the Educational Testing Service (ETS) and other companies, so too are we at the beginning of tests for a digital age that actually measure the skills our age demands. It is possible that within a decade, each learner—whether a schoolchild or a lifelong learner—could be building up his own private ePortfolio, with badges and credentialing built in, of all learning, all challenges, all there to measure accomplishments at the end of a school grade and still there several years later to help refresh a memory or to be shown to a future employer, all indexed and sortable at will.36

But if we are on the verge of new forms of testing for a digital age, we should think about what we want to test, as well as how we have to change our institutions of education, testing, and government policy that are currently reinforcing methods of assessment and even the kind of “itemized” learning that is firmly rooted in the last century.37 Because we are on the brink of new computerized, individualized, adaptive, self-assembling testing that will be useful to learners and easier on teachers, what exactly could we be testing for that eludes us in the item-response method?

We’ve made headway on that score too, with more and more educators, policy makers, parents, and students themselves arguing that we urgently need to be thinking of interconnected, not discrete, twenty-first-century skills. Instead of testing for the best answer to discrete questions, we need to measure the ability to make connections, to synthesize, collaborate, network, manage projects, solve problems, and respond to constantly changing technologies, interfaces, and eventually, in the workplace, new arrangements of labor and new economies. For schools this means that in addition to the three Rs of reading, writing, and arithmetic, kids should be learning critical thinking, innovation, creativity, and problem solving, all of the skills one can build upon and mesh with the skills of others. We need to test students on how critically they think about the issues of the digital age—privacy, security, or credibility. We could write algorithms to test how well kids sort the information that comes at them, how wisely they decide what is or is not reliable information. It would also be easy to assess the effectiveness of their use of new technologies and all the multimedia tools not only at their disposal but more and more necessary for their future employment. If you can’t get on Twitter, you haven’t passed that test. Just as you can build copyright restrictions into online source materials, you could as easily test the ways kids remix or download, testing for creativity and also for their sensitivity to the ethical or legal use of the intellectual property of others.

If change is the byword of a great era of technological innovation such as our own, we need to have methods of assessment that show not how well kids learn how to psyche out which answer among five has the highest probability of being right but how well they can apply their knowledge to novel situations. How flexible and adaptable are they; how capable of absorbing and responding to feedback? I am sure that one reason we’ve gone from 4 to 320 reality TV shows in a decade is that American Idol, So You Think You Can Dance, Project Runway, or Top Chef do a better job of teaching sound judgment and how to react well (or badly) to feedback than most of our schools.

We need to measure practical, real-world skills, such as how to focus attention through project and time management. There is no punch clock in do-it-yourself culture, so where do kids learn how to manage themselves? Similarly, where do they learn, online or face-to-face, at school and in the future workplace, how to work together? Every employer says working well with others is a key to success, but in school every child’s achievement is always weighed against that of everyone else’s. Are you in the top 1 percent or the bottom? How does that teach collaboration? Once we figure out how to teach collaboration, how do we measure it? Kids shouldn’t have to end up at their first job with a perfect report card and stellar test scores but no experience in working with others. When you fail at that in your first job, you don’t get a C. You get a pink slip and a fast walk to the exit door.

We also need to be able to measure how well young people communicate, including with people who may not share their assumptions and background. In social groups, online and face-to-face, we often live in a relatively homogeneous world.38 But in the workplace, we increasingly face a globalized workplace and a new world of digital communication, which has the potential for interaction with others who may not even share the same language. Through computer translating software, we can exchange words but not necessarily deep and differing cultural values, which is why business schools today insist that “culture” and “context” are the two most important features of global executive education. Executives need to know how to reason effectively, how to use systems and network thinking to understand the relationship between one problem and its solutions and other problems that may arise as a result. Bubble tests cannot begin to teach students how to analyze the way parts of a system interact with other parts of a complex system. We need to teach them how to make sound judgments and determinations about what is or is not credible, especially in a digital age when information comes unsorted.39

If all that sounds impossible, it is mostly because we have spent the last one hundred years believing that a multiple-choice test can tell us more about how kids are learning than can Mrs. Davidson, Katie Salen, or thousands of other inspiring teachers, past and present, who know a great student or one who is having difficulty when they see one, without benefit of any kind of test, machine-readable or otherwise.

We have to unlearn a lot of what we’ve come to believe about how we should measure. We need to think about how unnatural those tests are for assessing “unschooled” ways of learning, whether in everyday life or in the workplace. We don’t give Baby Andrew an item-response test at the end of his first year of life to see if he can crawl or walk. We observe his progress toward walking. We don’t subject a new employee to a standardized test at the end of her first year to see if she has the skills the job requires. Why in the world have we come to believe that is the right way to test our children?

If we want national standards, let’s take the current funds we put into end-of-grade testing and develop those badges and ePortfolios and adaptable challenge tests that will assist teachers in grading and assist students, too, in how they learn. Lots of online for-profit schools are developing these systems, and many of them work well. Surely they are more relevant to our children’s future than the bubble tests.

THERE IS ONE FEATURE OF the present end-of-grade exams that I would like to keep: the pacing of the school year. As I learned over and over, most schools teach real subjects and real learning from September to March, then stop what they are doing in order to help students prepare for the lower-order thinking required for the end-of-grades. Let’s build on that structure and reverse it. Instead of dumbing down to psyche out the test, insert a boss-level challenge at the end of the year. Let every student know that when March comes, everything they will learn in their classes will be put into an important, practical application in the world. Let them think about that every day. How could what I am learning today about the American Revolution be turned into something that could help someone else? How could my geography lessons be of interest in the world? Spend some part of a day, maybe once or twice a week, talking to students about how what they are learning matters in the world, and let them be thinking of how that would translate into a collaborative end-of-grade challenge. When March comes, instead of putting away the exciting learning to study for the tests, instead of nightmares over whether they will fail or pass this year, whether their schools will be deprived of funding because of their poor performance, let them lie awake at night dreaming of projects.

Here’s one. Let’s say a fifth-grade social science class spent a lot of the year focusing on the heritage of the local community. What if they decided they wanted to make a webinar on what they learned for the fourth graders who would be coming through the next year? That girl with the green hair would be enlisted to do the artwork. Rodney would be doing any math calculations. Someone might write and sing a theme song. Someone else would write the script. Budding performers would be tapped to narrate or even act out the parts. The computer kids would identify the software they needed and start putting this together. The teacher would help, would bring in businesspeople from the community, teachers from community colleges or local universities, or maybe college students who could aid here. And the kids would then go out and interview their parents and grandparents, local leaders, the police, the community organizers, maybe even the mayor, and all of this would be part of their real, exciting end-of-grade “test.” It would put into practice everything they have learned, would use all of their skills, and unlike the waste of time of the current cramming for lower-order thinking on the end-of-grades, it would mean each year ended with a synthesis of everything, showing each child what he or she could do in the world. Parents would have to be involved. Instead of the usual bake sale or ritual recital, an end-of-year “idea sale” would be held. Parents and anyone from town, including all those interviewed (get out the mayor!) would tour the school auditorium viewing what the students created. The kids could write press releases for the local newspapers and news programs. On the school’s interactive Web site, they would collect feedback from local Web developers or public historians or just a neighbor or two, so next year they could make an even better project.

This may sound implausible and impractical, but it isn’t. It’s happening in informal after-school programs all over America and could certainly happen in a school near you. We know students learn best when their schooling relates to their lives in their homes and communities. Make that connection work for schools as well as for students by connecting schools to community colleges and universities in the town, to retirement communities, to local businesses and nonprofits, to libraries and civic centers. The possibilities are endless.

I know of a great project like this. It’s called Hypercities ( and it works out of UCLA and the local public schools in Filipinotown in Los Angeles. It was one of the winners of the 2007 HASTAC/MacArthur Foundation Digital Media and Learning Competition. It is “a collaborative research and educational platform for traveling back in time to explore the historical layers of city spaces in an interactive, hypermedia environment.” Hypercities began working with students and community members in L.A.’s Historic Filipinotown (HiFi), with students interviewing their relatives, doing research in local archives, and restoring a map of an area of the city that no longer exists as it did fifty years ago, at the heart of America’s Filipino immigrant community. The students found photographs, documents, and even forgotten film footage; they also mastered the methodologies of history and software code writing, as well as multimedia editing, sound and video production, and other skills.40

Would a project like Hypercities reinforce all the standardized knowledge that students would cram for on a current end-of-grade exam? No. Thank goodness. First, we know that students tend not to retain that testable knowledge. Second, we know they don’t really know how to apply it beyond the test. And third, Kelly was right: It is lower-order thinking. What I am suggesting is reserving the last two months of school not for dumbing down but for inspiring the most unstructured, collaborative learning possible. It teaches students project management, achievement, and how to apply learning. It means that all the “cramming” that comes before it is not in order to pass a test but in order to teach kids how to think. What questions do I need to know? Taken over from spy talk, “need to know” is gamer language for that which is necessary in order to succeed in the next challenge. Games work so well and are so infinitely appealing because they reinforce the idea that the more we know, the better the game is. And that’s a life lesson every teacher wants to instill.

There is nothing better that students can take home over summer vacation than a sense that what they have learned the previous year has meant they were able, with the help of lots of other people, including that alienated girl with the green hair and that kid who counts on his fingers, and lots and lots of people beyond the walls of the school, to make something important happen, to meet a challenge. All of the great teachers I have met have been implicitly or explicitly preparing their students for the challenges ahead. That, not the end-of-grade exam, is where we want to focus their attention. Eyes on the prize, and that prize is an ability, for the rest of their lives, to know they can count on their knowledge, their skills, their intelligence, and their peers to help them meet the challenges—whatever those may be—that lie ahead.