-- a Gerald Bracey Report on the Condition of Education
SAT AND NAEP TRENDS:
Gerald W. Bracey
Gerald W. Bracey is an Associate of the High/Scope Educational Research Foundation and an Associate Professor at George Mason University. His most recent books are The War on America's Public Schools (Allyn & Bacon, 2002) and Put to the Test: An Educator's and Consumer's Guide to Standardized Testing (Revised edition, Phi Delta Kappa International, 2002). The opinions are his own.
The average SAT Verbal score in 2002 was precisely the same as it was in 1981, 504. Yet, each of the six major ethnic categories used by the College Board shows an increase in that period of time: whites, 8 points; blacks, 19; Asians 27; Puerto Ricans 18;and American Indians, 8. How can it be, then, that all groups that make up the national average have gained but the national average score has not budged in 21 years? This is not a trivial question: critics of schools have used these national averages as indicators of no progress in education reform.
The phenomenon by which the whole group shows one trend but the various subgroups show another occurs often in social science and medical research. It is known as Simpson's Paradox. A google search on "Simpson's Paradox" produces 2800 hits.
To under stand it, let's look first at the trends for the SAT both Verbal and Mathematics for the various ethnic groups and for all groups lumped together.
1981* 2002 Gain 1981 2002 Gain
White 519 527 +8 509 533 +24
Black 412 431 +19 391 427 +36
Asian 474 501 +27 512 569 +57
Mexican 438 446 +8 447 457 +10
Puerto Rican 437 455 +18 428 451 +23
American Indian 471 479 +8 463 483 +20
All Testtakers 504 504 0 494 516 +22
What on earth is going on here? The increase in Math scores for the most ethnic groups exceeds, and sometimes far exceeds, the gain for all students. The Verbal scores show an even more paradoxical outcome: All groups show an increase, but the gain for the whole group is exactly zero. Nil.
To understand how Simpson's paradox affects SAT averages over time, we must look at changes in the ethnic composition of the SAT testtaking group over time. Table 1 below shows these changes.
# % # %
White 19,383 85 698,659 65
Black 75,434 9 122,684 11
Asian 29,753 3 103,242 10
Mexican 14,405 2 48,255 4
Puerto Rican 7,038 1 14,273 1
American Indian 4,655 0 7,506 1
Total 100 92
(2002 percentages do not sum to 100% because of 8 percent responding "Latin American" or "Other," response categories not used in 1981).
The changing composition of the SAT testtakers causes the paradox. Minorities now comprise a much larger proportion of the total than they did 20 years ago. And, except for the Mathematics scores of Asians, all minority scores, while rising, remain below the overall average. Adding more and more of these improving, but still low, scores attenuates the rise of the overall average. In the case of the verbal score, it attenuates it to zero.
Simpson's Paradox is stated in many ways. They all convey the idea that when subgroups' scores on a variable are aggregated to form a single total group, the total might show a relationship that is the reverse of the relationship seen in the subgroups. Hence, the paradox.
In the above example, Simpson's Paradox strikes because the composition of the whole group changes over time: many more minorities in 2002 than in 1981. Simpson's Paradox also affects one-time measurements where the subgroups differ in some important way from the whole group. The following medical example shows how this happens. If we compare survival rates for patients in two hospitals, overall the results look like this:
Survived Died Total Survival
Hospital A 800 200 1000 80%
Hospital B 900 100 1000 90%
Hospitals are dangerous places generally, but it looks like if you must check into one, Hospital B is your medical facility of choice. But what if we divide the patients into those who were in good condition prior to treatment and those who were in poor condition?
Good Condition Patients
Survived Died Total Survival
Hospital A 590 10 600 98%
Hospital B 870 30 900 97%
Poor Condition Patients
Hospital A 210 190 400 53%
Hospital B 30 70 100 30%
Thus while Hospital B had a higher survival rate for all patients than did Hospital A, Hospital A treated a higher proportion of those who were in bad shape to start with. It also managed to keep a higher proportion alive. Hospital A is the place for you whether you are in good or poor condition on your arrival.
Back in education, we see Simpson's paradox at work in NAEP trends as well as in SAT trends.
Reading 1971 1999
Age 17 285 288
Age 13 255 259
Age 9 208 212
Over a period of 28 years, scores change little. "NAEP reading scores are essentially unchanged," said Right-wing pundit, George F. Will in his March 2, 2003 column. "This refutes the durable delusion that schools' cognitive outputs vary directly with financial inputs." This is a common comment from the Right. Spending has increased ("soared," "skyrocketed," "mounted" are words commonly used by the critics), but test scores are "flat." ("stagnant," "sluggish," "static," choose your term). As with the SAT, though, looking at trends by ethnic group reveals something different than just looking at aggregates for all groups:
Reading White Black Hispanic
1971 1999 1971 1999 1975! 1999
Age 17 291 295 238 264 252 271
Age 13 261 267 222 238 232 244
Age 9 212 221 170 186 183 193
! Hispanics constituted too small a sample to generate a reliable estimate in the 1971 assessment. Asians were still too small a group in 1999.
The changes for white students pretty much mirror the changes for the whole sample.
The gains for black and Hispanic students, though, are much larger than for the entire group. However, their scores remain lower than whites and, by Simpson's Paradox, because they are now a larger proportion of the total group, they attenuate the gains seen when all groups are combined.
The proportion of whites in the sample falls from roughly 80% to roughly 70% (it varies slightly for different ages). The proportion of the entire group made up of blacks changes over time from about 14 percent to about 16%, while the proportion of Hispanics doubles from about five percent to about 10 percent). Asians were not represented as a separate group until the science assessment of 1996 and even in that year there was concern about the accuracy of the estimated scores.
NAEP assessments in mathematics and science also show larger gains for ethnic groups than for everyone taken as a whole.
Lest anyone still be mystified by what's going on, let me present a hypothetical but concrete example. Consider the scores below:
Time 1 Time 2
1. 500 510
2. 500 510
3 500 510
4. 500 510
5. 500 510
6. 500 510
7. 500 510
8. 500 430
9. 500 430
10 400 430
Avg. 490 486
Let's assume that all of those 500's at Time 1 represent the SAT scores of white kids and that the 400 represents the SAT scores of minority students. At Time 2, assume that the 510's are the SAT scores of white students, and the 430's the SAT's of minorities. So all groups have gained. Whites have gained 10 points, minorities 30 points. The difference is that at Time 1, minorities only made up 10 percent of the total group, but at Time 2, they constitute 30% of the total. When we calculate the average for Time 1 and Time 2, we find that, despite the fact that all groups are scoring higher at Time 2, the average at Time 2 is lower than at Time 1, 486 at Time 2 vs. 490 at Time 1. This is Simpson's Paradox.
Thus, it sometimes appears as if test scores are not rising or are even falling when, in fact, test scores for all groups are rising at the same time as lower scoring groups are making up a larger proportion of the total. This, it should be obvious, does not mean the same thing as falling test scores due to declining achievement. It should be obvious, but it is often conveniently overlooked by school critics.
Indeed, since these critics are statistically sophisticated, one must conclude that they have ignored Simpson's Paradox not only conveniently, but also deliberately. And unethically.
*(1981 is used as a starting point because it was the first year the Board published a document showing SAT data by gender and ethnicity. Coincidentally, 1981 also marked the lowest point of the decline of average SAT scores that had begun in 1963. The Board category, Latin American, which covers Central and South American students, was not in use in 1981 and currently accounts for four percent of all SAT testtakers. They scored 458 on the Verbal in 2002 and 464 on the Math. Another four percent now check "other," also not used in 1981 and also account for 4 percent of the total. They scored 502 on the Verbal and 514 on the Math.).
© 2003 Gerald Bracey
|Web Services by |