EDDRA


Education Disinformation Detection and Reporting Agency

-- a Gerald Bracey Report on the Condition of Education


Index of
EDDRA
Articles

THOSE MISLEADING SAT AND NAEP TRENDS:
SIMPSON'S PARADOX AT WORK

Gerald W. Bracey

Gerald W. Bracey is an Associate of the High/Scope Educational Research Foundation and an Associate Professor at George Mason University.  His most recent books are The War on America's Public Schools (Allyn & Bacon, 2002) and Put to the Test: An Educator's and Consumer's Guide to Standardized Testing (Revised edition, Phi Delta Kappa International, 2002).  The opinions are his own.

The average SAT Verbal score in 2002 was precisely the same as it was in 1981, 504.  Yet, each of the six major ethnic categories used by the College Board shows an increase in that period of time: whites, 8 points; blacks, 19; Asians 27; Puerto Ricans 18;and American Indians, 8.  How can it be, then, that all groups that make up the national average have gained but the national average score has not budged in 21 years?  This is not a trivial question: critics of schools have used these national averages as indicators of no progress in education reform.

The phenomenon by which the whole group shows one trend but the various subgroups show another occurs often in social science and medical research.  It is known as Simpson's Paradox.  A google search on "Simpson's Paradox" produces 2800 hits.

To under stand it, let's look first at the trends for the SAT both Verbal and Mathematics for the various ethnic groups and for all groups lumped together. 

                          1981*     2002    Gain       1981   2002    Gain

         Verbal                                Mathematics

White                     519       527        +8         509      533     +24

Black                      412       431      +19        391      427     +36

Asian                     474        501     +27         512      569     +57

Mexican                438        446        +8        447      457      +10

Puerto Rican       437         455     +18         428      451     +23

American Indian  471        479       +8          463     483      +20

All Testtakers       504        504         0          494     516      +22

What on earth is going on here?  The increase in Math scores for the most ethnic groups exceeds, and sometimes far exceeds, the gain for all students.  The Verbal scores show an even more paradoxical outcome: All groups show an increase, but the gain for the whole group is exactly zero.  Nil.

To understand how Simpson's paradox affects SAT averages over time, we must look at changes in the ethnic composition of the SAT testtaking group over time.  Table 1 below shows these changes.

               

                                                                Table 1

                                           1981                            2002       

                                        #          %                       #           %

White                     19,383        85            698,659        65

Black                     75,434           9            122,684        11

Asian                     29,753           3            103,242        10

Mexican                 14,405           2              48,255          4

Puerto Rican          7,038            1              14,273          1

American Indian    4,655            0              7,506             1

Total                                           100                                 92

(2002 percentages do not sum to 100% because of 8 percent responding "Latin American" or "Other," response categories not used in 1981). 

The changing composition of the SAT testtakers causes the paradox.  Minorities now comprise a much larger proportion of the total than they did 20 years ago.  And, except for the Mathematics scores of Asians, all minority scores, while rising, remain below the overall average.  Adding more and more of these improving, but still low, scores attenuates the rise of the overall average.  In the case of the verbal score, it attenuates it to zero.

Simpson's Paradox is stated in many ways.  They all convey the idea that when subgroups' scores on a variable are aggregated to form a single total group, the total  might show a relationship that is the reverse of the relationship seen in the subgroups.  Hence, the paradox. 

In the above example, Simpson's Paradox strikes because the composition of the whole group changes over time: many more minorities in 2002 than in 1981.  Simpson's Paradox also affects one-time measurements where the subgroups differ in some important way from the whole group.  The following medical example shows how this happens.  If we compare survival rates for patients in two hospitals, overall the results look like this:

                                Survived                Died                  Total            Survival

                                                                                                                  Rate

Hospital A                800                          200                1000                80%

Hospital  B                900                          100                1000                90%

Hospitals are dangerous places generally, but it looks like if you must check into one, Hospital B is your medical facility of choice.  But what if we divide the patients into those who were in good condition prior to treatment and those who were in poor condition?


Good Condition Patients

                           Survived                     Died              Total            Survival

                                                                                                             Rate

Hospital A                590                          10                600                98%

Hospital  B                870                          30                900                97%

Poor Condition Patients

Hospital A                210                          190                400                53%

Hospital B                30                            70                100                30%

Thus while Hospital B had a higher survival rate for all patients than did Hospital A, Hospital A treated a higher proportion of those who were in bad shape to start with.  It also managed to keep a higher proportion alive.  Hospital A is the place for you whether you are in good or poor condition on your arrival. 

Back in education, we see Simpson's paradox at work in NAEP trends as well as in SAT trends.

NAEP

Reading                1971                1999

Age 17                   285                288

Age 13                   255                259

Age 9                     208                212

Over a period of 28 years, scores change little.  "NAEP reading scores are essentially unchanged," said Right-wing pundit, George F. Will in his March 2, 2003 column.  "This refutes the durable delusion that schools' cognitive outputs vary directly with financial inputs."  This is a common comment from the Right. Spending has increased ("soared," "skyrocketed," "mounted" are words commonly used by the critics), but test scores are "flat." ("stagnant," "sluggish," "static," choose your term).  As with the SAT, though, looking at trends by ethnic group reveals something different than just looking at aggregates for all groups:

Reading                 White                     Black                      Hispanic

                       1971        1999        1971       1999        1975!       1999

Age 17             291          295          238          264          252          271

Age 13             261          267          222          238          232          244

Age 9               212          221          170          186          183          193

! Hispanics constituted too small a sample to generate a reliable estimate in the 1971 assessment.  Asians were still too small a group in 1999.

The changes for white students pretty much mirror the changes for the whole sample. 

The gains for black and Hispanic students, though, are much larger than for the entire group.  However, their scores remain lower than whites and, by Simpson's Paradox, because they are now a larger proportion of the total group, they attenuate the gains seen when all groups are combined.

The proportion of whites in the sample falls from roughly 80% to roughly 70% (it varies slightly for different ages).  The proportion of the entire group made up of blacks changes over time from about 14 percent to about 16%, while the proportion of Hispanics doubles from about five percent to about 10 percent).  Asians were not represented as a separate group until the science assessment of 1996 and even in that year there was concern about the accuracy of the estimated scores.

NAEP assessments in mathematics and science also show larger gains for ethnic groups than for everyone taken as a whole.

Lest anyone still be mystified by what's going on, let me present a hypothetical but concrete example.  Consider the scores below:

            Time 1                    Time 2

1.             500                          510

2.             500                          510

3              500                          510

4.             500                          510

5.             500                          510

6.             500                          510

7.             500                          510

8.             500                          430

9.             500                          430

10            400                          430

                ____                       ____

Avg.       490                          486

Let's assume that all of those 500's at Time 1 represent the SAT scores of white kids and that the 400 represents the SAT scores of minority students.  At Time 2, assume that the 510's are the SAT scores of white students, and the 430's the SAT's of minorities.  So all groups have gained.  Whites have gained 10 points, minorities 30 points.  The difference is that at Time 1, minorities only made up 10 percent of the total group, but at Time 2, they constitute 30% of the total.  When we calculate the average for Time 1 and Time 2, we find that, despite the fact that all groups are scoring higher at Time 2, the average at Time 2 is lower than at Time 1, 486 at Time 2 vs. 490 at Time 1.  This is Simpson's Paradox.

Thus, it sometimes appears as if test scores are not rising or are even falling when, in fact, test scores for all groups are rising at the same time as lower scoring groups are making up a larger proportion of the total.  This, it should be obvious, does not mean the same thing as falling test scores due to declining achievement.  It should be obvious, but it is often conveniently overlooked by school critics. 

Indeed, since these critics are statistically sophisticated, one must conclude that they have ignored Simpson's Paradox not only conveniently, but also deliberately.  And unethically.

 -----

*(1981 is used as a starting point because it was the first year the Board published a document showing SAT data by gender and ethnicity.  Coincidentally, 1981 also marked the lowest point of the decline of average SAT scores that had begun in 1963.  The Board category, Latin American, which covers Central and South American students, was not in use in 1981 and currently accounts for four percent of all SAT testtakers.  They scored 458 on the Verbal in 2002 and 464 on the Math.  Another four percent now check "other," also not used in 1981 and also account for 4 percent of the total.  They scored 502 on the Verbal and 514 on the Math.).

 

© 2003 Gerald Bracey
Posted January 8, 2003

Web Services by