Earlier this month, the College Board released SAT scores for the high school graduating class of 2015. Both math and reading scores declined from 2014, continuing a steady downward trend that has been in place for the past decade. Pundits of contrasting political stripes seized on the scores to bolster their political agendas. Michael Petrilli of the Fordham Foundation argued that falling SAT scores show that high schools need more reform, presumably those his organization supports, in particular, charter schools and accountability.* For Carol Burris of the Network for Public Education, the declining scores were evidence of the failure of polices her organization opposes, namely, Common Core, No Child Left Behind, and accountability.
Petrilli and Burris are both misusing SAT scores. The SAT is not designed to measure national achievement; the score losses from 2014 were miniscule; and most of the declines are probably the result of demographic changes in the SAT population. Let’s examine each of these points in greater detail.
It never was. The SAT was originally meant to measure a student’s aptitude for college independent of that student’s exposure to a particular curriculum. The test’s founders believed that gauging aptitude, rather than achievement, would serve the cause of fairness. A bright student from a high school in rural Nebraska or the mountains of West Virginia, they held, should have the same shot at attending elite universities as a student from an Eastern prep school, despite not having been exposed to the great literature and higher mathematics taught at prep schools. The SAT would measure reasoning and analytical skills, not the mastery of any particular body of knowledge. Its scores would level the playing field in terms of curricular exposure while providing a reasonable estimate of an individual’s probability of success in college.
Note that even in this capacity, the scores never suffice alone; they are only used to make admissions decisions by colleges and universities, including such luminaries as Harvard and Stanford, in combination with a lot of other information—grade point averages, curricular resumes, essays, reference letters, extracurricular activities—all of which constitute a student’s complete application.
Today’s SAT has moved towards being a contentoriented test, but not entirely. Next year, the College Board will introduce a revised SAT to more closely reflect high school curricula. Even then, SAT scores should not be used to make judgements about U.S. high school performance, whether it’s a single high school, a state’s high schools, or all of the high schools in the country. The SAT sample is selfselected. In 2015, it only included about onehalf of the nation’s high school graduates: 1.7 million out of approximately 3.3 million total. And that’s about oneninth of approximately 16 million high school students. Generalizing SAT scores to these larger populations violates a basic rule of social science. The College Board issues a warning when it releases SAT scores: “Since the population of test takers is selfselected, using aggregate SAT scores to compare or evaluate teachers, schools, districts, states, or other educational units is not valid, and the College Board strongly discourages such uses.”
TIME’s coverage of the SAT release included a statement by Andrew Ho of Harvard University, who succinctly makes the point: “I think SAT and ACT are tests with important purposes, but measuring overall national educational progress is not one of them.”
SAT scores changed very little from 2014 to 2015. Reading scores dropped from 497 to 495. Math scores also fell two points, from 513 to 511. Both declines are equal to about 0.017 standard deviations (SD).^{[i]} To illustrate how small these changes truly are, let’s examine a metric I have used previously in discussing test scores. The average American male is 5’10” in height with a SD of about 3 inches. A 0.017 SD change in height is equal to about 1/20 of an inch (0.051). Do you really think you’d notice a difference in the height of two men standing next to each other if they only differed by 1/20^{th} of an inch? You wouldn’t. Similarly, the change in SAT scores from 2014 to 2015 is trivial.^{[ii]}
A more serious concern is the SAT trend over the past decade. Since 2005, reading scores are down 13 points, from 508 to 495, and math scores are down nine points, from 520 to 511. These are equivalent to declines of 0.12 SD for reading and 0.08 SD for math.^{[iii]} Representing changes that have accumulated over a decade, these losses are still quite small. In the Washington Post, Michael Petrilli asked “why is education reform hitting a brick wall in high school?” He also stated that “you see this in all kinds of evidence.”
You do not see a decline in the best evidence, the National Assessment of Educational Progress (NAEP). Contrary to the SAT, NAEP is designed to monitor national achievement. Its test scores are based on a random sampling design, meaning that the scores can be construed as representative of U.S. students. NAEP administers two different tests to high school age students, the long term trend (LTT NAEP), given to 17yearolds, and the main NAEP, given to twelfth graders.
Table 1 compares the past ten years’ change in test scores of the SAT with changes in NAEP.^{[iv]} The long term trend NAEP was not administered in 2005 or 2015, so the closest years it was given are shown. The NAEP tests show high school students making small gains over the past decade. They do not confirm the losses on the SAT.
Table 1. Comparison of changes in SAT, Main NAEP (12^{th} grade), and LTT NAEP (17yearolds) scores. Changes expressed as SD units of base year.
SAT 20052015 
Main NAEP 20052015 
LTT NAEP 20042012 

Reading 
0.12* 
+.05* 
+.09* 
Math 
0.08* 
+.09* 
+.03 
*p<.05
Petrilli raised another concern related to NAEP scores by examining cohort trends in NAEP scores. The trend for the 17yearold cohort of 2012, for example, can be constructed by using the scores of 13yearolds in 2008 and 9yearolds in 2004. By tracking NAEP changes over time in this manner, one can get a rough idea of a particular cohort’s achievement as students grow older and proceed through the school system. Examining three cohorts, Fordham’s analysis shows that the gains between ages 13 and 17 are about half as large as those registered between ages nine and 13. Kids gain more on NAEP when they are younger than when they are older.
There is nothing new here. NAEP scholars have been aware of this phenomenon for a long time. Fordham points to particular elements of education reform that it favors—charter schools, vouchers, and accountability—as the probable cause. It is true that those reforms more likely target elementary and middle schools than high schools. But the research literature on age discrepancies in NAEP gains (which is not cited in the Fordham analysis) renders doubtful the thesis that education policies are responsible for the phenomenon.^{[v]}
Whether high school age students try as hard as they could on NAEP has been pointed to as one explanation. A 1996 analysis of NAEP answer sheets found that 25to30 percent of twelfth graders displayed offtask test behaviors—doodling, leaving items blank—compared to 13 percent of eighth graders and six percent of fourth graders. A 2004 national commission on the twelfth grade NAEP recommended incentives (scholarships, certificates, letters of recognition from the President) to boost high school students’ motivation to do well on NAEP. Why would high school seniors or juniors take NAEP seriously when this low stakes test is taken in the midst of taking SAT or ACT tests for college admission, end of course exams that affect high school GPA, AP tests that can affect placement in college courses, state accountability tests that can lead to their schools being deemed a success or failure, and high school exit exams that must be passed to graduate?^{[vi]}
Other possible explanations for the phenomenon are: 1) differences in the scales between the ages tested on LTT NAEP (in other words, a onepoint gain on the scale between ages nine and 13 may not represent the same amount of learning as a onepoint gain between ages 13 and 17); 2) different rates of participation in NAEP among elementary, middle, and high schools;^{[vii]} and 3) social trends that affect all high school students, not just those in public schools. The third possibility can be explored by analyzing trends for students attending private schools. If Fordham had disaggregated the NAEP data by public and private schools (the scores of Catholic school students are available), it would have found that the pattern among private school students is similar—younger students gain more than older students on NAEP. That similarity casts doubt on the notion that policies governing public schools are responsible for the smaller gains among older students.^{[viii]}
Writing in the Washington Post, Carol Burris addresses the question of whether demographic changes have influenced the decline in SAT scores. She concludes that they have not, and in particular, she concludes that the growing proportion of students receiving exam fee waivers has probably not affected scores. She bases that conclusion on an analysis of SAT participation disaggregated by level of family income. Burris notes that the percentage of SAT takers has been stable across income groups in recent years. That criterion is not trustworthy. About 39 percent of students in 2015 declined to provide information on family income. The 61 percent that answered the family income question are probably skewed against lowincome students who are on fee waivers (the assumption being that they may feel uncomfortable answering a question about family income).^{[ix]} Don’t forget that the SAT population as a whole is a selfselected sample. A selfselected subsample from a selfselected sample tells us even less than the original sample, which told us almost nothing.
The fee waiver share of SAT takers increased from 21 percent in 2011 to 25 percent in 2015. The simple fact that fee waivers serve lowincome families, whose children tend to be lowerscoring SAT takers, is important, but not the whole story here. Students from disadvantaged families have always taken the SAT. But they paid for it themselves. If an additional increment of disadvantaged families take the SAT because they don’t have to pay for it, it is important to consider whether the new entrants to the pool of SAT test takers possess unmeasured characteristics that correlate with achievement—beyond the effect already attributed to socioeconomic status.
Robert Kelchen, an assistant professor of higher education at Seton Hall University, calculated the effect on national SAT scores of just three jurisdictions (Washington, DC, Delaware, and Idaho) adopting policies of mandatory SAT testing paid for by the state. He estimated that these policies explain about 21 percent of the nationwide decline in test scores between 2011 and 2015. He also notes that a more thorough analysis, incorporating fee waivers of other states and districts, would surely boost that figure. Fee waivers in two dozen Texas school districts, for example, are granted to all juniors and seniors in high school. And all students in those districts (including Dallas and Fort Worth) are required to take the SAT beginning in the junior year. Such universal testing policies can increase access and serve the cause of equity, but they will also, at least for a while, lead to a decline in SAT scores.
Here, I offer my own back of the envelope calculation of the relationship of demographic changes with SAT scores. The College Board reports test scores and participation rates for nine racial and ethnic groups.^{[x]} These data are preferable to family income because a) almost all students answer the race/ethnicity question (only four percent are nonresponses versus 39 percent for family income), and b) it seems a safe assumption that students are more likely to know their race or ethnicity compared to their family’s income.
The question tackled in Table 2 is this: how much would the national SAT scores have changed from 2005 to 2015 if the scores of each racial/ethnic group stayed exactly the same as in 2005, but each group’s proportion of the total population were allowed to vary? In other words, the scores are fixed at the 2005 level for each group—no change. The SAT national scores are then recalculated using the 2015 proportions that each group represented in the national population.
Table 2. SAT Scores and Demographic Changes in the SAT Population (20052015)
Projected Change Based on Change in Proportions 
Actual Change 
Projected Change as Percentage of Actual Change 

Reading 
9 
13 
69% 
Math 
7 
9 
78% 
The data suggest that twothirds to threequarters of the SAT score decline from 2005 to 2015 is associated with demographic changes in the testtaking population. The analysis is admittedly crude. The relationships are correlational, not causal. The race/ethnicity categories are surely serving as proxies for a bundle of other characteristics affecting SAT scores, some unobserved and others (e.g., family income, parental education, language status, class rank) that are included in the SAT questionnaire but produce data difficult to interpret.
Using an annual decline in SAT scores to indict high schools is bogus. The SAT should not be used to measure national achievement. SAT changes from 20142015 are tiny. The downward trend over the past decade represents a larger decline in SAT scores, but one that is still small in magnitude and correlated with changes in the SAT testtaking population.
In contrast to SAT scores, NAEP scores, which are designed to monitor national achievement, report slight gains for 17yearolds over the past ten years. It is true that LTT NAEP gains are larger among students from ages nine to 13 than from ages 13 to 17, but research has uncovered several plausible explanations for why that occurs. The public should exercise great caution in accepting the findings of test score analyses. Test scores are often misinterpreted to promote political agendas, and much of the alarmist rhetoric provoked by small declines in scores is unjustified.
[i] The 2014 SD for both SAT reading and math was 115.
[ii] A substantively trivial change may nevertheless reach statistical significance with large samples.
[iii] The 2005 SDs were 113 for reading and 115 for math.
[iv] Throughout this post, SAT’s Critical Reading (formerly, the SATVerbal section) is referred to as “reading.” I only examine SAT reading and math scores to allow for comparisons to NAEP. Moreover, SAT’s writing section will be dropped in 2016.
[v] The larger gains by younger vs. older students on NAEP is explored in greater detail in the 2006 Brown Center Report, pp. 1011.
[vi] If these influences have remained stable over time, they would not affect trends in NAEP. It is hard to believe, however, that high stakes tests carry the same importance today to high school students as they did in the past.
[vii] The 2004 blue ribbon commission report on the twelfth grade NAEP reported that by 2002 participation rates had fallen to 55 percent. That compares to 76 percent at eighth grade and 80 percent at fourth grade. Participation rates refer to the originally drawn sample, before replacements are made. NAEP is conducted with two stage sampling—schools first, then students within schools—meaning that the low participation rate is a product of both depressed school (82 percent) and student (77 percent) participation. See page 8 of: http://www.nagb.org/content/nagb/assets/documents/publications/12_gr_commission_rpt.pdf
[viii] Private school data are spotty on the LTT NAEP because of problems meeting reporting standards, but analyses identical to Fordham’s can be conducted on Catholic school students for the 2008 and 2012 cohorts of 17yearolds.
[ix] The nonresponse rate in 2005 was 33 percent.
[x] The nine response categories are: American Indian or Alaska Native; Asian, Asian American, or Pacific Islander; Black or African American; Mexican or Mexican American; Puerto Rican; Other Hispanic, Latino, or Latin American; White; Other; and No Response.
Earlier this month, the College Board released SAT scores for the high school graduating class of 2015. Both math and reading scores declined from 2014, continuing a steady downward trend that has been in place for the past decade. Pundits of contrasting political stripes seized on the scores to bolster their political agendas. Michael Petrilli of the Fordham Foundation argued that falling SAT scores show that high schools need more reform, presumably those his organization supports, in particular, charter schools and accountability.* For Carol Burris of the Network for Public Education, the declining scores were evidence of the failure of polices her organization opposes, namely, Common Core, No Child Left Behind, and accountability.
Petrilli and Burris are both misusing SAT scores. The SAT is not designed to measure national achievement; the score losses from 2014 were miniscule; and most of the declines are probably the result of demographic changes in the SAT population. Let’s examine each of these points in greater detail.
It never was. The SAT was originally meant to measure a student’s aptitude for college independent of that student’s exposure to a particular curriculum. The test’s founders believed that gauging aptitude, rather than achievement, would serve the cause of fairness. A bright student from a high school in rural Nebraska or the mountains of West Virginia, they held, should have the same shot at attending elite universities as a student from an Eastern prep school, despite not having been exposed to the great literature and higher mathematics taught at prep schools. The SAT would measure reasoning and analytical skills, not the mastery of any particular body of knowledge. Its scores would level the playing field in terms of curricular exposure while providing a reasonable estimate of an individual’s probability of success in college.
Note that even in this capacity, the scores never suffice alone; they are only used to make admissions decisions by colleges and universities, including such luminaries as Harvard and Stanford, in combination with a lot of other information—grade point averages, curricular resumes, essays, reference letters, extracurricular activities—all of which constitute a student’s complete application.
Today’s SAT has moved towards being a contentoriented test, but not entirely. Next year, the College Board will introduce a revised SAT to more closely reflect high school curricula. Even then, SAT scores should not be used to make judgements about U.S. high school performance, whether it’s a single high school, a state’s high schools, or all of the high schools in the country. The SAT sample is selfselected. In 2015, it only included about onehalf of the nation’s high school graduates: 1.7 million out of approximately 3.3 million total. And that’s about oneninth of approximately 16 million high school students. Generalizing SAT scores to these larger populations violates a basic rule of social science. The College Board issues a warning when it releases SAT scores: “Since the population of test takers is selfselected, using aggregate SAT scores to compare or evaluate teachers, schools, districts, states, or other educational units is not valid, and the College Board strongly discourages such uses.”
TIME’s coverage of the SAT release included a statement by Andrew Ho of Harvard University, who succinctly makes the point: “I think SAT and ACT are tests with important purposes, but measuring overall national educational progress is not one of them.”
SAT scores changed very little from 2014 to 2015. Reading scores dropped from 497 to 495. Math scores also fell two points, from 513 to 511. Both declines are equal to about 0.017 standard deviations (SD).^{[i]} To illustrate how small these changes truly are, let’s examine a metric I have used previously in discussing test scores. The average American male is 5’10” in height with a SD of about 3 inches. A 0.017 SD change in height is equal to about 1/20 of an inch (0.051). Do you really think you’d notice a difference in the height of two men standing next to each other if they only differed by 1/20^{th} of an inch? You wouldn’t. Similarly, the change in SAT scores from 2014 to 2015 is trivial.^{[ii]}
A more serious concern is the SAT trend over the past decade. Since 2005, reading scores are down 13 points, from 508 to 495, and math scores are down nine points, from 520 to 511. These are equivalent to declines of 0.12 SD for reading and 0.08 SD for math.^{[iii]} Representing changes that have accumulated over a decade, these losses are still quite small. In the Washington Post, Michael Petrilli asked “why is education reform hitting a brick wall in high school?” He also stated that “you see this in all kinds of evidence.”
You do not see a decline in the best evidence, the National Assessment of Educational Progress (NAEP). Contrary to the SAT, NAEP is designed to monitor national achievement. Its test scores are based on a random sampling design, meaning that the scores can be construed as representative of U.S. students. NAEP administers two different tests to high school age students, the long term trend (LTT NAEP), given to 17yearolds, and the main NAEP, given to twelfth graders.
Table 1 compares the past ten years’ change in test scores of the SAT with changes in NAEP.^{[iv]} The long term trend NAEP was not administered in 2005 or 2015, so the closest years it was given are shown. The NAEP tests show high school students making small gains over the past decade. They do not confirm the losses on the SAT.
Table 1. Comparison of changes in SAT, Main NAEP (12^{th} grade), and LTT NAEP (17yearolds) scores. Changes expressed as SD units of base year.
SAT 20052015 
Main NAEP 20052015 
LTT NAEP 20042012 

Reading 
0.12* 
+.05* 
+.09* 
Math 
0.08* 
+.09* 
+.03 
*p<.05
Petrilli raised another concern related to NAEP scores by examining cohort trends in NAEP scores. The trend for the 17yearold cohort of 2012, for example, can be constructed by using the scores of 13yearolds in 2008 and 9yearolds in 2004. By tracking NAEP changes over time in this manner, one can get a rough idea of a particular cohort’s achievement as students grow older and proceed through the school system. Examining three cohorts, Fordham’s analysis shows that the gains between ages 13 and 17 are about half as large as those registered between ages nine and 13. Kids gain more on NAEP when they are younger than when they are older.
There is nothing new here. NAEP scholars have been aware of this phenomenon for a long time. Fordham points to particular elements of education reform that it favors—charter schools, vouchers, and accountability—as the probable cause. It is true that those reforms more likely target elementary and middle schools than high schools. But the research literature on age discrepancies in NAEP gains (which is not cited in the Fordham analysis) renders doubtful the thesis that education policies are responsible for the phenomenon.^{[v]}
Whether high school age students try as hard as they could on NAEP has been pointed to as one explanation. A 1996 analysis of NAEP answer sheets found that 25to30 percent of twelfth graders displayed offtask test behaviors—doodling, leaving items blank—compared to 13 percent of eighth graders and six percent of fourth graders. A 2004 national commission on the twelfth grade NAEP recommended incentives (scholarships, certificates, letters of recognition from the President) to boost high school students’ motivation to do well on NAEP. Why would high school seniors or juniors take NAEP seriously when this low stakes test is taken in the midst of taking SAT or ACT tests for college admission, end of course exams that affect high school GPA, AP tests that can affect placement in college courses, state accountability tests that can lead to their schools being deemed a success or failure, and high school exit exams that must be passed to graduate?^{[vi]}
Other possible explanations for the phenomenon are: 1) differences in the scales between the ages tested on LTT NAEP (in other words, a onepoint gain on the scale between ages nine and 13 may not represent the same amount of learning as a onepoint gain between ages 13 and 17); 2) different rates of participation in NAEP among elementary, middle, and high schools;^{[vii]} and 3) social trends that affect all high school students, not just those in public schools. The third possibility can be explored by analyzing trends for students attending private schools. If Fordham had disaggregated the NAEP data by public and private schools (the scores of Catholic school students are available), it would have found that the pattern among private school students is similar—younger students gain more than older students on NAEP. That similarity casts doubt on the notion that policies governing public schools are responsible for the smaller gains among older students.^{[viii]}
Writing in the Washington Post, Carol Burris addresses the question of whether demographic changes have influenced the decline in SAT scores. She concludes that they have not, and in particular, she concludes that the growing proportion of students receiving exam fee waivers has probably not affected scores. She bases that conclusion on an analysis of SAT participation disaggregated by level of family income. Burris notes that the percentage of SAT takers has been stable across income groups in recent years. That criterion is not trustworthy. About 39 percent of students in 2015 declined to provide information on family income. The 61 percent that answered the family income question are probably skewed against lowincome students who are on fee waivers (the assumption being that they may feel uncomfortable answering a question about family income).^{[ix]} Don’t forget that the SAT population as a whole is a selfselected sample. A selfselected subsample from a selfselected sample tells us even less than the original sample, which told us almost nothing.
The fee waiver share of SAT takers increased from 21 percent in 2011 to 25 percent in 2015. The simple fact that fee waivers serve lowincome families, whose children tend to be lowerscoring SAT takers, is important, but not the whole story here. Students from disadvantaged families have always taken the SAT. But they paid for it themselves. If an additional increment of disadvantaged families take the SAT because they don’t have to pay for it, it is important to consider whether the new entrants to the pool of SAT test takers possess unmeasured characteristics that correlate with achievement—beyond the effect already attributed to socioeconomic status.
Robert Kelchen, an assistant professor of higher education at Seton Hall University, calculated the effect on national SAT scores of just three jurisdictions (Washington, DC, Delaware, and Idaho) adopting policies of mandatory SAT testing paid for by the state. He estimated that these policies explain about 21 percent of the nationwide decline in test scores between 2011 and 2015. He also notes that a more thorough analysis, incorporating fee waivers of other states and districts, would surely boost that figure. Fee waivers in two dozen Texas school districts, for example, are granted to all juniors and seniors in high school. And all students in those districts (including Dallas and Fort Worth) are required to take the SAT beginning in the junior year. Such universal testing policies can increase access and serve the cause of equity, but they will also, at least for a while, lead to a decline in SAT scores.
Here, I offer my own back of the envelope calculation of the relationship of demographic changes with SAT scores. The College Board reports test scores and participation rates for nine racial and ethnic groups.^{[x]} These data are preferable to family income because a) almost all students answer the race/ethnicity question (only four percent are nonresponses versus 39 percent for family income), and b) it seems a safe assumption that students are more likely to know their race or ethnicity compared to their family’s income.
The question tackled in Table 2 is this: how much would the national SAT scores have changed from 2005 to 2015 if the scores of each racial/ethnic group stayed exactly the same as in 2005, but each group’s proportion of the total population were allowed to vary? In other words, the scores are fixed at the 2005 level for each group—no change. The SAT national scores are then recalculated using the 2015 proportions that each group represented in the national population.
Table 2. SAT Scores and Demographic Changes in the SAT Population (20052015)
Projected Change Based on Change in Proportions 
Actual Change 
Projected Change as Percentage of Actual Change 

Reading 
9 
13 
69% 
Math 
7 
9 
78% 
The data suggest that twothirds to threequarters of the SAT score decline from 2005 to 2015 is associated with demographic changes in the testtaking population. The analysis is admittedly crude. The relationships are correlational, not causal. The race/ethnicity categories are surely serving as proxies for a bundle of other characteristics affecting SAT scores, some unobserved and others (e.g., family income, parental education, language status, class rank) that are included in the SAT questionnaire but produce data difficult to interpret.
Using an annual decline in SAT scores to indict high schools is bogus. The SAT should not be used to measure national achievement. SAT changes from 20142015 are tiny. The downward trend over the past decade represents a larger decline in SAT scores, but one that is still small in magnitude and correlated with changes in the SAT testtaking population.
In contrast to SAT scores, NAEP scores, which are designed to monitor national achievement, report slight gains for 17yearolds over the past ten years. It is true that LTT NAEP gains are larger among students from ages nine to 13 than from ages 13 to 17, but research has uncovered several plausible explanations for why that occurs. The public should exercise great caution in accepting the findings of test score analyses. Test scores are often misinterpreted to promote political agendas, and much of the alarmist rhetoric provoked by small declines in scores is unjustified.
[i] The 2014 SD for both SAT reading and math was 115.
[ii] A substantively trivial change may nevertheless reach statistical significance with large samples.
[iii] The 2005 SDs were 113 for reading and 115 for math.
[iv] Throughout this post, SAT’s Critical Reading (formerly, the SATVerbal section) is referred to as “reading.” I only examine SAT reading and math scores to allow for comparisons to NAEP. Moreover, SAT’s writing section will be dropped in 2016.
[v] The larger gains by younger vs. older students on NAEP is explored in greater detail in the 2006 Brown Center Report, pp. 1011.
[vi] If these influences have remained stable over time, they would not affect trends in NAEP. It is hard to believe, however, that high stakes tests carry the same importance today to high school students as they did in the past.
[vii] The 2004 blue ribbon commission report on the twelfth grade NAEP reported that by 2002 participation rates had fallen to 55 percent. That compares to 76 percent at eighth grade and 80 percent at fourth grade. Participation rates refer to the originally drawn sample, before replacements are made. NAEP is conducted with two stage sampling—schools first, then students within schools—meaning that the low participation rate is a product of both depressed school (82 percent) and student (77 percent) participation. See page 8 of: http://www.nagb.org/content/nagb/assets/documents/publications/12_gr_commission_rpt.pdf
[viii] Private school data are spotty on the LTT NAEP because of problems meeting reporting standards, but analyses identical to Fordham’s can be conducted on Catholic school students for the 2008 and 2012 cohorts of 17yearolds.
[ix] The nonresponse rate in 2005 was 33 percent.
[x] The nine response categories are: American Indian or Alaska Native; Asian, Asian American, or Pacific Islander; Black or African American; Mexican or Mexican American; Puerto Rican; Other Hispanic, Latino, or Latin American; White; Other; and No Response.
Last week, CNN ran a backtoschool story on homework with the headline, “Kids Have Three Times Too Much Homework, Study Finds; What’s the Cost?” Homework is an important topic, especially for parents, but unfortunately, CNN’s story misleads rather than informs. The headline suggests American parents should be alarmed because their kids have too much homework. Should they? No, CNN has ignored the best evidence on that question, which suggests the opposite. The story relies on the results of one recent study of homework—a study that is limited in what it can tell us, mostly because of its research design. But CNN even gets its main findings wrong. The study suggests most students have too little homework, not too much.
The study that piqued CNN’s interest was conducted during four months (two in the spring and two in the fall) in Providence, Rhode Island. About 1,200 parents completed a survey about their children’s homework while waiting in 27 pediatricians’ offices. Is the sample representative of all parents in the U.S.? Probably not. Certainly CNN should have been a bit leery of portraying the results of a survey conducted in a single American city—any city—as evidence applying to a broader audience. More importantly, viewers are never told of the study’s significant limitations: that the data come from a survey conducted in only one city—in pediatricians’ offices by a selfselected sample of respondents.
The survey’s sampling design is a huge problem. Because the sample is nonrandom there is no way of knowing if the results can be extrapolated to a larger population—even to families in Providence itself. Close to a third of respondents chose to complete the survey in Spanish. Enrollment in English Language programs in the Providence district comprises about 22 percent of students. About onefourth (26 percent) of survey respondents reported having one child in the family. According to the 2010 Census, the proportion of families nationwide with one child is much higher, at 43 percent.^{[i]} The survey is skewed towards large, Spanishspeaking families. Their experience with homework could be unique, especially if young children in these families are learning English for the first time at school.
The survey was completed by parents who probably had a sick child as they were waiting to see a pediatrician. That’s a stressful setting. The response rate to the survey is not reported, so we don’t know how many parents visiting those offices chose not to fill out the survey. If the typical pediatrician sees 100 unique patients per month, in a four month span the survey may have been offered to more than ten thousand parents in the 27 offices. The survey respondents, then, would be a tiny slice, 10 to 15 percent, of those eligible to respond. We also don’t know the publicprivate school break out of the respondents, or how many were sending their children to charter schools. It would be interesting to see how many parents willingly send their children to schools with a heavy homework load.
I wish the CNN team responsible for this story had run the data by some of CNN’s political pollsters. Alarm bells surely would have gone off. The hazards of accepting a selfselected, demographicallyskewed survey sample as representative of the general population are well known. Modern political polling—and its reliance on random samples—grew from an infamous mishap in 1936. A popular national magazine, the Literary Digest, distributed 10 million post cards for its readers to return as “ballots” indicating who they would vote for in the 1936 race for president. More than two million post cards were returned! A week before the election, the magazine confidently predicted that Alf Landon, the Republican challenger from Kansas, would defeat Franklin Roosevelt, the Democratic incumbent, by a huge margin: 57 percent to 43 percent. In fact, when the real election was held, the opposite occurred: Roosevelt won more than 60% of the popular vote and defeated Landon in a landslide. Pollsters learned that selfselected samples should be viewed warily. The magazine’s readership was disproportionately Republican to begin with, and sometimes disgruntled subjects are more likely to respond to a survey, no matter the topic, than the satisfied.
Here’s a very simple question: In its next poll on the 2016 presidential race, would CNN report the results of a survey of selfselected respondents in 27 pediatricians’ offices in Providence, Rhode Island as representative of national sentiment? Of course not. Then, please, CNN, don’t do so with education topics.
Let’s set aside methodological concerns and turn to CNN’s characterization of the survey’s findings. Did the study really show that most kids have too much homework? No, the headline that “Kids Have Three Times Too Much Homework” is not even an accurate description of the study’s findings. CNN’s on air coverage extended the misinformation. The online video of the coverage is tagged “Study: Your Kids Are Doing Too Much Homework.” The first caption that viewers see is “Study Says Kids Getting Way Too Much Homework.” All of these statements are misleading.
In the published version of the Providence study, the researchers plotted the average amount of time spent on homework by students’ grade.^{[ii]} They then compared those averages to a “10 minutes pergrade” guideline that serves as an indicator of the “right” amount of homework. I have attempted to replicate the data here in table form (they were originally reported in a line graph) to make that comparison easier.^{[iii]}
Contrary to CNN’s reporting, the data suggest—based on the ten minute pergrade rule—that most kids in this study have too little homework, not too much. Beginning in fourth grade, the average time spent on homework falls short of the recommended amount—a gap of only four minutes in fourth grade that steadily widens in later grades.
A more accurate headline would have been, “Study Shows Kids in Nine out of 13 Grades Have Too Little Homework.” It appears high school students (grades 912) spend only about half the recommended time on homework. Two hours of nightly homework is recommended for 12^{th} graders. They are, after all, only a year away from college. But according to the Providence survey, their homework load is less than an hour.
So how in the world did CNN come up with the headline “Kids Have Three Times Too Much Homework?” By focusing on grades K3 and ignoring all other grades. Here’s the reporting:
The study, published Wednesday in The American Journal of Family Therapy, found students in the early elementary school years are getting significantly more homework than is recommended by education leaders, in some cases nearly three times as much homework as is recommended.
The standard, endorsed by the National Education Association and the National ParentTeacher Association, is the socalled "10minute rule"— 10 minutes pergrade level pernight. That translates into 10 minutes of homework in the first grade, 20 minutes in the second grade, all the way up to 120 minutes for senior year of high school. The NEA and the National PTA do not endorse homework for kindergarten.
In the study involving questionnaires filled out by more than 1,100 English and Spanish speaking parents of children in kindergarten through grade 12, researchers found children in the first grade had up to three times the homework load recommended by the NEA and the National PTA.
Parents reported firstgraders were spending 28 minutes on homework each night versus the recommended 10 minutes. For secondgraders, the homework time was nearly 29 minutes, as opposed to the 20 minutes recommended.
And kindergartners, their parents said, spent 25 minutes a night on afterschool assignments, according to the study…
CNN focused on the four grades, K3, in which homework exceeds the tenminute rule. They ignored more than twothirds of the grades. Even with this focus, a more accurate headline would have been, “Study Suggests First Graders in Providence, RI Have Three Times Too Much Homework.”
Homework is a controversial topic. People hold differing points of view as to whether there is too much, too little, or just the right amount of homework. That makes it vitally important that the media give accurate information on the empirical dimensions to the debate. The amount of homework kids should have is subject to debate. But the amount of homework kids actually have is an empirical question. We can debate whether it’s too hot outside, but the actual temperature should be a matter of measurement, not debate. It’s impossible to think of a rational debate that can possibly ensue on the homework issue without knowing the empirical status quo in regards to time. Imagine someone beginning a debate by saying, “I am arguing that kids have too much [substitute “too little” here for the prohomework side] homework but I must admit that I have no idea how much they currently have.”
Data from the National Assessment of Educational Progress (NAEP) provide the best evidence we have on the amount of homework that kids have. NAEP’s sampling design allows us to make inferences about national trends, and the LongTerm Trend (LTT) NAEP offers data on homework since 1984. The latest LTT NAEP results (2012) indicate that the vast majority of nineyearolds (83 percent) have less than an hour of homework each night. There has been an apparent uptick in the homework load, however, as 35 percent reported no homework in 1984, and only 22 percent reported no homework in 2012. MET Life also periodically surveys a representative sample of students, parents, and teachers on the homework issue. In the 2007 results, a majority of parents (52 percent) of elementary grade students (grades 36 in the MET survey) estimated their children had 30 minutes or less of homework.
The MET Life survey found that parents have an overwhelmingly positive view of the amount of homework their children are assigned. Nine out of ten parents responded that homework offers the opportunity to talk and spend time with their children, and most do not see homework as interfering with family time or as a major source of familial stress. Minority parents, in particular, reported believing homework is beneficial for students’ success at school and in the future.^{[iv]}
That said, just as there were indeed Alf Landon voters in 1936, there are indeed children for whom homework is a struggle. Some bring home more than they can finish in a reasonable amount of time. A complication for researchers of elementary age children is that the same students who have difficulty completing homework may have other challenges—difficulties with reading, low achievement, and poor grades in school.[v] Parents who question the value of homework often have a host of complaints about their child’s school. It is difficult for researchers to untangle all of these factors and determine, in the instances where there are tensions, whether homework is the real cause. To their credit, the researchers who conducted the Providence study are aware of these constraints and present a number of hypotheses warranting further study with a research design supporting causal inferencing. That’s the value of this research, not CNN’s misleading reporting of the findings.
[i] Calculated from data in Table 64, U.S. Census Bureau, Statistical Abstract of the United States: 2012, page 56. http://www.census.gov/compendia/statab/2012/tables/12s0064.pdf.
[ii] The mean sample size for each grade is reported as 7.7 percent (or 90 students). Confidence intervals for each grade estimate are not reported.
[iii] The data in Table I are estimates (by sight) from a line graph incremented in five percentage point intervals.
[iv] Met Life, Met Life Survey of the American Teacher: The Homework Experience, November 13, 2007, pp. 15.
[v] Among high school students, the bias probably leans in the opposite direction: high achievers load up on AP, IB, and other courses that assign more homework.
Last week, CNN ran a backtoschool story on homework with the headline, “Kids Have Three Times Too Much Homework, Study Finds; What’s the Cost?” Homework is an important topic, especially for parents, but unfortunately, CNN’s story misleads rather than informs. The headline suggests American parents should be alarmed because their kids have too much homework. Should they? No, CNN has ignored the best evidence on that question, which suggests the opposite. The story relies on the results of one recent study of homework—a study that is limited in what it can tell us, mostly because of its research design. But CNN even gets its main findings wrong. The study suggests most students have too little homework, not too much.
The study that piqued CNN’s interest was conducted during four months (two in the spring and two in the fall) in Providence, Rhode Island. About 1,200 parents completed a survey about their children’s homework while waiting in 27 pediatricians’ offices. Is the sample representative of all parents in the U.S.? Probably not. Certainly CNN should have been a bit leery of portraying the results of a survey conducted in a single American city—any city—as evidence applying to a broader audience. More importantly, viewers are never told of the study’s significant limitations: that the data come from a survey conducted in only one city—in pediatricians’ offices by a selfselected sample of respondents.
The survey’s sampling design is a huge problem. Because the sample is nonrandom there is no way of knowing if the results can be extrapolated to a larger population—even to families in Providence itself. Close to a third of respondents chose to complete the survey in Spanish. Enrollment in English Language programs in the Providence district comprises about 22 percent of students. About onefourth (26 percent) of survey respondents reported having one child in the family. According to the 2010 Census, the proportion of families nationwide with one child is much higher, at 43 percent.^{[i]} The survey is skewed towards large, Spanishspeaking families. Their experience with homework could be unique, especially if young children in these families are learning English for the first time at school.
The survey was completed by parents who probably had a sick child as they were waiting to see a pediatrician. That’s a stressful setting. The response rate to the survey is not reported, so we don’t know how many parents visiting those offices chose not to fill out the survey. If the typical pediatrician sees 100 unique patients per month, in a four month span the survey may have been offered to more than ten thousand parents in the 27 offices. The survey respondents, then, would be a tiny slice, 10 to 15 percent, of those eligible to respond. We also don’t know the publicprivate school break out of the respondents, or how many were sending their children to charter schools. It would be interesting to see how many parents willingly send their children to schools with a heavy homework load.
I wish the CNN team responsible for this story had run the data by some of CNN’s political pollsters. Alarm bells surely would have gone off. The hazards of accepting a selfselected, demographicallyskewed survey sample as representative of the general population are well known. Modern political polling—and its reliance on random samples—grew from an infamous mishap in 1936. A popular national magazine, the Literary Digest, distributed 10 million post cards for its readers to return as “ballots” indicating who they would vote for in the 1936 race for president. More than two million post cards were returned! A week before the election, the magazine confidently predicted that Alf Landon, the Republican challenger from Kansas, would defeat Franklin Roosevelt, the Democratic incumbent, by a huge margin: 57 percent to 43 percent. In fact, when the real election was held, the opposite occurred: Roosevelt won more than 60% of the popular vote and defeated Landon in a landslide. Pollsters learned that selfselected samples should be viewed warily. The magazine’s readership was disproportionately Republican to begin with, and sometimes disgruntled subjects are more likely to respond to a survey, no matter the topic, than the satisfied.
Here’s a very simple question: In its next poll on the 2016 presidential race, would CNN report the results of a survey of selfselected respondents in 27 pediatricians’ offices in Providence, Rhode Island as representative of national sentiment? Of course not. Then, please, CNN, don’t do so with education topics.
Let’s set aside methodological concerns and turn to CNN’s characterization of the survey’s findings. Did the study really show that most kids have too much homework? No, the headline that “Kids Have Three Times Too Much Homework” is not even an accurate description of the study’s findings. CNN’s on air coverage extended the misinformation. The online video of the coverage is tagged “Study: Your Kids Are Doing Too Much Homework.” The first caption that viewers see is “Study Says Kids Getting Way Too Much Homework.” All of these statements are misleading.
In the published version of the Providence study, the researchers plotted the average amount of time spent on homework by students’ grade.^{[ii]} They then compared those averages to a “10 minutes pergrade” guideline that serves as an indicator of the “right” amount of homework. I have attempted to replicate the data here in table form (they were originally reported in a line graph) to make that comparison easier.^{[iii]}
Contrary to CNN’s reporting, the data suggest—based on the ten minute pergrade rule—that most kids in this study have too little homework, not too much. Beginning in fourth grade, the average time spent on homework falls short of the recommended amount—a gap of only four minutes in fourth grade that steadily widens in later grades.
A more accurate headline would have been, “Study Shows Kids in Nine out of 13 Grades Have Too Little Homework.” It appears high school students (grades 912) spend only about half the recommended time on homework. Two hours of nightly homework is recommended for 12^{th} graders. They are, after all, only a year away from college. But according to the Providence survey, their homework load is less than an hour.
So how in the world did CNN come up with the headline “Kids Have Three Times Too Much Homework?” By focusing on grades K3 and ignoring all other grades. Here’s the reporting:
The study, published Wednesday in The American Journal of Family Therapy, found students in the early elementary school years are getting significantly more homework than is recommended by education leaders, in some cases nearly three times as much homework as is recommended.
The standard, endorsed by the National Education Association and the National ParentTeacher Association, is the socalled "10minute rule"— 10 minutes pergrade level pernight. That translates into 10 minutes of homework in the first grade, 20 minutes in the second grade, all the way up to 120 minutes for senior year of high school. The NEA and the National PTA do not endorse homework for kindergarten.
In the study involving questionnaires filled out by more than 1,100 English and Spanish speaking parents of children in kindergarten through grade 12, researchers found children in the first grade had up to three times the homework load recommended by the NEA and the National PTA.
Parents reported firstgraders were spending 28 minutes on homework each night versus the recommended 10 minutes. For secondgraders, the homework time was nearly 29 minutes, as opposed to the 20 minutes recommended.
And kindergartners, their parents said, spent 25 minutes a night on afterschool assignments, according to the study…
CNN focused on the four grades, K3, in which homework exceeds the tenminute rule. They ignored more than twothirds of the grades. Even with this focus, a more accurate headline would have been, “Study Suggests First Graders in Providence, RI Have Three Times Too Much Homework.”
Homework is a controversial topic. People hold differing points of view as to whether there is too much, too little, or just the right amount of homework. That makes it vitally important that the media give accurate information on the empirical dimensions to the debate. The amount of homework kids should have is subject to debate. But the amount of homework kids actually have is an empirical question. We can debate whether it’s too hot outside, but the actual temperature should be a matter of measurement, not debate. It’s impossible to think of a rational debate that can possibly ensue on the homework issue without knowing the empirical status quo in regards to time. Imagine someone beginning a debate by saying, “I am arguing that kids have too much [substitute “too little” here for the prohomework side] homework but I must admit that I have no idea how much they currently have.”
Data from the National Assessment of Educational Progress (NAEP) provide the best evidence we have on the amount of homework that kids have. NAEP’s sampling design allows us to make inferences about national trends, and the LongTerm Trend (LTT) NAEP offers data on homework since 1984. The latest LTT NAEP results (2012) indicate that the vast majority of nineyearolds (83 percent) have less than an hour of homework each night. There has been an apparent uptick in the homework load, however, as 35 percent reported no homework in 1984, and only 22 percent reported no homework in 2012. MET Life also periodically surveys a representative sample of students, parents, and teachers on the homework issue. In the 2007 results, a majority of parents (52 percent) of elementary grade students (grades 36 in the MET survey) estimated their children had 30 minutes or less of homework.
The MET Life survey found that parents have an overwhelmingly positive view of the amount of homework their children are assigned. Nine out of ten parents responded that homework offers the opportunity to talk and spend time with their children, and most do not see homework as interfering with family time or as a major source of familial stress. Minority parents, in particular, reported believing homework is beneficial for students’ success at school and in the future.^{[iv]}
That said, just as there were indeed Alf Landon voters in 1936, there are indeed children for whom homework is a struggle. Some bring home more than they can finish in a reasonable amount of time. A complication for researchers of elementary age children is that the same students who have difficulty completing homework may have other challenges—difficulties with reading, low achievement, and poor grades in school.[v] Parents who question the value of homework often have a host of complaints about their child’s school. It is difficult for researchers to untangle all of these factors and determine, in the instances where there are tensions, whether homework is the real cause. To their credit, the researchers who conducted the Providence study are aware of these constraints and present a number of hypotheses warranting further study with a research design supporting causal inferencing. That’s the value of this research, not CNN’s misleading reporting of the findings.
[i] Calculated from data in Table 64, U.S. Census Bureau, Statistical Abstract of the United States: 2012, page 56. http://www.census.gov/compendia/statab/2012/tables/12s0064.pdf.
[ii] The mean sample size for each grade is reported as 7.7 percent (or 90 students). Confidence intervals for each grade estimate are not reported.
[iii] The data in Table I are estimates (by sight) from a line graph incremented in five percentage point intervals.
[iv] Met Life, Met Life Survey of the American Teacher: The Homework Experience, November 13, 2007, pp. 15.
[v] Among high school students, the bias probably leans in the opposite direction: high achievers load up on AP, IB, and other courses that assign more homework.
This is part two of my analysis of instruction and Common Core’s implementation. I dubbed the threepart examination of instruction “The Good, The Bad, and the Ugly.” Having discussed “the “good” in part one, I now turn to “the bad.” One particular aspect of the Common Core math standards—the treatment of standard algorithms in whole number arithmetic—will lead some teachers to waste instructional time.
In 1963, psychologist John B. Carroll published a short essay, “A Model of School Learning” in Teachers College Record. Carroll proposed a parsimonious model of learning that expressed the degree of learning (or what today is commonly called achievement) as a function of the ratio of time spent on learning to the time needed to learn.
The numerator, time spent learning, has also been given the term opportunity to learn. The denominator, time needed to learn, is synonymous with student aptitude. By expressing aptitude as time needed to learn, Carroll refreshingly broke through his era’s debate about the origins of intelligence (nature vs. nurture) and the vocabulary that labels students as having more or less intelligence. He also spoke directly to a primary challenge of teaching: how to effectively produce learning in classrooms populated by students needing vastly different amounts of time to learn the exact same content.^{[i]}^{ }
The source of that variation is largely irrelevant to the constraints placed on instructional decisions. Teachers obviously have limited control over the denominator of the ratio (they must take kids as they are) and less than one might think over the numerator. Teachers allot time to instruction only after educational authorities have decided the number of hours in the school day, the number of days in the school year, the number of minutes in class periods in middle and high schools, and the amount of time set aside for lunch, recess, passing periods, various pullout programs, pep rallies, and the like. There are also announcements over the PA system, stray dogs that may wander into the classroom, and other unscheduled encroachments on instructional time.
The model has had a profound influence on educational thought. As of July 5, 2015, Google Scholar reported 2,931 citations of Carroll’s article. Benjamin Bloom’s “mastery learning” was deeply influenced by Carroll. It is predicated on the idea that optimal learning occurs when time spent on learning—rather than content—is allowed to vary, providing to each student the individual amount of time he or she needs to learn a common curriculum. This is often referred to as “students working at their own pace,” and progress is measured by mastery of content rather than seat time. David C. Berliner’s 1990 discussion of time includes an analysis of mediating variables in the numerator of Carroll’s model, including the amount of time students are willing to spend on learning. Carroll called this persistence, and Berliner links the construct to student engagement and time on task—topics of keen interest to researchers today. Berliner notes that although both are typically described in terms of motivation, they can be measured empirically in increments of time.
Most applications of Carroll’s model have been interested in what happens when insufficient time is provided for learning—in other words, when the numerator of the ratio is significantly less than the denominator. When that happens, students don’t have an adequate opportunity to learn. They need more time.
As applied to Common Core and instruction, one should also be aware of problems that arise from the inefficient distribution of time. Time is a limited resource that teachers deploy in the production of learning. Below I discuss instances when the CCSSM may lead to the numerator in Carroll’s model being significantly larger than the denominator—when teachers spend more time teaching a concept or skill than is necessary. Because time is limited and fixed, wasted time on one topic will shorten the amount of time available to teach other topics. Excessive instructional time may also negatively affect student engagement. Students who have fully learned content that continues to be taught may become bored; they must endure instruction that they do not need.
Jason Zimba, one of the lead authors of the Common Core Math standards, and Barry Garelick, a critic of the standards, had a recent, interesting exchange about when standard algorithms are called for in the CCSSM. A standard algorithm is a series of steps designed to compute accurately and quickly. In the U.S., students are typically taught the standard algorithms of addition, subtraction, multiplication, and division with whole numbers. Most readers of this post will recognize the standard algorithm for addition. It involves lining up two or more multidigit numbers according to placevalue, with one number written over the other, and adding the columns from right to left with “carrying” (or regrouping) as needed.
The standard algorithm is the only algorithm required for students to learn, although others are mentioned beginning with the first grade standards. Curiously, though, CCSSM doesn’t require students to know the standard algorithms for addition and subtraction until fourth grade. This opens the door for a lot of wasted time. Garelick questioned the wisdom of teaching several alternative strategies for addition. He asked whether, under the Common Core, only the standard algorithm could be taught—or at least, could it be taught first. As he explains:
Delaying teaching of the standard algorithm until fourth grade and relying on place value “strategies” and drawings to add numbers is thought to provide students with the conceptual understanding of adding and subtracting multidigit numbers. What happens, instead, is that the means to help learn, explain or memorize the procedure become a procedure unto itself and students are required to use inefficient cumbersome methods for two years. This is done in the belief that the alternative approaches confer understanding, so are superior to the standard algorithm. To teach the standard algorithm first would in reformers’ minds be rote learning. Reformers believe that by having students using strategies in lieu of the standard algorithm, students are still learning “skills” (albeit inefficient and confusing ones), and these skills support understanding of the standard algorithm. Students are left with a panoply of methods (praised as a good thing because students should have more than one way to solve problems), that confuse more than enlighten.
Zimba responded that the standard algorithm could, indeed, be the only method taught because it meets a crucial test: reinforcing knowledge of place value and the properties of operations. He goes on to say that other algorithms also may be taught that are consistent with the standards, but that the decision to do so is left in the hands of local educators and curriculum designers:
In short, the Common Core requires the standard algorithm; additional algorithms aren’t named, and they aren’t required…Standards can’t settle every disagreement—nor should they. As this discussion of just a single slice of the math curriculum illustrates, teachers and curriculum authors following the standards still may, and still must, make an enormous range of decisions.
Zimba defends delaying mastery of the standard algorithm until fourth grade, referring to it as a “culminating” standard that he would, if he were teaching, introduce in earlier grades. Zimba illustrates the curricular progression he would employ in a table, showing that he would introduce the standard algorithm for addition late in first grade (with twodigit addends) and then extend the complexity of its use and provide practice towards fluency until reaching the culminating standard in fourth grade. Zimba would introduce the subtraction algorithm in second grade and similarly ramp up its complexity until fourth grade.
It is important to note that in CCSSM the word “algorithm” appears for the first time (in plural form) in the third grade standards:
3.NBT.2 Fluently add and subtract within 1000 using strategies and algorithms based on place value, properties of operations, and/or the relationship between addition and subtraction.
The term “strategies and algorithms” is curious. Zimba explains, “It is true that the word ‘algorithms’ here is plural, but that could be read as simply leaving more choice in the hands of the teacher about which algorithm(s) to teach—not as a requirement for each student to learn two or more general algorithms for each operation!”
I have described before the “dog whistles” embedded in the Common Core, signals to educational progressives—in this case, math reformers—that despite these being standards, the CCSSM will allow them great latitude. Using the plural “algorithms” in this third grade standard and not specifying the standard algorithm until fourth grade is a perfect example of such a dog whistle.
It appears that the Common Core authors wanted to reach a political compromise on standard algorithms.
Standard algorithms were a key point of contention in the “Math Wars” of the 1990s. The 1997 California Framework for Mathematics required that students know the standard algorithms for all four operations—addition, subtraction, multiplication, and division—by the end of fourth grade.^{[ii]} The 2000 Massachusetts Mathematics Curriculum Framework called for learning the standard algorithms for addition and subtraction by the end of second grade and for multiplication and division by the end of fourth grade. These two frameworks were heavily influenced by mathematicians (from Stanford in California and Harvard in Massachusetts) and quickly became favorites of math traditionalists. In both states’ frameworks, the standard algorithm requirements were in direct opposition to the reformoriented frameworks that preceded them—in which standard algorithms were barely mentioned and alternative algorithms or “strategies” were encouraged.
Now that the CCSSM has replaced these two frameworks, the requirement for knowing the standard algorithms in California and Massachusetts slips from third or fourth grade all the way to sixth grade. That’s what reformers get in the compromise. They are given a green light to continue teaching alternative algorithms, as long as the algorithms are consistent with teaching place value and properties of arithmetic. But the standard algorithm is the only one students are required to learn. And that exclusivity is intended to please the traditionalists.
I agree with Garelick that the compromise leads to problems. In a 2013 Chalkboard post, I described a first grade math program in which parents were explicitly requested not to teach the standard algorithm for addition when helping their children at home. The students were being taught how to represent addition with drawings that clustered objects into groups of ten. The exercises were both time consuming and tedious. When the parents met with the school principal to discuss the matter, the principal told them that the math program was following the Common Core by promoting deeper learning. The parents withdrew their child from the school and enrolled him in private school.
The value of standard algorithms is that they are efficient and packed with mathematics. Once students have mastered singledigit operations and the meaning of place value, the standard algorithms reveal to students that they can take procedures that they already know work well with one and twodigit numbers, and by applying them over and over again, solve problems with large numbers. Traditionalists and reformers have different goals. Reformers believe exposure to several algorithms encourages flexible thinking and the ability to draw on multiple strategies for solving problems. Traditionalists believe that a bigger problem than students learning too few algorithms is that too few students learn even one algorithm.
I have been a critic of the math reform movement since I taught in the 1980s. But some of their complaints have merit. All too often, instruction on standard algorithms has left out meaning. As Karen C. Fuson and Sybilla Beckmann point out, “an unfortunate dichotomy” emerged in math instruction: teachers taught “strategies” that implied understanding and “algorithms” that implied procedural steps that were to be memorized. Michael Battista’s research has provided many instances of students clinging to algorithms without understanding. He gives an example of a student who has not quite mastered the standard algorithm for addition and makes numerous errors on a worksheet. On one item, for example, the student forgets to carry and calculates that 19 + 6 = 15. In a postworksheet interview, the student counts 6 units from 19 and arrives at 25. Despite the obvious discrepancy—(25 is not 15, the student agrees)—he declares that his answers on the worksheet must be correct because the algorithm he used “always works.”^{[iii]}^{ }
Math reformers rightfully argue that blind faith in procedure has no place in a thinking mathematical classroom. Who can disagree with that? Students should be able to evaluate the validity of answers, regardless of the procedures used, and propose alternative solutions. Standard algorithms are tools to help them do that, but students must be able to apply them, not in a robotic way, but with understanding.
Let’s return to Carroll’s model of time and learning. I conclude by making two points—one about curriculum and instruction, the other about implementation.
In the study of numbers, a coherent K12 math curriculum, similar to that of the previous California and Massachusetts frameworks, can be sketched in a few short sentences. Addition with whole numbers (including the standard algorithm) is taught in first grade, subtraction in second grade, multiplication in third grade, and division in fourth grade. Thus, the study of whole number arithmetic is completed by the end of fourth grade. Grades five through seven focus on rational numbers (fractions, decimals, percentages), and grades eight through twelve study advanced mathematics. Proficiency is sought along three dimensions: 1) fluency with calculations, 2) conceptual understanding, 3) ability to solve problems.
Placing the CCSSM standard for knowing the standard algorithms of addition and subtraction in fourth grade delays this progression by two years. Placing the standard for the division algorithm in sixth grade continues the twoyear delay. For many fourth graders, time spent working on addition and subtraction will be wasted time. They already have a firm understanding of addition and subtraction. The same thing for many sixth graders—time devoted to the division algorithm will be wasted time that should be devoted to the study of rational numbers. The numerator in Carroll’s instructional time model will be greater than the denominator, indicating the inefficient allocation of time to instruction.
As Jason Zimba points out, not everyone agrees on when the standard algorithms should be taught, the alternative algorithms that should be taught, the manner in which any algorithm should be taught, or the amount of instructional time that should be spent on computational procedures. Such decisions are made by local educators. Variation in these decisions will introduce variation in the implementation of the math standards. It is true that standards, any standards, cannot control implementation, especially the twists and turns in how they are interpreted by educators and brought to life in classroom instruction. But in this case, the standards themselves are responsible for the myriad approaches, many unproductive, that we are sure to see as schools teach various algorithms under the Common Core.
[i] Tracking, ability grouping, differentiated learning, programmed learning, individualized instruction, and personalized learning (including today’s flipped classrooms) are all attempts to solve the challenge of student heterogeneity.
[ii] An earlier version of this post incorrectly stated that the California framework required that students know the standard algorithms for all four operations by the end of third grade. I regret the error.
[iii] Michael T. Battista (2001). “Research and Reform in Mathematics Education,” pp. 3284 in The Great Curriculum Debate: How Should We Teach Reading and Math? (T. Loveless, ed., Brookings Instiution Press).
This is part two of my analysis of instruction and Common Core’s implementation. I dubbed the threepart examination of instruction “The Good, The Bad, and the Ugly.” Having discussed “the “good” in part one, I now turn to “the bad.” One particular aspect of the Common Core math standards—the treatment of standard algorithms in whole number arithmetic—will lead some teachers to waste instructional time.
In 1963, psychologist John B. Carroll published a short essay, “A Model of School Learning” in Teachers College Record. Carroll proposed a parsimonious model of learning that expressed the degree of learning (or what today is commonly called achievement) as a function of the ratio of time spent on learning to the time needed to learn.
The numerator, time spent learning, has also been given the term opportunity to learn. The denominator, time needed to learn, is synonymous with student aptitude. By expressing aptitude as time needed to learn, Carroll refreshingly broke through his era’s debate about the origins of intelligence (nature vs. nurture) and the vocabulary that labels students as having more or less intelligence. He also spoke directly to a primary challenge of teaching: how to effectively produce learning in classrooms populated by students needing vastly different amounts of time to learn the exact same content.^{[i]}^{ }
The source of that variation is largely irrelevant to the constraints placed on instructional decisions. Teachers obviously have limited control over the denominator of the ratio (they must take kids as they are) and less than one might think over the numerator. Teachers allot time to instruction only after educational authorities have decided the number of hours in the school day, the number of days in the school year, the number of minutes in class periods in middle and high schools, and the amount of time set aside for lunch, recess, passing periods, various pullout programs, pep rallies, and the like. There are also announcements over the PA system, stray dogs that may wander into the classroom, and other unscheduled encroachments on instructional time.
The model has had a profound influence on educational thought. As of July 5, 2015, Google Scholar reported 2,931 citations of Carroll’s article. Benjamin Bloom’s “mastery learning” was deeply influenced by Carroll. It is predicated on the idea that optimal learning occurs when time spent on learning—rather than content—is allowed to vary, providing to each student the individual amount of time he or she needs to learn a common curriculum. This is often referred to as “students working at their own pace,” and progress is measured by mastery of content rather than seat time. David C. Berliner’s 1990 discussion of time includes an analysis of mediating variables in the numerator of Carroll’s model, including the amount of time students are willing to spend on learning. Carroll called this persistence, and Berliner links the construct to student engagement and time on task—topics of keen interest to researchers today. Berliner notes that although both are typically described in terms of motivation, they can be measured empirically in increments of time.
Most applications of Carroll’s model have been interested in what happens when insufficient time is provided for learning—in other words, when the numerator of the ratio is significantly less than the denominator. When that happens, students don’t have an adequate opportunity to learn. They need more time.
As applied to Common Core and instruction, one should also be aware of problems that arise from the inefficient distribution of time. Time is a limited resource that teachers deploy in the production of learning. Below I discuss instances when the CCSSM may lead to the numerator in Carroll’s model being significantly larger than the denominator—when teachers spend more time teaching a concept or skill than is necessary. Because time is limited and fixed, wasted time on one topic will shorten the amount of time available to teach other topics. Excessive instructional time may also negatively affect student engagement. Students who have fully learned content that continues to be taught may become bored; they must endure instruction that they do not need.
Jason Zimba, one of the lead authors of the Common Core Math standards, and Barry Garelick, a critic of the standards, had a recent, interesting exchange about when standard algorithms are called for in the CCSSM. A standard algorithm is a series of steps designed to compute accurately and quickly. In the U.S., students are typically taught the standard algorithms of addition, subtraction, multiplication, and division with whole numbers. Most readers of this post will recognize the standard algorithm for addition. It involves lining up two or more multidigit numbers according to placevalue, with one number written over the other, and adding the columns from right to left with “carrying” (or regrouping) as needed.
The standard algorithm is the only algorithm required for students to learn, although others are mentioned beginning with the first grade standards. Curiously, though, CCSSM doesn’t require students to know the standard algorithms for addition and subtraction until fourth grade. This opens the door for a lot of wasted time. Garelick questioned the wisdom of teaching several alternative strategies for addition. He asked whether, under the Common Core, only the standard algorithm could be taught—or at least, could it be taught first. As he explains:
Delaying teaching of the standard algorithm until fourth grade and relying on place value “strategies” and drawings to add numbers is thought to provide students with the conceptual understanding of adding and subtracting multidigit numbers. What happens, instead, is that the means to help learn, explain or memorize the procedure become a procedure unto itself and students are required to use inefficient cumbersome methods for two years. This is done in the belief that the alternative approaches confer understanding, so are superior to the standard algorithm. To teach the standard algorithm first would in reformers’ minds be rote learning. Reformers believe that by having students using strategies in lieu of the standard algorithm, students are still learning “skills” (albeit inefficient and confusing ones), and these skills support understanding of the standard algorithm. Students are left with a panoply of methods (praised as a good thing because students should have more than one way to solve problems), that confuse more than enlighten.
Zimba responded that the standard algorithm could, indeed, be the only method taught because it meets a crucial test: reinforcing knowledge of place value and the properties of operations. He goes on to say that other algorithms also may be taught that are consistent with the standards, but that the decision to do so is left in the hands of local educators and curriculum designers:
In short, the Common Core requires the standard algorithm; additional algorithms aren’t named, and they aren’t required…Standards can’t settle every disagreement—nor should they. As this discussion of just a single slice of the math curriculum illustrates, teachers and curriculum authors following the standards still may, and still must, make an enormous range of decisions.
Zimba defends delaying mastery of the standard algorithm until fourth grade, referring to it as a “culminating” standard that he would, if he were teaching, introduce in earlier grades. Zimba illustrates the curricular progression he would employ in a table, showing that he would introduce the standard algorithm for addition late in first grade (with twodigit addends) and then extend the complexity of its use and provide practice towards fluency until reaching the culminating standard in fourth grade. Zimba would introduce the subtraction algorithm in second grade and similarly ramp up its complexity until fourth grade.
It is important to note that in CCSSM the word “algorithm” appears for the first time (in plural form) in the third grade standards:
3.NBT.2 Fluently add and subtract within 1000 using strategies and algorithms based on place value, properties of operations, and/or the relationship between addition and subtraction.
The term “strategies and algorithms” is curious. Zimba explains, “It is true that the word ‘algorithms’ here is plural, but that could be read as simply leaving more choice in the hands of the teacher about which algorithm(s) to teach—not as a requirement for each student to learn two or more general algorithms for each operation!”
I have described before the “dog whistles” embedded in the Common Core, signals to educational progressives—in this case, math reformers—that despite these being standards, the CCSSM will allow them great latitude. Using the plural “algorithms” in this third grade standard and not specifying the standard algorithm until fourth grade is a perfect example of such a dog whistle.
It appears that the Common Core authors wanted to reach a political compromise on standard algorithms.
Standard algorithms were a key point of contention in the “Math Wars” of the 1990s. The 1997 California Framework for Mathematics required that students know the standard algorithms for all four operations—addition, subtraction, multiplication, and division—by the end of fourth grade.^{[ii]} The 2000 Massachusetts Mathematics Curriculum Framework called for learning the standard algorithms for addition and subtraction by the end of second grade and for multiplication and division by the end of fourth grade. These two frameworks were heavily influenced by mathematicians (from Stanford in California and Harvard in Massachusetts) and quickly became favorites of math traditionalists. In both states’ frameworks, the standard algorithm requirements were in direct opposition to the reformoriented frameworks that preceded them—in which standard algorithms were barely mentioned and alternative algorithms or “strategies” were encouraged.
Now that the CCSSM has replaced these two frameworks, the requirement for knowing the standard algorithms in California and Massachusetts slips from third or fourth grade all the way to sixth grade. That’s what reformers get in the compromise. They are given a green light to continue teaching alternative algorithms, as long as the algorithms are consistent with teaching place value and properties of arithmetic. But the standard algorithm is the only one students are required to learn. And that exclusivity is intended to please the traditionalists.
I agree with Garelick that the compromise leads to problems. In a 2013 Chalkboard post, I described a first grade math program in which parents were explicitly requested not to teach the standard algorithm for addition when helping their children at home. The students were being taught how to represent addition with drawings that clustered objects into groups of ten. The exercises were both time consuming and tedious. When the parents met with the school principal to discuss the matter, the principal told them that the math program was following the Common Core by promoting deeper learning. The parents withdrew their child from the school and enrolled him in private school.
The value of standard algorithms is that they are efficient and packed with mathematics. Once students have mastered singledigit operations and the meaning of place value, the standard algorithms reveal to students that they can take procedures that they already know work well with one and twodigit numbers, and by applying them over and over again, solve problems with large numbers. Traditionalists and reformers have different goals. Reformers believe exposure to several algorithms encourages flexible thinking and the ability to draw on multiple strategies for solving problems. Traditionalists believe that a bigger problem than students learning too few algorithms is that too few students learn even one algorithm.
I have been a critic of the math reform movement since I taught in the 1980s. But some of their complaints have merit. All too often, instruction on standard algorithms has left out meaning. As Karen C. Fuson and Sybilla Beckmann point out, “an unfortunate dichotomy” emerged in math instruction: teachers taught “strategies” that implied understanding and “algorithms” that implied procedural steps that were to be memorized. Michael Battista’s research has provided many instances of students clinging to algorithms without understanding. He gives an example of a student who has not quite mastered the standard algorithm for addition and makes numerous errors on a worksheet. On one item, for example, the student forgets to carry and calculates that 19 + 6 = 15. In a postworksheet interview, the student counts 6 units from 19 and arrives at 25. Despite the obvious discrepancy—(25 is not 15, the student agrees)—he declares that his answers on the worksheet must be correct because the algorithm he used “always works.”^{[iii]}^{ }
Math reformers rightfully argue that blind faith in procedure has no place in a thinking mathematical classroom. Who can disagree with that? Students should be able to evaluate the validity of answers, regardless of the procedures used, and propose alternative solutions. Standard algorithms are tools to help them do that, but students must be able to apply them, not in a robotic way, but with understanding.
Let’s return to Carroll’s model of time and learning. I conclude by making two points—one about curriculum and instruction, the other about implementation.
In the study of numbers, a coherent K12 math curriculum, similar to that of the previous California and Massachusetts frameworks, can be sketched in a few short sentences. Addition with whole numbers (including the standard algorithm) is taught in first grade, subtraction in second grade, multiplication in third grade, and division in fourth grade. Thus, the study of whole number arithmetic is completed by the end of fourth grade. Grades five through seven focus on rational numbers (fractions, decimals, percentages), and grades eight through twelve study advanced mathematics. Proficiency is sought along three dimensions: 1) fluency with calculations, 2) conceptual understanding, 3) ability to solve problems.
Placing the CCSSM standard for knowing the standard algorithms of addition and subtraction in fourth grade delays this progression by two years. Placing the standard for the division algorithm in sixth grade continues the twoyear delay. For many fourth graders, time spent working on addition and subtraction will be wasted time. They already have a firm understanding of addition and subtraction. The same thing for many sixth graders—time devoted to the division algorithm will be wasted time that should be devoted to the study of rational numbers. The numerator in Carroll’s instructional time model will be greater than the denominator, indicating the inefficient allocation of time to instruction.
As Jason Zimba points out, not everyone agrees on when the standard algorithms should be taught, the alternative algorithms that should be taught, the manner in which any algorithm should be taught, or the amount of instructional time that should be spent on computational procedures. Such decisions are made by local educators. Variation in these decisions will introduce variation in the implementation of the math standards. It is true that standards, any standards, cannot control implementation, especially the twists and turns in how they are interpreted by educators and brought to life in classroom instruction. But in this case, the standards themselves are responsible for the myriad approaches, many unproductive, that we are sure to see as schools teach various algorithms under the Common Core.
[i] Tracking, ability grouping, differentiated learning, programmed learning, individualized instruction, and personalized learning (including today’s flipped classrooms) are all attempts to solve the challenge of student heterogeneity.
[ii] An earlier version of this post incorrectly stated that the California framework required that students know the standard algorithms for all four operations by the end of third grade. I regret the error.
[iii] Michael T. Battista (2001). “Research and Reform in Mathematics Education,” pp. 3284 in The Great Curriculum Debate: How Should We Teach Reading and Math? (T. Loveless, ed., Brookings Instiution Press).
This post continues a series begun in 2014 on implementing the Common Core State Standards (CCSS). The first installment introduced an analytical scheme investigating CCSS implementation along four dimensions: curriculum, instruction, assessment, and accountability. Three posts focused on curriculum. This post turns to instruction. Although the impact of CCSS on how teachers teach is discussed, the post is also concerned with the inverse relationship, how decisions that teachers make about instruction shape the implementation of CCSS.
A couple of points before we get started. The previous posts on curriculum led readers from the upper levels of the educational system—federal and state policies—down to curricular decisions made “in the trenches”—in districts, schools, and classrooms. Standards emanate from the top of the system and are produced by politicians, policymakers, and experts. Curricular decisions are shared across education’s systemic levels. Instruction, on the other hand, is dominated by practitioners. The daily decisions that teachers make about how to teach under CCSS—and not the idealizations of instruction embraced by upperlevel authorities—will ultimately determine what “CCSS instruction” really means.
I ended the last post on CCSS by describing how curriculum and instruction can be so closely intertwined that the boundary between them is blurred. Sometimes stating a precise curricular objective dictates, or at least constrains, the range of instructional strategies that teachers may consider. That post focused on EnglishLanguage Arts. The current post focuses on mathematics in the elementary grades and describes examples of how CCSS will shape math instruction. As a former elementary school teacher, I offer my own personal opinion on these effects.
Certain aspects of the Common Core, when implemented, are likely to have a positive impact on the instruction of mathematics. For example, Common Core stresses that students recognize fractions as numbers on a number line. The emphasis begins in third grade:
CCSS.MATH.CONTENT.3.NF.A.2
Understand a fraction as a number on the number line; represent fractions on a number line diagram.
CCSS.MATH.CONTENT.3.NF.A.2.A
Represent a fraction 1/b on a number line diagram by defining the interval from 0 to 1 as the whole and partitioning it into b equal parts. Recognize that each part has size 1/b and that the endpoint of the part based at 0 locates the number 1/b on the number line.
CCSS.MATH.CONTENT.3.NF.A.2.B
Represent a fraction a/b on a number line diagram by marking off a lengths 1/b from 0. Recognize that the resulting interval has size a/b and that its endpoint locates the number a/b on the number line.
When I first read this section of the Common Core standards, I stood up and cheered. Berkeley mathematician HungHsi Wu has been working with teachers for years to get them to understand the importance of using number lines in teaching fractions.^{[1]} American textbooks rely heavily on partwhole representations to introduce fractions. Typically, students see pizzas and apples and other objects—typically other foods or money—that are divided up into equal parts. Such models are limited. They work okay with simple addition and subtraction. Common denominators present a bit of a challenge, but ½ pizza can be shown to be also 2/4, a half dollar equal to two quarters, and so on.
With multiplication and division, all the little tricks students learned with whole number arithmetic suddenly go haywire. Students are accustomed to the fact that multiplying two whole numbers yields a product that is larger than either number being multiplied: 4 X 5 = 20 and 20 is larger than both 4 and 5.^{[2]} How in the world can ¼ X 1/5 = 1/20, a number much smaller than either 1/4or 1/5? The partwhole representation has convinced many students that fractions are not numbers. Instead, they are seen as strange expressions comprising two numbers with a small horizontal bar separating them.
I taught sixth grade but occasionally visited my colleagues’ classes in the lower grades. I recall one exchange with second or third graders that went something like this:
“Give me a number between seven and nine.” Giggles.
“Eight!” they shouted.
“Give me a number between two and three.” Giggles.
“There isn’t one!” they shouted.
“Really?” I’d ask and draw a number line. After spending some time placing whole numbers on the number line, I’d observe, “There’s a lot of space between two and three. Is it just empty?”
Silence. Puzzled little faces. Then a quiet voice. “Two and a half?”
You have no idea how many children do not make the transition to understanding fractions as numbers and because of stumbling at this crucial stage, spend the rest of their careers as students of mathematics convinced that fractions are an impenetrable mystery. And that’s not true of just students. California adopted a test for teachers in the 1980s, the California Basic Educational Skills Test (CBEST). Beginning in 1982, even teachers already in the classroom had to pass it. I made a nice afterschool and summer income tutoring colleagues who didn’t know fractions from Fermat’s Last Theorem. To be fair, primary teachers, teaching kindergarten or grades 12, would not teach fractions as part of their math curriculum and probably hadn’t worked with a fraction in decades. So they are no different than nonliterary types who think Hamlet is just a play about a young guy who can’t make up his mind, has a weird relationship with his mother, and winds up dying at the end.
Division is the most difficult operation to grasp for those arrested at the partwhole stage of understanding fractions. A problem that Liping Ma posed to teachers is now legendary.^{[3]}
She asked small groups of American and Chinese elementary teachers to divide 1 ¾ by ½ and to create a word problem that illustrates the calculation. All 72 Chinese teachers gave the correct answer and 65 developed an appropriate word problem. Only nine of the 23 American teachers solved the problem correctly. A single American teacher was able to devise an appropriate word problem. Granted, the American sample was not selected to be representative of American teachers as a whole, but the stark findings of the exercise did not shock anyone who has worked closely with elementary teachers in the U.S. They are often weak at math. Many of the teachers in Ma’s study had vague ideas of an “invert and multiply” rule but lacked a conceptual understanding of why it worked.
A linguistic convention exacerbates the difficulty. Students may cling to the mistaken notion that “dividing in half” means “dividing by onehalf.” It does not. Dividing in half means dividing by two. The number line can help clear up such confusion. Consider a basic, wholenumber division problem for which third graders will already know the answer: 8 divided by 2 equals 4. It is evident that a segment 8 units in length (measured from 0 to 8) is divided by a segment 2 units in length (measured from 0 to 2) exactly 4 times. Modeling 12 divided by 2 and other basic facts with 2 as a divisor will convince students that whole number division works quite well on a number line.
Now consider the number ½ as a divisor. It will become clear to students that 8 divided by ½ equals 16, and they can illustrate that fact on a number line by showing how a segment ½ units in length divides a segment 8 units in length exactly 16 times; it divides a segment 12 units in length 24 times; and so on. Students will be relieved to discover that on a number line division with fractions works the same as division with whole numbers.
Now, let’s return to Liping Ma’s problem: 1 ¾ divided by ½. This problem would not be presented in third grade, but it might be in fifth or sixth grades. Students who have been working with fractions on a number line for two or three years will have little trouble solving it. They will see that the problem simply asks them to divide a line segment of 1 3/4 units by a segment of ½ units. The answer is 3 ½ . Some students might estimate that the solution is between 3 and 4 because 1 ¾ lies between 1 ½ and 2, which on the number line are the points at which the ½ unit segment, laid end on end, falls exactly three and four times. Other students will have learned about reciprocals and that multiplication and division are inverse operations. They will immediately grasp that dividing by ½ is the same as multiplying by 2—and since 1 ¾ x 2 = 3 ½, that is the answer. Creating a word problem involving string or rope or some other linearly measured object is also surely within their grasp.
I applaud the CCSS for introducing number lines and fractions in third grade. I believe it will instill in children an important idea: fractions are numbers. That foundational understanding will aid them as they work with more abstract representations of fractions in later grades. Fractions are a monumental barrier for kids who struggle with math, so the significance of this contribution should not be underestimated.
I mentioned above that instruction and curriculum are often intertwined. I began this series of posts by defining curriculum as the “stuff” of learning—the content of what is taught in school, especially as embodied in the materials used in instruction. Instruction refers to the “how” of teaching—how teachers organize, present, and explain those materials. It’s each teacher’s repertoire of instructional strategies and techniques that differentiates one teacher from another even as they teach the same content. Choosing to use a number line to teach fractions is obviously an instructional decision, but it also involves curriculum. The number line is mathematical content, not just a teaching tool.
Guiding third grade teachers towards using a number line does not guarantee effective instruction. In fact, it is reasonable to expect variation in how teachers will implement the CCSS standards listed above. A small body of research exists to guide practice. One of the best resources for teachers to consult is a practice guide published by the What Works Clearinghouse: Developing Effective Fractions Instruction for Kindergarten Through Eighth Grade (see full disclosure below).^{[4] } The guide recommends the use of number lines as its second recommendation, but it also states that the evidence supporting the effectiveness of number lines in teaching fractions is inferred from studies involving whole numbers and decimals. We need much more research on how and when number lines should be used in teaching fractions.
Professor Wu states the following, “The shift of emphasis from models of a fraction in the initial stage to an almost exclusive model of a fraction as a point on the number line can be done gradually and gracefully beginning somewhere in grade four. This shift is implicit in the Common Core Standards.”^{[5]} I agree, but the shift is also subtle. CCSS standards include the use of other representations—fraction strips, fraction bars, rectangles (which are excellent for showing multiplication of two fractions) and other graphical means of modeling fractions. Some teachers will manage the shift to number lines adroitly—and others will not. As a consequence, the quality of implementation will vary from classroom to classroom based on the instructional decisions that teachers make.
The current post has focused on what I believe to be a positive aspect of CCSS based on the implementation of the standards through instruction. Future posts in the series—covering the “bad” and the “ugly”—will describe aspects of instruction on which I am less optimistic.
[1] See H. Wu (2014). “Teaching Fractions According to the Common Core Standards,” https://math.berkeley.edu/~wu/CCSSFractions_1.pdf. Also see "What's Sophisticated about Elementary Mathematics?" http://www.aft.org/sites/default/files/periodicals/wu_0.pdf
[2] Students learn that 0 and 1 are exceptions and have their own special rules in multiplication.
[3] Liping Ma, Knowing and Teaching Elementary Mathematics.
[4] The practice guide can be found at: http://ies.ed.gov/ncee/wwc/pdf/practice_guides/fractions_pg_093010.pdf I serve as a content expert in elementary mathematics for the What Works Clearinghouse. I had nothing to do, however, with the publication cited.
[5] Wu, page 3.
This post continues a series begun in 2014 on implementing the Common Core State Standards (CCSS). The first installment introduced an analytical scheme investigating CCSS implementation along four dimensions: curriculum, instruction, assessment, and accountability. Three posts focused on curriculum. This post turns to instruction. Although the impact of CCSS on how teachers teach is discussed, the post is also concerned with the inverse relationship, how decisions that teachers make about instruction shape the implementation of CCSS.
A couple of points before we get started. The previous posts on curriculum led readers from the upper levels of the educational system—federal and state policies—down to curricular decisions made “in the trenches”—in districts, schools, and classrooms. Standards emanate from the top of the system and are produced by politicians, policymakers, and experts. Curricular decisions are shared across education’s systemic levels. Instruction, on the other hand, is dominated by practitioners. The daily decisions that teachers make about how to teach under CCSS—and not the idealizations of instruction embraced by upperlevel authorities—will ultimately determine what “CCSS instruction” really means.
I ended the last post on CCSS by describing how curriculum and instruction can be so closely intertwined that the boundary between them is blurred. Sometimes stating a precise curricular objective dictates, or at least constrains, the range of instructional strategies that teachers may consider. That post focused on EnglishLanguage Arts. The current post focuses on mathematics in the elementary grades and describes examples of how CCSS will shape math instruction. As a former elementary school teacher, I offer my own personal opinion on these effects.
Certain aspects of the Common Core, when implemented, are likely to have a positive impact on the instruction of mathematics. For example, Common Core stresses that students recognize fractions as numbers on a number line. The emphasis begins in third grade:
CCSS.MATH.CONTENT.3.NF.A.2
Understand a fraction as a number on the number line; represent fractions on a number line diagram.
CCSS.MATH.CONTENT.3.NF.A.2.A
Represent a fraction 1/b on a number line diagram by defining the interval from 0 to 1 as the whole and partitioning it into b equal parts. Recognize that each part has size 1/b and that the endpoint of the part based at 0 locates the number 1/b on the number line.
CCSS.MATH.CONTENT.3.NF.A.2.B
Represent a fraction a/b on a number line diagram by marking off a lengths 1/b from 0. Recognize that the resulting interval has size a/b and that its endpoint locates the number a/b on the number line.
When I first read this section of the Common Core standards, I stood up and cheered. Berkeley mathematician HungHsi Wu has been working with teachers for years to get them to understand the importance of using number lines in teaching fractions.^{[1]} American textbooks rely heavily on partwhole representations to introduce fractions. Typically, students see pizzas and apples and other objects—typically other foods or money—that are divided up into equal parts. Such models are limited. They work okay with simple addition and subtraction. Common denominators present a bit of a challenge, but ½ pizza can be shown to be also 2/4, a half dollar equal to two quarters, and so on.
With multiplication and division, all the little tricks students learned with whole number arithmetic suddenly go haywire. Students are accustomed to the fact that multiplying two whole numbers yields a product that is larger than either number being multiplied: 4 X 5 = 20 and 20 is larger than both 4 and 5.^{[2]} How in the world can ¼ X 1/5 = 1/20, a number much smaller than either 1/4or 1/5? The partwhole representation has convinced many students that fractions are not numbers. Instead, they are seen as strange expressions comprising two numbers with a small horizontal bar separating them.
I taught sixth grade but occasionally visited my colleagues’ classes in the lower grades. I recall one exchange with second or third graders that went something like this:
“Give me a number between seven and nine.” Giggles.
“Eight!” they shouted.
“Give me a number between two and three.” Giggles.
“There isn’t one!” they shouted.
“Really?” I’d ask and draw a number line. After spending some time placing whole numbers on the number line, I’d observe, “There’s a lot of space between two and three. Is it just empty?”
Silence. Puzzled little faces. Then a quiet voice. “Two and a half?”
You have no idea how many children do not make the transition to understanding fractions as numbers and because of stumbling at this crucial stage, spend the rest of their careers as students of mathematics convinced that fractions are an impenetrable mystery. And that’s not true of just students. California adopted a test for teachers in the 1980s, the California Basic Educational Skills Test (CBEST). Beginning in 1982, even teachers already in the classroom had to pass it. I made a nice afterschool and summer income tutoring colleagues who didn’t know fractions from Fermat’s Last Theorem. To be fair, primary teachers, teaching kindergarten or grades 12, would not teach fractions as part of their math curriculum and probably hadn’t worked with a fraction in decades. So they are no different than nonliterary types who think Hamlet is just a play about a young guy who can’t make up his mind, has a weird relationship with his mother, and winds up dying at the end.
Division is the most difficult operation to grasp for those arrested at the partwhole stage of understanding fractions. A problem that Liping Ma posed to teachers is now legendary.^{[3]}
She asked small groups of American and Chinese elementary teachers to divide 1 ¾ by ½ and to create a word problem that illustrates the calculation. All 72 Chinese teachers gave the correct answer and 65 developed an appropriate word problem. Only nine of the 23 American teachers solved the problem correctly. A single American teacher was able to devise an appropriate word problem. Granted, the American sample was not selected to be representative of American teachers as a whole, but the stark findings of the exercise did not shock anyone who has worked closely with elementary teachers in the U.S. They are often weak at math. Many of the teachers in Ma’s study had vague ideas of an “invert and multiply” rule but lacked a conceptual understanding of why it worked.
A linguistic convention exacerbates the difficulty. Students may cling to the mistaken notion that “dividing in half” means “dividing by onehalf.” It does not. Dividing in half means dividing by two. The number line can help clear up such confusion. Consider a basic, wholenumber division problem for which third graders will already know the answer: 8 divided by 2 equals 4. It is evident that a segment 8 units in length (measured from 0 to 8) is divided by a segment 2 units in length (measured from 0 to 2) exactly 4 times. Modeling 12 divided by 2 and other basic facts with 2 as a divisor will convince students that whole number division works quite well on a number line.
Now consider the number ½ as a divisor. It will become clear to students that 8 divided by ½ equals 16, and they can illustrate that fact on a number line by showing how a segment ½ units in length divides a segment 8 units in length exactly 16 times; it divides a segment 12 units in length 24 times; and so on. Students will be relieved to discover that on a number line division with fractions works the same as division with whole numbers.
Now, let’s return to Liping Ma’s problem: 1 ¾ divided by ½. This problem would not be presented in third grade, but it might be in fifth or sixth grades. Students who have been working with fractions on a number line for two or three years will have little trouble solving it. They will see that the problem simply asks them to divide a line segment of 1 3/4 units by a segment of ½ units. The answer is 3 ½ . Some students might estimate that the solution is between 3 and 4 because 1 ¾ lies between 1 ½ and 2, which on the number line are the points at which the ½ unit segment, laid end on end, falls exactly three and four times. Other students will have learned about reciprocals and that multiplication and division are inverse operations. They will immediately grasp that dividing by ½ is the same as multiplying by 2—and since 1 ¾ x 2 = 3 ½, that is the answer. Creating a word problem involving string or rope or some other linearly measured object is also surely within their grasp.
I applaud the CCSS for introducing number lines and fractions in third grade. I believe it will instill in children an important idea: fractions are numbers. That foundational understanding will aid them as they work with more abstract representations of fractions in later grades. Fractions are a monumental barrier for kids who struggle with math, so the significance of this contribution should not be underestimated.
I mentioned above that instruction and curriculum are often intertwined. I began this series of posts by defining curriculum as the “stuff” of learning—the content of what is taught in school, especially as embodied in the materials used in instruction. Instruction refers to the “how” of teaching—how teachers organize, present, and explain those materials. It’s each teacher’s repertoire of instructional strategies and techniques that differentiates one teacher from another even as they teach the same content. Choosing to use a number line to teach fractions is obviously an instructional decision, but it also involves curriculum. The number line is mathematical content, not just a teaching tool.
Guiding third grade teachers towards using a number line does not guarantee effective instruction. In fact, it is reasonable to expect variation in how teachers will implement the CCSS standards listed above. A small body of research exists to guide practice. One of the best resources for teachers to consult is a practice guide published by the What Works Clearinghouse: Developing Effective Fractions Instruction for Kindergarten Through Eighth Grade (see full disclosure below).^{[4] } The guide recommends the use of number lines as its second recommendation, but it also states that the evidence supporting the effectiveness of number lines in teaching fractions is inferred from studies involving whole numbers and decimals. We need much more research on how and when number lines should be used in teaching fractions.
Professor Wu states the following, “The shift of emphasis from models of a fraction in the initial stage to an almost exclusive model of a fraction as a point on the number line can be done gradually and gracefully beginning somewhere in grade four. This shift is implicit in the Common Core Standards.”^{[5]} I agree, but the shift is also subtle. CCSS standards include the use of other representations—fraction strips, fraction bars, rectangles (which are excellent for showing multiplication of two fractions) and other graphical means of modeling fractions. Some teachers will manage the shift to number lines adroitly—and others will not. As a consequence, the quality of implementation will vary from classroom to classroom based on the instructional decisions that teachers make.
The current post has focused on what I believe to be a positive aspect of CCSS based on the implementation of the standards through instruction. Future posts in the series—covering the “bad” and the “ugly”—will describe aspects of instruction on which I am less optimistic.
[1] See H. Wu (2014). “Teaching Fractions According to the Common Core Standards,” https://math.berkeley.edu/~wu/CCSSFractions_1.pdf. Also see "What's Sophisticated about Elementary Mathematics?" http://www.aft.org/sites/default/files/periodicals/wu_0.pdf
[2] Students learn that 0 and 1 are exceptions and have their own special rules in multiplication.
[3] Liping Ma, Knowing and Teaching Elementary Mathematics.
[4] The practice guide can be found at: http://ies.ed.gov/ncee/wwc/pdf/practice_guides/fractions_pg_093010.pdf I serve as a content expert in elementary mathematics for the What Works Clearinghouse. I had nothing to do, however, with the publication cited.
[5] Wu, page 3.
March 26, 2015
2:00 PM  2:30 PM EDT
Online Only
Live Webcast
And more from the Brown Center Report on American Education
Girls outscore boys on practically every reading test given to a large population. And they have for a long time. A 1942 Iowa study found girls performing better than boys on tests of reading comprehension, vocabulary, and basic language skills, and girls have outscored boys on every reading test ever given by the National Assessment of Educational Progress (NAEP). This gap is not confined to the U.S. Reading tests administered as part of the Progress in International Reading Literacy Study (PIRLS) and the Program for International Student Assessment (PISA) reveal that the gender gap is a worldwide phenomenon.
On March 26, join Brown Center experts Tom Loveless and Matthew Chingos as they discuss the latest Brown Center Report on American Education, which examines this phenomenon. Hear what Loveless's analysis revealed about where the gender gap stands today and how it's trended over the past several decades  in the U.S. and around the world.
Tune in below or via Spreecast where you can submit questions.
Spreecast is the social video platform that connects people.
Check out Girls, Boys, and Reading on Spreecast.
March 26, 2015
2:00 PM  2:30 PM EDT
Online Only
Live Webcast
And more from the Brown Center Report on American Education
Girls outscore boys on practically every reading test given to a large population. And they have for a long time. A 1942 Iowa study found girls performing better than boys on tests of reading comprehension, vocabulary, and basic language skills, and girls have outscored boys on every reading test ever given by the National Assessment of Educational Progress (NAEP). This gap is not confined to the U.S. Reading tests administered as part of the Progress in International Reading Literacy Study (PIRLS) and the Program for International Student Assessment (PISA) reveal that the gender gap is a worldwide phenomenon.
On March 26, join Brown Center experts Tom Loveless and Matthew Chingos as they discuss the latest Brown Center Report on American Education, which examines this phenomenon. Hear what Loveless's analysis revealed about where the gender gap stands today and how it's trended over the past several decades  in the U.S. and around the world.
Tune in below or via Spreecast where you can submit questions.
Spreecast is the social video platform that connects people.
Check out Girls, Boys, and Reading on Spreecast.
This week marks the release of the 2015 Brown Center Report on American Education, the fourteenth issue of the series. One of the three studies in the report, “Girls, Boys, and Reading,” examines the gender gap in reading. Girls consistently outscore boys on reading assessments. They have for a long time. A 1942 study in Iowa discovered that girls were superior to boys on tests of reading comprehension, vocabulary, and basic language skills.^{[i]} Girls have outscored boys on the National Assessment of Educational Progress (NAEP) reading assessments since the first NAEP was administered in 1971.
I hope you’ll read the full study—and the other studies in the report—but allow me to summarize the main findings of the gender gap study here.
Eight assessments generate valid estimates of U.S. national reading performance: the Main NAEP, given at three grades (fourth, eighth, and 12^{th} grades); the NAEP Long Term Trend (NAEPLTT), given at three ages (ages nine, 13, and 17); the Progress in International Reading Literacy Study (PIRLS), an international assessment given at fourth grade; and the Program for International Student Assessment (PISA), an international assessment given to 15yearolds. Females outscore males on the most recent administration of all eight tests. And the gaps are statistically significant. Expressed in standard deviation units, they range from 0.13 on the NAEPLTT at age nine to 0.34 on the PISA at age 15.
The gaps are shrinking. At age nine, the gap on the NAEPLTT declined from 13 scale score points in 1971 to five points in 2012. During the same time period, the gap at age 13 shrank from 11 points to eight points, and at age 17, from 12 points to eight points. Only the decline at age nine is statistically significant, but at ages 13 and 17, declines since the gaps peaked in the 1990s are also statistically significant. At all three ages, gaps are shrinking because of males making larger gains on NAEP than females. In 2012, seventeenyearold females scored the same on the NAEP reading test as they did in 1971. Otherwise, males and females of all ages registered gains on the NAEP reading test from 19712012, with males’ gains outpacing those of females.
The gap is worldwide. On the 2012 PISA, 15yearold females outperformed males in all sixtyfive participating countries. Surprisingly, Finland, a nation known for both equity and excellence because of its performance on PISA, evidenced the widest gap. Girls scored 556 and boys scored 494, producing an astonishing gap of 62 points (about 0.66 standard deviations—or more than one and a half years of schooling). Finland also had one of the world’s largest gender gaps on the 2000 PISA, and since then it has widened. Both girls’ and boys’ reading scores declined, but boys’ declined more (26 points vs. 16 points). To put the 2012 scores in perspective, consider that the OECD average on the reading test is 496. Finland’s strong showing on PISA is completely dependent on the superior performance of its young women.
The gap seems to disappear by adulthood. Tests of adult reading ability show no U.S. gender gap in reading by 25 years of age. Scores even tilt toward men in later years.
The words “seems to disappear” are used on purpose. One must be careful with crosssectional data not to assume that differences across age groups indicate an agebased trend. A recent Gallup poll, for example, asked several different age groups how optimistic they were about finding jobs as adults. Optimism fell from 68% in grade five to 48% in grade 12. The authors concluded that “optimism about future job pursuits declines over time.” The data do not support that conclusion. The data were collected at a single point in time and cannot speak to what optimism may have been before or after that point. Perhaps today’s 12^{th} graders were even more pessimistic several years ago when they were in fifth grade. Perhaps the 12^{th}graders are old enough to remember when unemployment spiked during the Great Recession and the fifthgraders are not. Perhaps 12^{th}graders are simply savvier about job prospects and the pitfalls of seeking employment, topics on which fifthgraders are basically clueless.
At least with the data cited above we can track measures of the same cohorts’ gender gap in reading over time. By analyzing multiple crosssections—data collected at several different points in time—we can look at real change. Those cohorts of nineyearolds in the 1970s, 1980s, and 1990s, are—respectively—today in their 50s, 40s, and 30s. Girls were better readers than boys when these cohorts were children, but as grown ups, women are not appreciably better readers than men.
Care must be taken nevertheless in drawing firm conclusions. There exists what are known as cohort effects that can bias measurements. I mentioned the Great Recession. Experiencing great historical cataclysms, especially war or economic chaos, may bias a particular cohort’s responses to survey questions or even its performance on tests. American generations who experienced the Great Depression, World War II, and the Vietnam War—and more recently, the digital revolution, the Great Recession, and the Iraq War—lived through events that uniquely shape their outlook on many aspects of life.
The gender gap is large, worldwide, and persistent through the K12 years. What should be done about it? Maybe nothing. As just noted, the gap seems to dissipate by adulthood. Moreover, crafting an effective remedy for the gender gap is made more difficult because we don’t definitely know its cause. Enjoyment of reading is a good example. Many commentators argue that schools should make a concerted effort to get boys to enjoy reading more. Enjoyment of reading is statistically correlated with reading performance, and the hope is that making reading more enjoyable would get boys to read more, thereby raising reading skills.
It makes sense, but I’m skeptical. The fact that better readers enjoy reading more than poor readers—and that the relationship stands up even after boatloads of covariates are poured into a regression equation—is unpersuasive evidence of causality. As I stated earlier, PISA produces data collected at a single point in time. It isn’t designed to test causal theories. Reverse causality is a profound problem. Getting kids to enjoy reading more may in fact boost reading ability. But the causal relationship might be flowing in the opposite direction, with enhanced skill leading to enjoyment. The correlation could simply be indicating that people enjoy activities that they’re good at—a relationship that probably exists in sports, music, and many human endeavors, including reading.
A key question for policymakers is whether boosting boys’ enjoyment of reading would help make boys better readers. I investigate by analyzing national changes in PISA reading scores from 2000, when the test was first given, to 2102. PISA creates an Index of Reading Enjoyment based on several responses to a student questionnaire. Enjoyment of reading has increased among males in some countries and decreased in others. Is there any relationship between changes in boys’ enjoyment and changes in PISA reading scores?
There is not. The correlation coefficient for the two phenomena is 0.01. Nations such as Germany raised boys’ enjoyment of reading and increased their reading scores by about 10 points on the PISA scale. France, on the other hand, also raised boys’ enjoyment of reading, but French males’ reading scores declined by 15 points. Ireland increased how much boys enjoy reading by a little bit but the boys’ scores fell a whopping 37 points. Poland’s males actually enjoyed reading less in 2012 than in 2000, but their scores went up more than 14 points. No relationship.
How should policymakers proceed? Large, crosssectional assessments are good for measuring academic performance at one point in time. They are useful for generating hypotheses based on observed relationships, but they are not designed to confirm or reject causality. To do that, randomized control trials should be conducted of programs purporting to boost reading enjoyment. Also, consider that it ultimately may not matter whether enjoying reading leads to more proficient readers. Enjoyment of reading may be an end worthy of attainment irrespective of its relationship to achievement. In that case, RCTs should carefully evaluate the impact of interventions on both enjoyment of reading and reading achievement, whether the two are related or not.
[i] J.B. Stroud and E.F. Lindquist, “Sex differences in achievement in the elementary and secondary schools,” Journal of Educational Psychology, vol. 33(9) (Washington, D.C.: American Psychological Association, 1942), 657–667.
This week marks the release of the 2015 Brown Center Report on American Education, the fourteenth issue of the series. One of the three studies in the report, “Girls, Boys, and Reading,” examines the gender gap in reading. Girls consistently outscore boys on reading assessments. They have for a long time. A 1942 study in Iowa discovered that girls were superior to boys on tests of reading comprehension, vocabulary, and basic language skills.^{[i]} Girls have outscored boys on the National Assessment of Educational Progress (NAEP) reading assessments since the first NAEP was administered in 1971.
I hope you’ll read the full study—and the other studies in the report—but allow me to summarize the main findings of the gender gap study here.
Eight assessments generate valid estimates of U.S. national reading performance: the Main NAEP, given at three grades (fourth, eighth, and 12^{th} grades); the NAEP Long Term Trend (NAEPLTT), given at three ages (ages nine, 13, and 17); the Progress in International Reading Literacy Study (PIRLS), an international assessment given at fourth grade; and the Program for International Student Assessment (PISA), an international assessment given to 15yearolds. Females outscore males on the most recent administration of all eight tests. And the gaps are statistically significant. Expressed in standard deviation units, they range from 0.13 on the NAEPLTT at age nine to 0.34 on the PISA at age 15.
The gaps are shrinking. At age nine, the gap on the NAEPLTT declined from 13 scale score points in 1971 to five points in 2012. During the same time period, the gap at age 13 shrank from 11 points to eight points, and at age 17, from 12 points to eight points. Only the decline at age nine is statistically significant, but at ages 13 and 17, declines since the gaps peaked in the 1990s are also statistically significant. At all three ages, gaps are shrinking because of males making larger gains on NAEP than females. In 2012, seventeenyearold females scored the same on the NAEP reading test as they did in 1971. Otherwise, males and females of all ages registered gains on the NAEP reading test from 19712012, with males’ gains outpacing those of females.
The gap is worldwide. On the 2012 PISA, 15yearold females outperformed males in all sixtyfive participating countries. Surprisingly, Finland, a nation known for both equity and excellence because of its performance on PISA, evidenced the widest gap. Girls scored 556 and boys scored 494, producing an astonishing gap of 62 points (about 0.66 standard deviations—or more than one and a half years of schooling). Finland also had one of the world’s largest gender gaps on the 2000 PISA, and since then it has widened. Both girls’ and boys’ reading scores declined, but boys’ declined more (26 points vs. 16 points). To put the 2012 scores in perspective, consider that the OECD average on the reading test is 496. Finland’s strong showing on PISA is completely dependent on the superior performance of its young women.
The gap seems to disappear by adulthood. Tests of adult reading ability show no U.S. gender gap in reading by 25 years of age. Scores even tilt toward men in later years.
The words “seems to disappear” are used on purpose. One must be careful with crosssectional data not to assume that differences across age groups indicate an agebased trend. A recent Gallup poll, for example, asked several different age groups how optimistic they were about finding jobs as adults. Optimism fell from 68% in grade five to 48% in grade 12. The authors concluded that “optimism about future job pursuits declines over time.” The data do not support that conclusion. The data were collected at a single point in time and cannot speak to what optimism may have been before or after that point. Perhaps today’s 12^{th} graders were even more pessimistic several years ago when they were in fifth grade. Perhaps the 12^{th}graders are old enough to remember when unemployment spiked during the Great Recession and the fifthgraders are not. Perhaps 12^{th}graders are simply savvier about job prospects and the pitfalls of seeking employment, topics on which fifthgraders are basically clueless.
At least with the data cited above we can track measures of the same cohorts’ gender gap in reading over time. By analyzing multiple crosssections—data collected at several different points in time—we can look at real change. Those cohorts of nineyearolds in the 1970s, 1980s, and 1990s, are—respectively—today in their 50s, 40s, and 30s. Girls were better readers than boys when these cohorts were children, but as grown ups, women are not appreciably better readers than men.
Care must be taken nevertheless in drawing firm conclusions. There exists what are known as cohort effects that can bias measurements. I mentioned the Great Recession. Experiencing great historical cataclysms, especially war or economic chaos, may bias a particular cohort’s responses to survey questions or even its performance on tests. American generations who experienced the Great Depression, World War II, and the Vietnam War—and more recently, the digital revolution, the Great Recession, and the Iraq War—lived through events that uniquely shape their outlook on many aspects of life.
The gender gap is large, worldwide, and persistent through the K12 years. What should be done about it? Maybe nothing. As just noted, the gap seems to dissipate by adulthood. Moreover, crafting an effective remedy for the gender gap is made more difficult because we don’t definitely know its cause. Enjoyment of reading is a good example. Many commentators argue that schools should make a concerted effort to get boys to enjoy reading more. Enjoyment of reading is statistically correlated with reading performance, and the hope is that making reading more enjoyable would get boys to read more, thereby raising reading skills.
It makes sense, but I’m skeptical. The fact that better readers enjoy reading more than poor readers—and that the relationship stands up even after boatloads of covariates are poured into a regression equation—is unpersuasive evidence of causality. As I stated earlier, PISA produces data collected at a single point in time. It isn’t designed to test causal theories. Reverse causality is a profound problem. Getting kids to enjoy reading more may in fact boost reading ability. But the causal relationship might be flowing in the opposite direction, with enhanced skill leading to enjoyment. The correlation could simply be indicating that people enjoy activities that they’re good at—a relationship that probably exists in sports, music, and many human endeavors, including reading.
A key question for policymakers is whether boosting boys’ enjoyment of reading would help make boys better readers. I investigate by analyzing national changes in PISA reading scores from 2000, when the test was first given, to 2102. PISA creates an Index of Reading Enjoyment based on several responses to a student questionnaire. Enjoyment of reading has increased among males in some countries and decreased in others. Is there any relationship between changes in boys’ enjoyment and changes in PISA reading scores?
There is not. The correlation coefficient for the two phenomena is 0.01. Nations such as Germany raised boys’ enjoyment of reading and increased their reading scores by about 10 points on the PISA scale. France, on the other hand, also raised boys’ enjoyment of reading, but French males’ reading scores declined by 15 points. Ireland increased how much boys enjoy reading by a little bit but the boys’ scores fell a whopping 37 points. Poland’s males actually enjoyed reading less in 2012 than in 2000, but their scores went up more than 14 points. No relationship.
How should policymakers proceed? Large, crosssectional assessments are good for measuring academic performance at one point in time. They are useful for generating hypotheses based on observed relationships, but they are not designed to confirm or reject causality. To do that, randomized control trials should be conducted of programs purporting to boost reading enjoyment. Also, consider that it ultimately may not matter whether enjoying reading leads to more proficient readers. Enjoyment of reading may be an end worthy of attainment irrespective of its relationship to achievement. In that case, RCTs should carefully evaluate the impact of interventions on both enjoyment of reading and reading achievement, whether the two are related or not.
[i] J.B. Stroud and E.F. Lindquist, “Sex differences in achievement in the elementary and secondary schools,” Journal of Educational Psychology, vol. 33(9) (Washington, D.C.: American Psychological Association, 1942), 657–667.
Part II of the 2015 Brown Center Report on American Education
Over the next several years, policy analysts will evaluate the impact of the Common Core State Standards (CCSS) on U.S. education. The task promises to be challenging. The question most analysts will focus on is whether the CCSS is good or bad policy. This section of the Brown Center Report (BCR) tackles a set of seemingly innocuous questions compared to the hotbutton question of whether Common Core is wise or foolish. The questions all have to do with when Common Core actually started, or more precisely, when the Common Core started having an effect on student learning. And if it hasn’t yet had an effect, how will we know that CCSS has started to influence student achievement?
The analysis below probes this issue empirically, hopefully persuading readers that deciding when a policy begins is elemental to evaluating its effects. The question of a policy’s starting point is not always easy to answer. Yet the answer has consequences. You can’t figure out whether a policy worked or not unless you know when it began.^{[i]}
The analysis uses surveys of state implementation to model different CCSS starting points for states and produces a second early report card on how CCSS is doing. The first report card, focusing on math, was presented in last year’s BCR. The current study updates state implementation ratings that were presented in that report and extends the analysis to achievement in reading. The goal is not only to estimate CCSS’s early impact, but also to lay out a fair approach for establishing when the Common Core’s impact began—and to do it now before data are generated that either critics or supporters can use to bolster their arguments. The experience of No Child Left Behind (NCLB) illustrates this necessity.
After the 2008 National Assessment of Educational Progress (NAEP) scores were released, former Secretary of Education Margaret Spellings claimed that the new scores showed “we are on the right track.”^{[ii]} She pointed out that NAEP gains in the previous decade, 19992009, were much larger than in prior decades. Mark Schneider of the American Institutes of Research (and a former Commissioner of the National Center for Education Statistics [NCES]) reached a different conclusion. He compared NAEP gains from 19962003 to 20032009 and declared NCLB’s impact disappointing. “The preNCLB gains were greater than the postNCLB gains.”^{[iii]} It is important to highlight that Schneider used the 2003 NAEP scores as the starting point for assessing NCLB. A report from FairTest on the tenth anniversary of NCLB used the same demarcation for pre and postNCLB time frames.^{[iv]} FairTest is an advocacy group critical of high stakes testing—and harshly critical of NCLB—but if the 2003 starting point for NAEP is accepted, its conclusion is indisputable, “NAEP score improvement slowed or stopped in both reading and math after NCLB was implemented.”
Choosing 2003 as NCLB’s starting date is intuitively appealing. The law was introduced, debated, and passed by Congress in 2001. President Bush signed NCLB into law on January 8, 2002. It takes time to implement any law. The 2003 NAEP is arguably the first chance that the assessment had to register NCLB’s effects.
Selecting 2003 is consequential, however. Some of the largest gains in NAEP’s history were registered between 2000 and 2003. Once 2003 is established as a starting point (or baseline), pre2003 gains become “preNCLB.” But what if the 2003 NAEP scores were influenced by NCLB? Experiments evaluating the effects of new drugs collect baseline data from subjects before treatment, not after the treatment has begun. Similarly, evaluating the effects of public policies require that baseline data are not influenced by the policies under evaluation.
Avoiding such problems is particularly difficult when state or local policies are adopted nationally. The federal effort to establish a speed limit of 55 miles per hour in the 1970s is a good example. Several states already had speed limits of 55 mph or lower prior to the federal law’s enactment. Moreover, a few states lowered speed limits in anticipation of the federal limit while the bill was debated in Congress. On the day President Nixon signed the bill into law—January 2, 1974—the Associated Press reported that only 29 states would be required to lower speed limits. Evaluating the effects of the 1974 law with national data but neglecting to adjust for what states were already doing would obviously yield tainted baseline data.
There are comparable reasons for questioning 2003 as a good baseline for evaluating NCLB’s effects. The key components of NCLB’s accountability provisions—testing students, publicizing the results, and holding schools accountable for results—were already in place in nearly half the states. In some states they had been in place for several years. The 1999 iteration of Quality Counts, Education Week’s annual report on statelevel efforts to improve public education, entitled Rewarding Results, Punishing Failure, was devoted to state accountability systems and the assessments underpinning them. Testing and accountability are especially important because they have drawn fire from critics of NCLB, a law that wasn’t passed until years later.
The Congressional debate of NCLB legislation took all of 2001, allowing states to pass anticipatory policies. Derek Neal and Diane Whitmore Schanzenbach reported that “with the passage of NCLB lurking on the horizon,” Illinois placed hundreds of schools on a watch list and declared that future state testing would be high stakes.^{[v]} In the summer and fall of 2002, with NCLB now the law of the land, state after state released lists of schools falling short of NCLB’s requirements. Then the 20022003 school year began, during which the 2003 NAEP was administered. Using 2003 as a NAEP baseline assumes that none of these activities—previous accountability systems, public lists of schools in need of improvement, anticipatory policy shifts—influenced achievement. That is unlikely.^{[vi]}
Unlike NCLB, there was no “preCCSS” state version of Common Core. States vary in how quickly and aggressively they have implemented CCSS. For the BCR analyses, two indexes were constructed to model CCSS implementation. They are based on surveys of state education agencies and named for the two years that the surveys were conducted. The 2011 survey reported the number of programs (e.g., professional development, new materials) on which states reported spending federal funds to implement CCSS. Strong implementers spent money on more activities. The 2011 index was used to investigate eighth grade math achievement in the 2014 BCR. A new implementation index was created for this year’s study of reading achievement. The 2013 index is based on a survey asking states when they planned to complete full implementation of CCSS in classrooms. Strong states aimed for full implementation by 20122013 or earlier.
Fourth grade NAEP reading scores serve as the achievement measure. Why fourth grade and not eighth? Reading instruction is a key activity of elementary classrooms but by eighth grade has all but disappeared. What remains of “reading” as an independent subject, which has typically morphed into the study of literature, is subsumed under the EnglishLanguage Arts curriculum, a catchall term that also includes writing, vocabulary, listening, and public speaking. Most students in fourth grade are in selfcontained classes; they receive instruction in all subjects from one teacher. The impact of CCSS on reading instruction—the recommendation that nonfiction take a larger role in reading materials is a good example—will be concentrated in the activities of a single teacher in elementary schools. The burden for meeting CCSS’s press for nonfiction, on the other hand, is expected to be shared by all middle and high school teachers.^{[vii] }
Table 21 displays NAEP gains using the 2011 implementation index. The four year period between 2009 and 2013 is broken down into two parts: 20092011 and 20112013. Nineteen states are categorized as “strong” implementers of CCSS on the 2011 index, and from 20092013, they outscored the four states that did not adopt CCSS by a little more than one scale score point (0.87 vs. 0.24 for a 1.11 difference). The nonadopters are the logical control group for CCSS, but with only four states in that category—Alaska, Nebraska, Texas, and Virginia—it is sensitive to big changes in one or two states. Alaska and Texas both experienced a decline in fourth grade reading scores from 20092013.
The 1.11 point advantage in reading gains for strong CCSS implementers is similar to the 1.27 point advantage reported last year for eighth grade math. Both are small. The reading difference in favor of CCSS is equal to approximately 0.03 standard deviations of the 2009 baseline reading score. Also note that the differences were greater in 20092011 than in 20112013 and that the “medium” implementers performed as well as or better than the strong implementers over the entire four year period (gain of 0.99).
Table 22 displays calculations using the 2013 implementation index. Twelve states are rated as strong CCSS implementers, seven fewer than on the 2011 index.^{[viii]} Data for the nonadopters are the same as in the previous table. In 20092013, the strong implementers gained 1.27 NAEP points compared to 0.24 among the nonadopters, a difference of 1.51 points. The thirtyfour states rated as medium implementers gained 0.82. The strong implementers on this index are states that reported full implementation of CCSSELA by 2013. Their larger gain in 20112013 (1.08 points) distinguishes them from the strong implementers in the previous table. The overall advantage of 1.51 points over nonadopters represents about 0.04 standard deviations of the 2009 NAEP reading score, not a difference with real world significance. Taken together, the 2011 and 2013 indexes estimate that NAEP reading gains from 20092013 were one to one and onehalf scale score points larger in the strong CCSS implementation states compared to the states that did not adopt CCSS.
As noted above, the 2013 implementation index is based on when states scheduled full implementation of CCSS in classrooms. Other than reading achievement, does the index seem to reflect changes in any other classroom variable believed to be related to CCSS implementation? If the answer is “yes,” that would bolster confidence that the index is measuring changes related to CCSS implementation.
Let’s examine the types of literature that students encounter during instruction. Perhaps the most controversial recommendation in the CCSSELA standards is the call for teachers to shift the content of reading materials away from stories and other fictional forms of literature in favor of more nonfiction. NAEP asks fourth grade teachers the extent to which they teach fiction and nonfiction over the course of the school year (see Figure 21).
Historically, fiction dominates fourth grade reading instruction. It still does. The percentage of teachers reporting that they teach fiction to a “large extent” exceeded the percentage answering “large extent” for nonfiction by 23 points in 2009 and 25 points in 2011. In 2013, the difference narrowed to only 15 percentage points, primarily because of nonfiction’s increased use. Fiction still dominated in 2013, but not by as much as in 2009.
The differences reported in Table 23 are national indicators of fiction’s declining prominence in fourth grade reading instruction. What about the states? We know that they were involved to varying degrees with the implementation of Common Core from 20092013. Is there evidence that fiction’s prominence was more likely to weaken in states most aggressively pursuing CCSS implementation?
Table 23 displays the data tackling that question. Fourth grade teachers in strong implementation states decisively favored the use of fiction over nonfiction in 2009 and 2011. But the prominence of fiction in those states experienced a large decline in 2013 (12.4 percentage points). The decline for the entire four year period, 20092013, was larger in the strong implementation states (10.8) than in the medium implementation (7.5) or nonadoption states (9.8).
This section of the Brown Center Report analyzed NAEP data and two indexes of CCSS implementation, one based on data collected in 2011, the second from data collected in 2013. NAEP scores for 20092013 were examined. Fourth grade reading scores improved by 1.11 scale score points in states with strong implementation of CCSS compared to states that did not adopt CCSS. A similar comparison in last year’s BCR found a 1.27 point difference on NAEP’s eighth grade math test, also in favor of states with strong implementation of CCSS. These differences, although certainly encouraging to CCSS supporters, are quite small, amounting to (at most) 0.04 standard deviations (SD) on the NAEP scale. A threshold of 0.20 SD—five times larger—is often invoked as the minimum size for a test score change to be regarded as noticeable. The current study’s findings are also merely statistical associations and cannot be used to make causal claims. Perhaps other factors are driving test score changes, unmeasured by NAEP or the other sources of data analyzed here.
The analysis also found that fourth grade teachers in strong implementation states are more likely to be shifting reading instruction from fiction to nonfiction texts. That trend should be monitored closely to see if it continues. Other events to keep an eye on as the Common Core unfolds include the following:
1. The 2015 NAEP scores, typically released in the late fall, will be important for the Common Core. In most states, the first CCSSaligned state tests will be given in the spring of 2015. Based on the earlier experiences of Kentucky and New York, results are expected to be disappointing. Common Core supporters can respond by explaining that assessments given for the first time often produce disappointing results. They will also claim that the tests are more rigorous than previous state assessments. But it will be difficult to explain stagnant or falling NAEP scores in an era when implementing CCSS commands so much attention.
2. Assessment will become an important implementation variable in 2015 and subsequent years. For analysts, the strategy employed here, modeling different indicators based on information collected at different stages of implementation, should become even more useful. Some states are planning to use Smarter Balanced Assessments, others are using the Partnership for Assessment of Readiness for College and Careers (PARCC), and still others are using their own homegrown tests. To capture variation among the states on this important dimension of implementation, analysts will need to use indicators that are uptodate.
3. The politics of Common Core injects a dynamic element into implementation. The status of implementation is constantly changing. States may choose to suspend, to delay, or to abandon CCSS. That will require analysts to regularly reconfigure which states are considered “in” Common Core and which states are “out.” To further complicate matters, states may be “in” some years and “out” in others.
A final word. When the 2014 BCR was released, many CCSS supporters commented that it is too early to tell the effects of Common Core. The point that states may need more time operating under CCSS to realize its full effects certainly has merit. But that does not discount everything states have done so far—including professional development, purchasing new textbooks and other instructional materials, designing new assessments, buying and installing computer systems, and conducting hearings and public outreach—as part of implementing the standards. Some states are in their fifth year of implementation. It could be that states need more time, but innovations can also produce their biggest “pop” earlier in implementation rather than later. Kentucky was one of the earliest states to adopt and implement CCSS. That state’s NAEP fourth grade reading score declined in both 20092011 and 20112013. The optimism of CCSS supporters is understandable, but a one and a half point NAEP gain might be as good as it gets for CCSS.
[i] These ideas were first introduced in a 2013 Brown Center Chalkboard post I authored, entitled, “When Does a Policy Start?”
[ii] Maria Glod, “Since NCLB, Math and Reading Scores Rise for Ages 9 and 13,” Washington Post, April 29, 2009.
[iii] Mark Schneider, “NAEP Math Results Hold Bad News for NCLB,” AEIdeas (Washington, D.C.: American Enterprise Institute, 2009).
[iv] Lisa Guisbond with Monty Neill and Bob Schaeffer, NCLB’s Lost Decade for Educational Progress: What Can We Learn from this Policy Failure? (Jamaica Plain, MA: FairTest, 2012).
[v] Derek Neal and Diane Schanzenbach, “Left Behind by Design: Proficiency Counts and TestBased Accountability,” NBER Working Paper No. W13293 (Cambridge: National Bureau of Economic Research, 2007), 13.
[vi] Careful analysts of NCLB have allowed different states to have different starting dates: see Thomas Dee and Brian A. Jacob, “Evaluating NCLB,” Education Next 10, no. 3 (Summer 2010); Manyee Wong, Thomas D. Cook, and Peter M. Steiner, “No Child Left Behind: An Interim Evaluation of Its Effects on Learning Using Two Interrupted Time Series Each with Its Own NonEquivalent Comparison Series,” Working Paper 0911 (Evanston, IL: Northwestern University Institute for Policy Research, 2009).
[vii] Common Core State Standards Initiative. “English Language Arts Standards, Key Design Consideration.” Retrieved from: http://www.corestandards.org/ELALiteracy/introduction/keydesignconsideration/
[viii] Twelve states shifted downward from strong to medium and five states shifted upward from medium to strong, netting out to a seven state swing.
« Part I: Girls, boys, and reading  Part III: Student Engagement » 
Part II of the 2015 Brown Center Report on American Education
Over the next several years, policy analysts will evaluate the impact of the Common Core State Standards (CCSS) on U.S. education. The task promises to be challenging. The question most analysts will focus on is whether the CCSS is good or bad policy. This section of the Brown Center Report (BCR) tackles a set of seemingly innocuous questions compared to the hotbutton question of whether Common Core is wise or foolish. The questions all have to do with when Common Core actually started, or more precisely, when the Common Core started having an effect on student learning. And if it hasn’t yet had an effect, how will we know that CCSS has started to influence student achievement?
The analysis below probes this issue empirically, hopefully persuading readers that deciding when a policy begins is elemental to evaluating its effects. The question of a policy’s starting point is not always easy to answer. Yet the answer has consequences. You can’t figure out whether a policy worked or not unless you know when it began.^{[i]}
The analysis uses surveys of state implementation to model different CCSS starting points for states and produces a second early report card on how CCSS is doing. The first report card, focusing on math, was presented in last year’s BCR. The current study updates state implementation ratings that were presented in that report and extends the analysis to achievement in reading. The goal is not only to estimate CCSS’s early impact, but also to lay out a fair approach for establishing when the Common Core’s impact began—and to do it now before data are generated that either critics or supporters can use to bolster their arguments. The experience of No Child Left Behind (NCLB) illustrates this necessity.
After the 2008 National Assessment of Educational Progress (NAEP) scores were released, former Secretary of Education Margaret Spellings claimed that the new scores showed “we are on the right track.”^{[ii]} She pointed out that NAEP gains in the previous decade, 19992009, were much larger than in prior decades. Mark Schneider of the American Institutes of Research (and a former Commissioner of the National Center for Education Statistics [NCES]) reached a different conclusion. He compared NAEP gains from 19962003 to 20032009 and declared NCLB’s impact disappointing. “The preNCLB gains were greater than the postNCLB gains.”^{[iii]} It is important to highlight that Schneider used the 2003 NAEP scores as the starting point for assessing NCLB. A report from FairTest on the tenth anniversary of NCLB used the same demarcation for pre and postNCLB time frames.^{[iv]} FairTest is an advocacy group critical of high stakes testing—and harshly critical of NCLB—but if the 2003 starting point for NAEP is accepted, its conclusion is indisputable, “NAEP score improvement slowed or stopped in both reading and math after NCLB was implemented.”
Choosing 2003 as NCLB’s starting date is intuitively appealing. The law was introduced, debated, and passed by Congress in 2001. President Bush signed NCLB into law on January 8, 2002. It takes time to implement any law. The 2003 NAEP is arguably the first chance that the assessment had to register NCLB’s effects.
Selecting 2003 is consequential, however. Some of the largest gains in NAEP’s history were registered between 2000 and 2003. Once 2003 is established as a starting point (or baseline), pre2003 gains become “preNCLB.” But what if the 2003 NAEP scores were influenced by NCLB? Experiments evaluating the effects of new drugs collect baseline data from subjects before treatment, not after the treatment has begun. Similarly, evaluating the effects of public policies require that baseline data are not influenced by the policies under evaluation.
Avoiding such problems is particularly difficult when state or local policies are adopted nationally. The federal effort to establish a speed limit of 55 miles per hour in the 1970s is a good example. Several states already had speed limits of 55 mph or lower prior to the federal law’s enactment. Moreover, a few states lowered speed limits in anticipation of the federal limit while the bill was debated in Congress. On the day President Nixon signed the bill into law—January 2, 1974—the Associated Press reported that only 29 states would be required to lower speed limits. Evaluating the effects of the 1974 law with national data but neglecting to adjust for what states were already doing would obviously yield tainted baseline data.
There are comparable reasons for questioning 2003 as a good baseline for evaluating NCLB’s effects. The key components of NCLB’s accountability provisions—testing students, publicizing the results, and holding schools accountable for results—were already in place in nearly half the states. In some states they had been in place for several years. The 1999 iteration of Quality Counts, Education Week’s annual report on statelevel efforts to improve public education, entitled Rewarding Results, Punishing Failure, was devoted to state accountability systems and the assessments underpinning them. Testing and accountability are especially important because they have drawn fire from critics of NCLB, a law that wasn’t passed until years later.
The Congressional debate of NCLB legislation took all of 2001, allowing states to pass anticipatory policies. Derek Neal and Diane Whitmore Schanzenbach reported that “with the passage of NCLB lurking on the horizon,” Illinois placed hundreds of schools on a watch list and declared that future state testing would be high stakes.^{[v]} In the summer and fall of 2002, with NCLB now the law of the land, state after state released lists of schools falling short of NCLB’s requirements. Then the 20022003 school year began, during which the 2003 NAEP was administered. Using 2003 as a NAEP baseline assumes that none of these activities—previous accountability systems, public lists of schools in need of improvement, anticipatory policy shifts—influenced achievement. That is unlikely.^{[vi]}
Unlike NCLB, there was no “preCCSS” state version of Common Core. States vary in how quickly and aggressively they have implemented CCSS. For the BCR analyses, two indexes were constructed to model CCSS implementation. They are based on surveys of state education agencies and named for the two years that the surveys were conducted. The 2011 survey reported the number of programs (e.g., professional development, new materials) on which states reported spending federal funds to implement CCSS. Strong implementers spent money on more activities. The 2011 index was used to investigate eighth grade math achievement in the 2014 BCR. A new implementation index was created for this year’s study of reading achievement. The 2013 index is based on a survey asking states when they planned to complete full implementation of CCSS in classrooms. Strong states aimed for full implementation by 20122013 or earlier.
Fourth grade NAEP reading scores serve as the achievement measure. Why fourth grade and not eighth? Reading instruction is a key activity of elementary classrooms but by eighth grade has all but disappeared. What remains of “reading” as an independent subject, which has typically morphed into the study of literature, is subsumed under the EnglishLanguage Arts curriculum, a catchall term that also includes writing, vocabulary, listening, and public speaking. Most students in fourth grade are in selfcontained classes; they receive instruction in all subjects from one teacher. The impact of CCSS on reading instruction—the recommendation that nonfiction take a larger role in reading materials is a good example—will be concentrated in the activities of a single teacher in elementary schools. The burden for meeting CCSS’s press for nonfiction, on the other hand, is expected to be shared by all middle and high school teachers.^{[vii] }
Table 21 displays NAEP gains using the 2011 implementation index. The four year period between 2009 and 2013 is broken down into two parts: 20092011 and 20112013. Nineteen states are categorized as “strong” implementers of CCSS on the 2011 index, and from 20092013, they outscored the four states that did not adopt CCSS by a little more than one scale score point (0.87 vs. 0.24 for a 1.11 difference). The nonadopters are the logical control group for CCSS, but with only four states in that category—Alaska, Nebraska, Texas, and Virginia—it is sensitive to big changes in one or two states. Alaska and Texas both experienced a decline in fourth grade reading scores from 20092013.
The 1.11 point advantage in reading gains for strong CCSS implementers is similar to the 1.27 point advantage reported last year for eighth grade math. Both are small. The reading difference in favor of CCSS is equal to approximately 0.03 standard deviations of the 2009 baseline reading score. Also note that the differences were greater in 20092011 than in 20112013 and that the “medium” implementers performed as well as or better than the strong implementers over the entire four year period (gain of 0.99).
Table 22 displays calculations using the 2013 implementation index. Twelve states are rated as strong CCSS implementers, seven fewer than on the 2011 index.^{[viii]} Data for the nonadopters are the same as in the previous table. In 20092013, the strong implementers gained 1.27 NAEP points compared to 0.24 among the nonadopters, a difference of 1.51 points. The thirtyfour states rated as medium implementers gained 0.82. The strong implementers on this index are states that reported full implementation of CCSSELA by 2013. Their larger gain in 20112013 (1.08 points) distinguishes them from the strong implementers in the previous table. The overall advantage of 1.51 points over nonadopters represents about 0.04 standard deviations of the 2009 NAEP reading score, not a difference with real world significance. Taken together, the 2011 and 2013 indexes estimate that NAEP reading gains from 20092013 were one to one and onehalf scale score points larger in the strong CCSS implementation states compared to the states that did not adopt CCSS.
As noted above, the 2013 implementation index is based on when states scheduled full implementation of CCSS in classrooms. Other than reading achievement, does the index seem to reflect changes in any other classroom variable believed to be related to CCSS implementation? If the answer is “yes,” that would bolster confidence that the index is measuring changes related to CCSS implementation.
Let’s examine the types of literature that students encounter during instruction. Perhaps the most controversial recommendation in the CCSSELA standards is the call for teachers to shift the content of reading materials away from stories and other fictional forms of literature in favor of more nonfiction. NAEP asks fourth grade teachers the extent to which they teach fiction and nonfiction over the course of the school year (see Figure 21).
Historically, fiction dominates fourth grade reading instruction. It still does. The percentage of teachers reporting that they teach fiction to a “large extent” exceeded the percentage answering “large extent” for nonfiction by 23 points in 2009 and 25 points in 2011. In 2013, the difference narrowed to only 15 percentage points, primarily because of nonfiction’s increased use. Fiction still dominated in 2013, but not by as much as in 2009.
The differences reported in Table 23 are national indicators of fiction’s declining prominence in fourth grade reading instruction. What about the states? We know that they were involved to varying degrees with the implementation of Common Core from 20092013. Is there evidence that fiction’s prominence was more likely to weaken in states most aggressively pursuing CCSS implementation?
Table 23 displays the data tackling that question. Fourth grade teachers in strong implementation states decisively favored the use of fiction over nonfiction in 2009 and 2011. But the prominence of fiction in those states experienced a large decline in 2013 (12.4 percentage points). The decline for the entire four year period, 20092013, was larger in the strong implementation states (10.8) than in the medium implementation (7.5) or nonadoption states (9.8).
This section of the Brown Center Report analyzed NAEP data and two indexes of CCSS implementation, one based on data collected in 2011, the second from data collected in 2013. NAEP scores for 20092013 were examined. Fourth grade reading scores improved by 1.11 scale score points in states with strong implementation of CCSS compared to states that did not adopt CCSS. A similar comparison in last year’s BCR found a 1.27 point difference on NAEP’s eighth grade math test, also in favor of states with strong implementation of CCSS. These differences, although certainly encouraging to CCSS supporters, are quite small, amounting to (at most) 0.04 standard deviations (SD) on the NAEP scale. A threshold of 0.20 SD—five times larger—is often invoked as the minimum size for a test score change to be regarded as noticeable. The current study’s findings are also merely statistical associations and cannot be used to make causal claims. Perhaps other factors are driving test score changes, unmeasured by NAEP or the other sources of data analyzed here.
The analysis also found that fourth grade teachers in strong implementation states are more likely to be shifting reading instruction from fiction to nonfiction texts. That trend should be monitored closely to see if it continues. Other events to keep an eye on as the Common Core unfolds include the following:
1. The 2015 NAEP scores, typically released in the late fall, will be important for the Common Core. In most states, the first CCSSaligned state tests will be given in the spring of 2015. Based on the earlier experiences of Kentucky and New York, results are expected to be disappointing. Common Core supporters can respond by explaining that assessments given for the first time often produce disappointing results. They will also claim that the tests are more rigorous than previous state assessments. But it will be difficult to explain stagnant or falling NAEP scores in an era when implementing CCSS commands so much attention.
2. Assessment will become an important implementation variable in 2015 and subsequent years. For analysts, the strategy employed here, modeling different indicators based on information collected at different stages of implementation, should become even more useful. Some states are planning to use Smarter Balanced Assessments, others are using the Partnership for Assessment of Readiness for College and Careers (PARCC), and still others are using their own homegrown tests. To capture variation among the states on this important dimension of implementation, analysts will need to use indicators that are uptodate.
3. The politics of Common Core injects a dynamic element into implementation. The status of implementation is constantly changing. States may choose to suspend, to delay, or to abandon CCSS. That will require analysts to regularly reconfigure which states are considered “in” Common Core and which states are “out.” To further complicate matters, states may be “in” some years and “out” in others.
A final word. When the 2014 BCR was released, many CCSS supporters commented that it is too early to tell the effects of Common Core. The point that states may need more time operating under CCSS to realize its full effects certainly has merit. But that does not discount everything states have done so far—including professional development, purchasing new textbooks and other instructional materials, designing new assessments, buying and installing computer systems, and conducting hearings and public outreach—as part of implementing the standards. Some states are in their fifth year of implementation. It could be that states need more time, but innovations can also produce their biggest “pop” earlier in implementation rather than later. Kentucky was one of the earliest states to adopt and implement CCSS. That state’s NAEP fourth grade reading score declined in both 20092011 and 20112013. The optimism of CCSS supporters is understandable, but a one and a half point NAEP gain might be as good as it gets for CCSS.
[i] These ideas were first introduced in a 2013 Brown Center Chalkboard post I authored, entitled, “When Does a Policy Start?”
[ii] Maria Glod, “Since NCLB, Math and Reading Scores Rise for Ages 9 and 13,” Washington Post, April 29, 2009.
[iii] Mark Schneider, “NAEP Math Results Hold Bad News for NCLB,” AEIdeas (Washington, D.C.: American Enterprise Institute, 2009).
[iv] Lisa Guisbond with Monty Neill and Bob Schaeffer, NCLB’s Lost Decade for Educational Progress: What Can We Learn from this Policy Failure? (Jamaica Plain, MA: FairTest, 2012).
[v] Derek Neal and Diane Schanzenbach, “Left Behind by Design: Proficiency Counts and TestBased Accountability,” NBER Working Paper No. W13293 (Cambridge: National Bureau of Economic Research, 2007), 13.
[vi] Careful analysts of NCLB have allowed different states to have different starting dates: see Thomas Dee and Brian A. Jacob, “Evaluating NCLB,” Education Next 10, no. 3 (Summer 2010); Manyee Wong, Thomas D. Cook, and Peter M. Steiner, “No Child Left Behind: An Interim Evaluation of Its Effects on Learning Using Two Interrupted Time Series Each with Its Own NonEquivalent Comparison Series,” Working Paper 0911 (Evanston, IL: Northwestern University Institute for Policy Research, 2009).
[vii] Common Core State Standards Initiative. “English Language Arts Standards, Key Design Consideration.” Retrieved from: http://www.corestandards.org/ELALiteracy/introduction/keydesignconsideration/
[viii] Twelve states shifted downward from strong to medium and five states shifted upward from medium to strong, netting out to a seven state swing.
« Part I: Girls, boys, and reading  Part III: Student Engagement » 
Part III of the 2015 Brown Center Report on American Education
Student engagement refers to the intensity with which students apply themselves to learning in school. Traits such as motivation, enjoyment, and curiosity—characteristics that have interested researchers for a long time—have been joined recently by new terms such as, “grit,” which now approaches cliché status. International assessments collect data from students on characteristics related to engagement. This study looks at data from the Program for International Student Assessment (PISA), an international test given to fifteenyearolds. In the U.S., most PISA students are in the fall of their sophomore year. The high school years are a time when many observers worry that students lose interest in school.
Compared to their peers around the world, how do U.S. students appear on measures of engagement? Are national indicators of engagement related to achievement? This analysis concludes that American students are about average in terms of engagement. Data reveal that several countries noted for their superior ranking on PISA—e.g., Korea, Japan, Finland, Poland, and the Netherlands—score below the U.S. on measures of student engagement. Thus, the relationship of achievement to student engagement is not clear cut, with some evidence pointing toward a weak positive relationship and other evidence indicating a modest negative relationship.
Education studies differ in units of analysis. Some studies report data on individuals, with each student serving as an observation. Studies of new reading or math programs, for example, usually report an average gain score or effect size representing the impact of the program on the average student. Others studies report aggregated data, in which test scores or other measurements are averaged to yield a group score. Test scores of schools, districts, states, or countries are constructed like that. These scores represent the performance of groups, with each group serving as a single observation, but they are really just data from individuals that have been aggregated to the group level.
Aggregated units are particularly useful for policy analysts. Analysts are interested in how Fairfax County or the state of Virginia or the United States is doing. Governmental bodies govern those jurisdictions and policymakers craft policy for all of the citizens within the political jurisdiction—not for an individual.
The analytical unit is especially important when investigating topics like student engagement and their relationships with achievement. Those relationships are inherently individual, focusing on the interaction of psychological characteristics. They are also prone to reverse causality, meaning that the direction of cause and effect cannot readily be determined. Consider selfesteem and academic achievement. Determining which one is cause and which is effect has been debated for decades. Students who are good readers enjoy books, feel pretty good about their reading abilities, and spend more time reading than other kids. The possibility of reverse causality is one reason that beginning statistics students learn an important rule: correlation is not causation.
Starting with the first international assessments in the 1960s, a curious pattern has emerged. Data on students’ attitudes toward studying school subjects, when examined on a national level, often exhibit the opposite relationship with achievement than one would expect. The 2006 Brown Center Report (BCR) investigated the phenomenon in a study of “the happiness factor” in learning.^{[i]} Test scores of fourth graders in 25 countries and eighth graders in 46 countries were analyzed. Students in countries with low math scores were more likely to report that they enjoyed math than students in highscoring countries. Correlation coefficients for the association of enjoyment and achievement were 0.67 at fourth grade and 0.75 at eighth grade.
Confidence in math performance was also inversely related to achievement. Correlation coefficients for national achievement and the percentage of students responding affirmatively to the statement, “I usually do well in mathematics,” were 0.58 among fourth graders and 0.64 among eighth graders. Nations with the most confident math students tend to perform poorly on math tests; nations with the least confident students do quite well.
That is odd. What’s going on? A comparison of Singapore and the U.S. helps unravel the puzzle. The data in figure 31 are for eighth graders on the 2003 Trends in Mathematics and Science Study (TIMSS). U.S. students were very confident—84% either agreed a lot or a little (39% + 45%) with the statement that they usually do well in mathematics. In Singapore, the figure was 64% (46% + 18%). With a score of 605, however, Singaporean students registered about one full standard deviation (80 points) higher on the TIMSS math test compared to the U.S. score of 504.
When withincountry data are examined, the relationship exists in the expected direction. In Singapore, highly confident students score 642, approximately 100 points above the leastconfident students (551). In the U.S., the gap between the most and leastconfident students was also about 100 points—but at a much lower level on the TIMSS scale, at 541 and 448. Note that the leastconfident Singaporean eighth grader still outscores the mostconfident American, 551 to 541.
The lesson is that the unit of analysis must be considered when examining data on students’ psychological characteristics and their relationship to achievement. If presented with countrylevel associations, one should wonder what the withincountry associations are. And vice versa. Let’s keep that caution in mind as we now turn to data on fifteenyearolds’ intrinsic motivation and how nations scored on the 2012 PISA.
PISA’s index of intrinsic motivation to learn mathematics comprises responses to four items on the student questionnaire: 1) I enjoy reading about mathematics; 2) I look forward to my mathematics lessons; 3) I do mathematics because I enjoy it; and 4) I am interested in the things I learn in mathematics. Figure 32 shows the percentage of students in OECD countries—thirty of the most economically developed nations in the world—responding that they agree or strongly agree with the statements. A little less than onethird (30.6%) of students responded favorably to reading about math, 35.5% responded favorably to looking forward to math lessons, 38.2% reported doing math because they enjoy it, and 52.9% said they were interested in the things they learn in math. A ballpark estimate, then, is that onethird to onehalf of students respond affirmatively to the individual components of PISA’s intrinsic motivation index.
Table 31 presents national scores on the 2012 index of intrinsic motivation to learn mathematics. The index is scaled with an average of 0.00 and a standard deviation of 1.00. Student index scores are averaged to produce a national score. The scores of 39 nations are reported—29 OECD countries and 10 partner countries.^{[ii]} Indonesia appears to have the most intrinsically motivated students in the world (0.80), followed by Thailand (0.77), Mexico (0.67), and Tunisia (0.59). It is striking that developing countries top the list. Universal education at the elementary level is only a recent reality in these countries, and they are still struggling to deliver universally accessible high schools, especially in rural areas and especially to girls. The students who sat for PISA may be an unusually motivated group. They also may be deeply appreciative of having an opportunity that their parents never had.
The U.S. scores about average (0.08) on the index, statistically about the same as New Zealand, Australia, Ireland, and Canada. The bottom of the table is extremely interesting. Among the countries with the least intrinsically motivated kids are some PISA high flyers. Austria has the least motivated students (0.35), but that is not statistically significantly different from the score for the Netherlands (0.33). What’s surprising is that Korea (0.20), Finland (0.22), Japan (0.23), and Belgium (0.24) score at the bottom of the intrinsic motivation index even though they historically do quite well on the PISA math test.
Let’s now dig a little deeper into the intrinsic motivation index. Two components of the index are how students respond to “I do mathematics because I enjoy it” and “I look forward to my mathematics lessons.” These sentiments are directly related to schooling. Whether students enjoy math or look forward to math lessons is surely influenced by factors such as teachers and curriculum. Table 32 rank orders PISA countries by the percentage of students who “agree” or “strongly agree” with the questionnaire prompts. The nations’ 2012 PISA math scores are also tabled. Indonesia scores at the top of both rankings, with 78.3% enjoying math and 72.3% looking forward to studying the subject. However, Indonesia’s PISA math score of 375 is more than one full standard deviation below the international mean of 494 (standard deviation of 92). The tops of the tables are primarily dominated by lowperforming countries, but not exclusively so. Denmark is an averageperforming nation that has high rankings on both sentiments. Liechtenstein, Hong KongChina, and Switzerland do well on the PISA math test and appear to have contented, positivelyoriented students.
Several nations of interest are shaded. The bar across the middle of the tables, encompassing Australia and Germany, demarcates the median of the two lists, with 19 countries above and 19 below that position. The United States registers above the median on looking forward to math lessons (45.4%) and a bit below the median on enjoyment (36.6%). A similar proportion of students in Poland—a country recently celebrated in popular media and in Amanda Ripley’s book, The Smartest Kids in the World,^{[iii] }for making great strides on PISA tests—enjoy math (36.1%), but only 21.3% of Polish kids look forward to their math lessons, very near the bottom of the list, anchored by Netherlands at 19.8%.
Korea also appears in Ripley’s book. It scores poorly on both items. Only 30.7% of Korean students enjoy math, and less than that, 21.8%, look forward to studying the subject. Korean education is depicted unflatteringly in Ripley’s book—as an academic pressure cooker lacking joy or purpose—so its standing here is not surprising. But Finland is another matter. It is portrayed as laidback and studentcentered, concerned with making students feel relaxed and engaged. Yet, only 28.8% of Finnish students say that they study mathematics because they enjoy it (among the bottom four countries) and only 24.8% report that they look forward to math lessons (among the bottom seven countries). Korea, the pressure cooker, and Finland, the laidback paradise, look about the same on these dimensions.
Another country that is admired for its educational system, Japan, does not fare well on these measures. Only 30.8% of students in Japan enjoy mathematics, despite the boisterous, enthusiastic classrooms that appear in Elizabeth Green’s recent book, Building a Better Teacher.^{[iv]} Japan does better on the percentage of students looking forward to their math lessons (33.7%), but still places far below the U.S. Green’s book describes classrooms with younger students, but even so, surveys of Japanese fourth and eighth graders’ attitudes toward studying mathematics report results similar to those presented here. American students say that they enjoy their math classes and studying math more than students in Finland, Japan, and Korea.
It is clear from Table 32 that at the national level, enjoying math is not positively related to math achievement. Nor is looking forward to one’s math lessons. The correlation coefficients reported in the last row of the table quantify the magnitude of the inverse relationships. The 0.58 and 0.57 coefficients indicate a moderately negative association, meaning, in plain English, that countries with students who enjoy math or look forward to math lessons tend to score below average on the PISA math test. And highscoring nations tend to register below average on these measures of student engagement. Countrylevel associations, however, should be augmented with studentlevel associations that are calculated within each country.
The 2012 PISA volume on student engagement does not present withincountry correlation coefficients on intrinsic motivation or its components. But it does offer withincountry correlations of math achievement with three other characteristics relevant to student engagement. Table 33 displays statistics for students’ responses to: 1) if they feel like they belong at school; 2) their attitudes toward school, an index composed of four factors;^{[v]} and 3) whether they had arrived late for school in the two weeks prior to the PISA test. These measures reflect an excellent mix of behaviors and dispositions.
The withincountry correlations trend in the direction expected but they are small in magnitude. Correlation coefficients for math performance and a sense of belonging at school range from 0.02 to 0.18, meaning that the country exhibiting the strongest relationship between achievement and a sense of belonging—Thailand, with a 0.18 correlation coefficient—isn’t registering a strong relationship at all. The OECD average is 0.08, which is trivial. The U.S. correlation coefficient, 0.07, is also trivial. The relationship of achievement with attitudes toward school is slightly stronger (OECD average of 0.11), but is still weak.
Of the three characteristics, arriving late for school shows the strongest correlation, an unsurprising inverse relationship of 0.14 in OECD countries and 0.20 in the U.S. Students who tend to be tardy also tend to score lower on math tests. But, again, the magnitude is surprisingly small. The coefficients are statistically significant because of large sample sizes, but in a real world “would I notice this if it were in my face?” sense, no, the correlation coefficients are suggesting not much of a relationship at all.
The PISA report presents withincountry effect sizes for the intrinsic motivation index, calculating the achievement gains associated with a one unit change in the index. One of several interesting findings is that intrinsic motivation is more strongly associated with gains at the top of the achievement distribution, among students at the 90^{th} percentile in math scores, than at the bottom of the distribution, among students at the 10^{th} percentile.
The report summarizes the withincountry effect sizes with this statement: “On average across OECD countries, a change of one unit in the index of intrinsic motivation to learn mathematics translates into a 19 scorepoint difference in mathematics performance.”^{[vi]} This sentence can be easily misinterpreted. It means that within each of the participating countries students who differ by one unit on PISA’s 2012 intrinsic motivation index score about 19 points apart on the 2012 math test. It does not mean that a country that gains one unit on the intrinsic motivation index can expect a 19 point score increase.^{[vii] }
Let’s now see what that association looks like at the national level.
PISA first reported national scores on the index of intrinsic motivation to learn mathematics in 2003. Are gains that countries made on the index associated with gains on PISA’s math test? Table 34 presents a score card on the question, reporting the changes that occurred in thirtynine nations—in both the index and math scores—from 2003 to 2012. Seventeen nations made statistically significant gains on the index; fourteen nations had gains that were, in a statistical sense, indistinguishable from zero—labeled “no change” in the table; and eight nations experienced statistically significant declines in index scores.
The U.S. scored 0.00 in 2003 and 0.08 in 2012, notching a gain of 0.08 on the index (statistically significant). Its PISA math score declined from 483 to 481, a decline of 2 scale score points (not statistically significant).
Table 34 makes it clear that national changes on PISA’s intrinsic motivation index are not associated with changes in math achievement. The countries registering gains on the index averaged a decline of 3.7 points on PISA’s math assessment. The countries that remained about the same on the index had math scores that also remain essentially unchanged (0.09) And the most striking finding: countries that declined on the index (average of 0.15) actually gained an average of 10.3 points on the PISA math scale. Intrinsic motivation went down; math scores went up. The correlation coefficient for the relationship over all, not shown in the table, is 0.30.
The analysis above investigated student engagement. International data from the 2012 PISA were examined on several dimensions of student engagement, focusing on a measure that PISA has employed since 2003, the index of intrinsic motivation to learn mathematics. The U.S. scored near the middle of the distribution on the 2012 index. PISA analysts calculated that, on average, a one unit change in the index was associated with a 19 point gain on the PISA math test. That is the average of withincountry calculations, using studentlevel data that measure the association of intrinsic motivation with PISA score. It represents an effect size of about 0.20—a positive effect, but one that is generally considered small in magnitude.^{[viii]}
The unit of analysis matters. Betweencountry associations often differ from withincountry associations. The current study used a difference in difference approach that calculated the correlation coefficient for two variables at the national level: the change in intrinsic motivation index from 20032012 and change in PISA score for the same time period. That analysis produced a correlation coefficient of 0.30, a negative relationship that is also generally considered small in magnitude.
Neither approach can justify causal claims nor address the possibility of reverse causality occurring—the possibility that high math achievement boosts intrinsic motivation to learn math, rather than, or even in addition to, high levels of motivation leading to greater learning. Poor math achievement may cause intrinsic motivation to fall. Taken together, the analyses lead to the conclusion that PISA provides, at best, weak evidence that raising student motivation is associated with achievement gains. Boosting motivation may even produce declines in achievement.
Here’s the bottom line for what PISA data recommends to policymakers: Programs designed to boost student engagement—perhaps a worthy pursuit even if unrelated to achievement—should be evaluated for their effects in small scale experiments before being adopted broadly. The international evidence does not justify widescale concern over current levels of student engagement in the U.S. or support the hypothesis that boosting student engagement would raise student performance nationally.
Let’s conclude by considering the advantages that nationallevel, difference in difference analyses provide that studentlevel analyses may overlook.
1. They depict policy interventions more accurately. Policies are actions of a political unit affecting all of its members. They do not simply affect the relationship of two characteristics within an individual’s psychology. Policymakers who ask the question, “What happens when a country boosts student engagement?” are asking about a countrylevel phenomenon.
2. Direction of causality can run differently at the individual and group levels. For example, we know that enjoying a school subject and achievement on tests of that subject are positively correlated at the individual level. But they are not always correlated—and can in fact be negatively correlated—at the group level.
3. By using multiple years of panel data and calculating change over time, a difference in difference analysis controls for unobserved variable bias by “baking into the cake” those unobserved variables at the baseline. The unobserved variables are assumed to remain stable over the time period of the analysis. For the cultural factors that many analysts suspect influence betweennation test score differences, stability may be a safe assumption. Difference in difference, then, would be superior to crosssectional analyses in controlling for cultural influences that are omitted from other models.
4. Testing artifacts from a cultural source can also be dampened. Characteristics such as enjoyment are culturally defined, and the language employed to describe them is also culturally bounded. Consider two of the questionnaire items examined above: whether kids “enjoy” math and how much they “look forward” to math lessons. Cultural differences in responding to these prompts will be reflected in betweencountry averages at the baseline, and any subsequent changes will reflect fluctuations net of those initial differences.
[i] Tom Loveless, “The Happiness Factor in Student Learning,” The 2006 Brown Center Report on American Education: How Well are American Students Learning? (Washington, D.C.: The Brookings Institution, 2006).
[ii] All countries with 2003 and 2012 data are included.
[iii] Amanda Ripley, The Smartest Kids in the World: And How They Got That Way (New York, NY: Simon & Schuster, 2013)
[iv] Elizabeth Green, Building a Better Teacher: How Teaching Works (and How to Teach It to Everyone) (New York, NY: W.W. Norton & Company, 2014).
[v] The attitude toward school index is based on responses to: 1) Trying hard at school will help me get a good job, 2) Trying hard at school will help me get into a good college, 3) I enjoy receiving good grades, 4) Trying hard at school is important. See: OECD, PISA 2012 Database, Table III.2.5a.
[vi] OECD, PISA 2012 Results: Ready to Learn: Students’ Engagement, Drive and SelfBeliefs (Volume III) (Paris: PISA, OECD Publishing, 2013), 77.
[vii] PISA originally called the index of intrinsic motivation the index of interest and enjoyment in mathematics, first constructed in 2003. The four questions comprising the index remain identical from 2003 to 212, allowing for comparability. Index values for 2003 scores were rescaled based on 2012 scaling (mean of 0.00 and SD of 1.00), meaning that index values published in PISA reports prior to 2012 will not agree with those published after 2012 (including those analyzed here). See: OECD, PISA 2012 Results: Ready to Learn: Students’ Engagement, Drive and SelfBeliefs (Volume III) (Paris: PISA, OECD Publishing, 2013), 54.
[viii] PISA math scores are scaled with a standard deviation of 100, but the average withincountry standard deviation for OECD nations was 92 on the 2012 math test.
« Part II: Measuring Effects of the Common Core 
Part III of the 2015 Brown Center Report on American Education
Student engagement refers to the intensity with which students apply themselves to learning in school. Traits such as motivation, enjoyment, and curiosity—characteristics that have interested researchers for a long time—have been joined recently by new terms such as, “grit,” which now approaches cliché status. International assessments collect data from students on characteristics related to engagement. This study looks at data from the Program for International Student Assessment (PISA), an international test given to fifteenyearolds. In the U.S., most PISA students are in the fall of their sophomore year. The high school years are a time when many observers worry that students lose interest in school.
Compared to their peers around the world, how do U.S. students appear on measures of engagement? Are national indicators of engagement related to achievement? This analysis concludes that American students are about average in terms of engagement. Data reveal that several countries noted for their superior ranking on PISA—e.g., Korea, Japan, Finland, Poland, and the Netherlands—score below the U.S. on measures of student engagement. Thus, the relationship of achievement to student engagement is not clear cut, with some evidence pointing toward a weak positive relationship and other evidence indicating a modest negative relationship.
Education studies differ in units of analysis. Some studies report data on individuals, with each student serving as an observation. Studies of new reading or math programs, for example, usually report an average gain score or effect size representing the impact of the program on the average student. Others studies report aggregated data, in which test scores or other measurements are averaged to yield a group score. Test scores of schools, districts, states, or countries are constructed like that. These scores represent the performance of groups, with each group serving as a single observation, but they are really just data from individuals that have been aggregated to the group level.
Aggregated units are particularly useful for policy analysts. Analysts are interested in how Fairfax County or the state of Virginia or the United States is doing. Governmental bodies govern those jurisdictions and policymakers craft policy for all of the citizens within the political jurisdiction—not for an individual.
The analytical unit is especially important when investigating topics like student engagement and their relationships with achievement. Those relationships are inherently individual, focusing on the interaction of psychological characteristics. They are also prone to reverse causality, meaning that the direction of cause and effect cannot readily be determined. Consider selfesteem and academic achievement. Determining which one is cause and which is effect has been debated for decades. Students who are good readers enjoy books, feel pretty good about their reading abilities, and spend more time reading than other kids. The possibility of reverse causality is one reason that beginning statistics students learn an important rule: correlation is not causation.
Starting with the first international assessments in the 1960s, a curious pattern has emerged. Data on students’ attitudes toward studying school subjects, when examined on a national level, often exhibit the opposite relationship with achievement than one would expect. The 2006 Brown Center Report (BCR) investigated the phenomenon in a study of “the happiness factor” in learning.^{[i]} Test scores of fourth graders in 25 countries and eighth graders in 46 countries were analyzed. Students in countries with low math scores were more likely to report that they enjoyed math than students in highscoring countries. Correlation coefficients for the association of enjoyment and achievement were 0.67 at fourth grade and 0.75 at eighth grade.
Confidence in math performance was also inversely related to achievement. Correlation coefficients for national achievement and the percentage of students responding affirmatively to the statement, “I usually do well in mathematics,” were 0.58 among fourth graders and 0.64 among eighth graders. Nations with the most confident math students tend to perform poorly on math tests; nations with the least confident students do quite well.
That is odd. What’s going on? A comparison of Singapore and the U.S. helps unravel the puzzle. The data in figure 31 are for eighth graders on the 2003 Trends in Mathematics and Science Study (TIMSS). U.S. students were very confident—84% either agreed a lot or a little (39% + 45%) with the statement that they usually do well in mathematics. In Singapore, the figure was 64% (46% + 18%). With a score of 605, however, Singaporean students registered about one full standard deviation (80 points) higher on the TIMSS math test compared to the U.S. score of 504.
When withincountry data are examined, the relationship exists in the expected direction. In Singapore, highly confident students score 642, approximately 100 points above the leastconfident students (551). In the U.S., the gap between the most and leastconfident students was also about 100 points—but at a much lower level on the TIMSS scale, at 541 and 448. Note that the leastconfident Singaporean eighth grader still outscores the mostconfident American, 551 to 541.
The lesson is that the unit of analysis must be considered when examining data on students’ psychological characteristics and their relationship to achievement. If presented with countrylevel associations, one should wonder what the withincountry associations are. And vice versa. Let’s keep that caution in mind as we now turn to data on fifteenyearolds’ intrinsic motivation and how nations scored on the 2012 PISA.
PISA’s index of intrinsic motivation to learn mathematics comprises responses to four items on the student questionnaire: 1) I enjoy reading about mathematics; 2) I look forward to my mathematics lessons; 3) I do mathematics because I enjoy it; and 4) I am interested in the things I learn in mathematics. Figure 32 shows the percentage of students in OECD countries—thirty of the most economically developed nations in the world—responding that they agree or strongly agree with the statements. A little less than onethird (30.6%) of students responded favorably to reading about math, 35.5% responded favorably to looking forward to math lessons, 38.2% reported doing math because they enjoy it, and 52.9% said they were interested in the things they learn in math. A ballpark estimate, then, is that onethird to onehalf of students respond affirmatively to the individual components of PISA’s intrinsic motivation index.
Table 31 presents national scores on the 2012 index of intrinsic motivation to learn mathematics. The index is scaled with an average of 0.00 and a standard deviation of 1.00. Student index scores are averaged to produce a national score. The scores of 39 nations are reported—29 OECD countries and 10 partner countries.^{[ii]} Indonesia appears to have the most intrinsically motivated students in the world (0.80), followed by Thailand (0.77), Mexico (0.67), and Tunisia (0.59). It is striking that developing countries top the list. Universal education at the elementary level is only a recent reality in these countries, and they are still struggling to deliver universally accessible high schools, especially in rural areas and especially to girls. The students who sat for PISA may be an unusually motivated group. They also may be deeply appreciative of having an opportunity that their parents never had.
The U.S. scores about average (0.08) on the index, statistically about the same as New Zealand, Australia, Ireland, and Canada. The bottom of the table is extremely interesting. Among the countries with the least intrinsically motivated kids are some PISA high flyers. Austria has the least motivated students (0.35), but that is not statistically significantly different from the score for the Netherlands (0.33). What’s surprising is that Korea (0.20), Finland (0.22), Japan (0.23), and Belgium (0.24) score at the bottom of the intrinsic motivation index even though they historically do quite well on the PISA math test.
Let’s now dig a little deeper into the intrinsic motivation index. Two components of the index are how students respond to “I do mathematics because I enjoy it” and “I look forward to my mathematics lessons.” These sentiments are directly related to schooling. Whether students enjoy math or look forward to math lessons is surely influenced by factors such as teachers and curriculum. Table 32 rank orders PISA countries by the percentage of students who “agree” or “strongly agree” with the questionnaire prompts. The nations’ 2012 PISA math scores are also tabled. Indonesia scores at the top of both rankings, with 78.3% enjoying math and 72.3% looking forward to studying the subject. However, Indonesia’s PISA math score of 375 is more than one full standard deviation below the international mean of 494 (standard deviation of 92). The tops of the tables are primarily dominated by lowperforming countries, but not exclusively so. Denmark is an averageperforming nation that has high rankings on both sentiments. Liechtenstein, Hong KongChina, and Switzerland do well on the PISA math test and appear to have contented, positivelyoriented students.
Several nations of interest are shaded. The bar across the middle of the tables, encompassing Australia and Germany, demarcates the median of the two lists, with 19 countries above and 19 below that position. The United States registers above the median on looking forward to math lessons (45.4%) and a bit below the median on enjoyment (36.6%). A similar proportion of students in Poland—a country recently celebrated in popular media and in Amanda Ripley’s book, The Smartest Kids in the World,^{[iii] }for making great strides on PISA tests—enjoy math (36.1%), but only 21.3% of Polish kids look forward to their math lessons, very near the bottom of the list, anchored by Netherlands at 19.8%.
Korea also appears in Ripley’s book. It scores poorly on both items. Only 30.7% of Korean students enjoy math, and less than that, 21.8%, look forward to studying the subject. Korean education is depicted unflatteringly in Ripley’s book—as an academic pressure cooker lacking joy or purpose—so its standing here is not surprising. But Finland is another matter. It is portrayed as laidback and studentcentered, concerned with making students feel relaxed and engaged. Yet, only 28.8% of Finnish students say that they study mathematics because they enjoy it (among the bottom four countries) and only 24.8% report that they look forward to math lessons (among the bottom seven countries). Korea, the pressure cooker, and Finland, the laidback paradise, look about the same on these dimensions.
Another country that is admired for its educational system, Japan, does not fare well on these measures. Only 30.8% of students in Japan enjoy mathematics, despite the boisterous, enthusiastic classrooms that appear in Elizabeth Green’s recent book, Building a Better Teacher.^{[iv]} Japan does better on the percentage of students looking forward to their math lessons (33.7%), but still places far below the U.S. Green’s book describes classrooms with younger students, but even so, surveys of Japanese fourth and eighth graders’ attitudes toward studying mathematics report results similar to those presented here. American students say that they enjoy their math classes and studying math more than students in Finland, Japan, and Korea.
It is clear from Table 32 that at the national level, enjoying math is not positively related to math achievement. Nor is looking forward to one’s math lessons. The correlation coefficients reported in the last row of the table quantify the magnitude of the inverse relationships. The 0.58 and 0.57 coefficients indicate a moderately negative association, meaning, in plain English, that countries with students who enjoy math or look forward to math lessons tend to score below average on the PISA math test. And highscoring nations tend to register below average on these measures of student engagement. Countrylevel associations, however, should be augmented with studentlevel associations that are calculated within each country.
The 2012 PISA volume on student engagement does not present withincountry correlation coefficients on intrinsic motivation or its components. But it does offer withincountry correlations of math achievement with three other characteristics relevant to student engagement. Table 33 displays statistics for students’ responses to: 1) if they feel like they belong at school; 2) their attitudes toward school, an index composed of four factors;^{[v]} and 3) whether they had arrived late for school in the two weeks prior to the PISA test. These measures reflect an excellent mix of behaviors and dispositions.
The withincountry correlations trend in the direction expected but they are small in magnitude. Correlation coefficients for math performance and a sense of belonging at school range from 0.02 to 0.18, meaning that the country exhibiting the strongest relationship between achievement and a sense of belonging—Thailand, with a 0.18 correlation coefficient—isn’t registering a strong relationship at all. The OECD average is 0.08, which is trivial. The U.S. correlation coefficient, 0.07, is also trivial. The relationship of achievement with attitudes toward school is slightly stronger (OECD average of 0.11), but is still weak.
Of the three characteristics, arriving late for school shows the strongest correlation, an unsurprising inverse relationship of 0.14 in OECD countries and 0.20 in the U.S. Students who tend to be tardy also tend to score lower on math tests. But, again, the magnitude is surprisingly small. The coefficients are statistically significant because of large sample sizes, but in a real world “would I notice this if it were in my face?” sense, no, the correlation coefficients are suggesting not much of a relationship at all.
The PISA report presents withincountry effect sizes for the intrinsic motivation index, calculating the achievement gains associated with a one unit change in the index. One of several interesting findings is that intrinsic motivation is more strongly associated with gains at the top of the achievement distribution, among students at the 90^{th} percentile in math scores, than at the bottom of the distribution, among students at the 10^{th} percentile.
The report summarizes the withincountry effect sizes with this statement: “On average across OECD countries, a change of one unit in the index of intrinsic motivation to learn mathematics translates into a 19 scorepoint difference in mathematics performance.”^{[vi]} This sentence can be easily misinterpreted. It means that within each of the participating countries students who differ by one unit on PISA’s 2012 intrinsic motivation index score about 19 points apart on the 2012 math test. It does not mean that a country that gains one unit on the intrinsic motivation index can expect a 19 point score increase.^{[vii] }
Let’s now see what that association looks like at the national level.
PISA first reported national scores on the index of intrinsic motivation to learn mathematics in 2003. Are gains that countries made on the index associated with gains on PISA’s math test? Table 34 presents a score card on the question, reporting the changes that occurred in thirtynine nations—in both the index and math scores—from 2003 to 2012. Seventeen nations made statistically significant gains on the index; fourteen nations had gains that were, in a statistical sense, indistinguishable from zero—labeled “no change” in the table; and eight nations experienced statistically significant declines in index scores.
The U.S. scored 0.00 in 2003 and 0.08 in 2012, notching a gain of 0.08 on the index (statistically significant). Its PISA math score declined from 483 to 481, a decline of 2 scale score points (not statistically significant).
Table 34 makes it clear that national changes on PISA’s intrinsic motivation index are not associated with changes in math achievement. The countries registering gains on the index averaged a decline of 3.7 points on PISA’s math assessment. The countries that remained about the same on the index had math scores that also remain essentially unchanged (0.09) And the most striking finding: countries that declined on the index (average of 0.15) actually gained an average of 10.3 points on the PISA math scale. Intrinsic motivation went down; math scores went up. The correlation coefficient for the relationship over all, not shown in the table, is 0.30.
The analysis above investigated student engagement. International data from the 2012 PISA were examined on several dimensions of student engagement, focusing on a measure that PISA has employed since 2003, the index of intrinsic motivation to learn mathematics. The U.S. scored near the middle of the distribution on the 2012 index. PISA analysts calculated that, on average, a one unit change in the index was associated with a 19 point gain on the PISA math test. That is the average of withincountry calculations, using studentlevel data that measure the association of intrinsic motivation with PISA score. It represents an effect size of about 0.20—a positive effect, but one that is generally considered small in magnitude.^{[viii]}
The unit of analysis matters. Betweencountry associations often differ from withincountry associations. The current study used a difference in difference approach that calculated the correlation coefficient for two variables at the national level: the change in intrinsic motivation index from 20032012 and change in PISA score for the same time period. That analysis produced a correlation coefficient of 0.30, a negative relationship that is also generally considered small in magnitude.
Neither approach can justify causal claims nor address the possibility of reverse causality occurring—the possibility that high math achievement boosts intrinsic motivation to learn math, rather than, or even in addition to, high levels of motivation leading to greater learning. Poor math achievement may cause intrinsic motivation to fall. Taken together, the analyses lead to the conclusion that PISA provides, at best, weak evidence that raising student motivation is associated with achievement gains. Boosting motivation may even produce declines in achievement.
Here’s the bottom line for what PISA data recommends to policymakers: Programs designed to boost student engagement—perhaps a worthy pursuit even if unrelated to achievement—should be evaluated for their effects in small scale experiments before being adopted broadly. The international evidence does not justify widescale concern over current levels of student engagement in the U.S. or support the hypothesis that boosting student engagement would raise student performance nationally.
Let’s conclude by considering the advantages that nationallevel, difference in difference analyses provide that studentlevel analyses may overlook.
1. They depict policy interventions more accurately. Policies are actions of a political unit affecting all of its members. They do not simply affect the relationship of two characteristics within an individual’s psychology. Policymakers who ask the question, “What happens when a country boosts student engagement?” are asking about a countrylevel phenomenon.
2. Direction of causality can run differently at the individual and group levels. For example, we know that enjoying a school subject and achievement on tests of that subject are positively correlated at the individual level. But they are not always correlated—and can in fact be negatively correlated—at the group level.
3. By using multiple years of panel data and calculating change over time, a difference in difference analysis controls for unobserved variable bias by “baking into the cake” those unobserved variables at the baseline. The unobserved variables are assumed to remain stable over the time period of the analysis. For the cultural factors that many analysts suspect influence betweennation test score differences, stability may be a safe assumption. Difference in difference, then, would be superior to crosssectional analyses in controlling for cultural influences that are omitted from other models.
4. Testing artifacts from a cultural source can also be dampened. Characteristics such as enjoyment are culturally defined, and the language employed to describe them is also culturally bounded. Consider two of the questionnaire items examined above: whether kids “enjoy” math and how much they “look forward” to math lessons. Cultural differences in responding to these prompts will be reflected in betweencountry averages at the baseline, and any subsequent changes will reflect fluctuations net of those initial differences.
[i] Tom Loveless, “The Happiness Factor in Student Learning,” The 2006 Brown Center Report on American Education: How Well are American Students Learning? (Washington, D.C.: The Brookings Institution, 2006).
[ii] All countries with 2003 and 2012 data are included.
[iii] Amanda Ripley, The Smartest Kids in the World: And How They Got That Way (New York, NY: Simon & Schuster, 2013)
[iv] Elizabeth Green, Building a Better Teacher: How Teaching Works (and How to Teach It to Everyone) (New York, NY: W.W. Norton & Company, 2014).
[v] The attitude toward school index is based on responses to: 1) Trying hard at school will help me get a good job, 2) Trying hard at school will help me get into a good college, 3) I enjoy receiving good grades, 4) Trying hard at school is important. See: OECD, PISA 2012 Database, Table III.2.5a.
[vi] OECD, PISA 2012 Results: Ready to Learn: Students’ Engagement, Drive and SelfBeliefs (Volume III) (Paris: PISA, OECD Publishing, 2013), 77.
[vii] PISA originally called the index of intrinsic motivation the index of interest and enjoyment in mathematics, first constructed in 2003. The four questions comprising the index remain identical from 2003 to 212, allowing for comparability. Index values for 2003 scores were rescaled based on 2012 scaling (mean of 0.00 and SD of 1.00), meaning that index values published in PISA reports prior to 2012 will not agree with those published after 2012 (including those analyzed here). See: OECD, PISA 2012 Results: Ready to Learn: Students’ Engagement, Drive and SelfBeliefs (Volume III) (Paris: PISA, OECD Publishing, 2013), 54.
[viii] PISA math scores are scaled with a standard deviation of 100, but the average withincountry standard deviation for OECD nations was 92 on the 2012 math test.
« Part II: Measuring Effects of the Common Core 

Part I of the 2015 Brown Center Report on American Education.
Girls score higher than boys on tests of reading ability. They have for a long time. This section of the Brown Center Report assesses where the gender gap stands today and examines trends over the past several decades. The analysis also extends beyond the U.S. and shows that boys’ reading achievement lags that of girls in every country in the world on international assessments. The international dimension—recognizing that U.S. is not alone in this phenomenon—serves as a catalyst to discuss why the gender gap exists and whether it extends into adulthood.
One of the earliest largescale studies on gender differences in reading, conducted in Iowa in 1942, found that girls in both elementary and high schools were better than boys at reading comprehension.^{[i]} The most recent results from reading tests of the National Assessment of Educational Progress (NAEP) show girls outscoring boys at every grade level and age examined. Gender differences in reading are not confined to the United States. Among younger children—age nine to ten, or about fourth grade—girls consistently outscore boys on international assessments, from a pioneering study of reading comprehension conducted in fifteen countries in the 1970s, to the results of the Program in International Reading Literacy Study (PIRLS) conducted in fortynine nations and nine benchmarking entities in 2011. The same is true for students in high school. On the 2012 reading literacy test of the Program for International Student Assessment (PISA), worldwide gender gaps are evident between fifteenyearold males and females.
As the 21^{st} century dawned, the gender gap came under the scrutiny of reporters and pundits. Author Christina Hoff Sommers added a political dimension to the gender gap, and some say swept the topic into the culture wars raging at the time, with her 2000 book The War Against Boys: How Misguided Feminism is Harming Our Young Men.^{[ii]} Sommers argued that boys’ academic inferiority, and in particular their struggles with reading, stemmed from the feminist movement’s impact on schools and society. In the second edition, published in 2013, she changed the subtitle to How Misguided Policies Are Harming Our Young Men. Some of the sting is removed from the indictment of “misguided feminism.” But not all of it. Sommers singles out for criticism a 2008 report from the American Association of University Women.^{[iii]} That report sought to debunk the notion that boys fared poorly in school compared to girls. It left out a serious discussion of boys’ inferior performance on reading tests, as well as their lower grade point averages, greater rate of school suspension and expulsion, and lower rate of acceptance into college.
Journalist Richard Whitmire picked up the argument about the gender gap in 2010 with Why Boys Fail: Saving Our Sons from an Educational System That’s Leaving Them Behind.^{[iv]} Whitmire sought to separate boys’ academic problems from the culture wars, noting that the gender gap in literacy is a worldwide phenomenon and appears even in countries where feminist movements are weak to nonexistent. Whitmire offers several reasons for boys’ low reading scores, including poor reading instruction (particularly a lack of focus on phonics), and too few books appealing to boys’ interests. He also dismisses several explanations that are in circulation, among them, video games, hiphop culture, too much testing, and feminized classrooms. As with Sommers’s book, Whitmire’s culprit can be found in the subtitle: the educational system. Even if the educational system is not the original source of the problem, Whitmire argues, schools could be doing more to address it.
In a 2006 monograph, education policy researcher Sara Mead took on the idea that American boys were being shortchanged by schools. After reviewing achievement data from NAEP and other tests, Mead concluded that the real story of the gender gap wasn’t one of failure at all. Boys and girls were both making solid academic progress, but in some cases, girls were making larger gains, misleading some commentators into concluding that boys were being left behind. Mead concluded, “The current boy crisis hype and the debate around it are based more on hopes and fears than on evidence.”^{[v]}
The analysis below focuses on where the gender gap in reading stands today, not its causes. Nevertheless, readers should keep in mind the three most prominent explanations for the gap. They will be used to frame the concluding discussion.
Biological/Developmental: Even before attending school, young boys evidence more problems in learning how to read than girls. This explanation believes the sexes are hardwired differently for literacy.
School Practices: Boys are inferior to girls on several school measures—behavioral, social, and academic—and those discrepancies extend all the way through college. This explanation believes that even if schools do not create the gap, they certainly don’t do what they could to ameliorate it.
Cultural Influences: Cultural influences steer boys toward nonliterary activities (sports, music) and define literacy as a feminine characteristic. This explanation believes cultural cues and strong role models could help close the gap by portraying reading as a masculine activity.
Table 11 displays the most recent data from eight national tests of U.S. achievement. The first group shows results from the National Assessment of Educational Progress Long Term Trend (NAEPLTT), given to students nine, 13, and 17 years of age. The NAEPLTT in reading was first administered in 1971. The second group of results is from the NAEP Main Assessment, which began testing reading achievement in 1992. It assesses at three different grade levels: fourth, eighth, and twelfth. The last two tests are international assessments in which the U.S. participates, the Progress in International Reading Literacy Study (PIRLS), which began in 2001, and the Program for International Student Assessment (PISA), first given in 2000. PIRLS tests fourth graders, and PISA tests 15yearolds. In the U.S., 71 percent of students who took PISA in the fall of 2012 were in tenth grade.
Two findings leap out. First, the test score gaps between males and females are statistically significant on all eight assessments. Because the sample sizes of the assessments are quite large, statistical significance does not necessarily mean that the gaps are of practical significance—or even noticeable if one observed several students reading together. The tests also employ different scales. The final column in the table expresses the gaps in standard deviation units, a measure that allows for comparing the different scores and estimating their practical meaningfulness.
The second finding is based on the standardized gaps (expressed in SDs). On both NAEP tests, the gaps are narrower among elementary students and wider among middle and high school students. That pattern also appears on international assessments. The gap is twice as large on PISA as on PIRLS.[vi] A popular explanation for the gender gap involves the different maturation rates of boys and girls. That theory will be discussed in greater detail below, but at this point in the analysis, let’s simply note that the gender gap appears to grow until early adolescence—age 13 on the LTTNAEP and grade eight on the NAEP Main.
Should these gaps be considered small or large? Many analysts consider 10 scale score points on NAEP equal to about a year of learning. In that light, gaps of five to 10 points appear substantial. But compared to other test score gaps on NAEP, the gender gap is modest in size. On the 2012 LTTNAEP for nineyearolds, the five point gap between boys and girls is about onehalf of the 10 point gap between students living in cities and those living in suburbs.[vii] The gap between students who are eligible for free and reduced lunch and those who are not is 28 points; between black and white students, it is 23 points; and between English language learners (ELL) and nonELL students, it is 34 points.
Table 11 only shows the size of the gender gap as gauged by assessments at single points in time. For determining trends, let’s take a closer look at the LTTNAEP, since it provides the longest running record of the gender gap. In Table 12, scores are displayed from tests administered since 1971 and given nearest to the starts and ends of decades. Results from 2008 and 2012 are both shown to provide readers an idea of recent fluctuations. At all three ages, gender gaps were larger in 1971 than they are today. The change at age nine is statistically significant, but not at age 13 (p=0.10) or age 17 (p=.07), although they are close. Slight shrinkage occurred in the 1980s, but the gaps expanded again in the 1990s. The gap at age 13 actually peaked at 15 scale score points in 1994 (not shown in the table), and the decline since then is statistically significant. Similarly, the gap at age 17 peaked in 1996 at 15 scale score points, and the decline since then is also statistically significant. More recently, the gap at age nine began to shrink again in 1999, age 13 began shrinking in the 2000s, and age 17 in 2012.
Table 13 decomposes the change figures by male and female performance. Sara Mead’s point, that the NAEP story is one of both sexes gaining rather than boys falling behind, is even truer today than when she made it in 2006. When Mead’s analysis was published, the most recent LTTNAEP data were from 2004. Up until then, girls had made greater reading gains than boys. But that situation has reversed. Boys have now made larger gains over the history of LTTNAEP, fueled by the gains that they registered from 2004 to 2012. The score for 17yearold females in 2012 (291) was identical to their score in 1971.
The United States is not alone in reading’s gender gap. Its gap of 31 points is not even the largest (see Figure 11). On the 2012 PISA, all OECD countries exhibited a gender gap, with females outscoring males by 23 to 62 points on the PISA scale (standard deviation of 94). On average in the OECD, girls outscored boys by 38 points (rounded to 515 for girls and 478 for boys). The U.S. gap of 31 points is less than the OECD average.
Finland had the largest gender gap on the 2012 PISA, twice that of the U.S., with females outscoring males by an astonishing 62 points (0.66 SDs). Finnish girls scored 556, and boys scored 494. To put this gap in perspective, consider that Finland’s renowned superiority on PISA tests is completely dependent on Finnish girls. Finland’s boys’ score of 494 is about the same as the international average of 496, and not much above the OECD average for males (478). The reading performance of Finnish boys is not statistically significantly different from boys in the U.S. (482) or from the average U.S. student, both boys and girls (498). Finnish superiority in reading only exists among females.
There is a hint of a geographical pattern. Northern European countries tend to have larger gender gaps in reading. Finland, Sweden, Iceland, and Norway have four of the six largest gaps. Denmark is the exception with a 31 point gap, below the OECD average. And two Asian OECD members have small gender gaps. Japan’s gap of 24 points and South Korea’s gap of 23 are ranked among the bottom four countries. The Nordic tendency toward large gender gaps in reading was noted in a 2002 analysis of the 2000 PISA results.^{[viii] } At that time, too, Denmark was the exception. Because of the larger sample and persistence over time, the Nordic pattern warrants more confidence than the one in the two Asian countries.
Back to Finland. That’s the headline story here, and it contains a lesson for cautiously interpreting international test scores. Consider that the 62 point gender gap in Finland is only 14 points smaller than the U.S. blackwhite gap (76 points) and 21 points larger than the whiteHispanic gap (41 points) on the same test. Finland’s gender gap illustrates the superficiality of much of the commentary on that country’s PISA performance. A common procedure in policy analysis is to consider how policies differentially affect diverse social groups. Think of all the commentators who cite Finland to promote particular policies, whether the policies address teacher recruitment, amount of homework, curriculum standards, the role of play in children’s learning, school accountability, or high stakes assessments.^{[ix] } Advocates pound the table while arguing that these policies are obviously beneficial. “Just look at Finland,” they say. Have you ever read a warning that even if those policies contribute to Finland’s high PISA scores—which the advocates assume but serious policy scholars know to be unproven—the policies also may be having a negative effect on the 50 percent of Finland’s school population that happens to be male?
One of the solutions put forth for improving boys’ reading scores is to make an effort to boost their enjoyment of reading. That certainly makes sense, but past scores of national reading and math performance have consistently, and counterintuitively, shown no relationship (or even an inverse one) with enjoyment of the two subjects. PISA asks students how much they enjoy reading, so let’s now investigate whether fluctuations in PISA scores are at all correlated with how much 15yearolds say they like to read.
The analysis below employs what is known as a “differencesindifferences” analytical strategy. In both 2000 and 2009, PISA measured students’ reading ability and asked them several questions about how much they like to read. An enjoyment index was created from the latter set of questions.^{[x] } Females score much higher on this index than boys. Many commentators believe that girls’ greater enjoyment of reading may be at the root of the gender gap in literacy.
When new international test scores are released, analysts are tempted to just look at variables exhibiting strong correlations with achievement (such as amount of time spent on homework), and embrace them as potential causes of high achievement. But crosssectional correlations can be deceptive. The direction of causality cannot be determined, whether it’s doing a lot of homework that leads to high achievement, or simply that good students tend to take classes that assign more homework. Correlations in crosssectional data are also vulnerable to unobserved factors that may influence achievement. For example, if cultural predilections drive a country’s exemplary performance, their influence will be masked or spuriously assigned to other variables unless they are specifically modeled.^{[xi]} Class size, betweenschool tracking, and time spent on learning are all topics on which differencesindifferences has been fruitfully employed to analyze multiple crosssections of international data.
Another benefit of differencesindifferences is that it measures statistical relationships longitudinally. Table 14 investigates the question: Is the rise and fall of reading enjoyment correlated with changes in reading achievement? Many believe that if boys liked reading more, their literacy test scores would surely increase. Table 14 does not support that belief. Data are available for 27 OECD countries, and they are ranked by how much they boosted males’ enjoyment of reading. The index is set at the studentlevel with a mean of 0.00 and standard deviation of 1.00. For the twentyseven nations in Table 14, the mean national change in enjoyment is .02 with a standard deviation of .09.
Germany did the best job of raising boys’ enjoyment of reading, with a gain of 0.12 on the index. German males’ PISA scores also went up—a little more than 10 points (10.33). France, on the other hand, raised males’ enjoyment of reading nearly as much as Germany (0.11), but French males’ PISA scores declined by 15.26 points. A bit further down the column, Ireland managed to get boys to enjoy reading a little more (a gain of 0.05) but their reading performance fell a whopping 36.54 points. Toward the bottom end of the list, Poland’s boys enjoyed reading less in 2009 than in 2000, a decline of 0.14 on the index, but over the same time span, their reading literacy scores increased by more than 14 points (14.29). Among the countries in which the relationship goes in the expected direction is Finland. Finnish males’ enjoyment of reading declined (0.14) as did their PISA scores in reading literacy (11.73). Overall, the correlation coefficient for change in enjoyment and change in reading score is 0.01, indicating no relationship between the two.
Christina Hoff Sommers and Richard Whitmire have praised specific countries for first recognizing and then addressing the gender gap in reading. Recently, Sommers urged the U.S. to “follow the example of the British, Canadians, and Australians.”^{[xii]} Whitmire described Australia as “years ahead of the U.S. in pioneering solutions” to the gender gap. Let’s see how those countries appear in Table 14. England does not have PISA data for the 2000 baseline year, but both Canada and Australia are included. Canada raised boys’ enjoyment of reading a little bit (0.02) but Canadian males’ scores fell by about 12 points (11.74). Australia suffered a decline in boys’ enjoyment of reading (0.04) and achievement (16.50). As promising as these countries’ efforts may have appeared a few years ago, so far at least, they have not borne fruit in raising boys’ reading performance on PISA.
Achievement gaps are tricky because it is possible for the test scores of the two groups being compared to both decline while the gap increases or, conversely, for scores of both to increase while the gap declines. Table 14 only looks at males’ enjoyment of reading and its relationship to achievement. A separate differencesindifferences analysis was conducted (but not displayed here) to see whether changes in the enjoyment gap—the difference between boys’ and girls’ enjoyment of reading—are related to changes in reading achievement. They are not (correlation coefficient of 0.08). National PISA data simply do not support the hypothesis that the superior reading performance of girls is related to the fact that girls enjoy reading more than boys.
Let’s summarize the main findings of the analysis above. Reading scores for girls exceed those for boys on eight recent assessments of U.S. reading achievement. The gender gap is larger for middle and high school students than for students in elementary school. The gap was apparent on the earliest NAEP tests in the 1970s and has shown some signs of narrowing in the past decade. International tests reveal that the gender gap is worldwide. Among OECD countries, it even appears among countries known for superior performance on PISA’s reading test. Finland not only exhibited the largest gender gap in reading on the 2012 PISA, the gap had widened since 2000. A popular recommendation for boosting boys’ reading performance is finding ways for them to enjoy reading more. That theory is not supported by PISA data. Countries that succeeded in raising boys’ enjoyment of reading from 2000 to 2009 were no more likely to improve boys’ reading performance than countries where boys’ enjoyment of reading declined.
The origins of the gender gap are hotly debated. The universality of the gap certainly supports the argument that it originates in biological or developmental differences between the two sexes. It is evident among students of different ages in data collected at different points in time. It exists across the globe, in countries with different educational systems, different popular cultures, different child rearing practices, and different conceptions of gender roles. Moreover, the greater prevalence of reading impairment among young boys—a ratio of two or three to one—suggests an endemic difficulty that exists before the influence of schools or culture can take hold.^{[xiii] }
But some of the data examined above also argue against the developmental explanation. The gap has been shrinking on NAEP. At age nine, it is less than half of what it was forty years ago. Biology doesn’t change that fast. Gender gaps in math and science, which were apparent in achievement data for a long time, have all but disappeared, especially once course taking is controlled. The reading gap also seems to evaporate by adulthood. On an international assessment of adults conducted in 2012, reading scores for men and women were statistically indistinguishable up to age 35—even in Finland and the United States. After age 35, men had statistically significantly higher scores in reading, all the way to the oldest group, age 55 and older. If the gender gap in literacy is indeed shaped by developmental factors, it may be important for our understanding of the phenomenon to scrutinize periods of the life cycle beyond the age of schooling.
Another astonishing pattern emerged from the study of adult reading. Participants were asked how often they read a book. Of avid book readers (those who said they read a book once a week) in the youngest group (age 24 and younger), 59 percent were women and 41 percent were men. By age 55, avid book readers were even more likely to be women, by a margin of 63 percent to 37 percent. Twothirds of respondents who said they never read books were men. Women remained the more enthusiastic readers even as the test scores of men caught up with those of women and surpassed them.
A few years ago, Ian McEwan, the celebrated English novelist, decided to reduce the size of the library in his London townhouse. He and his younger son selected thirty novels and took them to a local park. They offered the books to passersby. Women were eager and grateful to take the books, McEwan reports. Not a single man accepted. The author’s conclusion? “When women stop reading, the novel will be dead.”^{[xiv] }
McEwan might be right, regardless of the origins of the gender gap in reading and the efforts to end it.
[i] J.B. Stroud and E.F. Lindquist, “Sex differences in achievement in the elementary and secondary schools,” Journal of Educational Psychology, vol. 33(9) (Washington, D.C.: American Psychological Association, 1942), 657667.
[ii] Christina Hoff Sommers, The War Against Boys: How Misguided Feminism Is Harming Our Young Men (New York, NY: Simon & Schuster, 2000).
[iii] Christianne Corbett, Catherine Hill, and Andresse St. Rose, Where the Girls Are: The Facts About Gender Equity in Education (Washington, D.C.: American Association of University Women, 2008).
[iv] Richard Whitmire, Why Boys Fail: Saving Our Sons from an Educational System That’s Leaving Them Behind (New York, NY: AMACOM, 2010).
[v] Sara Mead, The Evidence Suggests Otherwise: The Truth About Boys and Girls (Washington, D.C.: Education Sector, 2006).
[vi] PIRLS and PISA assess different reading skills. Performance on the two tests may not be comparable.
[vii] NAEP categories were aggregated to calculate the city/suburb difference.
[viii] OECD, Reading for Change: Performance and Engagement Across Countries (Paris: OECD, 2002), 125.
[ix] The best example of promoting Finnish education policies is Pasi Sahlberg’s Finnish Lessons: What Can the World Learn from Educational Change in Finland? (New York: Teachers College Press, 2011).
[x] The 2009 endpoint was selected because 2012 data for the enjoyment index were not available on the NCES PISA data tool.
[xi] A formal name for the problem of reverse causality is endogeneity and for the problem of unobserved variables, omitted variable bias.
[xii] Christina Hoff Sommers, “The Boys at the Back,” New York Times, February 2, 2013; Richard Whitmire, Why Boys Fail (New York: AMACOM, 2010), 153.
[xiii] J.L. Hawke, R.K. Olson, E.G. Willcutt, S.J. Wadsworth, & J.C. DeFries, “Gender ratios for reading difficulties,” Dyslexia 15(3), (Chichester, England: Wiley, 2009), 239–242.
[xiv] Daniel Zalewski, “The Background Hum: Ian McEwan’s art of unease,” The New Yorker, February 23, 2009.
Part II: Measuring Effects of the Common Core » 
Part I of the 2015 Brown Center Report on American Education.
Girls score higher than boys on tests of reading ability. They have for a long time. This section of the Brown Center Report assesses where the gender gap stands today and examines trends over the past several decades. The analysis also extends beyond the U.S. and shows that boys’ reading achievement lags that of girls in every country in the world on international assessments. The international dimension—recognizing that U.S. is not alone in this phenomenon—serves as a catalyst to discuss why the gender gap exists and whether it extends into adulthood.
One of the earliest largescale studies on gender differences in reading, conducted in Iowa in 1942, found that girls in both elementary and high schools were better than boys at reading comprehension.^{[i]} The most recent results from reading tests of the National Assessment of Educational Progress (NAEP) show girls outscoring boys at every grade level and age examined. Gender differences in reading are not confined to the United States. Among younger children—age nine to ten, or about fourth grade—girls consistently outscore boys on international assessments, from a pioneering study of reading comprehension conducted in fifteen countries in the 1970s, to the results of the Program in International Reading Literacy Study (PIRLS) conducted in fortynine nations and nine benchmarking entities in 2011. The same is true for students in high school. On the 2012 reading literacy test of the Program for International Student Assessment (PISA), worldwide gender gaps are evident between fifteenyearold males and females.
As the 21^{st} century dawned, the gender gap came under the scrutiny of reporters and pundits. Author Christina Hoff Sommers added a political dimension to the gender gap, and some say swept the topic into the culture wars raging at the time, with her 2000 book The War Against Boys: How Misguided Feminism is Harming Our Young Men.^{[ii]} Sommers argued that boys’ academic inferiority, and in particular their struggles with reading, stemmed from the feminist movement’s impact on schools and society. In the second edition, published in 2013, she changed the subtitle to How Misguided Policies Are Harming Our Young Men. Some of the sting is removed from the indictment of “misguided feminism.” But not all of it. Sommers singles out for criticism a 2008 report from the American Association of University Women.^{[iii]} That report sought to debunk the notion that boys fared poorly in school compared to girls. It left out a serious discussion of boys’ inferior performance on reading tests, as well as their lower grade point averages, greater rate of school suspension and expulsion, and lower rate of acceptance into college.
Journalist Richard Whitmire picked up the argument about the gender gap in 2010 with Why Boys Fail: Saving Our Sons from an Educational System That’s Leaving Them Behind.^{[iv]} Whitmire sought to separate boys’ academic problems from the culture wars, noting that the gender gap in literacy is a worldwide phenomenon and appears even in countries where feminist movements are weak to nonexistent. Whitmire offers several reasons for boys’ low reading scores, including poor reading instruction (particularly a lack of focus on phonics), and too few books appealing to boys’ interests. He also dismisses several explanations that are in circulation, among them, video games, hiphop culture, too much testing, and feminized classrooms. As with Sommers’s book, Whitmire’s culprit can be found in the subtitle: the educational system. Even if the educational system is not the original source of the problem, Whitmire argues, schools could be doing more to address it.
In a 2006 monograph, education policy researcher Sara Mead took on the idea that American boys were being shortchanged by schools. After reviewing achievement data from NAEP and other tests, Mead concluded that the real story of the gender gap wasn’t one of failure at all. Boys and girls were both making solid academic progress, but in some cases, girls were making larger gains, misleading some commentators into concluding that boys were being left behind. Mead concluded, “The current boy crisis hype and the debate around it are based more on hopes and fears than on evidence.”^{[v]}
The analysis below focuses on where the gender gap in reading stands today, not its causes. Nevertheless, readers should keep in mind the three most prominent explanations for the gap. They will be used to frame the concluding discussion.
Biological/Developmental: Even before attending school, young boys evidence more problems in learning how to read than girls. This explanation believes the sexes are hardwired differently for literacy.
School Practices: Boys are inferior to girls on several school measures—behavioral, social, and academic—and those discrepancies extend all the way through college. This explanation believes that even if schools do not create the gap, they certainly don’t do what they could to ameliorate it.
Cultural Influences: Cultural influences steer boys toward nonliterary activities (sports, music) and define literacy as a feminine characteristic. This explanation believes cultural cues and strong role models could help close the gap by portraying reading as a masculine activity.
Table 11 displays the most recent data from eight national tests of U.S. achievement. The first group shows results from the National Assessment of Educational Progress Long Term Trend (NAEPLTT), given to students nine, 13, and 17 years of age. The NAEPLTT in reading was first administered in 1971. The second group of results is from the NAEP Main Assessment, which began testing reading achievement in 1992. It assesses at three different grade levels: fourth, eighth, and twelfth. The last two tests are international assessments in which the U.S. participates, the Progress in International Reading Literacy Study (PIRLS), which began in 2001, and the Program for International Student Assessment (PISA), first given in 2000. PIRLS tests fourth graders, and PISA tests 15yearolds. In the U.S., 71 percent of students who took PISA in the fall of 2012 were in tenth grade.
Two findings leap out. First, the test score gaps between males and females are statistically significant on all eight assessments. Because the sample sizes of the assessments are quite large, statistical significance does not necessarily mean that the gaps are of practical significance—or even noticeable if one observed several students reading together. The tests also employ different scales. The final column in the table expresses the gaps in standard deviation units, a measure that allows for comparing the different scores and estimating their practical meaningfulness.
The second finding is based on the standardized gaps (expressed in SDs). On both NAEP tests, the gaps are narrower among elementary students and wider among middle and high school students. That pattern also appears on international assessments. The gap is twice as large on PISA as on PIRLS.[vi] A popular explanation for the gender gap involves the different maturation rates of boys and girls. That theory will be discussed in greater detail below, but at this point in the analysis, let’s simply note that the gender gap appears to grow until early adolescence—age 13 on the LTTNAEP and grade eight on the NAEP Main.
Should these gaps be considered small or large? Many analysts consider 10 scale score points on NAEP equal to about a year of learning. In that light, gaps of five to 10 points appear substantial. But compared to other test score gaps on NAEP, the gender gap is modest in size. On the 2012 LTTNAEP for nineyearolds, the five point gap between boys and girls is about onehalf of the 10 point gap between students living in cities and those living in suburbs.[vii] The gap between students who are eligible for free and reduced lunch and those who are not is 28 points; between black and white students, it is 23 points; and between English language learners (ELL) and nonELL students, it is 34 points.
Table 11 only shows the size of the gender gap as gauged by assessments at single points in time. For determining trends, let’s take a closer look at the LTTNAEP, since it provides the longest running record of the gender gap. In Table 12, scores are displayed from tests administered since 1971 and given nearest to the starts and ends of decades. Results from 2008 and 2012 are both shown to provide readers an idea of recent fluctuations. At all three ages, gender gaps were larger in 1971 than they are today. The change at age nine is statistically significant, but not at age 13 (p=0.10) or age 17 (p=.07), although they are close. Slight shrinkage occurred in the 1980s, but the gaps expanded again in the 1990s. The gap at age 13 actually peaked at 15 scale score points in 1994 (not shown in the table), and the decline since then is statistically significant. Similarly, the gap at age 17 peaked in 1996 at 15 scale score points, and the decline since then is also statistically significant. More recently, the gap at age nine began to shrink again in 1999, age 13 began shrinking in the 2000s, and age 17 in 2012.
Table 13 decomposes the change figures by male and female performance. Sara Mead’s point, that the NAEP story is one of both sexes gaining rather than boys falling behind, is even truer today than when she made it in 2006. When Mead’s analysis was published, the most recent LTTNAEP data were from 2004. Up until then, girls had made greater reading gains than boys. But that situation has reversed. Boys have now made larger gains over the history of LTTNAEP, fueled by the gains that they registered from 2004 to 2012. The score for 17yearold females in 2012 (291) was identical to their score in 1971.
The United States is not alone in reading’s gender gap. Its gap of 31 points is not even the largest (see Figure 11). On the 2012 PISA, all OECD countries exhibited a gender gap, with females outscoring males by 23 to 62 points on the PISA scale (standard deviation of 94). On average in the OECD, girls outscored boys by 38 points (rounded to 515 for girls and 478 for boys). The U.S. gap of 31 points is less than the OECD average.
Finland had the largest gender gap on the 2012 PISA, twice that of the U.S., with females outscoring males by an astonishing 62 points (0.66 SDs). Finnish girls scored 556, and boys scored 494. To put this gap in perspective, consider that Finland’s renowned superiority on PISA tests is completely dependent on Finnish girls. Finland’s boys’ score of 494 is about the same as the international average of 496, and not much above the OECD average for males (478). The reading performance of Finnish boys is not statistically significantly different from boys in the U.S. (482) or from the average U.S. student, both boys and girls (498). Finnish superiority in reading only exists among females.
There is a hint of a geographical pattern. Northern European countries tend to have larger gender gaps in reading. Finland, Sweden, Iceland, and Norway have four of the six largest gaps. Denmark is the exception with a 31 point gap, below the OECD average. And two Asian OECD members have small gender gaps. Japan’s gap of 24 points and South Korea’s gap of 23 are ranked among the bottom four countries. The Nordic tendency toward large gender gaps in reading was noted in a 2002 analysis of the 2000 PISA results.^{[viii] } At that time, too, Denmark was the exception. Because of the larger sample and persistence over time, the Nordic pattern warrants more confidence than the one in the two Asian countries.
Back to Finland. That’s the headline story here, and it contains a lesson for cautiously interpreting international test scores. Consider that the 62 point gender gap in Finland is only 14 points smaller than the U.S. blackwhite gap (76 points) and 21 points larger than the whiteHispanic gap (41 points) on the same test. Finland’s gender gap illustrates the superficiality of much of the commentary on that country’s PISA performance. A common procedure in policy analysis is to consider how policies differentially affect diverse social groups. Think of all the commentators who cite Finland to promote particular policies, whether the policies address teacher recruitment, amount of homework, curriculum standards, the role of play in children’s learning, school accountability, or high stakes assessments.^{[ix] } Advocates pound the table while arguing that these policies are obviously beneficial. “Just look at Finland,” they say. Have you ever read a warning that even if those policies contribute to Finland’s high PISA scores—which the advocates assume but serious policy scholars know to be unproven—the policies also may be having a negative effect on the 50 percent of Finland’s school population that happens to be male?
One of the solutions put forth for improving boys’ reading scores is to make an effort to boost their enjoyment of reading. That certainly makes sense, but past scores of national reading and math performance have consistently, and counterintuitively, shown no relationship (or even an inverse one) with enjoyment of the two subjects. PISA asks students how much they enjoy reading, so let’s now investigate whether fluctuations in PISA scores are at all correlated with how much 15yearolds say they like to read.
The analysis below employs what is known as a “differencesindifferences” analytical strategy. In both 2000 and 2009, PISA measured students’ reading ability and asked them several questions about how much they like to read. An enjoyment index was created from the latter set of questions.^{[x] } Females score much higher on this index than boys. Many commentators believe that girls’ greater enjoyment of reading may be at the root of the gender gap in literacy.
When new international test scores are released, analysts are tempted to just look at variables exhibiting strong correlations with achievement (such as amount of time spent on homework), and embrace them as potential causes of high achievement. But crosssectional correlations can be deceptive. The direction of causality cannot be determined, whether it’s doing a lot of homework that leads to high achievement, or simply that good students tend to take classes that assign more homework. Correlations in crosssectional data are also vulnerable to unobserved factors that may influence achievement. For example, if cultural predilections drive a country’s exemplary performance, their influence will be masked or spuriously assigned to other variables unless they are specifically modeled.^{[xi]} Class size, betweenschool tracking, and time spent on learning are all topics on which differencesindifferences has been fruitfully employed to analyze multiple crosssections of international data.
Another benefit of differencesindifferences is that it measures statistical relationships longitudinally. Table 14 investigates the question: Is the rise and fall of reading enjoyment correlated with changes in reading achievement? Many believe that if boys liked reading more, their literacy test scores would surely increase. Table 14 does not support that belief. Data are available for 27 OECD countries, and they are ranked by how much they boosted males’ enjoyment of reading. The index is set at the studentlevel with a mean of 0.00 and standard deviation of 1.00. For the twentyseven nations in Table 14, the mean national change in enjoyment is .02 with a standard deviation of .09.
Germany did the best job of raising boys’ enjoyment of reading, with a gain of 0.12 on the index. German males’ PISA scores also went up—a little more than 10 points (10.33). France, on the other hand, raised males’ enjoyment of reading nearly as much as Germany (0.11), but French males’ PISA scores declined by 15.26 points. A bit further down the column, Ireland managed to get boys to enjoy reading a little more (a gain of 0.05) but their reading performance fell a whopping 36.54 points. Toward the bottom end of the list, Poland’s boys enjoyed reading less in 2009 than in 2000, a decline of 0.14 on the index, but over the same time span, their reading literacy scores increased by more than 14 points (14.29). Among the countries in which the relationship goes in the expected direction is Finland. Finnish males’ enjoyment of reading declined (0.14) as did their PISA scores in reading literacy (11.73). Overall, the correlation coefficient for change in enjoyment and change in reading score is 0.01, indicating no relationship between the two.
Christina Hoff Sommers and Richard Whitmire have praised specific countries for first recognizing and then addressing the gender gap in reading. Recently, Sommers urged the U.S. to “follow the example of the British, Canadians, and Australians.”^{[xii]} Whitmire described Australia as “years ahead of the U.S. in pioneering solutions” to the gender gap. Let’s see how those countries appear in Table 14. England does not have PISA data for the 2000 baseline year, but both Canada and Australia are included. Canada raised boys’ enjoyment of reading a little bit (0.02) but Canadian males’ scores fell by about 12 points (11.74). Australia suffered a decline in boys’ enjoyment of reading (0.04) and achievement (16.50). As promising as these countries’ efforts may have appeared a few years ago, so far at least, they have not borne fruit in raising boys’ reading performance on PISA.
Achievement gaps are tricky because it is possible for the test scores of the two groups being compared to both decline while the gap increases or, conversely, for scores of both to increase while the gap declines. Table 14 only looks at males’ enjoyment of reading and its relationship to achievement. A separate differencesindifferences analysis was conducted (but not displayed here) to see whether changes in the enjoyment gap—the difference between boys’ and girls’ enjoyment of reading—are related to changes in reading achievement. They are not (correlation coefficient of 0.08). National PISA data simply do not support the hypothesis that the superior reading performance of girls is related to the fact that girls enjoy reading more than boys.
Let’s summarize the main findings of the analysis above. Reading scores for girls exceed those for boys on eight recent assessments of U.S. reading achievement. The gender gap is larger for middle and high school students than for students in elementary school. The gap was apparent on the earliest NAEP tests in the 1970s and has shown some signs of narrowing in the past decade. International tests reveal that the gender gap is worldwide. Among OECD countries, it even appears among countries known for superior performance on PISA’s reading test. Finland not only exhibited the largest gender gap in reading on the 2012 PISA, the gap had widened since 2000. A popular recommendation for boosting boys’ reading performance is finding ways for them to enjoy reading more. That theory is not supported by PISA data. Countries that succeeded in raising boys’ enjoyment of reading from 2000 to 2009 were no more likely to improve boys’ reading performance than countries where boys’ enjoyment of reading declined.
The origins of the gender gap are hotly debated. The universality of the gap certainly supports the argument that it originates in biological or developmental differences between the two sexes. It is evident among students of different ages in data collected at different points in time. It exists across the globe, in countries with different educational systems, different popular cultures, different child rearing practices, and different conceptions of gender roles. Moreover, the greater prevalence of reading impairment among young boys—a ratio of two or three to one—suggests an endemic difficulty that exists before the influence of schools or culture can take hold.^{[xiii] }
But some of the data examined above also argue against the developmental explanation. The gap has been shrinking on NAEP. At age nine, it is less than half of what it was forty years ago. Biology doesn’t change that fast. Gender gaps in math and science, which were apparent in achievement data for a long time, have all but disappeared, especially once course taking is controlled. The reading gap also seems to evaporate by adulthood. On an international assessment of adults conducted in 2012, reading scores for men and women were statistically indistinguishable up to age 35—even in Finland and the United States. After age 35, men had statistically significantly higher scores in reading, all the way to the oldest group, age 55 and older. If the gender gap in literacy is indeed shaped by developmental factors, it may be important for our understanding of the phenomenon to scrutinize periods of the life cycle beyond the age of schooling.
Another astonishing pattern emerged from the study of adult reading. Participants were asked how often they read a book. Of avid book readers (those who said they read a book once a week) in the youngest group (age 24 and younger), 59 percent were women and 41 percent were men. By age 55, avid book readers were even more likely to be women, by a margin of 63 percent to 37 percent. Twothirds of respondents who said they never read books were men. Women remained the more enthusiastic readers even as the test scores of men caught up with those of women and surpassed them.
A few years ago, Ian McEwan, the celebrated English novelist, decided to reduce the size of the library in his London townhouse. He and his younger son selected thirty novels and took them to a local park. They offered the books to passersby. Women were eager and grateful to take the books, McEwan reports. Not a single man accepted. The author’s conclusion? “When women stop reading, the novel will be dead.”^{[xiv] }
McEwan might be right, regardless of the origins of the gender gap in reading and the efforts to end it.
[i] J.B. Stroud and E.F. Lindquist, “Sex differences in achievement in the elementary and secondary schools,” Journal of Educational Psychology, vol. 33(9) (Washington, D.C.: American Psychological Association, 1942), 657667.
[ii] Christina Hoff Sommers, The War Against Boys: How Misguided Feminism Is Harming Our Young Men (New York, NY: Simon & Schuster, 2000).
[iii] Christianne Corbett, Catherine Hill, and Andresse St. Rose, Where the Girls Are: The Facts About Gender Equity in Education (Washington, D.C.: American Association of University Women, 2008).
[iv] Richard Whitmire, Why Boys Fail: Saving Our Sons from an Educational System That’s Leaving Them Behind (New York, NY: AMACOM, 2010).
[v] Sara Mead, The Evidence Suggests Otherwise: The Truth About Boys and Girls (Washington, D.C.: Education Sector, 2006).
[vi] PIRLS and PISA assess different reading skills. Performance on the two tests may not be comparable.
[vii] NAEP categories were aggregated to calculate the city/suburb difference.
[viii] OECD, Reading for Change: Performance and Engagement Across Countries (Paris: OECD, 2002), 125.
[ix] The best example of promoting Finnish education policies is Pasi Sahlberg’s Finnish Lessons: What Can the World Learn from Educational Change in Finland? (New York: Teachers College Press, 2011).
[x] The 2009 endpoint was selected because 2012 data for the enjoyment index were not available on the NCES PISA data tool.
[xi] A formal name for the problem of reverse causality is endogeneity and for the problem of unobserved variables, omitted variable bias.
[xii] Christina Hoff Sommers, “The Boys at the Back,” New York Times, February 2, 2013; Richard Whitmire, Why Boys Fail (New York: AMACOM, 2010), 153.
[xiii] J.L. Hawke, R.K. Olson, E.G. Willcutt, S.J. Wadsworth, & J.C. DeFries, “Gender ratios for reading difficulties,” Dyslexia 15(3), (Chichester, England: Wiley, 2009), 239–242.
[xiv] Daniel Zalewski, “The Background Hum: Ian McEwan’s art of unease,” The New Yorker, February 23, 2009.
Part II: Measuring Effects of the Common Core » 
Editor's Note: The introduction to the 2015 Brown Center Report on American Education appears below. Use the Table of Contents to navigate through the report online, or download a PDF of the full report.
TABLE OF CONTENTS
Part I: Girls, Boys, and Reading
Part II: Measuring Effects of the Common Core
The 2015 Brown Center Report (BCR) represents the 14^{th} edition of the series since the first issue was published in 2000. It includes three studies. Like all previous BCRs, the studies explore independent topics but share two characteristics: they are empirical and based on the best evidence available. The studies in this edition are on the gender gap in reading, the impact of the Common Core State Standards  English Language Arts on reading achievement, and student engagement.
Part one examines the gender gap in reading. Girls outscore boys on practically every reading test given to a large population. And they have for a long time. A 1942 Iowa study found girls performing better than boys on tests of reading comprehension, vocabulary, and basic language skills. Girls have outscored boys on every reading test ever given by the National Assessment of Educational Progress (NAEP)—the first long term trend test was administered in 1971—at ages nine, 13, and 17. The gap is not confined to the U.S. Reading tests administered as part of the Progress in International Reading Literacy Study (PIRLS) and the Program for International Student Assessment (PISA) reveal that the gender gap is a worldwide phenomenon. In more than sixty countries participating in the two assessments, girls are better readers than boys.
Perhaps the most surprising finding is that Finland, celebrated for its extraordinary performance on PISA for over a decade, can take pride in its high standing on the PISA reading test solely because of the performance of that nation’s young women. With its 62 point gap, Finland has the largest gender gap of any PISA participant, with girls scoring 556 and boys scoring 494 points (the OECD average is 496, with a standard deviation of 94). If Finland were only a nation of young men, its PISA ranking would be mediocre.
Part two is about reading achievement, too. More specifically, it’s about reading and the English Language Arts standards of the Common Core (CCSSELA). It’s also about an important decision that policy analysts must make when evaluating public policies—the determination of when a policy begins. How can CCSS be properly evaluated?
Two different indexes of CCSSELA implementation are presented, one based on 2011 data and the other on data collected in 2013. In both years, state education officials were surveyed about their Common Core implementation efforts. Because fortysix states originally signed on to the CCSSELA—and with at least forty still on track for full implementation by 2016—little variability exists among the states in terms of standards policy. Of course, the four states that never adopted CCSSELA can serve as a small control group. But variation is also found in how the states are implementing CCSS. Some states are pursuing an array of activities and aiming for full implementation earlier rather than later. Others have a narrow, targeted implementation strategy and are proceeding more slowly.
The analysis investigates whether CCSSELA implementation is related to 20092013 gains on the fourth grade NAEP reading test. The analysis cannot verify causal relationships between the two variables, only correlations. States that have aggressively implemented CCSSELA (referred to as “strong” implementers in the study) evidence a one to one and onehalf point larger gain on the NAEP scale compared to nonadopters of the standards. This association is similar in magnitude to an advantage found in a study of eighth grade math achievement in last year’s BCR. Although positive, these effects are quite small. When the 2015 NAEP results are released this winter, it will be important for the fate of the Common Core project to see if strong implementers of the CCSSELA can maintain their momentum.
Part three is on student engagement. PISA tests fifteenyearolds on three subjects—reading, math, and science—every three years. It also collects a wealth of background information from students, including their attitudes toward school and learning. When the 2012 PISA results were released, PISA analysts published an accompanying volume, Ready to Learn: Students’ Engagement, Drive, and SelfBeliefs, exploring topics related to student engagement.
Part three provides secondary analysis of several dimensions of engagement found in the PISA report. Intrinsic motivation, the internal rewards that encourage students to learn, is an important component of student engagement. National scores on PISA’s index of intrinsic motivation to learn mathematics are compared to national PISA math scores. Surprisingly, the relationship is negative. Countries with highly motivated kids tend to score lower on the math test; conversely, higherscoring nations tend to have lessmotivated kids.
The same is true for responses to the statements, “I do mathematics because I enjoy it,” and “I look forward to my mathematics lessons.” Countries with students who say that they enjoy math or look forward to their math lessons tend to score lower on the PISA math test compared to countries where students respond negatively to the statements. These counterintuitive finding may be influenced by how terms such as “enjoy” and “looking forward” are interpreted in different cultures. Withincountry analyses address that problem. The correlation coefficients for withincountry, studentlevel associations of achievement and other components of engagement run in the anticipated direction—they are positive. But they are also modest in size, with correlation coefficients of 0.20 or less.
Policymakers are interested in questions requiring analysis of aggregated data—at the national level, that means betweencountry data. When countries increase their students’ intrinsic motivation to learn math, is there a concomitant increase in PISA math scores? Data from 2003 to 2012 are examined. Seventeen countries managed to increase student motivation, but their PISA math scores fell an average of 3.7 scale score points. Fourteen countries showed no change on the index of intrinsic motivation—and their PISA scores also evidenced little change. Eight countries witnessed a decline in intrinsic motivation. Inexplicably, their PISA math scores increased by an average of 10.3 scale score points. Motivation down, achievement up.
Correlation is not causation. Moreover, the absence of a positive correlation—or in this case, the presence of a negative correlation—is not refutation of a possible positive relationship. The lesson here is not that policymakers should adopt the most effective way of stamping out student motivation. The lesson is that the level of analysis matters when analyzing achievement data. Policy reports must be read warily—especially those freely offering policy recommendations. Beware of analyses that exclusively rely on within or betweencountry test data without making any attempt to reconcile discrepancies at other levels of analysis. Those analysts could be cherrypicking the data. Also, consumers of education research should grant more credence to approaches modeling change over time (as in difference in difference models) than to crosssectional analyses that only explore statistical relationships at a single point in time.
Part I: Girls, Boys, and Reading » 
Editor's Note: The introduction to the 2015 Brown Center Report on American Education appears below. Use the Table of Contents to navigate through the report online, or download a PDF of the full report.
TABLE OF CONTENTS
Part I: Girls, Boys, and Reading
Part II: Measuring Effects of the Common Core
The 2015 Brown Center Report (BCR) represents the 14^{th} edition of the series since the first issue was published in 2000. It includes three studies. Like all previous BCRs, the studies explore independent topics but share two characteristics: they are empirical and based on the best evidence available. The studies in this edition are on the gender gap in reading, the impact of the Common Core State Standards  English Language Arts on reading achievement, and student engagement.
Part one examines the gender gap in reading. Girls outscore boys on practically every reading test given to a large population. And they have for a long time. A 1942 Iowa study found girls performing better than boys on tests of reading comprehension, vocabulary, and basic language skills. Girls have outscored boys on every reading test ever given by the National Assessment of Educational Progress (NAEP)—the first long term trend test was administered in 1971—at ages nine, 13, and 17. The gap is not confined to the U.S. Reading tests administered as part of the Progress in International Reading Literacy Study (PIRLS) and the Program for International Student Assessment (PISA) reveal that the gender gap is a worldwide phenomenon. In more than sixty countries participating in the two assessments, girls are better readers than boys.
Perhaps the most surprising finding is that Finland, celebrated for its extraordinary performance on PISA for over a decade, can take pride in its high standing on the PISA reading test solely because of the performance of that nation’s young women. With its 62 point gap, Finland has the largest gender gap of any PISA participant, with girls scoring 556 and boys scoring 494 points (the OECD average is 496, with a standard deviation of 94). If Finland were only a nation of young men, its PISA ranking would be mediocre.
Part two is about reading achievement, too. More specifically, it’s about reading and the English Language Arts standards of the Common Core (CCSSELA). It’s also about an important decision that policy analysts must make when evaluating public policies—the determination of when a policy begins. How can CCSS be properly evaluated?
Two different indexes of CCSSELA implementation are presented, one based on 2011 data and the other on data collected in 2013. In both years, state education officials were surveyed about their Common Core implementation efforts. Because fortysix states originally signed on to the CCSSELA—and with at least forty still on track for full implementation by 2016—little variability exists among the states in terms of standards policy. Of course, the four states that never adopted CCSSELA can serve as a small control group. But variation is also found in how the states are implementing CCSS. Some states are pursuing an array of activities and aiming for full implementation earlier rather than later. Others have a narrow, targeted implementation strategy and are proceeding more slowly.
The analysis investigates whether CCSSELA implementation is related to 20092013 gains on the fourth grade NAEP reading test. The analysis cannot verify causal relationships between the two variables, only correlations. States that have aggressively implemented CCSSELA (referred to as “strong” implementers in the study) evidence a one to one and onehalf point larger gain on the NAEP scale compared to nonadopters of the standards. This association is similar in magnitude to an advantage found in a study of eighth grade math achievement in last year’s BCR. Although positive, these effects are quite small. When the 2015 NAEP results are released this winter, it will be important for the fate of the Common Core project to see if strong implementers of the CCSSELA can maintain their momentum.
Part three is on student engagement. PISA tests fifteenyearolds on three subjects—reading, math, and science—every three years. It also collects a wealth of background information from students, including their attitudes toward school and learning. When the 2012 PISA results were released, PISA analysts published an accompanying volume, Ready to Learn: Students’ Engagement, Drive, and SelfBeliefs, exploring topics related to student engagement.
Part three provides secondary analysis of several dimensions of engagement found in the PISA report. Intrinsic motivation, the internal rewards that encourage students to learn, is an important component of student engagement. National scores on PISA’s index of intrinsic motivation to learn mathematics are compared to national PISA math scores. Surprisingly, the relationship is negative. Countries with highly motivated kids tend to score lower on the math test; conversely, higherscoring nations tend to have lessmotivated kids.
The same is true for responses to the statements, “I do mathematics because I enjoy it,” and “I look forward to my mathematics lessons.” Countries with students who say that they enjoy math or look forward to their math lessons tend to score lower on the PISA math test compared to countries where students respond negatively to the statements. These counterintuitive finding may be influenced by how terms such as “enjoy” and “looking forward” are interpreted in different cultures. Withincountry analyses address that problem. The correlation coefficients for withincountry, studentlevel associations of achievement and other components of engagement run in the anticipated direction—they are positive. But they are also modest in size, with correlation coefficients of 0.20 or less.
Policymakers are interested in questions requiring analysis of aggregated data—at the national level, that means betweencountry data. When countries increase their students’ intrinsic motivation to learn math, is there a concomitant increase in PISA math scores? Data from 2003 to 2012 are examined. Seventeen countries managed to increase student motivation, but their PISA math scores fell an average of 3.7 scale score points. Fourteen countries showed no change on the index of intrinsic motivation—and their PISA scores also evidenced little change. Eight countries witnessed a decline in intrinsic motivation. Inexplicably, their PISA math scores increased by an average of 10.3 scale score points. Motivation down, achievement up.
Correlation is not causation. Moreover, the absence of a positive correlation—or in this case, the presence of a negative correlation—is not refutation of a possible positive relationship. The lesson here is not that policymakers should adopt the most effective way of stamping out student motivation. The lesson is that the level of analysis matters when analyzing achievement data. Policy reports must be read warily—especially those freely offering policy recommendations. Beware of analyses that exclusively rely on within or betweencountry test data without making any attempt to reconcile discrepancies at other levels of analysis. Those analysts could be cherrypicking the data. Also, consumers of education research should grant more credence to approaches modeling change over time (as in difference in difference models) than to crosssectional analyses that only explore statistical relationships at a single point in time.
Part I: Girls, Boys, and Reading » 
A curriculum controversy is roiling schools in the San Francisco Bay Area. In the past few months, parents in the San MateoFoster City School District, located just south of San Francisco International Airport, voiced concerns over changes to the middle school math program. The changes were brought about by the Common Core State Standards (CCSS). Under previous policies, most eighth graders in the district took algebra I. Some very sharp math students, who had already completed algebra I in seventh grade, took geometry in eighth grade. The new CCSSaligned math program will reduce eighth grade enrollments in algebra I and eliminate geometry altogether as a middle school course.
A little background information will clarify the controversy. Eighth grade mathematics may be the single gradesubject combination most profoundly affected by the CCSS. In California, the push for most students to complete algebra I by the end of eighth grade has been a centerpiece of state policy, as it has been in several states influenced by the “Algebra for All” movement that began in the 1990s. Nationwide, in 1990, about 16 percent of all eighth graders reported that they were taking an algebra or geometry course. In 2013, the number was three times larger, and nearly half of all eighth graders (48 percent) were taking algebra or geometry.^{[i]} When that percentage goes down, as it is sure to under the CCSS, what happens to high achieving math students?
The parents who are expressing the most concern have kids who excel at math. One parent in San MateoFoster City told The San Mateo Daily Journal, “This is really holding the advanced kids back.”^{[ii]} The CCSS math standards recommend a single math course for seventh grade, integrating several math topics, followed by a similarly integrated math course in eighth grade. Algebra I won’t be offered until ninth grade. The San MateoFoster City School District decided to adopt a “three years into two” accelerated option. This strategy is suggested on the Common Core website as an option that districts may consider for advanced students. It combines the curriculum from grades seven through nine (including algebra I) into a two year offering that students can take in seventh and eighth grades.^{[iii] } The district will also provide—at one school site—a sequence beginning in sixth grade that compacts four years of math into three. Both accelerated options culminate in the completion of algebra I in eighth grade.
The San MateoFoster City School District is home to many welleducated, highpowered professionals who work in Silicon Valley. They are unrelentingly liberal in their politics. Equity is a value they hold dear.^{[iv]} They also know that completing at least one high school math course in middle school is essential for students who wish to take AP Calculus in their senior year of high school. As CCSS is implemented across the nation, administrators in districts with demographic profiles similar to San MateoFoster City will face parents of mathematically precocious kids asking whether the “common” in Common Core mandates that all students take the same math course. Many of those districts will respond to their constituents and provide accelerated pathways (“pathway” is CCSS jargon for course sequence).
But other districts will not. Data show that urban schools, schools with large numbers of black and Hispanic students, and schools located in impoverished neighborhoods are reluctant to differentiate curriculum. It is unlikely that gifted math students in those districts will be offered an accelerated option under CCSS. The reason why can be summed up in one word: tracking.
Tracking in eighth grade math means providing different courses to students based on their prior math achievement. The term “tracking” has been stigmatized, coming under fire for being inequitable. Historically, where tracking existed, black, Hispanic, and disadvantaged students were often underrepresented in highlevel math classes; white, Asian, and middleclass students were often overrepresented. An antitracking movement gained a full head of steam in the 1980s. Tracking reformers knew that persuading high schools to detrack was hopeless. Consequently, tracking’s critics focused reform efforts on middle schools, urging that they group students heterogeneously with all students studying a common curriculum. That approach took hold in urban districts, but not in the suburbs.
Now the Common Core and detracking are linked. Providing an accelerated math track for high achievers has become a flashpoint throughout the San Francisco Bay Area. An October 2014 article in The San Jose Mercury News named Palo Alto, Saratoga, Cupertino, Pleasanton, and Los Gatos as districts that have announced, in response to parent pressure, that they are maintaining an accelerated math track in middle schools. These are highachieving, suburban districts. Los Gatos parents took to the internet with a petition drive when a rumor spread that advanced courses would end. Ed Source reports that 900 parents signed a petition opposing the move and board meetings on the issue were packed with opponents. The accelerated track was kept. Piedmont established a single track for everyone, but allowed parents to apply for an accelerated option. About twenty five percent did so. The Mercury News story underscores the demographic pattern that is unfolding and asks whether CCSS “could cement a twotier system, with accelerated math being the norm in wealthy areas and the exception elsewhere.”
What is CCSS’s real role here? Does the Common Core take an explicit stand on tracking? Not really. But detracking advocates can interpret the “common” in Common Core as license to eliminate accelerated tracks for high achievers. As a noted CCSS supporter (and tracking critic), William H. Schmidt, has stated, “By insisting on common content for all students at each grade level and in every community, the Common Core mathematics standards are in direct conflict with the concept of tracking.”^{[v]} Thus, tracking joins other controversial curricular ideas—e.g., integrated math courses instead of courses organized by content domains such as algebra and geometry; an emphasis on “deep,” conceptual mathematics over learning procedures and basic skills—as “dog whistles” embedded in the Common Core. Controversial positions aren’t explicitly stated, but they can be heard by those who want to hear them.
CCSS doesn’t have to take an outright stand on these debates in order to have an effect on policy. For the practical questions that local grouping policies resolve—who takes what courses and when do they take them—CCSS wipes the slate clean. There are plenty of people ready to write on that blank slate, particularly administrators frustrated by unsuccessful efforts to detrack in the past
Suburban parents are mobilized in defense of accelerated options for advantaged students. What about kids who are outstanding math students but also happen to be poor, black, or Hispanic? What happens to them, especially if they attend schools in which the top institutional concern is meeting the needs of kids functioning several years below grade level? I presented a paper on this question at a December 2014 conference held by the Fordham Institute in Washington, DC. I proposed a pilot program of “tracking for equity.” By that term, I mean offering black, Hispanic, and poor high achievers the same opportunity that the suburban districts in the Bay Area are offering. High achieving middle school students in poor neighborhoods would be able to take three years of math in two years and proceed on a path toward AP Calculus as high school seniors.
It is true that tracking must be done carefully. Tracking can be conducted unfairly and has been used unjustly in the past. One of the worst consequences of earlier forms of tracking was that lowskilled students were tracked into dead end courses that did nothing to help them academically. These lowskilled students were disproportionately from disadvantaged communities or communities of color. That’s not a danger in the proposal I am making. The default curriculum, the one every student would take if not taking the advanced track, would be the Common Core. If that’s a dead end for low achievers, Common Core supporters need to start being more honest in how they are selling the CCSS. Moreover, to ensure that the policy gets to the students for whom it is intended, I have proposed running the pilot program in schools predominantly populated by poor, black, or Hispanic students. The pilot won’t promote segregation within schools because the sad reality is that participating schools are already segregated.
Since I presented the paper, I have privately received negative feedback from both Algebra for All advocates and Common Core supporters. That’s disappointing. Because of their animus toward tracking, some critics seem to support a severe policy swing from Algebra for All, which was pursued for equity, to Algebra for None, which will be pursued for equity. It’s as if either everyone or no one should be allowed to take algebra in eighth grade. The argument is that allowing only some eighth graders to enroll in algebra is elitist, even if the students in question are poor students of color who are prepared for the course and likely to benefit from taking it.
The controversy raises crucial questions about the Common Core. What’s common in the common core? Is it the curriculum? And does that mean the same curriculum for all? Will CCSS serve as a curricular floor, ensuring all students are exposed to a common body of knowledge and skills? Or will it serve as a ceiling, limiting the progress of bright students so that their achievement looks more like that of their peers? These questions will be answered differently in different communities, and as they are, the inequities that Common Core supporters think they’re addressing may surface again in a profound form.
[i] Loveless, T. (2008). The 2008 Brown Center Report on American Education. Retrieved from http://www.brookings.edu/research/reports/2009/02/25educationloveless. For San MateoFoster City’s sequence of math courses, see: page 10 of http://smfcca.schoolloop.com/file/1383373423032/1229222942231/1242346905166154769.pdf
[ii] Swartz, A. (2014, November 22). “Parents worry over losing advanced math classes: San MateoFoster City Elementary School District revamps offerings because of Common Core.” San Mateo Daily Journal. Retrieved from http://www.smdailyjournal.com/articles/lnews/20141122/parentsworryoverlosingadvancedmathclassessanmateofostercityelementaryschooldistrictrevampsofferingsbecauseofcommoncore/1776425133822.html
[iii] Swartz, A. (2014, December 26). “Changing Classes Concern for parents, teachers: Administrators say Common Core Standards Reason for Modifications.” San Mateo Daily Journal. Retrieved from http://www.smdailyjournal.com/articles/lnews/20141226/changingclassesconcernforparentsteachersadministratorssaycommoncorestandardsreasonformodifications/1776425135624.html
[iv] In the 2014 election, Jerry Brown (D) took 75% of Foster City’s votes for governor. In the 2012 presidential election, Barak Obama received 71% of the vote. http://www.citydata.com/city/FosterCityCalifornia.html
[v] Schmidt, W.H. and Burroughs, N.A. (2012) “How the Common Core Boosts Quality and Equality.” Educational Leadership, December 2012/January 2013. Vol. 70, No. 4, pp. 5458.
A curriculum controversy is roiling schools in the San Francisco Bay Area. In the past few months, parents in the San MateoFoster City School District, located just south of San Francisco International Airport, voiced concerns over changes to the middle school math program. The changes were brought about by the Common Core State Standards (CCSS). Under previous policies, most eighth graders in the district took algebra I. Some very sharp math students, who had already completed algebra I in seventh grade, took geometry in eighth grade. The new CCSSaligned math program will reduce eighth grade enrollments in algebra I and eliminate geometry altogether as a middle school course.
A little background information will clarify the controversy. Eighth grade mathematics may be the single gradesubject combination most profoundly affected by the CCSS. In California, the push for most students to complete algebra I by the end of eighth grade has been a centerpiece of state policy, as it has been in several states influenced by the “Algebra for All” movement that began in the 1990s. Nationwide, in 1990, about 16 percent of all eighth graders reported that they were taking an algebra or geometry course. In 2013, the number was three times larger, and nearly half of all eighth graders (48 percent) were taking algebra or geometry.^{[i]} When that percentage goes down, as it is sure to under the CCSS, what happens to high achieving math students?
The parents who are expressing the most concern have kids who excel at math. One parent in San MateoFoster City told The San Mateo Daily Journal, “This is really holding the advanced kids back.”^{[ii]} The CCSS math standards recommend a single math course for seventh grade, integrating several math topics, followed by a similarly integrated math course in eighth grade. Algebra I won’t be offered until ninth grade. The San MateoFoster City School District decided to adopt a “three years into two” accelerated option. This strategy is suggested on the Common Core website as an option that districts may consider for advanced students. It combines the curriculum from grades seven through nine (including algebra I) into a two year offering that students can take in seventh and eighth grades.^{[iii] } The district will also provide—at one school site—a sequence beginning in sixth grade that compacts four years of math into three. Both accelerated options culminate in the completion of algebra I in eighth grade.
The San MateoFoster City School District is home to many welleducated, highpowered professionals who work in Silicon Valley. They are unrelentingly liberal in their politics. Equity is a value they hold dear.^{[iv]} They also know that completing at least one high school math course in middle school is essential for students who wish to take AP Calculus in their senior year of high school. As CCSS is implemented across the nation, administrators in districts with demographic profiles similar to San MateoFoster City will face parents of mathematically precocious kids asking whether the “common” in Common Core mandates that all students take the same math course. Many of those districts will respond to their constituents and provide accelerated pathways (“pathway” is CCSS jargon for course sequence).
But other districts will not. Data show that urban schools, schools with large numbers of black and Hispanic students, and schools located in impoverished neighborhoods are reluctant to differentiate curriculum. It is unlikely that gifted math students in those districts will be offered an accelerated option under CCSS. The reason why can be summed up in one word: tracking.
Tracking in eighth grade math means providing different courses to students based on their prior math achievement. The term “tracking” has been stigmatized, coming under fire for being inequitable. Historically, where tracking existed, black, Hispanic, and disadvantaged students were often underrepresented in highlevel math classes; white, Asian, and middleclass students were often overrepresented. An antitracking movement gained a full head of steam in the 1980s. Tracking reformers knew that persuading high schools to detrack was hopeless. Consequently, tracking’s critics focused reform efforts on middle schools, urging that they group students heterogeneously with all students studying a common curriculum. That approach took hold in urban districts, but not in the suburbs.
Now the Common Core and detracking are linked. Providing an accelerated math track for high achievers has become a flashpoint throughout the San Francisco Bay Area. An October 2014 article in The San Jose Mercury News named Palo Alto, Saratoga, Cupertino, Pleasanton, and Los Gatos as districts that have announced, in response to parent pressure, that they are maintaining an accelerated math track in middle schools. These are highachieving, suburban districts. Los Gatos parents took to the internet with a petition drive when a rumor spread that advanced courses would end. Ed Source reports that 900 parents signed a petition opposing the move and board meetings on the issue were packed with opponents. The accelerated track was kept. Piedmont established a single track for everyone, but allowed parents to apply for an accelerated option. About twenty five percent did so. The Mercury News story underscores the demographic pattern that is unfolding and asks whether CCSS “could cement a twotier system, with accelerated math being the norm in wealthy areas and the exception elsewhere.”
What is CCSS’s real role here? Does the Common Core take an explicit stand on tracking? Not really. But detracking advocates can interpret the “common” in Common Core as license to eliminate accelerated tracks for high achievers. As a noted CCSS supporter (and tracking critic), William H. Schmidt, has stated, “By insisting on common content for all students at each grade level and in every community, the Common Core mathematics standards are in direct conflict with the concept of tracking.”^{[v]} Thus, tracking joins other controversial curricular ideas—e.g., integrated math courses instead of courses organized by content domains such as algebra and geometry; an emphasis on “deep,” conceptual mathematics over learning procedures and basic skills—as “dog whistles” embedded in the Common Core. Controversial positions aren’t explicitly stated, but they can be heard by those who want to hear them.
CCSS doesn’t have to take an outright stand on these debates in order to have an effect on policy. For the practical questions that local grouping policies resolve—who takes what courses and when do they take them—CCSS wipes the slate clean. There are plenty of people ready to write on that blank slate, particularly administrators frustrated by unsuccessful efforts to detrack in the past
Suburban parents are mobilized in defense of accelerated options for advantaged students. What about kids who are outstanding math students but also happen to be poor, black, or Hispanic? What happens to them, especially if they attend schools in which the top institutional concern is meeting the needs of kids functioning several years below grade level? I presented a paper on this question at a December 2014 conference held by the Fordham Institute in Washington, DC. I proposed a pilot program of “tracking for equity.” By that term, I mean offering black, Hispanic, and poor high achievers the same opportunity that the suburban districts in the Bay Area are offering. High achieving middle school students in poor neighborhoods would be able to take three years of math in two years and proceed on a path toward AP Calculus as high school seniors.
It is true that tracking must be done carefully. Tracking can be conducted unfairly and has been used unjustly in the past. One of the worst consequences of earlier forms of tracking was that lowskilled students were tracked into dead end courses that did nothing to help them academically. These lowskilled students were disproportionately from disadvantaged communities or communities of color. That’s not a danger in the proposal I am making. The default curriculum, the one every student would take if not taking the advanced track, would be the Common Core. If that’s a dead end for low achievers, Common Core supporters need to start being more honest in how they are selling the CCSS. Moreover, to ensure that the policy gets to the students for whom it is intended, I have proposed running the pilot program in schools predominantly populated by poor, black, or Hispanic students. The pilot won’t promote segregation within schools because the sad reality is that participating schools are already segregated.
Since I presented the paper, I have privately received negative feedback from both Algebra for All advocates and Common Core supporters. That’s disappointing. Because of their animus toward tracking, some critics seem to support a severe policy swing from Algebra for All, which was pursued for equity, to Algebra for None, which will be pursued for equity. It’s as if either everyone or no one should be allowed to take algebra in eighth grade. The argument is that allowing only some eighth graders to enroll in algebra is elitist, even if the students in question are poor students of color who are prepared for the course and likely to benefit from taking it.
The controversy raises crucial questions about the Common Core. What’s common in the common core? Is it the curriculum? And does that mean the same curriculum for all? Will CCSS serve as a curricular floor, ensuring all students are exposed to a common body of knowledge and skills? Or will it serve as a ceiling, limiting the progress of bright students so that their achievement looks more like that of their peers? These questions will be answered differently in different communities, and as they are, the inequities that Common Core supporters think they’re addressing may surface again in a profound form.
[i] Loveless, T. (2008). The 2008 Brown Center Report on American Education. Retrieved from http://www.brookings.edu/research/reports/2009/02/25educationloveless. For San MateoFoster City’s sequence of math courses, see: page 10 of http://smfcca.schoolloop.com/file/1383373423032/1229222942231/1242346905166154769.pdf
[ii] Swartz, A. (2014, November 22). “Parents worry over losing advanced math classes: San MateoFoster City Elementary School District revamps offerings because of Common Core.” San Mateo Daily Journal. Retrieved from http://www.smdailyjournal.com/articles/lnews/20141122/parentsworryoverlosingadvancedmathclassessanmateofostercityelementaryschooldistrictrevampsofferingsbecauseofcommoncore/1776425133822.html
[iii] Swartz, A. (2014, December 26). “Changing Classes Concern for parents, teachers: Administrators say Common Core Standards Reason for Modifications.” San Mateo Daily Journal. Retrieved from http://www.smdailyjournal.com/articles/lnews/20141226/changingclassesconcernforparentsteachersadministratorssaycommoncorestandardsreasonformodifications/1776425135624.html
[iv] In the 2014 election, Jerry Brown (D) took 75% of Foster City’s votes for governor. In the 2012 presidential election, Barak Obama received 71% of the vote. http://www.citydata.com/city/FosterCityCalifornia.html
[v] Schmidt, W.H. and Burroughs, N.A. (2012) “How the Common Core Boosts Quality and Equality.” Educational Leadership, December 2012/January 2013. Vol. 70, No. 4, pp. 5458.
Edutourism is not new. For American education professors in the 1920s, nothing certified one’s progressive credentials like a trip to the Soviet Union. Diane Ravitch presents a vivid account in Left Back: A Century of Failed School Reforms. She describes how John Dewey, the most famous progressive educator of the era, visited Soviet schools in 1928 and returned full of admiration. He appreciated the emphasis on collectivism over individualism and the ease with which schools integrated curricula with the goals of society. One activity that he singled out for praise was sending students into the community to educate and help “ignorant adults to understand the policies of local soviets.” William Heard Kilpatrick, father of the project method, toured Russian schools in 1929. He applauded the ubiquitous use of projectbased learning in Soviet classrooms, noting that “down to the smallest detail in the school curriculum, every item is planned to further the Soviet plan of society.” Educator and political activist George Counts shipped a Ford sedan to Leningrad and set out on a threemonth tour, extolling the role Soviet schools played in “the greatest social experiment in history.”^{[i]}
In hindsight these scholars seem incredibly naïve. Soviet schools were indeed an extension of the state, but as such, they served as indoctrination centers for one of history’s most monstrous regimes. Stalin’s plan for society was enforced by a huge secret police force and included the mass execution of political opponents, the forced starvation of millions of peasants, and a vast network of prison camps (gulags) erected to house slave labor.
To their credit, Dewey and Kilpatrick turned on Stalinism. Counts held on longer, even praising Stalin’s Five Year Plan as a “brilliant and heroic success.” In 19321933, as the first Five Year Plan transitioned into the second, an estimated 25,000 Ukrainians died daily of starvation from the forced famine that Stalin imposed on the region. Later, Counts would recognize Stalin’s schools as tools of totalitarianism, and he became, in one biographer’s words, “a determined opponent of Soviet ideology.”^{[ii]}
Today we have a new outbreak of edutourism. American adventurers have fanned out across the globe to bring back to the United States the lessons of other school systems. Thomas L. Friedman of the New York Times visited Shanghai schools on a junket organized by Teach for All, an offshoot of Teach for America, and declared “I think I found The Secret”—The Secret being how Shanghai scored at the top on the 2009 PISA tests. After declaring, “there is no secret,” Friedman fell back on some stock explanations for high achievement, focusing in particular on changing how teachers are trained and reorganizing their work day to allow for less instruction, more professional development, and ample time for peer interaction. Elizabeth Green, author and editorinchief of Chalkbeat, toured schools in Japan, and she, too, embraced the idea that the key to better teaching could be informed by observing classrooms abroad. For Green, lesson study and resurrecting controversial pedagogical reforms from the 1980s and 1990s would surely boost mathematics learning. Finland has been swamped with edutourists, spurred primarily by that nation’s illustrious PISA scores. The Education Ministry of Finland hosted at least 100 delegations from 40 to 45 countries per year from 2005 to 2011.
International tests identify the highest scoring nations of the world. What’s wrong with visitors going to top preforming nations and seeing with their own eyes what schools are doing? Contemporary edutourists aren’t blinded by political ideology in the same way as Dewey and his colleagues were. So what’s the problem? The short answer: Edutourism might produce good journalism, but it also tends to produce very bad social science.
The people named in the paragraphs above are incredibly smart. But they succumbed to the worst folly of edutourism. Three perils, explained below, mislead edutourists into believing that what they observe in a particular nation’s classrooms is causally related to that country’s impressive academic achievement.
Singling out a top achieving country—or state or district or school or teacher or some other “subject”—and then generalizing from what this top performer does is known as selecting on the dependent variable. The dependent variable, in this case, is achievement. To look for patterns of behavior that may explain achievement, a careful analyst examines subjects across the distribution—middling and poor performers as well as those at the top. That way, if a particular activity—let’s call one “Teaching Strategy X”—is found a lot at the top, not as much in the middle, and rarely or not at all at the bottom, the analyst can say that Teaching Strategy X is positively correlated with performance. That doesn’t mean it causes high achievement—even high school statistics students are taught “correlation is not causation”—only that the two variables go up and down together.
Edutourists routinely select on the dependent variable. They go to countries at the top of international assessments, such as Finland and Japan. They never go to countries in the middle or at the bottom of the distribution. If they did and found Teaching Strategy X used frequently among low performers, the positive correlation would evaporate—and they would have to seriously question whether Teaching Strategy X has any relationship with achievement.
Jay P. Greene concisely describes the problem in a review of Marc Tucker’s book, Surpassing Shanghai: An Agenda for American Education Built on the World’s Leading Systems. Tucker uses a “best practices” approach in the book, an analytical strategy that marks his entire career. Tucker describes what top performing nations do in education and builds an agenda based on the practices that he thinks made the nations successful.
Here’s Greene’s critique:
The fundamental flaw of a “best practices” approach, as any student in a halfdecent researchdesign course would know, is that it suffers from what is called “selection on the dependent variable.” If you only look at successful organizations, then you have no variation in the dependent variable: they all have good outcomes. When you look at the things that successful organizations are doing, you have no idea whether each one of those things caused the good outcomes, had no effect on success, or was actually an impediment that held organizations back from being even more successful. An appropriate research design would have variation in the dependent variable; some have good outcomes and some have bad ones. To identify factors that contribute to good outcomes, you would, at a minimum, want to see those factors more likely to be present where there was success and less so where there was not.^{[iii]}
Jay Greene is right. “Best practices” is the worst practice.
Typically, visitors to schools see what their hosts arrange for them to see. If the host is a governmental agency responsible for school quality—and those range from national ministries to local school administrations—it has a considerable amount of time, effort, public funds, and political prestige wrapped up in a particular set of policies. Policy makers are not indifferent to the impressions that visitors take away from school visits, no more so than the military is indifferent to the impressions reporters take away from visits to bases or the battlefield. Outside observers should consider with skepticism the representativeness of the schools or classrooms they visit—or that a handful of schools can ever serve as a proxy for an entire nation. Some might try to visit some randomly selected schools, but more than likely they will be steered to a preselected set. The desire to present a rosy picture to outsiders need not be the only motive. There is also the practical matter that schools must be prepared to receive a group of observers. Schools have work to do and visitors can be a distraction.
One way to check for representativeness is to compare edutourists’ observations to data collected from larger samples that have been drawn scientifically so as to be representative. In a previous chalkboard post, I critiqued Elizabeth Green’s reporting on math instruction in Japanese and American schools. The kids she saw in Japanese classrooms were happily engaged in mathematics—boisterous, energetic, with arguments abounding about solutions to problems—whereas in the United States, she saw dull classrooms where children unhappily practiced procedures. The stark contrast Green painted is refuted by decades of survey data from the two countries. Instructional differences do exist, but they don’t appear to be related to achievement. As for joy of learning, there is a mountain of evidence that American kids enjoy learning math more than Japanese kids, evidence collected from large, random samples of students of different ages and grades. That evidence should be trusted over observations conducted in a small number of nonrandomly selected settings.
Diane Ravitch explains how John Dewey was misled in the 1920s: “Like so many other travelers to the Soviet Union both then and later, Dewey saw what he wanted to see, particularly the things that confirmed his vision for his own society.” This is called confirmation bias, the tendency to see what confirms an observer’s prior expectations.
Thomas Friedman is not an education expert, so he relied on experts to guide his thinking—and his visit—when he observed schools in Shanghai. Experts such as Andreas Schleicher of the OECD and Wendy Kopp of Teach for America are quoted in his column as pointing him towards Shanghai’s teaching reforms to explain the municipality’s sky high PISA scores. Those reforms have never been evaluated using rigorous methods of program evaluation. It’s a shame that Friedman overlooked the role that China’s social policies have in boosting Shanghai’s PISA scores. In particular, he overlooked the Chinese hukou, an internal passport system that culls migrant children from Shanghai’s student population as they approach the age for PISA sampling, fifteen years old.
Poor migrants from rural villages flock to China’s big cities for jobs. The hukou system rations public services in China, including education. Hukous were originally issued to families based on their place of residence in 1958. Hukou privileges are inherited, creating a huge urbanrural divide. It doesn’t matter if the child of migrants is born in Shanghai, or even if her parents were, she will still hold a rural hukou. Recent reforms have allowed migrants greater access to primary and lower secondary schools, but high schools are still largely out of reach. Kam Wing Chan, a Chinese demography expert at the University of Washington, has shown that as Shanghai children from migrant families approach high school age, their numbers in the school system drop precipitously.^{[iv]} Migrants who do not hold a Shanghai hukou send their children back to ancestral regions, even if the kids have never stepped foot in a rural village. There they join an estimated 61 million children, known as “left behinds,” who never made the journey to cities in the first place.^{[v]}
The hukou creates an apartheid system of education. Human Rights Watch and Amnesty International have condemned the hukou system for its discriminatory treatment of migrant children. It is shocking that OECD documents from its PISA experts hold up Shanghai as a model of equity that the rest of the world should emulate, praising policies for treating migrants as “our children.” Reports from OECD economists have taken the opposite position, sharply criticizing Shanghai”
The Shanghai Education Committee justifies local high schools’ refusal to admit the children of migrant workers on the grounds that “if we open the door to them, it would be difficult to shut in the future; local education resources should not be freely allocated to immigrant children.” As a result, few migrant children attend general high schools and those who do return to their registration locality find it hard to adapt and often fail to complete the course.^{[vi]}
If Tom Friedman had talked to different experts, even different experts in the OECD, he would have left Shanghai with a very different impression of its school system.
The perils of edutourism discussed above—selecting on the dependent variable, relying on impressions taken from small nonrandom samples, and confirmation bias—corrupt key features of sound policy analysis. Impressions from a few observations do not have the same evidentiary standing as carefully collected data from a scientifically selected sample. And the statement, “I’ve been to high achieving countries and have seen with my own eyes what they are doing right” cannot substitute for research designs that rigorously test causal hypotheses.
Let me end on a personal note. The critique above is not meant to discourage edutourism, but to identify its vulnerability to misuse. I have had the good fortune to visit many schools abroad during my career. One of the first opportunities was in 1985 when, as a classroom teacher, I chaperoned a group of California high school students on a tour of several Asian countries, including China, Korea, and Japan. American tourism in China was a rare event in those days. The classrooms I observed were profoundly impressive. I visited schools in Helsinki, Finland in 2005, before the hype about Finland reached today’s ridiculous levels, and witnessed wonderful teaching and learning.
I have conducted several studies and written extensively about international education but have not mentioned my personal visits to schools abroad until the sentences that you just read. It’s not that they weren’t important to me. I treasure the memories, and as a former classroom teacher, hope to add to them with future visits.
But policy analysis must be built on a sturdier foundation than personal impressions. Education policies affect the lives of hundreds, thousands, sometimes even millions of students. Those of us in the business of informing the policy process—whether it’s talking to policy makers or the public about its schools—must gather strong evidence that can be generalized to large numbers of students. First person accounts of visiting schools abroad are entertaining to read, but a careful reader will exercise skepticism when edutourists start giving advice on how to improve education.
[i] Diane Ravitch, Left Back: A Century of Failed School Reforms (Simon & Schuster, 2000): pp. 202218. Also see The Later Works of John Dewey, Volume 3, 1925  1953: 19271928 / Essays, Reviews, Miscellany, and Impressions of Soviet Russia (Southern Illinois University Press, November 1988).
[ii] Gerald Lee Gutek, George S. Counts and American Civilization: The Educator as Social Theorist (Mercer University Press, June 1984).
[iii] Jay P. Greene, “Best Practices Are the Worst,” Education Next, vol. 12, no. 3 (Summer 2012). http://educationnext.org/bestpracticesaretheworst/
[iv] Kam Wing Chan, Ming Pao Daily News, January 3, 2014, http://faculty.washington.edu/kwchan/ShanghaiPISA.jpg.
Edutourism is not new. For American education professors in the 1920s, nothing certified one’s progressive credentials like a trip to the Soviet Union. Diane Ravitch presents a vivid account in Left Back: A Century of Failed School Reforms. She describes how John Dewey, the most famous progressive educator of the era, visited Soviet schools in 1928 and returned full of admiration. He appreciated the emphasis on collectivism over individualism and the ease with which schools integrated curricula with the goals of society. One activity that he singled out for praise was sending students into the community to educate and help “ignorant adults to understand the policies of local soviets.” William Heard Kilpatrick, father of the project method, toured Russian schools in 1929. He applauded the ubiquitous use of projectbased learning in Soviet classrooms, noting that “down to the smallest detail in the school curriculum, every item is planned to further the Soviet plan of society.” Educator and political activist George Counts shipped a Ford sedan to Leningrad and set out on a threemonth tour, extolling the role Soviet schools played in “the greatest social experiment in history.”^{[i]}
In hindsight these scholars seem incredibly naïve. Soviet schools were indeed an extension of the state, but as such, they served as indoctrination centers for one of history’s most monstrous regimes. Stalin’s plan for society was enforced by a huge secret police force and included the mass execution of political opponents, the forced starvation of millions of peasants, and a vast network of prison camps (gulags) erected to house slave labor.
To their credit, Dewey and Kilpatrick turned on Stalinism. Counts held on longer, even praising Stalin’s Five Year Plan as a “brilliant and heroic success.” In 19321933, as the first Five Year Plan transitioned into the second, an estimated 25,000 Ukrainians died daily of starvation from the forced famine that Stalin imposed on the region. Later, Counts would recognize Stalin’s schools as tools of totalitarianism, and he became, in one biographer’s words, “a determined opponent of Soviet ideology.”^{[ii]}
Today we have a new outbreak of edutourism. American adventurers have fanned out across the globe to bring back to the United States the lessons of other school systems. Thomas L. Friedman of the New York Times visited Shanghai schools on a junket organized by Teach for All, an offshoot of Teach for America, and declared “I think I found The Secret”—The Secret being how Shanghai scored at the top on the 2009 PISA tests. After declaring, “there is no secret,” Friedman fell back on some stock explanations for high achievement, focusing in particular on changing how teachers are trained and reorganizing their work day to allow for less instruction, more professional development, and ample time for peer interaction. Elizabeth Green, author and editorinchief of Chalkbeat, toured schools in Japan, and she, too, embraced the idea that the key to better teaching could be informed by observing classrooms abroad. For Green, lesson study and resurrecting controversial pedagogical reforms from the 1980s and 1990s would surely boost mathematics learning. Finland has been swamped with edutourists, spurred primarily by that nation’s illustrious PISA scores. The Education Ministry of Finland hosted at least 100 delegations from 40 to 45 countries per year from 2005 to 2011.
International tests identify the highest scoring nations of the world. What’s wrong with visitors going to top preforming nations and seeing with their own eyes what schools are doing? Contemporary edutourists aren’t blinded by political ideology in the same way as Dewey and his colleagues were. So what’s the problem? The short answer: Edutourism might produce good journalism, but it also tends to produce very bad social science.
The people named in the paragraphs above are incredibly smart. But they succumbed to the worst folly of edutourism. Three perils, explained below, mislead edutourists into believing that what they observe in a particular nation’s classrooms is causally related to that country’s impressive academic achievement.
Singling out a top achieving country—or state or district or school or teacher or some other “subject”—and then generalizing from what this top performer does is known as selecting on the dependent variable. The dependent variable, in this case, is achievement. To look for patterns of behavior that may explain achievement, a careful analyst examines subjects across the distribution—middling and poor performers as well as those at the top. That way, if a particular activity—let’s call one “Teaching Strategy X”—is found a lot at the top, not as much in the middle, and rarely or not at all at the bottom, the analyst can say that Teaching Strategy X is positively correlated with performance. That doesn’t mean it causes high achievement—even high school statistics students are taught “correlation is not causation”—only that the two variables go up and down together.
Edutourists routinely select on the dependent variable. They go to countries at the top of international assessments, such as Finland and Japan. They never go to countries in the middle or at the bottom of the distribution. If they did and found Teaching Strategy X used frequently among low performers, the positive correlation would evaporate—and they would have to seriously question whether Teaching Strategy X has any relationship with achievement.
Jay P. Greene concisely describes the problem in a review of Marc Tucker’s book, Surpassing Shanghai: An Agenda for American Education Built on the World’s Leading Systems. Tucker uses a “best practices” approach in the book, an analytical strategy that marks his entire career. Tucker describes what top performing nations do in education and builds an agenda based on the practices that he thinks made the nations successful.
Here’s Greene’s critique:
The fundamental flaw of a “best practices” approach, as any student in a halfdecent researchdesign course would know, is that it suffers from what is called “selection on the dependent variable.” If you only look at successful organizations, then you have no variation in the dependent variable: they all have good outcomes. When you look at the things that successful organizations are doing, you have no idea whether each one of those things caused the good outcomes, had no effect on success, or was actually an impediment that held organizations back from being even more successful. An appropriate research design would have variation in the dependent variable; some have good outcomes and some have bad ones. To identify factors that contribute to good outcomes, you would, at a minimum, want to see those factors more likely to be present where there was success and less so where there was not.^{[iii]}
Jay Greene is right. “Best practices” is the worst practice.
Typically, visitors to schools see what their hosts arrange for them to see. If the host is a governmental agency responsible for school quality—and those range from national ministries to local school administrations—it has a considerable amount of time, effort, public funds, and political prestige wrapped up in a particular set of policies. Policy makers are not indifferent to the impressions that visitors take away from school visits, no more so than the military is indifferent to the impressions reporters take away from visits to bases or the battlefield. Outside observers should consider with skepticism the representativeness of the schools or classrooms they visit—or that a handful of schools can ever serve as a proxy for an entire nation. Some might try to visit some randomly selected schools, but more than likely they will be steered to a preselected set. The desire to present a rosy picture to outsiders need not be the only motive. There is also the practical matter that schools must be prepared to receive a group of observers. Schools have work to do and visitors can be a distraction.
One way to check for representativeness is to compare edutourists’ observations to data collected from larger samples that have been drawn scientifically so as to be representative. In a previous chalkboard post, I critiqued Elizabeth Green’s reporting on math instruction in Japanese and American schools. The kids she saw in Japanese classrooms were happily engaged in mathematics—boisterous, energetic, with arguments abounding about solutions to problems—whereas in the United States, she saw dull classrooms where children unhappily practiced procedures. The stark contrast Green painted is refuted by decades of survey data from the two countries. Instructional differences do exist, but they don’t appear to be related to achievement. As for joy of learning, there is a mountain of evidence that American kids enjoy learning math more than Japanese kids, evidence collected from large, random samples of students of different ages and grades. That evidence should be trusted over observations conducted in a small number of nonrandomly selected settings.
Diane Ravitch explains how John Dewey was misled in the 1920s: “Like so many other travelers to the Soviet Union both then and later, Dewey saw what he wanted to see, particularly the things that confirmed his vision for his own society.” This is called confirmation bias, the tendency to see what confirms an observer’s prior expectations.
Thomas Friedman is not an education expert, so he relied on experts to guide his thinking—and his visit—when he observed schools in Shanghai. Experts such as Andreas Schleicher of the OECD and Wendy Kopp of Teach for America are quoted in his column as pointing him towards Shanghai’s teaching reforms to explain the municipality’s sky high PISA scores. Those reforms have never been evaluated using rigorous methods of program evaluation. It’s a shame that Friedman overlooked the role that China’s social policies have in boosting Shanghai’s PISA scores. In particular, he overlooked the Chinese hukou, an internal passport system that culls migrant children from Shanghai’s student population as they approach the age for PISA sampling, fifteen years old.
Poor migrants from rural villages flock to China’s big cities for jobs. The hukou system rations public services in China, including education. Hukous were originally issued to families based on their place of residence in 1958. Hukou privileges are inherited, creating a huge urbanrural divide. It doesn’t matter if the child of migrants is born in Shanghai, or even if her parents were, she will still hold a rural hukou. Recent reforms have allowed migrants greater access to primary and lower secondary schools, but high schools are still largely out of reach. Kam Wing Chan, a Chinese demography expert at the University of Washington, has shown that as Shanghai children from migrant families approach high school age, their numbers in the school system drop precipitously.^{[iv]} Migrants who do not hold a Shanghai hukou send their children back to ancestral regions, even if the kids have never stepped foot in a rural village. There they join an estimated 61 million children, known as “left behinds,” who never made the journey to cities in the first place.^{[v]}
The hukou creates an apartheid system of education. Human Rights Watch and Amnesty International have condemned the hukou system for its discriminatory treatment of migrant children. It is shocking that OECD documents from its PISA experts hold up Shanghai as a model of equity that the rest of the world should emulate, praising policies for treating migrants as “our children.” Reports from OECD economists have taken the opposite position, sharply criticizing Shanghai”
The Shanghai Education Committee justifies local high schools’ refusal to admit the children of migrant workers on the grounds that “if we open the door to them, it would be difficult to shut in the future; local education resources should not be freely allocated to immigrant children.” As a result, few migrant children attend general high schools and those who do return to their registration locality find it hard to adapt and often fail to complete the course.^{[vi]}
If Tom Friedman had talked to different experts, even different experts in the OECD, he would have left Shanghai with a very different impression of its school system.
The perils of edutourism discussed above—selecting on the dependent variable, relying on impressions taken from small nonrandom samples, and confirmation bias—corrupt key features of sound policy analysis. Impressions from a few observations do not have the same evidentiary standing as carefully collected data from a scientifically selected sample. And the statement, “I’ve been to high achieving countries and have seen with my own eyes what they are doing right” cannot substitute for research designs that rigorously test causal hypotheses.
Let me end on a personal note. The critique above is not meant to discourage edutourism, but to identify its vulnerability to misuse. I have had the good fortune to visit many schools abroad during my career. One of the first opportunities was in 1985 when, as a classroom teacher, I chaperoned a group of California high school students on a tour of several Asian countries, including China, Korea, and Japan. American tourism in China was a rare event in those days. The classrooms I observed were profoundly impressive. I visited schools in Helsinki, Finland in 2005, before the hype about Finland reached today’s ridiculous levels, and witnessed wonderful teaching and learning.
I have conducted several studies and written extensively about international education but have not mentioned my personal visits to schools abroad until the sentences that you just read. It’s not that they weren’t important to me. I treasure the memories, and as a former classroom teacher, hope to add to them with future visits.
But policy analysis must be built on a sturdier foundation than personal impressions. Education policies affect the lives of hundreds, thousands, sometimes even millions of students. Those of us in the business of informing the policy process—whether it’s talking to policy makers or the public about its schools—must gather strong evidence that can be generalized to large numbers of students. First person accounts of visiting schools abroad are entertaining to read, but a careful reader will exercise skepticism when edutourists start giving advice on how to improve education.
[i] Diane Ravitch, Left Back: A Century of Failed School Reforms (Simon & Schuster, 2000): pp. 202218. Also see The Later Works of John Dewey, Volume 3, 1925  1953: 19271928 / Essays, Reviews, Miscellany, and Impressions of Soviet Russia (Southern Illinois University Press, November 1988).
[ii] Gerald Lee Gutek, George S. Counts and American Civilization: The Educator as Social Theorist (Mercer University Press, June 1984).
[iii] Jay P. Greene, “Best Practices Are the Worst,” Education Next, vol. 12, no. 3 (Summer 2012). http://educationnext.org/bestpracticesaretheworst/
[iv] Kam Wing Chan, Ming Pao Daily News, January 3, 2014, http://faculty.washington.edu/kwchan/ShanghaiPISA.jpg.
This post is the third and final segment on implementing curriculum aligned with the Common Core State Standards (CCSS). It focuses on how curriculum is shaped in schools and classrooms. Previous posts described curriculum implementation at the national, state, and district levels. Future posts in this series will examine the role of instruction, assessment, and accountability in implementing the Common Core.
I have previously defined curriculum as the “stuff” of learning, the content of what is taught in school—especially as embodied in the materials used in instruction. You will notice below that when classroom activities are examined, the distinction between curriculum and instruction can become blurred. The “what” of teaching and the “how” of teaching are often intertwined.
Three school and classroom forces are especially important in molding the curriculum. We can expect them to influence curriculum adopted to implement CCSS.
Common Core’s champions argue that the standards embrace more rigorous learning objectives than current state standards. If that is true—and barring any huge leap forward in productivity—additional time may be needed for both reading and mathematics instruction.
In secondary education, the allocation of time to disciplinary subject is fixed by the class schedule. Middle and high schools divide the day into periods of instruction, with students moving among the classes during “passing periods” that separate the end of one class from the beginning of the next. A typical day is divided into six or seven periods of 45 to 55 minutes, with five to seven minutes per passing period. Some schools employ “block scheduling,” in which two or three periods are combined into a single period so that instruction may be delivered in larger chunks. The practice was a popular (some would say faddish) innovation in the 1990s, and the 1998 NAEP reported 40% of high schools scheduled at least some courses into blocks.^{[1]}
Elementary classrooms are usually “selfcontained,” meaning that the same teacher teaches all subjects. That gives elementary teachers more control than their secondary counterparts over how much time is spent on each subject. Several studies have documented significant betweenclass variation in elementary grades’ instructional time. On the 2013 NAEP, for example, about half of fourth grade teachers (48%) reported spending 10 hours or more per week on language arts, which includes reading, writing, literature, and related topics. But more than one in five (22%) said they devoted less than seven hours per week on the same subjects, a huge difference when compounded over a full school year. A 2002 study of time diaries kept by elementary school teachers discovered time differences of about two and a half hours per week—or 87 hours per year—on core subjects.
Some states mandate a minimum amount of daily instruction devoted to reading. Districts also offer guidelines. Chicago Public Schools extended the school day in 20122013 and simultaneously issued minimum guidelines for all academic subjects, including at least two hours daily on reading and writing. The district claimed that this was the first time such guidelines had been issued, justifying the policy by explaining “In the past, there has been little consistency from school to school in time spent on core subjects.”^{[2]}
No real way exists for states or districts to enforce time guidelines. What happens if insufficient time is allotted? Interestingly, Chicago was the site of an early 1980s study of elementary school reading instruction that bears on the question. In a citywide study, Robert Dreeben and Adam Gamoran found that first graders learn how to read in classrooms that optimally balance two crucial inputs: curriculum coverage and instructional time.^{[3]} Adopting more rigorous curricula is not enough. They singled out a particular Chicago school that “used relatively difficult materials, but did not succeed in covering very much of them, even with the group that had the highest mean aptitude in the whole sample. This was caused in part by an insufficiency of time set aside by the teachers and by the school administration for basal reading instruction.” A more recent study decomposed the influence of districts, schools, and teachers and estimated that 80% to 95% of variation in content coverage results from different time allocations by individual teachers.^{[4]}
Much has been made that the CCSS mathematics standards are focused and coherent, qualities that previous state standards lacked. Focus and coherence are related to the ordering of math topics so that knowledge and skills build in a logical manner. The CCSS organize topics by grade. Any substantive benefit they may provide in terms of focus and coherence will be due to better organization of topics among grades—and especially between adjacent grades. But within grades, the CCSS are silent on ordering topics. That discretion is left up to local educators and will undoubtedly be influenced by the textbook program they choose.
Emphasis
Two controversies have erupted regarding what teachers will emphasize during English Language Arts instruction. The first popped up in 2012 and involved the amount of nonfiction and fiction taught in classrooms.^{[5]} A 5050 balance is recommended by CCSS for grades K5. A shift towards more reading of nonfiction is urged for later grades. The Common Core State Standards Initiative for English Language Arts cites NAEP as licensing the recommended distribution, stating that the standards “follow NAEP’s lead in balancing the reading of literature with the reading of informational texts.” A breakdown of the distribution of literary and informational passages by grade level on NAEP is provided: 50% literary and 50% informational in 4^{th} grade, 45%55% in 8^{th} grade, and 30%70% in 12^{th} grade. The CCSS Initiative states, “The Standards aim to align instruction with this framework so that many more students at present can meet the requirements of college and career readiness.”
Fiction has a hallowed place in the ELA curriculum, whether it’s the simple stories found in elementary school basal texts or the classic novels taught in high school. Veteran teachers are unlikely to willingly abandon instructional units, some of which they may have spent years refining based on classroom experiences. That is especially true with great fiction that they also happen to enjoy teaching. Defenders of the CCSS reassure teachers that the nonfiction/fiction distributional guidelines are for the amount of reading expected of students across all school subjects, not only in ELA classes, allowing the ELA curriculum to continue with a heavy dose of fiction.
That leaves the issue in a bit of a muddle. Does the balance of fiction and nonfiction matter? As Sandra Stotsky and Mark Bauerlein have pointed out, there is no evidence that reading nonfiction is more effective in teaching critical thinking than reading fiction, or that reading informational texts instead of fiction enhances college and career readiness.^{[6]} What reading informational texts can offer, however, is background knowledge that makes all kinds of reading more accessible to the reader. That raises the second controversy.
The Common Core State Standards advocate “close reading” of texts. Because the meaning of any text originates in the words themselves, understanding what an author means requires students to read and reread text for its essence. In 2013, a teaching guide was published on how the Gettysburg Address could be taught consistent with the CCSS’s notion of close reading. The guide was created by Student Achievement Partners, an organization founded by David Coleman, Susan Pimentel, and Jason Zimba, lead authors of the CCSS. It was also posted on EngageNY, a website established and maintained by the New York State Department of Education to provide guidance on implementing CCSSaligned materials in the classroom.
The guide consists of several lessons on the Gettysburg Address that take three to six days to complete. The first lesson has the students reading the text of the speech cold, with no background material on the Civil War or the significance of the Battle at Gettysburg or of the occasion that brought Lincoln to give the speech. Nothing at all. The stated reason for this strategy is to get students accustomed to the idea of analyzing text on its own. Providing historical context or other preparatory information (often called prereading activities) leads students away from the text. Moreover, the guide asserts that the recommendation promotes equity: “This close reading approach forces students to rely exclusively on the text instead of privileging background knowledge and levels the playing field for all students as they seek to comprehend Lincoln’s address.”
Such an extreme application of close reading has been met with sharp criticism. Dan Willingham has described as “a bit crazy” the notion that: “we will read the text as though we know nothing about the subject at hand; the author’s words will be not only necessary for our interpretation, we’ll consider them sufficient.” Many adherents of E.D. Hirsch’s Core Knowledge believe that the Common Core bolsters the chance of students reading broadly across disciplines. It’s hard to see that happening if teachers aren’t even supposed to help students understand the historical context of an important speech before they read it. Moreover, it is difficult to resolve the glaring contradiction of urging more informational reading but not wanting students to put that knowledge to use when encountering a particular text for the first time, even debasing it in order to “level the playing field.”
As I mentioned above, the lines separating curriculum and instruction often become blurred, and this is precisely such a case. Most of the debate about the Gettysburg lessons has been about pedagogy, whether this is a good or bad way of teaching. But I present the controversy here to make a point about curriculum. Teachers currently vary in how they prepare students for reading particular texts, in particular, how much historical background information they provide when tackling historical texts. They will continue to do so. Those choices, made countless times every day in classrooms across the country, mean that students encounter very different curricula—even when studying the same topic.
Imagine a student whose teacher, as complimentary materials to studying the Gettysburg Address, assigns excerpts from Gary Wills’s Lincoln at Gettysburg, shows the Gettysburg episode from the Ken Burns Civil War series, and gives a brief lecture on famous eulogies and memorials (perhaps starting with Pericles). Compare that content to a student who receives the close reading lessons described above. The two students will both have studied the Gettysburg Address, but they will take away completely different knowledge from the lessons. They will learn different content because they studied different curricula. It is also almost impossible to think of an assessment that could measure these two students’ new learning in a fair and accurate way.
The debate over prereading, like the debate over nonfiction and fiction, has done very little to clarify the CCSS in regards to curriculum. How a CCSS curriculum emerges will depend on what Richard Elmore calls “the power of the bottom over the top” when education policy is implemented.^{[7]} The decisions local educators make in allocating time, ordering curricular topics, and emphasizing some CCSS skills and knowledge over others will determine how the CCSS is realized in schools and classrooms.
[1] All statistics attributed to NAEP are from data retrieved from NAEP Data Explorer (http://nces.ed.gov/nationsreportcard/naepdata/). High school scores are from 12^{th} grade.
[3] Robert Dreeben and Adam Gamoran (1986). “Race Instruction, and Learning,” American Sociological Review, vol 51: October, 660669.
[4] William H. Schmidt and Curtis C. McKnight (2012). Inequality for All: The Challenge of Unequal Opportunity in American Schools.
[5] See Jay Mathews, “Fiction vs. nonfiction smackdown,” Washington Post, 10/17/2012. http://www.washingtonpost.com/local/education/fictionvsnonfictionsmackdown/2012/10/17/cbb333d016f011e2a55c39408fbe6a4b_story.html
This post is the third and final segment on implementing curriculum aligned with the Common Core State Standards (CCSS). It focuses on how curriculum is shaped in schools and classrooms. Previous posts described curriculum implementation at the national, state, and district levels. Future posts in this series will examine the role of instruction, assessment, and accountability in implementing the Common Core.
I have previously defined curriculum as the “stuff” of learning, the content of what is taught in school—especially as embodied in the materials used in instruction. You will notice below that when classroom activities are examined, the distinction between curriculum and instruction can become blurred. The “what” of teaching and the “how” of teaching are often intertwined.
Three school and classroom forces are especially important in molding the curriculum. We can expect them to influence curriculum adopted to implement CCSS.
Common Core’s champions argue that the standards embrace more rigorous learning objectives than current state standards. If that is true—and barring any huge leap forward in productivity—additional time may be needed for both reading and mathematics instruction.
In secondary education, the allocation of time to disciplinary subject is fixed by the class schedule. Middle and high schools divide the day into periods of instruction, with students moving among the classes during “passing periods” that separate the end of one class from the beginning of the next. A typical day is divided into six or seven periods of 45 to 55 minutes, with five to seven minutes per passing period. Some schools employ “block scheduling,” in which two or three periods are combined into a single period so that instruction may be delivered in larger chunks. The practice was a popular (some would say faddish) innovation in the 1990s, and the 1998 NAEP reported 40% of high schools scheduled at least some courses into blocks.^{[1]}
Elementary classrooms are usually “selfcontained,” meaning that the same teacher teaches all subjects. That gives elementary teachers more control than their secondary counterparts over how much time is spent on each subject. Several studies have documented significant betweenclass variation in elementary grades’ instructional time. On the 2013 NAEP, for example, about half of fourth grade teachers (48%) reported spending 10 hours or more per week on language arts, which includes reading, writing, literature, and related topics. But more than one in five (22%) said they devoted less than seven hours per week on the same subjects, a huge difference when compounded over a full school year. A 2002 study of time diaries kept by elementary school teachers discovered time differences of about two and a half hours per week—or 87 hours per year—on core subjects.
Some states mandate a minimum amount of daily instruction devoted to reading. Districts also offer guidelines. Chicago Public Schools extended the school day in 20122013 and simultaneously issued minimum guidelines for all academic subjects, including at least two hours daily on reading and writing. The district claimed that this was the first time such guidelines had been issued, justifying the policy by explaining “In the past, there has been little consistency from school to school in time spent on core subjects.”^{[2]}
No real way exists for states or districts to enforce time guidelines. What happens if insufficient time is allotted? Interestingly, Chicago was the site of an early 1980s study of elementary school reading instruction that bears on the question. In a citywide study, Robert Dreeben and Adam Gamoran found that first graders learn how to read in classrooms that optimally balance two crucial inputs: curriculum coverage and instructional time.^{[3]} Adopting more rigorous curricula is not enough. They singled out a particular Chicago school that “used relatively difficult materials, but did not succeed in covering very much of them, even with the group that had the highest mean aptitude in the whole sample. This was caused in part by an insufficiency of time set aside by the teachers and by the school administration for basal reading instruction.” A more recent study decomposed the influence of districts, schools, and teachers and estimated that 80% to 95% of variation in content coverage results from different time allocations by individual teachers.^{[4]}
Much has been made that the CCSS mathematics standards are focused and coherent, qualities that previous state standards lacked. Focus and coherence are related to the ordering of math topics so that knowledge and skills build in a logical manner. The CCSS organize topics by grade. Any substantive benefit they may provide in terms of focus and coherence will be due to better organization of topics among grades—and especially between adjacent grades. But within grades, the CCSS are silent on ordering topics. That discretion is left up to local educators and will undoubtedly be influenced by the textbook program they choose.
Emphasis
Two controversies have erupted regarding what teachers will emphasize during English Language Arts instruction. The first popped up in 2012 and involved the amount of nonfiction and fiction taught in classrooms.^{[5]} A 5050 balance is recommended by CCSS for grades K5. A shift towards more reading of nonfiction is urged for later grades. The Common Core State Standards Initiative for English Language Arts cites NAEP as licensing the recommended distribution, stating that the standards “follow NAEP’s lead in balancing the reading of literature with the reading of informational texts.” A breakdown of the distribution of literary and informational passages by grade level on NAEP is provided: 50% literary and 50% informational in 4^{th} grade, 45%55% in 8^{th} grade, and 30%70% in 12^{th} grade. The CCSS Initiative states, “The Standards aim to align instruction with this framework so that many more students at present can meet the requirements of college and career readiness.”
Fiction has a hallowed place in the ELA curriculum, whether it’s the simple stories found in elementary school basal texts or the classic novels taught in high school. Veteran teachers are unlikely to willingly abandon instructional units, some of which they may have spent years refining based on classroom experiences. That is especially true with great fiction that they also happen to enjoy teaching. Defenders of the CCSS reassure teachers that the nonfiction/fiction distributional guidelines are for the amount of reading expected of students across all school subjects, not only in ELA classes, allowing the ELA curriculum to continue with a heavy dose of fiction.
That leaves the issue in a bit of a muddle. Does the balance of fiction and nonfiction matter? As Sandra Stotsky and Mark Bauerlein have pointed out, there is no evidence that reading nonfiction is more effective in teaching critical thinking than reading fiction, or that reading informational texts instead of fiction enhances college and career readiness.^{[6]} What reading informational texts can offer, however, is background knowledge that makes all kinds of reading more accessible to the reader. That raises the second controversy.
The Common Core State Standards advocate “close reading” of texts. Because the meaning of any text originates in the words themselves, understanding what an author means requires students to read and reread text for its essence. In 2013, a teaching guide was published on how the Gettysburg Address could be taught consistent with the CCSS’s notion of close reading. The guide was created by Student Achievement Partners, an organization founded by David Coleman, Susan Pimentel, and Jason Zimba, lead authors of the CCSS. It was also posted on EngageNY, a website established and maintained by the New York State Department of Education to provide guidance on implementing CCSSaligned materials in the classroom.
The guide consists of several lessons on the Gettysburg Address that take three to six days to complete. The first lesson has the students reading the text of the speech cold, with no background material on the Civil War or the significance of the Battle at Gettysburg or of the occasion that brought Lincoln to give the speech. Nothing at all. The stated reason for this strategy is to get students accustomed to the idea of analyzing text on its own. Providing historical context or other preparatory information (often called prereading activities) leads students away from the text. Moreover, the guide asserts that the recommendation promotes equity: “This close reading approach forces students to rely exclusively on the text instead of privileging background knowledge and levels the playing field for all students as they seek to comprehend Lincoln’s address.”
Such an extreme application of close reading has been met with sharp criticism. Dan Willingham has described as “a bit crazy” the notion that: “we will read the text as though we know nothing about the subject at hand; the author’s words will be not only necessary for our interpretation, we’ll consider them sufficient.” Many adherents of E.D. Hirsch’s Core Knowledge believe that the Common Core bolsters the chance of students reading broadly across disciplines. It’s hard to see that happening if teachers aren’t even supposed to help students understand the historical context of an important speech before they read it. Moreover, it is difficult to resolve the glaring contradiction of urging more informational reading but not wanting students to put that knowledge to use when encountering a particular text for the first time, even debasing it in order to “level the playing field.”
As I mentioned above, the lines separating curriculum and instruction often become blurred, and this is precisely such a case. Most of the debate about the Gettysburg lessons has been about pedagogy, whether this is a good or bad way of teaching. But I present the controversy here to make a point about curriculum. Teachers currently vary in how they prepare students for reading particular texts, in particular, how much historical background information they provide when tackling historical texts. They will continue to do so. Those choices, made countless times every day in classrooms across the country, mean that students encounter very different curricula—even when studying the same topic.
Imagine a student whose teacher, as complimentary materials to studying the Gettysburg Address, assigns excerpts from Gary Wills’s Lincoln at Gettysburg, shows the Gettysburg episode from the Ken Burns Civil War series, and gives a brief lecture on famous eulogies and memorials (perhaps starting with Pericles). Compare that content to a student who receives the close reading lessons described above. The two students will both have studied the Gettysburg Address, but they will take away completely different knowledge from the lessons. They will learn different content because they studied different curricula. It is also almost impossible to think of an assessment that could measure these two students’ new learning in a fair and accurate way.
The debate over prereading, like the debate over nonfiction and fiction, has done very little to clarify the CCSS in regards to curriculum. How a CCSS curriculum emerges will depend on what Richard Elmore calls “the power of the bottom over the top” when education policy is implemented.^{[7]} The decisions local educators make in allocating time, ordering curricular topics, and emphasizing some CCSS skills and knowledge over others will determine how the CCSS is realized in schools and classrooms.
[1] All statistics attributed to NAEP are from data retrieved from NAEP Data Explorer (http://nces.ed.gov/nationsreportcard/naepdata/). High school scores are from 12^{th} grade.
[3] Robert Dreeben and Adam Gamoran (1986). “Race Instruction, and Learning,” American Sociological Review, vol 51: October, 660669.
[4] William H. Schmidt and Curtis C. McKnight (2012). Inequality for All: The Challenge of Unequal Opportunity in American Schools.
[5] See Jay Mathews, “Fiction vs. nonfiction smackdown,” Washington Post, 10/17/2012. http://www.washingtonpost.com/local/education/fictionvsnonfictionsmackdown/2012/10/17/cbb333d016f011e2a55c39408fbe6a4b_story.html
The July 27, 2014 edition of the New York Times Sunday Magazine featured an article by Elizabeth Green entitled “Why Do Americans Stink at Math?” In this blog post, I identify six myths promulgated in that article. Let me be clear at the outset. I am an admirer of Elizabeth Green’s journalism and am sympathetic to the idea that improving teaching would raise American math achievement. But this article is completely off base. Its most glaring mistake is giving the impression that a particular approach to mathematics instruction—referred to over the past halfcentury as “progressive,” “constructivist,” “discovery,” or “inquirybased”—is the answer to improving mathematics learning in the U.S. That belief is not supported by evidence.
Green asserts that American math reformers frequently come up with great ideas—the examples cited are the New Math of the 1960s, California’s 1985 Mathematics Framework, the National Council of Teachers of Mathematics (NCTM) math reforms of the 1980s and 1990s, and today’s Common Core—but the reforms are shot down because of poor implementation. Green deserves credit for avoiding the common habit of attributing the failure of reform to “implementation” without defining the term. In Green’s way of thinking, implementation of math reform hinges on changing the way teachers teach.^{[1]} American teachers, Green argues, are wedded to the idea that learning mathematics is synonymous with memorization and practicing procedures. They aren’t provided the training to teach in different ways. Left on their own, teachers teach the way they themselves were taught, emphasizing, in her words, “mind numbing” routines—and perpetuating a cycle of failure.
Green believes that the 1980s math reforms failed in the U.S. but took root and flourished in Japan. Over a 12 year span, she writes, “the Japanese educational system embraced this more vibrant approach to math.” The two countries’ math classrooms are dramatically different, and readers are presented with a series of contrasts. American classrooms are dull and oppressively quiet; Japanese classrooms are bursting with children “talking, arguing, shrieking about the best way to solve problems.” Japanese students “uncover math’s procedures, properties and proofs for themselves;” American students regurgitate rules spoon fed to them by teachers. When the innovations of the 1980s and 1990s were proposed, Japan “was able to shift a country full of teachers to a new approach.” American teachers dug in and clung to traditional teaching. The upshot of all of this? Japan scores at the top of international math tests; the U.S. scores near the middle of the pack or worse.
The story is wrong. It goes wrong by embracing six myths.
Green provides no evidence that instructional differences are at the heart of American and Japanese achievement differences. Indeed, she provides no evidence, other than the assertions of advocates of the teaching practices extolled in the article, that Japanese teachers changed their instructional practices in the 1990s, or that American teachers did not change theirs, or that any change that occurred in math instruction in either country had an impact on student achievement.
Green relies on the Trends in International Mathematics and Science Study (TIMSS) 1995 Video Study to document differences in Japanese and American teaching styles.^{[2]} She fails to tell readers of a crucial limitation of that study. The TIMSS video study did not collect data on how much kids learned during lessons. Interesting differences were indeed apparent between Japanese and American instruction (German math teachers were also part of the study), but the study could not assess whether Japanese students learned more math as a result. Withincountry comparisons might have been especially revealing. If Japanese kids who were exposed to “reform” teaching strategies learned more than Japanese kids exposed to traditional instruction, and if great care were taken to make sure the two groups were equal on characteristics related to learning, then that would suggest the choice of instructional regime might be driving the learning differences. Given the study’s limitations, that analysis could not be conducted.
The 1995 TIMSS collected survey data separate from the video study that can shed light on the question. Eighth grade math teachers were queried: how often do you ask students to do reasoning tasks? Table 1 shows the frequency of teachers’ answers, along with the average TIMSS score of students for each response category (in parentheses).
Teachers’ Reports on How Often They Ask Students to Do Reasoning Tasks
Never or Almost Never 
Some Lessons 
Most Lessons 
Every Lesson 

Japan 
0% 
7% (594) 
55% (604) 
37% (608) 
U.S. 
0% 
24% (495) 
50% (498) 
26% (514) 
Source: Table 5.11, IEA Third International Mathematics and Science Study (TIMSS), 19941995, page 160.
Note that the data support the view that eighth grade Japanese teachers emphasize reasoning more often than U.S. teachers. But the data also suggest that the difference can only explain a trivial amount of the JapanU.S. test score gap. The gap hovers around 100 points across response categories, comparable to the overall gap reported in 1995 (Japan scored 605, the U.S., 500). The withincountry difference between teachers who include reasoning tasks in every lesson versus teachers who only present them in some lessons is only 14 scale score points in Japan and 19 points in the U.S. Indeed, even if 100% of U.S. teachers had said they emphasize reasoning in every lesson—and the 514 TIMSS score for the category held firm—the achievement gap between the two countries would only contract negligibly. This suggests the overall test score difference between the two countries is driven by other factors.
What are those other factors? Green dismisses cultural differences or the contribution of instruction outside school to Japanese math achievement. This is puzzling. There is no discussion of Japanese parents drilling children in math at home or of the popularity of Kumon centers that focus on basic skills.^{[3] }And juku gets not a single mention in Green’s article. Juku, commonly known as “cram school,” is the private, afterschool instruction that most Japanese students receive, especially during middle school as they prepare for high school entrance exams. Jukus are famous for focusing on basic skills, drill and practice, and memorization.^{[4]} Japanese public schools have the luxury of offloading these instructional burdens to jukus.
An alternative hypothesis to Green’s story is this: perhaps because of jukus Japanese teachers can take their students’ fluency with mathematical procedures for granted and focus lessons on problem solving and conceptual understanding. American teachers, on the other hand, must teach procedural fluency or it is not taught at all.
Green’s article depicts American math classrooms as boring, unhappy places and Japanese classrooms as vibrant and filled with joy. She cites no data other than her own impressions from classroom observations and the assertions of advocates of reformoriented instruction. It is odd that she didn’t examine the Program for International Assessment (PISA) or TIMSS data on enjoyment because both assessments routinely survey students from randomlysampled classrooms and ask whether they enjoy learning mathematics.^{[5]}
American students consistently report enjoying math more than Japanese students. In response to the statement, “I look forward to my mathematics lessons,” posed on PISA, the percentage of U.S. 15yearolds agreeing in 2012 was 45.4%, compared to 33.7% in Japan. To the prompt, “I do mathematics because I enjoy it,” the percentage agreeing was 36.6% in the U.S. and 30.8% in Japan.^{[6]} The differences between countries are statistically significant.
TIMSS asks younger students whether they like learning math.^{[7]} Among 8^{th} graders, the American results are pretty grim. Only 19% say they enjoy learning math, while 40% do not like it, more than a 2to1 ratio disliking the subject. But American students are downright giddy compared to students in Japan. Only 9% of Japanese 8^{th} graders say they like learning math and 53% do not like it, almost a whopping 6to1 ratio of disliking to liking. Fourth graders in both countries like the subject more than eighth graders, but in the U.S. the like to dislike ratio is about 2to1 (45% to 22%) and in Japan it’s close to even (29% to 23%).^{[8]}
Green’s impressions are based on observations in nonrandomly selected classrooms. They suggest that American students dislike math and Japanese students love it. But empirical evidence collected by more scientific methods finds exactly the opposite.
Japanese and American math scores are headed in opposite directions, but the trend is not what you’d guess after reading the New York Times article. Japan’s scores are going down, and U.S. scores are going up. The first international assessment of math, a precursor to today’s TIMSS test, took place in 1964. Twelve nations took part. Japan ranked second, the U.S. eleventh (outscoring only Sweden).^{[9]} If the scores are converted to standard deviation units (SD), Japan scored 0.9 SD higher than the U.S (all scores in this section refer to eighth grade).
Jump ahead about five decades. On the 2011 TIMSS, Japan still outscored the U.S., but by a smaller amount: 0.61 SD. Most of the narrowing occurred after 1995. From 1995 to 2011, the average scale score for Japan’s eighth graders fell 11 points (from 581 to 570) while the U.S. eighth graders gained 17 points (from 492 to 509). Japan’s decline and the U.S.’s increase are both statistically significant.
This pokes a huge hole in Green’s story. She attributes Japan’s high math proficiency to teaching reforms adopted in the 1980s and 1990s, but does not acknowledge that Japan was doing quite well—and even better than today relative to the U.S. —on international math tests in the 1960s. If Japan now outscores the U.S. because of superior teaching, how could it possibly have performed better on math tests in the 1960s? According to Green, the 1960s were the bad old days of Japanese math instruction focused on rote learning. And what about the decline in Japan’s math achievement since 1995? Is this really the nation we should look to for guidance on improving math instruction?
Green blames the demise of American math reform in the 1990s on the failure to adequately prepare teachers for change. She does not even mention “the math wars,” the intense political battles that were fought in communities across the country when new math programs were introduced. California deserves attention since Green holds up the 1985 California math framework as an example of that era’s push towards “teaching for understanding.”
The 1985 and 1992 California math frameworks were indeed crowning achievements of progressive math reformers, but as new textbooks and programs began trickling into schools, a coalition of parents and mathematicians arose in vehement opposition. The charge was that the frameworks—and their close cousin, the 1989 NCTM standards—contained weak mathematical content. One reform program, Mathland, attained notoriety for replacing textbooks with kits of manipulatives and for delaying or omitting the teaching of standard algorithms.^{[10]}
In 1999, new state standards were written by four mathematicians from Stanford University. The standards repudiated the previous state framework and the NCTM standards. Although math reformers in California opposed the new standards, they could not claim that the authors lacked a conceptual understanding of mathematics or viewed math as the robotic execution of procedures. The standards focused on clearly stated content objectives for each grade level, and avoided recommending instructional strategies. They encouraged the development of computation skills, conceptual understanding, and problem solving.
The notion that classroom teachers’ blind devotion to procedures or memorization led to the failure of 1990s math reform in the U.S. is ahistorical. Indeed, Green cites no historical accounts of that period to support the claim. Moreover, the suggestion that teachers were left on their own to figure out how to change their teaching is inaccurate. Throughout the 1990s, the NCTM standards were used as a template for the development of standards and assessments in states across the land. Education school professors in the late 1990s overwhelmingly supported math reform.^{[11]}
The federal government deployed powerful resources to promote math reform and the National Science Foundation spent hundreds of millions of dollars training teachers in three different systemic reform initiatives. The National Assessment of Educational Progress (NAEP) rewrote its math framework and redesigned its math test to reflect the NCTM standards. In 1999, the U.S. Department of Education endorsed several reformoriented math programs. But a petition signed by over 200 mathematicians, educators, and scientists appeared in the Washington Post on November 18, 1999 renouncing the list of recommended programs..
Math reform in the U.S. is typically the offspring of government power wedded to education school romanticism. David Klein has written a succinct account of twentieth century American math reforms. E.D. Hirsch’s intellectual history of curricular reform attributes the periodic rise of progressive movements to the ideological “thought world” that dominates education schools. Contrary to Elizabeth Green’s account, these histories conclude that math reform movements have repeatedly failed not because of stubborn teachers who cling to tired, old practices but because the reforms have been—there are no other words for it—just bad ideas.
Algorithms are procedures. When the Common Core states that elementary students will learn standard algorithms—the conventional methods for adding, subtracting, multiplying, and dividing numbers—it is saying students will learn procedures. Fluency with basic facts (e.g., 6 + 7 = 13, 18  9 = 9) is attained through memorization. Nothing in the Common Core discourages memorization. The primary authors of the Common Core math standards, William McCallum and Jason Zimba, have been clear that the Common Core is neutral on pedagogy, with teachers free to choose the instructional strategies—traditional or progressive or whatever—that they deem best.^{[12]} The Common Core is about content, not pedagogy. As the Common Core State Standards (CCSS) website adamantly proclaims, “Teachers know best about what works in the classroom. That is why these standards establish what students need to learn, but do not dictate how teachers should teach. Instead, schools and teachers decide how best to help students reach the standards.”^{[13]}
That does not mean the Common Core won’t be used to promote constructivist pedagogy or to suppress traditional instruction. The protests of CCSS authors that the standards are being misinterpreted may not be enough. The danger emanates from what I’ve previously described as “dog whistles” embedded in the Common Core.^{[14]} The CCSS math documents were crafted to comprise ideas (CCSS advocates would say the best ideas) from both traditional and progressive perspectives in the “math wars.” That is not only politically astute, but it also reflects the current state of research on effective mathematical instruction. Scholarly reviews of the literature have raised serious objections to constructivism. The title of an influential 2006 review published in Educational Psychologist says it all, “Why Minimal Guidance During Instruction Does Not Work: An Analysis of the Failure of Constructivist, Discovery, ProblemBased, Experiential and InquiryBased Teaching.”^{[15]} Unfortunately, the Common Core—and in particular the Standards for Mathematical Practice—contain enough shorthand terms related to constructivist pedagogy that, when heard by the true believers of inquirybased math reform, can be taken as license for the imposition of their ideology on teachers.
In its onesided support for a particular style of math instruction, Elizabeth Green’s article acts as a megaphone for these dog whistles, the misguided notions that, although seemingly innocuous to most people, are packed with meaning for partisans of inquirybased learning. Green’s article is based on bad science, bad history, and unfortunate myths that will lead us away from, rather than closer to, the improvement of math instruction in the United States.
[1] Green’s choice of math reforms to list—all of which, except for the Common Core, tried to change how math is taught—is bound to mislead one into thinking that math reform’s implementation problems are primarily related to instruction.
[3] A 1994 Chicago Tribune article describes a local student who happily gets afterschool Kumon lessons. Note the reference to schools of that time “forging a path toward a greater understanding of math concepts.”
[4] Ironically, an oped published in August 2014 in the New York Times on hagwons, the Korean version of jukus, attributes Korea’s high PISA scores to hagwon instruction. It is inexplicable that hagwon instruction could mean so much to Korea’s test score success in this article but juku instruction does not even warrant mention in an article on Japan’s high scores.
[5] I devoted a section of the 2006 Brown Center Report to “the happiness factor” in education.
[6] OECD. PISA 2012 Results: What Students Know and Can Do. Student Performance in Mathematics, Reading and Sciences. Table III.3.4f.
[7] Ina V.S. Mullis, Michael O. Martin, Pierre Foy, and Alka Arora. TIMSS 2011 International Results in Mathematics. Chapter 8, Exhibits 8.1 (page 330) and 8.2 (p. 332).
[9] International Study of Achievement in Mathematics: A Comparison of Twelve Countries (Vols. 1–2), edited by T. Husén (New York: John Wiley & Sons, 1967).
[10] At its peak, Mathland’s publisher claimed that the program was the most popular in California. Today, it is not published.
[11] A snippet from a 1997 survey of education professors conducted by Public Agenda: The process of learning is more important to education professors than whether or not students absorb specific knowledge. Nearly 9 in 10 (86%) say when K12 teachers assign math or history questions, it is more important for kids to struggle with the process of finding the right answers than knowing the right answer. “We have for so many years said to kids ‘What's 7+5?’ as if that was the important thing. The question we should be asking is ‘Give me as many questions whose answer is 12...,’” said a Chicago professor who was interviewed for this study.
[12] McCallum on CCSS: “They just say what we want students to learn.” And Jason Zimba on misinterpreting the Practice Standards to diminish traditional content: “I sometimes worry that talking about the practice standards can be a way to avoid talking about focus and specific math content. Until we see fewer topics and a strong focus on arithmetic in elementary grades, we really aren't seeing the standards being implemented.”
The July 27, 2014 edition of the New York Times Sunday Magazine featured an article by Elizabeth Green entitled “Why Do Americans Stink at Math?” In this blog post, I identify six myths promulgated in that article. Let me be clear at the outset. I am an admirer of Elizabeth Green’s journalism and am sympathetic to the idea that improving teaching would raise American math achievement. But this article is completely off base. Its most glaring mistake is giving the impression that a particular approach to mathematics instruction—referred to over the past halfcentury as “progressive,” “constructivist,” “discovery,” or “inquirybased”—is the answer to improving mathematics learning in the U.S. That belief is not supported by evidence.
Green asserts that American math reformers frequently come up with great ideas—the examples cited are the New Math of the 1960s, California’s 1985 Mathematics Framework, the National Council of Teachers of Mathematics (NCTM) math reforms of the 1980s and 1990s, and today’s Common Core—but the reforms are shot down because of poor implementation. Green deserves credit for avoiding the common habit of attributing the failure of reform to “implementation” without defining the term. In Green’s way of thinking, implementation of math reform hinges on changing the way teachers teach.^{[1]} American teachers, Green argues, are wedded to the idea that learning mathematics is synonymous with memorization and practicing procedures. They aren’t provided the training to teach in different ways. Left on their own, teachers teach the way they themselves were taught, emphasizing, in her words, “mind numbing” routines—and perpetuating a cycle of failure.
Green believes that the 1980s math reforms failed in the U.S. but took root and flourished in Japan. Over a 12 year span, she writes, “the Japanese educational system embraced this more vibrant approach to math.” The two countries’ math classrooms are dramatically different, and readers are presented with a series of contrasts. American classrooms are dull and oppressively quiet; Japanese classrooms are bursting with children “talking, arguing, shrieking about the best way to solve problems.” Japanese students “uncover math’s procedures, properties and proofs for themselves;” American students regurgitate rules spoon fed to them by teachers. When the innovations of the 1980s and 1990s were proposed, Japan “was able to shift a country full of teachers to a new approach.” American teachers dug in and clung to traditional teaching. The upshot of all of this? Japan scores at the top of international math tests; the U.S. scores near the middle of the pack or worse.
The story is wrong. It goes wrong by embracing six myths.
Green provides no evidence that instructional differences are at the heart of American and Japanese achievement differences. Indeed, she provides no evidence, other than the assertions of advocates of the teaching practices extolled in the article, that Japanese teachers changed their instructional practices in the 1990s, or that American teachers did not change theirs, or that any change that occurred in math instruction in either country had an impact on student achievement.
Green relies on the Trends in International Mathematics and Science Study (TIMSS) 1995 Video Study to document differences in Japanese and American teaching styles.^{[2]} She fails to tell readers of a crucial limitation of that study. The TIMSS video study did not collect data on how much kids learned during lessons. Interesting differences were indeed apparent between Japanese and American instruction (German math teachers were also part of the study), but the study could not assess whether Japanese students learned more math as a result. Withincountry comparisons might have been especially revealing. If Japanese kids who were exposed to “reform” teaching strategies learned more than Japanese kids exposed to traditional instruction, and if great care were taken to make sure the two groups were equal on characteristics related to learning, then that would suggest the choice of instructional regime might be driving the learning differences. Given the study’s limitations, that analysis could not be conducted.
The 1995 TIMSS collected survey data separate from the video study that can shed light on the question. Eighth grade math teachers were queried: how often do you ask students to do reasoning tasks? Table 1 shows the frequency of teachers’ answers, along with the average TIMSS score of students for each response category (in parentheses).
Teachers’ Reports on How Often They Ask Students to Do Reasoning Tasks
Never or Almost Never 
Some Lessons 
Most Lessons 
Every Lesson 

Japan 
0% 
7% (594) 
55% (604) 
37% (608) 
U.S. 
0% 
24% (495) 
50% (498) 
26% (514) 
Source: Table 5.11, IEA Third International Mathematics and Science Study (TIMSS), 19941995, page 160.
Note that the data support the view that eighth grade Japanese teachers emphasize reasoning more often than U.S. teachers. But the data also suggest that the difference can only explain a trivial amount of the JapanU.S. test score gap. The gap hovers around 100 points across response categories, comparable to the overall gap reported in 1995 (Japan scored 605, the U.S., 500). The withincountry difference between teachers who include reasoning tasks in every lesson versus teachers who only present them in some lessons is only 14 scale score points in Japan and 19 points in the U.S. Indeed, even if 100% of U.S. teachers had said they emphasize reasoning in every lesson—and the 514 TIMSS score for the category held firm—the achievement gap between the two countries would only contract negligibly. This suggests the overall test score difference between the two countries is driven by other factors.
What are those other factors? Green dismisses cultural differences or the contribution of instruction outside school to Japanese math achievement. This is puzzling. There is no discussion of Japanese parents drilling children in math at home or of the popularity of Kumon centers that focus on basic skills.^{[3] }And juku gets not a single mention in Green’s article. Juku, commonly known as “cram school,” is the private, afterschool instruction that most Japanese students receive, especially during middle school as they prepare for high school entrance exams. Jukus are famous for focusing on basic skills, drill and practice, and memorization.^{[4]} Japanese public schools have the luxury of offloading these instructional burdens to jukus.
An alternative hypothesis to Green’s story is this: perhaps because of jukus Japanese teachers can take their students’ fluency with mathematical procedures for granted and focus lessons on problem solving and conceptual understanding. American teachers, on the other hand, must teach procedural fluency or it is not taught at all.
Green’s article depicts American math classrooms as boring, unhappy places and Japanese classrooms as vibrant and filled with joy. She cites no data other than her own impressions from classroom observations and the assertions of advocates of reformoriented instruction. It is odd that she didn’t examine the Program for International Assessment (PISA) or TIMSS data on enjoyment because both assessments routinely survey students from randomlysampled classrooms and ask whether they enjoy learning mathematics.^{[5]}
American students consistently report enjoying math more than Japanese students. In response to the statement, “I look forward to my mathematics lessons,” posed on PISA, the percentage of U.S. 15yearolds agreeing in 2012 was 45.4%, compared to 33.7% in Japan. To the prompt, “I do mathematics because I enjoy it,” the percentage agreeing was 36.6% in the U.S. and 30.8% in Japan.^{[6]} The differences between countries are statistically significant.
TIMSS asks younger students whether they like learning math.^{[7]} Among 8^{th} graders, the American results are pretty grim. Only 19% say they enjoy learning math, while 40% do not like it, more than a 2to1 ratio disliking the subject. But American students are downright giddy compared to students in Japan. Only 9% of Japanese 8^{th} graders say they like learning math and 53% do not like it, almost a whopping 6to1 ratio of disliking to liking. Fourth graders in both countries like the subject more than eighth graders, but in the U.S. the like to dislike ratio is about 2to1 (45% to 22%) and in Japan it’s close to even (29% to 23%).^{[8]}
Green’s impressions are based on observations in nonrandomly selected classrooms. They suggest that American students dislike math and Japanese students love it. But empirical evidence collected by more scientific methods finds exactly the opposite.
Japanese and American math scores are headed in opposite directions, but the trend is not what you’d guess after reading the New York Times article. Japan’s scores are going down, and U.S. scores are going up. The first international assessment of math, a precursor to today’s TIMSS test, took place in 1964. Twelve nations took part. Japan ranked second, the U.S. eleventh (outscoring only Sweden).^{[9]} If the scores are converted to standard deviation units (SD), Japan scored 0.9 SD higher than the U.S (all scores in this section refer to eighth grade).
Jump ahead about five decades. On the 2011 TIMSS, Japan still outscored the U.S., but by a smaller amount: 0.61 SD. Most of the narrowing occurred after 1995. From 1995 to 2011, the average scale score for Japan’s eighth graders fell 11 points (from 581 to 570) while the U.S. eighth graders gained 17 points (from 492 to 509). Japan’s decline and the U.S.’s increase are both statistically significant.
This pokes a huge hole in Green’s story. She attributes Japan’s high math proficiency to teaching reforms adopted in the 1980s and 1990s, but does not acknowledge that Japan was doing quite well—and even better than today relative to the U.S. —on international math tests in the 1960s. If Japan now outscores the U.S. because of superior teaching, how could it possibly have performed better on math tests in the 1960s? According to Green, the 1960s were the bad old days of Japanese math instruction focused on rote learning. And what about the decline in Japan’s math achievement since 1995? Is this really the nation we should look to for guidance on improving math instruction?
Green blames the demise of American math reform in the 1990s on the failure to adequately prepare teachers for change. She does not even mention “the math wars,” the intense political battles that were fought in communities across the country when new math programs were introduced. California deserves attention since Green holds up the 1985 California math framework as an example of that era’s push towards “teaching for understanding.”
The 1985 and 1992 California math frameworks were indeed crowning achievements of progressive math reformers, but as new textbooks and programs began trickling into schools, a coalition of parents and mathematicians arose in vehement opposition. The charge was that the frameworks—and their close cousin, the 1989 NCTM standards—contained weak mathematical content. One reform program, Mathland, attained notoriety for replacing textbooks with kits of manipulatives and for delaying or omitting the teaching of standard algorithms.^{[10]}
In 1999, new state standards were written by four mathematicians from Stanford University. The standards repudiated the previous state framework and the NCTM standards. Although math reformers in California opposed the new standards, they could not claim that the authors lacked a conceptual understanding of mathematics or viewed math as the robotic execution of procedures. The standards focused on clearly stated content objectives for each grade level, and avoided recommending instructional strategies. They encouraged the development of computation skills, conceptual understanding, and problem solving.
The notion that classroom teachers’ blind devotion to procedures or memorization led to the failure of 1990s math reform in the U.S. is ahistorical. Indeed, Green cites no historical accounts of that period to support the claim. Moreover, the suggestion that teachers were left on their own to figure out how to change their teaching is inaccurate. Throughout the 1990s, the NCTM standards were used as a template for the development of standards and assessments in states across the land. Education school professors in the late 1990s overwhelmingly supported math reform.^{[11]}
The federal government deployed powerful resources to promote math reform and the National Science Foundation spent hundreds of millions of dollars training teachers in three different systemic reform initiatives. The National Assessment of Educational Progress (NAEP) rewrote its math framework and redesigned its math test to reflect the NCTM standards. In 1999, the U.S. Department of Education endorsed several reformoriented math programs. But a petition signed by over 200 mathematicians, educators, and scientists appeared in the Washington Post on November 18, 1999 renouncing the list of recommended programs..
Math reform in the U.S. is typically the offspring of government power wedded to education school romanticism. David Klein has written a succinct account of twentieth century American math reforms. E.D. Hirsch’s intellectual history of curricular reform attributes the periodic rise of progressive movements to the ideological “thought world” that dominates education schools. Contrary to Elizabeth Green’s account, these histories conclude that math reform movements have repeatedly failed not because of stubborn teachers who cling to tired, old practices but because the reforms have been—there are no other words for it—just bad ideas.
Algorithms are procedures. When the Common Core states that elementary students will learn standard algorithms—the conventional methods for adding, subtracting, multiplying, and dividing numbers—it is saying students will learn procedures. Fluency with basic facts (e.g., 6 + 7 = 13, 18  9 = 9) is attained through memorization. Nothing in the Common Core discourages memorization. The primary authors of the Common Core math standards, William McCallum and Jason Zimba, have been clear that the Common Core is neutral on pedagogy, with teachers free to choose the instructional strategies—traditional or progressive or whatever—that they deem best.^{[12]} The Common Core is about content, not pedagogy. As the Common Core State Standards (CCSS) website adamantly proclaims, “Teachers know best about what works in the classroom. That is why these standards establish what students need to learn, but do not dictate how teachers should teach. Instead, schools and teachers decide how best to help students reach the standards.”^{[13]}
That does not mean the Common Core won’t be used to promote constructivist pedagogy or to suppress traditional instruction. The protests of CCSS authors that the standards are being misinterpreted may not be enough. The danger emanates from what I’ve previously described as “dog whistles” embedded in the Common Core.^{[14]} The CCSS math documents were crafted to comprise ideas (CCSS advocates would say the best ideas) from both traditional and progressive perspectives in the “math wars.” That is not only politically astute, but it also reflects the current state of research on effective mathematical instruction. Scholarly reviews of the literature have raised serious objections to constructivism. The title of an influential 2006 review published in Educational Psychologist says it all, “Why Minimal Guidance During Instruction Does Not Work: An Analysis of the Failure of Constructivist, Discovery, ProblemBased, Experiential and InquiryBased Teaching.”^{[15]} Unfortunately, the Common Core—and in particular the Standards for Mathematical Practice—contain enough shorthand terms related to constructivist pedagogy that, when heard by the true believers of inquirybased math reform, can be taken as license for the imposition of their ideology on teachers.
In its onesided support for a particular style of math instruction, Elizabeth Green’s article acts as a megaphone for these dog whistles, the misguided notions that, although seemingly innocuous to most people, are packed with meaning for partisans of inquirybased learning. Green’s article is based on bad science, bad history, and unfortunate myths that will lead us away from, rather than closer to, the improvement of math instruction in the United States.
[1] Green’s choice of math reforms to list—all of which, except for the Common Core, tried to change how math is taught—is bound to mislead one into thinking that math reform’s implementation problems are primarily related to instruction.
[3] A 1994 Chicago Tribune article describes a local student who happily gets afterschool Kumon lessons. Note the reference to schools of that time “forging a path toward a greater understanding of math concepts.”
[4] Ironically, an oped published in August 2014 in the New York Times on hagwons, the Korean version of jukus, attributes Korea’s high PISA scores to hagwon instruction. It is inexplicable that hagwon instruction could mean so much to Korea’s test score success in this article but juku instruction does not even warrant mention in an article on Japan’s high scores.
[5] I devoted a section of the 2006 Brown Center Report to “the happiness factor” in education.
[6] OECD. PISA 2012 Results: What Students Know and Can Do. Student Performance in Mathematics, Reading and Sciences. Table III.3.4f.
[7] Ina V.S. Mullis, Michael O. Martin, Pierre Foy, and Alka Arora. TIMSS 2011 International Results in Mathematics. Chapter 8, Exhibits 8.1 (page 330) and 8.2 (p. 332).
[9] International Study of Achievement in Mathematics: A Comparison of Twelve Countries (Vols. 1–2), edited by T. Husén (New York: John Wiley & Sons, 1967).
[10] At its peak, Mathland’s publisher claimed that the program was the most popular in California. Today, it is not published.
[11] A snippet from a 1997 survey of education professors conducted by Public Agenda: The process of learning is more important to education professors than whether or not students absorb specific knowledge. Nearly 9 in 10 (86%) say when K12 teachers assign math or history questions, it is more important for kids to struggle with the process of finding the right answers than knowing the right answer. “We have for so many years said to kids ‘What's 7+5?’ as if that was the important thing. The question we should be asking is ‘Give me as many questions whose answer is 12...,’” said a Chicago professor who was interviewed for this study.
[12] McCallum on CCSS: “They just say what we want students to learn.” And Jason Zimba on misinterpreting the Practice Standards to diminish traditional content: “I sometimes worry that talking about the practice standards can be a way to avoid talking about focus and specific math content. Until we see fewer topics and a strong focus on arithmetic in elementary grades, we really aren't seeing the standards being implemented.”
In my May Chalkboard post, I presented Pressman and Wildavsky’s classic implementation model as a guide to analyzing the implementation of the Common Core State Standards (CCSS). With policies that span multiple layers of governance, decision points at every level influence the fate of implementation. Negotiating each level of governance also leaves policies vulnerable to attack by political opponents. When 45 states and the District of Columbia initially adopted the Common Core and the federal government supported the effort through Race to the Top funding, the initiative’s opponents did not simply roll up their tents and disappear. They reorganized in several states to launch new battles, and are now preparing to fight in many districts as well.
In this essay, I continue the analysis of curriculum’s role in implementing CCSS. I discuss key curricular decisions that will be encountered as CCSS makes its way through the school system, and the potential political controversies that this process may provoke. There are two pathways that analysts must pay attention to while tracking CCSS’s implementation, one systemic and the other political. The systemic path mostly comprises the activities of education’s “insiders”—educators, officials, publishers—whose daily work routinely shapes curriculum.^{[1]} The political path focuses primarily on “outsiders,” in particular, the forums where CCSS’s opponents can try to influence, and perhaps even block, implementation.
Throughout the history of American education, when reformers’ pet ideas have failed, the failure has been laid at the feet of “poor implementation.” Rarely is implementation defined, or the vast policy literature on implementation cited, so that one can determine what exactly led to the policy’s demise.^{[2]}
In my May Chalkboard post, I defined implementation of the CCSS as: the decisions that educators make—at national, state, district, school, and classroom levels—to realize the curriculum, instruction, assessment, and accountability systems of the Common Core. Let’s apply the definition more specifically to curriculum. Curriculum is the “stuff” of learning, the content of what is taught in school—especially as embodied in the materials used in instruction. Table 1 exhibits an overview of activities that will be key to implementing curriculum under the Common Core. The activities encompass how educators (or education officials) produce, select, and organize instructional materials in accordance with the Common Core.
The table’s entries stop at the district level. The next Chalkboard post in this series will discuss the implementation of curriculum in schools and classrooms.
Table 1: Activities Key to the Implementation of CCSS
National efforts to implement CCSS curricula have been underway since the standards were first written and states adopted them. Publishers have produced materials that they claim are aligned with CCSS, although two recent reviews have questioned that claim.^{[4]} The federal government provided funds supporting new materials through Race to the Top, the Investing in Innovation Fund (I3), and Title I programs. Professional groups of educators (e.g., NCTM, NCTE), as well as noneducators (e.g., Achieve), have designed websites and issued publications offering guidance on curriculum.
The U.S. Department of Education supports the CCSS while Congress has stayed mostly silent. The key political forum to watch now at the national level is the media, including social media. Public opinion on CCSS is malleable; polls indicate about half of the American people have never heard of Common Core.^{[5]} In the past few months, math problems attributed to the CCSS were ridiculed on television by comedians Louis CK and Stephen Colbert. Fairly or unfairly, becoming the butt of jokes cannot enhance the public’s perceptions of Common Core curriculum. How the public comes to view the CCSS will play a role in dictating its fate.
In twenty states textbooks are adopted for all districts at the state level.^{[6]} The word “textbook” here refers to curricular programs comprising both digital and hard copy materials, including supplementary workbooks, worksheets, and games, as well as formative and summative assessments. State adoption states, as they are called, typically identify the programs that local districts may purchase using state funds. The remaining states give districts greater latitude in selecting materials, although even these states frequently issue guidance on selection (note that in the table’s mention of states’ advisory websites, Oregon is a state textbook adoption state, but New York and Illinois are not). Most states have already been working on implementing CCSS curriculum for a few years. In a 2011 survey, the U.S. Department of Education found that twentynine states had “provided instructional materials or curriculum assistance for the CCSS.”^{[7]} In a 2013 survey conducted by the Center for Education Policy at George Washington University, thirty states reported that CCSSaligned curricula were currently in use in classrooms, with nine more states slated to begin in 20132014.^{[8]}
As mentioned above, recent academic reviews of textbooks have questioned publishers’ claim that materials are aligned with the Common Core. William Schmidt of Michigan State University called the assertion a “sham” and according to Education Week, “dismissed most purveyors of such claims as ‘snake oil salesmen’ who have done little more than slap shiny new stickers on the same books they’ve been selling for years.”^{[9]} Schmidt’s team of researchers reviewed 35 textbook series.
Local educators seem to view alignment differently. Seattle Public Schools managed to find four finalists for the 2014 K5 math adoption, even though alignment with CCSS was an ironclad requirement.^{[10]} California’s 2014 K8 math adoption includes 20 approved texts. The selection process was legislatively mandated (AB 1246) to choose only curriculum materials aligned with the CCSS.^{[11]} Districts will now pick from among these state approved programs. Obviously, wellinformed people can disagree on what constitutes alignment. It’s also important to note that one of California’s stateapproved programs, Connected Mathematics, encountered strenuous opposition in several of the state’s districts when it was introduced in the 1990s. That legacy, and memories of the “math wars” that still linger in many communities, surely could influence decisions made today about math curricula.
In addition to selecting curriculum and offering professional development on its use to teachers and principals, districts decide how to organize curriculum and students into instructional units. The match of students and curriculum—deciding who takes what and when they take it—has a long history of controversy in middle and high schools.
Table 1 lists three decisions concerning how to organize curriculum that local educators face in implementing the Common Core: reorganizing history and social science, deciding between integrated or traditional high school math, and tracking middle schoolers who take advanced math. In Boston, district leaders received sharp criticism when it was thought that they had decided to abolish historysocial science departments and combine them with Englishlanguage arts. Faced with backlash and petitions^{[12]} against the policy, the Superintendent issued a statement clarifying the district plan for “improving and coordinating the use of instructional materials throughout all subject areas” by bringing separate history and social studies, English language arts, and world languages departments together under the “humanities umbrella.”^{[13]} This was in response to the Common Core calling for students to read more nonfiction texts and schools to incorporate historical documents into literacy instruction. The decision also comports with a longstanding strain of reform seeking to integrate traditional subject areas into multidisciplinary courses; combining history and ELA, for example, is often an initial target of integration.
A second decision also involves curriculum integration but applies exclusively to high school mathematics. Most countries other than the U.S. do not teach high school math in courses organized by topic (Algebra, Geometry, etc.) but instead integrate topics into a sequence of single year courses (Math 1, Math 2, etc.). Integrated math remains an exotic course option in the U.S. Only two percent of 12^{th} graders in the 2013 National Assessment of Educational Progress (NAEP) said they were taking an integrated Math 4 course in 12^{th} grade. Only three percent of these same twelfth graders reported that they took an integrated Math 3 course in 11^{th} grade, and four percent reported taking an integrated Math 2 course in 10^{th} grade. American math reformers have long dreamed of joining European and Asian high schools in offering a sequence of integrated high school math courses. The idea has never taken hold. Common Core furnishes an opportunity to advance this idea.
The authors of the Common Core math standards wrote two sets of standards for high school, one for the traditional sequence and one for integrated math. By treating both approaches as if they have equal standing—regardless of the overwhelming relative popularity of the traditional sequence—the CCSS cannot help but be regarded as prying open a window for integrated math courses. The nation’s schools have evidenced a lopsided preference for the traditional sequence, but the two approaches are granted parity in the Common Core. Neutrality, in this case, is a tacit endorsement. And it carries significant consequences for implementation. State, district, and school administrators who have long wished and waited for an integrated sequence of math courses are licensed to push this approach as a “reform” sanctioned by the CCSS.
Integrated math isn’t new. How have previous attempts fared in implementation? Not well. The most recent case comes from the state of Georgia. The state began easing towards integrated math as the preferred high school sequence in 2004 and pushed districts to implement it by 2009. Many resisted, and a statewide debate (and irony) ensued. By 2011, Georgia school districts started going back to the traditional sequence, with some districts citing the Common Core as their rationale.^{[14]} At the same time, districts in other parts of the country were citing Common Core as a reason for converting to integrated math. The Santa Barbara Unified School District, for example, adopted the integrated sequence in 2014.^{[15]}
The third decision that districts face involves tracking. How will differentiation occur within a core curriculum that is ostensibly “common” to all students? Taking Algebra I in eighth grade or earlier, a practice that has become increasingly popular since the 1990s, is a sticking point with CCSS. The CCSS math standards include many algebra topics at eighth grade, but do not offer a formal course in Algebra I until ninth grade. Williamson Evers, Sandra Stotsky, and Ze’ev Wurman are among those opposing Common Core for, among other objections, retreating from the eighth grade algebra course.^{[16]} On the 2013 NAEP, 43 percent of eighth graders reported that they were taking an algebra course and five percent were enrolled in geometry, presumably having taken Algebra I in seventh grade. Thus, about half of all students are affected, and they include the nation’s highest achievers in mathematics.^{[17]} How will districts modify the curriculum serving middle school high achievers and still adhere to the CCSS?
The CCSS describes “compacted” and “accelerated pathways” (see Appendix A in CCSS) that would allow students to complete Algebra I by the end of eighth grade, but it’s unclear, especially if students then take integrated courses in high school, how this will all mesh to form a coherent course of study. The CCSS makes no mention at all of accommodating students who complete Geometry before ninth grade. The challenge of providing courses for accelerated students is not new, of course, so it is not that the CCSS created this problem. But it does force districts to reconsider the manner in which they have ordered existing curriculum into a sequence of courses, and by doing so, may rekindle political battles that were fought years before—and settled by districts’ current course offerings.
In this post, I have traced the implementation of Common Core curriculum from the national to the district level. A future post will focus on implementing CCSS curriculum in schools and classrooms, and additional posts will analyze the implementation of CCSS’s other key policy dimensions—instruction, assessment, and accountability—following the same analytical strategy of examining how they are implemented down through the educational system.
In terms of curriculum’s role in implementing the Common Core, three overarching conclusions can be drawn from the foregoing analysis.
[1] An alliance of insideoutside political actors with shared interest in a particular policy is often called a “policy network.” For a discussion of policy networks promoting 1960s’ and 1990s’ math reforms see The Great Curriculum Debate (Loveless, 2001).
[3] States in which textbooks are adopted for all districts at the state level by the department of education or board of education are called “state adoption states”. I refer to states in which this is not the case as “nonadoption states.”
[6] http://www.afb.org/info/programsandservices/professionaldevelopment/solutionsforum/stateadoptionoftextbooks/1235 An analysis of the politics of textbook adoption is offered in Wong and Loveless (1991).
[7] Ann Weber, et al., State Implementation of Reforms Promoted Under the Recovery Act (USDOE, 2014). http://ies.ed.gov/ncee/pubs/20144011/
[8] http://www.huntintersection.com/2013/10/03/tenbigtakeawaysfromcepsresearchonstateimplementationofthecommoncore/
[10] The district selected “Math in Focus” (Singapore math) but is now embroiled in controversy over a school waiver process that may allow schools to adopt “enVision Math”, the second place finisher.
[12] http://www.change.org/petitions/mayormartinjwalshreinstatethedepartmentofhistorysocialstudiesasacoreacademiccontentdisciplineinthebostonpublicschools?recruiter=177161&utm_campaign=twitter_link_action_box&utm_medium=twitter&utm_source=share_petition%20via%20@Change
[13] http://www.bostonpublicschools.org/site/default.aspx?PageType=3&DomainID=4&ModuleInstanceID=14&ViewID=047E6BE36D8741308424D8E4E9ED6C2A&RenderLoc=0&FlexDataID=4091&PageID=1
[14] The Atlanta JournalConstitution covered the debate. Be sure to read Maureen Downey’s columns on the controversy. http://www.ajc.com/news/news/local/integratedmathcouldbeoutinhighschools/nQM6Y/
[16] http://www.usnews.com/news/specialreports/articles/2014/02/25/thecommoncoremathstandardscontentandcontroversy
[17] NAEP Data Explorer. http://nces.ed.gov/nationsreportcard/about/naeptools.asp
In my May Chalkboard post, I presented Pressman and Wildavsky’s classic implementation model as a guide to analyzing the implementation of the Common Core State Standards (CCSS). With policies that span multiple layers of governance, decision points at every level influence the fate of implementation. Negotiating each level of governance also leaves policies vulnerable to attack by political opponents. When 45 states and the District of Columbia initially adopted the Common Core and the federal government supported the effort through Race to the Top funding, the initiative’s opponents did not simply roll up their tents and disappear. They reorganized in several states to launch new battles, and are now preparing to fight in many districts as well.
In this essay, I continue the analysis of curriculum’s role in implementing CCSS. I discuss key curricular decisions that will be encountered as CCSS makes its way through the school system, and the potential political controversies that this process may provoke. There are two pathways that analysts must pay attention to while tracking CCSS’s implementation, one systemic and the other political. The systemic path mostly comprises the activities of education’s “insiders”—educators, officials, publishers—whose daily work routinely shapes curriculum.^{[1]} The political path focuses primarily on “outsiders,” in particular, the forums where CCSS’s opponents can try to influence, and perhaps even block, implementation.
Throughout the history of American education, when reformers’ pet ideas have failed, the failure has been laid at the feet of “poor implementation.” Rarely is implementation defined, or the vast policy literature on implementation cited, so that one can determine what exactly led to the policy’s demise.^{[2]}
In my May Chalkboard post, I defined implementation of the CCSS as: the decisions that educators make—at national, state, district, school, and classroom levels—to realize the curriculum, instruction, assessment, and accountability systems of the Common Core. Let’s apply the definition more specifically to curriculum. Curriculum is the “stuff” of learning, the content of what is taught in school—especially as embodied in the materials used in instruction. Table 1 exhibits an overview of activities that will be key to implementing curriculum under the Common Core. The activities encompass how educators (or education officials) produce, select, and organize instructional materials in accordance with the Common Core.
The table’s entries stop at the district level. The next Chalkboard post in this series will discuss the implementation of curriculum in schools and classrooms.
Table 1: Activities Key to the Implementation of CCSS
National efforts to implement CCSS curricula have been underway since the standards were first written and states adopted them. Publishers have produced materials that they claim are aligned with CCSS, although two recent reviews have questioned that claim.^{[4]} The federal government provided funds supporting new materials through Race to the Top, the Investing in Innovation Fund (I3), and Title I programs. Professional groups of educators (e.g., NCTM, NCTE), as well as noneducators (e.g., Achieve), have designed websites and issued publications offering guidance on curriculum.
The U.S. Department of Education supports the CCSS while Congress has stayed mostly silent. The key political forum to watch now at the national level is the media, including social media. Public opinion on CCSS is malleable; polls indicate about half of the American people have never heard of Common Core.^{[5]} In the past few months, math problems attributed to the CCSS were ridiculed on television by comedians Louis CK and Stephen Colbert. Fairly or unfairly, becoming the butt of jokes cannot enhance the public’s perceptions of Common Core curriculum. How the public comes to view the CCSS will play a role in dictating its fate.
In twenty states textbooks are adopted for all districts at the state level.^{[6]} The word “textbook” here refers to curricular programs comprising both digital and hard copy materials, including supplementary workbooks, worksheets, and games, as well as formative and summative assessments. State adoption states, as they are called, typically identify the programs that local districts may purchase using state funds. The remaining states give districts greater latitude in selecting materials, although even these states frequently issue guidance on selection (note that in the table’s mention of states’ advisory websites, Oregon is a state textbook adoption state, but New York and Illinois are not). Most states have already been working on implementing CCSS curriculum for a few years. In a 2011 survey, the U.S. Department of Education found that twentynine states had “provided instructional materials or curriculum assistance for the CCSS.”^{[7]} In a 2013 survey conducted by the Center for Education Policy at George Washington University, thirty states reported that CCSSaligned curricula were currently in use in classrooms, with nine more states slated to begin in 20132014.^{[8]}
As mentioned above, recent academic reviews of textbooks have questioned publishers’ claim that materials are aligned with the Common Core. William Schmidt of Michigan State University called the assertion a “sham” and according to Education Week, “dismissed most purveyors of such claims as ‘snake oil salesmen’ who have done little more than slap shiny new stickers on the same books they’ve been selling for years.”^{[9]} Schmidt’s team of researchers reviewed 35 textbook series.
Local educators seem to view alignment differently. Seattle Public Schools managed to find four finalists for the 2014 K5 math adoption, even though alignment with CCSS was an ironclad requirement.^{[10]} California’s 2014 K8 math adoption includes 20 approved texts. The selection process was legislatively mandated (AB 1246) to choose only curriculum materials aligned with the CCSS.^{[11]} Districts will now pick from among these state approved programs. Obviously, wellinformed people can disagree on what constitutes alignment. It’s also important to note that one of California’s stateapproved programs, Connected Mathematics, encountered strenuous opposition in several of the state’s districts when it was introduced in the 1990s. That legacy, and memories of the “math wars” that still linger in many communities, surely could influence decisions made today about math curricula.
In addition to selecting curriculum and offering professional development on its use to teachers and principals, districts decide how to organize curriculum and students into instructional units. The match of students and curriculum—deciding who takes what and when they take it—has a long history of controversy in middle and high schools.
Table 1 lists three decisions concerning how to organize curriculum that local educators face in implementing the Common Core: reorganizing history and social science, deciding between integrated or traditional high school math, and tracking middle schoolers who take advanced math. In Boston, district leaders received sharp criticism when it was thought that they had decided to abolish historysocial science departments and combine them with Englishlanguage arts. Faced with backlash and petitions^{[12]} against the policy, the Superintendent issued a statement clarifying the district plan for “improving and coordinating the use of instructional materials throughout all subject areas” by bringing separate history and social studies, English language arts, and world languages departments together under the “humanities umbrella.”^{[13]} This was in response to the Common Core calling for students to read more nonfiction texts and schools to incorporate historical documents into literacy instruction. The decision also comports with a longstanding strain of reform seeking to integrate traditional subject areas into multidisciplinary courses; combining history and ELA, for example, is often an initial target of integration.
A second decision also involves curriculum integration but applies exclusively to high school mathematics. Most countries other than the U.S. do not teach high school math in courses organized by topic (Algebra, Geometry, etc.) but instead integrate topics into a sequence of single year courses (Math 1, Math 2, etc.). Integrated math remains an exotic course option in the U.S. Only two percent of 12^{th} graders in the 2013 National Assessment of Educational Progress (NAEP) said they were taking an integrated Math 4 course in 12^{th} grade. Only three percent of these same twelfth graders reported that they took an integrated Math 3 course in 11^{th} grade, and four percent reported taking an integrated Math 2 course in 10^{th} grade. American math reformers have long dreamed of joining European and Asian high schools in offering a sequence of integrated high school math courses. The idea has never taken hold. Common Core furnishes an opportunity to advance this idea.
The authors of the Common Core math standards wrote two sets of standards for high school, one for the traditional sequence and one for integrated math. By treating both approaches as if they have equal standing—regardless of the overwhelming relative popularity of the traditional sequence—the CCSS cannot help but be regarded as prying open a window for integrated math courses. The nation’s schools have evidenced a lopsided preference for the traditional sequence, but the two approaches are granted parity in the Common Core. Neutrality, in this case, is a tacit endorsement. And it carries significant consequences for implementation. State, district, and school administrators who have long wished and waited for an integrated sequence of math courses are licensed to push this approach as a “reform” sanctioned by the CCSS.
Integrated math isn’t new. How have previous attempts fared in implementation? Not well. The most recent case comes from the state of Georgia. The state began easing towards integrated math as the preferred high school sequence in 2004 and pushed districts to implement it by 2009. Many resisted, and a statewide debate (and irony) ensued. By 2011, Georgia school districts started going back to the traditional sequence, with some districts citing the Common Core as their rationale.^{[14]} At the same time, districts in other parts of the country were citing Common Core as a reason for converting to integrated math. The Santa Barbara Unified School District, for example, adopted the integrated sequence in 2014.^{[15]}
The third decision that districts face involves tracking. How will differentiation occur within a core curriculum that is ostensibly “common” to all students? Taking Algebra I in eighth grade or earlier, a practice that has become increasingly popular since the 1990s, is a sticking point with CCSS. The CCSS math standards include many algebra topics at eighth grade, but do not offer a formal course in Algebra I until ninth grade. Williamson Evers, Sandra Stotsky, and Ze’ev Wurman are among those opposing Common Core for, among other objections, retreating from the eighth grade algebra course.^{[16]} On the 2013 NAEP, 43 percent of eighth graders reported that they were taking an algebra course and five percent were enrolled in geometry, presumably having taken Algebra I in seventh grade. Thus, about half of all students are affected, and they include the nation’s highest achievers in mathematics.^{[17]} How will districts modify the curriculum serving middle school high achievers and still adhere to the CCSS?
The CCSS describes “compacted” and “accelerated pathways” (see Appendix A in CCSS) that would allow students to complete Algebra I by the end of eighth grade, but it’s unclear, especially if students then take integrated courses in high school, how this will all mesh to form a coherent course of study. The CCSS makes no mention at all of accommodating students who complete Geometry before ninth grade. The challenge of providing courses for accelerated students is not new, of course, so it is not that the CCSS created this problem. But it does force districts to reconsider the manner in which they have ordered existing curriculum into a sequence of courses, and by doing so, may rekindle political battles that were fought years before—and settled by districts’ current course offerings.
In this post, I have traced the implementation of Common Core curriculum from the national to the district level. A future post will focus on implementing CCSS curriculum in schools and classrooms, and additional posts will analyze the implementation of CCSS’s other key policy dimensions—instruction, assessment, and accountability—following the same analytical strategy of examining how they are implemented down through the educational system.
In terms of curriculum’s role in implementing the Common Core, three overarching conclusions can be drawn from the foregoing analysis.
[1] An alliance of insideoutside political actors with shared interest in a particular policy is often called a “policy network.” For a discussion of policy networks promoting 1960s’ and 1990s’ math reforms see The Great Curriculum Debate (Loveless, 2001).
[3] States in which textbooks are adopted for all districts at the state level by the department of education or board of education are called “state adoption states”. I refer to states in which this is not the case as “nonadoption states.”
[6] http://www.afb.org/info/programsandservices/professionaldevelopment/solutionsforum/stateadoptionoftextbooks/1235 An analysis of the politics of textbook adoption is offered in Wong and Loveless (1991).
[7] Ann Weber, et al., State Implementation of Reforms Promoted Under the Recovery Act (USDOE, 2014). http://ies.ed.gov/ncee/pubs/20144011/
[8] http://www.huntintersection.com/2013/10/03/tenbigtakeawaysfromcepsresearchonstateimplementationofthecommoncore/
[10] The district selected “Math in Focus” (Singapore math) but is now embroiled in controversy over a school waiver process that may allow schools to adopt “enVision Math”, the second place finisher.
[12] http://www.change.org/petitions/mayormartinjwalshreinstatethedepartmentofhistorysocialstudiesasacoreacademiccontentdisciplineinthebostonpublicschools?recruiter=177161&utm_campaign=twitter_link_action_box&utm_medium=twitter&utm_source=share_petition%20via%20@Change
[13] http://www.bostonpublicschools.org/site/default.aspx?PageType=3&DomainID=4&ModuleInstanceID=14&ViewID=047E6BE36D8741308424D8E4E9ED6C2A&RenderLoc=0&FlexDataID=4091&PageID=1
[14] The Atlanta JournalConstitution covered the debate. Be sure to read Maureen Downey’s columns on the controversy. http://www.ajc.com/news/news/local/integratedmathcouldbeoutinhighschools/nQM6Y/
[16] http://www.usnews.com/news/specialreports/articles/2014/02/25/thecommoncoremathstandardscontentandcontroversy
[17] NAEP Data Explorer. http://nces.ed.gov/nationsreportcard/about/naeptools.asp
Most analysts agree that the success or failure of the Common Core State Standards (CCSS) hinges on implementation. But the term has been ambiguous. Advocates of CCSS talk about aligned curriculum, instructional shifts, challenging assessments that test critical thinking, and rigorous accountability systems that produce an accurate appraisal of whether students are on track to be college or careerready by the time they graduate from high school. These descriptions are unsatisfying. Heavy with flattering adjectives, they echo the confidence proponents have that CCSS will improve several important aspects of schooling. But such confidence may be misplaced; for example, decades—if not centuries—of effort have been devoted to the perfection of instruction. Moreover, when CCSS’s advocates talk about implementation, it seems to mean every important activity in education outside of adopting standards. By meaning almost everything, it means nothing.
This Chalkboard post begins a series on implementation of the CCSS, with an examination of curriculum as an aspect of implementation. Future posts will look at instruction, assessment, and accountability. I start with a framework for thinking about implementation. This conceptual framework will guide the current analysis as well as future posts. I will mostly discuss CCSS’s mathematics standards, primarily because I know more about them than the ELA standards, but also because the skills and knowledge expressed in math standards have a clarity that ELA standards lack. That said, I will bring ELA standards—and standards in other subjects that CCSS does not yet encompass—into the discussion when appropriate. I will also draw on the public policy literature on implementation. The goal is to discuss the implementation of CCSS analytically.
In the field of policy analysis, the classic text on implementation is Jeffrey Pressman and Aaron Wildavsky’s Implementation, published in 1973. The book’s 45word subtitle—surely one of the longest for such an influential text—begins with the clause, “How Great Expectations in Washington Are Dashed in Oakland.” The book describes the saga of a federal redevelopment program in Oakland, California. The program’s designers started out with ample resources, broad political support, and the cooperation of all major federal, state, and local stakeholders, including powerful people in both government and the private sector. The path to successful implementation looked like a slam dunk. And yet the program failed.
What happened? The details of the program’s failure are not important here. But two big ideas that Pressman and Wildavsky highlight are generalizable to a lot of other policies, including the Common Core. Implementation involves stepbystep encounters with what Pressman and Wildavsky call “decision points,” a sequence of hurdles for the policy or program to clear. In the case of a program involving several layers of government, these decision points not only mean that the support of state and local officials must be held over time, but also that officials must make good decisions when exercising discretionary authority on the program’s behalf. Think of a child lining up several dozen dominoes, with the goal of pushing over the first domino in order to topple them all. If a single domino doesn’t do its job, the last domino will not fall. Every decision point in the implementation process exposes nascent programs to possible failure.
Policy makers are wildly optimistic about implementing new programs. Pressman and Wildavsky offer a mathematical insight into why this is so. Consider an implementation path in which the probability of negotiating any single decision point is quite high—say, 95 percent. A casual preview of implementation may lead one to conclude that since clearing points A, B, and C is easy, implementation will be easy. Such reasoning overlooks the reality that the probability of success shrinks as the number of decision points increases. With three decision points, the odds fall to about 86 percent (.95 x .95 x .95). It takes 14 decision points for the odds to drop below 50 percent. Then failure is more likely than success.
A key assumption of Pressman and Wildavsky’s conceptual scheme is that implementation decision points are organized vertically, down through levels of government. There is also a certain amount of sequential dependence, as the domino analogy above implies. That may be true for a redevelopment program, but it’s not always true in education. I doubt that it’s true for Common Core. Education consists of loosely coupled organizational units (states, districts, schools, classes). Failure at one level may not be fatal to another. There can be good classes in bad schools, for example, good schools in bad districts, and so on. States or districts might bungle the CCSS, but savvy districts and schools could still rescue the standards and use them effectively.
Nevertheless, the vertical structure is useful for modeling how CCSS implementation will unfold. It is also useful for anticipating political opposition that the CCSS may encounter. Terry Moe has written extensively on the politics of “blocking.” When advocates of a particular education policy are victorious in the legislative arena, they have only won a battle, not a war. Opponents will show up again and again during implementation—in schools, or before school boards, or in other local forums—to continue the battle.
So let’s map the major points of vulnerability for the Common Core’s implementation. The project functions at the national, state, district, school, and classroom levels. At each of the five levels, decisions have been made or will be made regarding Common Core. The four crucial components of CCSS’s implementation—curriculum, instruction, assessment, and accountability—combine with the levels of decision making to create a minimum of twenty decision points. Imagine a 4 X 5 table with empty cells for the decision points. Future historians, by filling in the blank cells of the table, will tell the story of CCSS’s implementation.
Cells may comprise multiple decision points. In terms of curriculum, for example, twenty states have state textbook adoption, in which state boards and departments of education select the curricular materials that public schools may purchase. The other thirty states leave that decision up to districts, but typically provide funding for purchasing materials. Currently, states and districts are selecting math programs to reflect the CCSS, offering programs to train educators on how to use the new curricula, and purchasing new materials that are beginning to appear in schools and classrooms.
Note that the whole implementation process is bottomheavy, leading ultimately to activities in the nation’s 98,817 public schools and in the classrooms within them. Historically, curriculum controversies reach their greatest intensity when curricular materials are introduced in classrooms. That is happening now with the Common Core. Common Core won the support of elites and cleared most upperlevel decision points—all but a few states are on board with CCSS. Those highlevel decisions are no longer the main events in CCSS’s implementation.
The emergence of social media as a tool for mobilizing political action has undoubtedly enhanced the power of actors at the lowerlevel decision points to sway implementation. Forty or fifty years ago, difficulties implementing a math program in a small rural district probably would not receive much notice. In the 1960s and 1970s, the failure of “new math” wasn’t apparent for several years, until surveys revealed teachers were not using the new curricula. During the last curriculum controversy in mathematics—the math wars of the 1990s—the internet was just beginning to be used for organizing people politically. Curriculum aligned with the 1989 standards of the National Council of Teachers of Mathematics was the source of the conflict. The website “Mathematically Correct” fostered a national network of opposition by tabulating local efforts to drive NCTMoriented math programs out of the schools.
Today, a number of grass roots organizations have sprung up to fight against CCSS. Poorly designed math problems are widely circulated on Twitter and criticized by bloggers. I will discuss this phenomenon in greater depth in my June Chalkboard post, but suffice it for now to say that these attacks on Common Core, whether justified or not, illustrate the vulnerability of CCSS curriculum as implementation unfolds and the number of decision points multiply.
Shouldn’t we expect local educators to make good decisions when choosing curriculum that is compatible with the Common Core? As my colleagues Matt Chingos and Russ Whitehurst have documented, educators have very little evidence to go on when selecting curriculum. Evidence of effectiveness is in short supply. One of the rare randomized control trials of elementary math curricula was conducted by Mathematica. The study followed students through grades 1 and 2. Four math programs were evaluated, and although limiting the study to first and second grade curricula ensured that many common topics were covered, one of the programs produced very different results. Students in three of the programs (Math Expressions, Saxon, and Scott Foresman/Addison Wesley/envision) scored about the same, but all three outscored the fourth program (Investigations) by a statistically significant amount (effect size of about 0.22). A student at the 50^{th} percentile who received instruction in Investigations in first and second grade would have scored at the 59^{th} percentile if taught from one of the other programs.
What do educators go by if they can’t select on effectiveness? One popular approach is to go by alignment—how well math programs match up with the topics in the CCSS. This is a poor substitute for evidence of effectiveness. A wellaligned program means it covers the topics and objectives that CCSS lists for a particular grade level—it does not mean that the program covers them well. Some programs may cover Topic A well and students will learn because of that. Other programs may cover Topic A poorly and students will not learn. Both programs are aligned with Topic A.
Let’s conclude by returning to the question of defining implementation. What does the implementation of CCSS mean? I have drawn on and modified Pressman and Wildavsky’s implementation model to suggest a definition: the decisions that educators make—at national, state, district, school, and classroom levels—to realize the curriculum, instruction, assessment, and accountability systems of the Common Core. The CCSS implementation process will involve several decision points, with each one leaving the CCSS vulnerable to bad decisions by officials, who have scant evidence on which to act, and to the efforts of political opponents.
Most analysts agree that the success or failure of the Common Core State Standards (CCSS) hinges on implementation. But the term has been ambiguous. Advocates of CCSS talk about aligned curriculum, instructional shifts, challenging assessments that test critical thinking, and rigorous accountability systems that produce an accurate appraisal of whether students are on track to be college or careerready by the time they graduate from high school. These descriptions are unsatisfying. Heavy with flattering adjectives, they echo the confidence proponents have that CCSS will improve several important aspects of schooling. But such confidence may be misplaced; for example, decades—if not centuries—of effort have been devoted to the perfection of instruction. Moreover, when CCSS’s advocates talk about implementation, it seems to mean every important activity in education outside of adopting standards. By meaning almost everything, it means nothing.
This Chalkboard post begins a series on implementation of the CCSS, with an examination of curriculum as an aspect of implementation. Future posts will look at instruction, assessment, and accountability. I start with a framework for thinking about implementation. This conceptual framework will guide the current analysis as well as future posts. I will mostly discuss CCSS’s mathematics standards, primarily because I know more about them than the ELA standards, but also because the skills and knowledge expressed in math standards have a clarity that ELA standards lack. That said, I will bring ELA standards—and standards in other subjects that CCSS does not yet encompass—into the discussion when appropriate. I will also draw on the public policy literature on implementation. The goal is to discuss the implementation of CCSS analytically.
In the field of policy analysis, the classic text on implementation is Jeffrey Pressman and Aaron Wildavsky’s Implementation, published in 1973. The book’s 45word subtitle—surely one of the longest for such an influential text—begins with the clause, “How Great Expectations in Washington Are Dashed in Oakland.” The book describes the saga of a federal redevelopment program in Oakland, California. The program’s designers started out with ample resources, broad political support, and the cooperation of all major federal, state, and local stakeholders, including powerful people in both government and the private sector. The path to successful implementation looked like a slam dunk. And yet the program failed.
What happened? The details of the program’s failure are not important here. But two big ideas that Pressman and Wildavsky highlight are generalizable to a lot of other policies, including the Common Core. Implementation involves stepbystep encounters with what Pressman and Wildavsky call “decision points,” a sequence of hurdles for the policy or program to clear. In the case of a program involving several layers of government, these decision points not only mean that the support of state and local officials must be held over time, but also that officials must make good decisions when exercising discretionary authority on the program’s behalf. Think of a child lining up several dozen dominoes, with the goal of pushing over the first domino in order to topple them all. If a single domino doesn’t do its job, the last domino will not fall. Every decision point in the implementation process exposes nascent programs to possible failure.
Policy makers are wildly optimistic about implementing new programs. Pressman and Wildavsky offer a mathematical insight into why this is so. Consider an implementation path in which the probability of negotiating any single decision point is quite high—say, 95 percent. A casual preview of implementation may lead one to conclude that since clearing points A, B, and C is easy, implementation will be easy. Such reasoning overlooks the reality that the probability of success shrinks as the number of decision points increases. With three decision points, the odds fall to about 86 percent (.95 x .95 x .95). It takes 14 decision points for the odds to drop below 50 percent. Then failure is more likely than success.
A key assumption of Pressman and Wildavsky’s conceptual scheme is that implementation decision points are organized vertically, down through levels of government. There is also a certain amount of sequential dependence, as the domino analogy above implies. That may be true for a redevelopment program, but it’s not always true in education. I doubt that it’s true for Common Core. Education consists of loosely coupled organizational units (states, districts, schools, classes). Failure at one level may not be fatal to another. There can be good classes in bad schools, for example, good schools in bad districts, and so on. States or districts might bungle the CCSS, but savvy districts and schools could still rescue the standards and use them effectively.
Nevertheless, the vertical structure is useful for modeling how CCSS implementation will unfold. It is also useful for anticipating political opposition that the CCSS may encounter. Terry Moe has written extensively on the politics of “blocking.” When advocates of a particular education policy are victorious in the legislative arena, they have only won a battle, not a war. Opponents will show up again and again during implementation—in schools, or before school boards, or in other local forums—to continue the battle.
So let’s map the major points of vulnerability for the Common Core’s implementation. The project functions at the national, state, district, school, and classroom levels. At each of the five levels, decisions have been made or will be made regarding Common Core. The four crucial components of CCSS’s implementation—curriculum, instruction, assessment, and accountability—combine with the levels of decision making to create a minimum of twenty decision points. Imagine a 4 X 5 table with empty cells for the decision points. Future historians, by filling in the blank cells of the table, will tell the story of CCSS’s implementation.
Cells may comprise multiple decision points. In terms of curriculum, for example, twenty states have state textbook adoption, in which state boards and departments of education select the curricular materials that public schools may purchase. The other thirty states leave that decision up to districts, but typically provide funding for purchasing materials. Currently, states and districts are selecting math programs to reflect the CCSS, offering programs to train educators on how to use the new curricula, and purchasing new materials that are beginning to appear in schools and classrooms.
Note that the whole implementation process is bottomheavy, leading ultimately to activities in the nation’s 98,817 public schools and in the classrooms within them. Historically, curriculum controversies reach their greatest intensity when curricular materials are introduced in classrooms. That is happening now with the Common Core. Common Core won the support of elites and cleared most upperlevel decision points—all but a few states are on board with CCSS. Those highlevel decisions are no longer the main events in CCSS’s implementation.
The emergence of social media as a tool for mobilizing political action has undoubtedly enhanced the power of actors at the lowerlevel decision points to sway implementation. Forty or fifty years ago, difficulties implementing a math program in a small rural district probably would not receive much notice. In the 1960s and 1970s, the failure of “new math” wasn’t apparent for several years, until surveys revealed teachers were not using the new curricula. During the last curriculum controversy in mathematics—the math wars of the 1990s—the internet was just beginning to be used for organizing people politically. Curriculum aligned with the 1989 standards of the National Council of Teachers of Mathematics was the source of the conflict. The website “Mathematically Correct” fostered a national network of opposition by tabulating local efforts to drive NCTMoriented math programs out of the schools.
Today, a number of grass roots organizations have sprung up to fight against CCSS. Poorly designed math problems are widely circulated on Twitter and criticized by bloggers. I will discuss this phenomenon in greater depth in my June Chalkboard post, but suffice it for now to say that these attacks on Common Core, whether justified or not, illustrate the vulnerability of CCSS curriculum as implementation unfolds and the number of decision points multiply.
Shouldn’t we expect local educators to make good decisions when choosing curriculum that is compatible with the Common Core? As my colleagues Matt Chingos and Russ Whitehurst have documented, educators have very little evidence to go on when selecting curriculum. Evidence of effectiveness is in short supply. One of the rare randomized control trials of elementary math curricula was conducted by Mathematica. The study followed students through grades 1 and 2. Four math programs were evaluated, and although limiting the study to first and second grade curricula ensured that many common topics were covered, one of the programs produced very different results. Students in three of the programs (Math Expressions, Saxon, and Scott Foresman/Addison Wesley/envision) scored about the same, but all three outscored the fourth program (Investigations) by a statistically significant amount (effect size of about 0.22). A student at the 50^{th} percentile who received instruction in Investigations in first and second grade would have scored at the 59^{th} percentile if taught from one of the other programs.
What do educators go by if they can’t select on effectiveness? One popular approach is to go by alignment—how well math programs match up with the topics in the CCSS. This is a poor substitute for evidence of effectiveness. A wellaligned program means it covers the topics and objectives that CCSS lists for a particular grade level—it does not mean that the program covers them well. Some programs may cover Topic A well and students will learn because of that. Other programs may cover Topic A poorly and students will not learn. Both programs are aligned with Topic A.
Let’s conclude by returning to the question of defining implementation. What does the implementation of CCSS mean? I have drawn on and modified Pressman and Wildavsky’s implementation model to suggest a definition: the decisions that educators make—at national, state, district, school, and classroom levels—to realize the curriculum, instruction, assessment, and accountability systems of the Common Core. The CCSS implementation process will involve several decision points, with each one leaving the CCSS vulnerable to bad decisions by officials, who have scant evidence on which to act, and to the efforts of political opponents.
Fortyfour states and the District of Columbia have adopted the Common Core State Standards in English language arts and mathematics. Despite initial enthusiasm, criticism of and outright opposition to the standards are beginning to arise. Tom Loveless, a senior fellow in the Brown Center on Education Policy at Brookings, explains how the Common Core came about, why some are opposed to it now, and what his research shows about its impact on student achievement.
SUBSCRIBE TO THE PODCAST ON ITUNES »
Show notes:
• A Progress Report on the Common Core (Loveless)
• Predicting the Effect of the Common Core State Standards on Student Achievement (Loveless, in 2012 Brown Center Report)
• In Defense of the Common Core Standards (West and Bleiberg)
• Common Core Aligned Assessments: You Get What You Pay For? (Brown Center event)
• 3 Technical Choke Points that Could Sink the Common Core Tests (West and Bleiberg)
• Standardized Testing and the Common Core (Brown Center event)
• NAEP and the Common Core Standards (Loveless)
• The Common Core State Standards Initiative
Fortyfour states and the District of Columbia have adopted the Common Core State Standards in English language arts and mathematics. Despite initial enthusiasm, criticism of and outright opposition to the standards are beginning to arise. Tom Loveless, a senior fellow in the Brown Center on Education Policy at Brookings, explains how the Common Core came about, why some are opposed to it now, and what his research shows about its impact on student achievement.
SUBSCRIBE TO THE PODCAST ON ITUNES »
Show notes:
• A Progress Report on the Common Core (Loveless)
• Predicting the Effect of the Common Core State Standards on Student Achievement (Loveless, in 2012 Brown Center Report)
• In Defense of the Common Core Standards (West and Bleiberg)
• Common Core Aligned Assessments: You Get What You Pay For? (Brown Center event)
• 3 Technical Choke Points that Could Sink the Common Core Tests (West and Bleiberg)
• Standardized Testing and the Common Core (Brown Center event)
• NAEP and the Common Core Standards (Loveless)
• The Common Core State Standards Initiative
The 2014 Brown Center Report on American Education (2014 BCR), released last week, included a study of homework. The study revisits a question investigated in the 2003 BCR: how much homework do American students have? Recent stories in the popular press have featured children burdened with an enormous amount of homework, three hours or more per night. Are these students' experiences typical or rare?
They are rare. According to 2012 NAEP data, only five percent of nineyearolds, seven percent of 13yearolds, and 13 percent of 17yearolds had more than two hours of homework the day before filling out the student questionnaire.^{[i]} MetLife’s 2007 survey of parents and children reports similar figures. Three percent of parents with children attending elementary schools estimated three hours or more of homework on a typical school day. For parents of secondary school children, the share was five percent. In the student surveys, only two percent of kids in grade 36 said they had three hours or more of homework, and only eight percent of kids in grades 712 said they had that much.
So only a small sliver of the overall population has as much as three hours of nightly homework. And yet on March 21, 2014, CNN reported dramatically different findings from a survey of 4,000 students conducted by researchers at Challenge Success (Stanford Graduate School of Education). The students in that survey had an average homework load of three hours, with some doing as much as five hours per night. The researchers also found that excessive homework was correlated with high levels of stress and health problems. Not only is the homework load onerous, the study concluded, it is also unhealthy for kids.
As a gauge of the national homework load, the study is profoundly flawed. It’s impossible to say whether the findings are even generalizable to the 10 high schools that the students attended. The schools (comprising the study’s sampling frame) are not representative. They consist of four public and six private schools, all in welltodo suburban neighborhoods in California (median household income of $90,000). The schools are extraordinarily highachieving, with 93 percent of graduates going to college. Fiftyfour percent of the students who answered the survey are female. Only six percent are black or Hispanic. These statistics diverge significantly from statistics for the U.S. as a whole.
The sampling strategy was also vulnerable to skewing. Students were invited to fill out the questionnaire. They were not selected randomly. Selfselection can bias a sample by making a group of highlymotivated subjects appear larger than it really is. Students who were unhappy with homework were probably more motivated to fill out the survey. Those who were content or indifferent were probably less motivated. The authors of the study do not report how many students were initially invited to respond, making the response rate incalculable. The average public high school in California has more than 1,200 students—many suburban neighborhoods have larger schools—so the nonrespondents probably outnumber the respondents. Four of the schools offered the questionnaire online, adding another opportunity for selfselection.
A nonrepresentative sampling frame followed by selfselection of respondents can produce misleading survey results. A famous example of this is the Literary Digest’s 1936 public opinion poll on the presidential election. The magazine mailed out 10 million postcards to its subscribers, automobile registrants, and telephone users. More than 2 million responses led to a clear prediction: The Republican candidate, Alf Landon, would receive 57 percent of the vote and easily defeat incumbent President Franklin Roosevelt, the Democratic candidate. But that didn’t happen. Roosevelt won in a landslide, getting 61 percent of the vote.
How could the survey be so wrong? The ten million voters who were surveyed leaned strongly Republican. The Digests’ subscribers tended to be from wealthier households, as did car owners and telephone users in 1936. Combine that with the propensity of disgruntled voters to return such a survey, and the ingredients for misleading results are in place.
Look, it’s a fact that some students have too much homework. And it’s plausible that in ten highpowered high schools one can find hundreds of students who are stressed out to the point of missing sleep, which is one of the study’s indicators of poor health. But that is not the norm. Stress from academic expectations is not the experience of the average American teen.
Laurence Steinberg, a psychologist at Temple University, has devoted his career to studying adolescents, including their cognitive and emotional development and health. His research has drawn on databases designed to be nationally representative (e.g., National Longitudinal Study of Adolescent Health). The tenth edition of Steinberg’s book, Adolescence, was published in 2013. Writing recently in Slate (February, 2014), Steinberg calls our high schools a “disaster,” primarily because they ask so little of students. Steinberg recommends “classes that really challenge students to work hard.”
Parents and teachers should monitor students for taking on an academic workload that is too stressful. But to conclude from anecdotal reports and case studies that American high school students are overworked would be wrong. There also is an element of the Stanford study that makes those students’ situation puzzling. Recall that six of the schools in the Stanford study are private schools. Highpowered, academicallyfocused high schools are not for everyone. School officials in such schools typically make the homework load clear to prospective students and their parents. AP classes aren’t for everyone either. It’s like going to a steak house for dinner and then getting sick from the menu because you’re a vegetarian. You made a bad choice. My advice is to eat somewhere else.
[i] Asking students for the amount of homework on the day before taking the NAEP test presents both benefits and risks The benefits are a more precise timeframe for the homework estimate (as opposed to “usually”) and reliance on short term memory. A risk is that teachers, on the day before the test, may assign less or no homework to students selected for NAEP. The Brown Center Report compared responses to this item with responses to a separate NAEP question, posing the question as “usually,” that is now discontinued but asked until 2004. There is evidence of underreporting at the low end of the homework load (e.g., the no homework group), but the percentage of students with a heavy homework load appears unaffected. See the discussion on pages 2021.
The 2014 Brown Center Report on American Education (2014 BCR), released last week, included a study of homework. The study revisits a question investigated in the 2003 BCR: how much homework do American students have? Recent stories in the popular press have featured children burdened with an enormous amount of homework, three hours or more per night. Are these students' experiences typical or rare?
They are rare. According to 2012 NAEP data, only five percent of nineyearolds, seven percent of 13yearolds, and 13 percent of 17yearolds had more than two hours of homework the day before filling out the student questionnaire.^{[i]} MetLife’s 2007 survey of parents and children reports similar figures. Three percent of parents with children attending elementary schools estimated three hours or more of homework on a typical school day. For parents of secondary school children, the share was five percent. In the student surveys, only two percent of kids in grade 36 said they had three hours or more of homework, and only eight percent of kids in grades 712 said they had that much.
So only a small sliver of the overall population has as much as three hours of nightly homework. And yet on March 21, 2014, CNN reported dramatically different findings from a survey of 4,000 students conducted by researchers at Challenge Success (Stanford Graduate School of Education). The students in that survey had an average homework load of three hours, with some doing as much as five hours per night. The researchers also found that excessive homework was correlated with high levels of stress and health problems. Not only is the homework load onerous, the study concluded, it is also unhealthy for kids.
As a gauge of the national homework load, the study is profoundly flawed. It’s impossible to say whether the findings are even generalizable to the 10 high schools that the students attended. The schools (comprising the study’s sampling frame) are not representative. They consist of four public and six private schools, all in welltodo suburban neighborhoods in California (median household income of $90,000). The schools are extraordinarily highachieving, with 93 percent of graduates going to college. Fiftyfour percent of the students who answered the survey are female. Only six percent are black or Hispanic. These statistics diverge significantly from statistics for the U.S. as a whole.
The sampling strategy was also vulnerable to skewing. Students were invited to fill out the questionnaire. They were not selected randomly. Selfselection can bias a sample by making a group of highlymotivated subjects appear larger than it really is. Students who were unhappy with homework were probably more motivated to fill out the survey. Those who were content or indifferent were probably less motivated. The authors of the study do not report how many students were initially invited to respond, making the response rate incalculable. The average public high school in California has more than 1,200 students—many suburban neighborhoods have larger schools—so the nonrespondents probably outnumber the respondents. Four of the schools offered the questionnaire online, adding another opportunity for selfselection.
A nonrepresentative sampling frame followed by selfselection of respondents can produce misleading survey results. A famous example of this is the Literary Digest’s 1936 public opinion poll on the presidential election. The magazine mailed out 10 million postcards to its subscribers, automobile registrants, and telephone users. More than 2 million responses led to a clear prediction: The Republican candidate, Alf Landon, would receive 57 percent of the vote and easily defeat incumbent President Franklin Roosevelt, the Democratic candidate. But that didn’t happen. Roosevelt won in a landslide, getting 61 percent of the vote.
How could the survey be so wrong? The ten million voters who were surveyed leaned strongly Republican. The Digests’ subscribers tended to be from wealthier households, as did car owners and telephone users in 1936. Combine that with the propensity of disgruntled voters to return such a survey, and the ingredients for misleading results are in place.
Look, it’s a fact that some students have too much homework. And it’s plausible that in ten highpowered high schools one can find hundreds of students who are stressed out to the point of missing sleep, which is one of the study’s indicators of poor health. But that is not the norm. Stress from academic expectations is not the experience of the average American teen.
Laurence Steinberg, a psychologist at Temple University, has devoted his career to studying adolescents, including their cognitive and emotional development and health. His research has drawn on databases designed to be nationally representative (e.g., National Longitudinal Study of Adolescent Health). The tenth edition of Steinberg’s book, Adolescence, was published in 2013. Writing recently in Slate (February, 2014), Steinberg calls our high schools a “disaster,” primarily because they ask so little of students. Steinberg recommends “classes that really challenge students to work hard.”
Parents and teachers should monitor students for taking on an academic workload that is too stressful. But to conclude from anecdotal reports and case studies that American high school students are overworked would be wrong. There also is an element of the Stanford study that makes those students’ situation puzzling. Recall that six of the schools in the Stanford study are private schools. Highpowered, academicallyfocused high schools are not for everyone. School officials in such schools typically make the homework load clear to prospective students and their parents. AP classes aren’t for everyone either. It’s like going to a steak house for dinner and then getting sick from the menu because you’re a vegetarian. You made a bad choice. My advice is to eat somewhere else.
[i] Asking students for the amount of homework on the day before taking the NAEP test presents both benefits and risks The benefits are a more precise timeframe for the homework estimate (as opposed to “usually”) and reliance on short term memory. A risk is that teachers, on the day before the test, may assign less or no homework to students selected for NAEP. The Brown Center Report compared responses to this item with responses to a separate NAEP question, posing the question as “usually,” that is now discontinued but asked until 2004. There is evidence of underreporting at the low end of the homework load (e.g., the no homework group), but the percentage of students with a heavy homework load appears unaffected. See the discussion on pages 2021.
March 21, 2014
2:00 PM  2:30 PM EDT
Online Only
The Brookings Institution
Washington, DC
How well are American students learning? For 13 years, the Brown Center Report on American Education has sought to answer that question and to identify policies that help more American students learn better.
Each year, the Brown Center Report offers indepth, datadriven analysis of pressing topics and trends in American education. The recentlyreleased 2014 report investigates three critical issues that have universal ramifications for students and parents in American schools: the homework burden, the Common Core State Standards, and the PISAShanghai Controversy.
On Friday, March 21 at 2:00 p.m. EDT, report author andÂ Brookings Senior Fellow Tom Loveless will share some of his most controversial findings in an online chat hosted by Spreecast and moderated by education expert Ben Wildavsky of the Rockefeller Institute of Government, State University of New York. Tune in to listen  and participate  below!
March 21, 2014
2:00 PM  2:30 PM EDT
Online Only
The Brookings Institution
Washington, DC
How well are American students learning? For 13 years, the Brown Center Report on American Education has sought to answer that question and to identify policies that help more American students learn better.
Each year, the Brown Center Report offers indepth, datadriven analysis of pressing topics and trends in American education. The recentlyreleased 2014 report investigates three critical issues that have universal ramifications for students and parents in American schools: the homework burden, the Common Core State Standards, and the PISAShanghai Controversy.
On Friday, March 21 at 2:00 p.m. EDT, report author andÂ Brookings Senior Fellow Tom Loveless will share some of his most controversial findings in an online chat hosted by Spreecast and moderated by education expert Ben Wildavsky of the Rockefeller Institute of Government, State University of New York. Tune in to listen  and participate  below!
William H. Schmidt of Michigan State University presented research on the Common Core State Standards (CCSS) for Mathematics at the National Press Club on May 3, 2012.^{[1]} A paper based on the same research, coauthored with Richard T. Houang, was published in Educational Researcher in October 2012.^{[2]} Schmidt and Houang’s study (also referred to as the “MSU study” below) was important for endorsing CCSS’s prospective effectiveness at a time when debate on the CCSS was beginning to heat up. Opponents of the Common Core had criticized the CCSS for lacking empirical support. The MSU study showed that states with math standards similar to the Common Core, after controlling for other potential influences, registered higher NAEP scores in 2009 than states with standards divergent from the CCSS. The implication was that the math standards of CCSS would boost state math performance on NAEP.
Is there reason to believe that projection will become reality? In this section of the Brown Center Report, a twopart investigation attempts to answer that question. First, the ratings of state standards provided by Schmidt and Houang’s study are examined using NAEP data that have been collected since their study was completed. The central question is whether the MSU ratings predict progress on NAEP from 20092013. Second, a new analysis is presented, independent from the MSU ratings, comparing the NAEP gains of states with varying degrees of CCSS implementation. The two analyses offer exploratory readings of how the Common Core is affecting achievement so far.
Schmidt and Houang used state NAEP scores on the 2009 eighth grade math assessment to model the potential effectiveness of the CCSS. They first developed a scale to rate the degree of congruence of each state’s standards with the CCSS. The ratings were based on earlier work also conducted by Schmidt and his colleagues at MSU. That work made a lasting and important contribution to curriculum studies by attempting to represent the quality of curriculum standards—both international and domestic—in a quantitative form.^{[3]} The key dimensions measured in the MSU ratings are focus and coherence. Focus refers to limiting topics in the math curriculum to the most important topics and teaching them in depth. Coherence refers to organizing topics in a manner that reflects the underlying structure of mathematics, allowing knowledge and skills to build sequentially.
In the National Press Club talk, Schmidt presented a chart showing how the states fell on the congruence measure (see Table 31). Alabama, Michigan, California, and the others at the top of the scale had standards most like the CCSS math standards. Arizona, Nevada, Iowa and those at the bottom of the scale had standards that diverged from the CCSS.
Table 31 includes a categorical variable (15) for the five congruency ratings. The MSU authors used the continuous form of the congruence ratings along with demographic covariates in a regression equation that predicted state NAEP scores. The congruence rating was statistically insignificant. No relationship to achievement was uncovered. An analysis of residuals, however, revealed two distinct sets of states (referred to as “Group A” and “Group B”). (Key differences between the two groups are discussed below.) Regression equations incorporating membership in these two groups did produce statistically significant coefficients for the congruence rating.
Figure 31, reproduced from the Educational Researcher article, clearly shows two upward sloping regression lines. The MSU authors concluded that that it was time to end the debate over the wisdom of the Common Core and that the CCSS in math “deserve to be seriously implemented.”^{[4] }
Examining NAEP Gains with the MSU Ratings
NAEP scores for 2011 and 2013 have been released since the Schmidt and Houang study. These scores offer the opportunity to update Schmidt and Houang’s findings. They also allow a check of the study’s most important policy lesson, that states adopting the CCSS in math could expect an increase in their eighth grade NAEP math scores. Examining gain scores—specifically, the change in state scores since 2009—provides a way to evaluate the predictive capacity, at least in the short run, of the 2009 MSU analysis. By relying on crosssectional data, NAEP scores from a single point in time, the Schmidt and Houang analysis helps to explain the performance of states in 2009. But states adopt policies with an eye toward future results. Did the states with standards most like the CCSS in 2009 continue to make the greatest gains in later years? Gain score analysis also possesses a technical advantage. It is generally superior to crosssectional analysis in controlling for omitted variables that may influence achievement by “baking them into the cake” at baseline.^{[5] }
Tables 32, 33, and 34 report the average gains of states expressed as changes in scale score points on the eighth grade NAEP math assessment. The states are grouped by their MSU rating. Bear in mind that the 2009 MSU ratings were assigned based on the math standards then in place. States with a “5” had math standards most similar to the CCSS. States with a “1” had math standards most divergent from the CCSS.
Table 32 reveals no systematic relationship between the states’ MSU ratings and changes in NAEP from 20092013. Indeed, states with standards most different from the CCSS (rated 1) gained the most on NAEP (2.25). States with standards most like the CCSS scored the next largest gains (1.94); and states with a 4 rating (second most similar group to the CCSS) lost ground, declining 0.81. The data are lumpy, so whether a positive relationship is expected (i.e., states scoring 5 should make the greatest gains, 4 the next greatest gains, and so forth) or a negative relationship (states scoring 1 should make the greatest gains because they have the most to gain from adopting CCSS, states scoring 2 have the next most to gain, etc.), no statistical relationship is evident. No linear pattern emerges across the categories.
What about the two time intervals, 20092011 and 20112013? NAEP scores are more stable over longer periods of time so the four year interval is probably a preferable indicator. In addition, a clear point of demarcation does not exist for when an old set of standards ends and a new set begins. Nevertheless, let’s consider how the CCSS unfolded to guide the consideration of the data by different time periods.
The 20092011 interval should probably receive the closest scrutiny in probing for a correlation of state achievement with 2009 standards. Those standards were still operational from 20092011. The states rated “1” notched particularly strong gains (1.91) during this period. States rated “4” actually declined (0.91). That is not what one would expect if the MSU ratings accurately reflected the quality of 2009 standards.
The 20112013 interval should represent the strongest early indicator of gains after adopting the CCSS. Fortyfive states had adopted the CCSS math standards by 2011. In a survey of state officials in 2011, most declared that they had begun the implementation process (progress in implementation receives explicit attention below).^{[6]} The gains for this interval might be expected to be inversely related to the MSU ratings, with larger gains coming from the states rated “1.” They were making the most dramatic curricular changes and should experience the most growth that accrues from adopting the CCSS. That expectation isn’t met either. States with a “5” made the largest gains (0.94); however, the second largest gains were recorded by the states with a “1” rating (0.34).
Recall that Schmidt and Houang did not find a significant relationship until they divided the states into two groups, Group A and Group B. Group A consists of 37 states and Group B has 13 states. The groups are quite different demographically. More than half of the Group B states are Southern. They have lower per capita wealth and serve a greater proportion of black and Hispanic students. They receive more federal funding than Group A states. They also scored about 14.67 scale score points lower than Group A states on the 2009 NAEP. Schmidt and Houang speculate that the states in Group B, despite many having high quality math standards, faced a more difficult implementation environment because of demographic challenges and resource constraints.
Tables 33 and 34 disaggregate the gains by these two groups. Table 33 examines the A group. The NAEP changes generally contradict the MSU ratings. From 20092013, the states with the weakest congruence with CCSS made the greatest gains (2.25). The changes from 20092011 are the most glaring. States with the strongest ratings (the 5’s) lost ground (0.64), and the states rated “1” scored gains (1.91). Note, though, that some of the ratings groups have become quite small (only three states in Group A have a “5” rating), so these figures must be taken with several grains of salt. Also observe that all of the states with ratings of “1” or “2” belong to group A. Consequently, the results for these states in Table 33 are the same as in Table 32.
Table 34 examines the states in Group B. Note that the ratings divide what is already a small group, 13 states, into even smaller groups. The states rated “5” registered the smallest gain (2.44) of the ratings groups for 20092013. As a whole, from 20092013 the Group B states made larger gains than the Group A states (2.78 vs. the 0.77 reported in Table 33), narrowing the gap with the A states by about two NAEP scale score points.
This may indicate regression to the mean. There could also be another, unknown factor driving the relatively larger gains by Group B states. Whatever the cause, the gains in Group B states cast doubt on Schmidt and Houang’s hypothesis that implementation difficulties lie at the heart of Group B’s underperformance on the 2009 NAEP. If these states had trouble implementing their own standards prior to 2009, it is difficult to imagine them suddenly discovering the secret to implementation in the first few years with the Common Core. And if resource constraints were a primary factor hobbling past efforts at implementation, surely finding adequate resources during the Great Recession limited what the Group B states could accomplish.
In sum, the Schmidt and Houang ratings of state math standards in 2009 do not predict gains on NAEP very well in subsequent years. The notion that disaggregating the states into two groups would clarify matters because 13 states (Group B) faced implementation challenges also does not receive support. Whether in Group A or Group B, states with 2009 math standards most dissimilar to the Common Core made the largest NAEP gains from 20092013.
As Schmidt and Houang point out—and any informed observer would surely agree—the progress states make in implementing the CCSS is crucial to the standards’ impact on achievement. The MSU congruence ratings were designed to serve as substitutes for CCSS implementation measures, there being no implementation to measure in 2009. Now, with the passage of time, it is possible to get an early reading on implementation from a direct measure of state efforts. A 2011 survey of state educational agencies was mentioned above. The survey was conducted as part of a U.S. Department of Education study of reforms promoted by the Recovery Act. The Common Core was one such reform. The survey asked states if they had: 1) adopted the CCSS; 2) provided, guided, or funded professional development on the CCSS; 3) provided curriculum/instructional materials for the CCSS; and 4) worked with a consortium to develop assessments aligned with the CCSS.
For the current study, the states’ responses were utilized to create an implementation rating. Modifications to the survey answers were made if a press report was located updating a state’s status after the 2011 survey was conducted. Montana, Washington, and Wyoming, for example, had not yet adopted the CCSS when the survey was conducted, but they did soon thereafter. Georgia, Kansas, Oklahoma, Pennsylvania, and Utah have either withdrawn from their respective CCSS assessment consortium or announced a freeze on CCSS testing.
The category “nonadopter” was assigned to states that answered “no” to all four questions. That group consists of Alaska, Minnesota, Nebraska, Texas, and Virginia. Those states are going their own way on math standards and can serve as a control group for CCSS.^{[7] } At the other end of the implementation continuum, the category “strong” was assigned to states answering “yes” to all four questions. A total of 19 states have adopted the CCSS, taken steps to provide both professional development and curriculum/instructional materials aligned with CCSS, and are members of a consortium designing CCSS assessments. They are the strong implementers of CCSS. The remaining 26 states are medium implementers. They adopted CCSS but have not taken all of the other steps available to them to implement the standards.
Table 35 shows the average NAEP gains of the states based on implementation of CCSS. The 20092013 gains are what CCSS advocates hope for, at least in terms of a consistent pattern. The strong implementers made the largest gains (1.88), followed by the medium implementers (1.00), and then the nonadopters (0.61). The 20112013 pattern is also favorable towards the implementers of the CCSS. The medium implementers made the most progress (0.61) and the strong implementers made gains (0.21), although less than in 20092011. Caution must be exercised with the nonadopters since they only include five states, and Alaska’s decline of 1.49 scale score points from 20092013 diminishes what was an average gain of more than one point by the other four states.
The Schmidt and Houang state standards ratings of 2009 proved to be a poor predictor of progress on NAEP in subsequent years. A rating based on states’ implementation activities did reveal a pattern. States that more aggressively implemented the CCSS registered larger gains from 20092013. That’s an optimistic finding for CCSS.
Let’s evaluate the magnitude of potential gains from CCSS using that optimistic finding. Start by recognizing that from 19902013—the entire history of the main NAEP assessment—scores on the eighth grade math test rose from 263 to 285, a gain of 22 points. That averages to about one scale score point per year. The gains from 20092013 have significantly lagged that pace. As reported in Table 35, the average gain for the entire period was 1.30, which comes out to 0.33 per year. Critics of CCSS might suspect that the transition to CCSS is responsible for the slowing, but the data presented here do not support the charge. The five states that rejected the CCSS have performed worse than the states that adopted CCSS.
But how much worse? What is the difference? Not much. The 1.27 gap between strong implementers and nonadopters is about .035 of the 2009 NAEP’s standard deviation (36). A rule of thumb is that differences of less than .20 SD are not even noticeable, let alone significant. If it takes four years for the CCSS to generate a .035 SD improvement, it will take 24 years for a noticeable improvement to unfold. And that improvement would add up to 7.62 NAEP scale score points, a gain in 24 years that falls far short of the 22 point gain that NAEP registered in its first 23 years.
Recent NAEP gains might be disappointing because the economic turmoil of the past few years presented an inopportune time for implementing new standards. That’s possible, but the historical record is mixed. The early 1990s recession was accompanied by weak NAEP gains, but the early 2000s recession took place while NAEP scores were soaring. Perhaps the positive effects of the CCSS will not fully emerge until assessments aligned with the standards are administered and accountability systems tied to the results are launched. There is evidence that the testbased accountability systems of the late 1990s and the NCLB inspired systems of the early 2000s had a positive impact on achievement; however, in many jurisdictions, accountability systems were then being implemented for the first time.^{[8]} The new CCSS accountability systems will be replacing systems that are already in place. The quality that they add to or subtract from existing systems is unknown. Moreover, as the consequences of NCLB’s accountability systems began to be felt, significant political opposition arose in many states. Whether the CCSS systems experience the same backlash remains to be seen.
Can the small, insignificant effect of implementation be reconciled with the MSU study? Schmidt and Houang reported the tests of statistical significance for their congruence rating but they did not report an estimate of CCSS effects on NAEP scores. It is always possible for a statistically significant regression coefficient to denote an effect that is insignificant in the real world. Statistical significance tells us that we can be confident that an effect is different from zero, not that the difference is important. This is an especially relevant distinction when an analysis of NAEP data is conducted with states as the unit of analysis. As pointed out in a 2012 Brown Center Report study of the CCSS, most variation on NAEP lies within states—between students, not between states.^{[9]} The standard deviation of state NAEP scores on the 2009 math test is 7.6 points. The standard deviation of the 2009 NAEP eighth grade math score, a statistic based on variation in student performance, is 36 points—four to five times larger.
An illustration of what these two SDs mean for interpreting the magnitude of CCSS effects is revealing. Schmidt and Houang’s congruence rating has a range of 662826, mean of 762, and SD of 33.5. The regression coefficient for the congruence rating was 0.08.^{[10]} A statistical convention is to calculate the impact that a one SD change in an independent variable (in this case, the congruence rating) has on the dependent variable (in this case, the 2009 eighth grade NAEP score). In plain English, how much of a boost in NAEP scores can we expect from a pretty big increase in the congruence rating? A little arithmetic produces the following: a one SD gain in the congruence rating (33.5 points) is predicted to yield a NAEP gain of 2.68 points. Consider that gain in terms of the two SDs. It is about 0.35 of the statelevel SD—a moderate but noticeable effect that is consistent with MSU’s finding of statistical significance. But as a proportion of the studentlevel SD, the effect is only 0.07 SD, which is quite small, even undetectable. Moreover, the MSU analysis could not assign a firm estimate of how much time it took for states with standards similar to CCSS to generate this tiny effect, although six to eight years is a good guess.^{[11]}
The point here is not that Schmidt and Houang did anything wrong. State level policies certainly can be evaluated with statelevel data. The problem is that a statistically significant finding from an analysis of statelevel NAEP scores, the variation among states being relatively small, often fades to insignificance when considered in the more practical, real world terms of how much math students are learning. It is doubtful that even the most ardent Common Core supporter will be satisfied if the best CCSS can offer—after all of the debate, the costs in tax revenue, and blood, sweat, and tears going into implementation—is a three point NAEP gain.
The 2012 Brown Center Report predicted, based on empirical analysis of the effects of state standards, that the CCSS will have little to no impact on student achievement. Supporters of the Common Core argue that strong, effective implementation of the standards will sweep away such skepticism by producing lasting, significant gains in student learning. So far, at least—and it is admittedly the early innings of a long ballgame—there are no signs of such an impressive accomplishment.
Part III Notes:
[1] William H. Schmidt, New Research Links Common Core Math Standards to Higher Achievement. Presented at the National Press Club, May 3, 2012. PowerPoint available at http://www.achieve.org/files/BILL_CCSSM.ppt.
[2] William H. Schmidt and Richard T. Houang, “Curricular Coherence and the Common Core State Standards for Mathematics,” Educational Researcher 41, no. 8 (2012): 294308.
[3] William H. Schmidt, Curtis C. McKnight, Gilbert A. Valverde, Richard T. Houang, and David E. Wiley, Many visions, many aims, Volume I: A crossnational investigation of curricular intentions in school mathematics (Dordrecht: Kluwer Academic Publishers, 1997); William H. Schmidt, Curtis C. McKnight, Senta A. Raizen, Pamela M. Jakwerth, Gilbert A. Valverde, Richard G. Wolfe, Edward D. Britton, Leonard J. Bianchi, and Richard T. Houang, A splintered vision: An investigation of US science and mathematics education. Vol. 3. (Boston: Kluwer Academic Publishers, 1997); William H. Schmidt and Richard T. Houang, “Lack of Focus in the Mathematics Curriculum: Symptom or Cause?” in Lessons Learned: What International Assessments Tell Us about Math Achievement, ed. Tom Loveless (Washington: Brookings Institution Press, 2007).
[4] William H. Schmidt and Richard T. Houang, “Curricular Coherence and the Common Core State Standards for Mathematics,” Educational Researcher 41, no. 8 (2012): pp. 307.
[5] JanEric Gustafsson, “Understanding Causal Influences on Educational Achievement through Analysis of Differences over Time within Countries,” in Lessons Learned: What International Assessments Tell Us about Math Achievement, ed. Tom Loveless (Washington: Brookings Institution Press, 2007).
[6] A 2011 survey of state education agencies is reported by Webber et. al, “State Implementation of Reforms Promoted Under the Recovery Act,” U.S. Department of Education (January 2014).
[7] Minnesota adopted the CCSS in EnglishLanguage Arts.
[8] For a description of standards and accountability systems in the late 1990s, see Education Week, Quality Counts: Rewarding Results, Punishing Failure (January 11, 1999). For evidence of positive effects of test based accountability, see Thomas Dee and Brian A. Jacob, “Evaluating NCLB,” Education Next 10, no. 3 (Summer 2010); Manyee Wong, Thomas D. Cook, and Peter M. Steiner, “No Child Left Behind: An Interim Evaluation of Its Effects on Learning Using Two Interrupted Time Series Each with Its Own NonEquivalent Comparison Series,” Working Paper 0911 (Evanston, IL: Northwestern University Institute for Policy Research, 2009); Eric A. Hanushek and Margaret E. Raymond, "Does school accountability lead to improved student performance?" Journal of Policy Analysis and Management 24, no.2 (Spring 2005): 297327; Martin Carnoy and Susanna Loeb, "Does external accountability affect student outcomes? A crossstate analysis," Educational Evaluation and Policy Analysis 24, no. 4 (2002): 305331.
[9] Tom Loveless, The 2012 Brown Center Report on American Education (Washington: The Brookings Institution, 2012).
[10] Summary statistics for rating of congruence is on page 300 and regression output is on page 304 of William H. Schmidt and Richard T. Houang, “Curricular Coherence and the Common Core State Standards for Mathematics,” Educational Researcher 41, no. 8 (2012): 294308.
[11] The No Child Left Behind (NCLB) Act of 2001required standards in all states. About half of the states already had standards in place before NCLB. Schmidt and Houang’s statement is that eighth graders in 2009 probably spent all or most of their school years learning math under the state standards then in place.
« Part II: Homework in America 
William H. Schmidt of Michigan State University presented research on the Common Core State Standards (CCSS) for Mathematics at the National Press Club on May 3, 2012.^{[1]} A paper based on the same research, coauthored with Richard T. Houang, was published in Educational Researcher in October 2012.^{[2]} Schmidt and Houang’s study (also referred to as the “MSU study” below) was important for endorsing CCSS’s prospective effectiveness at a time when debate on the CCSS was beginning to heat up. Opponents of the Common Core had criticized the CCSS for lacking empirical support. The MSU study showed that states with math standards similar to the Common Core, after controlling for other potential influences, registered higher NAEP scores in 2009 than states with standards divergent from the CCSS. The implication was that the math standards of CCSS would boost state math performance on NAEP.
Is there reason to believe that projection will become reality? In this section of the Brown Center Report, a twopart investigation attempts to answer that question. First, the ratings of state standards provided by Schmidt and Houang’s study are examined using NAEP data that have been collected since their study was completed. The central question is whether the MSU ratings predict progress on NAEP from 20092013. Second, a new analysis is presented, independent from the MSU ratings, comparing the NAEP gains of states with varying degrees of CCSS implementation. The two analyses offer exploratory readings of how the Common Core is affecting achievement so far.
Schmidt and Houang used state NAEP scores on the 2009 eighth grade math assessment to model the potential effectiveness of the CCSS. They first developed a scale to rate the degree of congruence of each state’s standards with the CCSS. The ratings were based on earlier work also conducted by Schmidt and his colleagues at MSU. That work made a lasting and important contribution to curriculum studies by attempting to represent the quality of curriculum standards—both international and domestic—in a quantitative form.^{[3]} The key dimensions measured in the MSU ratings are focus and coherence. Focus refers to limiting topics in the math curriculum to the most important topics and teaching them in depth. Coherence refers to organizing topics in a manner that reflects the underlying structure of mathematics, allowing knowledge and skills to build sequentially.
In the National Press Club talk, Schmidt presented a chart showing how the states fell on the congruence measure (see Table 31). Alabama, Michigan, California, and the others at the top of the scale had standards most like the CCSS math standards. Arizona, Nevada, Iowa and those at the bottom of the scale had standards that diverged from the CCSS.
Table 31 includes a categorical variable (15) for the five congruency ratings. The MSU authors used the continuous form of the congruence ratings along with demographic covariates in a regression equation that predicted state NAEP scores. The congruence rating was statistically insignificant. No relationship to achievement was uncovered. An analysis of residuals, however, revealed two distinct sets of states (referred to as “Group A” and “Group B”). (Key differences between the two groups are discussed below.) Regression equations incorporating membership in these two groups did produce statistically significant coefficients for the congruence rating.
Figure 31, reproduced from the Educational Researcher article, clearly shows two upward sloping regression lines. The MSU authors concluded that that it was time to end the debate over the wisdom of the Common Core and that the CCSS in math “deserve to be seriously implemented.”^{[4] }
Examining NAEP Gains with the MSU Ratings
NAEP scores for 2011 and 2013 have been released since the Schmidt and Houang study. These scores offer the opportunity to update Schmidt and Houang’s findings. They also allow a check of the study’s most important policy lesson, that states adopting the CCSS in math could expect an increase in their eighth grade NAEP math scores. Examining gain scores—specifically, the change in state scores since 2009—provides a way to evaluate the predictive capacity, at least in the short run, of the 2009 MSU analysis. By relying on crosssectional data, NAEP scores from a single point in time, the Schmidt and Houang analysis helps to explain the performance of states in 2009. But states adopt policies with an eye toward future results. Did the states with standards most like the CCSS in 2009 continue to make the greatest gains in later years? Gain score analysis also possesses a technical advantage. It is generally superior to crosssectional analysis in controlling for omitted variables that may influence achievement by “baking them into the cake” at baseline.^{[5] }
Tables 32, 33, and 34 report the average gains of states expressed as changes in scale score points on the eighth grade NAEP math assessment. The states are grouped by their MSU rating. Bear in mind that the 2009 MSU ratings were assigned based on the math standards then in place. States with a “5” had math standards most similar to the CCSS. States with a “1” had math standards most divergent from the CCSS.
Table 32 reveals no systematic relationship between the states’ MSU ratings and changes in NAEP from 20092013. Indeed, states with standards most different from the CCSS (rated 1) gained the most on NAEP (2.25). States with standards most like the CCSS scored the next largest gains (1.94); and states with a 4 rating (second most similar group to the CCSS) lost ground, declining 0.81. The data are lumpy, so whether a positive relationship is expected (i.e., states scoring 5 should make the greatest gains, 4 the next greatest gains, and so forth) or a negative relationship (states scoring 1 should make the greatest gains because they have the most to gain from adopting CCSS, states scoring 2 have the next most to gain, etc.), no statistical relationship is evident. No linear pattern emerges across the categories.
What about the two time intervals, 20092011 and 20112013? NAEP scores are more stable over longer periods of time so the four year interval is probably a preferable indicator. In addition, a clear point of demarcation does not exist for when an old set of standards ends and a new set begins. Nevertheless, let’s consider how the CCSS unfolded to guide the consideration of the data by different time periods.
The 20092011 interval should probably receive the closest scrutiny in probing for a correlation of state achievement with 2009 standards. Those standards were still operational from 20092011. The states rated “1” notched particularly strong gains (1.91) during this period. States rated “4” actually declined (0.91). That is not what one would expect if the MSU ratings accurately reflected the quality of 2009 standards.
The 20112013 interval should represent the strongest early indicator of gains after adopting the CCSS. Fortyfive states had adopted the CCSS math standards by 2011. In a survey of state officials in 2011, most declared that they had begun the implementation process (progress in implementation receives explicit attention below).^{[6]} The gains for this interval might be expected to be inversely related to the MSU ratings, with larger gains coming from the states rated “1.” They were making the most dramatic curricular changes and should experience the most growth that accrues from adopting the CCSS. That expectation isn’t met either. States with a “5” made the largest gains (0.94); however, the second largest gains were recorded by the states with a “1” rating (0.34).
Recall that Schmidt and Houang did not find a significant relationship until they divided the states into two groups, Group A and Group B. Group A consists of 37 states and Group B has 13 states. The groups are quite different demographically. More than half of the Group B states are Southern. They have lower per capita wealth and serve a greater proportion of black and Hispanic students. They receive more federal funding than Group A states. They also scored about 14.67 scale score points lower than Group A states on the 2009 NAEP. Schmidt and Houang speculate that the states in Group B, despite many having high quality math standards, faced a more difficult implementation environment because of demographic challenges and resource constraints.
Tables 33 and 34 disaggregate the gains by these two groups. Table 33 examines the A group. The NAEP changes generally contradict the MSU ratings. From 20092013, the states with the weakest congruence with CCSS made the greatest gains (2.25). The changes from 20092011 are the most glaring. States with the strongest ratings (the 5’s) lost ground (0.64), and the states rated “1” scored gains (1.91). Note, though, that some of the ratings groups have become quite small (only three states in Group A have a “5” rating), so these figures must be taken with several grains of salt. Also observe that all of the states with ratings of “1” or “2” belong to group A. Consequently, the results for these states in Table 33 are the same as in Table 32.
Table 34 examines the states in Group B. Note that the ratings divide what is already a small group, 13 states, into even smaller groups. The states rated “5” registered the smallest gain (2.44) of the ratings groups for 20092013. As a whole, from 20092013 the Group B states made larger gains than the Group A states (2.78 vs. the 0.77 reported in Table 33), narrowing the gap with the A states by about two NAEP scale score points.
This may indicate regression to the mean. There could also be another, unknown factor driving the relatively larger gains by Group B states. Whatever the cause, the gains in Group B states cast doubt on Schmidt and Houang’s hypothesis that implementation difficulties lie at the heart of Group B’s underperformance on the 2009 NAEP. If these states had trouble implementing their own standards prior to 2009, it is difficult to imagine them suddenly discovering the secret to implementation in the first few years with the Common Core. And if resource constraints were a primary factor hobbling past efforts at implementation, surely finding adequate resources during the Great Recession limited what the Group B states could accomplish.
In sum, the Schmidt and Houang ratings of state math standards in 2009 do not predict gains on NAEP very well in subsequent years. The notion that disaggregating the states into two groups would clarify matters because 13 states (Group B) faced implementation challenges also does not receive support. Whether in Group A or Group B, states with 2009 math standards most dissimilar to the Common Core made the largest NAEP gains from 20092013.
As Schmidt and Houang point out—and any informed observer would surely agree—the progress states make in implementing the CCSS is crucial to the standards’ impact on achievement. The MSU congruence ratings were designed to serve as substitutes for CCSS implementation measures, there being no implementation to measure in 2009. Now, with the passage of time, it is possible to get an early reading on implementation from a direct measure of state efforts. A 2011 survey of state educational agencies was mentioned above. The survey was conducted as part of a U.S. Department of Education study of reforms promoted by the Recovery Act. The Common Core was one such reform. The survey asked states if they had: 1) adopted the CCSS; 2) provided, guided, or funded professional development on the CCSS; 3) provided curriculum/instructional materials for the CCSS; and 4) worked with a consortium to develop assessments aligned with the CCSS.
For the current study, the states’ responses were utilized to create an implementation rating. Modifications to the survey answers were made if a press report was located updating a state’s status after the 2011 survey was conducted. Montana, Washington, and Wyoming, for example, had not yet adopted the CCSS when the survey was conducted, but they did soon thereafter. Georgia, Kansas, Oklahoma, Pennsylvania, and Utah have either withdrawn from their respective CCSS assessment consortium or announced a freeze on CCSS testing.
The category “nonadopter” was assigned to states that answered “no” to all four questions. That group consists of Alaska, Minnesota, Nebraska, Texas, and Virginia. Those states are going their own way on math standards and can serve as a control group for CCSS.^{[7] } At the other end of the implementation continuum, the category “strong” was assigned to states answering “yes” to all four questions. A total of 19 states have adopted the CCSS, taken steps to provide both professional development and curriculum/instructional materials aligned with CCSS, and are members of a consortium designing CCSS assessments. They are the strong implementers of CCSS. The remaining 26 states are medium implementers. They adopted CCSS but have not taken all of the other steps available to them to implement the standards.
Table 35 shows the average NAEP gains of the states based on implementation of CCSS. The 20092013 gains are what CCSS advocates hope for, at least in terms of a consistent pattern. The strong implementers made the largest gains (1.88), followed by the medium implementers (1.00), and then the nonadopters (0.61). The 20112013 pattern is also favorable towards the implementers of the CCSS. The medium implementers made the most progress (0.61) and the strong implementers made gains (0.21), although less than in 20092011. Caution must be exercised with the nonadopters since they only include five states, and Alaska’s decline of 1.49 scale score points from 20092013 diminishes what was an average gain of more than one point by the other four states.
The Schmidt and Houang state standards ratings of 2009 proved to be a poor predictor of progress on NAEP in subsequent years. A rating based on states’ implementation activities did reveal a pattern. States that more aggressively implemented the CCSS registered larger gains from 20092013. That’s an optimistic finding for CCSS.
Let’s evaluate the magnitude of potential gains from CCSS using that optimistic finding. Start by recognizing that from 19902013—the entire history of the main NAEP assessment—scores on the eighth grade math test rose from 263 to 285, a gain of 22 points. That averages to about one scale score point per year. The gains from 20092013 have significantly lagged that pace. As reported in Table 35, the average gain for the entire period was 1.30, which comes out to 0.33 per year. Critics of CCSS might suspect that the transition to CCSS is responsible for the slowing, but the data presented here do not support the charge. The five states that rejected the CCSS have performed worse than the states that adopted CCSS.
But how much worse? What is the difference? Not much. The 1.27 gap between strong implementers and nonadopters is about .035 of the 2009 NAEP’s standard deviation (36). A rule of thumb is that differences of less than .20 SD are not even noticeable, let alone significant. If it takes four years for the CCSS to generate a .035 SD improvement, it will take 24 years for a noticeable improvement to unfold. And that improvement would add up to 7.62 NAEP scale score points, a gain in 24 years that falls far short of the 22 point gain that NAEP registered in its first 23 years.
Recent NAEP gains might be disappointing because the economic turmoil of the past few years presented an inopportune time for implementing new standards. That’s possible, but the historical record is mixed. The early 1990s recession was accompanied by weak NAEP gains, but the early 2000s recession took place while NAEP scores were soaring. Perhaps the positive effects of the CCSS will not fully emerge until assessments aligned with the standards are administered and accountability systems tied to the results are launched. There is evidence that the testbased accountability systems of the late 1990s and the NCLB inspired systems of the early 2000s had a positive impact on achievement; however, in many jurisdictions, accountability systems were then being implemented for the first time.^{[8]} The new CCSS accountability systems will be replacing systems that are already in place. The quality that they add to or subtract from existing systems is unknown. Moreover, as the consequences of NCLB’s accountability systems began to be felt, significant political opposition arose in many states. Whether the CCSS systems experience the same backlash remains to be seen.
Can the small, insignificant effect of implementation be reconciled with the MSU study? Schmidt and Houang reported the tests of statistical significance for their congruence rating but they did not report an estimate of CCSS effects on NAEP scores. It is always possible for a statistically significant regression coefficient to denote an effect that is insignificant in the real world. Statistical significance tells us that we can be confident that an effect is different from zero, not that the difference is important. This is an especially relevant distinction when an analysis of NAEP data is conducted with states as the unit of analysis. As pointed out in a 2012 Brown Center Report study of the CCSS, most variation on NAEP lies within states—between students, not between states.^{[9]} The standard deviation of state NAEP scores on the 2009 math test is 7.6 points. The standard deviation of the 2009 NAEP eighth grade math score, a statistic based on variation in student performance, is 36 points—four to five times larger.
An illustration of what these two SDs mean for interpreting the magnitude of CCSS effects is revealing. Schmidt and Houang’s congruence rating has a range of 662826, mean of 762, and SD of 33.5. The regression coefficient for the congruence rating was 0.08.^{[10]} A statistical convention is to calculate the impact that a one SD change in an independent variable (in this case, the congruence rating) has on the dependent variable (in this case, the 2009 eighth grade NAEP score). In plain English, how much of a boost in NAEP scores can we expect from a pretty big increase in the congruence rating? A little arithmetic produces the following: a one SD gain in the congruence rating (33.5 points) is predicted to yield a NAEP gain of 2.68 points. Consider that gain in terms of the two SDs. It is about 0.35 of the statelevel SD—a moderate but noticeable effect that is consistent with MSU’s finding of statistical significance. But as a proportion of the studentlevel SD, the effect is only 0.07 SD, which is quite small, even undetectable. Moreover, the MSU analysis could not assign a firm estimate of how much time it took for states with standards similar to CCSS to generate this tiny effect, although six to eight years is a good guess.^{[11]}
The point here is not that Schmidt and Houang did anything wrong. State level policies certainly can be evaluated with statelevel data. The problem is that a statistically significant finding from an analysis of statelevel NAEP scores, the variation among states being relatively small, often fades to insignificance when considered in the more practical, real world terms of how much math students are learning. It is doubtful that even the most ardent Common Core supporter will be satisfied if the best CCSS can offer—after all of the debate, the costs in tax revenue, and blood, sweat, and tears going into implementation—is a three point NAEP gain.
The 2012 Brown Center Report predicted, based on empirical analysis of the effects of state standards, that the CCSS will have little to no impact on student achievement. Supporters of the Common Core argue that strong, effective implementation of the standards will sweep away such skepticism by producing lasting, significant gains in student learning. So far, at least—and it is admittedly the early innings of a long ballgame—there are no signs of such an impressive accomplishment.
Part III Notes:
[1] William H. Schmidt, New Research Links Common Core Math Standards to Higher Achievement. Presented at the National Press Club, May 3, 2012. PowerPoint available at http://www.achieve.org/files/BILL_CCSSM.ppt.
[2] William H. Schmidt and Richard T. Houang, “Curricular Coherence and the Common Core State Standards for Mathematics,” Educational Researcher 41, no. 8 (2012): 294308.
[3] William H. Schmidt, Curtis C. McKnight, Gilbert A. Valverde, Richard T. Houang, and David E. Wiley, Many visions, many aims, Volume I: A crossnational investigation of curricular intentions in school mathematics (Dordrecht: Kluwer Academic Publishers, 1997); William H. Schmidt, Curtis C. McKnight, Senta A. Raizen, Pamela M. Jakwerth, Gilbert A. Valverde, Richard G. Wolfe, Edward D. Britton, Leonard J. Bianchi, and Richard T. Houang, A splintered vision: An investigation of US science and mathematics education. Vol. 3. (Boston: Kluwer Academic Publishers, 1997); William H. Schmidt and Richard T. Houang, “Lack of Focus in the Mathematics Curriculum: Symptom or Cause?” in Lessons Learned: What International Assessments Tell Us about Math Achievement, ed. Tom Loveless (Washington: Brookings Institution Press, 2007).
[4] William H. Schmidt and Richard T. Houang, “Curricular Coherence and the Common Core State Standards for Mathematics,” Educational Researcher 41, no. 8 (2012): pp. 307.
[5] JanEric Gustafsson, “Understanding Causal Influences on Educational Achievement through Analysis of Differences over Time within Countries,” in Lessons Learned: What International Assessments Tell Us about Math Achievement, ed. Tom Loveless (Washington: Brookings Institution Press, 2007).
[6] A 2011 survey of state education agencies is reported by Webber et. al, “State Implementation of Reforms Promoted Under the Recovery Act,” U.S. Department of Education (January 2014).
[7] Minnesota adopted the CCSS in EnglishLanguage Arts.
[8] For a description of standards and accountability systems in the late 1990s, see Education Week, Quality Counts: Rewarding Results, Punishing Failure (January 11, 1999). For evidence of positive effects of test based accountability, see Thomas Dee and Brian A. Jacob, “Evaluating NCLB,” Education Next 10, no. 3 (Summer 2010); Manyee Wong, Thomas D. Cook, and Peter M. Steiner, “No Child Left Behind: An Interim Evaluation of Its Effects on Learning Using Two Interrupted Time Series Each with Its Own NonEquivalent Comparison Series,” Working Paper 0911 (Evanston, IL: Northwestern University Institute for Policy Research, 2009); Eric A. Hanushek and Margaret E. Raymond, "Does school accountability lead to improved student performance?" Journal of Policy Analysis and Management 24, no.2 (Spring 2005): 297327; Martin Carnoy and Susanna Loeb, "Does external accountability affect student outcomes? A crossstate analysis," Educational Evaluation and Policy Analysis 24, no. 4 (2002): 305331.
[9] Tom Loveless, The 2012 Brown Center Report on American Education (Washington: The Brookings Institution, 2012).
[10] Summary statistics for rating of congruence is on page 300 and regression output is on page 304 of William H. Schmidt and Richard T. Houang, “Curricular Coherence and the Common Core State Standards for Mathematics,” Educational Researcher 41, no. 8 (2012): 294308.
[11] The No Child Left Behind (NCLB) Act of 2001required standards in all states. About half of the states already had standards in place before NCLB. Schmidt and Houang’s statement is that eighth graders in 2009 probably spent all or most of their school years learning math under the state standards then in place.
« Part II: Homework in America 