In a pessimistic view of the state of college education, Arum and Roksa claimed that 45% of students show no statistically significant gains in critical thinking. We demonstrate that their significance tests were conducted incorrectly and that it is likely that far fewer students show significant gains. However, this does not reflect on student learning as much as how hard it is to find a significant result with a sample size of one.
One of the most highly-cited findings from the book Academically Adrift: Limited Learning on College Campuses by Arum and Roksa (2011a) reflects the main theme of the title of the book, namely that “...we observe no statistically significant gains in critical thinking, complex reasoning, and writing skills for at least 45 percent of the [college] students in our study” (p. 36). In a related article on the topic, Arum and Roksa (2011b) repeat that “Forty-five percent of students did not demonstrate any significant improvement in learning, as measured by CLA performance, during their first 2 years of college” (p. 204).
This gain (or lack thereof) is based on a measure called the College Learning Assessment (CLA) that the authors administered to college students at the beginning of their freshman year and again at end of their sophomore year, with the difference between CLA scores across years as the indicator of change. Based on the change in CLA scores, Astin (2011) predicted that “this 45-percent conclusion is well on its way to becoming part of the folklore about American higher education,” and time has borne this prediction out. A Google search for the key words:
"Academically Adrift" Arum " 45 percent of the students”
conducted on April 29th, 2013, revealed over 27,000 results. As a basis for comparison, changing "45 percent" to "48 percent" revealed no results.
Astin (2011) astutely critiqued Arum and Roksa’s use of significance testing to assess improvement in student learning. His two major arguments were (1) learning should be measured in terms of practically significant gains and not focused on statistically significant gains and (2) the reliability of measured student gains was low, making it it very difficult to find statistically significant differences, even if important gains in learning actually existed. Initially, we felt that Astin dealt with these issues sufficiently, and that no further comments were necessary. Upon closer examination, however, we discovered that the significance tests conducted by Arum and Roksa were done incorrectly. This error in computing significance tests has critical implications, and therefore merits a comment in order to correct the scientific record.
These authors and subsequently the media were shocked by the finding that 45% of students failed to show statistically significant gains. This 45% finding is indeed shocking — but for a reason completely unrelated to student learning: Considering that each significance test was based on a sample size of 1 (i.e., each student’s change in the CLA measure)1, it is hard to imagine that as many as 55% of students (i.e., 100% - 45%) would show statistically significant gains. Indeed, based on the mean difference between the pre- and post-tests, one would expect to find an order of magnitude fewer significant improvements than reported in the study.
The reason Arum and Roksa found that so many (not so few) students improved significantly is that they computed the significance test incorrectly. These authors computed a standard error appropriate to test the mean difference across students, but not appropriate for testing the improvement of individual students. More specifically, their error was to divide by the standard deviation of the change scores by the square root of the sample size when computing the standard error (Arum, 2012). This led to a standard error much lower than the standard error appropriate for testing an individual student’s change.
Clearly, the number of students in the study should not affect the significance of tests for the improvement scores of individual students. However, in Arum and Roksa’s analysis, it does. For example, if 100,000 students had been tested on the CLA, virtually every gain (or loss) would have been significant because the standard error for a single student's change would be computed by dividing the standard deviation of difference scores across students by the square root of N (100,000 in this example). With such a small standard error, even extremely small differences would be deemed significant. The authors would thus have concluded that a much higher percentage of students showed learning gains even though nothing would have changed except the number of students tested. With an appropriate method, an increase in sample size would not materially affect the significance test for an individual student’s change in his/her CLA score across two time points.
Thus, we agree with Astin that the proportion of students improving significantly is hardly one of the better ways to measure the size of an effect. However, if this measure is to be used, it should be calculated correctly.
Certainly, it could be argued the gist of Arum and Roksa’s conclusions are not dependent on the 45% figure, but that is a complex question requiring appropriate statistical testing and further investigation of the CLA. Our point is that this 45% figure has garnered the most attention and is the most inflammatory. We feel it is important to set the record straight that this 45% figure is in error.
1. The fact that the “error term” was based on the whole sample does not materially affect this argument.
Arum, R (2012) Personal communication, May 6, 2012.
Arum, R., & Roksa, J. (2011a). Academically adrift: Limited learning on college campuses. Chicago: University of Chicago Press.
Arum, R., & Roksa, J. (2011b). Limited learning on college campuses, Society, 48, 203-207.
Astin, A. W. (2011, February). “In ‘Academically Adrift,’ data don’t back up sweeping claim.” The Chronicle of Higher Education. Retrieved from http://chronicle.com/article/Academically-Adrift-a/126371/.