Insights on Standardized Test Criticisms and Testing Effects

29 May 2023
Richard P. Phelps
O
v
e
r
-
 
a
n
d
 
U
n
d
e
r
-
u
s
e
d
 
C
r
i
t
i
c
i
s
m
s
 
o
f
 
S
t
a
n
d
a
r
d
i
z
e
d
 
T
e
s
t
s
1
University of Bucharest
2
“If a thing exists, it
exists in some
amount. If it exists in
some amount, then it
is capable of being
measured.”
−−René Descartes,
Principles of
Philosophy
, 1664
3
Some 
Over
used Criticisms
Time Lost from Learning
Teaching to the Test
Narrowing the Curriculum
Distorts Instruction
4
Some Overused Criticisms
Time Lost from Learning
Teaching to the Test
Narrowing the Curriculum
Distorts Instruction
5
The “Testing Effect”
Generally, one learns more by testing than by restudy.
Aristotle ~350 BCE
Edward Thorndike ~1900
“The active recall of a fact from within us is, as a rule,
better than its impressions without"
C. C. Ross 1941
“The act of testing alone, irrespective of other factors,
tends to improve achievement.”
Experiments:   1909, 1917, 1923, 1924, 1939
6
The effect of testing on student learning
> 3,000 documents
700 separate studies, > 1,600 separate
effects
2,000 other studies were reviewed and
found incomplete or inappropriate
A thousand other studies remain to be
reviewed
7
245 Qualitative studies
813 Survey or Poll questions
640 Quantitative Effects:
 
Experiments:
  
School- and classroom-level
 
Multivariate studies:
  
Large-scale testing programs
The effect of testing on student learning
8
Meta-analysis
A method for
summarizing a large
research literature, with
a single, comparable
measure
.
( 0.5 effect size ≈ 1 grade level of learning )
9
Survey study effect sizes average > 1.0
Over 90% of qualitative studies positive
For quantitative studies, effect sizes vary between 0.55
and 0.88:
+ testing or testing more
+ testing with stakes
+ testing with feedback
Findings from Phelps (2012):
10
Findings from Phelps Meta-Regression (2019)
To raise achievement:
Add a test
Add feedback
Add consequences
consequences 
+
 feedback
…is the strongest treatment
11
Some Overused Criticisms
Time Lost from Learning
Teaching to the Test
Narrowing the Curriculum
Distorts Instruction
Residency in rural, poor Appalachia, 1980s
Surprised by claims that state and school district scored 
above
average
 on national tests
Investigated, all US states claimed to be 
above average
John J. Cannell, M.D.
12
Welcome to Lake Wobegon, where all the women are
strong, all the men are good-looking, and all the
children are above average.
- Garrison Keillor, 
A Prairie Home Companion
13
Dr. Cannell
s
suspects
Lax security
Outdated or invalid norms
Deliberate educator manipulation (i.e., cheating)
14
CRESST
s Lake
Wobegon suspects
Outdated or invalid norms
High stakes, that induce 
teaching to the test
 (i.e., test
coaching) under pressure
Problem:  only one of Cannell’s many tests had any stakes
15
Harms of misinformation
1. 
Unfairly discredits useful evaluation tool
2. Test security (in U.S.) remains poor
3. Teachers given mixed messages
 
4. Now spreading worldwide via the OECD
16
17
Some Overused Criticisms
Time Lost from Learning
Teaching to the Test
Narrowing the Curriculum
Distorts Instruction
18
Some Overused Criticisms
Time Lost from Learning
Teaching to the Test
Narrowing the Curriculum
Distorts Instruction
19
Studies of the reliability of teacher grading,
1890s to 1920s
e.g., Starch & Elliot, 1912
T
wo actual English examination papers
Sent to 142 teachers to grade
Grades ranged from 50 to 98%
One paper:  14 grades < 80%  &  14 > 94%
20
Starch & Elliot, 1912
T
wo actual Geometry examination papers
Sent to 116 teachers to grade
Grades ranged from 28 to 92%
One paper:  20 grades < 60%  &  9 > 85%
Studies of the reliability of teacher grading,
1890s to 1920s
21
Why
 
Standardized
 
tests?
I
n some places
, the only
objective measure
available to the public (i.e.,
not under the control of
insiders).
22
Schools vary in quality
Courses vary in quality
Grade comparisons are not reliable
How can those outside a school or classroom judge
the quality of a school, its instruction, or its students?
Standardized tests’ most important feature is standardization.
23
Some 
Under
used Criticisms
Choosing the Wrong Test Type
Lax Security
Threats to Privacy
Not Testing When Benefits are Clear
24
Some Underused Criticisms
Choosing the Wrong Test Type
Lax Security
Threats to Privacy
Not Testing When Benefits are Clear
Achievement
Aptitude
Non-cognitive
1. Three types of large-scale tests
25
PSU:  
una prueba en guerra consigo misma
Se espera que haga
demasiadas cosas…
…ninguna la hace bien,
…& empeora algunas
importantes
(
una prueba de salida de la educación científica humanista, presentado como
un vehículo para evaluar la cobertura curricular que hoy es empleada como
prueba de admisión para todos los estudiantes (incluyendo a los de la
enseñanza media TP )
Multiplicidad de Propósitos en la PSU:
1.
Medir la implementación de un nuevo currículo;
2.
Medir bien el dominio de dos currículos muy distintos
entre sí;
3.
Incentivar a los liceos a implementar el nuevo currículo
4.
Incentivar a los alumnos a estudiar más
5.
Predecir el éxito en la universidad;
6.
Predecir éxito en programas universitarios muy
distintos entre sí
7.
Proveer puntos de corte para el ingreso a la
universidad, para becas y ayudas financieras.
 
Comparing Achievement & Aptitude tests
28
Non-cognitive tests
More recently developed
 
– measure values, attitudes, preferences
Types: 
 
integrity tests
   
career exploration
   
matchmaking
   
employment “fit”
29
Comparing Achievement, Aptitude, &
Non-Cognitive Tests
30
31
Some Underused Criticisms
Choosing the Wrong Test Type
Lax Security
Threats to Privacy
Not Testing When Benefits are Clear
Large-scale test, tight security
32
Large-scale test, lax security
33
34
Some Underused Criticisms
Choosing the Wrong Test Type
Lax Security
Threats to Privacy
Not Testing When Benefits are Clear
35
Threats to Privacy Increase in the Internet Age
Lavishly-funded government-sponsored hacking
Ransomware
Commercial incentives to collect personal data
Schools often store psychological and medical data on students
Socio-Emotional Learning (SEL) programs will increase amount
36
Some Underused Criticisms
Choosing the Wrong Test Type
Lax Security
Threats to Privacy
Not Testing When Benefits are Clear
37
Cognitive Scientists’
6 Strategies for Effective Learning
Interleaving
Concrete Examples
Elaboration
Retrieval Practice
Spaced Practice
Dual Coding
10 benefits of testing and their applications to education
Roediger, Putnam and Smith
SOURCE: Roediger, Putnam, & Smith, Ten benefits of testing and their applications to educational practice, 
Psychology of
Learning and Motivation, 55
, 2011.
Benefit 1: The Testing Effect: Retrieval Aids Later Retention
Benefit 2: Testing Identifies Gaps in Knowledge
Benefit 3: Testing Causes Students to Learn More from the Next Study Episode
Benefit 4: Testing Produces Better Organization of Knowledge
Benefit 5: Testing Improves Transfer of Knowledge to New Contexts
Benefit 6: Testing can Facilitate Retrieval of Material That was not Tested
Benefit 7: Testing Improves Metacognitive Monitoring
Benefit 8: Testing Prevents Interference from Prior Material when Learning 
 
  
New Material
Benefit 9: Testing Provides Feedback to Instructors
Benefit 10: Frequent Testing Encourages Students to Study
38
Why
 
consequential
tests
?
Most respond to both intrinsic and extrinsic motivators
and the proportion varies from individual to individual.
consequential tests provide both forms of inducement.
consequential tests tend to be taken more seriously and
administered with tighter security.
40
Large-scale tests are needed for other
purposes, such as
…monitoring and system diagnosis
selection to programs
…workforce planning
…accountability
…credentialing
41
Some large-scale test advantages
On per-student basis, inexpensive
Cognitive laboratory pre-testing possible
Standardization offers comparisons across schools and regions.
May produce high-quality test items that schools and teachers can
use
.
SOURCE: Phelps, Benchmarking to the best in mathematics, 
Evaluation Review
, 2001
42
SOURCE: Phelps, Benchmarking to the best in mathematics, 
Evaluation Review
, 2001
43
44
US State of California University Admission Test Decision
Faculty conducts large study
showing clear information benefits
of admission testing. Votes in favor
of continuing to use them.
Board of Directors overrules
them, will ban use of
admission tests. Cite usual,
many-times disproven
equity arguments.
45
46
47
O
v
e
r
-
 
a
n
d
 
U
n
d
e
r
-
u
s
e
d
 
C
r
i
t
i
c
i
s
m
s
 
o
f
 
S
t
a
n
d
a
r
d
i
z
e
d
 
T
e
s
t
s
https://nonpartisaneducation.org
48
r
ichard {at} nonpartisaneducation {dot} org
Slide Note

I can make these slides available, so there should be no reason to take notes.

Embed
Share

This presentation discusses both overused and underused criticisms of standardized tests, highlighting issues such as time lost from learning, narrowing of the curriculum, and the testing effect on student learning. It delves into the benefits of testing and presents findings on the impact of testing with stakes and feedback. Meta-analysis methods are explored for summarizing research literature.

  • Standardized tests
  • Criticisms
  • Testing effects
  • Education
  • Meta-analysis

Uploaded on Apr 17, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Over- and Under-used Criticisms of Standardized Tests Richard P. Phelps University of Bucharest 29 May 2023 1

  2. If a thing exists, it exists in some amount. If it exists in some amount, then it is capable of being measured. Ren Descartes, Principles of Philosophy, 1664 2

  3. Some Overused Criticisms Time Lost from Learning Teaching to the Test Narrowing the Curriculum Distorts Instruction 3

  4. Some Overused Criticisms Time Lost from Learning Teaching to the Test Narrowing the Curriculum Distorts Instruction 4

  5. The Testing Effect Generally, one learns more by testing than by restudy. Aristotle ~350 BCE Edward Thorndike ~1900 The active recall of a fact from within us is, as a rule, better than its impressions without" C. C. Ross 1941 The act of testing alone, irrespective of other factors, tends to improve achievement. Experiments: 1909, 1917, 1923, 1924, 1939 5

  6. The effect of testing on student learning > 3,000 documents 700 separate studies, > 1,600 separate effects 2,000 other studies were reviewed and found incomplete or inappropriate A thousand other studies remain to be reviewed 6

  7. The effect of testing on student learning 245 Qualitative studies 813 Survey or Poll questions 640 Quantitative Effects: Experiments: School- and classroom-level Multivariate studies: Large-scale testing programs 7

  8. Meta-analysis A method for summarizing a large research literature, with a single, comparable measure. ( 0.5 effect size 1 grade level of learning ) 8

  9. Findings from Phelps (2012): Survey study effect sizes average > 1.0 Over 90% of qualitative studies positive For quantitative studies, effect sizes vary between 0.55 and 0.88: + testing or testing more + testing with stakes + testing with feedback 9

  10. Findings from Phelps Meta-Regression (2019) To raise achievement: Add a test Add feedback Add consequences consequences + feedback is the strongest treatment 10

  11. Some Overused Criticisms Time Lost from Learning Teaching to the Test Narrowing the Curriculum Distorts Instruction 11

  12. John J. Cannell, M.D. Residency in rural, poor Appalachia, 1980s Surprised by claims that state and school district scored above average on national tests Investigated, all US states claimed to be above average 12

  13. Welcome to Lake Wobegon, where all the women are strong, all the men are good-looking, and all the children are above average. - Garrison Keillor, A Prairie Home Companion 13

  14. Dr. Cannells suspects Lax security Outdated or invalid norms Deliberate educator manipulation (i.e., cheating) 14

  15. CRESSTs Lake Wobegon suspects Outdated or invalid norms High stakes, that induce teaching to the test (i.e., test coaching) under pressure Problem: only one of Cannell s many tests had any stakes 15

  16. Harms of misinformation 1. Unfairly discredits useful evaluation tool 2. Test security (in U.S.) remains poor 3. Teachers given mixed messages 4. Now spreading worldwide via the OECD 16

  17. Some Overused Criticisms Time Lost from Learning Teaching to the Test Narrowing the Curriculum Distorts Instruction 17

  18. Some Overused Criticisms Time Lost from Learning Teaching to the Test Narrowing the Curriculum Distorts Instruction 18

  19. Studies of the reliability of teacher grading, 1890s to 1920s e.g., Starch & Elliot, 1912 Two actual English examination papers Sent to 142 teachers to grade Grades ranged from 50 to 98% One paper: 14 grades < 80% & 14 > 94% 19

  20. Studies of the reliability of teacher grading, 1890s to 1920s Starch & Elliot, 1912 Two actual Geometry examination papers Sent to 116 teachers to grade Grades ranged from 28 to 92% One paper: 20 grades < 60% & 9 > 85% 20

  21. Why Standardized tests? In some places, the only objective measure available to the public (i.e., not under the control of insiders). 21

  22. How can those outside a school or classroom judge the quality of a school, its instruction, or its students? Schools vary in quality Courses vary in quality Grade comparisons are not reliable Standardized tests most important feature is standardization. 22

  23. Some Underused Criticisms Choosing the Wrong Test Type Lax Security Threats to Privacy Not Testing When Benefits are Clear 23

  24. Some Underused Criticisms Choosing the Wrong Test Type Lax Security Threats to Privacy Not Testing When Benefits are Clear 24

  25. 1. Three types of large-scale tests Achievement Aptitude Non-cognitive 25

  26. PSU: una prueba en guerra consigo misma (una prueba de salida de la educaci n cient fica humanista, presentado como un veh culo para evaluar la cobertura curricular que hoy es empleada como prueba de admisi n para todos los estudiantes (incluyendo a los de la ense anza media TP ) Se espera que haga demasiadas cosas ninguna la hace bien, & empeora algunas importantes

  27. Multiplicidad de Propsitos en la PSU: 1. Medir la implementaci n de un nuevo curr culo; 2. Medir bien el dominio de dos curr culos muy distintos entre s ; 3. Incentivar a los liceos a implementar el nuevo curr culo 4. Incentivar a los alumnos a estudiar m s 5. Predecir el xito en la universidad; 6. Predecir xito en programas universitarios muy distintos entre s 7. Proveer puntos de corte para el ingreso a la universidad, para becas y ayudas financieras.

  28. Comparing Achievement & Aptitude tests Achievement Aptitude Measure past learning potential Development content analysis job/skills analysis Validation retrospective predictive Content dependent independent Coachable? very much not much 28

  29. Non-cognitive tests More recently developed measure values, attitudes, preferences Types: integrity tests career exploration matchmaking employment fit 29

  30. Comparing Achievement, Aptitude, & Non-Cognitive Tests Achievement Aptitude Non-Cognitive attitudes, values, preferences Measure past learning potential Development content analysis job/skills analysis surveys Validation retrospective predictive predictive Content dependent independent independent Coachable? very much very little can be faked 30

  31. Some Underused Criticisms Choosing the Wrong Test Type Lax Security Threats to Privacy Not Testing When Benefits are Clear 31

  32. Large-scale test, tight security 32

  33. Large-scale test, lax security 33

  34. Some Underused Criticisms Choosing the Wrong Test Type Lax Security Threats to Privacy Not Testing When Benefits are Clear 34

  35. Threats to Privacy Increase in the Internet Age Lavishly-funded government-sponsored hacking Ransomware Commercial incentives to collect personal data Schools often store psychological and medical data on students Socio-Emotional Learning (SEL) programs will increase amount 35

  36. Some Underused Criticisms Choosing the Wrong Test Type Lax Security Threats to Privacy Not Testing When Benefits are Clear 36

  37. Cognitive Scientists 6 Strategies for Effective Learning Retrieval Practice Interleaving Spaced Practice Concrete Examples Dual Coding Elaboration 37

  38. 10 benefits of testing and their applications to education Roediger, Putnam and Smith Benefit 1: The Testing Effect: Retrieval Aids Later Retention Benefit 2: Testing Identifies Gaps in Knowledge Benefit 3: Testing Causes Students to Learn More from the Next Study Episode Benefit 4: Testing Produces Better Organization of Knowledge Benefit 5: Testing Improves Transfer of Knowledge to New Contexts Benefit 6: Testing can Facilitate Retrieval of Material That was not Tested Benefit 7: Testing Improves Metacognitive Monitoring Benefit 8: Testing Prevents Interference from Prior Material when Learning New Material Benefit 9: Testing Provides Feedback to Instructors Benefit 10: Frequent Testing Encourages Students to Study SOURCE: Roediger, Putnam, & Smith, Ten benefits of testing and their applications to educational practice, Psychology of Learning and Motivation, 55, 2011. 38

  39. Whyconsequential tests? Most respond to both intrinsic and extrinsic motivators and the proportion varies from individual to individual. consequential tests provide both forms of inducement. consequential tests tend to be taken more seriously and administered with tighter security.

  40. Large-scale tests are needed for other purposes, such as monitoring and system diagnosis selection to programs workforce planning accountability credentialing 40

  41. Some large-scale test advantages On per-student basis, inexpensive Cognitive laboratory pre-testing possible Standardization offers comparisons across schools and regions. May produce high-quality test items that schools and teachers can use. 41

  42. Figure 1: Average TIMSS Score and Number of Quality Control Measures Used, by Country 80 Average Percent Correct (grades 7&8) 70 60 50 40 30 20 10 0 0 5 10 15 20 Number of Quality Control Measures Used Top-Performing Countries Bottom-Performing Countries SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001 42

  43. Figure 2: Average TIMSS Score and Number of Quality Control Measures Used (each adjusted for GDP/capita), by Country Average Percent Correct (grades 7& 8) (per GDP/capita) Number of Quality Control Measures Used (per GDP/capita) SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001 43

  44. US State of California University Admission Test Decision Faculty conducts large study showing clear information benefits of admission testing. Votes in favor of continuing to use them. Board of Directors overrules them, will ban use of admission tests. Cite usual, many-times disproven equity arguments. 44

  45. 45

  46. 46

  47. 47

  48. Over- and Under-used Criticisms of Standardized Tests https://nonpartisaneducation.org richard {at} nonpartisaneducation {dot} org 48

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#