Enhancing E-Assessment with Natural Language Processing

 
U
s
i
n
g
 
N
L
P
 
t
o
 
S
u
p
p
o
r
t
 
S
c
a
l
a
b
l
e
A
s
s
e
s
s
m
e
n
t
 
o
f
 
S
h
o
r
t
 
F
r
e
e
 
T
e
x
t
R
e
s
p
o
n
s
e
s
 
 
Alistair Willis
 
Department of Computing and Communications,
The Open University, UK
 
e-assessment at the Open University
 
Short answer questions
assess detail and specific knowledge
ability to verbalise knowledge
cognitively different from Multiple Choice Questions:
student is unprompted
 
Valuable assessment method for distance learning
eg. Open University
MOOCS
 
Open University has large answer sets for training data
high population courses
 
 
e-assessment at the Open University
 
 
Online, automatically marked questions in use for Level 1 OU Science
successful in formative assessment
positive student feedback
 
moving towards summative assessment
 
 
BUT
Existing system uses (simplified) regular expressions
need to be written by computing expert
 
Difficult to write effective marking rules
(near) impossible for non-computing expert
 
 
Automatic assessment at the OU
 
Automatic assessment at the OU
Answer Server
Student input
Mark/
feedback
Question mark
scheme
Virtual Learning
Environment
 
 
 
e-assessment architecture
 
 
 
 
 
 
 
 
 
 
Mark scheme maps student input onto mark
 
Can we simplify mark scheme authoring?
 
 
Requirements for automatic assessment
system
 
Award of marks should be:
 
consistent
 
explanatory
why
 was this mark (not) awarded?
 
maintainable
by a domain expert
not necessarily a computer scientist
 
 
Requirements for automatic assessment
system
 
Award of marks should be:
 
consistent
 
explanatory
why
 was this mark (not) awarded?
 
maintainable
by a domain expert
not necessarily a computer scientist
 
 
“Traceable” (eg. QAA requirement)
NLP for e-assessment
 
How much linguistic analysis is actually needed?
 
bag of words/keyword analysis not good enough
 
  
high temperature and pressure
 
( 
 )
 
  
high pressure and temperature
 
( 
 )
 
  
temperature and high pressure
 
( 
X
 )
 
But deep parsing can overcommit (eg. IAT by Mitchell 
et al, 
2002)
 
  
( 
high pressure
 ) 
and temperature
 
v.
  
high
 ( 
pressure and temperature
 )
 
 
 
Marking as Information Extraction
 
 
Treat marking as rule-based Information Extraction task
 
matching rules express the mark scheme (Junker 
et al 
1999)
keyword recognition + linear word order
simple spelling correction ( edit distance ≤ 1 )
 
term(R, Term, I)
template(R, Template, I)
precedes(I1, I2)
closely_precedes(I1, I2)
 
 
 
 
 
Example...
 
 
 
term(R, oil, I) 
term(R, less, J) 
template(R, dens, K) 
precedes(I, J) 
 correct(R)
 
  
oil is less dense than water
 
( 
 )
 
  
oil has less density than water
 
( 
 )
 
  
water is less dense than oil
 
( 
X
 )
 
 
 
Amati
 
Amati system
make mark scheme authoring easy for tutors
domain specialist
not (necessarily) computing specialist
 
bootstrapping model
load an initial set of responses
mark responses (by hand)
r
e
p
e
a
t
:
write/edit rules to reflect marks
extend set of responses
apply rules to extended set
confirm/change assigned marks
u
n
t
i
l
 
n
o
 
f
u
r
t
h
e
r
 
i
m
p
r
o
v
e
m
e
n
t
apply rules to remaining responses
 
 
 
 
Evaluation
 
Is the representation language expressive enough?
 
How good are the Amati mark schemes?
How do they compare to a gold standard?
How do they compare to a human marker?
 
Tested on 6 questions
Student responses from two presentations (2008/2009)
Mark scheme built on 2008 responses
Tested on 2009 responses (unseen)
 
 
 
 
Test Set Construction
 
How to get a gold standard/ground truth?
 
Human marks can be unreliable
Low inter-marker agreement
 
Multiple-pass annotation
Initially marked by two subject-specialist tutors
allowed to confer during marking
Module chair as final authority
third marker
also adjudicates on disagreements
 
 
Results
 
High accuracy
although task is highly constrained
 
α higher than human inter-marker agreement on the same questions
71.2% ≤ α ≤ 88.2%
 
Rule Induction
 
To induce mark schemes, try Inductive Logic Programming
Aleph (Srinivasan 2004)
 
Amati proposes rules from marked responses
Still requires manual post-editing
eg. predicted rule:
 
 
 
 
matches
 
high pressure and temperature
 
high pressure and heat
 
Question author can make more intuitively plausible
repeated runs improve the overall coverage
 
 
 
 
term(R, high, A) 
 
term(R, pressure, B) 
 
term(R, and, C) 
 correct(R)
Some difficult responses
 
Lack of linguistic knowledge not necessarily a drawback
 
  
The oil floats on the water because it is lighter
 
 
  
The oil floats on the water because it is heavier
 
 
Applying BOD, mark both correct
better than attempting to resolve pronoun
easier with linear word order than syntactic analysis
 
Some difficult responses
 
Hard to mark where students can give particular examples
 
A snowflake falls vertically with a constant speed. What can you say about
the forces acting on the snowflake?
 
abstract answers easy to mark
They are equal and opposite
no net forces
all forces cancel out
etc.
 
Much harder to predict all specific examples
the gravity and air resistance are the same
gravity is counteracted by drag
etc.
 
Discussion
 
Ongoing work
 
User studies
Formal comparison of responses marked “by hand” and with Amati
Compare time to mark and inter-marker agreement
 
Linguistic complexity
 
Multiple-mark responses
Better synonym handling
Feedback generation
 
Thank you!
Slide Note
Embed
Share

Scalable assessment of short free text responses in e-assessment is facilitated through NLP at The Open University, UK. The system aims to simplify mark scheme authoring, ensure consistency in awarding marks, and make assessments maintainable by domain experts, not just computer scientists.

  • E-Assessment
  • Natural Language Processing
  • Open University
  • Scalable Assessment
  • NLP

Uploaded on Sep 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Using NLP to Support Scalable Assessment of Short Free Text Responses Alistair Willis Department of Computing and Communications, The Open University, UK

  2. e-assessment at the Open University Short answer questions assess detail and specific knowledge ability to verbalise knowledge cognitively different from Multiple Choice Questions: student is unprompted Valuable assessment method for distance learning eg. Open University MOOCS Open University has large answer sets for training data high population courses

  3. e-assessment at the Open University Online, automatically marked questions in use for Level 1 OU Science successful in formative assessment positive student feedback moving towards summative assessment BUT Existing system uses (simplified) regular expressions need to be written by computing expert Difficult to write effective marking rules (near) impossible for non-computing expert

  4. Automatic assessment at the OU

  5. Automatic assessment at the OU

  6. e-assessment architecture Student input Virtual Learning Environment Answer Server Mark/ feedback Question mark scheme Mark scheme maps student input onto mark Can we simplify mark scheme authoring?

  7. Requirements for automatic assessment system Award of marks should be: consistent explanatory why was this mark (not) awarded? maintainable by a domain expert not necessarily a computer scientist

  8. Requirements for automatic assessment system Award of marks should be: consistent explanatory why was this mark (not) awarded? maintainable by a domain expert not necessarily a computer scientist Traceable (eg. QAA requirement)

  9. NLP for e-assessment How much linguistic analysis is actually needed? bag of words/keyword analysis not good enough high temperature and pressure ( ) high pressure and temperature ( ) temperature and high pressure ( X ) But deep parsing can overcommit (eg. IAT by Mitchell et al, 2002) v. ( high pressure ) and temperature high ( pressure and temperature )

  10. Marking as Information Extraction Treat marking as rule-based Information Extraction task matching rules express the mark scheme (Junker et al 1999) keyword recognition + linear word order simple spelling correction ( edit distance 1 ) term(R, Term, I) template(R, Template, I) precedes(I1, I2) closely_precedes(I1, I2)

  11. Example... term(R, oil, I) term(R, less, J) template(R, dens, K) precedes(I, J) correct(R) oil is less dense than water ( ) oil has less density than water ( ) water is less dense than oil ( X )

  12. Amati Amati system make mark scheme authoring easy for tutors domain specialist not (necessarily) computing specialist bootstrapping model load an initial set of responses mark responses (by hand) repeat: write/edit rules to reflect marks extend set of responses apply rules to extended set confirm/change assigned marks until no further improvement apply rules to remaining responses

  13. Evaluation Is the representation language expressive enough? How good are the Amati mark schemes? How do they compare to a gold standard? How do they compare to a human marker? Tested on 6 questions Student responses from two presentations (2008/2009) Mark scheme built on 2008 responses Tested on 2009 responses (unseen)

  14. Test Set Construction How to get a gold standard/ground truth? Human marks can be unreliable Low inter-marker agreement Multiple-pass annotation Initially marked by two subject-specialist tutors allowed to confer during marking Module chair as final authority third marker also adjudicates on disagreements

  15. Results Question Number of responses Accuracy / % / % Sandstone Snowflake Charge Rocks Sentence Oil 1711 2057 1127 1429 1173 817 98.4 91.0 98.9 99.0 98.2 96.1 97.5 81.7 97.6 89.6 97.5 91.5 High accuracy although task is highly constrained higher than human inter-marker agreement on the same questions 71.2% 88.2%

  16. Rule Induction To induce mark schemes, try Inductive Logic Programming Aleph (Srinivasan 2004) Amati proposes rules from marked responses Still requires manual post-editing eg. predicted rule: term(R, high, A) term(R, pressure, B) term(R, and, C) correct(R) matches high pressure and temperature high pressure and heat Question author can make more intuitively plausible repeated runs improve the overall coverage

  17. Some difficult responses Lack of linguistic knowledge not necessarily a drawback The oil floats on the water because it is lighter The oil floats on the water because it is heavier Applying BOD, mark both correct better than attempting to resolve pronoun easier with linear word order than syntactic analysis

  18. Some difficult responses Hard to mark where students can give particular examples A snowflake falls vertically with a constant speed. What can you say about the forces acting on the snowflake? abstract answers easy to mark They are equal and opposite no net forces all forces cancel out etc. Much harder to predict all specific examples the gravity and air resistance are the same gravity is counteracted by drag etc.

  19. Discussion Ongoing work User studies Formal comparison of responses marked by hand and with Amati Compare time to mark and inter-marker agreement Linguistic complexity Multiple-mark responses Better synonym handling Feedback generation

  20. Thank you!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#