Enhancing Certification Exam Item Prediction with Machine Learning

Slide Note

Utilizing machine learning to predict Bloom's Taxonomy levels for certification exam items is explored in this study by Alan Mead and Chenxuan Zhou. The research investigates the effectiveness of a Naïve Bayesian classifier in predicting and distinguishing cognitive complexity levels. Through research questions, methodologies, and findings, the study aims to improve item classification accuracy and cross-validation across domains.

norm_654 Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Using Machine Learning to Predict Bloom's Taxonomy Level for Certification Exam Items Alan Mead and Chenxuan Zhou Certiverse

Agenda Introduction to the problem Methodology Results Discussion and Future Directions

Blooms taxonomy Bloom s (1956, 2001) taxonomy of cognitive complexity Six levels from simple recall to creation of novel work products Widely used to classify assessment materials Does your program use Bloom s taxonomy? If so, how? Collapsing levels is very common What is psychometric value of Bloom s taxonomy for exams? Source: Vanderbilt University Center for Teaching, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

Should psychometricians care? The evidence for an association between Bloom s level and item difficulty is mixed Some studies found Bloom s taxonomy only weakly related to item difficulty (Tan & Othman, 2013; Mesic & Muratovic, 2011) Two studies of TIMSS data found a strong relationship (Rosca, 2004; Sinharay, 2016) I m not aware of any evidence of better validity for more complex items But if cognitive complexity is specified in the blueprint, then it s related to content validity We should, at any rate, appreciate new methodologies

Research questions RQ1: How well does the ML model (Na ve Bayesian classifier) predict Bloom s taxonomy levels of the test items? RQ2: Does the ML model have similar predictive accuracy in all Bloom s taxonomy levels? RQ3: How well does the ML model distinguish items of Bloom s taxonomy level 1 from items of higher levels? RQ4: How well does the model fit in one domain cross-validate in another domain? RQ5: What aspects of items influence the predictions of the model?

Sample MCQ sample Cross-domain sample Purpose Model training and cross-validation Model validation in a cross-domain item pool Used for RQs 1, 2, 3 & 5 RQ 4 Extracted from online practice exams on an IT-related topic Source Mohammad and Omar (2020) study Number of items 141 141 Level 1 = Remember; Level 2 = Understand Level 3 = Apply; Level 4 = Analyze + Evaluate + Create Cognitive classes Classification procedure Class label retrieved from Mohammad & Omar (2020) Classified by the second Author

Feature extraction A step in natural language processing to reduce noise and boost signal in the predictor features Conducted using R packages tm and SnowballC Words of each item normalized by: - Converting to lower case; - Removing punctuations, numbers, and stopwords; - Stemming - Tokenization For example, item 63 in the MCQ sample - Before feature extraction: Which is required when creating a new Plan? - After feature extraction: requir , creat , new , plan Document-term matrix (DTM): each document (i.e., item) in a row and columns representing terms appearing across all items

Nave Bayesian Model P(c|d) is the probability of a document d belonging to class c ?1,?2, ,?? are unique tokens of document d ?(?1,?2, ,??|?) is the conditional probability of tokens, ?1,?2, ,??, occurring in a document we know to be from class c P(c) is the prior probability of a document belonging to class c. ?(?1,?2, ,??) is the probability of tokens in the data. na ve signals the assumption about independent terms in a document d. So, and Na ve Bayes probability model combined with the maximum a posteriori (MAP) decision rule:

Example: Predicting level for item 63 Class: c1 c2 c3 c4 ? ?? 0.4455 ? ?? 0.1683 Likelihood of t given c ? ?|?? 0.0000 0.1176 0.0588 0.0000 ? ?? 0.2475 ? ?? 0.1386 Prior probability: ? ?|?? 0.0444 0.0889 0.0222 0.0222 ? ?|?? 0.0000 0.1200 0.0000 0.0000 ? ?|?? 0.1429 0.1429 0.0000 0.0000 Term requir creat new plan ? ??|? 8.6664E-07 ? ??|? ? ??|? ? ??|? Posterior probability: 0 0 0

Methodology Model 1: (RQs 1, 2, 4 & 5) - Outcome classes: level 1 (remember); level 2 (understand); level 3 (apply); level 4 (analyze + evaluate + create) - Trained on 70% of the MCQ sample using stratified sampling based on Bloom s levels - Cross-validated on the remaining 30% of the MCQ sample - Validated in the cross-domain sample Model 2: (RQ 3) - Outcome classes: level 1 (remember); level 2 and above (understand + apply + analyze + evaluate + create) - Same training/testing partition in the MCQ sample

RQ1: How well does the Nave Bayesian classifier predict? Training sample 3 Cross-validation sample 2 3 Bloom's level 1 2 4 Overall 1 4 Overall Descriptives number of items base rate Confusion matrix (rows=predicted, columns = actual) 1 2 3 4 Performance by class Precision Recall F1-measure Overall model performance Accuracy 45 17 25 14 101 1.000 18 7 10 5 40 0.446 0.168 0.248 0.139 0.450 0.175 0.250 0.125 1.000 44 0 0 1 0 17 0 0 0 0 24 1 0 0 0 14 44 17 24 16 16 2 0 0 1 2 3 1 1 1 8 0 0 0 0 5 18 5 11 6 1.000 0.978 0.978 1.000 1.000 1.000 1.000 0.960 0.960 0.875 1.000 1.000 0.983 0.980 0.980 0.889 0.889 0.889 0.400 0.286 0.333 0.727 0.800 0.762 0.833 1.000 0.909 0.756 0.775 0.762 0.980 0.775

RQ2: Good prediction for all levels? Training sample 3 Cross-validation sample 2 3 Bloom's level 1 2 4 Overall 1 4 Overall Descriptives number of items base rate Confusion matrix (rows=predicted, columns = actual) 1 2 3 4 Performance by class Precision Recall F1-measure Overall model performance Accuracy 45 17 25 14 101 1.000 18 7 10 5 40 0.446 0.168 0.248 0.139 0.450 0.175 0.250 0.125 1.000 44 0 0 1 0 17 0 0 0 0 24 1 0 0 0 14 44 17 24 16 16 2 0 0 1 2 3 1 1 1 8 0 0 0 0 5 18 5 11 6 1.000 0.978 0.978 1.000 1.000 1.000 1.000 0.960 0.960 0.875 1.000 1.000 0.983 0.980 0.980 0.889 0.889 0.889 0.400 0.286 0.333 0.727 0.800 0.762 0.833 1.000 0.909 0.756 0.775 0.762 0.980 0.775

RQ3: How well does the NBC distinguish L1? Training sample Bloom's level L1 Descriptives number of items 44 base rate 0.440 Confusion matrix (rows=predicted, columns = actual) 1 43 2 1 Performance by class Precision 1 Recall 0.977 F1-measure 0.989 Overall model performance Accuracy 0.99 Cross-validation sample L1 L2+ L2+ 56 19 22 0.560 0.463 0.537 0 16 3 3 56 19 0.842 0.842 0.842 0.854

RQ4: Does the classifier crossvalidate? Cross-domain validation sample 3 Bloom's level 1 2 4 Overall Descriptives number of items base rate 26 23 15 77 141 1.000 0.184 0.163 0.106 0.546 Confusion matrix (rows=predicted, columns = actual) 1 2 3 4 6 2 10 8 4 3 9 7 1 2 7 5 25 15 23 14 36 22 49 34 Performance by class Precision Recall F1-measure 0.167 0.231 0.194 0.136 0.130 0.133 0.143 0.467 0.219 0.412 0.182 0.252 0.293 0.213 0.218 Overall model performance Accuracy 0.213 95% CI of Accuracy (0.148, 0.290) 0.546 NIR

RQ5: How does the NBC predict level? c1 c2 c3 c4 ? ?|?? 0.7111 0.4000 0.3778 0.2000 0.2000 0.2000 0.2000 0.1333 0.1333 0.1111 0.1111 0.1111 0.1111 0.1111 0.1111 0.1111 ? ?|?? 0.2353 0.2353 0.1765 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 0.1176 ? ?|?? 0.2000 0.2000 0.2000 0.1200 0.1200 0.1200 0.1200 0.1200 0.1200 0.1200 0.1200 ? ?|?? 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 term follow option use user box dialog panel imag layer order clip take action work select tool term follow choos campaign action creat data can xxxx purpos workflow edit form method nondestruct set three two term use analyst report campaign creat data user busi practition best task term creat workflow segment express requir result shown

Discussion This method seems unsuitable for exams without extant item pools But could be useful for on-going exam programs with an existing pool If the intended Bloom s level if know, can easily imagine automated coaching that advises a SME if the item seems inappropriate I wonder if wording changes over time could invalidate the model? Is it an issue that we cannot (concisely) explain the predictions? What advice would our automated coach say? Don t use following ? That seems silly! Verbs were not particularly/uniquely helpful in classifying items

Future Directions Additional features Are higher level items longer? I think they should be Distinguishing between the item set-up/scenario and the item s question What would it take to make prediction much more robust? Probably massive datasets Probably a less na ve model I m not sure what the upper limit is Chenxuan and Alan only agree on Bloom s classifications about 70% of the time for items Predicting L1 vs. >L1 seems most urgent/important