Evaluation of Text Simplification Systems using Machine Translation Techniques
This research paper presents a method to evaluate text simplification systems using machine translation evaluation techniques. It focuses on assessing the quality of simplification output based on properties like grammaticality, meaning preservation, and simplicity. The study aims to develop evaluation standards for the output and includes a problem statement, proposed solutions, and experiment setup with results. Key aspects discussed include motivation, problem classification, grammaticality assessment, language modeling, and more.
- Text Simplification
- Machine Translation
- Evaluation Techniques
- Grammaticality Assessment
- Language Modeling
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Using Machine Translation Evaluation Techniques to Evaluate Text Simplification Systems Sandeep Mathias, Pushpak Bhattacharyya Department of Computer Science and Engineering, IIT Bombay {sam, pb}@cse.iitb.ac.in
Outline Motivation Problem Statement Solutions Proposed Grammaticality Meaning Preservation Simplicity Experiment Setup and Results 2
Motivation Develop evaluation standards for the quality of simplification output by text simplification systems. Address individual properties of the output, such as: Grammaticality of the output Meaning of the input preserved in the output Simplicity of the output Overall usability of the system 3
Problem Statement Classify the output of a text simplification system as either good , ok or bad . Training Set: 505 input output sentence pairs whose grammaticality, meaning preservation, simplicity and usability are classified as either good , ok or bad . Test Set: 126 input output sentence pairs which we should classify. 4
Example Tell me vs. Say me I told that <something> vs. I said that <something> An elephant vs. An equipment 6
What is Grammaticality? Grammaticality is a measure of how grammatically correct the output sentence is. Technique used: Language modelling Corpus used: Simple English Wikipedia from the English Simple English Wikipedia document aligned parallel corpus[4] 7
Why Language Modelling? Grammatical errors are either syntactic errors and usage errors. Syntactic errors are errors that require a rule to detect. Usage errors are errors that are caused by the usage of the words (Example: Say vs. Tell) Language modelling looks at how probable the sentence is based on the occurrence of different n-grams. 8
Grammaticality of the Output Language model score Number of words in the sentence Number of out of vocabulary words Perplexity of the sentence Average perplexity per word of the sentence The values of the above features are found out using SRILM. 9
Example Input: Warsaw lies on the Vistula River, about 240 miles southeast of the Baltic city of Gdansk. Good output: Warsaw lies on the Vistula River, about 240 miles southeast of the Baltic city of Danzig. Bad output: Warsaw is on the Vistula River, about 240 kilometres southeast of the Baltic city of Gdansk. 11
What is meaning preservation? A measure of seeing how much of the meaning of the input is conveyed to the output. 12
What do we require? A metric that detects: Exact matches Stem matches Synonym matches Paraphrase matches The answer METEOR![2] 13
How does METEOR score? We use the default weights for the matches as described in METEOR version 1.4[2] Type of match Exact Stem Synonym Paraphrase Weight 1.00 0.60 0.80 0.60 14
Simplicity 15
Types of Simplicity Structural complexity Mary Kom, the first Indian woman to win an Olympic medal in boxing, was suffered an upset against Azize Nimani of Germany. Vs. Mary Kom was the first Indian woman Mary Kom suffered an upset Lexical complexity Medical practitioner vs. Doctor Hypertension vs. High blood pressure 16
Structural Complexity Structural complexity is defined as the complexity of the sentence based on the complexity of the parse tree. We define structural complexity as the number of sentences from the main clause, relative clauses, appositives, noun and verb participial phrases and other subordinate clauses that we extract from Michael Heilman s factual statement extractor[3]. 17
Lexical Complexity of an n-gram Lexical complexity (?? ? ) of a given n-gram ? is composed of 2 parts corpus complexity (?? ? )and syllable count (?? ? ). Corpus complexity[1] The ratio of the log likelihood of the n-gram in an English corpus to the log likelihood of the n-gram in a Simple English corpus ??(?|??????) ??(?|??????) ?? ? = ??(?) = ?? ? ?? ? 18
Lexical complexity of a sentence If we have a sequence of unigrams, a b c d e f g , higher order n-grams will have unigrams repeated multiple times. 1 ?. Hence we introduce a weight, ??= Hence, lexical complexity of a sentence (??(?)) is given by ?? ? = ??? ??? ? ??(?) 19
Experimental Setup and Results 20
Experimental Setup For each task, we treat the problem as a classification problem and classify the outputs of the system as either good , ok or bad . Classifier used: REPTree, with bagging. Training data: 505 sentence pairs Test data: 126 sentence pairs Baseline: Majority class of the training data ( good in all cases) 21
Metrics used Accuracy (Acc.) Mean absolute error (MAE) Root mean square error (RMSE) The values 100, 50, and 0 are given to the classes good , ok and bad . 22
Results Grammaticality Experiment Training Set Baseline Training Set Test Set Baseline Test Set Acc. (%) 75.64 76.04 76.19 72.22 MAE 17.23 16.63 18.25 21.43 RMSE 36.96 36.01 21.63 25.78 23
Results Meaning Preservation Experiment Training Set Baseline Training Set Test Set Baseline Test Set Acc. (%) 58.21 66.34 57.94 63.49 MAE 28.61 19.50 28.97 20.63 RMSE 46.94 35.25 35.30 26.75 24
Results Simplicity Experiment Training Set Baseline Training Set Test Set Baseline Test Set Acc. (%) 52.67 48.31 55.56 47.62 MAE 32.18 32.87 29.37 34.13 RMSE 49.60 48.59 31.22 38.85 25
Overall Quality In addition to the above 3 tasks of classifying grammaticality, meaning preservation, and simplicity, we also look at the overall quality of the output, based on the input. We run 2 experiments for calculating this. The first is where the feature set is the output classes of the other 3 tasks. The second is the values of the different features used in the other tasks. 26
Results Overall Quality Experiment Training Set Baseline Training Set Classes Training Set Values Test Set Baseline Test Set Classes Test Set Values Acc. (%) 43.76 45.74 56.23 43.65 33.33 39.68 MAE 33.17 31.39 23.56 28.17 42.46 34.92 RMSE 46.51 44.67 36.70 40.52 47.83 42.97 27
Conclusions Metrics such as METEOR as well as techniques like language modelling achieve good results, as good / better than the baselines of those tasks. Evaluating complexity, though, is a more complex task and sometimes depends on who the target audience is. 28
References [1] Biran, O., Samuel, B., and Elhadad, N. (2011). Putting it simply: a context- aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, volume 2, pages 496-501. Association for Computational Linguistics. [2] Denkowski, M. and Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation. [3] Heilman, M. and Smith, N. A. (2010). Extracting simplified statements for factual question generation. In Proceedings of QG2010: The Third Workshop on Question Generation, page 11. [4] Kauchak, D. "Improving Text Simplification Language Modeling Using Unsimplified Text Data." ACL (1). 2013. 29
Thank You 30