Advancements in Logical Natural Language Generation from Open-Domain Tables

Slide Note
Embed
Share

Cutting-edge research in logical natural language generation (NLG) is transforming the field by moving beyond traditional surface realization to generate summarized text, conclude trends, and apply logical and mathematical operations. By addressing limitations such as lack of logical inference, summarization, and hallucination, NLG technologies are now providing high-level information to users and creating summarized reports from vast quantities of data. This progress is exemplified by datasets like TabFact, which support table-based fact verification and logical NLG tasks.


Uploaded on Dec 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Logical Natural Language Generation from Open- Domain Tables Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen and William Yang Wang

  2. Background Natural Language Generation Generating text from structured data/knowledge Existing NLG datasets E2ENLG: generate text from dialog act WebNLG: generate description from RDF triples WeatherGov: generate weather report from a infobox. WikiBio: generate biography from a one-row table ROTOWIRE: generate sports reports from NBA tables 2

  3. Background Example of traditional table-to-text dataset. Medal Table from Tournament Nation Canada Gold Medal 3 Silver Medal 1 Rank 5 Surface Realization Canada obtained 3 gold medals and 1 silver medal, and ranked 5th position. 3

  4. Background SOTA results on different NLG datasets BLEU-4 80 60 40 20 0 WebNLG E2ENLG WikiBio WeatherGOV BLEU-4 Extremely high scores over 60%. Does it mean that we already solve the problem of NLG? 4

  5. Background Limitations of traditional NLG tasks No logical inference: the generated texts are simply restating the world facts. No summarization: the generated texts cannot summarize the most interesting information or aggregate them. Hallucination: the generated texts are sometimes contradictory to real-world. 5

  6. Logical NLG Beyond Surface Realization Generating summarized text Concluding trends or implicit information Involving logical/mathematical operations Applications Providing more high-level information to users. Generating summarized reports from high-quantity data. 6

  7. Logical NLG Examples beyond surface realization: Medal Table from Tournament Nation Canada U.S Mexico Gold Medal 3 7 2 Silver Medal 1 2 5 Rank 5 1 6 Beyond Surface Realization U.S has obtained the most gold medals in the tournament and ranked the 1st. Mexico had fewer gold medals but more silver medals than Canada . 7

  8. Logical NLG Dataset Dataset source The dataset is collected from TabFact. We take the logically supported statements as our oracle table description. Dataset Statistics Vocab 122K Examples 37K Tables 7.3K Source Annotated Domain Open Schema Unlimited TabFact: A Large-scale Dataset for Table-based Fact Verification Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou and William Yang Wang Proceedings of ICLR 2020, Addis Ababa, Ethiopia 8

  9. Logical NLG Dataset Frequent Logical Operations: Superlative: A is the most Average: The average of ... for A is Count: There are N people . Numeric/Time Comparison: A happens one day before B Both/Neither: Both A and B are Sum/Diff: A team has obtained a total of B medals. .... Goal: generating sentences not only fluent but also logically entailed by the given table. 9

  10. Evaluation Fluency Metrics BLEU-1/2/3 Perplexity Factualness Metrics Model-based Semantic-Parsing (SP) accuracy Model-based Inference (NLI) accuracy Adversarial classification accuracy Human evaluation 10

  11. Semantic-Parsing Accuracy Apply semantic parser to parse the generated sentence to verify its logical factualness. Sentence: Canada obtained 1 more gold medal than Mexico Parsing [Link->Search] Eq(Hop(Filter(Nation==Canada), Gold Medal) 1) Execute True False 11

  12. NLI Accuracy Apply natural language inference model to verify the logical factualness Sentence: Canada obtained 1 more gold medal than Mexico NLI Linearize Table: In the first row . In the second row, . ????(?|?) 12

  13. Adversarial Accuracy Classify the correctness of paired oracle/adversarial text inputs with the trained NLG model. Original: Canada obtained 1 more gold medal than Mexico Adv: Canada obtained 1 less gold medal than Mexico ?(?|?) Trained NLG Model > ?(????|?) 13

  14. Metrics Discussion Model-based accuracy (SP/NLI accuracy) Pros: evaluating the peak of generation model s distribution, not biased. Cons: evaluation models are prone to errors (60-65%). Adversarial accuracy Pros: the evaluation is precise and stable. Cons: not necessarily evaluating the peak of the generation model s distribution, can be biased. The first two metrics are still in a preliminary stage, only for diagnosis purposes. 14

  15. Non-Pre-trained Baselines Field-infusing model (Lebret et. al. 2016) Feeding table header into the cell. Field-gating model (Liu et. al. 2018) Feeding table header as an additional gate in LSTM. [1] Neural Text Generation from Structured Data with Application to the Biography Domain R mi Lebret, David Grangier and Michael Auli, EMNLP 2016 [2] Table-to-text Generation by Structure-aware Seq2seq Learning Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang and Zhifang Sui, AAAI2018 15

  16. Pre-trained Baselines [GPT-GEN] Transform the table into a template sentence, use it as prefix of GPT-2 to generate description. 16

  17. Pre-trained Baselines [BERT-GEN] Concatenate the table and half-generated text into a `sentence` and mask the following words. Unveil the [MASK] token by token. 17

  18. Coarse-to-Fine Method First generate templates with placeholder. Then realize the placeholders with real entities/numbers. GPT-2 [ENT] obtained [ENT] more [ENT] than [ENT]. ?? Canada obtained 1 more gold medal than Mexico. 18

  19. Training Strategy Maximum Likelihood Training: Maximize the sequence log-likelihood. REINFORCE Algorithm: use a Semantic Parser to assign factualness reward. Adversarial Regularization: randomly synthesize logically refuted sentences and suppress their likelihood as a regularization. 19

  20. Experiments Fluency-based metrics [Overall] BLEU-3 Evaluation of different methods 14 12 10 BLEU-score 8 6 4 2 0 20

  21. Experiments 1. Pre-training helps improve text fluency. BLEU-3 Evaluation of different methods 14 12 10 BLEU-score 8 6 4 2 0 21

  22. Experiments 2. RL/Adv-Reg training degrade the fluency BLEU-3 Evaluation of different methods 14 12 10 BLEU-score 8 6 4 2 0 22

  23. Experiments 3. Coarse-to-fine achieves best fluency BLEU-3 Evaluation of different methods 14 12 10 BLEU-score 8 6 4 2 0 23

  24. Experiments Factualness metrics [Overall] Factualness Evaluation of different models 80 70 60 Percentage 50 40 30 20 10 0 SP-Acc NLI-Acc Adv-Acc 24

  25. Experiments 1. Pre-training can help improve factualness Factualness Evaluation of different models 80 70 60 Percentage 50 40 30 20 10 0 SP-Acc NLI-Acc Adv-Acc 25

  26. Experiments 2. RL can help the SP-Acc but no the others Factualness Evaluation of different models 44 43 42 Percentage 41 40 39 38 37 36 SP-Acc NLI-Acc Adv-Acc 26

  27. Experiments 3. Adversarial regularization can help Adv-Acc but not the others. Factualness Evaluation of different models 66 64 Percentage 62 60 58 56 54 SP-Acc NLI-Acc Adv-Acc 27

  28. Experiments 4. Coarse-to-fine achieves overall best score. Factualness Evaluation of different models 80 70 60 Percentage 50 40 30 20 10 0 SP-Acc NLI-Acc Adv-Acc 28

  29. Human Evaluation We employ human workers to evaluate the factualness of the generated text. Human Evaluation Results on Different Models 0.8 0.6 0.4 0.2 0 Non-Sense Wrong Partial Correct Correct Transoformer GPT-2 Adv-Reg RL Coarse-to-Fine 29

  30. Human Evaluation Only 20% of generated sentences from the best model are logically plausible. Human Evaluation Results on Different Models 0.8 0.6 0.4 0.2 0 Non-Sense Wrong Partial Correct Correct Transoformer GPT-2 Adv-Reg RL Coarse-to-Fine 30

  31. Forward Logical Dependency When we have a prefix of Colombia has , we need to generate the word in sequence order 3. The value depends on the semantics of the following words. 31

  32. Challenge 1. The monotonic left-to-right generation models cannot handle the forward logical dependency. 2. The existing probability-driven generation model does not encode symbolic execution, which cannot guarantee correctness. 32

  33. Takeaway Message Logical NLG is a new dataset to pose challenge to NLG model s inference capability. The existing models can only achieve 20% logical correctness, very premature stage. Future research could leverage symbolic execution into the text generation procedure. 33

Related


More Related Content