Enhanced Pre-training for Social Media Summarization

Slide Note
Embed
Share

Enhanced LM pre-training using salient span representations aims to improve key information prioritization in social media summarization tasks. The method involves aligning user posts with TL;DR summaries, identifying salient spans, and utilizing text infilling for representation learning.


Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. SigBART: Enhanced Pre-training via Salient Content Representation Learning for Social Media Summarization Sajad Sotudeh, and Nazli Goharian Appearing at the 12th International Workshop on Natural Language Processing for Social Media (SocialNLP) in conjunction with WWW 2024, May 2024

  2. Social Media Post Summarization Substantial increase in mental health discussions, 2013-2022. Surge in forum engagement highlights the need for effective information processing. Objective: Develop a neural text summarization method for faster processing of such content, to assist mental health professionals. *Figure borrowed from previous study, Sotudeh, Goharian, and Young (2022) 1

  3. Social Media Post Summarization TL;DR Too Long; Didn t Read: terminology for indicating summary of a post. Post-TL;DR alignment: Identifying words/phrases within post, that are a part of the TL;DR summary. How to use these alignments in LM s pre- training process? my classmate wanted to come see something with me, I agreed. she leant on my shoulder the whole time. Don t like her back, but I think she has a crush on me. Concerned about our friendship. TL;DR TL;DR 2

  4. Hypothesis Enhanced LM s pre-training on salient span representations improves the model's ability to prioritize key information in summarization, evidenced by evaluation metrics. Approach Identifying salient spans (by aligning post with TL;DR) and performing representation learning via Text Infilling objective. 3

  5. Step 1: Identification of Salient Spans (Alignment) Aligning user posts with TL;DR summaries through Sequence Tagging 0 0 1 0 Tags post terms that are replicated in TL;DR summary. A word from post will be labeled 1 only if it is a part of the longest subsequence in both post and TL;DR. Post: We visited the new cafe downtown, which serves exquisite artisan coffee and freshly baked pastries. TL;DR: Loved the artisan coffee at the new downtown cafe. 4

  6. Step 2: Salient Spans Representation Learning We visited the new cafe downtown, which serves exquisite artisan coffee and We visited the new cafe downtown, which serves exquisite artisan coffee and freshly baked pastries. freshly baked pastries. Token Infilling Encoder Decoder Identification of Salient Spans (Alignment) identified spans Masking the [M A S K] [M A S K] We visited the new cafe downtown, which serves exquisite artisan coffee and freshly baked pastries. <s> We visited the new cafe downtown, which serves exquisite artisan coffee and freshly baked pastries. 5

  7. Experiments Datasets: Pre-training dataset: Webis-TLDR-17, containing 3.8M post-TL;DR pairs (V lske et al., 2017). Finetuning dataset: MentSum, covering over 24K mental health posts (Sotudeh and Goharian, 2022) Comparison: CurrSum (Soutdeh et al., 2022): A SOTA curriculum-guided approach for mental health post summarization. BART + R3F (Aghajanyan et al., 2021): A SOTA summarizer on social media task, discouraging the representation change during fine-tuning. MatchSum (Zhong et al., 2020): An extractive summarizer that ranks spans of source in semantical space as candidate summaries. BART (Lewis et al., 2020), Pegasus (Zhang et al., 2019), and *SocialPegasus 6

  8. Experimental Results 0.32 0.1 0.31 0.09 0.3 0.08 0.29 0.28 0.07 0.27 0.06 0.26 0.25 0.05 Rouge-1 Rouge-2 0.23 0.88 0.22 0.87 0.21 Statistically significant improvements on all metrics! (??????< 0.05) 0.86 0.2 0.19 0.85 0.18 0.84 0.17 0.16 0.83 Rouge-L BertScore

  9. Experimental Results Human evaluation (1-5) on a set of randomly sampled 50 posts, with 2 human annotators: Fluency: is the TL;DR understandable? Informativeness: does the TL;DR provide useful information? Faithfulness: is the TL;DR supported by its post? Moderate agreement! Relative improvements (%): +1.8 +6.4 +3.4 8

  10. Error Analysis Case studies Bold: salient content (word/phrases to be picked up) Gray: unfaithful content

  11. Ethics & Privacy Adherence to Ethics: Only anonymized data used to avoid user identification. Automated Summarization Caution: Potential for misrepresenting complex narratives.

  12. Summary Learning salient span representations via token infilling pretraining objective yielded significantly improved summaries (ROUGE and BS) in the mental health post summarization. Human evaluations showed the system s superiority in terms of qualitative criteria, on which largest improvements were on informativeness. While the system can pick up on relevant information, faithfulness remains an open question for future work in underperformed cases.

  13. Thank You For Your Attention

Related


More Related Content