Enhancing Query-Focused Summarization with Contrastive Learning
The study explores incorporating contrastive learning into abstractive summarization systems to improve discernment between salient and non-salient content in summaries, aiming for higher relevance to the query. By designing a contrastive learning framework and utilizing segment scores, the system can better distinguish positive and negative instances, ultimately enhancing the summarization process. Two popular datasets for query-focused summarization are discussed, highlighting the potential for improving summarization performance.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
QontSum: On Contrasting Salient Content for Query-focused Summarization Sajad Sotudeh, Nazli Goharian Long paper to at The First Workshop on Generative Information Retrieval, July 27, 2023
Query-focused summarization (QFS) Input: A long document/meeting transcripts, and a query. Goal: Generating a summary that best answers the given query. In the context of Generative IR (Gen-IR), QFS is viewed as a method associated with Grounded Answer Generation (GAR). 1
QFS summarization Existing Results Query (question): Why did the team choose single-curved design when discussing remote control style? Human-written summary: Industrial Designer introduced uncurved, single-curved and double-curved design at first. User Interface and Project Manager thought that uncurved design was too dull and Industrial Designer pointed out that double-curved design would make it impossible to use scroll-wheels, so the team selected single-curved design eventually. How can we improve the performance of the SOTA further such that it generates more relevant summaries? SOTA (SegEnc-W): Industrial Designer thought that if they stick with the simple straight-forward not curve design, it would be too dull and customers would not like it. So they decided to use a single-curved case with a single curve. Also, the energy source was traditional batteries and solar, so they could integrate it. A long document (e.g., research paper, meeting transcripts, etc.) While SOTA can pick up information, directly relevant to the query, it may produce irrelevant information, too. 2
Hypothesis Incorporating contrastive learning into the abstractive summarization system enhances the system s ability to discern between salient and non- salient content, resulting in summaries of higher relevance to the query. Approach Designing a contrastive learning framework to get model learn from positive and negative instances. 3
Obtaining segment scores Segment Probabilities Contrastive Module (selecting positive and negative segments) Contrastive Module (computing InfoNCE contrastive loss) 0.59 0.23 0.55 0.68 Joint Training Segment Scorer Encoding document segments M L P Negative 0.68 Decoder Encoder Positive M L P 0.55 Decoder Query SegmentK M L P Decoder Contrastive Learning framework
Datasets Two commonly-used query-focused summarization datasets: QMSum Meeting summarization dataset 1808 query-focused summaries: 1257/272/279 train/validation/test split Avg meeting length: 9K tokens / Avg summary length: 70 tokens SqUALITY Abstractive summarization datasets, containing: 100 stories 500 questions (i..e, 5 questions per each story) 2,000 summaries (i.e., 4 reference summaries per each question-story pair) 39/25/36 train/validation/test split 5
Experimental Results QMSum Qontsum achieves SOTA on RG-1, RG-L, and BS (comparable) with reducedcomputational overhead as compared to the previous SOTA 6
Experimental Results SqUALITY Qontsum achieves SOTA on RG-L and BS (comparable) with reducedcomputational overhead as compared to the previous SOTA 7
Back to the first example Query (question): Why did the team choose single-curved design when discussing remote control style? QontSum improves the Query Relevance metric of the generated summary/answer. Human-written: Industrial Designer introduced uncurved, single-curved and double-curved design at first. User Interface and Project Manager thought that uncurved design was too dull and Industrial Designer pointed out that double-curved design would make it impossible to use scroll- wheels, so the team selected single-curved design eventually. SOTA (SegEnc-W): Industrial Designer thought that if they stick with the simple straight-forward not curve design, it would be too dull and customers would not like it. So they decided to use a single-curved case with a single curve. Also, the energy source was traditional batteries and solar, so they could integrate it. Ours (QontSum): When discussing the remote control style, the team found that the simple uncurved design was too dull for the customers, and double-curved design would be too hard on the customers' hands. So they decided to stick with the single curve design, which was more user-friendly and easy to use. 8
Human Analysis Analysis set: randomly selected 50 query-meeting instances from QMSum Metrics: Fluency: understandable (5) or completely gibberish (1)? Relevance: High relevance (5) or low relevance (1)? Faithfulness: supported (5) or completely unsupported (1) meeting materials? 9
Human Analysis Significant improvement on the Relevance metric as compared to the previous SOTA 10
Case Study (1) Query (question): Summarize the discussion about user requirements. SegEnc-W: Marketing thought that the user really really wanted a fancy look and feel. They were not so so interested in functional look and feels, like the one you had shown, David, with all the buttons and sounds good technically, but it's not what they want and they still want it to be technologically innovative, so maybe it's sort of related to what you had said with the speech QontSum: User requirements were mainly about fancy look and feel and easy to use. The group agreed that the remote control should be easy to find, easy-to-use, and not too expensive. They also agreed that it should not be too complicated. They wanted to design a remote control that was easy to be used all the time, every day, every hour, for everyone. Human: The marketing expert gave three points from the market analysis. Users would like to have a fancy look and feel rather than a functional one. Apart from that, they still wanted the product to be technologically innovative. At the same time, being user-friendly was also important. Then the group discussed how to make these requirements into practice. Information Redundancy (readability)
Case Study (2) Query (question): What did the professor think about VAD latency? QontSum: The professor thought that the latency of the VAD was two hundred milliseconds. He thought that this was too large. He mentioned that the team was using on-line normalization, which was used before the delta computation. The professor also mentioned that VAD latency was increasing. He wanted to know how much the team could cut down on the delta. SegEnc-W: The professor thought that the latency of the VAD was two hundred and twenty milliseconds. The VAD is used for on-line normalization, and it's used before the delta computation, so if you add these components it goes to a hundred and seventy, right? Human: The professor wanted to know how much latency the VAD was adding to the model. The professor thought that the process could work in parallel. The professor informed the team that the max latency was unknown. Factual correctness (faithfulness) Transcript copying (readability) *by the baseline
Error categories Fluency: information redundancy (26%), speaker mix-up (34%). Relevance: critical details missing (33%), query misinterpreting (too specific/broad queries) (29%). Faithfulness: misalignment with query (31%), deficiency in capturing the outcome of a topic (37%).
Conclusion Incorporation of contrastive learning into abstractive summarizer: Helps the model better distinguish salient content from non-salient content Improves the query relevance human metric significantly with reduced computational overhead Human evaluation shed light on the system s common errors: Fluency: information redundancy (26%), speaker mix-up (34%). Relevance: critical details missing (33%), query misinterpreting (too specific/broad queries) (29%) Faithfulness: misalignment with query (31%), deficiency in capturing the outcome of a topic (37%). 14
Thanks for your attention Open to questions!