Enhancing Intent Classification with Chain of Thought Prompting

Slide Note
Embed
Share

This study explores the use of Chain of Thought Prompting (CoT) for few-shot intent classification using large language models. The approach involves a series of reasoning steps to better understand user intent, leading to improved performance and explainable results compared to traditional prompting methods. The research highlights the benefits of CoT prompting in achieving interpretable classification results and universal system applicability across multiple clients.


Uploaded on Aug 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Chain of Thought Prompting for Few-Shot Intent Classification using Large Language Models Dimitrios Koutsianos M.Sc. in Data Science Department of Informatics Supervisor: Ion Androutsopoulos Omilia Supervisors: Themos Stafylakis Panagiotis Tassias 1

  2. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 2

  3. Intent Classification User Task-oriented Dialog System Texts Many training examples per intent Classification LSTM-based Classifier User intent Different System per Client 3

  4. Chain of Thought Prompting A series of intermediate reasoning steps . Better performance than normal prompting Explainable results Need for hand-crafted in-shot exemplars High Variance in Results LLMs with less than 100B parameters do not exhibit performance gains 4

  5. 0-shot Chain of Thought Prompting Special phrases concatenated at the end of the prompt Similar performance with CoT 0-shot Less Variance in Results Explainable results Requires two passes through the LLM, 1st for CoT 2nd for Result LLMs with less than 100B parameters do not exhibit performance gains 5

  6. Chain of Thought Prompting Intent Classification could benefit from CoT Prompting Universal system for multiple clients Interpretable Classification Results Less Cost/Easier Implementation for Omilia Better Understand a User s Intent 6

  7. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 7

  8. Datasets CLINC-150 10 different domains (Banking, Work, Travel etc.) Created for Out-of-Scope detection 150 intent classes + 1 oos class BANKING77 77 fine-grained intent classes All from the Banking Domain Texts resemble more real-life data 8

  9. Datasets Preprocessing CLINC-150 Test set BANKING77 Test set 30 texts/intent 40 texts/intent Keeping 5 texts/intent CLINC-150 750 test texts BANKING77 385 test texts 9

  10. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 10

  11. Prompting Pipeline 11

  12. Prompting Techniques CoT-inciting Phrases Show your Thoughts Let s Take a Deep Breath and Work on this Step by Step Let s Think Step by Step Let s Think Deep Breath Show your Thoughts Let s Think required going through the LLM twice, once for CoT and once for result Changed it to save time and resources Produces CoT and result at the same time 12

  13. Prompt Example We have the following set of intents along with their descriptions: * schedule_maintenance: The intent "schedule_maintenance" involves seeking help or information regarding the arrangement of upcoming maintenance activities for a car. gas_type: The intent "gas_type" involves seeking information about the specific type or grade of fuel required for a vehicle or a related inquiry about available fuel options. oil_change_when: The intent "oil_change_when" involves seeking information or recommendations regarding the appropriate timing or intervals for performing an oil change in a vehicle, considering factors such as mileage, driving conditions, and the specific requirements of the vehicle manufacturer. oil_change_how: The intent "oil_change_how" pertains to inquiries seeking guidance or instructions on the process of performing an oil change for a vehicle, including steps and recommended tools. shopping_list: The intent "shopping_list" involves requests or actions related to creating, managing, or obtaining information about a list of items to be purchased during a shopping activity, whether it s in-store or online A user wrote the following text: put together a list of instructions for me on how to change the oil in my car . The intent of this text is definitely one from the five intents in the previous set. What was the intent of the user when they wrote this text? Show your thoughts, answer in a single sentence, do not speculate and for your answer include the intent as written in the previous set, exactly as it is written there. 13

  14. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 14

  15. Initial Results Metric Intent extraction by hand from each response Accuracy 1 3 1st run Acc. 2nd run Acc. 3rd run Acc. Average Accuracy In Dataset In Response topping_up_by_card top_up_by_card Correct Spelling error how_old_are_you what_is_your_age Wrong Complete Change 15

  16. CLINC-150 High Performance from all techniques Show your Thoughts performs the best among the three Deep Breath comes second Let s Think close behind in third place 16

  17. BANKING77 Average Performance by all 3 techniques Deep Breath is now the best performer Show your Thoughts close behind Let s Think performs the worst 17

  18. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 18

  19. Our Approach Every in-shot exemplar features: Top 5 possible intents Their descriptions A CoT-inciting phrase A response with Chain of Thought before the predicted intent 19

  20. Our Approach Best 0-shot technique is used in few-shot CLINC-150: Show your Thoughts BANKING77: Deep Breath 2-shot exemplars 3-shot exemplars 1-shot exemplar In-shot exemplars are extracted with Llama 1-shot, 2-shot & 3-shot techniques 20

  21. CLINC-150 Results 3-shot achieves the best performance 2-shot close behind 1-shot has worse performance than every technique, 0-shot or few-shot 21

  22. CLINC-150 Results text: give me the date that i last got my car checked response: based on the given text, the intent of the user when they wrote "give me the date that i last got my car checked" is to seek information about the last time their car was serviced or maintained, which aligns with the "last_maintenance" intent from the provided set of intents. Accurate & eloquent Chain of Thought Correct & Interpretable classification Chain of Thought precedes the predicted intent Llama thought before predicting an intent class 22

  23. BANKING77 Results Marginal performance boost over 0-shot 3-shot achieves the best performance 2-shot and 1-shot behind 0-shot 23

  24. BANKING77 Results text: why doesn t my balance reflect my transfer response: based on the text "why doesn t my balance reflect my transfer", the user is inquiring about the reason why their account balance does not reflect the recent transfer they made, and therefore the intent of the user is "balance_not_updated_after_bank_transfer". Accurate Chain of Thought, not as complex as the CLINC-150 one. Correct & Interpretable classification Chain of Thought precedes the predicted intent Llama thought before predicting an intent class 24

  25. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 25

  26. Ablation Study Check the importance of these factors & how the results compare with our full system. Gradually remove Chain of Thought and/or Descriptions. 0-shot methods remain 0-shot Few-shot methods utilize the same prompting phrase as in the original system New exemplars for the few-shot methods 26

  27. Without CoT (CLINC-150) Comparable results with Chain of Thought Close to the best 0-shot technique Confirms bibliography small LLMs do not gain performance by Chain of Thought Prompting 27

  28. Without CoT (BANKING77) Comparable results with Chain of Thought Close to the best 0-shot technique Confirms bibliography small LLMs do not gain performance by Chain of Thought Prompting 28

  29. Without Descriptions (CLINC-150) All 0-shot experiments exhibit large losses Few-shot experiments have less significant losses Could be due to the different in-shot exemplars 29

  30. Without Descriptions (BANKING77) All 0-shot experiments exhibit large losses 1-shot also exhibits a large loss, only slightly improving on Deep Breath without descriptions. 2-shot & 3-shot have less significant performance drops. 30

  31. Without both (CLINC-150) Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions Confirms bibliography that small models do not exhibit performance gains by Chain of Thought Prompting 31

  32. Without both (BANKING77) Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions Surpases 1-shot, close to 2-shot and 3-shot without descriptions. Confirms bibliography that small models do not exhibit performance gains by Chain of Thought Prompting 32

  33. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 33

  34. CLINC-150 60%: Correct CoT, Wrong Classification 3-shot Show your Thoughts errors 23.3%: Wrong CoT, Wrong Classification 16.7%: Similar actual & predicted intents 34

  35. BANKING77 36.7%: Correct CoT, Wrong Classification 40%: Wrong CoT, Wrong Classification 3-shot 3.3%: Not entirely accurate but not entirely incorrect initial label Deep Breath errors 16.7%: Wrong initial labels text: My card is just not working at this time label: virtual_card_not_working, predicted label: card_not_working 3.3%: Similar actual & predicted intents 35

  36. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 36

  37. Conclusions & Future Work Chain of Thought Prompting is utilized on Intent Classification tasks Show your Thoughts proves to be a great alternative to other CoT phrases Top 5 possible intents along with intent descriptions help smaller models performance Managed to reverse the Models with 100B parameters do not benefit from Chain of Thought Prompting Our Llama2-13B performed better than without CoT Future Work: Utilizing the whole test datasets More Prompting Techniques Bigger Models 37

Related


More Related Content