Enhancing Intent Classification with Chain of Thought Prompting

 
Chain of Thought Prompting for Few-Shot Intent
Classification using Large Language Models
 
Dimitrios Koutsianos
M.Sc. in Data Science
Department of Informatics
 
Supervisor: Ion Androutsopoulos
 
1
 
Omilia Supervisors: Themos Stafylakis
 
                   Panagiotis Tassias
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
2
Intent Classification
3
Chain of Thought Prompting
“A series of intermediate reasoning steps”
.
4
0-shot Chain of Thought Prompting
Special phrases concatenated at the end of the prompt
5
Chain of Thought Prompting
Intent Classification could benefit from CoT Prompting
6
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
7
 
CLINC-150
10 different domains (Banking, Work, Travel etc.)
Created for 
Out-of-Scope 
detection
150 intent classes + 1 
oos 
class
BANKING77
77 fine-grained intent classes
All from the Banking Domain
Texts resemble more real-life data
Datasets
8
Datasets
9
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
10
Prompting Pipeline
 
11
Prompting Techniques
CoT-inciting
Phrases
 
 
“Let’s Think” 
required going through the LLM twice, once for CoT and once for result
Changed it to save time and resources
Produces CoT and result at the same time
12
Prompt Example
 
13
We have the following set of intents along with their descriptions:
 
* 
schedule_maintenance
: The intent "schedule_maintenance" involves seeking help or information regarding the
arrangement of upcoming maintenance activities for a car.
gas_type
: The intent "gas_type" involves seeking information about the specific type or grade of fuel required for a
vehicle or a related inquiry about available fuel options.
oil_change_when
: The intent "oil_change_when" involves seeking information or recommendations regarding the
appropriate timing or intervals for performing an oil change in a vehicle, considering factors such as mileage, driving
conditions, and the specific requirements of the vehicle manufacturer.
oil_change_how
: The intent "oil_change_how" pertains to inquiries seeking guidance or instructions on the process
of performing an oil change for a vehicle, including steps and recommended tools.
shopping_list
: The intent "shopping_list" involves requests or actions related to creating, managing, or obtaining
information about a list of items to be purchased during a shopping activity, whether it’s in-store or online
 
A user wrote the following text: ”’
put together a list of instructions for me on how to change the oil in my car
”’.
The intent of this text is definitely one from the five intents in the previous set.
What was the intent of the user when they wrote this text? 
Show your thoughts
, 
answer in a single sentence
, 
do not
speculate
 and for your answer 
include the intent as written in the previous set
, exactly as it is written there.
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
14
Initial Results
15
Metric
CLINC-150
16
 
High Performance from all techniques
“Show your Thoughts”
 
performs the best among the three
“Deep Breath” 
comes second
“Let’s Think” 
close behind in third place
BANKING77
17
 
Average Performance by all 3 techniques
“Deep Breath”
 
is now the best performer
“Show your Thoughts” 
close behind
“Let’s Think” 
performs the worst
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
18
Our Approach
19
 
Every in-shot exemplar features:
Top 5 possible intents
Their descriptions
A CoT-inciting phrase
A response with Chain of Thought
before the predicted intent
Our Approach
20
 
Best 0-shot technique is used in few-shot
CLINC-150: 
Show your Thoughts
BANKING77: 
Deep Breath
In-shot exemplars are extracted with Llama
1-shot, 2-shot & 3-shot techniques
CLINC-150 Results
21
 
3-shot achieves the best performance
2-shot close behind
1-shot has worse performance than every technique, 0-shot or few-shot
CLINC-150 Results
22
text
: give me the date that i last got my car checked
response
: based on the given text, the intent of the user when they wrote "give me the date that i
last got my car checked" is to 
seek information about the last time their car was serviced or
maintained
, which aligns with the "
last_maintenance
" intent from the provided set of intents.
 
Accurate & eloquent Chain of Thought
Correct & Interpretable classification
Chain of Thought precedes the predicted intent 
 Llama “thought” before
predicting an intent class
BANKING77 Results
23
 
Marginal performance boost over 0-shot
3-shot achieves the best performance
2-shot and 1-shot behind 0-shot
BANKING77 Results
24
text
: 
why doesn’t my balance reflect my transfer
response: 
based on the text "why doesn’t my balance reflect my transfer", 
the user is inquiring
about the reason why their account balance does not reflect the recent transfer they made
, and
therefore the intent of the user is "
balance_not_updated_after_bank_transfer
".
 
Accurate Chain of Thought, not as complex as the CLINC-150 one.
Correct & Interpretable classification
Chain of Thought precedes the predicted intent 
 Llama “thought” before
predicting an intent class
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
25
Ablation Study
26
 
Gradually remove Chain of Thought and/or Descriptions.
0-shot methods remain 0-shot
Few-shot methods utilize the same prompting phrase as in the
original system
New exemplars for the few-shot methods
Without CoT (CLINC-150)
27
 
Comparable results with Chain of Thought
Close to the best 0-shot technique
Confirms bibliography 
 small LLMs do not gain performance by Chain of
Thought Prompting
Without CoT (BANKING77)
28
 
Comparable results with Chain of Thought
Close to the best 0-shot technique
Confirms bibliography 
 small LLMs do not gain performance by Chain of
Thought Prompting
Without Descriptions (CLINC-150)
29
 
All 0-shot experiments exhibit
large losses
Few-shot experiments have less
significant losses
Could be due to the
different in-shot exemplars
Without Descriptions (BANKING77)
30
 
All 0-shot experiments exhibit
large losses
1-shot also exhibits a large loss,
only slightly improving on 
Deep
Breath
 without descriptions.
2-shot & 3-shot have less
significant performance drops.
Without both (CLINC-150)
31
 
Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions
Confirms bibliography that small models do not exhibit performance gains by Chain
of Thought Prompting
Without both (BANKING77)
32
 
Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions
Surpases 1-shot, close to 2-shot and 3-shot without descriptions.
Confirms bibliography that small models do not exhibit performance gains by Chain
of Thought Prompting
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
33
CLINC-150
34
3-shot 
“Show your
Thoughts”
errors
BANKING77
35
3-shot 
“Deep Breath”
errors
 
text: 
My card is just not working at this time
label
: 
virtual_card_not_working
,
predicted label
: 
card_not_working
Outline
 
1.
Introduction
2.
Datasets
3.
Prompting Pipeline
4.
Initial Results
5.
Few-Shot Prompting
6.
Ablation Study
7.
Error Analysis
8.
Conclusions & Future Work
 
36
Conclusions & Future Work
37
 
Chain of Thought Prompting is utilized on Intent Classification tasks
Show your Thoughts 
proves to be a great alternative to other CoT phrases
Top 5 possible intents along with intent descriptions help smaller models’ performance
Managed to reverse the 
“Models with 100B parameters do not benefit from Chain of
Thought Prompting”
Our Llama2-13B performed better than without CoT
Future Work:
Utilizing the whole test datasets
More Prompting Techniques
Bigger Models
Slide Note
Embed
Share

This study explores the use of Chain of Thought Prompting (CoT) for few-shot intent classification using large language models. The approach involves a series of reasoning steps to better understand user intent, leading to improved performance and explainable results compared to traditional prompting methods. The research highlights the benefits of CoT prompting in achieving interpretable classification results and universal system applicability across multiple clients.

  • Intent Classification
  • Chain of Thought Prompting
  • Few-Shot Learning
  • Large Language Models
  • User Intent

Uploaded on Aug 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Chain of Thought Prompting for Few-Shot Intent Classification using Large Language Models Dimitrios Koutsianos M.Sc. in Data Science Department of Informatics Supervisor: Ion Androutsopoulos Omilia Supervisors: Themos Stafylakis Panagiotis Tassias 1

  2. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 2

  3. Intent Classification User Task-oriented Dialog System Texts Many training examples per intent Classification LSTM-based Classifier User intent Different System per Client 3

  4. Chain of Thought Prompting A series of intermediate reasoning steps . Better performance than normal prompting Explainable results Need for hand-crafted in-shot exemplars High Variance in Results LLMs with less than 100B parameters do not exhibit performance gains 4

  5. 0-shot Chain of Thought Prompting Special phrases concatenated at the end of the prompt Similar performance with CoT 0-shot Less Variance in Results Explainable results Requires two passes through the LLM, 1st for CoT 2nd for Result LLMs with less than 100B parameters do not exhibit performance gains 5

  6. Chain of Thought Prompting Intent Classification could benefit from CoT Prompting Universal system for multiple clients Interpretable Classification Results Less Cost/Easier Implementation for Omilia Better Understand a User s Intent 6

  7. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 7

  8. Datasets CLINC-150 10 different domains (Banking, Work, Travel etc.) Created for Out-of-Scope detection 150 intent classes + 1 oos class BANKING77 77 fine-grained intent classes All from the Banking Domain Texts resemble more real-life data 8

  9. Datasets Preprocessing CLINC-150 Test set BANKING77 Test set 30 texts/intent 40 texts/intent Keeping 5 texts/intent CLINC-150 750 test texts BANKING77 385 test texts 9

  10. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 10

  11. Prompting Pipeline 11

  12. Prompting Techniques CoT-inciting Phrases Show your Thoughts Let s Take a Deep Breath and Work on this Step by Step Let s Think Step by Step Let s Think Deep Breath Show your Thoughts Let s Think required going through the LLM twice, once for CoT and once for result Changed it to save time and resources Produces CoT and result at the same time 12

  13. Prompt Example We have the following set of intents along with their descriptions: * schedule_maintenance: The intent "schedule_maintenance" involves seeking help or information regarding the arrangement of upcoming maintenance activities for a car. gas_type: The intent "gas_type" involves seeking information about the specific type or grade of fuel required for a vehicle or a related inquiry about available fuel options. oil_change_when: The intent "oil_change_when" involves seeking information or recommendations regarding the appropriate timing or intervals for performing an oil change in a vehicle, considering factors such as mileage, driving conditions, and the specific requirements of the vehicle manufacturer. oil_change_how: The intent "oil_change_how" pertains to inquiries seeking guidance or instructions on the process of performing an oil change for a vehicle, including steps and recommended tools. shopping_list: The intent "shopping_list" involves requests or actions related to creating, managing, or obtaining information about a list of items to be purchased during a shopping activity, whether it s in-store or online A user wrote the following text: put together a list of instructions for me on how to change the oil in my car . The intent of this text is definitely one from the five intents in the previous set. What was the intent of the user when they wrote this text? Show your thoughts, answer in a single sentence, do not speculate and for your answer include the intent as written in the previous set, exactly as it is written there. 13

  14. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 14

  15. Initial Results Metric Intent extraction by hand from each response Accuracy 1 3 1st run Acc. 2nd run Acc. 3rd run Acc. Average Accuracy In Dataset In Response topping_up_by_card top_up_by_card Correct Spelling error how_old_are_you what_is_your_age Wrong Complete Change 15

  16. CLINC-150 High Performance from all techniques Show your Thoughts performs the best among the three Deep Breath comes second Let s Think close behind in third place 16

  17. BANKING77 Average Performance by all 3 techniques Deep Breath is now the best performer Show your Thoughts close behind Let s Think performs the worst 17

  18. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 18

  19. Our Approach Every in-shot exemplar features: Top 5 possible intents Their descriptions A CoT-inciting phrase A response with Chain of Thought before the predicted intent 19

  20. Our Approach Best 0-shot technique is used in few-shot CLINC-150: Show your Thoughts BANKING77: Deep Breath 2-shot exemplars 3-shot exemplars 1-shot exemplar In-shot exemplars are extracted with Llama 1-shot, 2-shot & 3-shot techniques 20

  21. CLINC-150 Results 3-shot achieves the best performance 2-shot close behind 1-shot has worse performance than every technique, 0-shot or few-shot 21

  22. CLINC-150 Results text: give me the date that i last got my car checked response: based on the given text, the intent of the user when they wrote "give me the date that i last got my car checked" is to seek information about the last time their car was serviced or maintained, which aligns with the "last_maintenance" intent from the provided set of intents. Accurate & eloquent Chain of Thought Correct & Interpretable classification Chain of Thought precedes the predicted intent Llama thought before predicting an intent class 22

  23. BANKING77 Results Marginal performance boost over 0-shot 3-shot achieves the best performance 2-shot and 1-shot behind 0-shot 23

  24. BANKING77 Results text: why doesn t my balance reflect my transfer response: based on the text "why doesn t my balance reflect my transfer", the user is inquiring about the reason why their account balance does not reflect the recent transfer they made, and therefore the intent of the user is "balance_not_updated_after_bank_transfer". Accurate Chain of Thought, not as complex as the CLINC-150 one. Correct & Interpretable classification Chain of Thought precedes the predicted intent Llama thought before predicting an intent class 24

  25. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 25

  26. Ablation Study Check the importance of these factors & how the results compare with our full system. Gradually remove Chain of Thought and/or Descriptions. 0-shot methods remain 0-shot Few-shot methods utilize the same prompting phrase as in the original system New exemplars for the few-shot methods 26

  27. Without CoT (CLINC-150) Comparable results with Chain of Thought Close to the best 0-shot technique Confirms bibliography small LLMs do not gain performance by Chain of Thought Prompting 27

  28. Without CoT (BANKING77) Comparable results with Chain of Thought Close to the best 0-shot technique Confirms bibliography small LLMs do not gain performance by Chain of Thought Prompting 28

  29. Without Descriptions (CLINC-150) All 0-shot experiments exhibit large losses Few-shot experiments have less significant losses Could be due to the different in-shot exemplars 29

  30. Without Descriptions (BANKING77) All 0-shot experiments exhibit large losses 1-shot also exhibits a large loss, only slightly improving on Deep Breath without descriptions. 2-shot & 3-shot have less significant performance drops. 30

  31. Without both (CLINC-150) Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions Confirms bibliography that small models do not exhibit performance gains by Chain of Thought Prompting 31

  32. Without both (BANKING77) Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions Surpases 1-shot, close to 2-shot and 3-shot without descriptions. Confirms bibliography that small models do not exhibit performance gains by Chain of Thought Prompting 32

  33. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 33

  34. CLINC-150 60%: Correct CoT, Wrong Classification 3-shot Show your Thoughts errors 23.3%: Wrong CoT, Wrong Classification 16.7%: Similar actual & predicted intents 34

  35. BANKING77 36.7%: Correct CoT, Wrong Classification 40%: Wrong CoT, Wrong Classification 3-shot 3.3%: Not entirely accurate but not entirely incorrect initial label Deep Breath errors 16.7%: Wrong initial labels text: My card is just not working at this time label: virtual_card_not_working, predicted label: card_not_working 3.3%: Similar actual & predicted intents 35

  36. Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 36

  37. Conclusions & Future Work Chain of Thought Prompting is utilized on Intent Classification tasks Show your Thoughts proves to be a great alternative to other CoT phrases Top 5 possible intents along with intent descriptions help smaller models performance Managed to reverse the Models with 100B parameters do not benefit from Chain of Thought Prompting Our Llama2-13B performed better than without CoT Future Work: Utilizing the whole test datasets More Prompting Techniques Bigger Models 37

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#