Enhancing Intent Classification with Chain of Thought Prompting

Chain of Thought Prompting for Few-Shot Intent

Classification using Large Language Models

Dimitrios Koutsianos

M.Sc. in Data Science

Department of Informatics

Supervisor: Ion Androutsopoulos

Omilia Supervisors: Themos Stafylakis

                   Panagiotis Tassias

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

Intent Classification

Chain of Thought Prompting

“A series of intermediate reasoning steps”

0-shot Chain of Thought Prompting

Special phrases concatenated at the end of the prompt

Chain of Thought Prompting

Intent Classification could benefit from CoT Prompting

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

●

CLINC-150

○

10 different domains (Banking, Work, Travel etc.)

○

Created for

Out-of-Scope

detection

○

150 intent classes + 1

oos

class

●

BANKING77

○

77 fine-grained intent classes

○

All from the Banking Domain

○

Texts resemble more real-life data

Datasets

Datasets

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

Prompting Pipeline

Prompting Techniques

CoT-inciting

Phrases

●

“Let’s Think”

required going through the LLM twice, once for CoT and once for result

●

Changed it to save time and resources

●

Produces CoT and result at the same time

Prompt Example

We have the following set of intents along with their descriptions:

schedule_maintenance

: The intent "schedule_maintenance" involves seeking help or information regarding the

arrangement of upcoming maintenance activities for a car.

∗

gas_type

: The intent "gas_type" involves seeking information about the specific type or grade of fuel required for a

vehicle or a related inquiry about available fuel options.

∗

oil_change_when

: The intent "oil_change_when" involves seeking information or recommendations regarding the

appropriate timing or intervals for performing an oil change in a vehicle, considering factors such as mileage, driving

conditions, and the specific requirements of the vehicle manufacturer.

∗

oil_change_how

: The intent "oil_change_how" pertains to inquiries seeking guidance or instructions on the process

of performing an oil change for a vehicle, including steps and recommended tools.

∗

shopping_list

: The intent "shopping_list" involves requests or actions related to creating, managing, or obtaining

information about a list of items to be purchased during a shopping activity, whether it’s in-store or online

A user wrote the following text: ”’

put together a list of instructions for me on how to change the oil in my car

”’.

The intent of this text is definitely one from the five intents in the previous set.

What was the intent of the user when they wrote this text?

Show your thoughts

answer in a single sentence

do not

speculate

 and for your answer

include the intent as written in the previous set

, exactly as it is written there.

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

Initial Results

Metric

CLINC-150

➔

High Performance from all techniques

➔

“Show your Thoughts”

performs the best among the three

➔

“Deep Breath”

comes second

➔

“Let’s Think”

close behind in third place

BANKING77

➔

Average Performance by all 3 techniques

➔

“Deep Breath”

is now the best performer

➔

“Show your Thoughts”

close behind

➔

“Let’s Think”

performs the worst

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

Our Approach

Every in-shot exemplar features:

●

Top 5 possible intents

●

Their descriptions

●

A CoT-inciting phrase

●

A response with Chain of Thought

before the predicted intent

Our Approach

●

Best 0-shot technique is used in few-shot

○

CLINC-150:

Show your Thoughts

○

BANKING77:

Deep Breath

●

In-shot exemplars are extracted with Llama

●

1-shot, 2-shot & 3-shot techniques

CLINC-150 Results

➔

3-shot achieves the best performance

➔

2-shot close behind

➔

1-shot has worse performance than every technique, 0-shot or few-shot

CLINC-150 Results

text

: give me the date that i last got my car checked

response

: based on the given text, the intent of the user when they wrote "give me the date that i

last got my car checked" is to

seek information about the last time their car was serviced or

maintained

, which aligns with the "

last_maintenance

" intent from the provided set of intents.

●

Accurate & eloquent Chain of Thought

●

Correct & Interpretable classification

●

Chain of Thought precedes the predicted intent

→

 Llama “thought” before

predicting an intent class

BANKING77 Results

➔

Marginal performance boost over 0-shot

➔

3-shot achieves the best performance

➔

2-shot and 1-shot behind 0-shot

BANKING77 Results

text

why doesn’t my balance reflect my transfer

response:

based on the text "why doesn’t my balance reflect my transfer",

the user is inquiring

about the reason why their account balance does not reflect the recent transfer they made

, and

therefore the intent of the user is "

balance_not_updated_after_bank_transfer

".

●

Accurate Chain of Thought, not as complex as the CLINC-150 one.

●

Correct & Interpretable classification

●

Chain of Thought precedes the predicted intent

→

 Llama “thought” before

predicting an intent class

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

Ablation Study

●

Gradually remove Chain of Thought and/or Descriptions.

●

0-shot methods remain 0-shot

●

Few-shot methods utilize the same prompting phrase as in the

original system

●

New exemplars for the few-shot methods

Without CoT (CLINC-150)

➔

Comparable results with Chain of Thought

➔

Close to the best 0-shot technique

➔

Confirms bibliography

→

 small LLMs do not gain performance by Chain of

Thought Prompting

Without CoT (BANKING77)

➔

Comparable results with Chain of Thought

➔

Close to the best 0-shot technique

➔

Confirms bibliography

→

 small LLMs do not gain performance by Chain of

Thought Prompting

Without Descriptions (CLINC-150)

●

All 0-shot experiments exhibit

large losses

●

Few-shot experiments have less

significant losses

○

Could be due to the

different in-shot exemplars

Without Descriptions (BANKING77)

●

All 0-shot experiments exhibit

large losses

●

1-shot also exhibits a large loss,

only slightly improving on

Deep

Breath

 without descriptions.

●

2-shot & 3-shot have less

significant performance drops.

Without both (CLINC-150)

➔

Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions

➔

Confirms bibliography that small models do not exhibit performance gains by Chain

of Thought Prompting

Without both (BANKING77)

➔

Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions

➔

Surpases 1-shot, close to 2-shot and 3-shot without descriptions.

➔

Confirms bibliography that small models do not exhibit performance gains by Chain

of Thought Prompting

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

CLINC-150

3-shot

“Show your

Thoughts”

errors

BANKING77

3-shot

“Deep Breath”

errors

text:

My card is just not working at this time

label

virtual_card_not_working

predicted label

card_not_working

Outline

1.

Introduction

2.

Datasets

3.

Prompting Pipeline

4.

Initial Results

5.

Few-Shot Prompting

6.

Ablation Study

7.

Error Analysis

8.

Conclusions & Future Work

Conclusions & Future Work

●

Chain of Thought Prompting is utilized on Intent Classification tasks

●

Show your Thoughts

proves to be a great alternative to other CoT phrases

●

Top 5 possible intents along with intent descriptions help smaller models’ performance

●

Managed to reverse the

“Models with 100B parameters do not benefit from Chain of

Thought Prompting”

○

Our Llama2-13B performed better than without CoT

●

Future Work:

○

Utilizing the whole test datasets

○

More Prompting Techniques

○

Bigger Models

Slide Note

Embed Share

Download

This study explores the use of Chain of Thought Prompting (CoT) for few-shot intent classification using large language models. The approach involves a series of reasoning steps to better understand user intent, leading to improved performance and explainable results compared to traditional prompting methods. The research highlights the benefits of CoT prompting in achieving interpretable classification results and universal system applicability across multiple clients.

kohen Follow

Uploaded on Aug 26, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Chain of Thought Prompting for Few-Shot Intent Classification using Large Language Models Dimitrios Koutsianos M.Sc. in Data Science Department of Informatics Supervisor: Ion Androutsopoulos Omilia Supervisors: Themos Stafylakis Panagiotis Tassias 1

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 2

Intent Classification User Task-oriented Dialog System Texts Many training examples per intent Classification LSTM-based Classifier User intent Different System per Client 3

Chain of Thought Prompting A series of intermediate reasoning steps . Better performance than normal prompting Explainable results Need for hand-crafted in-shot exemplars High Variance in Results LLMs with less than 100B parameters do not exhibit performance gains 4

0-shot Chain of Thought Prompting Special phrases concatenated at the end of the prompt Similar performance with CoT 0-shot Less Variance in Results Explainable results Requires two passes through the LLM, 1st for CoT 2nd for Result LLMs with less than 100B parameters do not exhibit performance gains 5

Chain of Thought Prompting Intent Classification could benefit from CoT Prompting Universal system for multiple clients Interpretable Classification Results Less Cost/Easier Implementation for Omilia Better Understand a User s Intent 6

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 7

Datasets CLINC-150 10 different domains (Banking, Work, Travel etc.) Created for Out-of-Scope detection 150 intent classes + 1 oos class BANKING77 77 fine-grained intent classes All from the Banking Domain Texts resemble more real-life data 8

Datasets Preprocessing CLINC-150 Test set BANKING77 Test set 30 texts/intent 40 texts/intent Keeping 5 texts/intent CLINC-150 750 test texts BANKING77 385 test texts 9

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 10

Prompting Pipeline 11

Prompting Techniques CoT-inciting Phrases Show your Thoughts Let s Take a Deep Breath and Work on this Step by Step Let s Think Step by Step Let s Think Deep Breath Show your Thoughts Let s Think required going through the LLM twice, once for CoT and once for result Changed it to save time and resources Produces CoT and result at the same time 12

Prompt Example We have the following set of intents along with their descriptions: * schedule_maintenance: The intent "schedule_maintenance" involves seeking help or information regarding the arrangement of upcoming maintenance activities for a car. gas_type: The intent "gas_type" involves seeking information about the specific type or grade of fuel required for a vehicle or a related inquiry about available fuel options. oil_change_when: The intent "oil_change_when" involves seeking information or recommendations regarding the appropriate timing or intervals for performing an oil change in a vehicle, considering factors such as mileage, driving conditions, and the specific requirements of the vehicle manufacturer. oil_change_how: The intent "oil_change_how" pertains to inquiries seeking guidance or instructions on the process of performing an oil change for a vehicle, including steps and recommended tools. shopping_list: The intent "shopping_list" involves requests or actions related to creating, managing, or obtaining information about a list of items to be purchased during a shopping activity, whether it s in-store or online A user wrote the following text: put together a list of instructions for me on how to change the oil in my car . The intent of this text is definitely one from the five intents in the previous set. What was the intent of the user when they wrote this text? Show your thoughts, answer in a single sentence, do not speculate and for your answer include the intent as written in the previous set, exactly as it is written there. 13

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 14

Initial Results Metric Intent extraction by hand from each response Accuracy 1 3 1st run Acc. 2nd run Acc. 3rd run Acc. Average Accuracy In Dataset In Response topping_up_by_card top_up_by_card Correct Spelling error how_old_are_you what_is_your_age Wrong Complete Change 15

CLINC-150 High Performance from all techniques Show your Thoughts performs the best among the three Deep Breath comes second Let s Think close behind in third place 16

BANKING77 Average Performance by all 3 techniques Deep Breath is now the best performer Show your Thoughts close behind Let s Think performs the worst 17

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 18

Our Approach Every in-shot exemplar features: Top 5 possible intents Their descriptions A CoT-inciting phrase A response with Chain of Thought before the predicted intent 19

Our Approach Best 0-shot technique is used in few-shot CLINC-150: Show your Thoughts BANKING77: Deep Breath 2-shot exemplars 3-shot exemplars 1-shot exemplar In-shot exemplars are extracted with Llama 1-shot, 2-shot & 3-shot techniques 20

CLINC-150 Results 3-shot achieves the best performance 2-shot close behind 1-shot has worse performance than every technique, 0-shot or few-shot 21

CLINC-150 Results text: give me the date that i last got my car checked response: based on the given text, the intent of the user when they wrote "give me the date that i last got my car checked" is to seek information about the last time their car was serviced or maintained, which aligns with the "last_maintenance" intent from the provided set of intents. Accurate & eloquent Chain of Thought Correct & Interpretable classification Chain of Thought precedes the predicted intent Llama thought before predicting an intent class 22

BANKING77 Results Marginal performance boost over 0-shot 3-shot achieves the best performance 2-shot and 1-shot behind 0-shot 23

BANKING77 Results text: why doesn t my balance reflect my transfer response: based on the text "why doesn t my balance reflect my transfer", the user is inquiring about the reason why their account balance does not reflect the recent transfer they made, and therefore the intent of the user is "balance_not_updated_after_bank_transfer". Accurate Chain of Thought, not as complex as the CLINC-150 one. Correct & Interpretable classification Chain of Thought precedes the predicted intent Llama thought before predicting an intent class 24

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 25

Ablation Study Check the importance of these factors & how the results compare with our full system. Gradually remove Chain of Thought and/or Descriptions. 0-shot methods remain 0-shot Few-shot methods utilize the same prompting phrase as in the original system New exemplars for the few-shot methods 26

Without CoT (CLINC-150) Comparable results with Chain of Thought Close to the best 0-shot technique Confirms bibliography small LLMs do not gain performance by Chain of Thought Prompting 27

Without CoT (BANKING77) Comparable results with Chain of Thought Close to the best 0-shot technique Confirms bibliography small LLMs do not gain performance by Chain of Thought Prompting 28

Without Descriptions (CLINC-150) All 0-shot experiments exhibit large losses Few-shot experiments have less significant losses Could be due to the different in-shot exemplars 29

Without Descriptions (BANKING77) All 0-shot experiments exhibit large losses 1-shot also exhibits a large loss, only slightly improving on Deep Breath without descriptions. 2-shot & 3-shot have less significant performance drops. 30

Without both (CLINC-150) Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions Confirms bibliography that small models do not exhibit performance gains by Chain of Thought Prompting 31

Without both (BANKING77) Accuracy without Chain of Thought & Descriptions > Accuracy without Descriptions Surpases 1-shot, close to 2-shot and 3-shot without descriptions. Confirms bibliography that small models do not exhibit performance gains by Chain of Thought Prompting 32

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 33

CLINC-150 60%: Correct CoT, Wrong Classification 3-shot Show your Thoughts errors 23.3%: Wrong CoT, Wrong Classification 16.7%: Similar actual & predicted intents 34

BANKING77 36.7%: Correct CoT, Wrong Classification 40%: Wrong CoT, Wrong Classification 3-shot 3.3%: Not entirely accurate but not entirely incorrect initial label Deep Breath errors 16.7%: Wrong initial labels text: My card is just not working at this time label: virtual_card_not_working, predicted label: card_not_working 3.3%: Similar actual & predicted intents 35

Outline 1. Introduction 2. Datasets 3. Prompting Pipeline 4. Initial Results 5. Few-Shot Prompting 6. Ablation Study 7. Error Analysis 8. Conclusions & Future Work 36

Conclusions & Future Work Chain of Thought Prompting is utilized on Intent Classification tasks Show your Thoughts proves to be a great alternative to other CoT phrases Top 5 possible intents along with intent descriptions help smaller models performance Managed to reverse the Models with 100B parameters do not benefit from Chain of Thought Prompting Our Llama2-13B performed better than without CoT Future Work: Utilizing the whole test datasets More Prompting Techniques Bigger Models 37

Enhancing Intent Classification with Chain of Thought Prompting

Download Presentation

Presentation Transcript

Related

More Related Content