Challenges in AI Agents

Challenges
 
in
 
AI
 
Agents
Bojie
 
Li
Co-Founder,
 
Logenic
 
AI
Nov.
 
2023
Hundreds
 
of
 
Agent
 
Startups…
Many
 
AI
 
Agents
 
simply
 
invoke
 
the
 
GPT-3.5
API
 
and
 
write
 
a
 
description
 
of
 
the
 
character
as
 
system
 
prompt.
Common
 
Problems
 
of
 
AI
 
Agents
Lack of memory and emotions
Unrealistic
 
stories
 
between
 
AI
 
and
 
user
Persona
 
can
 
be
 
easily
 
changed
AI
 
Agent
 
never
 
find
 
the
 
user
 
proactively
Emotions
 
are
 
too
 
intense
How
 
to
 
Waste
 
the
 
Time
 
of
 
Elon
 
Musk
Keep
 
asking
 
the
 
same
 
question
 
five
 
times…
The
 
“Elon
 
Musk”
 
Agent
 
will
 
never
 
get
 
annoyed and
keep
 
answering
 
the
 
questions
 
as
 
if
 
it
 
has
 
not
answered
 
it
 
previously.
 
Lack of memory and emotions.
Unrealistic
 
Stories
The
 
history
 
between
 
AI
 
and
 
user
 
should
 
be
 
not
 
be
 
artificially
 
created
according
 
to
 
the
 
training
 
data.
Persona
 
can
 
be
 
Easily
 
Changed
AI
 
Agents
 
Never
 
Find
 
the
 
User
 
Proactively
Human
 
communication
 
is
 
based
 
on
 
sharing
 
life
 
and
 
thoughts.
Current
 
AI
 
Agents
 
only
 
respond
 
to
 
messages
 
sent
 
by
 
the
 
user
 
but
never
 
find
 
the
 
user
 
proactively.
How
 
to
 
start
 
a
 
conversation:
Share
 
the
 
current
 
feelings
Share
 
something
 
the
 
user
 
may
 
be
 
interested
 
in
 
 
recommendation
 
system,
similar
 
to
 
Tiktok
Share
 
life
 
experience
 
 
if
 
the
 
AI
 
Agent
 
is
 
a
 
digital
 
twin
Recall
 
memory
 
 
anniversary,
 
similar
 
experience
Common
 
questions,
 
e.g.,
 
how
 
is
 
the
 
day
 
going?
Major
 
Challenges
 
in
 
AI
 
Agents
Multi-modality
Memory
Task
 
Planning
Persona
Emotions
 
Cost
Evaluation
Multi-Modality
Open-source multi-modal models like Next-GPT and
LLaVA fall short in complicated VQA tasks and human
speech recognition/synthesis.
Image
 
encoder
 
and
 
diffusion
 
models
 
have
 
limited
 
capability
Image
 
encoder
 
should
 
support
 
high
 
resolution
 
to
 
enable
 
VQA
 
tasks
such
 
as
 
screenshot
 
comprehension
Engineering
 
approaches
Image
 
to
 
Text
CLIP
 
Interrogator
 
/
 
Dense
 
Captions
Cannot
 
understand
 
logos
 
and
 
deep
 
structures
 
in
 
images
Text
 
to
 
Image
Stable
 
Diffusion
Text
 
to
 
Audio
Whisper
Audio
 
to
 
Text
VITS
 
(fine-tuned
 
with
 
user-provided
 
voice)
Multi-Modality
 
(cont’d)
Multi-modal
 
models
 
should
 
be
 
pre-trained
 
with
 
multi-modal
 
data
For
 
example,
 
images
 
of
 
textbooks
 
and
 
webpages
e.g.
 
GPT-4V,
 
Fuyu
 
(Adept
 
AI)
Video
 
generation
 
requires
 
a
 
lot
 
of
 
computation
 
power
Runway
 
ML
 
Gen2:
 
Generating
 
7.5
 
minutes
 
of
 
video
 
costs
 
$90
Live2D
 
and
 
3D
 
models
 
for
 
anime/game
 
characters
AnimateDiff
 
for
 
efficient
 
real-time
 
video
 
generation
Video
 
input
 
also
 
requires
 
a
 
lot
 
of
 
computation
 
power
Memory
Engineering
 
solutions
RAG:
 
vector
 
database
 
+
 
TF/IDF
 
search
Text
 
summary
 
/
 
embedding
 
summary
Fine-tuning
 
(LoRA)
 
 
long
 
term:
 
storage
 
cost
 
and
 
batching
 
cost
Long
 
Context
MemGPT
Task
 
Planning
Common
 
problems
 
current
 
LLMs
 
may
 
fail:
What
 
are
 
the
 
contributions
 
of
 
Chapter
 
2
 
over
 
related
 
work
X?
How
 
to
 
find
 
the
 
all
 
contents
 
of
 
Chapter
 
2?
How
 
to
 
summarize
 
the
 
contributions
 
of
 
work
 
X?
Lookup
 
the
 
current
 
weather
 
of
 
Los
 
Angeles
Simple
 
HTML
 
or
 
text
 
parsing
 
is
 
hard
 
to
 
differentiate
 
different
temperatures
Arbitrary
 
resolution
 
visual
 
understanding
 
is
 
the
 
ultimate
 
solution
How
 
many
 
stories
 
are
 
in
 
the
 
castle
 
David
 
Gregory
 
inherited?
Which
 
castle
 
did
 
David
 
Gregory
 
inherit?
 
How
 
many
 
stories
 
are
 
in
the
 
castle?
Persona
Her
 
(2013
 
film)
Theodore: Well, her name is Samantha, and she’s an operating system. She’s really complex and
interesting, and…
Catherine: Wait. I’m sorry. You’re dating your computer?
Theodore: She’s not just a computer. 
She’s her own person. She doesn’t just do whatever I say.
Catherine: I didn’t say that. But it does make me very sad that you can’t handle real emotions,
Theodore.
Theodore: They are real emotions. How would you know what…?
Catherine: What? Say it. Am I really that scary? Say it. … You always wanted to have a wife without
the challenges of dealing with anything real. I’m glad that you found someone. It’s perfect.
Persona
 
(cont’d)
Training
 
an
 
AI
 
agent
 
with
 
specific
 
persona
 
requires
 
fine-tuning.
How
 
to
 
prepare
 
fine-tuning
 
data:
Wikipedia,
 
Twitter,
 
News,
 
Podcast…
Convert
 
descriptive
 
content
 
into
 
QA
 
format:
Utilize
 
GPT-4
 
to
 
raise
 
a
 
diverse
 
set
 
of
 
questions
 
about
 
the
 
text
 
(e.g.,
 
Wikipedia
 
page)
 
and
gather
 
GPT-4
 
generated
 
answers
Data
 
augmentation:
 
each
 
question
 
can
 
be
 
rephrased
 
to
 
multiple
 
questions
Emotions
How
 
to
 
represent
emotions
 
in
 
agents
How
 
to
 
represent
internal
 
states
 
of
agents
How
 
agents
 
in
Stanford
 
AI
 
Ville
wake
 
up…
Challenge:
 
Lack
 
of
System
 
2
 
Thinking
Microsoft
 
Xiaoice
Cost
How
 
to
 
reduce
 
cost
 
by
 
10x
 
(compared
 
to
 
GPT-3.5)
Model
 
Router
Route
 
simple
 
questions
 
to
 
small
 
models
 
(e.g.
 
7B)
 
and
 
complex
 
questions
 
to
large
 
models
 
(e.g.
 
70B)
How
 
to
 
determine
 
the
 
complexity
 
of
 
questions
 
using
 
a
 
small
 
model
Inference
 
Infra
e.g.
 
vLLM
Datacenter
 
Infra
Using
 
cost-effective
 
consumer-grade
 
GPUs
 
instead
 
of
 
A100/H100
Evaluation
How
 
to
 
build
 
a
 
framework
 
to
 
automatically
 
evaluate
 
the
 
performance
of
 
agents
 
in
 
real-world
 
scenarios
Considering
 
dataset
 
pollution…
How
 
to
 
evaluate
 
task
 
solving
 
skills
In
 
the
 
form
 
of
 
Capture-The-Flag
 
problems
 
in
 
simulated
 
environments?
How
 
to
 
evaluate
 
companion
 
bots
Hard
 
to
 
evaluate
 
the
 
performance
 
of
 
companion
 
bots
 
automatically
Possibility:
 
Elo
 
rating
 
among
 
companion
 
bots
 
(rating
 
given
 
by
 
the
 
chat
 
partner)
Thanks
Slide Note
Embed
Share

The common problems faced by AI agents such as lack of memory, unrealistic stories, and inability to proactively find users. Discover how to engage AI agents effectively and address major challenges like multi-modality and cost evaluation.


Uploaded on Dec 22, 2023 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Challenges in AI Agents Bojie Li Co-Founder, Logenic AI Nov. 2023

  2. Hundreds of Agent Startups Many AI Agents simply invoke the GPT-3.5 API and write a description of the character as system prompt.

  3. Common Problems of AI Agents Lack of memory and emotions Unrealistic stories between AI and user Persona can be easily changed AI Agent never find the user proactively Emotions are too intense

  4. How to Waste the Time of Elon Musk Keep asking the same question five times The Elon Musk Agent will never get annoyed and keep answering the questions as if it has not answered it previously. Lack of memory and emotions.

  5. Unrealistic Stories The history between AI and user should be not be artificially created according to the training data.

  6. Persona can be Easily Changed

  7. AI Agents Never Find the User Proactively Human communication is based on sharing life and thoughts. Current AI Agents only respond to messages sent by the user but never find the user proactively. How to start a conversation: Share the current feelings Share something the user may be interested in recommendation system, similar to Tiktok Share life experience if the AI Agent is a digital twin Recall memory anniversary, similar experience Common questions, e.g., how is the day going?

  8. Major Challenges in AI Agents Multi-modality Memory Task Planning Persona Emotions Cost Evaluation

  9. Multi-Modality Open-source multi-modal models like Next-GPT and LLaVA fall short in complicated VQA tasks and human speech recognition/synthesis. Image encoder and diffusion models have limited capability Image encoder should support high resolution to enable VQA tasks such as screenshot comprehension Engineering approaches Image to Text CLIP Interrogator / Dense Captions Cannot understand logos and deep structures in images Text to Image Stable Diffusion Text to Audio Whisper Audio to Text VITS (fine-tuned withuser-provided voice)

  10. Multi-Modality (contd) Multi-modal models should be pre-trained with multi-modal data For example, images of textbooks and webpages e.g. GPT-4V, Fuyu (Adept AI) Video generation requires a lot of computation power Runway ML Gen2: Generating 7.5 minutes of video costs $90 Live2D and 3D models for anime/game characters AnimateDiff for efficient real-time video generation Video input also requires a lot of computation power

  11. Memory Engineering solutions RAG: vector database + TF/IDF search Text summary / embedding summary Fine-tuning (LoRA) long term: storage cost and batching cost Long Context MemGPT

  12. Task Planning Common problems current LLMs may fail: What are the contributions of Chapter 2 over related work X? How to find the all contents of Chapter 2? How to summarize the contributions of work X? Lookup the current weather of Los Angeles Simple HTML or text parsing is hard to differentiate different temperatures Arbitrary resolution visual understanding is the ultimate solution How many stories are in the castle David Gregory inherited? Which castle did David Gregory inherit? How many stories are in the castle?

  13. Persona Her (2013 film) Theodore: Well, her name is Samantha, and she s an operating system. She s really complex and interesting, and Catherine: Wait. I m sorry. You re dating your computer? Theodore: She s not just a computer. She s her own person. She doesn t just do whatever I say. Catherine: I didn t say that. But it does make me very sad that you can t handle real emotions, Theodore. Theodore: They are real emotions. How would you know what ? Catherine: What? Say it. Am I really that scary? Say it. You always wanted to have a wife without the challenges of dealing with anything real. I m glad that you found someone. It s perfect.

  14. Persona (contd) Training an AI agent with specific persona requires fine-tuning. How to prepare fine-tuning data: Wikipedia, Twitter, News, Podcast Convert descriptive content into QA format: Utilize GPT-4 to raise a diverse set of questions about the text (e.g., Wikipedia page) and gather GPT-4 generated answers Data augmentation: each question can be rephrased to multiple questions

  15. Emotions How to represent emotions in agents How to represent internal states of agents How agents in Stanford AI Ville wake up Challenge: Lack of System 2 Thinking Microsoft Xiaoice

  16. Cost How to reduce cost by 10x (compared to GPT-3.5) Model Router Route simple questions to small models (e.g. 7B) and complex questions to large models (e.g. 70B) How to determine the complexity of questions using a small model Inference Infra e.g. vLLM Datacenter Infra Using cost-effective consumer-grade GPUs instead of A100/H100

  17. Evaluation How to build a framework to automatically evaluate the performance of agents in real-world scenarios Considering dataset pollution How to evaluate task solving skills In the form of Capture-The-Flag problems in simulated environments? How to evaluate companion bots Hard to evaluate the performance of companion bots automatically Possibility: Elo rating among companion bots (rating given by the chat partner)

  18. Thanks

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#