Comparing CLIP vs. LLaVA on Zero-Shot Classification by Misaki Matsuura
In this study by Misaki Matsuura, the effectiveness of CLIP (contrastive language-image pre-training) and LLaVA (large language-and-vision assistant) on zero-shot classification is explored. CLIP, with 63 million parameters, retrieves textual labels based on internet image-text pairs. On the other hand, LLaVA, with 13 billion parameters, combines vision encoder and LLM for improved zero-shot performance. Results show LLaVA with a 35% accuracy, while CLIP achieved 65% accuracy on CIFAR-100 subset. The images analyzed include a snake camouflaged on a brown background. Explore more on image recognition advancements and performance evaluations in this research.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
CLIP vs. LLaVA on Zero-Shot Classification Misaki Matsuura
Background CLIP (contrastive language-image pre-training): Trained with (image, text) pairs from internet Can retrieve right set of textual labels given image 63 million parameters LLaVA (large language-and-vision assistant): Combination of vision encoder and LLM (CLIP + LLaMA) Instruction tuned improves zero-shot performance 13 billion parameters we want to prove LLaVA beats CLIP
CLIP and LLaVA in action CLIP: LLaVA: Top predictions: The image features a large, green, patterned snake sitting on a brown background. The snake appears to be camouflaged, blending in with its surroundings. It is positioned in the center of the scene, covering a significant portion of the image. The close-up view of the snake emphasizes its intricate pattern and texture, making it an interesting and visually striking creature. snake: 65.31% turtle: 12.29% sweet_pepper: 3.83% lizard: 1.88% CIFAR-100 crocodile: 1.75%
CIFAR-100 Subset Results CLIP LLaVA Notes: Accuracy: 65% Accuracy: 35% CLIP was given 100 classes to choose from, LLaVA was not LLaVA accuracy was determined by exact text match LLaVA evaluated with disadvantage
Current Work Give me a one-word label in quotation marks for the foreground object in this image LLaVA I would say this is a flower user el1 flower el2 Image Net SUN Text gallery: possible classes from ImageNet el3 CLIP 76.2 58.5 Vector eo LLaVA ? ?
Thank you! Questions?