BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
Retrieval-augmented language models like BTR address issues such as hallucination by providing efficient solutions for encoding input passages and queries. By utilizing cacheable binary token representations, BTR offers a unique approach to decomposing and binarizing passage encoding to improve runtime compression and retrieval speeds. The calibrated binarization technique helps prevent large-scale value collapse and enhances representation recovery. Evaluation against various baselines shows promising results in tasks such as open-domain QA datasets, fact-checking, and MMLU, with metrics covering accuracy, inference throughput, and storage efficiency.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models Hanna Hajishirzi Sewon Min Yizhong Wang Qingqing Cao 2024 (spotlight)
Retrieval LMs address issues like hallucination BTR, ICLR2024 2
Retrieval-augmented language models are slow Many input passages lead to long sequences, making reader encoding slow Reader Encoder Passage1 Query Answers Retriever Encoder Query Passage2 Query Decoder Encoder Passage3 Query Corpus Index Previous solutions precompute passage representations, but require huge storage costs BTR, ICLR2024 3
BTR Overview Reader Runtime compression Encoder Passage1 Query Encoder Passage2 Query Decoder Answers Encoder Passage3 Query Cacheable Binary Token Representations Decomposition and Calibrated Binarization Offline compression BTR, ICLR2024 4
Cacheable binary passage token representations Encoder Layer n Transformer block Layer k+1 Transformer block Runtime look-up 010101110100001 Binarizatin Layer k Decomposing query and passage encoding in the lower layers Key-value store of binary vectors and variance Precomputed offline Layer 1 DeFormer Offline compression (ACL 2020, Q. Cao et, al) Passages Query BTR, ICLR2024 5
Calibrated binarization fail because large-scale values collapse tanh into -1 and +1 quickly Direct Binarization Encoder ??+1 Layer n Transformer block Transformer encoder block Add Binarization via hashing function: sign { 1,1 } then use the differentiable tanh to approximate sign FFN Layer k+1 Transformer block LayerNorm 010101110100001 Add Binarization Layer k Attention 1. Apply binarization after layernorm Calibrated Binarization Layer 1 2. save representation variance to recover the representation scales LayerNorm ?? Passages Query BTR, ICLR2024 6
Evaluation setup Tasks: 3 open-domain QA datasets, fact-checking, and MMLU Metrics: measure task accuracy, report inference throughput (QPS), storage (GB) Baselines: Atlas, Atlas-Q, DensePhrase, DeFormer, LLaMA2-7b BTR, ICLR2024 10
BTR is more efficient and accurate NaturalQuestions Results 100x reduction BTR achieves over 3x speedup compared to baselines (Atlas) and provides >100x reduction in storage (compared to DeFormer) 3x speedup BTR for larger models can achieve >4x speedup, see more results in the paper BTR, ICLR2024 11
BTR takeaways and impact We design BTR, cacheable and calibrated binary token representations to improve the inference efficiency of retrieval LMs BTR achieves over 3-4x inference speedup, reduces storage by over 100x while maintaining task performance Binary representations could potentially supercharge on-device LMs, especially for personalized tasks that require storing local knowledge BTR, ICLR2024 13