Understanding Multi-Head Attention Layers in Transformers
Sitan Chen from Harvard presents joint work with Yuanzhi Li exploring the provable learnability of a multi-head attention layer in transformers. The talk delves into the architecture of transformers, highlighting the gap between practical success and theoretical understanding. Preliminaries, prior w
3 views • 38 slides