Research3 min read
ByteDance GenLIP framework enables Vision Transformers to predict language tokens with 5x data efficiency
Tags Research ยท AI
alphaXiv ยท GitHub (GenLIP code)ยท

ByteDance and Beijing Jiaotong University researchers introduced GenLIP, a generative pre-training framework that enables Vision Transformers to directly predict language tokens from visual inputs. The framework achieves competitive or superior performance on 14 multimodal benchmarks using 8 billion pre-training samples, outperforming baselines that use up to 40 billion samples โ a 5x data efficiency improvement. The minimalist approach unifies visual and textual representations through generative modeling. Code is available on GitHub at YanFangCS/GenLIP. The paper was featured on alphaXiv as a notable contribution from May 1, 2026.