Can large language models learn to “speak” image and video file formats? A new study explores an intriguing approach to visual generation using LLMs to model canonical codecs like JPEG and H.264.
In this post, we’ll dive deep into how researchers trained 7B-parameter models to generate images and videos by outputting compressed file bytes, and how this approach surprisingly outperforms specialized vision models on several benchmarks.
If you liked this post, follow me on Twitter or check out AImodels.fyi to stay up to date with all the latest AI research.
Overview
“Figure 2: Generated images by JPEG-LM and baselines with partial images as prompts. We show three random samples from JPEG-LM and one from VQ transformer and ImageGPT (with superresolution). The original images for the prompts are independently sourced outside existing training sets. We observe that JPEG-LM can generate realistic facial expressions, landscape, common objects, texts in image forms, etc.”
A new paper introduces JPEG-LM and AVC-LM, large language models trained to generate images and videos respectively by directly outputting compressed file bytes in JPEG and AVC/H.264 formats. The key contributions are:
Demonstrating that standard LLM architectures can effectively learn to model and generate canonical visual file encodings without any vision-specific modifications.
Showing this approach outperforms pixel-based and vector quantization baselines on image generation tasks across multiple datasets.
Providing a simple, flexible framework that bridges language and visual generation using a unified model architecture.
Revealing that this method is particularly effective at generating long-tail visual elements compared to other approaches.
The potential impact is significant — it suggests a path toward multimodal AI systems that can seamlessly work with text, images, and video using a common underlying architecture and training approach. This could accelerate progress in areas like multimodal reasoning, visual storytelling, and video generation.
(From the paper)
Plain English Explanation
The core idea here is to treat image and video generation as a language modeling task, where the “language” is the byte sequence of compressed file formats like JPEG and H.264.
Imagine if you could teach a large language model to “speak JPEG” — to understand the grammar and structure of JPEG files so well that it could generate valid, coherent JPEG byte sequences from scratch. That’s essentially what JPEG-LM does.
The key insight is that common compression formats like JPEG already do a good job of encoding visual information efficiently. By learning to model these formats directly, the LLM can implicitly learn about visual concepts and structure without needing specialized neural network architectures for vision.
It’s a bit like learning a foreign language by reading and writing secret messages, without ever seeing the decoded content. Through enough practice, you start to internalize the patterns and meaning even without knowing the decoded transmissions (Imitation Game, anyone?). Eventually you can write such a transmission yourself.
This approach has some nice properties:
It’s simpler than other methods that require special vision-specific neural network components
It allows using the exact same model architecture for text, images, and video
The compression keeps sequence lengths manageable compared to modeling raw pixels
It seems to be particularly good at capturing rare or unusual visual elements
The results suggest this surprisingly simple approach can match or exceed more complex methods, opening up exciting possibilities for unified multimodal AI systems.
Technical explanation
The authors train two 7B parameter Llama-2 models: JPEG-LM and AVC-LM. JPEG-LM is trained on 23M 256x256 images encoded as JPEG with quality factor 25 and “4:2:0” subsampling, resulting in 114B JPEG tokens for each epoch, with an average of 5K tokens per image. AVC-LM, on the other hand, is trained on 2M 256x144 videos (15 frames each at 3 fps) encoded with H.264/AVC using a constant quantization parameter of 37, producing 42B AVC tokens with an average of 15K tokens per video.
Both models use a standard autoregressive language modeling objective. Notably, the models employ no vision-specific architecture modifications, such as convolutions or 2D positional embeddings.
For evaluation, the authors use a zero-shot image completion task on ImageNet-1K (5000 samples) and FFHQ (1000 samples) datasets, using FID score as the primary metric. They compare their models against several baselines, including a VQ-VAE + Transformer trained on 200M images, ImageGPT trained on 14M 32x32 images, and diffusion models like Stable Diffusion and VQ-Diffusion.
The results are solid:
On ImageNet-1K, JPEG-LM achieved FID scores of 272.12, 123.09, and 34.21 for partial image ratios of 0.25, 0.5, and 0.75 respectively, outperforming all baselines.
Similarly impressive results were seen on FFHQ, with FID scores of 36.15, 31.22, and 27.15 for partial image ratios of 0.375, 0.4375, and 0.5.
In unconditional generation, JPEG-LM achieved an FID of 121.35 compared to 155.51 for the VQ transformer.
Perhaps most intriguingly, analysis showed a statistically significant correlation between JPEG-LM’s performance advantage and the rarity of image classes, suggesting better handling of long-tail visual elements.
The authors hypothesize that the non-neural, fixed JPEG compression preserves important visual details better than learned VQ encodings, especially for less common visual elements.
Critical Analysis
The approach presented in this paper has several notable strengths. Its conceptual simplicity and elegance are immediately apparent — by treating image and video generation as a language modeling task over compressed file formats, the authors have created a unified framework for multimodal AI. I think this is further enhanced by the strong empirical results achieved without any vision-specific architectural modifications, suggesting a potential for easier integration of visual and language modeling in future systems.
However, this paper is not without limitations. While more compact than raw pixel modeling, the sequences used (5K tokens per image) are still significantly longer than those in VQ approaches (typically 1K tokens). This could potentially limit scalability to larger models or datasets. The paper also focuses primarily on generation tasks, leaving open questions about the approach’s efficacy for visual understanding tasks like image classification or visual question answering.
The fixed JPEG quality setting used in the study may limit the model’s ability to handle diverse image types, and there’s currently no clear method for providing text conditioning or other controls over generation. This lack of controllability could be a significant hurdle for certain applications. The video generation results, while promising, are limited in scope (tested on 15-frame sequences) and require more thorough evaluation.
It’s also worth noting the substantial training data requirements — 23M images for JPEG-LM and 2M videos for AVC-LM. It’s unclear how performance might scale with less data, which could be a concern for domains with limited available data.
Moving forward, there are several exciting directions for future research. Exploring different codec choices or hybrid approaches could potentially yield even better results. Investigating the model’s performance on visual understanding tasks would help clarify its potential as a general-purpose vision model. Developing methods for controlled generation (e.g., text-to-image) would greatly enhance the model’s versatility. Scaling up model size and training data, as well as exploring combinations with other multimodal learning approaches, could push the boundaries of what’s possible with this technique. Finally, adaptive compression techniques could potentially help the model handle a wider range of image types and quality levels.
Conclusion
This work presents an intriguing new approach to visual generation using large language models to directly model compressed file formats. By demonstrating strong performance with a conceptually simple method, I think it opens up exciting possibilities for unified multimodal AI systems.
The ability to handle text, images, and video with a common architecture and training paradigm could accelerate progress in areas like multimodal reasoning, visual storytelling, and open-ended video generation. The model’s particular strength in generating long-tail visual elements is especially promising for creating more diverse and realistic images.
However, many open questions remain about the approach’s scalability, flexibility, and applicability to visual understanding tasks. The longer sequence lengths compared to VQ methods may pose challenges for scaling to larger models or datasets. Additionally, the lack of built-in controllability and the fixed compression settings may limit the approach’s versatility.
Where do we go from here? Potential research directions include adaptive compression techniques, methods for controlled generation, and investigations into the model’s capabilities on visual understanding tasks.
What do you think about this approach? Could modeling file formats be a key to more general visual intelligence? How might this impact the development of multimodal AI systems? I’d love to hear your thoughts in the comments!
If you found this analysis valuable, please consider becoming a paid subscriber to support more in-depth breakdowns of cutting-edge AI research. And don’t forget to share this post with colleagues who might find it interesting!
If you liked this post, follow me on Twitter or check out AImodels.fyi to stay up to date with all the latest AI research.
Top comments (0)