CMU Researchers Propose GILL: An AI Method To Fuse LLMs With Image Encoder And Decoder Models


With the release of OpenAI’s new GPT 4, multimodality in Large Language Models has been introduced. Unlike the previous version, GPT 3.5, which is only used to let the well-known ChatGPT take textual inputs, the latest GPT-4 accepts text as well as images as input. Recently, a team of researchers from Carnegie Mellon University proposed an approach called Generating Images with Large Language Models (GILL), which focuses on extending multimodal language models to generate some great unique images.

The GILL method enables the processing of inputs that are mixed with images and text to produce text, retrieve images, and create new images. GILL accomplishes this despite the models utilizing distinct text encoders by transferring the output embedding space of a frozen text-only LLM to that of a frozen image-generating model. Unlike other methods that call for interleaved image-text data, the mapping is accomplished by fine-tuning a small number of parameters utilizing image-caption pairings.

The team has mentioned that this method combines large language models for frozen text with models for image encoding and decoding that have already been trained. It can provide a wide range of multimodal capabilities, such as image retrieval, unique image production, and multimodal dialogue. This has been done by mapping the modalities’ embedding spaces in order to fuse them. GILL works with conditioning mixed image and text inputs and produces outputs that are both coherent and readable.

This method provides an effective mapping network that grounds the LLM to a text-to-image generation model in order to obtain great performance in picture generation. This mapping network converts hidden text representations into the visual models’ embedding space. In doing so, it uses the LLM’s powerful text representations to produce aesthetically consistent outputs. 

With this approach, the model can retrieve images from a specified dataset in addition to creating new images. The model chooses whether to produce or obtain an image at the time of inference. A learned decision module that is conditional on the LLM’s hidden representations is used to make this choice. This approach is computationally efficient as it works without the need to run the image generation model at the time of training.       

This method performs better than baseline generation models, especially for tasks requiring longer and more sophisticated language. In comparison, GILL outperforms the Stable Diffusion method in processing longer-form text, including dialogue and discourse. GILL performs more in dialogue-conditioned image generation than non-LLM-based generation models, benefiting from multimodal context and generating images that better match the given text. Unlike conventional text-to-image models that only process textual input, GILL can also process arbitrarily interleaved image-text inputs.

In conclusion, GILL (Generating Images with Large Language Models) seems promising as it portrays a wider range of abilities compared to previous multimodal language models. Its ability to outperform non-LLM-based generation models in various text-to-image tasks that measure context dependence makes it a powerful solution for multimodal tasks.

Check out the Paper and Project Page. Don’t forget to join our 22k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

Check Out 100’s AI Tools in AI Tools Club

The post CMU Researchers Propose GILL: An AI Method To Fuse LLMs With Image Encoder And Decoder Models appeared first on MarkTechPost.

 Read More MarkTechPost 







Leave a Reply

Your email address will not be published. Required fields are marked *