HyperGAI Introduces HPT: A Groundbreaking Family of Leading Multimodal LLMs


HyperGAI researchers have developed Hyper Pretrained Transformers (HPT) a multimodal language model that can handle different types of inputs such, as text, images, videos, and more. Traditional LLMs have achieved satisfactory results with text data but have a limited understanding of multimodal data, interfering with progress toward achieving Artificial General Intelligence (AGI). The HPT model aims to deliver performance across input formats without significantly increasing computational costs.

Currently, large language models like GPT-4V and Gemini Pro dominate the field but lack robustness in multimodal understanding. These models primarily focus on processing text and struggle with integrating visual information seamlessly. The proposed solution, HPT, offers a new approach by leveraging a multimodal pretraining framework capable of training large models proficient in understanding various modalities. HPT introduces two versions: HPT Pro, designed for complex multimodal tasks, and HPT Air, an efficient yet capable model for a wide range of tasks. HPT also introduces the H-Former, a key innovation bridging vision and language modalities by converting visual data into language tokens.

HPT employs a dual-network design in the H-Former to learn both local and global features, enabling the model to understand fine-grained details and abstract, high-level information across modalities. The H-Former serves as a bridge between vision and language, allowing HPT to comprehend visual content despite being primarily pre-trained on text. 

Significantly, HPT Pro outperforms larger proprietary models like GPT-4V and Gemini Pro on benchmarks such as MMBench and SEED-Image, showcasing its superiority in complex multimodal tasks. Meanwhile, HPT Air achieves state-of-the-art results among open-source multimodal LLM models of similar or smaller sizes on challenging benchmarks like MMMU, highlighting its efficiency and effectiveness. The performance of both HPT Pro and HPT Air underscores the effectiveness of the proposed framework in addressing the multimodal understanding challenge.

In conclusion, the paper presents a significant advancement in the field of multimodal LLMs with the introduction of the HPT framework. By effectively bridging the gap between vision and language modalities, HPT demonstrates superior performance compared to existing models on various benchmarks. The unique design of the H-Former and the scaling of the HPT framework open up exciting new ways to study how to achieve strong multimodal understanding.

Check out the Blog, Model, and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

IntroducingHPT- our new family of Open Multimodal LLMs from HyperGAI. HPT (Hyper-Pretrained Transformer) demonstrates strong capabilities on multiple multimodal benchmarks.

Main blog post: https://t.co/SEEZk8Nco3

In this thread, I’ll share some of the highlights.
(1/n) pic.twitter.com/WC5ktnPuyl

— Steven Hoi (@stevenhoi) March 19, 2024

The post HyperGAI Introduces HPT: A Groundbreaking Family of Leading Multimodal LLMs appeared first on MarkTechPost.

 Read More MarkTechPost 







Leave a Reply

Your email address will not be published. Required fields are marked *