
LLaVA-1.5-7B is a multimodal large language model that integrates visual understanding and natural language generation. Built upon the Vicuna-7B language model and the CLIP visual encoder, it can process both image and text inputs to support tasks such as visual question answering, image captioning, and multimodal reasoning. It is widely used in applications like intelligent assistants, education, and image-based analysis. Compared to earlier versions, LLaVA-1.5 offers significantly improved vision-language alignment and response accuracy, making it a key advancement in the field of multimodal AI.
Source model repository: Llava-1.5-7B
Model type:
LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.
Model date:
LLaVA-v1.5-7B was trained in September 2023.
Paper or resources for more information:
No data available
To be released