51OpenLab-一站式ICT创新服务平台

如何利用OpenVINOTM部署Qwen2多模态模型

openlab_96bf3613 更新于 8月前

作者：杨亦诚，Ekaterina Aidova

多模态大模型的核心思想是将不同媒体数据（如文本、图像、音频和视频等）进行融合，通过学习不同模态之间的关联，实现更加智能化的信息处理。简单来说，多模态大模型可以可以理解多种不同模态的输入数据，并输出相应反馈结果，例如图像理解，语音识别，视觉问题等。

图：多模态大模型任务流程

多模态大模型都会将文本生成模型作为底座模型，以支持对话能力，其中千问团队近期发布的Qwen2-Audio和Qwen2-VL便是以Qwen2为底座的多模态大模型，分别支持语音/文本以及图像/文本作为多模态输入，相比上一代的Qwen-VL 和 Qwen-Audio ，基于Qwen2的多模态模型具备更强大的视觉理解以语音理解能力，并实现了多语种的支持。本文将分享如何利用OpenVINOTM工具套件在轻薄本上部署Qwen2-Audio以及Qwen2-VL多模态模型。

Qwen2-Audio示例地址：openvino_notebooks/notebooks/qwen2-audio/qwen2-audio.ipynb at latest · openvinotoolkit/openvino_notebooks · GitHub

Qwen2-VL示例地址：openvino_notebooks/notebooks/qwen2-vl/qwen2-vl.ipynb at latest · openvinotoolkit/openvino_notebooks · GitHub
Qwen2 workshop：GitHub - openvino-dev-samples/qwen2-openvino-workshop

Qwen2-VL
1. 模型转换及量化
目前Qwen2-VL的推理任务还没有被完全集成进Optimum工具中，因此我们需要手动完成模型的转换和量化，其中包含语言模型lang_model，图像编码模型image_embed，文本token编码模型embed_token模型以及图像特征映射模型image_embed_merger。

为了简化转化步骤，我们提前对这些转化任务行进行了封装，开发者只需要调用Qwen2-VL示例地址中提供的函数便可完成这些模型的转换，并对其中负载最大的语言模型进行量化。这里以Qwen2-VL-2B-Instruct为例。

from ov_qwen2_vl import convert_qwen2vl_model

import nncf



compression_configuration = {

"mode": nncf.CompressWeightsMode.INT4_ASYM,

"group_size": 128,

"ratio": 1.0,

}


convert_qwen2vl_model("Qwen/Qwen2-VL-2B-Instruct", model_dir, compression_configuration)

2. 图片内容理解
此外在该示例中，我们也对模型的推理任务进行封装，通过以下代码便可快速部署图像理解任务，并实现文字的流式输出。由于Qwen2-VL对于输入数据有格式上的要求，因此我们需要提前将图片和文本包装为指定的字典格式，并调用模型自带的processor脚本将其转换为prompt输入。

question = "Describe this image."



messages = [

{

"role": "user",

"content": [

{

"type": "image",

"image": f"file://{example_image_path}",

},

{"type": "text", "text": question},

],

}

]

你可以将以下推理代码中的device设置为“GPU“，以激活系统中Intel集显或是独显的能力。

from ov_qwen2_vl import OVQwen2VLModel

model = OVQwen2VLModel(model_dir, device)



processor = AutoProcessor.from_pretrained(model_dir, min_pixel***in_pixel****ax_pixel***ax_pixels)



text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(

text=[text],

images=image_inputs,

videos=video_inputs,

padding=True,

return_tensors="pt",

)



generated_id*****odel.generate(**input****ax_new_tokens=100, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))

示例输出效果如下：

Question:

Describe this image.

Answer:

The image depicts a woman sitting on a sandy beach with a large dog. The dog is standing on its hind legs, reaching up to give the woman a high-five. The woman is **iling and appears to be enjoying the moment. The background shows the ocean with gentle waves, and the sky is clear with a soft light, suggesting it might be either sunrise or sunset. The scene is serene and joyful, capturing a heartwarming interaction between the woman and her dog.

3. 视频内容理解
由于Qwen2-VL可以同时支持对多个图像输入，因此可以基于这一特性实现视频内容理解，实现方法也特别简单，仅需对视频文件抽帧后保存为图片，并将这些图片基于Qwen2-VL提供的预处理脚本合并后，转化为Prompt模板，送入模型流水线进行推理。值得注意的是，当你将"type"设置为 "video"后，processor会自动将两张图片拼接为一张，进行处理，以优化推理性能，并降低多图任务的内存占用。

question = "描述一下这段视频"

messages = [

{

"role": "user",

"content": [

{

"type": "video",

"video": [

"file://./examples/keyframe_1.jpg",

"file://./examples/keyframe_2.jpg",

"file://./examples/keyframe_3.jpg",

"file://./examples/keyframe_4.jpg",

],

"fps": 1.0,

},

{"type": "text", "text": question},

],

}

]

Qwen2-Audio
1. 模型转换及量化
针对Qwen2-Audio，我们同样在Qwen2-VL示例地址中对模型的转换和量化步骤进行了接口封装，其中包含语言模型lang_model，音频编码模型audio_embed，文本token编码模型embed_token模型以及音频特征映射模型projection。使用方法如下：

from ov_qwen2_audio_helper import convert_qwen2audio_model

import nncf



compression_configuration = {

"mode": nncf.CompressWeightsMode.INT4_ASYM,

"group_size": 128,

"ratio": 1.0,

}



convert_qwen2audio_model("Qwen/Qwen2-Audio-7B-Instruct", model_dir, compression_configuration)

2. 语音对话
Qwen2-Audio提供语音对话和音频分析两种任务模式。在语音对话模式中，用户只需输入语音而无需输入文字，指令则通过语音直接传达给模型。下面则是一个音频分析的例子。

conversation = [

{"role": "system", "content": "You are a helpful assistant."},

{

"role": "user",

"content": [

{"type": "audio", "audio_url": audio_chat_url},

],

},

]



text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

audios = [librosa.load(audio_chat_file, sr=processor.feature_extractor.sampling_rate)[0]]



inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)



generate_ids = ov_model.generate(**input****ax_new_tokens=50, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))

和Qwen2-VL一样，我们需要在构建输入Prompt前，提前准备好字典格式的数据，可以看到在语音对话模式下，我们仅需提供音频文件的地址或路径。该示例的输出如下：

Answer:

Yes, I can guess that you are a female in your twentie****r/>
从输出结果可以看到Qwen2-Audio不光可以理解音频内容，并且可以识别对话者的音色和语调。

3. 音频分析
在音频分析模式下，Qwen2-Audio则支持多模态输入，此时我们可以将文本和音频拼接在一起，作为prompt送入模型中进行推理。

question = "What does the person say?"



conversation = [

{"role": "system", "content": "You are a helpful assistant."},

{

"role": "user",

"content": [

{"type": "audio", "audio_url": audio_url},

{"type": "text", "text": question},

],

},

]

示例输入结果：

Answer:

The person says: 'Mister Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

总结与展望
通过OpenVINOTM封装后的API函数，开发者可以非常便捷地对预训练模型进行转化压缩，并实现本地化的推理任务部署。同时基于Qwen2系列多模态模型强大的音频与图像理解能力，我们仅在轻薄本上便可以构建起一个完整的语言模型应用，在保护用户数据隐私的同时，降低硬件门槛。后期我们也计划将Qwen2多模态系列模型的流水线集成进Optimum组件中，方便开发者更灵活地进行调用，敬请期待。

参考资料
Qwen2-VL:
https://github.com/QwenLM/Qwen2-VL

Qwen2-Audio:
https://github.com/QwenLM/Qwen2-Audio

0个评论

提交

如何利用OpenVINOTM部署Qwen2多模态模型

用户登录还没账户？去注册

新用户注册已有账户，立即登录

重置密码

提示

提示

公告栏