|
| 1 | +## Use Cases |
| 2 | + |
| 3 | +### Embodied Agents |
| 4 | + |
| 5 | +Any-to-any models can help embodied agents operate in multi-sensory environments, such as video games or physical robots. The model can take in an image or video of a scene, text prompts, and audio, and respond by generating text, actions, predict next frames, or generate speech commands. |
| 6 | + |
| 7 | +### Real-time Accessibility Systems |
| 8 | + |
| 9 | +Vision-language based any-to-any models can be used aid visually impaired people. A real-time on-device any-to-any model can take a real-world video stream from wearable glasses, and describe the scene in audio (e.g., "A person in a red coat is walking toward you") or provide real-time closed captions and environmental sound cues. |
| 10 | + |
| 11 | +### Multimodal Content Creation |
| 12 | + |
| 13 | +One can use any-to-any models to generate multimodal content. For example, given a video and an outline, the model can generate speech, better videos, or a descriptive blog post. Moreover, these models can sync narration timing with visual transitions. |
| 14 | + |
| 15 | +## Inference |
| 16 | + |
| 17 | +You can infer with any-to-any models using transformers. Below is an example to infer Qwen2.5-Omni-7B model, make sure to check the model you're inferring with. |
| 18 | + |
| 19 | +```python |
| 20 | +import soundfile as sf |
| 21 | +from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor |
| 22 | +from qwen_omni_utils import process_mm_info |
| 23 | + |
| 24 | +model = Qwen2_5OmniModel.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto") |
| 25 | + |
| 26 | +processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") |
| 27 | + |
| 28 | +conversation = [ |
| 29 | + { |
| 30 | + "role": "system", |
| 31 | + "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", |
| 32 | + }, |
| 33 | + { |
| 34 | + "role": "user", |
| 35 | + "content": [ |
| 36 | + {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"}, |
| 37 | + ], |
| 38 | + }, |
| 39 | +] |
| 40 | + |
| 41 | +USE_AUDIO_IN_VIDEO = True |
| 42 | + |
| 43 | +text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) |
| 44 | +audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO) |
| 45 | +inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO) |
| 46 | +inputs = inputs.to(model.device).to(model.dtype) |
| 47 | + |
| 48 | +# Inference: Generation of the output text and audio |
| 49 | +text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO) |
| 50 | + |
| 51 | +text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) |
| 52 | +print(text) |
| 53 | +sf.write( |
| 54 | + "output.wav", |
| 55 | + audio.reshape(-1).detach().cpu().numpy(), |
| 56 | + samplerate=24000, |
| 57 | +) |
| 58 | +``` |
0 commit comments