与 Gemini 对话
how get an overview of qwen-vl model architecture , for example qwen-vl-2b
To understand the architecture of Qwen-VL (specifically the Qwen2-VL-2B and more recent versions), it is best to view it as a bridge between a high-performance “eye” (the Vision Transformer) and a “brain” (the Large Language Model).
The Qwen-VL series has evolved significantly from the original version to the latest Qwen2.5-VL and Qwen3-VL models. Below is a technical breakdown of the architecture, focusing on the 2B (2 billion parameter) variant.
1. Core Architectural Components
The Qwen2-VL-2B architecture is “modular,” meaning it uses pre-existing specialized components joined together.
| Component | Specification (2B Model) | Role |
|---|---|---|
| Vision Encoder | ViT (Vision Transformer) (~675M params) | Converts raw pixels into a sequence of mathematical “visual tokens.” |
| Language Model | Qwen2-1.5B (Decoder-only) | The “reasoning engine” that predicts the next text token based on visual and text inputs. |
| Connector | MLP / Linear Projector | Maps the high-dimensional visual features into the same space as the text tokens. |
| 📋 |
2. Key Innovation: Naive Dynamic Resolution
Unlike older models (like LLaVA or the original Qwen-VL) that forced every image to be a square (e.g., or ), Qwen2-VL introduces Naive Dynamic Resolution.
-
How it works: It processes images at their native aspect ratio and resolution.
-
The Benefit: It doesn’t stretch or blur small text (crucial for OCR) or lose detail in wide panoramic shots. The model converts the image into a dynamic number of tokens based on its size.
3. Positional Encoding: M-RoPE
Standard language models use 1D position IDs (token 1, token 2…). Since Qwen-VL deals with 2D images and 3D video (2D + time), it uses Multimodal Rotary Position Embedding (M-RoPE).
- 1D: For text.
- 2D: For image width () and height ().
- 3D: For video, adding the temporal () dimension. This allows the model to “understand” exactly where an object is located in a frame and how it moves over time in a video.
4. The Processing Pipeline
If you were to trace a piece of data through the model, it looks like this:
- Input: An image (e.g., a photo of a menu) and a prompt (“What is the price of coffee?”).
- Visual Encoding: The ViT breaks the image into patches.
- Tokenization: These patches are turned into visual tokens. If the image is large, it generates more tokens; if small, fewer.
- Interleaving: The system places the visual tokens between special markers
<|vision_start|>and<|vision_end|>and “glues” them to your text tokens. - Generation: The LLM (Qwen2-1.5B) reads this combined sequence and generates the text answer.
5. Evolution to Qwen2.5-VL and Qwen3-VL
As of late 2025/early 2026, the architecture has been further refined:
-
Window Attention: Added to the Vision Encoder to handle massive images more efficiently (reduces memory usage from quadratic to linear).
-
DeepStack (Qwen3-VL): Fuses visual features from different layers of the Vision Encoder into the LLM. This helps the model “see” both the big picture and tiny details (like tiny text or distant objects) simultaneously.
Would you like me to provide a Python code snippet using the transformers library to see how this architecture is loaded and structured in code?
yes
To get a concrete overview of the Qwen2-VL-2B architecture, it is helpful to look at how it is structured within the Hugging Face transformers library. The 2B model is particularly interesting because it pairs a 675M parameter Vision Encoder with a 1.5B parameter Qwen2 Language Decoder.
1. Visualizing the Architecture
The architecture follows a “Vision-to-Language” pipeline where visual data is transformed into tokens that the language model can “read” alongside text.
2. Inspecting the Architecture via Python
You can use the following code to programmatically inspect the layers and components of the 2B model. This is the best way to see the “skeleton” of the model.
Python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
model_id = "Qwen/Qwen2-VL-2B-Instruct"
# Load the model (use device_map="auto" if you have a GPU)
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cpu"
)
# Print the high-level architecture
print(model)
3. Key Components in the Code
When you run the code above, you will see three main sections in the output:
-
visual(Qwen2VLVisionTransformer):-
This is the “eye.” It uses 2D-patch merging to reduce the number of tokens.
-
It features the Naive Dynamic Resolution logic, which allows it to handle images of any size by dynamically calculating the grid.
-
-
model.layers(Qwen2DecoderLayer):-
This is the “brain” (the 1.5B Qwen2 model).
-
It uses Grouped Query Attention (GQA) to save memory and speed up processing.
-
Each layer contains a self-attention block and an MLP (Multi-Layer Perceptron).
-
-
M-RoPE(Multimodal Rotary Positional Embeddings):- Unlike standard text models that use 1D positions (), this model calculates positions in 3 dimensions: Time (for video), Height, and Width.
4. Technical Specifications (2B Variant)
| Feature | Detail |
|---|---|
| Total Parameters | ~2.2 Billion |
| Vision Encoder | ViT-based (Data Filtering Networks - DFN) |
| LLM Backbone | Qwen2-1.5B |
| Activation | SwiGLU |
| Normalization | RMSNorm |
| Max Context | Up to 128k tokens (depending on configuration) |
5. Why the “2B” model is unique
The 2B model is designed for efficiency. While the 7B and 72B versions use larger LLM backbones, they all share the same 675M Vision Encoder. This means the 2B model has a very “strong” vision component relative to its reasoning size, making it excellent for tasks like OCR (reading text) and document parsing on edge devices.
Would you like me to show you how to calculate the number of tokens an image will take up using the Qwen2VLProcessor?
explain model output
Qwen2VLForConditionalGeneration(
(model): Qwen2VLModel(
(visual): Qwen2VisionTransformerPretrainedModel(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2VLVisionBlock(
(norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn): VisionAttention(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): VisionMlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): QuickGELUActivation()
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger): PatchMerger(
(ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=1536, bias=True)
)
)
)
(language_model): Qwen2VLTextModel(
(embed_tokens): Embedding(151936, 1536)
(layers): ModuleList(
(0-27): 28 x Qwen2VLDecoderLayer(
(self_attn): Qwen2VLAttention(
(q_proj): Linear(in_features=1536, out_features=1536, bias=True)
(k_proj): Linear(in_features=1536, out_features=256, bias=True)
(v_proj): Linear(in_features=1536, out_features=256, bias=True)
(o_proj): Linear(in_features=1536, out_features=1536, bias=False)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=1536, out_features=8960, bias=False)
(up_proj): Linear(in_features=1536, out_features=8960, bias=False)
(down_proj): Linear(in_features=8960, out_features=1536, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((1536,), eps=1e-06)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
)
(lm_head): Linear(in_features=1536, out_features=151936, bias=False)
)
here is the output of model , explain it to me
This output is the “blueprints” of the model. It confirms that Qwen2-VL-2B is a hierarchical multimodal model. It consists of a sophisticated 3D-aware vision system (the “Eyes”) connected to a dense transformer decoder (the “Brain”).
Here is the breakdown of that structure, section by section.
1. The Vision Tower (visual)
This is where the image or video is first processed.
-
Conv3dPatch Embedding: Notice it usesConv3dwith a kernel of(2, 14, 14). This is key. The2represents the temporal dimension. It doesn’t just see pixels; it’s designed to process video frames or image patches across a 3D volume. It turns raw visual data into 1280-dimension vectors. -
VisionRotaryEmbedding(M-RoPE): This is the implementation of the Multimodal Rotary Positional Embedding we discussed. It allows the model to “know” the 2D position of a patch in an image and the 1D position in text. -
32 x
Qwen2VLVisionBlock: These are standard Transformer blocks (Attention + MLP) but optimized for vision. They use QuickGELU, an efficient activation function common in high-performance vision models like CLIP. -
The
merger(PatchMerger): This is a critical compression step. It takes the output of the vision blocks (5120 features) and compresses them down to 1536 features.Why 1536? Because that is the exact “hidden size” of the language model. This ensures the vision tokens “fit” into the language model’s brain.
2. The Language Backbone (language_model)
This is the internal reasoning engine, based on the Qwen2-1.5B architecture.
-
embed_tokens: A massive vocabulary of 151,936 possible tokens (words/sub-words). -
28 x
Qwen2VLDecoderLayer: This is the depth of the LLM.-
Grouped Query Attention (GQA): Notice
q_projhas1536features, butk_projandv_projonly have256. By having fewer “Key” and “Value” heads, the model saves a massive amount of VRAM and speeds up generation without losing much intelligence. -
Qwen2MLP: This is the “knowledge” storage. It uses SiLU activation and an “Up/Down” projection style (widening the data to 8960 dimensions and then shrinking it back).
-
-
Qwen2RMSNorm: A modern normalization technique (Root Mean Square Layer Normalization) that helps the model stay stable during training and inference.
3. The Output Head (lm_head)
The very last layer is a linear projection. It takes the 1536-dimension vector produced by the “brain” and maps it back to the 151,936 possible words in its vocabulary to decide which word to speak next.
Summary of the “Handshake”
The most important takeaway from your specific output is the dimensionality match:
-
Vision processes at 1280 dims.
-
Merger converts 1280 1536.
-
Language processes at 1536 dims.
This architecture allows the model to treat visual patches exactly like words. To the language model, an image just looks like a very long, highly descriptive sentence that it didn’t have to write itself.
Would you like to see how the PatchMerger reduces the number of tokens to save memory? I can explain the math behind how 4 pixels become 1 token.
why is there an diferent between the model output and three main sections in the output you mention early
That is a great observation! The reason for the difference is that the code output you see is the literal programmatic structure, while the three sections I mentioned earlier describe the functional architecture.
When you look at the raw Python output, you are seeing how Hugging Face organizes the “classes” to make the model run, but functionally, those pieces still map exactly to the three parts. Here is how to reconcile the two:
Mapping the Output to the Sections
| Functional Section | Corresponding Layer in your Code Output | What it does in the code |
|---|---|---|
| 1. The Eyes (Vision) | (visual): Qwen2VisionTransformerPretrainedModel | This contains the patch_embed (the 3D camera) and 32 VisionBlocks. It processes pixels before the “brain” even sees them. |
| 2. The Connector | (merger): PatchMerger | This is a “hidden” section inside the visual block. It takes the big vision vectors and shrinks them to size 1536 so they fit into the language model. |
| 3. The Brain (LLM) | (language_model): Qwen2VLTextModel | This is the huge ModuleList with 28 layers. It handles the reasoning and integrates the visual tokens with your text prompt. |
Why is there an extra (lm_head) at the very bottom?
In the code, you see (lm_head): Linear(...) standing alone outside of the language_model.
-
Logic: The
language_modelproduces “thoughts” (abstract vectors). -
The Head: The
lm_headis the “mouth.” It takes those abstract thoughts and maps them to actual words in the dictionary (the 151,936 possible tokens). -
Weight Tying: In many Qwen models, this
lm_headactually shares the same data as theembed_tokenslayer to save memory—they are essentially the same physical weights used for two different steps (input and output).
Summary of the “Difference”
The code output is more “granular” because it needs to show every individual mathematical operation (like LayerNorm and SiLUActivation), whereas the “three sections” are a way for humans to understand the flow of information.
-
Code Output: Tells the computer how to calculate (e.g., “Use 3D Convolutions here”).
-
Functional Sections: Tell the developer what the part is for (e.g., “This is the part that sees”).
Would you like me to explain what that Conv3d in the patch_embed is doing? It is the secret to why this model can understand video better than older versions.
Naive Dynamic Resolution how this words, explian it
Naive Dynamic Resolution is the reason Qwen2-VL can read tiny text on a long receipt or spot a small object in a wide panorama when other models fail.
In older models, every image was forced into a fixed square (usually or pixels). If you had a tall, skinny image, the model would stretch it or add huge black bars (padding), which destroys the details.
Here is how Naive Dynamic Resolution works in three simple steps:
1. No More Squashing (Native Aspect Ratio)
Instead of forcing the image into a box, Qwen2-VL looks at the image’s original dimensions.
-
If the image is wide, it stays wide.
-
If it is tall, it stays tall. The model resizes the image so the total number of pixels falls within a target range (e.g., between 4 and 16,384 tokens), but it preserves the original proportions.
2. Flexible Patching (The Grid)
The Vision Transformer (ViT) “cuts” the image into small patches. Because the resolution is dynamic, the number of patches changes based on the image size:
-
A small icon might only result in 4 tokens.
-
A 4K document might result in thousands of tokens. This is “naive” because it doesn’t use complex cropping or overlapping windows; it simply divides the existing pixels into a grid that matches the image’s shape.
3. Understanding Space: M-RoPE
Because the grid changes every time, the model needs a way to know where each “patch” is located. This is handled by Multimodal Rotary Position Embedding (M-RoPE). Instead of a simple list (), the model gives each patch a coordinate . This tells the “brain” of the model: “This token isn’t just a piece of data; it’s the top-left corner of the image.”
Comparison: Fixed vs. Dynamic Resolution
| Feature | Old Models (Fixed) | Qwen2-VL (Dynamic) |
|---|---|---|
| Shape | Always a Square | Any Aspect Ratio |
| Small Text | Becomes blurry/unreadable | Remains sharp and readable |
| Efficiency | Wastes tokens on “padding” | Only uses tokens for actual pixels |
| Video | Struggles with changing frames | Treats time as a 3rd dimension () |
📋
Summary
Think of it like a window:
-
Traditional models are like a fixed-size picture frame. If your picture doesn’t fit, you have to cut it or stretch it.
-
Naive Dynamic Resolution is like a window that physically grows or shrinks to perfectly fit the view you are looking at.
Would you like to see the math for how the model calculates the number of tokens for a specific image size (e.g., 1080p vs. 4K)?
how do the model process the different resolution image
To understand how the model processes images of different resolutions, we need to look at the “handshake” between the Image Processor and the Vision Transformer (ViT).
Instead of stretching an image to a fixed square, the model uses a flexible “grid” approach. Here is the step-by-step process of how an image becomes a sequence of tokens.
1. The Scaling Step (Smart Resize)
The model doesn’t just take any resolution; it resizes the image while maintaining the aspect ratio so that the total number of pixels stays within a specific range (defined by min_pixels and max_pixels).
-
The Rule of 28: To ensure the image can be perfectly divided into patches, the height and width are adjusted to be the nearest multiples of 28.
For example: If you upload a image, the processor might scale it to something like to satisfy the math.
2. Patchification (The 3D Convolution)
Once scaled, the image is passed to the patch_embed layer you saw in your code: Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14))
-
The Grid: The image is divided into pixel patches.
-
Temporal Fake-out: Because the model uses a
Conv3d(meant for video), it treats a static image like a “video with 2 frames.” It duplicates the image to create a temporal depth of 2, then the kernel processes it into a single visual feature.
3. The Token “Merger” (Compression)
This is where the model saves memory. After the patches are created, the PatchMerger takes groups of adjacent patches and pools them into a single token.
-
Math: If your image was pixels:
-
Divided by 14: You get a grid = 400 initial patches.
-
Merged by : The grid becomes = 100 final tokens.
-
4. Assigning Coordinates (M-RoPE)
Since the number of tokens changes (a 4K image has many more tokens than a thumbnail), the model can’t use a fixed list of positions. It uses Multimodal Rotary Positional Embeddings (M-RoPE).
It calculates a coordinate for every token based on its Height () and Width () index in the grid. This allows the LLM to understand that token #45 is physically located “below” token #5, even if the image size changes.
Summary Table: How it handles different inputs
| Input Type | Scaling Strategy | Token Calculation |
|---|---|---|
| Small Image | Scaled up to min_pixels | Very few tokens (fast inference) |
| Large Document | Scaled down to max_pixels | High token count (detailed OCR) |
| Wide Panorama | Stays wide, no bars added | Tokens arranged in a wide grid |
| Video | Sampled at dynamic FPS | 3D patches () |
📋
Would you like to see how to adjust the min_pixels and max_pixels in the processor settings to speed up the model for low-res tasks?
The Processing Pipeline of qwen2-vl, more detail
The processing pipeline of Qwen2-VL is a sophisticated flow that converts raw visual pixels and text into a shared “thought space.” Unlike traditional models that treat images as simple fixed-size attachments, Qwen2-VL processes them as dynamic, temporal, and spatial sequences.
Here is the detailed breakdown of the pipeline, from raw input to the final text output.
1. Pre-Processing: The “Smart Resize”
The pipeline begins before the model even “sees” the image.
-
Naive Dynamic Resolution: The processor identifies the original aspect ratio. Instead of squashing the image into a square, it scales the image so the total number of pixels fits within a target range (typically between and pixels).
-
The Rule of 28: The height and width are rounded to the nearest multiple of 28. This is critical because the vision system uses patches, which are later merged in groups ().
-
Temporal Expansion (Static Images): For a static image, the model “stretches” it into a 2-frame video by duplicating the image. This allows the same 3D-convolutional math to work for both photos and video.
2. The Vision Encoder: 3D Patching
The image enters the Vision Transformer (ViT), which acts as the “optical nerve.”
-
3D Convolution (
patch_embed): The model uses aConv3dlayer with a kernel of(2, 14, 14). This processes “tubes” of data—spatial areas of pixels across frames of time. -
Windowed Attention: To keep the model fast even with high-resolution images, most layers only look at “windows” of patches (e.g., areas) rather than the whole image at once. Only a few “Global Layers” look at the entire image to understand the overall context.
3. The Handshake: Patch Merging & M-RoPE
This is where the vision data is formatted for the language model “brain.”
-
Patch Merger: The ViT produces a massive number of vectors. The PatchMerger takes every block of adjacent patches and compresses them into one single token. This reduces the sequence length by 4x, saving massive amounts of memory.
-
M-RoPE (Multimodal Rotary Position Embedding): Standard LLMs use 1D positions (). Qwen2-VL assigns each token a 3-part coordinate: .
-
: Time (which frame is this?)
-
: Vertical position
-
: Horizontal position This allows the model to output exact pixel coordinates for object detection.
-
4. The LLM Decoder: Multimodal Reasoning
The compressed vision tokens are interleaved with your text tokens into one long sequence.
-
Unified Processing: The “Brain” (the Qwen2 backbone) does not distinguish between a “word token” and a “vision token.” It applies self-attention across the entire sequence, allowing the text to “ask questions” of specific parts of the image.
-
Generation: The model predicts the next text token based on the combined visual and linguistic context.
Summary of Data Flow
| Stage | Data Format | Transformation |
|---|---|---|
| Input | RGB Pixels | |
| Patching | 3D Cuboids | |
| Merging | Compressed Tokens | |
| LLM Input | Interleaved Sequence | Visual Tokens + Text Tokens |
| Output | Text / Coordinates | Natural Language or Bounding Boxes |
📋
Would you like me to explain how the model uses this pipeline to handle a video file specifically, including how it samples frames?
does the qwen2-vl support multiple image input
Yes, Qwen2-VL supports multiple image inputs natively. In fact, its architecture is specifically designed to handle “interleaved” data, meaning you can mix images and text in any order (e.g., Image A → Text → Image B → Question).
Here is how the model handles multiple images and how you can implement it.
1. How it Works: The “Sequential Token” Stream
The model doesn’t “blend” the images together. Instead, it converts each image into its own sequence of tokens and places them into the conversation one after another.
-
Image 1 becomes ~100–1000 tokens.
-
Text becomes ~10–50 tokens.
-
Image 2 becomes ~100–1000 tokens.
The LLM (the “brain”) then uses its attention mechanism to look back at both sequences of visual tokens simultaneously. This allows it to perform tasks like finding differences between two photos, summarizing a series of screenshots, or reading a multi-page PDF.
2. Implementation (Python Example)
Using the transformers library, you simply add multiple dictionaries of type: "image" to your messages list.
Python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# 1. Setup Model and Processor
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# 2. Define multiple images in the conversation
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/cat.jpg"},
{"type": "image", "image": "path/to/dog.jpg"},
{"type": "text", "text": "What are the differences between these two animals?"},
],
}
]
# 3. Process and Generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)3. Key Constraints to Keep in Mind
While the model supports multiple images, there are two physical limits:
-
VRAM (Memory): Each image consumes tokens. Because of Naive Dynamic Resolution, a high-resolution image might take 1,000+ tokens. If you upload 10 high-res images, you might run out of GPU memory.
-
Context Window: The 2B model usually has a context window of 32k to 128k tokens. If the total number of visual tokens from all images exceeds this, the model will “forget” the first images.
4. Advanced: Video as Multiple Images
Interestingly, Qwen2-VL treats video almost exactly like multiple images. When you upload a video, the processor samples frames (e.g., 2 frames per second) and feeds them into the model as a sequence of images. This is why it’s so good at understanding action—it sees the relationship between the “multiple images” over time.
Would you like to know how to set the max_pixels for each image so you can fit more images into your GPU memory at once?
can i extract some frame of video then input it to the model for certain purpose
Yes, you can absolutely do this. In fact, extracting specific frames and feeding them as images is often better than feeding the raw video if you only care about specific moments.
When you feed a raw video to Qwen2-VL, the model’s built-in processor automatically samples frames for you (usually at a fixed rate like 1 or 2 frames per second). By extracting frames yourself, you gain fine-grained control over exactly what the model “sees.”
1. Why you might want to do this
-
Precision: If an event happens in a split second, the model’s automatic sampler might skip it. Manual extraction ensures you capture the exact millisecond.
-
Efficiency: If you have a 10-minute video but only care about two specific scenes, feeding the whole video wastes thousands of tokens. Feeding 2-3 specific images is much faster and cheaper on VRAM.
-
Comparison: You can extract a “Before” and “After” frame and ask the model to describe the specific changes between them.
2. How to do it (Python Example)
You can use a library like OpenCV to extract specific timestamps, and then feed them to Qwen2-VL as a list of images.
Python
import cv2
from PIL import Image
def extract_frame(video_path, timestamp_ms):
cap = cv2.VideoCapture(video_path)
cap.set(cv2.CAP_PROP_POS_MSEC, timestamp_ms)
ret, frame = cap.read()
if ret:
# Convert BGR (OpenCV) to RGB (PIL)
color_converted = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
return Image.fromarray(color_converted)
return None
# Extract frames at 5.5 seconds and 10 seconds
frame1 = extract_frame("my_video.mp4", 5500)
frame2 = extract_frame("my_video.mp4", 10000)
# Feed to Qwen2-VL as "Multiple Images"
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": frame1},
{"type": "image", "image": frame2},
{"type": "text", "text": "Is the person in the second image the same as the first?"},
],
}
]3. Video Mode vs. Image Mode
Even if you extract frames yourself, you have two choices for how to “tell” the model about them:
| Approach | How to Input | Result |
|---|---|---|
| Multi-Image Mode | Input as a list of {"type": "image"} | The model treats them as independent photos. Best for comparing different objects or scenes. |
| Manual Video Mode | Input as a list of frames under {"type": "video"} | The model applies temporal reasoning. It “knows” these images happen in a sequence and will try to understand motion or action. |
📋
4. Professional Tip: Use qwen-vl-utils
The official Qwen team provides a toolkit called qwen-vl-utils. If you provide a video URL or path to this tool, you can actually set a custom FPS or total frame count without writing your own extraction script:
Python
# In your message content:
{
"type": "video",
"video": "path/to/video.mp4",
"fps": 1.0, # Only take 1 frame per second
"max_pixels": 360*420 # Reduce resolution to save memory
}
Would you like me to show you how to use the bounding box feature to have the model “point out” exactly where an object is in one of your extracted frames?