Skip to main content

QwenLM

Qwen 2.5 VL

ConfigurationQwen2.5-VL-3BQwen2.5-VL-7BQwen2.5-VL-72B
Vision Transformer (ViT)
Hidden Size128012801280
# Layers323232
# Num Heads161616
Intermediate Size345634563456
Patch Size141414
Window Size112112112
Full Attention Block Indexes{7, 15, 23, 31}{7, 15, 23, 31}{7, 15, 23, 31}
Vision-Language Merger
In Channel128012801280
Out Channel204835848192
Large Language Model (LLM)
Hidden Size204835848192
# Layers362880
# KV Heads248
Head Size128128128
Intermediate Size48641894429568
Embedding Tying
Vocabulary Size151646151646151646
# Trained Tokens4.1T4.1T4.1T
  • 每个 28x28 像素对应一个 token
  • 图像至少需要 4个 token
    • 最小像素 4 * 28 * 28
    • 最小正方形 2 * 28 -> 56 * 56 像素
  • 图像最大 16384 个 token
    • 图像最大像素 16384 * 28 * 28
    • 最大正方形 128 * 28 -> 3584 * 3584 像素
  • MAX_RATIO
    • 图像宽高比最大 200
  • ⚠️ 实际使用下来,A4 300DPI 的识别很容易出现 重复内容问题,72DPI 的识别效果更好。
  • OCR 推荐参数: 控制准确度和一致性
    • temperature=0, top_p=1.0, top_k=0
    • temperature=0.001, top_p=0.9, top_k=5
# 可以参考 smart_resize
# https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py
IMAGE_FACTOR = 28
MIN_PIXELS = 4 * 28 * 28
MAX_PIXELS = 16384 * 28 * 28
MAX_RATIO = 200

VIDEO_MIN_PIXELS = 128 * 28 * 28
VIDEO_MAX_PIXELS = 768 * 28 * 28
FRAME_FACTOR = 2
FPS = 2.0
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 768

# Set the maximum number of video token inputs.
# Here, 128K represents the maximum number of input tokens for the VLLM model.
# Remember to adjust it according to your own configuration.
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 128000 * 28 * 28 * 0.9)))
# 300DPI -> 72DPI
convert a.jpg -resize 25% -resize 'x28<' a.output.jpg

FAQ

macOS Dimension out of range

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="eager", # 修改这个
device_map="mps"
)

min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)