Models

Open Weights/Transformer

Date	Model Series	Size	Context Window	Creator	notes
2025-10-21	Qwen3-VL	{2B,32B}-{Instruct,Thinking}
2025-10-15	Qwen3-VL	{4B,8B}-{Instruct,Thinking}
2025-10-04	Qwen3-VL	30B-A3B-{Instruct,Thinking}
2025-09-23	Qwen3-VL	235B-A22B-{Instruct,Thinking}
2025-09-15	Qwen3-Next	80B
2025-09-01	Hunyuan-MT-Chimera-7B	7B
2025-08-25	InternVL3.5	1B - 241B - Qwen3, GPT OSS FT
2025-08-22	Intern-S1	Qwen3
2025-08-11	GLM 4.5V	based on GLM-4.5-Air-106-12A			Hybrid Reasoning
2025-08-06	GPT OSS	20B, 120B	128K	OpenAI	Reasoning, Tools
2025-07-28	Intern-S1	235B			Qwen3 + 6B InternViT
2025-07-28	GLM 4.5	355-32A, Air 106-12A	128K	Zhipu	Reasoning, Multilingual
2025-07-23	Qwen3 2507	Coder 480B-A35B, Coder-Flash 30B-A3B, 235B-A22B, 30B-A3B	256K, Yarn 1M
2025-07-28	WAN-2.2	TI2V-5B, T2V-27B-A14B, I2V-27B-A14B		Alibaba	T2V, I2V, VACE, FLF2V, Reasoning, Multilingual
2025-07-15	Voxtral 1.0	Mini 3B, Small 24B	32K	Mistral AI	Audio, 30min transcription, 40 min understanding
2025-07-11	Kimi k2	1T-A32B	128K	Moonshot AI	MoE
2025-07-02	GLM-4.1V	9B	64K	Zhipu,THUDM	Vision, Reasoning
2025-06-11	Magistral	small 24B	39K	Mistral AI	Reasoning, Multilingual
2025-06-07	Comma v0.1	7B		EleutherAI	Full OSS, English
2025-06-05	Qwen3-Embedding	0.6b, 4b, 8b	32K	Alibaba	Embedding, Reranking, Multilingual(100+), Instruction Aware, MRL(1024, 2560, 4096)
2025-05-28	DeepSeek R1 0528			DeepSeek AI	Update
2025-05-26	QwenLong-L1	32b	120K	Alibaba	text
2025-05-20	Gemma3n	5b-e2b, 8b-e4b		Google	Edge, PLE
2025-04-29	Qwen3	0.6b, 1.7b, 4b, 8b, 14b, 30b, 32b, 235b, 30b-a3b, 235b-a22b	40K	Alibaba	MoE, Reasoning
2025-04-05	Llama 4	scout 109b-a17b ,marverik 400b-a17b, 2T	1M, 10M	Meta	MoE, Vision
2025-03-26	Qwen2.5-Omni	3B, 7B		Alibaba	text, audio, image, video, speech
2025-03-12	Gemma3	1b, 4b, 12b, 27b	128K	Google DeepMind	Vision
2025-02-25	Wan 2.1	1.3b,14b		Alibaba	t2v, 480P, 720p
2025-02-24	smollm2	135m, 360m, 1.7b	8K	HuggingFaceTB
2025-01-28	Qwen2.5-VL	3b, 7b, 32b, 72b	125K	Alibaba	Vision
2025-01-28	Qwen2.5	0.5b, 1.5b, 3b, 7b, 14b, 32b, 72b	32K,1M	Alibaba
2025-01-20	DeepSeek R1	1.5b, 7b, 8b, 14b, 32b, 70b, 671b	128K	DeepSeek AI	Reasoning
2024-12-07	Llama 3.3	70B	128K	Meta
2024-12	Phi-4	14b	128K	Microsoft	mini-reasoning,reasoning,multimodal
2024-11-21	LTX-Video	2b, 13b		Lightricks	T2V
2024-10-05	LLaVA	7b, 13b, 34b	4K, 32K		Vision
2024-09-25	Llama 3.2	1B, 3B, 11B, 90B	128K	Meta
2024-07-23	Llama 3.1	8B, 70.6B, 405B	128K	Meta
2024-06-27	Gemma 2	9b, 27.2b	8K	Google DeepMind
2024-06-07	Qwen2	0.5b, 1.5b, 7b, 57b (A14b), 72b	32K, 64K, 128K	Alibaba
2024-04-23	Phi-3	3.8b , 7b , 14b	4K, 128K	Microsoft
2024-04-18	Llama 3	8b, 70.6b	8K, 128K	Meta
2024-02-21	Gemma	2b, 7b	8K	Google DeepMind
2023-12-11	Mistral	7b, 46.7b (8x7B MoE)	33K	Mistral AI
2023-07-18	Llama 2	6.7b, 13b, 69b	4K	Meta
2023-02-24	LLaMA	6.7B, 13B, 32.5B, 65.2B	2K	Meta
2020-06-11	GPT-3	175b	2K	OpenAI
2019-02-14	GPT-2	1.5b	1K	OpenAI
2018-06-11	GPT-1	117m	512	OpenAI

date	model	parameters	context length	hidden size	attention	heads (Q/KV/Group)	act func	head size	layers	experts (total/active)	vocab size	data type	tokenizer
2025-08-05	gpt-oss-120b	117B (5.1B active)	128K	2880	GQA + Sparse	64 / - / 8	SwiGLU	64	36	128 / 4	201,088	MXFP4 / bf16	o200k_harmony (BPE)
2025-08-05	gpt-oss-20b	21B (3.6B active)	128K	2880	GQA + Sparse	64 / - / 8	SwiGLU	64	24	32 / 4	201,088	MXFP4 / bf16	o200k_harmony (BPE)
2025-08-23	Seed-OSS-36B	36B	512K	5120	GQA	80 / 8 / -	SwiGLU	128	64	N/A (Dense)	155K	bf16	BPE
2025-03-12	Gemma3-27B	27B	128K	-	Interleaved Local/Global	-	-	-	-	N/A (Dense)	262K	bf16	SentencePiece
2025-01-22	Deepseek R1	671B (37B active)	128K	-	MLA	-	SwiGLU	-	61			fp8 / bf16
2025-05	Qwen3-30B-A3B	30.5B (3.3B active)	256K	4096	GQA	32 / 4 / -	SwiGLU	128	48	128 / 8	151,669	bf16	BPE (Byte-level)
2025-05	Qwen3-32B	32.8B	128K	-	GQA	64 / 8 / -	SwiGLU	-	64	N/A (Dense)	151,936	bf16	BPE (Byte-level)
2024-12	Deepseek V3	671B (37B active)	128K	-	MLA	-	SwiGLU	-	61	256 routed, 1 shared		fp8 / bf16
2019-02-14	GPT-2 1.5B	1.542B	1024	1600	Multi-Head	25 / 25 / -	GELU	64	48	N/A (Dense)	50,257	fp16	BPE (Byte-level)

Qwen3 tokenizer
- https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/tokenizer.json
- https://github.com/dqbd/tiktokenizer/issues/39

date	model	author	notes
2025-06	MobileNet V5	Google	256x256, 512x512, 768x768, CNN,Gemma 3n
2025-02-19	YOLOv12
2024-08	SAM v2	Meta
2024-04	MobileNet V4	Google
2023-04	SAM	Meta
2019-05	MobileNet V3	Google
2019-03	MobileNet V2	Google
2017-04	MobileNet	Google

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- https://arxiv.org/abs/1704.04861
- https://en.wikipedia.org/wiki/MobileNet
- MobileNetV3 https://arxiv.org/abs/1905.02244
- MobileNetV4 https://arxiv.org/abs/2404.10518
YOLO
- YOLO12 Attention-Centric Object Detection
- https://github.com/ultralytics/ultralytics
SAM - Segment Anything Model
- https://github.com/facebookresearch/segment-anything
SAM v2
- https://arxiv.org/abs/2408.00714

Diffusion/Image Models

date	model	size	author	notes
2025-11-26	FLUX.2 dev	32B	Black Forest Labs	Apache-2.0 klein 商业 pro, flex
2025-08-04	Qwen-Image	20B	Alibaba	Apache-2.0,MMDiT , T2I,Editing, Text
2025-07-16	HiDream-E1-1			Editing
2025-05-29	FLUX.1 Kontext	dev 12B,~~max~~,~~pro~~	Black Forest Labs
2025-04-07	HiDream-I1	17B, Fast 16step, Dev 28step, Full 50step
2025-01-25	Lumina-Image 2.0	2B	OpenGVLab	Apache-2.0
2024-10-22	SD 3.5	turbo, large, medium ,2.5B, 8B	Stability AI
2024-08-01	FLUX.1	dev, schnell, 12B, ~~pro~~	Black Forest Labs
2024-02	SD 3.0	800M, 8B	Stability AI
2023-11	SDXL Turbo		Stability AI
2023-07	SDXL 1.0	3.5B
2022-12	SD v2.1
2022-11	SD v2.0
2022-10	SD 1.5	983M	RunwayML
2022-08	SD 1.1 1.2 1.3 1.4		CompVis
2025-06-16	OmniGen2	7B	VectorSpaceLab	T2I, Editing, Composing

Proprietary Models

release	model	output	input price	author	notes
2025-07-31	Horizon Alpha			OpenAI	256K
2025-06	Kling 2.1	$0.28/s
2025-06-17	Gemini 2.5 Pro	$10.00/1M	$1.25/1M	Google	1M
2025-06	Gemini 2.5 Flash-Lite	$0.40/1M	$0.10/1M, audio$ 0.50/1M	Google
2025-06	Gemini 2.5 Flash	$2.50/1M	$0.30/1M, audio$ 1.00/1M	Google
2025-05-22	Claude 4 Opus	$15/1M	$3/1M	Anthropic	200K
2025-05-22	Claude 4 Sonnet	$75/1M	$15/1M	Anthropic	200K
2025-05-20	Imagen 4			Google	t2i
2025-05	Veo 3	$0.50/s, audio$ 0.75/s		Google	t2v
2025-05	FLUX.1 Kontext max/pro			Black Forest Labs	t2i
2025-04-17	Gemini 2.0 Flash			Google
2025-04-16	Seedream 3.0			Bytedance	t2i
2025-04-14	GPT-4.1, mini, nano			OpenAI
2025-03-25	Gemini 2.0 Pro			Google	2M
2025-02-27	ChatGPT 4.5			OpenAI	128K
2025-02-24	Claude 3.7
2025-02-05	Gemini 2.0 Flash			Google	audio, video
2025-02-01	Gemini 2.0 Flash-Lite			Google
2025-01-10	o3, o3-mini			OpenAI	Reasoning
2024-12-17	o1			OpenAI
2024-12	Veo 2	$0.35/s		Google	t2v
2024-10	Recraft V3			Recraft
2024-09-12	o1-preview			OpenAI	Reasoning
2024-08	Imagen 3			Google	t2i
2024-07-18	GPT-4o mini			OpenAI
2024-06-20	Claude 3.5 Haiku
2024-05-13	GPT-4o			OpenAI	text, audio, image
2024-03-04	Claude 3 Haiku, Sonnet, Opus			Anthropic	200K
2024-02-15	Gemini 1.5 Pro			Google	突破性的100万token超长上下文窗口
2023-12-06	Gemini 1.0 Pro			Google	原生多模态模型家族
2023-11-21	Claude 2.1			Anthropic	200K
2023-11-06	GPT-4V			OpenAI	128K, Vision
2023-11-06	GPT-4 Turbo			OpenAI	128K
2023-07-11	Claude 2			Anthropic	100K
2023-06-27	GPT-3.5-16k			OpenAI	16K
2023-03-14	GPT-4			OpenAI	8K, 32K , image
2023-03-01	GPT-3.5-turbo			OpenAI	4K
2022-11-30	GPT-3.5			OpenAI	4K

abbr.	stand for	meaning
MRL	Matryoshka Representation Learning
R2V	reference-to-video
MV2V	masked video-to-video
V2V	video-to-video
MoE	Mixture of Experts	混合专家模型
VACE	Video Animation, Composition, and Editing

MRL - Matryoshka Representation Learning
- Embedding 模型支持支持自定义最终嵌入的维度
- https://huggingface.co/blog/matryoshka
VACE: All-in-One Video Creation and Editing
- 2025-03-11
- VACE - Video Animation, Composition, and Editing
- 一体化的视频处理框架
- r2v, mv2v, v2v
- https://arxiv.org/abs/2503.07598
R2V - reference-to-video
MV2V - masked video-to-video
*-pt - Pre-Training - 预训练模型
- 在大规模数据集上进行初始训练，学习语言模式和结构。
- 该模型适合作为基础模型，供开发者在特定任务上进行进一步的微调。
*-ft
- Fine-tuned
*-it - Instruction Tuning - 经过指令微调的模型
- 在预训练模型的基础上，进一步针对特定任务或指令进行了微调。
- 此版本更适合直接应用于实际任务，因为它已经针对特定用途进行了优化。

内存占用计算方式
- 参数x精度
- 目前理想精度是 float16, bfloat16 - 1 个参数占用 16bit
  - 1B -> 2GB
- 量化参数 - 常见量化 int4
  - 1B -> 0.5GB
  - https://huggingface.co/datasets/christopherthompson81/quant_exploration
  - Q4_0 - worse accuracy but higher speed
  - Q4_1 - more accurate but slower
  - q4_2, q4_3 - new generations of q4_0 and q4_1, more accurate
  - https://github.com/ggerganov/llama.cpp/discussions/406
7B - 8GB 内存
13B - 16GB 内存
70B - 32GB/48G 内存
小 context window 适用于 RAG
Context Window
- LLama-3 8B 8K-1M https://ollama.com/library/llama3-gradient
  - 256k context window requires at least 64GB of memory
  - 1M+ context window requires significantly more (100GB+)

按照商业公司分类模型之间关联性高，模型有连续性。虽然会扩展调整各种能力，但是 Base 模型的发展和用到的技术会相对连续。

Leader board/Index/排行/Ranking
- LLM
- Text-to-Image
  - ArtificialAnalysis/Text-to-Image-Leaderboard
- Usage/Adoption/Coding
  - https://openrouter.ai/rankings
  - https://aider.chat/docs/leaderboards/
- Indexing
- BFCL Leaderboard https://gorilla.cs.berkeley.edu/leaderboard.html
  - Berkeley Function-Calling Leaderboard
- https://models.litellm.ai/
- https://arena.xlang.ai/leaderboard
Benchmark/Eval
- https://huggingface.co/spaces/Jellyfish042/UncheatableEval
- https://github.com/open-compass/VLMEvalKit
价格/Pricing/成本/Cost
- https://www.llm-prices.com/
- https://openrouter.ai/models
- https://ai.google.dev/gemini-api/docs/pricing
- https://idp-leaderboard.org/
Visual
- microsoft/Florence-2-large
  - MIT
  - base 0.23B, large 0.77B
  - Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
阿里云/Alibaba
- QwenLM
  - Qwen
  - QwQ 32B
  - Qwen2 VL
    - Qwen/QVQ-72B-Preview
  - QwenLM/Qwen2.5
    - Qwen2.5 Coder
    - Qwen2.5 VL
    - Qwen2.5 Math
    - Collection Qwen/Qwen2.5-VL
  - QwenLM/Qwen
  - Qwen3
    - 推荐参数
      - Thinking - temperature=0.6, top_p=0.95, top_k=20
      - temperature=0.7, top_p=0.8, top_k=20
      - min_p=0.0
- Wan-Video
  - Wan-Video/Wan2.1
    - 1.3B, 14B
    - T2V-1.3B, VACE-1.3B
    - T2V-14B, I2V-14B-720P, I2V-14B-480P, FLF2V-14B, VACE-14B
    - VACE
      - 视频编辑
    - FLF2V - First-Last-Frame-to-Video - 首末帧到视频
      - 生成中间的过渡动画
- ali-vilab/VACE
  - all-in-one Video Creation and Editing
  - VACE-Wan2.1-1.3B-Preview
  - VACE-LTX-Video-0.9
  - Wan2.1-VACE-1.3B,14B
- HumanMLLM/R1-Omni
  - 阿里通义实验室
  - Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
deepseek
- deepseek-ai/Janus
  - Janus-Series: Unified Multimodal Understanding and Generation Models
  - 理解图像，生成图像
  - 基于 DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base
  - hf deepseek-ai/Janus-Pro-7B
- deepseek-ai/DeepSeek-R1
  - MoE, GRPO, MLA, RL, MTP, FP8
- deepseek-ai/DeepSeek-V3
- deepseek-ai/DeepSeek-VL2
  - DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
- DeepSeek-V2
  - MLA
Google
- SigLIP2
- SigLIP1
- google-deepmind/gemma
  - Apache-2.0, Flax, JAX
  - by Google DeepMind
  - Ultra, Pro, Flash, Nano
  - https://ai.google.dev/gemma/docs/core
  - gemma 3
    - 1B text only, 4, 12, 27B Vision + text. 14T tokens
    - 128K context length further trained from 32K. 1B is 32K.
    - Removed attn softcapping. Replaced with QK norm
    - 5 sliding + 1 global attn
    - 1024 sliding window attention
    - RL - BOND, WARM, WARP
    - 推荐参数: temperature=1.0, top_k=64, top_p=0.95, min_p=0.0
    - ⚠️注意不支持获取对象检测的坐标
    - https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d
    - https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
    - https://blog.google/technology/developers/gemma-3/
    - https://huggingface.co/blog/gemma3
    - https://blog.roboflow.com/gemma-3/
    - https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-tune
    - Ollama 3 Tools support https://github.com/ollama/ollama/issues/9680
bytedance/字节跳动
- ~~ByteDance-Seed/Seed1.5-VL~~
  - https://huggingface.co/spaces/ByteDance-Seed/Seed1.5-VL
- bytedance-seed/BAGEL
  - ByteDance-Seed/BAGEL-7B-MoT
    - 图片理解、生成、编辑
    - https://huggingface.co/spaces/ByteDance-Seed/BAGEL
    - 要求
      - 1024 × 1024 Image Gen 80GB vRAM
      - 4×16G GPU 能运行
      - e.g. 1 minute on 3xRTX3090, 8 minutes on A100
      - https://github.com/ByteDance-Seed/Bagel/issues/4
    - https://github.com/neverbiasu/ComfyUI-BAGEL
- UI-TARS
  - https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B
- bytedance/MegaTTS3
  - TTS Diffusion Transformer
  - 0.45B
  - 中文、英文
- SeedVR - 3B, 7B
  - 视频修复（Video Restoration, VR） - 视频画质增强
  - hf collection SeedVR
- ByteDance-Seed/SeedVR2-3B
Tencent/腾讯
- https://huggingface.co/tencent
- 混元
- Tencent-Hunyuan/HunyuanVideo-Avatar
  - Image-to-Video
  - https://arxiv.org/html/2505.20156v1
  - https://aivideo.hunyuan.tencent.com/
  - hf tencent/HunyuanVideo-Avatar
- hf tencent/Hunyuan3D-2
  - 2025-01-21
- hf tencent/HunyuanVideo
  - Text-to-Video
  - 2024-10-03
- Tencent/HunyuanVideo-I2V
  - Image-to-Video
  - 720P, vRAM 60GB - 推荐 vRAM 80GB
  - 2025-03-06
  - hf tencent/HunyuanVideo-I2V
    - for diffusers hunyuanvideo-community/HunyuanVideo-I2V
- hf tencent/SongGeneration
Microsoft
- phi
  - phi4
    - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
    - Vision: 英语
    - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
    - https://huggingface.co/microsoft/Phi-4-multimodal-instruct
  - phi3
- microsoft/OmniParser
  - 识别 UI 交互元素, 屏幕分析
  - Pure Vision Based GUI Agent
  - hf microsoft/OmniParser-v2.0
  - demo microsoft/OmniParser-v2
  - provider
    - https://replicate.com/microsoft/omniparser-v2
  - https://microsoft.github.io/WindowsAgentArena/
- microsoft/SoM
  - Set-of-Mark Prompting for GPT-4V and LMMs
  - Segment+Mark
- microsoft/Magma
  - 2025.02.18
  - A Foundation Model for Multimodal AI Agents
  - CLIP-ConvneXt-XXLarge
  - usecase/capabilities
    - Visual Planning and Action / 视觉规划和行动
    - Robotics Manipulations / 机器人操作
    - Environment Interaction / 环境交互
    - Visual Navigation / 视觉导航
    - 高级图文理解
  - https://huggingface.co/MagmaAI
  - https://huggingface.co/microsoft/Magma-8B
  - demo microsoft/Magma-UI
  - OLLAMA https://github.com/ollama/ollama/issues/9366
Cohere for AI
https://huggingface.co/CohereLabs/aya-vision-32b
- 多语言，23种语言
OCR/Document Understanding
- opendatalab/OmniDocBench
- Demo
  - https://huggingface.co/spaces/prithivMLmods/Multimodal-OCR2
- Qwen VL
- InternVL
- mindee/doctr
  - Apache-2.0, Python, TensorFlow 2, PyTorch
  - Document Text Recognition
  - ⚠️ french vocab, 不支持中文
  - https://huggingface.co/spaces/mindee/doctr
  - Multilingual support mindee/doctr#1699
- PaddlePaddle/PaddleOCR
- RapidOCR
- tesseract
- surya
- breezedeus/Pix2Text
- docling-project/docling
  - ds4sd/docling-models
  - hf ds4sd/SmolDocling-256M-preview
    - 只支持英文
    - by Docling Team, IBM Research
    - based on SmolVLM-256M-Instruct
- Yuliang-Liu/MonkeyOCR
  - MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
  - 2025.06.05
  - 基于 Qwen2.5-VL-3b
  - 支持中文、英文
  - hf echo840/MonkeyOCR
  - Document Parsing LMM
  - 擅长公式识别、表格识别
  - 传统方式: 版面分析、OCR、结构分析
  - SRR (Structure-Recognition-Relation)
    - 结构检测 (Structure Detection)：一眼看懂文档的宏观布局，识别出这是一个表格、一个公式，还是一个普通段落。
    - 内容识别 (Content Recognition)：在识别出结构的同时，完成对内部文字的 OCR 识别。
    - 关系预测 (Relationship Prediction)：理解这些结构之间的关联，比如知道某个图表是为哪一段文字作解释的。
- nanonets/Nanonets-OCR-s
  - 基于 Qwen2.5-VL-3b
  - Image -> Structure Markdown
- ByteDance/Dolphin
  - Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
- ChatDOC/OCRFlux-3B
  - 基于 Qwen/Qwen2.5-VL-3B-Instruct
- ocrmypdf/OCRmyPDF
  - adds an OCR text layer to scanned PDF
- Topdu/OpenOCR
  - 复旦大学
Document Structure/Layout Analysis/OCR Toolkit
- Marker
- rednote-hilab/dots.ocr
  - MIT
  - 2025.07.30
  - 1.7B
  - hf rednote-hilab/dots.ocr
  - https://huggingface.co/spaces/MohamedRashad/Dots-OCR
  - Ollama https://github.com/ollama/ollama/issues/11653
- allenai/olmocr
  - Apache-2.0, Python
  - Toolkit for linearizing PDFs for LLM datasets/training
- opendatalab/MinerU
  - AGPLv3, Python
  - 上海人工智能实验室（OpenDataLab）
  - PDF -> JSON, Markdown
  - LayoutLMv3
  - UniMERNet
  - opendatalab/PDF-Extract-Kit
- DocLayout-YOLO
Task Specific Models/任务模型/领域模型
- WebWalkerQA https://huggingface.co/spaces/callanwu/WebWalkerQALeaderboard
- Alibaba-NLP/WebAgent
  - hf Alibaba-NLP/WebDancer-32B
- osmosis-ai/Osmosis-Structure-0.6B
  - 基于 Qwen3-0.6B
  - 文本 -> JSON
- osmosis-ai/osmosis-mcp-4b
  - 基于 Qwen3-4B
  - multi step MCP-style tool usage
- 腾讯/Hunyuan-MT-Chimera-7B
  - 多语言翻译
  - 30+语言
- 腾讯/Hunyuan-MT-7B
  - 中英双语翻译
LLaMA based
- Vicuna
haotian-liu/LLaVA
- LLaVA (Large Language and Vision Assistant)
- Vicuna + CLIP
OpenGVLab
- OpenGVLab/InternVL3-8B
  - OpenGVLab/VisualPRM-8B-v1_1
    - Process Reward Model (PRM)
command-a
- 主要用于 Agent, 工具调用
- https://cohere.com/blog/command-a
- https://huggingface.co/CohereForAI/c4ai-command-a-03-2025
llama2
- 7B, 13B, 70B
microsoft/BitNet
- MIT, C++, Python
- by Microsoft
- HN
vicuna
mistral
mixtral
Flan
Alpaca
GPT4All
Chinese LLaMA
Vigogne (French)
LLaMA
Databricks Dolly 2.0
- https://huggingface.co/databricks/dolly-v2-12b
- https://github.com/databrickslabs/dolly/tree/master/data
https://huggingface.co/stabilityai/stable-diffusion-2
togethercomputer/OpenChatKit
Alpaca
- 基于 LLaMA + 指令训练
FlagAI-Open/FlagAI
hpcaitech/ColossalAI
BlinkDL/ChatRWKV
- ChatGPT like
- RWKV (100% RNN)
nebuly-ai/nebullvm
FMInference/FlexGen
EssayKillerBrain/WriteGPT
- GPT-2
ymcui/Chinese-LLaMA-Alpaca
https://www.promptingguide.ai/zh/models/collection
Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models
RedPajama-Data-v2
- https://together.ai/blog/redpajama-data-v2
- https://github.com/togethercomputer/RedPajama-Data
- https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
- en, de, fr, es, it
hysts/ControlNet-v1-1
ggml
- ggerganov/ggml
  - MIT, C
.pth - PyTorch
- checklist.chk - MD5
- params.json - {"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": -1}
- Saving & Loading Models
https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76
https://erichartford.com/uncensored-models
https://huggingface.co/spaces/facebook/seamless_m4t
https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
Jina AI 8k text embedding
- https://news.ycombinator.com/item?id=38020109
- https://huggingface.co/jinaai/jina-embeddings-v2-base-en
- https://huggingface.co/jinaai/jina-embeddings-v2-small-en
Index/Repository/Models
- https://huggingface.co/models
- https://www.modelscope.cn/models
Loras/Gallery/ComfyUI/Workflow
- https://civitai.com/
- https://openmodeldb.info/
- https://www.runninghub.ai
- https://www.liblib.art
- https://tusi.cn

# AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
grep avx /proc/cpuinfo --color # x86_64

Abliterated

uncensored/abliterated/CensorTune
UGI - Uncensored General Intelligence - 通用无限制智能
Norm-Preserving Biprojected Abliteration - 规范化双投影消除
- https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
https://huggingface.co/blog/mlabonne/abliteration
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
https://huggingface.co/ArliAI/GLM-4.5-Air-Derestricted
Maker
- https://huggingface.co/ArliAI
- https://huggingface.co/huihui-ai
Sumandora/remove-refusals-with-transformers
https://huggingface.co/NSFW-API/NSFW_Wan_1.3b
https://huggingface.co/datasets/Guilherme34/uncensor
https://huggingface.co/models?search=uncensored
https://erichartford.com/uncensored-models
https://www.pixiv.net/novel/show.php?id=21039830
- https://huggingface.co/a686d380/rwkv-5-h-world

Domains

文本
- Translate
- Agentic
- Reasoning
- Coding
视觉/图像
- VLM
- OCR
- Layout Analysis
- Object Detection
- Image Classification
- Image Segmentation
- Image Generation
- Image Editing
音频/语音
- ASR
- TTS
- STT

VLM

VLM - Vision Language Model
box_2d format
Qwen (xmin, ymin, xmax, ymax) floats between 0-1
Gemini (ymin, xmin, ymax, xmax) integers between 0-1000
参考
- Is Gemini 2.5 good at bounding boxes?
  - https://news.ycombinator.com/item?id=44520292

Agent

Computer Agent
Qwen 2.5 VL 72B
UI-TARS
web-infra-dev/Midscene
- MIT, TypeScript
- UI-TARS, Qwen-2.5-VL
Fara 7B
https://huggingface.co/spaces/galileo-ai/agent-leaderboard
https://platform.openai.com/docs/models/computer-use-preview
- $3 /$ 12
https://arena.xlang.ai/leaderboard
https://huggingface.co/Menlo/Jan-nano
- for deep research tasks

中文

Qwen2
LlamaFamily/Llama-Chinese
UnicomAI/Unichat-llama3-Chinese
- 联通 llama3 微调
https://github.com/datawhalechina/self-llm

Fine-tuning

https://huggingface.co/ValueFX9507/Tifa-Deepsex-14b-CoT-GGUF-Q4

TTS

TTS, Dialogue, Audio, Speech, Voice
TTS
- Text Analysis
- Acoustic
- Vocoder
- https://huggingface.co/spaces/TTS-AGI/TTS-Arena
- https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2
Turn / VAD / Voice Activity Detection
- VAD (voice activity detection)
  - 传统检测方式, 不能理解语言所以容易导致 FP 检测
- https://huggingface.co/livekit/turn-detector
- https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector
- https://huggingface.co/spaces/webml-community/conversational-webgpu
选择因素
- 语音质量：合成语音需自然流畅，清晰易懂，适合长时间聆听，能表达情感。
- 个性化：支持音色、语速、语调等自定义，满足不同场景和品牌需求。
- 语言支持
- 延迟：低延迟适合实时交互场景，如语音助手；非实时应用对延迟要求较低。
- 资源需求
- 授权与使用：明确模型的授权方式和使用限制，注意是否需注明来源。
KokoroTTS
- https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX
VITA-MLLM/VITA-Audio
- ASR, TTS, SpokenQA
- https://huggingface.co/spaces/shenyunhang/VITA-Audio
resemble-ai/chatterbox
- MIT, Python
- 开源版本只支持 en
- hf ResembleAI/chatterbox
FunAudioLLM/CosyVoice
- 中文、英文、日文、韩文、中文方言（粤语、四川话、上海话、天津话、武汉话等）
- hf FunAudioLLM/CosyVoice2-0.5B
yl4579/HiFTNet
THUDM/GLM-4-Voice
- 中英语音对话模型
- https://huggingface.co/THUDM/glm-4-voice-tokenizer
- https://huggingface.co/THUDM/glm-4-voice-decoder
  - 基于 CosyVoice 重新训练的支持流式推理的语音解码器，将离散化的语音 Token 转化为连续的语音输出。
https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K
shivammehta25/Matcha-TTS
nari-labs/dia
- text to dialogue
- 只支持 en
- https://huggingface.co/spaces/nari-labs/Dia-1.6B
fishaudio/openaudio-s1-mini
- 0.5B
- S1 4B, 商业版本
商业
- https://elevenlabs.io/
https://huggingface.co/hf-audio

case	desc	notes	models
虚拟助手	AI助手提供自然语音应答，优化交互	自然度高、低延迟、多情感	XTTS-v2, MeloTTS, F5-TTS, ChatTTS
无障碍解决方案	为视觉及学习障碍者提供语音内容	清晰度高、易懂、稳定性好	MeloTTS, Bark
内容创作	生成播客、有声书等专业配音	音色多样、情感丰富、韵律自然	XTTSv2, F5-TTS, GPT-SoVITS-v2
自动化客服	IVR系统赋能高效自动化客服	清晰稳定、可定制性强	Piper, ParlerTTS, XTTSv2
语音自助终端	自助终端的交互式语音应答	响应快速、清晰易懂	Piper, MeloTTS

STT

STT - Speech to Text - 语音转文本
ASR - Automatic Speech Recognition - 自动语音识别
STT / Speech-to-Text / ASR / Automatic Speech Recognition / Speech Recognition
- https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
modelscope/FunASR
- MIT, Python
- ASR, VAD
Voxtral
- 支持中文
- mistralai/Voxtral-Mini-3B-2507
- mistralai/Voxtral-Small-24B-2507
- Voxtral Mini Transcribe
- https://huggingface.co/docs/transformers/main/en/model_doc/voxtral
- https://mistral.ai/news/voxtral
- https://huggingface.co/spaces/MohamedRashad/Voxtral
- 参考
  - https://github.com/coezbek/voxtral-test
https://huggingface.co/nvidia/canary-1b-flash
- 没有中文

Music

Music Generation - 音乐生成
Song Generation - 歌曲生成
https://huggingface.co/google/magenta-realtime
https://labs.google/fx/tools/music-fx-dj
http://goo.gle/lyria-realtime

MLLM

Multimodal Large Language Model - 多模态大语言模型
结构: 视觉编码器 + 投影器 + 语言模型
Vision Model
- ViT
Language Model
Projector / Vision-Language Adapter
- 将视觉模型提取出的图像特征与语言模型的表示空间对齐
- Cross-Attention Module - 交叉注意力模块

Vision

Document OCR - 文档 OCR
Handwriting OCR - 手写 OCR
Visual QA / Image QA - 图片 QA
Visual Reasoning - 图像推理
Image Classification - 图片分类
Document Understanding - 文档理解
Video Understanding - 视频理解
Object Detection - 对象识别
Object Counting - 对象计数
Agent - 屏幕理解操作
Object Grounding - 物体定位
- 返回 Bounding Box 坐标
- visual grounding poor performance after fine-tuning 2U1/Qwen2-VL-Finetune#77

Qwen2 VL
- factor=28
SmolVLM 256M
- 64 image tokens per 512px image
- https://huggingface.co/spaces/webml-community/smolvlm-realtime-webgpu
  - 500M, SmolVLM
  - https://www.reddit.com/r/LocalLLaMA/comments/1kmi6vl
Video Understanding
- https://huggingface.co/google/videoprism
参考
- https://blog.roboflow.com/multimodal-vision-models/
- https://github.com/0xsline/GeminiImageApp

Coding

All-Hands-AI/OpenHands
block/goose
https://www.swebench.com/
模型
- devstral 24B
  - https://mistral.ai/news/devstral
  - https://ollama.com/library/devstral

Video

整个流程
Flow
harry0703/MoneyPrinterTurbo
- 一键生成高清短视频
- https://huggingface.co/spaces/chaowenguo/avfwae
Lightricks/LTX-Video
- 30 FPS, 1216×704
- text-to-image, image-to-video, keyframe-based animation, video extension, video-to-video
Wan-Video
- WAN 2.1
  - https://huggingface.co/lym00/Wan2.1-T2V-1.3B-Self-Forcing-VACE-Addon-Experiment
  - https://huggingface.co/spaces/multimodalart/wan2-1-fast
参考
- Olow304/memvid

Image Generation

Image Generation - 图像生成
Image Editing - 图像编辑
Image Inpainting - 图像修复
Image Upscaling - 图像放大
Image Variation - 图像变体
Super Resolution - 超分辨率
Multimodal Understanding - 多模态理解
OpenCompass Multi-modal Academic Benchmarks
- MMB MMS MMMU MathVista Hallusion AI2D OCRBench MMVet
GenEval
- Single object Two object Counting Colors Position Attribute binding
DPG-Bench
- Global Entity Attribute Relation Other
ImgEdit-Bench
- Add Adjust Extract Replace Remove Background Style Hybrid Action
GEdit-Bench-EN
- Background Change Color Alteration Material Modification Motion Change Portrait Beautification Style Transfer Subject Addition Subject Removal Subject Replacement Text Modification Tone Transformation
AIDC-AI/Ovis-U1-3B
- 3B, Flux based
advimman/lama
- Image Inpainting, Resolution-robust Large Mask Inpainting
Styles
- showlab/OmniConsistency
  - OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
  - https://github.com/lc03lc/Comfyui_OmniConsistency
huggingface/diffusers
- bghira/SimpleTuner
- ostris/ai-toolkit
ByteDance/XVerse
- Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
FLUX.1
- https://playground.bfl.ai/
- collection FLUX.1
- hf black-forest-labs/FLUX.1-dev
  - 12B
- FLUX.1 Kontext
  - distilled variant of Kontext
  - 目前生成的图有点糊
  - https://bfl.ai/models/flux-kontext
  - https://replicate.com/black-forest-labs/flux-kontext-pro
  - https://replicate.com/flux-kontext-apps
  - https://news.ycombinator.com/item?id=44128322
- https://huggingface.co/lodestones/Chroma
https://genai-showdown.specr.net/

问题领域

Prompt adherence（提示词遵循度）
Generation quality（生成质量）
Instructiveness（可指导性）
Consistency of styles, characters, settings, etc.（风格、角色、设置的一致性）
Deliberate and exact intentional posing of characters and set pieces（角色和场景元素的精确姿态和故意摆放）
Compositing different images or layers together（将不同图像或图层组合在一起）
Relighting（重新打光）
Posing built into the model. No ControlNet hacks.（姿态控制内置于模型中，无需ControlNet等“黑科技”）
References built into the model. No IPAdapter, no required character/style LoRAs, etc.（参考功能内置于模型中，无需IPAdapter、角色/风格LoRA等）
Ability to address objects, characters, mannequins, etc. for deletion / insertion.（能够针对物体、角色、人体模型等进行删除/插入操作）
Ability to pull sources from across multiple images with or without "innovation" / change to their pixels.（能够从多张图片中提取来源，无论是否对其像素进行“创新”/更改）
Fine-tunable (so we can get higher quality and precision)（可微调，以获得更高的质量和精度）

Media Generation

Image & Text to Image, Video, Audio
Lip Sync - 口型同步
Text-guided
OmniAvatar/OmniAvatar-14B
- OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
商业/平台/Hub/Router
- https://fal.ai/
- https://replicate.com
- https://runware.ai
- https://www.together.ai
- https://datacrunch.io/
- https://heygen.com
- https://aihubmix.com/
- https://www.dmxapi.cn/
  - 中国
  - 即梦
- https://tokenflux.ai/
  - 模型 https://tokenflux.ai/models
- https://www.siliconflow.com/
  - 硅基流动
  - 北京硅动科技有限公司
- https://xiaoice.com/
  - 小冰数字人

Generative Marketing

媒体

可灵 (Kling) 2.1:
- 根据图像生成视频，效果不错
- 动态表情、大动态运镜、精确手势控制、演唱口型
- 旋转机位运镜和口型演出
Veo 3:
- 根据文本提示生成视频
- 实拍效果模拟
Sora:
- 将现有视频转换为新风格
Pika:
- 在场景中切换或添加内容
Runway:
- 引用人物、地点或风格（Gen-3）
Luma:
- 将视频重新调整为新的宽高比
Hedra:
- 让角色说话（口型同步）
即梦 (Instant Dream):
- 网上很多视频就是他做的
- 即梦 Omnihuman: 擅长静态口型
Vidu:
- 二次元动漫演绎
Viggle:
- 将角色添加到视频表情包中（角色动作迁移）
Higgsfield
- 使用好莱坞级视觉效果
剪映专业版:
- 功能强大，素材特效丰富，剪辑视频必装软件
Krea
- 使用 Wan 或 Hunyuan 等开源模型
美图秀秀:
- 直接绘画，大家用的惯

文本

豆包:
- 专注情感，生活场景必备
Kimi:
- 专业长文，就是不怕内容多
Deepseek:
- 写代码完全不出错，强的可怕
知乎:
- 喜欢知乎文章的朋友必备
gamma:
- 全球最牛的 PPT，根据你的文章直接定制化生成
MindShow:
- 输入文字大纲，自动整理成思维导图，还能一键转换成演示文稿

设计

稿定设计:
- 涵盖平面设计、电商设计等，提供超多可编辑模板，满足各种设计需求
易企秀:
- 能快速做 H5 页面，模板种类多，适合活动宣传、产品推广

检索

https://felo.ai/
- 全网最好用的小红书搜索工具，不知道的绝对是一大遗憾

Avatar

数字人
HunyuanVideo-Avatar
tencent-ailab/IP-Adapter
- Text-to-Image
Pony
- finetune on SDXL
- trained on 2.5 million furry/anthro/cartoon/anime images
- 能直接识别很多动漫角色，不需要 lora

Diffusion

I2I Image to Image
Image Edit
Image Generation
Image Inpainting
Image Upscale
Image Variation
T2I Text to Image
T2V Text to Video
Video Generation
In/Out-Painting
Structural Conditioning

Alpha-VLLM/Lumina-Image-2.0
- hf Alpha-VLLM/Lumina-Image-2.0

base model	date
Aura Flow
CogVideoX
Flux .1 D, flux.1-dev	2024-08-01
Flux .1 Kontext	2025-05-29
Flux .1 S, flux.1-schnell	2024-08-01
HiDream
Hunyuan 1
Hunyuan Video
Illustrious
Imagen 4	2025-05-20
Kolors
LTXV
Lumina
Mochi
NoobAI
ODOR
Open AI
Other
PixArt Σ
PixArt Α
Playground V2
Pony
SD 1.4
SD 1.5
SD 1.5 Hyper
SD 1.5 LCM
SD 2.0
SD 2.0 768
SD 2.1
SD 2.1 768
SD 2.1 Unclip
SD 3
SD 3.5
SD 3.5 Large
SD 3.5 Large Turbo
SD 3.5 Medium
SDXL 0.9
SDXL 1.0
SDXL 1.0 LCM
SDXL Distilled
SDXL Hyper
SDXL Lightning
SDXL Turbo
SVD
SVD XT
Stable Cascade
WAN Video
Wan Video 1.3B T2v
Wan Video 14B I2v 480p
Wan Video 14B I2v 720p
Wan Video 14B T2v

Mode Type	cn
Aesthetic Gradient	美学渐变
Checkpoint	检查点
Controlnet	控制网
Detection	检测
DoRA	DoRA
Hypernetwork	超网络
LoRA	LoRA
LyCORIS	LyCORIS
Motion	动态
Other	其他
Poses	姿势
Embedding	嵌入
Upscaler	超分辨率
VAE	变分自编码器
Wildcards	通配符
Workflows	工作流

Checkpoint Type
Merge
Trained

File Type
Core ML
Diffusers
GGUF
ONNX
Other
Pickle Tensor
Safe Tensor
Pt

Category	cn
Action	动作
Aesthetic	美学
Architecture	建筑
Animal	动物
Assets	资产
Background	背景
Base Model	基础模型
Buildings	建筑物
Celebrity	名人
Character	角色
Clothing	服装
Concept	概念
Objects	物体
Poses	姿势
Style	风格
Tool	工具
Vehicle	交通工具

常见出问题的地方
- 手指
- 眼睛
- 头发
- 下巴
- 皮肤
CFG - Classifier-Free Diffusion Guidance (2022)
black-forest-labs/flux
- FLUX.1 [dev] Non-Commercial License
  - 不能将这个模型或其生成的任何内容，用于任何以盈利为目的商业活动。
  - 不允许训练竞争模型
  - 只有 FLUX.1-schnell 是 apache-2.0 协议
- FLUX.1-Kontext dev
  - Demo https://specularrealms.com/ai-transcripts/experiments-with-flux-kontext/
  - https://blog.fal.ai/announcing-flux-1-kontext-dev-inference-training/
- FLUX.1-kontext
  - in-context image generation
  - https://bfl.ai/models/flux-kontext
  - https://bfl.ai/announcements/flux-1-kontext
- hf collection FLUX.1
  - FLUX.1-dev - FLUX.1 dev Non-Commercial License
    - 图像质量
    - guidance distillation
    - https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0
  - FLUX.1-schnell - Apache 2.0
    - 生成速度和效率 - 1-4 步
    - latent adversarial diffusion distillation
    - Finetune/LoRA
      - black-forest-labs/FLUX.1-schnell
        
        lodestones/Chroma
        
        https://civitai.com/posts/13766416
        
        https://civitai.com/models/1330309/chroma
HiDream-I1-Full
- 2025-04-28
- 17B
- VAE: FLUX.1 schnell
- Text Encoder: google/t5-v1_1-xxl, meta-llama/Meta-Llama-3.1-8B-Instruct
VectorSpaceLab/OmniGen2
huanngzh/mv-adapter
Stable Diffusion
- Stable Diffusion XL (SDXL)
- Stable Diffusion 3.5
  - OpenAI CLIP-L
  - OpenCLIP bigG
  - Google T5-XXL
  - hf collection Stable Diffusion 3.5
  - https://huggingface.co/stabilityai/stable-diffusion-3.5-large
    - Multimodal Diffusion Transformer (MMDiT)
  - Medium
    - 2B 模型
    - MMDiT-X
  - Large-Turbo
    - MMDiT + ADD
    - 速度快, 4 个步骤
Optimization
- TMElyralab/lyraDiff
  - Apache-2.0, C++
  - acceleration engine for Diffusion and DiT models

OpenAI CLIP
- Contrastive Language-Image Pre-training - 对比语言-图像预训练
- -L Vision Transformer, ViT, Large
- 零样本图像分类 (Zero-shot Image Classification)
- ViT-L/14
OpenCLIP
- 对 OpenAI CLIP 的开源复现和扩展, 提高透明度和性能
- LAION数据集
- bigG - ViT-bigG/14
Google T5-XXL
- Text-to-Text Transfer Transformer（文本到文本迁移Transformer）
- 将所有自然语言处理（NLP）任务都统一为一种“文本到文本”的格式
- 在输入文本前加上一个 任务前缀 来告诉模型需要做什么
  - translate English to German: That is good.
  - summarize: [一篇很长的文章]
  - cola sentence: The course is jumping well.
- FLAN-T5

Prompting

[人物描述] [场景构建] [摄影参数] [氛围强化] [细节补充]

场景构建
- 服装细节
- 动态姿势
- 光影氛围
https://huggingface.co/spaces/gokaygokay/FLUX-Prompt-Generator
https://github.com/dagthomas/comfyui_dagthomas
参考
- https://huggingface.co/microsoft/Promptist
  - 优化 Stable Diffusion v1.4 prompt
- https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion
  - GPT-2 models intended to generate prompt texts for imaging AIs
- https://huggingface.co/daspartho/prompt-extend
  - GPT-2 model trained on dataset of stable diffusion prompts
- https://huggingface.co/datasets/daspartho/stable-diffusion-prompts
- https://huggingface.co/succinctly/text2image-prompt-generator
  - GPT-2 model fine-tuned on the succinctly/midjourney-prompts dataset
AUTOMATIC/promptgen-lexart (~300mb) - Finetuned using 134,819 prompts from lexica.art
AUTOMATIC/promptgen-majinai-safe (~300mb) - 1,654 prompts from majinai.art
AUTOMATIC/promptgen-majinai-unsafe (~300mb) - 825 prompts from majinai.art (NSFW)
Gustavosta/MagicPrompt-Dalle
kmewhort/stable-diffusion-prompt-bolster (~500mb),
Ar4ikov/gpt2-650k-stable-diffusion-prompt-generator (~500mb),
Ar4ikov/gpt2-medium-650k-stable-diffusion-prompt-generator (~1.4gb),
crumb/bloom-560m-RLHF-SD2-prompter-aesthetic (~1.1gb),
Meli/GPT2-Prompt (~500mb),
DrishtiSharma/StableDiffusion-Prompt-Generator-GPT-Neo-125M (~550mb)

Resolution

1:1
- 512x512
- 768x768
- 1024x1024
4:3
16:9
- 1216x704
9:16
- 704x1216
Portrait
- 832x1216
Landscape
- 1216x832

Negative

text
watermark
camera

out of frame, lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature,

(worst quality, low quality:1.4), (ugly:1.2), (stitching:1.2),
bad anatomy, deformed, disfigured, malformed limbs, extra limbs, fused limbs,
poorly drawn face, distorted face, malformed face, asymmetric eyes,
poorly drawn hands, extra fingers, fused fingers, malformed hands,
text, error, signature, watermark, username

https://huggingface.co/spaces/stabilityai/stable-diffusion/discussions/7857

动态姿势

服装细节

trendy off-shoulder top
oversized cozy sweater
t-shirt with a cute cat print

质量

(shot on Sony A7 IV, 50mm f/1.8 lens)
photorealistic
ultra detailed
natural skin texture
soft film grain
8k uhd

氛围强化

氛围

serene, tranquil, intimate, modern elegance

色彩

Warm neutrals with pops of soft pastels in window light

动态

Subtle light reflection on sweater fabric, gentle light diffusion across the room

摄影参数

视觉风格
光影处理

Chroma

FT on FLUX.1-schnell
Apache 2.0
https://civitai.com/models/1330309/chroma

Realisian

SD 1.5
steps 12 (8 ≈ 16)
DPM++ SDE Karras
Hires Fix Required
- On
- Upscaler: Latent (bicubic antialised)
- Hires Steps: 5 (4 ≈ 10)
- Denoising Strenght: 0.55 (0.4 ≈ 0.7)
- Upscale by: 2
CFG Scale 3 (2 ≈ 5)
Clip Skip 1 (1 ≈ 2)

Negative

https://civitai.com/models/289190/realisian-negative-embedding

embedding:realisianNeg.Z5yh, Realisian-Neg

Juggernaut XL

SDXL 1.0

Juggernaut Ragnarok
- 专注于提升照片写实感、数字绘画、人物姿势、手部和脚部等方面的表现。
- 该模型以 Jug XII 为基础，首先通过摄影数据集训练，进一步使用 Booru 标签进行重标注，并以 SDXL 作为底座进行训练。随后，作者又以 Lustify by Coyotte 为基础对同一数据集再次训练，并将其以一定比例合并，作为输出的稳定器。由于数据集采用 Booru 标签标注，Booru 风格的提示词和 X–XII 版本的描述方式在 Ragnarok 上都能很好地工作。
- 适合用于追求高质量写实风格的图像生成项目，但作为 SDXL 模型，仍存在如远距离人脸、文本渲染等方面的局限。推荐将其作为生成管道中的一环（如 FluxDev / Pixelwave / Jug Flux Pro → Juggernaut Ragnarok）以获得更佳效果。模型完全开源，支持自由合并、微调和商用。

Base Model	SDXL 1.0
Resolution	832x1216 for Portrait
Sampler	DPM++ 2M SDE
Steps	30-40
CFG	3-6 (less is a bit more realistic)
VAE	✅
HiRes	4xNMKD-Siax_200k with 15 Steps and 0.3 Denoise + 1.5 Upscale

CyberRealistic Pony

CyberRealistic Pony 是将 Pony Diffusion 的可爱风格与 CyberRealistic 的写实质感结合的模型。

CyberRealistic Pony https://civitai.com/models/443821/cyberrealistic-pony

Base Model	Pony
Resolution	896x1152 / 832x1216
Sampler	DPM++ SDE Karras / DPM++ 2M Karras / Euler a
Steps	30+ Steps
CFG	5
Clip Skip	2

Positive

score_9, score_8_up, score_7_up, (SUBJECT),

Negative

score_6, score_5, score_4, (worst quality:1.2), (low quality:1.2), (normal quality:1.2), lowres, bad anatomy, bad hands, signature, watermarks, ugly, imperfect eyes, skewed eyes, unnatural face, unnatural body, error, extra limb, missing limbs

score_6, score_5, score_4, simplified, abstract, unrealistic, impressionistic, low resolution, lowres, bad anatomy, bad hands, missing fingers, worst quality, low quality, normal quality, cartoon, anime, drawing, sketch, illustration, artificial, poor quality

ADetailer

Adetailer model: face_yolov9c.pt
If you only want the main face being refined set 'Mask only the top k largest' to 1.

Metric

abbr.	stand for	better	meaning	notes
WER	Word Error Rate	⬇️ L	词错误率	STT
RTFx	Real-Time Factor	⬆️ H	实时因子	STT
CER	Character Error Rate		字符错误率	STT
PER	Phoneme Error Rate		音素错误率	STT

WER = (S + D + I) / N = (S + D + I) / (S + D + C)
- S = Substitutions
- D = Deletions
- I = Insertions
- C = Correct
- N = Total number of words
  - N=S+D+C
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
RTFx = RTFx = 1/RTF
huggingface/evaluate
https://huggingface.co/spaces/evaluate-metric/wer

Abliterated​

Domains​

VLM​

Agent​

中文​

Fine-tuning​

TTS​

STT​

Music​

MLLM​

Vision​

Coding​

Video​

Image Generation​

Media Generation​

Generative Marketing​

媒体​

文本​

设计​

检索​

Avatar​

Diffusion​

Prompting​

Chroma​

Realisian​

Juggernaut XL​

CyberRealistic Pony​

Metric​

Datasets​