Models
Date | Model | Size | Context Window | Creator | tags |
---|---|---|---|---|---|
2025-06-11 | Magistral | small 24B | 39K | Mistral AI | Reasoning, Multilingual |
2025-06-07 | Comma v0.1 | 7B | EleutherAI | Full OSS, English | |
2025-06-05 | Qwen3-Embedding | 0.6b, 4b, 8b | 32K | Alibaba | Embedding, Reranking, Multilingual(100+), Instruction Aware, MRL(1024, 2560, 4096) |
Phi-4 | 14b | 128K | Microsoft | mini-reasoning,reasoning,multimodal | |
2025-05-28 | DeepSeek R1 0528 | DeepSeek AI | Update | ||
2025-05-26 | QwenLong-L1 | 32b | 120K | Alibaba | text |
2025-05-20 | Gemma3n | 8b-e2b, 8b-e4b | Edge, PLE | ||
2025-04-29 | Qwen3 | 0.6b, 1.7b, 4b, 8b, 14b, 30b, 32b, 235b, 30b-a3b, 235b-a22b | 40K | Alibaba | MoE, Reasoning |
2025-04-05 | Llama 4 | scout 109b-a17b ,marverik 400b-a17b, 2T | 1M, 10M | Meta | MoE, Vision |
2025-03-26 | Qwen2.5-Omni | 3B, 7B | Alibaba | text, audio, image, video, speech | |
2025-03-12 | Gemma3 | 1b, 4b, 12b, 27b | 128K | Google DeepMind | Vision |
2025-02-26 | Wan 2.1 | 1.3b,14b | Alibaba | t2v, 480P, 720p | |
2025-02-24 | smollm2 | 135m, 360m, 1.7b | 8K | HuggingFaceTB | |
2025-01-28 | Qwen2.5-VL | 3b, 7b, 32b, 72b | 125K | Alibaba | Vision |
2025-01-28 | Qwen2.5 | 0.5b, 1.5b, 3b, 7b, 14b, 32b, 72b | 32K,1M | Alibaba | |
2025-01-20 | DeepSeek R1 | 1.5b, 7b, 8b, 14b, 32b, 70b, 671b | 128K | DeepSeek AI | Reasoning |
2024-12-07 | Llama 3.3 | 70B | 128K | Meta | |
2024-10-05 | LLaVA | 7b, 13b, 34b | 4K, 32K | Vision | |
2024-09-25 | Llama 3.2 | 1B, 3B, 11B, 90B | 128K | Meta | |
2024-07-23 | Llama 3.1 | 8B, 70.6B, 405B | 128K | Meta | |
2024-06-27 | Gemma 2 | 9b, 27.2b | 8K | Google DeepMind | |
2024-06-07 | Qwen2 | 0.5b, 1.5b, 7b, 57b (A14b), 72b | 32K, 64K, 128K | Alibaba | |
2024-04-23 | Phi-3 | 3.8b , 7b , 14b | 4K, 128K | Microsoft | |
2024-04-18 | Llama 3 | 8b, 70.6b | 8K, 128K | Meta | |
2024-02-21 | Gemma | 2b, 7b | 8K | Google DeepMind | |
2023-12-11 | Mistral | 7b, 46.7b (8x7B MoE) | 33K | Mistral AI | |
2023-07-18 | Llama 2 | 6.7b, 13b, 69b | 4K | Meta | |
2023-02-24 | LLaMA | 6.7B, 13B, 32.5B, 65.2B | 2K | Meta | |
2020-06-11 | GPT-3 | 175b | 2K | OpenAI | |
2019-02-14 | GPT-2 | 1.5b | 1K | OpenAI | |
2018-06-11 | GPT-1 | 117m | 512 | OpenAI |
Proprietary Models
release | model | author | notes |
---|---|---|---|
2025-04-17 | Gemini 2.5 Flash | ||
2025-04-14 | GPT-4.1, mini, nano | OpenAI | |
2025-03-25 | Gemini 2.5 Pro | 2M | |
2025-02-05 | Gemini 2.0 Flash | audio, video | |
2025-02-01 | Gemini 2.0 Flash-Lite | ||
2025-01-10 | o3, o3-mini | OpenAI | Reasoning |
2024-12-17 | o1 | OpenAI | |
2024-09-12 | o1-preview | OpenAI | Reasoning |
2024-07-18 | GPT-4o mini | OpenAI | |
2024-05-13 | GPT-4o | OpenAI | text, audio, image |
2024-03-04 | Claude 3 Haiku, Sonnet, Opus | Anthropic | 200K |
2024-02-15 | Gemini 1.5 Pro | 突破性的100万token超长上下文窗口 | |
2023-12-06 | Gemini 1.0 Pro | 原生多模态模型家族 | |
2023-11-21 | Claude 2.1 | Anthropic | 200K |
2023-11-06 | GPT-4V | OpenAI | 128K, Vision |
2023-11-06 | GPT-4 Turbo | OpenAI | 128K |
2023-07-11 | Claude 2 | Anthropic | 100K |
2023-06-27 | GPT-3.5-16k | OpenAI | 16K |
2023-03-14 | GPT-4 | OpenAI | 8K, 32K , image |
2023-03-01 | GPT-3.5-turbo | OpenAI | 4K |
2022-11-30 | GPT-3.5 | OpenAI | 4K |
abbr. | stand for | meaning |
---|---|---|
MRL | Matryoshka Representation Learning | |
R2V | reference-to-video | |
MV2V | masked video-to-video | |
V2V | video-to-video | |
MoE | Mixture of Experts | 混合专家模型 |
VACE | Video Animation, Composition, and Editing |
- MRL - Matryoshka Representation Learning
- Embedding 模型支持 支持自定义最终嵌入的维度
- https://huggingface.co/blog/matryoshka
- VACE: All-in-One Video Creation and Editing
- 2025-03-11
- VACE - Video Animation, Composition, and Editing
- 一体化的视频处理框架
- r2v, mv2v, v2v
- https://arxiv.org/abs/2503.07598
- R2V - reference-to-video
- MV2V - masked video-to-video
*-pt
- Pre-Training - 预训练模型- 在大规模数据集上进行初始训练,学习语言模式和结构。
- 该模型适合作为基础模型,供开发者在特定任务上进行进一步的微调。
*-ft
- Fine-tuned
*-it
- Instruction Tuning - 经过指令微调的模型- 在预训练模型的基础上,进一步针对特定任务或指令进行了微调。
- 此版本更适合直接应用于实际任务,因为它已经针对特定用途进行了优化。
- 内存占用计算方式
- 参数x精度
- 目前理想精度是 float16, bfloat16 - 1 个参数占用 16bit
- 1B -> 2GB
- 量化参数 - 常见量化 int4
- 1B -> 0.5GB
- https://huggingface.co/datasets/christopherthompson81/quant_exploration
- Q4_0 - worse accuracy but higher speed
- Q4_1 - more accurate but slower
- q4_2, q4_3 - new generations of q4_0 and q4_1, more accurate
- https://github.com/ggerganov/llama.cpp/discussions/406
- 7B - 8GB 内存
- 13B - 16GB 内存
- 70B - 32GB/48G 内存
- 小 context window 适用于 RAG
- Context Window
- LLama-3 8B 8K-1M https://ollama.com/library/llama3-gradient
- 256k context window requires at least 64GB of memory
- 1M+ context window requires significantly more (100GB+)
- LLama-3 8B 8K-1M https://ollama.com/library/llama3-gradient
按照 商业公司分类 模型之间关联性高,模型有连续性。虽然会扩展调整各种能力,但是 Base 模型的发展和用到的技术会相对连续。
- Leader board/Index/排行/Ranking
- https://ollama.com/library
- https://livebench.ai/
- https://huggingface.co/open-llm-leaderboard
- https://lmarena.ai/
- https://www.vellum.ai/llm-leaderboard
- https://openrouter.ai/rankings
- https://aider.chat/docs/leaderboards/
- https://huggingface.co/models
- BFCL Leaderboard https://gorilla.cs.berkeley.edu/leaderboard.html
- Berkeley Function-Calling Leaderboard
- https://models.litellm.ai/
- https://arena.xlang.ai/leaderboard
- 价格/Pricing/成本/Cost
- Visual
- microsoft/Florence-2-large
- MIT
- base 0.23B, large 0.77B
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
- microsoft/Florence-2-large
- 阿里云/Alibaba
- QwenLM
- Qwen
- QwQ 32B
- Qwen2 VL
- QwenLM/Qwen2.5
- Qwen2.5 Coder
- Qwen2.5 VL
- Qwen2.5 Math
- Collection Qwen/Qwen2.5-VL
- QwenLM/Qwen
- Qwen3
- 推荐参数
- Thinking - temperature=0.6, top_p=0.95, top_k=20
- temperature=0.7, top_p=0.8, top_k=20
- min_p=0.0
- 推荐参数
- Wan-Video
- HumanMLLM/R1-Omni
- 阿里通义实验室
- Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
- QwenLM
- deepseek
- deepseek-ai/Janus
- Janus-Series: Unified Multimodal Understanding and Generation Models
- deepseek-ai/DeepSeek-R1
- MoE, GRPO, MLA, RL, MTP, FP8
- deepseek-ai/DeepSeek-V3
- deepseek-ai/DeepSeek-VL2
- DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
- DeepSeek-V2
- MLA
- deepseek-ai/Janus
- Google
- SigLIP2
- SigLIP1
- google-deepmind/gemma
- Apache-2.0, Flax, JAX
- by Google DeepMind
- Ultra, Pro, Flash, Nano
- https://ai.google.dev/gemma/docs/core
- gemma 3
- 1B text only, 4, 12, 27B Vision + text. 14T tokens
- 128K context length further trained from 32K. 1B is 32K.
- Removed attn softcapping. Replaced with QK norm
- 5 sliding + 1 global attn
- 1024 sliding window attention
- RL - BOND, WARM, WARP
- 推荐参数: temperature=1.0, top_k=64, top_p=0.95, min_p=0.0
- ⚠️注意 不支持获取对象检测的坐标
- https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d
- https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
- https://blog.google/technology/developers/gemma-3/
- https://huggingface.co/blog/gemma3
- https://blog.roboflow.com/gemma-3/
- https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-tune
- Ollama 3 Tools support https://github.com/ollama/ollama/issues/9680
- bytedance/字节跳动
ByteDance-Seed/Seed1.5-VL- bytedance-seed/BAGEL
- ByteDance-Seed/BAGEL-7B-MoT
- 图片理解、生成、编辑
- https://huggingface.co/spaces/ByteDance-Seed/BAGEL
- 要求
- 1024 × 1024 Image Gen 80GB vRAM
- 4×16G GPU 能运行
- e.g. 1 minute on 3xRTX3090, 8 minutes on A100
- https://github.com/ByteDance-Seed/Bagel/issues/4
- https://github.com/neverbiasu/ComfyUI-BAGEL
- ByteDance-Seed/BAGEL-7B-MoT
- UI-TARS
- bytedance/MegaTTS3
- TTS Diffusion Transformer
- 0.45B
- 中文、英文
- Tencent/腾讯
- https://huggingface.co/tencent
- 混元
- Tencent-Hunyuan/HunyuanVideo-Avatar
- hf tencent/Hunyuan3D-2
- 2025-01-21
- hf tencent/HunyuanVideo
- Text-to-Video
- 2024-10-03
- Tencent/HunyuanVideo-I2V
- Image-to-Video
- 720P, vRAM 60GB - 推荐 vRAM 80GB
- 2025-03-06
- hf tencent/HunyuanVideo-I2V
- for diffusers hunyuanvideo-community/HunyuanVideo-I2V
- Microsoft
- phi
- phi4
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
- Vision: 英语
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
- phi3
- phi4
- microsoft/OmniParser
- 识别 UI 交互元素, 屏幕分析
- Pure Vision Based GUI Agent
- hf microsoft/OmniParser-v2.0
- demo microsoft/OmniParser-v2
- provider
- https://microsoft.github.io/WindowsAgentArena/
- microsoft/SoM
- Set-of-Mark Prompting for GPT-4V and LMMs
- Segment+Mark
- microsoft/Magma
- 2025.02.18
- A Foundation Model for Multimodal AI Agents
- CLIP-ConvneXt-XXLarge
- usecase/capabilities
- Visual Planning and Action / 视觉规划和行动
- Robotics Manipulations / 机器人操作
- Environment Interaction / 环境交互
- Visual Navigation / 视觉导航
- 高级图文理解
- https://huggingface.co/MagmaAI
- https://huggingface.co/microsoft/Magma-8B
- demo microsoft/Magma-UI
- OLLAMA https://github.com/ollama/ollama/issues/9366
- phi
- Cohere for AI
- https://huggingface.co/CohereLabs/aya-vision-32b
- 多语言,23种语言
- OCR
- Qwen VL
- InternVL2
- mindee/doctr
- Apache-2.0, Python, TensorFlow 2, PyTorch
- Document Text Recognition
- ⚠️ french vocab, 不支持 中文
- https://huggingface.co/spaces/mindee/doctr
- Multilingual support mindee/doctr#1699
- PaddlePaddle/PaddleOCR
- RapidOCR
- tesseract
- surya
- breezedeus/Pix2Text
- Yuliang-Liu/MonkeyOCR
- MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
- 2025.06.05
- 3B
- 支持 中文、英文
- hf echo840/MonkeyOCR
- Document Parsing LMM
- 擅长 公式识别、表格识别
- 传统方式: 版面分析、OCR、结构分析
- SRR (Structure-Recognition-Relation)
- 结构检测 (Structure Detection):一眼看懂文档的宏观布局,识别出这是一个表格、一个公式,还是一个普通段落。
- 内容识别 (Content Recognition):在识别出结构的同时,完成对内部文字的 OCR 识别。
- 关系预测 (Relationship Prediction):理解这些结构之间的关联,比如知道某个图表是为哪一段文字作解释的。
- LLaMA based
- Vicuna
- haotian-liu/LLaVA
- LLaVA (Large Language and Vision Assistant)
- Vicuna + CLIP
- OpenGVLab
- command-a
- llama2
- 7B, 13B, 70B
- uncensored/abliterated/CensorTune
- Sumandora/remove-refusals-with-transformers
- https://huggingface.co/NSFW-API/NSFW_Wan_1.3b
- https://huggingface.co/huihui-ai
- https://huggingface.co/datasets/Guilherme34/uncensor
- https://huggingface.co/models?search=uncensored
- https://erichartford.com/uncensored-models
- https://www.pixiv.net/novel/show.php?id=21039830
- microsoft/BitNet
- MIT, C++, Python
- by Microsoft
- HN
- vicuna
- mistral
- mixtral
- Flan
- Alpaca
- GPT4All
- Chinese LLaMA
- Vigogne (French)
- LLaMA
- Databricks Dolly 2.0
- https://huggingface.co/stabilityai/stable-diffusion-2
- togethercomputer/OpenChatKit
- Alpaca
- 基于 LLaMA + 指令训练
- FlagAI-Open/FlagAI
- hpcaitech/ColossalAI
- BlinkDL/ChatRWKV
- ChatGPT like
- RWKV (100% RNN)
- nebuly-ai/nebullvm
- FMInference/FlexGen
- EssayKillerBrain/WriteGPT
- GPT-2
- ymcui/Chinese-LLaMA-Alpaca
- https://www.promptingguide.ai/zh/models/collection
- Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models
- RedPajama-Data-v2
- hysts/ControlNet-v1-1
- ggml
- ggerganov/ggml
- MIT, C
- ggerganov/ggml
- .pth - PyTorch
- checklist.chk - MD5
- params.json -
{"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": -1}
- Saving & Loading Models
- https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76
- https://erichartford.com/uncensored-models
- https://huggingface.co/spaces/facebook/seamless_m4t
- https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
- Jina AI 8k text embedding
- Models
# AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
grep avx /proc/cpuinfo --color # x86_64
Computer Agent
- Qwen 2.5 VL 72B
- UI-TARS
- web-infra-dev/Midscene
- MIT, TypeScript
- UI-TARS, Qwen-2.5-VL
- https://platform.openai.com/docs/models/computer-use-preview
- 12
- https://arena.xlang.ai/leaderboard
中文
- Qwen2
- LlamaFamily/Llama-Chinese
- UnicomAI/Unichat-llama3-Chinese
- 联通 llama3 微调
- https://github.com/datawhalechina/self-llm
Fine-tuning
Audio
- TTS, Dialogue, Audio, Speech, Voice
- TTS
- Text Analysis
- Acoustic
- Vocoder
- https://huggingface.co/spaces/TTS-AGI/TTS-Arena
- https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2
- STT / Speech-to-Text / ASR / Automatic Speech Recognition / Speech Recognition
- Turn / VAD / Voice Activity Detection
- VAD (voice activity detection)
- 传统检测方式, 不能理解语言所以容易导致 FP 检测
- https://huggingface.co/livekit/turn-detector
- https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector
- https://huggingface.co/spaces/webml-community/conversational-webgpu
- VAD (voice activity detection)
- 选择因素
- 语音质量:合成语音需自然流畅,清晰易懂,适合长时间聆听,能表达情感。
- 个性化:支持音色、语速、语调等自定义,满足不同场景和品牌需求。
- 语言支持
- 延迟:低延迟适合实时交互场景,如语音助手;非实时应用对延迟要求较低。
- 资源需求
- 授权与使用:明确模型的授权方式和使用限制,注意是否需注明来源。
- KokoroTTS
- VITA-MLLM/VITA-Audio
- ASR, TTS, SpokenQA
- https://huggingface.co/spaces/shenyunhang/VITA-Audio
- resemble-ai/chatterbox
- MIT, Python
- 开源版本只支持 en
- hf ResembleAI/chatterbox
- FunAudioLLM/CosyVoice
- 中文、英文、日文、韩文、中文方言(粤语、四川话、上海话、天津话、武汉话等)
- hf FunAudioLLM/CosyVoice2-0.5B
- yl4579/HiFTNet
- THUDM/GLM-4-Voice
- 中英语音对话模型
- https://huggingface.co/THUDM/glm-4-voice-tokenizer
- https://huggingface.co/THUDM/glm-4-voice-decoder
- 基于 CosyVoice 重新训练的支持流式推理的语音解码器,将离散化的语音 Token 转化为连续的语音输出。
- https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K
- shivammehta25/Matcha-TTS
- modelscope/FunASR
- MIT, Python
- ASR, VAD
- nari-labs/dia
- text to dialogue
- 只支持 en
- https://huggingface.co/spaces/nari-labs/Dia-1.6B
- 商业
- https://huggingface.co/hf-audio
case | desc | notes | models |
---|---|---|---|
虚拟助手 | AI助手提供自然语音应答,优化交互 | 自然度高、低延迟、多情感 | XTTS-v2, MeloTTS, F5-TTS, ChatTTS |
无障碍解决方案 | 为视觉及学习障碍者提供语音内容 | 清晰度高、易懂、稳定性好 | MeloTTS, Bark |
内容创作 | 生成播客、有声书等专业配音 | 音色多样、情感丰富、韵律自然 | XTTSv2, F5-TTS, GPT-SoVITS-v2 |
自动化客服 | IVR系统赋能高效自动化客服 | 清晰稳定、可定制性强 | Piper, ParlerTTS, XTTSv2 |
语音自助终端 | 自助终端的交互式语音应答 | 响应快速、清晰易懂 | Piper, MeloTTS |
STT
- STT - Speech to Text - 语音转文本
- ASR - Automatic Speech Recognition - 自动语音识别
- modelscope/FunASR
MLLM
- Multimodal Large Language Model - 多模态大语言模型
- 结构: 视觉编码器 + 投影器 + 语言模型
- Vision Model
- ViT
- Language Model
- Projector / Vision-Language Adapter
- 将视觉模型提取出的图像特征与语言模型的表示空间对齐
- Cross-Attention Module - 交叉注意力模块
Vision
- Document OCR - 文档 OCR
- Handwriting OCR - 手写 OCR
- Visual QA / Image QA - 图片 QA
- Visual Reasoning - 图像推理
- Image Classification - 图片分类
- Document Understanding - 文档理解
- Video Understanding - 视频理解
- Object Detection - 对象识别
- Object Counting - 对象计数
- Agent - 屏幕理解操作
- Object Grounding - 物体定位
- 返回 Bounding Box 坐标
- visual grounding poor performance after fine-tuning 2U1/Qwen2-VL-Finetune#77
- Qwen2 VL
- factor=28
- SmolVLM 256M
- 64 image tokens per 512px image
- https://huggingface.co/spaces/webml-community/smolvlm-realtime-webgpu
- 500M, SmolVLM
- https://www.reddit.com/r/LocalLLaMA/comments/1kmi6vl
- 参考
Coding
Video
- 整个流程
- Flow
- harry0703/MoneyPrinterTurbo
- Lightricks/LTX-Video
- 30 FPS, 1216×704
- text-to-image, image-to-video, keyframe-based animation, video extension, video-to-video
- Wan-Video
Generation Media
- Text to Image, Video, Audio
- Image Inpainting
- Image Variation
- Text-guided
- Upscale, Super Resolution
- huggingface/diffusers
- 商业/平台/Hub/Router
- FLUX.1
- https://playground.bfl.ai/
- collection FLUX.1
- hf black-forest-labs/FLUX.1-dev
- 12B
- FLUX.1 Kontext
- https://huggingface.co/lodestones/Chroma
- https://genai-showdown.specr.net/
问题领域
- Prompt adherence(提示词遵循度)
- Generation quality(生成质量)
- Instructiveness(可指导性)
- Consistency of styles, characters, settings, etc.(风格、角色、设置的一致性)
- Deliberate and exact intentional posing of characters and set pieces(角色和场景元素的精确姿态和故意摆放)
- Compositing different images or layers together(将不同图像或图层组合在一起)
- Relighting(重新打光)
- Posing built into the model. No ControlNet hacks.(姿态控制内置于模型中,无需ControlNet等“黑科技”)
- References built into the model. No IPAdapter, no required character/style LoRAs, etc.(参考功能内置于模型中,无需IPAdapter、角色/风格LoRA等)
- Ability to address objects, characters, mannequins, etc. for deletion / insertion.(能够针对物体、角色、人体模型等进行删除/插入操作)
- Ability to pull sources from across multiple images with or without "innovation" / change to their pixels.(能够从多张图片中提取来源,无论是否对其像素进行“创新”/更改)
- Fine-tunable (so we can get higher quality and precision)(可微调,以获得更高的质量和精度)
Generative Marketing
好的,这是去掉“AI”关键词后的版本:
媒体
- 可灵 (Kling) 2.1:
- 根据图像生成视频,效果不错
- 动态表情、大动态运镜、精确手势控制、演唱口型
- 旋转机位运镜和口型演出
- Veo 3:
- 根据文本提示生成视频
- 实拍效果模拟
- Sora:
- 将现有视频转换为新风格
- Pika:
- 在场景中切换或添加内容
- Runway:
- 引用人物、地点或风格(Gen-3)
- Luma:
- 将视频重新调整为新的宽高比
- Hedra:
- 让角色说话(口型同步)
- 即梦 (Instant Dream):
- 网上很多视频就是他做的
- 即梦 Omnihuman: 擅长静态口型
- Vidu:
- 二次元动漫演绎
- Viggle:
- 将角色添加到视频表情包中(角色动作迁移)
- Higgsfield
- 使用好莱坞级视觉效果
- 剪映专业版:
- 功能强大,素材特效丰富,剪辑视频必装软件
- Krea
- 使用 Wan 或 Hunyuan 等开源模型
- 美图秀秀:
- 直接绘画,大家用的惯
文本
- 豆包:
- 专注情感,生活场景必备
- Kimi:
- 专业长文,就是不怕内容多
- Deepseek:
- 写代码完全不出错,强的可怕
- 知乎:
- 喜欢知乎文章的朋友必备
- gamma:
- 全球最牛的 PPT,根据你的文章直接定制化生成
- MindShow:
- 输入文字大纲,自动整理成思维导图,还能一键转换成演示文稿
设计
- 稿定设计:
- 涵盖平面设计、电商设计等,提供超多可编辑模板,满足各种设计需求
- 易企秀:
- 能快速做 H5 页面,模板种类多,适合活动宣传、产品推广
检索
- https://felo.ai/
- 全网最好用的小红书搜索工具,不知道的绝对是一大遗憾
Avatar
-
数字人
-
HunyuanVideo-Avatar
-
- Text-to-Image
-
Pony
- finetune on SDXL
- trained on 2.5 million furry/anthro/cartoon/anime images
- 能直接识别很多动漫角色,不需要 lora
Diffusion Models
[人物描述] [场景构建] [摄影参数] [氛围强化] [细节补充]
- 场景构建
- 服装细节
- 动态姿势
- 光影氛围
Resolution
- 1:1
- 512x512
- 768x768
- 1024x1024
- 4:3
- 16:9
- 1216x704
- Portrait
- 832x1216
- Landscape
- 1216x832
Negative
text
watermark
camera
out of frame, lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature,
(worst quality, low quality:1.4), (ugly:1.2), (stitching:1.2),
bad anatomy, deformed, disfigured, malformed limbs, extra limbs, fused limbs,
poorly drawn face, distorted face, malformed face, asymmetric eyes,
poorly drawn hands, extra fingers, fused fingers, malformed hands,
text, error, signature, watermark, username
动态姿势
服装细节
trendy off-shoulder top
oversized cozy sweater
t-shirt with a cute cat print
质量
(shot on Sony A7 IV, 50mm f/1.8 lens)
photorealistic
ultra detailed
natural skin texture
soft film grain
8k uhd
氛围强化
- 氛围
serene, tranquil, intimate, modern elegance
- 色彩
Warm neutrals with pops of soft pastels in window light
- 动态
Subtle light reflection on sweater fabric, gentle light diffusion across the room
摄影参数
- 视觉风格
- 光影处理
Juggernaut XL
- SDXL 1.0
- Juggernaut Ragnarok
- 专注于提升照片写实感、数字绘画、人物姿势、手部和脚部等方面的表现。
- 该模型以 Jug XII 为基础,首先通过摄影数据集训练,进一步使用 Booru 标签进行重标注,并以 SDXL 作为底座进行训练。随后,作者又以 Lustify by Coyotte 为基础对同一数据集再次训练,并将其以一定比例合并,作为输出的稳定器。由于数据集采用 Booru 标签标注,Booru 风格的提示词和 X–XII 版本的描述方式在 Ragnarok 上都能很好地工作。
- 适合用于追求高质量写实风格的图像生成项目,但作为 SDXL 模型,仍存在如远距离人脸、文本渲染等方面的局限。推荐将其作为生成管道中的一环(如 FluxDev / Pixelwave / Jug Flux Pro → Juggernaut Ragnarok)以获得更佳效果。模型完全开源,支持自由合并、微调和商用。
Base Model | SDXL 1.0 |
---|---|
Resolution | 832x1216 for Portrait |
Sampler | DPM++ 2M SDE |
Steps | 30-40 |
CFG | 3-6 (less is a bit more realistic) |
VAE | ✅ |
HiRes | 4xNMKD-Siax_200k with 15 Steps and 0.3 Denoise + 1.5 Upscale |
- https://huggingface.co/RunDiffusion/Juggernaut-XI-v11
- https://civitai.com/models/133005/juggernaut-xl
CyberRealistic Pony
CyberRealistic Pony 是将 Pony Diffusion 的可爱风格与 CyberRealistic 的写实质感结合的模型。
- CyberRealistic Pony https://civitai.com/models/443821/cyberrealistic-pony
Base Model | Pony |
---|---|
Resolution | 896x1152 / 832x1216 |
Sampler | DPM++ SDE Karras / DPM++ 2M Karras / Euler a |
Steps | 30+ Steps |
CFG | 5 |
Clip Skip | 2 |
Positive
score_9, score_8_up, score_7_up, (SUBJECT),
Negative
score_6, score_5, score_4, (worst quality:1.2), (low quality:1.2), (normal quality:1.2), lowres, bad anatomy, bad hands, signature, watermarks, ugly, imperfect eyes, skewed eyes, unnatural face, unnatural body, error, extra limb, missing limbs
score_6, score_5, score_4, simplified, abstract, unrealistic, impressionistic, low resolution, lowres, bad anatomy, bad hands, missing fingers, worst quality, low quality, normal quality, cartoon, anime, drawing, sketch, illustration, artificial, poor quality
ADetailer
Adetailer model: face_yolov9c.pt
If you only want the main face being refined set 'Mask only the top k largest' to 1.
Metric
abbr. | stand for | better | meaning | notes |
---|---|---|---|---|
WER | Word Error Rate | ⬇️ L | 词错误率 | STT |
RTFx | Real-Time Factor | ⬆️ H | 实时因子 | STT |
CER | Character Error Rate | 字符错误率 | STT | |
PER | Phoneme Error Rate | 音素错误率 | STT |
- WER = (S + D + I) / N = (S + D + I) / (S + D + C)
- S = Substitutions
- D = Deletions
- I = Insertions
- C = Correct
- N = Total number of words
- N=S+D+C
- RTFx = (number of seconds of audio inferred) / (compute time in seconds)
- RTFx = RTFx = 1/RTF
- huggingface/evaluate
- https://huggingface.co/spaces/evaluate-metric/wer