Inference Awesome

ollama
toruchrun
vLLM Virtual Large Language Model
- PagedAttention
SGLang
localai
NVIDIA/TensorRT-LLM
- Apache-2.0, C++, Python
- trt
huggingface/text-generation-inference
- Apache-2.0, Python, Rust
- HF TGI
triton-inference-server/server
- BSD-3, Python, C++
- NVIDIA Triton
bentoml/BentoML
- Apache-2.0, Python
- Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines
- bentoml/BentoDiffusion
- bentoml/OpenLLM
Image
- ComfyUI
- AUTOMATIC1111/stable-diffusion-webui
  - A1111
- SD
Audio
- Whisper
Embeddings
- michaelfeil/infinity
  - MIT, Python
exo-explore/exo
- GPLv3, Python
Tencent/ncnn
- BSD-3, C/C++
- neural network inference framework optimized for the mobile platform
InternLM/lmdeploy
- Apache-2.0, Python, C++
- based on MMRazor and MMDeploy
mit-han-lab/nunchaku
- Apache-2.0, Python, C++
- Nunchaku is a high-performance inference engine optimized for 4-bit neural networks
- SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
theroyallab/tabbyAPI
- AGPLv3, Python
- turboderp-org/exllamav2
  - MIT, Python
- turboderp-org/exllamav3
  - MIT, Python
Reading
- https://github.com/bentoml/llm-inference-handbook