GStars
    vllm-project

    vllm-project/vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    ai
    llm
    machine-learning
    deep-learning
    amd
    blackwell
    cuda
    deepseek
    deepseek-v3
    gpt
    gpt-oss
    inference
    kimi
    llama
    llm-serving
    model-serving
    moe
    openai
    pytorch
    qwen
    qwen3
    tpu
    transformer
    Python
    Apache-2.0
    68.8K stars
    13.0K forks
    68.8K watching
    Updated 2/27/2026
    View on GitHub
    Backblaze Advertisement

    Loading star history...

    Health Score

    75

    Weekly Growth

    +845

    +1.2% this week

    Contributors

    1

    Total contributors

    Open Issues

    3.1K

    Generated Insights

    About vllm

    vLLM

    Easy, fast, and cheap LLM serving for everyone

    | Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |


    Latest News ๐Ÿ”ฅ

    • [2025/08] We hosted vLLM Shanghai Meetup focusing on building, developing, and integrating with vLLM! Please find the meetup slides here.
    • [2025/08] We hosted vLLM Beijing Meetup focusing on large-scale LLM deployment! Please find the meetup slides here and the recording here.
    • [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement here.
    • [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post here.
    Previous News

    About

    vLLM is a fast and easy-to-use library for LLM inference and serving.

    Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

    vLLM is fast with:

    • State-of-the-art serving throughput
    • Efficient management of attention key and value memory with PagedAttention
    • Continuous batching of incoming requests
    • Fast model execution with CUDA/HIP graph
    • Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
    • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
    • Speculative decoding
    • Chunked prefill

    vLLM is flexible and easy to use with:

    • Seamless integration with popular Hugging Face models
    • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
    • Tensor, pipeline, data and expert parallelism support for distributed inference
    • Streaming outputs
    • OpenAI-compatible API server
    • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron
    • Prefix caching support
    • Multi-LoRA support

    vLLM seamlessly supports most popular open-source models on HuggingFace, including:

    • Transformer-like LLMs (e.g., Llama)
    • Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
    • Embedding Models (e.g., E5-Mistral)
    • Multi-modal LLMs (e.g., LLaVA)

    Find the full list of supported models here.

    Getting Started

    Install vLLM with pip or from source:

    pip install vllm
    

    Visit our documentation to learn more.

    Contributing

    We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.

    Sponsors

    vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!

    Cash Donations:

    • a16z
    • Dropbox
    • Sequoia Capital
    • Skywork AI
    • ZhenFund

    Compute Resources:

    • Alibaba Cloud
    • AMD
    • Anyscale
    • AWS
    • Crusoe Cloud
    • Databricks
    • DeepInfra
    • Google Cloud
    • Intel
    • Lambda Lab
    • Nebius
    • Novita AI
    • NVIDIA
    • Replicate
    • Roblox
    • RunPod
    • Trainy
    • UC Berkeley
    • UC San Diego

    Slack Sponsor: Anyscale

    We also have an official fundraising venue through OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.

    Citation

    If you use vLLM for your research, please cite our paper:

    @inproceedings{kwon2023efficient,
      title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
      author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
      booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
      year={2023}
    }
    

    Contact Us

    • For technical questions and feature requests, please use GitHub Issues
    • For discussing with fellow users, please use the vLLM Forum
    • For coordinating contributions and development, please use Slack
    • For security disclosures, please use GitHub's Security Advisories feature
    • For collaborations and partnerships, please contact us at [email protected]

    Media Kit

    Discover Repositories

    Search across tracked repositories by name or description