lmsys.orgAI tool

lmsys-org

lmsys.org
Plans tarifaires

Aucun plan tarifaire detaille n'est encore disponible pour cet outil.

Presentation detaillee

LMSYS ORGProjectsBlogAboutDonationsChatbot Arena (graduated)Open MenuClose MenuVicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Qualityby: The Vicuna Team, Mar 30, 2023We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. The cost of training Vicuna-13B is around $300. The code and weights, along with an online demo, are publicly available for non-commercial use. Vicuna (generated by stable diffusion 2.1) *According to a fun and non-scientific evaluation with GPT-4. Further rigorous evaluation is needed. How Good is Vicuna? After fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca (see examples below), with the quality on par with ChatGPT. However, evaluating chatbots is never a simple task. With recent advancements in GPT-4, we are curious whether its capabilities have reached a human-like level that could enable an automated evaluation framework for benchmark generation and performance assessments. Our initial finding indicates that GPT-4 can produce highly consistent ranks and detailed assessment when comparing chatbots’ answers (see above example of GPT-4 judgment). Preliminary evaluations based on GPT-4, summarized in Figure 1, show that Vicuna achieves 90%* capability of Bard/ChatGPT. While this proposed framework shows a potential to automate chatbot assessment, it is not yet a rigorous approach. Building an evaluation system for chatbots remains an open question requiring further research. More details are provided in the evaluation section. Figure 1. Relative Response Quality Assessed by GPT-4* Online Demo Try the Vicuna-13B demo here! Overview The rapid advancement of large language models (LLMs) has revolutionized chatbot systems, resulting in unprecedented levels of intelligence as seen in OpenAI's ChatGPT. However, despite its impressive performance, the training and architecture details of ChatGPT remain unclear, hindering research and open-source innovation in this field. Inspired by the Meta LLaMA and Stanford Alpaca project, we introduce Vicuna-13B, an open-source chatbot backed by an enhanced dataset and an easy-to-use, scalable infrastructure. By fine-tuning a LLaMA base model on user-shared conversations collected from ShareGPT.com, Vicuna-13B has demonstrated competitive performance compared to other open-source models like Stanford Alpaca. This blog post provides a preliminary evaluation of Vicuna-13B's performance and describes its training and serving infrastructure. We also invite the community to interact with our online demo to test the capabilities of this chatbot. Figure 2. Workflow Overview Figure 2 provides an overview of our work. To begin, we collected around 70K conversations from ShareGPT.com, a website where users can share their ChatGPT conversations. Next, we enhanced the training scripts provided by Alpaca to better handle multi-turn conversations and long sequences. The training was done with PyTorch FSDP on 8 A100 GPUs in one day. For serving the demo, we implemented a lightweight distributed serving system. We conducted a preliminary evaluation of the model quality by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. To compare two different models, we combine the outputs from each model into a single prompt for each question. The prompts are then sent to GPT-4, which assesses which model provides better responses. A detailed comparison of LLaMA, Alpaca, ChatGPT, and Vicuna is shown in Table 1 below. Table 1. Comparison between several notable models Model Name LLaMA Alpaca Vicuna Bard/ChatGPT Dataset Publicly available datasets(1T token) Self-instruct from davinci-003 API(52K samples) User-shared conversations(70K samples) N/A Training code N/A Available Available N/A Evaluation metrics Academic benchmark Author evaluation GPT-4 assessment Mixed Training cost(7B) 82K GPU-hours 500(data)+100 (training) $140 (training) N/A Training cost(13B) 135K GPU-hours N/A $300 (training) N/A Training Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model's maximum context length. Our training recipe builds on top of Stanford’s alpaca with the following improvements. Multi-turn conversations: We adjust the training loss to account for multi-turn conversations and compute the fine-tuning loss solely on the chatbot's output. Memory Optimizations: To enable Vicuna's understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing gradient checkpointing and flash attention. Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from 500toaround140 and the 13B model from around 1Kto300. Serving We build a serving system that is capable of serving multiple models with distributed workers. It supports flexible plug-in of GPU workers from both on-premise clusters and the cloud. By utilizing a fault-tolerant controller and managed spot feature in SkyPilot, this serving system can work well with cheaper spot instances from multiple clouds to reduce the serving costs. It is currently a lightweight implementation and we are working on integrating more of our latest research into it. How To Evaluate a Chatbot? Evaluating AI chatbots is a challenging task, as it requires examining language understanding, reasoning, and context awareness. With AI chatbots becoming more advanced, current open benchmarks may no longer suffice. For instance, the evaluation dataset used in Stanford’s Alpaca, self-instruct, can be effectively answered by SOTA chatbots, making it difficult for humans to discern differences in performance. More limitations include training/test data contamination and the potentially high cost of creating new benchmarks. To tackle these issues, we propose an evaluation framework based on GPT-4 to automate chatbot performance assessment. First, we devised eight question categories, such as Fermi problems, roleplay scenarios, and coding/math tasks, to test various aspects of a chatbot's performance. Through careful prompt engineering, GPT-4 is able to generate diverse, challenging questions that baseline models struggle with. We select ten questions per category and collect answers from five chatbots: LLaMA, Alpaca, ChatGPT, Bard, and Vicuna. We then ask GPT-4 to rate the quality of their answers based on helpfulness, relevance, accuracy, and detail. We discover that GPT-4 can produce not only relatively consistent scores but also detailed explanations on why such scores are given (detailed examples link). However, we also notice that GPT-4 is not very good at judging coding/math tasks. Figure 3. Response Comparison Assessed by GPT-4 Figure 3 displays the comparison results between all baselines and Vicuna. GPT-4 prefers Vicuna over state-of-the-art open-source models (LLaMA, Alpaca) in more than 90% of the questions, and it achieves competitive performance against proprietary models (ChatGPT, Bard). In 45% of the questions, GPT-4 rates Vicuna's response as better or equal to ChatGPT's. As GPT-4 assigns a quantitative score to each response on a scale of 10, we calculate the total score for each (baseline, Vicuna) comparison pair by adding up the scores obtained by each model on 80 questions. As shown in Table 2, Vicuna’s total score is 92% of ChatGPT’s. Despite recent advancements, these chatbots still face limitations, such as struggling with basic math problems or having limited coding ability. Table 2. Total Scores Assessed by GPT-4. Baseline Baseline Score Vicuna Score LLaMA-13B 513.0 694.0 Alpaca-13B 583.0 704.0 Bard 664.0 655.5 ChatGPT 693.0 638.0 While this proposed evaluation framework demonstrates the potential for assessing chatbots, it is not yet a rigorous or mature approach, as large language models are prone to hallucinate. Developing a comprehensive, standardized evaluation system for chatbots remains an open question requiring further research. Edited: After this blog post, we conducted a deeper study on this GPT4-based evaluation approach. You are welcome to read our new Judging LLM-as-a-judge paper and try the new evaluation tool. Limitations We have noticed that, similar to other large language models, Vicuna has certain limitations. For instance, it is not good at tasks involving reasoning or mathematics, and it may have limitations in accurately identifying itself or ensuring the factual accuracy of its outputs. Additionally, it has not been sufficiently optimized to guarantee safety or mitigate potential toxicity or bias. To address the safety concerns, we use the OpenAI moderation API to filter out inappropriate user inputs in our online demo. Nonetheless, we anticipate that Vicuna can serve as an open starting point for future research to tackle these limitations. Release In our first release, we will share the training, serving, and evaluation code on a GitHub repo: https://github.com/lm-sys/FastChat. We also released the Vicuna-13B model weights. There is no plan to release the dataset. Join our Discord server and follow our Twitter to get the latest updates. License The online demo is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us If you find any potential violation. The code is released under the Apache License 2.0. Acknowledgment We would like to thank Xinyang Geng, Hao Liu, and Eric Wallace from BAIR; Xuecheng Li, and Tianyi Zhang from Stanford Alpaca team for their insightful discussion and feedback; Qirong Ho from MBZUAI for providing support on the serving cluster. Please check out a blog post from BAIR about a concurrent effort on their chatbot, Koala. The Team This is a joint effort with collaborators from multiple institutions, including UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI. Students (alphabetical order): Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang (✉), Lianmin Zheng (✉), Siyuan Zhuang, Yonghao Zhuang Advisors (alphabetical order): Joseph E. Gonzalez, Ion Stoica, Eric P. Xing ✉ Correspondence to: Lianmin Zheng (lianminzheng@gmail.com), Hao Zhang (sjtu.haozhang@gmail.com), or LMSYS (lmsys.org@gmail.com). Citation @misc{vicuna2023, title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality}, url = {https://lmsys.org/blog/2023-03-30-vicuna/}, author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.}, month = {March}, year = {2023} } After this blog post, we extended our idea of GPT-4 based evaluation and wrote a more formal paper that systematically studies this "LLM-as-a-judge" approach. You are welcome to read and cite this paper: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. --- LMSYS ORGProjectsBlogAboutDonationsChatbot Arena (graduated)Open MenuClose MenuABOUTLarge Model Systems (LMSYS Corp.) is a 501(c)(3) non-profit focused on incubating open-source projects and research. Our mission is to make large AI models accessible to everyone by co-developing open models, datasets, systems, and evaluation tools. We conduct cutting-edge machine learning research, develop open-source software, train large language models for broad accessibility, and build distributed systems to optimize their training and inference. Directors and officers Ying Sheng, Lianmin Zheng, Wei-Lin Chiang, Zihao Ye, Yusi Chen, Tiancheng Xie. Members of flagship projects SGLang major developers: Lianmin Zheng, Ying Sheng, Liangsheng Yin, Yineng Zhang, Ke Bao, Byron Hsu, Chenyang Zhao, Zhiqiang Xie, Jingyi Chen, Xiaoyu Zhang, Baizhou Zhang, Yi Zhang, Jiexin Liang, Chang Su, Simo Lin, Hai Xiao. More information see SGLang GitHub. FlashInfer: Zihao Ye. More information see FlashInfer GitHub. Chatbot Arena (graduated): See https://blog.lmarena.ai/about Vicuna LLM: Wei-Lin Chiang, Joseph E. Gonzalez, Dacheng Li, Zhuohan Li, Zi Lin, Ying Sheng, Ion Stoica, Zhanghao Wu, Eric P. Xing, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang Project memberships are granted to individuals who have demonstrated outstanding technical contributions and research impact in LMSYS projects, as evaluated by our advisory board of distinguished academics and industry leaders. Advisors Joseph E. Gonzalez, Ion Stoica, Eric P. Xing, Hao Zhang, Jun Qian, Mingxing Zhang Sponsors LMSYS is supported by donations from the following institutions: Voltage Park, NVIDIA, Nebius, Google Cloud, AtlasCloud, a16z, AMD, InnoMatrix, Laude Institute, Hyperbolic, NovitaAI, Verda Cloud, Sky9, Kaggle, MBZUAI, Together, RunPod, Anyscale, HuggingFace We also thank the following companies for providing API credits to serve their models on Chatbot Arena (graduated): Alibaba, Anthropic, Cohere, Databricks, Google, Mistral, OpenAI, Reka, 01ai We welcome diverse forms of donations and sponsorships, including but not limited to cash, computing devices (e.g., GPUs), and cloud credits. Please refer to to https://lmsys.org/donations/. History LMSYS originated from a multi-university collaboration involving UC Berkeley, Stanford, UCSD, CMU, and MBZUAI in 2023. It was established as a non-profit corporation in September 2024 to incubate early-stage open-source and research projects. It became well-known for its impactful flagship projects, including: Chatbot Arena (graduated) SGLang FastChat Vicuna LLM These projects feature open models with millions of downloads, crowdsourced platforms with millions of users, and efficient systems that are orders of magnitude faster. Contact us Email us at lmsys.org@gmail.com. Join us on Slack. Follow us on X. --- LMSYS ORGProjectsBlogAboutDonationsChatbot Arena (graduated)Open MenuClose MenuPROJECTSLMSYS Org develops open models, datasets, systems, and evaluation tools for large models.SYSTEMSSGLangA fast serving engine for LLMs and VLMs.FastChatAn open and scalable platform for training, finetuning, serving, and evaluating LLM-based chatbots.SpecForgeTrain speculative decoding models effortlessly and port them smoothly to SGLang serving.S-LoRAA system for serving thousands of concurrent LoRA adapters.RouteLLMA framework for serving and evaluating LLM routers.Lookahead DecodingAn exact, fast, parallel decoding algorithm without the need for draft models or data stores.EVALUATIONChatbot ArenaA benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. It comes with a leaderboard based on Elo ratings.Arena Hard AutoAn automatic pipeline converting live data to high quality benchmarks for evaluating chat assistants. The questions are more difficult than those in MT-Bench.MT-BenchA set of challenging, multi-turn, and open-ended questions for evaluating chat assistants. It uses LLM-as-a-judge to evaluate model responses.DATASETSLMSYS-Chat-1MThis dataset contains one million real-world conversations with 25 state-of-the-art LLMs.Chatbot Arena ConversationsThis dataset contains 33K cleaned conversations with pairwise human preferences collected on Chatbot Arena.ToxicChatThis dataset contains 10K high-quality data for content moderation in real-world user-AI interactions based on user queries from the Vicuna online demo.MODELSVicunaBase: LlamaSize: 7B, 13B, 33BAn open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.LongChatBase: LlamaSize: 7B, 13BA series of open-source chatbots with long context length (16K - 32K).FastChat-T5Base: Flan-T5Size: 3BA commercial-friendly, compact, yet powerful chat assistant. --- LMSYS ORGProjectsBlogAboutDonationsChatbot Arena (graduated)Open MenuClose MenuBLOGLatest updates and releases by LMSYS Org are announced through our blogpost series.Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deploymentsby: The Mooncake Team, Volcano Engine, March 25, 20261. The Problem: The Necessity and Vulnerability of Wide EP To serve massive Mixture-of-Experts (MoE) models efficiently, deploying a "wide" Expert Parallelism (EP) strategy—often spanning 32 GPUs or more per inference instance—is not just an option; it is a necessity. We need wide EP for t...ROCm Support for Miles: Large-Scale RL Post-Training on AMD Instinct™ GPUsby: AMD & Miles Team, March 17, 2026Reinforcement learning (RL) has rapidly become a core stage of modern foundation-model development. While large-scale pretraining remains essential, today's most capable models rely heavily on post-training techniques to improve reasoning, tool use, and multi-turn interaction. These workflows depend...SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systemsby: NVIDIA Nemotron Team, March 11, 2026We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0. Nemotron 3 Super is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. Agentic systems that chain planning, reasoning, and tools produce far more tokens than single-tur...Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72by: NVIDIA and Community SGLang Developers, February 20, 2026The SGLang team has worked closely with NVIDIA across multiple GPU generations to unlock step-function gains in inference performance for large-scale deployments of Mixture of Expert (MoE) reasoning models. Building on prior results that delivered 4x speedups on Blackwell B200 vs.Hopper H200 in Semi...Deploying DeepSeek on GB300 NVL72: Big Wins in Long-Context Inferenceby: Nvidia & SGLang Team, February 19, 2026TL;DR As the latest addition to the Blackwell family, the GB300 NVL72 is the most powerful platform for long-context LLM inference. In this blog post, we share our latest progress on optimizing DeepSeek R1-NVFP4 for 128K/8K ISL/OSL (Input Sequence Length/Output Sequence Length) long-context serving ...SGLang-Diffusion: Advanced Optimizations for Production-Ready Video Generationby: The SGLang-Diffusion Team, February 16, 2026Following our two-month progress update, we're excited to share a deeper dive into the advanced optimizations that make SGLang-Diffusion a production-ready framework for video generation. These improvements focus on scalability, efficiency, and stability—essential for deploying diffusion models at s...Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300X Seriesby: The Qwen C-end Infrastructure Engineering Team & The AMD AI Framework Team, February 11, 20261. Introduction Qwen is a series of large-scale, high-performance Large Language Models (LLMs) developed by the Qwen Team of Alibaba Cloud. From the first generation to the latest third-generation flagship models, all Qwen variants have undergone dedicated training and fine-grained tuning, endowing ...Squeezing 1TB Model Rollout into a Single H200: INT4 QAT RL End-to-End Practiceby: SGLang RL Team, InfiXAI Team, Ant Group Asystem & AQ Infra Team, slime Team, RadixArk Team, January 26, 2026 💡 TL;DR: Inspired by the Kimi K2 team, the SGLang RL team successfully landed an INT4 Quantization-Aware Training (QAT) pipeline. By combining fake quantization during training with real quantization at inference (W4A16), we achieved stability and train–infer consistency comparable to BF16 full-pr...Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLangby: Novita AI, January 21, 2026TL;DR A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG. We introduce an end-to-end performance optimization strategy that addresses bottlenecks across the entire inference pipeline — from kernel execution efficiency...SGLang-Diffusion: Two Months Inby: The SGLang-Diffusion Team, January 16, 2026Since its release in early Nov. 2025, SGLang-Diffusion has gained significant attention and widespread adoption within the community. We are deeply grateful for the extensive feedback and growing number of contributions from open-source developers. Over the past two months, we've been meticulously o...Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyondby: Shangming Cai, January 15, 2026TL;DR We are excited to introduce SGLang's highly optimized Pipeline Parallelism (PP) implementation, specifically engineered to tackle the challenges of ultra-long context inference. By integrating Chunked Pipeline Parallelism, Asynchronous P2P Communication, and a simple yet effective Dynamic Chun...EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models in SGLangby: rednote hilab, Alibaba Cloud Computing, AntGroup SCT, January 12, 2026TL;DR We introduce Encoder-Prefill-Decode (EPD) Disaggregation in SGLang, a novel architecture that separates vision encoding from language processing in Vision-Language Models (VLMs). This can enable: Independent scaling of vision encoding capacity: Encoder servers can be scaled horizontally wi...SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Frameworkby: SpecForge Team, Ant Group AQ Team, Nex-AGI Team, EigenAI Team, December 23, 2025TL;DR The SpecForge team has collaborated with multiple industry partners - including Ant, Meituan, Nex-AGI, and EigenAI - to release SpecBundle (Phase 1), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. SpecBundle is designed to improve the availability a...Power Up Diffusion LLMs: Day‑0 Support for LLaDA 2.0by: Ant Group DeepXPU Team, SGLang Team, December 19, 2025TL;DR We are excited to introduce the design and implementation of the Diffusion Large Language Model (dLLM) framework within SGLang. By leveraging the existing Chunked-Prefill mechanism, our system achieves: Seamless integration: Built into the SGLang ecosystem without core architectural changes. ...Mini-SGLang: Efficient Inference Engine in a Nutshellby: Ziyi Xu, December 17, 2025We're excited to introduce Mini-SGLang, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the SGLang project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced fea...SGLang Day-0 Support for MiMo-V2-Flash Modelby: SGLang Team and Xiaomi LLM Core Team, December 16, 2025Introduction XiaomiMiMo/MiMo-V2-Flash, with 309B total parameters and 15B activated parameters, is a new inference-centric model designed to maximize decoding efficiency. It is based on two key designs: sliding window attention and multi-layer MTP. MiMo-V2-Flash is explicitly co-designed for real-wo...SGLang Adds Day-0 Support for the Highly Efficient, Open Nemotron 3 Nano Hybrid MoE Modelby: NVIDIA Nemotron Team, December 15, 2025Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by SGLang out of the box and it uses a new method called Quantization-Aware Distillation (QAD) to maintain accuracy on NVFP4 while delivering 4x throughput on B200 compared to FP8-H100. You ...Let Tensors Fly — Accelerating Large Model Weight Loading with R-Forkby: Ant Group DeepXPU Team, SGLang Team, December 10, 2025TL;DR We introduce Tensor R-Fork (stands for Tensor Remote Fork), a novel weight loading methodology that leverages efficient inter-node device-to-device interconnect to load tensors from a running SGLang instance to a new instance with zero-copy. Our approach provides three key advantages: Signi...Boost SGLang Inference: Native NVIDIA Model Optimizer Integration for Seamless Quantization and Deploymentby: NVIDIA ModelOpt Team, Dec 02, 2025(Updated on Dec 2) We are thrilled to announce a major new feature in SGLang: native support for NVIDIA Model Optimizer quantization! This integration streamlines the entire model optimization and deployment process, allowing you to go from a full-precision model to a high-performance, quantized end...From research to production: Accelerate OSS LLM with EAGLE-3 on Vertexby: Ivan Nardini, Charles Chen, Ying Wang, December 1, 2025TL;DR: Speculative decoding boosts LLM inference, but traditional methods require a separate, inefficient draft model. Vertex AI utilizes EAGLE-3, adding a small draft head (2-5% of the target model) to internal layers, simplifying training and achieving ~2x-3x decoding speedup. This post outlines o...Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RLby: InfiXAI Team, Ant Group AQ Team, SGLang RL Team, Miles Team, November 25, 2025 TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effecti...LMSYS Fellowship Programby: LMSYS Board, November 23, 2025We are thrilled to announce the launch of the LMSYS Fellowship Program! This year, the program is dedicated to supporting full-time PhD students in the United States who have made significant contributions to the open-source AI infrastructure community. Fellowship recipients will be awarded up to $5...Introducing Miles — RL Framework To Fire Up Large-Scale MoE Trainingby: RadixArk Team, November 19, 2025 A journey of a thousand miles is made one small step at a time. Today, we are releasing Miles, an enterprise-grade reinforcement learning framework tailored for large-scale MoE training and production workloads. Miles is built on top of slime, the lightweight RL framework that has quietly powered ...🚀 AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRoundby: By Intel Neural Compressor Team, November 14, 2025Overview We are thrilled to announce an official collaboration between SGLang and AutoRound, enabling low-bit quantization for efficient LLM inference. Through this integration, developers can now quantize large models with AutoRound’s signed-gradient optimization and directly deploy them in SGLang’...SGLang Diffusion: Accelerating Video and Image Generationby: The SGLang Diffusion Team, November 7, 2025We are excited to introduce SGLang Diffusion, which brings SGLang's state-of-the-art performance to accelerate image and video generation for diffusion models. SGLang Diffusion supports major open-source video and image generation models (Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux) while provid..."No Free Lunch": Deconstruct Efficient Attention with MiniMax M2by: MiniMax LLM Team together with Xinyuan Tong, Kangyan Zhou, Mingyi Lu, and Chenyang Zhao, November 4, 2025We are excited to announce day-one support for the new flagship model, MiniMax M2, on SGLang. The MiniMax M2 redefines efficiency for agents: it is a compact, fast, and cost-effective Mixture of Experts (MoE) model (230 billion total parameters, 10 billion active) built for elite performance in codi...Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Sparkby: Jerry Zhou, November 3, 2025We’ve got some exciting updates about the NVIDIA DGX Spark! In the week following the official launch, we collaborated closely with NVIDIA and successfully brought GPT-OSS 20B and GPT-OSS 120B support to SGLang on the DGX Spark. The results are impressive: around 70 tokens/s on GPT-OSS 20B and 50 to...SGLang-Jax: An Open-Source Solution for Native TPU Inferenceby: The SGLang-Jax Team, October 29, 2025We're excited to introduce SGLang-Jax, a state-of-the-art open-source inference engine built entirely on Jax and XLA. It leverages SGLang's high-performance server architecture and uses Jax to compile the model's forward pass. By combining SGLang and Jax, this project delivers fast, native TPU infer...Accelerating Hybrid Inference in SGLang with KTransformers CPU Kernelsby: KVCache.AI and Approaching AI, October 22, 2025Background: Hybrid Inference for Sparse MoE Models Modern Mixture-of-Experts (MoE) language models such as DeepSeek-V3 contain hundreds of billions of parameters, but only a small subset of experts are activated per token. This sparse activation pattern makes MoE models ideal for CPU/GPU hybrid infe...SGLang and NVIDIA Accelerating SemiAnalysis InferenceMAX and GB200 Togetherby: NVIDIA and community SGLang developers, Oct 14, 2025The SGLang and NVIDIA teams have a strong track record of collaboration, consistently delivering inference optimizations and system-level improvements to ensure exceptional performance of the SGLang framework. Most recently, this collaboration has been centered on the NVIDIA Blackwell architecture, ...NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inferenceby: Jerry Zhou and Richard Chen, October 13, 2025Thanks to NVIDIA’s early access program, we are thrilled to get our hands on the NVIDIA DGX™ Spark. It’s quite an unconventional system, as NVIDIA rarely releases compact, all-in-one machines that bring supercomputing-class performance to a desktop workstation form factor. Over the past year, SGLang...SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attentionby: The SGLang Team, September 29, 2025We are excited to announce that SGLang supports DeepSeek-V3.2 on Day 0! According to the DeepSeek tech report, it equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSee...PD-Multiplexing: Unlocking High-Goodput LLM Serving with GreenContextby: Weihao Cui, Yukang Chen, Xiaoze Fan, Han Zhao, Ziyi Xu, Xusheng Chen, Bingsheng He, Quan Chen, September 28, 2025This post highlights our initial efforts to support a new serving paradigm, PD-Multiplexing, in SGLang. It is designed to deliver higher goodput in LLM serving. PD-Multiplexing leverages GreenContext, a new NVIDIA GPU capability that allows lightweight and fine-grained partitioning of GPU resources ...Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96Gby: Tianyu Zhang*, Peng Zhang*, Yusong Gao, Yun Zhang, September 26, 2025Introduction Operationalizing scaled Mixture-of-Experts (MoE) models such as DeepSeek-R1 requires a careful balance of latency, throughput, and cost. The challenge is especially acute on hardware with asymmetric performance profiles—for example, the H20 GPU, which offers high memory bandwidth but co...Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughputby: The SGLang Team, September 25, 2025The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress after our previous blog post to optimize the inference performance of DeepSeek V3/R1 with FP8 attention, NVFP4 MoE, large-scale expert parallelism, prefill-decode disaggregation, and vari...Towards Deterministic Inference in SGLang and Reproducible RL Trainingby: The SGLang Team, September 22, 2025 (Updated on September 24)TL;DR: This post shares our efforts to enable deterministic inference in SGLang and our collaboration with slime to work towards reproducible RL training. <br /> Recently, the Thinking Machines Lab published a blog detailing their findings. Since this blog was published, the industry has respo...Optimizing FP4 Mixed-Precision Inference on AMD GPUsby: Haohui Mai, Lei Zhang, September 21, 2025Introduction As frontier large language models (LLMs) continue scaling to unprecedented sizes, they demand increasingly more compute power and memory bandwidth from GPUs. Both GPU manufacturers and model developers are shifting toward low-precision floating-point formats. FP4 (4-bit floating point) ...SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backendsby: Zhiqiang Xie, September 10, 2025From the community: In a coding agent scenario using Qwen3-Coder-480B, the observed dialogues often stretched past 25K tokens around 8 turns per session. Without full KV cache retention, nearly every request required costly re-computation. By integrating SGLang HiCache with DeepSeek 3FS KVStore for ...LongCat-Flash: Deploying Meituan's Agentic Model with SGLangby: Meituan LongCat Team, September 01, 20251. Introduction: Deploying Meituan's Agentic Open-Source MoE Model LongCat-Flash, Meituan's open-source Agentic Mixture-of-Experts (MoE) model is now available from huggingface LongCat-Flash-Chat. Released by Meituan LongCat Team, it features: 560B total params 18.6B–31.3B (27B on average) per toke...Fine-tune and deploy gpt-oss MXFP4: ModelOpt + SGLangby: NVIDIA ModelOpt Team, Aug 28, 2025(Updated on Aug 29) OpenAI recently released gpt-oss, the first open source model family from OpenAI's lab since GPT-2. These models demonstrate strong math, coding, and general capabilities. Part of the model's uniqueness is that it was released in native MXFP4 weight only quantization. This allows...SGLang for gpt-oss: From Day 0 Support to Enhanced Performanceby: Liangsheng Yin, Ke Bao, August 27, 2025We are excited to announce a major update for SGLang, focusing on deep performance optimizations and new features for the recently released openai/gpt-oss-120b model. While we had support from day zero, we took the last few weeks to enhance our engine to ensure you get the best possible performance....GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilitiesby: GLM Team, July 31, 2025Today, we are excited to introduce our latest flagship models GLM-4.5 and GLM-4.5-Air, along with their FP8 variants. All models are now available with day-one support on SGLang. GLM-4.5 and GLM-4.5-Air are both powerful models designed to unify reasoning, coding, and agentic capabilities, with 355B...SpecForge: Accelerating Speculative Decoding Training for SGLangby: The SGLang Team, July 25, 2025Speculative decoding is a powerful technique for accelerating Large Language Model (LLM) inference. In this blog post, we are excited to announce the open-sourcing of SpecForge, our new training framework for Eagle3-based speculative decoding. SpecForge is designed for ease of use and is tightly int...Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUsby: The Mooncake Team, July 20, 20251️⃣ Introduction: Deploying the Most Advanced Open-Source MoE Model Kimi K2 is currently the most advanced open-source Mixture-of-Experts (MoE) model available. Released by Moonshot AI in 2025, it features: 1 trillion total parameters 32 billion activated parameters per token 384 experts with dynam...Accelerating SGLang with Multiple Token Predictionby: Eigen AI Team, July 17, 2025TL;DR SGLang now supports smooth combination of these advanced features: Multiple Token Prediction (MTP), Large-Scale Expert Parallelism (EP), and Prefill-Decode disaggregation. This integration delivers up to 60% higher output throughput through a new decoding paradigm, better parallelism, and more...How to support new VLMs into SGLang: A Case Study with NVILAby: The NVILA Team, July 16, 2025The world of LLMs is evolving at a remarkable pace, with Visual Language Models (VLMs) at the forefront of this revolution. These models power applications that can understand and reason about both images and text. There are tons of new VLM models emerging daily, and we want to integrate them into S...Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLangby: Intel PyTorch Team, July 14, 2025The impressive performance of DeepSeek R1 marked a rise of giant Mixture of Experts (MoE) models in Large Language Models (LLM). However, its massive model size and unique architecture have posed new challenges on deployment. The significant memory requirements will normally require 8x or even 16x h...slime: An SGLang-Native Post-Training Framework for RL Scalingby: The slime Team, July 9, 2025Vision That Drives slime We believe in RL. We believe RL is the final piece toward AGI. If you feel the same way, you'll share our vision: Every field should be end-to-end RLed and every task should become an agent environment. Every RL run should last longer, and every model should scale larger. R...OME: Revolutionizing LLM Infrastructure with Model-Driven Architectureby: The Oracle Team, July 8, 2025The Tale of Two Teams: Why Model Serving Is Broken In any large organization deploying LLMs, two distinct teams emerge with conflicting needs: The ML Engineers spend months benchmarking models, experimenting with serving technologies, and crafting optimal deployment strategies. Each model demands di...Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughputby: The SGLang Team, June 16, 2025The GB200 NVL72 is the world's most advanced hardware for AI training and inference. In this blog post, we're excited to share early results from running DeepSeek 671B with prefill-decode disaggregation and large-scale expert parallelism on the GB200 NVL72. By leveraging Blackwell-specific features ...Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUsby: The SGLang Team, May 5, 2025DeepSeek is a popular open-source large language model (LLM) praised for its strong performance. However, its large size and unique architecture, which uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), require an advanced system for efficient serving at scale. In this blog, we exp...SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputsby: The SGLang Team, December 4, 2024We’re excited to release SGLang v0.4, featuring significant performance improvements and new features: Zero-overhead batch scheduler: 1.1x increase in throughput. Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. Data parallelism attention for DeepSeek mo...Announcing a New Site for Chatbot Arenaby: LMSys Team, Sep 20, 2024We’re excited to share that Chatbot Arena now has its own dedicated website: lmarena.ai and blog! You might be wondering why we’re making this change. Over the past year, with the incredible support of our community, Chatbot Arena has evolved into a mature ecosystem and platform. We believe it’s tim...RedTeam Arena: An Open-Source, Community-driven Jailbreaking Platformby: Anastasios Angelopoulos*, Luca Vivona*, Wei-Lin Chiang*, Aryan Vichare, Lisa Dunlap, Salvivona, Pliny, Ion Stoica, Sep 13, 2024We are excited to launch RedTeam Arena, a community-driven redteaming platform, built in collaboration with Pliny and the BASI community! <img src="/images/blog/redteam_arena/badwords.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom:...SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVisionby: The SGLang Team, September 4, 2024We're excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates: Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA) Up to 1.5x lower latency with torch.compile...Does style matter? Disentangling style and substance in Chatbot Arenaby: Tianle Li*, Anastasios Angelopoulos*, Wei-Lin Chiang*, Aug 29, 2024Why is GPT-4o-mini so good? Why does Claude rank so low, when anecdotal experience suggests otherwise? We have answers for you. We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style...Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)by: The SGLang Team, Jul 25, 2024At LMSYS.org, we've been running the Chatbot Arena platform for over a year, serving millions of users. We know firsthand how crucial efficient serving is for AI products and research. Through our operational experiences and in-depth research, we've continuously enhanced the underlying serving syste...RouteLLM: An Open-Source Framework for Cost-Effective LLM Routingby: Isaac Ong*, Amjad Almahairi*, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, July 1, 2024LLMs have demonstrated remarkable capabilities across a range of tasks, but there exists wide variation in their costs and capabilities, as seen from the plot of performance against cost in Figure 1. Very broadly, more capable models tend to be more expensive than less capable models. This leads to ...The Multimodal Arena is Here!by: Christopher Chou*, Lisa Dunlap*, Wei-Lin Chiang, Ying Sheng, Lianmin Zheng, Anastasios Angelopoulos, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, June 27, 2024Multimodal Chatbot Arena We added image support to Chatbot Arena! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother. In just two weeks, we have collected over 17,0...Introducing Hard Prompts Category in Chatbot Arenaby: Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024Background Introducing Hard Prompts, a new and challenging category in the Chatbot Arena Leaderboard. Over the past few months, the community has shown a growing interest in more challenging prompts that push the limits of current language models. To meet this demand, we are excited to introduce the...What’s up with Llama 3? Arena data analysisby: Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang, May 8, 2024On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English Chatbot Arena leaderboard with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this bl...LMSYS Kaggle Competition – Predicting Human Preference with $100,000 in Prizesby: LMSYS Arena Team, May 2, 2024Overview LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the Chatbot Arena, containing conversations and user prefer...From Live Data to High-Quality Benchmarks: The Arena-Hard Pipelineby: Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica, April 19, 2024Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently update to avoid over-fitting or test set leakage. Traditional ...LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluationby: LMSYS Arena Team, Mar 1, 2024Our Mission Chatbot Arena (lmarena.ai) is an open-source project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLM...Fast JSON Decoding for Local LLMs with Compressed Finite State Machineby: Liangsheng Yin, Ying Sheng, Lianmin Zheng, Feb 5, 2024Constraining an LLM to consistently generate valid JSON or YAML that adheres to a specific schema is a critical feature for many applications. In this blog post, we introduce an optimization that significantly accelerates this type of constrained decoding. Our approach utilizes a compressed finite s...Fast and Expressive LLM Inference with RadixAttention and SGLangby: Lianmin Zheng*, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng*, Jan 17, 2024Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing ...Chatbot Arena: New models & Elo system updateby: Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, Ion Stoica, Dec 7, 2023Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over 130,000 votes that are now collected to rank the most capable 40+ models! In this blog post, we'll cover the results of several new models: Tulu-2-DPO-70B...Break the Sequential Dependency of LLM Inference Using Lookahead Decodingby: Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, November 21, 2023TL;DR: We introduce lookahead decoding, a new, exact, and parallel decoding algorithm to accelerate LLM inference. Lookahead decoding breaks the sequential dependency in autoregressive decoding by concurrently extracting and verifying n-grams directly with the LLM, utilizing the Jacobi iteration me...Recipe for Serving Thousands of Concurrent LoRA Adaptersby: Ying Sheng*, Shiyi Cao*, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, November 15, 2023In this blog post, we introduce S-LoRA (code), a system designed for the scalable serving of many LoRA adapters. S-LoRA adopts the idea of Unified Paging for KV cache and adapter weights to reduce memory fragmentation. Heterogeneous Batching of LoRA computation with different ranks leveraging optim...Catch me if you can! How to beat GPT-4 with a 13B modelby: Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica, Nov 14, 2023Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination. <img src="/images/blog/decontaminator/llama-rephraser.png" s...ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactionsby: Zi Lin*, Zihan Wang*, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang, October 30, 2023In this blogpost, we introduce ToxicChat, a benchmark consisting of 10K high-quality data for content moderation in real-world user-AI interactions. Evaluation results show that fine-tuning on this benchmark notably improves a baseline model’s ability to detect toxic queries in user-AI interactions....Chatbot Arena Conversation Dataset Releaseby: LMSYS Org, July 20, 2023Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we ...How Long Can Open-Source LLMs Truly Promise on Context Length?by: The LongChat Team, June 29, 2023In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens. Evaluation results show that the long-range retrieval accuracy of LongChat-13B is up to 2x higher than other long-context open models s...Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33Bby: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Hao Zhang, June 22, 2023In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: Chatbot Arena Elo, based on 42K anonymous votes from Chatbot Arena using the Elo rating system. MT-Bench score, based on a challenging multi-turn benchmark and GPT-4 gr...Building a Truly "Open" OpenAI API Server with Open Models Locallyby: Shuo Yang and Siyuan Zhuang, June 9, 2023Many applications have been built on closed-source OpenAI APIs, but now you can effortlessly port them to use open-source alternatives without modifying the code. FastChat's OpenAI-compatible API server enables this seamless transition. In this blog post, we show how you can do this and use LangChai...Chatbot Arena Leaderboard Updates (Week 4)by: LMSYS Org, May 25, 2023In this update, we are excited to welcome the following models joining the Chatbot Arena: Google PaLM 2, chat-tuned with the code name chat-bison@001 on Google Cloud Vertex AI Anthropic Claude-instant-v1 MosaicML MPT-7B-chat Vicuna-7B A new Elo rating leaderboard based on the 27K anonymous voting ...Chatbot Arena Leaderboard Updates (Week 2)by: LMSYS Org, May 10, 2023We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous Chatbot Arena. We are actively iterating on the design of the arena and leaderboard scores. In this update, we have added 4 new yet strong players into the Arena, including...Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratingsby: Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, May 3, 2023We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in ches...Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Qualityby: The Vicuna Team, March 30, 2023We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LL...