Detailed pricing plans are not available yet for this tool.
We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic. By clicking “Accept,” you agree to our website's cookie use as described in our Cookie Policy. You can change your cookie settings at any time by clicking “Preferences.”PreferencesDeclineAcceptSwitch to FriendliAI and get up to 50K inference credit! — Apply nowFriendli InferenceThe fastest LLM inference engine on the marketGet startedRead the docsGROUNDBREAKING PERFORMANCE50~90%Cost savingsup to6× FewerGPUs required110.7× HigherThroughput26.2× LowerLatency301What Friendli Inference offersSpeed up the serving of LLMs, thus slashing costs by50~90%Friendli Inference is highly optimized to make LLM serving fast and cost-effective. Process LLM inference with Friendli Inference, the fastest engine on the market. Our performance testing shows that Friendli Inference is significantly faster than vLLM and TensorRT-LLM.Read moreMulti-LoRA serving on a single GPUFriendli Inference simultaneously supports multiple LoRA models on fewer GPUs (even on just a single GPU!), a remarkable leap in making LLM customization more accessible and efficient.Read moreDeploy LLMs and more!Friendli Inference supports a wide range of generative AI models, including quantized models and MoE.View the full model list02Key TechnologyIteration batching(aka continuous batching)Iteration batching is a new batching technology we invented to handle concurrent generation requests very efficiently. Iteration batching can achieve up to tens of times higher LLM inference throughput than conventional batching while satisfying the same latency requirement. Our technology is protected by our patents in the US, Korea and ChinaRead moreDNN libraryFriendli DNN Library is the set of optimized GPU kernels carefully curated and designed specifically for generative AI. Our novel library allows Friendli Inference to support faster LLM inference of various tensor shapes and datatypes, as well as support quantization, Mixture of Experts, LoRA adapters, and so on.Friendli TCacheFriendli TCache intelligently identifies and stores frequently used computational results. The Friendli Inference leverages the cached results, significantly reducing the workload on the GPUs.Read moreSpeculative decoding Friendli Inference natively supports speculative decoding, an optimization technique that rapidly speeds up LLM/LMM inference by making educated guesses on future tokens in parallel while generating the current token. Through validation of the generated potential future tokens, speculative decoding ensures identical model outputs at a fraction of the inference time.03HighlightsRunning Quantized Mixtral 8x7B on a Single GPUWe quantized the Mixtral-7x8B-instruct v0.1 model with AWQ and ran it on a single NVIDIA A100 80GB GPU. Both the TTFT and TPOT outnumbers a baseline vLLM system. Friendli Inference achieves at least 4.1x faster response time and 3.8x ~ 23.8x higher token throughput.Read moreQuantized Llama 2 70B on Single GPUWith Friendli Inference, running AWQ-ed models is seamless. For example, one can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on Friendli Inference. Running LLMs with AWQ on Friendli Inference enables you to achieve efficient LLM deployment and remarkable efficiency gains without sacrificing accuracy.Read moreEven faster TTFT with Friendli TCacheFriendli TCache reuses recurring computations, optimizing TTFT (Time to First Token) by leveraging cached results. We show that our Engine delivers 11.3x to 23x faster TTFT compared to vLLM.Read moreHOW TO USEThree ways to run generative AI models with Friendli Inference:01Dedicated EndpointsBuild and run generative AI models on autopilotLearn more02ContainerServe LLM and LMM inferences with Friendli Inference in your private environmentLearn more03Serverless EndpointsCall our fast and affordable API for open-source generative AI modelsLearn more1. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Inference. See the detailed results and methodology here.2. Performance compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150. Evaluation conducted by FriendliAI.3. Performance of Friendli Container compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150, mean request per second = 0.5. Evaluation conducted by FriendliAI.Explore FriendliAI todayGet startedTalk to an engineer --- Switch to FriendliAI and get up to 50K inference credit! — Apply nowPricing build to scale with your growthFast, reliable, and affordable inference at any scale. Get started instantly with self-serve, or contact us for enterprise deployments.Get startedContact usServerless EndpointsRun the fastest frontier model inference with a simple API call.See pricingDedicated EndpointsRun dedicated inference with unmatched speed and reliability at scale.See pricingContainerRun inference with full control and performance in your environment.Contact usServerless API PricingGet instant access to the fastest frontier model inference with a simple API call.Text and visionPay per token or GPU timeModel$ / 1M tokensLlama-3.1-8B-Instruct$0.1Llama-3.3-70B-Instruct$0.6K-EXAONE-236B-A23B$0.2 Input · $0.1 Cached Input · $0.8 OutputQwen3-235B-A22B-Instruct-2507$0.2 Input · $0.8 OutputMiniMax-M2.1$0.3 Input · $0.15 Cached Input · $1.2 OutputMiniMax-M2.5$0.3 Input · $0.06 Cached Input · $1.2 OutputDeepSeek-V3.2$0.5 Input · $0.25 Cached Input · $1.5 OutputGLM-4.7$0.6 Input · $2.2 OutputKimi-K2.5$0.6 Input · $3 OutputGLM-5$1 Input · $0.5 Cached Input · $3.2 OutputModel$ / secondQwen3-30B-A3B$0.002DeepSeek-V3.1$0.004For models where cached input pricing is not listed, prompt caching discounts may be available for enterprise deployments. Contact us to learn more.Dedicated Endpoints PricingGet instant access to the fastest frontier model inference with a simple API call.BasicGet started with:Pay-as-you-goOn-demand GPUsSupport for custom, fine-tuned, and open-source modelsAutomatic traffic-based scalingReal-time performance, usage, and log visibilityZero-downtime model updatesMulti-LoRA supportSOC2 complianceEmail and in-app chat supportGet startedEnterpriseEverything in Basic, plus:Reserved GPUsPriority access to high-demand GPU typesHands-on engineering expertiseDedicated Slack supportVPC and on-prem deployment optionsEnterprise-grade security and complianceCustom global region deployment99.99% availability SLAsDiscounts on monthly reserved GPUsContact usOn-demand deploymentOnly pay for the compute you use, down to the second, with no extra charges for start-up timesGPU Type$ / hour (billed per second)A100 80GB GPU$2.9H100 80GB GPU$3.9H200 141GB GPU$4.5B200 180GB GPU$8.9For estimates of per-token prices, see this page. Results vary by use case, but we often observe 2-3x higher throughput and faster speed on FriendliAI compared to open source inference engines.Container PricingRun inference with full control and performance in your environment.Contact usExplore FriendliAI todayGet startedTalk to an engineer --- Switch to FriendliAI and get up to 50K inference credit! — Apply nowServing Generative AI for AllMISSIONEmpowering organizations to harness the full potential of generative AI models with ease and cost-efficiency.A world where any company can use generative AI.We believe that the efficient and scalable use of generative AI models should be for everyone.Efficient, automated generative AI model serving.By eliminating the complexities of generative AI serving, we aim to empower more companies to achieve innovation with generative AI.LEADERSHIPLeading the development of generative AI serving with a brilliant teamByung-Gon ChunFounder & CEOProfessor, Computer Science and Engineering Department, Seoul National University (Sabbatical Leave)Visiting Research Scientist, FacebookPrincipal Scientist, MicrosoftResearch Scientist, Yahoo!Research Scientist, IntelPh.D., Computer Science, University of California, BerkeleyM.S., Computer Science, Stanford UniversityGyeong-In YuCTOPh.D., Computer Science and Engineering, Seoul National UniversityRyan PollockVP Go-to-MarketFormer Marketing Executive at Together AI, Vultr, DigitalOcean, and Google CloudB.S., Computer Science, Cornell UniversityBrian YooAdvisorFormer COO of Moloco, pre-IPO AI company with $250M+ revenueM.S., Operations Research and Industrial Engineering, Cornell UniversityExplore FriendliAI todayGet startedTalk to an engineer --- Switch to FriendliAI and get up to 50K inference credit! — Apply nowOur PartnersWe collaborate with global leaders to maximize performance and efficiency in AI inference—driving innovation and advancing the community together.Become a partnerBecome a partnerEcosystem PartnersEcosystem partners extend our technology into new domains, helping customers seamlessly integrate, deploy, and scale AI solutions across industries.Explore FriendliAI todayGet startedTalk to an engineer