Detailed pricing plans are not available yet for this tool.
Introducing WebCode - search evals for coding agentsIntroducing WebCode - search evals for coding agentsRead moreAPI DashboardTry WebsetsProductsSearchContentsAnswerWebsetsCompanyAboutResearchCareersWe're hiringCase StudiesContact usTeamBlogBrandDevelopersPrompt builderAPI DashboardDemosDocsGitHubAPI StatusFAQMCP ServerIntegrationsPricingResearchThe best search API for AIPowering agents with fast, high-quality web searchFind me the latestFind me the latest|Try API for freeTry WebsetsThousands of companies search the web with ExaBenchmarksHighest performance across every use caseBest-in-class accuracy and latency across the most challenging benchmarks.AccuracyExa leads across FRAMES, Tip-of-Tongue, and Seal0 — the most demanding retrieval benchmarks.LatencyExa Instant returns results in under 180ms, faster than any other search provider.Search VerticalsBest-in-class across company search, people search, and code — not just general web queries.ExaPerplexityBraveExaPerplexityBraveAccuracy (%)0%10%20%30%40%50%60%54.4%44.5%21.6%54.2%36.7%19.3%36%34.5%8.2%FRAMESTip-of-TongueSeal0Data for every industryState of the art web indexes for every use caseCoding AgentsNewsFinanceRecruitingConsultingCursor solves complex issues in seconds with Exa's low latency searchCoding AgentsSearch over millions of GitHub repos, docs, and Stack Overflow with high-accuracy code retrieval.Notion's agents find the latest news with Exa's always fresh web index.NewsSearch and summarize recent news, trends, and anything else across the web.Point72 uses Exa's index of 70M+ companies for up-to-date financial dataFinanceAccess historical market data, company signals, and news from any time period.Hubspot monitors for updates across 1B+ people and companies with ExaRecruitingSearch our index of 1B+ profiles to find the best candidate information.*Big 3 consulting firmUses Exa to get token efficient highlights for their research agentsConsultingRun deep research across PDFs, web pages, and expert sources for any industry.Cursor solves complex issues in seconds with Exa's low latency searchCoding AgentsSearch over millions of GitHub repos, docs, and Stack Overflow with high-accuracy code retrieval.Notion's agents find the latest news with Exa's always fresh web index.NewsSearch and summarize recent news, trends, and anything else across the web.Point72 uses Exa's index of 70M+ companies for up-to-date financial dataFinanceAccess historical market data, company signals, and news from any time period.Hubspot monitors for updates across 1B+ people and companies with ExaRecruitingSearch our index of 1B+ profiles to find the best candidate information.*Big 3 consulting firmUses Exa to get token efficient highlights for their research agentsConsultingRun deep research across PDFs, web pages, and expert sources for any industry.Cursor solves complex issues in seconds with Exa's low latency searchCoding AgentsSearch over millions of GitHub repos, docs, and Stack Overflow with high-accuracy code retrieval.Notion's agents find the latest news with Exa's always fresh web index.NewsSearch and summarize recent news, trends, and anything else across the web.Point72 uses Exa's index of 70M+ companies for up-to-date financial dataFinanceAccess historical market data, company signals, and news from any time period.Hubspot monitors for updates across 1B+ people and companies with ExaRecruitingSearch our index of 1B+ profiles to find the best candidate information.*Big 3 consulting firmUses Exa to get token efficient highlights for their research agentsConsultingRun deep research across PDFs, web pages, and expert sources for any industry.Search APIPowerful features built into your search Structured Output"content": { "companies": [ { "company_name": "General Electric", "ceo_name": "Larry Culp", "founded_year": 1892 }, { "company_name": "RTX", "ceo_name": "Christopher T. Calio", "founded_year": 2020 }, { "company_name": "Boeing", "ceo_name": "Kelly Ortberg", "founded_year": 1916 } ]},Extracts structured outputs from Exa's database of 70M+ companies.Deep web research with structured outputsStructured outputs enable complex enrichment workflows requiring web search.Search with structured outputs →Extracts structured outputs from Exa's database of 70M+ companies.70M+PEOPLENEWSPowers their coding agents with Exa's low-latency web searchIndustry-leading web index built for agentsDedicated high quality web indexes for every use case: people, companies, code docs, financial data, and news.Use Exa's web indexes →Powers their coding agents with Exa's low-latency web search<180msHighlights: On4K CHARSWikipedia - Boeingairplanes , rotorcraft , rockets , satellites , and missiles worldwide. The company also provides leasing and product support services. Boeing is among the largest global aerospace manufacturers ; it is the fourth-largest defense contractor in the world based on 2022 revenue and is the largest exporter in the United States by dollar value. Boeing was founded in 1916 by William E. Boeing in Seattle , Washington. The present corporation is the result of the merger of Boeing with McDonnell Douglas on August 1, 1997. As of 2023, the Boeing Company's corporate headquarters is located in the Crystal City neighborhood of Arlington County, Virginia .Uses highlights to reduce their token budgets and LLM costs by over 50%Token-efficient contents makes agents smarterHighlights extract the most relevant excerpts from pages for your query.Get webpage contents →Uses highlights to reduce their token budgets and LLM costs by over 50%50%Enterprise-grade securityand controlsTalk to an expert→Zero Data RetentionEnsure true privacy and compliance with customized ZDR. All queries and data can be automatically purged based on your requirements.SOC 2 Type II CertifiedOur security framework maintains the highest level of compliance with industry standards. Safe information processing and access control.Single-Sign OnA seamless, secure login experience for your entire team. Built-in team authentication and authorization management.Trusted by world-class teamsView all customer storiesOpenRouter has a laser focus on bringing a great developer experience to all language models, and Exa is the best way we've found for grounding AI in the real world in a model-agnostic way.Alex AtallahCEO, OpenRouterExa's powerful search capabilities have been instrumental in delivering the high-quality, relevant web content our users need while maintaining our commitment to privacy and user control.Sarah SachsAI Engineering Lead, NotionThis is so powerful. Exa is like Perplexity-as-a-service. The infrastructure to ground your AI products on real world data and facts.Guillermo RauchCEO, VercelExa's strong coverage and flexible API have been a key differentiator for us. Scientists trust our product further when the relevant papers they expect to see are available to them in the right workflows. The ability to customize search based on the use case easily, is important.Naveed JanmohamedCEO, AnaraWhen we saw companies show up in Exa that we hadn't found anywhere else, we knew it was our best option. It delivers, for a few dollars, what we'd been spending hundreds of thousands to assemble.David BoskovicCEO, FlatfileWe use Exa for essentially all parts of our research—gathering sources, documenting, creating notes, and building briefs.Christopher VarnerHead of Results InnovationStackAI and Exa are the dream team…Exa is amazing at web search! We're proud to use Exa on our platform and to show our customers how web search can power AI agent workflows.Bernard AceitunoCo-founder and President, StackAIModels are only as good as the data they're trained on, and Exa's search allowed us to get high quality data we couldn't find any other way.Jonathan FrankleChief AI scientist, Databricks --- Introducing WebCode - search evals for coding agentsIntroducing WebCode - search evals for coding agentsRead moreAPI DashboardTry WebsetsProductsSearchContentsAnswerWebsetsCompanyAboutResearchCareersWe're hiringCase StudiesContact usTeamBlogBrandDevelopersPrompt builderAPI DashboardDemosDocsGitHubAPI StatusFAQMCP ServerIntegrationsPricingResearch/pricingRun up to 1,000 requests for free every month100-1200msSearchList of results and their contents.$7/1k requests (1-10 results)+$1 per 1k additional results beyond 10Learn moreTry for freeBest forWeb search tool calls for agentsDifferent latency profiles: Instant, Fast, AutoBuilt-in text and highlights+$1/1k summaries4-30sAgentic SearchSearch with Deep mode for structured outputs.$12/1k requests+$3 per 1k requests with reasoning enabledLearn moreTry for freeBest forDeep research and multi-step agent workflowsStructured output supportHigher reasoning capability500msContentsToken-efficient webpage contents.$1/1k pages per content typeLearn moreTry for freeBest forRetrieving full page content for LLM contextRich full-page contentsReceive truncated or with highlightsAnswerDirect answers backed by citations$5/1k answersTry for freeResearchAutonomous research tasksAgent search operations$5Agent page reads*$5Reasoning tokens (/1M)$5*Costs $10 for exa-research-pro. A page is defined as 1,000 tokens of content from webpages.Try for freeEnterpriseFor high volume, custom datasets, enterprise security, and more.Talk to usPowerful searchUp to 1,000 results per searchRequests with more than 25 resultsCustom rate limits (QPS)Tailored moderationEnterprise-Grade SupportSLAs and MSAs1:1 onboarding and supportZero Data RetentionCustom PricingVolume discountsStandard billingStartup and Education GrantsBuild comprehensive web search into your startup or education project—for free.$1000 worth of free creditsBuild, launch, and test searchContact us --- Introducing WebCode - search evals for coding agentsIntroducing WebCode - search evals for coding agentsRead moreAPI DashboardTry WebsetsProductsSearchContentsAnswerWebsetsCompanyAboutResearchCareersWe're hiringCase StudiesContact usTeamBlogBrandDevelopersPrompt builderAPI DashboardDemosDocsGitHubAPI StatusFAQMCP ServerIntegrationsPricingResearchDear curious internet wanderer,Welcome to Exa. Exa is an applied AI lab building a search engine unlike the world has ever seen.We don't have ads. We sell our search as an API, optimized exclusively for quality, latency, and customizability. That's why customers like Cursor, Lovable, and thousands of others use us to power agents and workflows.But that's just today. Exa's ultimate goal is perfect search.What is perfect search? It's simple – you should be able to know anything about the world. Literally anything, like:"Find every engineer in NYC who's written a blog post about philosophy""Find all the political clips that went viral in Europe in the past month"Perfect search is highly controllable, unbiased, comprehensive. Because we lack it, we each walk around with an incomplete understanding of nearly everything.Building perfect search will be extremely hard. But it's also necessary. It's critical civilizational infrastructure for our new AI reality.Politics is fragmenting, wars are raging, and technology is accelerating -- we need tools that deeply inform us what's going on. If we don't gain control over the planet's information, we will lose control over our planet.In short, the world needs perfect search. It needs an organization with pure incentives to build it. And because no one else is doing it, that's why we have to.Will Bryk, CEOHow we're doing itWeb-scale infrastructureBuilding a search engine from scratch requires building massive-scale infrastructure. There are 100s of billions of webpages (roughly an exabyte!) that need storing, processing, indexing, and serving at high throughput.Building this is fun but quite difficult. That's why search tools, like ChatGPT, rely on 3rd party search engines under the hood.Novel neural architecturesWe train novel architectures for web search using end-to-end neural networks. Unlike keyword methods, neural methods get better with more compute and will win in the long run.We're lucky to now own hundreds of H200s worth of research compute... also known as our exa-cluster :)World-class teamWe've been assembling some of the smartest engineers, researchers, and operators in the world to build the best search engine in history.We've raised over $100M from top investors including Benchmark, Lightspeed, Nvidia, and YC, and are advised by top researchers from OpenAI, Google, and Bing.We're an SF team of builders and researchersWill BrykCEOWill was one of the first engineers at Cresta where he built real time AI products. He studied CS and physics at Harvard, where he researched human/AI interaction and led the robotics club. Will considers himself an expert in both embedding models and chocolate chip cookies -- the jury is still out on which is more critical for company operations.Jeff WangCo-founderJeff spent three years building data and web infra at Plaid. He studied CS and Philosophy at Harvard, where he ran a GPU cluster in his dorm room and was roommates with Will. The team estimates that 20% of social analysis in San Francisco traces back to one of Jeff's many viral tweets.Ben ChenTechnical StaffBen previously did quant trading at SIG and before that took the hardest math course in the country at Harvard. When we find frisbees, tailor made suits, or scribbled math formulas lying around the office, there's usually a Ben behind it.Hubert YuanTechnical StaffHubert previously worked on projects like particle simulations and automated wheelchairs. He studied CS in the Yao Class at Tsinghua University. Hubert's appetite for clean microservice architecture is perhaps only matched by his appetite for Haribo sour candy.Shreyas SreenivasTechnical StaffShreyas previously worked on various projects, from training neural networks in Haskell to building a game streaming engine. He studied CS at the University of Waterloo. You can typically find Shreyas analyzing the price/performance of AWS services or crushing the team in basketball, sometimes at the same time.Michael FineTechnical StaffMichael previously worked on ML and privacy at various companies, including Apple. He studied CS at Harvard University. He also somehow finds time to cook chef-level meals and have PhD-level knowledge on nearly everything -- both of which the team enjoys consuming.Felix ZellerTechnical StaffFelix previously worked on opensource projects from next-gen text editors to composable knowledge management systems. He (almost) studied CS and philosophy at UIUC until he realized that he is already a beast. The only job Felix should not do is corrosion engineering, because he deeply wishes to convert the world into rust.Stacey TaraPeople + Workplace OpsStacey previously worked on business development at Awesomic. She got a bachelors and masters from Taras Shevchenko National University of Kyiv, basically the Harvard of Ukraine. Stacey goes by many names at Exa -- workplace operator, recruiting coordinator, chief happiness officer -- but perhaps her most beloved name is "greatest cookie baker of all time". These cookies are unfairly delicious.Joshua AhnTechnical StaffJoshua previously studied CS at the University of Chicago, where he solved ML problems on 3D reconstruction. These days, he builds virtual worlds and massive lego datasets (and even bigger Exa datasets). Given his ML abilities and all the cities he's lived in across the Midwest/East Coast, some believe Joshua has neurally solved the Travelling Salesman Problem in polynomial time.Ishan GoswamiTechnical StaffIshan previously cofounded a text-to-video startup. Before that, he was a software engineer at Rephrase AI, which got acquired by Adobe. He has a Computer Science degree and has been coding since he was 13 years old. Ishan has so much energy and has shipped so many Exa apps that some on the team believe that when our LLM APIs are overprovisioned Ishan personally responds to each API request.Joao AdrianoTechnical staffAdriano's coding journey started by crafting custom Warcraft 3 maps (yes, he's been at it that long). He skipped college and has been coding professionally for 12 years -- his most recent adventure was hacking equipment protocols on cargo ships. At Exa, Adriano uses AI tools to update so much code across the stack that the team's only proof he isn't actually Claude 5 under the hood is his mastery of swing dancing.Carlos MarquesTechnical staffCarlos previously did MLOps for several years, building training infra for 1000s of models across cloud and edge. He studied electrical/computer engineering in Portugal, where he researched RL for computation offloading. Carlos named our GPU cluster 'Hephaestus', presumably after the greek god of blacksmiths, but more likely because it's hard to spell so that only he can successfully ssh into it and hog all the compute.Gabriel CammanyTechnical staffGabriel previously founded the startup Sofon, a graph-based tool for ranking people’s authority. Before that, he was a founding engineer at Diagonal, an API for crypto payments. He studied CS in Barcelona where he built ANN-based particle trackers at CERN. Gabriel has spent his adult life indulging in huge quantities of two things -- AI knowledge tools and olive oil. The team is still unsure which is less healthy.Tom AnTechnical staffTom studied CS at CMU, where he implemented network protocols, optimizing compilers, and a cloud hybrid file system. Before that, he worked at Google on highly optimized, global scale services. Tom can see the beauty in many things, whether it's a well designed vectorDB or a motorcycle ride down to Santa Cruz with his 50 year old camera and a roll of film.Felicia TangChief of StaffFelicia previously worked on Growth/Strategy at startups in search and climate tech, and prior to that did private equity at Blackstone. She studied Environmental Economics at UC Berkeley. Felicia is one of the very few people who has both won debate championships and deeply studied Buddhism, meaning she has mastered thinking both fast and slow.Anca NegoiuTechnical StaffAnca previously built AI search pipelines and databases at Primer.ai. Before that, she studied Computer Science at Princeton. Anca grew up in Romania, is fond of Egyptian bellydancing, and has mastered American free style rap, however she insists her preferred world language is still python.Liam HinzmanTechnical staffLiam previously worked as a software engineer at Numerai before realizing that he can also totally master art and so became an artist. After he created Exa's infamous 30 foot office mural and then helped code up our AI powered Websets product, Liam became universally known as the guy who put the AI in renAIssance manSam MitchellTechnical staffSam went all-in on competitive math in high school, earning a spot at the Math Olympiad Summer Program at CMU. He studied CS at MIT, and then researched efficient GNN architectures at the IBM Watson AI Lab. Sam’s mission at Exa is simple: train the ultimate search engine. The team believes that his breakthroughs will not occur at his desk, but rather on the giant blue bean bag where he ponders new ideas.Song YouTechnical staffSong previously founded the company Jouncer for sharing projects across art and engineering, and then was a frontend engineer at Instabase. She studied Computer Engineering at the University of Toronto, where she developed her love for America. Song spent two years as an artist in Brooklyn creating scifi art before she realized: why paint it when you can build it?Shyam KumarProductShyam previously founded Kapstan, a Kubernetes abstraction layer for better container orchestration. Before that, he was a software engineer at Apple and consultant at BCG. He studied at Berkeley and MIT, though he'd argue his real learning came through teaching, from SAT prep to middle school basketball. The team sometimes wonders if Shyam's loyalty lies with eng or GTM, to which he responds 'porque no los dos?'Mark PekalaTechnical staffMark previously worked on video-conferencing at Ameelio, indexing video at Kino AI, and caching and materialization at Sigma Computing. Before that he studied math and CS at Harvard, while building learned indices in the Data Systems lab. Mark owns and manages Exa’s army of agents, and he also personally owns and manages a colony of ants, causing some to speculate that Mark sees something deeper here.Tyler KillianTechnical staffTyler studied math and CS at McGill University, where he scored Top 500 on the Putnam and was a Codeforces Candidate Master. When he’s not building state-of-the-art crawling systems, you’ll find him doing 15 meter kiteboarding jumps, thinking about math, or somehow solving a 7x7 rubiks cube just from "thinking about the colors".Linh NguyenTechnical staffLinh previously worked in fintech on explainable ML models for credit underwriting. She studied Chemical Engineering at MIT before realizing she’d rather ride the AI train with her tech bro friends. Linh tries to spend more time in the wilderness than she does looking at screens, but when she lapses you can find her scrolling Google Maps, Strava, and Exa search data.Will RobertsSalesWill studied Industrial Engineering at both the University of Michigan and UC Berkeley, where he swam competitively. Before Exa, he worked as a research engineer at SRI International, followed by a role as an associate at Kilonova Capital. Sometimes people wonder how Will powers through so many sales each day, until they realize he didn't become NCAA Team National Champion by staying put.Ben ChanTechnical staffBen previously did a PhD at Cornell building the fastest distributed algorithms in the world, now taught at universities and powering famous protocols. Before that he studied CS at MIT. Ben likes to dive really deep into things, from calligraphy to plants to the unfalsifiability of the Physical Church Turing Thesis to an absolutely massive report on GPT2 experiments that really should've broken github's markdown limits.Rohit PrakashTechnical staffRohit’s coding journey began when he was 11 years old making video games with C++ (this also led to a 5 year programming hiatus). Most recently, he graduated in CS from NYU where he spent time debugging kernel programs. While he is somewhat famous on X, he is more famous at Exa as the only member who switches code editors more than he does LLMs.YouTechnical staffYou previously worked on some project that demonstrated exceptional skill. You studied CS at somewhere, but far more importantly want to learn by joining a startup working on massive-scale ML/infra. You are excited to tackle a mission as old as ancient greece -- organize the world's knowledge -- and recognize that to do that You must meet Us and become We. --- Introducing WebCode - search evals for coding agentsIntroducing WebCode - search evals for coding agentsRead moreAPI DashboardTry WebsetsProductsSearchContentsAnswerWebsetsCompanyAboutResearchCareersWe're hiringCase StudiesContact usTeamBlogBrandDevelopersPrompt builderAPI DashboardDemosDocsGitHubAPI StatusFAQMCP ServerIntegrationsPricingResearch1Why we no longer evaluate SWE-bench Verified - OpenAI, 2026.#State of code searchToday, we're open-sourcing a set of coding evaluations, WebCode, that we built to evaluate web search for coding agents. At Exa, we power search for most of the largest coding agent companies, and we have observed a surge in code search queries over the past year, with a particularly large jump at the end of 2025. Figure 1: Code search queries on Exa This growth pushed us to focus on code search, where precision is especially important: agents build on retrieved context across many steps, so stale or noisy search results can poison or even derail the reasoning processes of long-running agents. Documentation, changelogs, and issues update constantly, so we built a dedicated ingestion pipeline focused on fresh, clean results over full-page chrome. To build better code search, effective methods for evaluating improvements are necessary. We observed that existing public web search benchmarks for coding agents have significant gaps.#Why public benchmarks fail for retrievalThe canonical public benchmark problem of saturation, contamination, indirect training on eval problem sets is well-documented. OpenAI recently deprecated their own SWE-bench verified benchmark for exactly this reason11Why we no longer evaluate SWE-bench Verified - OpenAI, 2026.: "Models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests... improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities." A model with benchmark answers in its training distribution will produce correct outputs via parametric memory rather than through reasoning, and the eval has no mechanism to distinguish between the two. Search is relevant precisely where this parametric memory fails: for tasks that are new, niche, or complex. For agents writing code, relevant documents such as changelogs, GitHub issues, and SDK docs are continuously updating, making real-time fluency a necessity.#Evals DesignSearch consists of two main components: Contents quality - contents that correctly answer the query Retrieval quality - identification of relevant URLs that contain the contents necessary to answer the query Figure 2: Index Quality and Retrieval Quality#1. Contents Quality#1.1 Evaluating against a golden extraction Given a URL, we measure how faithfully each provider extracts its content, graded against a golden reference built from rendered screenshots and the DOM. Defining "correct content" for a web page is a rabbit hole: should navigation be included? User-profile widgets? Sidebar links? We constrain the problem by asking: what content is maximally useful for an LLM answering a coding question from this page? That means keeping the substantive body - prose, code blocks, API signatures, tables - and stripping everything else. To build the golden reference: we render each page in a Browserbase cloud browser, screenshot it end-to-end, and feed the screenshots alongside the DOM to a state-of-the-art multimodal model that produces markdown as faithful as possible to what's rendered. Working from rendered pixels rather than raw HTML means we see the page after JavaScript execution, lazy loading, and dynamic rendering - the same content a human would see in a browser. Figure 3: Golden reference pipeline: render page in a cloud browser, screenshot end-to-end, then feed screenshots and DOM to a multimodal model to produce faithful markdown #How do we benchmark? While the reference above is the ground truth every content extraction is scored against, there are nuances to scoring. An extractor can: fetch correctly but corrupt formatting preserve structure perfectly while dropping half the page We score along multiple dimensions to diagnose where providers fail, using two complementary approaches: LLM-judged metrics that assess semantic dimensions string-matching struggles with: completeness (is the golden content present without excess?), accuracy (are numbers, code, and names faithful to the source?), and structure (are headings, lists, tables preserved?) Deterministic metrics from the NLP tradition for precise, reproducible measurement: signal (what fraction of the extraction is substantive content, derived from the ratio of golden to extracted length?), code and table recall (are code blocks and tables preserved?), and ROUGE-L (word-level longest common subsequence F1) #Results We evaluate our contents quality on a dataset of 250 URLs mined from simulated coding agent search distributions (Appendix A). Figure 4: Extraction quality across seven dimensions: completeness, signal, structure, accuracy, code recall, table recall, and ROUGE-L ProviderCompletenessSignalStructureAccuracyCode RecallTable RecallROUGE-LExa82.894.581.889.396.791.983.2Parallel74.277.680.889.294.192.273.7Claude†59.855.175.181.182.482.066.8 *All scores 0-100. †Claude web_fetch_20260209 (allowed_callers=['direct']) returned empty content for 12.0% of URLs. #Example: focused extraction vs full text Extraction lengths vary by over an order of magnitude across providers for the same page. The 1x to 13x difference from the golden reference is driven by an excess consisting entirely of sidebars, navigation and chrome. Figure 5: Length of content extracted relative to the golden extraction #1.2. Highlights: in-document search The URL is still given, but now so is a query. Given this world-view, we measure whether providers can surface the relevant section (highlight) of the page that correctly answers the query. This can be viewed as the base case of RAG: instead of retrieving over the entire corpus, we instead retrieve over a single document. To ensure token-efficient code search, measuring highlight (relevant section of the page) quality is important. However, we observed that the typical evaluation harness for RAG is prone to bidirectional failure modes: Highlights were incorrect but the synthesis LLM returned the correct answer Highlights were correct but the synthesis LLM returned the incorrect answer The root cause: standard RAG evaluation is framed as a generative task: the synthesis LLM produces an answer, and that generation step is noisy enough to mask both good and bad extractions. We reformulate it as a discriminative task. Instead of "generate the right answer from this context," we ask "does this context contain the right answer?" A frontier model can reliably discriminate whether a highlight contains a given fact, even when synthesis would be noisy. This gives us two independent axes: Correctness (generative): does the synthesized answer match the gold target? The judge never sees the highlights. Groundedness (discriminative): do the highlights contain the gold target answer? The judge never sees the synthesized answer. #Example Figure 6: Comparison between correctness and groundedness Exa: extracted the relevant Apple documentation, including the exact bullet point that answers the question. The synthesis model cited it and produced a correct, grounded answer. Claude: web_fetch_20260209 returned a JavaScript-required shell with no documentation content. Yet the synthesis model answered correctly anyway: it had memorized Apple's docs from training data. A correctness-only metric would score both identically. #Results - importance of optimizing for groundedness We evaluate each provider on the same dataset as the contents dataset. Figure 7: Results across providers for correctness and groundedness Correctness scores cluster around ~86%, while groundedness scores show much higher variance, better isolating capability differences across search providers. This suggests correctness primarily reflects the synthesis model, not the search provider.#2. Retrieval Quality#2.1. RAG: From a document to the entire web In the prior section, we constrained our world view to a single document. Now, we expand our world view to the entire web and ask whether we are still able to retrieve the correct answer. We generate question-answer pairs from long-context documentation pages: pages that run up to millions of characters. For each page, we select a niche segment buried deep in the document, construct a single factually verifiable question, and verify the answer using multiple research agents powered by Exa Deep. To further motivate each question from a retrieval perspective, we require that two frontier models fail to answer the question from parametric memory alone over three completions. The final 317 queries (Appendix C) were dispatched to five providers (Exa, Brave, Perplexity, Parallel, and Tavily) and graded on groundedness: did the result set contain the correct answer? We also computed citation precision: not just whether the answer appeared, but what fraction of returned results actually contained it. Figure 8: RAG groundedness and citation precision across providers #2.2. End to End Coding Tasks Finally, we evaluate retrieval quality in a sandboxed coding environment with bash tool calls and unit tests - closely mirroring real-world autonomous coding workflows. Existing benchmarks such as TerminalBench focus primarily on reasoning and tool use, and generally do not require web search. In fact, we found no public benchmark that explicitly evaluates a coding agent's ability to use web search. Therefore, we built our own benchmark. Figure 9: Procedure for constructing the end-to-end coding tasks We applied the following heuristics to our seeding methodology: Library selection: Identified relevant libraries (>200 GitHub stars) with releases after August 1, 2025. Each release must include at least three breaking changes or new functions; releases that only included minor notes, internal refactors, or bug fixes were dropped Knowledge check: Queried a frontier model on the API changes introduced in the release (no web access). In conjunction, deployed multiple research agents powered by Exa's type:deep search to extract concrete behavioral deltas. If the base model already knew what the research agent found, we dropped the candidate Quality judge: Scored the research output on several dimensions, and only kept tasks containing concrete changes with working code examples Task generation: Generated a task prompt, solution, and test suite using the novel facts Sandbox verification: Inside a Modal sandbox, we verified the solution passed the test suite while the empty stub failed. Tasks where the tests did not discriminate between the two were dropped Difficulty gate: Two frontier models attempted to solve the task from just the prompt using up to 10,000 tokens of reasoning. We ran each model's solution against the public test suite and hidden discriminators. If any model fully passed, we hardened the task up to three times, and rejected the task if it still passed Although tool-enabled models outperform their non tool-enabled counterparts, we empirically observed that these constraints provided a strong signal for distinguishing search quality across providers. Figure 10: Groundedness improvement over the no-search baseline by provider. A sub-agent handles web search and feeds results back to the main agent; native search replaces the sub-agent entirely. We ran multiple rollouts per provider across several frontier models. Each line shows how much that provider's pass rate improved over the no-search baseline; diamonds mark the median. Wider distributions indicate less consistent gains.#ConclusionWebCode both identifies and fills a critical gap in the current state of web search evals for coding agents, and demonstrates the importance of evaluating both contents and retrieval quality. While there are further improvements to be made, we hope that open-sourcing these evaluations will help uplift the current state of code search across the industry. #Appendix: Datasets#A. Contents dataset We release a set of 250 URLs from our contents dataset based on coding agent search distributions. #B. Highlights dataset We use the same dataset for this section as the contents dataset. All providers receive the same synthesis model and identical instructions, isolating extraction quality as the only variable. #C. RAG dataset We release a set of 317 {query, expected_answer} pairs sourced from authoritative documentation (GNU, W3C, IETF RFCs, language-official docs for Python, Rust, Go, and others). #D. E2E dataset We release a set of 33 tasks across 9 languages (Go, Python, TypeScript, Swift, Java, Kotlin, Ruby, Elixir, C++). Each task contains a {prompt, unit_test, golden_solution, citations, dockerfile} tuple targeting a specific release after August 1, 2025. To prevent indirect cheating, we also block all DNS requests to the sandbox except for the model calls.Come work at Exa if you want to build the search engine used by the largest coding agent companies in the world.See open roles


