crfm-stanford-edu
Website: https://crfm.stanford.edu/2023/03/13/alpaca.html
Detailed pricing plans are not available yet for this tool.
Alpaca: A Strong, Replicable Instruction-Following Model Authors: Rohan Taori* and Ishaan Gulrajani* and Tianyi Zhang* and Yann Dubois* and Xuechen Li* and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On our preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$). Checkout our code release on GitHub. Update: The public demo is now disabled. The original goal of releasing a demo was to disseminate our research in an accessible way. We feel that we have mostly achieved this goal, and given the hosting costs and the inadequacies of our content filters, we decided to bring down the demo. Overview Instruction-following models such as GPT-3.5 (text-davinci-003), ChatGPT, Claude, and Bing Chat have become increasingly powerful. Many users now interact with these models regularly and even use them for work. However, despite their widespread deployment, instruction-following models still have many deficiencies: they can generate false information, propagate social stereotypes, and produce toxic language. To make maximum progress on addressing these pressing problems, it is important for the academic community to engage. Unfortunately, doing research on instruction-following models in academia has been difficult, as there is no easily accessible model that comes close in capabilities to closed-source models such as OpenAI’s text-davinci-003. We are releasing our findings about an instruction-following language model, dubbed Alpaca, which is fine-tuned from Meta’s LLaMA 7B model. We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce. We are releasing our training recipe and data, and intend to release the model weights in the future. We are also hosting an interactive demo to enable the research community to better understand the behavior of Alpaca. Interaction can expose unexpected capabilities and failures, which will guide us for the future evaluation of these models. We also encourage users to report any concerning behaviors in our web demo so that we can better understand and mitigate these behaviors. As any release carries risks, we discuss our thought process for this open release later in this blog post. We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based on OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use. Training recipe There are two important challenges to training a high-quality instruction-following model under an academic budget: a strong pretrained language model and high-quality instruction-following data. The first challenge is addressed with the recent release of Meta’s new LLaMA models. For the second challenge, the self-instruct paper suggests using an existing strong language model to automatically generate instruction data. In particular, Alpaca is a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAI’s text-davinci-003. The figure below illustrates how we obtained the Alpaca model. For the data, we generated instruction-following demonstrations by building upon the self-instruct method. We started with the 175 human-written instruction-output pairs from the self-instruct seed set. We then prompted text-davinci-003 to generate more instructions using the seed set as in-context examples. We improved over the self-instruct method by simplifying the generation pipeline (see details in GitHub) and significantly reduced the cost. Our data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API. Equipped with this instruction-following dataset, we then fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. We note that training efficiency can be improved to further reduce the cost. Preliminary evaluation To evaluate Alpaca, we conduct human evaluation (by the 5 student authors) on the inputs from the self-instruct evaluation set. This evaluation set was collected by the self-instruct authors and covers a diverse list of user-oriented instructions including email writing, social media, and productivity tools. We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003. We were quite surprised by this result given the small model size and the modest amount of instruction following data. Besides leveraging this static evaluation set, we have also been testing the Alpaca model interactively and found that Alpaca often behaves similarly to text-davinci-003 on a diverse set of inputs. We acknowledge that our evaluation may be limited in scale and diversity. So we are releasing an interactive demo of Alpaca, and encourage readers to evaluate Alpaca themselves and give us feedback. In the rest of this section, we include several interaction examples to showcase the capabilities and limitations of Alpaca. The above examples show that the outputs of Alpaca are generally well-written. We note that Alpaca reflects the general style of the instruction-following dataset. As a result, Alpaca’s answers are typically shorter than ChatGPT, reflecting text-davinci-003’s shorter outputs. Known limitations Alpaca also exhibits several common deficiencies of language models, including hallucination, toxicity, and stereotypes. Hallucination in particular seems to be a common failure mode for Alpaca, even compared to text-davinci-003. For example, in the following figure, Alpaca wrongly says that the Capital of Tanzania is Dar es Salaam, which is the largest city in Tanzania. (It was the capital until 1974, when it was replaced by Dodoma.) Furthermore, Alpaca can be used to generate well-written outputs that spread misinformation, as seen in the following example. Alpaca likely contains many other limitations associated with both the underlying language model and the instruction tuning data. However, we believe that the artifact will still be useful to the community, as it provides a relatively lightweight model that serves as a basis to study important deficiencies. We encourage users to help us identify new kinds of failures by flagging them in the web demo. Overall, we hope that the release of Alpaca can facilitate further research into instruction-following models and their alignment with human values. Assets released We are releasing the following assets today: Demo: an interactive demo for everyone to try out Alpaca. Data: 52K demonstrations used to fine-tune Alpaca. Data generation process: the code for generating the data. Training code: for fine-tuning the model using the Hugging Face API. We intend to release the following assets in the near future: Model weights: We have reached out to Meta to obtain guidance on releasing the Alpaca model weights, both for the 7B Alpaca and for fine-tuned versions of the larger LLaMA models. Release decision We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models. Future directions We are excited by the research opportunities that Alpaca unlocks. There are many exciting future directions: Evaluation: We need to evaluate Alpaca more rigorously. We will start with HELM (Holistic Evaluation of Language Models), which hopefully will evolve to capture more generative, instruction-following scenarios. Safety: We would like to further study the risks of Alpaca and improve its safety using methods such as automatic red teaming, auditing, and adaptive testing. Understanding: We hope to better understand how capabilities arise from the training recipe. What properties of a base model do you need? What happens when you scale up? What properties of instruction data is needed? What are alternatives to using self-instruct on text-davinci-003? Acknowledgments This work was done at the Center for Research on Foundation Models (CRFM) with support from the Stanford Institute for Human-Centered AI (HAI) and the Stanford Natural Language Processing (NLP) group. We also especially thank Yifan Mai for helpful engineering support for demo deployment. Alpaca depends directly and critically on existing works. We would like to thank Meta AI Research for training and releasing the LLaMA models, the self-instruct team for giving us a basis for the data generation pipeline, Hugging Face for the training code, and OpenAI for paving the path and showing what can be achieved. We would also like to highlight that there are many other open efforts for instruction-following LLMs and chat models, including OpenChatKit, Open Assistant, and Carper AI. --- Faculty Members Percy Liang Director, Computer Science Akshay Chaudhari Radiology and (by courtesy) Biomedical Data Science Carlos Guestrin Computer Science Chelsea Finn Computer Science Chris Re Computer Science Christopher Manning LInguistics and Computer Science Dan Boneh Computer Science Dan Ho Law and Political Science Dan Jurafsky Linguistics and Computer Science Daniel Yamins Computer Science and Psychology / Wu Tsai Institute Diyi Yang Computer Science Dorsa Sadigh Computer Science Douwe Kiela Symbolic Systems Ehsan Adeli Psychiatry and Behavioral Sciences Erik Brynjolfsson HAI - Digital Economy Lab Fei-Fei Li Computer Science James Zou Biomedical Data Science Jeannette Bohg Computer Science Jiajun Wu Computer Science John Duchi Statistics, Electrical Engineering Julian Nyarko Law Jure Leskovec Computer Science Marco Pavone Aeronautics and Astronautics Mark A Lemley Law Matei Zaharia Computer Science Michael Bernstein Computer Science Monica Lam Computer Science Nigam Shah Medicine Noah Goodman Psychology and Computer Science Rob Reich Political Science and McCoy Family Center for Ethics in Society Russ Altman Bioengineering Sanmi Koyejo Computer Science Stefano Ermon Computer Science Surya Ganguli Applied Physics Tatsu Hashimoto Computer Science Tengyu Ma Computer Science Thomas Icard Philosophy Tobias Gerstenberg Psychology Research Engineering Team Abhinav Garg Computer Science David Hall Computer Science Yifan Mai Computer Science Postdocs, Students, and Researchers Alan Luo PhD Student Computer Science Alberto Tono PhD Student Civil and Environmental Engineering Alex Tamkin PhD Student Computer Science Allen Nie PhD Student Computer Science Alycia Lee MS Student Computer Science Amelia Hardy MS Student Computer Science Ananya Kumar PhD Student Computer Science Andreas Haupt Postdoc Computer Science, Economics Andrew Gaut MS Student Computer Science Andy Zhang JD Student Law Annie Chen PhD Student Computer Science Arjun Desai PhD Student Electrical Engineering Armin Thomas Postdoc Data Science and Psychology Ashwin Agrawal PhD Student Civil and Environmental Engineering Avanika Narayan PhD Student Computer Science Benjamin Xie Postdoc Institute for Human-Centered Artificial Intelligence, McCoy Family Center for Ethics in Society Berivan Isik PhD Student Electrical Engineering Betty Xiong PhD Student Biomedical Informatics Bobby Yan PhD Student Computer Science Brando Miranda PhD Student Computer Science Caleb Winston PhD Student Computer Science Camilo Ruiz PhD Student Computer Science Cem Gokmen PhD Student Computer Science Chenlin Meng PhD Student Computer Science Chris Cundy PhD Student Computer Science Chris Harjadi Undergraduate Student Computer Science Connor Toups MS Student Computer Science Cristobal Eyzaguirre PhD Student Computer Science Daniel Shin MS Student Computer Science David Rose PhD Student Department of Psychology Diana Acosta Navas Postdoc McCoy Family Center for Ethics in Society Dilara Soylu MS Student Computer Science Dimitris Tsipras Postdoc Computer Science Div Garg PhD Student Computer Sciencc Drew A. Hudson PhD Student Computer Science Ed Chen PhD Student Computer Science Eduardo Pontes Reis Visiting Scholar AIMI Center Eric Mitchell PhD Student Computer Science Eric Zelikman PhD Student Computer Science Esin Durmus Postdoc Computer Science Faisal Ladhak Visiting Researcher Computer Science Federico Bianchi Postdoc Computer science Frieda Rong PhD Student Computer Science Gabriel Mukobi Undergraduate Student Computer Science Gabriel Poesia Reis e Silva PhD Student Computer Science Gautam Mittal MS Student Computer Science Hancheng Cao PhD Student Computer Science Hongyu Ren PhD Student Computer Science Hongyue Li MS Student Computer Science Irena Gao Undergraduate Student Computer Science Isabelle Levent Undergraduate Student Philosophy Ivan Zhou MS Student Computer Science Jan-Philipp Fränken Postdoc Psychology Jason Fries Research Scientist Biomedical Informatics / School of Medicine Jeff Z. HaoChen PhD Student Computer Science Jeongyeon Kim PhD Student Computer Science Jimmy Wu PhD Student Computer Science Joel Niklaus Visiting Student Researcher Law John Hewitt PhD Student Computer Science John Thickstun Postdoc Computer Science Jonathan Vandenburgh Postdoc McCoy Family Center for Ethics in Society Joon Sung Park PhD Student Computer Science Josselin Somerville MS Student Computer Science Joy He-Yueya PhD Student Computer Science Juan Manuel Zambrano Chaves PhD Student Biomedical Data Science Juan Miguel Navarro Carranza PhD Student Department of Civil and Environmental Engineering Julian Quevedo Undergraduate Student Computer Science Kai Fronsdal Undergraduate Student Mathematics Kanishk Gandhi PhD Student Computer Science Karan Goel PhD Student Computer Science Kaylee Burns PhD Student Computer Science Kelechi Uhegbu MS Student Computer Science Keshav Santhanam PhD Student Computer Science Kevin Feigelis PhD Student Physics Kevin Klyman Policy Research Analyst Freeman Spogli Institute Khaled Saab PhD Student Electrical Engineering Kristina Gligorić Postdoc Computer Science Kush Bhatia Postdoc Computer Science Kyle Hsu PhD Student Computer Science Lauren Gillespie PhD Student Computer Science Liana Patel PhD Student Computer Science Lina Lukyantseva PhD Student Economics & Computer Science Lisa Li PhD Student Computer Science Lu Yang PhD Student Bioengineering and Computer Science Lucia Zheng MS Student Computer Science Lyna Kim Undergraduate Student Computer Science Mars Huang PhD Student Biomedical Informatics Matthias Gerstgrasser Postdoc Computer Science Mayee Chen PhD Student Computer Science Mert Yuksekgonul PhD Student Computer Science Michael Moor Postdoc Computer Science Michael Poli PhD Student Computer Science Michael Wornow PhD Student Computer Science Michael Zhang PhD Student Computer Science Michihiro Yasunaga PhD Student Computer Science Miguel Paredes Stanford Alumni Affiliate Stanford Alumni Affiliate Mina Lee PhD Student Computer Science Minkai Xu PhD Student Computer Science Mo Tiwari PhD Student Computer Science Nandita Bhaskhar PhD Student Electrical Engineering Nathan Kim Undergraduate Student Computer Science Neel Guha PhD Student Computer Science Neil Band PhD Student Computer Science Olawale Salaudeen PhD Student Computer Science Omar Khattab PhD Student Computer Science Pawan Wirawarn Undergraduate Student Computer Science Peter Henderson PhD Student Computer Science Pranav Jain MS Student Computer Science Qian Huang PhD Student Computer Science Rishi Bommasani PhD Student Computer Science Rohan Koodli PhD Student Biomedical Informatics Rohan Taori PhD Student Computer Science Rohith Kuditipudi PhD Student Computer Science Ruhana Azam Visiting Student Researcher Computer Science Ruocheng Wang PhD Student Computer Science Ruth Starkman Lecturer I'm in four departments but my main gig is Program in Writing and Rhetoric Ryan A. Chi MS Student Computer Science Rylan Schaeffer PhD Student Computer Science Sabri Eyuboglu PhD Student Computer Science Samar Khanna MS Student Computer Science Sandra Luksic Research Assistant McCoy Family Center for Ethics in Society Sang Michael Xie PhD Student Computer Science Sarah Chen Undergraduate Student Computer Science Sarah Wu PhD Student Psychology Shelby Grossman Research Scholar Cyber Policy Center - Stanford Internet Observatory Shiori Sagawa PhD Student Computer Science Shivam Garg PhD Student Computer Science Shyamal Buch PhD Student Computer Science Siddharth Karamcheti PhD Student Computer Science Silas Alberti PhD Student Electrical Engineering Simran Arora PhD Student Computer Science Sina Semnani PhD Student Computer Science Siying Zhang Researcher Psychology Steven Feng PhD Student Computer Science Sudharsan Sundar Undergraduate Student Computer Science Suvir Mirchandani PhD Student Computer Science Tianyi Zhang PhD Student Computer Science Ting-An Lin Postdoc McCoy Family Center for Ethics in Society Trevor Chow Undergraduate Student Mathematics Trevor Gale PhD Student Computer Science Tri Dao PhD Student Computer Science Tristan Thrush PhD Student Computer Science Vaish Shrivastava MS Student Computer Science Valentin Hofmann Visiting Scholar Linguistics Vishnu Sarukkai PhD Student Computer Science Weixin Liang PhD Student Computer Science William Wang Undergraduate Student Biology and Bioengineering William Zhang Undergraduate Student Computer Science Winnie Xu Research Scholar Computer Science Xikun Zhang PhD Student Computer Science Xuechen Li PhD Student Computer Science Yann Dubois PhD Student Computer Science Yian Zhang MS Student Computer Science Yoonho Lee PhD Student Computer Science Yuhui Zhang PhD Student Computer Science Yusuf Roohani PhD Student Computer Science Zepeng Huo Postdoc Biomedical Informatics Zhengxuan Wu PhD Student Computer Science Zhenlin Chen PhD Student Doerr School of Sustainability, Energy Science Engineering Zhiyuan Li Postdoc Computer Science Ziwen Chen PhD Student Graduate School of Business Alumni Aditi Raghunathan Assistant Professor, CMU Antoine Bosselut Assistant Professor, EPFL Ben Newman Alumni Chris Donahue Research Scientist, Google Claudia D'Arpino Research Scientist, NVIDIA Dakuo Wang Associate Professor, Northeastern University Deepak Narayanan Research Scientist, Microsoft Fereshte Khani Research Scientist, Microsoft Florian Tramer Assistant Professor, ETH Zurich Jim Fan Research Scientist, NVIDIA Kathleen Creel Assistant Professor, Northeastern Laurel Orr Numbers Station Researcher Pang Wei Koh Assistant Professor, UW Ranjay Krishna Assistant Professor, UW Saahil Jain Engineer, You.com Shibani Santurkar Research Scientist, OpenAI Yuhuai (Tony) Wu Research Scientist, Google --- On the Opportunities and Risks of Foundation Models Download the report. Authors: Rishi Bommasani*, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Kohd, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang , Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, Percy Liang* Abstract AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature. Introduction To learn more, see this section in the report. This report investigates an emerging paradigm for building artificial intelligence (AI) systems based on a general class of models which we term foundation models. A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks; current examples include BERT [Devlin et al. 2019], GPT-3 [Brown et al. 2020], and CLIP [Radford et al. 2021]. From a technological point of view, foundation models are not new — they are based on deep neural networks and self-supervised learning, both of which have existed for decades. However, the sheer scale and scope of foundation models from the last few years have stretched our imagination of what is possible; for example, GPT-3 has 175 billion parameters and can be adapted via natural language prompts to do a passable job on a wide range of tasks despite not being trained explicitly to do many of those tasks [Brown et al. 2020]. At the same time, existing foundation models have the potential to accentuate harms, and their characteristics are in general poorly understood. Given their impending widespread deployment, they have become a topic of intense scrutiny [Bender et al. 2021]. Capabilities To learn more, see this section in the report. Foundation models acquire various capabilities that can power applications. We have chosen to discuss five potential capabilities: the ability to process different modalities (e.g., language, vision), to affect the physical world (robotics), to perform reasoning, and to interact with humans (interaction). Finally, we conclude with a philosophical discussion of potential limits on their capabilities. Language To learn more, see this section in the report. Authors: Isabel Papadimitriou, Christopher D. Manning NLP as a field has blazed the trail for foundation models. While these models dominate standard benchmarks, there is a clear gap between the capabilities these models acquire currently and those that characterize language as a complex system for human communication and thought. In response to this, we emphasize the full range of linguistic variation (e.g., different styles, dialects, languages), which poses an opportunity and challenge given some variants are data-limited. Further, child language acquisition is more sample efficient than the training of foundation models; we examine how signals beyond text and grounding may help to bridge this gap. Both of these characteristics of language provide clear directions for future foundation models research. Vision To learn more, see this section in the report. Authors: Shyamal Buch, Drew A. Hudson, Frieda Rong, Alex Tamkin, Xikun Zhang, Bohan Wu, Ehsan Adeli, Stefano Ermon, Ranjay Krishna, Juan Carlos Niebles, Jiajun Wu, Li Fei-Fei Computer vision led the adoption of deep learning in AI, demonstrating that models pretrained on large annotated datasets can transfer to numerous downstream settings. Now, pretraining on web-scale raw data instead of curated datasets, foundation models are on the rise in computer vision. These models have shown promising results for standard tasks in the field, like image classification and object detection, and training on multimodal and embodied data beyond images may enable progress on significant challenges (e.g., 3D geometric and physical understanding, commonsense reasoning). We also discuss some of the key challenges in modeling (e.g., the ability to scale effectively to videos) and evaluation (e.g., the measurement of higher-order capabilities) along with the applications (e.g., ambient intelligence for healthcare) and societal considerations (e.g., surveillance) that will determine the impact of foundation models for computer vision going forward. Robotics To learn more, see this section in the report. Authors: Siddharth Karamcheti, Annie Chen, Suvir Mirchandani, Suraj Nair, Krishnan Srinivasan, Kyle Hsu, Jeannette Bohg, Dorsa Sadigh, Chelsea Finn A longstanding goal of robotics research is to develop "generalist" robots capable of performing myriad tasks across physically diverse environments. Unlike language and vision, which have led the way with foundation models both due to the abundance of raw data to train these models on and the availability of virtual applications to apply these models to, robotics faces fundamental challenges due to being anchored to the physical world. The principal challenge in developing new types of foundation models for robotics—different in nature than their language and vision counterparts—is acquiring sufficient data of the right form that is conducive to learning: we explore how plentiful data (e.g., generic videos of humans, amongst others) that is not specific to particular environments and across modalities (e.g., language, vision) may help to bridge this gap. These new robotic foundation models could allow for easier task specification and learning, ushering in new applications (e.g., better robotic assistance for household tasks) and heightening the importance of robustness and safety (e.g., formal safety evaluation). Reasoning and search To learn more, see this section in the report. Authors: Yuhuai Wu, Frieda Rong, Hongyu Ren, Sang Michael Xie, Xuechen Li, Andy Shih, Drew A. Hudson, Omar Khattab Reasoning and search problems such as theorem proving and program synthesis have been long-standing challenges in AI. The combinatorial search space renders traditional search-based methods intractable. However, humans are known to operate intuitively even in the most mathematical of domains, and indeed existing work such as AlphaGo have already shown that deep neural networks can be effective in guiding the search space. But humans also transfer knowledge across tasks, facilitating much more efficient adaptation and the ability to reason more abstractly. Foundation models offer the possibility of closing this gap: their multi-purpose nature along with their strong generative and multimodal capabilities offer new leverage for controlling the combinatorial explosion inherent to search. Interaction To learn more, see this section in the report. Authors: Joon Sung Park, Chris Donahue, Mina Lee, Siddharth Karamcheti, Dorsa Sadigh, Michael S. Bernstein Foundation models show clear potential to transform the developer and user experience for AI systems: foundation models lower the difficulty threshold for prototyping and building AI applications due to their sample efficiency in adaptation, and raise the ceiling for novel user interaction due to their multimodal and generative capabilities. This provides a synergy we encourage going forward: developers can provide applications that better fit the user's needs and values, while introducing far more dynamic forms of interaction and opportunities for feedback. Philosophy of understanding To learn more, see this section in the report. Authors: Christopher Potts, Thomas Icard, Eva Portelance, Dallas Card, Kaitlyn Zhou, John Etchemendy What could a foundation model come to understand about the data it is trained on? Focusing on the case of natural language, we identify different positions on the nature of understanding and explore their relevance for our central question. Our tentative conclusion is that skepticism about the capacity of future foundation models to understand natural language may be premature, especially where the models are trained on multi-modal data. Applications To learn more, see this section in the report. At present, foundation model research is largely confined to computer science and AI, with the impact of foundation models and the applications they support largely being centered in the tech industry. Moving forward, foundation models present clear potential to transform and extend the reach of AI across many sectors beyond the tech industry, suggesting a more pervasive effect on people's lives. While there is a multitude of applications and domains to consider, we we have chosen three applications — healthcare, law, and education — because they represent foundational pillars of our society. For foundation models to significantly contribute to these application domains, models will require specific capabilities as well as technical innovation to account for the unique considerations in each domain. Further, since these domains are critical to societal function, applying foundation models in these domains requires engaging with deeply sociotechnical matters such as those those pertaining to data, privacy, interpretability, fairness and ethics. Healthcare and biomedicine To learn more, see this section in the report. Authors: Michihiro Yasunaga, Jing Huang, Camilo Ruiz, Yuhui Zhang, Giray Ogut, Saahil Jain, William Wang, Yusuf Roohani, Hongyu Ren, Antoine Bosselut, Ehsan Adeli, Jure Leskovec, Russ Altman Healthcare tasks (e.g., patient care via disease treatment) and biomedical research (e.g., scientific discovery of new therapies) require expert knowledge that is limited and expensive. Foundation models present clear opportunities in these domains due to the abundance of data across many modalities (e.g., images, text, molecules) to train foundation models, as well as the value of improved sample efficiency in adaptation due to the cost of expert time and knowledge. Further, foundation models may allow for improved interface design for both healthcare providers and patients to interact with AI systems, and their generative capabilities suggest potential for open-ended research problems like drug discovery. Simultaneously, they come with clear risks (e.g., exacerbating historical biases in medical datasets and trials). To responsibly unlock this potential requires engaging deeply with the sociotechnical matters of data sources and privacy as well as model interpretability and explainability, alongside effective regulation of the use of foundation models for both healthcare and biomedicine. Law To learn more, see this section in the report. Authors: Peter Henderson, Lucia Zheng, Jenny Hong, Neel Guha, Mark Krass, Julian Nyarko, Daniel E. Ho Legal applications require that attorneys read and produce long coherent narratives that incorporate shifting contexts and decipher ambiguous legal standards. Foundation models may provide benefits in this domain: ample data exists in the form of legal documents and their generative capabilities are well-suited to the many generative tasks required in law, but significant improvements are required for foundation models to be able to reliably reason over various sources of information to generate truthful long-form documents. As is the care in healthcare, the sample efficiency of adaptation for foundation models is of heightened value given the costs of expert time and knowledge in the legal domain, which may allow for the re-allocation of expertise towards pressing problems of justice and government service. The responsible development of foundation models for law will require specific consideration of privacy, and highlights core limitations of existing foundation models that will require fundamental advances with respect to provenance for their behavior and guarantees for the factuality of their generation. Education To learn more, see this section in the report. Authors: Ali Malik, Dorottya Demszky, Pang Wei Koh, Moussa Doumbouya, Drew A. Hudson, Allen Nie, Hamed Nilforoshan, Alex Tamkin, Emma Brunskill, Noah Goodman, Chris Piech Education is a complex and subtle domain; effective teaching involves reasoning about student cognition and should reflect the learning goals of students. The nature of foundation models presents promise here that has yet to be realized in the sphere of AI for education: while certain many streams of data in education are individually too limited to train foundation models, the ability to leverage relevant data from outside the domain (e.g., the Internet) and make use of data across multiple modalities (e.g., textbooks, mathematical formula, diagrams, video-based tutorials) jointly offers hope for foundation models that are broadly applicable to educational tasks. If foundation models lead to a significant improvement in education-relevant capabilities, there is clear potential for new applications that align with the open-ended generative (e.g., problem generation) and interactive (e.g., feedback to teachers) aspects of foundation models; the sample efficient adaptation of foundation models suggests greater ability for adaptive and personalized learning. In this event, renewed consideration is required of hallmarks of applying technology to education (e.g., student privacy), along with certain concerns becoming more critical (e.g., inequity in access to technology in education, technology-aided plagiarism). Technology To learn more, see this section in the report. Now we discuss the technology behind building better model architectures, training and adaptation procedures, and of course scaling up the systems. One crucial but often overlooked topic is data—where does it come from and what is its composition? In addition, we want foundation models to be robust to distribution shifts and secure against attackers. Finally, we wish to understand why foundation models work from both a mathematical perspective as well as an empirical perspective. Modeling To learn more, see this section in the report. Authors: Drew A. Hudson, Antoine Bosselut, Alex Tamkin, Omar Khattab, Jared Quincy Davis, Jiaxuan You, Trevor Gale What structural properties give rise to a foundation model? In the modeling section, we explore the underlying architectures behind foundation models and identify 5 key attributes. First, we start by discussing expressivity of the computational model — to capture and assimilate real-world information, and scalability — to adeptly handle large quantities of high-dimensional data. These properties are successfully realized by existing architectures such as the transformer network that underpins most foundation models to date. We then proceed to attributes may be essential for the next generation of models, including: multimodallity — to consume, process and potentially produce content from different sources and domains, memory capacity — to effectively store and retrieve the acquired knowledge, and finally, compositionality, to foster successful generalization to novel settings and environments. We believe that realizing the full potential envisioned for foundation models will hinge on modelling advances to fulfill these desiderata. Training To learn more, see this section in the report. Authors: Alex Tamkin Training objectives mathematically specify how models should learn and acquire capabilities from their training data. The current status quo for training foundation models involves modality-specific objectives (e.g., masked language modeling for text and SimCLR for images) that are often chosen heuristically. We envision that future training objectives for foundation models will reflect two changes: principled selection derived from systematic evidence and evaluation, and domain-generality to provide rich, scalable, and unified training signal across data sources and modalities. We also discuss important design trade-offs, including generative vs discriminative training, the choice of input data representation, and the potential of future training objectives that involve explicit representations of goals. Adaptation To learn more, see this section in the report. Authors: Xiang Lisa Li*, Eric Mitchell*, Sang Michael Xie, Xuechen Li, Tatsunori Hashimoto Foundation models are intermediary assets; they are unfinished and generally should not be used directly, instead requiring adaptation for specific downstream tasks. The de facto approach for adaptation has been fine-tuning, with recent work suggesting that lightweight fine-tuning alternatives and prompting-based methods may achieve favorable accuracy-efficiency tradeoffs. Moving forward, we envision a more expansive view of adaptation that goes beyond just specializing foundation models to perform the task of interest: adaptation will alleviate deficiencies of stand-alone foundation models (e.g., temporal adaptation to reflect changes over time in the world) or introduce constraints (e.g., GDPR compliance relating to the right to be forgotten); this broader perspective on adaptation coincides with a need for new evaluation protocols that systematically evaluate adaptation methods while controlling for resources (e.g., runtime, memory) and access requirements involved in adaptation. Evaluation To learn more, see this section in the report. Authors: Rishi Bommasani, Kawin Ethayarajh, Omar Khattab Evaluation offers context to foundation models by providing a means to track progress, understand models, and document their capabilities and biases. Foundation models challenge the ability of standard evaluation paradigms in machine learning to achieve these goals since they are one step removed from specific tasks. To envision new paradigms in evaluation that suit foundation models, we discuss (i) evaluating foundation models directly to measure their inherent capabilities and inform how foundation models are trained, (ii) evaluating task-specific models by controlling for adaptation resources and access, and (iii) broader evaluation design to provide richer context beyond measures of accuracy (e.g., robustness, fairness, efficiency, environmental impact). Reform of evaluation practices will allow for evaluation that adequately serves both the diverse goals and stakeholders involved in the foundation model paradigm. Systems To learn more, see this section in the report. Authors: Deepak Narayanan, Trevor Gale, Keshav Santhanam, Omar Khattab, Tianyi Zhang, Matei Zaharia While the training data determines the theoretical information available for foundation models, and model architectures and training objectives determine how much of this information can be extracted, computer systems determine what is practically achievable. Systems are a key bottleneck for scaling in terms of data and model size, both of which appear to reliably track with improvements in capabilities. To ensure that we can train the next generation of foundation models efficiently (with respect to time and cost), we will require the co-design of algorithms, models, software, and hardware. This co-design is already starting to happen to in various forms, from carefully tuned parallelism strategies to new architectures such as retrieval-based and mixture-of-expert models. Beyond training, we consider what will be required to deploy applications on top of foundation models (e.g., efficient inference). Data To learn more, see this section in the report. Authors: Laurel Orr, Simran Arora, Karan Goel, Avanika Narayan, Michael Zhang, Christopher Ré Data is the lifeblood of foundation models; the training data of these models largely determines what these capabilities these models can acquire. The centrality of data is not unique to foundation models; recent calls for data-centric AI indicate the pervasive importance of managing, understanding, and documenting data used to train machine learning models. For foundation models specifically, the current modus operandi is for training data to be selected using unspecified or unclear principles with a general lack of transparency regarding the nature of training data. We believe an alternative approach is needed to re-imagine the data ecosystem surrounding foundation models: we draw upon work on data visualization and management to propose a data hub for foundation models. We articulate how this proposal relates to many of the relevant data-centric considerations for foundation models: selection, curation, documentation, access, visualization and inspection, quality assessment, and legal regulation. Security and privacy To learn more, see this section in the report. Authors: Florian Tramèr*, Rohith Kuditipudi*, Xuechen Li* Security and privacy for foundation models is largely uncharted at present. Fundamentally, foundation models are a high-leverage single point of failure, making them a prime target for attack: existing work demonstrates a variety of security vulnerabilities (e.g., adversarial triggers to generate undesirable outputs) or privacy risks (e.g., memorization of training data) for these models. Further, the generality of foundation models compounds these concerns, intensifying the risk for function creep or dual use (i.e., use for unintended purposes). For security, we view foundation models as akin to operating systems in traditional software systems; we discuss steps towards secure foundation models which, if achieved, would provide a strong abstraction layer to build upon for reliable ML applications. For privacy, by leveraging knowledge transfer from public data, foundation models may enable more sample efficient adaptation to sensitive data distributions, i.e., privacy-preserving applications may incur less degradation in accuracy when built using foundation models. Robustness to distribution shifts To learn more, see this section in the report. Authors: Sang Michael Xie, Ananya Kumar, Rohan Taori, Tony Lee, Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto A major limitation of standard machine learning is that it produces models that are not robust to distribution shifts, where the training distribution does not match the test distribution (for the downstream task). Existing work shows that adapting a foundation model trained on a broad range of unlabeled data improves the robustness of adapted models across a wide variety of shifts. This opens a new set of promising directions for improving training and adaptation of foundation models for robustness. However, we do not believe that foundation models are a panacea for robustness—challenges such as extrapolation across time and spurious correlations are not likely to be fully addressed. AI safety and alignment To learn more, see this section in the report. Authors: Alex Tamkin, Geoff Keeling, Jack Ryan, Sydney von Arx Ensuring foundation models are reliable, robust, and interpretable is increasingly important when considering the potential real-world applications of these models. In addition to critical and immediate considerations, we also consider the relationship between foundation models and larger-scale risks, hazards, and harms that have the potential for increased relevance as model capabilities continue to advance. For example, we consider the importance of aligning foundation models such that they are not deployed with misspecified goals or values. We also discuss the relevance of forecasting the emergent behaviors of foundation models (e.g., the ability to deceive or plan strategically), which may complicate attempts to adapt them to particular tasks, and may require new approaches for interpretability or evaluation. Theory To learn more, see this section in the report. Authors: Aditi Raghunathan, Sang Michael Xie, Ananya Kumar, Niladri Chatterji, Rohan Taori, Tatsunori Hashimoto, Tengyu Ma Learning theory provides a broad foundation for the variety of contexts encountered in applied machine learning; theory offers both understanding, principles, and guarantees to complement empirical findings. At present, the study of foundation models is largely empirical: the theory of standard supervised learning, while relatively mature, is inadequate to fully explain foundation models. Specifically, the discrepancy between the training phase and the adaptation phase within the foundation model regime pinpoints the insufficiency of existing theory, since these phases correspond to (potentially) completely different tasks and data distributions. Nevertheless, we endeavor that advances in theory to address this discrepancy, even in simple, limited settings, will provide useful insights. Interpretability To learn more, see this section in the report. Authors: John Hewitt*, Armin W. Thomas*, Pratyusha Kalluri, Rodrigo Castellon, Christopher D. Manning Interpretability provides clarity to foundation models: the opacity of the deep neural networks that underpin foundation models, alongside the expected ubiquity of foundation models, heightens the need to understand these models and their capabilities. Interpretability methods at present generally are designed for interpreting and explaining the behavior of task-specific models; the nature of foundation models (i.e., the wide array of tasks these models are beneficial for and the unexpected emergent properties they acquire) introduces new challenges for interpretability research. To frame the discussion of interpretability for foundation models, we propose the one model-many models paradigm, which aims to determine the extent to which the one model (the foundation model) and its many models (its adapted derivatives) share decision-making building blocks. In addition to interpreting the decision-making components involved, we further discuss explainability in the context of foundation models (e.g., the validity ofpost hoc explanations generated by models) as well as the mechanisms that drive model behavior (which may clarify the extent to which understanding foundation models can extend to understanding their adapted derivatives). Given the critical role we ascribe interpretability in the study of foundation models, we conclude with an assessment of the societal impact of interpretability and non-interpretability. Society To learn more, see this section in the report. We believe the rapid development of foundation models, adapted and deployed to various applications, will have wide-ranging consequences on the health of societies. What makes these models so exciting and also so troubling is their task agnosticity. Societal impact is easier (but still non-trivial) to understand and reason about when we talk about specific systems deployed to users, but how can we take into account the societal impact of all possible systems and use cases when developing foundation models? Inequity and fairness To learn more, see this section in the report. Authors: Rishi Bommasani, Fereshte Khani, Esin Durmus, Faisal Ladhak, Dan Jurafsky In many contexts, machine learning has been shown to contribute to, and potentially amplify, societal inequity. Foundation models may extend this trend, i.e., furthering the unjust treatment of people who have been historically discriminated against. However, understanding the relationship between inequity and foundation models requires reckoning with the abstraction of foundation models; foundation models are intermediary assets that are adapted for applications that impact users. Therefore, we delineate intrinsic biases, i.e., properties in foundation models that portend harm, and extrinsic harms, i.e., harms arising in the context of specific applications built using foundation models. We taxonomize various sources (e.g., training data, lack of diversity among foundation model developers, the broader sociotechnical context) that give rise to these biases and harms, emphasizing the importance, and technical difficulty, of source tracing to understand ethical and legal responsibility. We do not view unfairness as inevitable in the foundation model paradigm: to address unfair outcomes that arise from foundation models, we dually consider proactive interventions (e.g., technical methods like counterfactual data augmentation) and reactive recourse (e.g., mechanisms for feedback propagation and attribution of moral/legal responsibility). Misuse To learn more, see this section in the report. Authors: Antoine Bosselut*, Shelby Grossman*, Ben Newman We define foundation model misuse as the use of foundation models as they are technically intended (e.g., to generate language or video), but with the goal of causing societal harm (e.g., to generate disinformation, to develop deepfakes for harassment). We argue that advances in foundation models will result in higher-quality machine-generated content that will be easier to create and personalize for misuse purposes. For example, disinformation actors may use them to quickly generate collections of articles targeted across different demographic groups (e.g., nationality, political party, religion, etc.). While these new capabilities may limit existing human detection methods for harmful content (e.g., tracking similar text across different sources), foundation models may themselves provide promising potential as automated misuse detectors. Environment To learn more, see this section in the report. Authors: Peter Henderson, Lauren Gillespie, Dan Jurafsky Foundation models are the byproducts of computationally expensive training regimes, with the existing trajectory favoring even more intensive models; the energy required for this training coincides with the release of more carbon into the atmosphere and the degradation of the environment. At present, current discussion centers these enormous single-time training costs and the potential to amortize these costs across repeated use. We seek to clarify these discussions by identifying assumptions that shape the calculus of environmental impact for foundation models. Further, we envision that the ecosystem surrounding foundation models requires a multi-faceted approach: (i) more compute-efficient models, hardware, and energy grids all may mitigate the carbon burden of these models, (ii) environmental cost should be a clear factor that informs how foundation models are evaluated, such that foundation models can be more comprehensively juxtaposed with more environment-friendly baselines, and (iii) the cost-benefit analysis surrounding environmental impact necessitates greater documentation and measurement across the community. Legality To learn more, see this section in the report. Authors: Neel Guha, Peter Henderson, Lucia Zheng, Mark Krass, Daniel E. Ho Foundation models rest on tenuous legal footings at present; how the law bears on both the development and use of these models is largely unclear. Legal and regulatory frameworks for foundation models specifically, alongside those for AI technology more generally, will be needed to influence, constrain, and even foster practices in research, development, and deployment. Centering on the legal landscape of the United States, where existing consideration of algorithmic tools remains broadly uncertain, we highlight the pertinent issues of liability for model predictions and protections from model behavior. With respect to both issues, we describe how legal standards will need to be advanced to address these given the intermediary status of foundation models (as opposed to that of user-facing task-specific models). Economics To learn more, see this section in the report. Authors: Zanele Munyikwa, Mina Lee, Erik Brynjolfsson Foundation models are likely to have substantial economic impact due to their novel capabilities and potential applications in a wide variety of industries and occupations. We consider the implications of the development and use of foundation models for the future of the US and global economy with a focus on productivity, wage inequality, and concentration of ownership. Ethics of scale To learn more, see this section in the report. Authors: Kathleen Creel, Dallas Card, Rose E. Wang, Isabelle Levent, Alex Tamkin, Armin W. Thomas, Lauren Gillespie, Rishi Bommasani, Rob Reich In addition to running the risk of increasing inequity, as discussed in the section on fairness, the widespread adoption of foundation models poses other ethical, political and social concerns. We discuss ethical issues related to the scale of application of foundation models, such as homogenization and the concentration of power, as well as the norms and release strategies appropriate to address them. Conclusion To learn more, see this section in the report. In this report, we have endeavored to comprehensively discuss many of the most critical aspects of foundation models, ranging from their technical underpinnings to their societal consequences. In this way, we acknowledge the unusual approach taken: we have attempted to clarify the nature of a paradigm that may only have just begun, rather than waiting for more to unfold or the dust to settle. Therefore, much still remains unclear in spite of our efforts and we reiterate that this is just the beginning of a paradigm shift: foundation models have only just begun to transform the way AI systems are built and deployed in the world. Moving forward, we view this document as serving an important role in orienting and framing dialogue on these models and this new paradigm in AI. That said, to ensure the responsible development and deployment of these models on durable foundations, we envision collaboration between different sectors, institutions, and disciplines from the onset to be especially critical. Acknowledgements We would like to thank the following people for their valuable feedback: Mohit Bansal, Boaz Barak, Yoshua Bengio, Stella Biderman, Su Lin Blodgett, Sam Bowman, Collin Burns, Nicholas Carlini, David Chalmers, Jack Clark, Jeff Dean, Jesse Dodge, Jarred Dunnmon, Gabe Dupre, Jason Eisner, Iason Gabriel, Dan Hendrycks, Avery Hill, Yacine Jernite, Gabbrielle Johnson, Sarah Kreps, Jay McClelland, Preetum Nakkiran, Julian Nyarko, Fernando Pereira, Vinodkumar Prabhakaran, Colin Raffel, Marten van Schijndel, Ludwig Schmidt, Yoav Shoham, Madalsa Singh, Megha Srivastava, Jacob Steinhardt, Emma Strubell, Qian Yang, Luke Zettlemoyer, and Ruiqi Zhong. In addition, we would like to especially thank Vanessa Parli for helping to organize this effort. Citation Guidelines To cite the entire report, please use the BibTeX entry provided below. To cite an individual section of the report, please reference the section number. For example, for the ethics section, cite as (Bommasani et al., 2021, §5.6) or (Bommasani et al., 2021, §5.6: Ethics). This can be achieved through the command \citep[][§5.6]{Bommasani2021FoundationModels} in LaTeX. @article{Bommasani2021FoundationModels, title={On the Opportunities and Risks of Foundation Models}, author={Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and S. Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen A. Creel and Jared Davis and Dora Demszky and Chris Donahue and Moussa Doumbouya and Esin Durmus and Stefano Ermon and John Etchemendy and Kawin Ethayarajh and Li Fei-Fei and Chelsea Finn and Trevor Gale and Lauren E. Gillespie and Karan Goel and Noah D. Goodman and Shelby Grossman and Neel Guha and Tatsunori Hashimoto and Peter Henderson and John Hewitt and Daniel E. Ho and Jenny Hong and Kyle Hsu and Jing Huang and Thomas F. Icard and Saahil Jain and Dan Jurafsky and Pratyusha Kalluri and Siddharth Karamcheti and Geoff Keeling and Fereshte Khani and O. Khattab and Pang Wei Koh and Mark S. Krass and Ranjay Krishna and Rohith Kuditipudi and Ananya Kumar and Faisal Ladhak and Mina Lee and Tony Lee and Jure Leskovec and Isabelle Levent and Xiang Lisa Li and Xuechen Li and Tengyu Ma and Ali Malik and Christopher D. Manning and Suvir P. Mirchandani and Eric Mitchell and Zanele Munyikwa and Suraj Nair and Avanika Narayan and Deepak Narayanan and Benjamin Newman and Allen Nie and Juan Carlos Niebles and Hamed Nilforoshan and J. F. Nyarko and Giray Ogut and Laurel Orr and Isabel Papadimitriou and Joon Sung Park and Chris Piech and Eva Portelance and Christopher Potts and Aditi Raghunathan and Robert Reich and Hongyu Ren and Frieda Rong and Yusuf H. Roohani and Camilo Ruiz and Jack Ryan and Christopher R'e and Dorsa Sadigh and Shiori Sagawa and Keshav Santhanam and Andy Shih and Krishna Parasuram Srinivasan and Alex Tamkin and Rohan Taori and Armin W. Thomas and Florian Tram{\`e}r and Rose E. Wang and William Wang and Bohan Wu and Jiajun Wu and Yuhuai Wu and Sang Michael Xie and Michihiro Yasunaga and Jiaxuan You and Matei A. Zaharia and Michael Zhang and Tianyi Zhang and Xikun Zhang and Yuhui Zhang and Lucia Zheng and Kaitlyn Zhou and Percy Liang}, journal={ArXiv}, year={2021}, url={https://crfm.stanford.edu/assets/report.pdf} } --- A reproducible and transparent framework for evaluating foundation models.Find leaderboards with many scenarios, metrics, and models with support for multimodality and model-graded evaluation.Leaderboards ↓GithubHELM LeaderboardsCapabilities →A new leaderboard for evaluating general capabilities of language modelsAudio →Holistic Evaluation of Audio-Language ModelsHELM Lite →Lightweight, broad evaluation of the capabilities of language models using in-context learningHELM Classic →Thorough language model evaluations based on the scenarios from the original HELM paperHEIM →Holistic evaluation of text-to-image modelsHELM Instruct →Evaluations of instruction following models with absolute ratingsMMLU →Massive Multitask Language Understanding (MMLU) evaluations using standardized promptsVHELM →Holistic Evaluation of Vision-Language ModelsImage2Struct →Evaluations of Vision-Language Models on extracting structured information from imagesAIR-Bench →Safety benchmark based on emerging government regulations and company policiesSafety →Safety benchmark that aggregates popular safety benchmarks across 6 risk vectorsCLEVA →Chinese-language benchmark for holistic evaluation of Chinese language modelsThaiExam →Thai-language evaluations of language models on standardized examinations in ThailandSEA-HELM →Assessment of large language models across various tasks, emphasizing Southeast Asian languagesMMLU-Winogrande-Afr →Clinical MMLU and Winogrande in 11 low-resource African languagesToRR →A benchmark for table reasoning and robustnessFinance →Financial-domain benchmark using real financial documentsMedHELM →A benchmark by medical experts for LLMs grounded in real-world healthcare needsLong Context →A benchmark of LLM long context capabilitiesArabic →Evaluation of LLMs on 7 popular Arabic-language benchmarks