Making large language and multimodal models smarter and safer for enterprises and users
Making our AI future accessible to everyone
Lawful AIs by design
~ We create state-of-the-art AIs that are directed to legal standards.
We develop large-scale models and provide enterprise data tooling with the goal of compliance with legal standards, ensuring that our language and multimodal technologies are trusted by businesses and the broader community.
~ We work well with others.
We work with infrastructure partners, such as Together.ai, Hessian.ai, and others to deliver highly performant multimodal models.
~ Our tooling complements our deep experience with:
pre-training
continued pretraining
fine-tuning
performing reinforcement learning
alignment & redteaming
~ We support open science.
Moreover, our commitment to open science research aims to address unlawful outputs and AI biases, fostering the development of practical applications for multilingual and multimodal AI.
Leveraging our experience, Ontocord is creating a training platform dedicated to data-centric AI and ensuring regulatory compliance.
Lawfulness directed, human-oversight, data-opinionated-model-agnostic, secure, scalable, compute efficient, large multimodal foundational models.
Sign up to receive additional information.
By signing up you agree to the terms of use and privacy policy.
Selected Press
RedPajama is a collaborative project between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute to create leading, fully open-source large language models (LLMs). Its effort began with yesterday’s release of a 1.2 trillion token dataset that follows the LLaMA recipe. The data enables any organization to pre-train models that can be permissively licensed. The full dataset is available on Hugging Face and users can reproduce results with Apache 2.0 scripts available on Github.
- VentureBeat
“Ontocord [created] a training dataset for the models. Called the Open Instruction Generalist Dataset, the dataset contains more than 40 million examples of questions and answers, follow-up questions and more designed to “teach” a model how to respond to different instructions (e.g. “Write an outline for a history paper on the Civil War”).”
- TechCrunch
Research
At Ontocord, we train large models and build enterprise data tooling with the goal to comply with laws and to enable large language and multimodal models to be trusted by businesses and the wider community. We build datasets so they can be more trustworthy, and we promote research in the wider open science community to further this goal. Our work is directed to reducing illegal and bias AI output and creating more useful applications including multimodal and multilingual AIs.
aurora-m-v0.1-biden-harris-redteamed
The First Open Source Biden-Harris Executive Order Red-teamed Multilingual Model
Jan 24, 2024
Model Description
This is version 0.1 of Aurora-m, a Starcoderplus Based 16B model that was continued pretrained on an additional approximatley 435B tokens. This version is an experimental research version that is meant to be used for multidomain, and multilingual red-teaming research. This project was created as part of the MDEL efforts.
Acknowledgement:
Training was conducted on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland. We thank them and all the participants of the MDEL efforts, which we will list in an updated version of this model card and our corresponding data card. And of course thank you to the wonderful BigCode team (of which many members of our team are part of) for Starcoderplus.
Vistral-7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese
January 13, 2024
by Chien Van Nguyen, Thuat Nguyen, Quan Nguyen, Huy Nguyen, Björn Plüster, Nam Pham, Huu Nguyen, Patrick Schramowski, Thien Nguyen
Model Description
We introduce Vistral-7B-chat, a multi-turn conversational large language model for Vietnamese, which as of Feburary 2024 is state-of-the-art in the 7B category. Vistral is extended from the Mistral 7B model using diverse data for continual pre-training and instruction tuning. In particular, our process to develop Vistral involves:
Extend the tokenizer of Mistral 7B to better support Vietnamese.
Perform continual pre-training for Mistral over a diverse dataset of Vietnamese texts that are meticulously cleaned and deduplicated.
Perform supervised fine-tuning for the model using diverse instruction data. We design a set of instructions to align the model with the safety criteria in Vietnam.
Data
We will make the data available after we release the technical report for this model. However, we have made some of the data available here in our CulutraY and CulutraX datasets.
Performance
We evaluated our Vistral model using the VMLU leaderboard, a reliable framework for evaluating large language models in Vietnamese across various tasks. These tasks involve multiple-choice questions in STEM, Humanities, Social Sciences, and more. Our model achieved an average score of 50.07%, surpassing ChatGPT's performance of 46.33% significantly.
Acknowledgement
We thank Hessian AI and the Jülich Supercomputing Centre (JSC) for their support and compute in order to train this model.
CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages
Dataset Summary
From the team that brought you CulutraX, we present CulturaY, another substantial multilingual dataset that applies the same dataset cleaning methodology to the HPLT v1.1 dataset. Please note that HPLT v1.2 has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: Vistral-7B-Chat.
Our annotations and arrangements are licensed under CC-BY-4.0, and we make the data available for fair use machine learning research.
But we make no claims as to the underlying copyrights of the work. This data was copied from the HPLT project, which in turn used the data from Common Crawl and the Internet Archive.
Acknowledgement
We thank our collaborators at UONLP - The Natural Language Processing Group at the University of Oregon, and the computing resources of the managers of the Karolina Supercomputers. We also thank our friends at TurkuNLP for their support.
Open Instruction Generalist (OIG) Dataset
The Open Instruction Generalist (OIG) dataset is one of the first large scale open source instruction dataset that contains ~43M instructions.
OIG is a chatbot dataset created by Ontocord with Together, LAION and other members of the research community and is intended to create equal access to chatbot technology. It contains 30 component instruction datasets. As of Feburary 2024, it has been downloaded over 7000 times and used to train many dozens of models.
BigScience
Huu Nguyen of Ontocord co-led the data governance efforts of BigScience started by Huggingface, a collaboration of more than 1,000 researchers from 60 countries and more than 250 institutions in creating together a very large multilingual neural network language model and a very large multilingual text dataset on the 28 petaflops Jean Zay (IDRIS) supercomputer located near Paris, France. See our paper, Data Governance in the Age of Large-Scale Data-Driven Language Technology:
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
SafeLMM: Safe Large Multimodal Models By Design - Proposal
Oct 13, 2023
by Huu Nguyen, Robert Kaczmarczyk, Anna Rogers, Bo Li, Ludwig Schmidt, Rio Yokota, Marianna Nezhurina, Liangyu Chen, Marzena Karpinska, Taishi Nakamura, Tommaso Furlanello, Tanmay Laud, Giovanni Puccetti, Xiaozhe Yao, Dung Nguyen, Qi Sun, Aleksandr Drozd, Paulo Villegas, Gabriel Ilharco Magalhaes, Mitchell Wortsman, Weiyang Liu, Christoph Schuhmann, Kenneth Heafield, Jenia Jitsev.
The proposed Synthetic Augmented data, Fair and Extreme-scaled Large Multimodal Model (SafeLMM) project will redefine the AI landscape by pioneering next-generation multimodal models that emphasise ethical and regulatory compliance. In collaboration with Ontocord AI, PIISA.org, LAION, e.v., Juelich Supercomputing Center, Horizon Europe project HPLT, and Efficient Translation Limited, among others, the SafeLMM models, ranging from 7B to 34B paramasters, will harness vast amounts of detoxified synthetic data and open and permissively licensed real data spanning images and text in 31 languages to address compliance with regulations.
OpenAssistant
As one of the four founders of OpenAssistant, Huu Nguyyen of Ontocord co-led a group of 13,500 volunteers who have created large scale human-generated data points for the first Open Source alternative to ChatGPT. See our paper
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
Aligning large language models (LLMs) with human preferences has proven to drastically improve usability and has driven rapid adoption as demonstrated by ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) greatly reduce the required skill and domain knowledge to effectively harness the capabilities of LLMs, increasing their accessibility and utility across various domains. However, state-of-the-art alignment techniques like RLHF rely on high-quality human feedback data, which is expensive to create and often remains proprietary. In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 complete and fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers. Models trained on OpenAssistant Conversations show consistent improvements on standard benchmarks over respective base models. We release our code and data under a fully permissive licence.
Red Pajama Dataset v1 and models
RedPajama, is an effort to produce a reproducible, fully-open, leading language model and 1.2T token dataset. RedPajama is a collaboration between Ontocord, Together, ETH DS3Lab, Stanford CRFM, and Hazy Research. RedPajama has three key components:
Pre-training data, which needs to be both high quality and have broad coverage
Base models, which are trained at scale on this data
Instruction tuning data and models, which improve the base model to make it usable and safe
As of February 2024, it has been downloaded tens of thousands of times, and used to train over 160 models.
Legal Playbook for Natural Language Processing Researchers
As part of Huggingface’s BigScience efforts, Huu Nguyen of Ontocord led the NYU law school-based legal hackathon that authored the NLP Legal Playbook, and which has now become part of the OECD.AI’s Catalogue of Tools & Metrics for Trustworthy AI.
This playbook is a legal research resource for various activities related to data gathering, data governance, and disposition of an AI model available as a public resource. It aims to benefit academic and government researchers including those in New York State who wish to understand how best to use AI models to provide natural language processing (“NLP”) as public infrastructure, but who do not have legal resources. The playbook aims to be a general informational resource to public organizations, including cross national organizations focused on non-commercial open science in NLP and promotion of the human rights to equal access to scientific advancement under UDHR Art. 27
With this playbook, we strive to assist researchers who have less resources to help them guide their communities and their research, including low income communities who may not have access to legal resources. In particular, this playbook is cross jurisdictional, and hopefully will be relevant to NLP and data researchers in underserved language communities whose data will be processed (e.g., minority dual-language speakers) and those who wish to participate and have a stake in AI.
Meet the Team
-
Huu Nguyen, Esq.
CEO & PARTNERSHIP ADVOCATE
-
Jenia Jitsev, PhD
OPEN FOUNDATION MODELS ADVOCATE
LAB LEAD AT JUELICH SUPERCOMPUTING CENTER (JSC)
-
Patrick Schramowski, PhD
ETHICAL AI ADVOCATE
GROUP LEAD AT DFKI OF THE TU DARMSTADT UNIVERSITY
-
Quan Nguyen
MULTIMODAL LEAD
-
Thuat Nguyen
LARGE SCALE MULTILINGUAL DATA LEAD
-
Marianna Nezhurina
SCIENTIFIC MODELS & DATA
JUELICH SUPERCOMPUTING CENTER (JSC)
-
Nathan Tyler
SUSTAINABILITY & BRAND LEAD
EX-GOOGLE -
Bo Li, PhD
SAFETY ADVISOR
PROFESSOR AT UNIVERSITY OF CHICAGO, UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAGNE
-
David Lansky, Esq.
AI & SOCIETY POLICY
ORGANIZATIONAL ADVISOR
EX-GENERAL COUNSEL OF OPENAI,
CURRENT GENERAL COUNSEL OF CONVERGENT RESEARCH
-
Kenneth Heafield, PhD
SCALABILITY ADVISOR
CEO OF EFFICIENT TRANSLATION, LTD
-
Karan Malhotra
COMMUNITY ADVISOR
CIO OF NOUS RESEARCH
-
Victor May
INDUSTRY ADVISOR
ML MANAGER AT CHEGG