
Our research is directed to making lawful and useful AGI
At Ontocord, we train large models and build enterprise data tooling with the goal to comply with laws and to enable large language and multimodal models to be trusted by businesses and the wider community. Our research is directed to reducing illegal and bias AI output and creating more useful applications including multimodal and multilingual AIs.
Our Research
-
Aurora-m
aurora-m-v0.1-biden-harris-redteamed
The First Open Source Biden-Harris Executive Order Red-teamed Multilingual Model
Jan 24, 2024
Model Description
This is version 0.1 of Aurora-m, a Starcoderplus Based 16B model that was continued pretrained on an additional approximatley 435B tokens. This version is an experimental research version that is meant to be used for multidomain, and multilingual red-teaming research. This project was created as part of the MDEL efforts.
Acknowledgement:
Training was conducted on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland. We thank them and all the participants of the MDEL efforts, which we will list in an updated version of this model card and our corresponding data card. And of course thank you to the wonderful BigCode team (of which many members of our team are part of) for Starcoderplus.
-
SafeLMM
Oct 13, 2023
by Huu Nguyen, Robert Kaczmarczyk, Anna Rogers, Bo Li, Ludwig Schmidt, Rio Yokota, Marianna Nezhurina, Liangyu Chen, Marzena Karpinska, Taishi Nakamura, Tommaso Furlanello, Tanmay Laud, Giovanni Puccetti, Xiaozhe Yao, Dung Nguyen, Qi Sun, Aleksandr Drozd, Paulo Villegas, Gabriel Ilharco Magalhaes, Mitchell Wortsman, Weiyang Liu, Christoph Schuhmann, Kenneth Heafield, Jenia Jitsev.
The proposed Synthetic Augmented data, Fair and Extreme-scaled Large Multimodal Model (SafeLMM) project will redefine the AI landscape by pioneering next-generation multimodal models that emphasise ethical and regulatory compliance. In collaboration with Ontocord AI, PIISA.org, LAION, e.v., Juelich Supercomputing Center, Horizon Europe project HPLT, and Efficient Translation Limited, among others, the SafeLMM models, ranging from 7B to 34B paramasters, will harness vast amounts of detoxified synthetic data and open and permissively licensed real data spanning images and text in 31 languages to address compliance with regulations.
-
Vistral-7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese
January 13, 2024
by Chien Van Nguyen, Thuat Nguyen, Quan Nguyen, Huy Nguyen, Björn Plüster, Nam Pham, Huu Nguyen, Patrick Schramowski, Thien Nguyen
Model Description
We introduce Vistral-7B-chat, a multi-turn conversational large language model for Vietnamese, which as of Feburary 2024 is state-of-the-art in the 7B category. Vistral is extended from the Mistral 7B model using diverse data for continual pre-training and instruction tuning. In particular, our process to develop Vistral involves:
Extend the tokenizer of Mistral 7B to better support Vietnamese.
Perform continual pre-training for Mistral over a diverse dataset of Vietnamese texts that are meticulously cleaned and deduplicated.
Perform supervised fine-tuning for the model using diverse instruction data. We design a set of instructions to align the model with the safety criteria in Vietnam.
Data
We will make the data available after we release the technical report for this model. However, we have made some of the data available here in our CulutraY and CulutraX datasets.
Performance
We evaluated our Vistral model using the VMLU leaderboard, a reliable framework for evaluating large language models in Vietnamese across various tasks. These tasks involve multiple-choice questions in STEM, Humanities, Social Sciences, and more. Our model achieved an average score of 50.07%, surpassing ChatGPT's performance of 46.33% significantly.
Acknowledgement
We thank Hessian AI and the Jülich Supercomputing Centre (JSC) for their support and compute in order to train this model.
-
New List Item
Description goes here -
New List Item
Description goes here -
New List Item
Description goes here -
New List Item
Description goes here -
New List Item
Description goes here -
New List Item
Description goes here
