Occiglot: Polyglot Language Models for the Occident
Our friends at Occiglot released a series of open source well performing multilingual models in 4 EU languages - German, French, Spanish, and Italian - while still retaining performance under English. For example, the Occiglot-7b-de-en-instruct model had the best performing average score as of the date of release of 0.566474 compared to other similar models.
Model Release
As part of their commitment to transparent research, they released an amazing ten intermediary 7B model checkpoints. The release focused on the five largest European languages: English, German, French, Spanish, and Italian.
They started from Mistral-7B – an existing pre-trained model for English and performed bi-lingual as well as a multilingual continual pre-training on 700B additional multilingual tokens and subsequent instruction tuning (1B tokens) for each language. They released the models on Hugging Face under Apache 2.0 license.
About Occiglot
Occiglot is an non-profit, open science collective that strongly believes that dedicated language modeling solutions are key to maintaining academic and economic competitiveness and AI sovereignt and digital language equality. Crucially, high-quality, fundamental research and IP-driven technological applications require direct access to these models and the data that went into training them.
Ontocord.AI is honored to participate in the collective and collaborate and contribute to the Occiglot’s efforts to create equal access to AI technology and bring linguistic fairness to AI.