Cultura-Y
Introduction
From the team that brought you CulturaX, we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the HPLT v1.1 dataset. Please note that HPLT v1.2 has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: Vistral-7B-Chat.
Background
Data and model architecture are two key components in creating a good language model. While the Transformer architecture has become widely popular and is the main backbone for creating top-ranking models, and improving the model architecture often only helps increase the model's accuracy by a few percentage points, improving training data can help increase the accuracy of the model by up to tens of percentage points or even more. Therefore, improving training data is increasingly receiving more attention and investment, especially as Large Language Models (LLMs) become more prevalent.
Recognizing the growing demand for training data for Large Language Models (LLMs) but lacking sufficient high-quality datasets (both in terms of quality and quantity) to train models, moreover, most publicly released datasets are primarily for common languages like English. our team at the University of Oregon has released CulturaX - a large text dataset for over 160 languages, meticulously cleaned and ready for training large language models.
After its release, CulturaX has been used to train numerous LLMs and has contributed to the creation of many good language models such as Vistral - the SOTA Vietnamese model, SambaLingo - SOTA models for 9 languages, and many other good models. This indicates that CulturaX is a good dataset and our data processing approach to create it is very effective.
And after CulturaX (X) had been released for a while, the main author of CulturaX, Thuat Nguyen, participated in an Ontocord project to create another dataset for the community, thus resulting in CulturaY (Y) - a text dataset for over 70 languages built with the same strong pipeline as CulturaX.
Process
In this article, we will discuss how we built CulturaY and the differences between versions X and Y. If you have read the paper on CulturaX, you will notice many similarities in this article. However, there will be several details that we have adjusted and upgraded from the original pipeline to suit CulturaY.
Firstly, to create CulturaY, we began with the HPLT dataset (version 1.1). This is also a notable difference between X and Y. While X was generated from cleaning data from Common Crawl (mC4, Oscar), Y was generated from cleaning raw data from the Internet Archive (HPLT). While Common Crawl is quite popular, data from the Internet Archive is less known and exploited, even though the data from both sources are similar. HPLT or CulturaY could be considered the first publicly released datasets originating from the Internet Archive. Using both CulturaX and CulturaY simultaneously will help your model have a more diverse source of data.
In essence, our pipeline is built based on Bloom's data cleaning pipeline: evaluating each document in the dataset according to criteria such as document length, perplexity, bad words ratio, etc., and removing documents that do not perform well in any of these criteria. Note that we perform data cleaning separately for each language, so the following explanations will apply to each language in the HPLT dataset rather than the entire data in HPLT.
Document Analysis
To determine whether a document is good or bad, we can evaluate it based on criteria such as:
Document length: A document that is too short (e.g., < 100 characters) usually does not contain useful information.
Perplexity of the document: A document with too high perplexity typically lacks coherent and useful content for the model.
Bad words ratio: If a document contains too many bad words, it can make the model learning "toxic" or generate impolite words.
Additionally, we also evaluate documents based on various other criteria such as Character repetition ratio, Word repetition ratio, Special character ratio, Stop word ratio, Number of lines, etc.
We perform this evaluation for at least 25% of the total number of documents. For common languages like English, evaluating the entire dataset would be costly and unnecessary. Based on our experience, evaluating 25% of randomly selected documents is sufficient to reflect the distribution of the entire dataset.
Choosing filter Thresholds
After performing document analysis, for each criterion we obtain the scores of the documents evaluated based on that criterion.
To determine whether a document is good or bad based on a specific criterion, we select a valid threshold or range for each criterion. Documents with scores within the valid range for all criteria are considered good, while those outside the valid range for any criterion are considered bad and removed from the dataset.
To choose valid ranges for each criterion, we treat each criterion as a univariate and identify outliers. For example, if most documents have perplexity scores ranging from 200 to 1100, documents with perplexity scores higher than 1100 would be considered bad. We do not eliminate documents with perplexity scores below 200 since we aim to remove only those with excessively high perplexity.
In Bloom's original pipeline, they visualized the distribution of each variable (criterion) and used native speakers to manually select thresholds for each criterion. However, this approach has weaknesses, particularly for languages without native volunteer speakers, making it impossible to choose thresholds. Additionally, when they analyzed documents, they only analyzed about 15 thousand documents, a small number compared to the total number of documents, leading to the distribution of the value range of their criteria not resembling the distribution of the entire dataset.
To address this, instead of using humans to choose thresholds for each criterion, we opt to use outlier detection algorithms to support all languages. We experimented with algorithms such as Local Outlier Factor, Random Forest, DBScan, etc. However, using these algorithms has drawbacks such as lengthy training times and the need to fine-tune parameters to obtain reasonable threshold values for each criterion. It is difficult to determine which set of parameters would yield the best results if multiple parameter sets produce seemingly reasonable values.
Finally, we choose to use the Interquartile Range (IQR) method to select thresholds for each criterion. The strength of IQR lies in its simplicity, resulting in faster results compared to other methods. Selecting values with IQR is also much simpler than parameter tuning for other algorithms. Furthermore, using IQR allows us to understand the data distribution, thereby avoiding selecting excessively high thresholds that would remove too many documents unnecessarily.
Based on our experience, selecting values at percentiles Q_1=10% and Q_3=90% is sufficiently good for all criteria while still retaining the majority of the data. In this regard, depending on the criterion, either Q_1 or Q_3 will be used. For instance, with Perplexity, if a document's perplexity score exceeds Q_3, it will be removed. Conversely, for criteria like document length, documents with lengths below Q_1 will be eliminated.
Additional Cleaning Step
After obtaining thresholds for each criterion, we can use these thresholds to remove bad, outlier documents from the dataset.
The remaining documents are relatively clean; however, some documents may still contain noise such as JavaScript or short lines at the end of the document, often not useful, such as comments, related articles, etc.
Therefore, in this final step, we clean the content of each document by removing lines containing JavaScript or removing all consecutive short lines at the end of each article.
And that's how we created CulturaY.
Acknowledgement
We thank our collaborators at UONLP - The Natural Language Processing Group at the University of Oregon, and the computing resources of the managers of the Karolina Supercomputers. We also thank our friends at TurkuNLP for their support.