AI2 Dolma Most Frequent Words: a derivative dataset of the most common words
Originally published here. Best viewed as a Streamlit app.
In August 2023, AI2 released Dolma, an open corpus for training large language models. Read their blog post to learn more.
The texts in Dolma come from six data sources. This blog post shows the most frequent words in the three smallest sources: peS2o, Project Gutenberg, and Wikipedia.
Dolma was released under an interesting license: the AI2 ImpACT License for Medium Risk Artifacts. This small dataset of word counts is a Data Derivative of Dolma: it’s “a new dataset that incorporates some or all of our data” (see the license summary).
The ImpACT license includes use-based restrictions. Briefly, you can’t use this word list for a few purposes: military weapons/surveillance, law enforcement (including predictive or biometric identification systems), disseminating information without a disclaimer the information is machine generated, and “fully automated decision-making without a human in the loop”.
AI2 call these “Flow Down Use-Based Restrictions”: “The Use-Based Restrictions should be included in an enforceable legal agreement for all downstream use and/or further distribution by your end users. Our intent is for the Use-Based Restrictions to continue running downstream.” I’m not sure what that means for the license of this word count table, or which licenses are compatible with these flow-down restrictions. (If you actually want to use these data for some reason, treat them as covered by the same AI2 ImpACT License for Medium Risk Artifacts and submit your contact info on the HuggingFace dataset release.)
Derivative Impact Report
The ImpACT license also requires me to produce a Derivative Impact Report, which is an interesting concept based on data cards. I submitted this report to AI2 via a web form.
-
Intended Use: What is the intended use of the Derivative?
A high-level overview for 3 of the 6 Dolma sources.
-
Intended Users: Who are the intended users of the Derivative?
Any person interested in the most frequent words in peS2o, Project Gutenberg, or Wikipedia sources for Dolma.
-
Funding: What is the source of funding for the program, project, or initiative that developed the Derivative?
No funding source.
-
Dataset Sources & Modifications: To develop the Data Derivative, what is the source of any data that was added to the original dataset and/or how was any data removed from the original dataset?
Only data from the peS2o, Project Gutenberg, and Wikipedia sources was used.
-
Dataset Size: How many examples are included in the Data Derivative overall?
1007 (top 500 most frequent words from each source)
How did I count words?
I chose the fastest approach I could think of:
using bash
for case-sensitive string counting after splitting on spaces and newlines.
Punctuation is removed (via tr -d '[[:punct:]]'
).
I would not recommend this approach to segmentation or word identification.
You can compare the token counts reported in the Dolma data release to my word counts in the table below. As expected, these word counts are lower as they exclude punctuation and many words are composed of multiple tokens.
Source | GPT-NeoX Tokens (billions) | This word count (billions) | Unique words (millions) | % Hapaxes | |
---|---|---|---|---|---|
0 | peS2o | 57 | * | * | 23.2% |
1 | Project Gutenberg | 4.8 | 3.53 | 14.3 | 62.2% |
2 | Wikipedia | 3.6 | 2.59 | 12.9 | 54.7% |
*The word counts for the peS2o data are based on a random 500,000 documents (about 1% of the total, sampled using shuf
’s reservoir sampling), so we can’t compute full word counts or compare to the reported token counts.
(In the sample: 0.97 billion total words with 4.9 million unique words.)
Most Frequent Words
See the markdown table below.
Word | Mean Rank | peS2o Rank | peS2o P(w) | Gutenberg Rank | Gutenberg P(w) | Wikipedia Rank | Wikipedia P(w) | |
---|---|---|---|---|---|---|---|---|
0 | the | 1 | 1 | 0.0590858 | 1 | 0.060334 | 1 | 0.0625368 |
1 | of | 2 | 2 | 0.0380566 | 2 | 0.0357862 | 2 | 0.034112 |
2 | and | 3 | 3 | 0.0295427 | 3 | 0.0303696 | 3 | 0.0295959 |
3 | in | 4.66667 | 4 | 0.0213987 | 6 | 0.0171874 | 4 | 0.0255978 |
4 | to | 4.66667 | 5 | 0.0195903 | 4 | 0.0257457 | 5 | 0.0207556 |
5 | a | 5.66667 | 6 | 0.0156087 | 5 | 0.0195295 | 6 | 0.0200714 |
6 | is | 9 | 7 | 0.0122707 | 11 | 0.00810247 | 9 | 0.0095068 |
7 | was | 9.66667 | 14 | 0.00583241 | 8 | 0.0104745 | 7 | 0.0128642 |
8 | that | 10.6667 | 10 | 0.00880814 | 7 | 0.0110184 | 15 | 0.00586098 |
9 | for | 11.3333 | 8 | 0.00955548 | 16 | 0.00705593 | 10 | 0.00829891 |
10 | with | 12 | 9 | 0.00920586 | 14 | 0.00753056 | 13 | 0.00717199 |
11 | The | 12.6667 | 11 | 0.00785279 | 19 | 0.0054833 | 8 | 0.0103172 |
12 | as | 12.6667 | 12 | 0.00635921 | 15 | 0.00714859 | 11 | 0.00788162 |
13 | on | 16.3333 | 16 | 0.00526325 | 21 | 0.00538805 | 12 | 0.00758886 |
14 | by | 16.3333 | 13 | 0.00631448 | 22 | 0.00527338 | 14 | 0.00697149 |
15 | at | 20 | 21 | 0.003628 | 23 | 0.0052012 | 16 | 0.00524444 |
16 | be | 21 | 18 | 0.00495423 | 18 | 0.00571736 | 27 | 0.00253783 |
17 | from | 21.3333 | 19 | 0.00433035 | 28 | 0.00424531 | 17 | 0.00520234 |
18 | were | 23.3333 | 17 | 0.00499865 | 31 | 0.00364787 | 22 | 0.00333038 |
19 | it | 23.3333 | 32 | 0.00214591 | 13 | 0.00797386 | 25 | 0.00284839 |
Why did I make this word list?
I made this because I’m a computer science researcher interested in making ML-powered systems useful and accessible. I was primarily motivated by an interest in the ImpACT license, and I wanted to produce a quick Data Derivative. I also think word counts are underrated: a good way to quickly highlight the similarities and differences of the datasets that make up Dolma.
View the code for this blog post and Streamlit app on GitHub, including the full table as a CSV. Shout-outs to gunzip
for being outrageously fast.
Under the terms of the license, I must include the following attribution notice: Dolma is licensed under the AI2 ImpACT License for Medium Risk Artifacts, © 2023 The Allen Institute for Artificial Intelligence.