Moby Project

The Moby Project is a collection of of a number of different types of word lists, mostly in English. The files are intended for use in lexical-related projects. The files are public domain and most of them can be downloaded at Project Gutenberg.

The lists making up the Moby Project are as follows:

Moby Hyphenator II - list of word hyphenations.
Moby Language II - word lists of five languages – French, German, Italian, Japanese, and Spanish:
Moby Part-of-Speech - list of words with their parts of speech.
Moby Pronunciator II - list of words with their pronunciations using Moby's pronunciation mark-up.
Moby Shakespeare contains the complete unabridged works of Shakespeare. This file is not available from Project Gutenberg.
Moby Thesaurus II
Words (see below)

The Words distribution consists of the following files:

Acronyms
Common words - words present in two or more published dictionaries. The list starts with nearly 300 suffixes.
Compound words - phrases, proper nouns, and acronyms not included in the common words file. This file starts with over 540 suffixes. It's hard to see how they fit this category.
Words included in the first edition of the Official Scrabble Players Dictionary
Additions to the Official Scrabble Players Dictionary in the second edition
Most frequently occurring words in the English language
Most frequently occurring words on Usenet in 1992
Most frequently occurring substrings in the King James Version of the Bible
Most common names used in the United States and Great Britain
Common English female names
Common English male names
Most common misspelled English words. While such lists usually list the misspellings along with the correct spelling, this list doesn't contain any actual misspellings.
Place names in the United States
Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings
United States Constitution including all amendments current to 1993

Issues

While some of the files have a lot of entries in them, a significant percentage of them are not words at all, but numbers, made-up words, misspellings, proper names, phrases made up of words already in the lists. In many cases, dozens and even hundreds of "words" are simply other words with common prefixes such as "un-", "non-", added to them, many of which would not be found in dictionaries.

Another common problem is the inclusion of many non-English words and phrases rarely seen in English writings and thus which do not belong in lists of English words. So while there is some useful data in the lists, someone would have to do a lot of clean-up work to makes the lists actually useful.

For example, the Hyphenator is said to contain over 187,000 hyphenated words, but in addition to the problems listed above, nearly 10,000 of the words are one-syllable, such as through, meaning there are no hyphenations possible for them.

The list of Japanese words contains English words such as abnormal and non-words such as abcdefgh and m,./. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lowercased words. The list of Italian words, however, contains no capitalized words whatsoever.

References

External links