Moby Project: Difference between revisions

Content deleted Content added

Inline

Revision as of 18:21, 2 September 2016

The Moby Project is a collection of of a number of different types of word lists. The files are public domain and can be downloaded at Project Gutenberg.

While some of the files have a lot of entries in them, a significant percentage of them are not words at all, but numbers, made-up words, misspellings, proper names, phrases made up of words already in the lists. In many cases, dozens and even hundreds of "words" are simply other words with common prefixes added to them, such as "un-", "non-", added to them and many of which would not be found in dictionaries. Another common problem is the inclusion of many non-English words and phrases not commonly seen in English writings and thus do not belong in lists of English words. So while there is some useful data in the lists, someone would have to do a lot of clean-up work to makes the lists actually useful.

For example, the Hyphenator is said to contain over 187,000 hyphenated words, but in addition to the problems listed above, nearly 10,000 of the words are one-syllable, such as 'through' and 'avoir', meaning there are no hyphenations possible for them.

The list of Japanese words contains English words such as abnormal and non-words such as abcdefgh and m,./. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lowercased words. The list of Italian words, however, contains no capitalized words whatsoever.

The lists making up the Moby Project are as follows:

Moby Hyphenator II -- list of word hyphenations.

Moby Language II -- word lists of five languages – French, German, Italian, Japanese, and Spanish:

Moby Part-of-Speech -- list of words with their parts of speech.

Moby Pronunciator II -- list of words with their pronunciations using Moby's pronunciation mark-up. Example:

Moby Shakespeare contains the complete unabridged works of Shakespeare. This file is not available from Project Gutenberg.

Moby Thesaurus II

Words

The Words distribution consists of the following files:

Acronyms

Common words - words present in two or more published dictionaries. The list starts with nearly 300 suffixes.

Compound words - phrases, proper nouns, and acronyms not included in the common words file. This file starts with over 540 suffixes. It's hard to see how they fit this category.

Words included in the first edition of the Official Scrabble Players Dictionary

Additions to the Official Scrabble Players Dictionary in the second edition

Most frequently occurring words in the English language

Most frequently occurring words on Usenet in 1992

Most frequently occurring substrings in the King James Version of the Bible

Most common names used in the United States and Great Britain

Common English female names

Common English male names

Most common misspelled English words. While such lists usually list the misspellings along with the correct spelling, this list doesn't contain any actual misspellings.

Place names in the United States

Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings

United States Constitution including all amendments current to 1993

References

External links