Moby Project: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
Took out all the hype.
Line 4: Line 4:
{{refimprove|date=January 2016}}
{{refimprove|date=January 2016}}
}}
}}
The '''Moby Project''' is a collection of public-domain lexical resources. It was created by [[Grady Ward]]. The resources were dedicated to the public domain, and are now mirrored at [[Project Gutenberg]]. {{As of|2007}}, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations.
The '''Moby Project''' is a collection of of a number of different types of word lists. The files are public domain and can be downloaded at [[Project Gutenberg]].


While some of the files have a lot of entries in them, a significant percentage of them are not words at all, but numbers, made-up words, misspellings, proper names, phrases made up of words already in the lists. In many cases, dozens and even hundreds of "words" are simply other words with common prefixes added to them, such as "un-", "non-", added to them and many of which would not be found in dictionaries. Another common problem is the inclusion of many non-English words and phrases not commonly seen in English writings and thus do not belong in lists of English words. So while there is some useful data in the lists, someone would have to do a lot of clean-up work to makes the lists actually useful.
== Hyphenator ==
The '''Moby Hyphenator II''' contains the hyphenations of 187,175 words and phrases (including 9,752 entries where no hyphenations are given, such as ‘through’ and ‘avoir’). The character encoding appears to be MacRoman, and hyphenation is indicated by a bullet (character value 165 decimal, or A5 hexadecimal). Some entries, however, have a combination of actual hyphens and character 165, such as "bar•ber-sur•geon".


For example, the Hyphenator is said to contain over 187,000 hyphenated words, but in addition to the problems listed above, nearly 10,000 of the words are one-syllable, such as 'through' and 'avoir', meaning there are no hyphenations possible for them.
There is little to no documentation of the hyphenation choices made; the following examples might give some flavour of the style of hyphenation used: at•mos•phere; at•tend•ant; ca•pac•i•ty; un•col•or•a•ble.


The list of Japanese words contains English words such as ''abnormal'' and non-words such as ''abcdefgh'' and ''m,./''. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lowercased words. The list of Italian words, however, contains no capitalized words whatsoever.
== Language ==
'''Moby Language II''' contains wordlists of five languages – [[French language|French]], [[German language|German]], [[Italian language|Italian]], [[Japanese language|Japanese]], and [[Spanish language|Spanish]]:


The lists making up the Moby Project are as follows:
{| class="wikitable"
|-
! Language
! Words
! Size (in [[byte]]s)
|-
! French
|align="right"| 138,257
|align="right"| 1,524,757
|-
! German
|align="right"| 159,809
|align="right"| 2,055,986
|-
! Italian
|align="right"| 60,453
|align="right"| 561,981
|-
! Japanese
|align="right"| 115,523
|align="right"| 934,783
|-
! Spanish
|align="right"| 86,059
|align="right"| 850,523
|-
! Total
|align="right"| 560,101
! 5,928,030
|}


'''Moby Hyphenator II''' -- list of word hyphenations.
However, some of the lists are contaminated, for example the Japanese list contains English words such as ''abnormal'' and non-words such as ''abcdefgh'' and ''m,./''. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lower-cased words. The list of Italian words, however, contains no capitalized words whatsoever.


'''Moby Language II''' -- word lists of five languages – [[French language|French]], [[German language|German]], [[Italian language|Italian]], [[Japanese language|Japanese]], and [[Spanish language|Spanish]]:
The foreign languages list does not use accented characters, so "e^tre" is how you would look up the French word "être" ("To Be").


== Part-of-Speech ==
'''Moby Part-of-Speech''' -- list of words with their parts of speech.
'''Moby Part-of-Speech''' contains 233,356 words fully described by [[Lexical category|part(s) of speech]], listed in priority order. The format of the file is ''word\parts-of-speech'', with the following parts of speech being identified:


'''Moby Pronunciator II''' -- list of words with their pronunciations using Moby's pronunciation mark-up. Example:
{| class="wikitable"
|-
! Part-of-speech
! Code
|-
| [[Noun]]
| N
|-
| [[Plural]]
| p
|-
| [[Noun phrase]]
| h
|-
| [[Verb]] (usually [[participle]])
| V
|-
| [[Transitive verb]]
| t
|-
| [[Intransitive verb]]
| i
|-
| [[Adjective]]
| A
|-
| [[Adverb]]
| v
|-
| [[Grammatical conjunction|Conjunction]]
| C
|-
| [[Preposition]]
| P
|-
| [[Interjection]]
| !
|-
| [[Pronoun]]
| r
|-
| [[Article (grammar)|Definite article]]
| D
|-
| [[Article (grammar)|Indefinite article]]
| I
|-
| [[Nominative]]
| o
|}


'''Moby Shakespeare''' contains the complete unabridged works of [[Shakespeare]]. This file is not available from Project Gutenberg.
== Pronunciator ==
The '''Moby Pronunciator II''' contains 177,267 words with corresponding pronunciations. The Project Gutenberg distribution also contains a copy of the [[cmudict]] v0.3. The file follows the format ''word[/part-of-speech] pronunciation''. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example for the words spelled ''close,'' the verb has the pronunciation {{IPAc-en|ˈ|k|l|oʊ|z}}, whereas the adjective is {{IPAc-en|ˈ|k|l|oʊ|s}}. The parts-of-speech have been assigned the following codes:


'''Moby Thesaurus II'''
{| class="wikitable"
|-
! Part-of-speech
! Code
|-
| [[Noun]]
| n
|-
| [[Verb]]
| v
|-
| [[Adjective]]
| aj
|-
| [[Adverb]]
| av
|-
| [[Interjection]]
| interj
|}


'''Words'''
Following this is the pronunciation. Several special symbols are present:


The Words distribution consists of the following files:
{| class="wikitable"
|-
! Symbol
! Meaning
|-
| /
| Used to separate [[phoneme]]s
|-
| _
| Used to separate words
|-
| '
| [[Primary stress]] on the following syllable
|-
| ,
| [[Secondary stress]] on the following syllable
|}


Acronyms
The rest of the symbols are used to represent [[IPA]] characters, according to the following table:


Common words - words present in two or more published dictionaries. The list starts with nearly 300 suffixes.
{| class="IPA wikitable"
|-
! Symbol
! [[Help:IPA for English|IPA]]
|-
| &
| æ
|-
| -
| ə
|-
| @
| ʌ, ə
|-
| @r
| ɜr, ər
|-
| A
| ɑː
|-
| aI
| aɪ
|-
| Ar
| ɑr
|-
| AU
| aʊ
|-
| b
| b
|-
| d
| d
|-
| D
| ð
|-
| dZ
| dʒ
|-
| E
| ɛ
|-
| eI
| eɪ
|-
| f
| f
|-
| g
| ɡ
|-
| h
| h
|-
| hw
| hw
|-
| i
| iː
|-
| I
| ɪ
|-
| j
| j
|-
| k
| k
|-
| l
| l
|-
| m
| m
|-
| n
| n
|-
| N
| ŋ
|-
| O
| ɔː
|-
| Oi
| ɔɪ
|-
| oU
| oʊ
|-
| p
| p
|-
| r
| r
|-
| s
| s
|-
| S
| ʃ
|-
| t
| t
|-
| T
| θ
|-
| tS
| tʃ
|-
| u
| uː
|-
| U
| ʊ
|-
| v
| v
|-
| w
| w
|-
| z
| z
|-
| Z
| ʒ
|}


Compound words - phrases, [[proper noun]]s, and [[acronym]]s not included in the common words file. This file starts with over 540 suffixes. It's hard to see how they fit this category.
== Shakespeare ==
'''Moby Shakespeare''' contains the complete unabridged works of [[Shakespeare]]. This specific resource is not available from Project Gutenberg.


Words included in the first edition of the [[Official Scrabble Players Dictionary]]
== Thesaurus ==
The '''Moby Thesaurus II''' contains 30,260 root words, with 2,520,264 [[synonym]]s and related terms – an average of 83.3 per root word. Each line consists of a list of [[comma-separated values]], with the first term being the root word, and all following words being related terms.


Additions to the Official Scrabble Players Dictionary in the second edition
[[Grady Ward]] placed this thesaurus in the [[public domain]] in 1996. It is also available as a [[Debian]] package.


Most frequently occurring words in the [[English language]]
== Words ==

The distribution consists of the following 16 files:
Most frequently occurring words on [[Usenet]] in 1992

Most frequently occurring [[substring]]s in the [[King James Version of the Bible]]

Most common [[name]]s used in the United States and [[Great Britain]]

Common English [[female]] names

Common English [[male]] names

Most common misspelled English words. While such lists usually list the misspellings along with the correct spelling, this list doesn't contain any actual misspellings.

Place names in the United States

Single words excluding proper nouns, acronyms, compound words and phrases, but including [[Archaism|archaic]] words and significant [[variant spellings]]

[[United States Constitution]] including all amendments current to 1993


{| class="wikitable"
|-
! Filename
! Words
! Description
|-
| ACRONYMS.TXT
| 6,213
| Common [[acronym]]s and [[abbreviation]]s
|-
| COMMON.TXT
| 74,550
| Common words present in two or more published dictionaries
|-
| COMPOUND.TXT
| 256,772
| Phrases, [[proper noun]]s, and [[acronym]]s not included in the common words file
|-
| CROSSWD.TXT
| 113,809
| Words included in the first edition of the [[Official Scrabble Players Dictionary]]
|-
| CRSWD-D.TXT
| 4,160
| Additions to the Official Scrabble Players Dictionary in the second edition
|-
| FICTION.TXT
| 467
| A list of the most commonly occurring [[substring]]s in the book ''[[The Joy Luck Club (novel)|The Joy Luck Club]]''
|-
| FREQ.TXT
| 1,000
| Most frequently occurring words in the [[English language]], listed in descending order
|-
| FREQ-INT.TXT
| 1,000
| Most frequently occurring words on [[Usenet]] in 1992, listed with corresponding percentage in decreasing order
|-
| KJVFREQ.TXT
| 1,185
| Most frequently occurring [[substring]]s in the [[King James Version of the Bible]], listed in descending order
|-
| NAMES.TXT
| 21,986
| Most common [[name]]s used in the United States and [[Great Britain]]
|-
| NAMES-F.TXT
| 4,946
| Common English [[female]] names
|-
| NAMES-M.TXT
| 3,897
| Common English [[male]] names
|-
| OFTENMIS.TXT
| 366
| Most common misspelled English words
|-
| PLACES.TXT
| 10,196
| Place names in the United States
|-
| SINGLE.TXT
| 354,984
| Single words excluding proper nouns, acronyms, compound words and phrases, but including [[Archaism|archaic]] words and significant [[variant spellings]]
|-
| USACONST.TXT
| 7,618
| [[United States Constitution]] including all amendments current to 1993
|-
! Total
! 863,149
!
|}


== References ==
== References ==
Line 373: Line 67:
*[http://icon.shef.ac.uk/Moby/ Moby Project homepage]
*[http://icon.shef.ac.uk/Moby/ Moby Project homepage]
*[http://www.gutenberg.org/catalog/world/results?title=moby+list Project Gutenberg downloads]
*[http://www.gutenberg.org/catalog/world/results?title=moby+list Project Gutenberg downloads]
*''[http://www.foo.be/docs/tpj/issues/vol4_4/tpj0404-0003.html Searching for Rhymes with Perl]''; [http://interglacial.com/~sburke/mpron/ corresponding code]
*''[http://wixml.net/moby.html Conversion to relational database]'' (Dead link)


[[Category:Public domain databases]]
[[Category:Public domain databases]]

Revision as of 18:21, 2 September 2016

The Moby Project is a collection of of a number of different types of word lists. The files are public domain and can be downloaded at Project Gutenberg.

While some of the files have a lot of entries in them, a significant percentage of them are not words at all, but numbers, made-up words, misspellings, proper names, phrases made up of words already in the lists. In many cases, dozens and even hundreds of "words" are simply other words with common prefixes added to them, such as "un-", "non-", added to them and many of which would not be found in dictionaries. Another common problem is the inclusion of many non-English words and phrases not commonly seen in English writings and thus do not belong in lists of English words. So while there is some useful data in the lists, someone would have to do a lot of clean-up work to makes the lists actually useful.

For example, the Hyphenator is said to contain over 187,000 hyphenated words, but in addition to the problems listed above, nearly 10,000 of the words are one-syllable, such as 'through' and 'avoir', meaning there are no hyphenations possible for them.

The list of Japanese words contains English words such as abnormal and non-words such as abcdefgh and m,./. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lowercased words. The list of Italian words, however, contains no capitalized words whatsoever.

The lists making up the Moby Project are as follows:

Moby Hyphenator II -- list of word hyphenations.

Moby Language II -- word lists of five languages – French, German, Italian, Japanese, and Spanish:

Moby Part-of-Speech -- list of words with their parts of speech.

Moby Pronunciator II -- list of words with their pronunciations using Moby's pronunciation mark-up. Example:

Moby Shakespeare contains the complete unabridged works of Shakespeare. This file is not available from Project Gutenberg.

Moby Thesaurus II

Words

The Words distribution consists of the following files:

Acronyms

Common words - words present in two or more published dictionaries. The list starts with nearly 300 suffixes.

Compound words - phrases, proper nouns, and acronyms not included in the common words file. This file starts with over 540 suffixes. It's hard to see how they fit this category.

Words included in the first edition of the Official Scrabble Players Dictionary

Additions to the Official Scrabble Players Dictionary in the second edition

Most frequently occurring words in the English language

Most frequently occurring words on Usenet in 1992

Most frequently occurring substrings in the King James Version of the Bible

Most common names used in the United States and Great Britain

Common English female names

Common English male names

Most common misspelled English words. While such lists usually list the misspellings along with the correct spelling, this list doesn't contain any actual misspellings.

Place names in the United States

Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings

United States Constitution including all amendments current to 1993


References

External links