Moby Project

The Moby Project is a collection of public-domain lexical resources created by Grady Ward. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg. As of 2007^[update], it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations.^[1]

YouTube Encyclopedic

1/1
Views:
737

Transcription

the contemporary digital age is characterized by states of hybridity and convergence. Virtual and physical domains have become deeply intertwined and the social spaces we occupy increasingly mediate lived experiences, confusing and blurring real vs. performative life-events. Our GPS enabled mobile phone applications allow us to be in physical space but also to mediate the reality of that situatedness In combination with other social media we can further inscribe and broadcast our locations, and we can become actors and agents in our own amateur Web 2.0 and Web 3.0 production practices. Using my handheld device I can stand in one place while holding and surveying my avatar-self in a relocated, dislocated media context. I am inside a map of my own making, and I can redraw the borders and boundaries of my own experience. I am here, and now I am here, and now here. I am abstracted to a digital pinpoint I am geo-located. I'm on my own radar. I can move my bodies through space, and I can watch myself move my bodies through space. I cant touch myself, to watch myself, moving through space. I can read and write my own movements, and you can follow me. I am my own voyeur, my own stalker, and you may also survey me watching, touching, performing myself. The complex politics of this form of tactile surveillance (being subject and other, seeing and touching one's self, and others, while dynamically inscribing our bodies on the move) requires a nuanced attention to technical vs. organic functions. What does it mean to be embodied in an age of mixed realities and to what extent are our affective relationships to the digital media and that dis-play us at the critical center of such concerns? How does one formulate a critical entry point and navigate the maps, landscapes, texts, and bodies that circulate and proliferate in these mixed realities? The (re-)Mapping Moby Project is a creative and critical applied research experiment exploring affective embodiment and tracing its expressive potential in contemporary mixed media/mixed reality experience. The research brings a classic literary text (Herman Melville’s 1851 Moby-Dick) in conversation with contemporary digital mapping and mixed reality media technologies. Using the literary text as an inspirational base text, we are exploring ways to re-map the novel onto a collaboratively-constructed dynamic map interface that allows users to navigate experiences inspired by the text and to perform these narrative negotiations with an attention to new material reading and writing practices. How does one read and write in mixed reality environments? Who and what is a text, a body, a narrative, and how can it/they perform differently in otherly (dis)located contexts? How can we as researchers use our own affective experiences with texts, with locations, to create/share performative and collaborative practices? These are the questions that engage us. Working with research partners at The Augmented Environments Lab at the Georgia Institute of Technology, our research team at the Blekinge Institute of Technology and at Malmö University are adapting the Argon mobile AR tool to construct the (re-)Mapping Moby experience. The AR tool is used in combination with social media tools and practices, particularly those that enable geo-tagging and that foreground location-based documentation and inscription. Our goal is to create an active intervention in the text based on a distributed locative narrative model, one which foregrounds organic bodies in movement as important agents to situate mediated experiences. Re-located to Karlskrona, Sweden, a small naval town in the south eastern archipelago, the re-Mapping Moby experience is purposely far-removed from New Bedford and from Melville’s locations at sea, except by allusion. The connections between physical location and narrative are deliberately intersected via the users personalized interventions in the spaces through which they move. Tracing a series of customized walking routes, a user both follows the path of a previous user’s reading experience, while leaving traces of her own behind for future walkers. Using an iPad touch screen and mobile phones to access audio, video, and other augmented media inspired by the text and the location, the user follows routes mapped to chapters, completes tasks, adds commentary and documents encounters in the physical space. The documentation is circulated via social media and then fed-back into the experience and made available online. Moby-Dick as the base text for the experience is not accidental. Melville’s text overtly navigates among a host of formal structures, deflecting singular narrative perspective and complicating its textual surface in a dizzying rhetorical performative display, set against a backdrop of 19th Century technical innovation and invention. Re-mapping the text (figuratively and literally) in a collaborative digital context continues a tradition of embodied innovation initiated by Melville in his great unwieldy fleshy whale of a text. Now augmented by the principles of new materialism and post human interventions and by the affordances of contemporary mixed media culture, new bodies may also come into play and actively relocate dislocate, circulate, and resituate the expressive experiences engaged by the narrative.

Hyphenator

The Moby Hyphenator II contains hyphenations of 187,175 words and phrases (including 9,752 entries where no hyphenations are given, such as through and avoir). The character encoding appears to be MacRoman, and hyphenation is indicated by a bullet (⟨•⟩, character value 165 decimal, or A5 hexadecimal). Some entries, however, have a combination of actual hyphens and character 165, such as "bar•ber-sur•geon".

There is little to no documentation of the hyphenation choices made; the following examples might give some flavour of the style of hyphenation used: at•mos•phere; at•tend•ant; ca•pac•i•ty; un•col•or•a•ble.

Languages

Moby Language II contains wordlists of five languages: French, German, Italian, Japanese, and Spanish. Their statistics are:

Language	Words	Size (in bytes)
French	138,257	1,524,757
German	159,809	2,055,986
Italian	60,453	561,981
Japanese	115,523	934,783
Spanish	86,059	850,523
Total	560,101	5,928,030

However, some of the lists are contaminated: for example, the Japanese list contains English words such as abnormal and non-words such as abcdefgh and m,./. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lower-cased words. The list of Italian words, however, contains no capitalized words whatsoever.

The lists do not use accented characters, so "e^tre" is how a user would look up the French word être ("to be").

Part-of-Speech

Moby Part-of-Speech contains 233,356 words fully described by part(s) of speech, listed in priority order. The format of the file is word\parts-of-speech, with the following parts of speech being identified:

Part-of-speech	Code
Noun	N
Plural	p
Noun phrase	h
Verb (usually participle)	V
Transitive verb	t
Intransitive verb	i
Adjective	A
Adverb	v
Conjunction	C
Preposition	P
Interjection	!
Pronoun	r
Definite article	D
Indefinite article	I
Nominative	o

Pronunciator

The Moby Pronunciator II contains 177,267 entries with corresponding pronunciations. Most of the entries describe a single word, but approximately 79,000^[2] contain hyphenated or multiple word phrases, names, or lexemes. The Project Gutenberg distribution also contains a copy of the cmudict v0.3. The file contains lines of the format word[/part-of-speech] pronunciation. Each line is ended with the ASCII carriage return character (CR, '\r', 0x0D, 13 in decimal).

The word field can include apostrophes (e.g. isn't), hyphens (e.g. able-bodied), and multiple words separated by underscores (e.g. monkey_wrench). Non-English words are generally rendered, as stated in the documentation, without accents or other diacritical marks. However, in 36 entries (e.g. São_Miguel), some non-ASCII accented characters remain, represented using Mac OS Roman encoding.

The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example, for the words spelled close, the verb has the pronunciation /ˈkloʊz/, whereas the adjective is /ˈkloʊs/. The parts-of-speech have been assigned the following codes:

Part-of-speech	Code
Noun	n
Verb	v
Adjective	aj
Adverb	av
Interjection	interj

Following this is the pronunciation. Several special symbols are present:

Symbol	Meaning
_	Used to separate words
'	Primary stress on the following syllable
,	Secondary stress on the following syllable

The rest of the symbols are used to represent IPA characters. The pronunciations are generally consistent with a General American dialect of English, that exhibits father-bother merger, hurry-furry merger and lot-cloth split, but does not exhibit cot-caught merger or wine-whine merger. Each phoneme is represented by a sequence of one or more characters. Some of the sequences are delimited with a slash character "/", as shown in the following table, but note that the sequence for /ɔɪ/ is delimited by two slash characters at either end:

Symbol	IPA
/&/	æ
/-/	ə
/@/	ʌ, ə
/[@]/r	ɜr, ər
/A/	ɑ, ɑː
/aI/	aɪ
/AU/	aʊ
b	b
d	d
/D/	ð
/dZ/	dʒ
/E/	ɛ
/eI/	eɪ
f	f
g	ɡ
h	h
hw	hw
/i/	iː
/I/	ɪ
/j/	j
/ju/	juː
k	k
l	l
m	m
n	n
/N/	ŋ
/O/	ɔ, ɔː
//Oi//	ɔɪ
/oU/	oʊ
p	p
r	r
s	s
/S/	ʃ
t	t
/T/	θ
/tS/	tʃ
/u/	uː
/U/	ʊ
v	v
w	w
z	z
/Z/	ʒ

To this collection are added a number of extra sequences representing phonemes found in several other languages. These are used to encode the non-English words, phrases and names that are included in the database. The following table contains these extra phonemes, but note that the extent to which some of these may exist due to encoding errors is not clear.

Symbol	IPA
A	a
e	e, ɛ
i	i, ɪ
N	Nasalisation of preceding vowel
o	o
O	[intent not clear]
R	ʁ
S	s
u	u
V	v, β, ʋ
W	w
/x/	x
/y/	ø
Y	y
/z/	ts
Z	z

Shakespeare

Moby Shakespeare contains the complete unabridged works of Shakespeare. This specific resource is not available from Project Gutenberg, but it is available in a 1993 version on the web.^[3]

Thesaurus

The Moby Thesaurus II contains 30,260 root words, with 2,520,264 synonyms and related terms – an average of 83.3 per root word. Each line consists of a list of comma-separated values, with the first term being the root word, and all following words being related terms.

Grady Ward placed this thesaurus in the public domain in 1996. It is also available as a Debian package although the package has been discontinued starting with Bullseye.^[4]

Words

Moby Words II is the largest wordlist in the world.^[1]^{[additional citation(s) needed]} The distribution consists of the following 16 files:

Filename	Words	Description
ACRONYMS.TXT	6,213	Common acronyms and abbreviations
COMMON.TXT	74,550	Common words present in two or more published dictionaries
COMPOUND.TXT	256,772	Phrases, proper nouns, and acronyms not included in the common words file
CROSSWD.TXT	113,809	Words included in the first edition of the Official Scrabble Players Dictionary
CRSWD-D.TXT	4,160	Additions to the Official Scrabble Players Dictionary in the second edition
FICTION.TXT	467	A list of the most commonly occurring substrings in the book The Joy Luck Club
FREQ.TXT	1,000	Most frequently occurring words in the English language, listed in descending order
FREQ-INT.TXT	1,000	Most frequently occurring words on Usenet in 1992, listed with corresponding percentage in decreasing order
KJVFREQ.TXT	1,185	Most frequently occurring substrings in the King James Version of the Bible, listed in descending order
NAMES.TXT	21,986	Most common names used in the United States and Great Britain
NAMES-F.TXT	4,946	Common English female names
NAMES-M.TXT	3,897	Common English male names
OFTENMIS.TXT	366	Most common misspelled English words
PLACES.TXT	10,196	Place names in the United States
SINGLE.TXT	354,984	Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings
USACONST.TXT	7,618	United States Constitution including all amendments current to 1993
Total	863,149	Not the total of unique words.
Total Uniq	639,995	Total of single, proper nouns, acronyms, and compound words and phrases (all of the files that contain unique words).

References

^ ^a ^b "ACL SIGLEX Resource Links". Special Interest Group on the Lexicon of the Association for Computational Linguistics. August 13, 2004. Archived from the original on December 15, 2018. Retrieved May 9, 2022. Moby Words: 610,000+ words and phrases. The largest word list in the world
^ Obtained by running the UNIX command grep '.*[-_].* .*' mobypron.unc | wc -l after converting the line endings and correcting some encoding errors.
^ mobyshak.txt 1993 version
^ Tosi, Sandro (July 13, 2020). "RM: dict-moby-thesaurus -- RoQA; dead upstream (10+ years); python2-only; no extrenal [sic] deps; extremely low popcon". Debian Bug report logs. Retrieved May 10, 2022.

External links

Moby Project homepage, University of Sheffield; copy made by the Wayback Machine of the page as it was on 30 September 2017. ("Last modified: October 24, 2000") working download site.
Project Gutenberg downloads
Searching for Rhymes with Perl; corresponding code
Wiktionary:Appendix:Moby Thesaurus II

This page was last edited on 24 June 2024, at 23:42

From Wikipedia, the free encyclopedia