To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
Languages
Recent
Show all languages
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Text normalization

From Wikipedia, the free encyclopedia

Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.[1]

YouTube Encyclopedic

  • 1/3
    Views:
    649
    7 275
    376
  • Text Normalization | Part 1 | Text Preprocessing | Text Analytics with Python
  • Lecture 05 — Word Normalization and Stemming — [ NLP || Dan Jurafsky || Stanford University ]
  • LESSON 2.3: NATURAL LANGUAGE PROCESSING: Rules of Tokenization | Text Normalization

Transcription

Applications

Text normalization is frequently used when converting text to speech. Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context.[2] For example:

  • "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.[3]
  • "vi" could be pronounced as "vie," "vee," or "the sixth" depending on the surrounding words.[4]

Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing diacritical marks; and if "john" is to match "John", the text would be converted to a single case. To prepare text for searching, it might also be stemmed (e.g. converting "flew" and "flying" both into "fly"), canonicalized (e.g. consistently using American or British English spelling), or have stop words removed.

Techniques

For simple, context-independent normalization, such as removing non-alphanumeric characters or diacritical marks, regular expressions would suffice. For example, the sed script sed ‑e "s/\s+/ /g"  inputfile would normalize runs of whitespace characters into a single space. More complex normalization requires correspondingly complicated algorithms, including domain knowledge of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text[5] and as a special case of machine translation.[6][7]

Textual scholarship

In the field of textual scholarship and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of scribal abbreviations and the transliteration of the archaic glyphs typically found in manuscript and early printed sources. A normalized edition is therefore distinguished from a diplomatic edition (or semi-diplomatic edition), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.[8]

See also

References

  1. ^ Richard Sproat and Steven Bedrick (September 2011). "CS506/606: Txt Nrmlztn". Retrieved October 2, 2012.
  2. ^ Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." Computer Speech and Language 15; 287–333. doi:10.1006/csla.2001.0169.
  3. ^ "Samoan Numbers". MyLanguages.org. Retrieved October 2, 2012.
  4. ^ "Text-to-Speech Engines Text Normalization". MSDN. Retrieved October 2, 2012.
  5. ^ Zhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics; 688–695. doi:10.1.1.72.8138.
  6. ^ Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006). "Text Normalization as a Special Case of Machine Translation." Proceedings of the International Multiconference on Computer Science and Information Technology 1; 51–56.
  7. ^ Mosquera, A.; Lloret, E.; Moreda, P. (2012). "Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation" Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA); 9-14
  8. ^ Harvey, P. D. A. (2001). Editing Historical Records. London: British Library. pp. 40–46. ISBN 0-7123-4684-8.
This page was last edited on 8 December 2023, at 04:25
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.