To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
Languages
Recent
Show all languages
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Sentence boundary disambiguation

From Wikipedia, the free encyclopedia

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in The Wall Street Journal corpus denote abbreviations.[1] Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.

Some languages including Japanese and Chinese have unambiguous sentence-ending markers.

YouTube Encyclopedic

  • 1/3
    Views:
    1 561
    27 128
    514
  • 5 Sentence Boundary Detection - Spacy Masterclass Tutorial
  • Word Sense Disambiguation 🔥
  • 08 - NLP Sentence Segmentation with NLTK

Transcription

Strategies

The standard 'vanilla' approach to locate the end of a sentence:[clarification needed]

(a) If it is a period, it ends a sentence.
(b) If the preceding token is in the hand-compiled list of abbreviations, then it does not end a sentence.
(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct.[2] Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%.

Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.[3] The SATZ[4] architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software

Examples of use of Perl compatible regular expressions ("PCRE")
  • ((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
  • $sentences = preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for PHP)
Online use, libraries, and APIs
  • sent_detector – Java[5]
  • Lingua-EN-Sentence – perl[6]
  • Sentence.pm – perl[7]
  • SATZ – An Adaptive Sentence Segmentation System – by David D. Palmer – C[8]
Toolkits that include sentence detection

See also

References

  1. ^ E. Stamatatos; N. Fakotakis & G. Kokkinakis. "1 Automatic extraction of rules for sentence boundary disambiguation". University of Patras. Retrieved 2009-01-03.
  2. ^ O'Neil, John. "Doing Things with Words, Part Two: Sentence Boundary Detection". Retrieved 2009-01-03.
  3. ^ Reynar, JC; Ratnaparkhi, A. "A Maximum Entropy Approach to Identifying Sentence Boundaries" (PDF). Retrieved 2009-01-03.
  4. ^ "SATZ: An Adaptive Sentence Boundary Detector". Archived from the original on 2007-09-22.
  5. ^ [1]
  6. ^ "Lingua-EN-Sentence-0.25 - Module for splitting text into sentences. - metacpan.org". metacpan.org.
  7. ^ "Text::Sentence - module for splitting text into sentences - metacpan.org". metacpan.org.
  8. ^ http://elib.cs.berkeley.edu/src/satz/
  9. ^ "Apache OpenNLP". opennlp.apache.org.
  10. ^ [2]
  11. ^ "NLTK :: Natural Language Toolkit". www.nltk.org.
  12. ^ "Software - The Stanford Natural Language Processing Group". nlp.stanford.edu.
  13. ^ "Google Code Archive - Long-term storage for Google Code Project Hosting". code.google.com.
  14. ^ "CogCompNLP". January 2, 2024 – via GitHub.

External links

This page was last edited on 2 January 2024, at 19:50
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.