To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

List of text corpora

From Wikipedia, the free encyclopedia

Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.[1]

YouTube Encyclopedic

  • 1/5
    Views:
    1 742
    4 098
    70 897
    1 875
    1 396
  • LESSON 4: NATURAL LANGUAGE PROCESSING | Examining Text Corpora with Brown Corpus in Python
  • Loading Text Corpus from the Document Repository
  • Corpus Linguistics: The Basics
  • Using corpus-based tools to analyze words and texts
  • LESSON 3: NATURAL LANGUAGE PROCESSING | Text Corpora with Gutenberg Corpus Using Python

Transcription

English language

European languages

Slavic

East Slavic

South Slavic

West Slavic

German

Middle Eastern Languages

  • Corpus Inscriptionum Semiticarum
  • Kanaanäische und Aramäische Inschriften
  • Hamshahri Corpus (Persian)
  • Persian in MULTEXT-EAST corpus (Persian)[15]
  • Amarna letters (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus[16]
  • TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling[16]
  • PTC: Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN 964-8699-32-1
  • Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
  • Neo-Assyrian Text Corpus Project
  • Quranic Arabic Corpus (Classical Arabic)
  • Electronic Text Corpus of Sumerian Literature
  • Open Richly Annotated Cuneiform Corpus
  • Asosoft text corpus[17]Central Kurdish (Sorani)
  • Thesaurus Linguae Aegyptiae (ancient Egyptian, Afro-Asiatic)

Devanagari

East Asian Languages

South Asian Languages

African languages

Parallel corpora of diverse languages

Comparable Corpora

L2 (English) Corpora

  • Cambridge Learner Corpus[44]
  • Corpus of Academic Written and Spoken English (CAWSE),[45] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.  
  • English as a Lingua Franca in Academic Settings (ELFA),[46] an academic ELF corpus.[47][48]
  • International Corpus of Learner English (ICLE),[49] a corpus of learner written English.
  • Louvain International Database of Spoken English Interlanguage (LINDSEI),[50] a corpus of learner spoken English.
  • Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[51][52]
  • University of Pittsburgh English Language Institute Corpus (PELIC)[53]
  • Vienna-Oxford International Corpus of English (VOICE),[54] an ELF corpus.[47]

References

  1. ^ Leech, Geoffrey (2007). "Teaching and language corpora: a convergence". In Wichmann, A.; et al. (eds.). Teaching and Language Corpora. London: Longman. p. 9.
  2. ^ "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
  3. ^ Wahle, Jan Philip; Ruas, Terry; Mohammad, Saif; Gipp, Bela (2022). "D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research". Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association: 2642–2651. arXiv:2204.13384.
  4. ^ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  5. ^ "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  6. ^ [1],Basque corpora
  7. ^ (in Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
  8. ^ "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
  9. ^ "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
  10. ^ Glazkova, A (2020). "Topical Classification of Text Fragments Accounting for Their Nearest Context". Automation and Remote Control. 81 (12): 2262–2276. doi:10.1134/S0005117920120097. S2CID 231929892.
  11. ^ Rubtsova, Yu (2015). "Constructing a corpus for sentiment classification training". Software & Systems. 1: 72–78. doi:10.15827/0236-235X.109.072-078.
  12. ^ "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
  13. ^ "Електронски корупус на македонски книжевни текстови".
  14. ^ "Portál | Český národní korpus".
  15. ^ Zdravkova, Katrina; Tufiş, Dan; Simov, Kiril; Radziszewski, Adam; Qasemizadeh, Behrang; Priest-Dorman, Greg; Petkevič, Vladimír; Oravecz, Csaba; Krstev, Cvetana; Kotsyba, Natalia; Kaalep, Heiki-Jaan; Ide, Nancy; Garabík, Radovan; Dimitrova, Ludmila; Derzhanski, Ivan; Barbu, Ana-Maria; Erjavec, Tomaž (2010-05-14). "Available from CLARIN". http://nl.ijs.si/me/v4/. {{cite journal}}: External link in |journal= (help)
  16. ^ a b "University of Tehran NLP Lab". ece.ut.ac.ir. Archived from the original on 28 January 2014. Retrieved 12 January 2014.
  17. ^ Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
  18. ^ "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
  19. ^ https://wortschatz.uni-leipzig.de/en/download/Hindi
  20. ^ D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
  21. ^ Glossa (uio.no)
  22. ^ https://aclanthology.org/L14-1376/
  23. ^ https://arxiv.org/pdf/2102.06991.pdf, https://wortschatz.uni-leipzig.de/en/download/Hausa
  24. ^ https://www.sketchengine.eu/igtenten-igbo-corpus/
  25. ^ https://www.sketchengine.eu/corpora-and-languages/oromo-text-corpora/
  26. ^ https://www.researchgate.net/publication/336274457_Digital_Yoruba_Corpus, https://www.sketchengine.eu/corpora-and-languages/yoruba-text-corpora/
  27. ^ https://wortschatz.uni-leipzig.de/en/download/Zulu
  28. ^ Pan, Jun (2019). "The Chinese/English Political Interpreting Corpus (CEPIC). Hong Kong Baptist University Library". Retrieved January 3, 2022.
  29. ^ Pan, Jun (2019-10-30). "The Chinese/English Political Interpreting Corpus (CEPIC): A New Electronic Resource for Translators and Interpreters". Proceedings of the Second Workshop Human-Informed Translation and Interpreting Technology Associated with RANLP 2019. Incoma Ltd., Shoumen, Bulgaria: 82–88. doi:10.26615/issn.2683-0078.2019_010. S2CID 211257773.
  30. ^ "EUR-Lex Corpus". sketchengine.co.uk. 2 June 2016. Retrieved 27 October 2016.
  31. ^ "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
  32. ^ "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 23 November 2020.
  33. ^ Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174. Archived from the original (PDF) on 16 January 2014. Retrieved 12 January 2014.
  34. ^ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  35. ^ H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.
  36. ^ Ralf, Ralf Steinberger; Pouliquen, Bruno; Widiger, Anna; Ignat, Camelia; Erjavec, Tomaž; Tufiş, Dan; Varga, Dániel (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.
  37. ^ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  38. ^ Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. Vol. 7499. pp. 3–15. CiteSeerX 10.1.1.452.8074. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6.
  39. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  40. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia - Social and Behavioral Sciences, 95, 12-19.
  41. ^ Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  42. ^ Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  43. ^ Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
  44. ^ "Cambridge English Corpus", Wikipedia, 2019-09-27, retrieved 2020-01-07
  45. ^ "CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学". nottingham.edu.cn. Retrieved 2020-01-07.
  46. ^ "English as a Lingua Franca in Academic Settings". University of Helsinki. 2018-03-23. Retrieved 2020-01-07.
  47. ^ a b "English as a lingua franca", Wikipedia, 2019-12-14, retrieved 2020-01-07
  48. ^ Mauranen, A (2010). "English as an academic lingua franca: The ELFA project". English for Specific Purposes. 29 (3): 183–190. doi:10.1016/j.esp.2009.10.001.
  49. ^ "ICLE". UCLouvain. Retrieved 2020-01-07.
  50. ^ "LINDSEI". UCLouvain (in French). Retrieved 2020-01-07.
  51. ^ "Trinity Lancaster Corpus | ESRC Centre for Corpus Approaches to Social Science (CASS)". Retrieved 2020-01-07.
  52. ^ Gablasova, D (2019). "The Trinity Lancaster Corpus: Development, Description and Application". International Journal of Learner Corpus Research. 5 (2): 126–158. doi:10.1075/ijlcr.19001.gab.
  53. ^ Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. doi:10.5281/zenodo.3991977
  54. ^ "Project". univie.ac.at. Retrieved 2020-01-07.

See also

This page was last edited on 28 February 2024, at 11:24
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.