What is Corpus

Corpus:

“Corpus” is a Latin word meaning “body” [McEnery & Wilson, 1996]; hence a text corpus is any discrete body of text. The term “corpus” used in computing means a mass of electronic text, readable by machines. Corpora (the plural form of corpus) could be any form of media, such as text, speech or microfilms.

Corpora are the knowledge base used in corpus linguistics, to analyze and study language. The linguistic processing of corpora is called annotation where tools like part of speech tagging, stemming or lemmatization are applied. Annotation also includes reformulating corpora into new linguistic forms [McEnery & Oakes, 1996].

Corpora applications are used in variety fields including computational linguistics, speech recognition, Information Retrieval and machine translation.

References:

[Abusalah, 2008] Abusalah M., (2008). "Cross Language Information Retrieval Using Ontologies", PhD Thesis, University of Sunderland.
[McEnery & Oakes, 1996] McEnery T. and Oakes M. (1996). “Sentence and word alignment in the CRATER Project”. In: J. Thomas and M. Short (eds), Using Corpora for Language Research, Longman, London, Pages 211–231.
[McEnery & Wilson, 1996] McEnery T. and Wilson A. (1997). “Corpus Linguistics”. Edinburgh: Edinburgh University Press, ISBN 0-7486-0808-7.

Mustafa Abusalah Blog

Search This Blog

What is Corpus

Labels

Comments

Post a Comment

Popular posts from this blog

Coldfusion Facebook Graph API publish to your wall and your friends walls

Implementation of Facebook Graph API in Coldfusion

The New Facebook Wall Publishing method Stream.publish