Skip to main content

What is Corpus


Join our Corpora Group on LinkedIn

Corpus:

“Corpus” is a Latin word meaning “body” [McEnery & Wilson, 1996]; hence a text corpus is any discrete body of text. The term “corpus” used in computing means a mass of electronic text, readable by machines. Corpora (the plural form of corpus) could be any form of media, such as text, speech or microfilms.

Corpora are the knowledge base used in corpus linguistics, to analyze and study language. The linguistic processing of corpora is called annotation where tools like part of speech tagging, stemming or lemmatization are applied. Annotation also includes reformulating corpora into new linguistic forms [McEnery & Oakes, 1996].

Corpora applications are used in variety fields including computational linguistics, speech recognition, Information Retrieval and machine translation.

References:


  • [Abusalah, 2008] Abusalah M., (2008). "Cross Language Information Retrieval Using Ontologies", PhD Thesis, University of Sunderland.
  • [McEnery & Oakes, 1996] McEnery T. and Oakes M. (1996). “Sentence and word alignment in the CRATER Project”. In: J. Thomas and M. Short (eds), Using Corpora for Language Research, Longman, London, Pages 211–231.
  • [McEnery & Wilson, 1996] McEnery T. and Wilson A. (1997). “Corpus Linguistics”. Edinburgh: Edinburgh University Press, ISBN 0-7486-0808-7.

Comments

Popular posts from this blog

Coldfusion Facebook Graph API publish to your wall and your friends walls

In this tutorial we will learn by full coldfusion Graph API code example how to publish on your wall and your friends walls. This application uses new oauth authentication method. The code is divided into four files: we will first start with a file called index.cfm: <cfoutput>         <!--- Your FB application IDS --->       <cfset api_key = ""/>     <cfset secret_key = ""/>     <cfset appID = ""/>     <!--- create a connection to the fb graph cfc --->     <cfset graphCFC = createObject("component", "graph").init(#appID#, #api_key#, #secret_key#) />     <!--- If user is authenticated or his access token is set create a cookie --->        <cfif not isdefined("cookie.access_token") and isdefined("url.access_token")>         <cfset cookie.acce...

Implementation of Facebook Graph API in Coldfusion

Facebook has launched a new FB API called Graph that simplifies FB applications development, the new Graph API allows website owners to create Single Sign On (SSO) with Facebook and allow websites owners to be able to import a lot of users information but after their permissions. The code below written in Coldfusion gives an example on how to Create FB login/logout button and then how to retrieve the created cookie, and use it for further development:     <body>   <cfoutput>     <!--- Your FB application IDS --->     <cfset api_key = "XXXXXXXXXXXXX"/>     <cfset secret_key = "XXXXXXXXXXXXXX"/>     <cfset appID = "XXXXXXXXXXXXXX"/>     <!--- Facebook login/logout button --->     <p><fb:login-button perms="email,user_birthday" autologoutlink="true"></fb:login-button></p>     <!--- Facebook login/logo...

Top Google Adsense Alternatives

Google Adsense is a web tool that allows publishers in the Google Network of content sites to automatically serve text, image, video, and rich media adverts that are targeted to site content and audience. These adverts are administered, sorted, and maintained by Google, and they can generate revenue on either a per-click or per-impression basis.  Google servers advertisers using google adwords platform, while adsense is the publishers platform. Google Adsense is the top Ad Publishers platform over the web ranking number one in web advertising industry. Adsense offers contextual advertisements that covers web sites, blogs, games, videos, mobile browsing etc. What made Google Adsense no. 1 is the reliability, stability, variety of services and large number of publishers including google it self. Also google has a fair platform that detects invalid clicks so google successfully protects its advertisers and also offers its best publishers top CPC. Two reasons are behin...