Information extraction
You don't need to be Editor-In-Chief to add or edit content to WikiDoc. You can begin to add to or edit text on this WikiDoc page by clicking on the edit button at the top of this page. Next enter or edit the information that you would like to appear here. Once you are done editing, scroll down and click the Save page button at the bottom of the page.
In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: "Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data.
The significance of IE is determined by the growing amount of information available in unstructured (i.e. without metadata) form, for instance on the Internet. This knowledge can be made more accessible by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with.
A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted. Current approaches to IE use natural language processing techniques that focus on very restricted domains. For example, the Message Understanding Conference (MUC) is a competition-based conference that focused on the following domains in the past:
- MUC-1 (1987), MUC-2 (1989): Naval operations messages.
- MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
- MUC-5 (1993): Joint ventures and microelectronics domain.
- MUC-6 (1995): News articles on management changes.
- MUC-7 (1998): Satellite launch reports.
Natural Language texts may need to use some form of a Text simplification to create a more easily machine readable text to extract the sentences.
Typical subtasks of IE are:
- Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
- Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.
- Terminology extraction: finding the relevant terms for a given corpus
See also
- HAREM, a Portuguese named entity recognition contest
- ECHELON
- General Architecture for Text Engineering "General Architecture for Text Engineering", which is bundled with a free Information Extraction system
External links
- Extracción informacion (Spanish site)
- http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ MUC
- http://projects.ldc.upenn.edu/ace/ ACE (LDC)
- http://www.itl.nist.gov/iad/894.01/tests/ace/ ACE (NIST)
- http://lcl2.di.uniroma1.it TermExtractor
- TermFinder, on-line terminology extractor for EN, FR & IT - web application
Commercial
- Document Summary System, The Document Summary System is a commercial product that performs document summarizationde:Informationsextraktion
el:Εξαγωγή πληροφοριώνeu:Informazioa ateratzea ja:情報抽出
Acknowledgement and Attribution Regarding Sources of Content
Some of the initial content on this page may be incorporated in part from copyleft sources in the public domain including wikis such as Wikipedia and AskDrWiki. Drug information for patients came from the The National Library of Medicine. Infectious disease information may have come from the Centers for Disease Control (CDC). Differential Diagnoses are drawn from clinicians as well as an amalgamation of 3 sources: 1.The Disease Database; 2. Kahan, Scott, Smith, Ellen G. In A Page: Signs and Symptoms. Malden, Massachusetts: Blackwell Publishing, 2004:3; 3. Sailer, Christian, Wasner, Susanne. Differential Diagnosis Pocket. Hermosa Beach, CA: Borm Bruckmeir Publishing LLC, 2002:7 .

