Document classification
You don't need to be Editor-In-Chief to add or edit content to WikiDoc. You can begin to add to or edit text on this WikiDoc page by clicking on the edit button at the top of this page. Next enter or edit the information that you would like to appear here. Once you are done editing, scroll down and click the Save page button at the bottom of the page.
Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information.
Contents |
Techniques
Document classification techniques include:
- naive Bayes classifier
- tf-idf
- latent semantic indexing
- support vector machines
- artificial neural network
- kNN
- decision trees, such as ID3
- Concept Mining
and approaches based on natural language processing.
Applications
A recent notable use of document classification techniques has been spam filtering which tries to discern E-mail spam messages from legitimate emails.
See also
- classification
- supervised learning, unsupervised learning
- document retrieval
- information retrieval
- string metrics
- machine learning
- text mining, web mining, concept mining
Further reading
Publications:
- Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002 [1]
- Introduction to document classification
- Bibliography on Automated Text Categorization
- Bibliography on Query Classification
Data sets:
- TechTC - Technion Repository of Text Categorization Datasets
- David D. Lewis's Datasetseu:Dokumentuen sailkapena
nn:Dokumentklassifiseringsu:Document classification fi:Dokumenttien luokittelu
Acknowledgement and Attribution Regarding Sources of Content
Some of the initial content on this page may be incorporated in part from copyleft sources in the public domain including wikis such as Wikipedia and AskDrWiki. Drug information for patients came from the The National Library of Medicine. Infectious disease information may have come from the Centers for Disease Control (CDC). Differential Diagnoses are drawn from clinicians as well as an amalgamation of 3 sources: 1.The Disease Database; 2. Kahan, Scott, Smith, Ellen G. In A Page: Signs and Symptoms. Malden, Massachusetts: Blackwell Publishing, 2004:3; 3. Sailer, Christian, Wasner, Susanne. Differential Diagnosis Pocket. Hermosa Beach, CA: Borm Bruckmeir Publishing LLC, 2002:7 .

