Unstructured data

Unstructured data (or unstructured information) refers to masses of (usually) computerized information which do either not have a data structure or one that is not easily readable by a machine. As has been noted, the term is imprecise: software that creates machine-processable structure exploits word morphology, sentence syntax, and other small- and large-scale patterns found in source materials to discern linguistic, auditory, and visual structure that is inherent in all forms of human communication. Examples of "unstructured data" may include audio, video and unstructured text such as the body of an email or word processor document.

Merrill Lynch estimates that more than 85% of all business information exists as unstructured data.

Data with some form of structure may also be referred to as unstructured data if the structure is not helpful for the desired processing task. For example, an HTML webpage is highly structured, but this structure is often oriented towards formatting, rather than performing more complex tasks with the content of the page.

Dealing with unstructured data
Data mining and text analytics techniques are different methods used to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or Part-of-speech tagging for further text mining-based structuring. UIMA provides a common framework for processing this information to extract meaning and create structured data about the information.