Differences between Structured, Unstructured, and Semi-structured Documents
What is Structured Data?
The term structured data refers to a set of information where the formatting, number, and layout are in a fixed field within a file or record. Simply think about a (well organized) Excel sheet, which is a prime example of structured data. Most questionnaires and application forms are fixed forms. These forms are usually distributed as blank forms with constrained text boxes and “fill-in-the-bubble” responses.
Examples of structured data include names, dates, addresses, credit card numbers, stock information, geolocation, database, CRM and more. Structured data is highly organized and easily understood by machine language. Those working within relational databases can input, search, and manipulate structured data relatively quickly. This is the most attractive feature of structured data.
What is Unstructured Data?
Unstructured data is mostly categorized as qualitative data, which means it cannot be processed and analyzed using conventional tools and methods. It is difficult to deconstruct because it has no predefined model. There is no data model; the data is stored in its native format. Typical examples of unstructured data are rich media, text, social media activity, surveillance imagery, and so on.
The vast majority of all data created today is unstructured, in fact makes up 80% or more of all enterprise data. Due to its unorganized structure, it is very cumbersome or even impossible for machines/computers to make sense of it. This means that companies not taking unstructured data into account are missing out on a lot of valuable business intelligence.
However, with the advent of AI and more sophisticated machine learning methods, we are currently making a lot of progress in teaching a machine how to understand and extract data from unstructured documents.
What is Semi-structured Data?
Semi-structured data is a third category that falls somewhere between the other two. Semi-structured is data which has some degree of organization in it. This degree of organization is typically achieved with some sort of tags or other elements with defined properties which introduce a hierarchy and system into a file. However, the order and amount of such structuring tags and elements may vary.
A typical example of semistructured data is smartphone photos. Every photo taken with a smartphone contains unstructured image content as well as the tagged time, location, and other identifiable (and structured) information. Semi-structured data formats include JSON, CSV, and XML file types.