When guiding teams in time series anomaly detection, a common point of confusion arises: the distinction between structured and unstructured data. To clarify this, let’s explore the prevalent data formats in the industry and the challenges they present.
Data Diversity: Common Data Structures
In practice however, it’s not always so cut and dry. For example, if I save data in logs does that mean the data is unstructured? Not necessarily. If the data being saved in those logs is combined from well-defined Comma-Separated Values (CSV), where each field represents a well-structured format (ie. temperature or weight), you will be able to define a metadata schema for that log and process it in a structured way. Vice-versa if the data is in relational databases.
And having the data work easily with a schema doesn’t necessarily indicate that the data is well structured. In some cases, data is saved into text fields where each field actually represents a big chunk of plan text. Without an unstructured manipulation approach, you won’t be able to get value from those database-free text fields.
Data can be represented in thousands of ways. Below I’ve summarized a table of the most common structures, and the technology and companies supporting these formats.
Data Type |
Typically Structured or Unstructured? | Technology Examples |
Comment |
Logs | U | Splunk, Elastic, SumoLogic | Those tools provide ways to only display the data in a structured way (tables, time series) |
Events | U | Mixpanel | Some event platforms support both structured or unstructured formats. |
Events | S | Google Analytics, Adobe Analytics | |
Time Series | S | Anodot, DataDog, InfluxDB, OpenTSDB | Some of these tools provide ways to convert structured and unstructured data into time series. |
Relational | S | SQL Databases, Excel | Definitely the most common way to save company data. |
Columnar | S | Cassandra, BigQuery, Snowflake, Redshift | |
Json/XML | S | MongoDB |
Unstructured Data: Challenges and Opportunities
While unstructured data lacks a predefined schema, it can be valuable with the right tools. Parsing heuristics, NLP, and regular expressions help extract meaning from text-based data.
Key Challenges:
- Discovery Efforts: Parsing unstructured data often requires significant upfront work to define appropriate schemas.
- Accuracy Limitations: Even with advanced techniques, parsing accuracy may not be perfect.
- Performance Overhead: Parsing unstructured data can be computationally expensive compared to working with structured formats.
Opportunities:
- Flexibility: Unstructured data can capture a wide range of information, including customer feedback and machine-generated events.
- Cost-Effectiveness: In some cases, storing data in unstructured formats can be more efficient.
Balancing Act:
Choosing between structured and unstructured formats depends on specific use cases, performance requirements, and the value of the data. While structured data offers simplicity and efficiency, unstructured data can provide valuable insights when handled effectively.
About Structured Data
As human beings, we are always trying to be more organized in our thoughts. We assign names to people, addresses to houses, and to our communications, a string of words. In a similar fashion, we label and arrange our data to give us more powerful ways and tools to process and analyze information.
When data is organized into a well-defined structure, a schema can be defined. This leads to a better data validation process, better data quality, and better insight. For example, if you have a columnar database in which each column represents a well-defined measure as revenues and transactions, then it will be easier to build a proxy layer that validates the data that comes in. And, of course, it will be much easier to manipulate, and to run flexible operations such as sum, range checking, range indexing, aggregations, etc.
Whereas in unstructured data, you will always need to worry that a small change in the data will upset all your parsing assumptions – in structured data, this is not the case. Structured data will require more thought, design and preparatory work, and cannot match all the use cases.
Unstructured to Structured Conversion
In many cases, in order to achieve a stronger analysis of the unstructured data, you must first convert it to a structured format. Take logs, for example. They can be converted into a time series if they contain a timestamp, or to tables, if you can define a regex for each field (CSV) There are even cases when the structure data itself needs a better way to be represented in order to be better analyzed.
In the case of columnar, for instance, you might want to convert it into a time series to understand trends in the columns that represent measurements. However, each transformation might lead to some data loss and compromise the data coverage. Companies usually save a portion of the same data in other formats but will still preserve the basic, original data format as a backup.
Conclusion
If you can, structure it! This is the most powerful and cheapest way to work with data. In cases when you can’t cover all your data in a structured way, you might want to have different views of the same data. It might seem inefficient to save both unstructured and structured data, but in the long run it may turn out to be even more cost effective.