When guiding teams in time series anomaly detection, a common point of confusion arises: the distinction between structured and unstructured data. To clarify this, let’s explore the prevalent data formats in the industry and the challenges they present.

Data Diversity: Common Data Structures

 

In practice however, it’s not always so cut and dry.  For example, if I save data in logs does that mean the data is unstructured? Not necessarily. If the data being saved in those logs is combined from well-defined Comma-Separated Values (CSV), where each field represents a well-structured format (ie. temperature or weight), you will be able to define a metadata schema for that log and process it in a structured way. Vice-versa if the data is in relational databases.

And having the data work easily with a schema doesn’t necessarily indicate that the data is well structured. In some cases, data is saved into text fields where each field actually represents a big chunk of plan text. Without an unstructured manipulation approach, you won’t be able to get value from those database-free text fields.

Data can be represented in thousands of ways. Below I’ve summarized a table of the most common structures, and the technology and companies supporting these formats.

 

Data Type

Typically Structured or Unstructured? Technology Examples

Comment

Logs U Splunk, Elastic, SumoLogic Those tools provide ways to only display the data in a structured way (tables, time series)
Events U Mixpanel Some event platforms support both structured or unstructured formats.
Events S Google Analytics, Adobe Analytics
Time Series S Anodot, DataDog, InfluxDB, OpenTSDB Some of these tools provide ways to convert structured and unstructured data into time series.
Relational S SQL Databases, Excel Definitely the most common way to save company data.
Columnar S Cassandra, BigQuery, Snowflake, Redshift
Json/XML S MongoDB

 

Unstructured Data: Challenges and Opportunities

While unstructured data lacks a predefined schema, it can be valuable with the right tools. Parsing heuristics, NLP, and regular expressions help extract meaning from text-based data.

Key Challenges:

  • Discovery Efforts: Parsing unstructured data often requires significant upfront work to define appropriate schemas.
  • Accuracy Limitations: Even with advanced techniques, parsing accuracy may not be perfect.
  • Performance Overhead: Parsing unstructured data can be computationally expensive compared to working with structured formats.

Opportunities:

  • Flexibility: Unstructured data can capture a wide range of information, including customer feedback and machine-generated events.
  • Cost-Effectiveness: In some cases, storing data in unstructured formats can be more efficient.

Balancing Act:

Choosing between structured and unstructured formats depends on specific use cases, performance requirements, and the value of the data. While structured data offers simplicity and efficiency, unstructured data can provide valuable insights when handled effectively.

 

About Structured Data

As human beings, we are always trying to be more organized in our thoughts. We assign names to people, addresses to houses, and to our communications, a string of words. In a similar fashion, we label and arrange our data to give us more powerful ways and tools to process and analyze information.

When data is organized into a well-defined structure, a schema can be defined. This leads to a better data validation process, better data quality, and better insight. For example, if you have a columnar database in which each column represents a well-defined measure as revenues and transactions, then it will be easier to build a proxy layer that validates the data that comes in. And, of course, it will be much easier to manipulate, and to run flexible operations such as sum, range checking, range indexing, aggregations, etc.

Whereas in unstructured data, you will always need to worry that a small change in the data will upset all your parsing assumptions – in structured data, this is not the case. Structured data will require more thought, design and preparatory work, and cannot match all the use cases.

 

Unstructured to Structured Conversion

In many cases, in order to achieve a stronger analysis of the unstructured data, you must first convert it to a structured format. Take logs, for example. They can be converted into a time series if they contain a timestamp, or to tables, if you can define a regex for each field (CSV) There are even cases when the structure data itself needs a better way to be represented in order to be better analyzed.

In the case of columnar, for instance, you might want to convert it into a time series to understand trends in the columns that represent measurements. However, each transformation might lead to some data loss and compromise the data coverage. Companies usually save a portion of the same data in other formats but will still preserve the basic, original data format as a backup.

 

Conclusion

If you can, structure it! This is the most powerful and cheapest way to work with data. In cases when you can’t cover all your data in a structured way, you might want to have different views of the same data. It might seem inefficient to save both unstructured and structured data, but in the long run it may turn out to be even more cost effective.

Written by David Drai

David is CEO and co-founder of Anodot, where he is committed to helping data-driven companies illuminate business blind spots with AI analytics. He previously was CTO at Gett, an app-based transportation service used in hundreds of cities worldwide. Prior to Gett, he co-founded Cotendo, a content delivery network and site acceleration services provider that was acquired by Akamai Technologies, where he also served as CTO. He graduated from Technion - Israel Institute of Technology with a BSc in computer science.

You'll believe it when you see it