How to Build a Security Data Lake: Key Challenges, Top Tools, and Lessons Learned (Part 1)

Security operations are undergoing a generational transformation. As log volumes explode and attack surfaces multiply, organizations are discovering that legacy SIEMs - once the centerpiece of enterprise detection and response - can no longer keep up. The answer? The security data lake: a modern architecture that combines low-cost object storage, flexible compute, and open schema to deliver scale, speed, and control.
In this post, we’ll explore the key forces driving this shift, the challenges of implementation, and what the future holds for query and detection tools built for the next generation of cybersecurity data operations.
From SIEM Overload to Data Lake Agility
Every day, security teams face a staggering volume of log data from across cloud infrastructure, endpoints, and SaaS platforms - AWS CloudTrail, VPC Flow Logs, DNS logs, EDR logs, Okta audit trails, and more. Many of these logs are very noisy and traditional SIEMs were never designed to handle this volume.
Stateful SIEM clusters will now easily hit limits on data retention and query performance. Your queries will timeout, you will see big delays in ingestion, ingestion costs will hit seven figures, and so on. Teams often have to drop high-volume logs like VPC flow data. Even then, queries can take hours to execute, stalling investigations and threat hunts.
Maintaining uptime in these clusters demands continuous engineering work - managing shards, rebalancing, capacity planning, tuning ingestions pipelines does not scale very easily. The operational tax is immense.
The Rise of Object Stores
The new pattern emerging now is that object storage is becoming the foundation of modern data infrastructure. Object storage (like Amazon S3, Azure Blob, or Google Cloud Storage) offers infinite scalability at a fraction of the cost. Amazon S3 even promises eleven nines availability!
Apple’s famous migration from Splunk to an S3-based data lake with Databricks Delta Lake exemplifies this trend - achieving petabytes–per-day ingestion while decoupling compute from storage. They also built their own Apache Lucene plugins for Apache Spark achieving the ability to do regular expression and token based search against multi-petabyte indexes with very high performance.
This decoupling means compute can scale on demand, rather than keeping costly infrastructure running 24/7.
However, once the data is in an object store, security teams now need to layer on other tools to transform the logs, search them, run detections, perform analytics, etc. This can be really hard and require significant data engineering resources.
Full-Text Search, the Missing Ingredient
One particular challenge for security teams is being able to conduct full-text search on even really messy logs, especially those that don’t neatly fit into a schema. Security engineers are used to being able to perform full-text substring search in traditional SIEMs, but miss this feature in their data lakes. So why does full-text search matter?
If you can’t search inside the noisy parts of logs, it’s almost impossible to find certain kinds of signals that only appear in messy data, like detect reverse shells, command & control traffic, or malicious scripts. Examples where full-text search is essential:
- Messy command-line arguments
 - Suspicious user-agent strings
 - Embedded payloads in HTTP requests
 - Unstructured authentication events or script contents
 
Without full-text search, a lot of critical data remains invisible in the data lake. In standard data lake SQL query engines, most full-text or substring searches will require doing full table scans, which are extremely slow, making such searches infeasible.
With it, there are massive wins - and significant investigation power. Later in this blog series we will discuss how we built Scanner to handle highly-performant full-text search.
Why the Security Data Lake Model Works
A modern security data lake redefines how organizations collect, store, and query logs. Here’s what makes it so powerful:
- Infinite Scalability: Object storage like Amazon S3 offers near-unlimited capacity and durability, removing the need for constant cluster management.
 - Cost Efficiency: Store raw logs for pennies per GB - orders of magnitude cheaper than SIEM storage costs.
 - Decoupled Compute: Run multiple analytics tools against the same dataset - Athena, Spark, or next-gen engines like Scanner.
 - Open Standards: Emerging schemas like Open Cybersecurity Schema Framework and Elastic Common Schema make cross-tool analysis and normalization easier.
 - Control and Compliance: Data remains in your cloud account - no vendor lock-in, and full custody of sensitive telemetry.
 
This model mirrors what happened in business analytics over the last decade: the move from rigid, monolithic data warehouses to open, flexible object storage-based platforms like Snowflake and Databricks.
Now, that same revolution is coming to security operations.
Challenges with Security Data Lakes
While data lakes are fantastic for infinite scale at a tiny fraction of the cost, security teams quickly hit some other roadblocks when trying to work with security logs for typical incident response and threat-hunting use cases.
The first challenge is how to ingest all your logs into your security data lake. The second challenge is how to make it easily queryable. The final challenge is with performance of queries (which can take hours). Let’s address each of these challenges.
Log Ingestion: The First but Difficult Step
Getting data into your data lake is the foundation of the entire architecture. Fortunately, cloud providers and modern SaaS platforms have made major progress in enabling native log exports.
- AWS, Azure, and GCP all support direct log delivery into object storage, usually in gzip JSON format. For example, AWS allows you to export CloudTrail, VPC flow logs, CloudWatch, and many other logs directly into S3 buckets in their raw native format. Essentially every cloud provider has a way to archive large volumes of logs into object storage. 
 - Many SaaS platforms offer APIs for audit logs. Atlassian, Okta, Slack, PagerDuty, Salesforce, Stripe, and Zoom allow teams to pull activity data via APIs. Engineering teams often build their own connectors. Sometimes they use tools such as running AWS Lambda functions that periodically poll these APIs. Others use HashiCorp’s Grove to collect these logs and store them in S3 for security analysis.
 - Teams can use tools like Fluent Bit, Vector, or Cribl to push logs into the lake, optionally transforming or filtering them along the way. Cribl allows you to filter and transform data from many different kinds of logs, and then route it to multiple destinations. One of its most popular destinations is to S3. 
 - New managed services like Amazon AppFabric now aggregate SaaS audit logs automatically. However, its per monitored user pricing can add up to significantly high costs.   
 
S3 log exports come in a mix of formats - most commonly gzipped, newline-delimited JSON. Some sources use large JSON objects with a single array where each element in the array is a new log event. Others produce CSV or Parquet (like Amazon Security Lake). Raw log exports are often just plaintext. All of this results in a messy mix of file types stored in object storage.
Scanner Collect is a new tool designed to receive and manage logs from various sources. It supports HTTP push ingestion, API polling, and webhooks, allowing integration with a wide variety of systems like Okta, 1Password, and Google Workspace using multiple authentication flows. It streamlines the process of loading logs into Amazon S3 by supporting both log pulling from different security sources and receiving webhook payloads, logs, and alerts from tools that push via HTTP. Once collected, the logs are funneled into S3 as gzipped newline-delimited JSON files, a vendor-agnostic format, ready for security analysis.
In our next part of this blog series we will cover how to make these messy logs actually queryable and usable!
