December 19, 2025

How to Build a Security Data Lake: Detection Engineering in Data Lakes (Part 3)

Welcome back to our blog series on building Security Data Lakes. In the first part we covered the fundamentals of how a security data lake overcomes SIEM limits and provides increased visibility. In the second part, we addressed how to make your data lake logs actually queryable and highly performant. Now, we turn our attention to adding detections on top of your data lake.

When building security detection systems on top of data lakes, you have three fundamentally different approaches to choose from. Each offers different trade-offs between complexity, efficiency, and detection capability. Understanding these three methods—scheduled query detections, streaming event-based detections, and streaming query detections—is critical for building effective detection engineering programs at scale.

The Three Detection Paradigms

Scheduled Query Detections: Power at the Cost of Scale

Scheduled query detections are the traditional approach that many organizations use in their SIEMs and data lakes today. These are saved SQL queries that run on a periodic schedule—perhaps once an hour, once every 15 minutes, or even once per minute. Tools like Panther, Splunk, and Elasticsearch all support this paradigm, as do direct implementations on top of Snowflake, Athena, Presto, or Amazon Security Lake using simple cron jobs.

The Appeal of Scheduled Query Detections

Scheduled query detections offer enormous flexibility for detection engineers. Because they're just arbitrary SQL queries, you can do truly complex analysis:

Baseline comparisons: Compute what normal activity looks like over a prior time period (e.g., average login counts, percentile distributions), then compare recent activity against that baseline to detect anomalies that are multiple standard deviations away from the mean.
Cross-source correlations: Join data across many different log sources and SQL tables to detect sophisticated attack patterns that span multiple systems.
Statistical analysis: Perform percentile calculations, spike detection, and other statistical functions that require looking at aggregate data rather than individual events.
Complex transformations: Use window functions, recursive CTEs, and other advanced SQL features to model complex security scenarios.

This flexibility is the key strength of scheduled query detections. If you can write SQL to express a detection idea, you can implement it.

Concrete Example: Brute-force Authentication Attack Detection

Here's a real-world example of a scheduled query detection you might run on your authentication logs:

SELECT
  user_id,
  event_type,
  COUNT(*) as event_count,
  MAX(timestamp) as last_event
FROM auth_events
WHERE event_type IN ('rate_limit_exceeded', 'account_lockout', 'login_failure')
  AND timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY user_id, event_type
HAVING COUNT(*) > 4

This query alerts you to brute force attempts by detecting when a user triggers rate limiting or account lockout events more than 4 times in a rolling hour window. You'd schedule this to run every 15 minutes to catch these attacks quickly.

This is simple, but you can extend it significantly. You might join against your user database to exclude service accounts, correlate across multiple identity providers, apply machine learning models to baseline typical user behavior, or filter out false positives based on geographic patterns. All of this is possible because you have the full power of SQL.

The Scaling Problem

The fundamental problem with scheduled query detections is efficiency. As your data volumes grow and you want to run detections more frequently, this approach hits hard scaling limits.

When you run a scheduled query frequently over a large time range, you're rescanning the same data over and over again. If you want to detect anomalies in the last hour by comparing it to the baseline from the previous 30 days, and you run this query every minute, you're scanning that same 30-day window 1,440 times per day. This is wasteful.

At larger scales—when you're ingesting terabytes or tens of terabytes per day into your data lake—scheduled query detections become impractical. Your queries simply won't complete before the next scheduled run. Many organizations find they can only afford to run scheduled queries once per hour, sometimes once every 15 minutes, but rarely more frequently without hitting resource limits or incurring prohibitive costs.

Streaming Event-Based Detections: Efficient but Limited

The second approach is streaming event-based detections, popularized by tools like Panther and formalized by the Sigma detection rule framework. Instead of running periodic queries over historical data, streaming event-based detections process one log event at a time as it enters the system.

How Streaming Event-Based Detections Work

With this approach, as events flow into your data lake during the ETL process, each event is evaluated against a set of detection rules in real-time. Rules can examine:

Individual event properties and conditions
Sophisticated Python logic (in Panther's case) or pattern matching (in Sigma) to transform and analyze the event
A cache lookup to compare the current event against historical state stored in Redis, DynamoDB, or similar key-value stores

For example, here's what a streaming event-based detection looks like (Panther Python syntax):

def rule(event):
    # Only look at login events
    if event.get('event_type') != 'user_login':
        return False

    # Check if this looks like a suspicious command
    if 'admin' in event.get('user_roles', []) and \
       'sudo' in event.get('command', '').lower():
        return True

    return False

def title(event):
    return f"Admin user {event['user_id']} executed privileged command"

def deduplication_key(event):
    return f"{event['user_id']}-{event['timestamp']}"

Or a more complex example using cache lookups for impossible travel detection:

def rule(event):
    if event.get('event_type') != 'user_login':
        return False

    # Get previous location from cache
    previous_location = cache.lookup('last_login_location', event['user_id'])
    if not previous_location:
        # First time seeing this user
        return False

    # Calculate distance and time elapsed
    distance_km = calculate_distance(
        previous_location['lat'], previous_location['lon'],
        event['latitude'], event['longitude']
    )
    time_diff_hours = (event['timestamp'] - previous_location['timestamp']) / 3600

    # Speed required to make this journey
    required_speed = distance_km / time_diff_hours if time_diff_hours > 0 else float('inf')

    # Flag if speed exceeds 900 km/h (impossible for commercial flight)
    return required_speed > 900

These rules process one event at a time. When an event arrives, the rule executes immediately, makes a cache lookup, and decides whether to fire an alert. No aggregation, no historical analysis—just event-by-event decisions.

The Efficiency Advantage

Streaming event-based detections are remarkably efficient. You can run hundreds or thousands of detection rules in parallel, each processing individual events. They scale linearly with event volume—more events means proportionally more processing, but you never have the rescanning problem of scheduled queries.

The rules are also flexible. In Panther, you write Python code to pull apart nested JSON, construct custom alert messages, and apply sophisticated logic to individual events. This is powerful for event-by-event analysis.

The Correlation Limitation

However, streaming event-based detections have a fundamental limitation: they can only look at one log event at a time (plus optionally one prior event stored in a cache). This makes complex pattern detection difficult.

Many attacks, especially against cloud infrastructure and identity systems, aren't obvious from a single event. A credential compromise often doesn't involve malware installation or obviously malicious behavior—instead, it manifests as a statistical anomaly. An attacker might have elevated access, causing unusual API call patterns or a spike in sensitive operations. Detecting this requires looking at aggregate data across many events, comparing current activity to historical baselines, and doing statistical analysis.

There's also the practical problem of cache overhead. As your event volume grows, the sheer number of cache reads and writes required for correlations can become overwhelming. If you're storing the last IP address associated with each user and the last user associated with each IP address, with millions of events per second, you can saturate your cache layer.

Streaming Query Detections: The Best of Both Worlds

A newer paradigm, exemplified by systems like Scanner, offers a third approach: streaming query detections. This combines the analytical power of scheduled query detections with the efficiency and freshness of streaming event-based detections.

The Core Insight: Partial Query Evaluation

The key innovation is partial query evaluation combined with intelligent caching. When a detection query is executed, it's evaluated in two stages:

Streaming stage: As events flow into the system, the query is partially executed up to the first aggregation operation. For example, if your detection is "count login events by IP in the last hour, then alert if count > 100", the streaming stage counts events by IP as they arrive.

Caching stage: These partial results are stored in an intermediate data structure—specifically, an interval tree (also called a roll-up tree)—that makes it efficient to query over arbitrary time ranges later.

Query stage: When you want to run the full detection query, instead of rescanning all the raw data, you look up pre-computed aggregate results from the interval tree, combine them, and evaluate the rest of the query.

The Interval Tree Data Structure

The interval tree is elegantly designed. In Scanner's implementation:

The root node spans 12 hours of data
Its two child nodes each span 6 hours
Their children span 3 hours, then 1 hour, continuing down to nodes that span 1 minute

This creates a balanced tree where looking up any arbitrary time range requires consulting only a logarithmic number of nodes—typically 4-5 lookups from S3. Compare this to a naive approach that would scan thousands of 1-minute time buckets.

Size and Cost Efficiency

The interval tree adds only about 1% overhead to the raw log data size. Nodes are empty if there are no matches for the query's filters, so the actual storage cost is minimal. For a detection query that filters to specific event types or conditions, the tree might be much smaller than 1%.

Live, Frequent Execution

This architecture enables running arbitrary detection queries as frequently as once per minute over large time ranges. You can have a detection that:

Looks at the last 2 hours of data
Searches for complex patterns (joins, aggregations, statistical analysis)
Runs every 60 seconds
Completes in seconds with just a few S3 reads

This is a massive improvement over scheduled query detections, which might only run every 15 minutes or hour over the same data.

Practical Examples: Three Streaming Query Detections

Let's look at three increasingly sophisticated detection queries that exemplify streaming query detections:

Example 1: Simple Suspicious Process Execution (Event-Based)

This detection identifies when potentially malicious utilities are executed on Windows systems. It's simple enough to work as a streaming event-based detection, but runs efficiently as a streaming query detection:

%ingest.source_type="windows:process_creation"
(Image: '\\finger.exe' OR OriginalFileName: 'finger.exe')
| stats
  min(@scnr.datetime) as firstTime,
  max(@scnr.datetime) as lastTime,
  count() as eventCount
  by Image, CommandLine, SubjectUserName
| where eventCount > 0

Finger.exe is a legacy utility that has no legitimate business purpose on modern Windows systems. Any execution is highly suspicious and worth investigating. You can run this across millions of process events with minimal cost, detecting the presence of a single suspicious event type in seconds.

Example 2: Detecting S3 Data Exfiltration (Aggregation-Based)

This detection identifies unusual S3 access patterns that might indicate data exfiltration. It goes beyond simple filtering to aggregate access patterns:

%ingest.source_type="aws:cloudtrail"
eventSource: "s3.amazonaws.com"
eventName: "GetObject"
| stats
  count() as access_count,
  sum(additionalEventData.bytesTransferredOut) as total_bytes,
  countdistinct(requestParameters.bucketName) as buckets_accessed,
  min(@scnr.datetime) as first_access,
  max(@scnr.datetime) as last_access
  by userIdentity.userName, sourceIPAddress
| eval gb = total_bytes / (1024 * 1024 * 1024)
| where access_count > 100 and gb > 0.5 and buckets_accessed > 3
| table userIdentity.userName, sourceIPAddress, access_count, gb, buckets_accessed, first_access, last_access

This enhanced detection finds users downloading large volumes (500MB+) across multiple buckets (3+) in a single 2-hour window. The addition of source IP address and the count of distinct buckets accessed helps distinguish automated exfiltration (accessing many buckets) from legitimate bulk data operations (typically accessing a single bucket). Running this every minute through streaming query detections gives you real-time visibility into exfiltration attempts.

Note: This detection would run with parameters like time_range_s: 7200 (2-hour lookback) and run_frequency_s: 60 (every minute), enabling frequent execution over a large time window that would be impractical for scheduled queries.

Example 3: Cross-Source Correlation - Failed MFA Followed by Sensitive IAM Activity

This example demonstrates the power of streaming query detections for multi-source correlation: detecting when a user experiences multiple failed MFA attempts in Okta, then immediately performs sensitive IAM operations in AWS. This pattern indicates a potential credential compromise with active exploitation.

This type of detection requires three separate detections working together:

Detection 1: Failed MFA Attempts (Okta)

This detection identifies users experiencing repeated MFA failures in Okta, which can indicate either an attacker guessing credentials or a compromised account where an attacker is attempting to bypass MFA:

%ingest.source_type="okta:system"
eventType: "user.authentication.auth_failure"
eventDescription: "MFA"
| stats
  count() as mfa_failures,
  min(@scnr.time_ns) as first_attempt_ns,
  max(@scnr.time_ns) as last_attempt_ns
  by @ecs.user.email
| where mfa_failures >= 3

‍

Detection 2: Sensitive IAM Activity (AWS CloudTrail)

This detection flags users making sensitive IAM changes—creating access keys, attaching policies, or modifying trust relationships. These are high-value operations that attackers prioritize after gaining initial access:

%ingest.source_type="aws:cloudtrail"
eventName: (
  "CreateAccessKey" "AttachUserPolicy" "PutUserPolicy" "CreatePolicyVersion"
)
| stats
  count() as iam_changes,
  countdistinct(eventName) as activity_types,
  max(@scnr.time_ns) as iam_time_ns
  by @ecs.user.email
| where iam_changes >= 1

‍

Detection 3: Correlation Detection (via _detections index)

Both detections write alerts to the _detections index. This meta-detection queries that index to find cases where both the MFA and IAM detections fire for the same user within a short time window—a strong signal of active compromise:

@index="_detections"
(name: "Failed MFA Attempts" OR name: "Sensitive IAM Activity")
| stats
  count() as alert_count,
  countdistinct(name) as alert_types,
  min(@scnr.time_ns) as first_alert_ns,
  max(@scnr.time_ns) as last_alert_ns
  by results_table.rows[0].@ecs.user.email
| eval duration_seconds = (last_alert_ns - first_alert_ns) / 1000000000
| where alert_types >= 2 and duration_seconds <= 3600
| rename results_table.rows[0].@ecs.user.email as user_email, alert_count as total_alerts, duration_seconds as time_between_alerts_seconds
| table user_email, total_alerts, alert_types, time_between_alerts_seconds

‍

Time Range and Frequency

Each of these three detections has two parameters set separately: time_range_s and run_frequency_s. For example:

Detection 1 & 2 (individual signals): time_range_s: 7200 (2 hours), run_frequency_s: 300 (every 5 minutes)
Detection 3 (correlation): time_range_s: 3600 (1 hour for correlation window), run_frequency_s: 300 (every 5 minutes)

This frequent execution with large time windows would be prohibitively expensive with scheduled queries but is trivial with streaming query detections thanks to the pre-computed interval trees.

The Cost Advantage

The key insight: traditional scheduled query detections would require rescanning hours of log data every time the query runs. Streaming query detections instead leverage pre-computed aggregations (interval trees) stored efficiently in S3. You can afford to run these correlation detections every 5 minutes, all day, across terabytes of data—something that would bankrupt a scheduled query-based system. This is why streaming query detections unlock practical, high-frequency, sophisticated correlation detection at scale.

Scalability at Data Lake Volumes

This matters enormously at data lake scale. If you're ingesting terabytes or tens of terabytes per day, you can run hundreds or even thousands of detection rules via streaming query detections at minimal cost. This simply isn't feasible with scheduled query detections.

Alert Routing

When a streaming query detection triggers an alert, Scanner automatically routes it to configured "event sinks"—alert destinations where your team sees the alert:

Chat platforms: Slack channels for real-time notification
Incident response: PagerDuty for on-call escalation
SOAR integrations: Tines, Torq, or other automation platforms via webhook for automated response workflows
Ticketing: Jira for creating and tracking incidents

Every detection alert is also written to a special _detections index—a searchable, queryable record of all alerts your detection system has fired. As demonstrated in Example 3, this index enables meta-detections that correlate multiple detection rules together, creating a powerful feedback loop: raw logs → detection rules → alerts in the _detections index → correlation detections → further refinement of your detection strategy.

Streaming Event-Based vs. Streaming Query Detections

It's worth emphasizing that streaming event-based detections remain valuable—they excel at detecting single-event anomalies that are obvious from examining one log entry. Sigma rules for EDR logs looking for specific command-line executions or file operations are perfect use cases.

However, streaming event-based detections fundamentally cannot do complex aggregations, correlations, and statistical analysis. As cloud infrastructure and identity-based attacks become more prevalent, these capabilities become increasingly important.

Streaming query detections don't replace streaming event-based detections—they complement them. You use both: streaming event-based rules for obvious, single-event indicators of compromise, and streaming query detections for complex, statistically-based threat detection.

Up Next - Detection Engineering in the AI Era

One of the magical things about Security Data Lakes is that they are ideally positioned to deliver the best platform for AI-driven threat hunting and incident response. In our next part of this blog series, we will delve further into the convergence of AI agents, data lakes, and the cost and performance gains compared to traditional SIEMs.

Cliff Crosland

CEO, Co-founder

Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.

Back to Blog