December 11, 2025

How to Build a Security Data Lake: Agentic AI in SecOps (Part 4)

We have reached the final chapter of our blog series on How to Build a Security Data Lake. In the first part we covered the fundamentals of how a security data lake overcomes SIEM limits and provides increased visibility. In the second part, we addressed how to make your data lake logs actually queryable and highly performant. Most recently in the third part we illustrated how to build detections and alerts on top of your data lake. 

Having built the core capabilities of a Security Data Lake, we now turn our attention to Agentic AI in the SOC. 

The convergence of AI agents, data lakes, and fast query engines is transforming how teams approach detection engineering and incident response. Rather than manually writing and tuning detection rules, or spending hours investigating alerts, AI agents with access to your data lake can accelerate both the development and operational sides of security detection.

AI-Assisted Detection Rule Development

Modern detection engineering workflows use AI agents like Claude Code or Cursor connected to your data lake via MCP servers. Instead of manually writing detection queries, you collaborate with AI to rapidly develop, test, and validate rules.

The Traditional Process (Manual):

  1. Read documentation to learn query syntax
  2. Write a detection query based on what you think should work
  3. Test it against logs (trial and error)
  4. Adjust thresholds and filters based on results
  5. Repeat until the rule works and false positives are acceptable

This takes hours and requires deep technical knowledge of your query language and data model.

With Claude Code and Scanner MCP:

You describe the detection you want in plain English, and Claude Code queries your Scanner data lake to develop, test, and refine the rule:

You: "I want to detect data exfiltration through S3. Specifically, users downloading more than 1GB from sensitive buckets (buckets with 'customer' or 'sensitive' in the name) within a 1-hour window. But exclude known data analytics jobs."

Claude Code with Scanner MCP:

  1. Queries Scanner to explore your S3 access patterns
  2. Generates a detection query:
@index=cloudtrail
eventSource: "s3.amazonaws.com"
eventName: "GetObject"
requestParameters.bucketName: (*customer* OR *sensitive*)
NOT userIdentity.userName: (analytics-service bi-export backup-job)
| stats
  count() as access_count,
  sum(additionalEventData.bytesTransferredOut) as total_bytes
  by userIdentity.userName, requestParameters.bucketName
| eval gb = total_bytes / (1024 * 1024 * 1024)
| where access_count > 100 and gb > 1

        3. Tests it: "Running against last 30 days... Found 47 alerts. Let me analyze the false positives..."

   

You: "Most of those are from the finance team doing month-end exports. The other ones look suspicious."

Claude Code with Scanner MCP:

  1. Refines the rule to exclude finance team
  2. Retests: "Better. Now 2 alerts. These look like alerts worth investigating."
  3. Generates unit tests to validate the rule
  4. Creates a Scanner YAML file ready to deploy

Time saved: 4 hours of manual work → 15 minutes of conversation.

Claude Code handles the technical details while you focus on the security logic. The key advantage is that Claude can:

  • Query your actual data to understand your data model
  • Test proposed thresholds and see real results
  • Iteratively refine rules based on what the data shows
  • Generate unit tests covering true positives, false negatives, and edge cases

Example: Using Claude Code to Develop a Rule

In Claude Code (or Cursor), you can have a conversation like this:

You: "I need to build a detection rule for IAM privilege escalation. I want to catch users making multiple failed attempts to modify IAM policies (PutRolePolicy, AttachRolePolicy) within an hour—this suggests permission probing. But exclude terraform and ci-pipeline service accounts. Let me start by exploring what this activity looks like in my data."

Claude Code with Scanner MCP:

  1. Queries Scanner to explore IAM activity
  2. Writes an exploratory query
  3. Shows you sample results
  4. Based on results, generates a detection rule
  5. Tests it and shows you what it catches
  6. Iterates based on your feedback

Finally, it generates a Scanner YAML file ready to deploy:

# schema: https://scanner.dev/schema/scanner-detection-rule.v1.json
name: "IAM Privilege Escalation - Policy Modification Probing"
enabled: true
description: |
  Detects potential privilege escalation attempts through
  repeated failed attempts to modify IAM policies.

severity: High
query_text: |
  %ingest.source_type="aws:cloudtrail"
  eventSource="iam.amazonaws.com"
  eventName=(PutRolePolicy AttachRolePolicy)
  errorCode=*
  NOT userIdentity.userName=(terraform ci-pipeline)
  | stats
    count() as failed_attempts,
    min(timestamp) as firstTime,
    max(timestamp) as lastTime
    by userIdentity.userName
  | where failed_attempts >= 3

tags:
  - techniques.ta0005.privilege_escalation

time_range_s: 3600
run_frequency_s: 300

tests:
  - name: Detect privilege escalation probing
    now_timestamp: "2024-08-21T00:10:00.000Z"
    dataset_inline: |
      {"timestamp":"2024-08-21T00:05:00.000Z","%ingest.source_type":"aws:cloudtrail","eventSource":"iam.amazonaws.com","eventName":"PutRolePolicy","userIdentity":{"userName":"attacker"},"errorCode":"AccessDenied"}
      {"timestamp":"2024-08-21T00:05:30.000Z","%ingest.source_type":"aws:cloudtrail","eventSource":"iam.amazonaws.com","eventName":"AttachRolePolicy","userIdentity":{"userName":"attacker"},"errorCode":"UnauthorizedOperation"}
      {"timestamp":"2024-08-21T00:06:00.000Z","%ingest.source_type":"aws:cloudtrail","eventSource":"iam.amazonaws.com","eventName":"PutRolePolicy","userIdentity":{"userName":"attacker"},"errorCode":"AccessDenied"}
    expected_detection_result: true

  - name: No alert for excluded service accounts
    now_timestamp: "2024-08-21T00:10:00.000Z"
    dataset_inline: |
      {"timestamp":"2024-08-21T00:05:00.000Z","%ingest.source_type":"aws:cloudtrail","eventSource":"iam.amazonaws.com","eventName":"PutRolePolicy","userIdentity":{"userName":"terraform"},"errorCode":"AccessDenied"}
      {"timestamp":"2024-08-21T00:05:30.000Z","%ingest.source_type":"aws:cloudtrail","eventSource":"iam.amazonaws.com","eventName":"PutRolePolicy","userIdentity":{"userName":"terraform"},"errorCode":"AccessDenied"}
    expected_detection_result: false

You can also use Claude Code to:

  • Migrate rules from other platforms — Point Claude at a Splunk or Datadog rule and ask it to translate to Scanner syntax, then test for equivalence
  • Reduce false positives — Describe the false positive pattern you're seeing, have Claude analyze the data to find the root cause, and suggest refinements
  • Analyze coverage gaps — Have Claude review your entire detection rule library, map coverage to MITRE ATT&CK, and recommend new rules for high-risk gaps

Autonomous Alert Triage and Investigation

Beyond rule development, autonomous agents transform alert response by automatically investigating and triaging security alerts. Instead of analysts manually reviewing every alert to determine if it's a threat or false positive, agents do the heavy lifting instantly.

The Problem: Organizations generate 100+ alerts per day. Manually triaging each one wastes analyst time—most are false positives (legitimate activity triggering rules). By the time analysts get to real threats, incident response is delayed.

Autonomous Alert Triage with Claude Agent SDK:

Here's a practical Python example using Claude Agent SDK to automatically triage alerts:

#!/usr/bin/env python3
import asyncio
import os
from datetime import datetime
from dotenv import load_dotenv
from claude_agent_sdk import query, ClaudeAgentOptions

async def alert_triage_agent(alert_id: str, alert_summary: str):
    """
    Automatically triage an incoming security alert.
    Classifies as benign, suspicious, or malicious based on
    data lake investigation.
    """
    load_dotenv()

    # Configure agent with Scanner MCP for data lake access
    options = ClaudeAgentOptions(
        model="claude-opus-4-5-20251101",
        allowed_tools=[
            "mcp__scanner__get_scanner_context",
            "mcp__scanner__execute_query",
            "mcp__scanner__fetch_cached_results",
        ],
        mcp_servers={
            "scanner": {
                "type": "http",
                "url": os.environ.get("SCANNER_MCP_URL"),
                "headers": {
                    "Authorization": f"Bearer {os.environ.get('SCANNER_MCP_API_KEY')}"
                }
            }
        }
    )

    # Define triage investigation
    prompt = f"""
    I'm receiving a security alert that needs immediate triage.

    **Alert ID**: {alert_id}
    **Alert Summary**: {alert_summary}

    Please perform the following investigation:

    1. **Gather Context**: Query Scanner to find:
       - Details about what this alert is detecting
       - Related activity from the same user/IP in the last 24 hours
       - Any pattern of similar activity

    2. **Analyze User Baseline**:
       - What is this user's normal activity pattern?
       - Is this behavior anomalous for them?
       - Have they done similar things before?

    3. **Assess Threat Indicators**:
       - Are there signs this is a real attack?
       - Are there indicators of compromise (multiple failed attempts, lateral movement, data access)?
       - Or does this look like normal operational activity?

    4. **Classify the Alert** as one of:
       - ✅ BENIGN: Legitimate activity, recommend closing
       - ⚠️ SUSPICIOUS: Warrants investigation, flag for human review
       - 🔴 MALICIOUS: High-confidence threat, escalate immediately

    Provide:
    - Your classification with confidence level (high/medium/low)
    - Key evidence supporting your decision
    - Specific recommendations for next steps
    - Any related alerts or activity that should be investigated together
    """

    print(f"[{datetime.now().isoformat()}] Triaging alert {alert_id}...")

    async for message in query(prompt=prompt, options=options):
        print(message)

    print(f"[{datetime.now().isoformat()}] Triage complete.")

# Usage: Can be triggered when alerts arrive
if __name__ == "__main__":
    # Example alert from your detection system
    alert_id = "alert_2024_08_21_001"
    alert_summary = "User john.smith downloaded 2.3GB from sensitive-customer-data S3 bucket in 10 minutes"

    asyncio.run(alert_triage_agent(alert_id, alert_summary))

When you run this agent on that S3 exfiltration alert, it automatically:

  1. Queries Scanner to understand what the alert detected
  2. Investigates john.smith's access patterns
  3. Checks if this is normal for them or anomalous
  4. Looks for related suspicious activity (failed logins, lateral movement, etc.)
  5. Provides a classification and evidence-based recommendation

Output example:

Classification: 🔴 MALICIOUS (High Confidence)

Evidence:
- User john.smith normally accesses 5-10 files per week from this bucket
- Today they downloaded 2.3GB in 10 minutes (672x normal daily volume)
- Related activity: 3 failed login attempts from unusual IP (192.0.2.1) 1 hour before
- New AWS API keys created by this user today
- Activity pattern matches known exfiltration behavior (rapid, systematic data collection)

Recommendation:
- Escalate immediately
- Disable john.smith's AWS credentials
- Investigate compromised credentials
- Check S3 bucket for unusual access from other accounts

What would normally take an analyst 30-45 minutes (investigating, creating tickets, notifying teams, documenting) happens in seconds via autonomous agents.

Safe Agent Actions: Staging for Human Review

A critical principle when deploying autonomous agents: use read-only tools for investigation, and staging tools for response. Agents should gather context and make recommendations, but consequential actions should require human approval.

Staging tools create artifacts that humans review before anything irreversible happens:

  • Jira: Add a comment with findings and recommended next steps (human decides whether to escalate or close)
  • GitHub: Open a pull request to update a detection rule or blocklist (human reviews before merging)
  • Slack: Post to a triage channel with evidence summary (human decides response)
  • Ticketing: Create a draft incident report (human approves before sending)
  • Change management: Queue a credential rotation or firewall change (human approves execution)

What NOT to give agents (at least initially):

  • Direct credential revocation
  • Firewall rule modifications
  • Account disabling
  • Production config changes

This approach prevents agents from taking irreversible actions on false positives, creates an audit trail of agent reasoning, and keeps humans in the loop for high-stakes decisions. It also builds trust incrementally—start with read-only investigation, graduate to staging actions, and only later (if ever) allow direct automated response.

Configuring safe tool access:

options = ClaudeAgentOptions(
    model="claude-opus-4-5-20251101",
    allowed_tools=[
        # Read-only: investigation
        "mcp__scanner__execute_query",
        "mcp__scanner__fetch_cached_results",
        # Staging: require human review before consequences
        "mcp__jira__addComment",
        "mcp__github__create_pull_request",
        "mcp__slack__chat_postMessage",
        # NOT included: disable_user, revoke_credentials, modify_firewall
    ],
    # ... MCP server config
)

The exact tool names depend on your MCP server configurations—check each server's documentation for available tools.

Example Scenario:

An alert fires: "User downloaded 2GB from sensitive S3 bucket in 10 minutes"

Without automation:

  • 10+ minutes: Analyst sees alert, queries logs to understand what happened
  • 5+ minutes: Analyst checks if this user normally does this
  • 5+ minutes: Analyst creates a ticket, fills in context
  • 5+ minutes: Analyst notifies the security team
  • Total: 25+ minutes before human decision-making

With autonomous investigation:

  • Alert fires → agent automatically investigates → agent creates ticket → agent posts to Slack → analyst receives complete briefing with evidence and recommended actions
  • Time: seconds

The Query Performance Requirement

For AI agents to be effective, they need fast access to your data. When an agent investigates an alert, it runs multiple ad-hoc queries to gather context:

  • "Show me all login activity from this user in the last 24 hours"
  • "What's their typical access pattern for this data?"
  • "Are there any other suspicious activities from this IP address?"

If each query takes 10 minutes or an hour, the investigation becomes impractical. But if each query takes seconds, the agent can gather a complete investigation in seconds.

This is where optimized query engines become critical. Traditional data lakes like Athena work poorly for this because they're built for batch queries on static data. But data lakes with inverted indices optimized for full-text search and substring matching (like Scanner) can execute complex queries over massive datasets extremely fast.

Cost and Speed Comparison (for ad-hoc investigation queries):

  • Traditional data lake (Athena): 4 hours to run, $100 cost per complex query
  • Optimized data lake (Scanner): Seconds to run, $0.10 cost per complex query

This 40x improvement in speed and 1000x improvement in cost transforms data lakes from being unsuitable for real-time agent interaction to being practical operational tools for 24/7 autonomous security operations.

The Full AI SecOps Workflow

The most effective teams combine all three components:

  1. Interactive Investigations — Analysts use AI with a data lake via MCP interface to explore incidents conversationally. Describe what to investigate in natural language, and AI writes queries, executes them, and summarizes results with evidence and timeline
  2. Detection Engineering — Use AI to develop and maintain detection rules, continuously improving coverage and accuracy
  3. Autonomous Workflows — Let agents monitor 24/7, triage 80% of alerts instantly, investigate new indicators, and respond to routine threats automatically

This allows your team to scale security operations dramatically—you can monitor terabytes of data per day, run hundreds of detection rules, and respond to threats 24/7 with a small team. The agents handle routine work (triage, investigation, response) while your team focuses on complex decisions, rule development, and strategic security improvements.

Looking Forward

The future of detection engineering in data lakes points to:

  1. AI agents for both automated detection engineering and autonomous alert investigation and enrichment
  2. Fast query engines with inverted indices to support real-time AI agent interaction over massive datasets
  3. Humans plus AI in the SOC delivers the best results. We conducted some tests and found that AI alone made critical mistakes, but with the proper human supervision the results were far better than either AI alone or human alone. 

This concludes our blog series on How to Build a Security Data Lake. We learned:

Photo of Cliff Crosland
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.