May 1, 2026

We Indexed a Petabyte Over a Weekend

We indexed 1.4 PiB of log data, 689 billion events, in 80 hours. The pipeline ran unattended from start to finish.

Scanner has an unusual indexing and query architecture that produces this kind of efficiency. A typical SIEM charges on the order of $1M/year to ingest 1 TB/day, roughly 350 TiB/year. We indexed four years' worth of that volume in a weekend, at orders of magnitude less cost. Most organizations never attempt anything close to a petabyte. They downsample, set retention limits, or simply accept that older data is gone.

This post covers the architecture and what makes it work. We indexed this dataset partly to stress test the system at scale, and partly to build a realistic environment for threat hunting. We buried a simulated APT campaign among 689 billion legitimate events to see how the investigation workflow holds up.

The Setup

We filled an S3 bucket with 1.4 PiB of AWS CloudTrail logs covering months of synthetic API activity across EC2, IAM, S3, Lambda, and dozens of other services. The configuration was straightforward: point Scanner at the bucket, specify gzipped JSON with CloudTrail defaults, and let it run. Scanner indexes the data continuously in its native semi-structured JSON. The only transformation was adding Elastic Common Schema (ECS) field names alongside the original fields, which is on by default for CloudTrail. All fields are indexed and searchable by default, so there's nothing to configure up front beyond the source bucket and format.

Throughput

Here's the daily throughput, pulled directly from Scanner's _usage index:

Day	Data Indexed	Log Events
Day 1	319 TiB	155 billion
Day 2	369 TiB	178 billion
Day 3	526 TiB	254 billion
Day 4	211 TiB	102 billion

Total: 1.4 PiB, 689 billion log events.

The whole run took roughly 80 hours of wall-clock time.

The shape of the curve tells a story. Day 1 started slow as the pipeline ramped up, discovering files in the bucket and filling its work queues. By Day 1 afternoon it reached 14–18 TiB/hour. Day 2 held a steady plateau around 14–16 TiB/hour. Day 3 was the peak: the pipeline was fully saturated, sustaining above 20 TiB/hour for the entire day. Day 4 continued at full speed until the work queue drained around 10:00.

Peak throughput was 23.9 TiB/hour, or 6.8 GiB/second sustained over a full hour of parsing and indexing CloudTrail JSON. During peak hours on Day 3, the pipeline was processing roughly 11 billion log events per hour, about 3 million events per second. It held above 19 TiB/hour for 34 consecutive hours, from Day 3 morning through Day 4 morning.

Architecture

Pipeline

ECS Fargate tasks scale up as SQS queues fill with work. Each task reads raw log files from S3, parses and indexes them, and writes index files to a separate Scanner index bucket. A merge step combines small index files into larger ones for efficient querying.

Stage	Component
Source	S3 (raw log files)
Trigger	EventBridge
Queue	SQS
Compute	ECS Fargate workers
Output	S3 (index files)

The bottleneck is S3 write throughput, not CPU. We tune the worker count to keep S3 PutObject at steady state without triggering throttling (S3 returns 503 SlowDown when you push too hard on a single prefix). The index files are written across a distributed prefix space, so the ceiling is high.

Indexing isn't the only thing the pipeline does. Each worker also runs detection rules against the data as it flows in. The detection engine uses a streaming query engine that computes partial results from each batch of log events and caches them in S3. Those cached partial results are reused across subsequent evaluations, so the work is incremental rather than re-scanning from scratch. All active detection rules run in parallel as data is indexed. For this run, we had roughly 70 detection rules evaluating continuously across all 1,200 workers. The detection cache is typically less than 1% the size of the index files.

Coordination

All coordination happens through three backing stores:

SQS queues the work. Each message represents a file to index. SQS gives us at-least-once delivery, which means tasks can be delivered more than once.
DynamoDB tracks indexing progress. It records which files have been processed, manages active leases on files, and provides the idempotence guarantees that make at-least-once delivery safe. If two workers pick up the same file, the lease system ensures only one writes the index.
MySQL serves as the metadata store. It tells the query engine which index files exist and are available to search. When a worker finishes writing an index file to S3, it registers the file in MySQL with an atomic operation. Queries see a consistent view of available data.

Index Format

Scanner has an unusual indexing approach compared to traditional telemetry query engines. Elasticsearch and other Lucene-based systems maintain fine-grained inverted indexes: one entry per term per document, with posting lists, term frequencies, and scoring metadata. A Lucene index commonly grows to 2-3x the size of the original data and must reside on fast attached storage or in memory. Building it is CPU-intensive by nature.

Scanner's index maps tokens to pages of roughly 100K events rather than to individual log events. This coarser granularity means far less bookkeeping during indexing and produces a much smaller index. Each data group holds the log events themselves, a string token index, a number index, and column statistics. Scanner doesn't query your source files at read time; it queries its own optimized copy. Despite storing the full log data plus all index structures, the index is stored compressed in S3 and is only 10% of the raw data size. For this run, that meant 139.1 TiB of index files from 1.4 PiB of raw input. An Elasticsearch cluster indexing the same data would need several petabytes of attached storage for its indexes alone.

The page is the atomic unit of both indexing and querying. The token index maps tokens to the page byte ranges that contain them. The number index describes the distribution of numerical values in each page. Index structures are serialized in Rust using bincode and rkyv (a zero-copy deserialization framework), both chosen for minimal overhead when converting between in-memory data structures and their serialized representation on S3.

The design follows from S3's latency characteristics. Reading the first byte from S3 takes 100-500ms regardless of read size. Since that cost is already paid, Scanner reads in larger chunks: a full 100K-event page per request, then decompresses and scans it with SIMD-accelerated parsing. The result is query latency measured in seconds rather than milliseconds, but across a petabyte of data, a few seconds is excellent performance. Traditional search engines optimize for sub-millisecond queries over gigabytes. Scanner optimizes for seconds-scale queries over petabytes.

Each incoming raw log file produces a small, self-contained index. These are efficiently merged into larger ones over time, so there is no single global index to build or maintain. Indexing is embarrassingly parallel, and merging is efficient because it combines pre-built structures rather than rebuilding from scratch. With this approach, the limiting factor for indexing throughput is S3 write I/O, not CPU.

More detail on the index format and query engine is in How Scanner Achieves Fast Queries.

Read Path

The read path is completely independent of the write path. At query time, Scanner launches serverless Lambda functions that visit thousands of index files in parallel. Each function narrows down to the pages that contain hits, scans those pages, and produces partial results structured as monoids: algebraic values that can be merged associatively. The coordinator combines these partial results as they stream back, so query performance scales with the number of Lambda invocations rather than total data size.

Queries against the demo-data index worked normally while nearly 24 TiB/hour of new data was streaming in. The pipeline writes immutable index files to S3: no lock contention, no write-ahead log, no compaction pauses. A query that started before a batch of index files landed won't see those events yet, but it won't slow down or fail. The next query picks them up.

What We Improved

Our previous largest run topped out at 83 TiB/day. Here's what got us to 526 TiB/day.

Lease conflict resolution. With hundreds of concurrent workers and at-least-once delivery from SQS, conflicts are inevitable. The previous system handled conflicts correctly but wastefully: a worker that lost a lease would fail its task and the work would be requeued. The new system resolves conflicts gracefully. Workers check leases before committing, back off when they detect contention, and resume with the next item in the queue. Less wasted compute, higher effective throughput.

Auto-scaling. The scaling system got significantly better at matching capacity to demand. It ramps up aggressively as queues fill, then scales down quickly as they empty. Critically, it protects in-flight Fargate tasks from termination. You don't want to kill a worker that's 90% through indexing a file. The scaler tracks task progress and only removes idle capacity.

Parallel data file creation. Workers previously built each component of a data group (log events, token index, number index, column statistics) sequentially. Now they build them in parallel. This doesn't change the I/O profile, but it reduces the wall-clock time per file, which means each worker processes more files per hour.

The indexing code, JSON parser, index format, and query engine were all unchanged from prior runs.

Compute and Storage

Each indexing worker is a single vCPU with 2 GB of memory. The pipeline scaled from idle to 1,200 concurrent workers at peak, held above 1,000 for over 24 hours, then drained back to idle. At peak: 1,200 vCPUs and ~2.3 TiB of memory, all ephemeral. The only permanent infrastructure is S3 and the metadata store.

The ~577K index objects span 13 months of date-partitioned prefixes. Merge workers continuously consolidated small files into larger ones as data flowed in.

Hunting Through a Petabyte

Indexing fast is pointless if you can't search fast. So we buried a needle in the haystack.

Investigating an APT campaign across 13 months of logs is typically a multi-day or multi-week effort. Each query is expensive, iteration is slow, and analysts have to know what they're looking for before they start. We wanted to see what happens when that constraint disappears.

The dataset includes a simulated APT campaign we call Operation Crimson Badger. An attacker compromises a legitimate user's credentials, creates a backdoor IAM user, maintains C2 beaconing over eight months, and exfiltrates data from S3. About 60K events spread across 13 months, hidden among 689 billion legitimate ones.

One of the 70 detection rules running during indexing caught an anomalous spike in S3 GetObject calls: ~50K requests from a single IP, 185.234.72.111, in December. That was the alert. We pointed Claude Code at it, connected to Scanner through its MCP server, and started pulling the thread.

Step 1: What bucket is being targeted?

Claude Code queried Scanner for all activity from the flagged IP:

@index=demo-data sourceIPAddress:"185.234.72.111"
| stats count() by requestParameters.bucketName

bucketName	count
initechdata-archive-ce60160	~50K

One bucket. Targeted exfiltration.

Step 2: Trace the timeline backwards.

We ran the same IP query month by month across the full 13-month dataset. Each query scanned 230-410 GiB of index data and returned in under two minutes.

Month	ListBuckets	DescribeInstances	ListObjectsV2	GetObject
Mar 2025	-	-	-	-
Apr 2025	2	2	-	-
May 2025	2	2	-	-
Jun 2025	2	2	-	~10K
Jul 2025	2	2	-	-
Aug 2025	1	1	-	-
Sep 2025	2	2	-	-
Oct 2025	2	2	-	-
Nov 2025	1	1	1	10
Dec 2025	-	-	-	~50K

The pattern is textbook APT. Low-and-slow reconnaissance for months: a handful of ListBuckets and DescribeInstances calls each month, just enough to maintain access and map the environment. A first wave of exfiltration in June (~10K files). Quiet again through the summer. In November, the attacker scopes the target. ListObjectsV2 appears for the first time alongside 10 test downloads. Then in December, mass exfiltration: ~50K files from a single archive bucket.

Step 3: Find the persistence mechanism.

@index=demo-data eventName:"CreateUser" "svc-cloudwatch-sync"

Field	Value
eventTime	2025-03-15T02:31:00Z
sourceIPAddress	185.234.72.111
userIdentity.userName	david.chen
requestParameters.userName	svc-cloudwatch-sync
awsRegion	us-east-1

There it is. On March 15th at 2:31am UTC, the attacker used compromised credentials for david.chen to create a backdoor IAM user named svc-cloudwatch-sync. The name is designed to look like a service account. An analyst scanning IAM users would see it and move on. The attacker then used this account for eight months of quiet beaconing and eventual data theft.

The entire investigation, from detection alert to full attack timeline, took less than 10 minutes. Each query against a year of data took between 10 seconds to 2 minutes. We asked Claude Code to pivot, correlate, and dig deeper, and it ran each query through Scanner's MCP server interactively. The petabyte-scale data set wasn't a barrier to exploration. We followed the leads as they emerged, without needing to know what we were looking for before we started.

Why This Matters

This is what indexing and searching S3 directly unlocks: workflows that weren't possible before, like fast threat hunting through years of logs, or continuous AI agents that hunt across your environment automatically. All at a fraction of the ingest volume cost of a traditional SIEM.

With the index we optimized for S3 search, each query costs a handful of cents and takes seconds to run. The investigation above ran a dozen queries across 13 months of data interactively, following each lead as it emerged.

The closest alternative for this use case, searching historical CloudTrail data in S3, is Amazon Athena. In practice, most teams are querying raw JSON because converting a petabyte of CloudTrail to partitioned Parquet is its own data engineering project. It is common for teams to routinely wait tens of minutes and spend dozens, or hundreds, of dollars per query when using Athena. An investigation like the one above, where each step informs the next, might cost hundreds of dollars and take many hours to complete. You often need to know what you're looking for before you start, because iterating in Athena is too expensive.

If you are looking for a powerful security data lake, where you can point AI agents at years of logs and get answers in seconds, try Scanner.

Cliff Crosland

CEO, Co-founder

Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.

Back to Blog