November 28, 2023

Introducing New Statistical Aggregations: Average, Percentile, Variance, and More

We’re excited to announce the release of new statistical aggregation functions in Scanner’s query language, which helps you explore your logs in powerful ways.

Introducing stats queries

Scanner supports a new stats query feature, which gives you the ability to compute statistical aggregations.

* | stats <functors> by col1, col2, ...

For example, let’s say you would like to know if any employees are querying your company’s S3 buckets at high levels. This could indicate that an employee’s user identity has been compromised and is being used to steal data from S3.

Here is a stats query in Scanner that retrieves all of the S3 requests made by employee IAM user identities and then compute the average, median, and 90th percentile of request counts by user.

userIdentity.type: "IAMUser" and eventSource: "s3.amazonaws.com"
| stats count() as numReqs, userIdentity.arn by userIdentity.arn
| stats avg(numReqs), percentile(50, numReqs),
  percentile(90, numReqs)

Users in the 90th percentile might be suspect, so you can then drill down into the activity of these users and check for malicious behavior.

userIdentity.type: "IAMUser" and eventSource: "s3.amazonaws.com"
| stats count() as numReqs, userIdentity.arn by userIdentity.arn
| where numReqs >= 158
New visualizations

When you execute a stats query, Scanner allows you to visualize the results in a few ways. You can view a simple bar chart that demonstrates the total aggregation breakdown, or you can view time-binned bar charts and line charts that display how the aggregations have evolved over time.

Statistical functions available with stats

When you use the stats query feature, there are several statistical functions you can use to explore your data.

  • count() – compute the total count of hits per group
  • countdistinct(col, ...) – compute the distinct number of values in a column
  • avg(col) – compute the average of a numeric column across all groups
  • var(col) – compute the sample variance of a numeric column
  • percentile(n, col) – compute the n-th percentile of a column
  • sum(col) – compute the sum of a numeric column
  • min(col) – compute the minimum value of a column
  • max(col) – compute the minimum value of a column
Powerful, fast data exploration

We want queries to be fast so that investigations can be performed as quickly as possible, and this includes stats queries. When result sets get large, Scanner uses probabilistic algorithms and data structures to produce approximate answers with low error. For more details, check out our docs: https://docs.scanner.dev/scanner/using-scanner/aggregations.

Now that these statistical query functions are in place, we are building several cool features on top of them. Stay tuned!

We believe that traditional log architectures are broken for modern log volumes. Scanner enables fast search and detections for log data lakes – directly in your S3 buckets. Reduce the total cost of ownership of logs by 80-90%.
Photo of Cliff Crosland
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.