January 29, 2024

Data Engineering Podcast: Build A Data Lake For Your Security Logs With Scanner

Scanner CEO and Co-founder, Cliff Crosland, joins the Data Engineering Podcast host, Tobias Macey, for a conversation about how Scanner’s fast log search and threat detections API for your S3 makes the discovery and exploration of security threats easier, faster, and much cheaper.

Episode Transcript

Tobias Macey

Hello and welcome to the Data Engineering podcast, the show about modern data management. I’m your host Tobias Macy, and today I’m interviewing Cliff Crosland about Scanner a security data like platform for analyzing security logs and identifying issues quickly and cost effectively. So Cliff, can you start by introducing yourself?

Cliff Crosland

Yeah, absolutely. Thanks Tobias. Yeah, my name is Cliff Crosland. I’m a software engineer. I’ve been focusing on data platforms and data engineering. Really high scale systems, a lot of C++ and Rust through the ages. And yeah, we’ve had a lot of fun on our team. My Co-founder and I, another software engineer, Steven Wu. We were at a startup together beforehand. We ran into some really big problems with log scale and so we built a tool to, to make log search super fast. And we found that in particular, security teams really have this problem more than almost anybody. So we’re really focused on helping people tackle massive data problems in these security data lakes and keeping search and detection super fast.

So that’s, that’s what we’re up to.

Tobias Macey

And do you remember how you first got started working in data?

Cliff Crosland

Yes. So the startup where my Co-founder Steven and I worked before, we had a massive crawling system. So we built this really fun, intelligent executive assistant. And what we would do is we would crawl the web and we’d look for news articles about companies and we’d help people by sending them these intelligent briefings before their meetings for the day, give them like news alerts about the companies they’re about to meet with, about the people if they showed up in the news. And so we were crawling a massive amount of data and we, we, we definitely like, ran into some interesting bugs with, with Curl like the libc or like the libcurl written in C. We went really low level to try to optimize our crawler as much as possible and ended up moving that to Rust.

But yeah, it was, I think the, the, the first taste of managing petabytes of data came at that startup was super fun. And it was interesting. We definitely went from like Ruby all the way down to C very quickly in order to, to optimize. So yeah, we, we’ve seen the, the whole stack, the whole spectrum of different data engineering tools that can be applied to massive datasets.

Tobias Macey

And with that exploration of Curl and optimizing it and moving it to Rust, obviously Curl is written in C. It has a massive amount of history and community knowledge that has been baked into it. I’m wondering, what are some of the things that you gained by moving to Rust and what are some of the things you lost by the fact that you weren’t integrating directly into Curl? Or were you actually still using Curl and just using the unsafe keyword?

Cliff Crosland

Oh yeah. Well, I mean that that, that is totally doable now. I mean it’s, it’s, it was super interesting. It was a, it was a fun experience. So our crawler system was written in C++ and we used libcurl was, it was excellent. It is kind of hard to write like this asynchronous multi-threaded Curl code and interacting with networks and so on in, in C/C++ is kind of annoying. But what we found is we ran into some memory and safety problems in Curl. We, we were able to reproduce it, it got fixed eventually upstream. A bunch of other people ran into the same thing. We, we thought like, wow, we basically, our server would, some would crash and this is like a server that’s interacting with the public web constantly and is sort of this vulnerable endpoint that where if someone were to take advantage of a bug and Curl, it’d be really scary for us.

Because we were hitting all kinds of servers like as we’re calling the web and some, some evil servers like trying to send us, you know, g zip bombs and so on. But yeah, so we, we were nervous about the fact that, okay, there’s this memory on safety problem where it would set fault got fixed upstream. We, we were able to like add some commits to Curl, which was a lot of fun. Like I think Curl is one of the coolest like, and, and most important C programs ever written. But we felt like, okay, well if if this is like one of the best C programs in existence C libraries in existence and it’s still having these memory safety problems, what, what should we, what can we do about this?

And we built like a Rust prototype. It was a lot of fun. We got a lot of memory safety out of that. It was actually back in the day when you couldn’t use the async keyword yet in Rust so. So one of the things I think we lost was it was way less ergonomic than using Curl, but eventually it got really nice once the async keyword was added and like then writing that code wasn’t so insane in Rust. But yeah, we, we got a lot of of confidence that the program was safe, that we wouldn’t have the same kinds of memory safety problems and threat safety problems. So that was, it was really nice.

It kind of got us hooked on Rust and, and why we’re using Rust in this new data lake product that, that we’re building at Scanner.

Tobias Macey

And so digging more into Scanner, can you describe a bit about what it is and some of the story behind how it came to be and why you decided that you wanted to spend your time and energy on it?

Cliff Crosland

Awesome. Yeah, so at our prior startup, here’s the origin story of Scanner. We had a, we were using Splunk for our logs. We were generating like tens of gigabytes of logs per day, which is fine. And our Splunk bill is like tens of thousands of dollars. And then we very quickly scaled up our crawling system and our user count and so on. And our, our log generation volume grew from like 10 gigabytes a day to a terabyte per day. And our bill jumped up, our log bill jumped up to like a million dollars a year plus. And so we had to just kind of throw away logs by just uploading them to S3.

We still wanted to see if we could use them for debugging, but it was just so expensive to use Splunk and, and many other tools. And like Datadog for example, pushing logs there at, at one terabyte a day is also super expensive. So we put things in, in S3, it was super cheap, but then it was impossible to find anything again. So you can use Athena, Athena’s pretty good there, but then to use Athena, you have to transform all of your logs into sort of a common format. It’s not very good at doing ad hoc searching. So we just felt like, okay, it seems to be the case that logs follow this cycle where there’s like this on-premise era where everyone has log files stored on hard drives across their data center.

And that’s really painful for investigation. You have to SSH around then everyone moved to using things like SQL, like Oracle and MySQL and Postgres based log search tools. So like a lot of the early security search tools like ArcSight were built on top of Oracle. But that’s painful because you have to transfer all of your logs to fit a SQL table schema. And then, then you have tools like Elasticsearch and Splunk come along in the on-premise world and they’re like, oh, unstructured data’s totally fine, perfect for logs, you can, you don’t have to transfer your your data into a SQL schema beforehand.

This is awesome. And so then we’re kinda repeating the cycle again in the, in the cloud era where people upload logs to S3 as flat files, but that’s really painful for, for search exploration. And then, then people are are reaching for SQL tools like Amazon Athena or Presto or Trino or Snowflake in order to interact with these logs in S3. But the thing is, logs aren’t really, you know, they’re not tabular all the time. They are often JSON, they have freeform structure. The structure’s changing all the time. New fields are added or removed in, in this log data. And so SQL is not quite the right fit.

And we just thought like, where is the search index that’s built on top of S3 that’s not sort of like, you know, old school like Elasticsearch kind of is, was built in the on-premise era still is good for the cloud, but, but it’s all the logs are stored on servers with, with Scanner, we thought like why, why aren’t all the, the logs and the indices, why aren’t, why isn’t all of that in S3? And then you can just search through that data extremely quickly where it’s like the S3 based indexing system. And so that’s what we built and it makes it just incredibly fast to search through mass amounts of data, the index files that kind of help zoom in on the regions of logs that contain the, the tokens in your query.

So then instead of like, you know, kind of scanning the whole world like you do oftentimes with some of these S3 SQL tools, it really zooms in on just the chunks of logs that contain the hits that, that you need to see. And so, yeah, we just felt like this, this solved our problem in a major way that the problem that we had at the prior startup, and we found that as we were sharing scanner with people, the, some of the, the teams that really suffer from this the most are security teams who have, like, we need to store years of data for compliance purposes or because they want to do investigations.

We heard horror stories where on average it takes like a couple hundred days to detect that you’ve had a breach and then, you know, a couple, like a month or two after that to really terminate the breach and, and disconnect attackers. But if you only have like 30 days of logs that you can see in most tools going back, it’s really painful. Like why not just, you know, search all of your S3 data going back forever and why isn’t that fast? Why does that take like, you know, days to run those queries? So yeah, that, that’s, we’ve been super happy to share this tool with like security teams and security engineers and, and in addition to application debugging teams who are using Scanner just for looking at their application logs.

But just to make it really easy to look at past historical data in S3 and just yeah, search through S3 data super, super fast.

Tobias Macey

You’ve already mentioned a couple of other products in the space and some of the reasons why you decided that you needed to build this solution, but I’m wondering if you can give a broader view of some of the shortcomings that exist categorically in other tools in this space and some of the ways that you’re thinking about the specific problem that you’re trying to solve for.

Cliff Crosland

Yeah, absolutely. So the, there are sort of two categories of logging tools that you see used by modern security teams or application teams who are using logs for observability and fixing errors. And what you see is you see either old sort of like old school architecture designed for the on-premise era, or you have a new architecture, but it’s all SQL based. And so the on-premise era architecture are, are tools like Elasticsearch and, and Splunk and so on. They, they’re tools where you, you stand up a bunch of servers and the logs are uploaded to those servers and they distribute those, those logs around among all of the servers in the cluster.

But that is really hard to scale once you start to generate a terabyte of logs per day. Most of the work that’s being done in those clusters is like replicating logs around from one server to another trying to get sufficient redundancy like healing partitions and so on. It’s surprising how little compute is actually being used for powering querying or doing like basic indexing. A lot of it is just trying to maintain these massive log sets coupled, it’s sort of like where compute and storage are coupled together on these servers. So that’s sort of like the old school approach where in your old, you know, on-premise data center, you’d have a bunch of servers running and you would replicate your logs between those servers.

But now in like the, the world of cloud storage, you have S3, you have blob storage in Azure, you have these really cool, really, really scalable data storage systems that do all of this duplication, redundancy, partition healing and so on for very, very low cost. And so you get these really cool tools, like Snowflake comes along and they do a really cool thing and they decouple compute and storage from each other. So all of your SQL data in Snowflake is stored on S3 and the computes totally separate. You can spin up a lot of compute for your query, you can spin up a little bit.

It’s really nice. You don’t have to have this like really high maintenance cluster you’re running all the time because storage and compute are totally decoupled. But the, the problem with tools like Snowflake is it’s sql, it’s really great for business analytics and for data that has a lot of really strong schemas, but it’s not great for logs, which is like totally freeform, semi-structured JSON data usually and with tons of nested fields. And, and so it’s really hard to do all of the work to transform your logs to fit into the SQL schema. And we, we just felt like, okay, so we want, we want that kind of really cool decoupling of compute and storage, but why can’t my blogs just be anything?

Why can’t, you know, I want them to be semi-structured and I want my indexing tool to be good at indexing and searching through semi-structured data. So that’s where scanner makes a huge improvement. So I guess like it’s, it’s much, much more scalable and faster than those kind of like high maintenance old tools like Elasticsearch and Splunk that can handle semi-structured data fine, but they can’t handle the scale that you need in this era of like using cloud tools and generating terabytes of logs per day. And then it’s also much better, a much better experience in terms of onboarding logs and searching logs than like the cloud tools, the cloud SQL tools today like Snowflake or Presto or Trino.

Yeah, it’s sort of like, yeah, it’s trying to build the search index version of a Snowflake. And so it just giving you really, really easy onboarding of logs, doesn’t matter what the schema is. And really easy ad hoc freeform search.

Tobias Macey

The closest analog that comes to mind, at least in my experience of working in this space is the Loki project, which is what the Grafana team has built, but that does have some minimal amount of structuring that it requires as far as the, I forget the terminology that they use, but effectively the labeling that you’re using for the different log streams that you’re populating. Whereas it sounds like what you’re doing is bypassing the right path and just being the index and read capability. So it doesn’t matter if the logs were generated through Scanner, you just say there are some logs here now I’m going to do something useful with it versus Loki, which requires that it be both in the read and write path in order to be able to work effectively.

Cliff Crosland

Yes, definitely. I think Loki is a super cool tool. One of the things that’s interesting is, yeah, as you mentioned, it’s sort of like this metadata labeling, this tagging that you have to add to the logs and push them through Loki through a pipeline to add the, the metadata to the logs. And then when you execute a search with Loki, that also executes a search against like S3 data and it can use those tags to determine like which files to go scan. But the, one of the interesting issues that you run into, especially if you’re doing security investigations, is you may not know what the tag should be.

It might be like, I have an IP address here or maybe it’s like a collection of malicious IP addresses that I’m scared about and I, it could be in my like Okta logs, my GitHub logs, it might be in CloudTrail, might be in VPC Flow logs. I don’t know which log source to go check. I kind of wanna check everything together. And what Scanner does is it actually indexes the content as opposed to just using the metadata tags as Loki does to determine which files to go scan instead the index files, well these like scanners, Lambdas will go when you execute a query will go and ask the index files, okay, here’s here are the set of IP addresses that the, the users asked for which, which files and which chunks of those files contain those tokens.

And then once it gets that information, it will, it spawns out Lambdas to go and just search that small subset. So it’s basically like, yeah, you can use the contents indexed and, and you can use the, the content to narrow down the search and not just the, the metadata tag, but I think Loki is a, a really cool and is like another log tool in this vein where it is scalable, it focuses on S3 storage, but I think that’s an accurate characterization where yeah, you don’t need to change the logs, you just kind of point Scanner at your logs and S3 and say index this and let me search through it really fast.

Tobias Macey

And that brings us to the data acquisition piece of the question where having a powerful tool that can do all kinds of automated discovery or a very fast querying is completely useless if you don’t have anything to run those queries against. And so I’m wondering what are the different data acquisition paths that you are focused on prioritizing as you’re early in your product lifecycle and some of the ways that you’re thinking about the future trajectory of supporting the, the data acquisition path and being able to branch out into a broader variety of systems that you can crawl?

Cliff Crosland

Yeah, absolutely. So, what we’re focusing on right now is tools that naturally upload their logs to S3. And so this would be like AWS CloudTrail or, or CrowdStrike has like Falcon data replicator or GitHub has an option where you can push your GitHub audit logs into an S3 bucket of your choice. CloudFlare, same thing like many, many security related tools actually already have connections to push logs into S3. And so for those like it’s really easy to get started with with Scanner, you just enable those and those different tools to push your logs into S3 buckets and then you point Scanner at those buckets and Scanner provides really fast search on that data.

So any, any AWS data in particular, I think like that’s what most of our users are focused on, like CloudTrail is, is a really killer log source that everyone wants to monitor and run detections on and so on. But in the future, one of the things that we are beginning to build is data connectors that go and pull in logs from tools that are like API based, like Okta is a good example. So some of our users use like really cool tools like Cribl or Vector to go and fetch logs from different places and push them into S3, but there are others who just want us want to do it from the same UI in scanner.

And so that’s, that’s a future direction for this year is building up these like this larger set of data connectors that can go and pull logs down from different sources, especially like API based sources and then pull them into your S3 bucket for you and then they’re there in S3 for you forever. If you want to query them with Athena or something or some other tool that’s great or if you want like super fast search and Scanner can index those for you and then you can execute queries from within Scanner.

Tobias Macey

And given your security focus as the primary impetus for building this system, what are the things that you are explicitly not solving for or at least not yet?

Cliff Crosland

Yes, I think this is great. There are some really amazing security tools. So as everyone knows, Splunk has like everything in the kitchen sink all built into it and we don’t want to try to replicate every last thing that Splunk can do, but what we are, maybe eventually, but one of the things that we are, are really focused on doing is integrating really well with other excellent security tools. So one example is Tines. We have an integration where with your detection rules you can set up, you can push the detection events that are triggered by the detection rule system. You can push those to Slack, you can also push them to Tines.

And Tines is a really cool SOAR tool to automate security response and you can use that. We push to webhooks in Tines and then you can automate your Tines workflow to do other things. Like you might hit an API to go and reject a user or block an IP address or undo like a security group change an AWS that when your developers accidentally did, you know, like by opening up a a network to the world or something. So there are some cool things you can do with Tines and, and Scanner integrates with that and, and can send those messages to Tines and, and then you can use tines or Torq or other tools to, to do an automated response.

So we don’t, we’re not gonna build the automated response ourselves. You want to hand ’em off to like really cool modern tools to do that. And then there’s also a pushing off instead of doing all the kind of the security event case management where you assign a a ticket to to somebody within Scanner, we don’t wanna do that either. We push events off to Jira and probably Linear as well. Linear is is is a really fun ticketing system, but basically like pushing off the issue tracking and the automated response to other tools. We, we don’t wanna do all of that. We just wanna make extremely fast search.

And then these detections, which are really easy to write and you can run these queries all the time on your data. We’re just really focusing on like the search and the query experience

Tobias Macey

Digging into the architecture of the Scanner platform. You mentioned that you’re very focused on cloud first modern systems. Wondering if you can just talk through some of the ways that you have built the platform and some of the design philosophies that are core to your product identity.

Cliff Crosland

Yes. So we really believe in decoupling storage and compute. We think that is an extremely cool new pattern that you see in SQL tools so far. And we really feel like search indexes need to need to do the same thing. And so the architecture for Scanner is serverless. So we use ECS Fargate and AWS for our indexing compute. And that can scale up and down extremely rapidly. So what what happens is your S3 buckets might get full with a lot of log files. We receive SQS notifications. The ECS Fargate system will spin up tasks to go and handle that compute spin back down again once like the, the volume comes back down and then it creates these index files, which are in S3.

We also have a MySQL database for everybody. And that MySQL database is there for metadata, like here, here’s where all the index files are, here are the time ranges they span and which, which indices they, they refer to things about like which users have access to which indices in the system. Those are stored in MySQL. And then when a query occurs, we spin up one, our API handler will go and query MySQL. We’ll figure out, okay, here are all of the index files that are relevant to the query because they’re looking at this index or these seven in indices and they’re looking at this time range, they’re executing this query.

So here all of the index files that we want to go and hit will spawn up a huge number of Lambda functions, which will all individually go and very rapidly traverse those index files in chunks and jump around in chunks. And the index file sort of guides the, the Lambda to, the small subset of, of regions that contain the hits. And then we accumulate those results back together in, we, we, this could be like a fun topic for math nerds out there, but like we have the, we we call the Monoid data structure server. So we compute all of these aggregate values and then merge them together in the the monoid server and then report those off to the user in the UI.

So that’s how we do our aggregation queries. And actually all of our queries are our technically monoid values, which is a, like a fun idea from group theory that’s really useful for doing this kind of thing. So that’s, that’s roughly how the, the system works and, and how it hangs together.

Tobias Macey

Getting even deeper into the weeds, indexing is a bit of a dark art and there are multiple different ways that you can construct those indexes for different use cases, whether that is accuracy, retrieval speed, reducing the amount of scanning where those indexes live, how you manage the update cycles on them. I’m wondering if you could talk a bit more into how you think about the specifics around indexing this variegated data that is semi-structured or unstructured and some of the complexities that come about building out consistently useful indexes across this mess of data that people are just throwing at you.

Cliff Crosland

Yes, I think that that’s a super great question. This is something that we’ve really played with because it seems like there’s a, a pretty broad spectrum of what you can do with an index. Like if you look at Elasticsearch’s indices, they’re extremely fine grained where every token basically has a mapping to every single individual document where the token appears, which means that the index is really massive. It, it might be like several times larger than the original dataset, but then on the other hand you have like no indexing at all, which means that ingestion is super fast.

So with maybe Elasticsearch, it’s, it’s slow with Loki ingestion’s super fast because there really isn’t any indexing. There’s just like labeling the logs with the different tags and that helps partition them into different buckets. But then the problem with a tool that doesn’t do any indexing is that when you execute a query, and if you don’t know exactly what partition you needed to look in, if you wanna look kind of everywhere and look for like, this is a, this is a scary IP address, I wanna see all activity for the past year for this particular set of IP addresses or this user or something, no matter what log source it’s in, that can be really hard with a tool that doesn’t do any indexing.

So we, we feel like we’re in the middle, our index is more coarse grained and so we have like a number of different index file types. So it’s not just string tokens, but there’s also numerical indices and indices that keep track of things like the most common fields and the, the range of values that appear in those fields. But we’re trying to, to balance search speed with really, really fast and low cost ingestion. Where I think like kind of Elasticsearch and, and tools like that fall over is ingestion is really, really slow or it’s just takes so much. CPU is really expensive and Loki is like no indexing, so it’s super fast to ingest, but query is slow.

We wanna make querying really fast. So like our, our trade off is we want it to be possible to query for a needle in haystack in a petabyte of logs in a hundred seconds or less and in in various other tools that might actually take you something like hours, like or, or sometimes days. But we also want it to be the case that, or like the compute for each gigabyte index is really cheap for us so that we can handle tons and tons of of traffic. So we have like this coarse grain index system where you map from the tokens and the content that appears in the logs and instead of making a really fine grain mapping to every single individual log event that, that contains those tokens or that content instead you have a much more coarse grain mapping so that the index files guide you to a region that you still need to go scan the entire region of, you know, maybe hundreds of thousands of log events.

But it’s actually really fast to do that, especially if you’re using Rust, especially if your format is really fast to download and parse and decompress, then, then doing that and, and narrowing down the search to kind of these reasonably large chunks. But the, those chunks that are pretty fast and easy to scan, it gives us a nice balance in our indexing system between like a, yeah, super expensive but really fine grained and like super cheap but very unhelpful indexing. So we, we like to be like super fast but also really, really cheap to index.

Tobias Macey

And in terms of the motivating constraints, I’m wondering what were the, what is the North star that you’re using to decide this is something that we want to include in the system, that is something that is nice to have but we can’t execute on right now and that is completely out of scope and is somebody else’s problem?

Cliff Crosland

That’s a great question. So what we, we really want to focus on what we think is missing in the space, which is search and, and making that sufficiently fast on massive data sets but at low cost. So we don’t want to, you know, we, we think it’s silly for someone generating a terabyte of logs per day to pay a million dollars a year. It should be 10 times less than that. And we, we think that, so our North Star is how can we make search 10 times cheaper but also remain extremely fast? And we’re really focused on that. And so the search that we’re focusing on is both ad hoc searching and also these detection queries that run all the time to find threats.

And we’re just laser focused on that. We don’t want to go and build every, you know, feature that, that you see in the security space. We want to integrate really well with those tools, whether they’re sending us data into S3 for us to index or if we’re sending out events to them to go and, and handle different responses. So yeah, we’re just, we just feel like search is not good enough and we’re, we really want that to be excellent and save people a huge amount of time when they’re doing these really big investigations over their data.

Tobias Macey

And then a bit more nuanced on the question of indexing is also the fact that log data is notoriously messy. There is no consistent format that you can be guaranteed that anybody’s going to have because most log systems will allow you to customize it to no end. And I’m curious how you are approaching that problem of being able to identify this is a string that is worth looking at or this is the structure that’s being used, whether it is a standardized structure such as JSON or Log Plex or Syslog or this is just a completely arbitrary string, but there’s actually interesting stuff in there and now you need to pull out my regexes.

Cliff Crosland

Yes, I think that’s a super great question. We, we feel like one of the really, one of the most painful things about using logs right now is all of the data engineering work that goes into setting the setting up in first place. And so a lot of the, the data lake style tools that people are reaching for do require a significant amount of effort to do things like transform your syslog or your JSON or your plain text or your CSVs into SQL tables. I guess with CSV it’s actually not that hard because it’s like basically a table, but with, especially with more freeform data types and log types like JSON or even like Protobufs and so on, it, it can be really painful to map that data into really strict schema structures.

And so we, we feel like logs are like by their nature are really, they’re semi-structured, they’re freeform, it should be really easy to query them in an ad hoc way. And so we really focus on imposing no structure at all on users. We do support like various important log formats like JSON, like Parquet, like CSV, plain text, Avro is coming if more people want Avro, but, but basically we want to make it really easy for you to onboard logs. You don’t have to transform them into a different schema. And that’s the way that we’ve generated our indexes is that the, whereas like a lot of other in indices are, are hierarchical where the index is organized where like the, the field name is on top and then once you know what the field name the user’s looking in, then you go and search within that scoped amount of data.

So it’s like very columner for us. The columns don’t matter at all and in some sense the columns are just annotations on the data. So if I am looking for content like a user ID or like a string or, or multiple strings that I, I might be curious about and then, but I have a column name in front of that in my query, then what we’ll do is we’ll find that content and then we’ll see if any of those hits also have this column annotation so that way your columns can change day to day. They don’t have to to remain stable, they don’t have to remain small, you’re not limited to like a hundred fields or something like that.

There could be millions of distinct fields in your system. It doesn’t matter at all. We are really just focused on the content. And so yeah, we, we really want it to be possible for people to just dump their logs somewhere and not then have another project where every new log source requires this massive onboarding transformation work. It’s just like, I’ll just point Scanner at it and scanner can discover the schema, handle it, query that schema without any problem at all. So yeah, we’re really focused on that. I think one thing I might mention there, which is super interesting that we’re playing with that’s in beta right now is auto schema discovery.

So for a lot of log sources, it’s really nice to, to transform them into a common schema if you can. So something like OCSF, which is this Open Cybersecurity Foundation format and, and a bunch of tools are beginning to adopt this, but it’s basically trying to come up with a common schema so you don’t have to remember what like the source IP address name is of that, of that field is in every single log source you have. You can just say, give me the source endpoints IP address and look that up across all my log sources. And so the way that we do that is instead of requiring you to transform them ahead of time, we discover the schema and we will transform your query into the, like if you, if you type in sort of an OCSF or like a common schema style query, we’ll transform your query so that in the log sources that you’re looking at, it will match the, the fields and the structure of, of those log sources.

So just again, we’re just trying to remove as much data engineering effort off of everyone’s plate as we possibly can so they don’t need to do all this transformation ahead of time.

Tobias Macey

And in terms of the overall design and scope of the system and the problem that you’re trying to solve, how has that changed from when you first started working on the problem?

Cliff Crosland

The problem? Yes. So when we first started working on Scanner, we really felt like we wanted to solve problems for people building applications and for people who wanted to quickly search through logs and debug them and have a really cheap place to put those logs. And we discovered something really interesting, which is that most people who are debugging applications, they only really care about having like 7 days of logs or something like that, or like maybe 14 days maximum. And they don’t necessarily need very many logs. Error messages maybe are like the most important thing to keep, maybe sample some subset of like normal logs.

And we discovered, yeah, they, they don’t like application developers who are using observability products. They don’t need that much log retention, they don’t have that much log scale or it’s, it’s, it’s a little bit more rare. And whereas as soon as we started to talk to security people and security engineers and DevSecOps, like everyone who’s really focused on the detection and response and, and log tools for security teams, they all said the same thing, which is like, wow, can I use this tomorrow? Like, this is so cool because I have a few years of logs, but they’re invisible to me.

I have some log tools that, that keep something like 14 days or 30 days of logs in, in scope. But when there’s a breach, oftentimes I won’t know about it in, you know, if a third party that we’re using gets breached, they might say like, this, this breach started about six months ago, here are some indicators of compromise you might wanna look for, like IP addresses, domains, emails and so on. And it’s really hard to go look in the past and go find that data. And so we thought, well, like if it’s in S3, why isn’t it super fast to search through S3 data? And so yeah, we discovered that instead of focusing on the, the problems of application developers, that we started to to focus on the problems of security teams.

And then we started to see security teams using us like every day of the week. Like, because there’s so much data they’re curious about, there’s so many ways that people can, can attack you and, and threats are extremely creative. And so it’s very helpful to be able to look at not just the past handful of days, but you know, a ton of time in into the past, you know, 90 days, 180 days, like multiple years.

Those kinds of queries are really important to security teams. And so that’s how we’ve, we’ve evolved is to really focus on those problems and, and solving problems for people who have like the biggest logs, log sprawl issues. And that’s been extremely rewarding is to help ’cause security teams are extremely overworked and have so much to worry about in addition to logs they have, you know, they’re setting policies for their companies, they’re doing trainings, there’s just so much on their plate and everyone is bothering them for more reports to ask them questions about whether they’re vulnerable to this or that.

And so the more time we can save for them by making search extremely fast and make their historical investigations really fast, the better their lives are. So yeah, we’ve definitely shifted from like in the early days focusing on like the observability use cases to focusing on the security use cases

Tobias Macey

Given the security oriented customer base that you’re focused on. I’m wondering if you can talk to some of the ways that the architecture of your product is designed to fit into the types of regulatory and compliance constraints, both organizationally and legally that they need to be able to accommodate, particularly with ensuring that data access for this sensitive information which is required for them to be able to do their jobs and d determine whether or not they’ve been compromised doesn’t leave a predefined premise that they’re able to maintain full control of and some of the ways that you work to give them comfort and guarantees around the ways in which you are processing their data.

Cliff Crosland

Yes, absolutely. So that’s one of the things I think has, has been really fun to build and to work on is trying to make the, the connection between the compute and the storage as safe, but also as secure and as fast as possible and as low cost as possible. So one of the things that we do is with Scanner, we create a brand new AWS account specifically for your team. It’s not multi-tenant unless the, the free tier if you want to play with Scanner is a multi-tenant environment for you to to play with. But for teams who are using the product and on on really serious problems, we spin up a a completely separate AWS account in the same region alongside where your buckets are.

So instead of like with other SaaS tools, you’re pushing logs out to some third party, you’re pushing them over the internet with our tool, everything stays within the same region and we just use IAM permissions to say, cool, like this, this one single tenant AWS account that’s completely isolated is the compute there is allowed to communicate with your buckets. And when we create the index files, we save the index files into your S3 bucket as well. So you don’t need to worry about like the, the data being stored by somebody else. Everything remains in your buckets and, and there’s not this vendor lock-in where your data is kind of owned by some other person’s cloud.

It’s all just within your own S3 buckets. So it’s sort of like this compute service that you can leverage, but then all of the data remains in your S3 buckets forever. And so that does a couple of fun things. It it means the data transfer cost is lower, that’s kind of fun where instead of pushing logs to like a third party vendor over the internet, everything is just remains within the same region. So data transfer cost, thankfully I’m grateful for AWS for this, but in the same region, data transfer cost between compute and S3 storage is free and, but it also means that there is extremely tight control over where, where this data goes, who has access to this data because everything remains, all the data remains in the customer’s S3 in their AWS account.

And also our compute is in this isolated a AWS account that’s unique to them. We are, we are playing with, there are some users who are curious about whether we, we could deploy into their AWS account as well and they can run everything there. And that’s actually very easy for us to do. That’s, that’s something that we’d love to, we’d love to, to try out, but that it’s architected to be able to deploy the compute to any AWS account that you point at it just, you just need enough permissions to set up the various infrastructure inside of that account. So that’s something I think we’ll probably like be fun to play with this year.

Deploying into someone’s environment directly instead of running the compute in our own AWS account. But anyways, yeah, we, we really try to keep everything isolated and all of the, the compute is yours and all of the storage remains in your AWS account

Tobias Macey

For teams who are adopting Scanner. They’re incorporating that into their day-to-Day workflow. I’m wondering if you can talk through what that typically looks like and some of the ways that they are applying Scanner to the problems that they’re trying to solve and some of the ways that you think about the collaboration activities of security engineers in these teams who are doing this either ad hoc discovery or maybe they have some scheduled scan that they’re running to detect any persistent issues.

Cliff Crosland

Yes, I I think it’s really interesting to, to learn what kinds of queries and what kinds of investigations that people can do now and that get unlocked by a tool like Scanner. And so what we see that security teams do is they jump in and whether an alert comes from from Scanner or from some other sources, what you, you can jump in and start to investigate the recent past, like over the past couple of days very quickly. But then immediately you start to see really fun activity, like people start to explore over the past six months, over the past year and look for indicators of compromise going back a very long way.

But another really interesting thing that a Scanner unlocks that we were surprised by is that extremely high volume log sources that are, that tend to be sort of low value usually, but can be extremely important like VPC flow logs, which are voluminous and there are so many, like every single connection that’s going on in inside of your network in AWS shows up there basically. But you can start to run cross correlations between VPC flow logs and your other like high, like higher value but lower volume log sources. So you might say like, oh, this IP address is showing up in my, like AWS cloud trail because someone’s trying to log in.

But I also see that that IP address is showing up in the VPC flow logs and I, and I know that the destination address is EC2. And so it’s like, wow, this is like a, like VPC flow logs and other kinds of really high volume log sources, which are, have been, which in the past have been sort of things that no one would dare, you know, like upload to like other tools because it’d be way too expensive. It now becomes like possible to run a single query that touches them both. And like you can find activity across so many different log sources, including these like really high volume log sources because it’s cheap enough to do that in S3 now.

So, and it’s fast enough I guess like that’s the other thing is these high volume log sources, you can run queries in other tools, but it’s, they’re often slow. But in Scanner it’s really fun to see security teams start to do these extensive searches across really, really high volume log sources that they typically don’t index at all because it’s too expensive. So, and yeah, the, the collaboration that we see people do is often copy pasting like these like perma links that we have to different views and to, to different log events and hunting down like the source of different kinds of, of traffic.

What, what a like strange user strange policy changes if, if MFA gets disabled for a user. Why did that happen? Has it happened before? Did it, has it happened? Like how many times has that happened for this user over the past year? Is there something weird going on nine months ago? Like those are all kind of cool things that Scanner allows you to do that other tools don’t because the the retention period’s low or like the the number of log sources you’re allowed to use is, is, is too low.

Tobias Macey

And that correlation aspect is interesting too because there are a number of other security focused products that I’ve seen that are oriented around graph structures and being able to do automated relationship discovery between different events because most of the attacks that are actually going to have an impact are going to be multi-stage. If it’s not just a brute force denial of service or I’m just going to scan your whole website, then it’s going to be something where somebody is putting in the effort because you’re a high value target. And so they’re actually going to be doing it in multiple stages, possibly with a long time delay between them.

And I’m curious, what are some of the ways that you think about being able to identify and surface some of those types of correlations that do require multiple hops before an exploit is actually, actually realized?

Cliff Crosland

Yeah, that’s a, that’s an awesome question. So it’s really interesting to see yeah, how these kinds of attacks evolve. You’ll have someone who is able to reset a password on a particular user, you might see some kind of low urgency detection event about that. And what, what a user will do after that is they’ll start to create more like AWS IAM roles and they look pretty convincing that the names of these and, but they’re all associated with the original user. And so yeah, it does start to to look like a graph structure, but then they start to, you start to see those IP addresses show up in other places in VPC flow logs, you start to see those IP addresses show up in like Okta logs or like in your, maybe your, they’re starting to hit your API and so yeah, I think with Scanner it’s really fast to build those relationships as you’re running queries because you can sort of really quickly see a given name or a given, IAM user a given IP address given AWS access key across many different log sources at the same time.

But I think that sort of thing, you in Scanner, you do have to build you, you do yourself by executing queries as opposed to us generating it. I do think that’s like a, a really cool future direction to go and do some research in where you could actually do multi hop queries as a result of the detection. You could actually build this in Scanner like by running a detection event, which then pings Tines, which then like a, like a SOAR tool, an automation tool like Tines, which could then like execute a query in Scanner and then given those results, execute maybe another query and Scanner.

So maybe you could actually do some of these things. But I do think, yeah, it’s, we do think like the ability to search really quickly to follow these like multiple hops from, from one bit of scary activity to the next is really important. And so I think while Scanner doesn’t like automatically generate those relationships, it’s possible that, that we will in the future, but we definitely make it like totally doable from like really fast search by, by executing search really quickly and making it so you don’t have to wait like five minutes between queries. It’s more like five seconds or something and you can jump quickly from place to place to, to find what an attacker’s doing.

But yeah, I think that’s a really cool, a really cool direction to go in and, and could be something that, that we work on and make a feature at some point, but we’ll, we’ll we’ll see.

Tobias Macey

And from that iterative discovery process too, just from a user experience perspective, I’m curious what that looks like. Where I’ve run a query, I get a result a lot of times when you’re working in a SQL editor for instance, you run your next query and it completely blows away your previous result. I’m wondering what that looks like in Scanner as far as being able to do that it iterative process of building up more complex correlations or being able to say, okay, well this was the result here, I’m gonna store that to the side because I need to find this other piece of information.

And then being able to go back and forth and collectively build up a more complete picture of what you’re trying to discover.

Cliff Crosland

Yeah, absolutely. So I think one of the things that’s really interesting, which about Scanner, which I, I I find like really annoying about using SQL for a lot of these tools is when you’re doing an investigation with a SQL tool, you have to kind of jump through and say, I’m going to to look in this table and it might be like this log source in this like AWS region and has is like what this table contains. And then I have to run the query again in totally different table. But one of the really cool things in Scanner is you can say, cool, I have this query and I want to see the results for this log source.

And now I have like, oh, I have some data that I’ve discovered from that log source that’s interesting and I can just continue to edit that query to render both of those things together in the same result table. So I can say like, I want this kind of thing from this log source and I want like, like, please show me results from this other log source as well. And you can start to build your, these like stats tables if you want, or you can just kind of render them all as like a raw search results. But you can very, very easily combine the, the search results from many different things altogether into the same view.

And so I do think it, I think I, I really like that idea of like, could you do something like pop over and, and like have multiple like queries going side by side or something like that. We do have like, you can open up tabs as you drill down into something you can say like execute the search and like open up that this particular search and like look at the context and the new tab, look at the context in that, that tab. But another thing you can do is actually just continue to extend your query and you don’t have to do things like select the specific table and that’s the only thing that can render results.

You can just really gather results from many, many different log sources. So we often see like, hey, please look for this email address or this like, file hash across where, wherever it is, I actually have no idea we have like 50 log sources, like, I don’t know where this file hash appears or this domain or this IP address. Just show me everything. And then you can start to gather everything together into the same view. It is, I I think like this sort of like search index like experience in my, it just feels like way more flexible than SQL in our opinion. So that’s like why we really wanted to replicate the, the, the experience that you get using a search index.

But, but on top of S3 now, which, which makes it like a little bit easier and faster to gather all of the data that you need

Tobias Macey

Given your focus on search as the core product experience. When are you gonna jump on the hype train and run everything through vector embeddings and do semantic and similarity search?

Cliff Crosland

We, we have played with that a little bit actually. So the, the, the thing that’s kind of interesting is that the, the scale is brutal. Like, like when you’re generating a terabyte of logs per day and trying to, to, to build vector embeddings for all of that. And then so like one way to to, to play with this would be like in detections it might be something like, well here is like the vector space where here, here are some regions where there are evil things going on, but it’s a little bit hard to build concrete rules to describe it. It could be really fun to say, alright, well now, like let’s take, let’s take all my traffic and let me do something like create vector embeddings for this like really massive amount of, of log data and then, and then see like if that falls in one of those regions in the, the vector space that that’s like, this seems evil.

This seems like a threat that it, it is just, it’s just extremely slow now. Like it’s, it’s, you know, a terabyte of logs per day is just a brutal amount of, of stuff to, to drive embeddings for. But it is something that we’ve played with is like one thing that again, might be fun for like a, a future research direction to play with is something that’s lower volume, but still really important. Like when we, when a detection rule goes off, it creates a detection event and you like build a bunch of detection events out and, but the number of detection events you have instead of billions and, you know, tens of billions, hundreds of billions of those a day, it might be thousands.

And that is totally viable to do something like generate vector embeddings and then go and see if these detection events, which all might be low urgency, like combined together into something that’s scary, that would be really fun to play with. But that’s, I, it is exciting. I think like, you know, vector databases and, and basically trying to, to search in semantically as opposed to with like very concrete rules would be awesome. It’s also just tough to do at scale in terms of expense and speed, like keeping up with the ingestion would be really painful, but there are some cool ways that we, we may, we may play with in the future.

Tobias Macey

And it could also be interesting for outlier detection as well, because in the vector space, a majority of your logs are actually probably going to fit within that similarity search, but it’s when something is anomalous and outside of the, that bound within a certain threshold that it becomes interesting and requires further investigation.

Cliff Crosland

Yes, absolutely, I think like it could be, you know, anomaly detection could be that you do, you just might have to reduce the, the fine grain resolution to something like manageable by a vector database. And that’s definitely the, the direction things are going is just like, I feel so bad for all these security teams and everyone dealing with, you know, 15 terabytes of logs per day and things like that. Like everything breaks and everything is hard at that volume. But that could be a fun, a fun, like a fun problem to solve in the future for us.

Tobias Macey

And as you have been building Scanner, working with your customers, getting deeper into this security space and the detection of these security events, what are some of the most interesting or innovative or unexpected ways that you’ve seen Scanner applied?

Cliff Crosland

Yeah, I think, I think the most interesting thing that I’ve seen people do with Scanner is to cross correlate across many different data sources and to use creative log sources to do this. So I think what, what we tend to see people do, I guess I, I guess I’ll pause, I’ll think about that for a minute actually. There are so many different ways, there’s so many different things that people do, which are surprising. Well, I think the most innovative things that people do with Scanner is, is hunting through very different log sources and finding ways to track users or like, or threats that are, are spanning many different log sources together.

And so it’s doing things like looking at GitHub activity that might be something like the policies are changing for an organization. Who is that user? Has that user made AWS API calls in the past? What kinds of API calls are they? And the ability to go and cross correlate and, and very quickly search through many different log sources and start to view those results altogether, starts to paint a, a really clear picture about like a threat or a non threat or different ways you can improve your policies to, to be a little bit less, less strict or, or to be more strict based on what you’re seeing in your log data and what tends to go off in your detections, what what tends not to.

So yeah, I think the, some of the, the coolest things that we’ve seen people do is just jumping around from log source to log source and adding really interesting different new log sources to tools. So in, in addition to security logs, like also jumping into application logs to look for activity from, from users and then jumping into VPC flow network logs, just a huge, a huge diversity of different log types that people are using to search.

Tobias Macey

And in your experience of building the Scanner product and figuring out how to make it work for the teams that you’re selling to, what are some of the most interesting or unexpected or challenging lessons that you’ve learned in the process?

Cliff Crosland

I think one of the most interesting lessons that, that we’ve learned is that integration is really, really painful. So it might be the case that people feel very comfortable spinning up a project and building like their own pipeline to start to build their own data lake. But really when you get down into the details, when someone has like fifty or a hundred different log sources, it’s extremely nice to be able to take as much of that burden off their shoulders as possible. So like the search and the indexing and the detections are really cool and are the most technical aspects of what we do.

But some of the things that we’ve discovered is by helping people reduce the, the friction in integration, it makes an enormous difference. Like it’s, it’s maybe the fastest way you can improve someone’s life. You know, instead of every log source requires a new parser, a new transformer, a new pipeline. Instead, if you can say cool, like you, you connect them into S3, you don’t have to transform them, you don’t have to get them to fit a SQL schema, we’ll just take it from here. That is an extremely powerful thing. So I think like that’s been one of the most interesting challenges is like how can we reduce the integration friction as much as possible so that people can go back to just like investigating, querying, playing with the logs and they’re not spending time yet spinning up yet another log source trying to integrate it yet another log source.

We just like want integration to be, you know, basically take that entirely off of your shoulders and handle whatever log input that you send our way.

Tobias Macey

And for people who are in the security space, they’re trying to figure out what are all the events that I need to care about? What are the cases where Scanner is the wrong choice?

Cliff Crosland

I would say that Scanner is, is not what you want to use. If you have like a a you, you want to run like sophisticated like large SQL queries where you have many joins and you’re trying to do like a query that’s going to run for a few hours and generate like a big report or something like that. And so people will often do this when they, when they do things like they try to compute all of the, the sum of all of the bytes transferred across the network from this VPC to that VPC or from their network to outside in the internet or something. And those kinds of queries that that just take a, a long time and, and and really take advantage of, of SQL and how sophisticated the queries are in a tool that like the kind of business analytics queries that you might, might do in a tool like Snowflake or, or something like that’s really great, that’s really great there.

But Scanner is really meant for extremely fast investigations and, and fast needle and haystack search with like basic kinds of aggregations. So instead of, you know, every query taking an hour Scanner will take a few seconds, but if you wanna do something kind of kind of sophisticated like let’s compute input and output balance for different hosts or like if you’re looking at like financial transactions and trying to compute like a balance or something and try to see like has anyone’s balance gone negative over, you know, like a week long period or 24 hour period and you’re running like a big aggregation type query.

SQL would definitely be what you would reach for there and you wouldn’t want to use Scanner for that. Like scanner would would be much better for doing like threat hunting and finding, finding activity and, and zooming in on activity quickly and not, not as good at like the the big aggregation type queries that you might do in another tool

Tobias Macey

As you continue to build and iterate on the Scanner product and work with your customers. What are some of the things you have planned for the near to medium term or any particular projects or problem areas you’re excited to explore?

Cliff Crosland

Yes, everyone really wants an API and we’re really excited about that. So a lot of people, what they do is they’ll, they have many different log tools or they, there are just a lot of features that they, that they want out of a tool but they, they want them to integrate well together and and play well together. And so we are really excited about building an API that can do two things in particular. One is to allow you to do ad hoc searches very quickly, but the other is to take advantage of this aggregation caching system that we’ve built for detections and make it possible for you to run API queries that can power dashboards really quickly.

So basically the, it generates these time series aggregation values for detections and it’s really helpful for rendering dashboards. We wanna make it very, very fast for you to have an API where if you’re using Tableau or you’re using Grafana, you can use the Scanner API to take a look at your S3 data and build dashboards incredibly quickly and run queries in incredibly quickly in Scanner so you don’t have to jump into to scanner to build the dashboard. Your dashboards can be in many different places and and Scanner can fit in well with what you do or like a front directly from Slack.

If you get a detection event from Scanner, you might write your own bot or, or like integrate with some other bot where you can say, cool now, like take the next step and can you run this query and Scanner please and render the results for me here in Slack. So those kinds of things. I think like that’s the, the next big frontier for us is building out this API to make it so like Scanner can be this core search tool and like aggregation value tool over your logs that integrates with tons of different tools and allows you to build on top of it. So yeah, that, that’s like by far the biggest thing.

Everyone’s also asking for dashboards as well. That’s something that we, we want to build, but it might be possible that people might just build all their dashboards use on top of our API. Yeah, that’s, that’s the big next big frontier for us. And then also increasing the number of data connectors that we have so that you can very easily pull in data from many different log sources into your S3 buckets and make onboarding new log sources as easy as possible.

Tobias Macey

Are there any other aspects of the work that you’re doing at Scanner, the overall space of security log analysis and threat discovery that we didn’t discuss yet that you’d like to cover before we close out the show?

Cliff Crosland

I think that’s probably good.

Tobias Macey

For anybody who wants to get in touch with you and follow along with the work that you’re doing, I’ll have you add your preferred contact information to the show notes. And as the final question, I’d like to get your perspective on what you see as being the biggest gap in the tooling or technology that’s available for data management today.

Cliff Crosland

Yeah, I think the biggest pain that I think everyone experiences is in integration. So trying to get massive amounts of data into the right shape, into the right destinations and so on. And, and also making it visible. I think like that, that’s a lot of tools exist and a lot of cool functionality exists to get data into a place like S3. That’s probably where everything should be. That’s, it scales well, but that data often becomes invisible and very hard to interact with. And so I think like getting really great visibility search and query ability on massive amounts of data at low cost is, is like a huge problem and that’s like a, a big reason why we’re, we’re working on Scanner.

Tobias Macey

Alright, well thank you very much for taking the time today to join me and share the work that you and your team are doing at Scanner. It’s definitely a very interesting product. It’s great to see you working on making the discovery and exploration of security threats easier and faster for security teams. So I appreciate all the time and energy that you’re putting into that and I hope you enjoy the rest of your day.

Cliff Crosland

Awesome. Tobias’s, really appreciate it. Thank you so much.

‍

We believe that traditional log architectures are broken for modern log volumes. Scanner enables fast search and detections for log data lakes – directly in your S3 buckets. Reduce the total cost of ownership of logs by 80-90%.

Cliff Crosland

CEO, Co-founder

Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.

Case Studies

Events

Blog

Documentation

Changelog

Pricing

Data Engineering Podcast: Build A Data Lake For Your Security Logs With Scanner

Subscribe to the Newsletter

Data Engineering Podcast: Build A Data Lake For Your Security Logs With Scanner

Share this article

Subscribe to the Newsletter

Product

Company

Resources