December 11, 2025

Cloud Security Podcast: SIEM vs. Data Lake: Why We Ditched Traditional Logging?

Scanner CEO and Co-founder Cliff Crosland joins Cloud Security Podcast host Ashish Rajan for an in-depth conversation about the Security Data Lake journey, overcoming typical challenges, and preparing for the future.

Listen to the full episode here.

Ashish Rajan: If you have been wondering what it is like to build an in-house data security lake, well, this is the episode for you. I got to speak to Cliff Crosland from Scanner.dev, who tried doing this, failed, learned a few lessons, and then now he's sharing it over here. What he found through the journey of using SIEMs, SQL injections, normalizations, the number of log sources you have to care about.

Now, I don't want to deter you from looking at a data lake. I definitely see today, with AI being so prolific in a lot of organizations, either the engineering team or the security team, they are all looking at building data security lakes or data lakes in general. So if you want to be able to tap into that and see what that could look like for your organization, especially perhaps if you're sick of your expensive SIEM or not being able to store enough logs because of volume-based pricing... Whatever your excuse may be, I think you'll enjoy this episode.

If you know someone who is working on planning for what could be the future of security operation without a SIEM and building a data security lake, we also covered some of the challenges, too. So do share this episode with them so they can get a full understanding of what it takes to build a data security team in terms of teams and the challenges they would face as they walk along that path.

And finally, if you have been listening or watching an episode of Cloud Security Podcast for a second or third time, thank you so much for supporting us. I would really appreciate it if you can take a second to hit that subscribe or follow button. It really helps us grow and helps more people find us so that we can inform them about making right calls about these technologies with cloud security and have an overall good cloud security posture and program. Thank you again for your love and support, and I will see you and talk to you in the next episode. Peace.

Ashish Rajan: Hello and welcome to another episode of Cloud Security Podcast. I've got Cliff with me. Hey, man. Thanks for coming on the show.

Cliff Crosland: Thanks so much. It's a pleasure to be here.

Ashish Rajan: Man, I'm excited for this. First of all, maybe can we start with a bit of your background as well? What have you been up to? Where are you today?

Cliff Crosland: Yeah, I'm the co-founder of a very fun database-focused startup that does a lot of fun stuff with data lakes called Scanner. My background is I'm obsessed with low-level system performance, and I have a love-hate relationship with the Rust programming language. So... but yeah, my co-founder and I, we've worked together for a long time. I've done a couple of different startups. The last one, we were acquired into Cisco, just always facing interesting security and observability challenges at massive scale. So, yeah. I don't get to code—my team here at the company doesn't let me code as much as I want to anymore. But that is my bread and butter and my passion involves going crazy with systems-level programming.

Ashish Rajan: Fair enough. I think when we were talking initially about this, I remember you telling me about your experience of building a data lake yourself. And I guess the question that I have for you is, it almost seems like a theme today that a lot of people—and I'm kind of guilty of this as well—when I was a CISO, I was thinking about every time I thought of a SOC, I would think about, hey, I need to have a SIEM. Now, I'm going to use the word "traditional SIEM" considering now we are in this AI world. So let's just say, why are people switching from traditional SIEMs to data lakes now?

Cliff Crosland: It's really interesting. I think basically the world of logging and data volumes has changed dramatically. We have this problem as well where our traditional SIEM—we were using Splunk at the time at the prior startup—the data volumes... It's very easy to get terabytes of logs per day. And traditional SIEMs were wonderful in the era when you had maybe individual gigabytes or maybe tens of gigabytes of logs per day. But now that we're in a very containerized world, everyone's using many different SaaS tools and services, log volumes just get massive.

And then it just becomes impossible to keep all of the logs that you want to in your SIEM. And so a lot of people will start to move data at scale over to data lakes because they're just wonderful for managing massive log volumes and scaling basically forever. We really think that's the future of where logs should go. But yeah, it's just economically feasible once you reach certain log volumes to scale and capture all of the data that you want in a data lake. It's not economically feasible in a SIEM. You have to decide which log sources to cut out, which logs to pare down, etc. So yeah, that's like a huge reason why we moved to a data lake and why other people are as well.

Ashish Rajan: Economical in the sense of cost and visibility and all that as well?

Cliff Crosland: Yes. It basically, with a lot of the challenges of SIEMs with the sort of limited log volume that they can ingest and retain... They're great. They're complex. They're awesome. They have a lot of really robust features. But once you are ingesting something like multiple terabytes of logs per day, it becomes extremely painful... painful economically to retain everything. And so you start to drop a lot of log data. And so a data lake gives you a lot more visibility because you can kind of capture all of the logs that you could ever want to and store it in the data lake and keep it around forever. It's just so much cheaper. You can really capture all of the log sources that you want to. So you can really get visibility into everything rather than making hard choices about a finite set of log sources that you want to capture in the SIEM. So yeah, visibility is another big reason why people move to data lakes.

Ashish Rajan: So your first prototype version that you had tried building—and we were talking about this last time—what was that journey like? The whole SIEM to... all the big surprises you found along the way as well?

Cliff Crosland: Yes. The prior startup where my co-founder and I worked, it was a lot of fun. Our log volume exploded rapidly. So we were in charge of both observability and security at this startup. And our traditional SIEM basically immediately hit its volume license thresholds as the log volumes grew. And to increase our license, it would have been more expensive than the entire budget for the engineering team.

So what we ended up doing is we ended up redirecting the vast, like 90% plus of our log data over to S3 buckets. And we're like, "Cool, like we have Athena. Let's go query that." And that worked for small datasets and looking at maybe a day of data. But it was... it was really sad that basically, like, it became a bit of a black hole where you couldn't really search through very much data. Once the dataset became large, querying that data lake in S3 became more and more painful over time. And also, as we added more and more log sources, trying to get them to fit into a SQL table schema and so on just became extremely laborious and painful. So it was... yeah, it was awesome from a cost perspective, but it was really painful from a usability perspective, for sure.

Ashish Rajan: I mean, I guess the surprise being the engineering overload that you're putting yourself through this. Because I imagine it's not for everyone, right? Like, for example, is there usually a breaking point when people realize? Because a lot of people, in my mind, kind of always have gone down the path of either finding a log aggregator or a SIEM, and that becomes the collection point for all the logs that you care about, and you build threat detection on top of it. That's been the standard for a while. Do you almost need to be an engineering team to build a data lake? Or... I guess, what's the breaking point when people suddenly decide, "Okay, I think I can't do this"? To your point, it may have been that, hey, S3 buckets are good for one day of log, but maybe not for one year of log or 90 days of log that we require for regulatory reasons.

Cliff Crosland: Yeah, that's a great question. I think it really is the case that with the way that data lakes are today— most data lake tools are very hard to use effectively. Like pushing data into Splunk or Elastic or like a traditional SIEM is quite a nice experience, relatively speaking, compared to data lakes. You don't really have to do a lot of massaging of log schemas and transformation of data. The traditional SIEMs are quite good at just making sense of your data, making it all searchable and so on.

But yes, if you're like, "My team wants more visibility into more log sources, maybe my retention window is not long enough to really do effective investigations, maybe you want to do threat hunting and you're like, 'Well, really the only feasible way for me to get all of my logs covered would be to use a data lake. Let's start working on that.'" It is a lot of data engineering to get that to work properly.

I think basically you have to really understand every log source, do like custom manual work per log source to fetch it from your tools, transform it into a schema that fits into some... there are a couple of different more popular schemas for data lake SQL engines. OCSF is one. But it can be basically every log source is quite a bit of work. And you will be on this endless journey of maintaining a data lake forever. It's fine for teams... We think this is changing though. Like there are a lot of really cool new data lake technologies that are coming out and a lot of innovation there. We're still in the early days.

But yeah, I think for some teams, if you have a lot of engineering resources and maybe you have other data engineering teams at the company, they might love this project of building out a data lake and creating all these tables and so on, and sharing the data maybe with other teams who might really appreciate visibility into different log sources that are typically used by security teams. But yeah, it is a heavy lift to get a data lake to work well. And so... But we think there, yeah, that there are some cool examples of teams who have done this well and new technologies that are coming out. We really care about making data lakes easier... Yeah, it is like, what we think is the future of logs and making data lakes easier to use, faster, and more powerful. All of that is happening now. So hopefully it won't be as painful as it has been recently. But for now, it is with the most common data lake tools, it is annoying. It is like a big engineering lift to get your data lake running.

Ashish Rajan: I guess you gave the example of Athena as well that you had used for storage, which I guess to what you said, maybe it's great for storing but maybe unusable? I think to your point you had a few examples, just talking about the speed and everything. So, I guess in my mind, you started with a SIEM, you started pushing logs to S3 bucket, and then you realize, "Oh, great for one day but not for 90 days or whatever." Now you're like, "Okay, I'm storing it in Athena, so maybe I can do the query better." What's the challenge there as you get to that stage three?

Cliff Crosland: Yeah, that's a great question. So Athena is basically Apache Presto, Apache Trino under the hood probably. Those are some of the most common data lake tools out there that are all SQL-based. And it's good at querying S3 buckets. And so what we did in our original data lake is we just started dumping huge amounts of log data into S3, storing them forever, and we tried to use Athena to go and run queries on them.

One of the challenges is that this is... it's pretty common, this happens a lot with data lakes is when you execute a query, unless you are querying like something that is perfectly suited to the way that you've organized your data and partitioned it into different folders and like maybe perfectly indexed and so on in your data lake... the queries are going to be extremely slow.

So we'd typically do, you know, like "Let's go search the last 30 days for some activity." And then the query would take three hours to run and might cost a few hundred dollars. And so that's like Athena... even though in theory we could run investigations on this data lake and go and scan through our S3 buckets where all of our log data was, in practice it was almost unusable unless we were doing something like using Athena to query just a small amount of data, like the past 24 hours or something like that. But yeah, if we wanted to search larger datasets, it just does kind of a bit of naive S3 scanning. If your data isn't perfectly columnar, you're trying to do text search, like it's kind of impossible to use.

And so it's good for... I think Athena is probably good for business transactional data that's very columnar, very spreadsheet-like. But for a lot of security logs, they can be a lot messier. They can be like deeply nested JSON or lots of text, like PowerShell command line text and so on. That's where SQL engines that are the typical data lake engines really break down. And so it was basically like, "Well, maybe we'll use Athena every now and then," but it's almost unsuitable for day-to-day use. We'll touch it a few times a year. But it's very cheap to store the logs though, if we ever needed it, to put them in S3 and use Athena to query them. But yeah, like unusable to go and get visibility and search.

Ashish Rajan: And to put this into context, at a security team—let's just say a security operation team—you probably are looking at a huge amount of data for every incident you're reviewing. And you don't have the time to just... I'm just going to give this much of query and hopefully I don't have to wait an hour because that's a long time to wait for an incident to be reviewed when you've already identified there's an incident. Are there other things that happen as well? Like do people actually keep all the logs? Because sounds like if it's too expensive, people shed logs as well.

Cliff Crosland: Yes. I think basically this is the journey that people go on: Okay, cool, I've got all my logs heading to my SIEM. This is great. Oh my god, this is getting to be really expensive. And either my SIEM is crashing a lot or I basically am hitting volume limits. And unless I have a few million more dollars to spend on my SIEM, I'm going to have to start shedding logs.

So a classic example, I think Cribl's popularity speaks to this, where Cribl as a data pipeline system sits in front of SIEMs. That's often one way that people use it. They'll use Cribl to delete fields from logs, filter them down, sample them down, like try to just keep the log volume down to avoid spending too much on logs and on ingestion volume costs in their traditional SIEM.

And so then people start deleting logs. And really, like sampling and deleting logs is kind of okay for SRE teams or observability use cases because for those you're kind of using these logs to get a sense for the health of the system. And so getting sampled data is all right. But for security teams, it can be pretty terrifying to be like, "Well, I'm only keeping like 20% of my log data." But the threat actor activity and maybe the IOCs that I care about, like malicious IP addresses, etc., they're invisible to me. I want to keep everything. I want full fidelity. I want to be able to find any activity. Like even one event can be really important in security.

So yeah, like shedding those logs can be very... for us, it was this way too, but it can be like a very annoying and painful experience. Like, "Okay, well, what risk am I willing to accept here? Like which kinds of logs, which kinds of threats am I willing to allow to become invisible to me?" And making those choices between logs can be very scary and painful. And we think the future looks like keeping them all! But making it much more accessible to go and search them. I think it is fun. There are lots of cool technologies coming out to make data lakes better that way. But yeah, it is an annoying problem that security teams have to face there.

Ashish Rajan: So what's the misconception then? Because it sounds like to your point, it could be easy for people to go down the path of, "Oh, SIEM's too complex, quite expensive." Are there any misconceptions that security leaders have with their approach to just looking at data lake as a cheap storage that to your point, you can just query as much as you want?

Cliff Crosland: Yeah, I think that's a great question. I think one of the things that we tend to hear from teams is that a lot of people that we talk to are excited about building out a data lake, but they're nervous about how much engineering effort it's going to be. There are some really cool open-source tools that help make this easier. So like, I think a fun one that we often see is HashiCorp has this unofficial project called Grove to help you collect logs in. There's also Substation from Brex, that's a really great pipeline system. But anyways, one of the misconceptions I think is that there's going to be an extreme amount of engineering lift to get my logs to be gathered into the data lake and then transformed into a schema that works for me and then make it actually searchable.

It's true that until recently, it has been very hard to build a data lake and that it is a heavy lift. But yeah, I think one of the misconceptions that's starting to change, it's becoming easier and easier to just gather messy data into a data lake and have really cool tools that will make sense of it and make it fast to search or detect on or normalize. There are many options out there. But like, yeah, I think over time we'll see in the coming few years, I feel like everything is going to be moving to data lakes over time. Like whether it's security logs or everything will be moving to object storage and S3 buckets. And then we'll start to see even more cool tools to make data lakes easier to onboard and build.

But yeah, I would say that for security leaders... With popular data lake tools, it will be a big engineering lift. But look for other cool tools that are more new out there like that may make getting started on the data lake journey much easier.

Ashish Rajan: Sounds like there's been a few generations of this data lake architecture as well. And I guess you've tried a few versions yourself as well. I'm curious, how do you see... What are the three or four or however many generations that the data lake architecture has gone through? Like today you're saying it's a lot easier, but what's the transition been? So that people get a sense of which one of those stages am I at in my data lake journey as I'm building this?

Cliff Crosland: Yes, I think that's a great question. I think just looking at history, it's really interesting to see the evolution and how things have evolved over time. Like the original SIEMs, like ArcSight, they were based on SQL, they were based on Oracle. And it was a good first step. But one of the problems is like again, log data, security data can be a messy fit for a hard fit for perfect, structured SQL tables.

And then you saw this generation of SIEMs like Splunk or Elastic, and they were much better at taking messier log data, normalizing it, being a bit more flexible, searching semi-structured or well-structured or totally unstructured text. They were good at all of those things. So we see this transition for old SIEMs from the SQL era to full-text search ability.

And we're seeing the same thing I think with data lakes today, where the original technologies that are built for querying data and managing data in data lakes have all been SQL-based at first. So you see tools like Snowflake or Athena and Presto, etc. They're all very SQL-oriented. It's really cool to see more technologies today with data lakes focused on full-text search. That's something that we love at Scanner, where we love the unstructured messy data.

But also there's a really cool example from Apple’s security team where they moved from Splunk to a data lake in S3, and they were using Databricks Delta Lake. But then they also built their own custom full-text search using Lucene and Apache Spark together. That's a huge engineering lift. Like I don't think every organization can do that. But they sort of showed like, cool, the data lake... a SQL-based data lake is a good first step. But then getting full-text search on the messier, semi-structured, unstructured data in my data lake is the next evolution. And so, yeah, we're excited to see a bunch of cool technologies there that people are doing. And I think in the future it'll be more turnkey to do that. You won't have to have Apple's engineering resources to pull that off. But yeah, that is the next generation of data lakes is tools that are just very good at messy, unstructured, semi-structured data.

Ashish Rajan: I mean, I guess turnkey reminds me that there's Amazon Security Data Lake as well. We've been talking about Amazon AWS Athena, S3 bucket. So where does the whole Amazon Security Data Lake fit into this kind of world?

Cliff Crosland: Yes, I think it's a really good first step. I think what Amazon Security Lake is good at is they have a bunch of different log sources that they support out of the box. That's really cool. And what they do is they will translate them into OCSF, which is this really cool, pretty strict schema... but it is a really cool schema that sort of provides a schema on top of all of your security data. So all of your logs from many different sources can all fit the same schema. They translate it into Parquet files, which are really fast to query because they're really nice and columnar.

But there are two problems that we hear people talk about with Amazon Security Lake, and it'll be cool to see if this gets easier over time. But one is that for custom log sources or log sources that aren't in their list of supported log sources, again, there's this data engineering lift that you have to do to get your messy, weirdly structured logs—every log source is like totally different and has their schema is always changing, it's annoying—but if your log source isn't on their list, you're going to have to do the work to get it to fit into this very strict schema. And that can be a massive amount of work if you have custom logs that you want to monitor that you're generating internally, or just you have many different kinds of logs that aren't on their list. You're going to have to do that data engineering work. So maybe over time it gets to be easier to get those logs to fit into that schema. That would be really cool over time.

But that one problem is just that data engineering lift if your log source isn't supported. The second problem is the full-text search again. Like if you have things like command line text from EDR logs, like PowerShell commands, you're trying to dive in and do substring search and really understand messier log data, that is unfortunately still quite slow in the data lake. It's not really designed for that. Amazon Security Lake is really designed for great columnar, like SQL-friendly data. But if you can't get your data into that format, or if your data just by nature isn't very SQL-friendly, it might be annoying. But it would be cool to... like if Amazon does build out easier onboarding and easier full-text search into Security Lake, that would be great. But it's definitely like that is the direction things should go in is like more data in object storage, easier to retain, more scale, etc.

Ashish Rajan: Interesting. And I guess you touched on something really interesting there. Most organizations use custom applications, right? Not that many are just using SaaS applications or standard applications. Everyone's using custom logs, custom applications they're building in-house, whatever the reason may be. And I definitely find that custom logs probably are the more common patterns you would find when it comes to logging and security operations instead of the standard logs. And yes, you can, to your point, you can have an OCSF format that you can use standard by Amazon. But if 90% of your logs are supposedly custom, then you're back to the engineering part again.

I'm curious about the whole schema and normalization part. Because I guess because there are so many different kinds of sources in an organization—custom sources... I mean, I guess cloud logs kind of are covered because your OCSF kind of conversion may happen more easily. But to be honest, we're not just looking at cloud logs. We're looking at application logs, enterprise application logs, which are very custom. What's the challenge with how easy or difficult is the whole schema normalization thing in this text-based search world that we are all moving towards with data lake?

Cliff Crosland: Yeah, that's a great question. I think it is... if you are focused on SQL, you really have no choice but to do a lot of work to transform every log source including your custom application logs. But in the new world where you have engines and query engines that can handle full-text search on messier log data, that normalization isn't as important.

I think there are some really fun things happening there. So you... if your engine is really good at understanding deeply nested or unstructured or semi-structured data, you may not even need to do normalization. One of the things that we see—and this is like a fun new era for all of us—is if you're using agents, if you're using LLMs to do things like run an investigation on your data lake, it's actually quite good at doing fuzzier kinds of correlation.

Whereas traditionally with a SIEM, like you can't do correlation unless you have the schema perfectly mapped. Like if you're like, "Cool, like let's do a correlation and see what this user has done across many services," then every service's logs need to be normalized to have exactly the same column for “user”. But in the future, as if your system is good at searching on messier data, you don't need the column names to be exactly the same. And also LLMs are quite good at saying, "Cool, I actually did follow... I ran a few different queries. I followed this user all the way through across these patterns. I kind of get the idea that this field is the same as this field in this log source." And also like maybe their username is just slightly varied from a source to source. Like it can be quite cool to see that yeah, in the future, I think really what we'll see is more strength in tools understanding messier data and not forcing everyone to totally clean their data first and make it perfectly normalized. Not only will tools make it easier to search that messy data, but also agents will just be able to understand the messy data much more easily and do those correlations for us.

So I still think it is helpful to normalize as much as you can, but I kind of believe in "best effort" normalization. Like adding a handful of normalized fields, not forcing teams to get their logs to be perfectly mapped into a schema. So yeah, I think there's like a happy medium we can all get to. But it'll become less and less important over time, I think.

Ashish Rajan: Interesting. And I guess you've said something really interesting there. Forcing everyone to go down a particular path is great when you're not expecting the application to change at all. And obviously most organizations have 300-plus applications. It takes like 10 of them to change their schema. Suddenly now you're like, "Oh, I haven't received logs from this particular application for months now. I have no idea what's happening over there." And by the way, your SIEM is not even telling you that I haven't received any logs because there is no alert for the fact that you're not receiving any logs. That would get quite tricky quite quickly.

Cliff Crosland: Yes. And when we talk to users, oftentimes what they're doing every week is they're like, "Okay, cool, when we had 10 log sources, we were starting the data lake journey. This is fine." But then when they got to like 40, 50 log sources, basically every week, at least one of them was misbehaving because the schema changed a little bit. Like new fields showed up that were important or got renamed, and then suddenly like it stopped getting inserted, it had errors. So every week the team was just constantly fighting with getting the data to fit into this very rigid structure in the common SQL-based data lake tooling.

So the future looks like... because it's kind of funny because we as humans can kind of see the schema change and be like, "Eh, I get it. I get what this new field means. I understand." So what what we what I really think is that tools in the future need to like embrace the fact that logs are going to be messy and that both humans and AI is going to be able to kind of get it—get what the schema means now—and not force everyone to fight a stupid fire all the time like getting their logs to be perfect. But it is the case like, yeah, once you... I think that's one of the scary things about ingesting dozens and dozens of log sources is like the ongoing maintenance and the pain that you experience. Every week there's another stupid fire to put out. But yeah, we'll see how it all shakes out in the future. But like messiness will be embraced more is my prediction.

Ashish Rajan: I love that. But do you reckon... you mentioned AI a couple of times as well. I'm curious. There's a lot of AI SOC that's popular on the internet at the moment, and a lot of them focus on detection. If you were to build a detection pipeline with the data lakes, what impact do you think AI would have on it today that people can use? Like I mean, especially the companies that are very engineering forward, they've listened to you go, "Oh yeah, I can definitely... I have the engineering capability. I can go down and build a data lake." What should an... what should a detection pipeline look like today? From all the way from ingestion to enrichment to detection... I guess what principles would you follow for that?

Cliff Crosland: Yes, that's a great question. So I would say that some of the places where AI is most helpful is in its understanding and knowledge of the world and all of the log sources that exist out there. And so what that means is that you can use AI, I would say like from the beginning when it comes to ingestion, you can use AI to tell to teach you about the APIs that exist and the log sources that exist and the connectors that exist, the documentation that's out there and get it to write code for you because it tends to understand all of these log sources better than we have. Like you won't... Instead of doing all the research for every log source on how to collect it, how to pull it in, you can really make a lot of progress quickly by using AI in Claude Code or Cursor, etc., if you want to build custom connectors. There are also lots of tools that have lots of connectors built in. Use those as well. But basically you start first by letting AI do the work of using its breadth of knowledge to understand what all of the log sources are that you might want to collect and then how to connect to them. Gather them all in.

And then start to use it to do things like enrich... transform the data, give you ideas for what fields you might want to enrich, what threat intelligence feeds you might want to use to enrich that data. That enrichment can happen in like... you could do it yourself by analyzing the data in S3 and maybe transforming it and saving it as a new file. That can be helpful. But yeah, like definitely use AI in the beginning to connect everything, know what options are available to enrich.

But then when it comes to search and detections, the... what I think is like AI can be can really help you figure out how to normalize your data if you really, really, really need to fit it into a SQL schema. And most data lake tools require that of you. So that can be helpful. It will get you like 80% of the way there. It can be kind of... there's still like a bit more... it'll hallucinate a little bit so there's... it can speed up your journey of normalizing your data to a schema like your table schema in SQL.

But then once it's in a state that you like, using AI to give you detection suggestions is really cool. I think it's quite good at understanding schemas, whether they're messy schemas or like very clean SQL schemas. And because it kind of knows everything that's out there, it can be a really great brainstorming partner.

My opinion for now is that it's not yet ready to be fully trusted with important investigations and [incident] response. It is good at getting started, but I think humans are still very, very much needed in actually analyzing the alert and making sure that the AI's investigation into it, the queries it ran, the detections it wrote are actually valid and make sense and are useful. So yeah, I definitely think that the principle I would say is like is really try to leverage AI to give you a quick understanding of your data, what data you should pull in, brainstorm ideas for detections, do a first cut at investigations when an alert goes off, but not necessarily just to hand over the keys and say like, "Cool, you're running my SOC now. Like you're replacing my team." Like I really don't think that's the future. I think the future looks more like a detection engine... everyone will be leveled up and will all become really powerful detection engineers. So like... but I still think humans are essential. I don't think they'll get replaced, but I think we'll all be doing cool collaborative detection engineering work with the AI.

Ashish Rajan: And I guess to your point, I kind of agree with this also the fact that today when you find a new detection rule that has to be created, you normally put in a Jira ticket or whatever and wait for say Ashish to come back from his holiday or wherever he has gone to come back and finally do this. But to what you just called out,  I may not need to be a SIEM Splunk expert to know the SQL query that I need to have for this detection. I can just basically ask the LLM and go, "Hey, I'm trying to do this. What do you think is a semi-dirty version that I can put today?" So that it helps Ashish to validate, but at the same time, I already have my prototype going. So I'm not waiting for remembering the context and typing everything out and all of that. I agree with this. But have you seen it work at scale as well? Because we just spoke about the scale problem where if you have 10 log sources, great, but the moment you go 40, 50, hundreds, every time a schema changes, now you're back to like trying to massage this in whatever way. Do you reckon AI would be better on that scale? Or have you seen any examples of it out of curiosity?

Cliff Crosland: Yes, it's a great question. I think it is really fun to see that we've seen it at scale where people have dozens of log sources and they're starting to get hundreds of alerts or thousands of alerts per day, sometimes tens of thousands. But AI can be really good at saying, "Here's a trend of what we're seeing, which is like these new fields showing up in your schema." So you can even just say connect to MCP servers like and you know, a bunch of data lake tools have MCP servers. I love it. It's fun. It's exciting. But you can basically say, "What is what has changed recently in this source versus another?" And it will give you a reasonably strong analysis of that and really speed you along in understanding what's changing in your schemas so you can make decisions about what new detections you should write or or new or like what should change about existing detections if the schema has changed. Maybe detections aren't valid anymore, they don't catch it because they're referencing an old field. It's quite good at noticing that as long as you're there to ask.

We've seen people do cool automations where like on a schedule, they will actually have an agent go and query like many MCP servers they have internally but including their SIEM or their data lake and say like, "What are trends in our alerts?" It might say, "This alert count has dropped off a lot, so maybe a schema has changed. And actually I've noticed it because I just went and ran a query to check on it, and I can see that the schema is a little bit different. And I would recommend updating the detection in this particular way."

But then also, if your alert volume is so massive that it's not possible for humans to really address all the alerts, it can give you really good ideas on how you can reduce the severity level of different alerts to say like, "Ah, these ones probably don't need to wake someone up. Here's a general trend given these 1,000 alerts that went off yesterday that you might want to be cognizant of. Here are the entities like the hosts and the IP addresses that seem to show up a lot in those alerts."

And then for the really critical alerts, it will basically do its best to do a deep dive investigation. But that's when a human really needs to come in and say, "All right, for this critical alert, let me make sure that its analysis is correct and what it has found... let me give it feedback. Let's dive deeply into different parts of the data to really do a full investigation on this really important incident, this really important alert."

So yeah, we've definitely seen at scale like people... Yeah, I think AI has been very helpful. Like another... One last thing that's fun is to see people close the loop on detection engineering where an alert goes off in their data lake or their SIEM, goes off to Jira or Linear or another issue tracker, and then an agent goes and takes a first cut at the investigation. It then will make a recommendation for what the detection rule should probably be. Like, "Maybe this is too noisy and we should tune it a little bit. Like add an exception to the detection." And then it will go open up a pull request in GitHub, and then the team can just review it, accept it, and then like their detection rule is now tuned better. So yeah, it's all really important though, I still think, for humans to review that though and not just to say like, "Go nuts, agents! Like change everything." You know, like... but it can be such an accelerator for people. So yeah, it is like I think it does work well at scale to get AI to get involved.

Ashish Rajan: I can already... to your point, the same example that I said earlier where I identify a detection but instead of me manually typing and going through the procedure, I let the AI agent do that job and make it as a pull request that's reviewed by someone. But I've given it the right context, it's created the right detection as per the context, whether it's cloud, container, Kubernetes, whatever, doesn't really matter. And kind of sends over a pull request. That'll be an amazing future to get to.

For CISOs who are probably still on that build versus buy dilemma that most of us go through when you're trying to go through a question where I have an expensive SIEM that I'm paying money for, but I have this cheap storage option. What are the team skill sets people need to have to even consider this? Like especially this time of the year when a lot of people are considering what their 2026 and beyond would look like from a program perspective. Especially if you work in security operations, AI attacks are on top of mind for a lot of people as well. And they're like, "Hey, I mean, maybe SIEM has a better option, data lake is a better option." Still tossing the idea. For future-proofing, do you still believe data lake is the right decision for people who have that dilemma, whether it's economical or not? And if it is, what kind of skill set should they have in their team to even make that possible?

Cliff Crosland: That's a great question. I think one of the things that we see is that I don't think it's quite time to totally replace your SIEM with a data lake. Some teams do it and it's great. Like I do I do see that quite a bit. But I think one pattern that tends to work well is to say, "Cool, like all my logs used to fit in my SIEM a few years ago. Now it's like 10% of them. Let me keep those 10% going to my SIEM. Maybe it's like one terabyte a day. And then nine terabytes a day of logs are still being generated and I just have no visibility."

Instead of dropping them entirely, let's make the first step which is just store them in S3 for compliance purposes. And then the question is like, "Okay, now should I build my data lake now that my data is like my nine terabytes a day is flowing to S3? Or should I buy something?" If you have a really strong data engineering team like whether in the security side or the rest of your organization is already doing a lot of really cool data engineering with data lakes for other purposes like for business analytics or even for observability, then yeah, you can share that work with them. That can be a fun project. It is a a forever project though. Like every log source, you know, you'll always be updating schemas and so on.

And unfortunately I don't think the open-source tooling exists yet to give people the ability to... unless you're like unless you have a lot of engineering resources like an Apple, getting full-text search on your data lake is still super early and there there aren't... unless you're willing to really do an insane amount of engineering and build your own like inverted index or your own custom Lucene fork like Apple did, that that's probably not going to happen. So if you want that kind of full-text capability, that kind of messy data searchability, you probably will have to buy something.

But over time, I think it's just going to get easier and easier. But I I would say that you should probably buy if your team is really SOC-focused and not necessarily data engineering-focused. There are more and more cool data lake tools that can take over the job of building out the data lake. I think that's kind of the first thing that's happening is lots of tools exist out there to help you gather the data into your data lake. And now I think it's time for tools to exist to make that data once it's in your data lake really easy to search, really easy to run detections on and so on.

But yeah, I think if you just love data engineering projects like... just be prepared for a long tail of data engineering like schema tweaks and maintenance forever. But if you don't have that data engineering talent on and resources like you'll probably need to buy something.

Ashish Rajan: Yeah, like don't overkill it I guess.

Cliff Crosland: Yes. Yes.

Ashish Rajan: Fair enough. I mean, those were the technical questions I have. I've got three fun questions for you as well. First one being: what do you spend most time on when you're not working on data lake and cloud and technology and all of that?

Cliff Crosland: Yes. It is like my family and I have like two young kids. That's definitely it. But the thing I love to do though is skiing. It's like one of my favorite things in the world. So like Lake Tahoe, Utah... that's my favorite.

Ashish Rajan: Oh yeah. Awesome. I'm a snowboarder myself but I'll take you... we can still be friends. That's okay.

Cliff Crosland: Yes.

Ashish Rajan: Where... the second question that I have for you is: what is something that you're proud of but which is not on your social media?

Cliff Crosland: Yes. That's a good question. I think one thing that I'm proud of that was really fun to work on—this is kind of silly—is that I created a JavaScript plugin that was like simulating what a black hole looks like, a visualizer for black holes. This is like a goofy thing, but I love physics. I think physics is super fun. I think this was like after Interstellar. But you can move your mouse cursor around on a background and see gravitational lensing and warping of space in an image. So that's a silly thing but it is something that was fun to work on and fun to learn the math about that I was proud of. Yeah. And it got on the front page of Hacker News for a second. Yeah.

Ashish Rajan: Well, I mean, that would have been amazing. I'm sure Hacker News would have picked it up as well. Final question: what's your favorite cuisine or restaurant that you can share with us?

Cliff Crosland: Yes. I would say that definitely my favorite thing in the world to eat is pumpkin pie. I don't know if this is like...

Ashish Rajan: Does that count as a cuisine or...?

Cliff Crosland: Yes. With Thanksgiving it's definitely been on my mind. I would say it's kind of funny though like it's a family recipe that we have that's like way more sugar than the typical pumpkin pie. So I hate all other pumpkin pies, but like yeah, if I love me some pumpkin pie, especially if you dump in a very generous heap of sugar too.

Ashish Rajan: Fair enough. I mean it is that time of the year as well. So it's like autumn winter so we are into the Halloween Thanksgiving season as well. So rightly so. I mean, now thank you for sharing that as well. And thank you for spending time with us. Where can people connect with you and find out more about Scanner.dev and other things you're working on from a data lake perspective? Even if they have questions about data lakes they're building.

Cliff Crosland: Yeah, for sure. We're pretty active on LinkedIn. Hit us up at Scanner.dev. I love DMing or chatting with people in comments. People have amazing conversations on LinkedIn about data lakes these days and SIEM. It's like a really cool time. I think SIEM is like everyone's like thinks there's a big shift afoot. So if you have cool ideas for what that looks like, I would always love to chat there. Or reach me on X/Twitter. CliftonCrosland is my handle. But yeah, it's always fun to see the very amazing, very cool technologies that people are bringing to this problem. So happy to chat.

Ashish Rajan: I would put the links in the show notes as well. But thank you so much for spending time with us and sharing all that as well. And thank you for everyone tuning in. I'll see you next time.

Cliff Crosland: Thanks so much.

Photo of Cliff Crosland
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.