CyberWire Podcast: Why Security Data Lakes are Ideal for AI in the SOC

Dave Bittner (Host, CyberWire Daily) and Cliff Crosland (CEO, Scanner) discuss why Security Data Lakes are becoming the foundation for AI agents operating in the SOC. To be truly effective, AI models need as much context as possible. In this podcast, they discuss:
- The need for long-term complete security data
- Accelerating SecOps with AI for investigations, detections, automating alert triage, and more
- Performance challenges with AI agents and overcoming them with high-speed queries
- The power of AI agents and human analysts operating together
Dave Bittner: Cliff Crosland is CEO and co-founder at Scanner.dev, and in today's sponsored Industry Voices conversation, we discuss why Security Data lakes are ideal for AI in the SOC.
Cliff Crosland: The most common way that people construct data lakes is to store significant amounts of messy data in object storage buckets.
So this might be like AWS S3 or Azure Blob Storage or Google Cloud storage buckets. It's just basically storage locations that can scale forever. And that they tend to interact with this data with various kinds of SQL based engines. That's the most common way to query this data. But yes, data lake's, the most common way that people build their data lakes is to build them on top of object storage.
Dave Bittner: And so when we're talking about AI and its ability to improve the lives of people in the SOC, how do these two things cross paths?
Cliff Crosland: It's super interesting. So we found that more than half of our customers are building agentic AI workflows on top of their data lakes and on top of many other tools, for all of their SecOps responsibilities just to speed things along.
No one is fully removing people from the workflows, but they are finding a tremendous amount of value in speeding up things like investigations when an alert is triggered. By having a data lake that's easy to access, with a huge amount of data, just way easier to get lots of data into a data lake than into a traditional SIEM.
Then the agents can go and pull in rich context from many different sources. So AI together with Data Lakes, we think is really the future of doing SecOps investigations and diving into log data. There's just a lot of power and having access to more and more log sources, and more and more historical data too, going back, not just a few weeks, but months or years, to do like a deep dive threat hunt.
It's super cool.
Dave Bittner: Well, can we dig into some of the details here? I mean, when, when somebody has this sort of thing up and running, how does it work?
Cliff Crosland: Yes. So what folks tend to do is their first cut at building a data lake. It tends to be using a SQL based tool, like maybe it's Apache Presto or Amazon Athena is very common in our world.
We're very AWS focused. But what they will do is they will then use different kinds of SDKs that interact with MCP servers. So Model Context Protocol, it's very cool. And then they will, they will use that to go and interact with their data lake and interact with different security tools that they have.
So a very concrete example might be something like an alert lands from something like Amazon Guard duty. And then the agent will then go and pick up the ticket that got created in Jira, and then we'll go do an investigation in the data lake. It might ping some other tools. It might write up a little summary in Slack and write some comments in the Jira ticket.
And then it might also do something cool like open a pull request in the team's GitHub repository to tweak some code or maybe a detection rule that they have in there. And then humans can kind of review everything that it just did. The code review, the change that it's making against the code, they'll go and review the comments that are being added to the ticket.
So it can be very cool if your data lake is working well for you and, and you've done the work to make it fast. It can be a very cool source of additional rich context for agents to use when they're doing investigations.
Dave Bittner: Now my understanding is that query speed in particular is really critical for enabling these AI agents. Can you unpack that for us? Why does that matter?
Cliff Crosland: Yes, it is really interesting. So this is a common theme that we run into and why people come and talk to us, is when they are trying to use Amazon Athena, or Presto to go and query a data lake. Sometimes the query will run for hours and then the agentic workflow just doesn't work. It's just sitting there constantly pinging over and over again waiting for a query to return. And so what you really want, if you want an agent to do a good job at doing an investigation quickly on an alert that comes in, you want your data lake to be really fast to go and query.
We really are obsessed with what the future of data lakes looks like. We think data lakes are just going to get faster and faster. And this common complaint about data lakes being slow, that's going to go away over time. It's getting easier to do data engineering on traditional data lakes to make them faster, to query with Apache iceberg and parquet formats and so on.
But there are also other cool things going on, like being able to support full text search, just even on the messiest of log data. And getting results back rapidly. That's something that we are super excited about. As data lakes get faster, it'll just be easier and easier for agents to rapidly investigate incidents like do detection engineering on your behalf. And speed up everyone's job in the SOC.
Dave Bittner: You know, years ago when AI was just starting to become the hot thing along with machine learning, of course, I remember reading an article and it was about the state of a computer's ability to play chess against humans. And they were talking to a chess grandmaster and they were saying that, you know, humans can play against the machine. The machine can play against another machine. But really the human combined with the AI was the best chess player in the world, and that combination was hard to beat. Is my understanding correct that you all have done some testing on this internally and you're finding similar sorts of results?
Cliff Crosland: Yes, we definitely think that. I mean, we could be wrong as artificial general intelligence or artificial super intelligence lands on the scene in a decade, if we're lucky.
I don't know if that's really gonna happen, but maybe at that point, the AI can just take over the job, but it was really interesting. There was some research done at Stanford that showed that doctors actually by using AI that the AI does better on its own rather than doctor plus AI at doing diagnosis, for a certain kind of symptom evaluation testing that they were doing, which was surprising.
And so in our minds, we thought, wouldn't it be really cool to see if AI can do the job of a SOC analyst better than humans can by themselves or even humans plus AI? The interesting thing that we found there is that. Human plus AI together does far better. There are a couple of different interesting findings.
One is that it just seems to be that there is a lot more medical data out there for foundation models to train on, like millions of research papers. And so it makes sense that they're good at diagnosing medical problems. But in cybersecurity, the false negative rate. That's the scary thing.
If a true positive alert, if a real threat is present in your log data, and you're under attack and the agent doesn't find it and thinks everything is peachy, that is scary. And that was very common with AI running entirely by itself. But what we found to be really effective was AI and humans working together where a human can kind of just use their judgment to nudge the AI along and iterate together on an investigation report, like in an artifact that they can continue to develop together. So like the AI will do a first stab and then the investigation a human can say you totally like missed something over here.
You mentioned it, but you didn't really deep dive, go dive deeply into this weird data exfiltration, what is happening there? Why are there so many downloads from an S3 bucket in the logs? And then the AI will often say wow, you're right. This is actually really bad. Let me dive more deeply into this.
But the cool thing is instead of a human taking hours to piece, to write queries, to dig through logs, you can really just start to use your intuition, your judgment as, as a person and as a security practitioner to just come up with great ideas for the AI to go and explore. And then there's this really fast translation between the messy data, the deep hard to understand obscure data sources and then insights from it.
So we think together, humans and AI are awesome and in our own testing the false positive rate got a lot better when humans were involved. But also the false negative rate was a lot better. Humans were like, we're better at being maybe a little paranoid and nudging the AI along in the right direction.
Dave Bittner: How do you go about dialing in the degree to which the humans are having oversight over the AI?
Cliff Crosland: Yes, this is really tricky. I think what you want is an AI agent and a bunch of agentic workflows to make your life easier. You don't wanna have to micromanage them. And so it could be a challenge if you have hundreds or like maybe even thousands of alerts being triggered per day.
So if you don't want to have to go and do a deep dive review on every single response that your agent is making to these alerts. But what is really effective is to keep humans in the loop. But then to do things like let the AI give you a batch understanding, like a global understanding of the patterns of those hundreds of alerts, and then surface the highest priority things for you to go review.
We don't think it's time to let agents go and make a final call on really important investigations and do things like immediately change your code, we think and change your detections. We think instead if it can open up a pull request for humans to review. If it can add comments for humans to review, and then humans can just click, accept or approve or dive deeply into the details if they want to. It can really speed people along. So, yes it is a challenge I think, like if you can get your detections to be tuned, to reduce your alerts to something that is reviewable, like maybe dozens of alerts a day, that can be wonderful.
And we actually find that AI agents are really helpful in helping you tune your detections to remove the noise, and giving you ideas for how to reduce the false positive rate. So, yeah, it, I think you kind of need to get your alerts under control and, and then only have a volume at which humans can afford to go and review what the agents are doing and what their investigation conclusions are.
Dave Bittner: Yeah, it strikes me that approaching it this way maybe it's an opportunity for your humans in the loop to stay sharper because they don't have that grunt work of as you said, going through so much data manually, they're able to apply their intuition where it really matters.
Cliff Crosland: Yes. There was a fun interview with Allie Mellen who talked about how in the future the SOC analyst, the low level SOC analyst role is going to evolve and it will become more about detection engineering together with an AI. And we definitely see that, it's really fun to watch with our users.
They will build workflows where if an alert is really noisy, an AI will do an initial attempt at writing code to change the detection rule to make it less noisy to reduce the false positive rate. Humans can review and then it's just so much more fun than going and manually triaging dozens or hundreds of alerts per day. You can just use your high level judgment, your creative ideas, to look at instead of getting into the weeds and into the details on every single alert that happens. You can guide and shape, almost like managing agents to write code for you and get a lot of great work done, and clean up your detections, tune them better.
Yeah, we see a lot of people, instead of just trusting the out of the box detections from their SIEM or their security data lake tool, they will customize and tune hundreds of detections from their vendor to be more appropriate to their business context. And they can only do that because AI's helping them speed that along, they're not doing it all by hand.
They're just doing code reviews and maybe giving the agent feedback and maybe tuning the code a little bit more. But yeah, I completely agree that it helps people be sharper, see, and focus on better, more high leverage projects. It's exciting.
Dave Bittner: What are some of the things that you and your colleagues have learned along the way? You know, what are, in terms of having your own unique approach to this, the things that that you believe differentiates you from other folks who are out there doing this?
Cliff Crosland: I think what's going to happen is all of the efforts that a lot of organizations are going to try to do data engineering to get all of their data in their data lake to conform to a common schema that's not going to be important anymore.
In the future, instead of every single one of your 50 log sources that you have to conform to a schema like OCSF, instead tools are going to be very good at handling messiness, and that's something our tool at Scanner we really care about is logs can be very messy. And, because of the way that we're approaching running queries and analyzing data and data lakes is about embracing the messy and text-based nature, deeply nested, JSON, like Schemaless, nature of logs in security.
It just makes it much easier to gather many different log sources together. They don't, you don't need to do as much engineering work to get them to conform to a common schema and then it's really critical that that data be fast to search through so that the agent can actually make progress.
We're excited about adopting many of the same ideas that you see in tools like Lucene or Elasticsearch - building like an inverted index, but making that inverted index extremely friendly to data lakes, and data lake scale and object storage that allows you to go and execute very, very fast searches over massive data sets, without doing a significant amount of data engineering to get your data to conform to a particular schema. Just let the logs come in as they may and be messy. And then what the future looks like is gathering more and more log sources cheaply in data lakes and in object storage, and then letting agents do the really cool fuzzy sorts of correlations and searching across them, to do deep dives and powerful investigations. So we're excited about that. We think that is the direction things will go in with object storage and data lakes in the future - more and more friendliness to unstructured and semi-structured data and faster and faster search across that data.
Dave Bittner: That's Cliff Crosland, CEO, and co-founder at Scanner.dev.
