February 12, 2025

CyberWire Daily Podcast: The Evolution of Security Data Lakes And The “Bring Your Own Data” Model For Security Tools

In this CyberWire Daily Podcast episode with host Dave Bittner, Cliff Crosland, CEO and co-founder of Scanner.dev , discusses the evolution and benefits of Security Data Lakes, explaining how they provide a more flexible and cost-effective way to store large amounts of diverse security data compared to traditional data warehouses and SIEM tools. He emphasizes the advantages of the “bring your own data model” approach, where organizations maintain their data in their own cloud storage while allowing various vendor tools to analyze it, reducing vendor lock-in and enabling multiple teams to access and analyze the same data without duplication.

Dave Bittner

Cliff Crosland is CEO and co-founder at Scanner.dev. In today’s Sponsored Industry Voices segment, we discuss the evolution of Security Data Lakes and the bring your own model for security tools.

Cliff Crosland

So a data lake is an evolution beyond a data warehouse. And there are all of these funny terms for big data storage areas, but a data lake is a strategy for taking data of many different formats. A data warehouse was the first step in this direction where the idea was to have tons and tons of data that matched a really strict structure. But data lakes, the idea there is just to have a storage repository of lots and lots of messy data of many different formats and many different structures. And the idea is that you could just pour tons of data into this lake and then make sense of it afterward, and then analyze it afterward. So it just makes it very easy to get lots of data in.

And then the challenge becomes trying to get value back out again and query it and get a sense for what’s going on. And so data lakes are starting to become more and more popular, especially in security, because there’s just so many different kinds of data to collect in security. And it’s just, it’s much easier, more scalable and just very cheap compared to other tools to store data in a data lake. And just to get really specific about data lakes, people tend to just store lots and lots of data in cloud storage, whether that’s like Amazon S3 or Azure Blob storage or GCP storage. There are a couple of different places where people store it, but it tends to be just really big cloud storage lakes of data.

And it’s very helpful once your data reaches massive scale to afford and to scale up with all of the massive amounts of data volume that come in now. So we think, anyways, I get into more details there, but we think that because there’s so much data now, and if you’re operating a cloud service, there’s so many different data sources that really data lakes are becoming a more and more important part of a security team’s detection and response strategy, just because it really allows you to get coverage on mass amounts of data volume that becomes too expensive in the traditional way that logs and data are stored. So that’s just like a whirlwind tour of data lakes. But yeah, it’s a rough idea.

Dave Bittner

So I hear folks talking about this. Bring your own model. Can you describe that for us? What does that entail?

Cliff Crosland

Yeah, I think it’s a really powerful way that software is being deployed now. And I really think that this is the future of how more and more tools are going to work. So back in the good old days, the operational approach was to send all of your data off to a vendor. If this is like if you’re using a SIEM tool, you know, security information event management tool. Oftentimes what that looks like is shipping lots of logs over to a third party, and it can be expensive to transfer, et cetera.

But now the way that things are moving is that you will store all of your own data in your own storage buckets and your own cloud storage, and then you will plug in many different vendor tools into that data, give them permission to analyze it in different ways, and use tools for what they’re strong at. It’s really cool to bring your own storage, even bring your own cloud compute. You can basically say to a vendor, please deploy your software into my environment. There are lots of cool tools doing this, whether it’s security or database related. There are many different companies using this approach, but you kind of get the power of SaaS products where they can get deployed frequently and updated all the time. And it’s a really good user experience.

But you’re letting the vendor run everything in your own cloud environment, which means you keep full data custody. You can get perfect visibility into what’s going on and how much compute you’re using, how much storage you’re using. You can drive costs down. It’s a pretty powerful new approach for security teams and data analysis in general. And I think as AI applications and use cases start to explode, you’re starting to see that happen too. So it’s pretty exciting. There’s just a lot of new tools that are deploying into your cloud, into your storage, and instead of getting locked into a vendor, you get to maintain full custody of that data. It’s a cool new pattern.

Dave Bittner

Can we talk about the scalability here? The possibilities for growing beyond your expectations? If need be, yes.

Cliff Crosland

So I think some of the interesting trends with security log data is as people are operating more and more services in the cloud, that they’re operating more and more SaaS tools. The traditional log and traditional data management tools get to be extremely expensive. So the beauty of data lakes is that cloud storage is very cheap and can scale forever. As long as you can apply tools and smart ways to organize the data and make it fast to access. It can really drive down costs and make it possible for you to have a lot of visibility into historical data. So, yeah, a lot of tools that people tended to use up until a couple years ago, you really can only retain a couple of weeks or maybe a couple of months of logs, maximum.

And then you would just kind of dump the rest of your logs into cloud storage just for compliance purposes. You have no way to get value out of them. But because there’s this new model of being able to store your logs at scale into massively scalable cloud storage at low cost, and there are new really cool data lake tools to analyze that data and get you answers quickly, you can really get value out of this historical data. So instead of spending millions of dollars on a SIEM tool, you might spend ten times less now by taking a data lake approach. So, yeah, it’s really cool. What is like the scalability that data lakes can achieve, which I think is a big reason why big companies, Snowflake, Databricks, et cetera.

Lots of companies, whether in security or in data analysis, are really excited about data lakes and all of the applications they have.

Dave Bittner

I would imagine it cuts down on redundancy quite a bit as well. Right, because as you say, you can have different, I’ll just call them plugins. Looking at this big lake full of data, you don’t have to duplicate that data to be analyzed from Platform A or Program B. It’s all there, and you can send things to analyze it as need be.

Cliff Crosland

Yes, absolutely. And I think so. One thing we’ve seen too, is in the past, you’d have different teams at the company using multiple tools, shipping the same data off to those multiple tools, duplicating the data, as you were saying. But also, if you have different divisions, different departments across the company, they would themselves also ship data off to many different tools, which was a huge problem. It would just be duplicating the same massive data flows from all of these different log sources tons and tons of different tools. So, like, you’d have the security team looking at one set of the data in one set of tools. You’d have the application developers who are trying to debug things and get like, health metrics on the application. They’re using a totally different set of tools and shipping the same data off there.

But the really cool thing about the data lake is, yes, there’s a centralized place, and then you can plug in many different tools to go and analyze that data for different use cases. And then it’s fun to see teams become, once they start to build their own data lake, they become a resource across the entire company. And then everyone starts to pile in and say, this is really cool. It breaks down the silos and wow the security team has this kind of data for the web application firewall.

That’s actually really helpful for debugging this other problem and for our infrastructure team, and because there is the centralized place in the data lake, everyone can jump in and analyze that.

And they’re not replicating this cost by shipping the data off to 10 different vendors. There’s just one location and they can just use different vendor tools to analyze that same data. Yeah. So it’s really cool to see. Oftentimes what I’ve seen is one team, like at the security team or maybe some business intelligence teams start to use a data lake and then lots of other teams get excited and everyone starts to break down silos and start to develop really cool use cases for the same data sets and share them across the company. So that’s another really powerful thing about data lakes.

Dave Bittner

Yeah. Interesting. Well, help me understand how data lakes handle different types of data. My understanding is this has evolved over time.

Cliff Crosland

Yes. So it’s really interesting. The beauty of data lakes is that you can store lots of different kinds of data in one place. The challenge becomes trying to get value out of really messy data of different kinds. And different tools have arisen to tackle the messiness of data lake data. So you might have like web application firewall logs coming in that the security team really cares about and network flow logs, and they have very different formats. Trying to build data lake tools to make that useful has been a lot of progress there and a nice evolution. So when the data lakes were originally introduced, it was pretty strict and there was a lot of upfront work.

You had to do a lot of work every time you added a new data source to transform the data to fit an appropriate schema and otherwise the data would be really slow to access. But over time, you start to see cool new innovations. Apache Iceberg is a big one. Amazon has introduced there a new product there with S3 tables to really natively support Apache Iceberg. And the cool thing there is that it’s much more easy to evolve the schemas and the data structures over time and edit them. So I think where things are heading in our opinion is the tools will get smarter and better about just handling the messiness and the structure for you. It’s getting easier. We still feel like it’s a little too challenging.

I think there are cool things that we really care about to make things easier and more schema less, et cetera. But yeah, it’s really fun to watch as new tools appear on the scene to make data lakes easier to use. We see the messiness of data getting handled more and more intelligently, making it easier and easier for people to adopt. So, yeah, that’s definitely the way things are heading and we hope eventually I think it’s really cool just to see what people are doing with LLMs as well. With generative AI, it can do a really good job at helping you figure out what the schema should be and just kind of taking on the annoying transformation work every time you add a new data source. Yeah, it’s really neat to see how easy these data lake tools are getting.

And we think the future will be you’ll just point a tool at your messy data and it will just totally make perfect sense of it all for you. That’s where things will eventually get to. But yeah, it’s heading that way slowly.

Dave Bittner

Yeah, you touched on this idea that the bring your own model reduces vendor lock in. Can we dig into that a little bit? What’s the advantage here?

Cliff Crosland

Yeah, so the beauty here is, with other logging platforms in the past, in particular with SIEM tools, the idea would be to ship your logs off to a tool that you maybe were running internally, or maybe you were shipping them off to a third party, and then that data is just locked into that specific tool in that specific format that is very tightly coupled to that vendor. And so that could be nice if there’s a strong vendor ecosystem. If basically all of the features that you want are handled by that vendor, that’s fine. But then what that also means is that the vendor can increase their prices and you’re stuck there. The beauty of data lakes is that you bring the data, you can bring the compute as well.

And the idea is that the vendor will supply tools you can use to analyze that. And you’re not locked into any of those vendors. You could drop one, you could pick up another. There are a lot of really cool open formats that people are using for data lake files and the catalogs that people use to track what data is in the data lake. So, yeah, we really think the direction that things should go in is that you should have a lot of flexibility and be able to select from many different vendors that can all analyze the same data set without getting stuck in one forever. And you might love your vendor at first, and then over the years they kind of stop innovating, but then it’s very hard to move off.

You may have built a lot of dashboards and queries and detection rules there, but with the data lake approach. There’s just way more flexibility and this really cool notion of having full data custody that gives you freedom to pick and choose as you want.

Dave Bittner

So what are your recommendations then for folks who want to look into this, who want to explore the possibility for their own organization? What’s a good place to start?

Cliff Crosland

Yeah, definitely. So I think there are a couple of tools that you should definitely look at to as you get started. If you’re in AWS, that would be Amazon’s Athena. In Google, that would be BigQuery. And for Azure, there is the Azure Data Lake suite of tools. That’s probably the best place to start playing with things then if you want to really get deep into security, there are lots of cool security data lake related tools to help you pull data into your data lake or to structure it well or to search it well. Just a whole suite of things that are centered around the data lake technologies that exist out there. So I think I would really recommend that people take a look at Apache Iceberg. This is probably one of the cool innovations that’s happening now.

We still think Iceberg can be a little bit, is a little bit too difficult to use because it still is a little too strict, but it’s definitely a step in the right direction of making the data lake really flexible. The place to start would be with the different cloud providers data lake specific tools. If you want to get started with getting security data into those tools, there are plenty of different services and startups and companies that will help you load logs from different locations into your data lake.

But you could start with just a few log sources. Maybe log sources that are really easy to get into your own cloud providers data lake like the cloud audit logs or maybe your identity provider logs. And then just play with the different data lake tools that exist out there to see what suits your use case as well, like who has the best detections or whose structure works best for you. Who’s easiest to use? There are lots of cool things out there.

Dave Bittner

That’s Cliff Crossland, CEO and co founder at Scanner.dev. If you’d like additional details, we have a link to their blog on Security Data lakes in our show notes.

We believe that traditional log architectures are broken for modern log volumes. Scanner enables fast search and detections for log data lakes – directly in your S3 buckets. Reduce the total cost of ownership of logs by 80-90%.
Photo of Cliff Crosland
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.