September 2, 2024

Scanner.dev on Data Lakes with Smashing Security Podcast hosts Graham Cluley & Carole Theriault

Scanner.dev CEO Cliff Crosland joins Smashing Security Podcast hosts, Graham Cluley and Carole Theriault, to discuss how Scanner transforms raw log data into searchable insights, helping organizations handle security events more effectively. Cliff explains the challenges of traditional logging tools, the high cost of log retention, and how Scanner leverages Data Lakes to make log analysis faster, cheaper, and more efficient. The conversation also touched on the future role of AI in organizing and interpreting log data, making security monitoring more accessible and actionable.

Click here to listen to the full episode.

Graham Cluley

Now, Carol, you’ve been chatting to The folks at Scanner.dev this week, haven’t you?

Carole Theriault

Yes, I have. And I learned all about data lakes and how you can make them work for you. What is a data lake, you ask? Listen up. Today we are chatting with Cliff Crossland. He’s CEO and co-founder of Scanner.dev. Scanner.dev have a logging tool called Scanner. Now Scanner turns raw logging data or raw log data into searchable material, whether it’s like a critical security event or, you know, insights hidden deep within the logs. And we listeners are going to find out how this works from the head honcho. So, Cliff Crossland, welcome to Smashing Security.

Cliff Crosland

Thank you so much, Carol. Great, great to be here.

Carole Theriault

Well, thank you for being here. Now first, maybe we can start with you giving a better introduction about you. So we want to know how you ended up not only co-founding, but heading up Scanner.

Cliff Crosland

Yeah, for sure. So I’m a software engineer. I’ve worked on a bunch of different things related to data infrastructure. At a prior startup. My co-founder and I, were responsible for the security logs and all of the like, application debugging logs. And we had a huge amount of pain and suffering when it came to how expensive they got and how challenging it was to derive insights, insights from logs at massive scale. So yeah, we’re just, we wanted to solve this problem.

We just make it easier for people to understand and solve complex security problems using their log data. We jumped in after the startup was acquired by Cisco and that was fun to be there for a bit. We wanted to solve this problem for ourselves and for other people. So we jumped in and we built out Scanner to make logs way easier and way less expensive. We’re passionate about something that is probably a little bit boring to people is just massive log data, but to us, they’re fascinating. You kind of like build a whole view of what’s happening inside of an organization, what’s happening inside of your app, what’s happening like across your it.

Carole Theriault

You’re like the Wizard of Oz.

Cliff Crosland

Yes, yes. Like, yeah, peeking down and trying to figure out what’s really happening behind the curtain. So we love massive amounts of log data. It’s rare to find people like that, but that’s us for sure.

Carole Theriault

I have to be honest, I don’t know a lot about security logging. I should know a lot more than I do and I don’t know anything about how it’s kind of evolved or matured through the years. So this must be weird to you. This must be a word you use about 100 times a day. But maybe you can set the scene for me.

Cliff Crosland

A fun example to maybe start with is a recent story that you and Graham and Dave covered a couple episodes ago about the Path of Exile 2 hack. Oh, game got hacked. People had items stolen, digital goods stolen, and one of the challenges they ran into was they wanted to track down and see what happened, who hacked them, how, how did this take place, who was affected. But they only had 30 days of log retention, so they could only go back in the past so far.

Carole Theriault

Yeah. Everyone was up in arms, right? The company, the players. Yeah.

Cliff Crosland

Yes. And this happens all the time. Like, Okta had a breach, you know, really core identity provider and login service provider. One of the things that security teams really find useful to do forensics and track down what’s going on with threat activity is log data. And log data is basically like a surveillance camera recording everything that’s going on in your organization, in your servers, in all of your cloud tools. What’s happening in Slack? What’s happening in, like, Microsoft Teams, in, like, your Google workspace, and who’s sharing what documents and so on. Just kind of like recording everything that’s happening. And then that information’s super helpful to dive into as a security team and say, okay, there’s like, something weird going on. Like, this one employee is starting to share tons and tons of, like, Google documents outside of the org. Let’s double click into that. And so the challenge is keeping enough log data around. It’s kind of insane how much log data gets generated. Like, people are using more and more tools. They’re using multiple cloud providers.

Each cloud has like a million different services that they provide, and each one of those generates logs. You have all these, like, little log messages with, like, timestamps and information about who’s doing what. That can get extremely expensive. So teams like the Path of Exile 2 developers. Totally understandable why people only keep like 30 days of logs around. Because once you get to a terabyte of logs a day, which happens quickly, that can cost a million dollars a year. That’s what happened. Yeah that’s I happened up with our prior startup, we grew quickly. We generated lots of logs, and then our logging tool, the license got tripped, we exceeded it, and we asked them, okay, well, if we were to expand it to cover ourselves, how much would it cost? And was like, well, you’re at about a terabyte, so it might be something like $1.2 million a year.

Carole Theriault

Like, what? Excuse me?

Cliff Crosland

Okay, that’s like, more than our entire, like, employee budget. Like, what is happening? We really think at Scanner that the architecture of traditional logging tools is just broken for modern log volumes. And there’s just a very different, really cool new pattern that’s emerging to handle logs. It’s still early, and there’s a lot that needs to be built to make the experience better. But, yeah, there, there are new approaches that will reduce the costs by like 80 or 90% and make it actually reasonable for teams to keep more than 30 days of logs.

Carole Theriault

I guess in a security incident or a security event situation, obviously there’s huge time constraints. You want to get everything sorted and people want answers quickly. And there’s people knocking at the door, both, you know, the press, your, your clients, your partners. How are people dealing with that right now?

Cliff Crosland

Yes, it is really interesting. What often happens is a threat report comes out like, this particular vendor got breached, and here is a list of all of the malicious IP addresses that we detected as part of this breach. And so then they’ll publish that to everyone and say, go and look and see if you’ve been affected by this. And if you find these IP addresses in your log somewhere, or these domains or these file, you know, malware file hashes, someone in an organization has downloaded these and running them on their computer, you might be exposed. And so they’ll jump in to the traditional log tools and then they’ll be able to run searches over maybe, you know, a couple weeks or 30 days or something like that. And if they can’t find it, they’ll then do this really painstaking process of going into their archives. If they have archives, hopefully they do, but often they don’t.

They have like the 30 days in there, and that’s it. Other teams who have archives, they can try to pull them in and do this process of rehydrating logs, it’s called, but it’s like going back and trying to pull in old data, pull them back into their log tool. That can take days. We’ve talked to folks where it takes weeks. And just answering the question from, you know, like your, your CTO or your CISO or somebody at the organization who said, oh, this, this threat is something we’re scared of. Can you tell us, like, have we ever been exposed to this over the past six months? And that question can take like a week to answer or, or, or, or weeks, or maybe you never answer it.

Carole Theriault

Yeah, and a very stressful week in some situations, in some companies, I imagine.

Cliff Crosland

Absolutely.

Carole Theriault

If I can pivot here, this is where I’m guessing this is where Scanner.dev comes in. Right. Because as you introed, you wanted to solve this problem and you, you filled it in really well. I feel it now. How are you addressing all these pain points?

Cliff Crosland

Anyone who has log data, we think the future is in data lakes. And a data lake is just a funny term that evolved from the term data warehouse. And so I’ll start there.

Carole Theriault

Yeah, okay.

Cliff Crosland

Yeah. A data warehouse is like a giant database where everything is like neatly organized in, you know, rows and shelves and aisles and so on. Data warehouses were designed for business data, like business analytics data that’s very well structured. Like here are the purchases from this customer in this place. And so data warehouses, what you do is you have tons of data, but you do a lot of work to make it super, super structured and organized. A data lake is just this place where you pour in data of many different kinds that’s way less organized.

Carole Theriault

Sounds like my desktop.

Cliff Crosland

Yes, yes, exactly. Yeah, like with a million files or screenshots or whatever on the. Or a desktop in real life with papers everywhere. Yes, exactly. Mine too. I’ve got like a bunch of toddler artwork on my desk at the moment. Anyway, so a data lake, the idea is you can take data from many different sources with many different formats. Some are really structured, some are really messy, some are like in between semi structured data and you just pour it into this storage location basically. And the cool thing about a data lake is it’s a lot easier. You just kind of dump the data in there and you use cloud storage for this, which is so much cheaper.

Carole Theriault

Yeah.

Cliff Crosland

Traditional tools, it will cost like a couple of dollars per gigabyte or something, which ends up being way expensive at scale. And cloud storage costs just a few cents per gigabyte per month. And, and so it’s, it’s just a very different experience to use a data lake and to store all this data in a big cloud storage that can kind of grow forever. You can put like little bit of data, you can put a ton of data. Because the data’s so messy, it can be a huge pain to analyze it. Some teams have to do a huge amount of work to like organize the data and kind of turn it into a data warehouse. It’s actually called data lakehouse.

It’s kind of weird. It’s like in between the two where it’s like kind of messy but more structured so that people can analyze it. It’s so much data that data lakes are often very slow and very hard to use. And at Scanner we really want data lakes to be just trivial to use. You just point Scanner at your messy data. It will index it for fast search. It will organize it will transform it to make it better for security. Use cases to point out different users or different IP addresses that are involved in logs. Yes. Scanner, we just want to make it way, way cheaper and way easier to use logging data at scale. And we think data lakes are the future and we think data lakes need to be easier to use. And that’s what Scanner is all about, making data lakes easy and super fast to search.

Carole Theriault

It sounds so good. What is your thing that you think is just so utterly brilliant, you’re the proudest of.

Cliff Crosland

Yes, the thing about Scanner that I really am proud of is how fast it is. So data lakes are in their infancy, I would say. And a lot of the time when you do a search for something, it can take hours or days to run a search for all of this data. And in Scanner, we’ll have teams jump in and then they’ll copy paste in a list of IP addresses which were just divulged in a threat report and they’ll get answer in 20 seconds, you know, like.

Carole Theriault

Yeah, yeah.

Cliff Crosland

They can not only just answer one question or a few questions a day, suddenly they’re starting to ask dozens of questions and follow many different leads to go and trace through what happened, what, what a threat did in this case. If there are other threats related to this one, are there other employees that have been impacted? What services did they touch? They can just very rapidly search through data really fast. Yeah, we’re kind of obsessed with speed at Scanner.

Carole Theriault

Like, you know what’s good, though, about that that you may never have thought of before, though? As well as I’ve had to do searches before, big queries. And it takes forever and takes forever. And it turns out the thing’s hung.

Cliff Crosland

Yes, right.

Carole Theriault

The thing is hung and I didn’t spot it and it might have hung like five hours earlier and I didn’t even notice it happening. So you solved that problem just by being speedy.

Cliff Crosland

Yes.

Carole Theriault

So I love that. I love it for that as well. Is there anything to add? We’re fast running out of time. I’m just. I’m fascinated by all this. I’m learning tons. Is there anything you’d like to add before we close?

Cliff Crosland

Yes. I think one thing we’re excited about the future of data lakes is how AI is going to be used. AI is really great at, like, taking the mess and making it organized and then also taking custom data and then coming up with a common schema and also just helping you take a bunch of messy logs and just explain them to you and explain alerts and look at high level patterns. So we’re really excited about what’s going to happen there in the future. To take messy data and just make that easier and easier and easier and more and more trivial to get answers from. The way that this is working with data lakes and the costs are going to come down, it’s going to get easier and easier to answer questions and more people beyond security are going to get benefit from all of this data. So it should be pretty cool. It’ll be a little while the next couple of years with AI. But yeah, it’s going to be really fun to watch this unfold.

Carole Theriault

Do you know what? I’m going to coin it right now. Data Sea. Data Ocean. Exactly.

Cliff Crosland

Exactly.

Carole Theriault

TM Smashing Security. Yes. Way bigger than a lake. Smashing security. Listeners, you can learn loads more about Scanner at the website Scanner.dev that’s Scanner.dev and Cliff Crossland, CEO and co-founder of Scanner.dev. it’s been a joy speaking with you. Thank you so much for making time in your early morning.

Cliff Crosland

Thank you so much for having me.

Carole Theriault

Thank you. And now I know so much more more about security logging and logging in general and I just feel smart.

Graham Cluley

Fascinating stuff. And that just about wraps up the show for this week. Don’t forget you can find Smashing Security on Bluesky, unlike Twitter which wouldn’t let us have a G. And don’t forget as well. To ensure you never miss another episode, follow Smash Insecurity in your favorite podcast app such as Apple Podcasts, Spotify and Pocket Casts.

We believe that traditional log architectures are broken for modern log volumes. Scanner enables fast search and detections for log data lakes – directly in your S3 buckets. Reduce the total cost of ownership of logs by 80-90%.
Photo of Cliff Crosland
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, which provides fast search and threat detections for log data in S3. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it's mostly love these days.