Cyber Security Tribe Video Interview: The Future of SIEM: A Search Index in Your Data Lake

Our CEO and Co-founder, Cliff Crosland joined Cyber Security Tribe’s Dorene Rettas for a discussion on the Future of SIEM and advocates for a new design suited for the cloud: keeping a search index in cloud storage alongside your data lake.
Interview Transcript
Dorene Rettas
Hi, today I have with me Cliff Crossland. I’m so happy to have you Cliff, and I know you recently wrote an article for us that got a lot of interest from our community and we’ll get into that a little bit later. Let’s just start with you giving a little bit of your background.
Cliff Crosland
Yeah, for sure. So my name’s Cliff Crossland and I am the CEO and Co-founder of Scanner.dev. It’s a next-gen SIEM and observability platform. My co-founder and I, we were engineers at a prior startup that was acquired by Cisco and we had a massive Splunk bill and we felt like no one should ever have a massive Splunk bill again. So we built a new SIEM tool that’s based on cloud storage and it’s been a super fun and interesting experience building this out for teams. Everyone is like drowning in massive log volume, so we’re excited to make a difference there.
Dorene Rettas
Fantastic. Well I’m excited for you. So you mentioned cloud and we’re gonna get into that. Let’s start with this. There are a slew of new gen AI projects happening within every organization today. There’s been an influx as you know, and most of them are running in the cloud. So can you share a little bit about why that is?
Cliff Crosland
Yeah, it’s been really interesting talking to different teams who are beginning to move to the cloud when they’ve been traditionally just on-premise at their company for a long time. Anyone who wants to do a generative AI project and dip their toe in or or get started quickly, it’s very difficult to spin up GPUs, On-prem, it’s expensive. You have to plan ahead and it’s so much easier just to give things a shot in AWS or GCP or Azure and spin up GPUs there. So that’s the number one reason why we see people move to the cloud for these gen AI projects is because they need GPUs and it’s much easier to get them there and easier to like upgrade your GPUs there as like new models from NVIDIA come out.
So there’s a lot of reasons why the infrastructure is such, so much easier in the cloud. But another really important reason is that if you, if you wanna do a gen AI project and train a model, you need a lot of data and you need a place to put that data. And so what people do is they’ll use massive data lakes in cloud storage, which is a really awesome place to put heterogeneous kinds of data. So like, you know, images or video or like PDFs or Word documents or CSVs and JSON. There’s just all kinds of different data types that can live in cloud storage very cheaply. And so if you wanted to use a tool like Databricks or something like that, or Snowflake or many of the different kinds of, you know, Apache Hive, many different kinds of tools that need to analyze massive amounts of data, it’s so much easier to do that on cloud storage cheaply and quickly.
It scales like forever. So a lot of people are starting to put their data into cloud storage as a result. So they need the cloud for the GPUs and for the, the cloud storage for these data lakes to train the models.
Dorene Rettas
Makes sense. Easier, faster, simpler, all of that. So continuing down that cloud discussion and obviously our communities, mainly CISOs and cybersecurity professionals, surprisingly, and I do say surprising at, at least to me, there are so many, many organizations and industries that are slow to a shift over to the cloud and transition. I had a CISO round table two weeks ago where one of them said they were still 80% on-prem, 20% in the cloud, and we all went, what? But what he shared with us was that as new business units were using the cloud, he had a whole new slew of security concerns because what he had put in place and the security posture they had for on-prem would be very different than what the controls he would need for in the cloud.
So let’s talk about that, the cloud and what the security concerns are that you see.
Cliff Crosland
Yeah, so as we talk to CISOs and security engineers who are moving more and more infrastructure to the cloud and trying to pro protect that infrastructure, there are a lot of scary problems. So the number one problem is just a much larger attack surface area.
So every cloud service that you interact with has a public API to the web. So you, you can interact with your, these like storage buckets I was talking about with data lakes up up until last year, Amazon’s default was to make these buckets public and open to the internet, which is crazy. I think like a bunch of people have been breached because a bucket or two was open.
So that’s the attack surface area is big and it, and it’s constantly expanding. So they’re really cool new services that cloud providers are always launching and they all have like a public API and you might have engineers internally who are like, wow, this is really cool, let’s give this a shot. So like it might be, you know, a new way to cache a new caching layer in front of a database and it might be like, oh, that’s innocuous, but then it might have a completely new set of permissions that you need to set up appropriately. And so it can be very easy to, for somebody to slip in through this attack surface area that’s pretty big and it’s just growing.
The, the second thing people are, are worried about is misconfiguration.
So it, one of the, the beautiful things about the cloud is, is it has all these APIs you can use to configure everything, configure your infrastructure, launch new services and servers and databases. And so you can use infrastructure as code to, you know, like Terraform or Cloud Formation or Pulumi to go and like set up your infrastructure, define it, get it code reviewed, and then you just define it in code and then everything gets set up. You don’t have to like click around and do all this manual effort to set up your cloud infrastructure. But one of the things that’s really interesting is that opens up a big threat to misconfiguration accidentally writing code that is wrong.
Especially we found this a couple times with some of our users where they were using like GitHub copilot or like generative AI agents to help them write code. And the thing about these agents is they’ll write code that looks right but is subtly wrong because it’s like maybe using a tutorial, but then like it might do something like the default is public, like, you know, or it might, might set up a server where it has a public IP address. You didn’t mean to do that, but it’s just you’re trusting the, the copilot instead of reading the documentation maybe. And so yeah, it, it can, it can enhance productivity but also open up to like subtle things that can get misconfiguration wrong.
And so then you can get a misconfigured cloud and that can be really scary. People can take advantage of that really rapidly.
Dorene Rettas
Yeah, that subtle but ends up being quite significant, right? Yes. So in terms of what are you finding in terms of CISO’s leveraging new solutions to assist them with the cloud security?
Cliff Crosland
Well, you definitely have CSPMs. So cloud security posture management, like wiz big in the news almost acquired by Google and that does continuous monitoring of misconfigurations and like the, the network reachability. So it looks for the things that I just mentioned, like, oh, the API is public, maybe it’s being misused or maybe we accidentally added some misconfiguration that opens us up more than we thought. So CSPM tools are really critical and like are basically a must have.
And then the, the next thing that we’re seeing people adopt for cloud security is data lake based logs. So I talked a little bit about how for training AI models, you really need some kind of data lake strategy to, to contain all of this data. But one of the interesting challenges with, with a cloud is that log volume for your security logs and cloud audit logs can be massive. And so it can cost, you know, in traditional SIEMs it can cost millions of dollars. So we’re seeing people adopt new SIEM tools, like, you know, they’re using Snowflake, they’re using Amazon Security Lake, they’re trying to basically reduce the cost dramatically of SIEM by moving their data into cloud storage and are looking for threats that way and running investigations there.
Instead of using like a traditional SIEM like a Splunk or Elasticsearch where you have to get a cluster running and there’s maintenance and it’s expensive, they, they’re looking for cheaper ways to, to scale it out. And so yeah, it’s really the, those two approaches, which is you, you really need a cloud security posture management to look for misconfiguration and then you need this data lake based SIEM to handle the massive volume because in case someone does get in in like the 15 minutes where something was misconfigured, they can stay there and hide forever and and, and wait a couple months before they start to attack.
And so it’s really nice if you can have a, a tool like a data lake based, like cloud storage based SIEM tool that can look at all of history and store it all for long periods of time and allow you to do investigations on really old data and new data and, and at much lower cost than before. So those are really like the two approaches that we see people absolutely must adopt as they move to the cloud.
Dorene Rettas
Yeah, and and we’re hearing and seeing much of what you just shared as well.
Okay. This is gonna be your favorite part because it really goes back to what you are all about and what Scanner.dev is about. So as noted, you wrote an article for us recently and you really dove into where legacy SIEMs are failing and why organizations need to be looking at sort of next-gen SIEM and what that looks like. So I’m just gonna open that up to you to talk about that topic. Why are legacy SIEMs no longer the right choice or where are they falling short I should say? And then what does that next-gen SIEM look like?
Cliff Crosland
Yeah, absolutely. So this goes back a little bit to our origin story. My co-founder and I this, this startup where we were early engineers, our logs scaled rapidly. We were using Splunk for observability and SIEM and our bill shot up as our, our cloud infrastructure scaled. That shot up from like $10,000 a year to a million dollars a year. And we thought this was wild.
So we started to store data in cloud storage or logs in cloud storage, but we had a a lot of trouble searching them. So, which is why we built this startup. But really if you look at the legacy SIEMs, they all cost multiple dollars per gigabyte of, of ingested data. And that was totally fine back in the day when log volume was, was measured in gigabytes per day. But it’s very easy to get to the point where you’re ingesting a terabyte per day, 2 terabytes, 3 terabytes, 5 and 20.
And it, it is, once you get into the terabyte range with these legacy SIEM tools, the cost explodes, it, it starts to become at least a million dollars a year, multiple millions. It can be like for a lot of the CISOs we talked to just log ingestion, it can can be like one of the top three budget line items for their teams and they, it it is, it, it is strange because if you, if you look at how cheap cloud storage is, the, the cost is a few cents per gigabyte per month.
And so what we really think is you, you see the next-gen SIEMs coming about, they really take modern log volume seriously. So a lot of the legacy SIEMs, what they’ll do is they’ll let you like drop logs or keep some of the logs or filter them or keep them for 30 days only. We, we talk to users who like we’re only keeping 30 days in their SIEM then 15 days, then seven days and this, they have very little log retention window to do an investigation, which is, which can be really terrifying if someone comes in and hides for a while before they start to do data exfiltration. It’s hard to like connect it back to the original breach if you can only see seven days.
And so they’re like desperately looking for something that can scale to let like really manage or handle that massive log volume. So yeah, we really think that next-gen SIEMs, they they should cost tens of cents per gigabyte, not dollars per gigabyte. It needs to be 10 x cheaper to scale. And we also think that the next-gen SIEMs, they they shouldn’t limit you to 30 days of investigation. You should get a year, you should get multiple years of investigation capability. So what what people, what people do is they’ll often have their logs going into their traditional SIEM and then after 30 days they get archived away to cloud storage.
But then to really look at that data again, they have to go and like read it back in and, and then spend like the ingestion cost again to get it back in. They don’t know exactly what parts of that old data they need. Maybe they need this, maybe they need that. So an investigation over the last year can take weeks of work and be really expensive.
And so yeah, we, we really think like the, the next-gen SIEMs, if you look at what people are are doing, they’re using Snowflake, they’re using Amazon security, like they’re using these, these tools that can go and run queries on cloud storage and they can run queries over like years of logs.
They, they are a little bit slow is what the, the complaint that we hear often. So that’s one thing that we care about at Scanner is making them fast. But basically we do believe that next-gen SIEMs, they really need to reduce the costs of all of the, the log data that’s coming in and they need to give you visibility into massive amounts of history.
We hear, you know, from from users like this has happened multiple times where someone will come in, they’ll, there’ll be a breach. That breach will then go and create like multiple identities that all look innocuous in your cloud infrastructure and they’ll do nothing for six months and then they’ll start like one of those identities six months later will start to do something suspicious like exfiltrate some data. You stop that one and then you’re like, oh my god, where did that come from? Let’s go and find like all the other identities. Were there anything else created? Let’s, let’s see what else happened.
But that was six months ago and it’s not in my my SIEM window, my, if I’m using a traditional SIEM, we think like, nope, you should be able to go back years and find whatever, whenever that happened, go and piece that that story together and find that threat. So yeah, there are a bunch of cool things I think next-gen SIEMs need to have. But it definitely is just, it is not right today that it costs so much for logs and it’s not right that you only have this tiny little window. It should be cheaper and you should be able to see back a really long ways so you can really protect yourself and, and look for these like these advanced persistent threats that are hiding there for long periods of time.
Dorene Rettas
And so to recap, and I know you’ve stated a couple times, but I always like to state the obvious, it doesn’t hurt when we’re talking about next-gen SIEMs including Scanner.dev obviously really what I’m hearing from you, there’s three main benefits. There’s obviously significant cost reduction, there is the length of investigative capability time, which is critical, right? And ultimately will assist the business and threats. And then also there’s the time because if that investigative window is much smaller as you talked about, that could take months, weeks, whatever that is at work.
And so you’ve reduced that as well. So, and time is money so that swings back to the money again. So there’s three main areas that I see it that next-gen SIEMs really will benefit organizations as they’re moving forward.
Cliff Crosland
Yeah, absolutely. I think we, I I think like we’re a little bit wild about speed at Scanner, but I, and so are other like next-gen SIEM providers, I’m, I’m kind of excited to see everything that’s going on. Really great competitors out there, but we like to measure things in terms of microseconds, not milliseconds, not seconds. We really try to make it as fast as possible to get through massive amounts of data and yeah, you see that with, with as data lakes grow, you really need something fast to, to go and piece this piece, the, the puzzle together. Threats in the cloud get really, really creative and strange.
And if it does take you forever to go and look back forensically to find where threats came from, if it takes you weeks that they might just go and create some more identities while, while they’re hidden there, they might go and they might go and attack another cloud infrastructure you have, you might not know exactly what to block. Yeah. So we really think like time is of the essence. You should be able to look at a year of history rapidly. Like it should take you seconds, it shouldn’t take you days.
So, and, and there’s, there’s a lot of cool innovation happening. I think like some of it is from the gen AI space and people doing really cool machine learning on data lakes. So really cool innovation in the data formats that are stored in cloud storage, but not just for machine learning but also for log analysis, like what we’re doing. There’s just a lot of really cool new ways that people are, are pursuing to store the data such that it can be analyzed as fast as possible such that you can zoom in on like exactly the right pieces of data as quickly as possible.
So yeah, I think, I think a lot of people in the past have shied away from like the using cloud storage as a SIEM ’cause cloud storage feels slow ’cause it’s massive and it’s cheaper, but like, it’s not as fast to access as like a really fast like server like you might run Splunk on or something, but the, your Splunk server can only show you so like a little bit. We need to all work on making the data lake much faster and that’s what we’re doing and that’s what a bunch of like really cool companies who are building next-gen SIEMs are, are also trying to do as well. So it’d be interesting to see, yeah, the, you’re right, time is money and like getting to the bottom of threats ASAP is absolutely essential.
Otherwise you won’t do it. It’s like, oh, this will take a couple weeks. Like I’m not actually gonna finish this investigation because a new thing has come up and I, I can’t afford to go and spend forever looking into the historical data that that should never be the case. It should be possible to search through things rapidly.
Dorene Rettas
I really appreciate your time, I appreciate talking to you. And what I love here is obviously you’re talking about your own company, but you’re also talking about knowing that there’s a lot of other organizations that are out there now, companies coming out and really helping to advance things for CISOs and other individuals as the market changes, which is really what it comes down to. The industry is ever changing and it’s probably changing faster now than ever before with the, the influx of generative AI projects and cloud. And so it’s, I say admirable when I find folks who are thinking outside of the box and creating new, new solutions for it.
I’m gonna leave us with a cliff hanger because I know that you guys have a release coming out in a couple of weeks and so when people wanna learn more about Scanner.dev beside beyond going to your site, of course they can check out what we’ll be publishing about your release in a few weeks as well. But Cliff, thank you. I really appreciate you joining.
Cliff Crosland
Wonderful, great to talk to you Dorene, really appreciate it.