How Data Lakes are Advancing Threat Hunting

By September 18, 2019 October 10th, 2019 Cloud Security

How Data Lakes are Advancing Threat Hunting

Data lake threat hunting

Businesses that can harness the capacity of data lakes are the ones who truly begin to experience the benefits of big data. One of the areas where this could make a significant difference is in cybersecurity, with specific regards to advanced threat hunting.

In this article, we’ll look at how data lakes differ from other data repositories (i.e. data warehouses), what challenges threat hunters currently face and how the next-gen SIEM, together with data lakes, is advancing the field.

We’ll also explain why seeking guidance from an experienced 3rd party like Shamrock Consulting Group can give your business a valuable leg up before diving into the depths of the data lake.

What is a Data Lake, Anyway?

One way to imagine a data lake is to think of a huge body of water fed by lots of incoming streams. Bring the cloud into the picture and you can imagine data flowing down from the heavens into that one big data source.

Unlike a data warehouse, which stores data hierarchically in files and folders, a data lake is a repository for raw data. This data is assigned a unique identifier and tagged with an extended set of metadata tags.

Although the term ‘data lake’ is intertwined with the Hadoop framework, its usage has now been extended to cover any data storage facility where the purpose of storage is not pre-defined.

A modern data lake can take in data from millions of logs and events every second. This includes new sources of data from click-streams, social media, mobile apps and IoT devices. Utilizing this data is the challenge of our age, not just for threat hunters but also for data analysts connected to various areas of the business.

The value of the data lake has already been proven by research. For example, in 2017, an Aberdeen Group research project for AWS found that businesses that invested in data lake technology outperformed their competitors by 9%.

The evolution of the SIEM (which we will go into later), means that whereas previously threat hunters would have had to tap into numerous data streams individually, they can now access the data lake as a whole. Pretty sweet, right?

The Challenges of Threat Hunting Today

Cyber-attacks are persistent and come in a variety of forms. Threats to enterprise networks can be broadly broken down into three groups:

Malicious Insiders

These are people who work within the business and use their access privileges to cause damage to or defraud the organization that employs them. Threat hunters use browser forensics and study unusual authentication patterns to identify planned or actual attacks.

Data Exfiltration

This is where a successful attack has previously occurred and sensitive data is being illicitly diverted outside of the company network. Anomalies in the frequency and/or size of data transfers are red flags for this type of attack.

External Attacks

These are ongoing attacks which have yet to breach the network. They include advanced persistent threats (APTs), denial-of-service (DDoS) attacks and zero-day exploits. Threat hunters need to be alert to both known attack signatures and the symptoms of new types of attack.

As networks become more complex, security teams are faced with the task of making sense of inputs from a huge number of devices and systems, including cloud-based software, apps and services. For starters, there are logs from firewalls, VPNs, DLP tools, proxies, servers and endpoints. Then there are multiple communications from devices using different protocols such as DHCP, IP, TCP, HTTP and FTP.

Human threat hunters would be simply unable to deal with this data flood without help from sophisticated security information and event management (SIEM) systems.

The SIEM: Past and Present

A data lake is worthless to a threat hunter unless he or she can query the data within it. To understand how data lakes are advancing threat hunting, we have to also look at how SIEMs have evolved over the past twenty years.

SIEMs were the natural next step forward from separate security information management (SIM) and security event management (SEM) systems commonly in use at the beginning of the 21st century. These were only able to cope with small-scale data and had poor data visualization capability. Human operators had to manually ingest the data, write correlation rules and analyze the outputs. It was a slow process, but cyber-attacks at this time were relatively predictable and easy to spot.

Although SIEMs had been around for a decade by 2010, it was only then that they started becoming commonplace in businesses. SIEMs could automatically integrate logs from different devices and systems in one place, enabling threat hunters to detect more subtle threat actors (i.e. those that could move laterally within a breached network). However, analysis was still performed manually on a sub-set of filtered data using limited pre-built visualization tools.

SIEM 3.0

The next generation SIEM tech, sometimes referred to as SIEM 3.0, brought machine learning (ML) into the equation. This smart system not only ingests data automatically but also makes decisions based on pattern detection capabilities that go way beyond human capabilities. For common threats, SIEM 3.0 can trigger Security Orchestration, Automation and Response (SOAR) tools to take remedial action without human intervention.

For more complex threats, the next-gen SIEM can alert a human threat hunter, who can then attempt to determine the source and nature of the threat. All security information can be accessed via a single pane of glass through which threat hunters receive a high-fidelity, full context view of security events.

Data can be subjected to a range of sophisticated analytic techniques such as user and entity behavior analytics (UEBA), which matches observed behavior with expected behavior and picks up on any anomalies. There is no need for time-consuming IT requests to build an analytics platform because this is already pre-built into the system.

Unlike earlier iterations, security professionals can query as much of the historical and current data as can be feasibly stored. With data lake integration, that figure is practically limitless.

Querying the Data Lake

So now we’ve come full circle to the data lake. The unstructured nature of data in a lake makes it suitable for a wide range of different queries including SQL queries, full text search, real-time analytics and machine learning. Integrated with advanced SIEM tech, searching for correlations and understanding the full context of any patterns becomes as simple as clicking a search button on a central user interface. This makes for a faster kill-chain which, as any threat hunter knows, is a very satisfying outcome – not to mention great news for the business itself.

The virtually unlimited capacity of data lakes means that threat hunters can query vast quantities of historic and present data. This makes it much easier to discover subtle anomalies when investigating previous attacks, and the results can be fed back to the SIEM to help it improve its ability to recognize and derail future attacks.

Before You Take a Dip in the Data Lake…

Data lakes are available through AWS (S3), Azure Data Lake and Apache Hadoop among others and, as with other distributed technologies, your choice of data lake and SIEM is critical to the security function of your business.

Although data lakes have advantages over data warehouses, most businesses will need both since warehouses are optimized for fast queries connected to line of business applications. Shamrock can help ensure that you make the right choices for your business every step of the way.

Our team has over a decade of experience working alongside Fortune 500, Fortune 100 and Fortune 50 companies on their security solutions and cloud connectivity. As a vendor-neutral source of specialist support, you can trust Shamrock to guide you in the most profitable direction for your business (and our security analysis is free!).

Ben Ferguson

Ben Ferguson

Ben Ferguson is the Vice President and Senior Network Architect for Shamrock Consulting Group, an industry leader in digital transformation solutions. Since his departure from Biochemical research in 2004, Ben has built core competencies around cloud direct connects and cloud cost reduction, enterprise wide area network architecture, high density data center deployments, cybersecurity and Voice over IP telephony. Ben has designed hundreds of complex networks for some of the largest companies in the world and he’s helped Shamrock become a top partner of the 3 largest public cloud platforms for AWS, Azure and GCP consulting. When he takes the occasional break from designing networks, he enjoys surfing, golf, working out, trying new restaurants and spending time with his wife, Linsey, his son, Weston and his dog, Hamilton.