Is OCSF the Key to Democratizing Security Data Lakes?

Ingesting and analyzing security data takes a lot of time and resources, but the Open Cybersecurity Schema Framework (OCSF) aims to change that. Here's what the Salesforce DnR team found when taking OCSF for a test spin.
Is OCSF the Key to Democratizing Security Data Lakes?

Detecting and stopping today’s cyber attacks requires coordination across many different security tools. Unfortunately, the time and resources teams spent unifying this security log data is an accepted cost of performing the analytics necessary to root out potential attacks. The Open Cybersecurity Schema Framework (OSCF) project aims to change that so that teams can spend less time analyzing and more time protecting their environments. But let’s take a look at what makes this new approach so important and timely.

In order for security log event data to be useful, each log must have a schema that defines the fields available for searching, aggregating, or detecting on top of the log. For instance, a network-connection log might have a source IP that initiated the connection, and a destination IP that received and served a response to the request — the schema would have fields such as src and dest so that security analysts know they can make queries surrounding those fields.

However, there’s little agreement between logtypes or tools for these schemas. Using the same example, CrowdStrike extracts the IP source in network logs as ip and the destination as RemoteAddressIP4. PAN Firewalls call the source IP src and destination dst. Some logs are completely unstructured, and the source and destination IPs can only be determined from understanding the log format and the location of the IP from within the log event. 

Oftentimes, analysts will want to make queries across large subsets of logs. For example, they might want to know, across all network logs, has any host ever connected to a particular IP address. For this, not only do we need a schema for each logtype, but we need these schemas to agree. If only there were a better way to ingest and analyze data with less manual effort!

The Challenges of Log Normalization

Until now, Salesforce addressed this with a complex internal solution called the “Events Data Dictionary” (DDI), which lists every single field in every single log, and we have defined data field extraction rules (DFEs) which parse the logs and enforce the dictionary as a schema. In detection and response architecture, this process happens in a system called the “Normalizer”. 

After normalization, all logs, including previously unstructured ones, have matching schemas and a set of fields that we can query from. We extract RemoteAddressIP4 in CrowdStrike logs and rename to IPDestination, similarly dst from PAN logs turns into IPDestination, and analysts simply write queries if they want to see if a particular IP was connected to by querying all logs and asking where IPDestination == [...].

Ideally, this is great! Analysts can search. However, log normalization also generates its own challenges:

  1. There are hundreds of fields in the data dictionary, many of which are NULL, wasting a lot of space.
  2. When we get a new log type, fields may disagree with what we’ve previously defined and thus generate strange edge cases (e.g. user, username, useremail depending on logtype mean different things).
  3. We have to manually parse every new logtype or sublogtype as part of onboarding a new log source by writing regex or parsing rules.
  4. Any changes to the log structure, addition or substitution of fields may lead the existing rules to miss parsing relevant fields. This is a major data quality issue.
  5. Additionally, there isn’t a good way to validate or keep track of the rules at scale. It is highly probable the rules are duplicated; it is challenging to maintain or update the ruleset because deprecation of rules comes with substantial risk. This can lead to regressions where new rules overwrite old rule functionality.

All of this work has a growing cost associated with it. Instead of focusing primarily on detecting and responding to events, teams spend time manually normalizing this data as a prerequisite.

OCSF Adoption

The Open Cybersecurity Schema Framework (OCSF), publicly announced as an open schema standard at BlackHat (Aug '22) is a first-of-its-kind project, a significant effort to generalize data across multiple cybersecurity sources. This framework aims at delivering a simplified and vendor-agnostic taxonomy to help all security teams realize better, faster data ingestion and analysis without the time-consuming up-front normalization task. 

The goal is to have an open standard that can be adopted in any environment, any application, any solution provider, that fits in with existing security standards and processes. OCSF is an open-source framework similar to our Data Dictionary approach in that it aims to solve both the initial problem described above, and it also deals with the pain points arising from our approach by defining a subset of fields to be schematized based on analyst usage.

Our Chief Trust Officer, Vikram Rao, shared, “Every company is facing the imperative to go digital, fast. But building a security posture to meet internet-scale levels of digital trust can be a major challenge. New standards like OCSF reduce complexity for security teams, empowering them to focus on more impactful work like threat analysis and attack prevention.”

The Test Run

The Salesforce Detection and Response Engineering team evaluated OCSF, both considering enforcing this schema on Salesforce-generated logs, and adapting our current approach to normalization to output logs with the OCSF defined schemas, rather than our custom defined schemas coming from the Events Data Dictionary.

In order to understand OCSF better, we first performed an exercise to transform and evaluate a hypothetical Salesforce Security Detection Event to OCSF format. Below are the screenshots of the event in our current DDI format vs OCSF format. 

Salesforce Security Event - DDI Schematized
Salesforce Security Event - OCSF Schematized

While working on this exercise and exploring the OCSF schema, we found interesting fields that we could potentially use; e.g. the remediation object, this field helps to link the knowledge base articles, the description that will be helpful to the analysts downstream while triaging incidents and help them take further actions.

We have jump started by working across internal teams and prototyping, evaluating transformation of our internal application logs sfdc_applog, eventually to be OCSF-compliant log producers. We look to be continued collaborators moving forward. If you want to collaborate and use this schema to standardize your  logs, you can get started and fork the OCSF GitHub [here].

A Big Shout Out

The Open Cybersecurity Schema Framework project, intended to simplify data classification in a vendor-neutral framework to help security teams spend less time normalizing data and more time on defense, deserves a round of applause. The OCSF includes contributions from Cloudflare, CrowdStrike, DTEX, IBM Security, IronNet, JupiterOne, Okta, Palo Alto Networks, Rapid7, Salesforce, Securonix, Sumo Logic, Tanium, Trend Micro and Zscaler.

We at Salesforce are excited and have been early adopters of OCSF. These early exercises we performed to transform our top few logtype events to OCSF have been a great first step and helped onboard different internal teams to know the framework and its whereabouts. Given OCSF is the open standard and collaborative effort across the security industry, with multiple stakeholders adopting the standard, and helping it evolve, it has the potential to democratize security data lakes in all organizations.

Finally, in order for OCSF to have the most impact, we need cloud providers and security vendors to align and provide logs in OCSF schema. If you're in this space, please help break down boundaries and make this a standard in the security community!

Recommended Stories