TI in your ETL

TI in your ETL is more common than you think

Mature Security Operations (SecOps) programs have a good handle on ingesting the right security telemetry for their organization and make good use of it in threat detection and incident response processes. As these SecOps teams mature their use of telemetry, a common project is surfacing externally known threats that are present in their environment by matching on Indicators of Compromise (IOCs).

Common types of IOCs are IP addresses, file hash values, domains, and URLs. Using IOCs is less useful than detecting adversary TTPs because they are easier for adversaries to update compared to learning new techniques, but that doesn’t mean IOC detection isn’t useful - adversaries get lazy and IOC detections can find them just the same.

Indicator enrichment requirements: ETL

ETL front and center of security telemetry

All of the data sources (identity, EDR, Network, Cloud logs, application logs) in a team environment have a new perspective on the happenings of the network, and they communicate it with their schema. One of the (many) problems that distinct schemas for telemetry introduce is they limit widely deployed threat matching.

Moving all non-standard fields from IPAddress, actor.src_ip, client.ipAddress, id\.orig_h, or whatever the log provider chose to represent IPs to a standard field such as src_endpoint. Using a single field enables threat feeds to be universally applied to all datasets.

Indicator enrichment process

As new IOCs are discovered and distributed the 2 key tasks to do are (1) check your current exposure with a retroactive search of your environment, and (2) proactively monitor for the threats to enter your environment.

0. Indicator ingestion

The goal of ingestion is to create a central set of files that both the SIEM and ETL pipelines can use as they do IOC matching. A common format that SIEM, ETL tools, and analysts can all work with is CSVs. The CSV fields should be limited to the exact fields needed to provide analyst context and then link back to the full events at their providers.

Integrate feeds into the datastore by adding their raw data into a file tree, and on a regular schedule aggregate all data into a normalized set of CSVs organized by indicator types.

my-datastore
└── iocs
    ├── domains.csv
    ├── hashes.csv
    ├── ips.csv
    ├── provider1
    │   ├── feed1
    │   │   └── yyyy
    │   │       └── mm
    │   │           └── dd
    │   │               └── uuid
    │   └── feed2
    │       └── yyyy
    │           └── mm
    │               └── dd
    │                   └── uuid
    └── provider2
        ├── feed1
        │   └── yyyy
        │       └── mm
        │           └── dd
        │               └── uuid
        └── feed2
            └── yyyy
                └── mm
                    └── dd
                        └── uuid

1. Retroactive searching

The goal of retroactive matching is to discover if your environment has already had exposure to the threats your IOC feed learned about today.

Every few hours, load the new indicators into SIEM lookup tables and query the enterprise for those across contextually relevant fields. The query time window should be as long as possible, at least longer than the timeframe when the indicators were discovered

Contextually relevant means not blinding matching all IPs in the IOC dataset with all IPs in your environment, you need to consider the malicious behavior the IOC was observed in and only match when it makes sense. For example, match IPs associated with vulnerability scanners as OCSF src_endpoint objects rather than dst_endpoints objects.

2. Proactive matching

The goal of proactive matching constantly assess if the new data from your environment is seeing attacks that other organizations previously observed. Doing this matching in ETL pipelines creates richer, more contextualized events and mitigates a few problems that full-SIEM solutions have such as lookup table constraints, and query performance.

After events have been normalized to a common schema, do the lookup in your indicator datastore; matches should be stored in the event itself in an enrichment field. Here’s an example of how an enrichment might look if you’re using ECS’s threat tags for a PikaBot C2 using Abuse.ch Feodo data.

{
  "destination": {
    "ip": "34.120.195.249"
  },
  "event": {
    "original": {
      "DestinationIP": "34.120.195.249"
    }
  },
  "threat": {
    "enrichments": [
      {
        "indicator": {
          "confidence": "35",
          "description": "FEODO: Pikabot C2",
          "first_seen": "2024-05-06 18:50:05Z",
          "marking": {
            "tlp": "TLP:CLEAR"
          },
          "provider": "abuse_ch_feodo",
          "reference": "r2://your-ioc-datastore/iocs/abuse-ch/feodo/2024/05/06/40753e84-baa9-42f4-954f-92209e07b7ce.jsonl",
          "type": "ipv4-addr"
        },
        "matched": {
          "atomic": "34.120.195.249",
          "field": "destination.ip",
          "occured": "2024-05-04T13:12:28.391Z"
        }
      }
    ]
  }
}

When the enriched object is ingested into a SIEM, detection analytics look for newly created events where the threat field exits with optional filtering on indicator confidence levels.

Summary

This isn’t a perfect solution that handles the expiration of indicators, or analyst curation. This gets teams with ETL pipelines from 0 indicator matching to 99% of the way. Later down the line you can re-invent how the common datastore is created and keep the matching portions the same.

This framework for pushing threat indicator processing into an ETL pipeline supports as many feeds as you want, at an incredible discount to what many Threat Intelligence Platforms will charge you. Give it a try!