Cloudy Journey: Introducing EvidenceForge: Synthetic security logs that don’t look (as) fake

Security teams need high-quality, labeled datasets to train threat hunters and incident responders, validate detection logic, and develop robust analytic models.
EvidenceForge helps teams overcome the limitations of anonymized or stale public datasets, while avoiding the cost and complexity of setting up real infrastructure and performing manual attack simulations to create their own.
The tool incorporates sophisticated timing models and assigns specific roles to users and systems, generating realistic malicious activity, background noise, and “red herrings” to optimize data realism.
The tool generates correlated logs across 20+ Windows, Linux, and network monitoring formats using a canonical event model that ensures causal and temporal consistency.

Good data is hard to find... and to create

Introducing EvidenceForge: Synthetic security logs that don’t look (as) fake

A lot of important work in security depends on having realistic log data to work with, and a lot of that work gets blocked, watered down, or quietly skipped because the data just isn’t available. The use cases come up constantly: teaching threat hunters, incident responders, and detection engineers with datasets that have known ground truth; validating that a detection fires on the right activity without drowning in false positives; and training ML models that need labeled, balanced, multi-source telemetry at scale.

These are different problems with the same root cause. You need realistic, labeled security logs and you can’t get them easily. The options are limited:

Real production telemetry is a compliance problem. Public datasets are often so heavily anonymized they no longer resemble the original log sources. The LANL dataset and OpTC are well-known examples of data scrubbed to the point of being generic event representations rather than actual telemetry. What isn’t anonymized is stale, narrow, and over-recycled.
You can generate data yourself using attack simulation frameworks like Atomic Red Team or MITRE Caldera, but that requires real infrastructure, is time-consuming to operate, and scales poorly when you need variety.
You can hire a red team, which trades complexity for money but still takes weeks and produces only the specific scenario they ran.

Synthetic generators seem like an obvious solution and many existing ones are genuinely useful tools, but they share a common architectural limitation: They generate events independently, one format at a time, with no shared state across log sources. The result is datasets where events don’t tell a coherent story. For example, a process in Sysmon doesn’t connect to the same process in standard Windows logs, or a network logon doesn’t leave a consistent connection trace. More capable tools support attack chains and MITRE ATT&CK mapping, but even then, they generate individual events rather than simulating something that happened, with all the prerequisite and consequent evidence that real activity would produce. Realistic background noise is largely absent.

What analysts detect when they call data synthetic is the absence of a coherent causal story. The logs don’t line up because they emit each log entry independently from the others, and they are not modeling a series of connected events.

The answer: A new kind of synthetic data

EvidenceForge is a new open-source project from Cisco Talos that approaches the problem differently. It features a single canonical event model, causal ordering, realistic background noise, and AI-assisted scenario authoring. The result is a synchronized dataset across 20+ log formats (Windows, Linux, network, and endpoint detection and response [EDR] telemetry), complete with ground truth documentation and an analyst briefing.

One honest note: No purely synthetic dataset will fool a seasoned analyst in every case, but that’s okay. The goal is fidelity that’s good enough to be useful, not something that’s indistinguishable from production.

The core idea: One event, many formats

Most synthetic log generators are a collection of independent emitters. Each one knows how to produce its own format but doesn’t share state with the others. You can see the seams the moment you cross-reference across sources.

EvidenceForge inverts that. Every piece of evidence flows from a single canonical SecurityEvent object. That object carries a timestamp and event type, plus over 30 composable context objects populated as needed: ProcessContext (PID, parent PID, image, command line), NetworkContext (src/dst IP and port, Zeek UID, shared across Zeek, EDR, and SNORT®), AuthContext (username, LogonID, logon type, result), DnsContext and HttpContext (protocol-layer detail that fans out into the corresponding Zeek log types), and many more. Emitters read only the fields relevant to their format.

The consequence of shared contexts is that emitters cannot disagree. There is one PID, one LogonID, one timestamp, and one Zeek UID. The engine is also OS-aware: Windows hosts produce Security Events and Sysmon while Linux hosts produce syslog and bash history, each according to the OS assigned to each host in the scenario.

All of this is driven by a scenario configuration file: a YAML document describing the environment (hosts, users, network topology) and an optional attack storyline. The engine reads that file and produces the correlated dataset.

What the engine produces

From a single scenario, EvidenceForge generates several correlated log formats:

Windows Security Events (30 event IDs covering authentication, process lifecycle, Kerberos, persistence, account management, and more)
Sysmon (10 event IDs)
EDR/XDR telemetry
Linux syslog
bash history
Zeek logs in JSON format
Snort IDS alerts
Firewall logs
Web server access logs
Forward HTTP proxy logs

The exact output logs depend on a combination of the components in the simulated environment, and which log sources you may have opted to disable.

Every attack scenario also produces two companion documents.

“ENVIRONMENT.md” is an analyst briefing consisting of organizational context, network layout, user roles, naming conventions — everything an analyst would need before diving into the logs, with zero information about the attack itself.
“GROUND_TRUTH.md” documents exactly what happened including a narrative, a timeline, and key IOCs.

Causality, not just sequence

Real logs are both temporally and causally ordered. Before a domain logon, there’s a Kerberos TGT, then a TGS. Before a TCP connection to a hostname, there’s a DNS query. This is the physics of how the protocols work.

EvidenceForge ships with a composable rule engine that auto-generates prerequisite events with realistic timing offsets so that each event sits exactly where an analyst would expect to pivot to it:

A logon in the scenario expands to the Kerberos exchange that made it possible.
A connection to a named host gets the DNS resolution inserted beforehand.
A privileged admin command generates downstream audit events.

Network visibility is a first-class concept

Most synthetic generators are too visible, meaning that every connection gets a log, regardless of whether a sensor would have seen it. Real networks don’t work that way. Traffic between hosts on the same VLAN may never cross a SPAN port. East-west traffic in a segmented network may be invisible to perimeter sensors. A TAP at the internet edge sees outbound traffic but nothing internal.

EvidenceForge lets you declare sensor placement in the scenario: SPAN or TAP, monitored segments, and direction. The engine determines which connections each sensor could realistically observe and only emits network logs where they’d actually appear. If your environment has a monitoring gap, the generated data has that same gap, which is exactly the kind of thing analysts need to learn to reason about.

AI co-develops the story; a script generates the evidence

The hard part of realistic synthetic data is scenario design, not generation. Describing a coherent attack lifecycle with the right tactics, techniques, and procedures (TTPs); realistic sequencing; and plausible actor behavior requires research and protocol knowledge most people don’t carry in their heads.

EvidenceForge addresses this with Claude/Codex skills. You bring intent (an attack type, an environment, a training objective), the AI brings research and technical scaffolding (a guided interview, MITRE ATT&CK TTP research), and together you collaboratively develop the attack narrative, resulting in a validated YAML scenario file.

The YAML is version-controllable, shareable, and editable. Once it exists, generation is entirely deterministic: a Python script reads the config and produces all the correlated log evidence.

This separation is the optimal balance of what each technology is good at. AI excels in narrative coherence, TTP research, and protocol knowledge. A deterministic script excels at the thousands of cross-referenced field values, causal prerequisite chains, and inter-format consistency checks that make up a realistic dataset. This would overwhelm even a capable LLM at scale, and hallucinated field values or subtle inconsistencies would undermine the whole point.

A typical scenario costs pennies in API calls to co-develop, and the data generates in seconds or minutes rather than the hours or days an LLM-based approach would require. EvidenceForge also produces identical output every run because randomness is seeded. Built-in validation checks the scenario for schema correctness and cross-reference integrity before generation runs, and the AI can automatically fix most errors it finds.

Making the background convincing

Attack events are only useful if analysts have to work to find them. Noise quality matters as much as signal quality.

EvidenceForge’s baseline engine generates several types of realistic background noise, including:

Legitimate lateral movement patterns (backup agents, monitoring tools, AD replication, application-to-database traffic)
User and application-driven network activity (web browsing, SMB file share access, RDP sessions, scheduled service polling)
Per-user diversified command pools, depending on user role
Red herrings (suspicious-looking events or patterns that are benign)

Timing is just as important as content. Volume-level realism without burst-level texture still looks synthetic. EvidenceForge uses three complementary timing models:

A Hawkes process for user activity, a self-exciting model where each event makes the next more likely for a short window, then decays, matching how people actually work in bursts
A periodic envelope for large-scale structure (Monday login storms, Friday drop-off, and near-zero weekends)
Periodic intervals plus jitter for modelling recurring automated events like scheduled tasks, background updates, and other system and service traffic

Most timing details are exposed in the scenario or engine config files, so you can tweak them to make them as realistic as you like for your simulated environment.

Getting started

EvidenceForge is available on GitHub. Clone the repo and follow the install instructions in the README.

The core experience is a guided conversation. Start the /eforge:scenario command and describe what you want. You can be as specific or as vague as you like. Bring a fully formed scenario and the AI helps translate it into a valid configuration; bring a rough idea and it asks the right questions, fills in the gaps, and makes suggestions until you have something technically coherent and satisfyingly realistic. From there, the skill leads you through validation, generation, and a brief automated data quality evaluation. You come out the other end with a complete, correlated dataset and companion documents. A full CLI is also available for scripted workflows.

What will you build?

EvidenceForge removes the data bottleneck. The question becomes what you do with that. The following are just a few examples:

Build a SOC analyst training program with scenarios tailored to your environment.
Test detections against controlled, labeled datasets before they go near production. See whether they fire on the attack and how they behave against realistic noise.
Generate the labeled training data your ML model needs.
Stress-test a new SIEM or detection pipeline against volume and variety you control.
Create repeatable practice exercises that can be regenerated on demand after tuning.

The scenarios themselves are shareable artifacts. A scenario developed for one team can be shared, adapted, or built on by others. The right mental model is high-fidelity training and testing data — not a production telemetry substitute — but within that framing, the use cases are broad.

from Cisco Talos Blog https://ift.tt/PVBXh1I
via IFTTT

Cloudy Journey

Pages

Wednesday, May 27, 2026

Introducing EvidenceForge: Synthetic security logs that don’t look (as) fake