The problem with vendor-specific parsing

Every DFIR case reaches the same friction point. The Windows side is tractable — EVTX, UAL, prefetch, registry, all well-defined formats. The Linux side is a bit messier but still bounded — auth.log, secure, wtmp, audit. And then comes the rest: Palo Alto GlobalProtect, Cisco AnyConnect, Fortinet SSL-VPN, Checkpoint, OpenVPN, Squid, Cloudflare Access, ZScaler, every hardware firewall, every cloud VPN. Each one has its own log format, its own conventions, its own quirks, and writing a dedicated parser per vendor is a losing battle.

Masstin’s answer is parse-custom: a new action that parses any text log using YAML rule files. One file per vendor format. A library of pre-built rules that grows over time. Inside each file, a list of sub-parsers handles the different line types the same product emits. The output is the same 14-column CSV timeline masstin uses everywhere else, so a carved EVTX chunk from a desktop, a Linux SSH brute-force, and a GlobalProtect VPN login appear side by side, ready for graph visualisation and temporal path reconstruction.

This post walks through the design, the schema, and the first rule that ships with the library: a fully researched Palo Alto GlobalProtect parser built from the official Palo Alto Networks documentation and validated against real sample log lines.


Design decisions

The schema is intentionally boring: flat YAML, four blocks per parser, string substitution for the output mapping. We explicitly avoided:

  • Embedding a scripting language (Lua, Python). Maximum flexibility, but breaks the premise of “easy for users”. If you need code, probably the right move is contributing a native parser to masstin instead.
  • Grok / Logstash patterns. Elegant but adds a learning curve on top of plain regex. Everyone who’s touched a Sigma rule already understands YAML + substrings + regex.
  • A 1:1 column mapping like column_3 = source_ip. Too limited — real logs have 4-6 different line types per product, each with its own shape. We need multiple sub-parsers per file.

What we kept:

  • One file per vendor+format combination. palo-alto-globalprotect.yaml covers the legacy SYSTEM log format. A separate file will cover the PAN-OS 9.1+ dedicated globalprotect log type when it ships. Mixing two formats in one file is a trap.
  • First match wins. Inside a file, parsers are tried in order. The first one that claims a line produces exactly one record and moves on. Cheap, predictable, easy to reason about.
  • Rejected lines are first-class citizens. Any line nothing matches goes to a rejected log. --dry-run shows you the first few so you know what your rule is missing. --debug preserves a sample alongside the output CSV for post-mortem.
  • Four extractors cover the real world. CSV for tabular logs (Palo Alto, many cloud exports). Keyvalue for key=value logs (Fortinet, CEF-lite formats). Regex for free-form prose (OpenVPN, legacy syslog). JSON is planned for v2.

The schema in one glance

meta:
  vendor: "Palo Alto Networks"
  product: "GlobalProtect (VPN)"
  reference_url: "https://docs.paloaltonetworks.com/..."

prefilter:           # optional fast path before per-parser matching
  contains_any: ["globalprotectgateway-", "globalprotectportal-"]

parsers:
  - name: "gp-gateway-auth-succ"
    match:
      contains: ["globalprotectgateway-auth-succ"]
    extract:
      type: csv
      delimiter: ","
      quote: '"'
      fields_by_index:
        6: generated_time
        9: gateway_name
        14: description
    sub_extract:
      field: description
      strip_before: ". "
      type: keyvalue
      pair_separator: ","
      kv_separator: ":"
      trim: true
    map:
      time_created:       "${generated_time}"
      computer:           "${gateway_name}"
      event_type:         "SUCCESSFUL_LOGON"
      event_id:           "GP-GW-AUTH-SUCC"
      subject_user_name:  "${User name}"
      workstation_name:   "${Login from}"
      ip_address:         "${Login from}"
      logon_type:         "VPN"
      filename:           "${__source_file}"
      detail:             "GlobalProtect gateway auth OK | user=${User name} from=${Login from} auth=${Auth type}"

Four building blocks per parser:

  • match — which lines this parser claims. Combine contains, contains_any and regex.
  • extract — how to pull fields out of the matched line. Pick one of csv, regex, keyvalue.
  • sub_extract — optional second-pass extraction on a field extracted above. Essential for nested formats like Palo Alto, where the outer shape is CSV but the interesting user/IP data lives inside one of the outer fields as a narrative sentence followed by Key: value, Key: value.
  • map — fill the 14 columns of masstin’s LogData using ${variable} substitution. Anything unknown becomes empty. Any text can be embedded in any field.

That’s it. Everything else (prefilter, strip_before, the special ${__source_file} / ${__line_number} variables) is convenience sugar on top of those four blocks.

The full schema reference is in docs/custom-parsers.md in the repo.


A walk through the Palo Alto GlobalProtect rule

Palo Alto’s GlobalProtect VPN is a natural first target: it’s widely deployed, the log format is documented, and there are public sample logs I could validate against. There are actually two formats: the legacy SYSTEM log (used by most deployments with classic syslog forwarding) and a new dedicated globalprotect log type introduced in PAN-OS 9.1 with 49+ separate CSV columns. The v1 rule covers the legacy format, because that’s what 90% of real deployments still produce. The dedicated log type rule will ship as a separate file when I have confirmed sample lines to test against.

The legacy format

A GlobalProtect login event in the SYSTEM log looks like this (real sample from the Palo Alto Splunk data generator):

1,2016/02/24 21:45:08,007200001165,SYSTEM,globalprotect,0,2016/02/24 21:40:52,,globalprotectgateway-auth-succ,VPN-GW-N,0,0,general,informational,"GlobalProtect gateway user authentication succeeded. Login from: 216.113.183.230, User name: user3, Auth type: profile, Client OS version: Microsoft Windows Server 2008 R2 Enterprise",641953,0x8000000000000000,0,0,0,0,,PA-VM

The outer shape is CSV with field 14 double-quoted. The index-to-field mapping, from the official syslog field description page:

Index Field
0 FUTURE_USE (usually “1”)
1 Receive Time
2 Serial Number
3 Type (SYSTEM)
4 Subtype (globalprotect)
6 Generated Time (canonical timestamp)
8 Event ID (globalprotectgateway-auth-succ, -auth-fail, -logout-succ, -regist-succ, portal-auth-*)
9 Object name (gateway or portal)
14 Description (quoted, contains user/IP data as inner key-value)

Notice field 14. It’s a CSV field in its own right, but the user, IP, auth type and OS live inside it, as an English sentence followed by Key: value, Key: value:

GlobalProtect gateway user authentication succeeded. Login from: 216.113.183.230, User name: user3, Auth type: profile, Client OS version: Microsoft Windows Server 2008 R2 Enterprise

This is exactly the kind of nested format sub_extract was designed for.

Handling the nested description

First we run a CSV extract that pulls the description out as a single string. Then we run a keyvalue sub-extract on that string — but not before stripping the leading prose. Without the strip, the keyvalue splitter would see the first comma and treat the entire sentence up to that comma as one giant “key”:

KEY: "GlobalProtect gateway user authentication succeeded. Login from"
VAL: "216.113.183.230"

…and ${Login from} substitution would silently return nothing.

The fix is strip_before: ". " — drop everything up to and including the first “. “ in the field. After stripping, the keyvalue input is clean:

Login from: 216.113.183.230, User name: user3, Auth type: profile, Client OS version: Microsoft Windows Server 2008 R2 Enterprise

and the keyvalue extractor produces Login from, User name, Auth type, Client OS version as context variables, ready for ${Login from} and friends in the map.

The five sub-parsers

The v1 rule has five parsers covering the events relevant to lateral movement tracking:

  1. gp-gateway-auth-succ — successful gateway authentication → SUCCESSFUL_LOGON
  2. gp-gateway-regist-succ — session fully established → SUCCESSFUL_LOGON (a variant flagged with its own event_id)
  3. gp-auth-fail — gateway or portal authentication failure → FAILED_LOGON
  4. gp-gateway-logout — gateway logout → LOGOFF
  5. gp-portal-auth-succ — portal auth OK (pre-gateway, informational) → SUCCESSFUL_LOGON with event_id=GP-PORTAL-AUTH-SUCC

Events that are NOT logons (configuration push, agent messages, config release) intentionally fall through to the rejected log. Masstin is a lateral movement tracker, not a generic log aggregator.

Validation against real samples

The rule was validated with --dry-run against 7 sample lines taken verbatim from the Palo Alto Splunk data generator — 4 matched (the logon events), 3 were correctly rejected (config push / agent message / config release):

[2/3] Processing 1 log file(s)...
    lines=7 matched=4 rejected=3

Custom parser summary:
  Lines read:    7
  Matched:       4 (57.1%)
  Rejected:      3
  Hits per parser:
         1 gp-gateway-regist-succ
         1 gp-gateway-logout
         1 gp-gateway-auth-succ
         1 gp-auth-fail

First matched records:
  2016/02/24 22:01:41 | LOGOFF            | user=user3         | src=              | dst=VPN-GW-N    | detail=GlobalProtect gateway logout | user=user3 reason=client logout.
  2016/02/24 21:40:52 | SUCCESSFUL_LOGON  | user=user3         | src=216.113.183.230 | dst=VPN-GW-N  | detail=GlobalProtect gateway auth OK | user=user3 from=216.113.183.230 auth=profile
  2016/02/24 21:40:28 | FAILED_LOGON      | user=Administrator | src=60.28.233.48  | dst=GP-Portal-1 | detail=GlobalProtect auth FAIL | user=Administrator from=60.28.233.48 reason=Authentication failed: Invalid username or password
  2016/02/24 22:41:24 | SUCCESSFUL_LOGON  | user=user1         | src=64.147.162.160 | dst=VPN-GW-N  | detail=GlobalProtect gateway register (session up) | user=user1 from=64.147.162.160 os=Microsoft Windows Server 2008 R2 Enterprise Edition Service Pack 1

Source IP, username, authentication type, OS version — all populated correctly for every logon event. The four matched records land in the same 14-column CSV and are ready for load-memgraph or load-neo4j like any other masstin source.


The rule library

The initial rule library ships with 8 complete rules and 31 sub-parsers covering the most common VPN, firewall and proxy products. Every rule was researched against the vendor’s official log format documentation and validated against realistic sample log lines committed alongside each rule in <category>/samples/.

Category Rule Parsers Format
VPN vpn/palo-alto-globalprotect.yaml 5 SYSTEM log subtype=globalprotect (legacy CSV syslog)
VPN vpn/cisco-anyconnect.yaml 4 %ASA-6-113039 / 722022 / 722023 / %ASA-4-113019
VPN vpn/fortinet-ssl-vpn.yaml 3 type=event subtype=vpn action=tunnel-up/down/ssl-login-fail
VPN vpn/openvpn.yaml 4 Free-form syslog (Peer Connection Initiated, AUTH_FAILED, SIGTERM)
Firewall firewall/palo-alto-traffic.yaml 2 PAN-OS TRAFFIC CSV — authenticated sessions via User-ID
Firewall firewall/cisco-asa.yaml 6 AAA auth (113004/5), login permit/deny (605004/5), WebVPN (716001/2)
Firewall firewall/fortinet-fortigate.yaml 4 type=event subtype=system\|user admin login, user auth
Proxy proxy/squid.yaml 3 access.log native — CONNECT tunnel, HTTP, TCP_DENIED

Running the entire library against all sample files in one shot produces:

Loaded 8 rule file(s), 31 parsers total
Lines read:    46
Matched:       38 (82.6%)
Rejected:      8   ← all intentionally rejected (config-release, TLS handshake packets,
                     system health logs, unauthenticated DNS, anonymous proxy requests)

A few design highlights from the stub-to-rule process:

  • Cisco split into two filescisco-anyconnect.yaml covers the VPN session lifecycle (parent session start, SVC connect/disconnect, session disconnect with duration). cisco-asa.yaml covers the generic firewall path: AAA authentication, management login permit/deny, WebVPN portal session. Same syslog stream, different purpose.
  • Palo Alto TRAFFIC filters on User-ID — TRAFFIC logs are extremely high-volume, but the lateral movement signal is only in sessions where the firewall could resolve a domain user via User-ID. The rule uses a positional regex ([^,]+ at comma index 12) to require a non-empty srcuser before the parser even touches the line, so raw internet traffic and DNS/NTP sessions are dropped cheaply at the match stage.
  • Squid uses positive-match regexes instead of negative look-ahead — Rust’s regex crate is linear-time and doesn’t support (?!...), so instead of “user is not -”, the rules say “user starts with an alphanumeric character” ([A-Za-z0-9][^\s]*) — functionally equivalent for the real log format.
  • FortiGate admin events don’t have action=login — they have logdesc="Admin login successful". Discovered during validation: the first version of the rule matched zero lines because it assumed a naming convention that only holds for the VPN subtype. The fix highlights the value of the dry-run validation loop.

The contribution model is the same as Sigma rules: collect sample lines, write the YAML, validate with --dry-run, open a PR adding a new file plus a row in the references table. Full guide in rules/README.md.


Using it

# Single rule file
masstin -a parse-custom --rules rules/vpn/palo-alto-globalprotect.yaml -f vpn.log -o timeline.csv

# Whole library — all rules tried against all log files
masstin -a parse-custom --rules rules/ -f vpn.log -f fw.log -o timeline.csv

# Dry-run: show first matches + rejected samples, no CSV
masstin -a parse-custom --rules rules/vpn/palo-alto-globalprotect.yaml -f vpn.log --dry-run

# Debug: preserve a rejected-lines sample alongside the output
masstin -a parse-custom --rules rules/ -f vpn.log -o timeline.csv --debug

Point it at any masstin-compatible output (Neo4j, Memgraph, the CSV merge pipeline) and your VPN events now flow through the same graph as your Windows RDP, Linux SSH and carved EVTX data.


Once you start feeding VPN / firewall / proxy logs into the masstin timeline alongside Windows EVTX and Linux auth logs, the combined output grows fast — and a lot of what grows is noise. Service logons from LOCAL SYSTEM, failed RDP attempts where the source IP was never captured, brute force from noisy jumpboxes, machine account (HOST$) network authentications, and so on.

Masstin v0.12.0 ships with four opt-in filtering flags built on top of real-case analysis of 178k-event CSVs:

  • --ignore-local drops records that carry no usable source information. The rule is based on a truth table: a record is kept whenever either src_ip OR src_computer has a real signal (the IP always wins — MSTSC|<real-IP> stays, MSTSC|- is filtered). Catches loopback IPs, LOCAL literals, Windows logon_type 5/2 with empty source, self-reference without IP, and noise placeholders (MSTSC, default_value).
  • --exclude-users <LIST> drops records whose user field matches any glob in the list. Supports exact match, prefix (svc_*), suffix (*$ for machine accounts), contains (*admin*), inline CSV, and @file.txt imports.
  • --exclude-hosts <LIST> same syntax, matches src_computer / dst_computer. Useful for excluding known jumpboxes and monitoring hosts.
  • --exclude-ips <LIST> accepts individual IPs, CIDR ranges (10.0.0.0/8, fe80::/10), and @file.txt. Critical in multi-site cases with dozens of trusted subnets.

Combined with --dry-run, you get a pre-flight stats report showing exactly how many records each filter layer would remove, broken down by rule, without writing the output CSV. That lets you validate the filter choice before committing to a long run.

All four flags apply to every parser action (parse-windows, parse-linux, parse-image, parse-custom, parser-elastic, parse-cortex, parse-cortex-evtx-forensics) and to merge — so you can also re-filter an existing CSV without re-parsing the original evidence.

Real measurements against the DefCon DFIR CTF 2018 combined timeline (178k events from FileServer + HRServer + Desktop):

🧹 Filter summary:
   Total records seen: 178,274
   Total kept:         110,070 (61.7%)
   Total filtered:     68,204 (38.3%)

   --ignore-local:     68,204 (38.3%)
      both_noise              67,703
      self_reference             134
      service_logon              306
      interactive_logon           21
      literal_LOCAL               39
      loopback_ip                  1

Full documentation in the README filtering section.

What’s next

  • v2 extractors. JSON with jq-style selectors. Already planned.
  • Conditional map. when: ${action} == "fail" style predicates so a single parser can handle both success and failure line variants of the same event when the format makes that cleaner than two parsers.
  • More rules. Cisco ASA AnyConnect, Fortinet FortiGate, OpenVPN and Squid are the next priorities. Checkpoint, ZScaler, Cloudflare Access are in the backlog.
  • PAN-OS 9.1+ dedicated globalprotect log type. A second Palo Alto rule covering the 49+ column dedicated format, once I can validate it against real samples.
  • Per-rule validation command. masstin -a parse-custom --validate rule.yaml to catch schema errors without running against a log file.

If you’d like to contribute a rule or a sample of your vendor’s logs, see the guide at rules/README.md in the masstin repo.


References — vendor official documentation used per rule

Every rule in the library was written from the vendor’s primary log format documentation and validated against real sample log lines. These are the sources used during the research pass:

Palo Alto GlobalProtect (vpn/palo-alto-globalprotect.yaml)

Palo Alto TRAFFIC (firewall/palo-alto-traffic.yaml)

Cisco AnyConnect (vpn/cisco-anyconnect.yaml)

Cisco ASA (firewall/cisco-asa.yaml)

Fortinet SSL VPN (vpn/fortinet-ssl-vpn.yaml)

Fortinet FortiGate (firewall/fortinet-fortigate.yaml)

OpenVPN (vpn/openvpn.yaml)

Squid proxy (proxy/squid.yaml)


Topic Link
Masstin main page masstin
Custom parser schema docs/custom-parsers.md
Rules library rules/
Rules library references table rules/README.md#references
CSV format and event classification CSV format
Graph visualisation Memgraph / Neo4j