Datajson v4 #12707

regit · 2025-03-02T20:50:38Z

Update of #12319

Contribution style:

I have read the contributing guide lines at
https://docs.suricata.io/en/latest/devguide/contributing/contribution-process.html

Our Contribution agreements:

I have signed the Open Information Security Foundation contribution agreement at
https://suricata.io/about/contribution-agreement/ (note: this is only required once)

Changes (if applicable):

I have updated the User Guide (in doc/userguide/) to reflect the changes made
I have updated the JSON schema (in etc/schema.json) to reflect all logging changes
(including schema descriptions)
I have created a ticket at
https://redmine.openinfosecfoundation.org/projects/suricata/issues

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

Reapply the patch and try my best to get a valid history
Implement datajson commands in rust

SV_BRANCH=OISF/suricata-verify#2330

suricata-qa · 2025-03-02T21:00:03Z

ERROR:
ERROR: SEGMENTATION FAULT in build_asan QA test

ERROR: QA failed on build_asan.

Pipeline 24948

suricata-qa · 2025-03-02T21:05:03Z

ERROR:
ERROR: SEGMENTATION FAULT in build_asan QA test

ERROR: QA failed on build_asan.

Pipeline 24949

This patch introduces a new keyword datajson that is similar to dataset with a twist. Where dataset allows match from sets, datajson allows the same but also adds JSON data to the alert event. This data is comint from the set definition it self. For example, an ipv4 set will look like: 10.16.1.11,{"test": "success","context":3} The syntax is value and json data separated by a comma. The syntax of the keyword is the following: datajson:isset,src_ip,type ip,load src.lst,key src_ip; Compare to dataset, it just have a supplementary option key that is used to indicate in which subobject the JSON value should be added. The information is added in the even under the alert.extra subobject: "alert": { "extra": { "src_ip": { "test": "success", "context": 3 }, The main interest of the feature is to be able to contextualize a match. For example, if you have an IOC source, you can do value1,{"actor":"APT28","Country":"FR"} value2,{"actor":"APT32","Country":"NL"} This way, a single dataset is able to produce context to the event where it was not possible before and multiple signatures had to be used. The format introduced in datajson is an evolution of the historical datarep format. This has some limitations. For example, if a user fetch IOCs from a threat intel server there is a large change that the format will be JSON or XML. Suricata has no support for the second but can support the first one. This patch implements this concept. A optional json_key option is added to the datajson keyword. If present, then Suricata assumes that the data file contains a JSON object that is an array. For each element elt of this array, the value added to the dataset is elt['$json_key'] and the JSON data is elt itself. Keeping the key value may seem redundant but it is useful to have it directly accessible in the extra data to be able to query it independantly of the signature (where it can be multiple metadata or even be a transformed metadata). In some case, when interacting with data (mostly coming from threat intel servers), the JSON array containing the data to use is not at the root of the object and it is ncessary to access a subobject. This patch implements this with support of key in level1.level2. This is done via the `array_key` option that contains the path to the data. Ticket: OISF#7372

With datajson infrastructure in place, it is now possible to add data in the extra information section. Following an idea by Jason Ish, this patch adds the feature for pcre extraction. A PCRE such as pcre:"/(?P<alert_ua>[a-zA-Z]+)\//" will add the content of the captured group to alert.extra.ua.

It can contain any vars so need addition properties.

codecov · 2025-03-02T22:12:21Z

Codecov Report

Attention: Patch coverage is 54.95557% with 659 lines in your changes missing coverage. Please review.

Project coverage is 80.59%. Comparing base (2e52e95) to head (ca12d2b).
Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12707      +/-   ##
==========================================
- Coverage   80.72%   80.59%   -0.13%     
==========================================
  Files         936      945       +9     
  Lines      259393   260824    +1431     
==========================================
+ Hits       209398   210222     +824     
- Misses      49995    50602     +607

Flag	Coverage Δ
fuzzcorpus	`56.47% <3.07%> (-0.50%)`	⬇️
livemode	`19.26% <1.91%> (-0.16%)`	⬇️
pcap	`43.84% <2.05%> (-0.39%)`	⬇️
suricata-verify	`63.46% <54.58%> (-0.07%)`	⬇️
unittests	`57.90% <2.32%> (-0.32%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

suricata-qa · 2025-03-03T00:15:03Z

ERROR:

ERROR: QA failed on SURI_TLPR1_suri_time.

Pipeline 24950

victorjulien · 2025-03-04T12:47:45Z

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

regit · 2025-03-04T14:57:33Z

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

victorjulien · 2025-03-04T15:57:35Z

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format.

regit · 2025-03-04T16:43:36Z

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

So you want to do something like the following on all datasets type:

typedef struct StringType {
    uint32_t len;
    union {
        DataRepType rep;
        DataJsonType json;
    }
    uint8_t *ptr;
} StringType;

I don't really like the cost of it in term of memory but it should be not that bad.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format.

victorjulien · 2025-03-04T17:51:04Z

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

So you want to do something like the following on all datasets type:
typedef struct StringType {
    uint32_t len;
    union {
        DataRepType rep;
        DataJsonType json;
    }
    uint8_t *ptr;
} StringType;

Could even use "flex array", then it wouldn't use space if the feature isn't in use

typedef struct StringType {
    uint32_t len;
    union {
        DataRepType rep;
        DataJsonType json;
    }
    uint8_t *enrich[];
} StringType;

inashivb · 2025-03-05T08:03:48Z

src/datajson.c

+            break;
+        case DATASET_TYPE_STRING:
+            set->hash = THashInit(cnf_name, sizeof(StringTypeJson), StringJsonSet, StringJsonFree,
+                    StringJsonHash, StringJsonCompare, NULL, NULL, load != NULL ? 1 : 0,


It seems like we might be missing a length getter here which is used to calculate memuse in THash API. This is needed for the string type as we do not have a definite size for it. wdyt?
See: https://redmine.openinfosecfoundation.org/issues/3910

simeonmiteff · 2025-03-06T03:45:37Z

Hi! 👋🏻

@jlucovsky sent me here! 😄

Some comments from the peanut gallery:

To me, the obvious use case is collections of IP only matching rules that look for the same, e.g., ET's Tor stuff could collapse ~2K rules into a handful.

Next up are the sets of IOC rules with common metadata, like this example from ET/Open:

alert dns $HOME_NET any -> any any (msg:"ET MALWARE DNS Query for Operation Cleaver Domain"; dns.query; content:"doosan-job.com"; depth:14; fast_pattern; endswith; nocase; reference:url,www0.cylance.com/assets/Cleaver/Cylance_Operation_Cleaver_Report.pdf; classtype:trojan-activity; sid:2019851; rev:6; metadata:created_at 2014_12_03, updated_at 2020_09_17;)

☝🏻 there are 21 of these dns.query exact-match rules that differ only in content.

...however, I can totally imagine a set of IOCs that have some common metadata (like reference URL), but other fields that are different for each IOC (imagine: "date first seen", or similar).

With these in mind:

JSON has an advantage over CSV (or TSV - as is the case with the zeek intel framework) that it doesn't require de-normalization to support the use case with both common/shared and IOC-specific metadata (assuming there is a way to merge the common + specific metadata into the alert).
base64 encoding the JSON field is honestly not very user friendly as such dataset files can't be updated by hand (would require an external script to generate them). I would recommend not shoe-horning this into the existing dataset format.

LMK if the redmine ticket is a better place to put this feedback. I'd be happy to review/test this feature and thank you for the great work!

regit · 2025-03-08T22:11:02Z

Hi! 👋🏻

@jlucovsky sent me here! 😄

Some comments from the peanut gallery:

To me, the obvious use case is collections of IP only matching rules that look for the same, e.g., ET's Tor stuff could collapse ~2K rules into a handful.

You can already use the ip.src keyword with dataset type ip to do that. What this is bringing here is that in a single rule you can match Tor AND some over categories such as all the other potential IP for policy violations.

Next up are the sets of IOC rules with common metadata, like this example from ET/Open:
alert dns $HOME_NET any -> any any (msg:"ET MALWARE DNS Query for Operation Cleaver Domain"; dns.query; content:"doosan-job.com"; depth:14; fast_pattern; endswith; nocase; reference:url,www0.cylance.com/assets/Cleaver/Cylance_Operation_Cleaver_Report.pdf; classtype:trojan-activity; sid:2019851; rev:6; metadata:created_at 2014_12_03, updated_at 2020_09_17;)
☝🏻 there are 21 of these dns.query exact-match rules that differ only in content.

Yes, getting the domain transform could help there too ;) I need to revive this one.

...however, I can totally imagine a set of IOCs that have some common metadata (like reference URL), but other fields that are different for each IOC (imagine: "date first seen", or similar).

With these in mind:

JSON has an advantage over CSV (or TSV - as is the case with the zeek intel framework) that it doesn't require de-normalization to support the use case with both common/shared and IOC-specific metadata (assuming there is a way to merge the common + specific metadata into the alert).

base64 encoding the JSON field is honestly not very user friendly as such dataset files can't be updated by hand (would require an external script to generate them). I would recommend not shoe-horning this into the existing dataset format.

This PR supports the $value,{json} format but also supports ingestion from a full JSON object. See description of previous PR: #12319

LMK if the redmine ticket is a better place to put this feedback. I'd be happy to review/test this feature and thank you for the great work!

regit · 2025-03-08T22:14:30Z

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format.

I'm thinking about dropping the dataset like format and switch to array of JSON object parsing and potentially one JSON object per line even if I think this is not as common.

jasonish · 2025-03-09T09:06:46Z

CSV is quite limiting, I wonder if it would make sense to consider a new dataset format that is lined based JSON:

{"key": "column1_value", "rep": 1, "data": {}}

Its pretty easy to mix in with the previous line based loader by a test on the first char. Extensible in ways that CSV is not.

Anyways, just tossing out the idea. I really do see datajson as an evolution of dataset, or an additional feature of dataset, so the closer we can keep it, the easier it will be for all to understand I think.

regit · 2025-03-09T20:54:39Z

CSV is quite limiting, I wonder if it would make sense to consider a new dataset format that is lined based JSON:
{"key": "column1_value", "rep": 1, "data": {}}
Its pretty easy to mix in with the previous line based loader by a test on the first char. Extensible in ways that CSV is not.

Anyways, just tossing out the idea. I really do see datajson as an evolution of dataset, or an additional feature of dataset, so the closer we can keep it, the easier it will be for all to understand I think.

I was thinking about this type of format when speaking about "jsonline".

victorjulien · 2025-03-09T21:02:33Z

Something like:
dataset: load set.json <- loads json
dataset: load myfile, format json <- also?

regit · 2025-03-09T21:04:06Z

Something like: dataset: load set.json <- loads json dataset: load myfile, format json <- also?

I've started to implement the second proposal this morning. json would be the file as a "big" JSON object and in second step I want to try to implement jsonline with one JSON entry per line.

jasonish · 2025-03-12T15:25:11Z

Something like: dataset: load set.json <- loads json dataset: load myfile, format json <- also?

Yeah, extension test and/or explicit format. Its also easy enough to dynamically test, but sometimes an explicit flag is good as well.

regit added 2 commits March 2, 2025 17:30

util/ip: add IPv4 and IPv6 length

b2e7f02

util/byte: add HexToRaw function

e2b7990

regit requested review from jasonish, jufajardini, victorjulien and a team as code owners March 2, 2025 20:50

regit force-pushed the datajson-v4 branch from 1e62200 to 3a8b3a1 Compare March 2, 2025 20:57

regit added 6 commits March 2, 2025 22:45

eve/schema: pktvars is a container

93389b4

It can contain any vars so need addition properties.

eve/schema: document datajson output

285dc79

doc/userguide: add datajson keyword

3b0cb38

suricatasc: implement datajson commands

ca12d2b

regit force-pushed the datajson-v4 branch from 3a8b3a1 to ca12d2b Compare March 2, 2025 21:46

jufajardini added the decision-required Waiting on deliberation from the team label Mar 4, 2025

inashivb reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datajson v4 #12707

Datajson v4 #12707

regit commented Mar 2, 2025

suricata-qa commented Mar 2, 2025

suricata-qa commented Mar 2, 2025

codecov bot commented Mar 2, 2025 •

edited

Loading

suricata-qa commented Mar 3, 2025

victorjulien commented Mar 4, 2025

regit commented Mar 4, 2025

victorjulien commented Mar 4, 2025

regit commented Mar 4, 2025

victorjulien commented Mar 4, 2025

inashivb Mar 5, 2025

simeonmiteff commented Mar 6, 2025 •

edited by jlucovsky

Loading

regit commented Mar 8, 2025 •

edited

Loading

regit commented Mar 8, 2025

jasonish commented Mar 9, 2025

regit commented Mar 9, 2025

victorjulien commented Mar 9, 2025

regit commented Mar 9, 2025

jasonish commented Mar 12, 2025

Datajson v4 #12707

Are you sure you want to change the base?

Datajson v4 #12707

Conversation

regit commented Mar 2, 2025

Contribution style:

Our Contribution agreements:

Changes (if applicable):

suricata-qa commented Mar 2, 2025

suricata-qa commented Mar 2, 2025

codecov bot commented Mar 2, 2025 • edited Loading

Codecov Report

suricata-qa commented Mar 3, 2025

victorjulien commented Mar 4, 2025

regit commented Mar 4, 2025

victorjulien commented Mar 4, 2025

regit commented Mar 4, 2025

victorjulien commented Mar 4, 2025

inashivb Mar 5, 2025

Choose a reason for hiding this comment

simeonmiteff commented Mar 6, 2025 • edited by jlucovsky Loading

regit commented Mar 8, 2025 • edited Loading

regit commented Mar 8, 2025

jasonish commented Mar 9, 2025

regit commented Mar 9, 2025

victorjulien commented Mar 9, 2025

regit commented Mar 9, 2025

jasonish commented Mar 12, 2025

codecov bot commented Mar 2, 2025 •

edited

Loading

simeonmiteff commented Mar 6, 2025 •

edited by jlucovsky

Loading

regit commented Mar 8, 2025 •

edited

Loading