Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datajson v4 #12707

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Datajson v4 #12707

wants to merge 8 commits into from

Conversation

regit
Copy link
Contributor

@regit regit commented Mar 2, 2025

Update of #12319

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

  • Reapply the patch and try my best to get a valid history
  • Implement datajson commands in rust

SV_BRANCH=OISF/suricata-verify#2330

@suricata-qa
Copy link

ERROR:
ERROR: SEGMENTATION FAULT in build_asan QA test

ERROR: QA failed on build_asan.

Pipeline 24948

@suricata-qa
Copy link

ERROR:
ERROR: SEGMENTATION FAULT in build_asan QA test

ERROR: QA failed on build_asan.

Pipeline 24949

regit added 6 commits March 2, 2025 22:45
This patch introduces a new keyword datajson that is similar
to dataset with a twist. Where dataset allows match from sets,
datajson allows the same but also adds JSON data to the alert
event. This data is comint from the set definition it self.
For example, an ipv4 set will look like:

  10.16.1.11,{"test": "success","context":3}

The syntax is value and json data separated by a comma.

The syntax of the keyword is the following:

  datajson:isset,src_ip,type ip,load src.lst,key src_ip;

Compare to dataset, it just have a supplementary option key
that is used to indicate in which subobject the JSON value
should be added.

The information is added in the even under the alert.extra
subobject:

  "alert": {
    "extra": {
      "src_ip": {
        "test": "success",
        "context": 3
      },

The main interest of the feature is to be able to contextualize
a match. For example, if you have an IOC source, you can do

 value1,{"actor":"APT28","Country":"FR"}
 value2,{"actor":"APT32","Country":"NL"}

This way, a single dataset is able to produce context to the
event where it was not possible before and multiple signatures
had to be used.

The format introduced in datajson is an evolution of the
historical datarep format. This has some limitations. For example,
if a user fetch IOCs from a threat intel server there is a large
change that the format will be JSON or XML. Suricata has no support
for the second but can support the first one.

This patch implements this concept. A optional json_key option is
added to the datajson keyword. If present, then Suricata assumes
that the data file contains a JSON object that is an array.
For each element elt of this array, the value added to the dataset
is elt['$json_key'] and the JSON data is elt itself.

Keeping the key value may seem redundant but it is useful to have it
directly accessible in the extra data to be able to query it
independantly of the signature (where it can be multiple metadata
or even be a transformed metadata).

In some case, when interacting with data (mostly coming from
threat intel servers), the JSON array containing the data
to use is not at the root of the object and it is ncessary
to access a subobject.

This patch implements this with support of key in level1.level2.
This is done via the `array_key` option that contains the path
to the data.

Ticket: OISF#7372
With datajson infrastructure in place, it is now possible to
add data in the extra information section. Following an idea
by Jason Ish, this patch adds the feature for pcre extraction.

A PCRE such as pcre:"/(?P<alert_ua>[a-zA-Z]+)\//" will add the
content of the captured group to alert.extra.ua.
It can contain any vars so need addition properties.
Copy link

codecov bot commented Mar 2, 2025

Codecov Report

Attention: Patch coverage is 54.95557% with 659 lines in your changes missing coverage. Please review.

Project coverage is 80.59%. Comparing base (2e52e95) to head (ca12d2b).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12707      +/-   ##
==========================================
- Coverage   80.72%   80.59%   -0.13%     
==========================================
  Files         936      945       +9     
  Lines      259393   260824    +1431     
==========================================
+ Hits       209398   210222     +824     
- Misses      49995    50602     +607     
Flag Coverage Δ
fuzzcorpus 56.47% <3.07%> (-0.50%) ⬇️
livemode 19.26% <1.91%> (-0.16%) ⬇️
pcap 43.84% <2.05%> (-0.39%) ⬇️
suricata-verify 63.46% <54.58%> (-0.07%) ⬇️
unittests 57.90% <2.32%> (-0.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@suricata-qa
Copy link

ERROR:

ERROR: QA failed on SURI_TLPR1_suri_time.

Pipeline 24950

@victorjulien
Copy link
Member

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

@jufajardini jufajardini added the decision-required Waiting on deliberation from the team label Mar 4, 2025
@regit
Copy link
Contributor Author

regit commented Mar 4, 2025

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

@victorjulien
Copy link
Member

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format.

@regit
Copy link
Contributor Author

regit commented Mar 4, 2025

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

So you want to do something like the following on all datasets type:

typedef struct StringType {
    uint32_t len;
    union {
        DataRepType rep;
        DataJsonType json;
    }
    uint8_t *ptr;
} StringType;

I don't really like the cost of it in term of memory but it should be not that bad.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format.

@victorjulien
Copy link
Member

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

So you want to do something like the following on all datasets type:

typedef struct StringType {
    uint32_t len;
    union {
        DataRepType rep;
        DataJsonType json;
    }
    uint8_t *ptr;
} StringType;

Could even use "flex array", then it wouldn't use space if the feature isn't in use

typedef struct StringType {
    uint32_t len;
    union {
        DataRepType rep;
        DataJsonType json;
    }
    uint8_t *enrich[];
} StringType;

break;
case DATASET_TYPE_STRING:
set->hash = THashInit(cnf_name, sizeof(StringTypeJson), StringJsonSet, StringJsonFree,
StringJsonHash, StringJsonCompare, NULL, NULL, load != NULL ? 1 : 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we might be missing a length getter here which is used to calculate memuse in THash API. This is needed for the string type as we do not have a definite size for it. wdyt?
See: https://redmine.openinfosecfoundation.org/issues/3910

@simeonmiteff
Copy link

simeonmiteff commented Mar 6, 2025

Hi! 👋🏻

@jlucovsky sent me here! 😄

Some comments from the peanut gallery:

To me, the obvious use case is collections of IP only matching rules that look for the same, e.g., ET's Tor stuff could collapse ~2K rules into a handful.

Next up are the sets of IOC rules with common metadata, like this example from ET/Open:

alert dns $HOME_NET any -> any any (msg:"ET MALWARE DNS Query for Operation Cleaver Domain"; dns.query; content:"doosan-job.com"; depth:14; fast_pattern; endswith; nocase; reference:url,www0.cylance.com/assets/Cleaver/Cylance_Operation_Cleaver_Report.pdf; classtype:trojan-activity; sid:2019851; rev:6; metadata:created_at 2014_12_03, updated_at 2020_09_17;)

☝🏻 there are 21 of these dns.query exact-match rules that differ only in content.

...however, I can totally imagine a set of IOCs that have some common metadata (like reference URL), but other fields that are different for each IOC (imagine: "date first seen", or similar).

With these in mind:

  • JSON has an advantage over CSV (or TSV - as is the case with the zeek intel framework) that it doesn't require de-normalization to support the use case with both common/shared and IOC-specific metadata (assuming there is a way to merge the common + specific metadata into the alert).
  • base64 encoding the JSON field is honestly not very user friendly as such dataset files can't be updated by hand (would require an external script to generate them). I would recommend not shoe-horning this into the existing dataset format.

LMK if the redmine ticket is a better place to put this feedback. I'd be happy to review/test this feature and thank you for the great work!

@regit
Copy link
Contributor Author

regit commented Mar 8, 2025

Hi! 👋🏻

@jlucovsky sent me here! 😄

Some comments from the peanut gallery:

To me, the obvious use case is collections of IP only matching rules that look for the same, e.g., ET's Tor stuff could collapse ~2K rules into a handful.

You can already use the ip.src keyword with dataset type ip to do that. What this is bringing here is that in a single rule you can match Tor AND some over categories such as all the other potential IP for policy violations.

Next up are the sets of IOC rules with common metadata, like this example from ET/Open:

alert dns $HOME_NET any -> any any (msg:"ET MALWARE DNS Query for Operation Cleaver Domain"; dns.query; content:"doosan-job.com"; depth:14; fast_pattern; endswith; nocase; reference:url,www0.cylance.com/assets/Cleaver/Cylance_Operation_Cleaver_Report.pdf; classtype:trojan-activity; sid:2019851; rev:6; metadata:created_at 2014_12_03, updated_at 2020_09_17;)

☝🏻 there are 21 of these dns.query exact-match rules that differ only in content.

Yes, getting the domain transform could help there too ;) I need to revive this one.

...however, I can totally imagine a set of IOCs that have some common metadata (like reference URL), but other fields that are different for each IOC (imagine: "date first seen", or similar).

With these in mind:

  • JSON has an advantage over CSV (or TSV - as is the case with the zeek intel framework) that it doesn't require de-normalization to support the use case with both common/shared and IOC-specific metadata (assuming there is a way to merge the common + specific metadata into the alert).
  • base64 encoding the JSON field is honestly not very user friendly as such dataset files can't be updated by hand (would require an external script to generate them). I would recommend not shoe-horning this into the existing dataset format.

This PR supports the $value,{json} format but also supports ingestion from a full JSON object. See description of previous PR: #12319

LMK if the redmine ticket is a better place to put this feedback. I'd be happy to review/test this feature and thank you for the great work!

@regit
Copy link
Contributor Author

regit commented Mar 8, 2025

I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like enrichment-format json, enrich-alert <some-key>, enrich-alert-label <name>.

Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.

There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.

The format in the CSV file should probably encode the JSON string as base64, just like we do with string type.

The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ?

The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format.

I'm thinking about dropping the dataset like format and switch to array of JSON object parsing and potentially one JSON object per line even if I think this is not as common.

@jasonish
Copy link
Member

jasonish commented Mar 9, 2025

CSV is quite limiting, I wonder if it would make sense to consider a new dataset format that is lined based JSON:

{"key": "column1_value", "rep": 1, "data": {}}

Its pretty easy to mix in with the previous line based loader by a test on the first char. Extensible in ways that CSV is not.

Anyways, just tossing out the idea. I really do see datajson as an evolution of dataset, or an additional feature of dataset, so the closer we can keep it, the easier it will be for all to understand I think.

@regit
Copy link
Contributor Author

regit commented Mar 9, 2025

CSV is quite limiting, I wonder if it would make sense to consider a new dataset format that is lined based JSON:

{"key": "column1_value", "rep": 1, "data": {}}

Its pretty easy to mix in with the previous line based loader by a test on the first char. Extensible in ways that CSV is not.

Anyways, just tossing out the idea. I really do see datajson as an evolution of dataset, or an additional feature of dataset, so the closer we can keep it, the easier it will be for all to understand I think.

I was thinking about this type of format when speaking about "jsonline".

@victorjulien
Copy link
Member

Something like:
dataset: load set.json <- loads json
dataset: load myfile, format json <- also?

@regit
Copy link
Contributor Author

regit commented Mar 9, 2025

Something like: dataset: load set.json <- loads json dataset: load myfile, format json <- also?

I've started to implement the second proposal this morning. json would be the file as a "big" JSON object and in second step I want to try to implement jsonline with one JSON entry per line.

@jasonish
Copy link
Member

Something like: dataset: load set.json <- loads json dataset: load myfile, format json <- also?

Yeah, extension test and/or explicit format. Its also easy enough to dynamically test, but sometimes an explicit flag is good as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision-required Waiting on deliberation from the team
Development

Successfully merging this pull request may close these issues.

7 participants