-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datajson v4 #12707
base: master
Are you sure you want to change the base?
Datajson v4 #12707
Conversation
ERROR: ERROR: QA failed on build_asan. Pipeline 24948 |
ERROR: ERROR: QA failed on build_asan. Pipeline 24949 |
This patch introduces a new keyword datajson that is similar to dataset with a twist. Where dataset allows match from sets, datajson allows the same but also adds JSON data to the alert event. This data is comint from the set definition it self. For example, an ipv4 set will look like: 10.16.1.11,{"test": "success","context":3} The syntax is value and json data separated by a comma. The syntax of the keyword is the following: datajson:isset,src_ip,type ip,load src.lst,key src_ip; Compare to dataset, it just have a supplementary option key that is used to indicate in which subobject the JSON value should be added. The information is added in the even under the alert.extra subobject: "alert": { "extra": { "src_ip": { "test": "success", "context": 3 }, The main interest of the feature is to be able to contextualize a match. For example, if you have an IOC source, you can do value1,{"actor":"APT28","Country":"FR"} value2,{"actor":"APT32","Country":"NL"} This way, a single dataset is able to produce context to the event where it was not possible before and multiple signatures had to be used. The format introduced in datajson is an evolution of the historical datarep format. This has some limitations. For example, if a user fetch IOCs from a threat intel server there is a large change that the format will be JSON or XML. Suricata has no support for the second but can support the first one. This patch implements this concept. A optional json_key option is added to the datajson keyword. If present, then Suricata assumes that the data file contains a JSON object that is an array. For each element elt of this array, the value added to the dataset is elt['$json_key'] and the JSON data is elt itself. Keeping the key value may seem redundant but it is useful to have it directly accessible in the extra data to be able to query it independantly of the signature (where it can be multiple metadata or even be a transformed metadata). In some case, when interacting with data (mostly coming from threat intel servers), the JSON array containing the data to use is not at the root of the object and it is ncessary to access a subobject. This patch implements this with support of key in level1.level2. This is done via the `array_key` option that contains the path to the data. Ticket: OISF#7372
With datajson infrastructure in place, it is now possible to add data in the extra information section. Following an idea by Jason Ish, this patch adds the feature for pcre extraction. A PCRE such as pcre:"/(?P<alert_ua>[a-zA-Z]+)\//" will add the content of the captured group to alert.extra.ua.
It can contain any vars so need addition properties.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12707 +/- ##
==========================================
- Coverage 80.72% 80.59% -0.13%
==========================================
Files 936 945 +9
Lines 259393 260824 +1431
==========================================
+ Hits 209398 210222 +824
- Misses 49995 50602 +607
Flags with carried forward coverage won't be shown. Click here to find out more. |
ERROR: ERROR: QA failed on SURI_TLPR1_suri_time. Pipeline 24950 |
I remain stuck on the need for a new keyword. Can we not extend dataset/datarep with some new optional keys? E.g. something like The format in the CSV file should probably encode the JSON string as base64, just like we do with string type. |
Well, with a new name really convey that it offers different capabilities and that it is not just an evolution. It is easy to communicate on it.
The data needs to be a JSON string and the code is using libjansson to verify the syntax. IMO this is enough or do I miss something ? |
There is simply too much code and logic duplication between dataset and datajson. It also shows for example in the poor test coverage for datajson here.
The line format is supposed to be CSV, so raw json breaks it. Encoding it as base64 addresses that, plus it allows future extension of the line format. |
So you want to do something like the following on all datasets type: typedef struct StringType {
uint32_t len;
union {
DataRepType rep;
DataJsonType json;
}
uint8_t *ptr;
} StringType; I don't really like the cost of it in term of memory but it should be not that bad.
|
Could even use "flex array", then it wouldn't use space if the feature isn't in use typedef struct StringType {
uint32_t len;
union {
DataRepType rep;
DataJsonType json;
}
uint8_t *enrich[];
} StringType; |
break; | ||
case DATASET_TYPE_STRING: | ||
set->hash = THashInit(cnf_name, sizeof(StringTypeJson), StringJsonSet, StringJsonFree, | ||
StringJsonHash, StringJsonCompare, NULL, NULL, load != NULL ? 1 : 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we might be missing a length getter here which is used to calculate memuse in THash API. This is needed for the string type as we do not have a definite size for it. wdyt?
See: https://redmine.openinfosecfoundation.org/issues/3910
Hi! 👋🏻 @jlucovsky sent me here! 😄 Some comments from the peanut gallery: To me, the obvious use case is collections of IP only matching rules that look for the same, e.g., ET's Tor stuff could collapse ~2K rules into a handful. Next up are the sets of IOC rules with common metadata, like this example from ET/Open:
☝🏻 there are 21 of these ...however, I can totally imagine a set of IOCs that have some common metadata (like reference URL), but other fields that are different for each IOC (imagine: "date first seen", or similar). With these in mind:
LMK if the redmine ticket is a better place to put this feedback. I'd be happy to review/test this feature and thank you for the great work! |
You can already use the ip.src keyword with dataset type ip to do that. What this is bringing here is that in a single rule you can match Tor AND some over categories such as all the other potential IP for policy violations.
Yes, getting the domain transform could help there too ;) I need to revive this one.
This PR supports the $value,{json} format but also supports ingestion from a full JSON object. See description of previous PR: #12319
|
I'm thinking about dropping the dataset like format and switch to array of JSON object parsing and potentially one JSON object per line even if I think this is not as common. |
CSV is quite limiting, I wonder if it would make sense to consider a new dataset format that is lined based JSON: {"key": "column1_value", "rep": 1, "data": {}} Its pretty easy to mix in with the previous line based loader by a test on the first char. Extensible in ways that CSV is not. Anyways, just tossing out the idea. I really do see |
I was thinking about this type of format when speaking about "jsonline". |
Something like: |
I've started to implement the second proposal this morning. |
Yeah, extension test and/or explicit format. Its also easy enough to dynamically test, but sometimes an explicit flag is good as well. |
Update of #12319
Contribution style:
https://docs.suricata.io/en/latest/devguide/contributing/contribution-process.html
Our Contribution agreements:
https://suricata.io/about/contribution-agreement/ (note: this is only required once)
Changes (if applicable):
(including schema descriptions)
https://redmine.openinfosecfoundation.org/projects/suricata/issues
Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372
Describe changes:
SV_BRANCH=OISF/suricata-verify#2330