Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

ryanohnemus · 2024-09-18T03:46:02Z

This is more of a Question/Issue, but I created an example test that can be used so created this as a PR.

Background

I have been running into a few performance bottlenecks within my fluent-bit setup in kubernetes so I created a k8s_perf_test test that can be used to (hopefully) show the issue and hopefully this will result in some discussion about performance tuning, fluent-bit defaults, or possibly even pointing out some flaws with my own set up 😄.

Test Setup

This runs a single fluent-bit pod w/ a fluent-bit container and a logwriter container that runs the perf test
examples/k8s_perf_test/run-test.sh and examples/k8s_perf_test/values.yaml are setup to use the standard fluent-bit helm chart.
I used the extraContainers to create a python/ubuntu container (called logwriter) that is sidecar'd with fluent-bit
I used extraFiles to store my container startup script and test_runner.py
I setup an extra emptyDir (perftest-volume) for ephermal shared storage between fluent-bit and logwriter, mounted in both at /app/perftest/containers, the /fluent-bit configmap is also mounted in both containers
run-log-writer-test.sh passes configuration to test_runner.py, specifically it builds a logfile name that "impersonates" a log filename that would be created by containerd. the test_runner.py creates the logfile in /app/perftest/containers/ which is watched by the fluent-bit tail input.
test_runner.py has a small bit of logic, but has been performant enough on a macbook pro 2019, and gcp (gke) n2-standard-8 w/ ssd boot disks to write >50Mb/s to a file. It writes in the containerd (cri) format and it also does file renames to mimic logrotate.
test_runner.py writes the containerd log file and then polls the fluent-bit api until all the written records are accounted for within /api/v1/metrics/ output null.0 proc_records
output in fluentbit is set to null

Results

I ran this on both a gcp n2-standard-8 host that used ssd for it's bootdisk as well as a macbook. The results were similar in both cases in term of fluent-bit throughput. The numbers below are from a macbook pro 2019 2.3Ghz i7 running a single node kind (k8s) on docker.

1. tail input defaults do not seem optimal, setting larger input buffers are more performant, but can then result in downstream issue

Fluent-bit with a single tail input that uses no multiline.parser nor any filters using, and the default buffers ingests slower than when higher buffers are defined. However defining higher buffers can lead to output errors like: out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374 & Allow output plugins to configure a max chunk size #1938 as it tends to create larger chunks.

Initial input config:

    [INPUT]
        name                     tail
        read_from_head           false
        skip_long_lines          on
        path                     /app/perftest/containers/*.log
        Tag                      kube.*
        Rotate_Wait              60
        Refresh_Interval         1
        mem_buf_limit            50MB
        threaded                 on

this uses the default buffer_chunk_size and buffer_max_size and resulted in a write throughput of 41.196Mb/s
changing the buffer_chunk_size AND buffer_max_size both to 5M resulted in a write throughput of 44.097Mb/s

1a. A fluent-bit config that only reads and doesn't parse anything isn't super useful, so I re-tested the above config with multiline.parser cri and the following settings:

    [INPUT]
        name                     tail
        read_from_head           false
        skip_long_lines          on
        path                     /app/perftest/containers/*.log
        multiline.parser         cri
        Tag                      kube.*
        #buffer_chunk_size        5M
        #buffer_max_size          5M
        Rotate_Wait              60
        Refresh_Interval         1
        mem_buf_limit            50MB
        threaded                 on
        #Inotify_Watcher         false

I changed the buffer settings as followed for varying results:

defaults for buffer_chunk_size + buffer_max_size at = 17.64Mb/s
256K buffer_chunk_size + buffer_max_size = 19.15Mb/s
5M buffer_chunk_size + buffer_max_size = 22.78Mb/s
10M buffer_chunk_size + buffer_max_size = 21.42Mb/s

This looks like we could add a few mb/s to fluent-bit throughput just by increasing these buffer sizes by default (which is only 32K). However this seems to create oversized chunks and output plugins can not handle that well (#1938). Is there any other suggestion for improving the initial parsing speed?

NOTE: for the setup above i used a filters: "" in values.yaml

2. Adding common processing filters quickly slows down fluent-bit to a crawl

I added and tested with both the filters-simple and filters-extended in values.yaml. When testing with those you will need to rename those sections to just filters for it to be activated.
For these changes I kept the larger buffers, my input section was:

    [INPUT]
        name                     tail
        read_from_head           false
        skip_long_lines          on
        path                     /app/perftest/containers/*.log
        multiline.parser         cri
        Tag                      kube.*
        buffer_chunk_size        5M
        buffer_max_size          5M
        Rotate_Wait              60
        Refresh_Interval         1
        mem_buf_limit            50MB
        threaded                 on
        #Inotify_Watcher         false

2a. Please review the values.yaml filters-simple section.

I started by adding just the following:

    [FILTER]
        Name        modify
        Match       kube*
        Copy        level severity
        Remove      _p

this is to copy a field, then drops the _p artifact that comes from cri parsing, this lowered the processing to 18.065Mb/s (down from the 22.78Mb/s w/ higher buffers and no filter)

2b. adding filter kubernetes for namespace labels & annotations, and pod labels & annotations. This also used Merge_log to move the log filed to message

    [FILTER]
        Name                kubernetes
        Alias               get_k8s_metadata
        Kube_Tag_Prefix     kube.app.perftest.containers
        Match               kube*
        K8S-Logging.Parser  Off
        K8S-Logging.Exclude Off
        #Use_Kubelet         On
        #Kubelet_Host        ${NODE_IP}
        #Kubelet_Port        10250
        Buffer_Size         2MB
        Labels              On
        Annotations         On
        #Merge_Log On moves 'log' field to message
        Merge_Log           On
        Merge_Log_Trim      On
        kube_meta_cache_ttl 15m
        kube_meta_namespace_cache_ttl 15m
        Namespace_labels    On
        Namespace_annotations    On
        namespace_metadata_only Off

this brought processing down to 13.14Mb/s. At this point in my test, I would consider this a very minimal fluentbit + k8s config. However I'd assume many people using this are also going on to do work similar to my filters-extended example, which move k8s and other fields around, and potentially removes other fields before being sent to an output

2c. please look at filters-extended in values.yaml, this has what is in filters-simple plus then uses a nest/lift to move kubernetes meta files and a modify filter.

After using the filters-extended config, I ran into several issues with fluent-bit being able to keep up with log-rotation, something I also have seen in my production setups. It potentially misses logrotates and does not realize it (switching Inotify_watcher false) does not seem to be an improvement and it's hard to tell because this is also not reflected within fluent-bit metrics (it doesn't know it missed a rotation so how can it record it). To address it for this test only you can change Rotate_Wait in the the input to an extremely high number like 300. In standard k8s setup's, you will miss data as kubelet generally is doing logrotation when a container log size reaches 10mb (usually at 10s interval checks). So as fluent-bit backs up and a container is writing faster than fluent-bit can process, logs are missed w/ no metrics available to know they've been missed.
The input pauses constantly because the engine thread is backed up since all filters are executed single-thread in the engine thread (iirc) and fluent-bit is at a processing rate of 4.9Mb/s. (In my actual prod setup i have another lua script that runs between the last 2 filters, and that loses another 1.5Mb/s of throughput to the point fluent-bit pipeline can only process 3.5Mb/s).

Questions

is there anything glaringly wrong with the fluent-bit pipeline setup (specifically looking at the full filters-extended version)
are there any suggestions on what can be done to improve this? I'd like to avoid having a single fluent-bit that just reads from the disk and multiple downstream fluent-bits for k8s enrichment before a final real output
is there any possibility of having more than a single engine thread as that seems to be the bottleneck?

Signed-off-by: ryanohnemus <[email protected]>

agup006 · 2024-09-19T19:42:54Z

Heya is the file already present prior to read? I’m wondering if iops are a contributing factor to slow down

multiple filters chained together is one we know is not optimal which is why we’ve introduced processors. Have you tried the benchmark with processors?

Checking the commit also looks like latest? Thought I would doublecheck though

testing: k8s_perf_test

9c005c0

Signed-off-by: ryanohnemus <[email protected]>

ryanohnemus requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners September 18, 2024 03:46

github-actions bot added the docs-required label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

ryanohnemus commented Sep 18, 2024

agup006 commented Sep 19, 2024

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Are you sure you want to change the base?

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Conversation

ryanohnemus commented Sep 18, 2024

Background

Test Setup

Results

1. tail input defaults do not seem optimal, setting larger input buffers are more performant, but can then result in downstream issue

2. Adding common processing filters quickly slows down fluent-bit to a crawl

Questions

agup006 commented Sep 19, 2024