[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

Kaiyang-Chen · 2025-01-10T07:44:52Z

Describe the bug
For a forest with 800 trees and num_leaves=256, input feature dimension=210, the inference job on GPU for multiple batchsize (from 1 to 500 with 10 as step) slower then the old implemenatation for 4-5 times.
For some performance stats:
The non-experimental GPU method took around 110 microseconds for inference batch < 64 samples
The experimental fil took around 450 microseconds for the same batch.

Is this performance degradation reasonable? I think it violates the first design goal of the fil experimental project ('Provide state-of-the-art runtime performance for forest models on GPU, especially for cases where CPU performance will not suffice (e.g. large batches, deep trees, many trees, etc.).')

Any hints on how to improve performance for experiment version? If needed, i can provide the model file.

Expected behavior
experiment FIL inference at a speed at least not much slower then the original version.

Environment details (please complete the following information):

Environment location: Bare-metal
Linux Distro/Architecture: centos7.9
GPU Model/Driver: [V100 and driver 550.54.15 ]
CUDA: [12.4]
Method of cuDF & cuML install: [conda]
cuml 25.02.00a42 cuda12_py312_250109_g225d0aaa0_42 rapidsai-nightly
libcuml 25.02.00a42 cuda12_250109_g225d0aaa0_42 rapidsai-nightly
libraft 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
libraft-headers 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
libraft-headers-only 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
pylibraft 25.02.00a32 cuda12_py312_250109_g8fc988e1_32 rapidsai-nightly
raft-dask 25.02.00a32 cuda12_py312_250109_g8fc988e1_32 rapidsai-nightly
treelite 4.3.0 py312h01abfbf_0 conda-forge
librmm 25.02.00a37 cuda12_250109_gc1ccdadb_37 rapidsai-nightly
rmm 25.02.00a37 cuda12_py312_250109_gc1ccdadb_37 rapidsai-nightly

dantegd · 2025-01-10T13:12:50Z

Thanks for the issue @Kaiyang-Chen, the model file would be extremely useful. @wphicks and @hcho3 might be the best people to help here, though I think they are not around until next week.

hcho3 · 2025-01-11T09:24:13Z

Yes, it will help us tremendously if you are able to share the model with us.

Note: I am taking time off for the next two weeks, until Jan 25. I will be able to start troubleshooting the performance issue then.

Kaiyang-Chen · 2025-01-13T08:43:50Z

model_20241106.txt
sure, i have attached the model file here. Thanks for the help! @dantegd @hcho3

wphicks · 2025-01-13T14:41:51Z

I'm happy to dig into this in more depth, but I'm almost certain I can give you an answer based on what we have here already.

In addition to more fundamental changes, experimental FIL also updates the way we choose default hyperparameters. Original FIL selected those parameters based largely on implementation details, but experimental FIL defaults to hyperparameters that give the best throughput for large batches. At batch size 64, that's definitely going to give a significant performance degradation.

As a quick test, experimental FIL offers the new .optimize method, which will select optimal hyperparameters based on batch size or data characteristics. Try calling that method on your FIL model with argument batch_size=64 before running your benchmark.

If you're still seeing a performance degradation, we can dig into this a lot more carefully. Thanks very much for the report! This is exactly the sort of thing we want to catch before promoting experimental FIL to stable.

Kaiyang-Chen · 2025-01-14T01:32:29Z

If you're still seeing a performance degradation, we can dig into this a lot more carefully.

yes, i've tried tuning the parameter myself by hand using a range of reasonable chunk_size and also both tree layout (depth / width). It does affect the performance but seemingly within 20-30% range. Cannot produce a result that is even close to the stable version. @wphicks

wphicks · 2025-01-15T15:43:46Z

Okay, in that case, let's dig into it more systematically. Can you post your benchmarking code so I can try for an exact repro?

wphicks · 2025-01-15T19:11:29Z

I was able to reproduce the regression with the code below. Very interesting! This is a domain (shallow trees, small batches, wide inputs) where experimental FIL has seen lower performance at times, but I haven't seen any other model where performance has suffered this much. I'll investigate further. Can you confirm that the code below at least generally matches how you performed your own benchmarks?

import cupy as cp
import logging
import numpy as np
import treelite
from cuml import ForestInference as FIL
from cuml.experimental.fil import ForestInference as FILEX
from pandas import DataFrame
from time import perf_counter

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def load():
    tl_model = treelite.frontend.load_lightgbm_model('model.txt')
    return (
        FIL().load_from_treelite_model(
            tl_model,
            precision='float32'
        ),
        FILEX.load(
            'model.txt',
            precision='float32'
        )
    )


def run(
    fil,
    filex,
    *,
    batch_size=None,
    min_batch_size=1,
    max_batch_size=131072,
    iterations=10,
    warmup_iterations=2,
    format='cupy',
    results=None
):
    if batch_size is None:
        batch_size = min_batch_size
    if results is None:
        results = {
            'batch_size': [],
            'FIL': [],
            'FILEX': []
        }
    results['batch_size'].append(batch_size)
    if format == 'cupy':
        xpy = cp
    elif format == 'numpy':
        xpy = np

    dtype = filex.forest.get_dtype()
    # TODO(wphicks): set range based on model.txt for each feature
    warmup_batches = xpy.random.uniform(
        xpy.finfo(dtype).min / 2,
        xpy.finfo(dtype).max / 2,
        size=(warmup_iterations, batch_size, filex.forest.num_features())
    )
    batches = xpy.random.uniform(
        xpy.finfo(dtype).min / 2,
        xpy.finfo(dtype).max / 2,
        size=(iterations, batch_size, filex.forest.num_features())
    )
    filex.optimize(batch_size=batch_size)
    for name, model in (('FIL', fil), ('FILEX', filex)):
        for i in range(warmup_iterations):
            model.predict(warmup_batches[i])
        start = perf_counter()
        for i in range(iterations):
            model.predict(batches[i])
        elapsed = perf_counter() - start
        results[name].append(elapsed)
        logger.info(
            f'Run at batch size {batch_size} completed in'
            f' {elapsed:.2E}s with {name}'
        )

    if results['FIL'][-1] < results['FILEX'][-1]:
        next_batch_size = batch_size + (
            (max_batch_size - batch_size) // 2
        )
        min_batch_size = batch_size
    else:
        logger.info(f'FILEX outperformed FIL at batch size {batch_size}')
        next_batch_size = batch_size - (
            (batch_size - min_batch_size) // 2
        )
        max_batch_size = batch_size
    if (
        next_batch_size < min_batch_size or
        next_batch_size >= max_batch_size or
        next_batch_size == batch_size
    ):
        return DataFrame.from_dict(results)
    else:
        return run(
            fil,
            filex,
            batch_size=next_batch_size,
            min_batch_size=min_batch_size,
            max_batch_size=max_batch_size,
            iterations=iterations,
            warmup_iterations=warmup_iterations,
            format=format,
            results=results
        )


if __name__ == '__main__':
    fil, filex = load()
    df = run(fil, filex)
    df.sort_values(by='batch_size')
    print(df.to_csv(index=False))

Kaiyang-Chen · 2025-01-16T01:23:17Z

Can you confirm that the code below at least generally matches how you performed your own benchmarks?

Yes the procedure is similar. Two tiny differences are that I am only testing up to batchsize 500 and I am using the cpp backend directly (it should not cause differences).

Kaiyang-Chen · 2025-01-16T01:24:58Z

And another interesting thing is, as you mentioned the forest has shallow trees relatively, but using width layout generate worse performance compared to depth.

wphicks · 2025-01-16T15:13:17Z

I'm not too surprised that breadth-first layout would perform worse for this depth. In general, we should get a slightly higher L2 cache hit rate starting around depth 4 for depth-first layout, though that is not always the determinant of performance for a whole model. The overall performance is still a puzzle to me though. I'm working on generating models with a range of parameters similar to the one you provided to help isolate where the issue is.

Kaiyang-Chen added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

Kaiyang-Chen commented Jan 10, 2025

dantegd commented Jan 10, 2025

hcho3 commented Jan 11, 2025

Kaiyang-Chen commented Jan 13, 2025

wphicks commented Jan 13, 2025

Kaiyang-Chen commented Jan 14, 2025 •

edited

Loading

wphicks commented Jan 15, 2025

wphicks commented Jan 15, 2025

Kaiyang-Chen commented Jan 16, 2025 •

edited

Loading

Kaiyang-Chen commented Jan 16, 2025 •

edited

Loading

wphicks commented Jan 16, 2025

[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

Comments

Kaiyang-Chen commented Jan 10, 2025

dantegd commented Jan 10, 2025

hcho3 commented Jan 11, 2025

Kaiyang-Chen commented Jan 13, 2025

wphicks commented Jan 13, 2025

Kaiyang-Chen commented Jan 14, 2025 • edited Loading

wphicks commented Jan 15, 2025

wphicks commented Jan 15, 2025

Kaiyang-Chen commented Jan 16, 2025 • edited Loading

Kaiyang-Chen commented Jan 16, 2025 • edited Loading

wphicks commented Jan 16, 2025

Kaiyang-Chen commented Jan 14, 2025 •

edited

Loading

Kaiyang-Chen commented Jan 16, 2025 •

edited

Loading

Kaiyang-Chen commented Jan 16, 2025 •

edited

Loading