Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

Open
Kaiyang-Chen opened this issue Jan 10, 2025 · 10 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@Kaiyang-Chen
Copy link

Describe the bug
For a forest with 800 trees and num_leaves=256, input feature dimension=210, the inference job on GPU for multiple batchsize (from 1 to 500 with 10 as step) slower then the old implemenatation for 4-5 times.
For some performance stats:
The non-experimental GPU method took around 110 microseconds for inference batch < 64 samples
The experimental fil took around 450 microseconds for the same batch.

Is this performance degradation reasonable? I think it violates the first design goal of the fil experimental project ('Provide state-of-the-art runtime performance for forest models on GPU, especially for cases where CPU performance will not suffice (e.g. large batches, deep trees, many trees, etc.).')

Any hints on how to improve performance for experiment version? If needed, i can provide the model file.

Expected behavior
experiment FIL inference at a speed at least not much slower then the original version.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: centos7.9
  • GPU Model/Driver: [V100 and driver 550.54.15 ]
  • CUDA: [12.4]
  • Method of cuDF & cuML install: [conda]
    cuml 25.02.00a42 cuda12_py312_250109_g225d0aaa0_42 rapidsai-nightly
    libcuml 25.02.00a42 cuda12_250109_g225d0aaa0_42 rapidsai-nightly
    libraft 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
    libraft-headers 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
    libraft-headers-only 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
    pylibraft 25.02.00a32 cuda12_py312_250109_g8fc988e1_32 rapidsai-nightly
    raft-dask 25.02.00a32 cuda12_py312_250109_g8fc988e1_32 rapidsai-nightly
    treelite 4.3.0 py312h01abfbf_0 conda-forge
    librmm 25.02.00a37 cuda12_250109_gc1ccdadb_37 rapidsai-nightly
    rmm 25.02.00a37 cuda12_py312_250109_gc1ccdadb_37 rapidsai-nightly
@Kaiyang-Chen Kaiyang-Chen added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 10, 2025
@dantegd
Copy link
Member

dantegd commented Jan 10, 2025

Thanks for the issue @Kaiyang-Chen, the model file would be extremely useful. @wphicks and @hcho3 might be the best people to help here, though I think they are not around until next week.

@hcho3
Copy link
Contributor

hcho3 commented Jan 11, 2025

Yes, it will help us tremendously if you are able to share the model with us.

Note: I am taking time off for the next two weeks, until Jan 25. I will be able to start troubleshooting the performance issue then.

@Kaiyang-Chen
Copy link
Author

model_20241106.txt
sure, i have attached the model file here. Thanks for the help! @dantegd @hcho3

@wphicks
Copy link
Contributor

wphicks commented Jan 13, 2025

I'm happy to dig into this in more depth, but I'm almost certain I can give you an answer based on what we have here already.

In addition to more fundamental changes, experimental FIL also updates the way we choose default hyperparameters. Original FIL selected those parameters based largely on implementation details, but experimental FIL defaults to hyperparameters that give the best throughput for large batches. At batch size 64, that's definitely going to give a significant performance degradation.

As a quick test, experimental FIL offers the new .optimize method, which will select optimal hyperparameters based on batch size or data characteristics. Try calling that method on your FIL model with argument batch_size=64 before running your benchmark.

If you're still seeing a performance degradation, we can dig into this a lot more carefully. Thanks very much for the report! This is exactly the sort of thing we want to catch before promoting experimental FIL to stable.

@Kaiyang-Chen
Copy link
Author

Kaiyang-Chen commented Jan 14, 2025

If you're still seeing a performance degradation, we can dig into this a lot more carefully.

yes, i've tried tuning the parameter myself by hand using a range of reasonable chunk_size and also both tree layout (depth / width). It does affect the performance but seemingly within 20-30% range. Cannot produce a result that is even close to the stable version. @wphicks

@wphicks
Copy link
Contributor

wphicks commented Jan 15, 2025

Okay, in that case, let's dig into it more systematically. Can you post your benchmarking code so I can try for an exact repro?

@wphicks
Copy link
Contributor

wphicks commented Jan 15, 2025

I was able to reproduce the regression with the code below. Very interesting! This is a domain (shallow trees, small batches, wide inputs) where experimental FIL has seen lower performance at times, but I haven't seen any other model where performance has suffered this much. I'll investigate further. Can you confirm that the code below at least generally matches how you performed your own benchmarks?

import cupy as cp
import logging
import numpy as np
import treelite
from cuml import ForestInference as FIL
from cuml.experimental.fil import ForestInference as FILEX
from pandas import DataFrame
from time import perf_counter

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def load():
    tl_model = treelite.frontend.load_lightgbm_model('model.txt')
    return (
        FIL().load_from_treelite_model(
            tl_model,
            precision='float32'
        ),
        FILEX.load(
            'model.txt',
            precision='float32'
        )
    )


def run(
    fil,
    filex,
    *,
    batch_size=None,
    min_batch_size=1,
    max_batch_size=131072,
    iterations=10,
    warmup_iterations=2,
    format='cupy',
    results=None
):
    if batch_size is None:
        batch_size = min_batch_size
    if results is None:
        results = {
            'batch_size': [],
            'FIL': [],
            'FILEX': []
        }
    results['batch_size'].append(batch_size)
    if format == 'cupy':
        xpy = cp
    elif format == 'numpy':
        xpy = np

    dtype = filex.forest.get_dtype()
    # TODO(wphicks): set range based on model.txt for each feature
    warmup_batches = xpy.random.uniform(
        xpy.finfo(dtype).min / 2,
        xpy.finfo(dtype).max / 2,
        size=(warmup_iterations, batch_size, filex.forest.num_features())
    )
    batches = xpy.random.uniform(
        xpy.finfo(dtype).min / 2,
        xpy.finfo(dtype).max / 2,
        size=(iterations, batch_size, filex.forest.num_features())
    )
    filex.optimize(batch_size=batch_size)
    for name, model in (('FIL', fil), ('FILEX', filex)):
        for i in range(warmup_iterations):
            model.predict(warmup_batches[i])
        start = perf_counter()
        for i in range(iterations):
            model.predict(batches[i])
        elapsed = perf_counter() - start
        results[name].append(elapsed)
        logger.info(
            f'Run at batch size {batch_size} completed in'
            f' {elapsed:.2E}s with {name}'
        )

    if results['FIL'][-1] < results['FILEX'][-1]:
        next_batch_size = batch_size + (
            (max_batch_size - batch_size) // 2
        )
        min_batch_size = batch_size
    else:
        logger.info(f'FILEX outperformed FIL at batch size {batch_size}')
        next_batch_size = batch_size - (
            (batch_size - min_batch_size) // 2
        )
        max_batch_size = batch_size
    if (
        next_batch_size < min_batch_size or
        next_batch_size >= max_batch_size or
        next_batch_size == batch_size
    ):
        return DataFrame.from_dict(results)
    else:
        return run(
            fil,
            filex,
            batch_size=next_batch_size,
            min_batch_size=min_batch_size,
            max_batch_size=max_batch_size,
            iterations=iterations,
            warmup_iterations=warmup_iterations,
            format=format,
            results=results
        )


if __name__ == '__main__':
    fil, filex = load()
    df = run(fil, filex)
    df.sort_values(by='batch_size')
    print(df.to_csv(index=False))

@Kaiyang-Chen
Copy link
Author

Kaiyang-Chen commented Jan 16, 2025

Can you confirm that the code below at least generally matches how you performed your own benchmarks?

Yes the procedure is similar. Two tiny differences are that I am only testing up to batchsize 500 and I am using the cpp backend directly (it should not cause differences).

@Kaiyang-Chen
Copy link
Author

Kaiyang-Chen commented Jan 16, 2025

And another interesting thing is, as you mentioned the forest has shallow trees relatively, but using width layout generate worse performance compared to depth.

@wphicks
Copy link
Contributor

wphicks commented Jan 16, 2025

I'm not too surprised that breadth-first layout would perform worse for this depth. In general, we should get a slightly higher L2 cache hit rate starting around depth 4 for depth-first layout, though that is not always the determinant of performance for a whole model. The overall performance is still a puzzle to me though. I'm working on generating models with a range of parameters similar to the one you provided to help isolate where the issue is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants