[Roadmap] Phasing out the support for old binary format. #7547

trivialfis · 2022-01-09T20:05:19Z

XGBoost has a custom binary model format that has been used since day 1. Later in 1.0, we
introduced the JSON format as an alternative, which has a schema and has better
extensibility. The JSON format has been used as a default format for memory snapshot
serialization (pickle, rds, etc) and has extra features including categorical data support,
extra data feature names, and features types. However, for performance and compatibility
reasons we have continued the support for the old binary format. In 1.6 we plan to add
universal binary JSON as an extension to the current JSON format also as a replacement for the old
binary format.

Motivation

The old binary format is essentially copying internal structures like parameters, tree
nodes into a memory buffer, so it has a fixed memory layout that's difficult to change and
debug. If we look at the Learner class it's full of conditions to work around some
issues in binary format accumulated over the past. These issues root from the situation
that we can not change the binary output in any way, which also has an indirect impact on
how we write code. For instance, we can not change the RegTree structure due to how the
node is stored in the output and it's the very core of XGBoost. To overcome these issues
and clear some room for future development we need to phase out its use.

Roadmap

If the Universal Binary JSON implementation is accepted, I propose the following roadmap
for phasing out the support of the old binary format:

1.6: Add UBJSON and use it as the default format for full serialization (pickle, rds) as
default. Emit warning when users are loading old JSON format. This is necessary since
the default_left is changed from boolean to integer.
1.7: Emit warning when users save or load binary format.
2.0: Continue the warning about the binary format.
2.1: Continue the warning, and save the model as JSON or UBJ as default if the format is not specified. Save model in ubj as the default. #9947
2.2: Remove support for saving old binary format.
2.3: Remove support for loading old binary format. Remove support for the old JSON model.

note

JVM external checkpoint manager uses model suffix.
SHAP Fixes for XGBoost model load and GPU build. shap/shap#3462 .

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-01-09T20:05:51Z

@hcho3

hcho3 · 2022-01-09T20:10:55Z

This is necessary since
the default_left is changed from boolean to integer.

How necessary is this? Was the default_left changed to improve performance?

trivialfis · 2022-01-09T20:43:02Z

Yes. Most of the improvement comes from the typed array where we can omit the construction of Json struct and the guessing work for the next element. But there's no typed array for boolean.

Actually, there is, but it's not quite useful. The representation of boolean is T and F characters, they are both type and value at the same time. So if we were to have a typed boolean array, the whole array would be either true or false.

We can continue the support for the current JSON model for a very long time since the additional code is not much (1 condition to check whether it's bool or int), but I think it's also quite easy to move away from it since users can simply replace True to 1 and False to 0 in the JSON file. I can create a script for doing just that.

mpetricek-corp · 2022-11-15T17:32:25Z

Is there a simple way to silence this warning "Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3." when using the java interface? I.e. ml.dmlc.xgboost4j.java.XGBoost class from ml.dmlc.xgboost-jvm_2.12 maven artifact.

The C code seems to accept some "verbosity" configuration, but so far I have not found way to set this config from the Java code.

pranshupant · 2024-12-24T11:18:35Z

@trivialfis @hcho3 I am working on a project that uses the JSON format to save, load and analyze the serialized model. Will we continue to have support for the JSON serialization format moving forward?

Given that JSON has much broader support across different languages/ libraries compared to UBJSON, it would be great to continue having that as a serialization option. Thanks!

trivialfis · 2025-01-02T07:37:49Z

We will support JSON. It shares the same code path with UBJson, so rest assured, they will live together.

stenvala · 2025-02-20T09:41:26Z

To us this is really, really, bad news.

We are running a lot of xgboost models (tens of thousands) and execute on-demand inference in AWS lambda (it needs to be synchronous and fast) as we have found this the most cost efficient way. Our models are of various size, but even 100-200 MB is normal, and we need to almost always deserialize model from EFS when we do inference.

We have now done some study with EFS for the duration of deserialization and one inference and have following results (this is without categorical features, so equivalent). .bin means load binary file, then there is model size and operation duration. .ubj means the new format.

Case 1
.bin -> 43 MB -> 120 ms
.ubj -> 68 MB -> 980 ms

Case 2
.bin -> 97 MB -> 260 ms
.ubj -> 160 MB -> 2000 ms

Case 3
.bin -> 10 MB -> 30 ms
.ubj -> 16 MB -> 260 ms

I can't see it very good to change the serialization format with such a huge performance degradation.

trivialfis · 2025-02-20T13:06:23Z

@stenvala Are you (re)loading the model for each and every inference call?

trivialfis · 2025-02-20T13:26:14Z

In general, we assume the model is persisted in memory during inference. Data locality still matters, and the inference call generally is less than a few ms. However, in your use case, with tens of thousands of models, it's probably not cost-effective to persist all of them in memory. I don't have a good solution yet.

Side note: We can actually improve the performance of UBJ by removing the old binary format. Currently, we have to use the old "array of structures" to keep compatibility with the binary model, resulting into frequent indexing of memory. If we move away from the old model, we can simply memcpy large arrays like leaf indices and split values.

cc @hcho3 might have better insight for model deployment in general.

hcho3 · 2025-02-20T14:56:29Z

@stenvala We will try to improve the serialization performance with UBJSON.

That said, for your use case, you might be able to use Treelite to speed up serialization, as Treelite was designed to be a fast binary serialization format that supports all modern features like categorical data support. Can you describe what kind of inference you perform? (Do you predict probabilities? SHAP values? or leaf IDs?) I am the author of the project, and I can help you write a new inference workflow.

As for the rationale for switching the serialization format for XGBoost: The old method of serializing XGBoost models was fast but impossible to extend (*); for example, many users have been asking for native support for categorical features, but the old serialization format was not flexible enough to support it. There is trade-off between ability for extension and speed, and for XGBoost we chose to prioritize the former.

(*) With careful design, you can make a binary serialization format with ability for future expansion. Unfortunately, the old XGBoost binary format was not designed with such consideration.

stenvala · 2025-02-20T15:50:11Z

@hcho3 today we started looking at Treelite and first experiments look like with that we can reach adequate performance with it even if not yet really there what it used to be. We have margin though.

Great to hear that there are also efforts to increase performance with UBJSON.

I understand the rationale for the change. Just wanted to bring up the need for deserialize and dump needs where inference speed is not that important but model loading is.

stenvala · 2025-02-20T15:54:14Z

@stenvala Are you (re)loading the model for each and every inference call?

We use some caching in RAM but due to so many models (and the nature of lambda and many parallel request still maximum ~100) it's luck if the model is hot.

trivialfis added the type: roadmap label Jan 10, 2022

trivialfis pinned this issue Jul 29, 2022

trivialfis mentioned this issue Nov 8, 2022

Easiest way to get version info of a XGBoost model through C++ interface #8435

Closed

trivialfis mentioned this issue Apr 11, 2023

XGBoost 2.0 Model Format and Backwards Compatibility #9017

Closed

dotbg added a commit to dotbg/xgboost that referenced this issue Jul 12, 2023

Refs. dmlc#7547. Using as default serialization format.

a4fe177

trivialfis mentioned this issue Jul 14, 2023

Refs. #7547. Using ubj as default serialization format in jvm-packages. #9382

Open

thatlittleboy mentioned this issue Oct 8, 2023

BUG: XGboost binary format is being deprecated shap/shap#3321

Closed

4 tasks

matthewdeng mentioned this issue Nov 28, 2023

[train] update XGBoost model format to UBJ ray-project/ray#41442

Open

8 tasks

trivialfis mentioned this issue Dec 5, 2023

Proposal for new R interface (discussion thread) #9734

Open

trivialfis mentioned this issue Jan 3, 2024

Save model in ubj as the default. #9947

Merged

trivialfis mentioned this issue May 15, 2024

Add more detailed explanations and tutorial for advanced objectives #10283

Merged

trivialfis mentioned this issue Jun 26, 2024

Remove support for deprecated format in Python. #10490

Merged

trivialfis mentioned this issue Feb 13, 2025

R package changelog #11195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Phasing out the support for old binary format. #7547

[Roadmap] Phasing out the support for old binary format. #7547

trivialfis commented Jan 9, 2022 •

edited

Loading

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022

trivialfis commented Jan 9, 2022 •

edited

Loading

mpetricek-corp commented Nov 15, 2022

pranshupant commented Dec 24, 2024

trivialfis commented Jan 2, 2025

stenvala commented Feb 20, 2025 •

edited

Loading

trivialfis commented Feb 20, 2025 •

edited

Loading

trivialfis commented Feb 20, 2025 •

edited

Loading

hcho3 commented Feb 20, 2025 •

edited

Loading

stenvala commented Feb 20, 2025

stenvala commented Feb 20, 2025

[Roadmap] Phasing out the support for old binary format. #7547

[Roadmap] Phasing out the support for old binary format. #7547

Comments

trivialfis commented Jan 9, 2022 • edited Loading

Motivation

Roadmap

note

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022

trivialfis commented Jan 9, 2022 • edited Loading

mpetricek-corp commented Nov 15, 2022

pranshupant commented Dec 24, 2024

trivialfis commented Jan 2, 2025

stenvala commented Feb 20, 2025 • edited Loading

trivialfis commented Feb 20, 2025 • edited Loading

trivialfis commented Feb 20, 2025 • edited Loading

hcho3 commented Feb 20, 2025 • edited Loading

stenvala commented Feb 20, 2025

stenvala commented Feb 20, 2025

trivialfis commented Jan 9, 2022 •

edited

Loading

trivialfis commented Jan 9, 2022 •

edited

Loading

stenvala commented Feb 20, 2025 •

edited

Loading

trivialfis commented Feb 20, 2025 •

edited

Loading

trivialfis commented Feb 20, 2025 •

edited

Loading

hcho3 commented Feb 20, 2025 •

edited

Loading