Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for new R interface (discussion thread) #9734

Open
Tracked by #11249
david-cortes opened this issue Oct 30, 2023 · 120 comments
Open
Tracked by #11249

Proposal for new R interface (discussion thread) #9734

david-cortes opened this issue Oct 30, 2023 · 120 comments

Comments

@david-cortes
Copy link
Contributor

david-cortes commented Oct 30, 2023

ref #9475
ref #9810 (roadmap)

I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for xgboost should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything that xgboost can do so perhaps I might be missing something important here.

(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)

Looking at the python and C interfaces, I see that there's a low-level interface with objects like Booster with associated functions/methods xgb.train that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.

In the current R interface, there's to some degree a division into lower/higher-level interfaces with xgb.train() and xgboost(), but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R's stats::glm() or other popular R packages - e.g. it doesn't take formulas, nor data frames.


In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like survival::Surv, etc.), for both inputs and outputs.

In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages - for example, randomForest() will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface under a single function xgboost() and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.

In my view, a good separation between the two would involve:
  • Having the low-level interface work only with DMatrix (which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work only with common R objects (e.g. data.frame, matrix, dgCMatrix, dgRMatrix, etc.), but not with DMatrix (IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions).
  • Having separate classes for the outputs returned from both interfaces, with different associated methods and method signatures, so that e.g. the predict function for the object returned by xgb.train() would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while the predict function from the object returned from xgboost() would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives.
    • The plot functions perhaps could be shared, with some extra arguments being required for the low-level interface but forbidden for the high-level interface.
  • Having separate associated cv functions for the low-level and the high-level interfaces (e.g. xgboost.cv() or cv.xgboost(), like there is cv.glmnet(); and another xgb.cv() that would work with DMatrix).
  • Keeping extra metadata inside the object returned by the high-level interface, such as column names, types, factor levels, the actual string of the function call (the way base R does), etc.
  • Exporting from the package only what's necessary to support these two modes of operation alongside with their methods - e.g. not exporting internal functions like xgboost::normalize or internal classes/methods like Booster.handle.
A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas:
  • It should have an x/y interface, and perhaps also a formula interface (details on the latter to follow).
  • The x/y interface should accept x in the format of common R classes like data.frame, matrix, dgCMatrix, perhaps arrow (although I wouldn't consider it a priority), and perhaps other classes like float32 (from the float package). I don't think it'd be a good idea to support more obscure potential classes like dist that aren't typically used in data pipelines.
    • If I'm not mistaken, CSR is supported for prediction but not for training, so predict should additionally take dgRMatrix and sparseVector like it does currently.
    • I am not into GPU computing, nor into distributed computing, so perhaps I might be missing important classes here.
    • If supporting Spark in R, I guess the best way would be to have a separate package dedicated to it.
  • (Controversial) It should not only accept but also require different classes for y depending on the objective, and by default, if not passed, should select an objective based on the type of y:
    • factor with 2 levels -> binary:logistic.
    • factor with >2 levels -> multi:softmax.
    • survival::Surv -> survival:aft (with normal link), or maybe survival:cox if only right-censored.
    • others -> reg:squarederror.
    • And consequently, require factor types for classification, Surv for survival, and numeric/integer for regression (taking logical (boolean) as 0/1 numeric).
    • If multi-column response/target variables are to be allowed, they should be required to be 2-dimensional (i.e. no vector recycling), as either data.frame or matrix, and keep their column names as metadata if they had.
  • If x is a data.frame, it should automatically recognize factor columns as categorical types, and (controversial) also take character (string) class as categorical, converting them to factor on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter.
  • (Controversial) It should only support categorical columns in the form of an input x being a data.frame with factor/character types. Note: see question below about sparse types and categorical types.
  • If x is either a data.frame or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name.
    • For example, if x has 4 columns [c1, c2, c3, c4] it should allow passing monotone_constraints as either c(-1, 0, 0, 1), or as c("c1" = -1, "c4" = 1), or as list(c1 = -1, c4 = 1); but not as e.g. c("1" = -1, "4" = -1).
    • OTOH if x is a matrix/dgCMatrix without column names, it should accept them as c(-1, 0, 0, 1), or as c("1" = -1, "4" = -1), erroring out on non-integer names, not allowing negative numbers, and not allowing list with length<ncols as the matching would be ambiguous.
    • If there are duplicated column names, I think the most logical scenario here would be to error out with an informative message.
  • Regardless of whether x has column names, column-vector inputs like qid should be taken from what's passed on the arguments, and not guessed from the column names of x like the python scikit-learn interface does.
  • It should have function arguments (for IDE autocomplete) and in-package documentation for all of the accepted training parameters. Given the amount of possible parameters, perhaps the most reasonable way would be to put them into a separate generator function xgb.train_control() or similar (like C50::C5.0Control) that would return a list, or perhaps it should have them all as top-level arguments. I don't have a particular preference here, but:
    • If parameters are moved to a separate train control function, I think at least the following should remain top-level arguments: objective, nthread, verbosity, seed (if not using R's - see next point below), booster; plus arguments that reference columns names or indices: monotone_constraints, interaction_constraints; and data inputs like base_score or qid. nrounds I guess is most logical to put in the list of additional parameters, but given that it's a required piece of metadata for e.g. determining output dimensions, perhaps it should also be top-level.
    • If using a separate train control function, given that R allows inspecting the function call, it might be better to create a list with only the arguments that were explicitly supplied - e.g. if passing only something like xgb.train_control(eta=0.01), it could return only list(eta = 0.01) instead of a full list of parameters. In this case, it might not be strictly required to know what are the default values for every parameter for the purposes of developing this interface, but would be nice if they were easily findable from the signature anyway.
  • (Controversial) Not a high-priority thing, but would be ideal if one could pass seed and only rely on R's PRNG if seed is not passed.
  • It should have more informative print/show and summary methods that would display info such as the objective, booster type, rounds, dimensionality of the data to which it was fitted, whether it had categorical columns, etc.
    • I think the summary and the print method could both print the exact same information. Not sure if there's anything additional worth putting into the summary method that wouldn't be suitable for print. Perhaps evaluation metric or objective function per round could be shown (head + tail) for summary, but maybe it's not useful.
  • It should stick to common R idioms and conventions in terms of inputs and outputs - for example, print should not only print but also return the model object as invisible, methods like predict should take the same named arguments as others like predict(object, newdata, type, ...), return an output with one row per input in the data, keep row names in the output if there were any in the input, use name (Intercept) instead of BIAS, etc.
    • A tricky convention in this regard is base1 indexing - there's many places where it's used like interaction_constraints, iterationrange for prediction, node indices from predict, etc. Perhaps the high-level interface could do the conversion internally but I'm not sure if I'm missing more potential places where indices about something might be getting used, and probably more such parameters and outputs will be added in the future.
  • Following up on the point above, the predict function should:
    • Accept a type argument with potential values like "response", "class", "raw", "leaf", "contrib", "interaction", "approx.contrib", "approx.interaction" (not sure what to name the last 2 though).
    • Output factor types for class type in classification objectives.
    • Use column names of x for outputs of contrib/interaction, and factor levels of y as column names from e.g. response in multi-class classification and class-specific contrib/interaction (named as level:column, like base R).
  • Xgboost functions should avoid doing in-place modifications to R objects other than internal classes like DMatrix or handles - e.g. calling the plotting function should not leave the input with an additional Importance column after the call is done.
  • The plotting and feature importance functions should not require passing additional metadata, using instead the names that the input x had.
    • The function for plotting feature importances should in turn not require passing the calculated feature importances first, calculating them internally on-the-fly if not passed, given that it will have all the required inputs already.
  • There should be a function to convert objects between the low-level and the high-level R interface, which in the absence of metadata, could be done by autofilling column names with "V1", "V2", etc. and factor levels with "1", "2", "3", etc.
  • (Controversial) The predict function for the x/y interface should have a parameter to determine whether it should subset/re-order columns of newdata according to the column names that were used in training, and in this case, it should also recode factor levels. Note that this subsetting and reordering is always done for base R's predict.glm, but factors are not re-coded there (so e.g. if the levels in factor columns of newdata differ from the training data, predict.glm would instead output something that might not make sense).
  • Custom objectives and custom metrics for the high-level interface should be passed the same kinds of response variables that were used for input - e.g. a factor instead of the binarized numeric vector that xgboost will use internally.
    • Perhaps custom objectives and metrics could be required to be passed as a list with necessary fields instead of as separate arguments, and required to have extra metadata like whether they are regression/classification, a potential inverse link function for predict that would be kept inside the output, etc. Don't have a strong preference here though, and I expect most users of custom objectives would be using the low-level interface anyway.
  • (Very low priority) Perhaps the high-level interface should allow using base R's "families" (as returned by e.g. stats::poisson(link = "log")) and the structure they involve as custom objectives (like glmnet does), so that one could pass a family-compliant object from other packages as objective. I see an issue here though in that these families are meant for Fisher scoring, which means in the case of non-cannonical link functions like binomial(link = "probit"), they wouldn't calculate the true Hessian function, but I guess Fisher's should work just as fine with gradient boosting. Not an expert in this topic though.

The low-level interface in turn should support everything that the C interface offers - e.g. creating DMatrix from libsvm files, specifying categorical columns from a non-data-frame array, etc.


As per the formula interface, this is quite tricky to implement in a good way for xgboost.

In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but:

  • Stepwise selection from base R doesn't work with xgboost.
  • Transformations like log(x1) have no effect in decision trees, transformations like x1:x2 don't have the same practical effect as decision trees implicitly create interactions, and transformations like x^2 the way R does them do not make a difference for decision trees compared to simpler I(x^2)+x, for example.
  • Other outputs from formulas like contrasts are not used either way.
  • Transformations like log(y) aren't any harder to do with an x/y interface than with a formula interface.

Nevertheless, a formula interface can still be handy for calls like y ~ x1 + x2 + x3 when there are many columns, and packages like ranger offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use.

Some nuances in terms of formulas though:

  • Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept, and has no use for a column named (Intercept) that would have the same value for every row.
  • Base R's formulas will either do one-hot or dummy encoding for factor/character columns according to whether there is an intercept or not, while xgboost now has native support for categorical columns.
  • Using a formula for processing the training data also implies using it for predict - so for example, formulas do not recode levels of factor variables when calling predict, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces.
  • Some packages have their own formula parsers which allow additional interpretations for something like y ~ x | z than what base R would do (for example, lme4 would interpret z here as something that controls mixed effects for x, while base R would interpret this as a feature "x or z"), and in some cases xgboost() would also need a different interpretations of formulas (e.g. for parameter qid, which doesn't fit at either side of the formula).
  • Packages like randomForest don't use the base R formula parser, taking it instead (by copying the code) from a different library e1071 which is GPLv2 licensed, which is incompatible with xgboost's apache license.
  • Refering to columns that are the result of a formula by index in e.g. monotone_constraints could be tricky - e.g. if we remove the auto-added (Intercept) columns, should the numbers re-adjust?

Hence, supporting a formula interface for xgboost() would be tricky:

  • It could be possible to use base R's formula parser, but:
    • In this case it would not be possible to use categorical features (or their interactions) as such, as they will be either dummy- or one-hot encoded.
    • It theoretically could be possible to try to forcibly try to add -1 at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding (Intercept), but I can foresee cases in which this might not work depending on how the formula is inputted.
    • It would not be possible to specify additional columns like qid in these formulas.
  • It could be possible to develop a custom formula parser that would not one-hot-encode categorical features and e.g. interpret | differently, but what should happen in this case if the user requests something like xnum*xcat or f(xcat) (with f being some arbitrary base function) ?

Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to:

  • Use base R's formula parser.
  • Don't support parameters like monotone_constraints or qid in the formula interface.
  • Don't support native categorical columns, relying instead on the one-hot or dummy-encoding from the formula.
  • Try to remove the intercept by trying to create a new formula <curr formula> - 1 and error out if this doesn't succeed.
  • Let the formula handle all the predict post-processing, regardless of how the x/y interface does it.
  • Encourage users and reverse dependencies to avoid the formula interface for serious usage by being clear about all these limitations in the docs and having a warning message in the "description" section (which comes before the signatures in the doc). We're not in control of how reverse dependencies use it, but if any of the larger ones is tempted to use this formula interface anyway, could be followed up in their githubs.

A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here:
  • Does xgboost support passing a mixture of dense and sparse data together? pandas supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating the DMatrix object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such a DMatrix for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like a list.
    • Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.
  • Can sparse columns be taken as categorical without being densified? If so, I guess specifying categorical columns of a sparse matrix should also be supported in the high-level R interface.
  • In other interfaces, does passing a data frame of different types always involve putting everything into a contiguous array of the same type? I see there's a C function XGDMatrixCreateFromDT, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types like int64, these would be casted to a floating-point type. In R, especially when using data.table, it's oftentimes common to have 64-bit integers from the package bit64, which if casted to float64, would lose integral precision beyond $2^52$ (same for int32 which could lose precision when casted to float32) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create a DMatrix from a structure like an R data.frame, which is a list of arrays that aren't contiguous in memory and might have different dtypes.
  • Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification? As it is right now, it'd be quite easy to do this in R, but I can foresee a situation in which in the future someone for example submits a PR adding a new objective like binary:cauchit, adds it to the python XGBClassifier class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective.
  • How does the python interface keep the list of parameters up-to-date with the C interface? Is it done manually by maintainers? Is there some way in which the R interface could automatically keep in synch with the C one every time a new parameter is added? Or will it need to manually keep recollection of every possible allowed argument and its default value?
  • How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing? I see there's also a Julia interface for example, which just like R, uses base1 indexing, but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.
  • Since the R object from the high-level interface in this case would have more metadata than what's kept in the serialized C Booster, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned from xgboost(), an earlier object saved with saveRDS will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features.
    • How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster? How does it deal with converting between Booster and scikit-learn classes?
    • I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.
    • If one were to load a C Booster into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that the Booster wouldn't? Is it somehow standardized?
  • How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it?
  • Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix? If so, I guess support for arrow in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.

Some other questions for R users (@mayer79 ?):
  • Are there any popular data.frame-like objects from packages related to GPU computing (like cuDF for python) or distributed computing (other than spark) that I might perhaps be missing in this post?
  • Is there any popular R type for learning-to-rank tasks that would involve the qid and position information that xgboost uses? I am not aware of any but I'm no expert in this area.
  • Is there perhaps some object class from Bioconductor or from some other domain (like, don't know, geopositioning from e.g. gdal) that should logically map to some specific xgboost objective without undergoing transformation by the user to e.g. factor / numeric / etc. ?
  • Is there some package offering a custom formula parser meant for decision tree models?
@hcho3
Copy link
Collaborator

hcho3 commented Oct 30, 2023

I am the person who added the section a-compatibility-note-for-saveRDS-save to the manual, so let me answer the question regarding serialization.

How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster?

Does the python interface or other interfaces keep extra metadata that the Booster wouldn't?

The sklearn interface of XGBoost extracts the necessary key-value pairs from the object attributes, serialize them as a JSON string, and store the JSON string in the attributes field of the C Booster object.

Admittedly this approach is not foolproof. This is why we ask users to save the model using XGBoost's native serializer (load_model() / save_model()). It must be noted that the base scikit-learn package provides no guarantee of backward compatibility for pickled model objects. Hence, Python users are accustomed to using pickle for only short-term storage and other forms of storage for long-term storage.

It appears that the expectations are somewhat different in the R community. After seeing lots of bug reports arising from calling saveRDS on old model files (#5794), I had to add the big warning about saveRDS in a separate section a-compatibility-note-for-saveRDS-save.

I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.

Checking the version of the C Booster is sufficient, because we always bump the version of the native package and the R package in lockstep. For example, when the native package had its version bumped from 1.3.0 to 1.4.0, the R package had its version bumped from 1.3.0.1 to 1.4.0.1.

How does it deal with converting between Booster and scikit-learn classes?

You can extract the Booster class by calling get_booster() method of XGBClassifier / XGBRegressor. To convert Booster to XGBClassifier / XGBRegressor, first persist the Booster to a JSON file and then call load_model().

@trivialfis
Copy link
Member

trivialfis commented Oct 31, 2023

Let me try to clear some of the questions before making any comments. ;-)

Does xgboost support passing a mixture of dense and sparse data together?

No, and it's unlikely in the near future. XGBoost mostly works with row-major data with
some limited support for column-major (like CSC, cuDF), but partitioning data by column is
a different story. One of the goals of categorical data support is to eliminate the need for dummy encoding.

Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.

No. The DMatrix object doesn't provide any means of data manipulation, it's just an
internal storage class for a machine learning library. Any feature that is not directly related to
ML (like feature engineering) is considered out-of-scope. There's a slice method for the cv function, which we
would really like to remove. If the input data is a numpy array, use numpy to manipulate it.

Can sparse columns be taken as categorical without being densified?

Yes. Internally, XGBoost marks the maximum coded (like ordinal-encoding) categorical value and
uses it as the number of categories in the respective column. The format of the column is
orthogonal. We want to allow some extra information from users like a user-specified number
of categories in the future, but this should not block the current R interface
development. (we will use the feature_type field)

I am wondering if there should be a way to support these without loss of precision

No. Internally, the decision tree is always split using f32. In the long run, we can probably
support f64 after the removal of binary serialization. But so far it hasn't been a
priority since in most cases some normalization for the data can get us very far.

Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification?

You can query the objective using XGBoosterSaveConfig, which is exposed in the current R
interface xgb.config. It involves some JSON parsing though. You can cache the result in
a R object during model loading and after model training if it's performance-sensitive.

How does the python interface keep the list of parameters up-to-date with the C interface?

Mostly by hand, unfortunately. "Mostly" since we have the native interface, which accepts
a dictionary of parameters and there's no need to copy and paste the parameters for the
native interface. The scikit-learn interface requires explicitly listed parameters for
some scikit-learn magic to work (like Python inspect), and as a result, you see a list of
parameters in the skl interface.

Or will it need to manually keep recollection of every possible allowed argument and its default value?

No need to explicitly set the default value. Default values are set inside libxgboost
(c++), not passing the key into XGBoost means using the default value. The Python interface
uses None as a value to denote a parameter that should not be passed into libxgboost.

How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing?

I don't know, you will have to reach out for questions. @ExpandingMan :-)

How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster

We insist that one should not use pickle for long-term storage. An update to the cPython
interpreter itself might make the pickle object invalid due to invalid
instructions. However, we do maintain a policy in C++ to support it with best-effort, here it's:

  • If the pickle or RDS to be loaded has the same version of the current XGBoost, we load it in full.
  • If it's not the same version, we load only the model and discard all the configuration, see next answer for what are these.

However, what happens in R/Python is beyond our scope. Please take a look at our
document doc/tutorials//saving_model.rst or just continue with the following answers,
which should provide some more hints.

How does it deal with converting between Booster and scikit-learn classes?

get_booster as suggested by @hcho3 . In addition, the scikit-learn interface doesn't
have much metadata and can be fully functioning by loading a Booster object. Unlike
scikit-learn, XGBoost distinguishes model parameters and meta-parameters. The former is
related to model output and shape like objective, number of trees, and number of features,
while the latter is related to training and runtime like learning-rate, number of threads,
etc. The load_model makes sure the former is complete and discards later after
training. We don't need learning-rate after training, and the number of threads should be
discarded since we want the model to be portable across machines.

I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions

hopefully, the previous two answers clear this up as well.

Does the python interface or other interfaces keep extra metadata that the Booster wouldn't

Yes, Python scikit-learn is using attributes. But it's not important, we can remove it
anytime we want, it's a legacy. As we care about only the model, not the hyper-parameter
as mentioned previously. As long as the loaded model can do inference properly (hence
contains the full model), we are somewhat satisfied.

Is it somehow standardized?

No. It took some time before we got where we are, to gradually formalize what should be
saved. Please take a look at our document: doc/tutorials//saving_model.rst. The
save_model and load_model is guaranteed to work across bindings, other custom fields
are not.

If you think a new field should work across bindings, please share, we can always make
additions. A recent example is feature_types, which is saved into the JSON model for
categorical data support.

How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it

At the moment, it's the row order, we can have extra fields later. It's a new feature in
the recently announced 2.0, we are still hoping to collect feedback, it's still a hotly
researched area and nothing is settled (position-aware ltr).

Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix

I removed it very recently. (if something is missing, it's me to be blamed. -.- ). I'm
familiar with the internal of arrow for both CPU and GPU memory due to a different
project. It chunks data for each column non-uniformly and has layered buffers. It's quite
difficult for XGBoost to support such data format without first performing a concatenation
for all the chunks. It's probably a good design for data management frameworks like
spark/dask, but not ideal for a traditional ML algorithm. On the other hand, if we
perform a concatenation, there's no point in using the arrow's C interface. We can just
pass the memory buffer's address. We have one good example of how a contiguous (without
chunks) arrow-based table can be passed into XGBoost using cuDF, it dumps out the memory
addresses for each contiguous column as __cuda_array_interface__, and XGBoost can
consume them with zero copy for inplace-predict and QuantileDMatrix.

@trivialfis
Copy link
Member

trivialfis commented Oct 31, 2023

Comments unrelated to questions. I will update this comment when I can put together some more thoughts.

If I'm not mistaken, CSR is supported for prediction but not for training

CSR is well supported for all purposes.

Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept

Depending on the use case, one can fix the intercept to a constant (like 0) value or let xgboost find one based on MLE.

@david-cortes
Copy link
Contributor Author

Thanks for the answers.

So it sounds like a good way to approach serialization and transfers between interfaces would be to keep as many attributes as possible inside of the C booster, then have a function that default-initializes them, and then fills them with info from the booster if available at the moment it gets deserialized. This unfortunately wouldn't work for things like R formulas, so one would strictly need to use saveRDS (as opposed to xgb.save) in such cases however.

As a preliminary step, I think it would make sense to extend the the current optional attributes that are kept inside of the C booster to include categories of both y and categorical columns of X.

Scikit-learn currently accepts pd.Categorical and uses the categories when passed as y in some estimators. It currently doesn't keep recollection of categories for categorical columns of X, but it has slowly been moving in the direction of leveraging more and more pandas in its estimators and transformers.

I think it could even be handy to keep levels of categorical features as part of the metadata when using the scikit-learn interface and passing X that has pandas categorical types, even if they are not currently used (but perhaps could later on be used as column names in outputs from the scikit-learn interface?).

(I've opened a separate issue for the y categories: #9739)

Note that this might not be as straightforward as it sounds, as pandas allows any arbitrary type to be used as categories, not just strings, whereas R only allows strings, so perhaps it could further have the restriction that the categories would be coerced to strings.

@ExpandingMan
Copy link

I didn't bother to read most of this thread, but I was pinged concerning 1-based indexing, as I am one of the maintainers of the Julia wrapper, Julia being a 1-based indexed language.

Nearly all Julia packages whether wrappers or not, stick with purely 1-based index semantics. In the case of wrappers like XGBoost.jl, this simply means that the wrapper functions need to translate the indexing. There is a relatively low level direct wrapper of the libxgboost C-lib that does not do this, but it is considered private to the wrapper package.

but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.

I'm not sure what you're referring to, but no, that's not the case. The aforementioned low level C wrapper should not show up in the docs, libxgboost is small enough that the intention of XGBoost.jl is that nobody should ever have to use that.

@trivialfis
Copy link
Member

keep as many attributes as possible inside of the C booster

If the object is R-specifc, and R does such a good job of keeping saveRDS portable (in contrast to Python pickle) across versions, we can let saveRDS do its magic instead of saving them into the booster, right? The best code is the code that we don't have to maintain and yet it works. ;-) Fun aside, I'm fine with either approach though, will leave it to you to decide.

I'm not sure what you're referring to, but no, that's not the case.

Thank you for clearing that up @ExpandingMan ! That helps a lot.

@david-cortes
Copy link
Contributor Author

@mayer79 would be nice to hear some comments about the proposed interface here, especially around handling of formulas.

@mayer79
Copy link
Contributor

mayer79 commented Nov 5, 2023

@david-cortes : I was a couple of days afk and will give my feedback soon. Thank you so much for starting this thread. We might get some inspiration of https://github.com/MathiasAmbuehl/modeltuner (no possibility to select how to encode categoricals - it just uses OHE), but otherwise it looks neat.

@mayer79
Copy link
Contributor

mayer79 commented Nov 10, 2023

Curiously, I don't know the answers to the questions to me. But I think they are less important.

My current mood:

  1. Most of what have been suggested above means: Rewrite the xgboost() high level API and add a corresponding xgboost.cv() function. We could do this in the current package, without the need of maintaining a second {xgboost2} package.
  2. Categorical support should only become the default as soon as SHAP values are available for it. The {shap} library has almost as many pypi downloads as {xgboost}. Actually, I think that almost every new functionality should also make SHAP work.
  3. We can use a simple formula parser, see code below. No need for interactions, transformations etc. To control the categorical encoding of factors, we can use an argument encode_factors = c("integer", "target", "ohe"). Maybe this needs to be vectorized, so that users could specify encode_factors = c(Species = "integer", ...) to use different encodings per discrete feature.
  4. I actually like {ranger}'s interface: It accepts either formula + data or x + y. (I don't like its formula parser though).
  5. Monotonicity and interaction constraints could be passed as column names only, like: monotone_inc = c("Sepal.Width", ...), monotone_dec = c(...), interaction_constraints = list(c("Sepal.Width"), ...).
  6. In the high-level API, we would need to subset feature columns so that tibbles and data.tables are fine. This is usually not complicated.

Formula parser (dropping all transformations), just extracting column names:

formula_parser <- function(form, data) {
  form <- as.character(form)
  y <- form[2L]
  all_vars <- all.vars(
    stats::terms.formula(stats::reformulate(form[3L]), data = data[1L, ])
  )
  list(y = y, x = setdiff(all_vars, y))
}

formula_parser(Sepal.Width ~ ., data = iris)
# $y
# [1] "Sepal.Width"
# 
# $x
# [1] "Sepal.Length" "Petal.Length" "Petal.Width"  "Species"

@david-cortes
Copy link
Contributor Author

Regarding point (1), under the proposal here, some of the implications are that:

  • xgboost() would return a different class from xgb.train() (like in the python package, where the scikit-learn class is different from the Booster class), which in turn means objects returned from it would have a different predict method with different signature - i.e. if a reverse dependency uses xgboost() and calls predict with some argument, it would lead to breakage.
  • Some of the exported package methods would disappear (as they'd become internal), which again would lead to breakage if libraries rely on them - for example, a library might check to which xgboost model class does an object belong and try to convert from one to another (handle -> booster) internally, which would not be necessary anymore under the proposal, but would also lead to breakage in said dependency.
  • Some of the current methods (like plotting) might work differently depending on whether the input came from xgboost() or xgb.train(), which could potentially lead to breakage.

Thus, it might not be necessary to create a new package, but then it'd be necessary to work with reverse dependencies to adapt.


Regarding point (2), I do not think that withholding categorical support for now would be beneficial. At some point, the core xgboost library will have wider support for things that involve categorical variables and will hopefully have it enabled by default. We haven't even started work on the R interface yet and we don't know when it will be ready.

Also, in my view, I would not consider SHAP support as a blocker - there's plenty of use-cases for xgboost that benefit from better handling of categorical features and which do not involve calculating SHAP values; and if needed, an external library could be used for it even if it might not be optimal.


Regarding the proposed parser for formulas: I do not feel that such an interface would bring value to users if it's only used to select variables.

In my opinion, one of the most useful aspects of formulas is that they can act as quick data preparation pipelines, which are helpful when e.g. prototyping models in an interactive session - for example:

  • y ~ I(count_of_A + count_of_B + count_of_C) + I(measure1/measure2) + ...
  • y ~ heigh + log(weight) + I(mass/height^2)
  • log(y) ~ .
  • I((var1 - var2)/var3) ~ abs(var4) + var5 + ...

In these cases, the extra processing would be lost, and the selection of variables could be achieved just as easily through e.g. dplyr::select or data.table[,.(...)] and passed to an x/y interface instead.

What's more, if the formula were to be used to one-hot-encode categorical variables, it may be less efficient than what a user could achieve by e.g. using a sparse encoder instead of base R's when appropriate.

Also, choosing the way of handling categorical variables like that would then raise the question of what happens when parameter max_cat_to_onehot is supplied, and I imagine that more such parameters might appear in the future (like cat_smooth in lightgbm).

As such, I am thinking that if a formula parser is too complex to implement and there isn't any readily available parser with a compatible license, it might be better to not develop a formula interface, or maybe leave it for a future version.

@trivialfis
Copy link
Member

A side note, we support SHAP values with categorical data for the original SHAP algorithm (including interaction). The issue is the plotting library needs some additional work to read the model properly. We can work on the SHAP package after clearing up some priorities.

@mayer79
Copy link
Contributor

mayer79 commented Nov 12, 2023

@trivialfis Nice! So we need categorical support in R indeed at higher priority. In R, shapviz already works with categoricals, e.g., for LightGBM. So at least the plotting of SHAP values is ready in R.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 14, 2023

I just realized that the ALTREP system in newer R versions can be used (albeit in a rather hacky way) to manage custom serialization of arbitrary objects, provided that the user doesn't start interacting with and modifying those objects through R functions.

Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method like saveRDS instead of always materializing the serialized bytes on every model creation call; and to automatically re-create the C++ object and handle when calling readRDS without having to call xgb.Booster.complete, like the python package does with pickle's get/set state.

Here I've composed a small example of how to do this with externalptr objects:
https://github.com/david-cortes/altreppedpointer
(so far, haven't found any R package using this sort of mechanism though)

Note that this will require changing the class and the structure of the current handles, so a first step would be to make these classes internal-only as mentioned in the first message.

@trivialfis
Copy link
Member

trivialfis commented Nov 16, 2023

I just realized that the ALTREP system in newer R versions can be used

Thank you for sharing! That's quite interesting! The flexibility is a bit scary though, probably just due to my very limited understanding. I can learn more about the framework. I think if we were to use it, some documentation for future developers would be nice.

Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method

Indeed, that would be extremely helpful in keeping the states consistent.

As for error handling, I think the only concern is that XGB is not great at forward compatibility, loading a new model with an older version of XGB might result in errors (C API returning -1). OOM is a bit more tricky on lazy OS like Linux based OSs, the exception might not be caught by XGB. But I guess it's a general issue for all packages with CPP.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 26, 2023

As we haven't received any further comments about these proposals, guess it's reasonable to start with a plan and checklist of things to do based on the current proposal.

As I see it, the following would be a high-level overview of what needs to be done:

  • (High priority) Create an easy-to-use script to create an R package or to otherwise re-arrange files in the kind of folder structure that R uses for packages. Currently, there's a python script to do this which is used in the CI jobs, but it seems to be very problematic to use outside of the docker containers from those jobs - e.g. it won't overwrite files if there are any, doesn't seem to be simply callable from a folder above (where it will generate .pyc files which it will complain about overwriting), etc. Having something like the old makefile script to build and R package would certainly make things easier and faster.
  • Enable parameters and features for the R interface which are currently disabled:
    • (High priority) Parameters for categorical features ("enable_categorical" and "feature_types" - not sure if the latter is just a matter of allowing it in the JSON parser, or whether it additionally needs calls to e.g. XGDMatrixSetStrFeatureInfo or similar).
    • Fitting a model from a DMatrix created from a CSR matrix (currently errors out for fitting but not for predicting).
    • Other potential things that might have been disabled, such as passing a random seed (very low priority).
  • Enable multi-output input labels and predictions.
  • (High priority) Add a mechanism to create a DMatrix from an R data.frame. I guess the most efficient route here would be to call C function XGDMatrixCreateFromDT, but I get the impression from the python interface that this is not the method that ends up getting used when constructing from a pandas DataFrame so want to ask if this is the recommended way for a data structure like R's (basically a list of arrays which are not in one single contiguous memory block).
  • Add a mechanism to create a DMatrix object from arrow objects. Not sure how feasible this is though, and I'd say this is very low priority.
  • Add an interface to create QuantileDMatrix objects from R, accepting the same kinds of inputs as DMatrix (data.frame, matrix, dgCMatrix, dgRMatrix, arrow if implemented, maybe float32). I'd also consider this one very low priority though, and it's probably quite a lot more code than the DMatrix-related tasks.
  • Switch the current DMatrix creation function for R matrices towards the C function that uses array_interface.
    • In this case, the cv function could also be switched to use this same interface but with a pointer offset for matrices instead of using DMatrix slice methods. I got from the previous messages that you guys wanted to remove the DMatrix slice methods, so if this R usage form is the only thing keeping them, perhaps they could also be eliminated afterwards. Actually this is wrong, given that the indexes will not always reflect memory-contiguous slices of a single shuffle.
  • Switch the predict function for the current booster to use inplace predict or other more efficient DMatrix creators (like XGDMatrixCreateFromDT mentioned earlier) when appropriate.
  • Remove all the public interface (functions, docs, tests, examples) around the Booster.handle class, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.
  • After the task above is done, switch the handle serialization mechanism to ALTREP and remove xgb.Booster.complete (which wouldn't be needed anymore).
    • Perhaps a similar serialization method could be added for DMatrix objects too, which I guess is just a matter of calling XGDMatrixSaveBinary, but I want to ask if that's all it takes or whether it needs to e.g. keep the original R data objects (like matrix) and restore pointer addresses to that data or similar, as I see that the python class DMatrix doesn't have __getstate__/__setstate__.
    • Note that this requires R >= 4.3, so the CI jobs will need to be modified and older R versions dropped.
  • Remove the current xgboost() function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).
  • (High priority) After support for data.frame and categorical features is added, then create a new xgboost() function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper over DMatrix + xgb.train but implementing the kind of idiomatic R interface (x/y only, no formula) described in this thread, either with a separate function for the parameters or everything being passed in the main function.
    • It should return objects of a different class than xgb.train (perhaps the class could be named xgboost).
    • This class should have its own predict method, again with a different interface than the booster's predict, as described in the first message here.
    • If this class needs to keep additional attributes, perhaps they could be kept as part of the JSON that gets serialized, otherwise should have a note about serialization and transferability with other interfaces.
    • This is probably the largest PR in terms of code (especially tests!!), so might need to be split into different batches. For example, support for custom objectives could be left out from the first PR.
  • After the new xgboost() x/y interface gets implemented, then modify other functions to accept these objects - e.g.:
    • Plotting function.
    • Feature importance function.
    • Serialization functions that are aimed at transferring models between interfaces.
    • All of these should keep in mind small details like base-1 indexing for tree numbers and similar.
  • Create examples and vignettes for the new xgboost() function.
  • Perhaps create a higher-level cv function for the new xgboost() interface. I'd say this is low priority though.

Other topics from the earlier issue which aren't mentioned in the list above:

  • External memory with DataIter.
  • Quantile regression with multiple quantiles.
  • Use CMake instead of autotools when possible, this way we can make it easier to support GPU with R.
  • Distributed training, perhaps integration with RSpark.
    • This one I'm not sure if it'd make sense to put in the same package as xgboost and I'd also consider it a lower priority thhan the others.

Would like to hear some comments about it, particularly around DMatrix-related topics, and would be very helpful if xgboost maintainers could work on the first issue in this list.

After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people. I think there's a github functionality for it but I've never used it and don't know how to so perhaps maintainers could help here.

@trivialfis
Copy link
Member

trivialfis commented Nov 26, 2023

Thank you for writing a comprehensive list! I'm really excited for the proposed changes here. Let me try to comment on some of the items considered to be high priority:

Create an easy-to-use script to create an R package or to otherwise re-arrange files

One can run R CMD install . under the package directory, devtools can load the package from path as well. The script is not necessary for installing from source. I think it's possible to suppress the Python compilation. Lastly, if we were to rewrite it due to the use of Python, it would be great if we can use R instead of make file or shell.

Parameters for categorical features

If memory serves, I think the call to set types is already implemented in the current code base. We need additional support for recognizing factor automatically. I can double check later and share how to set the types.

Fitting a model from a DMatrix created from a CSR matrix

I will debug it, do you have a specific example?

Add a mechanism to create a DMatrix from an R data.frame.

I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?

After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people.

Feel free to pick whichever that fits your workflow best. One can use Todo items, multiple issues, or GitHub project to track progress. Tips: this is an to-do item

  • to do.

I think there's a button in GitHub to split an issue with multiple items into multiple issues. I can help create a GitHub classical project to keep track of issues if needed. Feel free to ping me.

@david-cortes
Copy link
Contributor Author

One can run R CMD install . under the package directory, devtools can load the package from path as well.

Thanks for the hint. Can confirm that R CMD INSTALL . from R-package/ works as expected under the new build system, as do RStudio's analog which uses devtools and the variants for running tests and building docs.

I will debug it, do you have a specific example?

Actually, this seems to have been already fixed when creating a DMatrix first, but this will still error out:

library(xgboost)
library(Matrix)
data("mtcars")
y <- mtcars$mpg
x <- mtcars[, -1]
x.sp <- as(as.matrix(x), "RsparseMatrix")
model <- xgboost(data=x.sp, label=y)

Nevertheless, since we want to remove the code behind this function, I guess it's not worth fixing at this point.

I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?

There is no difference in the arrays, both are just a list (an R array object which in C has internal type VECSXP and whose elements are retrieved through VECTOR_ELT) containing other SEXP objects whose pointers are retrievable through e.g. REAL(x), INTEGER(x), etc. Note that there's also a popular package bit64 that provides int64_t arrays.

Objects from data.table do have some differences though in that they can contain lists of lists and other types, but those are anyway not supported by xgboost; and in other small details which shouldn't matter for DMatrix such as potentially having hashes for some column or not having row names.

In terms of creating DMatrix from data.frame, I guess the most reasonable route would be to only accept columns of types numeric + integer + factor + bit64::integer64, leaving potentially character for the high-level interface. Note that float::float32 cannot be put inside a data.frame, unlike bit64.

I will create a separate roadmap issue with checklists to discuss work on these issues and progress.

@david-cortes
Copy link
Contributor Author

A couple things I am wondering:

A. xgboost supports the sorted indices splitting algorithm (tree_method="exact").

Under this algorithm, infinite values should simply be treated as greater or smaller than every other value, and since the raw values aren't used in any other statistic, it should be possible to fit a model with them.

But it seems that xgboost will not accept infinite values regardless in the DMatrix construction. Are infinite non-missing values never accepted in input features?


B. I see that there are functions like XGBoosterPredictFromDMatrix for which part of the output is a pointer to an array of values. In the R interface, it doesn't look like the pointer that gets passed and filled inside the call would be freed after.

Does that kind of function allocate arrays inside of the booster pointer? If so, does it mean that predicting on larger inputs makes the model heavier in memory?


C. I see there's some parallelization in the R wrapper for things like converting signed integers to unsigned.

Part of the current operations include retrieving a pointer to the data in every iteration of the loop, but if that part were to be removed by pre-retrieving the pointer, would that parallelization actually result in faster operations?

As far as I understand, for a regular amd64 platform, that sort of operation should be so fast that it's basically limited by the speed at which memory is retrieved from RAM, and having multiple threads retrieving from RAM shouldn't make it faster.

@trivialfis
Copy link
Member

But it seems that xgboost will not accept infinite values regardless in the DMatrix construction

I disabled the inf input due to lack of tests and other algorithms running into issues, to properly support it we need to have test cases for all potential uses, including things like SHAP, categorical features, etc. At the moment, it's not a very high priority. A use case would help.

it doesn't look like the pointer that gets passed and filled inside the call would be freed after

It's using thread-local static memory as a return buffer for thread-safe inference. The system will free it.

static thread_local int myvar;

As far as I understand, for a regular amd64 platform, that sort of operation should be so fast

I agree. Feel free to remove them.

but if that part were to be removed by pre-retrieving the pointer

I think we can entirely remove the data copies by using array interface, which is also supported by the SetInfo. As a bonus, it saves memory usage as well thanks to being able to avoid allocating a new buffer. Memory usage has been a huge headache since some users are quite used to deep-learning-like batch training and find it frustrating that gradient boosting needs the entire dataset in memory to be efficient. Most of the type-specific C API are legacy now.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 30, 2023

Another question: are there any guidelines for using SIMD code in xgboost?

I see for example there's a function in the R wrapper that substracts 1 from elements in an array:

idxvec[i] = INTEGER(idxset)[i] - 1;

I was thinking of adding #pragma omp simd, but was wondering if perhaps xgboost includes some SIMD module with dynamic dispatching per CPU instruction set or similar.

@trivialfis
Copy link
Member

No, we don't have those. Feel free to add it.

I thought about having an aligned allocator, but dropped the work as my attempt of using sse did not speedup the histogram kernel. Another attempt of using it is to speedup allreduce on CPU, the difference was quite insignificant from my benchmark, so it was dropped again.

@trivialfis
Copy link
Member

trivialfis commented Dec 1, 2023

HI @david-cortes , are you familiar with the error from CRAN check caused by calling error/Rf_error:

  xgboost_R.cc:70:3: warning: format string is not a string literal (potentially insecure) [-Wformat-security]

I'm not sure how to fix it without removing the use of the error function.

https://cran.r-project.org/web/checks/check_results_xgboost.html

Update
I will try Rf_error("%s", msg) with rhub.

@jameslamb
Copy link
Contributor

Rf_error("%s", msg) worked for us over in LightGBM when faced with the same warning:

microsoft/LightGBM#6216

We have a CI job there that replicates the Debian clang r-devel build from CRAN.

@david-cortes
Copy link
Contributor Author

Yes, that seems to be the usual clang formatting warning, which is solved by using "%s", <str> instead of just <str>. That %s is also the solution that's being proposed at Rcpp: RcppCore/Rcpp#1287

@david-cortes
Copy link
Contributor Author

@trivialfis In the previous thread, you mentioned this as one of the desired things to implement for the R interface:

quantile regression with multiple quantiles.

Is this supported in the python interface? Looks like it errors out if you try to pass multiple quantiles.

If this is supported in python, could you provide an example of what this means?

@bastistician
Copy link
Contributor

FWIW, I ran a reverse dependency check and found 48 packages out of 139 (that I could originally install) with new check ERRORs using an "xgboost" built from here. I noticed that nice deprecation warnings for formal arguments are already in place, but maybe the provisional mapping to the new interface could still be improved to minimize package breakage and also guide users: by far the most frequent error messages were

'objective' must be a function
argument "y" is missing, with no default

Two packages (EIX, personalized) could no longer be installed, both because

object 'slice' is not exported by 'namespace:xgboost'

I am attaching the problematic outputs collected via tools::check_packages_in_dir_details().

@david-cortes
Copy link
Contributor Author

FWIW, I ran a reverse dependency check and found 48 packages out of 139 (that I could originally install) with new check ERRORs using an "xgboost" built from here. I noticed that nice deprecation warnings for formal arguments are already in place, but maybe the provisional mapping to the new interface could still be improved to minimize package breakage and also guide users: by far the most frequent error messages were

'objective' must be a function
argument "y" is missing, with no default

Two packages (EIX, personalized) could no longer be installed, both because

object 'slice' is not exported by 'namespace:xgboost'

I am attaching the problematic outputs collected via tools::check_packages_in_dir_details().

Curious about that second one:

argument "y" is missing, with no default

This should only be a warning as of the latest commit of the master branch.

@bastistician When you did this experiment, did you build xgboost from source from the master branch here, or did you use one of the artifacts in the releases tags?

@david-cortes
Copy link
Contributor Author

This should only be a warning as of the latest commit of the master branch.

Actually on a second look: no, it's still an issue in the mater branch. Just an issue of one error being thrown before another more informative error ('x' being a DMatrix, which is not supported for 'xgboost()').

@david-cortes
Copy link
Contributor Author

david-cortes commented Feb 9, 2025

@trivialfis As a next step, we'd need to find out if out of these 48 broken packages, any of them would unbreak after the change in PR #11204 and compile a list of all the still-broken packages along with the emails of their maintainers and potentially github issue trackers if they have any.

Then send them an email and post in their github if they have, asking to update their code. Here I'm crafting a potential email template that you could use:

Dear CRAN package maintainer,

We are contacting you regarding updates to the XGBoost CRAN package, as you are listed as maintainer of one or more packages at CRAN which have XGBoost as a dependency.

The next version of XGBoost brings many breaking changes which unfortunately will result in failing CRAN checks of the package(s) you maintain once the updated version reaches CRAN.

In preparation for our next release, tentatively scheduled for <date you send the email + 2 months>, we ask you to incorporate fixes in your package as necessary and resubmit an updated version to CRAN before this date, so as to be able to work with the newest XGBoost and thereby safely remain at CRAN without breakage once we push the update.

Details about the breakage in your package(s) can be found under the following file, which lists package check errors for all reverse dependencies of XGBoost:
(link to earlier file in this thread)

Please see the instructions here for how to install the newest version of XGBoost before its release:
(link to installation instructions from github)

Then, please modify your package as necessary and re-run the CRAN checks (tests, examples, vignettes, demos) with the newest XGBoost to confirm that no further failures occur.

As some common guidelines, we'd like to ask maintainers to avoid usage of function 'xgboost()' in packages, preferring instead the function 'xgb.train()', with which it will be easier to make code to work under both older and newer XGBoost versions. When it comes to this function, please use the argument 'params' to pass training parameters instead of passing them as function arguments.

For more details about the changes, please see the migration guide at:
(link to migration guide)

Alongside with the documentation for the newest version of XGBoost:
(link to pkgdown docs)

In case of any questions or comments, feel free to post in this common GitHub discussion:
(link to this issue, or maybe better create a new one for help with migration to the newest version)

@bastistician
Copy link
Contributor

we'd need to find out if out of these 48 broken packages, any of them would unbreak after the change in PR #11204 and compile a list of all the still-broken packages

See the attached updated check results. So the change rescues 4 packages; 44 are still broken. This is still a pretty large number ... CRAN aims to maintain a coherent package cohort, so revdeps will either have to make sure they work with both the current and the upcoming xgboost, or they need to schedule a new release just after the new xgboost release and require its new version.

Regarding changes in errors, I see that some cases of 'objective' must be a function now fail later with argument is of length zero or replacement has length zero or EXPR must be a length 1 vector (at least the last comes from object$params$objective, the others may be similar), or with unused argument (as_booster = TRUE).

Note that there are still many errors with message argument "y" is missing, with no default.

along with the emails of their maintainers and potentially github issue trackers if they have any

Something like

subset(tools::CRAN_package_db(), Package %in% BROKEN, c(Package, Maintainer, BugReports))

(replacing BROKEN appropriately) should provide the relevant info.

@david-cortes
Copy link
Contributor Author

@bastistician Thanks for re-running the checks.

About those specific common errors:

  • argument "y" is missing, with no default:
    This one doesn't have a solution: it happens when calling function xgboost() with a DMatrix as "x" having the "label" field but no "y" passed to xgboost(). We cannot work around it because now xgboost() takes additional metadata from "y" which the DMatrix will not have.
    We potentially could make packages not break here by switching to xgb.train and giving a warning, but it would be quite messy as then some calls to xgboost() would yield models of one class (with metadata), and others models of a different class.

  • argument is of length zero:
    This one happens from trying to access fields from a booster object which are now attributes - e.g. model$call$params$objective, which is now attributes(model)$params$objective.

    Here I'm a bit undecided. On the one hand, we could unbreak things (and perhaps make usage of booster objects easier) by overriding operators like $ and [[ to retrieve from attributes instead, and it would work for easy cases like model$params$objective.

    BUT it would work for some forms of field access while breaking for others - for example, model$params would work, as would model[["params"]], but not getElement(model, "params"), nor hasName(model, "params"), and perhaps others more that I might be missing.

@mayer79 Would like some second opinion here.

@trivialfis
Copy link
Member

The online document for the new R package is up: https://xgboost.readthedocs.io/en/latest/

@david-cortes
Copy link
Contributor Author

The online document for the new R package is up: https://xgboost.readthedocs.io/en/latest/

@trivialfis Looks like something went wrong with the vignette:
Image

@david-cortes
Copy link
Contributor Author

Also the pkgdown link throws 404:
Image

@trivialfis
Copy link
Member

trivialfis commented Feb 12, 2025

@david-cortes It should show up after the xgboost docs CI has passed. Need to wait a few minutes after a PR merge as the R doc is built by the github action instead of readthedocs build service.

Image

@trivialfis
Copy link
Member

details.zip

Attached a summary of failures with #11232 and #11233 merged on top of the master branch.

One of the most frequent error is:

  Error in xgboost::xgboost(data = X, label = y, nrounds = nrounds, verbose = verbose,  :
    argument "x" is missing, with no default

On the other hand, this error should be closely inspected:

Package: fastrmodels 1.0.2
Check: data for non-ASCII characters, Result: NOTE

     *** caught segfault ***
    address 0x9, cause 'memory not mapped'

    Traceback:
     1: `$.xgb.Booster`(object, "ptr")
     2: object$ptr
     3: xgb.get.handle(model)
     4: xgb.get.num.boosted.rounds(x)
     5: length.xgb.Booster(x)
     6: length(x)
     7: check_one(get(ds, envir = dataEnv), ds)
     8: doTryCatch(return(expr), name, parentenv, handler)
     9: tryCatchOne(expr, names, parentenv, handlers[[1L]])
    10: tryCatchList(expr, classes, parentenv, handlers)
    11: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call, nlines = 1L)        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
    12: try(check_one(get(ds, envir = dataEnv), ds), silent = TRUE)
    13: withCallingHandlers(expr, message = function(c) if (inherits(c,     classes)) tryInvokeRestart("muffleMessage"))
    14: suppressMessages(try(check_one(get(ds, envir = dataEnv), ds),     silent = TRUE))
    15: withCallingHandlers(expr, warning = function(w) if (inherits(w,     classes)) tryInvokeRestart("muffleWarning"))
    16: suppressWarnings(suppressMessages(try(check_one(get(ds, envir = dataEnv),     ds), silent = TRUE)))
    17: tools:::.check_package_datasets(".")
    An irrecoverable exception occurred. R is aborting now ...
    Segmentation fault (core dumped)

@david-cortes
Copy link
Contributor Author

@trivialfis Do you observe that error with the current master branch too?

@trivialfis
Copy link
Member

Using the master branch emits a different error message, slightly better:

* checking contents of ‘data’ directory ... OK
* checking data for non-ASCII characters ... NOTE
  Error loading dataset 'cp_model':
   Error in xgb.get.handle(model) : 
    'xgb.Booster' object is corrupted or is from an incompatible XGBoost version.
  
  Error loading dataset 'ep_model':
   Error in xgb.get.handle(model) : 
    'xgb.Booster' object is corrupted or is from an incompatible XGBoost version.
  
  Error loading dataset 'wp_model':
   Error in xgb.get.handle(model) : 
    'xgb.Booster' object is corrupted or is from an incompatible XGBoost version.
  
  Error loading dataset 'wp_model_spread':
   Error in xgb.get.handle(model) : 
    'xgb.Booster' object is corrupted or is from an incompatible XGBoost version.
  
  Error loading dataset 'xpass_model':
   Error in xgb.get.handle(model) : 
    'xgb.Booster' object is corrupted or is from an incompatible XGBoost version.
  
  Error loading dataset 'xyac_model':
   Error in xgb.get.handle(model) : 
    'xgb.Booster' object is corrupted or is from an incompatible XGBoost version.
  
* checking LazyData ... OK
* checking data for ASCII and uncompressed saves ... OK

@trivialfis
Copy link
Member

trivialfis commented Feb 14, 2025

Based on the current status, I suggest that we kick start the next R release with an R-universe package xgboost3 that everyone can easily download and install before we move to CRAN:

  • It will be easier for others to adopt needed changes.
  • We still have breaking changes in the pipeline since we have many warnings for deprecated parameters. And we DON'T want to handle breaking changes in CRAN.
  • We might be able to get CUDA working there (not sure yet). Looks promising though: https://mlverse.r-universe.dev/cuda.ml .

We can host the package under the dmlc repository. Once we get the release out, we can see the real reaction from others and check how difficult it's to archive the CRAN package.

@trivialfis
Copy link
Member

trivialfis commented Feb 14, 2025

cc @mayer79 @hcho3 for #9734 (comment)

@david-cortes
Copy link
Contributor Author

I think by this point we can already assume that we won't be getting any pre-release feedback from users or package maintainers.

Would nevertheless be helpful to have R-universe builds from the master branch and mention those in the docs for installation instructions.

@david-cortes
Copy link
Contributor Author

@hcho3 In regards to your comment in the other thread:
#11233 (comment)

but it sounds like we can keep the package as long as we try to contact rev dev maintainers beforehand?

Yes, that is indeed the case - the first email that you'll get once the package fails the reverse dependency checks will say something like this:

Please reply-all and explain: Is this expected or do you need to fix anything in your package? If expected, have all maintainers of affected packages been informed well in advance? Are there false positives in our results?

So yes, it is possible to push an update that breaks some reverse dependencies, provided that they are notified in advance - 2 months is a reasonable time frame, and perhaps you could CC the CRAN maintainers email ([email protected]) when sending an email to the reverse dependencies.

That being said, CRAN is a manually curated and repository and judgement about the situation is up to their maintainers - they might have other opinions on what extent of breakage is appropriate.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 15, 2025

CRAN is a manually curated and repository and judgement about the situation is up to their maintainers - they might have other opinions on what extent of breakage is appropriate.

Then we should email the CRAN maintainers first, to explain the following:

  1. XGBoost 3.0 has a new interface that's more idiomatic R, leading to breaking changes. Is this a reasonable ground for causing breakage for X number of rev deps?
  2. if the amount of breakage is not reasonable, would you be willing to let us publish a new package {xgboost3} to replace {xgboost}? We really want to dispense with the old interface that led to unidiomatic usages of R.

We should get pre-clearance before we start contacting maintainers of rev deps.

@david-cortes
Copy link
Contributor Author

CRAN is a manually curated and repository and judgement about the situation is up to their maintainers - they might have other opinions on what extent of breakage is appropriate.

Then we should email the CRAN maintainers first, to explain the following:

  1. XGBoost 3.0 has a new interface that's more idiomatic R, leading to breaking changes. Is this a reasonable ground for causing breakage for X number of rev deps?
  2. if the amount of breakage is not reasonable, would you be willing to let us publish a new package {xgboost3} to replace {xgboost}? We really want to dispense with the old interface that led to unidiomatic usages of R.

We should get pre-clearance before we start contacting maintainers of rev deps.

Yes, you could try, but I highly doubt that they'd reply back, least of all in a reasonable time frame. They are also very short of maintainer's time.

And most likely, they'd redirect you to the mailing list for package developers. The CRAN maintainers sometimes participate in the discussions from that mailing list, so perhaps that could be another resource if you want to get an opinion from a wider audience, even though it won't constitute any kind of official green light.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 15, 2025

We can try to merge #11233 to reduce the number of broken rev deps, to "increase our chances"

@david-cortes
Copy link
Contributor Author

We can try to merge #11233 to reduce the number of broken rev deps, to "increase our chances"

I'd rather not if possible. We don't even know how many packages it would unbreak.

@trivialfis
Copy link
Member

From my perspective, R-universe can be a really good starting point before we finally move back to CRAN. It can help ease the transition and give us some time to work on integration with other projects and new features.

I'd rather not if possible. We don't even know how many packages it would unbreak.

I tend to agree, let's assume that anything published to CRAN must be maintained forever. If the code is not idiomatic to R, then it defeats the purpose of the rework, and the worst part is that we have to maintain it forever.

On the other hand, if we directly aim at CRAN, we might not be able to publish anything at all. We couldn't bring any update to it for 2 releases now, I think we have passed the point where we think we might suddenly solve all the problems.

@david-cortes
Copy link
Contributor Author

On the other hand, if we directly aim at CRAN, we might not be able to publish anything at all.

We could give it a try. Worst case is they'll reject it and then we'll have to upload under a new name.

@trivialfis
Copy link
Member

Considering that it breaks so many packages non-trivially and requires careful inspection of the migration guide and the new document, I am still inclined to start with R-universe. I will send e-mails to downstream project maintainers after the R-universe version is available for testing, and I will also ask for feedback before forcing an update on CRAN.

@jameslamb
Copy link
Contributor

jameslamb commented Feb 17, 2025

Posting here because I was tagged on #11233

Waiting for comments from ... @jameslamb ... here

I agree that #11233 should probably not be merged, especially without specific evidence about how many reverse dependency issues it fixes. The other issues mentioned in that PR's description (such as the difference between behavior of [ and [[ access) are also good reasons not to merge it.

I hope it's possible to release all of the accumulated changes on master under the {xgboost} name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants