Plans to support export PMML model in R package? #296

kevinzhangguangjin · 2017-02-13T15:53:00Z

It would be great to do this directly in R.

wxchan · 2017-02-13T17:26:00Z

I guess you can use https://github.com/vruusmann/jpmml-export. @vruusmann

vruusmann · 2017-02-13T18:26:57Z

The current working codebase is located at https://github.com/jpmml/jpmml-r

I did a few steps in this direction but got stuck with a problem that LightGBM's R wrapper object is not a "pure" R object (eg. a list), but some environment-type R object:

model = lgbm(...)
saveRDS(model, "model.rds") # THIS!

Such environment-type R objects are very difficult to (re-)use on other platforms.

On Java platform, the LightGBM-to-PMML conversion logic is readily available in the form of JPMML-LightGBM library. If there was an easy way to "parse" this environment-type R object, and get hold of the enclosed LightGBM model text file, then the rest would be easy.

Lets keep this issue open - I'll share my thoughts about how to make the RDS serialization of LightGBM wrapper objects a bit more intuitive here.

Laurae2 · 2017-02-13T21:42:56Z

@vruusmann Can you write on a temp file using lgb.save in R to dump it for jpmml-r? It (should be) is the model file re-usable elsewhere.

You are right it is an environment-type object and not a list, and it is difficult to get to the internals of it (unlike xgboost). That's mainly due to the OO style.

Demo for dumping to a temp file and reading it in R right after in the temp folder:

library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label=train$label)
data(agaricus.test, package='lightgbm')
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label=test$label)
params <- list(objective="regression", metric="l2")
valids <- list(test=dtest)
model <- lgb.train(params, dtrain, 100, valids, min_data=1, learning_rate=1, early_stopping_rounds=10)
temp <- tempfile()
lgb.save(model, temp)
readChar(temp, file.info(temp)$size) # could be what you want

Hand edited file extension to add .txt so it can be uploaded:

file18b4140b1d3f.txt (1.199 KB)

vruusmann · 2017-02-13T22:50:06Z

@Laurae2 Any plans for making lgb.Booster object "stateful"?

Currently, the RDS serialized form of lgb.Booster object does not hold the description of the model. It is assumed that this RDS file is shipped together with an LightGBM text file, and the lgb.Booster object is initialized by doing lgb.load(filename).

Why not embed the LightGBM text file directly into the RDS serialized form of lgb.Booster object? Would make it easier to distribute trained models, as there is no risk of "mismatching" RDS and LightGBM text files anymore. To save space, the enclosed string object could be zipped.

Laurae2 · 2017-02-14T11:31:33Z

@guolinke perhaps a rework of the R6 booster class might be needed (ex: xgboost - for better flexibility/extensibility), what is your opinion on this?

This would cause to remove the environment-style (object oriented) and add the booster as an argument to all the public/private functions needed (these functions would become booster-independent). It would allow also to explore the content of the booster model directly in the environment pane in RStudio.

It could also make it more obvious to be extended. Refers also to:

Environment pane in RStudio:

Structure access comparison:

> str(model_lgb)
Classes 'lgb.Booster', 'R6' <lgb.Booster>
  Public:
    add_valid: function (data, name) 
    best_iter: -1
    current_iter: function () 
    dump_model: function (num_iteration = NULL) 
    eval: function (data, name, feval = NULL) 
    eval_train: function (feval = NULL) 
    eval_valid: function (feval = NULL) 
    finalize: function () 
    initialize: function (params = list(), train_set = NULL, modelfile = NULL, 
    predict: function (data, num_iteration = NULL, rawscore = FALSE, predleaf = FALSE, 
    record_evals: list
    reset_parameter: function (params, ...) 
    rollback_one_iter: function () 
    save_model: function (filename, num_iteration = NULL) 
    set_train_data_name: function (name) 
    to_predictor: function () 
    update: function (train_set = NULL, fobj = NULL) 
  Private:
    eval_names: l2
    get_eval_info: function () 
    handle: 1.29778246886006e-315
    higher_better_inner_eval: FALSE
    init_predictor: NULL
    inner_eval: function (data_name, data_idx, feval = NULL) 
    inner_predict: function (idx) 
    is_predicted_cur_iter: list
    name_train_set: training
    name_valid_sets: list
    num_class: 1
    num_dataset: 2
    predict_buffer: list
    train_set: lgb.Dataset, R6
    valid_sets: list
> str(model_xgb)
List of 8
 $ handle        :Class 'xgb.Booster.handle' <externalptr> 
 $ raw           : raw [1:1099] 00 00 00 80 ...
 $ niter         : num 2
 $ evaluation_log:Classes ‘data.table’ and 'data.frame':	2 obs. of  3 variables:
  ..$ iter     : num [1:2] 1 2
  ..$ train_auc: num [1:2] 0.958 0.981
  ..$ eval_auc : num [1:2] 0.96 0.98
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ call          : language xgb.train(params = param, data = dtrain, nrounds = 2, watchlist = watchlist)
 $ params        :List of 7
  ..$ max_depth  : num 2
  ..$ eta        : num 1
  ..$ silent     : num 1
  ..$ nthread    : num 2
  ..$ objective  : chr "binary:logistic"
  ..$ eval_metric: chr "auc"
  ..$ silent     : num 1
 $ callbacks     :List of 2
  ..$ cb.print.evaluation:function (env = parent.frame())  
  .. ..- attr(*, "call")= language cb.print.evaluation(period = print_every_n)
  .. ..- attr(*, "name")= chr "cb.print.evaluation"
  ..$ cb.evaluation.log  :function (env = parent.frame(), finalize = FALSE)  
  .. ..- attr(*, "call")= language cb.evaluation.log()
  .. ..- attr(*, "name")= chr "cb.evaluation.log"
 $ feature_names : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
 - attr(*, "class")= chr "xgb.Booster"

vruusmann · 2017-02-14T11:49:58Z

My request for more "statefulness" was based on my previous experience with converting XGBoost model objects.

The state of an xgb.Booster object:

xgb.Booster$raw - A byte array, which contains the persistent state of an XGBoost model in XGBoost proprietary data format.
xgb.Booster$handle - A pointer to an in-memory Booster object that has been initialized based on the above byte array.

The current state of an lgb.Booster object:

lgb.Booster$handle - A pointer to an in-memory Booster object that has been initialized based on an external LightGBM text file.

In the future, the lgb.Booster class could define a new attribute lgb.Booster$raw (or lgb.Booster$text to indicate its human-friendliness) that corresponds to the xgb.Booster$raw attribute.

vruusmann · 2017-02-14T11:58:43Z

@guolinke @Laurae2 How about "downgrading" from R6 classes to S4 classes? Could preserve the current object-oriented API approach, while becoming more approachable for outside users. IIRC, one can explore the "contents" of S4 classes in RStudio pretty easily.

kevinzhangguangjin · 2017-02-15T06:50:30Z

Thanks @vruusmann @Laurae2 @wxchan @guolinke .
It seems stuck in the R6 classes "extraction" and no other option so far to get this done directly in R .
I saw Python approach is available but I did suggest to do this in R because R is more and more popular and PMML is easy to deploy .

guolinke · 2017-02-24T09:22:26Z

I think downgrade to "S4" is not so easy for LightGBM.
lgb.Booster caches many buffers to store some score/prediction to avoid the repeated calculation. It is hard to achieve this by S4.

@Laurae2 Can we just add something like lgb.Booster$raw to achieve stateful?

Laurae2 · 2017-02-24T17:11:07Z

@guolinke yes if a S4 class is not a solution, I'm thinking about this:

lgb.save able to save to a (potentially compressed) character variable (and not only to a file)
lgb.Booster$raw for the model file (compressed I guess)
saveRDS to serialize only lgb.Booster$raw (or we could tell the user in documentation that only lgb.Booster$raw must be serialized - serializing complete environments do not work in R)
A new function to reconstruct a lgb.Booster from model when loading from readRDS (instead of using the model load function), or it could be also lgb.load

xgboost has a similar approach for the last point, it allows to reset pointers from a loaded model so the model is usable.

Adding a $raw would also help using saveRDS/loadRDS to save/load objects from R (instead of using the provided lgb.save/lgb.load functions - in R we serialize to save objects, but we cannot do that in environments like @vruusmann described). I think it would need a way to convert from $raw to lgb.Booster though (like a new lgb.init function to create lgb.Booster from $raw?).

This could be something like this:

@vruusmann @kevinzhangguangjin @guolinke opinions about this scheme?

Conditions:

keep R6 class
user must saveRDS lgb.Booster$raw and not on the object environment
user should use lgb.init when loading model from $raw to reconstruct R6 booster object

Fixes:

create $raw which is serializable (can add compression if needed)
allows to save/load from variable (so you can load a model from a $raw also, like the way xgboost does)

vruusmann · 2017-02-24T20:36:22Z

Asking user to saveRDS() only the lgb.Booster$raw attribute (and not the entire lgb.Booster object) is probably "too much to learn".

Actually, my first complaint (#296 (comment)) is not about the environment-type R objects per se. R6 classes appear to be fully interoperable with the saveRDS() method.

This complaint should be reinterpreted as: "it would be nice if one could use the saveRDS() method to persist lgb.Booster objects so that the resulting RDS file is complete". Currently, the RDS file is incomplete, because the detailed description of the LightGBM model is stored in a separate text file(?).

So, for starters, introducing the lgb.Booster$raw attribute that holds the detailed description of the LightGBM model would be sufficient. Perhaps it's even possible to "overload" save and read methods like saveRDS.lgb.Booster and readRDS.lgb.Booster, and handle the interaction between $handle and $raw attributes there.

guolinke · 2017-03-02T00:01:27Z

@Laurae2 can you help to implement this? i can help after you opening PR.

Laurae2 · 2017-03-03T12:36:04Z

@guolinke sorry I was sick, I will try working on it now.

Laurae2 mentioned this issue Mar 5, 2017

[WIP/Tentative] Add $raw to LightGBM R Booster class #335

Closed

This was referenced Mar 15, 2017

R-package: lgb.save does not retain full information of the booster object? #346

Closed

[R-package] Add $raw to LightGBM #347

Merged

guolinke closed this as completed in #347 Mar 17, 2017

JesseLimtiaco mentioned this issue Apr 18, 2017

Reload R model object imported from different environment without using a temp file #432

Closed

jameslamb added the r-package label Dec 17, 2018

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans to support export PMML model in R package? #296

Plans to support export PMML model in R package? #296

kevinzhangguangjin commented Feb 13, 2017

wxchan commented Feb 13, 2017

vruusmann commented Feb 13, 2017

Laurae2 commented Feb 13, 2017

vruusmann commented Feb 13, 2017

Laurae2 commented Feb 14, 2017 •

edited

Loading

vruusmann commented Feb 14, 2017

vruusmann commented Feb 14, 2017

kevinzhangguangjin commented Feb 15, 2017

guolinke commented Feb 24, 2017

Laurae2 commented Feb 24, 2017

vruusmann commented Feb 24, 2017

guolinke commented Mar 2, 2017

Laurae2 commented Mar 3, 2017

Plans to support export PMML model in R package? #296

Plans to support export PMML model in R package? #296

Comments

kevinzhangguangjin commented Feb 13, 2017

wxchan commented Feb 13, 2017

vruusmann commented Feb 13, 2017

Laurae2 commented Feb 13, 2017

vruusmann commented Feb 13, 2017

Laurae2 commented Feb 14, 2017 • edited Loading

vruusmann commented Feb 14, 2017

vruusmann commented Feb 14, 2017

kevinzhangguangjin commented Feb 15, 2017

guolinke commented Feb 24, 2017

Laurae2 commented Feb 24, 2017

vruusmann commented Feb 24, 2017

guolinke commented Mar 2, 2017

Laurae2 commented Mar 3, 2017

Laurae2 commented Feb 14, 2017 •

edited

Loading