Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans to support export PMML model in R package? #296

Closed
kevinzhangguangjin opened this issue Feb 13, 2017 · 13 comments · Fixed by #347
Closed

Plans to support export PMML model in R package? #296

kevinzhangguangjin opened this issue Feb 13, 2017 · 13 comments · Fixed by #347

Comments

@kevinzhangguangjin
Copy link

It would be great to do this directly in R.

@wxchan
Copy link
Contributor

wxchan commented Feb 13, 2017

I guess you can use https://github.com/vruusmann/jpmml-export. @vruusmann

@vruusmann
Copy link

The current working codebase is located at https://github.com/jpmml/jpmml-r

I did a few steps in this direction but got stuck with a problem that LightGBM's R wrapper object is not a "pure" R object (eg. a list), but some environment-type R object:

model = lgbm(...)
saveRDS(model, "model.rds") # THIS!

Such environment-type R objects are very difficult to (re-)use on other platforms.

On Java platform, the LightGBM-to-PMML conversion logic is readily available in the form of JPMML-LightGBM library. If there was an easy way to "parse" this environment-type R object, and get hold of the enclosed LightGBM model text file, then the rest would be easy.

Lets keep this issue open - I'll share my thoughts about how to make the RDS serialization of LightGBM wrapper objects a bit more intuitive here.

@Laurae2
Copy link
Contributor

Laurae2 commented Feb 13, 2017

@vruusmann Can you write on a temp file using lgb.save in R to dump it for jpmml-r? It (should be) is the model file re-usable elsewhere.

You are right it is an environment-type object and not a list, and it is difficult to get to the internals of it (unlike xgboost). That's mainly due to the OO style.

Demo for dumping to a temp file and reading it in R right after in the temp folder:

library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label=train$label)
data(agaricus.test, package='lightgbm')
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label=test$label)
params <- list(objective="regression", metric="l2")
valids <- list(test=dtest)
model <- lgb.train(params, dtrain, 100, valids, min_data=1, learning_rate=1, early_stopping_rounds=10)
temp <- tempfile()
lgb.save(model, temp)
readChar(temp, file.info(temp)$size) # could be what you want

Hand edited file extension to add .txt so it can be uploaded:

file18b4140b1d3f.txt (1.199 KB)

@vruusmann
Copy link

@Laurae2 Any plans for making lgb.Booster object "stateful"?

Currently, the RDS serialized form of lgb.Booster object does not hold the description of the model. It is assumed that this RDS file is shipped together with an LightGBM text file, and the lgb.Booster object is initialized by doing lgb.load(filename).

Why not embed the LightGBM text file directly into the RDS serialized form of lgb.Booster object? Would make it easier to distribute trained models, as there is no risk of "mismatching" RDS and LightGBM text files anymore. To save space, the enclosed string object could be zipped.

@Laurae2
Copy link
Contributor

Laurae2 commented Feb 14, 2017

@guolinke perhaps a rework of the R6 booster class might be needed (ex: xgboost - for better flexibility/extensibility), what is your opinion on this?

This would cause to remove the environment-style (object oriented) and add the booster as an argument to all the public/private functions needed (these functions would become booster-independent). It would allow also to explore the content of the booster model directly in the environment pane in RStudio.

It could also make it more obvious to be extended. Refers also to:

Environment pane in RStudio:

image

Structure access comparison:

> str(model_lgb)
Classes 'lgb.Booster', 'R6' <lgb.Booster>
  Public:
    add_valid: function (data, name) 
    best_iter: -1
    current_iter: function () 
    dump_model: function (num_iteration = NULL) 
    eval: function (data, name, feval = NULL) 
    eval_train: function (feval = NULL) 
    eval_valid: function (feval = NULL) 
    finalize: function () 
    initialize: function (params = list(), train_set = NULL, modelfile = NULL, 
    predict: function (data, num_iteration = NULL, rawscore = FALSE, predleaf = FALSE, 
    record_evals: list
    reset_parameter: function (params, ...) 
    rollback_one_iter: function () 
    save_model: function (filename, num_iteration = NULL) 
    set_train_data_name: function (name) 
    to_predictor: function () 
    update: function (train_set = NULL, fobj = NULL) 
  Private:
    eval_names: l2
    get_eval_info: function () 
    handle: 1.29778246886006e-315
    higher_better_inner_eval: FALSE
    init_predictor: NULL
    inner_eval: function (data_name, data_idx, feval = NULL) 
    inner_predict: function (idx) 
    is_predicted_cur_iter: list
    name_train_set: training
    name_valid_sets: list
    num_class: 1
    num_dataset: 2
    predict_buffer: list
    train_set: lgb.Dataset, R6
    valid_sets: list
> str(model_xgb)
List of 8
 $ handle        :Class 'xgb.Booster.handle' <externalptr> 
 $ raw           : raw [1:1099] 00 00 00 80 ...
 $ niter         : num 2
 $ evaluation_log:Classesdata.tableand 'data.frame':	2 obs. of  3 variables:
  ..$ iter     : num [1:2] 1 2
  ..$ train_auc: num [1:2] 0.958 0.981
  ..$ eval_auc : num [1:2] 0.96 0.98
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ call          : language xgb.train(params = param, data = dtrain, nrounds = 2, watchlist = watchlist)
 $ params        :List of 7
  ..$ max_depth  : num 2
  ..$ eta        : num 1
  ..$ silent     : num 1
  ..$ nthread    : num 2
  ..$ objective  : chr "binary:logistic"
  ..$ eval_metric: chr "auc"
  ..$ silent     : num 1
 $ callbacks     :List of 2
  ..$ cb.print.evaluation:function (env = parent.frame())  
  .. ..- attr(*, "call")= language cb.print.evaluation(period = print_every_n)
  .. ..- attr(*, "name")= chr "cb.print.evaluation"
  ..$ cb.evaluation.log  :function (env = parent.frame(), finalize = FALSE)  
  .. ..- attr(*, "call")= language cb.evaluation.log()
  .. ..- attr(*, "name")= chr "cb.evaluation.log"
 $ feature_names : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
 - attr(*, "class")= chr "xgb.Booster"

@vruusmann
Copy link

My request for more "statefulness" was based on my previous experience with converting XGBoost model objects.

The state of an xgb.Booster object:

  • xgb.Booster$raw - A byte array, which contains the persistent state of an XGBoost model in XGBoost proprietary data format.
  • xgb.Booster$handle - A pointer to an in-memory Booster object that has been initialized based on the above byte array.

The current state of an lgb.Booster object:

  • lgb.Booster$handle - A pointer to an in-memory Booster object that has been initialized based on an external LightGBM text file.

In the future, the lgb.Booster class could define a new attribute lgb.Booster$raw (or lgb.Booster$text to indicate its human-friendliness) that corresponds to the xgb.Booster$raw attribute.

@vruusmann
Copy link

@guolinke @Laurae2 How about "downgrading" from R6 classes to S4 classes? Could preserve the current object-oriented API approach, while becoming more approachable for outside users. IIRC, one can explore the "contents" of S4 classes in RStudio pretty easily.

@kevinzhangguangjin
Copy link
Author

Thanks @vruusmann @Laurae2 @wxchan @guolinke .
It seems stuck in the R6 classes "extraction" and no other option so far to get this done directly in R .
I saw Python approach is available but I did suggest to do this in R because R is more and more popular and PMML is easy to deploy .

@guolinke
Copy link
Collaborator

I think downgrade to "S4" is not so easy for LightGBM.
lgb.Booster caches many buffers to store some score/prediction to avoid the repeated calculation. It is hard to achieve this by S4.

@Laurae2 Can we just add something like lgb.Booster$raw to achieve stateful?

@Laurae2
Copy link
Contributor

Laurae2 commented Feb 24, 2017

@guolinke yes if a S4 class is not a solution, I'm thinking about this:

  • lgb.save able to save to a (potentially compressed) character variable (and not only to a file)
  • lgb.Booster$raw for the model file (compressed I guess)
  • saveRDS to serialize only lgb.Booster$raw (or we could tell the user in documentation that only lgb.Booster$raw must be serialized - serializing complete environments do not work in R)
  • A new function to reconstruct a lgb.Booster from model when loading from readRDS (instead of using the model load function), or it could be also lgb.load

xgboost has a similar approach for the last point, it allows to reset pointers from a loaded model so the model is usable.

Adding a $raw would also help using saveRDS/loadRDS to save/load objects from R (instead of using the provided lgb.save/lgb.load functions - in R we serialize to save objects, but we cannot do that in environments like @vruusmann described). I think it would need a way to convert from $raw to lgb.Booster though (like a new lgb.init function to create lgb.Booster from $raw?).

This could be something like this:

image

@vruusmann @kevinzhangguangjin @guolinke opinions about this scheme?

Conditions:

  • keep R6 class
  • user must saveRDS lgb.Booster$raw and not on the object environment
  • user should use lgb.init when loading model from $raw to reconstruct R6 booster object

Fixes:

  • create $raw which is serializable (can add compression if needed)
  • allows to save/load from variable (so you can load a model from a $raw also, like the way xgboost does)

@vruusmann
Copy link

Asking user to saveRDS() only the lgb.Booster$raw attribute (and not the entire lgb.Booster object) is probably "too much to learn".

Actually, my first complaint (#296 (comment)) is not about the environment-type R objects per se. R6 classes appear to be fully interoperable with the saveRDS() method.

This complaint should be reinterpreted as: "it would be nice if one could use the saveRDS() method to persist lgb.Booster objects so that the resulting RDS file is complete". Currently, the RDS file is incomplete, because the detailed description of the LightGBM model is stored in a separate text file(?).

So, for starters, introducing the lgb.Booster$raw attribute that holds the detailed description of the LightGBM model would be sufficient. Perhaps it's even possible to "overload" save and read methods like saveRDS.lgb.Booster and readRDS.lgb.Booster, and handle the interaction between $handle and $raw attributes there.

@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2017

@Laurae2 can you help to implement this? i can help after you opening PR.

@Laurae2
Copy link
Contributor

Laurae2 commented Mar 3, 2017

@guolinke sorry I was sick, I will try working on it now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants