-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Make xgboost.spark support spark connect ML #9780
Comments
CC @wbo4958 |
Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor? |
It is supported, see: and New |
Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool. |
BTW, I'd like to take this task to make xgboost.spark support spark connect ML. |
oh, sorry for my misread, yes we haven't support stage-level scheduling in spark connect api, this is a todo task |
We can add stage-level scheduling params in |
I think this way is feasible. |
Cool, let me have the PR supporting stage-level scheduling for Dataframe API for spark @WeichenXu123 @zhengruifeng @Ngone51 |
Hi, may I ask what's the current status of this? |
Since spark 3.5, a new pyspark module is added:
pyspark.ml.connect
, it supports a few ML algorithms that runs on spark connect mode. This is design doc:https://www.google.com/url?q=https://docs.google.com/document/d/1LHzwCjm2SluHkta_08cM3jxFSgfF-niaCZbtIThG-H8/edit&sa=D&source=calendar&ust=1700005806011038&usg=AOvVaw2VEdVyMYg40yDLpElhcRAu
We should make estimators defined in
xgboost.spark
to support spark connect mode, to achieve the goal, we need:pyspark.ml.connect.Estimator
if it runs on spark connect modeThe text was updated successfully, but these errors were encountered: