Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] Make xgboost.spark support spark connect ML #9780

Open
WeichenXu123 opened this issue Nov 9, 2023 · 10 comments · May be fixed by #11050
Open

[spark] Make xgboost.spark support spark connect ML #9780

WeichenXu123 opened this issue Nov 9, 2023 · 10 comments · May be fixed by #11050

Comments

@WeichenXu123
Copy link
Contributor

Since spark 3.5, a new pyspark module is added: pyspark.ml.connect, it supports a few ML algorithms that runs on spark connect mode. This is design doc:
https://www.google.com/url?q=https://docs.google.com/document/d/1LHzwCjm2SluHkta_08cM3jxFSgfF-niaCZbtIThG-H8/edit&sa=D&source=calendar&ust=1700005806011038&usg=AOvVaw2VEdVyMYg40yDLpElhcRAu

We should make estimators defined in xgboost.spark to support spark connect mode, to achieve the goal, we need:

  • Make these estimator class inherits pyspark.ml.connect.Estimator if it runs on spark connect mode
  • All implementation code should only calls spark connect API (i.e. spark Dataframe API).
@WeichenXu123
Copy link
Contributor Author

CC @wbo4958

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 10, 2023

Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?

@WeichenXu123
Copy link
Contributor Author

Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?

It is supported, see:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html?highlight=mapinpandas

and

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInArrow.html?highlight=mapinarrow#pyspark.sql.DataFrame.mapInArrow

New barrier argument is added for them.
@wbo4958

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 13, 2023

Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 14, 2023

BTW, I'd like to take this task to make xgboost.spark support spark connect ML.

@WeichenXu123
Copy link
Contributor Author

Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.

oh, sorry for my misread, yes we haven't support stage-level scheduling in spark connect api, this is a todo task

@WeichenXu123
Copy link
Contributor Author

We can add stage-level scheduling params in mapInPandas API, similar to barrier param, CC @zhengruifeng @Ngone51 WDYT ?

@zhengruifeng
Copy link
Contributor

We can add stage-level scheduling params in mapInPandas API, similar to barrier param, CC @zhengruifeng @Ngone51 WDYT ?

I think this way is feasible.

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 15, 2023

Cool, let me have the PR supporting stage-level scheduling for Dataframe API for spark @WeichenXu123 @zhengruifeng @Ngone51

@trivialfis
Copy link
Member

Hi, may I ask what's the current status of this?

@WeichenXu123 WeichenXu123 linked a pull request Dec 4, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants