[spark] Make xgboost.spark support spark connect ML #9780

WeichenXu123 · 2023-11-09T23:54:24Z

Since spark 3.5, a new pyspark module is added: pyspark.ml.connect, it supports a few ML algorithms that runs on spark connect mode. This is design doc:
https://www.google.com/url?q=https://docs.google.com/document/d/1LHzwCjm2SluHkta_08cM3jxFSgfF-niaCZbtIThG-H8/edit&sa=D&source=calendar&ust=1700005806011038&usg=AOvVaw2VEdVyMYg40yDLpElhcRAu

We should make estimators defined in xgboost.spark to support spark connect mode, to achieve the goal, we need:

Make these estimator class inherits pyspark.ml.connect.Estimator if it runs on spark connect mode
All implementation code should only calls spark connect API (i.e. spark Dataframe API).

The text was updated successfully, but these errors were encountered:

WeichenXu123 · 2023-11-09T23:54:38Z

CC @wbo4958

wbo4958 · 2023-11-10T05:37:07Z

Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?

WeichenXu123 · 2023-11-13T08:11:11Z

Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?

It is supported, see:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html?highlight=mapinpandas

and

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInArrow.html?highlight=mapinarrow#pyspark.sql.DataFrame.mapInArrow

New barrier argument is added for them.
@wbo4958

wbo4958 · 2023-11-13T23:59:10Z

Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.

wbo4958 · 2023-11-14T00:54:47Z

BTW, I'd like to take this task to make xgboost.spark support spark connect ML.

WeichenXu123 · 2023-11-14T08:32:14Z

Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.

oh, sorry for my misread, yes we haven't support stage-level scheduling in spark connect api, this is a todo task

WeichenXu123 · 2023-11-14T08:39:11Z

We can add stage-level scheduling params in mapInPandas API, similar to barrier param, CC @zhengruifeng @Ngone51 WDYT ?

zhengruifeng · 2023-11-14T08:53:21Z

We can add stage-level scheduling params in mapInPandas API, similar to barrier param, CC @zhengruifeng @Ngone51 WDYT ?

I think this way is feasible.

wbo4958 · 2023-11-15T08:28:15Z

Cool, let me have the PR supporting stage-level scheduling for Dataframe API for spark @WeichenXu123 @zhengruifeng @Ngone51

trivialfis · 2024-08-16T20:11:02Z

Hi, may I ask what's the current status of this?

WeichenXu123 linked a pull request Dec 4, 2024 that will close this issue

[draft] Make xgboost spark support spark-connect #11050

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Make xgboost.spark support spark connect ML #9780

[spark] Make xgboost.spark support spark connect ML #9780

WeichenXu123 commented Nov 9, 2023

WeichenXu123 commented Nov 9, 2023

wbo4958 commented Nov 10, 2023

WeichenXu123 commented Nov 13, 2023

wbo4958 commented Nov 13, 2023

wbo4958 commented Nov 14, 2023

WeichenXu123 commented Nov 14, 2023

WeichenXu123 commented Nov 14, 2023

zhengruifeng commented Nov 14, 2023

wbo4958 commented Nov 15, 2023

trivialfis commented Aug 16, 2024

[spark] Make xgboost.spark support spark connect ML #9780

[spark] Make xgboost.spark support spark connect ML #9780

Comments

WeichenXu123 commented Nov 9, 2023

WeichenXu123 commented Nov 9, 2023

wbo4958 commented Nov 10, 2023

WeichenXu123 commented Nov 13, 2023

wbo4958 commented Nov 13, 2023

wbo4958 commented Nov 14, 2023

WeichenXu123 commented Nov 14, 2023

WeichenXu123 commented Nov 14, 2023

zhengruifeng commented Nov 14, 2023

wbo4958 commented Nov 15, 2023

trivialfis commented Aug 16, 2024