Add an assertColumnEquality method to allow for tests with less code #255

MrPowers · 2018-07-28T16:09:37Z

I've been using the assertColumnEquality for most of my Spark testing needs and have found that it allows for tests that require less code and run faster. I'd like to add this function to spark-testing-base, so more Spark users have a better testing experience!

Here's an example test with assertDataFrameEquals (uses createDF from spark-daria):

def myAddFunction(colName1: String, colName2: String): Column = {
  col(colName1) + col(colName2)
}

val actualDF = spark.createDF(
  List(
    (1, 3),
    (5, 3)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true)
  )
).withColumn(
    "the_sum",
    myAddFunction("num1", "num2")
  )

val expectedDF = spark.createDF(
  List(
    (1, 3, 4),
    (5, 3, 8)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true),
    ("the_sum", IntegerType, true)
  )
)

assertDataFrameEquals(actualDF, expectedDF)

Here's the same test with assertColumnEquality:

val df = spark.createDF(
  List(
    (1, 3, 4),
    (5, 3, 8)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true),
    ("expected", IntegerType, true)
  )
).withColumn(
    "the_sum",
    myAddFunction("num1", "num2")
  )

assertColumnEquality(df, "expected", "the_sum")

assertColumnEquality lets us reduce the test code from 25 lines to 15 lines.

I think assertColumnEquality runs faster for the following reasons:

creating one DataFrame is faster than creating two DataFrames
The collect() method runs faster than zipWithIndex()
We're not caching DataFrames with expected.rdd.cache and result.rdd.cache

assertDataFrameEquals will still be better for large DataFrame comparisons or multi-column comparisons.

This PR just contains an initial implementation. If you like this idea, we can merge it in and then work on making the error message pretty. It's hard to spot the row differences in the following error message:

+-------+-------------+
|   name|expected_name|
+-------+-------------+
|   phil|         phil|
| rashid|       rashid|
|matthew|        mateo|
|   sami|         sami|
|     li|         feng|
|   null|         null|
+-------+-------------+

We will be able to add a pretty error message like this so it's easy for users to spot the rows that are causing their tests to fail:

Thanks!

…that run fast

holdensmagicalunicorn · 2018-07-28T16:09:39Z

@MrPowers, thanks! I am a bot who has found some folks who might be able to help with the review:@holdenk and @mahmoudhanafy

Add an assertColumnEquality method to allow for tests with less code …

b6511af

…that run fast

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an assertColumnEquality method to allow for tests with less code #255

Add an assertColumnEquality method to allow for tests with less code #255

MrPowers commented Jul 28, 2018 •

edited

Loading

holdensmagicalunicorn commented Jul 28, 2018

Add an assertColumnEquality method to allow for tests with less code #255

Are you sure you want to change the base?

Add an assertColumnEquality method to allow for tests with less code #255

Conversation

MrPowers commented Jul 28, 2018 • edited Loading

holdensmagicalunicorn commented Jul 28, 2018

MrPowers commented Jul 28, 2018 •

edited

Loading