Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an assertColumnEquality method to allow for tests with less code #255

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MrPowers
Copy link
Collaborator

@MrPowers MrPowers commented Jul 28, 2018

Hi @holdenk 😄

I've been using the assertColumnEquality for most of my Spark testing needs and have found that it allows for tests that require less code and run faster. I'd like to add this function to spark-testing-base, so more Spark users have a better testing experience!

Here's an example test with assertDataFrameEquals (uses createDF from spark-daria):

def myAddFunction(colName1: String, colName2: String): Column = {
  col(colName1) + col(colName2)
}

val actualDF = spark.createDF(
  List(
    (1, 3),
    (5, 3)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true)
  )
).withColumn(
    "the_sum",
    myAddFunction("num1", "num2")
  )

val expectedDF = spark.createDF(
  List(
    (1, 3, 4),
    (5, 3, 8)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true),
    ("the_sum", IntegerType, true)
  )
)

assertDataFrameEquals(actualDF, expectedDF)

Here's the same test with assertColumnEquality:

val df = spark.createDF(
  List(
    (1, 3, 4),
    (5, 3, 8)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true),
    ("expected", IntegerType, true)
  )
).withColumn(
    "the_sum",
    myAddFunction("num1", "num2")
  )

assertColumnEquality(df, "expected", "the_sum")

assertColumnEquality lets us reduce the test code from 25 lines to 15 lines.

I think assertColumnEquality runs faster for the following reasons:

  • creating one DataFrame is faster than creating two DataFrames
  • The collect() method runs faster than zipWithIndex()
  • We're not caching DataFrames with expected.rdd.cache and result.rdd.cache

assertDataFrameEquals will still be better for large DataFrame comparisons or multi-column comparisons.

This PR just contains an initial implementation. If you like this idea, we can merge it in and then work on making the error message pretty. It's hard to spot the row differences in the following error message:

+-------+-------------+
|   name|expected_name|
+-------+-------------+
|   phil|         phil|
| rashid|       rashid|
|matthew|        mateo|
|   sami|         sami|
|     li|         feng|
|   null|         null|
+-------+-------------+

We will be able to add a pretty error message like this so it's easy for users to spot the rows that are causing their tests to fail:

assertcolumnequality_error_message

Thanks!

@holdensmagicalunicorn
Copy link

@MrPowers, thanks! I am a bot who has found some folks who might be able to help with the review:@holdenk and @mahmoudhanafy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants