Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BQ Json arrays and literals #5544

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Support BQ Json arrays and literals #5544

wants to merge 5 commits into from

Conversation

RustedBones
Copy link
Contributor

Json should simply be mapped to java List or Map

Fix #5542

Copy link

codecov bot commented Jan 17, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 61.46%. Comparing base (c8ca2f4) to head (5d0e8c1).

Files with missing lines Patch % Lines
.../spotify/scio/bigquery/syntax/TableRowSyntax.scala 80.00% 1 Missing ⚠️
...cala/com/spotify/scio/bigquery/types/package.scala 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5544      +/-   ##
==========================================
+ Coverage   61.45%   61.46%   +0.01%     
==========================================
  Files         314      314              
  Lines       11222    11228       +6     
  Branches      771      772       +1     
==========================================
+ Hits         6896     6901       +5     
- Misses       4326     4327       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RustedBones RustedBones changed the title Support BQ Json arrays Support BQ Json arrays and literals Jan 23, 2025
@RustedBones
Copy link
Contributor Author

Ideally, beam should read TableRow as pared json too. There is a hack here to handle string either as json string or falling back to literal

@turb
Copy link
Contributor

turb commented Jan 23, 2025

Thanks for this!

I executed some loads to BQ based on 574db49 and it seems all entries with JSON non-null columns are just filtered out, which is odd (jsarray or jsobject as well).

Tried to find the origin, but everything seems correct...

@turb
Copy link
Contributor

turb commented Jan 24, 2025

tl;dr: I think I found it (see at the end).

This problem is only when writing with Storage API.

Test code:

import com.spotify.scio.ContextAndArgs
import com.spotify.scio.bigquery._
import com.spotify.scio.bigquery.Table
import com.spotify.scio.bigquery.types.{BigQueryType, Json, description}
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition

import scala.concurrent.duration.Duration

object TestJob {

  @BigQueryType.toTable
  @description("Test")
  case class Test(id: String, json: Option[Json])

  def main(args: Array[String]): Unit = {
    implicit val (sc, _) = ContextAndArgs(args)

    val tests = sc.parallelize(Seq(
      Test("stored", None),
      Test("ignored", Some(Json("{\"key\":\"value\"}"))),
    ))

    tests.saveAsTypedBigQueryTable(
      Table.Spec(s"Test.test"),
      method = BigQueryIO.Write.Method.STORAGE_WRITE_API,
      writeDisposition = WriteDisposition.WRITE_APPEND,
    )

    sc.run().waitUntilFinish(Duration.Inf, cancelJob = false)
  }
}

Here the first Test is inserted in the test table, but the second one is just filtered out.

After some digging, the error can be found in org.apache.beam.sdk.io.gcp.bigquery.WriteResult#getFailedStorageApiInserts called by BigQueryStorageWriteApiSchemaTransformProvider:

BigQueryStorageApiInsertError{row=GenericData{classInfo=[f], {id=ignored, json={key=value}}}, errorMessage='syntax error while parsing object key - invalid literal; last read: '{k'; expected string literal'}.

It looks like at some point {"key":"value"} becomes {key=value}, so the insert is erroneous, then ignored.

Suggested solution

It seems it comes from com.spotify.scio.bigquery.Json:

def parse(json: Json): AnyRef = mapper.readValue(json.wkt, classOf[Object])

I think since it's there only to encapsulate a JSON string it should be:

def parse(json: Json): AnyRef = json.wkt

It solves the problem for my use-case, but I wonder if there are side-effects I don't know.

(another thing is Beam silently dropping bad inserts...)

case _ =>
new Json(mapper.writeValueAsString(value))
}
def parse(json: Json): AnyRef = mapper.readValue(json.wkt, classOf[Object])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def parse(json: Json): AnyRef = mapper.readValue(json.wkt, classOf[Object])
def parse(json: Json): AnyRef = json.wkt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery load fails with JSON array
2 participants