SPARKC-577, round two #1250

absurdfarce · 2020-06-16T07:54:18Z

Description

How did the Spark Cassandra Connector Work or Not Work Before this Patch

Connector was using internal serializable reps for keyspace/table metadata because corresponding Java driver classes weren't serializable. This changed in v4.6.0.

General Design of the patch

Existing connector metadata types were converted to traits with existing impls becoming a "default" implementation. Also added trait impls based on Java driver metadata types. Worth noting that the existing representations served two distinct functions:

Metadata retrieval/access
As a definition/descriptor for future table creation

Schema loading functions return an impl based on Java driver metadata types so (1) is the default for most operations. (2) is primarily used in the ColumnMapper code.

Also worth noting: impl for (1) is largely a direct cut-through to Java driver types and methods, but in some cases this wasn't adequate to implement the trait. As an example: the connector metadata types were storing some information (such as clustering column info) at the column level while the Java metadata types represent this at the table level. To work around this problem the new impls (based on Java driver metadata types) act as providers of Java metadata types for the current level all the way up to the keyspace. So DriverColumnDef can provide column metadata for the column it represents as well as table and keyspace metadata for the appropriate structures. Distinct traits were implemented to identify which classes can provide which Java metadata type(s).

Fixes: SPARKC-577

How Has This Been Tested?

Still a WIP, hasn't been tested meaningfully yet

Checklist:

I have a ticket in the OSS JIRA
I have performed a self-review of my own code
Locally all tests pass (make sure tests fail without your patch)

…nently restored KeyspaceDef as a mechanism for accessing downstream "def" types (TableDef, IndexDef, etc.) from a Schema instance.

…, serializable metadata

…umns effectively duplicates columns and was only used in one spot.

…se toInternal. Hope was this would address some of the ongoing IT issues we're seeing... but it doesn't look like that was successful. Also expanded Schema's IT to include tests for additional functoinality (now that it's being provided largely by an impl backed by Java driver metadata)

…calls

…ter this change so with any luck we should be good now.

…indexing impls

absurdfarce · 2020-06-23T21:43:23Z

driver/src/main/scala/com/datastax/spark/connector/cql/Schema.scala

+   */
+  def tableFromCassandra(session: CqlSession,
+                         keyspaceName: String,
+                         tableName: String): DriverTableDef = {

    fromCassandra(session, Some(keyspaceName), Some(tableName)).tables.headOption match {


I agree that the use of headOption here is a bit cumbersome, but unfortunately we're stuck with fromCassandra() for the foreseeable future. There are a fair amount of users who access that functionality through com.datastax.spark.connector.schemaFromCassandra(), which is used in a number of places throughout the code base. It's probably cleaner to re-work this API to return KeyspaceDefs or TableDefs directly rather than using Schema as a common return type (which in turn requires this "Schema with only some keyspaces/tables" nonsense used here). That seemed like a larger and/or discrete effort, however... and this ticket is already big enough. :)

jtgrabowski · 2020-06-25T11:11:47Z

@absurdfarce I'm not sure if this is the way to go. I thought that we we want to use Driver types and get rid of SCC types completely. The less code to maintain the better. Do you think it would be possible to refactor SCC so that it relies solely on Driver types?

absurdfarce · 2020-06-25T18:27:15Z

There's a bit of a complication there @jtgrabowski , although I'm not completely sure it's fatal.

The old Table/ColumnDef impl served two use cases in the code base. It represented data retrieved from C* metadata but it also could be created to define tables/columns which should be created by SCC. My original take on this ticket tried to separate those concerns... but it snowballed quickly. That led me to the approach used here of pulling these impls into traits and creating parallel impls.

I suppose we might be able to get there by leaving the Table/ColumnDef impls alone and just changing the blahFromCassandra() calls to return Java driver types and just leave everything else alone. I believe we do have at least one case where TableDef.copy() is used to tweak something we get from C*, so in that case we'll have to construct a TableDef based on the underlying Java driver type... but that's probably doable.

There's a second (smaller) problem as well: ColumnDef contains info that is maintained in ColumnMetadata and some info from TableMetadata. So callers interested in that functionality may need to be changed to get a column and then get the corresponding table. Unfortunately the Java driver doesn't allow that traversal directly; they'll have to get the table name and look it up as a separate op. That's entirely doable, but it does represent a bit of an increase in complexity.

Anyways, if you're okay with the approach suggested above I can try this again and see how feasible that is to implement.

jtgrabowski · 2020-06-26T15:04:18Z

I see. It's far more complicated than we all expected :) What would we gain with the current approach and the described one? I initially thought that we are going for less code to maintain but now I'm not sure if this whole thing is worth the effort. wdyt?

absurdfarce · 2020-06-28T08:32:39Z

Yeah, it's definitely more complicated than we expected. :)

It seems to me that the following things are true:

We won't be able to remove Column/TableDef from the current code base
- We need these classes (or something very much like them) to describe structures to be created by ColumnMapper
There is value in decoupling this function of describing structures to be created from analyzing keyspace/table/column metadata
- This certainly doesn't require parallel trait impls (like what I've used in this PR)
- In other contexts I've argued for removing intermediaries and exposing the Java driver API more directly... and I see no reason to argue for something different here
  - A further point: IIRC Russ dealt directly with Java driver metadata already in his DSv2 impl
There are two potential issues that occur to me when considering such an impl
- There are occasions when we do a TableDef.copy() based on retrieved metadata, so in the new world we'd have to support the ability to create a TableDef from Java driver metadata (discussed already on this PR)
- There are a few spots in code where metadata is expected to conform to FieldDef/StructDef traits so we might have to wrap Java driver metadata in these cases

Given all of these factors I think there's still a decent argument for trying the following:

Changing *FromSchema() methods to return Java driver metadata types
Preserving Table/ColumnDef for use as a means to define table creation in ColumnMapper

Doing so gives us the decoupling between table creation and metadata and moves us further along in the goal of interacting with the Java driver types more directly in the code base.

wdyt @jtgrabowski ?

jtgrabowski · 2020-06-29T10:08:54Z

Let's give it a try if you think it's feasible. I traced couple of *FromCassandra invocations and it looks like it would require a bit of work to convert them.

absurdfarce added 4 commits June 15, 2020 02:03

Initial schema impl using traits + default/driver impls

358e560

Initial round of test fixes for driver

8beee76

More compile fixes

e6ce88c

Additional compilation + test fixes

e1e2fa8

absurdfarce changed the title ~~Sparkc 577 round two~~ SPARKC-577, round two Jun 16, 2020

absurdfarce added the in progress label Jun 16, 2020

absurdfarce mentioned this pull request Jun 16, 2020

SPARKC-577: Removal of Driver Duplicate Classes #1245

Closed

3 tasks

absurdfarce added 13 commits June 17, 2020 02:43

Fixes for SchemaSpec plus some other changes as a result. Most proimi…

3968016

…nently restored KeyspaceDef as a mechanism for accessing downstream "def" types (TableDef, IndexDef, etc.) from a Schema instance.

Appears to at least be building now

b134265

Specs now pass!

53f6800

Forgot to actually update the driver to a version which has, you know…

99f7c26

…, serializable metadata

Trying to resolve IT failures

f95710b

Removed (essentially) duplicate field on StructDef subclasses. allCol…

b983e0e

…umns effectively duplicates columns and was only used in one spot.

Some additional integration tests

a6822ba

Add proper table filtering to "get this table from Cassandra" Schema …

e220040

…calls

First round of fixes for remamining IT failures

5332db6

Add explicit test to validate IOException being chucked when expected

52e1857

What should be the last round of fixes. Local run of ITs are clean af…

df98b34

…ter this change so with any luck we should be good now.

Spec fix: move overridden columns up so that they can be used by the …

3f32acc

…indexing impls

absurdfarce commented Jun 23, 2020

View reviewed changes

absurdfarce requested a review from jtgrabowski June 23, 2020 21:43

absurdfarce removed the in progress label Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARKC-577, round two #1250

SPARKC-577, round two #1250

absurdfarce commented Jun 16, 2020 •

edited

Loading

absurdfarce Jun 23, 2020

jtgrabowski commented Jun 25, 2020

absurdfarce commented Jun 25, 2020 •

edited

Loading

jtgrabowski commented Jun 26, 2020 •

edited

Loading

absurdfarce commented Jun 28, 2020 •

edited

Loading

jtgrabowski commented Jun 29, 2020

SPARKC-577, round two #1250

Are you sure you want to change the base?

SPARKC-577, round two #1250

Conversation

absurdfarce commented Jun 16, 2020 • edited Loading

Description

How did the Spark Cassandra Connector Work or Not Work Before this Patch

General Design of the patch

How Has This Been Tested?

Checklist:

absurdfarce Jun 23, 2020

Choose a reason for hiding this comment

jtgrabowski commented Jun 25, 2020

absurdfarce commented Jun 25, 2020 • edited Loading

jtgrabowski commented Jun 26, 2020 • edited Loading

absurdfarce commented Jun 28, 2020 • edited Loading

jtgrabowski commented Jun 29, 2020

absurdfarce commented Jun 16, 2020 •

edited

Loading

absurdfarce commented Jun 25, 2020 •

edited

Loading

jtgrabowski commented Jun 26, 2020 •

edited

Loading

absurdfarce commented Jun 28, 2020 •

edited

Loading