Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vector: add support for vector type #22488

Conversation

QuerthDP
Copy link
Contributor

@QuerthDP QuerthDP commented Jan 24, 2025

This pull request is an implementation of vector data type similar to one used by Apache Cassandra.

The patch contains:

  • implementation of vector_type_impl class
  • necessary functionalities similar to other data types
  • support for serialization and deserialization of vectors
  • support for Lua and JSON format
  • valid CQL syntax for vector<> type
  • type_parser support for vectors
  • expression adjustments such as:
    • add collection_constructor::style_type::vector
    • rename collection_constructor::style_type::list to collection_constructor::style_type::list_or_vector
  • vector type encoding (for drivers)
  • unit tests
  • cassandra compatibility tests
  • necessary documentation

Co-authored-by: @janpiotrlakomy

Fixes #19455

@scylladb-promoter
Copy link
Contributor

@scylladb/scylla-maint please review and trigger CI for this pull request
CI Build: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/parambuild/?PR_NUMBER=22488

@scylladb-promoter
Copy link
Contributor

Docs Preview 📖

Docs Preview for this pull request is available here

Changed Files:

Note: This preview will be available for 30 days and will be automatically deleted after that period. You can manually trigger a new build by committing changes.

}
return vector_type_impl::build_value(std::move(raw_vector), t.value_length_if_fixed());
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tested somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality is tested here.
I believe this test is sufficient, but we might adjust/expand it, if there are still any concerns.

Copy link
Contributor

@nyh nyh Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't understand why write any functionality tests in a Boost test. It should be in cqlpy (I'd create a new file test/cqlpy/test_type_vector.py for it), so you'll be able to compare the functionality to that in Cassandra.
Although specifically for JSON output and input, we actually have a lot of these tests in test/cqlpy/test_json.py. For example see test_tojson_timestamp and test_fromjson_timestamp which checked tojson() of the "timestamp" type - which we used to have a bug in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boost test have been added due to the reasoning that other data types have this kind of test here as well.

We are now aware that this PR lacks the reasonable amount of cqlpy tests, and it is our high priority TODO to deliver them in a close future. Will keep you updated!

return cmp;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify that this ordering is compatible with Cassandra.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I encourage people to write all tests that can be exercised through CQL in test/cqlpy and not in the test/boost cql_env framework. The latter always leaving you wondering whether the behavior you just wrote a test for actually matches Cassandra's behavior. The former allows you to really be sure that Scylla's behavior is identical to Cassandra and not incompatible with Cassandra for no good read.

I saw you translated Cassandra's unit tests and put it in test/cqlpy/cassandra_tests, which is great, but you can also add new cqlpy tests, just don't put them in cassandra_tests (a new file test/cqlpy/test_type_vector.py would be a good place for new tests for the vector type).

@@ -70,6 +70,7 @@ unresolved_identifier::~unresolved_identifier() = default;
static cql3::raw_value do_evaluate(const bind_variable&, const evaluation_inputs&);
static cql3::raw_value do_evaluate(const tuple_constructor&, const evaluation_inputs&);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, this separates the implementation from collections,
which could've been quite misleading and inelegant.

But is there any practical difference from collection_constructor? This adds a lot of noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe it is crucial to add the vector_constructor, regardless.

  1. First of all, if we used collection_constructor the way it is implemented now, we would have to use the do_evaluate function with it.
    There emerges a problem, as if we come into evaluate_list (as we would be redirected here, while literals for vectors and the lists are the same) then the wrong serialization function would be executed (serialize_listlike function differs from the one required for vectors according to the protocol [1]).
    We cannot just simply differ the vector from the list here, because there is no access to the column type from this part of code, it would require major refactoring if we wanted to do it this way.
    Moreover, which is a minor issue, the exception messages would not match, as the code would believe we executed the evaluate_list for the list, not vector, and list is hard coded here in the messages.
    • 💡 We could have introduced new collection_constructor::style_type different from the list one and implement crucial functions for vectors in collection_constructor the same way as we split maps and sets from lists, BUT as mentioned before it is not possible due to the ambiguity of list and vector literals...
  2. To resolve the issue, we implemented the distinction in prepare_expression, as there is a simple access to the column type through the receiver.
  3. Furthermore, we needed the split during preparation anyway, because the preparation for vectors and lists slightly differs (mainly due to the fact that lists can be empty), and that we need to produce proper column specification of vector_type_impl.
  4. On the other hand, I think it could be done the other way around, if we truly don't want to introduce a new constructor, but it would require more changes and does not come naturally to my mind how to do it.

On a final note, I do think clean and tidy code is a good thing to introduce, and splitting the vector from collections is,
I believe an example of it, as it delivers more clarity for future development and maintenance.
Of course, it might not be perfect as we do still learn and gain experience, but I hope it's clear enough what should it do.

[1] the protocol itself has some other issues, which will be mentioned here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a new style_type would be better (alternatively, switching on collection_constructor::type).

I understand your motivation to have the vector code separate, but my experience with the expression code is that we won't touch collection_constructor again (or at least until someone adds a new form of collections). On the other hand anyone implementing or maintaining a visitor will have to handle yet another top-level case which just leads to repetition. Better to isolate construction in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK, I think I understand your idea now.
From a fresh glimpse at the implementation we might indeed replace the vector_constructor which a new separate collection_constructor::style_type, which would be used only after preparation (as vector_constructor is used now).

I'm going to look into it deeper from now on, but high hopes it can be done the way with additional style type.
Feels like a good solution for me as well, since I noticed it. 🤞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just rebased the mentioned commits to implement vector style type instead of vector constructor.

Nevertheless I have two new ideas how could we improve the code from now on:

  1. We might introduce the list style type, which would be used exactly the same as vector style type now (i.e. only after preparation). It would leave us with a straight look at what is truly a list, and what could be also a vector.
  2. We could revert the rename of list to list_or_vector and provide more comments about the vector using list style before the preparation (feels a bit misleading, but would be also a good idea, and it requires less changes to the code).

Please take a look at these changes and let me know what do you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piodul please take a look at the changes as well, as we discussed how the implementation of the expressions should look like.

@avikivity
Copy link
Member

Looks generally good. I did not like the addition of vector_constructor, it looks like another special case of collection_constructor (even if the name isn't an exact match) and doesn't merit forcing everyone who implements a visitor to special-case it.

I'd like to see tests that check that an INSERT with a bind variable and a vector with incorrect length fails validation.

from ...porting import *


# Some of the lines in these tests are commented out because Scylla doesn't support arithmetic operations in literals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm happy you translated those tests, but just wanted to confirm something about the process of doing that. Am I correct in understanding you didn't delete or change any tests except the few things that you left commented out (not deleted) with an explanation why in a comment? If my understanding in correct, than it looks very good.

Some context: When I translated the rest of the tests (this is part of an on-going process that I've been doing on and off for a couple of years...) the main guiding principle was not to change the tests. We can't drop a test, or change it, just because it fails on Scylla or we think Cassandra is wrong and we want all our tests to be successful ;-) For example, if a test needs arithmetic operations (#2693) and Scylla doesn't have them, the test should "xfail " with that issue stated as a reason. But when this happens while I'm translating the test, I also comment out the lines which expose 2693 to see nothing else in the test fails because sometimes there is more than one bug exposed by one test. Usually, I do this commenting out temporarily and then when I'm done I restore that line it and leave the "xfail". But I am sort of fine in your case you left some of these lines exposing 2693 as commented out, because we have plenty of other tests exposing 2693 (arithmetic literals). But... on the other hand... There's also something to be said about testing this syntax. I.e., the fact that 1+1 works (and tested elsewhere and xfails until we do 2693) doesn't prove that a vector [1, 1+1] works - it's a separate piece of syntax, and also needs to be tested somewhere.

On the third (?) hand, I agree that having all the vector tests xfail because they inserted 1+1 to each and every one of them is silly and counter-productive, because it makes the tests unusable as regression tests (the tests xfail, so if the vector code breaks nobody can notice, because the test will fail just like before). So another alternative is to comment out all those instances of 1+1 except one, or perhaps leave all of them commented out and just create a tiny new test checking this specific syntax, leaving it xfailing on 2693.

Beyond that detail of #2693, and other things I saw commented out in the test (one thing that isn't a CQL test, so it's fine to comment out, another thing which the Python driver can't test, which is more sad but sometimes unavoidable) I just wanted to confirm there are no other changes in the translated test - nothing you just deleted without leaving commented-out code, and so on. The approach of leaving behind commented-out code with an explanation why you had to comment it out is good - it is possible that for some of them (e.g., that missing Python-driver feature) we can come back to fix it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your understanding is correct and thanks for the context!

The only other meaningful changes in the translated tests (except for 1 accidental one that I just found), are the error messages in the assertInvalidThrowMessage assertions. That is because some of the error messages differ between Scylla and Cassandra, so we have to test a subset of each message, e.g. check for "bytes" instead of "Unexpected 2 extraneous bytes after vector<text, 2> value".

As for the 1+1 syntax, I think we will add a new cqlpy test that covers it and mark it xfail on 2693.

@nyh
Copy link
Contributor

nyh commented Jan 26, 2025

Hi, we have an issue #19455 about adding Cassandra 5's "VECTOR" type to Scylla.
If this PR implements this feature as this issue requests (namely, Cassandra 5 compatibility) please mention "Fixes #19455" in the cover letter. If it only partially implements this feature, please explicitly mention what still needs to be done or what's different, and mention "Refs #19455" (which connects the patch to the issue, but won't close the issue automatically).

@avikivity avikivity added the backport/none Backport is not required label Jan 26, 2025
The vector is a fixed-length array of non-null
specified type elements.

Implement serialization, deserialization, comparison,
JSON and Lua support, and other functionalities.

Co-authored-by: Dawid Pawlik <[email protected]>
@QuerthDP QuerthDP force-pushed the 2-implement-vector-data-type branch from bd6d4a0 to d02f1cc Compare January 26, 2025 19:09
@scylladb-promoter
Copy link
Contributor

@scylladb/scylla-maint please review and trigger CI for this pull request
CI Build: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/parambuild/?PR_NUMBER=22488

@QuerthDP
Copy link
Contributor Author

Changelog:

  • adjusted documentation for vector support for Lua
  • removed unnecessary support for empty type in vectors
  • added description of vector type in proper commit message

@scylladbbot
Copy link

@QuerthDP new branch branch-2025.1 was added, please add backport label if needed

@QuerthDP QuerthDP force-pushed the 2-implement-vector-data-type branch from d02f1cc to 2488140 Compare January 27, 2025 23:43
@scylladb-promoter
Copy link
Contributor

@scylladb/scylla-maint please review and trigger CI for this pull request
CI Build: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/parambuild/?PR_NUMBER=22488

@QuerthDP
Copy link
Contributor Author

Changelog:

  • removed vector_constructor and implemented vector style type instead

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole series looks very good to me, I only left a few minor comments questions and requests.

: concrete_type(kind::vector, make_name(elements, dimension),
elements->value_length_if_fixed() ? std::optional(elements->value_length_if_fixed().value()*dimension):std::nullopt),
_elements_type(elements), _dimension(dimension) {
_contains_set_or_map = _elements_type->contains_set_or_map();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why vectors of set or map are even allowed, or why it matters if it's a "set or map" and not a list of or other strange things. I hope I'll understand this later as I read this series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to ban vectors of collections too, but it turns our Cassandra allows them.

it is not possible to update only some elements of a vector (without updating the whole vector).

Nevertheless, types stored in a vector are not implicitly frozen, so if you want to store a frozen collection or
frozen UDT in a vector, you need to explicitly wrap them using `frozen` keyword.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last paragraph is interesting. I think the word "neverthless" is a bit confusing (it confused me for a second) because this paragraph doesn't actually contradict to the previous paragraph and in my opinion - not even related to it. The previous paragraph is about frozen<vector<...>, and this one is about vector<frozen<...>. In any case, I wonder if anyone would ever ever use a vector of anything except float or double :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it might be a bit confusing at a first glance.
As you mentioned the core purpose for vectors are for storing floats and doubles, so I think this paragraph is unnecessary and may be removed. Don't you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not unnecessary if in fact vector<list,3> is supported and not identical to vector<frozen<list,3>. But I think it could be made identical... Since the data held by a vector is immutable (you can't do something like SET a[3]++) , it's pointless to have a non-frozen collection as the vector's item.

Anyway, I think we first need to decide what we want to happen and only then document exactly what we decided. If what this paragraph says is truely what we decided, then I think it should stay (just with a different word instead of "nevertheless").

I guess we can also base our decision on what Cassandra did - even if it's weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK Cassandra did exactly that what we implemented, i.e. it allows both frozen and unfrozen collections without implicitly freezing them.

In this case, as you suggested, we might only change the misleading vocabulary and it should be just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested collections are implicitly frozen, even if the "frozen" flag isn't set.

Frozen means individual cells cannot be updated independently (UPDATE tab SET my_set = my_set + {3}).

There's no syntax (or support in the storage engine) to update a nested set element, so it's implicitly frozen.

Similarly, a vector is itself implicitly frozen (you cannot update a single element), so any nested collections would be similarly implicitly frozen.

I guess we have to distinguish between "visible frozen" (shows up with frozen<> wrappers) and implicitly frozen (the entire collection is atomic). Yuch.

Copy link
Contributor Author

@QuerthDP QuerthDP Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all true. I might have used the wrong vocabulary here as well.

What I really meant by 'implicitly freezing' was a recursive call of freeze() function on the element type (like it is done in both frozen collections and tuples).

The lack of the recursive call implies that the types inside the vector are not visibly frozen, while efectively being frozen (cannot be updated) as you stated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janpiotrlakomy janpiotrlakomy force-pushed the 2-implement-vector-data-type branch from 2488140 to b427fed Compare January 28, 2025 14:21
QuerthDP and others added 9 commits January 28, 2025 21:14
This change is introduced due to lack of support for vector class name,
used by type_parser to create data_type based on given class name
(especially compound class name with inner types or other parameters).

Add function that parses vector type parameters from a class name.
Contains two type_parser tests: one for a valid vector
and another for invalid vector.
These tests check serialization and deserialization (including JSON),
basic inserts and selects, aggregate functions, element validation,
vector usage in user defined types and functions.

test_vector_between_user_types is a translated Apache Cassandra test
to check if it is handled properly internally.
Motivation for this changes is to provide a distinguishable interface
for vector type expressions.

The square bracket literal is ambigious for lists and vectors,
so that we need to perform a distinction not using CQL layer.
At first we should use the collection constructor to manage
both lists and vectors (although a vector is not a collection).
Later during preparation of expressions we should be able to get
to know the exact type using given receiver (column specification).

Knowing the type of expression we may use their respective style type
(in this case the vector style type being introduced),
which would make the implementation more precise and allow us to
evaluate the expressions properly.

This commit introduces vector style type and functions making use of it.

However vector style type is not yet used anywhere,
the next commit should adjust collection constructor and make use
of the new vector style type and it's features.
Like mentioned in the previous commit, this changes introduce usage
of vector style type and adjusts the functions using list style type
to distinguish vectors from lists.

Rename collection constructor style list to list_or_vector.
Add and adjust tests using vector and list_or_vector style types.

Implemented utilities used in expr_test similar to those added
in 8f6309b.
This change has been introduced to enable CQL drivers to recognize
vector type in query results.

The encoding has been imported from Apache Cassandra implementation
to match Cassandra's and latest drivers' behaviour.

Co-authored-by: Dawid Pawlik <[email protected]>
Add cql_vector_test which tests the basic functionalities of
the vector type using CQL.

Add vectors_test which tests if descending ordering of vector
is supported.
Add missing vector type documentation including: definition of vector,
adjustment of term definition, JSON encoding, Lua and cql3 type
mapping, vector dimension limit, and keyword specification.
@QuerthDP QuerthDP force-pushed the 2-implement-vector-data-type branch from b427fed to a68bf6d Compare January 28, 2025 20:16
@scylladb-promoter
Copy link
Contributor

@scylladb/scylla-maint please review and trigger CI for this pull request
CI Build: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/parambuild/?PR_NUMBER=22488

@QuerthDP
Copy link
Contributor Author

Changelog:

  • adjust docs (used @nyh's description of a vector. Thank you!)
  • disallow negative numbers as vector dimension
  • remove unexpected freeze<> in one of JSON tests

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/cql_functions_test
🔹 boost/cql_query_test
🔹 boost/expr_test
🔹 boost/json_cql_query_test
🔹 boost/types_test
🔹 boost/user_function_test
🔹 boost/user_types_test
🔹 boost/view_schema_test
✅ - dtest with tablets
✅ - dtest with topology changes
✅ - dtest
❌ - Unit Tests

Failed Tests (1/43308):

Build Details:

  • Duration: 5 hr 40 min
  • Builder: spider2.cloudius-systems.com

@QuerthDP
Copy link
Contributor Author

I've noticed that the CI didn't pass on a single test.

Failed Tests (1/43308):
cqlpy.cassandra_tests/validation/entities/secondary_index_test.debug.1 🔍

As I don't have access to read the jenkins logs, could I ask anyone responsible to let me know what was the reason for the failure?
Thank you in advance.

@nyh
Copy link
Contributor

nyh commented Jan 30, 2025

As I don't have access to read the jenkins logs, could I ask anyone responsible to let me know what was the reason for the failure? Thank you in advance.

This is cqlpy, cassandra_tests/validation/entities/secondary_index_test.py::testCreateAndDropIndexWithCamelCaseIdentifier the test crashed Scylla once in debug build, on an AddressSanitizer failure. Almost surely unrelated to your patch.

@nyh
Copy link
Contributor

nyh commented Jan 30, 2025

As I don't have access to read the jenkins logs, could I ask anyone responsible to let me know what was the reason for the failure? Thank you in advance.

This is cqlpy, cassandra_tests/validation/entities/secondary_index_test.py::testCreateAndDropIndexWithCamelCaseIdentifier the test crashed Scylla once in debug build, on an AddressSanitizer failure. Almost surely unrelated to your patch.

It appears to be the same as #22040. I'll update that issue with more information.

@Jadw1
Copy link
Contributor

Jadw1 commented Jan 31, 2025

Retriggered CI

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/cql_functions_test
🔹 boost/cql_query_test
🔹 boost/expr_test
🔹 boost/json_cql_query_test
🔹 boost/types_test
🔹 boost/user_function_test
🔹 boost/user_types_test
🔹 boost/view_schema_test
❌ - dtest with topology changes
❌ - dtest with tablets
❌ - dtest
❌ - Unit Tests

Failed Tests (36/43900):

Build Details:

  • Duration: 11 hr
  • Builder: i-05bc3aef4e87c2b58 (m5ad.8xlarge)

@Jadw1
Copy link
Contributor

Jadw1 commented Feb 6, 2025

Failed tests look unrelated, some of them are known (like #20448), retriggered.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/cql_functions_test
🔹 boost/cql_query_test
🔹 boost/expr_test
🔹 boost/json_cql_query_test
🔹 boost/types_test
🔹 boost/user_function_test
🔹 boost/user_types_test
🔹 boost/view_schema_test
✅ - dtest with topology changes
✅ - dtest with tablets
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 10 hr
  • Builder: i-0d778f093c96f4304 (m5ad.8xlarge)

@avikivity
Copy link
Member

Queued

@scylladb-promoter scylladb-promoter merged commit 861fb58 into scylladb:master Feb 7, 2025
15 checks passed
For example, a ``vector<float, 3>`` holds vectors of 3 floats
(in mathematical terms, these are 3-dimensional float vectors).
None of the elements stored in a vector can be null.
The vector type and it's respective literal are defined by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's -> its

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CQL "vector" type
8 participants