-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vector: add support for vector type #22488
vector: add support for vector type #22488
Conversation
@scylladb/scylla-maint please review and trigger CI for this pull request |
Docs Preview 📖Docs Preview for this pull request is available here Changed Files:
Note: This preview will be available for 30 days and will be automatically deleted after that period. You can manually trigger a new build by committing changes. |
} | ||
return vector_type_impl::build_value(std::move(raw_vector), t.value_length_if_fixed()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tested somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality is tested here.
I believe this test is sufficient, but we might adjust/expand it, if there are still any concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't understand why write any functionality tests in a Boost test. It should be in cqlpy (I'd create a new file test/cqlpy/test_type_vector.py for it), so you'll be able to compare the functionality to that in Cassandra.
Although specifically for JSON output and input, we actually have a lot of these tests in test/cqlpy/test_json.py. For example see test_tojson_timestamp and test_fromjson_timestamp which checked tojson() of the "timestamp" type - which we used to have a bug in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This boost test have been added due to the reasoning that other data types have this kind of test here as well.
We are now aware that this PR lacks the reasonable amount of cqlpy tests, and it is our high priority TODO to deliver them in a close future. Will keep you updated!
return cmp; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify that this ordering is compatible with Cassandra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I encourage people to write all tests that can be exercised through CQL in test/cqlpy and not in the test/boost cql_env framework. The latter always leaving you wondering whether the behavior you just wrote a test for actually matches Cassandra's behavior. The former allows you to really be sure that Scylla's behavior is identical to Cassandra and not incompatible with Cassandra for no good read.
I saw you translated Cassandra's unit tests and put it in test/cqlpy/cassandra_tests, which is great, but you can also add new cqlpy tests, just don't put them in cassandra_tests (a new file test/cqlpy/test_type_vector.py would be a good place for new tests for the vector type).
cql3/expr/expression.cc
Outdated
@@ -70,6 +70,7 @@ unresolved_identifier::~unresolved_identifier() = default; | |||
static cql3::raw_value do_evaluate(const bind_variable&, const evaluation_inputs&); | |||
static cql3::raw_value do_evaluate(const tuple_constructor&, const evaluation_inputs&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, this separates the implementation from collections,
which could've been quite misleading and inelegant.
But is there any practical difference from collection_constructor? This adds a lot of noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do believe it is crucial to add the vector_constructor
, regardless.
- First of all, if we used
collection_constructor
the way it is implemented now, we would have to use thedo_evaluate
function with it.
There emerges a problem, as if we come intoevaluate_list
(as we would be redirected here, while literals for vectors and the lists are the same) then the wrong serialization function would be executed (serialize_listlike
function differs from the one required for vectors according to the protocol [1]).
We cannot just simply differ the vector from the list here, because there is no access to the column type from this part of code, it would require major refactoring if we wanted to do it this way.
Moreover, which is a minor issue, the exception messages would not match, as the code would believe we executed theevaluate_list
for the list, not vector, andlist
is hard coded here in the messages.- 💡 We could have introduced new
collection_constructor::style_type
different from the list one and implement crucial functions for vectors incollection_constructor
the same way as we split maps and sets from lists, BUT as mentioned before it is not possible due to the ambiguity of list and vector literals...
- 💡 We could have introduced new
- To resolve the issue, we implemented the distinction in
prepare_expression
, as there is a simple access to the column type through thereceiver
. - Furthermore, we needed the split during preparation anyway, because the preparation for vectors and lists slightly differs (mainly due to the fact that lists can be empty), and that we need to produce proper column specification of
vector_type_impl
. - On the other hand, I think it could be done the other way around, if we truly don't want to introduce a new constructor, but it would require more changes and does not come naturally to my mind how to do it.
On a final note, I do think clean and tidy code is a good thing to introduce, and splitting the vector from collections is,
I believe an example of it, as it delivers more clarity for future development and maintenance.
Of course, it might not be perfect as we do still learn and gain experience, but I hope it's clear enough what should it do.
[1] the protocol itself has some other issues, which will be mentioned here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a new style_type would be better (alternatively, switching on collection_constructor::type).
I understand your motivation to have the vector code separate, but my experience with the expression code is that we won't touch collection_constructor again (or at least until someone adds a new form of collections). On the other hand anyone implementing or maintaining a visitor will have to handle yet another top-level case which just leads to repetition. Better to isolate construction in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK, I think I understand your idea now.
From a fresh glimpse at the implementation we might indeed replace the vector_constructor
which a new separate collection_constructor::style_type
, which would be used only after preparation (as vector_constructor
is used now).
I'm going to look into it deeper from now on, but high hopes it can be done the way with additional style type.
Feels like a good solution for me as well, since I noticed it. 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just rebased the mentioned commits to implement vector style type instead of vector constructor.
Nevertheless I have two new ideas how could we improve the code from now on:
- We might introduce the
list
style type, which would be used exactly the same asvector
style type now (i.e. only after preparation). It would leave us with a straight look at what is truly a list, and what could be also a vector. - We could revert the rename of
list
tolist_or_vector
and provide more comments about the vector usinglist
style before the preparation (feels a bit misleading, but would be also a good idea, and it requires less changes to the code).
Please take a look at these changes and let me know what do you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piodul please take a look at the changes as well, as we discussed how the implementation of the expressions should look like.
Looks generally good. I did not like the addition of vector_constructor, it looks like another special case of collection_constructor (even if the name isn't an exact match) and doesn't merit forcing everyone who implements a visitor to special-case it. I'd like to see tests that check that an INSERT with a bind variable and a vector with incorrect length fails validation. |
from ...porting import * | ||
|
||
|
||
# Some of the lines in these tests are commented out because Scylla doesn't support arithmetic operations in literals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm happy you translated those tests, but just wanted to confirm something about the process of doing that. Am I correct in understanding you didn't delete or change any tests except the few things that you left commented out (not deleted) with an explanation why in a comment? If my understanding in correct, than it looks very good.
Some context: When I translated the rest of the tests (this is part of an on-going process that I've been doing on and off for a couple of years...) the main guiding principle was not to change the tests. We can't drop a test, or change it, just because it fails on Scylla or we think Cassandra is wrong and we want all our tests to be successful ;-) For example, if a test needs arithmetic operations (#2693) and Scylla doesn't have them, the test should "xfail " with that issue stated as a reason. But when this happens while I'm translating the test, I also comment out the lines which expose 2693 to see nothing else in the test fails because sometimes there is more than one bug exposed by one test. Usually, I do this commenting out temporarily and then when I'm done I restore that line it and leave the "xfail". But I am sort of fine in your case you left some of these lines exposing 2693 as commented out, because we have plenty of other tests exposing 2693 (arithmetic literals). But... on the other hand... There's also something to be said about testing this syntax. I.e., the fact that 1+1
works (and tested elsewhere and xfails until we do 2693) doesn't prove that a vector [1, 1+1]
works - it's a separate piece of syntax, and also needs to be tested somewhere.
On the third (?) hand, I agree that having all the vector tests xfail because they inserted 1+1
to each and every one of them is silly and counter-productive, because it makes the tests unusable as regression tests (the tests xfail, so if the vector code breaks nobody can notice, because the test will fail just like before). So another alternative is to comment out all those instances of 1+1
except one, or perhaps leave all of them commented out and just create a tiny new test checking this specific syntax, leaving it xfailing on 2693.
Beyond that detail of #2693, and other things I saw commented out in the test (one thing that isn't a CQL test, so it's fine to comment out, another thing which the Python driver can't test, which is more sad but sometimes unavoidable) I just wanted to confirm there are no other changes in the translated test - nothing you just deleted without leaving commented-out code, and so on. The approach of leaving behind commented-out code with an explanation why you had to comment it out is good - it is possible that for some of them (e.g., that missing Python-driver feature) we can come back to fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your understanding is correct and thanks for the context!
The only other meaningful changes in the translated tests (except for 1 accidental one that I just found), are the error messages in the assertInvalidThrowMessage
assertions. That is because some of the error messages differ between Scylla and Cassandra, so we have to test a subset of each message, e.g. check for "bytes" instead of "Unexpected 2 extraneous bytes after vector<text, 2> value".
As for the 1+1
syntax, I think we will add a new cqlpy test that covers it and mark it xfail on 2693.
test/cqlpy/cassandra_tests/validation/operations/cql_vector_test.py
Outdated
Show resolved
Hide resolved
test/cqlpy/cassandra_tests/validation/operations/cql_vector_test.py
Outdated
Show resolved
Hide resolved
Hi, we have an issue #19455 about adding Cassandra 5's "VECTOR" type to Scylla. |
The vector is a fixed-length array of non-null specified type elements. Implement serialization, deserialization, comparison, JSON and Lua support, and other functionalities. Co-authored-by: Dawid Pawlik <[email protected]>
bd6d4a0
to
d02f1cc
Compare
@scylladb/scylla-maint please review and trigger CI for this pull request |
Changelog:
|
@QuerthDP new branch |
d02f1cc
to
2488140
Compare
@scylladb/scylla-maint please review and trigger CI for this pull request |
Changelog:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole series looks very good to me, I only left a few minor comments questions and requests.
: concrete_type(kind::vector, make_name(elements, dimension), | ||
elements->value_length_if_fixed() ? std::optional(elements->value_length_if_fixed().value()*dimension):std::nullopt), | ||
_elements_type(elements), _dimension(dimension) { | ||
_contains_set_or_map = _elements_type->contains_set_or_map(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why vectors of set or map are even allowed, or why it matters if it's a "set or map" and not a list of or other strange things. I hope I'll understand this later as I read this series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to ban vectors of collections too, but it turns our Cassandra allows them.
it is not possible to update only some elements of a vector (without updating the whole vector). | ||
|
||
Nevertheless, types stored in a vector are not implicitly frozen, so if you want to store a frozen collection or | ||
frozen UDT in a vector, you need to explicitly wrap them using `frozen` keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last paragraph is interesting. I think the word "neverthless" is a bit confusing (it confused me for a second) because this paragraph doesn't actually contradict to the previous paragraph and in my opinion - not even related to it. The previous paragraph is about frozen<vector<...>
, and this one is about vector<frozen<...>
. In any case, I wonder if anyone would ever ever use a vector of anything except float or double :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it might be a bit confusing at a first glance.
As you mentioned the core purpose for vectors are for storing floats and doubles, so I think this paragraph is unnecessary and may be removed. Don't you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not unnecessary if in fact vector<list,3> is supported and not identical to vector<frozen<list,3>. But I think it could be made identical... Since the data held by a vector is immutable (you can't do something like SET a[3]++
) , it's pointless to have a non-frozen collection as the vector's item.
Anyway, I think we first need to decide what we want to happen and only then document exactly what we decided. If what this paragraph says is truely what we decided, then I think it should stay (just with a different word instead of "nevertheless").
I guess we can also base our decision on what Cassandra did - even if it's weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK Cassandra did exactly that what we implemented, i.e. it allows both frozen and unfrozen collections without implicitly freezing them.
In this case, as you suggested, we might only change the misleading vocabulary and it should be just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested collections are implicitly frozen, even if the "frozen" flag isn't set.
Frozen means individual cells cannot be updated independently (UPDATE tab SET my_set = my_set + {3}
).
There's no syntax (or support in the storage engine) to update a nested set element, so it's implicitly frozen.
Similarly, a vector is itself implicitly frozen (you cannot update a single element), so any nested collections would be similarly implicitly frozen.
I guess we have to distinguish between "visible frozen" (shows up with frozen<> wrappers) and implicitly frozen (the entire collection is atomic). Yuch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's all true. I might have used the wrong vocabulary here as well.
What I really meant by 'implicitly freezing' was a recursive call of freeze()
function on the element type (like it is done in both frozen collections and tuples).
The lack of the recursive call implies that the types inside the vector are not visibly frozen, while efectively being frozen (cannot be updated) as you stated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cassandra 5.0 does support a vector with text subtype for example ... - see https://issues.apache.org/jira/browse/CASSANDRA-18960?focusedCommentId=17782187&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17782187
What's the use case? I donno.
2488140
to
b427fed
Compare
This change is introduced due to lack of support for vector class name, used by type_parser to create data_type based on given class name (especially compound class name with inner types or other parameters). Add function that parses vector type parameters from a class name.
Contains two type_parser tests: one for a valid vector and another for invalid vector.
These tests check serialization and deserialization (including JSON), basic inserts and selects, aggregate functions, element validation, vector usage in user defined types and functions. test_vector_between_user_types is a translated Apache Cassandra test to check if it is handled properly internally.
Motivation for this changes is to provide a distinguishable interface for vector type expressions. The square bracket literal is ambigious for lists and vectors, so that we need to perform a distinction not using CQL layer. At first we should use the collection constructor to manage both lists and vectors (although a vector is not a collection). Later during preparation of expressions we should be able to get to know the exact type using given receiver (column specification). Knowing the type of expression we may use their respective style type (in this case the vector style type being introduced), which would make the implementation more precise and allow us to evaluate the expressions properly. This commit introduces vector style type and functions making use of it. However vector style type is not yet used anywhere, the next commit should adjust collection constructor and make use of the new vector style type and it's features.
Like mentioned in the previous commit, this changes introduce usage of vector style type and adjusts the functions using list style type to distinguish vectors from lists. Rename collection constructor style list to list_or_vector.
Add and adjust tests using vector and list_or_vector style types. Implemented utilities used in expr_test similar to those added in 8f6309b.
This change has been introduced to enable CQL drivers to recognize vector type in query results. The encoding has been imported from Apache Cassandra implementation to match Cassandra's and latest drivers' behaviour. Co-authored-by: Dawid Pawlik <[email protected]>
Add cql_vector_test which tests the basic functionalities of the vector type using CQL. Add vectors_test which tests if descending ordering of vector is supported.
Add missing vector type documentation including: definition of vector, adjustment of term definition, JSON encoding, Lua and cql3 type mapping, vector dimension limit, and keyword specification.
b427fed
to
a68bf6d
Compare
@scylladb/scylla-maint please review and trigger CI for this pull request |
Changelog:
|
🔴 CI State: FAILURE✅ - Build Failed Tests (1/43308):Build Details:
|
I've noticed that the CI didn't pass on a single test.
As I don't have access to read the jenkins logs, could I ask anyone responsible to let me know what was the reason for the failure? |
This is cqlpy, |
It appears to be the same as #22040. I'll update that issue with more information. |
Retriggered CI |
🔴 CI State: FAILURE✅ - Build Failed Tests (36/43900):
Build Details:
|
Failed tests look unrelated, some of them are known (like #20448), retriggered. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Queued |
For example, a ``vector<float, 3>`` holds vectors of 3 floats | ||
(in mathematical terms, these are 3-dimensional float vectors). | ||
None of the elements stored in a vector can be null. | ||
The vector type and it's respective literal are defined by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's -> its
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.
The patch contains:
vector<>
typetype_parser
support for vectorscollection_constructor::style_type::vector
collection_constructor::style_type::list
tocollection_constructor::style_type::list_or_vector
Co-authored-by: @janpiotrlakomy
Fixes #19455