-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a groupby operator #1123
Add a groupby operator #1123
Conversation
955cd3c
to
db0d219
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. What would it take to abstract this such that it works in both ray and local modes?
def init_embedding(row): | ||
doc = Document.from_row(row) | ||
return {"vector": doc.embedding, "cluster": -1} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a "assert self.context.exec_mode == ExecMode.RAY" in here?
return context.read.document(doc_list) | ||
|
||
def test_groupby_count(self, fruits_docset): | ||
aggregated = fruits_docset.groupby("text_representation").count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do the Documents in aggregated look like at this point?
def init(embeddings, K, init_mode): | ||
if init_mode == "random": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supernit: init_mode could be an Enum but str is fine too. I guess would be nice to have the list of known init_modes in the exception?
from ray.data._internal.aggregate import Count | ||
from ray.data.aggregate import AggregateFn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import TYPE_CHECKING
if TYPE_CHECKING:
<ray imports>
|
||
return DocSet(self._docset.context, DatasetScan(serialized)) | ||
|
||
def count(self) -> DocSet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import count here
|
||
@staticmethod | ||
def update(embeddings, centroids, iterations, epsilon): | ||
i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import AggregateFn here
8a6aadd
to
3653e2f
Compare
This includes generally three steps: 1. materialize a document's embedding 2. initialize centroids randomly 2. iterate the kmeans process until converge, this is based on ray dataset map group and aggregate operators. The result centroids could be used for downstream work.
907625d
to
fdfc705
Compare
fdfc705
to
59c08ad
Compare
Implement groupby based on ray dataset groupby and show how a general entity clustering could be used together with kmeans clustering.
59c08ad
to
b25ee78
Compare
Implement groupby based on ray dataset groupby and show how a general
entity clustering could be used together with kmeans clustering.