Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a groupby operator #1123

Merged
merged 2 commits into from
Feb 3, 2025
Merged

Add a groupby operator #1123

merged 2 commits into from
Feb 3, 2025

Conversation

bohou-aryn
Copy link
Collaborator

Implement groupby based on ray dataset groupby and show how a general
entity clustering could be used together with kmeans clustering.

Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. What would it take to abstract this such that it works in both ray and local modes?

def init_embedding(row):
doc = Document.from_row(row)
return {"vector": doc.embedding, "cluster": -1}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a "assert self.context.exec_mode == ExecMode.RAY" in here?

return context.read.document(doc_list)

def test_groupby_count(self, fruits_docset):
aggregated = fruits_docset.groupby("text_representation").count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the Documents in aggregated look like at this point?

Comment on lines +39 to +41
def init(embeddings, K, init_mode):
if init_mode == "random":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: init_mode could be an Enum but str is fine too. I guess would be nice to have the list of known init_modes in the exception?

Comment on lines 1 to 2
from ray.data._internal.aggregate import Count
from ray.data.aggregate import AggregateFn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    <ray imports>


return DocSet(self._docset.context, DatasetScan(serialized))

def count(self) -> DocSet:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import count here


@staticmethod
def update(embeddings, centroids, iterations, epsilon):
i = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import AggregateFn here

@bohou-aryn bohou-aryn force-pushed the clustering branch 2 times, most recently from 8a6aadd to 3653e2f Compare January 31, 2025 22:32
This includes generally three steps:
1. materialize a document's embedding
2. initialize centroids randomly
2. iterate the kmeans process until converge, this is based on ray
   dataset map group and aggregate operators.

The result centroids could be used for downstream work.
@bohou-aryn bohou-aryn force-pushed the clustering branch 2 times, most recently from 907625d to fdfc705 Compare February 3, 2025 21:16
@bohou-aryn bohou-aryn enabled auto-merge (rebase) February 3, 2025 21:40
@bohou-aryn bohou-aryn disabled auto-merge February 3, 2025 21:40
Implement groupby based on ray dataset groupby and show how a general
entity clustering could be used together with kmeans clustering.
@bohou-aryn bohou-aryn merged commit 3c8831d into main Feb 3, 2025
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants