Skip to content

DATA-4320 - Implement BinaryDataToDataset in datamanager in RDK #5145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

tahiyasalam
Copy link
Member

@tahiyasalam tahiyasalam commented Jul 16, 2025

Overview

This PR implements the new endpoint UploadBinaryDataToDataset as part of the DataManager Service. We also implement both UploadBinaryDataToDataset and UploadImageToDataset as part of the Go SDK, where the latter is just a convenience method that converts an image.Image into bytes before calling the same underlying upload binary data method.

Scope: https://docs.google.com/document/d/1YSJ3lTz5sC5HP5xguvd0VqSAQNoUVPSWrvj3zldW0Uo/edit?tab=t.0#heading=h.tcicyojyqi6c

Testing

  • Added tests in builtin_sync_test.go, looking for feedback in where else to add tests
  • Manual testing within a module ✅

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jul 16, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 17, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 17, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 17, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 17, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 17, 2025
@tahiyasalam tahiyasalam changed the title WIP - Ability to add image to dataset in datamanager DATA-4320 - Implement BinaryDataToDataset in datamanager in RDK Jul 17, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 18, 2025
go.mod Outdated
@@ -444,3 +444,5 @@ require (
github.com/ziutek/mymysql v1.5.4 // indirect
golang.org/x/exp v0.0.0-20240904232852-e7e105dedf7e
)

replace go.viam.com/api => github.com/viamrobotics/api v0.1.458-0.20250717192712-d9437d8203b6 // upload-to-dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that we’ll have to update the API proper before merging

@@ -259,3 +260,34 @@ func lookupCollectorConfigsByResource(
}
return collectorConfigsByResource, nil
}

func (b *builtIn) UploadBinaryDataToDataset(ctx context.Context,
image []byte,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this binary_data?

fileContents = append(fileContents, datasetBytes...)
timeoutCtx, timeoutFn := context.WithTimeout(context.Background(), time.Second*5)
defer timeoutFn()
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry — not quite understanding why this is in a retry loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting UploadBinaryDataToDataset to potentially fail and need to be retried?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure, was mainly following the structure above for Syncing

if !tc.serviceFail {
// Validate first metadata message.
test.That(t, uploadCount.Load(), test.ShouldEqual, 1)
expectedUploadedCount := 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a variable that runs per test. Meaning, TestA has 3 expectedUploadedCount, TestB has 2...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test relies on conditionals and checking only the expected count within this condition, so I'm keen to leave as is as to not introduce an even larger diff if that's ok

@@ -476,6 +480,35 @@ func (s *Sync) syncArbitraryFile(f *os.File, tags []string, fileLastModifiedMill
s.atomicUploadStats.arbitrary.uploadedBytes.Add(bytesUploaded)
}

// UploadBinaryDataToDataset simultaneously uploads binary data and adds it to a dataset.
func (s *Sync) UploadBinaryDataToDataset(ctx context.Context, data []byte, datasetIDs, tags []string, mimeType v1.MimeType) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ask was for the writing to file to happen async as well? Like, to wrap all of this into its own goroutine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

mimeType datasyncpb.MimeType,
extra map[string]interface{},
) error {
ext, err := protoutils.StructToStructPb(extra)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we change this to extra. Just we also have the word extension being thrown around and that’s shortened to ext

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 18, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 18, 2025
@@ -259,3 +260,34 @@ func lookupCollectorConfigsByResource(
}
return collectorConfigsByResource, nil
}

func (b *builtIn) UploadBinaryDataToDataset(ctx context.Context,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (b *builtIn) UploadBinaryDataToDataset(ctx context.Context,
func (b *builtIn) UploadBinaryDataToDatasets(ctx context.Context,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I guess it looks like the name was already solidified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up changing this, thank you!

return err
}
return b.sync.UploadBinaryDataToDataset(ctx, imgBytes, datasetIDs, tags, mimeType)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are errors from these new functions all safe to surface to the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe these errors should be safe to surface to the user

fileContents = append(fileContents, datasetBytes...)
timeoutCtx, timeoutFn := context.WithTimeout(context.Background(), time.Second*5)
defer timeoutFn()
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting UploadBinaryDataToDataset to potentially fail and need to be retried?

if info.IsDir() && info.Name() == FailedDir {
// Do not sync the files in the corrupted data directory or in the directory that holds files
// that are simultaneously uploaded and added to a dataset.
if info.IsDir() && (info.Name() == FailedDir || info.Name() == DatasetDir) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more comfortable if the skipped directory had a name that was very unlikely to be used by users. "dataset" doesn't feel out of the realm of possibility 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory would be .viam/capture/dataset not just the directory dataset. I can change it to viamUploadToDataset or something? Open to suggestions


func (c *client) UploadBinaryDataToDataset(
ctx context.Context,
image []byte,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image []byte,
binaryData []byte,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

}

// ConvertImageToBytes converts an image.Image to a byte slice based on the specified MIME type.
func ConvertImageToBytes(image image.Image, mimeType datasyncpb.MimeType) ([]byte, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we have concerns about loading the entire image into memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, though I can add this as a follow up. This is a paradigm that's repeated throughout RDK

@@ -34,6 +34,21 @@ func (server *serviceServer) Sync(ctx context.Context, req *pb.SyncRequest) (*pb
return &pb.SyncResponse{}, nil
}

func (server *serviceServer) UploadImageToDataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the naming of this confusing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, was a typo

binaryData []byte,
datasetIDs, tags []string,
mimeType v1.MimeType,
extra map[string]interface{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like extra is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@katiepeters
Copy link
Member

Did a quick pass, but I'm going to add @vijayvuyyuru because I'm going to be mostly OOO next week.

@katiepeters katiepeters requested a review from vijayvuyyuru July 18, 2025 20:17
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 21, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 21, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 21, 2025
mimeType v1.MimeType,
_ map[string]interface{},
) error {
b.logger.Info("UploadBinaryDataToDatasets START")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be debug level logs? Seems excessive on every upload call.

mimeType v1.MimeType,
_ map[string]interface{},
) error {
b.logger.Info("UploadImageToDataset START")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u got it

) error {
b.logger.Info("UploadImageToDataset START")
defer b.logger.Info("UploadImageToDataset END")
b.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this locking should only be around the upload itself? I feel like safe to leave unlocked if we haven't converted the image.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

binaryData []byte,
datasetIDs, tags []string,
mimeType v1.MimeType,
_ map[string]interface{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be passed through as part of the Extra field. Same for UploadBinaryImage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method calls the sync client's UploadBinaryDataToDatasets and that does not take an extra

go func() {
defer close(errChan)
// Create a new directory CaptureDir/DatasetDir
newDir := filepath.Join(s.config.CaptureDir, DatasetDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] Feel free to ignore this. But I feel like this logic could be moved to reconfigure? Like making the directory as that unformation only changes then as opposed to trying to do it on each upload call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would only want to create this directory if we were actively trying to make the upload call. If the directory already exists, this should just be a no-op

return err
}

imgBytes, err := ConvertImageToBytes(image, mimeType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Could we change this function to just call c.UploadBinaryDataToDataSets? That way we dont need the marshalling of extra from map -> structpb defined twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 24, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants