-
Notifications
You must be signed in to change notification settings - Fork 9
created map-enc to force main econding to specific type #27
base: master
Are you sure you want to change the base?
Conversation
and used this to fix a problem I had inserting rows into bitquery due to default main output encoding of the df-map function
github's inability to notify me about PR's I actually care about continues... sorry for the delay @boymaas we'll look at this |
ok, some initial thoughts: First, thank you for submitting a patch! I believe you're right there's a bug here. I don't see how that function could work without something specifying that the output coder is a TableRow coder of some type. I'm currently a bit torn about whether expanding the number of pipeline functions so that we have ones that take a Coder is the best way forward. Another possible option would be a function that actually takes the PCollection and sets the output coder, which would mean we could reuse that function for either df-map or df-mapcat or even df-apply-dofn and pardo/create-and-apply. This would look like our existing
Or even another option would be to create versions of df-map/df-mapcat etc that take an options map that gets merged with the one passed to create-and-apply allowing you to override certain things. This is all really a question of where we want to go with the API so I think we want to wait till next week when @pelletier can weigh in. Thank you for changing the string concatenation to a composite. That's inarguably better. Finally, I need to figure out what we need to do to accept patches from people who aren't zendesk employees. |
No problem, thank you for opensourcing this library! This has save me a lot of work, map-enc was a quick fix to get things working until I had confirmed this was a bug, and not me using the library the wrong way. I prefer your solution as personally I favour composition above all else. The only drawback I see is that in the solution:
assuming this would add another step in the pipeline, it would make the whole processing pipeline a bit less efficient. If that is the case I would go for options. Ideally would be of course to rewrite the composition part of the code to use edn data structures which you can build/compose and compile down to a pipeline "onyx style". This would allow for further optimisations in a later stage. Most of the hard stuff has already been tackled. |
It wouldn't actually add another step to the pipeline, from a beam pov it's identical, just different api syntax from our pov. No performance impact.
The idea has merit, the tricky bit is figuring out how to do it in such a way that changes/extensions to beam can be integrated into that datastructure with minimal/no work. For the time being the beam api has been changing so rapidly and dramatically that it hasn't really been possible |
@boymaas Sorry for the late reply. We discussed it internally but forgot to reply back to this pull request. I do not want to continue expending the number of |
Alternatively, it might be nice to accept & args in the df-* functions. A need I have is getting the context and window, you could also specify the coders there. Something like (df-map pcoll "transform" #'transform-fn :with-context :with-window :with-coder (Coder.)) |
needed bigquery-io write-with-schema to work and as such changed the following. Maybe I am missing something, if not this could be a potential fix for other places which have the same problem.