Skip to content
This repository was archived by the owner on Nov 23, 2024. It is now read-only.

created map-enc to force main econding to specific type #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

boymaas
Copy link

@boymaas boymaas commented Jan 7, 2018

needed bigquery-io write-with-schema to work and as such changed the following. Maybe I am missing something, if not this could be a potential fix for other places which have the same problem.

and used this to fix a problem I had inserting rows into bitquery
due to default main output encoding of the df-map function
@bfabry
Copy link
Contributor

bfabry commented Jan 8, 2018

github's inability to notify me about PR's I actually care about continues... sorry for the delay @boymaas we'll look at this

@bfabry
Copy link
Contributor

bfabry commented Jan 8, 2018

ok, some initial thoughts:

First, thank you for submitting a patch!

I believe you're right there's a bug here. I don't see how that function could work without something specifying that the output coder is a TableRow coder of some type.

I'm currently a bit torn about whether expanding the number of pipeline functions so that we have ones that take a Coder is the best way forward. Another possible option would be a function that actually takes the PCollection and sets the output coder, which would mean we could reuse that function for either df-map or df-mapcat or even df-apply-dofn and pardo/create-and-apply. This would look like our existing df/output-is-kv and would be applied something like

       (-> pcoll
            (df/df-map "make-table-row" [#'table-row-maker schema])
            (df/df-output-is (TableRowJsonCoder/of))
            (.apply "write-to-bquery" (-> (BigQueryIO/writeTableRows)
                                          (.to output)
                                          (.withSchema bq-schema)
                                          (.withWriteDisposition (write-disposition WriteDispostions))
                                          (.withCreateDisposition (create-disposition CreateDispositions)))))))))

Or even another option would be to create versions of df-map/df-mapcat etc that take an options map that gets merged with the one passed to create-and-apply allowing you to override certain things. This is all really a question of where we want to go with the API so I think we want to wait till next week when @pelletier can weigh in.

Thank you for changing the string concatenation to a composite. That's inarguably better.

Finally, I need to figure out what we need to do to accept patches from people who aren't zendesk employees.

@boymaas
Copy link
Author

boymaas commented Jan 9, 2018

No problem, thank you for opensourcing this library! This has save me a lot of work, map-enc was a quick fix to get things working until I had confirmed this was a bug, and not me using the library the wrong way.

I prefer your solution as personally I favour composition above all else. The only drawback I see is that in the solution:

(-> pcoll
            (df/df-map "make-table-row" [#'table-row-maker schema])
            (df/df-output-is (TableRowJsonCoder/of))
            (.apply "write-to-bquery" (-> (BigQueryIO/writeTableRows)
                                          (.to output)
                                          (.withSchema bq-schema)
                                          (.withWriteDisposition (write-disposition WriteDispostions))
                                          (.withCreateDisposition (create-disposition CreateDispositions)))))))))

assuming this would add another step in the pipeline, it would make the whole processing pipeline a bit less efficient. If that is the case I would go for options.

Ideally would be of course to rewrite the composition part of the code to use edn data structures which you can build/compose and compile down to a pipeline "onyx style". This would allow for further optimisations in a later stage. Most of the hard stuff has already been tackled.

@bfabry
Copy link
Contributor

bfabry commented Jan 9, 2018

assuming this would add another step in the pipeline, it would make the whole processing pipeline a bit less efficient. If that is the case I would go for options.

It wouldn't actually add another step to the pipeline, from a beam pov it's identical, just different api syntax from our pov. No performance impact.

Ideally would be of course to rewrite the composition part of the code to use edn data structures which you can build/compose and compile down to a pipeline "onyx style". This would allow for further optimisations in a later stage. Most of the hard stuff has already been tackled.

The idea has merit, the tricky bit is figuring out how to do it in such a way that changes/extensions to beam can be integrated into that datastructure with minimal/no work. For the time being the beam api has been changing so rapidly and dramatically that it hasn't really been possible

@pelletier
Copy link
Contributor

@boymaas Sorry for the late reply. We discussed it internally but forgot to reply back to this pull request.

I do not want to continue expending the number of df-* functions we have. I'd prefer to see a set-coder function that can be applied to a pipeline to specify the coder to use. This mimics the Beam API (setCoder()) and offers the most flexibility.

@dgriff1
Copy link
Contributor

dgriff1 commented Jan 29, 2019

Alternatively, it might be nice to accept & args in the df-* functions. A need I have is getting the context and window, you could also specify the coders there. Something like (df-map pcoll "transform" #'transform-fn :with-context :with-window :with-coder (Coder.))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants