Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Implement VectorizedScalarExtractor #139

Open
simonmandlik opened this issue Jun 7, 2024 · 0 comments
Open

Enhancement: Implement VectorizedScalarExtractor #139

simonmandlik opened this issue Jun 7, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@simonmandlik
Copy link
Collaborator

simonmandlik commented Jun 7, 2024

Implement structure VectorizedScalarExtractor, with which we could group the following extractor into a single node:

julia> schema(JSON.parse.([
  """ { "a": 1, "b": 2, "c": 3 } """,
  """ { "b": 3, "c": 4 } """
  ])) |> suggestextractor
DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ├── b: CategoricalExtractor(n=3)
  ╰── c: CategoricalExtractor(n=3)

This is mainly an optimization.

Corresponding suggestextractor rule is already written, but commented out.

Possible starting point is the old implementation, we just need to port this to the new version. This old version doesn't implement normalization like ScalarExtractor, which would also be desirable in the new version.

As the example shows, we also need to make sure that StableExtractor wrapping works in an expected way.

One important design decision is how to build the extractor tree, because if we replace several ScalarExtractor nodes with only one new VectorizedScalarExtractor, it will have fewer nodes than e.g. the corresponding Schema, making stuff like HierarchicalUtils.jl traversal codes break. On the other hand, such extractor would have the same structure as the resulting model. Test what exactly are the implications here. One possible solution is to add some "ghost nodes" as children of VectorizedScalarExtractor that wouldn't ever be used, but would complete the tree.

@simonmandlik simonmandlik added the enhancement New feature or request label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant