Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple includes implementation #2483

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

marc-gr
Copy link
Contributor

@marc-gr marc-gr commented Mar 18, 2025

This is an initial implementation for the elastic/package-spec#89 proposal.

package-spec branch with the changes in https://github.com/elastic/package-spec/compare/main...marc-gr:package-spec:feat/includes?expand=1

With the current changes it allows to describe a _dev/shared/includes.yml file describing files to copy from other packages/data_streams at build time.

Instead of making this an invisible process I initially opted to commit the files explicitly, to ease debugging, and elastic-package check takes care of noticing if they are out of sync, while build copies them.

The initial layout of the file is quite naive and just to prove the point.

Summary:

  • Adds an includes.yml that describe files to copy from other packages or the same and where
  • Allows a _dev/shared/files/* path to put arbitrary files that can be shared from different data streams eg: field definitions
  • Adds a new step to the check command to notice any out of date files.
  • Adds a new step to the build command to copy the files.

Considerations:

  • Since this will always get files from the latest versions, we need to add tooling/CI steps to trigger tests for any integration that depends on a package

Example usage:

With an includes.yml in windows/_dev/shared like

- package: system
  from: data_stream/security/elasticsearch/ingest_pipeline/default.yml
  to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_default.yml
- package: system
  from: data_stream/security/elasticsearch/ingest_pipeline/standard.yml
  to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_standard.yml
elastic-package check
Lint the package
data_stream/forwarded/elasticsearch/ingest_pipeline/security_default.yml is outdated. Rebuild the package with 'elastic-package build'
--- want
+++ got
@@ -8,3 +8,3 @@
   - pipeline:
-      name: '{{ IngestPipeline "security_standard" }}'
+      name: '{{ IngestPipeline "standard" }}'
       if: 'ctx.winlog?.provider_name != null && ["Microsoft-Windows-Eventlog", "Microsoft-Windows-Security-Auditing"].contains(ctx.winlog.provider_name)'
@@ -52,3 +52,3 @@
       field: ecs.version
-      value: '8.17.0'
+      value: '8.11.0'
   - set:
data_stream/forwarded/elasticsearch/ingest_pipeline/security_standard.yml is outdated. Rebuild the package with 'elastic-package build'
Error: checking package failed: checking included files are up-to-date failed: files do not match

marc@tp:~/integrations/packages/windows$ elastic-package build
Build the package
system/data_stream/security/elasticsearch/ingest_pipeline/default.yml file copied to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_default.yml
system/data_stream/security/elasticsearch/ingest_pipeline/standard.yml file copied to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_standard.yml
README.md file rendered: /home/marc/integrations/packages/windows/docs/README.md
2025/03/18 11:44:09  INFO License text found in "/home/marc/integrations/LICENSE.txt" will be included in package
Package built: /home/marc/integrations/build/packages/windows-2.5.2.zip
Done

marc@tp:~/integrations/packages/windows$ elastic-package check
Lint the package
Done
Build the package
system/data_stream/security/elasticsearch/ingest_pipeline/default.yml file copied to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_default.yml
system/data_stream/security/elasticsearch/ingest_pipeline/standard.yml file copied to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_standard.yml
README.md file rendered: /home/marc/integrations/packages/windows/docs/README.md
2025/03/18 11:44:25  INFO License text found in "/home/marc/integrations/LICENSE.txt" will be included in package
Package built: /home/marc/integrations/build/packages/windows-2.5.2.zip
Done

@marc-gr marc-gr added the enhancement New feature or request label Mar 18, 2025
@marc-gr marc-gr requested a review from jsoriano March 18, 2025 10:44
@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 18, 2025

I would like to see if the approach makes sense and any other considerations before moving along with a more complete solution

@jsoriano
Copy link
Member

jsoriano commented Mar 18, 2025

Thanks, in general this approach looks good to me. I like that it doesn't need any change on the installation process as everything happens during build.

Let me discuss some details.

With the current changes it allows to describe a _dev/shared/includes.yml file describing files to copy from other packages/data_streams at build time.

I think we could place this file under _dev/build, and the shared files directly under _dev/build/shared.
We could even consider including the information in _dev/build/build.yml, though not a strong opinion about this.

Instead of making this an invisible process I initially opted to commit the files explicitly, to ease debugging, and elastic-package check takes care of noticing if they are out of sync, while build copies them.

In principle I would prefer to make this invisible, during build. We already have some "invisible" steps during builds, such as resolution of ECS fields, or some processing in dashboards.
Having to keep files on sync is always a source of papercuts.

Though if we make this completely invisible we would have to check that everything uses the built packages, I am not sure now if fields validation and pipeline tests work with built packages or with the source files.

Maybe a third option is to keep some list of checksums of imported files, something similar to go.mod, but this would also require some sync, so not sure if it is worth to consider.

Another problem we may have is that CI in the integrations repository is aware of packages now, and on PRs builds are only executed for modified packages. We need to make CI aware of these includes, so if some packages use a file from the system package, and this file changes, all affected packages are tested too.

- package: system
  from: data_stream/security/elasticsearch/ingest_pipeline/default.yml
  to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_default.yml

Not sure about package. This is a key that assumes a repository structure, and will only be useful in repositories that follow it (ok, this is our main use case with the integrations repository :D but this is circumstantial).

The configuration above could be also expressed like the following, that doesn't need to make any assumption on the structure of the repository, and is not so different:

- from: ../system/data_stream/security/elasticsearch/ingest_pipeline/default.yml
  to: data_stream/forwarded/elasticsearch/ingest_pipeline/security_default.yml

This would also allow to include files located in other parts of the repository. For example the APM package used to live in the repository of the APM server, they could have shared files with this approach.

In any case we have to be very careful with not allowing to traverse paths out of the root of the repository. Maybe we can leverage the new OpenRoot here.

@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 18, 2025

Thanks for taking a look!

I think we could place this file under _dev/build, and the shared files directly under _dev/build/shared. We could even consider including the information in _dev/build/build.yml, though not a strong opinion about this.

I like this one 👍

In principle I would prefer to make this invisible, during build. We already have some "invisible" steps during builds, such as resolution of ECS fields, or some processing in dashboards. Having to keep files on sync is always a source of papercuts.

I favored making it explicit thinking of the dev experience. Generally we will use this with file definitions and pipelines (mostly). I can imagine some frustration to figure out when things do not work as expected. On top of that, it felt kind of cumbersome to copy files on the fly or clean them up after packaging and prone to polluted repo if things go wrong at some point. But it is mostly a personal preference so I am not against any other option if it is the preferred approach.

Another problem we may have is that CI in the integrations repository is aware of packages now, and on PRs builds are only executed for modified packages. We need to make CI aware of these includes, so if some packages use a file from the system package, and this file changes, all affected packages are tested too.

Correct, I was thinking about adding a new command to elastic-package like elastic-package included or similar that lists any "importers" then we trigger the appropiate steps.

About removing the package key, sounds good to me 👍

Will do the mentioned changes while we discuss the rest of things.

Copy link
Contributor

@chrisberkhout chrisberkhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marc, I really want something like this.

In addition to sharing field definitions between data streams, I would have liked to have this functionality for

  • input configuration shared across data streams
  • field definitions shared between source data stream and transform

I don't have any strong recommendations here, just some thoughts...

Inline vs whole file

Some of the issue discussion talks about inline includes, but the implementation here is on the whole file level. I think that's enough. It keeps things simple while allowing use with files of different types.

Dev experience positives and negatives

In terms of dev experience, I see these positives currently:

  • copying the file into its final destination means you can read it in context or find its content with grep.
  • showing diffs means you save some time figuring out the mismatch

However, if each data stream has a certain file and i want to modify it, I may not realize before my edit that it is supposed to stay in sync with the others, and I need to consult the includes.yml to know which of the files is the source and which are the destinations.

Currently there's no help with conflict resolution. I may overwrite my new changes to a destination file by rebuilding the package.

An alternative to includes.yml

An alternative would be to have the copied files only go into the the package, and instead of includes.yml , there could be filename.ext.link files wherever we want filename.ext (with the content of .link files saying where the source is). It would make it clearer when the source content is from elsewhere, but there are drawbacks too.

Secure references

I think there's a potential security issue currently for something like this:

- package: system
  from: ../../../../../../../../etc/passwd
  to: LICENSE.txt

It would be good to lock it down a bit more.

Use outside of the integrations repo

I think the functionality we settle on should work and be useful for packages that are developed outside of the integrations repo. Currently there is an assumption that sibling directories of the package root are other packages.

@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 18, 2025

In addition to sharing field definitions between data streams, I would have liked to have this functionality for

  • input configuration shared across data streams
  • field definitions shared between source data stream and transform

This should be enough for this also as far as we do not enforce any particular file type.

An alternative to includes.yml

An alternative would be to have the copied files only go into the the package, and instead of includes.yml , there could be filename.ext.link files wherever we want filename.ext (with the content of .link files saying where the source is). It would make it clearer when the source content is from elsewhere, but there are drawbacks too.

I think this is a good point as a middleground as it makes it clear what file to look at. Will give it a go and see how it looks like 👍

Secure references

I think there's a potential security issue currently for something like this:

- package: system
  from: ../../../../../../../../etc/passwd
  to: LICENSE.txt

This is addressed by the use of OpenRoot, in my last commit we already scope the top most dir at packages/.

Use outside of the integrations repo

I think the functionality we settle on should work and be useful for packages that are developed outside of the integrations repo. Currently there is an assumption that sibling directories of the package root are other packages.

I think this might be something we do as a second iteration, not sure how to solve this generally (maybe by allowing referencing git repositories, for example) since most use cases for this currently fall in this initial case.

@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 19, 2025

I made changes to the spec and elastic-package so now you can add file.ext.link with a single line containing a reference to a file to include. It will do this transparently during build.

Considerations

  • We need to review tests as this will require changes in at least pipeline tests
  • Still only works with packages in the same repository
  • Paths in .link files are expected to be relative to ./packages/ to avoid issues escaping the root

@marc-gr marc-gr requested a review from chrisberkhout March 19, 2025 12:39
@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 20, 2025

Made a last change so during pipeline benchmarks/tests the included pipelines are expanded before instaling them so they can run properly

@jsoriano
Copy link
Member

We need to review tests as this will require changes in at least pipeline tests

We would need to modify at least the is_pr_affected function. It will need to look for .ext.lnk files, and check if the linked files are modified in the PR.

Maybe we should also add a check that requires a changelog entry for packages whose linked files have been modified. To avoid overlooking publication of fixes in dependant packages.

Maybe we should store a checksum of the linked file apart from its path, so there is some track of modifications in the dependent packages. What do you think about this option?

Still only works with packages in the same repository

👍

Paths in .link files are expected to be relative to ./packages/ to avoid issues escaping the root

This couples the feature to the structure of the integrations repository, we have packages on their own repository, and packages in other paths in other repositories.

I would only require the path to be under the root of the repository.

@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 26, 2025

Added the commented link commands:

  • All commands are relative to the current path, so they work for packages and at the repo level
  • elastic-package links check: will fail if any of the links in the tree has an outdated checksum
  • elastic-package links update: will update all links in the tree to its current checksum
  • elastic-package links list: will list any packages that contains links with references to the current package path. If outside a package, it will list any links referencing files outside any package
~/elastic-package/test/packages/other/pipeline_tests$ elastic-package links list
with_includes

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the feature looks good to me as proposed. Added some questions and comments about the proposal.

One thing I see could be a bit confusing is the mixed scope, maybe this feature should be agnostic of packages, and work always at the global level, taking into account the current directory.

@marc-gr marc-gr requested a review from jsoriano March 27, 2025 16:03
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, we will have to wait for the change in package spec before merging.

@jsoriano
Copy link
Member

Well, one thing we have to remember is the use of linked files in tests that don't use built packages, at least pipelines in pipeline tests.

@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 28, 2025

Well, one thing we have to remember is the use of linked files in tests that don't use built packages, at least pipelines in pipeline tests.

THis is already taken into account in https://github.com/elastic/elastic-package/pull/2483/files#diff-6d6115d2523865659d234b34e7eaf8cf1aa35ecf8378cfa32b1bb14d8781aaf9R75 where we copy them locally to install the pipelines and the remove them. Probably not the cleanest approach, but could not think about anything simpler without implementing something more complete such as #1743, but seemed out of scope for this one

@marc-gr
Copy link
Contributor Author

marc-gr commented Mar 28, 2025

Looks good, we will have to wait for the change in package spec before merging.

I wonder if given this can link files in the scope of a complete repository, does it really make sense to have it specified in the spec? Maybe would be enough/more versatile to ignore *.link files in the spec and let elastic-package deal with it

Copy link
Contributor

@chrisberkhout chrisberkhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction.

It would be good to explain it in docs/howto/.

From what I can see, running elastic-package check won't check the links, but I think it should.

There are some things I'd like to confirm:

Target hash in the link file
Is the purpose of this to make the developer confirm the places that an updated source will have effects, and to trigger checks for changes in certain paths in CI? I think those are good enough reasons.

File name scoping
File names are relative to the git repo, right? I think that's good.
But it's a new requirement that a package is inside a git repo, isn't it?
The link target can be outside _dev/shared/ - that's just one possibility. Is that right?

Source vs build package
The link files are added in package-spec, but they won't be present in the built package. Is that right? I'm a bit confused, because I thought that package spec defined the layout and content of the built package, but I see it also defines _dev, which is not copied during build. This current situation also means the package spec definitions about e.g. field defs or ingest pipelines don't apply to the linked versions (at least according to the definition). I think your suggestion that it could be only in elastic-package makes sense. I guess ideally there would be package spec-type definitions for both package source and built package definitions. I may be missing something here about how it currently works.

Validations
What's the current status of validations? Does a link file get get the same validations as a non-link file would in that location? This seems important.

It looks like currently, during package build the copy function skips links, but then they are added in a separate build step. In the pipeline test part, it copies link targets are the ingest pipeline directory and deletes them later. This logic is split over package-spec and elastic-package.
I know it's a bigger change, but it seems like the best solution might be to have all file access go through one function that can resolve the links, so that validations or other processing could be handled exactly the same for linked and plain files. What do you think?

@jsoriano
Copy link
Member

jsoriano commented Apr 2, 2025

Is the purpose of this to make the developer confirm the places that an updated source will have effects, and to trigger checks for changes in certain paths in CI? I think those are good enough reasons.

Correct, these are the main reasons.

File names are relative to the git repo, right? I think that's good.
But it's a new requirement that a package is inside a git repo, isn't it?

I think the paths could be relative to the link file itself, for clarity, and to avoid adding this requirement at this point.

Source vs build package

Yep, this is an open discussion, I also think we should have at some point different validators for source and built packages. There is an open issue about this elastic/package-spec#549

In the meantime I think that we should make the spec aware of linked files, so validations apply also on source files. For example to check limits on number of fields, or format of some files. Last changes in the Marc's PR in package spec go in this direction.

What's the current status of validations? Does a link file get get the same validations as a non-link file would in that location? This seems important.

+1, Marc's latest changes in the spec PR go in this direction.

@marc-gr marc-gr requested a review from jsoriano April 10, 2025 10:11
@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 10, 2025

💔 Build Failed

Failed CI Steps

History

@marc-gr marc-gr requested a review from chrisberkhout April 11, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants