Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use nbdime as a clever git filter #478

Open
qmarcou opened this issue May 22, 2019 · 6 comments
Open

use nbdime as a clever git filter #478

qmarcou opened this issue May 22, 2019 · 6 comments

Comments

@qmarcou
Copy link

qmarcou commented May 22, 2019

Hi,
First of all thank you so much for the work on nbdime, it really makes jupyter notebooks integration in a version control scheme much easier!

Still, I'm still struggling to get some kind of "optimal" git tracking of my notebooks by preventing metadata and output to be changed at every commit. I have checked (hopefully thoroughly) the different issues (e.g #423 and #410 ) and pieces of documentation related to this.

From what I gathered, here is what I got (please correct me if I'm wrong):

  • nbdime provides a machinery to produce diffs and merges, but cannot be used with git add, even trying to exploit the latter's option --patch, in order to add selectively cells input/output/metadata.
  • another workaround available is nbstripout which will simply erase all selected output/metadata from the notebook. This can be easily integrated in a git workflow by using nbstripout as a git filter. The problem with this solution being that it simply removes everything leaving no choice but never tracking metadata and output or removing it starting from the next commit.

Basically this only leaves 2 solutions: either track every single change in metadata and output or never have them in the git history.

I think it would be good to have an intermediate one allowing to track (chosen) metadata and output and add changes in metadata/output to commits only when desired. It would be quite helpful when you have a notebook full of plots, some of them potentially long to generate, to be able to keep a png of it inside the notebook (though I agree that if the plots takes time to generate one should probably find a workaround by saving processed data and/or the figure in a convenient format).

I was thinking along this line trying to find a solution, and I thikn I found a track:
The idea would be to use nbdime as a smarter filter than nbstripout. Since nbdime is able to nicely compute diffs one could exploit this ability to revert all changes in input/metadata/output to be similar to the last commit (the idea would be to have something similar to git checkout myfile.ipynb that would only revert pieces of the file).
For example if one only wants to commit changes in input cells, nbdime could compute a diff on everything but input cells (usually we would have used nbdime the other way around), and then revert all differences found in that diff to the last commit (we should be doable since the diff gives a line by line mapping, such that line by line substitutions/insertions/deletions can be performed). This would be executed as a special git input filter for instance (people would be able to create git aliases for different git add strategies). I think this approach would be a good compromise the the problem exposed above.

Maybe I'm missing some details making this approach untractable, but given how nicely nbdime works I feel it could be implemented. What do you think?

Sorry for the very long message I've been trying to make myself as clear as possible. Again thanks for the good work!

Best

@vidartf
Copy link
Collaborator

vidartf commented May 22, 2019

Hi!

If you wanted to make a git filter based on nbdime, I think the simplest logic would be to:

  • Configure filtering of diff using these functions:
    def reset_notebook_differ():
    """Reset the notebook_differs dictionary to default values."""
    # As it is a defaultdict2, simply clear all set keys to reset:
    for key in tuple(notebook_differs.keys()):
    del notebook_differs[key]
    def set_notebook_diff_ignores(ignore_paths):
    """Set/unset notebook differs to ignore.
    Parameters:
    ignore_paths: dict
    Dictionary with path strings (e.g. /cells/*/outputs) as keys.
    For each entry in the dictionary do the following:
    - if value is True, set path to `diff_ignore`.
    - if value is False, reset the differ of path to the default value.
    - if value is set/tuple/list, assume the container is a collection
    of subkeys of path to ignore with `diff_ignore_keys`.
    """
    for path, subkeys in ignore_paths.items():
    if subkeys is True:
    notebook_differs[path] = diff_ignore
    elif subkeys is False:
    if path in notebook_differs:
    del notebook_differs[path]
    elif isinstance(subkeys, (list, tuple, set)):
    notebook_differs[path] = diff_ignore_keys(notebook_differs[path], subkeys)
    else:
    raise ValueError('Invalid ignore config entry: %r: %r' % (path, subkeys))
    def set_notebook_diff_targets(sources=True, outputs=True, attachments=True,
    metadata=True, details=True):
    """Configure the notebook differs to include/ignore various changes."""
    config = {
    '/cells/*/source': not sources,
    '/cells/*/outputs': not outputs,
    '/cells/*/attachments': not attachments,
    '/metadata': not metadata,
    '/cells/*/metadata': not metadata,
    '/cells/*/outputs/*/metadata': not metadata,
    '/cells/*': False if details else ('execution_count',),
    '/cells/*/outputs/*': False if details else ('execution_count',),
    }
    set_notebook_diff_ignores(config)
  • Compute filtered diff:
    def diff_notebooks(a, b):
  • Apply this filtered diff on the base file:
    def patch_notebook(nb, diff):
  • Output the resulting notebook (nbformat.write() to a file or sys.stdout).

This could possibly be added to nbdime as another CLI entry point, but it might be better to play around with the idea as a separate script first (simply importing the methods from nbdime). If you get something working, we can look at helping getting it integrated with nbdime via a PR.

@qmarcou
Copy link
Author

qmarcou commented May 23, 2019

Hi!
Thanks a lot for the precise pointers!
I'll try and play around with a script, see if this actually works and whether it's a useful feature.
I'll keep you updated
Thanks!

@kynan
Copy link
Contributor

kynan commented Nov 10, 2019

@qmarcou did you get anywhere with this?

FYI, nbstripout has some options to control what to strip and what to keep.

@qmarcou
Copy link
Author

qmarcou commented Nov 14, 2019

Hi @kynan
Nope sadly I've been quite busy and did not have time to look into this...
Thanks for the pointer in case I or somebody else find some time,

@stephanecollot
Copy link

I'm really interested in this feature.
Specially because if I understood well, this would also keep your local cell output if you git pull/checkout, right?
In nbstripout it is removing all local cell output after pull/checkout operations.

@qmarcou
Copy link
Author

qmarcou commented Apr 18, 2024

Yes that's the idea, sadly I never had the time to dig dipper into it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants