* [PATCH 0/3] batch blob diff generation @ 2024-12-13 4:23 Justin Tobler 2024-12-13 4:23 ` [PATCH 1/3] builtin: introduce diff-blob command Justin Tobler ` (4 more replies) 0 siblings, 5 replies; 78+ messages in thread From: Justin Tobler @ 2024-12-13 4:23 UTC (permalink / raw) To: git; +Cc: ps, Justin Tobler Through git-diff(1) it is possible to generate a diff directly between two blobs. This is particularly useful when the pre-image and post-image blobs are known and we only care about the diff between them. Unfortunately, if a user has a batch of known blob pairs to compute diffs for, there is currently not a way to do so via a single Git process. To enable support for batch diffs of multiple blob pairs, this series introduces a new diff plumbing command git-diff-blob(1). Similar to git-diff-tree(1), it provides a "--stdin" option that reads a pair of blobs on each line of input and generates the diffs. This is intended to be used for scripting purposes where more fine-grained control for diff generation is desired. Below is an example for each usage: $ git diff-blob HEAD~5000:README.md HEAD:README.md $ git diff-blob --stdin <<EOF 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b HEAD~5000:README.md HEAD:README.md EOF Some alternative approaches that were considered: Instead of creating a new plumbing command, the existing git-diff(1) could have been extended with a similar "--batch" option ("--stdin" is techinically already handled through setup_revisions() since it isn't disabled). This option could read from stdin and generate diffs for any valid revision pair that gets provided (not just blob diffs). The primary reason for not going down this route was that git-diff-tree(1) already has support for batch diff generation for commits/trees through its "--stdin" option and teaching git-diff(1) a superset of this functionality would further complicate this porcelain interface for something that seems like more of a plumbing feature. Another idea was to extend the existing git-diff-tree(1) to support generating diffs for blob pairs through its "--stdin" option. This didn't seem like a good fit either though as it is really outside the scope of responsibilities for that command. Ultimately I couldn't find an existing place that seemed like a good fit thus the new plumbing command route was chosen. I'm still not sure though if a standalone "diff-blob" command is the right choice here either. Its primary function of generating a single blob pair diff is a direct subset of git-diff(1) and is thus largely redundant. The only additional value comes from its "--stdin" option which enables batch processing. To an extent it seems much of the existing diff plumbing commands feature set can also be accessed through git-diff(1) so maybe this isn't a big deal. Feedback and suggestions are much appreciated. This series is structured as follows: - Patch 1 introduces the "diff-blob" plumbing command and its surrounding setup. - Patch 2 teaches "diff-blob" the "--stdin" option which allows multiple blob pair diffs to be specified and processed. - Patch 3 teaches "diff-blob" the "-z" option which, when used with "--stdin", uses the NUL character to delimit the inputed blobs and outputted diffs. The series is built on top of caacdb5d (The fifteenth batch, 2024-12-10) with ps/build at 904339ed (Introduce support for the Meson build system, 2024-12-06) merged into it. This is done so the new command is integrated with the meson build system. -Justin Justin Tobler (3): builtin: introduce diff-blob command builtin/diff-blob: add "--stdin" option builtin/diff-blob: Add "-z" option .gitignore | 1 + Documentation/git-diff-blob.txt | 39 +++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-blob.c | 200 ++++++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + t/t4063-diff-blobs.sh | 108 ++++++++++------- 10 files changed, 309 insertions(+), 45 deletions(-) create mode 100644 Documentation/git-diff-blob.txt create mode 100644 builtin/diff-blob.c -- 2.47.1 ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH 1/3] builtin: introduce diff-blob command 2024-12-13 4:23 [PATCH 0/3] batch blob diff generation Justin Tobler @ 2024-12-13 4:23 ` Justin Tobler 2024-12-13 4:23 ` [PATCH 2/3] builtin/diff-blob: add "--stdin" option Justin Tobler ` (3 subsequent siblings) 4 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2024-12-13 4:23 UTC (permalink / raw) To: git; +Cc: ps, Justin Tobler Through git-diff(1), a single diff can be generated from a pair of blob revisions directly. This is useful when more explicit control over the set of diffs to compute is desired. Expanding on this, it would be useful to also support generating multiple diffs for blob pairs provided on stdin to faciliate batch processing in a single process. Batch blob diff processing is likely considered more of a plumbing feature so, instead of further extending the porcelain git-diff(1), a diff plumbing command should be used. As there is not an existing diff plumbing command that handles blob diffs, introduce git-diff-blob(1) which generates a single diff between a specified pair of blobs following how the same operation is done in git-diff(1). While git-diff-blob(1) functionality is a direct subset of git-diff(1), a subsequent patch extends it to provide a new plumbing related feature. The surrounding setup required for the new builtin is also added. An existing test for blob diffs through git-diff(1) is also modified to reuse its test cases for git-diff-blob(1). Signed-off-by: Justin Tobler <jltobler@gmail.com> --- .gitignore | 1 + Documentation/git-diff-blob.txt | 29 ++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-blob.c | 117 ++++++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + t/t4063-diff-blobs.sh | 100 ++++++++++++++------------- 10 files changed, 205 insertions(+), 48 deletions(-) create mode 100644 Documentation/git-diff-blob.txt create mode 100644 builtin/diff-blob.c diff --git a/.gitignore b/.gitignore index e82aa19df0..e7487072bd 100644 --- a/.gitignore +++ b/.gitignore @@ -52,6 +52,7 @@ /git-daemon /git-diagnose /git-diff +/git-diff-blob /git-diff-files /git-diff-index /git-diff-tree diff --git a/Documentation/git-diff-blob.txt b/Documentation/git-diff-blob.txt new file mode 100644 index 0000000000..732992d1d7 --- /dev/null +++ b/Documentation/git-diff-blob.txt @@ -0,0 +1,29 @@ +git-diff-blob(1) +================ + +NAME +---- +git-diff-blob - Compares the content and mode of two specified blobs + + +SYNOPSIS +-------- +[verse] +'git diff-blob' <blob> <blob> + +DESCRIPTION +----------- +Compare the content and mode of two specified blobs. + +OPTIONS +------- +<blob>:: + The id of a blob object or path-scoped revision that resolves to a blob. + +include::pretty-formats.txt[] + +include::diff-format.txt[] + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/meson.build b/Documentation/meson.build index f2426ccaa3..1bf0a80419 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -39,6 +39,7 @@ manpages = { 'git-daemon.txt' : 1, 'git-describe.txt' : 1, 'git-diagnose.txt' : 1, + 'git-diff-blob.txt' : 1, 'git-diff-files.txt' : 1, 'git-diff-index.txt' : 1, 'git-difftool.txt' : 1, diff --git a/Makefile b/Makefile index 06f01149ec..de2e43d4f6 100644 --- a/Makefile +++ b/Makefile @@ -1235,6 +1235,7 @@ BUILTIN_OBJS += builtin/credential-store.o BUILTIN_OBJS += builtin/credential.o BUILTIN_OBJS += builtin/describe.o BUILTIN_OBJS += builtin/diagnose.o +BUILTIN_OBJS += builtin/diff-blob.o BUILTIN_OBJS += builtin/diff-files.o BUILTIN_OBJS += builtin/diff-index.o BUILTIN_OBJS += builtin/diff-tree.o diff --git a/builtin.h b/builtin.h index f7b166b334..383e78ca99 100644 --- a/builtin.h +++ b/builtin.h @@ -152,6 +152,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_diff_blob(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/diff-blob.c b/builtin/diff-blob.c new file mode 100644 index 0000000000..7cfa4eb436 --- /dev/null +++ b/builtin/diff-blob.c @@ -0,0 +1,117 @@ +#include "builtin.h" +#include "config.h" +#include "diff.h" +#include "diffcore.h" +#include "gettext.h" +#include "hash.h" +#include "object.h" +#include "parse-options.h" +#include "revision.h" + +static void diff_blobs(struct object_array_entry *old_blob, + struct object_array_entry *new_blob, + struct diff_options *opts) +{ + const unsigned mode = canon_mode(S_IFREG | 0644); + struct object_id old_oid = old_blob->item->oid; + struct object_id new_oid = new_blob->item->oid; + unsigned old_mode = old_blob->mode; + unsigned new_mode = new_blob->mode; + char *old_path = old_blob->path; + char *new_path = new_blob->path; + struct diff_filespec *old, *new; + + if (old_mode == S_IFINVALID) + old_mode = mode; + + if (new_mode == S_IFINVALID) + new_mode = mode; + + if (!old_path) + old_path = old_blob->name; + + if (!new_path) + new_path = new_blob->name; + + if (!is_null_oid(&old_oid) && !is_null_oid(&new_oid) && + oideq(&old_oid, &new_oid) && (old_mode == new_mode)) + return; + + if (opts->flags.reverse_diff) { + SWAP(old_oid, new_oid); + SWAP(old_mode, new_mode); + SWAP(old_path, new_path); + } + + if (opts->prefix && + (strncmp(old_path, opts->prefix, opts->prefix_length) || + strncmp(new_path, opts->prefix, opts->prefix_length))) + return; + + old = alloc_filespec(old_path); + new = alloc_filespec(new_path); + + fill_filespec(old, &old_oid, 1, old_mode); + fill_filespec(new, &new_oid, 1, new_mode); + + diff_queue(&diff_queued_diff, old, new); + diffcore_std(opts); + diff_flush(opts); +} + +int cmd_diff_blob(int argc, const char **argv, const char *prefix, + struct repository *repo) +{ + struct object_array_entry *old_blob, *new_blob; + struct rev_info revs; + int ret; + + const char * const usage[] = { + N_("git diff-blob <blob> <blob>"), + NULL + }; + struct option options[] = { + OPT_END() + }; + + argc = parse_options(argc, argv, prefix, options, usage, + PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0); + + repo_config(repo, git_diff_basic_config, NULL); + prepare_repo_settings(repo); + repo->settings.command_requires_full_index = 0; + + repo_init_revisions(repo, &revs, prefix); + revs.abbrev = 0; + revs.diff = 1; + revs.disable_stdin = 1; + + prefix = precompose_argv_prefix(argc, argv, prefix); + argc = setup_revisions(argc, argv, &revs, NULL); + + if (!revs.diffopt.output_format) + revs.diffopt.output_format = DIFF_FORMAT_PATCH; + + switch (revs.pending.nr) { + case 2: + old_blob = &revs.pending.objects[0]; + new_blob = &revs.pending.objects[1]; + + if (old_blob->item->type != OBJ_BLOB) + die("object %s is not a blob", old_blob->name); + + if (new_blob->item->type != OBJ_BLOB) + die("object %s is not a blob", new_blob->name); + + diff_blobs(old_blob, new_blob, &revs.diffopt); + + break; + default: + usage_with_options(usage, options); + } + + ret = diff_result_code(&revs); + release_revisions(&revs); + + return ret; +} diff --git a/command-list.txt b/command-list.txt index e0bb87b3b5..78d8308352 100644 --- a/command-list.txt +++ b/command-list.txt @@ -93,6 +93,7 @@ git-daemon synchingrepositories git-describe mainporcelain git-diagnose ancillaryinterrogators git-diff mainporcelain info +git-diff-blob plumbinginterrogators git-diff-files plumbinginterrogators git-diff-index plumbinginterrogators git-diff-tree plumbinginterrogators diff --git a/git.c b/git.c index 46b3c740c5..17c018ea36 100644 --- a/git.c +++ b/git.c @@ -540,6 +540,7 @@ static struct cmd_struct commands[] = { { "describe", cmd_describe, RUN_SETUP }, { "diagnose", cmd_diagnose, RUN_SETUP_GENTLY }, { "diff", cmd_diff, NO_PARSEOPT }, + { "diff-blob", cmd_diff_blob, RUN_SETUP | NO_PARSEOPT }, { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, diff --git a/meson.build b/meson.build index 0dccebcdf1..fefb802c27 100644 --- a/meson.build +++ b/meson.build @@ -503,6 +503,7 @@ builtin_sources = [ 'builtin/credential.c', 'builtin/describe.c', 'builtin/diagnose.c', + 'builtin/diff-blob.c', 'builtin/diff-files.c', 'builtin/diff-index.c', 'builtin/diff-tree.c', diff --git a/t/t4063-diff-blobs.sh b/t/t4063-diff-blobs.sh index 50fdb5ea52..23615565fe 100755 --- a/t/t4063-diff-blobs.sh +++ b/t/t4063-diff-blobs.sh @@ -1,12 +1,14 @@ #!/bin/sh -test_description='test direct comparison of blobs via git-diff' +test_description='test direct comparison of blobs via git-diff and git-diff-blob' . ./test-lib.sh +commands="diff diff-blob" + run_diff () { # use full-index to make it easy to match the index line - git diff --full-index "$@" >diff + git $1 --full-index $2 $3 >diff } check_index () { @@ -37,61 +39,63 @@ test_expect_success 'create some blobs' ' sha1_two=$(git rev-parse HEAD:two) ' -test_expect_success 'diff by sha1' ' - run_diff $sha1_one $sha1_two -' -test_expect_success 'index of sha1 diff' ' - check_index $sha1_one $sha1_two -' -test_expect_success 'sha1 diff uses arguments as paths' ' - check_paths $sha1_one $sha1_two -' -test_expect_success 'sha1 diff has no mode change' ' - ! grep mode diff -' - -test_expect_success 'diff by tree:path (run)' ' - run_diff HEAD:one HEAD:two -' -test_expect_success 'index of tree:path diff' ' - check_index $sha1_one $sha1_two -' -test_expect_success 'tree:path diff uses filenames as paths' ' - check_paths one two -' -test_expect_success 'tree:path diff shows mode change' ' - check_mode 100644 100755 -' - -test_expect_success 'diff by ranged tree:path' ' - run_diff HEAD:one..HEAD:two -' -test_expect_success 'index of ranged tree:path diff' ' - check_index $sha1_one $sha1_two -' -test_expect_success 'ranged tree:path diff uses filenames as paths' ' - check_paths one two -' -test_expect_success 'ranged tree:path diff shows mode change' ' - check_mode 100644 100755 -' - -test_expect_success 'diff blob against file' ' - run_diff HEAD:one two +test_expect_success 'diff blob against file (git-diff)' ' + run_diff diff HEAD:one two ' -test_expect_success 'index of blob-file diff' ' +test_expect_success 'index of blob-file diff (git-diff)' ' check_index $sha1_one $sha1_two ' -test_expect_success 'blob-file diff uses filename as paths' ' +test_expect_success 'blob-file diff uses filename as paths (git-diff)' ' check_paths one two ' -test_expect_success FILEMODE 'blob-file diff shows mode change' ' +test_expect_success FILEMODE 'blob-file diff shows mode change (git-diff)' ' check_mode 100644 100755 ' -test_expect_success 'blob-file diff prefers filename to sha1' ' - run_diff $sha1_one two && +test_expect_success 'blob-file diff prefers filename to sha1 (git-diff)' ' + run_diff diff $sha1_one two && check_paths two two ' +for cmd in $commands; do + test_expect_success "diff by sha1 (git-$cmd)" ' + run_diff $cmd $sha1_one $sha1_two + ' + test_expect_success "index of sha1 diff (git-$cmd)" ' + check_index $sha1_one $sha1_two + ' + test_expect_success "sha1 diff uses arguments as paths (git-$cmd)" ' + check_paths $sha1_one $sha1_two + ' + test_expect_success "sha1 diff has no mode change (git-$cmd)" ' + ! grep mode diff + ' + + test_expect_success "diff by tree:path (run) (git-$cmd)" ' + run_diff $cmd HEAD:one HEAD:two + ' + test_expect_success "index of tree:path diff (git-$cmd)" ' + check_index $sha1_one $sha1_two + ' + test_expect_success "tree:path diff uses filenames as paths (git-$cmd)" ' + check_paths one two + ' + test_expect_success "tree:path diff shows mode change (git-$cmd)" ' + check_mode 100644 100755 + ' + + test_expect_success "diff by ranged tree:path (git-$cmd)" ' + run_diff $cmd HEAD:one..HEAD:two + ' + test_expect_success "index of ranged tree:path diff (git-$cmd)" ' + check_index $sha1_one $sha1_two + ' + test_expect_success "ranged tree:path diff uses filenames as paths (git-$cmd)" ' + check_paths one two + ' + test_expect_success "ranged tree:path diff shows mode change (git-$cmd)" ' + check_mode 100644 100755 + ' +done + test_done -- 2.47.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* [PATCH 2/3] builtin/diff-blob: add "--stdin" option 2024-12-13 4:23 [PATCH 0/3] batch blob diff generation Justin Tobler 2024-12-13 4:23 ` [PATCH 1/3] builtin: introduce diff-blob command Justin Tobler @ 2024-12-13 4:23 ` Justin Tobler 2024-12-13 4:23 ` [PATCH 3/3] builtin/diff-blob: Add "-z" option Justin Tobler ` (2 subsequent siblings) 4 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2024-12-13 4:23 UTC (permalink / raw) To: git; +Cc: ps, Justin Tobler There is not a way to generate multiple blob diffs from a single process. Similar to git-diff-tree(1) with its "--stdin" option, it would be useful if multiple blob pairs could be provided to git-diff-blob(1) to compute blob diffs for. Teach git-diff-blob(1) the "--stdin" option to allow a pair of blobs to be read from each line of stdin instead of relying on the single blob pair provided as arguments. When this option is specified, each valid line of input computes a blob diff thus allowing multiple blob diffs in a single process. A blob may be specified by its ID or a path-scoped revision that resolve to a blob. When a path-scoped revision is used, path and mode information is also extracted and presented in the resulting diff header. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- Documentation/git-diff-blob.txt | 6 ++++ builtin/diff-blob.c | 64 +++++++++++++++++++++++++++++++++ t/t4063-diff-blobs.sh | 14 ++++++++ 3 files changed, 84 insertions(+) diff --git a/Documentation/git-diff-blob.txt b/Documentation/git-diff-blob.txt index 732992d1d7..f6ecd522fa 100644 --- a/Documentation/git-diff-blob.txt +++ b/Documentation/git-diff-blob.txt @@ -10,6 +10,7 @@ SYNOPSIS -------- [verse] 'git diff-blob' <blob> <blob> +'git diff-blob' --stdin DESCRIPTION ----------- @@ -20,6 +21,11 @@ OPTIONS <blob>:: The id of a blob object or path-scoped revision that resolves to a blob. +--stdin:: + When `--stdin` is specified, the command does not take <blob> arguments + from the command line. Instead, it reads lines containing two <blob> + from its standard input. (Use a single space as separator.) + include::pretty-formats.txt[] include::diff-format.txt[] diff --git a/builtin/diff-blob.c b/builtin/diff-blob.c index 7cfa4eb436..45edfdd979 100644 --- a/builtin/diff-blob.c +++ b/builtin/diff-blob.c @@ -4,9 +4,12 @@ #include "diffcore.h" #include "gettext.h" #include "hash.h" +#include "object-name.h" #include "object.h" #include "parse-options.h" #include "revision.h" +#include "strbuf.h" +#include "string-list.h" static void diff_blobs(struct object_array_entry *old_blob, struct object_array_entry *new_blob, @@ -59,18 +62,66 @@ static void diff_blobs(struct object_array_entry *old_blob, diff_flush(opts); } +static void parse_blob_stdin(struct object_array *blob_pair, + struct repository *repo, const char *name) +{ + int flags = GET_OID_BLOB | GET_OID_RECORD_PATH; + struct object_context oc; + struct object_id oid; + struct object *obj; + + if (get_oid_with_context(repo, name, flags, &oid, &oc)) + die("invalid object %s given", name); + + obj = parse_object_or_die(&oid, name); + if (obj->type != OBJ_BLOB) + die("object %s is not a blob", name); + + add_object_array_with_path(obj, name, blob_pair, oc.mode, oc.path); + object_context_release(&oc); +} + +static void diff_blob_stdin(struct repository *repo, struct diff_options *opts) +{ + struct strbuf sb = STRBUF_INIT; + struct string_list_item *item; + + while (strbuf_getline(&sb, stdin) != EOF) { + struct object_array blob_pair = OBJECT_ARRAY_INIT; + struct string_list list = STRING_LIST_INIT_NODUP; + + if (string_list_split_in_place(&list, sb.buf, " ", -1) != 2) + die("two blobs not provided"); + + for_each_string_list_item(item, &list) { + parse_blob_stdin(&blob_pair, repo, item->string); + } + + diff_blobs(&blob_pair.objects[0], &blob_pair.objects[1], opts); + + string_list_clear(&list, 1); + object_array_clear(&blob_pair); + } + + strbuf_release(&sb); +} + int cmd_diff_blob(int argc, const char **argv, const char *prefix, struct repository *repo) { struct object_array_entry *old_blob, *new_blob; struct rev_info revs; + int read_stdin = 0; int ret; const char * const usage[] = { N_("git diff-blob <blob> <blob>"), + N_("git diff-blob --stdin"), NULL }; struct option options[] = { + OPT_BOOL(0, "stdin", &read_stdin, + N_("read blob pairs from stdin")), OPT_END() }; @@ -93,7 +144,20 @@ int cmd_diff_blob(int argc, const char **argv, const char *prefix, revs.diffopt.output_format = DIFF_FORMAT_PATCH; switch (revs.pending.nr) { + case 0: + if (!read_stdin) + usage_with_options(usage, options); + + revs.diffopt.no_free = 1; + diff_blob_stdin(repo, &revs.diffopt); + revs.diffopt.no_free = 0; + diff_free(&revs.diffopt); + + break; case 2: + if (read_stdin) + usage_with_options(usage, options); + old_blob = &revs.pending.objects[0]; new_blob = &revs.pending.objects[1]; diff --git a/t/t4063-diff-blobs.sh b/t/t4063-diff-blobs.sh index 23615565fe..d7785d4a6e 100755 --- a/t/t4063-diff-blobs.sh +++ b/t/t4063-diff-blobs.sh @@ -98,4 +98,18 @@ for cmd in $commands; do ' done +test_expect_success 'diff-blob --stdin with blob ID' ' + echo $sha1_one $sha1_two | git diff-blob --full-index --stdin >diff && + check_index $sha1_one $sha1_two && + check_paths $sha1_one $sha1_two && + ! grep mode diff +' + +test_expect_success 'diff-blob --stdin with revision' ' + echo HEAD:one HEAD:two | git diff-blob --full-index --stdin >diff && + check_index $sha1_one $sha1_two && + check_paths one two && + check_mode 100644 100755 +' + test_done -- 2.47.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* [PATCH 3/3] builtin/diff-blob: Add "-z" option 2024-12-13 4:23 [PATCH 0/3] batch blob diff generation Justin Tobler 2024-12-13 4:23 ` [PATCH 1/3] builtin: introduce diff-blob command Justin Tobler 2024-12-13 4:23 ` [PATCH 2/3] builtin/diff-blob: add "--stdin" option Justin Tobler @ 2024-12-13 4:23 ` Justin Tobler 2024-12-13 8:12 ` [PATCH 0/3] batch blob diff generation Jeff King 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler 4 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2024-12-13 4:23 UTC (permalink / raw) To: git; +Cc: ps, Justin Tobler The "--stdin" option for git-diff-blob(1) reads two space separated blobs for each line of input. A blob may be specified by its ID or a path-scoped revision that resolves to a blob. It is possible for the path to contain whitespace or newline characters which must be escaped. To make input more simple, teach git-diff-blob(1) the "-z" option which changes the input delimiter for each blob to a NUL character. With this option, the command waits two NUL terminated blobs to read and then generates the diff. The diff output is also NUL terminated to help differentiate between outputted diffs. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- Documentation/git-diff-blob.txt | 6 +++++- builtin/diff-blob.c | 37 +++++++++++++++++++++++++-------- 2 files changed, 33 insertions(+), 10 deletions(-) diff --git a/Documentation/git-diff-blob.txt b/Documentation/git-diff-blob.txt index f6ecd522fa..36cd686bb1 100644 --- a/Documentation/git-diff-blob.txt +++ b/Documentation/git-diff-blob.txt @@ -10,7 +10,7 @@ SYNOPSIS -------- [verse] 'git diff-blob' <blob> <blob> -'git diff-blob' --stdin +'git diff-blob' --stdin [-z] DESCRIPTION ----------- @@ -26,6 +26,10 @@ OPTIONS from the command line. Instead, it reads lines containing two <blob> from its standard input. (Use a single space as separator.) +-z:: + When `--stdin` has been given, use NUL characters to separate blob + inputs and diff outputs. + include::pretty-formats.txt[] include::diff-format.txt[] diff --git a/builtin/diff-blob.c b/builtin/diff-blob.c index 45edfdd979..60c92cec9c 100644 --- a/builtin/diff-blob.c +++ b/builtin/diff-blob.c @@ -81,23 +81,39 @@ static void parse_blob_stdin(struct object_array *blob_pair, object_context_release(&oc); } -static void diff_blob_stdin(struct repository *repo, struct diff_options *opts) +static void diff_blob_stdin(struct repository *repo, struct diff_options *opts, + int null_term) { struct strbuf sb = STRBUF_INIT; struct string_list_item *item; - while (strbuf_getline(&sb, stdin) != EOF) { + while (1) { struct object_array blob_pair = OBJECT_ARRAY_INIT; struct string_list list = STRING_LIST_INIT_NODUP; - if (string_list_split_in_place(&list, sb.buf, " ", -1) != 2) - die("two blobs not provided"); + if (null_term) { + if (strbuf_getline_nul(&sb, stdin) == EOF) + break; + parse_blob_stdin(&blob_pair, repo, sb.buf); - for_each_string_list_item(item, &list) { - parse_blob_stdin(&blob_pair, repo, item->string); + if (strbuf_getline_nul(&sb, stdin) == EOF) + break; + parse_blob_stdin(&blob_pair, repo, sb.buf); + } else { + if (strbuf_getline(&sb, stdin) == EOF) + break; + + if (string_list_split_in_place(&list, sb.buf, " ", -1) != 2) + die("two blobs not provided"); + + for_each_string_list_item(item, &list) { + parse_blob_stdin(&blob_pair, repo, item->string); + } } diff_blobs(&blob_pair.objects[0], &blob_pair.objects[1], opts); + if (null_term) + printf("%c", '\0'); string_list_clear(&list, 1); object_array_clear(&blob_pair); @@ -112,16 +128,19 @@ int cmd_diff_blob(int argc, const char **argv, const char *prefix, struct object_array_entry *old_blob, *new_blob; struct rev_info revs; int read_stdin = 0; + int null_term = 0; int ret; const char * const usage[] = { N_("git diff-blob <blob> <blob>"), - N_("git diff-blob --stdin"), + N_("git diff-blob --stdin [-z]"), NULL }; struct option options[] = { OPT_BOOL(0, "stdin", &read_stdin, N_("read blob pairs from stdin")), + OPT_BOOL('z', NULL, &null_term, + N_("inputed blobs and outputted diffs terminated with NUL")), OPT_END() }; @@ -149,13 +168,13 @@ int cmd_diff_blob(int argc, const char **argv, const char *prefix, usage_with_options(usage, options); revs.diffopt.no_free = 1; - diff_blob_stdin(repo, &revs.diffopt); + diff_blob_stdin(repo, &revs.diffopt, null_term); revs.diffopt.no_free = 0; diff_free(&revs.diffopt); break; case 2: - if (read_stdin) + if (read_stdin || null_term) usage_with_options(usage, options); old_blob = &revs.pending.objects[0]; -- 2.47.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 4:23 [PATCH 0/3] batch blob diff generation Justin Tobler ` (2 preceding siblings ...) 2024-12-13 4:23 ` [PATCH 3/3] builtin/diff-blob: Add "-z" option Justin Tobler @ 2024-12-13 8:12 ` Jeff King 2024-12-13 10:17 ` Junio C Hamano ` (2 more replies) 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler 4 siblings, 3 replies; 78+ messages in thread From: Jeff King @ 2024-12-13 8:12 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps On Thu, Dec 12, 2024 at 10:23:09PM -0600, Justin Tobler wrote: > To enable support for batch diffs of multiple blob pairs, this > series introduces a new diff plumbing command git-diff-blob(1). Similar > to git-diff-tree(1), it provides a "--stdin" option that reads a pair of > blobs on each line of input and generates the diffs. This is intended to > be used for scripting purposes where more fine-grained control for diff > generation is desired. Below is an example for each usage: > > $ git diff-blob HEAD~5000:README.md HEAD:README.md > > $ git diff-blob --stdin <<EOF > 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b > HEAD~5000:README.md HEAD:README.md > EOF In the first example, I think just using "git diff" would work (though it is not a plumbing command). But the stdin example is what's interesting here anyway, since it can handle arbitrary inputs. So let's focus on that. Feeding just blob ids has a big drawback: we don't have any context! So you get bogus filenames in the patch, no mode data, and so on. Feeding the paths along with their commits, as you do on the second line, gives you those things from the lookup context. But it also has some problems. One, it's needlessly expensive; we have to traverse HEAD~5000, and then dig into its tree to find the blobs (which presumably you already did, since how else would you end up with those oids). And two, there are parsing ambiguities, since arbitrary revision names can contain spaces. E.g., are we looking for the file "README.md HEAD:README.md" in HEAD~5000? So ideally we'd have an input format that encapsulates that extra context data and provides some mechanism for quoting. And it turns out we do: the --raw diff format. If the program takes that format, then you can manually feed it two arbitrary blob oids if you have them (and put whatever you like for the mode/path context), like: git diff-blob --stdin <<\EOF :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M README.md EOF Or you can get the real context yourself (though it seems to me that this is a gap in what "cat-file --batch" should be able to do in a single process): git ls-tree HEAD~5000 README.md >out read mode_a blob oid_a path <out git ls-tree HEAD README.md >out read mode_b blob oid_b path <out printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" | git diff-blob --stdin But it also means you can use --raw output directly. So: git diff-tree --raw -r HEAD~5000 HEAD -- README.md | git diff-blob --stdin Now that command by itself doesn't look all that useful; you could have just asked for patches from diff-tree. But by splitting the two, you can filter the set of paths in between (for example, to omit some entries, or to batch a large diff into more manageable chunks for pagination, etc). The patch might look something like this: https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/ :) That is what has been powering the diffs at github.com since 2016 or so. And continues to do so, as far as I know. I don't have access to their internal repository anymore, but I've continued to rebase the topic forward in my personal repo. You can fetch it from: https://github.com/peff/git jk/diff-pairs in case that is helpful. -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 8:12 ` [PATCH 0/3] batch blob diff generation Jeff King @ 2024-12-13 10:17 ` Junio C Hamano 2024-12-13 10:38 ` Jeff King 2024-12-13 16:41 ` Justin Tobler 2024-12-13 22:34 ` Junio C Hamano 2 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-13 10:17 UTC (permalink / raw) To: Jeff King; +Cc: Justin Tobler, git, ps Jeff King <peff@peff.net> writes: > Feeding just blob ids has a big drawback: we don't have any context! So > you get bogus filenames in the patch, no mode data, and so on. And the lack of filenames and the tree object name at the root level means you do not get anything out of the attribute subsystem, which in turn may affect a few more things. Unfortunately the format used in the output from "diff --raw" does not capture this. Does this want to be a building block for the server side diff? Would it be a bit too low level for each "request" to comparing only two blob objects? Can we place a lot more assumptions on the relationship among blob pairs that appear in the --stdin request (e.g., they all come from the same tree and share the same attr stack to look up attributes applicable for them)? ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 10:17 ` Junio C Hamano @ 2024-12-13 10:38 ` Jeff King 2024-12-15 2:07 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Jeff King @ 2024-12-13 10:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: Justin Tobler, git, ps On Fri, Dec 13, 2024 at 07:17:59PM +0900, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Feeding just blob ids has a big drawback: we don't have any context! So > > you get bogus filenames in the patch, no mode data, and so on. > > And the lack of filenames and the tree object name at the root level > means you do not get anything out of the attribute subsystem, which > in turn may affect a few more things. > > Unfortunately the format used in the output from "diff --raw" does > not capture this. Don't we just use the working tree .gitattributes by default, and ignore what's in the endpoints? In a bare repo we wouldn't have that, but I think the recently added attr.tree and --attr-source options would work there. You can't use attributes from multiple trees in a single request, but I doubt that would be a big drawback. -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 10:38 ` Jeff King @ 2024-12-15 2:07 ` Junio C Hamano 2024-12-15 2:17 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-15 2:07 UTC (permalink / raw) To: Jeff King; +Cc: Justin Tobler, git, ps Jeff King <peff@peff.net> writes: >> Unfortunately the format used in the output from "diff --raw" does >> not capture this. > > Don't we just use the working tree .gitattributes by default, and ignore > what's in the endpoints? In a bare repo we wouldn't have that, but I > think the recently added attr.tree and --attr-source options would work > there. Yeah, you're right. I forgot about attr.tree (does not seem to be documented, by the way) and --attr-source & GIT_ATTR_SOURCE. I imagine that this feature is primarily meant to be used on the server side, and in bare repositories, only "diff-tree" makes sense among the diff-* family of commands, which (as server environments lack "the index" nor "the working tree") would already be using these mechanisms, so there is no new issues introduced here. > You can't use attributes from multiple trees in a single request, but I > doubt that would be a big drawback. I think it is also true with the normal diff-tree and friends; I do not think it looks up attributes from each tree independently when you run "git diff-tree -r A B" to compare the blob in tree A that is CRLF with the blob at the same path in tree B. So we should be OK. Thanks. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-15 2:07 ` Junio C Hamano @ 2024-12-15 2:17 ` Junio C Hamano 2024-12-16 11:11 ` Jeff King 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-15 2:17 UTC (permalink / raw) To: Jeff King; +Cc: Justin Tobler, git, ps Junio C Hamano <gitster@pobox.com> writes: > Yeah, you're right. I forgot about attr.tree (does not seem to be > documented, by the way) We do have an entry in Documentation/config/attr.txt that describes the three; I simply assumed it is not documented as I didn't see it mentioned in Documentation/git.txt where --attr-source & GIT_ATTR_SOURCE are described. We may want to add something like this, perhaps? ----- >8 ----- Subject: doc: give attr.tree a bit more visibility In "git help config" output, attr.tree mentions both --attr-source and GIT_ATTR_SOURCE, but the description of --attr-source and GIT_ATTR_SOURCE that appear in "git help git", attr.tree is missing. Add it so that these three are described together in both places. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Documentation/git.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git c/Documentation/git.txt w/Documentation/git.txt index d15a869762..13f6785408 100644 --- c/Documentation/git.txt +++ w/Documentation/git.txt @@ -228,7 +228,10 @@ If you just want to run git as if it was started in `<path>` then use --attr-source=<tree-ish>:: Read gitattributes from <tree-ish> instead of the worktree. See linkgit:gitattributes[5]. This is equivalent to setting the - `GIT_ATTR_SOURCE` environment variable. + `GIT_ATTR_SOURCE` environment variable. The `attr.tree` + configuration variable is used as a fallback when this option + or the environment variable are not in use. + GIT COMMANDS ------------ ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-15 2:17 ` Junio C Hamano @ 2024-12-16 11:11 ` Jeff King 2024-12-16 16:29 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Jeff King @ 2024-12-16 11:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: Justin Tobler, git, ps On Sat, Dec 14, 2024 at 06:17:01PM -0800, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Yeah, you're right. I forgot about attr.tree (does not seem to be > > documented, by the way) > > We do have an entry in Documentation/config/attr.txt that describes > the three; I simply assumed it is not documented as I didn't see it > mentioned in Documentation/git.txt where --attr-source & > GIT_ATTR_SOURCE are described. > > We may want to add something like this, perhaps? > > ----- >8 ----- > Subject: doc: give attr.tree a bit more visibility Yeah, that looks good to me. I recall that the performance of attr.tree is not great for _some_ commands (like pack-objects). So it's perhaps reasonable to use for single commands like "git diff" but not to set in your on-disk config. It's possible we'd want to warn people about that before advertising it more widely? I dunno. -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-16 11:11 ` Jeff King @ 2024-12-16 16:29 ` Junio C Hamano 2024-12-18 11:39 ` Jeff King 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-16 16:29 UTC (permalink / raw) To: Jeff King; +Cc: Justin Tobler, git, ps Jeff King <peff@peff.net> writes: > I recall that the performance of attr.tree is not great for _some_ > commands (like pack-objects). So it's perhaps reasonable to use for > single commands like "git diff" but not to set in your on-disk config. > It's possible we'd want to warn people about that before advertising it > more widely? I dunno. Or we disable the unusably-inefficient feature before doing so. Would attr.tree be much less efficient than GIT_ATTR_SOURCE? ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-16 16:29 ` Junio C Hamano @ 2024-12-18 11:39 ` Jeff King 2024-12-18 14:53 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Jeff King @ 2024-12-18 11:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: Justin Tobler, git, ps On Mon, Dec 16, 2024 at 08:29:41AM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I recall that the performance of attr.tree is not great for _some_ > > commands (like pack-objects). So it's perhaps reasonable to use for > > single commands like "git diff" but not to set in your on-disk config. > > It's possible we'd want to warn people about that before advertising it > > more widely? I dunno. > > Or we disable the unusably-inefficient feature before doing so. > Would attr.tree be much less efficient than GIT_ATTR_SOURCE? Whether it's unusably inefficient depends on what you throw at it. IIRC, the performance difference for pack-objects on git.git was fairly negligible. The problem in linux.git is that besides being big, it has a deep(er) directory structure. So collecting all of the attributes for a file like drivers/gpu/drm/foo/bar.h needs to open all of those intermediate trees. So I'd be inclined to leave it in place, in case somebody is actually happily using it. GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd presumably only use it with a few select commands (as far as I know, pack-objects is the worst case because it's looking up one attribute on every single blob in all of history). -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-18 11:39 ` Jeff King @ 2024-12-18 14:53 ` Junio C Hamano 2024-12-20 9:09 ` Jeff King 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-18 14:53 UTC (permalink / raw) To: Jeff King; +Cc: Justin Tobler, git, ps Jeff King <peff@peff.net> writes: >> > I recall that the performance of attr.tree is not great for _some_ >> > commands (like pack-objects). So it's perhaps reasonable to use for >> > single commands like "git diff" but not to set in your on-disk config. >> > It's possible we'd want to warn people about that before advertising it >> > more widely? I dunno. >> >> Or we disable the unusably-inefficient feature before doing so. >> Would attr.tree be much less efficient than GIT_ATTR_SOURCE? > > Whether it's unusably inefficient depends on what you throw at it. IIRC, > the performance difference for pack-objects on git.git was fairly > negligible. The problem in linux.git is that besides being big, it has a > deep(er) directory structure. So collecting all of the attributes for a > file like drivers/gpu/drm/foo/bar.h needs to open all of those > intermediate trees. > > So I'd be inclined to leave it in place, in case somebody is actually > happily using it. > > GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd > presumably only use it with a few select commands (as far as I know, > pack-objects is the worst case because it's looking up one attribute on > every single blob in all of history). Ah, OK. So your "caution" was about the underlying mechanism to allow attributes corrected from the specified tree, and not specifically about using "attr.tree" to specify that tree? That was what got me confused. If that is the case, I do not think the documentation patch that started this exchange that adds attr.tree to where GIT_ATTR_SOURCE and --attr-source are already mentioned makes anything worse. Thanks. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-18 14:53 ` Junio C Hamano @ 2024-12-20 9:09 ` Jeff King 2024-12-20 9:10 ` Jeff King 0 siblings, 1 reply; 78+ messages in thread From: Jeff King @ 2024-12-20 9:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: Justin Tobler, git, ps On Wed, Dec 18, 2024 at 06:53:31AM -0800, Junio C Hamano wrote: > > Whether it's unusably inefficient depends on what you throw at it. IIRC, > > the performance difference for pack-objects on git.git was fairly > > negligible. The problem in linux.git is that besides being big, it has a > > deep(er) directory structure. So collecting all of the attributes for a > > file like drivers/gpu/drm/foo/bar.h needs to open all of those > > intermediate trees. > > > > So I'd be inclined to leave it in place, in case somebody is actually > > happily using it. > > > > GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd > > presumably only use it with a few select commands (as far as I know, > > pack-objects is the worst case because it's looking up one attribute on > > every single blob in all of history). > > Ah, OK. So your "caution" was about the underlying mechanism to > allow attributes corrected from the specified tree, and not > specifically about using "attr.tree" to specify that tree? That was > what got me confused. > > If that is the case, I do not think the documentation patch that > started this exchange that adds attr.tree to where GIT_ATTR_SOURCE > and --attr-source are already mentioned makes anything worse. Yeah, I agree it's somewhat orthogonal. Your patch made me think about it because it is advertising the config variant more widely. Somebody doing: git --attr-source=foo diff ... is probably OK, but: git --attr-source=foo pack-objects ... is less so. Using attr.tree instead means you're going to do the latter whether you intended to or not. -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-20 9:09 ` Jeff King @ 2024-12-20 9:10 ` Jeff King 0 siblings, 0 replies; 78+ messages in thread From: Jeff King @ 2024-12-20 9:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Justin Tobler, git, ps On Fri, Dec 20, 2024 at 04:09:08AM -0500, Jeff King wrote: > > Ah, OK. So your "caution" was about the underlying mechanism to > > allow attributes corrected from the specified tree, and not > > specifically about using "attr.tree" to specify that tree? That was > > what got me confused. > > > > If that is the case, I do not think the documentation patch that > > started this exchange that adds attr.tree to where GIT_ATTR_SOURCE > > and --attr-source are already mentioned makes anything worse. > > Yeah, I agree it's somewhat orthogonal. Your patch made me think about > it because it is advertising the config variant more widely. Somebody > doing: > > git --attr-source=foo diff ... > > is probably OK, but: > > git --attr-source=foo pack-objects ... > > is less so. Using attr.tree instead means you're going to do the latter > whether you intended to or not. Re-reading my message, I guess I didn't really give any conclusion. ;) I should add: I'm OK with leaving the performance implications undocumented for now. Hopefully in the long run somebody is interested in addressing them. -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 8:12 ` [PATCH 0/3] batch blob diff generation Jeff King 2024-12-13 10:17 ` Junio C Hamano @ 2024-12-13 16:41 ` Justin Tobler 2024-12-16 11:18 ` Jeff King 2024-12-13 22:34 ` Junio C Hamano 2 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2024-12-13 16:41 UTC (permalink / raw) To: Jeff King; +Cc: git, ps On 24/12/13 03:12AM, Jeff King wrote: > In the first example, I think just using "git diff" would work (though > it is not a plumbing command). But the stdin example is what's > interesting here anyway, since it can handle arbitrary inputs. So let's > focus on that. > > Feeding just blob ids has a big drawback: we don't have any context! So > you get bogus filenames in the patch, no mode data, and so on. > > Feeding the paths along with their commits, as you do on the second > line, gives you those things from the lookup context. But it also has > some problems. One, it's needlessly expensive; we have to traverse > HEAD~5000, and then dig into its tree to find the blobs (which > presumably you already did, since how else would you end up with those > oids). And two, there are parsing ambiguities, since arbitrary revision > names can contain spaces. E.g., are we looking for the file "README.md > HEAD:README.md" in HEAD~5000? > > So ideally we'd have an input format that encapsulates that extra > context data and provides some mechanism for quoting. And it turns out > we do: the --raw diff format. I had not considered using the raw diff format as the input source. As you pointed out, using blob IDs alone loses some of the useful context. By using path-scoped revisions, we can still get this context, but at an added cost of having to traverse the tree to get the underlying information. As you also mentioned, this is potentially wasteful if, for example, the blobs diffs you are trying to generate are a subset of git-diff-tree(1) output and thus the context is already known ahead of time. Which is exactly what we are hoping to accomplish. > If the program takes that format, then you can manually feed it two > arbitrary blob oids if you have them (and put whatever you like for the > mode/path context), like: > > git diff-blob --stdin <<\EOF > :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M README.md > EOF > > Or you can get the real context yourself (though it seems to me that > this is a gap in what "cat-file --batch" should be able to do in a > single process): > > git ls-tree HEAD~5000 README.md >out > read mode_a blob oid_a path <out > git ls-tree HEAD README.md >out > read mode_b blob oid_b path <out > printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" | > git diff-blob --stdin > > But it also means you can use --raw output directly. So: > > git diff-tree --raw -r HEAD~5000 HEAD -- README.md | > git diff-blob --stdin > > Now that command by itself doesn't look all that useful; you could have > just asked for patches from diff-tree. But by splitting the two, you can > filter the set of paths in between (for example, to omit some entries, > or to batch a large diff into more manageable chunks for pagination, > etc). Yup, this is exactly what I'm hoping to achieve! As a single commit may contain an unbounded number changes, being able to control diff generation at the blob level is quite useful. Using the raw diff format as input looks like a rather elegant solution and I think it makes sense to use it here for the "--stdin" option over just reading two blobs. > The patch might look something like this: > > https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/ > > :) That is what has been powering the diffs at github.com since 2016 or > so. And continues to do so, as far as I know. I don't have access to > their internal repository anymore, but I've continued to rebase the > topic forward in my personal repo. You can fetch it from: > > https://github.com/peff/git jk/diff-pairs > > in case that is helpful. Thanks Peff! From looking at the mentioned thread and branch, it looks like I'm basically trying to accomplish the same thing here. Just a bit late to the conversation. :) While the use-case is rather narrow, I think it would be nice to see this functionality provided upstream. I see this as a means to faciliate more fine-grained control of the blob diffs we actually want to compute at a given time and it seems like it would be reasonable to expose as part of the diff plumbing. I would certainly be interested in adapting this series to instead use raw input from git-diff-tree(1) or trying to revive the previous series if that is preferred. If there is interest in continuing, some lingering questions I have: Being that the primary purpose of git-diff-blob(1) here is to handle generating blob diffs as specified by stdin, is there any reason to have a normal mode that accepts a blob pair as arguments? Or would it be best to limit the input mechanism to stdin entirely? If the user wanted to compute a single blob diff they could just use git-diff(1) already so providing this as a part of git-diff-blob(1) is a bit redundant. Having it as an option for the user does seem a bit more friendly though. -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 16:41 ` Justin Tobler @ 2024-12-16 11:18 ` Jeff King 0 siblings, 0 replies; 78+ messages in thread From: Jeff King @ 2024-12-16 11:18 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps On Fri, Dec 13, 2024 at 10:41:25AM -0600, Justin Tobler wrote: > > The patch might look something like this: > > > > https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/ > > > > :) That is what has been powering the diffs at github.com since 2016 or > > so. And continues to do so, as far as I know. I don't have access to > > their internal repository anymore, but I've continued to rebase the > > topic forward in my personal repo. You can fetch it from: > > > > https://github.com/peff/git jk/diff-pairs > > > > in case that is helpful. > > Thanks Peff! From looking at the mentioned thread and branch, it looks > like I'm basically trying to accomplish the same thing here. Just a bit > late to the conversation. :) > > While the use-case is rather narrow, I think it would be nice to see > this functionality provided upstream. I see this as a means to faciliate > more fine-grained control of the blob diffs we actually want to compute > at a given time and it seems like it would be reasonable to expose as > part of the diff plumbing. I would certainly be interested in adapting > this series to instead use raw input from git-diff-tree(1) or trying to > revive the previous series if that is preferred. Yeah, if you want to take it in that direction, either by adapting the idea, or by starting with diff-pairs and polishing it up, I'm happy either way. GitHub folks may be happy if you keep the name "diff-pairs" and match the interface. ;) > If there is interest in continuing, some lingering questions I have: > > Being that the primary purpose of git-diff-blob(1) here is to handle > generating blob diffs as specified by stdin, is there any reason to have > a normal mode that accepts a blob pair as arguments? Or would it be best > to limit the input mechanism to stdin entirely? If the user wanted to > compute a single blob diff they could just use git-diff(1) already so > providing this as a part of git-diff-blob(1) is a bit redundant. Having > it as an option for the user does seem a bit more friendly though. I don't have a strong opinion. I agree it _could_ be more friendly, but it also raises questions about how that mode/path context is filled in. And also about other modes. E.g., "git diff HEAD:foo bar" will diff between a blob and a working tree. Should a new plumbing command support that, too? If those aren't things you immediately care about, I'd probably punt on it for now. I think it could be added later without losing compatibility (command-line arguments as appropriate for a single pair, and since the stdin format starts with ":mode" there's room to extend it). -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 8:12 ` [PATCH 0/3] batch blob diff generation Jeff King 2024-12-13 10:17 ` Junio C Hamano 2024-12-13 16:41 ` Justin Tobler @ 2024-12-13 22:34 ` Junio C Hamano 2024-12-15 23:24 ` Junio C Hamano 2 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-13 22:34 UTC (permalink / raw) To: Jeff King; +Cc: Justin Tobler, git, ps Jeff King <peff@peff.net> writes: > So ideally we'd have an input format that encapsulates that extra > context data and provides some mechanism for quoting. And it turns out > we do: the --raw diff format. Funny. The raw diff format indeed was designed as an interchange format from various "compare two sets of things" front-ends (like diff-files, diff-cache, and diff-tree) that emits the raw format, to be read by "diff-helper" (initially called "diff-tree-helper") that takes the raw format and - matches removed and added paths with similar contents to detect renames and copies - computes the output in various formats including "patch". So I guess we came a full circle, finally ;-). Looking in the archive for messages exchanged between junkio@ and torvalds@ mentioning diff before 2005-05-30 finds some interesting gems. https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-13 22:34 ` Junio C Hamano @ 2024-12-15 23:24 ` Junio C Hamano 2024-12-16 11:30 ` Jeff King 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2024-12-15 23:24 UTC (permalink / raw) To: Justin Tobler; +Cc: Jeff King, git, ps Junio C Hamano <gitster@pobox.com> writes: > Jeff King <peff@peff.net> writes: > >> So ideally we'd have an input format that encapsulates that extra >> context data and provides some mechanism for quoting. And it turns out >> we do: the --raw diff format. > > Funny. The raw diff format indeed was designed as an interchange > format from various "compare two sets of things" front-ends (like > diff-files, diff-cache, and diff-tree) that emits the raw format, to > be read by "diff-helper" (initially called "diff-tree-helper") that > takes the raw format and > > - matches removed and added paths with similar contents to detect > renames and copies > > - computes the output in various formats including "patch". > > So I guess we came a full circle, finally ;-). Looking in the archive > for messages exchanged between junkio@ and torvalds@ mentioning diff > before 2005-05-30 finds some interesting gems. > > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/ So, if we were to do what Justin tried to do honoring the overall design of our diff machinery, I think what we can do is as follows: * Use the "diff --raw" output format as the input, but with a bit of twist. (1) a narrow special case that takes only a single diff_filepair of <old> and <new> blobs, and immediately run diff_queue() on that single diff_filepair, which is Justin's use case. For this mode of operation, "flush after reach record of input" may be sufficient. (2) as a general "interchange format" to feed "comparison between two sets of <object, path>" into our diff machinery, we are better off if we can treat the input stream as multiple records that describes comparison between two sets. Imagine "git log --oneline --first-parent -2 --raw HEAD", where one set of "diff --raw" records show the changed blobs with their paths between HEAD~1 and HEAD, and another set does so for HEAD~2 and HEAD~1. We need to be able to tell where the first set ends and the second set starts, so that rename detection and other things, if requested, can be done within each set. My recommendation is to use a single blank line as a separator, e.g. :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt :100644 100644 8b3882cff1 4a74f7c7bd M refs.c :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh :100644 100644 c11213f520 8953d1c6d3 M refs/files-backend.c :100644 100644 b2e3ba877d bec5962deb M refs/reftable-backend.c so an application that wants to compare only one diff_filepair at a time would issue something like :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt :100644 100644 8b3882cff1 4a74f7c7bd M refs.c :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh so the parsing machinery does not have to worry about case (1) above. * Parse and append the input into diff_queue(), until you see an blank line. - If at EOF you are done, but if you have something accumulated in diff_queue(), show them (like below) first. In any case, at EOF, you are done. * Run diffcore_std() followed by diff_flush() to have the contents of the queue nicely formatted and emptied. Go back to parsing more input lines. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] batch blob diff generation 2024-12-15 23:24 ` Junio C Hamano @ 2024-12-16 11:30 ` Jeff King 0 siblings, 0 replies; 78+ messages in thread From: Jeff King @ 2024-12-16 11:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Justin Tobler, git, ps On Sun, Dec 15, 2024 at 03:24:11PM -0800, Junio C Hamano wrote: > > Funny. The raw diff format indeed was designed as an interchange > > format from various "compare two sets of things" front-ends (like > > diff-files, diff-cache, and diff-tree) that emits the raw format, to > > be read by "diff-helper" (initially called "diff-tree-helper") that > > takes the raw format and > > > > - matches removed and added paths with similar contents to detect > > renames and copies > > > > - computes the output in various formats including "patch". > > > > So I guess we came a full circle, finally ;-). Looking in the archive > > for messages exchanged between junkio@ and torvalds@ mentioning diff > > before 2005-05-30 finds some interesting gems. > > > > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/ :) That same thread was linked when I posted the original diff-pairs many years ago. > So, if we were to do what Justin tried to do honoring the overall > design of our diff machinery, I think what we can do is as follows: > > * Use the "diff --raw" output format as the input, but with a bit > of twist. > > (1) a narrow special case that takes only a single diff_filepair > of <old> and <new> blobs, and immediately run diff_queue() on > that single diff_filepair, which is Justin's use case. For > this mode of operation, "flush after reach record of input" > may be sufficient. My understanding was that he does not actually care about this case (just feeding two blobs), but is actually processing --raw output in the first place. Or did you just mean that we'd still be feeding raw output, but just getting the flush behavior? > (2) as a general "interchange format" to feed "comparison between > two sets of <object, path>" into our diff machinery, we are > better off if we can treat the input stream as multiple > records that describes comparison between two sets. Imagine > "git log --oneline --first-parent -2 --raw HEAD", where one > set of "diff --raw" records show the changed blobs with their > paths between HEAD~1 and HEAD, and another set does so for > HEAD~2 and HEAD~1. We need to be able to tell where the > first set ends and the second set starts, so that rename > detection and other things, if requested, can be done within > each set. Seems reasonable. For the use of diff-pairs at GitHub, I always just did full-tree things like rename detection in the initial diff-tree invocation. Since my goal was splitting/filtering, doing it after would yield wrong answers (since diff-pairs never sees the complete set). But it's possible for somebody to want to filter the intermediate results, then do full-tree stuff on the result (or even just delay the cost of rename detection). And certainly it's possible to want to feed a whole bunch of unrelated diff segments without having to spawn a process for each. So it's not something I wanted, but I agree it's good to plan for. > My recommendation is to use a single blank line as a separator, > e.g. > > :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt > :100644 100644 8b3882cff1 4a74f7c7bd M refs.c > :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh > > :100644 100644 c11213f520 8953d1c6d3 M refs/files-backend.c > :100644 100644 b2e3ba877d bec5962deb M refs/reftable-backend.c > > so an application that wants to compare only one diff_filepair > at a time would issue something like > > :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt > > :100644 100644 8b3882cff1 4a74f7c7bd M refs.c > > :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh > > so the parsing machinery does not have to worry about case (1) above. Yeah, that seems good. And it is backwards-compatible with the existing diff-pairs format (which just barfs on a blank line). That's not a big concern for the project, but it is nice that it makes life a bit simpler for folks who would eventually be tasked with switching from it to this new tool. ;) > * Parse and append the input into diff_queue(), until you see an > blank line. > > - If at EOF you are done, but if you have something accumulated > in diff_queue(), show them (like below) first. In any case, at > EOF, you are done. Yep, makes sense. > * Run diffcore_std() followed by diff_flush() to have the contents > of the queue nicely formatted and emptied. Go back to parsing > more input lines. That makes sense. I don't think my diff-pairs runs diffcore_std() at all. The plumbing defaults mean it would always be a noop unless you explicitly passed in rename, etc, options, and I never wanted to do that. We might have to check the interaction of diffcore code on a set of queued diffs that already have values for renames, etc. I.e., that: git diff-tree --raw -M | git diff-pairs -M does not barf, since the input step in diff-pairs is going to set status to 'R', etc, in the pairs. -Peff ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v2 0/3] batch blob diff generation 2024-12-13 4:23 [PATCH 0/3] batch blob diff generation Justin Tobler ` (3 preceding siblings ...) 2024-12-13 8:12 ` [PATCH 0/3] batch blob diff generation Jeff King @ 2025-02-12 4:18 ` Justin Tobler 2025-02-12 4:18 ` [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler ` (3 more replies) 4 siblings, 4 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-12 4:18 UTC (permalink / raw) To: git; +Cc: peff, Justin Tobler Through git-diff(1) it is possible to generate a diff directly between two blobs. This is particularly useful when the pre-image and post-image blobs are known and we only care about the diff between them. Unfortunately, if a user has a batch of known blob pairs to compute diffs for, there is currently not a way to do so via a single Git process. To enable support for batch diffs of multiple blob pairs, this series introduces a new diff plumbing command git-diff-pairs(1) based on a previous patch series submitted by Peff[1]. This command uses null delimited raw diffs as its source of input to control exactly which filepairs are diffed. The advantage of using the raw diff format is that it already has diff status type and object context information embedded in each line making it more efficient to generate diffs with as we can avoid having to peel revisions to get some the same info. For example: git diff-tree -r -z -M $old $new | git diff-pairs -p Here the output of git-diff-tree(1) is fed to git-diff-pairs(1) to generate the same output that would be expected from `git diff-tree -p -M`. While by itself not particularly useful, this means it is possible to split git-diff-tree(1) output across multiple git-diff-pairs(1) processes. Such a feature is useful on the server-side where diffs bewteen a large set of changes may not be feasible all at once due to timeout concerns. This series is structured as follows: - Patch 1 adds some new helper functions to get access to the queued `diff_filepair` after `diff_queue()` is invoked. - Patch 2 introduces the new git-diff-pairs(1) plumbing command. - Patch 3 teaches git-diff-pairs(1) a way to perform explicit diff queue flushes instead of waiting until stdin EOF to flush. In 1f010d6bdf (doc: use .adoc extension for AsciiDoc files, 2025-01-20), the extension for documentation was change from .txt to .adoc. This series builds on top of that change as to avoid conflicts in next. Changes since V1: - Changed from git-diff-blob(1) to git-diff-pairs(1) based on a previously submitted series. - Instead of each line containing a pair of blob revisions, the raw diff format is used as input which already has diff status and object context embedded. -Justin [1]: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> Justin Tobler (3): diff: return diff_filepair from diff queue helpers builtin: introduce diff-pairs command builtin/diff-pairs: allow explicit diff queue flush .gitignore | 1 + Documentation/git-diff-pairs.adoc | 66 +++++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 189 ++++++++++++++++++++++++++++++ command-list.txt | 1 + diff.c | 66 ++++++++--- diff.h | 15 +++ git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 102 ++++++++++++++++ 13 files changed, 427 insertions(+), 19 deletions(-) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh -- 2.48.1 ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler @ 2025-02-12 4:18 ` Justin Tobler 2025-02-12 9:06 ` Karthik Nayak 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler ` (2 subsequent siblings) 3 siblings, 2 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-12 4:18 UTC (permalink / raw) To: git; +Cc: peff, Justin Tobler The `diff_addremove()` and `diff_change()` functions setup and queue diffs, but do not return the `diff_filepair` added to the queue. In a subsequent commit, modifications to `diff_filepair` need to take place in certain cases after being queued. Split out the queuing operations into `diff_filepair_addremove()` and `diff_filepair_change()` which also return a handle to the queued `diff_filepair`. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- diff.c | 66 +++++++++++++++++++++++++++++++++++++++++----------------- diff.h | 15 +++++++++++++ 2 files changed, 62 insertions(+), 19 deletions(-) diff --git a/diff.c b/diff.c index 019fb893a7..afbb892c26 100644 --- a/diff.c +++ b/diff.c @@ -7157,16 +7157,18 @@ void compute_diffstat(struct diff_options *options, options->found_changes = !!diffstat->nr; } -void diff_addremove(struct diff_options *options, - int addremove, unsigned mode, - const struct object_id *oid, - int oid_valid, - const char *concatpath, unsigned dirty_submodule) +struct diff_filepair *diff_filepair_addremove(struct diff_options *options, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, + const char *concatpath, + unsigned dirty_submodule) { struct diff_filespec *one, *two; + struct diff_filepair *pair; if (S_ISGITLINK(mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; /* This may look odd, but it is a preparation for * feeding "there are unchanged files which should @@ -7186,7 +7188,7 @@ void diff_addremove(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7198,25 +7200,28 @@ void diff_addremove(struct diff_options *options, two->dirty_submodule = dirty_submodule; } - diff_queue(&diff_queued_diff, one, two); + pair = diff_queue(&diff_queued_diff, one, two); if (!options->flags.diff_from_contents) options->flags.has_changes = 1; + + return pair; } -void diff_change(struct diff_options *options, - unsigned old_mode, unsigned new_mode, - const struct object_id *old_oid, - const struct object_id *new_oid, - int old_oid_valid, int new_oid_valid, - const char *concatpath, - unsigned old_dirty_submodule, unsigned new_dirty_submodule) +struct diff_filepair *diff_filepair_change(struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, + unsigned new_dirty_submodule) { struct diff_filespec *one, *two; struct diff_filepair *p; if (S_ISGITLINK(old_mode) && S_ISGITLINK(new_mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; if (options->flags.reverse_diff) { SWAP(old_mode, new_mode); @@ -7227,7 +7232,7 @@ void diff_change(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7238,16 +7243,39 @@ void diff_change(struct diff_options *options, p = diff_queue(&diff_queued_diff, one, two); if (options->flags.diff_from_contents) - return; + return p; if (options->flags.quick && options->skip_stat_unmatch && !diff_filespec_check_stat_unmatch(options->repo, p)) { diff_free_filespec_data(p->one); diff_free_filespec_data(p->two); - return; + return p; } options->flags.has_changes = 1; + + return p; +} + +void diff_addremove(struct diff_options *options, int addremove, unsigned mode, + const struct object_id *oid, int oid_valid, + const char *concatpath, unsigned dirty_submodule) +{ + diff_filepair_addremove(options, addremove, mode, oid, oid_valid, + concatpath, dirty_submodule); +} + +void diff_change(struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, unsigned new_dirty_submodule) +{ + diff_filepair_change(options, old_mode, new_mode, old_oid, new_oid, + old_oid_valid, new_oid_valid, concatpath, + old_dirty_submodule, new_dirty_submodule); } struct diff_filepair *diff_unmerge(struct diff_options *options, const char *path) diff --git a/diff.h b/diff.h index 0a566f5531..6ea63f01e7 100644 --- a/diff.h +++ b/diff.h @@ -508,6 +508,21 @@ void diff_set_default_prefix(struct diff_options *options); int diff_can_quit_early(struct diff_options *); +struct diff_filepair *diff_filepair_addremove(struct diff_options *, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, const char *fullpath, + unsigned dirty_submodule); + +struct diff_filepair *diff_filepair_change(struct diff_options *, + unsigned mode1, unsigned mode2, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *fullpath, + unsigned dirty_submodule1, + unsigned dirty_submodule2); + void diff_addremove(struct diff_options *, int addremove, unsigned mode, -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers 2025-02-12 4:18 ` [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-02-12 9:06 ` Karthik Nayak 2025-02-12 17:35 ` Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt 1 sibling, 1 reply; 78+ messages in thread From: Karthik Nayak @ 2025-02-12 9:06 UTC (permalink / raw) To: Justin Tobler, git; +Cc: peff [-- Attachment #1: Type: text/plain, Size: 1995 bytes --] Justin Tobler <jltobler@gmail.com> writes: > The `diff_addremove()` and `diff_change()` functions setup and queue > diffs, but do not return the `diff_filepair` added to the queue. In a > subsequent commit, modifications to `diff_filepair` need to take place > in certain cases after being queued. > > Split out the queuing operations into `diff_filepair_addremove()` and > `diff_filepair_change()` which also return a handle to the queued > `diff_filepair`. > This patch keeps `diff_addremove()` and `diff_change()` while introducing two new functions which return the `diff_filepair`. Just a thought, why not replace them? The users `diff_addremove()` and `diff_change()` could simply call the new functions and ignore the return value? This would be messy if there were a lot of users of `diff_addremove()` and `diff_change()`, but I only see a few callers. Wouldn't it be cleaner to just replace? The patch looks good to me otherwise. [snip] > diff --git a/diff.h b/diff.h > index 0a566f5531..6ea63f01e7 100644 > --- a/diff.h > +++ b/diff.h > @@ -508,6 +508,21 @@ void diff_set_default_prefix(struct diff_options *options); > > int diff_can_quit_early(struct diff_options *); > > +struct diff_filepair *diff_filepair_addremove(struct diff_options *, > + int addremove, unsigned mode, > + const struct object_id *oid, > + int oid_valid, const char *fullpath, > + unsigned dirty_submodule); > + > +struct diff_filepair *diff_filepair_change(struct diff_options *, > + unsigned mode1, unsigned mode2, > + const struct object_id *old_oid, > + const struct object_id *new_oid, > + int old_oid_valid, int new_oid_valid, > + const char *fullpath, > + unsigned dirty_submodule1, > + unsigned dirty_submodule2); > + Nit: would be nice to have some comments to describe what these functions do. > void diff_addremove(struct diff_options *, > int addremove, > unsigned mode, > -- > 2.48.1 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 690 bytes --] ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers 2025-02-12 9:06 ` Karthik Nayak @ 2025-02-12 17:35 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-12 17:35 UTC (permalink / raw) To: Karthik Nayak; +Cc: git, peff On 25/02/12 01:06AM, Karthik Nayak wrote: > Justin Tobler <jltobler@gmail.com> writes: > > > The `diff_addremove()` and `diff_change()` functions setup and queue > > diffs, but do not return the `diff_filepair` added to the queue. In a > > subsequent commit, modifications to `diff_filepair` need to take place > > in certain cases after being queued. > > > > Split out the queuing operations into `diff_filepair_addremove()` and > > `diff_filepair_change()` which also return a handle to the queued > > `diff_filepair`. > > > > This patch keeps `diff_addremove()` and `diff_change()` while > introducing two new functions which return the `diff_filepair`. Just a > thought, why not replace them? The users `diff_addremove()` and > `diff_change()` could simply call the new functions and ignore the > return value? This was mostly to avoid changing the `add_remove_fn_t` and `change_fn_t` types that store `diff_addremove()` and `diff_change()` in `diff_options`. The `file_add_remove()` and `file_change()` functions, which also can be set in `diff_options`, do not ever queue file pairs so I don't think returning `diff_filepair` makes much sense there. > This would be messy if there were a lot of users of `diff_addremove()` > and `diff_change()`, but I only see a few callers. Wouldn't it be > cleaner to just replace? Patrick has suggested we avoid using the global `diff_queue_struct` implicitly. Currently, in the next version I'm planning to keep the separate functions as `diff_queue_addremove()` and `diff_queue_change()`, but also accept `diff_queue_struct` as an argument. > The patch looks good to me otherwise. > > [snip] > > > diff --git a/diff.h b/diff.h > > index 0a566f5531..6ea63f01e7 100644 > > --- a/diff.h > > +++ b/diff.h > > @@ -508,6 +508,21 @@ void diff_set_default_prefix(struct diff_options *options); > > > > int diff_can_quit_early(struct diff_options *); > > > > +struct diff_filepair *diff_filepair_addremove(struct diff_options *, > > + int addremove, unsigned mode, > > + const struct object_id *oid, > > + int oid_valid, const char *fullpath, > > + unsigned dirty_submodule); > > + > > +struct diff_filepair *diff_filepair_change(struct diff_options *, > > + unsigned mode1, unsigned mode2, > > + const struct object_id *old_oid, > > + const struct object_id *new_oid, > > + int old_oid_valid, int new_oid_valid, > > + const char *fullpath, > > + unsigned dirty_submodule1, > > + unsigned dirty_submodule2); > > + > > Nit: would be nice to have some comments to describe what these > functions do. I'll add in the next version. Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers 2025-02-12 4:18 ` [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-12 9:06 ` Karthik Nayak @ 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-12 17:24 ` Justin Tobler 1 sibling, 1 reply; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-12 9:23 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff On Tue, Feb 11, 2025 at 10:18:23PM -0600, Justin Tobler wrote: > The `diff_addremove()` and `diff_change()` functions setup and queue > diffs, but do not return the `diff_filepair` added to the queue. In a > subsequent commit, modifications to `diff_filepair` need to take place > in certain cases after being queued. > > Split out the queuing operations into `diff_filepair_addremove()` and > `diff_filepair_change()` which also return a handle to the queued > `diff_filepair`. One of the things that puzzled me a bit is that we keep the old-style functions, where the only difference is the return value. Wouldn't it make more sense to instead adapt these existing functions to reduce the amount of duplication? At the same time, while we're already at it, do we maybe also want to adapt the functions so that they get the `diff_queue` as input instead of relying on the global queue? That would make them more generally useful and be a step into the right direction regarding libification. If so, it would indeed make sense to also rename the function into e.g. `diff_queue_addremove()`. Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers 2025-02-12 9:23 ` Patrick Steinhardt @ 2025-02-12 17:24 ` Justin Tobler 2025-02-13 5:45 ` Patrick Steinhardt 0 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-12 17:24 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, peff On 25/02/12 10:23AM, Patrick Steinhardt wrote: > On Tue, Feb 11, 2025 at 10:18:23PM -0600, Justin Tobler wrote: > > The `diff_addremove()` and `diff_change()` functions setup and queue > > diffs, but do not return the `diff_filepair` added to the queue. In a > > subsequent commit, modifications to `diff_filepair` need to take place > > in certain cases after being queued. > > > > Split out the queuing operations into `diff_filepair_addremove()` and > > `diff_filepair_change()` which also return a handle to the queued > > `diff_filepair`. > > One of the things that puzzled me a bit is that we keep the old-style > functions, where the only difference is the return value. Wouldn't it > make more sense to instead adapt these existing functions to reduce the > amount of duplication? This is what I considered doing initially. I noticed though that both `diff_addremove()` and `diff_change()` are stored as callbacks in `diff_options` as types `add_remove_fn_t` and `change_fn_t`. The diff options configured for pruning use `file_add_remove()` and `file_change()` instead. Returning `diff_filepair` doesn't seems to make much sense in the context of `file_add_remove()` and `file_change()` as no filepairs ever get queued, so I opted to factor out the logic into separate functions instead of adapting the function signatures for all. This may not be the best option, so I can also change it if that is best. > At the same time, while we're already at it, do we maybe also want to > adapt the functions so that they get the `diff_queue` as input instead > of relying on the global queue? That would make them more generally > useful and be a step into the right direction regarding libification. If > so, it would indeed make sense to also rename the function into e.g. > `diff_queue_addremove()`. Thanks for the suggestion. I'll adapt the next version accordingly. -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers 2025-02-12 17:24 ` Justin Tobler @ 2025-02-13 5:45 ` Patrick Steinhardt 0 siblings, 0 replies; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-13 5:45 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff On Wed, Feb 12, 2025 at 11:24:55AM -0600, Justin Tobler wrote: > On 25/02/12 10:23AM, Patrick Steinhardt wrote: > > On Tue, Feb 11, 2025 at 10:18:23PM -0600, Justin Tobler wrote: > > > The `diff_addremove()` and `diff_change()` functions setup and queue > > > diffs, but do not return the `diff_filepair` added to the queue. In a > > > subsequent commit, modifications to `diff_filepair` need to take place > > > in certain cases after being queued. > > > > > > Split out the queuing operations into `diff_filepair_addremove()` and > > > `diff_filepair_change()` which also return a handle to the queued > > > `diff_filepair`. > > > > One of the things that puzzled me a bit is that we keep the old-style > > functions, where the only difference is the return value. Wouldn't it > > make more sense to instead adapt these existing functions to reduce the > > amount of duplication? > > This is what I considered doing initially. I noticed though that both > `diff_addremove()` and `diff_change()` are stored as callbacks in > `diff_options` as types `add_remove_fn_t` and `change_fn_t`. The diff > options configured for pruning use `file_add_remove()` and > `file_change()` instead. Returning `diff_filepair` doesn't seems to make > much sense in the context of `file_add_remove()` and `file_change()` as > no filepairs ever get queued, so I opted to factor out the logic into > separate functions instead of adapting the function signatures for all. > > This may not be the best option, so I can also change it if that is > best. Okay. This is context that should probably be part of the commit message, as it is quite an important detail to understand the implementation. Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler 2025-02-12 4:18 ` [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-02-12 4:18 ` Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt ` (4 more replies) 2025-02-12 4:18 ` [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler 3 siblings, 5 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-12 4:18 UTC (permalink / raw) To: git; +Cc: peff, Justin Tobler Through git-diff(1), a single diff can be generated from a pair of blob revisions directly. Unfortunately, there is not a mechanism to compute batches of specific file pair diffs in a single process. Such a feature is particularly useful on the server-side where diffing between a large set of changes is not feasible all at once due to timeout concerns. To facilitate this, introduce git-diff-pairs(1) which takes the null-terminated raw diff format as input on stdin and produces diffs in other formats. As the raw diff format already contains the necessary metadata, it becomes possible to progressively generate batches of diffs without having to recompute rename detection or retrieve object context. Something like the following: git diff-tree -r -z -M $old $new | git diff-pairs -p should generate the same output as `git diff-tree -p -M`. Furthermore, each line of raw diff formatted input can also be individually fed to a separate git-diff-pairs(1) process and still produce the same output. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Justin Tobler <jltobler@gmail.com> --- .gitignore | 1 + Documentation/git-diff-pairs.adoc | 62 +++++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 178 ++++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 80 ++++++++++++++ 11 files changed, 328 insertions(+) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh diff --git a/.gitignore b/.gitignore index e82aa19df0..03448c076a 100644 --- a/.gitignore +++ b/.gitignore @@ -54,6 +54,7 @@ /git-diff /git-diff-files /git-diff-index +/git-diff-pairs /git-diff-tree /git-difftool /git-difftool--helper diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc new file mode 100644 index 0000000000..e9ef4a6615 --- /dev/null +++ b/Documentation/git-diff-pairs.adoc @@ -0,0 +1,62 @@ +git-diff-pairs(1) +================= + +NAME +---- +git-diff-pairs - Compare blob pairs generated by `diff-tree --raw` + +SYNOPSIS +-------- +[verse] +'git diff-pairs' [diff-options] + +DESCRIPTION +----------- + +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will +reformat that output into whatever format is requested on its command +line. For example: + +----------------------------- +git diff-tree -z -M $a $b | +git diff-pairs -p +----------------------------- + +will compute the tree diff in one step (including renames), and then +`diff-pairs` will compute and format the blob-level diffs for each pair. +This can be used to modify the raw diff in the middle (without having to +parse or re-create more complicated formats like `--patch`), or to +compute diffs progressively over the course of multiple invocations of +`diff-pairs`. + +Each blob pair is fed to the diff machinery individually queued and the output +is flushed on stdin EOF. + +OPTIONS +------- + +include::diff-options.adoc[] + +include::diff-generate-patch.adoc[] + +NOTES +---- + +`diff-pairs` should handle any input generated by `diff-tree --raw -z`. +It may choke or otherwise misbehave on output from `diff-files`, etc. + +Here's an incomplete list of things that `diff-pairs` could do, but +doesn't (mostly in the name of simplicity): + + - Only `-z` input is accepted, not normal `--raw` input. + + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you + want to abbreviate the output, you can pass `--abbrev` to + `diff-pairs`. + + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via + the initial `diff-tree` invocation. + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/meson.build b/Documentation/meson.build index ead8e48213..e5ee177022 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -41,6 +41,7 @@ manpages = { 'git-diagnose.adoc' : 1, 'git-diff-files.adoc' : 1, 'git-diff-index.adoc' : 1, + 'git-diff-pairs.adoc' : 1, 'git-difftool.adoc' : 1, 'git-diff-tree.adoc' : 1, 'git-diff.adoc' : 1, diff --git a/Makefile b/Makefile index 896d02339e..3b8e1ad15e 100644 --- a/Makefile +++ b/Makefile @@ -1232,6 +1232,7 @@ BUILTIN_OBJS += builtin/describe.o BUILTIN_OBJS += builtin/diagnose.o BUILTIN_OBJS += builtin/diff-files.o BUILTIN_OBJS += builtin/diff-index.o +BUILTIN_OBJS += builtin/diff-pairs.o BUILTIN_OBJS += builtin/diff-tree.o BUILTIN_OBJS += builtin/diff.o BUILTIN_OBJS += builtin/difftool.o diff --git a/builtin.h b/builtin.h index f7b166b334..b2d2e9eb07 100644 --- a/builtin.h +++ b/builtin.h @@ -152,6 +152,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c new file mode 100644 index 0000000000..08f3ee81e5 --- /dev/null +++ b/builtin/diff-pairs.c @@ -0,0 +1,178 @@ +#include "builtin.h" +#include "commit.h" +#include "config.h" +#include "diff.h" +#include "diffcore.h" +#include "gettext.h" +#include "hex.h" +#include "object.h" +#include "parse-options.h" +#include "revision.h" +#include "strbuf.h" + +static unsigned parse_mode_or_die(const char *mode, const char **endp) +{ + uint16_t ret; + + *endp = parse_mode(mode, &ret); + if (!*endp) + die("unable to parse mode: %s", mode); + return ret; +} + +static void parse_oid(const char *p, struct object_id *oid, const char **endp, + const struct git_hash_algo *algop) +{ + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') + die("unable to parse object id: %s", p); +} + +static unsigned short parse_score(const char *score) +{ + unsigned long ret; + char *endp; + + errno = 0; + ret = strtoul(score, &endp, 10); + ret *= MAX_SCORE / 100; + if (errno || endp == score || *endp || (unsigned short)ret != ret) + die("unable to parse rename/copy score: %s", score); + return ret; +} + +static void flush_diff_queue(struct diff_options *options) +{ + /* + * If rename detection is not requested, use rename information from the + * raw diff formatted input. Setting found_follow ensures diffcore_std() + * does not mess with rename information already present in queued + * filepairs. + */ + if (!options->detect_rename) + options->found_follow = 1; + diffcore_std(options); + diff_flush(options); +} + +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, + struct repository *repo) +{ + struct strbuf path_dst = STRBUF_INIT; + struct strbuf path = STRBUF_INIT; + struct strbuf meta = STRBUF_INIT; + struct rev_info revs; + int ret; + + const char * const usage[] = { + N_("git diff-pairs [diff-options]"), + NULL + }; + struct option options[] = { + OPT_END() + }; + + show_usage_with_options_if_asked(argc, argv, usage, options); + + repo_init_revisions(repo, &revs, prefix); + repo_config(repo, git_diff_basic_config, NULL); + revs.disable_stdin = 1; + revs.abbrev = 0; + revs.diff = 1; + + argc = setup_revisions(argc, argv, &revs, NULL); + + /* Don't allow pathspecs at all. */ + if (revs.prune_data.nr) + usage_with_options(usage, options); + + if (!revs.diffopt.output_format) + revs.diffopt.output_format = DIFF_FORMAT_RAW; + + while (1) { + struct object_id oid_a, oid_b; + struct diff_filepair *pair; + unsigned mode_a, mode_b; + const char *p; + char status; + + if (strbuf_getline_nul(&meta, stdin) == EOF) + break; + + p = meta.buf; + if (*p != ':') + die("invalid raw diff input"); + p++; + + mode_a = parse_mode_or_die(p, &p); + mode_b = parse_mode_or_die(p, &p); + + parse_oid(p, &oid_a, &p, repo->hash_algo); + parse_oid(p, &oid_b, &p, repo->hash_algo); + + status = *p++; + + if (strbuf_getline_nul(&path, stdin) == EOF) + die("got EOF while reading path"); + + switch (status) { + case DIFF_STATUS_ADDED: + pair = diff_filepair_addremove(&revs.diffopt, '+', + mode_b, &oid_b, + 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_DELETED: + pair = diff_filepair_addremove(&revs.diffopt, '-', + mode_a, &oid_a, + 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_TYPE_CHANGED: + case DIFF_STATUS_MODIFIED: + pair = diff_filepair_change(&revs.diffopt, + mode_a, mode_b, + &oid_a, &oid_b, 1, 1, + path.buf, 0, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_RENAMED: + case DIFF_STATUS_COPIED: + { + struct diff_filespec *a, *b; + + if (strbuf_getline_nul(&path_dst, stdin) == EOF) + die("got EOF while reading destination path"); + + a = alloc_filespec(path.buf); + b = alloc_filespec(path_dst.buf); + fill_filespec(a, &oid_a, 1, mode_a); + fill_filespec(b, &oid_b, 1, mode_b); + + pair = diff_queue(&diff_queued_diff, a, b); + pair->status = status; + pair->score = parse_score(p); + pair->renamed_pair = 1; + } + break; + + default: + die("unknown diff status: %c", status); + } + } + + flush_diff_queue(&revs.diffopt); + ret = diff_result_code(&revs); + + strbuf_release(&path_dst); + strbuf_release(&path); + strbuf_release(&meta); + release_revisions(&revs); + + return ret; +} diff --git a/command-list.txt b/command-list.txt index e0bb87b3b5..bb8acd51d8 100644 --- a/command-list.txt +++ b/command-list.txt @@ -95,6 +95,7 @@ git-diagnose ancillaryinterrogators git-diff mainporcelain info git-diff-files plumbinginterrogators git-diff-index plumbinginterrogators +git-diff-pairs plumbinginterrogators git-diff-tree plumbinginterrogators git-difftool ancillaryinterrogators complete git-fast-export ancillarymanipulators diff --git a/git.c b/git.c index b23761480f..12bba872bb 100644 --- a/git.c +++ b/git.c @@ -540,6 +540,7 @@ static struct cmd_struct commands[] = { { "diff", cmd_diff, NO_PARSEOPT }, { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, { "fast-export", cmd_fast_export, RUN_SETUP }, diff --git a/meson.build b/meson.build index fbb8105d96..66ce3326e8 100644 --- a/meson.build +++ b/meson.build @@ -537,6 +537,7 @@ builtin_sources = [ 'builtin/diagnose.c', 'builtin/diff-files.c', 'builtin/diff-index.c', + 'builtin/diff-pairs.c', 'builtin/diff-tree.c', 'builtin/diff.c', 'builtin/difftool.c', diff --git a/t/meson.build b/t/meson.build index 4574280590..7ff17c6d29 100644 --- a/t/meson.build +++ b/t/meson.build @@ -500,6 +500,7 @@ integration_tests = [ 't4067-diff-partial-clone.sh', 't4068-diff-symmetric-merge-base.sh', 't4069-remerge-diff.sh', + 't4070-diff-pairs.sh', 't4100-apply-stat.sh', 't4101-apply-nonl.sh', 't4102-apply-rename.sh', diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh new file mode 100755 index 0000000000..e0a8e6f0a0 --- /dev/null +++ b/t/t4070-diff-pairs.sh @@ -0,0 +1,80 @@ +#!/bin/sh + +test_description='basic diff-pairs tests' +. ./test-lib.sh + +# This creates a diff with added, modified, deleted, renamed, copied, and +# typechange entries. That includes one in a subdirectory for non-recursive +# tests, and both exact and inexact similarity scores. +test_expect_success 'create commit with various diffs' ' + echo to-be-gone >deleted && + echo original >modified && + echo now-a-file >symlink && + test_seq 200 >two-hundred && + test_seq 201 500 >five-hundred && + git add . && + test_tick && + git commit -m base && + git tag base && + + echo now-here >added && + echo new >modified && + rm deleted && + mkdir subdir && + echo content >subdir/file && + mv two-hundred renamed && + test_seq 201 500 | sed s/300/modified/ >copied && + rm symlink && + git add -A . && + test_ln_s_add dest symlink && + test_tick && + git commit -m new && + git tag new +' + +test_expect_success 'diff-pairs recreates --raw' ' + git diff-tree -r -M -C -C base new >expect && + git diff-tree -r -M -C -C -z base new | + git diff-pairs >actual && + test_cmp expect actual +' + +test_expect_success 'diff-pairs can create -p output' ' + git diff-tree -p -M -C -C base new >expect && + git diff-tree -r -M -C -C -z base new | + git diff-pairs -p >actual && + test_cmp expect actual +' + +test_expect_success 'non-recursive --raw retains tree entry' ' + git diff-tree base new >expect && + git diff-tree -z base new | + git diff-pairs >actual && + test_cmp expect actual +' + +test_expect_success 'split input across multiple diff-pairs' ' + write_script split-raw-diff "$PERL_PATH" <<-\EOF && + $/ = "\0"; + while (<>) { + my $meta = $_; + my $path = <>; + # renames have an extra path + my $path2 = <> if $meta =~ /[RC]\d+/; + + open(my $fh, ">", sprintf "diff%03d", $.); + print $fh $meta, $path, $path2; + } + EOF + + git diff-tree -p -M -C -C base new >expect && + + git diff-tree -r -z -M -C -C base new | + ./split-raw-diff && + for i in diff*; do + git diff-pairs -p <$i || return 1 + done >actual && + test_cmp expect actual +' + +test_done -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-12 9:51 ` Karthik Nayak ` (3 subsequent siblings) 4 siblings, 0 replies; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-12 9:23 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff On Tue, Feb 11, 2025 at 10:18:24PM -0600, Justin Tobler wrote: > Through git-diff(1), a single diff can be generated from a pair of blob > revisions directly. Unfortunately, there is not a mechanism to compute > batches of specific file pair diffs in a single process. Such a feature > is particularly useful on the server-side where diffing between a large > set of changes is not feasible all at once due to timeout concerns. > > To facilitate this, introduce git-diff-pairs(1) which takes the > null-terminated raw diff format as input on stdin and produces diffs in s/null/NUL/ > other formats. As the raw diff format already contains the necessary > metadata, it becomes possible to progressively generate batches of diffs > without having to recompute rename detection or retrieve object context. > Something like the following: > > git diff-tree -r -z -M $old $new | > git diff-pairs -p > > should generate the same output as `git diff-tree -p -M`. Furthermore, > each line of raw diff formatted input can also be individually fed to a > separate git-diff-pairs(1) process and still produce the same output. > > Based-on-patch-by: Jeff King <peff@peff.net> > Signed-off-by: Justin Tobler <jltobler@gmail.com> I really like this new design, thanks for working well together! > diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc > new file mode 100644 > index 0000000000..e9ef4a6615 > --- /dev/null > +++ b/Documentation/git-diff-pairs.adoc > @@ -0,0 +1,62 @@ > +git-diff-pairs(1) > +================= > + > +NAME > +---- > +git-diff-pairs - Compare blob pairs generated by `diff-tree --raw` > + > +SYNOPSIS > +-------- > +[verse] > +'git diff-pairs' [diff-options] This should use `[synopsis]`, which allows you to drop the quoting. > +DESCRIPTION > +----------- > + > +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will > +reformat that output into whatever format is requested on its command Reformatting from my point of view implies that we only rearrange bits a bit. But we're not only reformatting the input, but actually compute the diffs. > +line. For example: > + > +----------------------------- > +git diff-tree -z -M $a $b | > +git diff-pairs -p > +----------------------------- > + > +will compute the tree diff in one step (including renames), and then > +`diff-pairs` will compute and format the blob-level diffs for each pair. > +This can be used to modify the raw diff in the middle (without having to > +parse or re-create more complicated formats like `--patch`), or to > +compute diffs progressively over the course of multiple invocations of > +`diff-pairs`. > + > +Each blob pair is fed to the diff machinery individually queued and the output > +is flushed on stdin EOF. I think the "flushing" part is a bit hard to understand without knowing anything about the command internals. As an unknowing reader I would assume you're talking about fflush(3p), but I think you're rather talking about when we "flush" the internal diff queue and thus compute the diffs. So I'd rephrase this and not talk about flushing, but about the behaviour observed by the user instead. > +OPTIONS > +------- > + > +include::diff-options.adoc[] > + > +include::diff-generate-patch.adoc[] > + > +NOTES > +---- > + > +`diff-pairs` should handle any input generated by `diff-tree --raw -z`. > +It may choke or otherwise misbehave on output from `diff-files`, etc. This reads a bit weird. The first thing that trips me is the "should". Does it or doesn't it handle the output of git-diff-tree(1)? The second part is that this, at least to me, implies that other formats of course aren't accepted, so why point that out explicitly? > +Here's an incomplete list of things that `diff-pairs` could do, but > +doesn't (mostly in the name of simplicity): > + > + - Only `-z` input is accepted, not normal `--raw` input. > + > + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you s/sha1s/object IDs/ > + want to abbreviate the output, you can pass `--abbrev` to > + `diff-pairs`. > + > + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via > + the initial `diff-tree` invocation. Makes sense. > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > new file mode 100644 > index 0000000000..08f3ee81e5 > --- /dev/null > +++ b/builtin/diff-pairs.c > @@ -0,0 +1,178 @@ > +#include "builtin.h" > +#include "commit.h" > +#include "config.h" > +#include "diff.h" > +#include "diffcore.h" > +#include "gettext.h" > +#include "hex.h" > +#include "object.h" > +#include "parse-options.h" > +#include "revision.h" > +#include "strbuf.h" > + > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > +{ > + uint16_t ret; > + > + *endp = parse_mode(mode, &ret); > + if (!*endp) > + die("unable to parse mode: %s", mode); Missing translation. > + return ret; > +} > + > +static void parse_oid(const char *p, struct object_id *oid, const char **endp, > + const struct git_hash_algo *algop) > +{ > + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') > + die("unable to parse object id: %s", p); Here, too. Do we also name this `parse_oid_or_die()` to stay consistent with `parse_mode_or_die()`? The same is also true for `parse_score()`. > +} > + > +static unsigned short parse_score(const char *score) > +{ > + unsigned long ret; > + char *endp; > + > + errno = 0; > + ret = strtoul(score, &endp, 10); > + ret *= MAX_SCORE / 100; > + if (errno || endp == score || *endp || (unsigned short)ret != ret) > + die("unable to parse rename/copy score: %s", score); > + return ret; > +} You can use `strtoul_ui()` instead, which does most of the error handling for you. > +static void flush_diff_queue(struct diff_options *options) > +{ > + /* > + * If rename detection is not requested, use rename information from the > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > + * does not mess with rename information already present in queued > + * filepairs. > + */ > + if (!options->detect_rename) > + options->found_follow = 1; > + diffcore_std(options); > + diff_flush(options); > +} > + > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct rev_info revs; > + int ret; > + > + const char * const usage[] = { > + N_("git diff-pairs [diff-options]"), > + NULL > + }; > + struct option options[] = { > + OPT_END() > + }; > + > + show_usage_with_options_if_asked(argc, argv, usage, options); > + > + repo_init_revisions(repo, &revs, prefix); > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + argc = setup_revisions(argc, argv, &revs, NULL); We need to check whether `argc > 0` here. Otherwise, unknown parameters may simply be ignored, I think. > + > + /* Don't allow pathspecs at all. */ > + if (revs.prune_data.nr) > + usage_with_options(usage, options); Should we give a better error in this case? > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_RAW; > + > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getline_nul(&meta, stdin) == EOF) > + break; > + > + p = meta.buf; > + if (*p != ':') > + die("invalid raw diff input"); > + p++; > + > + mode_a = parse_mode_or_die(p, &p); > + mode_b = parse_mode_or_die(p, &p); > + > + parse_oid(p, &oid_a, &p, repo->hash_algo); > + parse_oid(p, &oid_b, &p, repo->hash_algo); > + > + status = *p++; > + > + if (strbuf_getline_nul(&path, stdin) == EOF) > + die("got EOF while reading path"); Missing translation. > + switch (status) { > + case DIFF_STATUS_ADDED: > + pair = diff_filepair_addremove(&revs.diffopt, '+', > + mode_b, &oid_b, > + 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_DELETED: > + pair = diff_filepair_addremove(&revs.diffopt, '-', > + mode_a, &oid_a, > + 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_TYPE_CHANGED: > + case DIFF_STATUS_MODIFIED: > + pair = diff_filepair_change(&revs.diffopt, > + mode_a, mode_b, > + &oid_a, &oid_b, 1, 1, > + path.buf, 0, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_RENAMED: > + case DIFF_STATUS_COPIED: > + { > + struct diff_filespec *a, *b; > + > + if (strbuf_getline_nul(&path_dst, stdin) == EOF) > + die("got EOF while reading destination path"); Missing translation. > + a = alloc_filespec(path.buf); > + b = alloc_filespec(path_dst.buf); > + fill_filespec(a, &oid_a, 1, mode_a); > + fill_filespec(b, &oid_b, 1, mode_b); > + > + pair = diff_queue(&diff_queued_diff, a, b); > + pair->status = status; > + pair->score = parse_score(p); > + pair->renamed_pair = 1; > + } > + break; > + > + default: > + die("unknown diff status: %c", status); Missing translation. > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > new file mode 100755 > index 0000000000..e0a8e6f0a0 > --- /dev/null > +++ b/t/t4070-diff-pairs.sh > @@ -0,0 +1,80 @@ [snip] > +test_expect_success 'split input across multiple diff-pairs' ' > + write_script split-raw-diff "$PERL_PATH" <<-\EOF && > + $/ = "\0"; > + while (<>) { > + my $meta = $_; > + my $path = <>; > + # renames have an extra path > + my $path2 = <> if $meta =~ /[RC]\d+/; > + > + open(my $fh, ">", sprintf "diff%03d", $.); > + print $fh $meta, $path, $path2; > + } > + EOF > + > + git diff-tree -p -M -C -C base new >expect && > + > + git diff-tree -r -z -M -C -C base new | > + ./split-raw-diff && > + for i in diff*; do > + git diff-pairs -p <$i || return 1 Formatting: for i in diff* do git diff-pairs -p <$i || return 1 done >actual > + done >actual && > + test_cmp expect actual > +' Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt @ 2025-02-12 9:51 ` Karthik Nayak 2025-02-25 23:38 ` Justin Tobler 2025-02-12 11:40 ` Jean-Noël Avila ` (2 subsequent siblings) 4 siblings, 1 reply; 78+ messages in thread From: Karthik Nayak @ 2025-02-12 9:51 UTC (permalink / raw) To: Justin Tobler, git; +Cc: peff [-- Attachment #1: Type: text/plain, Size: 15918 bytes --] Justin Tobler <jltobler@gmail.com> writes: > Through git-diff(1), a single diff can be generated from a pair of blob > revisions directly. Unfortunately, there is not a mechanism to compute > batches of specific file pair diffs in a single process. Such a feature > is particularly useful on the server-side where diffing between a large > set of changes is not feasible all at once due to timeout concerns. > > To facilitate this, introduce git-diff-pairs(1) which takes the > null-terminated raw diff format as input on stdin and produces diffs in > other formats. As the raw diff format already contains the necessary > metadata, it becomes possible to progressively generate batches of diffs > without having to recompute rename detection or retrieve object context. > Something like the following: > > git diff-tree -r -z -M $old $new | > git diff-pairs -p > > should generate the same output as `git diff-tree -p -M`. Furthermore, > each line of raw diff formatted input can also be individually fed to a > separate git-diff-pairs(1) process and still produce the same output. > > Based-on-patch-by: Jeff King <peff@peff.net> > Signed-off-by: Justin Tobler <jltobler@gmail.com> > --- > .gitignore | 1 + > Documentation/git-diff-pairs.adoc | 62 +++++++++++ > Documentation/meson.build | 1 + > Makefile | 1 + > builtin.h | 1 + > builtin/diff-pairs.c | 178 ++++++++++++++++++++++++++++++ > command-list.txt | 1 + > git.c | 1 + > meson.build | 1 + > t/meson.build | 1 + > t/t4070-diff-pairs.sh | 80 ++++++++++++++ > 11 files changed, 328 insertions(+) > create mode 100644 Documentation/git-diff-pairs.adoc > create mode 100644 builtin/diff-pairs.c > create mode 100755 t/t4070-diff-pairs.sh > > diff --git a/.gitignore b/.gitignore > index e82aa19df0..03448c076a 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -54,6 +54,7 @@ > /git-diff > /git-diff-files > /git-diff-index > +/git-diff-pairs > /git-diff-tree > /git-difftool > /git-difftool--helper > diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc > new file mode 100644 > index 0000000000..e9ef4a6615 > --- /dev/null > +++ b/Documentation/git-diff-pairs.adoc > @@ -0,0 +1,62 @@ > +git-diff-pairs(1) > +================= > + > +NAME > +---- > +git-diff-pairs - Compare blob pairs generated by `diff-tree --raw` > + > +SYNOPSIS > +-------- > +[verse] > +'git diff-pairs' [diff-options] > + > +DESCRIPTION > +----------- > + > +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will > +reformat that output into whatever format is requested on its command > +line. For example: > + > +----------------------------- > +git diff-tree -z -M $a $b | > +git diff-pairs -p > +----------------------------- > + > +will compute the tree diff in one step (including renames), and then > +`diff-pairs` will compute and format the blob-level diffs for each pair. > +This can be used to modify the raw diff in the middle (without having to > +parse or re-create more complicated formats like `--patch`), or to > +compute diffs progressively over the course of multiple invocations of > +`diff-pairs`. > + > +Each blob pair is fed to the diff machinery individually queued and the output > +is flushed on stdin EOF. I found this hard to understand. After reading below, perhaps it would be easier to understand something simpler which doesn't mention the internal queuing mechanism and only talks about how the output is only steamed once we read EOF on stdin. > + > +OPTIONS > +------- > + > +include::diff-options.adoc[] > + > +include::diff-generate-patch.adoc[] > + > +NOTES > +---- > + > +`diff-pairs` should handle any input generated by `diff-tree --raw -z`. > +It may choke or otherwise misbehave on output from `diff-files`, etc. > + > +Here's an incomplete list of things that `diff-pairs` could do, but > +doesn't (mostly in the name of simplicity): > + > + - Only `-z` input is accepted, not normal `--raw` input. > + > + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you > + want to abbreviate the output, you can pass `--abbrev` to > + `diff-pairs`. > + > + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via > + the initial `diff-tree` invocation. > + > +GIT > +--- > +Part of the linkgit:git[1] suite > diff --git a/Documentation/meson.build b/Documentation/meson.build > index ead8e48213..e5ee177022 100644 > --- a/Documentation/meson.build > +++ b/Documentation/meson.build > @@ -41,6 +41,7 @@ manpages = { > 'git-diagnose.adoc' : 1, > 'git-diff-files.adoc' : 1, > 'git-diff-index.adoc' : 1, > + 'git-diff-pairs.adoc' : 1, > 'git-difftool.adoc' : 1, > 'git-diff-tree.adoc' : 1, > 'git-diff.adoc' : 1, > diff --git a/Makefile b/Makefile > index 896d02339e..3b8e1ad15e 100644 > --- a/Makefile > +++ b/Makefile > @@ -1232,6 +1232,7 @@ BUILTIN_OBJS += builtin/describe.o > BUILTIN_OBJS += builtin/diagnose.o > BUILTIN_OBJS += builtin/diff-files.o > BUILTIN_OBJS += builtin/diff-index.o > +BUILTIN_OBJS += builtin/diff-pairs.o > BUILTIN_OBJS += builtin/diff-tree.o > BUILTIN_OBJS += builtin/diff.o > BUILTIN_OBJS += builtin/difftool.o > diff --git a/builtin.h b/builtin.h > index f7b166b334..b2d2e9eb07 100644 > --- a/builtin.h > +++ b/builtin.h > @@ -152,6 +152,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit > int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); > int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); > int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); > int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); > int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); > int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > new file mode 100644 > index 0000000000..08f3ee81e5 > --- /dev/null > +++ b/builtin/diff-pairs.c > @@ -0,0 +1,178 @@ > +#include "builtin.h" > +#include "commit.h" > +#include "config.h" > +#include "diff.h" > +#include "diffcore.h" > +#include "gettext.h" > +#include "hex.h" > +#include "object.h" > +#include "parse-options.h" > +#include "revision.h" > +#include "strbuf.h" > + > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > +{ > + uint16_t ret; > + > + *endp = parse_mode(mode, &ret); > + if (!*endp) > + die("unable to parse mode: %s", mode); > + return ret; > +} > + > +static void parse_oid(const char *p, struct object_id *oid, const char **endp, > + const struct git_hash_algo *algop) Nit: similar to the function above, should this be called `parse_oid_or_die`? > +{ > + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') > + die("unable to parse object id: %s", p); > +} > + > +static unsigned short parse_score(const char *score) > +{ > + unsigned long ret; > + char *endp; > + > + errno = 0; > + ret = strtoul(score, &endp, 10); > + ret *= MAX_SCORE / 100; > + if (errno || endp == score || *endp || (unsigned short)ret != ret) > + die("unable to parse rename/copy score: %s", score); > + return ret; > +} > + > +static void flush_diff_queue(struct diff_options *options) > +{ > + /* > + * If rename detection is not requested, use rename information from the > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > + * does not mess with rename information already present in queued > + * filepairs. > + */ > + if (!options->detect_rename) > + options->found_follow = 1; > + diffcore_std(options); > + diff_flush(options); > +} > + > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct rev_info revs; > + int ret; > + > + const char * const usage[] = { > + N_("git diff-pairs [diff-options]"), > + NULL > + }; > + struct option options[] = { > + OPT_END() > + }; > + > + show_usage_with_options_if_asked(argc, argv, usage, options); > + > + repo_nit_revisions(repo, &revs, prefix); > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + argc = setup_revisions(argc, argv, &revs, NULL); > + > + /* Don't allow pathspecs at all. */ > + if (revs.prune_data.nr) > + usage_with_options(usage, options); > + > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_RAW; > + > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getline_nul(&meta, stdin) == EOF) > + break; > + > + p = meta.buf; > + if (*p != ':') > + die("invalid raw diff input"); > + p++; > + > + mode_a = parse_mode_or_die(p, &p); > + mode_b = parse_mode_or_die(p, &p); > + > + parse_oid(p, &oid_a, &p, repo->hash_algo); > + parse_oid(p, &oid_b, &p, repo->hash_algo); > + > + status = *p++; > + > + if (strbuf_getline_nul(&path, stdin) == EOF) > + die("got EOF while reading path"); > + > + switch (status) { > + case DIFF_STATUS_ADDED: > + pair = diff_filepair_addremove(&revs.diffopt, '+', > + mode_b, &oid_b, > + 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_DELETED: > + pair = diff_filepair_addremove(&revs.diffopt, '-', > + mode_a, &oid_a, > + 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_TYPE_CHANGED: > + case DIFF_STATUS_MODIFIED: > + pair = diff_filepair_change(&revs.diffopt, > + mode_a, mode_b, > + &oid_a, &oid_b, 1, 1, > + path.buf, 0, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_RENAMED: > + case DIFF_STATUS_COPIED: > + { > + struct diff_filespec *a, *b; > + > + if (strbuf_getline_nul(&path_dst, stdin) == EOF) > + die("got EOF while reading destination path"); > + > + a = alloc_filespec(path.buf); > + b = alloc_filespec(path_dst.buf); > + fill_filespec(a, &oid_a, 1, mode_a); > + fill_filespec(b, &oid_b, 1, mode_b); > + > + pair = diff_queue(&diff_queued_diff, a, b); > + pair->status = status; > + pair->score = parse_score(p); > + pair->renamed_pair = 1; > + } > + break; > + > + default: The only state I think is missing is `DIFF_STATUS_UNMERGED` (from 'diff.h'). Is that a state we need to handle? > + die("unknown diff status: %c", status); > + } > + } > + > + flush_diff_queue(&revs.diffopt); Now I understand what you meant by queuing the diffs. > + ret = diff_result_code(&revs); > + > + strbuf_release(&path_dst); > + strbuf_release(&path); > + strbuf_release(&meta); > + release_revisions(&revs); > + > + return ret; > +} > diff --git a/command-list.txt b/command-list.txt > index e0bb87b3b5..bb8acd51d8 100644 > --- a/command-list.txt > +++ b/command-list.txt > @@ -95,6 +95,7 @@ git-diagnose ancillaryinterrogators > git-diff mainporcelain info > git-diff-files plumbinginterrogators > git-diff-index plumbinginterrogators > +git-diff-pairs plumbinginterrogators > git-diff-tree plumbinginterrogators > git-difftool ancillaryinterrogators complete > git-fast-export ancillarymanipulators > diff --git a/git.c b/git.c > index b23761480f..12bba872bb 100644 > --- a/git.c > +++ b/git.c > @@ -540,6 +540,7 @@ static struct cmd_struct commands[] = { > { "diff", cmd_diff, NO_PARSEOPT }, > { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, > { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, > + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, > { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, > { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, > { "fast-export", cmd_fast_export, RUN_SETUP }, > diff --git a/meson.build b/meson.build > index fbb8105d96..66ce3326e8 100644 > --- a/meson.build > +++ b/meson.build > @@ -537,6 +537,7 @@ builtin_sources = [ > 'builtin/diagnose.c', > 'builtin/diff-files.c', > 'builtin/diff-index.c', > + 'builtin/diff-pairs.c', > 'builtin/diff-tree.c', > 'builtin/diff.c', > 'builtin/difftool.c', > diff --git a/t/meson.build b/t/meson.build > index 4574280590..7ff17c6d29 100644 > --- a/t/meson.build > +++ b/t/meson.build > @@ -500,6 +500,7 @@ integration_tests = [ > 't4067-diff-partial-clone.sh', > 't4068-diff-symmetric-merge-base.sh', > 't4069-remerge-diff.sh', > + 't4070-diff-pairs.sh', > 't4100-apply-stat.sh', > 't4101-apply-nonl.sh', > 't4102-apply-rename.sh', > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > new file mode 100755 > index 0000000000..e0a8e6f0a0 > --- /dev/null > +++ b/t/t4070-diff-pairs.sh > @@ -0,0 +1,80 @@ > +#!/bin/sh > + > +test_description='basic diff-pairs tests' > +. ./test-lib.sh > + > +# This creates a diff with added, modified, deleted, renamed, copied, and > +# typechange entries. That includes one in a subdirectory for non-recursive > +# tests, and both exact and inexact similarity scores. > +test_expect_success 'create commit with various diffs' ' Generally, tests for setup are named 'setup' so we can do something like: sh ./t0050-filesystem.sh --run=setup,9-11 Can we renmae this to 'setup'? > + echo to-be-gone >deleted && > + echo original >modified && > + echo now-a-file >symlink && > + test_seq 200 >two-hundred && > + test_seq 201 500 >five-hundred && > + git add . && > + test_tick && > + git commit -m base && > + git tag base && > + > + echo now-here >added && > + echo new >modified && > + rm deleted && > + mkdir subdir && > + echo content >subdir/file && > + mv two-hundred renamed && > + test_seq 201 500 | sed s/300/modified/ >copied && > + rm symlink && > + git add -A . && > + test_ln_s_add dest symlink && > + test_tick && > + git commit -m new && > + git tag new > +' > + > +test_expect_success 'diff-pairs recreates --raw' ' > + git diff-tree -r -M -C -C base new >expect && > + git diff-tree -r -M -C -C -z base new | > + git diff-pairs >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'diff-pairs can create -p output' ' > + git diff-tree -p -M -C -C base new >expect && > + git diff-tree -r -M -C -C -z base new | > + git diff-pairs -p >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'non-recursive --raw retains tree entry' ' > + git diff-tree base new >expect && > + git diff-tree -z base new | > + git diff-pairs >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'split input across multiple diff-pairs' ' > + write_script split-raw-diff "$PERL_PATH" <<-\EOF && > + $/ = "\0"; > + while (<>) { > + my $meta = $_; > + my $path = <>; > + # renames have an extra path > + my $path2 = <> if $meta =~ /[RC]\d+/; > + > + open(my $fh, ">", sprintf "diff%03d", $.); > + print $fh $meta, $path, $path2; > + } > + EOF > + > + git diff-tree -p -M -C -C base new >expect && > + > + git diff-tree -r -z -M -C -C base new | > + ./split-raw-diff && > + for i in diff*; do > + git diff-pairs -p <$i || return 1 > + done >actual && > + test_cmp expect actual > +' > + > +test_done > -- > 2.48.1 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 690 bytes --] ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 9:51 ` Karthik Nayak @ 2025-02-25 23:38 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-25 23:38 UTC (permalink / raw) To: Karthik Nayak; +Cc: git, peff On 25/02/12 01:51AM, Karthik Nayak wrote: > Justin Tobler <jltobler@gmail.com> writes: > > > Through git-diff(1), a single diff can be generated from a pair of blob > > revisions directly. Unfortunately, there is not a mechanism to compute > > batches of specific file pair diffs in a single process. Such a feature > > is particularly useful on the server-side where diffing between a large > > set of changes is not feasible all at once due to timeout concerns. > > > > To facilitate this, introduce git-diff-pairs(1) which takes the > > null-terminated raw diff format as input on stdin and produces diffs in > > other formats. As the raw diff format already contains the necessary > > metadata, it becomes possible to progressively generate batches of diffs > > without having to recompute rename detection or retrieve object context. > > Something like the following: > > > > git diff-tree -r -z -M $old $new | > > git diff-pairs -p > > > > should generate the same output as `git diff-tree -p -M`. Furthermore, > > each line of raw diff formatted input can also be individually fed to a > > separate git-diff-pairs(1) process and still produce the same output. > > > > Based-on-patch-by: Jeff King <peff@peff.net> > > Signed-off-by: Justin Tobler <jltobler@gmail.com> > > --- > > .gitignore | 1 + > > Documentation/git-diff-pairs.adoc | 62 +++++++++++ > > Documentation/meson.build | 1 + > > Makefile | 1 + > > builtin.h | 1 + > > builtin/diff-pairs.c | 178 ++++++++++++++++++++++++++++++ > > command-list.txt | 1 + > > git.c | 1 + > > meson.build | 1 + > > t/meson.build | 1 + > > t/t4070-diff-pairs.sh | 80 ++++++++++++++ > > 11 files changed, 328 insertions(+) > > create mode 100644 Documentation/git-diff-pairs.adoc > > create mode 100644 builtin/diff-pairs.c > > create mode 100755 t/t4070-diff-pairs.sh > > > > diff --git a/.gitignore b/.gitignore > > index e82aa19df0..03448c076a 100644 > > --- a/.gitignore > > +++ b/.gitignore > > @@ -54,6 +54,7 @@ > > /git-diff > > /git-diff-files > > /git-diff-index > > +/git-diff-pairs > > /git-diff-tree > > /git-difftool > > /git-difftool--helper > > diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc > > new file mode 100644 > > index 0000000000..e9ef4a6615 > > --- /dev/null > > +++ b/Documentation/git-diff-pairs.adoc > > @@ -0,0 +1,62 @@ > > +git-diff-pairs(1) > > +================= > > + > > +NAME > > +---- > > +git-diff-pairs - Compare blob pairs generated by `diff-tree --raw` > > + > > +SYNOPSIS > > +-------- > > +[verse] > > +'git diff-pairs' [diff-options] > > + > > +DESCRIPTION > > +----------- > > + > > +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will > > +reformat that output into whatever format is requested on its command > > +line. For example: > > + > > +----------------------------- > > +git diff-tree -z -M $a $b | > > +git diff-pairs -p > > +----------------------------- > > + > > +will compute the tree diff in one step (including renames), and then > > +`diff-pairs` will compute and format the blob-level diffs for each pair. > > +This can be used to modify the raw diff in the middle (without having to > > +parse or re-create more complicated formats like `--patch`), or to > > +compute diffs progressively over the course of multiple invocations of > > +`diff-pairs`. > > + > > +Each blob pair is fed to the diff machinery individually queued and the output > > +is flushed on stdin EOF. > > I found this hard to understand. > > After reading below, perhaps it would be easier to understand something > simpler which doesn't mention the internal queuing mechanism and only > talks about how the output is only steamed once we read EOF on stdin. I've reworked the documentation in the next version to avoid discussing internal details and stick to discussing user facing behavior. > > + > > +OPTIONS > > +------- > > + > > +include::diff-options.adoc[] > > + > > +include::diff-generate-patch.adoc[] > > + > > +NOTES > > +---- > > + > > +`diff-pairs` should handle any input generated by `diff-tree --raw -z`. > > +It may choke or otherwise misbehave on output from `diff-files`, etc. > > + > > +Here's an incomplete list of things that `diff-pairs` could do, but > > +doesn't (mostly in the name of simplicity): > > + > > + - Only `-z` input is accepted, not normal `--raw` input. > > + > > + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you > > + want to abbreviate the output, you can pass `--abbrev` to > > + `diff-pairs`. > > + > > + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via > > + the initial `diff-tree` invocation. > > + > > +GIT > > +--- > > +Part of the linkgit:git[1] suite > > diff --git a/Documentation/meson.build b/Documentation/meson.build > > index ead8e48213..e5ee177022 100644 > > --- a/Documentation/meson.build > > +++ b/Documentation/meson.build > > @@ -41,6 +41,7 @@ manpages = { > > 'git-diagnose.adoc' : 1, > > 'git-diff-files.adoc' : 1, > > 'git-diff-index.adoc' : 1, > > + 'git-diff-pairs.adoc' : 1, > > 'git-difftool.adoc' : 1, > > 'git-diff-tree.adoc' : 1, > > 'git-diff.adoc' : 1, > > diff --git a/Makefile b/Makefile > > index 896d02339e..3b8e1ad15e 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -1232,6 +1232,7 @@ BUILTIN_OBJS += builtin/describe.o > > BUILTIN_OBJS += builtin/diagnose.o > > BUILTIN_OBJS += builtin/diff-files.o > > BUILTIN_OBJS += builtin/diff-index.o > > +BUILTIN_OBJS += builtin/diff-pairs.o > > BUILTIN_OBJS += builtin/diff-tree.o > > BUILTIN_OBJS += builtin/diff.o > > BUILTIN_OBJS += builtin/difftool.o > > diff --git a/builtin.h b/builtin.h > > index f7b166b334..b2d2e9eb07 100644 > > --- a/builtin.h > > +++ b/builtin.h > > @@ -152,6 +152,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit > > int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); > > int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); > > int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); > > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); > > int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); > > int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); > > int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); > > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > > new file mode 100644 > > index 0000000000..08f3ee81e5 > > --- /dev/null > > +++ b/builtin/diff-pairs.c > > @@ -0,0 +1,178 @@ > > +#include "builtin.h" > > +#include "commit.h" > > +#include "config.h" > > +#include "diff.h" > > +#include "diffcore.h" > > +#include "gettext.h" > > +#include "hex.h" > > +#include "object.h" > > +#include "parse-options.h" > > +#include "revision.h" > > +#include "strbuf.h" > > + > > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > > +{ > > + uint16_t ret; > > + > > + *endp = parse_mode(mode, &ret); > > + if (!*endp) > > + die("unable to parse mode: %s", mode); > > + return ret; > > +} > > + > > +static void parse_oid(const char *p, struct object_id *oid, const char **endp, > > + const struct git_hash_algo *algop) > > Nit: similar to the function above, should this be called > `parse_oid_or_die`? Being consistent here is probably preferable, I've updated per your suggestion in the next version. > > +{ > > + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') > > + die("unable to parse object id: %s", p); > > +} > > + > > +static unsigned short parse_score(const char *score) > > +{ > > + unsigned long ret; > > + char *endp; > > + > > + errno = 0; > > + ret = strtoul(score, &endp, 10); > > + ret *= MAX_SCORE / 100; > > + if (errno || endp == score || *endp || (unsigned short)ret != ret) > > + die("unable to parse rename/copy score: %s", score); > > + return ret; > > +} > > + > > +static void flush_diff_queue(struct diff_options *options) > > +{ > > + /* > > + * If rename detection is not requested, use rename information from the > > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > > + * does not mess with rename information already present in queued > > + * filepairs. > > + */ > > + if (!options->detect_rename) > > + options->found_follow = 1; > > + diffcore_std(options); > > + diff_flush(options); > > +} > > + > > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > > + struct repository *repo) > > +{ > > + struct strbuf path_dst = STRBUF_INIT; > > + struct strbuf path = STRBUF_INIT; > > + struct strbuf meta = STRBUF_INIT; > > + struct rev_info revs; > > + int ret; > > + > > + const char * const usage[] = { > > + N_("git diff-pairs [diff-options]"), > > + NULL > > + }; > > + struct option options[] = { > > + OPT_END() > > + }; > > + > > + show_usage_with_options_if_asked(argc, argv, usage, options); > > + > > + repo_nit_revisions(repo, &revs, prefix); > > + repo_config(repo, git_diff_basic_config, NULL); > > + revs.disable_stdin = 1; > > + revs.abbrev = 0; > > + revs.diff = 1; > > + > > + argc = setup_revisions(argc, argv, &revs, NULL); > > + > > + /* Don't allow pathspecs at all. */ > > + if (revs.prune_data.nr) > > + usage_with_options(usage, options); > > + > > + if (!revs.diffopt.output_format) > > + revs.diffopt.output_format = DIFF_FORMAT_RAW; > > + > > + while (1) { > > + struct object_id oid_a, oid_b; > > + struct diff_filepair *pair; > > + unsigned mode_a, mode_b; > > + const char *p; > > + char status; > > + > > + if (strbuf_getline_nul(&meta, stdin) == EOF) > > + break; > > + > > + p = meta.buf; > > + if (*p != ':') > > + die("invalid raw diff input"); > > + p++; > > + > > + mode_a = parse_mode_or_die(p, &p); > > + mode_b = parse_mode_or_die(p, &p); > > + > > + parse_oid(p, &oid_a, &p, repo->hash_algo); > > + parse_oid(p, &oid_b, &p, repo->hash_algo); > > + > > + status = *p++; > > + > > + if (strbuf_getline_nul(&path, stdin) == EOF) > > + die("got EOF while reading path"); > > + > > + switch (status) { > > + case DIFF_STATUS_ADDED: > > + pair = diff_filepair_addremove(&revs.diffopt, '+', > > + mode_b, &oid_b, > > + 1, path.buf, 0); > > + if (pair) > > + pair->status = status; > > + break; > > + > > + case DIFF_STATUS_DELETED: > > + pair = diff_filepair_addremove(&revs.diffopt, '-', > > + mode_a, &oid_a, > > + 1, path.buf, 0); > > + if (pair) > > + pair->status = status; > > + break; > > + > > + case DIFF_STATUS_TYPE_CHANGED: > > + case DIFF_STATUS_MODIFIED: > > + pair = diff_filepair_change(&revs.diffopt, > > + mode_a, mode_b, > > + &oid_a, &oid_b, 1, 1, > > + path.buf, 0, 0); > > + if (pair) > > + pair->status = status; > > + break; > > + > > + case DIFF_STATUS_RENAMED: > > + case DIFF_STATUS_COPIED: > > + { > > + struct diff_filespec *a, *b; > > + > > + if (strbuf_getline_nul(&path_dst, stdin) == EOF) > > + die("got EOF while reading destination path"); > > + > > + a = alloc_filespec(path.buf); > > + b = alloc_filespec(path_dst.buf); > > + fill_filespec(a, &oid_a, 1, mode_a); > > + fill_filespec(b, &oid_b, 1, mode_b); > > + > > + pair = diff_queue(&diff_queued_diff, a, b); > > + pair->status = status; > > + pair->score = parse_score(p); > > + pair->renamed_pair = 1; > > + } > > + break; > > + > > + default: > > The only state I think is missing is `DIFF_STATUS_UNMERGED` (from > 'diff.h'). Is that a state we need to handle? The `DIFF_STATUS_UNMERGED` status is present when there are unmerged conflicted files in the working tree that have not been added to the index. I think this is a scenario where git-diff-pairs(1) would not make much sense, especially if there is no working tree present for a repository. It should be fine to leave this status unhandled for now as it just fallback to the default case and dies. > > + die("unknown diff status: %c", status); > > + } > > + } > > + > > + flush_diff_queue(&revs.diffopt); > > Now I understand what you meant by queuing the diffs. > > > + ret = diff_result_code(&revs); > > + > > + strbuf_release(&path_dst); > > + strbuf_release(&path); > > + strbuf_release(&meta); > > + release_revisions(&revs); > > + > > + return ret; > > +} > > diff --git a/command-list.txt b/command-list.txt > > index e0bb87b3b5..bb8acd51d8 100644 > > --- a/command-list.txt > > +++ b/command-list.txt > > @@ -95,6 +95,7 @@ git-diagnose ancillaryinterrogators > > git-diff mainporcelain info > > git-diff-files plumbinginterrogators > > git-diff-index plumbinginterrogators > > +git-diff-pairs plumbinginterrogators > > git-diff-tree plumbinginterrogators > > git-difftool ancillaryinterrogators complete > > git-fast-export ancillarymanipulators > > diff --git a/git.c b/git.c > > index b23761480f..12bba872bb 100644 > > --- a/git.c > > +++ b/git.c > > @@ -540,6 +540,7 @@ static struct cmd_struct commands[] = { > > { "diff", cmd_diff, NO_PARSEOPT }, > > { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, > > { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, > > + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, > > { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, > > { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, > > { "fast-export", cmd_fast_export, RUN_SETUP }, > > diff --git a/meson.build b/meson.build > > index fbb8105d96..66ce3326e8 100644 > > --- a/meson.build > > +++ b/meson.build > > @@ -537,6 +537,7 @@ builtin_sources = [ > > 'builtin/diagnose.c', > > 'builtin/diff-files.c', > > 'builtin/diff-index.c', > > + 'builtin/diff-pairs.c', > > 'builtin/diff-tree.c', > > 'builtin/diff.c', > > 'builtin/difftool.c', > > diff --git a/t/meson.build b/t/meson.build > > index 4574280590..7ff17c6d29 100644 > > --- a/t/meson.build > > +++ b/t/meson.build > > @@ -500,6 +500,7 @@ integration_tests = [ > > 't4067-diff-partial-clone.sh', > > 't4068-diff-symmetric-merge-base.sh', > > 't4069-remerge-diff.sh', > > + 't4070-diff-pairs.sh', > > 't4100-apply-stat.sh', > > 't4101-apply-nonl.sh', > > 't4102-apply-rename.sh', > > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > > new file mode 100755 > > index 0000000000..e0a8e6f0a0 > > --- /dev/null > > +++ b/t/t4070-diff-pairs.sh > > @@ -0,0 +1,80 @@ > > +#!/bin/sh > > + > > +test_description='basic diff-pairs tests' > > +. ./test-lib.sh > > + > > +# This creates a diff with added, modified, deleted, renamed, copied, and > > +# typechange entries. That includes one in a subdirectory for non-recursive > > +# tests, and both exact and inexact similarity scores. > > +test_expect_success 'create commit with various diffs' ' > > Generally, tests for setup are named 'setup' so we can do something > like: > sh ./t0050-filesystem.sh --run=setup,9-11 > > Can we renmae this to 'setup'? > Good suggestion, I've updated. > > + echo to-be-gone >deleted && > > + echo original >modified && > > + echo now-a-file >symlink && > > + test_seq 200 >two-hundred && > > + test_seq 201 500 >five-hundred && > > + git add . && > > + test_tick && > > + git commit -m base && > > + git tag base && > > + > > + echo now-here >added && > > + echo new >modified && > > + rm deleted && > > + mkdir subdir && > > + echo content >subdir/file && > > + mv two-hundred renamed && > > + test_seq 201 500 | sed s/300/modified/ >copied && > > + rm symlink && > > + git add -A . && > > + test_ln_s_add dest symlink && > > + test_tick && > > + git commit -m new && > > + git tag new > > +' > > + > > +test_expect_success 'diff-pairs recreates --raw' ' > > + git diff-tree -r -M -C -C base new >expect && > > + git diff-tree -r -M -C -C -z base new | > > + git diff-pairs >actual && > > + test_cmp expect actual > > +' > > + > > +test_expect_success 'diff-pairs can create -p output' ' > > + git diff-tree -p -M -C -C base new >expect && > > + git diff-tree -r -M -C -C -z base new | > > + git diff-pairs -p >actual && > > + test_cmp expect actual > > +' > > + > > +test_expect_success 'non-recursive --raw retains tree entry' ' > > + git diff-tree base new >expect && > > + git diff-tree -z base new | > > + git diff-pairs >actual && > > + test_cmp expect actual > > +' > > + > > +test_expect_success 'split input across multiple diff-pairs' ' > > + write_script split-raw-diff "$PERL_PATH" <<-\EOF && > > + $/ = "\0"; > > + while (<>) { > > + my $meta = $_; > > + my $path = <>; > > + # renames have an extra path > > + my $path2 = <> if $meta =~ /[RC]\d+/; > > + > > + open(my $fh, ">", sprintf "diff%03d", $.); > > + print $fh $meta, $path, $path2; > > + } > > + EOF > > + > > + git diff-tree -p -M -C -C base new >expect && > > + > > + git diff-tree -r -z -M -C -C base new | > > + ./split-raw-diff && > > + for i in diff*; do > > + git diff-pairs -p <$i || return 1 > > + done >actual && > > + test_cmp expect actual > > +' > > + > > +test_done > > -- > > 2.48.1 ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-12 9:51 ` Karthik Nayak @ 2025-02-12 11:40 ` Jean-Noël Avila 2025-02-12 16:50 ` Junio C Hamano 2025-02-17 14:38 ` Phillip Wood 4 siblings, 0 replies; 78+ messages in thread From: Jean-Noël Avila @ 2025-02-12 11:40 UTC (permalink / raw) To: Justin Tobler, git; +Cc: peff Le 12/02/2025 à 05:18, Justin Tobler a écrit : > > + > +SYNOPSIS > +-------- > +[verse] > +'git diff-pairs' [diff-options] > + This should read: [synopsis] git-diff-pairs [<diff-options>] > +DESCRIPTION > +----------- > + > +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will Please do not use the future form when describing the actual behavior. > +reformat that output into whatever format is requested on its command > +line. For example: > + > +----------------------------- > +git diff-tree -z -M $a $b | > +git diff-pairs -p > +----------------------------- > + > +will compute the tree diff in one step (including renames), and then > +`diff-pairs` will compute and format the blob-level diffs for each pair. > +This can be used to modify the raw diff in the middle (without having to > +parse or re-create more complicated formats like `--patch`), or to > +compute diffs progressively over the course of multiple invocations of > +`diff-pairs`. > + JN ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler ` (2 preceding siblings ...) 2025-02-12 11:40 ` Jean-Noël Avila @ 2025-02-12 16:50 ` Junio C Hamano 2025-02-19 22:19 ` Justin Tobler 2025-02-17 14:38 ` Phillip Wood 4 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2025-02-12 16:50 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff Justin Tobler <jltobler@gmail.com> writes: > +NOTES > +---- > + > +`diff-pairs` should handle any input generated by `diff-tree --raw -z`. > +It may choke or otherwise misbehave on output from `diff-files`, etc. > + > +Here's an incomplete list of things that `diff-pairs` could do, but > +doesn't (mostly in the name of simplicity): > + > + - Only `-z` input is accepted, not normal `--raw` input. > + > + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you > + want to abbreviate the output, you can pass `--abbrev` to > + `diff-pairs`. > + > + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via > + the initial `diff-tree` invocation. Which of the above limitations are fundamental, and which are merely due to incomplete implementation that could be improved in the future iterations? Without reading the code deeply, a lot of them look like merely due to this iteration being at a WIP state and not quite ready for the general public. What is especially curious is the reason why it is limited to diff-tree (by the way, don't you require '-r' if you are fed 'diff-tree' output, or are you prepared to expand tree objects in the input yourself?). I can guess that the 0{40} object names in the postimage to signal paths with working tree changes unadded to the index is something this fundamentally cannot work with, but you should be able to grok 'diff-index --cached', which does not have that issue, just fine. > diff --git a/Documentation/meson.build b/Documentation/meson.build > index ead8e48213..e5ee177022 100644 > --- a/Documentation/meson.build > +++ b/Documentation/meson.build > @@ -41,6 +41,7 @@ manpages = { > 'git-diagnose.adoc' : 1, > 'git-diff-files.adoc' : 1, > 'git-diff-index.adoc' : 1, > + 'git-diff-pairs.adoc' : 1, > 'git-difftool.adoc' : 1, > 'git-diff-tree.adoc' : 1, > 'git-diff.adoc' : 1, This apparently does not apply to 'master' and the base at least needs to contain 1f010d6b (doc: use .adoc extension for AsciiDoc files, 2025-01-20). Please clearly mark the series as such in the cover letter if the series is not built on top of recent 'master' (or 'maint' if it is a series to fix breakage, but it does not apply to this series). Thanks. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 16:50 ` Junio C Hamano @ 2025-02-19 22:19 ` Justin Tobler 2025-02-19 23:19 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-19 22:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peff On 25/02/12 08:50AM, Junio C Hamano wrote: > Justin Tobler <jltobler@gmail.com> writes: > > > +NOTES > > +---- > > + > > +`diff-pairs` should handle any input generated by `diff-tree --raw -z`. > > +It may choke or otherwise misbehave on output from `diff-files`, etc. > > + > > +Here's an incomplete list of things that `diff-pairs` could do, but > > +doesn't (mostly in the name of simplicity): > > + > > + - Only `-z` input is accepted, not normal `--raw` input. > > + > > + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you > > + want to abbreviate the output, you can pass `--abbrev` to > > + `diff-pairs`. > > + > > + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via > > + the initial `diff-tree` invocation. > > Which of the above limitations are fundamental, and which are merely > due to incomplete implementation that could be improved in the > future iterations? Thinking about this some more, I'm a bit unsure whether git-diff-pairs(1) should support "normal" `--raw` input. Furthermore, if we do want to support it, maybe it should be the default? From my perspective, ultimately I don't think there is much additional value provided by supporting multiple input options for git-diff-pairs(1) since the end result would be the same and its just an intermediate format. As I see it, the benefit of the NUL delimited raw diff ouput format is that it is a bit simpler to parse and likely a bit more efficient as it wouldn't have to deal with unquoting paths with special characters. The benefit of the "normal" raw format is probably that it is the more intuitive default option. I'm certainly interested in what folks think about this :) For abbreviated object IDs, supporting them would make the input format more flexible, but it would be simpler to just require the full OID be provided thus making the input format more explicit. My current thinking is to leave this unless others think it would be useful to support. Regarding pathspec support, being that git-diff-pairs(1) operates solely on the provided set of file pairs produced via some other Git operation, I don't think further limiting would provide much additional value either. If we do want this though, I think support could be added in the future. > Without reading the code deeply, a lot of them > look like merely due to this iteration being at a WIP state and not > quite ready for the general public. > > What is especially curious is the reason why it is limited to > diff-tree (by the way, don't you require '-r' if you are fed > 'diff-tree' output, or are you prepared to expand tree objects in > the input yourself?). The tree objects in the input are not expanded. With `git diff-pairs --raw` these objects are just printed again. With the `--patch` option, they are just ommitted. > I can guess that the 0{40} object names in the postimage to signal > paths with working tree changes unadded to the index is something > this fundamentally cannot work with, but you should be able to grok > 'diff-index --cached', which does not have that issue, just fine. I'll rework the documentation in the next version. git-diff-tree(1) is the command I have in mind as the common usecase to use in combination with git-diff-pairs(1), but it is not solely limited to it. As you mentioned, there are other commands that could be used to provide input here. > > diff --git a/Documentation/meson.build b/Documentation/meson.build > > index ead8e48213..e5ee177022 100644 > > --- a/Documentation/meson.build > > +++ b/Documentation/meson.build > > @@ -41,6 +41,7 @@ manpages = { > > 'git-diagnose.adoc' : 1, > > 'git-diff-files.adoc' : 1, > > 'git-diff-index.adoc' : 1, > > + 'git-diff-pairs.adoc' : 1, > > 'git-difftool.adoc' : 1, > > 'git-diff-tree.adoc' : 1, > > 'git-diff.adoc' : 1, > > This apparently does not apply to 'master' and the base at least > needs to contain 1f010d6b (doc: use .adoc extension for AsciiDoc > files, 2025-01-20). Please clearly mark the series as such in the > cover letter if the series is not built on top of recent 'master' > (or 'maint' if it is a series to fix breakage, but it does not apply > to this series). Will do Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-19 22:19 ` Justin Tobler @ 2025-02-19 23:19 ` Junio C Hamano 2025-02-19 23:47 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2025-02-19 23:19 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff Justin Tobler <jltobler@gmail.com> writes: > Thinking about this some more, I'm a bit unsure whether > git-diff-pairs(1) should support "normal" `--raw` input. Furthermore, if > we do want to support it, maybe it should be the default? > > From my perspective, ultimately I don't think there is much additional > value provided by supporting multiple input options for > git-diff-pairs(1) since the end result would be the same and its just an > intermediate format. As I see it, the benefit of the NUL delimited raw > diff ouput format is that it is a bit simpler to parse and likely a bit > more efficient as it wouldn't have to deal with unquoting paths with > special characters. The benefit of the "normal" raw format is probably > that it is the more intuitive default option. > > I'm certainly interested in what folks think about this :) FWIW, in our toolset, "-z" is not the default primarily because text format were chosen to help debuggability, which used to really matter in the early days. > For abbreviated object IDs, supporting them would make the input format > more flexible, but it would be simpler to just require the full OID be > provided thus making the input format more explicit. My current thinking > is to leave this unless others think it would be useful to support. Abbreviated object names would only be at "might be nice to have" level, I would think. We are talking about tools-to-tools communication after all. > Regarding pathspec support, being that git-diff-pairs(1) operates solely > on the provided set of file pairs produced via some other Git operation, > I don't think further limiting would provide much additional value > either. If we do want this though, I think support could be added in the > future. Another consideration is which side of the pipeline should take the responsibility to invoke the diffcore machinery. We certainly could make it the job for the upstream/frontend, in which case diff-pairs does not have to call into diffcore-rename, BUT it also means the downstream/backend needs to be able to parse two paths (renamed from and renamed to). Or we could make it the job for the downstream, and forbid the upstream/frontend from feeding renamed pairs (i.e. any input with status letter R or C are invalid), in which case diff-pairs can choose to invoke rename detection or not by paying attention to the -M option and invoking diffcore_rename() itself (which should be at no-cost from coding point of view, as it should be just the matter of calling diffcore_std()). > The tree objects in the input are not expanded. With `git diff-pairs > --raw` these objects are just printed again. With the `--patch` option, > they are just ommitted. Instead of getting expanded into its subpaths? >> This apparently does not apply to 'master' and the base at least >> needs to contain 1f010d6b (doc: use .adoc extension for AsciiDoc >> files, 2025-01-20). Please clearly mark the series as such in the >> cover letter if the series is not built on top of recent 'master' >> (or 'maint' if it is a series to fix breakage, but it does not apply >> to this series). > > Will do No longer needed, as the tip of 'master' now lives in the .adoc world. Hurrah! ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-19 23:19 ` Junio C Hamano @ 2025-02-19 23:47 ` Junio C Hamano 2025-02-20 0:32 ` Justin Tobler 0 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2025-02-19 23:47 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff Junio C Hamano <gitster@pobox.com> writes: >> Regarding pathspec support, being that git-diff-pairs(1) operates solely >> on the provided set of file pairs produced via some other Git operation, >> I don't think further limiting would provide much additional value >> either. If we do want this though, I think support could be added in the >> future. > > Another consideration is which side of the pipeline should take the > responsibility to invoke the diffcore machinery. We certainly could > make it the job for the upstream/frontend, in which case diff-pairs > does not have to call into diffcore-rename, BUT it also means the > downstream/backend needs to be able to parse two paths (renamed from > and renamed to). Or we could make it the job for the downstream, > and forbid the upstream/frontend from feeding renamed pairs (i.e. > any input with status letter R or C are invalid), in which case > diff-pairs can choose to invoke rename detection or not by paying > attention to the -M option and invoking diffcore_rename() itself > (which should be at no-cost from coding point of view, as it should > be just the matter of calling diffcore_std()). Sorry, but I hit <SEND> too early before finishing the most important part. We can move the features between upstream frontends and downstream diff-pairs. Depending on our goals, the best division of labor would be different. If we want to make it easy for people to write their custom frontends, for example, it might make sense to allow them to be as stupid and simple as possible and make all the heavy lifting the responsibility of the diff-pairs backend, which is the shared resource these frontends share and rely on (so that they have incentive to help us make sure diff-pairs will stay correct and performant). If on the other hand we want to allow people to do fancy processing in their custom frontends, maybe keeping diff-pairs as stupid and transparent would be a better option to give the people who write upstream/frontends more predictable behaviour. Where to do the pathspec limiting is one of these things. You could make it responsibility for the frontends if we assume that frontends must do their own limiting. Or you could make it an optional feature of the backends, so that frontends that does not do its own limiting can ask diff-pairs to limit. Which side to burden more really depends on whose job we are trying to make it easier. Thanks. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-19 23:47 ` Junio C Hamano @ 2025-02-20 0:32 ` Justin Tobler 2025-02-20 14:56 ` Justin Tobler 0 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-20 0:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peff On 25/02/19 03:47PM, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > >> Regarding pathspec support, being that git-diff-pairs(1) operates solely > >> on the provided set of file pairs produced via some other Git operation, > >> I don't think further limiting would provide much additional value > >> either. If we do want this though, I think support could be added in the > >> future. > > > > Another consideration is which side of the pipeline should take the > > responsibility to invoke the diffcore machinery. We certainly could > > make it the job for the upstream/frontend, in which case diff-pairs > > does not have to call into diffcore-rename, BUT it also means the > > downstream/backend needs to be able to parse two paths (renamed from > > and renamed to). Or we could make it the job for the downstream, > > and forbid the upstream/frontend from feeding renamed pairs (i.e. > > any input with status letter R or C are invalid), in which case > > diff-pairs can choose to invoke rename detection or not by paying > > attention to the -M option and invoking diffcore_rename() itself > > (which should be at no-cost from coding point of view, as it should > > be just the matter of calling diffcore_std()). In the current implementation, diff-pairs is capable of handling input containing rename/copy filepairs computed upstream. It does so by parsing the input line and manually setting the status, score, and paths for the queued `diff_filepair`. I think diff-pairs should support rename and copy input as it would allow for rename/copy detection to be performed upfront in a single pass by the upstream and the resulting output could be split up and fed to separate downstream diff-pairs. This is particularly useful for server-side diffs to break up what would be large diffs. > Sorry, but I hit <SEND> too early before finishing the most > important part. We can move the features between upstream frontends > and downstream diff-pairs. Depending on our goals, the best > division of labor would be different. If we want to make it easy > for people to write their custom frontends, for example, it might > make sense to allow them to be as stupid and simple as possible and > make all the heavy lifting the responsibility of the diff-pairs > backend, which is the shared resource these frontends share and rely > on (so that they have incentive to help us make sure diff-pairs will > stay correct and performant). If on the other hand we want to allow > people to do fancy processing in their custom frontends, maybe keeping > diff-pairs as stupid and transparent would be a better option to give > the people who write upstream/frontends more predictable behaviour. > > Where to do the pathspec limiting is one of these things. You could > make it responsibility for the frontends if we assume that frontends > must do their own limiting. Or you could make it an optional feature > of the backends, so that frontends that does not do its own limiting > can ask diff-pairs to limit. Which side to burden more really depends > on whose job we are trying to make it easier. For the server-side diff usecase, I think that aligns more towards having a front-end that does more of the heavy lifting such rename/copy detection and pathspec limiting, while the diff-pairs really just needs to compute the individual diffs for the already specified file pairs. I do see value though in keeping the door open for diff-pairs to become more robust and flexible. Maybe it would be fine for now to say pathspec limiting is not supported, but it could be in the future? > >> The tree objects in the input are not expanded. With `git diff-pairs > >> --raw` these objects are just printed again. With the `--patch` option, > >> they are just ommitted. > > >Instead of getting expanded into its subpaths? The current implementation of diff-pairs is rather simple. It relies on the upstream to feed it the file pairs with all the info upfront so it can setup the diff queue. This means input with tree objects is also queued as-is without being expanded further. I could maybe see a future though where we want diff-pairs to be a more robust backend and supports expanding these paths via -r option. Following previous discussion, maybe it's fine to keep the initial implementation of diff-pairs on the simple side for now. We could make diff-pairs die() for now if the -r option is explicitly set. Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-20 0:32 ` Justin Tobler @ 2025-02-20 14:56 ` Justin Tobler 2025-02-20 16:14 ` Junio C Hamano 0 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-20 14:56 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peff On 25/02/19 06:32PM, Justin Tobler wrote: > > >> The tree objects in the input are not expanded. With `git diff-pairs > > >> --raw` these objects are just printed again. With the `--patch` option, > > >> they are just ommitted. > > > > >Instead of getting expanded into its subpaths? > > The current implementation of diff-pairs is rather simple. It relies on > the upstream to feed it the file pairs with all the info upfront so it > can setup the diff queue. This means input with tree objects is also > queued as-is without being expanded further. I could maybe see a future > though where we want diff-pairs to be a more robust backend and supports > expanding these paths via -r option. Following previous discussion, > maybe it's fine to keep the initial implementation of diff-pairs on the > simple side for now. We could make diff-pairs die() for now if the -r > option is explicitly set. Thinking about this some more, adding support to expand trees in diff-pairs would alter patch output behavior. To better enable backwards compatible inclusion of this feature in the future, we may just want to die() for now if any tree object is present in diff-pairs input. -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-20 14:56 ` Justin Tobler @ 2025-02-20 16:14 ` Junio C Hamano 0 siblings, 0 replies; 78+ messages in thread From: Junio C Hamano @ 2025-02-20 16:14 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff Justin Tobler <jltobler@gmail.com> writes: > Thinking about this some more, adding support to expand trees in > diff-pairs would alter patch output behavior. To better enable backwards > compatible inclusion of this feature in the future, we may just want to > die() for now if any tree object is present in diff-pairs input. Sounds sensible. Thanks. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler ` (3 preceding siblings ...) 2025-02-12 16:50 ` Junio C Hamano @ 2025-02-17 14:38 ` Phillip Wood 2025-02-19 20:51 ` Justin Tobler 4 siblings, 1 reply; 78+ messages in thread From: Phillip Wood @ 2025-02-17 14:38 UTC (permalink / raw) To: Justin Tobler, git; +Cc: peff, Patrick Steinhardt, Junio C Hamano Hi Justin On 12/02/2025 04:18, Justin Tobler wrote: > Through git-diff(1), a single diff can be generated from a pair of blob > revisions directly. Unfortunately, there is not a mechanism to compute > batches of specific file pair diffs in a single process. Such a feature > is particularly useful on the server-side where diffing between a large > set of changes is not feasible all at once due to timeout concerns. > > To facilitate this, introduce git-diff-pairs(1) which takes the > null-terminated raw diff format as input on stdin and produces diffs in > other formats. As the raw diff format already contains the necessary > metadata, it becomes possible to progressively generate batches of diffs > without having to recompute rename detection or retrieve object context. > Something like the following: > > git diff-tree -r -z -M $old $new | > git diff-pairs -p > > should generate the same output as `git diff-tree -p -M`. Furthermore, > each line of raw diff formatted input can also be individually fed to a > separate git-diff-pairs(1) process and still produce the same output. I like the idea of this, I've left a few comments mainly around the UI. > +Here's an incomplete list of things that `diff-pairs` could do, but > +doesn't (mostly in the name of simplicity): > + > + - Only `-z` input is accepted, not normal `--raw` input. I think only accepting NUL terminated input is fine, but if we want to accept other formats we should have a plan for how to do that in a backwards compatible way as we cannot use `-z` to distinguish between input formats. > + const char * const usage[] = { > + N_("git diff-pairs [diff-options]"), Normally the option summary printed by "git foo -h" is generated by the option parser. In this case we don't define any options and use setup_revisions() instead so we need to provide the option summary ourselves. Looking at diff-files.c we can add "\n" COMMON_DIFF_OPTIONS_HELP; to do that. > + argc = setup_revisions(argc, argv, &revs, NULL); I think we should check that there are no options left on the commandline after setup_revisions() returns > + /* Don't allow pathspecs at all. */ > + if (revs.prune_data.nr) > + usage_with_options(usage, options); It is not just pathspecs that we want to reject but all revision related options. Looking at diff-files.c we can do if (rev.pending.nr || rev.min_age != -1 || rev.max_age != -1 || rev.max_count != -1) usage_with_option(usage, options); To catch some of that but it still accepts things like "--first-parent", "--merges" and "--ancestry-path". We may just have to live with that as I don't think it is worth expanding a huge amount of effort to prevent them. > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_RAW; This matches the other diff plumbing commands but I'm not sure it is the most helpful default for a command that is supposed to transform raw diffs into another format. Maybe we should default to DIFF_FORMAT_PATCH? > +test_expect_success 'split input across multiple diff-pairs' ' This needs a PERL prerequisite I think. I'm a bit unsure what this test adds compared to the others. Best Wishes Phillip > + write_script split-raw-diff "$PERL_PATH" <<-\EOF && > + $/ = "\0"; > + while (<>) { > + my $meta = $_; > + my $path = <>; > + # renames have an extra path > + my $path2 = <> if $meta =~ /[RC]\d+/; > + > + open(my $fh, ">", sprintf "diff%03d", $.); > + print $fh $meta, $path, $path2; > + } > + EOF > + > + git diff-tree -p -M -C -C base new >expect && > + > + git diff-tree -r -z -M -C -C base new | > + ./split-raw-diff && > + for i in diff*; do > + git diff-pairs -p <$i || return 1 > + done >actual && > + test_cmp expect actual > +' > + > +test_done ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-17 14:38 ` Phillip Wood @ 2025-02-19 20:51 ` Justin Tobler 2025-02-19 21:57 ` Junio C Hamano 2025-02-26 14:47 ` Phillip Wood 0 siblings, 2 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-19 20:51 UTC (permalink / raw) To: phillip.wood; +Cc: git, peff, Patrick Steinhardt, Junio C Hamano On 25/02/17 02:38PM, Phillip Wood wrote: > Hi Justin > > On 12/02/2025 04:18, Justin Tobler wrote: > > Through git-diff(1), a single diff can be generated from a pair of blob > > revisions directly. Unfortunately, there is not a mechanism to compute > > batches of specific file pair diffs in a single process. Such a feature > > is particularly useful on the server-side where diffing between a large > > set of changes is not feasible all at once due to timeout concerns. > > > > To facilitate this, introduce git-diff-pairs(1) which takes the > > null-terminated raw diff format as input on stdin and produces diffs in > > other formats. As the raw diff format already contains the necessary > > metadata, it becomes possible to progressively generate batches of diffs > > without having to recompute rename detection or retrieve object context. > > Something like the following: > > > > git diff-tree -r -z -M $old $new | > > git diff-pairs -p > > > > should generate the same output as `git diff-tree -p -M`. Furthermore, > > each line of raw diff formatted input can also be individually fed to a > > separate git-diff-pairs(1) process and still produce the same output. > > I like the idea of this, I've left a few comments mainly around the UI. > > > +Here's an incomplete list of things that `diff-pairs` could do, but > > +doesn't (mostly in the name of simplicity): > > + > > + - Only `-z` input is accepted, not normal `--raw` input. > > I think only accepting NUL terminated input is fine, but if we want to > accept other formats we should have a plan for how to do that in a > backwards compatible way as we cannot use `-z` to distinguish between input > formats. If in the future we want to support the normal format, we could introduce an `--input-format=normal` option or something along those lines. > > + const char * const usage[] = { > > + N_("git diff-pairs [diff-options]"), > > Normally the option summary printed by "git foo -h" is generated by the > option parser. In this case we don't define any options and use > setup_revisions() instead so we need to provide the option summary > ourselves. Looking at diff-files.c we can add > > "\n" > COMMON_DIFF_OPTIONS_HELP; > > to do that. Would this be preferable even if git-diff-pairs doesn't support all of the common diff options? > > + argc = setup_revisions(argc, argv, &revs, NULL); > > I think we should check that there are no options left on the commandline > after setup_revisions() returns Good call, will do in the next version. > > + /* Don't allow pathspecs at all. */ > > + if (revs.prune_data.nr) > > + usage_with_options(usage, options); > > It is not just pathspecs that we want to reject but all revision related > options. Looking at diff-files.c we can do > > if (rev.pending.nr || > rev.min_age != -1 || rev.max_age != -1 || > rev.max_count != -1) > usage_with_option(usage, options); > > To catch some of that but it still accepts things like "--first-parent", > "--merges" and "--ancestry-path". We may just have to live with that as I > don't think it is worth expanding a huge amount of effort to prevent them. Yes, we should also reject revision as well as pathspec arguments. Will update. > > + if (!revs.diffopt.output_format) > > + revs.diffopt.output_format = DIFF_FORMAT_RAW; > > This matches the other diff plumbing commands but I'm not sure it is the > most helpful default for a command that is supposed to transform raw diffs > into another format. Maybe we should default to DIFF_FORMAT_PATCH? As you mentioned, defaulting to DIFF_FORMAT_RAW isn't the most useful behavior. I agree that it makes more sense to use DIFF_FORMAT_PATCH as the default. Will update in the next version. > > +test_expect_success 'split input across multiple diff-pairs' ' > > This needs a PERL prerequisite I think. I'm a bit unsure what this test adds > compared to the others. This test demonstrates that the raw diff input can be split across separate git-diff-pairs(1) processes and still produce equivilant output which is one of the main usecases for the command. That being said, this test isn't really exercising different behavior of git-diff-pairs(1) itself, so maybe it would be best to drop it. Thanks for the review :) -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-19 20:51 ` Justin Tobler @ 2025-02-19 21:57 ` Junio C Hamano 2025-02-19 22:38 ` Justin Tobler 2025-02-26 14:47 ` Phillip Wood 1 sibling, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2025-02-19 21:57 UTC (permalink / raw) To: Justin Tobler; +Cc: phillip.wood, git, peff, Patrick Steinhardt Justin Tobler <jltobler@gmail.com> writes: >> I think only accepting NUL terminated input is fine, but if we want to >> accept other formats we should have a plan for how to do that in a >> backwards compatible way as we cannot use `-z` to distinguish between input >> formats. > > If in the future we want to support the normal format, we could introduce > an `--input-format=normal` option or something along those lines. Please don't. Have an explicit '-z' option from the beginning, and if the initial version is incapable of reading from text input, then it is perfectly fine to have if (!nul_termination) die(_("working without -z not supported (yet)"); Otherwise people have to remember that unlike everybody else that uses "-z" to signal NUL termination, this one alone wants to use a "--input-format" option that nobody else uses. >> > + /* Don't allow pathspecs at all. */ >> > + if (revs.prune_data.nr) >> > + usage_with_options(usage, options); Hmph, this is very unfortuate. The "--raw" format was originally designed as an interchange format between the frontend and backend. The frontend programs take two sets of contents stored in various places (like tree vs index, tree vs another tree) and express comparison of corresponding paths in (<from mode+contents> <to mode+contents> <path>) tuples" (a rough equivalent to what we internally have on the diff_queued_diff queue in core). The "--raw" format was designed to "dump" what is in the diff_queued_diff list. And then it would be passed to the single backend, that takes "--raw" format, pass them through the diffcore transform machinery (like matching removal and addition to detect renames), and produce various forms of output (like patch, diffstat, etc.). To me, what you are writing is the output phase of that pipeline, i.e. the backend. We do want to (evantually) be able to filter with pathspec, and all other things the current diff machinery does after the existing "all-in-one" "git diff" and "git diff-{files,index,tree}" commands do from their call to diffcore_std() and diffcore_flush(). The revisions option parsing machinery does accept options that would *not* make sense to expect for them to make any difference to the result of running "diff". Rejecting them is a nice thing to have, e.g. "git diff --no-merges HEAD^ HEAD" does not error out, but some people may want it to barf (I don't care---I am not sick enough to give apparently nonsense options to random commands), but it is perfectly fine to start your implementation with "nonsense options may be ignored". But in a "git diff-* -z | git diff-pairs -z" pipeline, I do not see a particular reason why you would want to forbid the downstream command to further limit the paths it processes with its own pathspec, e.g. git diff-tree -z --raw A B -- t/ | git diff-pairs -z t/helper/ sounds like a perfectly sensible request to grant. My recommendation is to avoid deciding to reject things your initial implementation happens not to support (yet) too early. In the end, we want this backend half just as powerful as, if not more than, the real "git diff" machinery that has both front- and backend in the same binary. Thanks. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-19 21:57 ` Junio C Hamano @ 2025-02-19 22:38 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-19 22:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: phillip.wood, git, peff, Patrick Steinhardt On 25/02/19 01:57PM, Junio C Hamano wrote: > Justin Tobler <jltobler@gmail.com> writes: > > >> I think only accepting NUL terminated input is fine, but if we want to > >> accept other formats we should have a plan for how to do that in a > >> backwards compatible way as we cannot use `-z` to distinguish between input > >> formats. > > > > If in the future we want to support the normal format, we could introduce > > an `--input-format=normal` option or something along those lines. > > Please don't. Have an explicit '-z' option from the beginning, and > if the initial version is incapable of reading from text input, then > it is perfectly fine to have > > if (!nul_termination) > die(_("working without -z not supported (yet)"); > > Otherwise people have to remember that unlike everybody else that > uses "-z" to signal NUL termination, this one alone wants to use a > "--input-format" option that nobody else uses. Thanks, I think this is a much better approach! :) > > >> > + /* Don't allow pathspecs at all. */ > >> > + if (revs.prune_data.nr) > >> > + usage_with_options(usage, options); > > Hmph, this is very unfortuate. > > The "--raw" format was originally designed as an interchange format > between the frontend and backend. > > The frontend programs take two sets of contents stored in various > places (like tree vs index, tree vs another tree) and express > comparison of corresponding paths in (<from mode+contents> <to > mode+contents> <path>) tuples" (a rough equivalent to what we > internally have on the diff_queued_diff queue in core). > > The "--raw" format was designed to "dump" what is in the > diff_queued_diff list. > > And then it would be passed to the single backend, that takes > "--raw" format, pass them through the diffcore transform machinery > (like matching removal and addition to detect renames), and produce > various forms of output (like patch, diffstat, etc.). > > To me, what you are writing is the output phase of that pipeline, > i.e. the backend. We do want to (evantually) be able to filter with > pathspec, and all other things the current diff machinery does after > the existing "all-in-one" "git diff" and "git diff-{files,index,tree}" > commands do from their call to diffcore_std() and diffcore_flush(). > > The revisions option parsing machinery does accept options that > would *not* make sense to expect for them to make any difference to > the result of running "diff". Rejecting them is a nice thing to > have, e.g. "git diff --no-merges HEAD^ HEAD" does not error out, but > some people may want it to barf (I don't care---I am not sick enough > to give apparently nonsense options to random commands), but it is > perfectly fine to start your implementation with "nonsense options > may be ignored". > > But in a "git diff-* -z | git diff-pairs -z" pipeline, I do not see > a particular reason why you would want to forbid the downstream > command to further limit the paths it processes with its own > pathspec, e.g. > > git diff-tree -z --raw A B -- t/ | git diff-pairs -z t/helper/ > > sounds like a perfectly sensible request to grant. > > My recommendation is to avoid deciding to reject things your initial > implementation happens not to support (yet) too early. In the end, > we want this backend half just as powerful as, if not more than, the > real "git diff" machinery that has both front- and backend in the > same binary. Ok, that makes sense. I was originally thinking pathspec limiting could just be handled upstream, but it probably doesn't make much to arbitrarily limit this functionality and remain more flexible. I'll do this in the next version. Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 2/3] builtin: introduce diff-pairs command 2025-02-19 20:51 ` Justin Tobler 2025-02-19 21:57 ` Junio C Hamano @ 2025-02-26 14:47 ` Phillip Wood 1 sibling, 0 replies; 78+ messages in thread From: Phillip Wood @ 2025-02-26 14:47 UTC (permalink / raw) To: Justin Tobler, phillip.wood; +Cc: git, peff, Patrick Steinhardt, Junio C Hamano Hi Justin On 19/02/2025 20:51, Justin Tobler wrote: > On 25/02/17 02:38PM, Phillip Wood wrote: >> Hi Justin >>> + const char * const usage[] = { >>> + N_("git diff-pairs [diff-options]"), >> >> Normally the option summary printed by "git foo -h" is generated by the >> option parser. In this case we don't define any options and use >> setup_revisions() instead so we need to provide the option summary >> ourselves. Looking at diff-files.c we can add >> >> "\n" >> COMMON_DIFF_OPTIONS_HELP; >> >> to do that. > > Would this be preferable even if git-diff-pairs doesn't support all of > the common diff options? Which options are you thinking about here? I might have missed something don't think that help text includes anything that's not in diff-options.adoc that we include in diff-pairs.adoc. If there are options in the documentation that we don't support then that is a problem. Best Wishes Phillip >>> + argc = setup_revisions(argc, argv, &revs, NULL); >> >> I think we should check that there are no options left on the commandline >> after setup_revisions() returns > > Good call, will do in the next version. > >>> + /* Don't allow pathspecs at all. */ >>> + if (revs.prune_data.nr) >>> + usage_with_options(usage, options); >> >> It is not just pathspecs that we want to reject but all revision related >> options. Looking at diff-files.c we can do >> >> if (rev.pending.nr || >> rev.min_age != -1 || rev.max_age != -1 || >> rev.max_count != -1) >> usage_with_option(usage, options); >> >> To catch some of that but it still accepts things like "--first-parent", >> "--merges" and "--ancestry-path". We may just have to live with that as I >> don't think it is worth expanding a huge amount of effort to prevent them. > > Yes, we should also reject revision as well as pathspec arguments. Will > update. > >>> + if (!revs.diffopt.output_format) >>> + revs.diffopt.output_format = DIFF_FORMAT_RAW; >> >> This matches the other diff plumbing commands but I'm not sure it is the >> most helpful default for a command that is supposed to transform raw diffs >> into another format. Maybe we should default to DIFF_FORMAT_PATCH? > > As you mentioned, defaulting to DIFF_FORMAT_RAW isn't the most useful > behavior. I agree that it makes more sense to use DIFF_FORMAT_PATCH as > the default. Will update in the next version. > >>> +test_expect_success 'split input across multiple diff-pairs' ' >> >> This needs a PERL prerequisite I think. I'm a bit unsure what this test adds >> compared to the others. > > This test demonstrates that the raw diff input can be split across > separate git-diff-pairs(1) processes and still produce equivilant > output which is one of the main usecases for the command. That being > said, this test isn't really exercising different behavior of > git-diff-pairs(1) itself, so maybe it would be best to drop it. > > Thanks for the review :) > > -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler 2025-02-12 4:18 ` [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-12 4:18 ` Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-17 14:38 ` Phillip Wood 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler 3 siblings, 2 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-12 4:18 UTC (permalink / raw) To: git; +Cc: peff, Justin Tobler The diffs queued from git-diff-pairs(1) stdin are not flushed EOF is reached. To enable greater flexibility, allow control over when the diff queue is flushed by writing a single nul byte on stdin between input file pairs. Diff output between flushes is separated by a single line terminator. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- Documentation/git-diff-pairs.adoc | 4 ++++ builtin/diff-pairs.c | 11 +++++++++++ t/t4070-diff-pairs.sh | 22 ++++++++++++++++++++++ 3 files changed, 37 insertions(+) diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc index e9ef4a6615..33c0d702f0 100644 --- a/Documentation/git-diff-pairs.adoc +++ b/Documentation/git-diff-pairs.adoc @@ -32,6 +32,10 @@ compute diffs progressively over the course of multiple invocations of Each blob pair is fed to the diff machinery individually queued and the output is flushed on stdin EOF. +To explicitly flush the diff queue, a single nul byte can be written to stdin +between filepairs. Diff output between flushes is separated by a single line +terminator. + OPTIONS ------- diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c index 08f3ee81e5..2436ce3013 100644 --- a/builtin/diff-pairs.c +++ b/builtin/diff-pairs.c @@ -99,6 +99,17 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, break; p = meta.buf; + if (!*p) { + flush_diff_queue(&revs.diffopt); + /* + * When the diff queue is explicitly flushed, append an + * additional terminator to separate batches of diffs. + */ + fprintf(revs.diffopt.file, "%c", + revs.diffopt.line_termination); + continue; + } + if (*p != ':') die("invalid raw diff input"); p++; diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh index e0a8e6f0a0..aca228a8fa 100755 --- a/t/t4070-diff-pairs.sh +++ b/t/t4070-diff-pairs.sh @@ -77,4 +77,26 @@ test_expect_success 'split input across multiple diff-pairs' ' test_cmp expect actual ' +test_expect_success 'diff-pairs explicit queue flush' ' + git diff-tree -r -M -C -C -z base new >input && + printf "\0" >>input && + git diff-tree -r -M -C -C -z base new >>input && + + git diff-tree -r -M -C -C base new >expect && + printf "\n" >>expect && + git diff-tree -r -M -C -C base new >>expect && + + git diff-pairs <input >actual && + test_cmp expect actual +' +j +test_expect_success 'diff-pairs explicit queue flush null terminated' ' + git diff-tree -r -M -C -C -z base new >expect && + printf "\0" >>expect && + git diff-tree -r -M -C -C -z base new >>expect && + + git diff-pairs -z <expect >actual && + test_cmp expect actual +' + test_done -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush 2025-02-12 4:18 ` [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler @ 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-17 14:38 ` Phillip Wood 1 sibling, 0 replies; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-12 9:23 UTC (permalink / raw) To: Justin Tobler; +Cc: git, peff On Tue, Feb 11, 2025 at 10:18:25PM -0600, Justin Tobler wrote: > The diffs queued from git-diff-pairs(1) stdin are not flushed EOF is I think you meant to say "are flush when stdin is closed" or something like that. > reached. To enable greater flexibility, allow control over when the diff > queue is flushed by writing a single nul byte on stdin between input s/nul/NUL/ > file pairs. Diff output between flushes is separated by a single line > terminator. > > Signed-off-by: Justin Tobler <jltobler@gmail.com> > --- > Documentation/git-diff-pairs.adoc | 4 ++++ > builtin/diff-pairs.c | 11 +++++++++++ > t/t4070-diff-pairs.sh | 22 ++++++++++++++++++++++ > 3 files changed, 37 insertions(+) > > diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc > index e9ef4a6615..33c0d702f0 100644 > --- a/Documentation/git-diff-pairs.adoc > +++ b/Documentation/git-diff-pairs.adoc > @@ -32,6 +32,10 @@ compute diffs progressively over the course of multiple invocations of > Each blob pair is fed to the diff machinery individually queued and the output > is flushed on stdin EOF. > > +To explicitly flush the diff queue, a single nul byte can be written to stdin > +between filepairs. Diff output between flushes is separated by a single line > +terminator. The same comment as for the previous patch applies here, I think we should refrain from using jargon like "flushing", "diff queue" or "filepairs". These are internal implementation details that the user shouldn't need to worry about. Instead, we should be talking about the user-visible effects. > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > index 08f3ee81e5..2436ce3013 100644 > --- a/builtin/diff-pairs.c > +++ b/builtin/diff-pairs.c > @@ -99,6 +99,17 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > break; > > p = meta.buf; > + if (!*p) { > + flush_diff_queue(&revs.diffopt); > + /* > + * When the diff queue is explicitly flushed, append an > + * additional terminator to separate batches of diffs. > + */ > + fprintf(revs.diffopt.file, "%c", > + revs.diffopt.line_termination); You can use `fputc(revs.diffopt.line_termination, revs.diffopt.file)` instead. > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > index e0a8e6f0a0..aca228a8fa 100755 > --- a/t/t4070-diff-pairs.sh > +++ b/t/t4070-diff-pairs.sh > @@ -77,4 +77,26 @@ test_expect_success 'split input across multiple diff-pairs' ' > test_cmp expect actual > ' > > +test_expect_success 'diff-pairs explicit queue flush' ' > + git diff-tree -r -M -C -C -z base new >input && > + printf "\0" >>input && > + git diff-tree -r -M -C -C -z base new >>input && > + > + git diff-tree -r -M -C -C base new >expect && > + printf "\n" >>expect && > + git diff-tree -r -M -C -C base new >>expect && > + > + git diff-pairs <input >actual && > + test_cmp expect actual > +' > +j > +test_expect_success 'diff-pairs explicit queue flush null terminated' ' s/null/NUL > + git diff-tree -r -M -C -C -z base new >expect && > + printf "\0" >>expect && > + git diff-tree -r -M -C -C -z base new >>expect && > + > + git diff-pairs -z <expect >actual && > + test_cmp expect actual > +' > + Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush 2025-02-12 4:18 ` [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt @ 2025-02-17 14:38 ` Phillip Wood 2025-02-19 23:09 ` Justin Tobler 1 sibling, 1 reply; 78+ messages in thread From: Phillip Wood @ 2025-02-17 14:38 UTC (permalink / raw) To: Justin Tobler, git; +Cc: peff, Patrick Steinhardt, Junio C Hamano Hi Justin On 12/02/2025 04:18, Justin Tobler wrote: > The diffs queued from git-diff-pairs(1) stdin are not flushed EOF is > reached. To enable greater flexibility, allow control over when the diff > queue is flushed by writing a single nul byte on stdin between input > file pairs. Diff output between flushes is separated by a single line > terminator. I agree with the comments others have made about the documentation. I also have some comments on the implementation below. > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > index 08f3ee81e5..2436ce3013 100644 > --- a/builtin/diff-pairs.c > +++ b/builtin/diff-pairs.c > @@ -99,6 +99,17 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > break; > > p = meta.buf; > + if (!*p) { > + flush_diff_queue(&revs.diffopt); > + /* > + * When the diff queue is explicitly flushed, append an > + * additional terminator to separate batches of diffs. > + */ > + fprintf(revs.diffopt.file, "%c", > + revs.diffopt.line_termination); As the user has requested an explicit flush we should call fflush(stdout) here to avoid deadlocking a caller that is waiting to read the terminator before writing the next batch of input. Ideally the tests would check that the output is flushed but I think that is quite hard to do with our test framework. I think it would be easier for callers to parse the output if we always printed NUL here. Programming languages generally have a function that allows you to read all the input until a specific byte is seen. If flushing always used a NUL terminator the caller could use their equivalent of read_until(b'\0') to hoover up the output (using '-z' to do this would change the output of --numstat and embed a NUL between any stat data and the patch). Using a newline as the terminator here means the caller needs to look for "\n\n". That string occurs in the output between the stat data and the patch and can also occur in the patch hunks if diff.suppressBlankEmpty is set. Now that we are calling diff_flush() in a loop we need to set .no_free in our diff options and call diff_free() at the end of the program (see the comment in diff.h) Best Wishes Phillip > + continue; > + } > + > if (*p != ':') > die("invalid raw diff input"); > p++; > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > index e0a8e6f0a0..aca228a8fa 100755 > --- a/t/t4070-diff-pairs.sh > +++ b/t/t4070-diff-pairs.sh > @@ -77,4 +77,26 @@ test_expect_success 'split input across multiple diff-pairs' ' > test_cmp expect actual > ' > > +test_expect_success 'diff-pairs explicit queue flush' ' > + git diff-tree -r -M -C -C -z base new >input && > + printf "\0" >>input && > + git diff-tree -r -M -C -C -z base new >>input && > + > + git diff-tree -r -M -C -C base new >expect && > + printf "\n" >>expect && > + git diff-tree -r -M -C -C base new >>expect && > + > + git diff-pairs <input >actual && > + test_cmp expect actual > +' > +j > +test_expect_success 'diff-pairs explicit queue flush null terminated' ' > + git diff-tree -r -M -C -C -z base new >expect && > + printf "\0" >>expect && > + git diff-tree -r -M -C -C -z base new >>expect && > + > + git diff-pairs -z <expect >actual && > + test_cmp expect actual > +' > + > test_done ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush 2025-02-17 14:38 ` Phillip Wood @ 2025-02-19 23:09 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-19 23:09 UTC (permalink / raw) To: phillip.wood; +Cc: git, peff, Patrick Steinhardt, Junio C Hamano On 25/02/17 02:38PM, Phillip Wood wrote: > Hi Justin > > On 12/02/2025 04:18, Justin Tobler wrote: > > The diffs queued from git-diff-pairs(1) stdin are not flushed EOF is > > reached. To enable greater flexibility, allow control over when the diff > > queue is flushed by writing a single nul byte on stdin between input > > file pairs. Diff output between flushes is separated by a single line > > terminator. > > I agree with the comments others have made about the documentation. I also > have some comments on the implementation below. > > > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > > index 08f3ee81e5..2436ce3013 100644 > > --- a/builtin/diff-pairs.c > > +++ b/builtin/diff-pairs.c > > @@ -99,6 +99,17 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > > break; > > p = meta.buf; > > + if (!*p) { > > + flush_diff_queue(&revs.diffopt); > > + /* > > + * When the diff queue is explicitly flushed, append an > > + * additional terminator to separate batches of diffs. > > + */ > > + fprintf(revs.diffopt.file, "%c", > > + revs.diffopt.line_termination); > > As the user has requested an explicit flush we should call fflush(stdout) > here to avoid deadlocking a caller that is waiting to read the terminator > before writing the next batch of input. Ideally the tests would check that > the output is flushed but I think that is quite hard to do with our test > framework. Good point, this needs to be explicitly flushed. Will fix. > I think it would be easier for callers to parse the output if we always > printed NUL here. Programming languages generally have a function that > allows you to read all the input until a specific byte is seen. If flushing > always used a NUL terminator the caller could use their equivalent of > read_until(b'\0') to hoover up the output (using '-z' to do this would > change the output of --numstat and embed a NUL between any stat data and the > patch). Using a newline as the terminator here means the caller needs to > look for "\n\n". That string occurs in the output between the stat data and > the patch and can also occur in the patch hunks if diff.suppressBlankEmpty > is set. I was originally thinking that, without the -z option, a newline to indicate separation between queued diff batches would be more human-friendly. Always using a NUL byte would be more appropriate for parsing though. I'll switch to using only a NUL byte here in the next version. > Now that we are calling diff_flush() in a loop we need to set .no_free in > our diff options and call diff_free() at the end of the program (see the > comment in diff.h) Indeed, will fix! Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v3 0/3] batch blob diff generation 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler ` (2 preceding siblings ...) 2025-02-12 4:18 ` [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler @ 2025-02-25 23:39 ` Justin Tobler 2025-02-25 23:39 ` [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler ` (4 more replies) 3 siblings, 5 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-25 23:39 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler Through git-diff(1) it is possible to generate a diff directly between two blobs. This is particularly useful when the pre-image and post-image blobs are known and we only care about the diff between them. Unfortunately, if a user has a batch of known blob pairs to compute diffs for, there is currently not a way to do so via a single Git process. To enable support for batch diffs of multiple blob pairs, this series introduces a new diff plumbing command git-diff-pairs(1) based on a previous patch series submitted by Peff[1]. This command uses NUL-delimited raw diffs as its source of input to control exactly which filepairs are diffed. The advantage of using the raw diff format is that it already has diff status type and object context information embedded in each line making it more efficient to generate diffs with as we can avoid having to peel revisions to get some the same info. For example: git diff-tree -r -z -M $old $new | git diff-pairs -p -z Here the output of git-diff-tree(1) is fed to git-diff-pairs(1) to generate the same output that would be expected from `git diff-tree -p -M`. While by itself not particularly useful, this means it is possible to split git-diff-tree(1) output across multiple git-diff-pairs(1) processes. Such a feature is useful on the server-side where diffs bewteen a large set of changes may not be feasible all at once due to timeout concerns. This command can be viewed as a backend tool that exposes Git's diff machinery. In its current form, the frontend that generates the raw diff lines used as input is expected to most of the heavy lifting (ie. pathspec limiting, tree object expansion). This series is structured as follows: - Patch 1 adds some new helper functions to get access to the queued `diff_filepair` after `diff_queue()` is invoked. - Patch 2 introduces the new git-diff-pairs(1) plumbing command. - Patch 3 allows git-diff-pairs(1) to immediately compute diffs queued on stdin when a NUL-byte is written after a raw input line instead of waiting for stdin to close. Changes since V2: - Pathspecs are not supported and thus rejected when provided as arguments. It should be possible in a future series to add support though. - Tree objects present in `diff-pairs` input are rejected. Support for tree objects could be added in the future, but for now they are rejected to enable to future support in a backwards compatible manner. - The -z option is required by git-diff-pairs(1). The NUL-delimited raw diff format is the only accepted form of input. Consequently, NUL-delimited output is the only option in the `--raw` mode. - git-diff-pairs(1) defaults to patch output instead of raw output. This better fits the intended usecase of the command. - A NUL-byte is now always used as the delimiter between batches of file pair diffs when queued diffs are explicitly computed by writing a NUL-byte on stdin. - Several other small cleanups and fixes along with documentation changes. Changes since V1: - Changed from git-diff-blob(1) to git-diff-pairs(1) based on a previously submitted series. - Instead of each line containing a pair of blob revisions, the raw diff format is used as input which already has diff status and object context embedded. -Justin [1]: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> Justin Tobler (3): diff: return diff_filepair from diff queue helpers builtin: introduce diff-pairs command builtin/diff-pairs: allow explicit diff queue flush .gitignore | 1 + Documentation/git-diff-pairs.adoc | 60 +++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 206 ++++++++++++++++++++++++++++++ command-list.txt | 1 + diff.c | 70 +++++++--- diff.h | 25 ++++ git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 83 ++++++++++++ 13 files changed, 432 insertions(+), 20 deletions(-) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh -- 2.48.1 ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler @ 2025-02-25 23:39 ` Justin Tobler 2025-02-26 18:04 ` Junio C Hamano 2025-02-25 23:39 ` [PATCH v3 2/3] builtin: introduce diff-pairs command Justin Tobler ` (3 subsequent siblings) 4 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-25 23:39 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler The `diff_addremove()` and `diff_change()` functions set up and queue diffs, but do not return the `diff_filepair` added to the queue. In a subsequent commit, modifications to `diff_filepair` need to occur in certain cases after being queued. Since the existing `diff_addremove()` and `diff_change()` are also used for callbacks in `diff_options` as types `add_remove_fn_t` and `change_fn_t`, modifying the existing function signatures requires further changes. The diff options for pruning use `file_add_remove()` and `file_change()` where file pairs do not even get queued. Thus, separate functions are implemented instead. Split out the queuing operations into `diff_queue_addremove()` and `diff_queue_change()` which also return a handle to the queued `diff_filepair`. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- diff.c | 70 +++++++++++++++++++++++++++++++++++++++++----------------- diff.h | 25 +++++++++++++++++++++ 2 files changed, 75 insertions(+), 20 deletions(-) diff --git a/diff.c b/diff.c index 019fb893a7..b5a779f997 100644 --- a/diff.c +++ b/diff.c @@ -7157,16 +7157,19 @@ void compute_diffstat(struct diff_options *options, options->found_changes = !!diffstat->nr; } -void diff_addremove(struct diff_options *options, - int addremove, unsigned mode, - const struct object_id *oid, - int oid_valid, - const char *concatpath, unsigned dirty_submodule) +struct diff_filepair *diff_queue_addremove(struct diff_queue_struct *queue, + struct diff_options *options, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, + const char *concatpath, + unsigned dirty_submodule) { struct diff_filespec *one, *two; + struct diff_filepair *pair; if (S_ISGITLINK(mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; /* This may look odd, but it is a preparation for * feeding "there are unchanged files which should @@ -7186,7 +7189,7 @@ void diff_addremove(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7198,25 +7201,29 @@ void diff_addremove(struct diff_options *options, two->dirty_submodule = dirty_submodule; } - diff_queue(&diff_queued_diff, one, two); + pair = diff_queue(queue, one, two); if (!options->flags.diff_from_contents) options->flags.has_changes = 1; + + return pair; } -void diff_change(struct diff_options *options, - unsigned old_mode, unsigned new_mode, - const struct object_id *old_oid, - const struct object_id *new_oid, - int old_oid_valid, int new_oid_valid, - const char *concatpath, - unsigned old_dirty_submodule, unsigned new_dirty_submodule) +struct diff_filepair *diff_queue_change(struct diff_queue_struct *queue, + struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, + unsigned new_dirty_submodule) { struct diff_filespec *one, *two; struct diff_filepair *p; if (S_ISGITLINK(old_mode) && S_ISGITLINK(new_mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; if (options->flags.reverse_diff) { SWAP(old_mode, new_mode); @@ -7227,7 +7234,7 @@ void diff_change(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7235,19 +7242,42 @@ void diff_change(struct diff_options *options, fill_filespec(two, new_oid, new_oid_valid, new_mode); one->dirty_submodule = old_dirty_submodule; two->dirty_submodule = new_dirty_submodule; - p = diff_queue(&diff_queued_diff, one, two); + p = diff_queue(queue, one, two); if (options->flags.diff_from_contents) - return; + return p; if (options->flags.quick && options->skip_stat_unmatch && !diff_filespec_check_stat_unmatch(options->repo, p)) { diff_free_filespec_data(p->one); diff_free_filespec_data(p->two); - return; + return p; } options->flags.has_changes = 1; + + return p; +} + +void diff_addremove(struct diff_options *options, int addremove, unsigned mode, + const struct object_id *oid, int oid_valid, + const char *concatpath, unsigned dirty_submodule) +{ + diff_queue_addremove(&diff_queued_diff, options, addremove, mode, oid, + oid_valid, concatpath, dirty_submodule); +} + +void diff_change(struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, unsigned new_dirty_submodule) +{ + diff_queue_change(&diff_queued_diff, options, old_mode, new_mode, + old_oid, new_oid, old_oid_valid, new_oid_valid, + concatpath, old_dirty_submodule, new_dirty_submodule); } struct diff_filepair *diff_unmerge(struct diff_options *options, const char *path) diff --git a/diff.h b/diff.h index 0a566f5531..63afa17e84 100644 --- a/diff.h +++ b/diff.h @@ -508,6 +508,31 @@ void diff_set_default_prefix(struct diff_options *options); int diff_can_quit_early(struct diff_options *); +/* + * Stages changes in the provided diff queue for file additions and deletions. + * If a file pair gets queued, it is returned. + */ +struct diff_filepair *diff_queue_addremove(struct diff_queue_struct *queue, + struct diff_options *, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, const char *fullpath, + unsigned dirty_submodule); + +/* + * Stages changes in the provided diff queue for file modifications. + * If a file pair gets queued, it is returned. + */ +struct diff_filepair *diff_queue_change(struct diff_queue_struct *queue, + struct diff_options *, + unsigned mode1, unsigned mode2, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *fullpath, + unsigned dirty_submodule1, + unsigned dirty_submodule2); + void diff_addremove(struct diff_options *, int addremove, unsigned mode, -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers 2025-02-25 23:39 ` [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-02-26 18:04 ` Junio C Hamano 0 siblings, 0 replies; 78+ messages in thread From: Junio C Hamano @ 2025-02-26 18:04 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps, karthik.188, phillip.wood123 Justin Tobler <jltobler@gmail.com> writes: > separate functions are implemented instead. It would have been more assuring to explicitly say that the original functions that discarded the newly created filepair after adding them to the queue are reimplemented as thin wrappers, which is the right thing to do and which is exactly what happens in this patch. Looking good. ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler 2025-02-25 23:39 ` [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-02-25 23:39 ` Justin Tobler 2025-02-26 18:24 ` Junio C Hamano ` (2 more replies) 2025-02-25 23:39 ` [PATCH v3 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler ` (2 subsequent siblings) 4 siblings, 3 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-25 23:39 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler, Jeff King Through git-diff(1), a single diff can be generated from a pair of blob revisions directly. Unfortunately, there is not a mechanism to compute batches of specific file pair diffs in a single process. Such a feature is particularly useful on the server-side where diffing between a large set of changes is not feasible all at once due to timeout concerns. To facilitate this, introduce git-diff-pairs(1) which acts as a backend passing its NUL-terminated raw diff format input from stdin through diff machinery to produce various forms of output such as patch or raw. The raw format was originally designed as an interchange format and represents the contents of the diff_queue_diff list making it possible to break the diff pipeline into separate stages. For example, git-diff-tree(1) can be used as a frontend to compute file pairs to queue and feed its raw output to git-diff-pairs(1) to compute patches. With this, batches of diffs can be progessively generated without having to recompute rename detection or retrieve object context. Something like the following: git diff-tree -r -z -M $old $new | git diff-pairs -p -z should generate the same output as `git diff-tree -p -M`. Furthermore, each line of raw diff formatted input can also be individually fed to a separate git-diff-pairs(1) process and still produce the same output. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Justin Tobler <jltobler@gmail.com> --- .gitignore | 1 + Documentation/git-diff-pairs.adoc | 56 +++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 193 ++++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 74 ++++++++++++ 11 files changed, 331 insertions(+) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh diff --git a/.gitignore b/.gitignore index 08a66ca508..04c444404e 100644 --- a/.gitignore +++ b/.gitignore @@ -55,6 +55,7 @@ /git-diff /git-diff-files /git-diff-index +/git-diff-pairs /git-diff-tree /git-difftool /git-difftool--helper diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc new file mode 100644 index 0000000000..e31f2e2fbb --- /dev/null +++ b/Documentation/git-diff-pairs.adoc @@ -0,0 +1,56 @@ +git-diff-pairs(1) +================= + +NAME +---- +git-diff-pairs - Compare the content and mode of provided blob pairs + +SYNOPSIS +-------- +[synopsis] +git diff-pairs -z [<diff-options>] + +DESCRIPTION +----------- +Show changes for file pairs provided on stdin. Input for this command must be +in the NUL-terminated raw output format as generated by commands such as `git +diff-tree -z -r --raw`. By default, the outputted diffs are computed and shown +in the patch format when stdin closes. + +Usage of this command enables the traditional diff pipeline to be broken up +into separate stages where `diff-pairs` acts as the output phase. Other +commands, such as `diff-tree`, may serve as a frontend to compute the raw +diff format used as input. + +Instead of computing diffs via `git diff-tree -p -M` in one step, `diff-tree` +can compute the file pairs and rename information without the blob diffs. This +output can be fed to `diff-pairs` to generate the underlying blob diffs as done +in the following example: + +----------------------------- +git diff-tree -z -r -M $a $b | +git diff-pairs -z +----------------------------- + +Computing the tree diff upfront with rename information allows patch output +from `diff-pairs` to be progressively computed over the course of potentially +multiple invocations. + +Pathspecs are not currently supported by `diff-pairs`. Pathspec limiting should +be performed by the upstream command generating the raw diffs used as input. + +Tree objects are not currently supported as input and are rejected. + +Abbreviated object IDs in the `diff-pairs` input are not supported. Outputted +object IDs can be abbreviated using the `--abbrev` option. + +OPTIONS +------- + +include::diff-options.adoc[] + +include::diff-generate-patch.adoc[] + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/meson.build b/Documentation/meson.build index 1129ce4c85..ce990e9fe5 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -42,6 +42,7 @@ manpages = { 'git-diagnose.adoc' : 1, 'git-diff-files.adoc' : 1, 'git-diff-index.adoc' : 1, + 'git-diff-pairs.adoc' : 1, 'git-difftool.adoc' : 1, 'git-diff-tree.adoc' : 1, 'git-diff.adoc' : 1, diff --git a/Makefile b/Makefile index bcf5ed3f85..56df7aed3f 100644 --- a/Makefile +++ b/Makefile @@ -1242,6 +1242,7 @@ BUILTIN_OBJS += builtin/describe.o BUILTIN_OBJS += builtin/diagnose.o BUILTIN_OBJS += builtin/diff-files.o BUILTIN_OBJS += builtin/diff-index.o +BUILTIN_OBJS += builtin/diff-pairs.o BUILTIN_OBJS += builtin/diff-tree.o BUILTIN_OBJS += builtin/diff.o BUILTIN_OBJS += builtin/difftool.o diff --git a/builtin.h b/builtin.h index 89928ccf92..e6aad3a6a1 100644 --- a/builtin.h +++ b/builtin.h @@ -153,6 +153,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c new file mode 100644 index 0000000000..9472b10461 --- /dev/null +++ b/builtin/diff-pairs.c @@ -0,0 +1,193 @@ +#include "builtin.h" +#include "commit.h" +#include "config.h" +#include "diff.h" +#include "diffcore.h" +#include "gettext.h" +#include "hex.h" +#include "object.h" +#include "parse-options.h" +#include "revision.h" +#include "strbuf.h" + +static unsigned parse_mode_or_die(const char *mode, const char **endp) +{ + uint16_t ret; + + *endp = parse_mode(mode, &ret); + if (!*endp) + die(_("unable to parse mode: %s"), mode); + return ret; +} + +static void parse_oid_or_die(const char *p, struct object_id *oid, + const char **endp, const struct git_hash_algo *algop) +{ + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') + die(_("unable to parse object id: %s"), p); +} + +static void flush_diff_queue(struct diff_options *options) +{ + /* + * If rename detection is not requested, use rename information from the + * raw diff formatted input. Setting found_follow ensures diffcore_std() + * does not mess with rename information already present in queued + * filepairs. + */ + if (!options->detect_rename) + options->found_follow = 1; + diffcore_std(options); + diff_flush(options); +} + +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, + struct repository *repo) +{ + struct strbuf path_dst = STRBUF_INIT; + struct strbuf path = STRBUF_INIT; + struct strbuf meta = STRBUF_INIT; + struct rev_info revs; + int ret; + + const char * const usage[] = { + N_("git diff-pairs -z [<diff-options>]"), + NULL + }; + struct option options[] = { + OPT_END() + }; + struct option *parseopts = add_diff_options(options, &revs.diffopt); + + show_usage_with_options_if_asked(argc, argv, usage, parseopts); + + repo_init_revisions(repo, &revs, prefix); + repo_config(repo, git_diff_basic_config, NULL); + revs.disable_stdin = 1; + revs.abbrev = 0; + revs.diff = 1; + + if (setup_revisions(argc, argv, &revs, NULL) > 1) + usage_with_options(usage, parseopts); + + /* + * With the -z option, both command input and raw output are + * NUL-delimited (this mode does not effect patch output). At present + * only NUL-delimited raw diff formatted input is supported. + */ + if (revs.diffopt.line_termination) { + error(_("working without -z is not supported")); + usage_with_options(usage, parseopts); + } + + if (revs.prune_data.nr) { + error(_("pathspec arguments not supported")); + usage_with_options(usage, parseopts); + } + + if (revs.pending.nr || revs.max_count != -1 || + revs.min_age != (timestamp_t)-1 || + revs.max_age != (timestamp_t)-1) { + error(_("revision arguments not allowed")); + usage_with_options(usage, parseopts); + } + + if (!revs.diffopt.output_format) + revs.diffopt.output_format = DIFF_FORMAT_PATCH; + + while (1) { + struct object_id oid_a, oid_b; + struct diff_filepair *pair; + unsigned mode_a, mode_b; + const char *p; + char status; + + if (strbuf_getline_nul(&meta, stdin) == EOF) + break; + + p = meta.buf; + if (*p != ':') + die(_("invalid raw diff input")); + p++; + + mode_a = parse_mode_or_die(p, &p); + mode_b = parse_mode_or_die(p, &p); + + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) + die(_("tree objects not supported")); + + parse_oid_or_die(p, &oid_a, &p, repo->hash_algo); + parse_oid_or_die(p, &oid_b, &p, repo->hash_algo); + + status = *p++; + + if (strbuf_getline_nul(&path, stdin) == EOF) + die(_("got EOF while reading path")); + + switch (status) { + case DIFF_STATUS_ADDED: + pair = diff_queue_addremove(&diff_queued_diff, + &revs.diffopt, '+', mode_b, + &oid_b, 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_DELETED: + pair = diff_queue_addremove(&diff_queued_diff, + &revs.diffopt, '-', mode_a, + &oid_a, 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_TYPE_CHANGED: + case DIFF_STATUS_MODIFIED: + pair = diff_queue_change(&diff_queued_diff, &revs.diffopt, + mode_a, mode_b, &oid_a, &oid_b, + 1, 1, path.buf, 0, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_RENAMED: + case DIFF_STATUS_COPIED: + { + struct diff_filespec *a, *b; + unsigned int score; + + if (strbuf_getline_nul(&path_dst, stdin) == EOF) + die(_("got EOF while reading destination path")); + + a = alloc_filespec(path.buf); + b = alloc_filespec(path_dst.buf); + fill_filespec(a, &oid_a, 1, mode_a); + fill_filespec(b, &oid_b, 1, mode_b); + + pair = diff_queue(&diff_queued_diff, a, b); + + if (strtoul_ui(p, 10, &score)) + die(_("unable to parse rename/copy score: %s"), p); + + pair->score = score * MAX_SCORE / 100; + pair->status = status; + pair->renamed_pair = 1; + } + break; + + default: + die(_("unknown diff status: %c"), status); + } + } + + flush_diff_queue(&revs.diffopt); + ret = diff_result_code(&revs); + + strbuf_release(&path_dst); + strbuf_release(&path); + strbuf_release(&meta); + release_revisions(&revs); + FREE_AND_NULL(parseopts); + + return ret; +} diff --git a/command-list.txt b/command-list.txt index c537114b46..b7ade3ab9f 100644 --- a/command-list.txt +++ b/command-list.txt @@ -96,6 +96,7 @@ git-diagnose ancillaryinterrogators git-diff mainporcelain info git-diff-files plumbinginterrogators git-diff-index plumbinginterrogators +git-diff-pairs plumbinginterrogators git-diff-tree plumbinginterrogators git-difftool ancillaryinterrogators complete git-fast-export ancillarymanipulators diff --git a/git.c b/git.c index 450d6aaa86..77c4359522 100644 --- a/git.c +++ b/git.c @@ -541,6 +541,7 @@ static struct cmd_struct commands[] = { { "diff", cmd_diff, NO_PARSEOPT }, { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, { "fast-export", cmd_fast_export, RUN_SETUP }, diff --git a/meson.build b/meson.build index bf95576f83..9e8b365d2a 100644 --- a/meson.build +++ b/meson.build @@ -540,6 +540,7 @@ builtin_sources = [ 'builtin/diagnose.c', 'builtin/diff-files.c', 'builtin/diff-index.c', + 'builtin/diff-pairs.c', 'builtin/diff-tree.c', 'builtin/diff.c', 'builtin/difftool.c', diff --git a/t/meson.build b/t/meson.build index 780939d49f..09c7bc2fad 100644 --- a/t/meson.build +++ b/t/meson.build @@ -500,6 +500,7 @@ integration_tests = [ 't4067-diff-partial-clone.sh', 't4068-diff-symmetric-merge-base.sh', 't4069-remerge-diff.sh', + 't4070-diff-pairs.sh', 't4100-apply-stat.sh', 't4101-apply-nonl.sh', 't4102-apply-rename.sh', diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh new file mode 100755 index 0000000000..2f511cc9c9 --- /dev/null +++ b/t/t4070-diff-pairs.sh @@ -0,0 +1,74 @@ +#!/bin/sh + +test_description='basic diff-pairs tests' +. ./test-lib.sh + +# This creates a diff with added, modified, deleted, renamed, copied, and +# typechange entries. That includes one in a subdirectory for non-recursive +# tests, and both exact and inexact similarity scores. +test_expect_success 'setup' ' + echo to-be-gone >deleted && + echo original >modified && + echo now-a-file >symlink && + test_seq 200 >two-hundred && + test_seq 201 500 >five-hundred && + git add . && + test_tick && + git commit -m base && + git tag base && + + echo now-here >added && + echo new >modified && + rm deleted && + mkdir subdir && + echo content >subdir/file && + mv two-hundred renamed && + test_seq 201 500 | sed s/300/modified/ >copied && + rm symlink && + git add -A . && + test_ln_s_add dest symlink && + test_tick && + git commit -m new && + git tag new +' + +test_expect_success 'diff-pairs recreates --raw' ' + git diff-tree -r -M -C -C -z base new >expect && + git diff-tree -r -M -C -C -z base new | + git diff-pairs --raw -z >actual && + test_cmp expect actual +' + +test_expect_success 'diff-pairs can create -p output' ' + git diff-tree -p -M -C -C base new >expect && + git diff-tree -r -M -C -C -z base new | + git diff-pairs -p -z >actual && + test_cmp expect actual +' + +test_expect_success 'diff-pairs does not support normal raw diff input' ' + git diff-tree -r base new | + test_must_fail git diff-pairs >out 2>err && + + test_must_be_empty out && + grep "error: working without -z is not supported" err +' + +test_expect_success 'diff-pairs does not support tree objects as input' ' + git diff-tree -z base new | + test_must_fail git diff-pairs -z >out 2>err && + + echo "fatal: tree objects not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_expect_success 'diff-pairs does not support pathspec arguments' ' + git diff-tree -r -z base new | + test_must_fail git diff-pairs -z -- new >out 2>err && + + test_must_be_empty out && + grep "error: pathspec arguments not supported" err +' + +test_done -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-25 23:39 ` [PATCH v3 2/3] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-26 18:24 ` Junio C Hamano 2025-02-27 22:15 ` Justin Tobler 2025-02-27 9:35 ` Karthik Nayak 2025-02-27 12:56 ` Patrick Steinhardt 2 siblings, 1 reply; 78+ messages in thread From: Junio C Hamano @ 2025-02-26 18:24 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps, karthik.188, phillip.wood123, Jeff King Justin Tobler <jltobler@gmail.com> writes: > +static void flush_diff_queue(struct diff_options *options) > +{ > + /* > + * If rename detection is not requested, use rename information from the > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > + * does not mess with rename information already present in queued > + * filepairs. > + */ > + if (!options->detect_rename) > + options->found_follow = 1; An ugly design decision that may be suboptimal from maintainability point of view. The parts of diffcore_std() that --follow wants to bypass may happen to be the same as the parts that this new caller wants to bypass, but who guarantees that they will stay that way in the future? > + diffcore_std(options); > + diff_flush(options); > +} > + > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct rev_info revs; > + int ret; > + > + const char * const usage[] = { > + N_("git diff-pairs -z [<diff-options>]"), > + NULL > + }; > + struct option options[] = { > + OPT_END() > + }; > + struct option *parseopts = add_diff_options(options, &revs.diffopt); > + > + show_usage_with_options_if_asked(argc, argv, usage, parseopts); > + > + repo_init_revisions(repo, &revs, prefix); > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > + usage_with_options(usage, parseopts); > + > + /* > + * With the -z option, both command input and raw output are > + * NUL-delimited (this mode does not effect patch output). At present Probably "effect" -> "affect". > + * only NUL-delimited raw diff formatted input is supported. > + */ > + if (revs.diffopt.line_termination) { > + error(_("working without -z is not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.prune_data.nr) { > + error(_("pathspec arguments not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.pending.nr || revs.max_count != -1 || > + revs.min_age != (timestamp_t)-1 || > + revs.max_age != (timestamp_t)-1) { > + error(_("revision arguments not allowed")); > + usage_with_options(usage, parseopts); > + } > + > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_PATCH; > + > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getline_nul(&meta, stdin) == EOF) > + break; There should be a variant of this function that takes delimiter parameter. By declaring an int variable that is initialized to '\0' (because you only deal with "-z" input) and passing that delimiter variable to strbuf_getwholeline() would future-proof this code path. How builtin/update-ref.c:update_refs_stdin() works may be inspiring. > + switch (status) { > + case DIFF_STATUS_ADDED: > + pair = diff_queue_addremove(&diff_queued_diff, > + &revs.diffopt, '+', mode_b, > + &oid_b, 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + ... > + default: > + die(_("unknown diff status: %c"), status); > + } > + } Amusing; looking good. > + flush_diff_queue(&revs.diffopt); > + ret = diff_result_code(&revs); > + > + strbuf_release(&path_dst); > + strbuf_release(&path); > + strbuf_release(&meta); > + release_revisions(&revs); > + FREE_AND_NULL(parseopts); > + > + return ret; > +} Nice. It is surprisingly compact and had everything I expected it to have ;-). > +test_expect_success 'diff-pairs recreates --raw' ' > + git diff-tree -r -M -C -C -z base new >expect && > + git diff-tree -r -M -C -C -z base new | > + git diff-pairs --raw -z >actual && > + test_cmp expect actual > +' Amusing ;-) But a very obvious and important thing to test. I would have fed <expect to diff-pairs for this test, though. Other than that, nicely done. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-26 18:24 ` Junio C Hamano @ 2025-02-27 22:15 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-27 22:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, ps, karthik.188, phillip.wood123, Jeff King On 25/02/26 10:24AM, Junio C Hamano wrote: > Justin Tobler <jltobler@gmail.com> writes: > > > +static void flush_diff_queue(struct diff_options *options) > > +{ > > + /* > > + * If rename detection is not requested, use rename information from the > > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > > + * does not mess with rename information already present in queued > > + * filepairs. > > + */ > > + if (!options->detect_rename) > > + options->found_follow = 1; > > An ugly design decision that may be suboptimal from maintainability > point of view. > > The parts of diffcore_std() that --follow wants to bypass may happen > to be the same as the parts that this new caller wants to bypass, > but who guarantees that they will stay that way in the future? Good point. When invoking diffcore_std(), we really just need to be able to skip diff_resolve_rename_copy() as that is what is updating the diff filepair statuses. In the next version, instead of relying on `found_follow`, I think I'll introduce a new diff_options field, `skip_resolving_statuses` for this specific purpose. > > + while (1) { > > + struct object_id oid_a, oid_b; > > + struct diff_filepair *pair; > > + unsigned mode_a, mode_b; > > + const char *p; > > + char status; > > + > > + if (strbuf_getline_nul(&meta, stdin) == EOF) > > + break; > > There should be a variant of this function that takes delimiter > parameter. By declaring an int variable that is initialized to '\0' > (because you only deal with "-z" input) and passing that delimiter > variable to strbuf_getwholeline() would future-proof this code path. > > How builtin/update-ref.c:update_refs_stdin() works may be inspiring. Makes sense, I'll swap to using strbuf_getwholeline() with a defined line terminator variable in the next version. This way it can help make supporting the "normal" raw diff format as input easier in the future. > > +test_expect_success 'diff-pairs recreates --raw' ' > > + git diff-tree -r -M -C -C -z base new >expect && > > + git diff-tree -r -M -C -C -z base new | > > + git diff-pairs --raw -z >actual && > > + test_cmp expect actual > > +' > > Amusing ;-) But a very obvious and important thing to test. > I would have fed <expect to diff-pairs for this test, though. Will adjust in the next version. > Other than that, nicely done. Thanks for the review! -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-25 23:39 ` [PATCH v3 2/3] builtin: introduce diff-pairs command Justin Tobler 2025-02-26 18:24 ` Junio C Hamano @ 2025-02-27 9:35 ` Karthik Nayak 2025-02-27 22:36 ` Justin Tobler 2025-02-27 12:56 ` Patrick Steinhardt 2 siblings, 1 reply; 78+ messages in thread From: Karthik Nayak @ 2025-02-27 9:35 UTC (permalink / raw) To: Justin Tobler, git; +Cc: ps, phillip.wood123, Jeff King [-- Attachment #1: Type: text/plain, Size: 8071 bytes --] Justin Tobler <jltobler@gmail.com> writes: > Through git-diff(1), a single diff can be generated from a pair of blob > revisions directly. Unfortunately, there is not a mechanism to compute > batches of specific file pair diffs in a single process. Such a feature > is particularly useful on the server-side where diffing between a large > set of changes is not feasible all at once due to timeout concerns. > > To facilitate this, introduce git-diff-pairs(1) which acts as a backend > passing its NUL-terminated raw diff format input from stdin through diff > machinery to produce various forms of output such as patch or raw. > > The raw format was originally designed as an interchange format and > represents the contents of the diff_queue_diff list making it possible > to break the diff pipeline into separate stages. For example, > git-diff-tree(1) can be used as a frontend to compute file pairs to > queue and feed its raw output to git-diff-pairs(1) to compute patches. > With this, batches of diffs can be progessively generated without having s/progessively/progressively > to recompute rename detection or retrieve object context. Something like > the following: > > git diff-tree -r -z -M $old $new | > git diff-pairs -p -z > > should generate the same output as `git diff-tree -p -M`. Furthermore, > each line of raw diff formatted input can also be individually fed to a > separate git-diff-pairs(1) process and still produce the same output. > > Based-on-patch-by: Jeff King <peff@peff.net> > Signed-off-by: Justin Tobler <jltobler@gmail.com> [snip] > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > new file mode 100644 > index 0000000000..9472b10461 > --- /dev/null > +++ b/builtin/diff-pairs.c > @@ -0,0 +1,193 @@ > +#include "builtin.h" > +#include "commit.h" > +#include "config.h" > +#include "diff.h" > +#include "diffcore.h" > +#include "gettext.h" > +#include "hex.h" > +#include "object.h" > +#include "parse-options.h" > +#include "revision.h" > +#include "strbuf.h" > + Nit: I could also compile without some of these headers, do we still need them all? diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c index 86e59a7e3a..1aea2ee726 100644 --- a/builtin/diff-pairs.c +++ b/builtin/diff-pairs.c @@ -1,14 +1,9 @@ #include "builtin.h" -#include "commit.h" #include "config.h" -#include "diff.h" #include "diffcore.h" -#include "gettext.h" #include "hex.h" -#include "object.h" #include "parse-options.h" #include "revision.h" -#include "strbuf.h" static unsigned parse_mode_or_die(const char *mode, const char **endp) { > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > +{ > + uint16_t ret; > + > + *endp = parse_mode(mode, &ret); > + if (!*endp) > + die(_("unable to parse mode: %s"), mode); > + return ret; > +} > + > +static void parse_oid_or_die(const char *p, struct object_id *oid, > + const char **endp, const struct git_hash_algo *algop) > Nit: without double checking, I couldn't tell what 'p' was, can we rename the variables here to be consistent with `parse_oid_hex_algop()`? > +{ > + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') > + die(_("unable to parse object id: %s"), p); > +} > + > +static void flush_diff_queue(struct diff_options *options) > +{ > + /* > + * If rename detection is not requested, use rename information from the > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > + * does not mess with rename information already present in queued > + * filepairs. > + */ > + if (!options->detect_rename) > + options->found_follow = 1; > + diffcore_std(options); > + diff_flush(options); > +} > + > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct rev_info revs; > + int ret; > + > + const char * const usage[] = { > + N_("git diff-pairs -z [<diff-options>]"), > + NULL > + }; > + struct option options[] = { > + OPT_END() > + }; > + struct option *parseopts = add_diff_options(options, &revs.diffopt); > + > + show_usage_with_options_if_asked(argc, argv, usage, parseopts); > + > + repo_init_revisions(repo, &revs, prefix); > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > + usage_with_options(usage, parseopts); > + > + /* > + * With the -z option, both command input and raw output are > + * NUL-delimited (this mode does not effect patch output). At present > + * only NUL-delimited raw diff formatted input is supported. > + */ > + if (revs.diffopt.line_termination) { > + error(_("working without -z is not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.prune_data.nr) { > + error(_("pathspec arguments not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.pending.nr || revs.max_count != -1 || > + revs.min_age != (timestamp_t)-1 || > + revs.max_age != (timestamp_t)-1) { > + error(_("revision arguments not allowed")); > + usage_with_options(usage, parseopts); > + } > + > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_PATCH; > + > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getline_nul(&meta, stdin) == EOF) > + break; > + > + p = meta.buf; > + if (*p != ':') > + die(_("invalid raw diff input")); > + p++; > + > + mode_a = parse_mode_or_die(p, &p); > + mode_b = parse_mode_or_die(p, &p); > + > + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) > + die(_("tree objects not supported")); > + > + parse_oid_or_die(p, &oid_a, &p, repo->hash_algo); > + parse_oid_or_die(p, &oid_b, &p, repo->hash_algo); > + > + status = *p++; > + > + if (strbuf_getline_nul(&path, stdin) == EOF) > + die(_("got EOF while reading path")); > + > + switch (status) { > + case DIFF_STATUS_ADDED: > + pair = diff_queue_addremove(&diff_queued_diff, > + &revs.diffopt, '+', mode_b, > + &oid_b, 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_DELETED: > + pair = diff_queue_addremove(&diff_queued_diff, > + &revs.diffopt, '-', mode_a, > + &oid_a, 1, path.buf, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_TYPE_CHANGED: > + case DIFF_STATUS_MODIFIED: > + pair = diff_queue_change(&diff_queued_diff, &revs.diffopt, > + mode_a, mode_b, &oid_a, &oid_b, > + 1, 1, path.buf, 0, 0); > + if (pair) > + pair->status = status; > + break; > + > + case DIFF_STATUS_RENAMED: > + case DIFF_STATUS_COPIED: > + { style: The general rule followed is to open the braces in the same line as the case statement. So `case DIFF_STATUS_COPIED: {` > + struct diff_filespec *a, *b; > + unsigned int score; > + > + if (strbuf_getline_nul(&path_dst, stdin) == EOF) > + die(_("got EOF while reading destination path")); > + > + a = alloc_filespec(path.buf); > + b = alloc_filespec(path_dst.buf); > + fill_filespec(a, &oid_a, 1, mode_a); > + fill_filespec(b, &oid_b, 1, mode_b); > + > + pair = diff_queue(&diff_queued_diff, a, b); > + > + if (strtoul_ui(p, 10, &score)) > + die(_("unable to parse rename/copy score: %s"), p); > + > + pair->score = score * MAX_SCORE / 100; > + pair->status = status; > + pair->renamed_pair = 1; > + } > + break; > + > + default: > + die(_("unknown diff status: %c"), status); > + } > + } > + > + flush_diff_queue(&revs.diffopt); > + ret = diff_result_code(&revs); > + > + strbuf_release(&path_dst); > + strbuf_release(&path); > + strbuf_release(&meta); > + release_revisions(&revs); > + FREE_AND_NULL(parseopts); > + > + return ret; > +} [snip] [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 690 bytes --] ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-27 9:35 ` Karthik Nayak @ 2025-02-27 22:36 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-27 22:36 UTC (permalink / raw) To: Karthik Nayak; +Cc: git, ps, phillip.wood123, Jeff King On 25/02/27 01:35AM, Karthik Nayak wrote: > > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > > new file mode 100644 > > index 0000000000..9472b10461 > > --- /dev/null > > +++ b/builtin/diff-pairs.c > > @@ -0,0 +1,193 @@ > > +#include "builtin.h" > > +#include "commit.h" > > +#include "config.h" > > +#include "diff.h" > > +#include "diffcore.h" > > +#include "gettext.h" > > +#include "hex.h" > > +#include "object.h" > > +#include "parse-options.h" > > +#include "revision.h" > > +#include "strbuf.h" > > + > > Nit: I could also compile without some of these headers, do we still > need them all? > > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > index 86e59a7e3a..1aea2ee726 100644 > --- a/builtin/diff-pairs.c > +++ b/builtin/diff-pairs.c > @@ -1,14 +1,9 @@ > #include "builtin.h" > -#include "commit.h" Looks like this one is unneeded. Will remove > #include "config.h" > -#include "diff.h" > #include "diffcore.h" > -#include "gettext.h" > #include "hex.h" > -#include "object.h" > #include "parse-options.h" > #include "revision.h" > -#include "strbuf.h" The others are directly referenced. I think it would be preferable to explicitly state them instead of relying on them being included transitively. > > static unsigned parse_mode_or_die(const char *mode, const char **endp) > { > > > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > > +{ > > + uint16_t ret; > > + > > + *endp = parse_mode(mode, &ret); > > + if (!*endp) > > + die(_("unable to parse mode: %s"), mode); > > + return ret; > > +} > > + > > +static void parse_oid_or_die(const char *p, struct object_id *oid, > > + const char **endp, const struct git_hash_algo *algop) > > > > Nit: without double checking, I couldn't tell what 'p' was, can we > rename the variables here to be consistent with `parse_oid_hex_algop()`? Will update > > + case DIFF_STATUS_RENAMED: > > + case DIFF_STATUS_COPIED: > > + { > > style: The general rule followed is to open the braces in the same line > as the case statement. So `case DIFF_STATUS_COPIED: {` Will fix in the next version. Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-25 23:39 ` [PATCH v3 2/3] builtin: introduce diff-pairs command Justin Tobler 2025-02-26 18:24 ` Junio C Hamano 2025-02-27 9:35 ` Karthik Nayak @ 2025-02-27 12:56 ` Patrick Steinhardt 2025-02-27 23:00 ` Justin Tobler 2 siblings, 1 reply; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-27 12:56 UTC (permalink / raw) To: Justin Tobler; +Cc: git, karthik.188, phillip.wood123, Jeff King On Tue, Feb 25, 2025 at 05:39:24PM -0600, Justin Tobler wrote: > Through git-diff(1), a single diff can be generated from a pair of blob > revisions directly. Unfortunately, there is not a mechanism to compute > batches of specific file pair diffs in a single process. Such a feature > is particularly useful on the server-side where diffing between a large > set of changes is not feasible all at once due to timeout concerns. > > To facilitate this, introduce git-diff-pairs(1) which acts as a backend > passing its NUL-terminated raw diff format input from stdin through diff > machinery to produce various forms of output such as patch or raw. > > The raw format was originally designed as an interchange format and > represents the contents of the diff_queue_diff list making it possible > to break the diff pipeline into separate stages. For example, > git-diff-tree(1) can be used as a frontend to compute file pairs to > queue and feed its raw output to git-diff-pairs(1) to compute patches. > With this, batches of diffs can be progessively generated without having > to recompute rename detection or retrieve object context. Something like s/rename detection/renames/ > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > new file mode 100644 > index 0000000000..9472b10461 > --- /dev/null > +++ b/builtin/diff-pairs.c > @@ -0,0 +1,193 @@ > +#include "builtin.h" > +#include "commit.h" > +#include "config.h" > +#include "diff.h" > +#include "diffcore.h" > +#include "gettext.h" > +#include "hex.h" > +#include "object.h" > +#include "parse-options.h" > +#include "revision.h" > +#include "strbuf.h" > + > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > +{ > + uint16_t ret; > + > + *endp = parse_mode(mode, &ret); > + if (!*endp) > + die(_("unable to parse mode: %s"), mode); > + return ret; > +} > + > +static void parse_oid_or_die(const char *p, struct object_id *oid, > + const char **endp, const struct git_hash_algo *algop) > +{ > + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') > + die(_("unable to parse object id: %s"), p); > +} > + > +static void flush_diff_queue(struct diff_options *options) > +{ > + /* > + * If rename detection is not requested, use rename information from the > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > + * does not mess with rename information already present in queued > + * filepairs. > + */ > + if (!options->detect_rename) > + options->found_follow = 1; It's a bit weird that we set this over here. Shouldn't we have set it up in the main function already? > + diffcore_std(options); > + diff_flush(options); > +} > + > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct rev_info revs; > + int ret; > + > + const char * const usage[] = { > + N_("git diff-pairs -z [<diff-options>]"), > + NULL > + }; > + struct option options[] = { > + OPT_END() > + }; > + struct option *parseopts = add_diff_options(options, &revs.diffopt); > + > + show_usage_with_options_if_asked(argc, argv, usage, parseopts); Don't we also have to call `parse_options()` even though we don't have our own options yet? Or is this all handled by `setup_revisions()`? > + repo_init_revisions(repo, &revs, prefix); > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > + usage_with_options(usage, parseopts); I think it's discouraged nowadays to use `usage_with_options()` as it generates a ton of noise while hiding the actual error message. It is instead recommended to directly call `usage()` with an error message. In this case here we would say e.g. `usage(_("unrecognized argument: %s"), argv[0])`, in the cases below we'd use the error messages you already have. > + > + /* > + * With the -z option, both command input and raw output are > + * NUL-delimited (this mode does not effect patch output). At present > + * only NUL-delimited raw diff formatted input is supported. > + */ > + if (revs.diffopt.line_termination) { > + error(_("working without -z is not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.prune_data.nr) { > + error(_("pathspec arguments not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.pending.nr || revs.max_count != -1 || > + revs.min_age != (timestamp_t)-1 || > + revs.max_age != (timestamp_t)-1) { > + error(_("revision arguments not allowed")); > + usage_with_options(usage, parseopts); > + } Okay. We restrict a bunch of usages, which makes your job simpler right now, but by dying it keeps the door open to iterate on those in the future. > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_PATCH; Instead of setting this conditionally, can we already set it up as a default before calling `setup_revisions()`? > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getline_nul(&meta, stdin) == EOF) > + break; > + > + p = meta.buf; > + if (*p != ':') > + die(_("invalid raw diff input")); > + p++; > + > + mode_a = parse_mode_or_die(p, &p); > + mode_b = parse_mode_or_die(p, &p); > + > + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) > + die(_("tree objects not supported")); I assume submodules aren't supported either, are they? If so, do we also have to check for `S_ISGITLINK()`? It would be nice to have a test for them. Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v3 2/3] builtin: introduce diff-pairs command 2025-02-27 12:56 ` Patrick Steinhardt @ 2025-02-27 23:00 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-27 23:00 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, karthik.188, phillip.wood123, Jeff King On 25/02/27 01:56PM, Patrick Steinhardt wrote: > > +static void flush_diff_queue(struct diff_options *options) > > +{ > > + /* > > + * If rename detection is not requested, use rename information from the > > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > > + * does not mess with rename information already present in queued > > + * filepairs. > > + */ > > + if (!options->detect_rename) > > + options->found_follow = 1; > > It's a bit weird that we set this over here. Shouldn't we have set it up > in the main function already? Everytime diffcore_std() is invoked found_follow gets reset. This was included here to ensure the correct value is always set. In the next version I am going to move away from using found_follow in favor of a new diff_options field to avoid some of this awkwardness altogether. > > + diffcore_std(options); > > + diff_flush(options); > > +} > > + > > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > > + struct repository *repo) > > +{ > > + struct strbuf path_dst = STRBUF_INIT; > > + struct strbuf path = STRBUF_INIT; > > + struct strbuf meta = STRBUF_INIT; > > + struct rev_info revs; > > + int ret; > > + > > + const char * const usage[] = { > > + N_("git diff-pairs -z [<diff-options>]"), > > + NULL > > + }; > > + struct option options[] = { > > + OPT_END() > > + }; > > + struct option *parseopts = add_diff_options(options, &revs.diffopt); > > + > > + show_usage_with_options_if_asked(argc, argv, usage, parseopts); > > Don't we also have to call `parse_options()` even though we don't have > our own options yet? Or is this all handled by `setup_revisions()`? In the current implementation, the diff options that get appended are only really used so that the usage message prints with the diff option info. It still relies on setup_revisions() to actually parse the options. Since there are not any real options that need parsing, parse_options() was not invoked. This is fairly confusing though. I plan to instead parse the diff options upfront with parse_options(). The diff options parsing through setup_revisions() becomes effectively a no-op. I think this makes more sense to read and still lets us print the common diff options is the usage message. > > + repo_init_revisions(repo, &revs, prefix); > > + repo_config(repo, git_diff_basic_config, NULL); > > + revs.disable_stdin = 1; > > + revs.abbrev = 0; > > + revs.diff = 1; > > + > > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > > + usage_with_options(usage, parseopts); > > I think it's discouraged nowadays to use `usage_with_options()` as it > generates a ton of noise while hiding the actual error message. It is > instead recommended to directly call `usage()` with an error message. > > In this case here we would say e.g. `usage(_("unrecognized argument: > %s"), argv[0])`, in the cases below we'd use the error messages you > already have. Good to know. I'll avoid printing the usage options message in all these failure scenarios in favor of what you suggested. > > + if (!revs.diffopt.output_format) > > + revs.diffopt.output_format = DIFF_FORMAT_PATCH; > > Instead of setting this conditionally, can we already set it up as a > default before calling `setup_revisions()`? The diff output format is set via OPT_BITOP() and thus can have multiple values at the same time. For example: $ git diff-tree --raw --patch HEAD will render both patch and raw output. If we unconditionally set DIFF_FORMAT_PATCH, it will always be included in the output which is not what we want. We only want to set DIFF_FORMAT_PATCH if there is still no value after all options parsing has occurred. > > + while (1) { > > + struct object_id oid_a, oid_b; > > + struct diff_filepair *pair; > > + unsigned mode_a, mode_b; > > + const char *p; > > + char status; > > + > > + if (strbuf_getline_nul(&meta, stdin) == EOF) > > + break; > > + > > + p = meta.buf; > > + if (*p != ':') > > + die(_("invalid raw diff input")); > > + p++; > > + > > + mode_a = parse_mode_or_die(p, &p); > > + mode_b = parse_mode_or_die(p, &p); > > + > > + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) > > + die(_("tree objects not supported")); > > I assume submodules aren't supported either, are they? If so, do we also > have to check for `S_ISGITLINK()`? It would be nice to have a test for > them. Submodules should actually be supported as I believe all the info present in the raw formatted input should be enough to properly display patch output. I'll add a submodule to the existing test setup to validate. Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v3 3/3] builtin/diff-pairs: allow explicit diff queue flush 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler 2025-02-25 23:39 ` [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-25 23:39 ` [PATCH v3 2/3] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-25 23:39 ` Justin Tobler 2025-02-26 14:58 ` [PATCH v3 0/3] batch blob diff generation phillip.wood123 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler 4 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-25 23:39 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler The diffs queued from git-diff-pairs(1) are flushed when stdin is closed. To enable greater flexibility, allow control over when the diff queue is flushed by writing a single NUL byte on stdin between input file pairs. Diff output between flushes is separated by a single NUL byte. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- Documentation/git-diff-pairs.adoc | 4 ++++ builtin/diff-pairs.c | 13 +++++++++++++ t/t4070-diff-pairs.sh | 9 +++++++++ 3 files changed, 26 insertions(+) diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc index e31f2e2fbb..f99fcd1ead 100644 --- a/Documentation/git-diff-pairs.adoc +++ b/Documentation/git-diff-pairs.adoc @@ -17,6 +17,10 @@ in the NUL-terminated raw output format as generated by commands such as `git diff-tree -z -r --raw`. By default, the outputted diffs are computed and shown in the patch format when stdin closes. +A single NUL byte may be written to stdin between raw input lines to compute +file pair diffs up to that point instead of waiting for stdin to close. A NUL +byte is also written to the output to delimit between these batches of diffs. + Usage of this command enables the traditional diff pipeline to be broken up into separate stages where `diff-pairs` acts as the output phase. Other commands, such as `diff-tree`, may serve as a frontend to compute the raw diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c index 9472b10461..7130569332 100644 --- a/builtin/diff-pairs.c +++ b/builtin/diff-pairs.c @@ -63,6 +63,7 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, repo_init_revisions(repo, &revs, prefix); repo_config(repo, git_diff_basic_config, NULL); + revs.diffopt.no_free = 1; revs.disable_stdin = 1; revs.abbrev = 0; revs.diff = 1; @@ -106,6 +107,17 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, break; p = meta.buf; + if (!*p) { + flush_diff_queue(&revs.diffopt); + /* + * When the diff queue is explicitly flushed, append a + * NUL byte to separate batches of diffs. + */ + fputc('\0', revs.diffopt.file); + fflush(revs.diffopt.file); + continue; + } + if (*p != ':') die(_("invalid raw diff input")); p++; @@ -180,6 +192,7 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, } } + revs.diffopt.no_free = 0; flush_diff_queue(&revs.diffopt); ret = diff_result_code(&revs); diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh index 2f511cc9c9..3352bfe0b9 100755 --- a/t/t4070-diff-pairs.sh +++ b/t/t4070-diff-pairs.sh @@ -71,4 +71,13 @@ test_expect_success 'diff-pairs does not support pathspec arguments' ' grep "error: pathspec arguments not supported" err ' +test_expect_success 'diff-pairs explicit queue flush' ' + git diff-tree -r -M -C -C -z base new >expect && + printf "\0" >>expect && + git diff-tree -r -M -C -C -z base new >>expect && + + git diff-pairs --raw -z <expect >actual && + test_cmp expect actual +' + test_done -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v3 0/3] batch blob diff generation 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler ` (2 preceding siblings ...) 2025-02-25 23:39 ` [PATCH v3 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler @ 2025-02-26 14:58 ` phillip.wood123 2025-02-27 22:04 ` Justin Tobler 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler 4 siblings, 1 reply; 78+ messages in thread From: phillip.wood123 @ 2025-02-26 14:58 UTC (permalink / raw) To: Justin Tobler, git; +Cc: ps, karthik.188 Hi Justin On 25/02/2025 23:39, Justin Tobler wrote: > > Changes since V2: > > - Pathspecs are not supported and thus rejected when provided as > arguments. It should be possible in a future series to add support > though. > > - Tree objects present in `diff-pairs` input are rejected. Support > for tree objects could be added in the future, but for now they > are rejected to enable to future support in a backwards compatible > manner. > > - The -z option is required by git-diff-pairs(1). The NUL-delimited > raw diff format is the only accepted form of input. Consequently, > NUL-delimited output is the only option in the `--raw` mode. > > - git-diff-pairs(1) defaults to patch output instead of raw output. > This better fits the intended usecase of the command. > > - A NUL-byte is now always used as the delimiter between batches of > file pair diffs when queued diffs are explicitly computed by > writing a NUL-byte on stdin. > > - Several other small cleanups and fixes along with documentation > changes. This addresses all my comments on the previous version, thank you. I do wonder if tying the input line termination to the output line termination is a good idea for a program that aims to to transform one diff format into another. Having said that this series is aimed at machine consumption of the output so it probably isn't a big problem. I also think we might want to massage the output in the tests so that we're not running test_cmp on files containing NUL bytes. Using git diff-tree -z ... | tr '\0' Q >actual would get rid of the NULs but does not improve the readability of the raw diffs that much as everything is still on a single line. Using '\n' instead of 'Q' would give us mulit-line output but we would lose confidence that the original output was actually NUL terminated. Best Wishes Phillip > Changes since V1: > > - Changed from git-diff-blob(1) to git-diff-pairs(1) based on a > previously submitted series. > > - Instead of each line containing a pair of blob revisions, the raw > diff format is used as input which already has diff status and > object context embedded. > > -Justin > > [1]: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> > > Justin Tobler (3): > diff: return diff_filepair from diff queue helpers > builtin: introduce diff-pairs command > builtin/diff-pairs: allow explicit diff queue flush > > .gitignore | 1 + > Documentation/git-diff-pairs.adoc | 60 +++++++++ > Documentation/meson.build | 1 + > Makefile | 1 + > builtin.h | 1 + > builtin/diff-pairs.c | 206 ++++++++++++++++++++++++++++++ > command-list.txt | 1 + > diff.c | 70 +++++++--- > diff.h | 25 ++++ > git.c | 1 + > meson.build | 1 + > t/meson.build | 1 + > t/t4070-diff-pairs.sh | 83 ++++++++++++ > 13 files changed, 432 insertions(+), 20 deletions(-) > create mode 100644 Documentation/git-diff-pairs.adoc > create mode 100644 builtin/diff-pairs.c > create mode 100755 t/t4070-diff-pairs.sh > ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v3 0/3] batch blob diff generation 2025-02-26 14:58 ` [PATCH v3 0/3] batch blob diff generation phillip.wood123 @ 2025-02-27 22:04 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-27 22:04 UTC (permalink / raw) To: phillip.wood; +Cc: git, ps, karthik.188 On 25/02/26 02:58PM, phillip.wood123@gmail.com wrote: > I also think we might want to massage the output in the tests so that we're > not running test_cmp on files containing NUL bytes. Using > > git diff-tree -z ... | tr '\0' Q >actual > > would get rid of the NULs but does not improve the readability of the raw > diffs that much as everything is still on a single line. Using '\n' instead > of 'Q' would give us mulit-line output but we would lose confidence that the > original output was actually NUL terminated. Is the underlying motivation here to provide more feedback if a test fails? I somewhat have a preference for the test to be validating the output as it is actually expected. As you mentioned, getting rid of the NUL bytes wouldn't help with readability much and we probably wouldn't want to replace with `\n`, so maybe a simple "Binary files expect and actual differ" would be the most straightforward. If this is the preferred way to handle it, I can adapt in a followup version though. :) Thanks for the review! -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v4 0/4] batch blob diff generation 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler ` (3 preceding siblings ...) 2025-02-26 14:58 ` [PATCH v3 0/3] batch blob diff generation phillip.wood123 @ 2025-02-28 0:26 ` Justin Tobler 2025-02-28 0:26 ` [PATCH v4 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler ` (4 more replies) 4 siblings, 5 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 0:26 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler Through git-diff(1) it is possible to generate a diff directly between two blobs. This is particularly useful when the pre-image and post-image blobs are known and we only care about the diff between them. Unfortunately, if a user has a batch of known blob pairs to compute diffs for, there is currently not a way to do so via a single Git process. To enable support for batch diffs of multiple blob pairs, this series introduces a new diff plumbing command git-diff-pairs(1) based on a previous patch series submitted by Peff[1]. This command uses NUL-delimited raw diffs as its source of input to control exactly which filepairs are diffed. The advantage of using the raw diff format is that it already has diff status type and object context information embedded in each line making it more efficient to generate diffs with as we can avoid having to peel revisions to get some the same info. For example: git diff-tree -r -z -M $old $new | git diff-pairs -p -z Here the output of git-diff-tree(1) is fed to git-diff-pairs(1) to generate the same output that would be expected from `git diff-tree -p -M`. While by itself not particularly useful, this means it is possible to split git-diff-tree(1) output across multiple git-diff-pairs(1) processes. Such a feature is useful on the server-side where diffs bewteen a large set of changes may not be feasible all at once due to timeout concerns. This command can be viewed as a backend tool that exposes Git's diff machinery. In its current form, the frontend that generates the raw diff lines used as input is expected to most of the heavy lifting (ie. pathspec limiting, tree object expansion). This series is structured as follows: - Patch 1 adds some new helper functions to get access to the queued `diff_filepair` after `diff_queue()` is invoked. - Patch 2 adds a new diff_options field that can be used to disable diff filepair status resolution. This prevents rename/copy statuses set from stdin from being altered when `diffcore_std()` is invoked. - Patch 3 introduces the new git-diff-pairs(1) plumbing command. - Patch 4 allows git-diff-pairs(1) to immediately compute diffs queued on stdin when a NUL-byte is written after a raw input line instead of waiting for stdin to close. Changes since V3: - Instead of relying on found_follow to prevent `diffcore_std()` from mutating diff filepair statuses, a new `diff_options` field, `skip_resolving_statuses` is introduced to achieve the same result in a more specific manner. - Parsing of diff options is now handled directly instead of going through `setup_revisions()`. This is done to so the diff options can be appended to the usage options and printed in the usage message. - Swapped to using `strbuf_getwholeline()` during stdin parsing to make the line termiantor more configurable in the future. - Stopped printing the usage message on errors to avoid masking the underlying error message. - Added test setup to exercise submodule change diffs. - Other small minor cleanups. Changes since V2: - Pathspecs are not supported and thus rejected when provided as arguments. It should be possible in a future series to add support though. - Tree objects present in `diff-pairs` input are rejected. Support for tree objects could be added in the future, but for now they are rejected to enable to future support in a backwards compatible manner. - The -z option is required by git-diff-pairs(1). The NUL-delimited raw diff format is the only accepted form of input. Consequently, NUL-delimited output is the only option in the `--raw` mode. - git-diff-pairs(1) defaults to patch output instead of raw output. This better fits the intended usecase of the command. - A NUL-byte is now always used as the delimiter between batches of file pair diffs when queued diffs are explicitly computed by writing a NUL-byte on stdin. - Several other small cleanups and fixes along with documentation changes. Changes since V1: - Changed from git-diff-blob(1) to git-diff-pairs(1) based on a previously submitted series. - Instead of each line containing a pair of blob revisions, the raw diff format is used as input which already has diff status and object context embedded. -Justin [1]: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> Justin Tobler (4): diff: return diff_filepair from diff queue helpers diff: add option to skip resolving diff statuses builtin: introduce diff-pairs command builtin/diff-pairs: allow explicit diff queue flush .gitignore | 1 + Documentation/git-diff-pairs.adoc | 60 +++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 209 ++++++++++++++++++++++++++++++ command-list.txt | 1 + diff.c | 72 +++++++--- diff.h | 33 +++++ git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 90 +++++++++++++ 13 files changed, 451 insertions(+), 21 deletions(-) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh Range-diff against v3: 1: d19b164914 ! 1: b2e5486442 diff: return diff_filepair from diff queue helpers @@ Commit message Split out the queuing operations into `diff_queue_addremove()` and `diff_queue_change()` which also return a handle to the queued - `diff_filepair`. + `diff_filepair`. Both `diff_addremove()` and `diff_change()` are + reimplemented as thin wrappers around the new functions. Signed-off-by: Justin Tobler <jltobler@gmail.com> -: ---------- > 2: 31d80d99ae diff: add option to skip resolving diff statuses 2: 991aaea3a9 ! 3: 3722c02112 builtin: introduce diff-pairs command @@ Commit message machinery to produce various forms of output such as patch or raw. The raw format was originally designed as an interchange format and - represents the contents of the diff_queue_diff list making it possible + represents the contents of the diff_queued_diff list making it possible to break the diff pipeline into separate stages. For example, git-diff-tree(1) can be used as a frontend to compute file pairs to queue and feed its raw output to git-diff-pairs(1) to compute patches. - With this, batches of diffs can be progessively generated without having - to recompute rename detection or retrieve object context. Something like + With this, batches of diffs can be progressively generated without + having to recompute renames or retrieve object context. Something like the following: git diff-tree -r -z -M $old $new | @@ builtin.h: int cmd_diagnose(int argc, const char **argv, const char *prefix, str ## builtin/diff-pairs.c (new) ## @@ +#include "builtin.h" -+#include "commit.h" +#include "config.h" +#include "diff.h" +#include "diffcore.h" +#include "gettext.h" ++#include "hash.h" +#include "hex.h" +#include "object.h" +#include "parse-options.h" +#include "revision.h" +#include "strbuf.h" + -+static unsigned parse_mode_or_die(const char *mode, const char **endp) ++static unsigned parse_mode_or_die(const char *mode, const char **end) +{ + uint16_t ret; + -+ *endp = parse_mode(mode, &ret); -+ if (!*endp) ++ *end = parse_mode(mode, &ret); ++ if (!*end) + die(_("unable to parse mode: %s"), mode); + return ret; +} + -+static void parse_oid_or_die(const char *p, struct object_id *oid, -+ const char **endp, const struct git_hash_algo *algop) ++static void parse_oid_or_die(const char *hex, struct object_id *oid, ++ const char **end, const struct git_hash_algo *algop) +{ -+ if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') -+ die(_("unable to parse object id: %s"), p); -+} -+ -+static void flush_diff_queue(struct diff_options *options) -+{ -+ /* -+ * If rename detection is not requested, use rename information from the -+ * raw diff formatted input. Setting found_follow ensures diffcore_std() -+ * does not mess with rename information already present in queued -+ * filepairs. -+ */ -+ if (!options->detect_rename) -+ options->found_follow = 1; -+ diffcore_std(options); -+ diff_flush(options); ++ if (parse_oid_hex_algop(hex, oid, end, algop) || *(*end)++ != ' ') ++ die(_("unable to parse object id: %s"), hex); +} + +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, @@ builtin/diff-pairs.c (new) + struct strbuf path_dst = STRBUF_INIT; + struct strbuf path = STRBUF_INIT; + struct strbuf meta = STRBUF_INIT; ++ struct option *parseopts; + struct rev_info revs; ++ int line_term = '\0'; + int ret; + -+ const char * const usage[] = { ++ const char * const usagestr[] = { + N_("git diff-pairs -z [<diff-options>]"), + NULL + }; + struct option options[] = { + OPT_END() + }; -+ struct option *parseopts = add_diff_options(options, &revs.diffopt); -+ -+ show_usage_with_options_if_asked(argc, argv, usage, parseopts); + + repo_init_revisions(repo, &revs, prefix); ++ ++ /* ++ * Diff options are usually parsed implicitly as part of ++ * setup_revisions(). Explicitly handle parsing to ensure options are ++ * printed in the usage message. ++ */ ++ parseopts = add_diff_options(options, &revs.diffopt); ++ show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); ++ + repo_config(repo, git_diff_basic_config, NULL); + revs.disable_stdin = 1; + revs.abbrev = 0; + revs.diff = 1; + ++ argc = parse_options(argc, argv, prefix, parseopts, usagestr, ++ PARSE_OPT_KEEP_UNKNOWN_OPT | ++ PARSE_OPT_KEEP_DASHDASH | ++ PARSE_OPT_KEEP_ARGV0); ++ + if (setup_revisions(argc, argv, &revs, NULL) > 1) -+ usage_with_options(usage, parseopts); ++ usagef(_("unrecognized argument: %s"), argv[0]); + + /* + * With the -z option, both command input and raw output are -+ * NUL-delimited (this mode does not effect patch output). At present ++ * NUL-delimited (this mode does not affect patch output). At present + * only NUL-delimited raw diff formatted input is supported. + */ -+ if (revs.diffopt.line_termination) { -+ error(_("working without -z is not supported")); -+ usage_with_options(usage, parseopts); -+ } ++ if (revs.diffopt.line_termination) ++ usage(_("working without -z is not supported")); + -+ if (revs.prune_data.nr) { -+ error(_("pathspec arguments not supported")); -+ usage_with_options(usage, parseopts); -+ } ++ if (revs.prune_data.nr) ++ usage(_("pathspec arguments not supported")); + + if (revs.pending.nr || revs.max_count != -1 || + revs.min_age != (timestamp_t)-1 || -+ revs.max_age != (timestamp_t)-1) { -+ error(_("revision arguments not allowed")); -+ usage_with_options(usage, parseopts); -+ } ++ revs.max_age != (timestamp_t)-1) ++ usage(_("revision arguments not allowed")); + + if (!revs.diffopt.output_format) + revs.diffopt.output_format = DIFF_FORMAT_PATCH; + ++ /* ++ * If rename detection is not requested, use rename information from the ++ * raw diff formatted input. Setting skip_resolving_statuses ensures ++ * diffcore_std() does not mess with rename information already present ++ * in queued filepairs. ++ */ ++ if (!revs.diffopt.detect_rename) ++ revs.diffopt.skip_resolving_statuses = 1; ++ + while (1) { + struct object_id oid_a, oid_b; + struct diff_filepair *pair; @@ builtin/diff-pairs.c (new) + const char *p; + char status; + -+ if (strbuf_getline_nul(&meta, stdin) == EOF) ++ if (strbuf_getwholeline(&meta, stdin, line_term) == EOF) + break; + + p = meta.buf; @@ builtin/diff-pairs.c (new) + + status = *p++; + -+ if (strbuf_getline_nul(&path, stdin) == EOF) ++ if (strbuf_getwholeline(&path, stdin, line_term) == EOF) + die(_("got EOF while reading path")); + + switch (status) { @@ builtin/diff-pairs.c (new) + break; + + case DIFF_STATUS_RENAMED: -+ case DIFF_STATUS_COPIED: -+ { ++ case DIFF_STATUS_COPIED: { + struct diff_filespec *a, *b; + unsigned int score; + -+ if (strbuf_getline_nul(&path_dst, stdin) == EOF) ++ if (strbuf_getwholeline(&path_dst, stdin, line_term) == EOF) + die(_("got EOF while reading destination path")); + + a = alloc_filespec(path.buf); @@ builtin/diff-pairs.c (new) + } + } + -+ flush_diff_queue(&revs.diffopt); ++ diffcore_std(&revs.diffopt); ++ diff_flush(&revs.diffopt); + ret = diff_result_code(&revs); + + strbuf_release(&path_dst); @@ t/t4070-diff-pairs.sh (new) +. ./test-lib.sh + +# This creates a diff with added, modified, deleted, renamed, copied, and -+# typechange entries. That includes one in a subdirectory for non-recursive -+# tests, and both exact and inexact similarity scores. ++# typechange entries. This includes a submodule to test submodule diff support. +test_expect_success 'setup' ' ++ test_config_global protocol.file.allow always && ++ test_create_repo sub && ++ test_commit -C sub initial && ++ ++ test_create_repo main && ++ cd main && + echo to-be-gone >deleted && + echo original >modified && + echo now-a-file >symlink && @@ t/t4070-diff-pairs.sh (new) + git commit -m base && + git tag base && + ++ git submodule add ../sub && + echo now-here >added && + echo new >modified && + rm deleted && @@ t/t4070-diff-pairs.sh (new) + +test_expect_success 'diff-pairs recreates --raw' ' + git diff-tree -r -M -C -C -z base new >expect && -+ git diff-tree -r -M -C -C -z base new | -+ git diff-pairs --raw -z >actual && ++ git diff-pairs --raw -z >actual <expect && + test_cmp expect actual +' + @@ t/t4070-diff-pairs.sh (new) + git diff-tree -r base new | + test_must_fail git diff-pairs >out 2>err && + ++ echo "usage: working without -z is not supported" >expect && + test_must_be_empty out && -+ grep "error: working without -z is not supported" err ++ test_cmp expect err +' + +test_expect_success 'diff-pairs does not support tree objects as input' ' @@ t/t4070-diff-pairs.sh (new) + git diff-tree -r -z base new | + test_must_fail git diff-pairs -z -- new >out 2>err && + ++ echo "usage: pathspec arguments not supported" >expect && + test_must_be_empty out && -+ grep "error: pathspec arguments not supported" err ++ test_cmp expect err +' + +test_done 3: 26c1c80b66 ! 4: a4809cbd80 builtin/diff-pairs: allow explicit diff queue flush @@ Documentation/git-diff-pairs.adoc: in the NUL-terminated raw output format as ge ## builtin/diff-pairs.c ## @@ builtin/diff-pairs.c: int cmd_diff_pairs(int argc, const char **argv, const char *prefix, + show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); - repo_init_revisions(repo, &revs, prefix); repo_config(repo, git_diff_basic_config, NULL); + revs.diffopt.no_free = 1; revs.disable_stdin = 1; @@ builtin/diff-pairs.c: int cmd_diff_pairs(int argc, const char **argv, const char p = meta.buf; + if (!*p) { -+ flush_diff_queue(&revs.diffopt); ++ diffcore_std(&revs.diffopt); ++ diff_flush(&revs.diffopt); + /* + * When the diff queue is explicitly flushed, append a + * NUL byte to separate batches of diffs. @@ builtin/diff-pairs.c: int cmd_diff_pairs(int argc, const char **argv, const char } + revs.diffopt.no_free = 0; - flush_diff_queue(&revs.diffopt); + diffcore_std(&revs.diffopt); + diff_flush(&revs.diffopt); ret = diff_result_code(&revs); - ## t/t4070-diff-pairs.sh ## @@ t/t4070-diff-pairs.sh: test_expect_success 'diff-pairs does not support pathspec arguments' ' - grep "error: pathspec arguments not supported" err + test_cmp expect err ' +test_expect_success 'diff-pairs explicit queue flush' ' -- 2.48.1 ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v4 1/4] diff: return diff_filepair from diff queue helpers 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler @ 2025-02-28 0:26 ` Justin Tobler 2025-02-28 0:26 ` [PATCH v4 2/4] diff: add option to skip resolving diff statuses Justin Tobler ` (3 subsequent siblings) 4 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 0:26 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler The `diff_addremove()` and `diff_change()` functions set up and queue diffs, but do not return the `diff_filepair` added to the queue. In a subsequent commit, modifications to `diff_filepair` need to occur in certain cases after being queued. Since the existing `diff_addremove()` and `diff_change()` are also used for callbacks in `diff_options` as types `add_remove_fn_t` and `change_fn_t`, modifying the existing function signatures requires further changes. The diff options for pruning use `file_add_remove()` and `file_change()` where file pairs do not even get queued. Thus, separate functions are implemented instead. Split out the queuing operations into `diff_queue_addremove()` and `diff_queue_change()` which also return a handle to the queued `diff_filepair`. Both `diff_addremove()` and `diff_change()` are reimplemented as thin wrappers around the new functions. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- diff.c | 70 +++++++++++++++++++++++++++++++++++++++++----------------- diff.h | 25 +++++++++++++++++++++ 2 files changed, 75 insertions(+), 20 deletions(-) diff --git a/diff.c b/diff.c index 019fb893a7..b5a779f997 100644 --- a/diff.c +++ b/diff.c @@ -7157,16 +7157,19 @@ void compute_diffstat(struct diff_options *options, options->found_changes = !!diffstat->nr; } -void diff_addremove(struct diff_options *options, - int addremove, unsigned mode, - const struct object_id *oid, - int oid_valid, - const char *concatpath, unsigned dirty_submodule) +struct diff_filepair *diff_queue_addremove(struct diff_queue_struct *queue, + struct diff_options *options, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, + const char *concatpath, + unsigned dirty_submodule) { struct diff_filespec *one, *two; + struct diff_filepair *pair; if (S_ISGITLINK(mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; /* This may look odd, but it is a preparation for * feeding "there are unchanged files which should @@ -7186,7 +7189,7 @@ void diff_addremove(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7198,25 +7201,29 @@ void diff_addremove(struct diff_options *options, two->dirty_submodule = dirty_submodule; } - diff_queue(&diff_queued_diff, one, two); + pair = diff_queue(queue, one, two); if (!options->flags.diff_from_contents) options->flags.has_changes = 1; + + return pair; } -void diff_change(struct diff_options *options, - unsigned old_mode, unsigned new_mode, - const struct object_id *old_oid, - const struct object_id *new_oid, - int old_oid_valid, int new_oid_valid, - const char *concatpath, - unsigned old_dirty_submodule, unsigned new_dirty_submodule) +struct diff_filepair *diff_queue_change(struct diff_queue_struct *queue, + struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, + unsigned new_dirty_submodule) { struct diff_filespec *one, *two; struct diff_filepair *p; if (S_ISGITLINK(old_mode) && S_ISGITLINK(new_mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; if (options->flags.reverse_diff) { SWAP(old_mode, new_mode); @@ -7227,7 +7234,7 @@ void diff_change(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7235,19 +7242,42 @@ void diff_change(struct diff_options *options, fill_filespec(two, new_oid, new_oid_valid, new_mode); one->dirty_submodule = old_dirty_submodule; two->dirty_submodule = new_dirty_submodule; - p = diff_queue(&diff_queued_diff, one, two); + p = diff_queue(queue, one, two); if (options->flags.diff_from_contents) - return; + return p; if (options->flags.quick && options->skip_stat_unmatch && !diff_filespec_check_stat_unmatch(options->repo, p)) { diff_free_filespec_data(p->one); diff_free_filespec_data(p->two); - return; + return p; } options->flags.has_changes = 1; + + return p; +} + +void diff_addremove(struct diff_options *options, int addremove, unsigned mode, + const struct object_id *oid, int oid_valid, + const char *concatpath, unsigned dirty_submodule) +{ + diff_queue_addremove(&diff_queued_diff, options, addremove, mode, oid, + oid_valid, concatpath, dirty_submodule); +} + +void diff_change(struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, unsigned new_dirty_submodule) +{ + diff_queue_change(&diff_queued_diff, options, old_mode, new_mode, + old_oid, new_oid, old_oid_valid, new_oid_valid, + concatpath, old_dirty_submodule, new_dirty_submodule); } struct diff_filepair *diff_unmerge(struct diff_options *options, const char *path) diff --git a/diff.h b/diff.h index 0a566f5531..63afa17e84 100644 --- a/diff.h +++ b/diff.h @@ -508,6 +508,31 @@ void diff_set_default_prefix(struct diff_options *options); int diff_can_quit_early(struct diff_options *); +/* + * Stages changes in the provided diff queue for file additions and deletions. + * If a file pair gets queued, it is returned. + */ +struct diff_filepair *diff_queue_addremove(struct diff_queue_struct *queue, + struct diff_options *, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, const char *fullpath, + unsigned dirty_submodule); + +/* + * Stages changes in the provided diff queue for file modifications. + * If a file pair gets queued, it is returned. + */ +struct diff_filepair *diff_queue_change(struct diff_queue_struct *queue, + struct diff_options *, + unsigned mode1, unsigned mode2, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *fullpath, + unsigned dirty_submodule1, + unsigned dirty_submodule2); + void diff_addremove(struct diff_options *, int addremove, unsigned mode, -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* [PATCH v4 2/4] diff: add option to skip resolving diff statuses 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler 2025-02-28 0:26 ` [PATCH v4 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-02-28 0:26 ` Justin Tobler 2025-02-28 8:29 ` Patrick Steinhardt 2025-02-28 0:26 ` [PATCH v4 3/4] builtin: introduce diff-pairs command Justin Tobler ` (2 subsequent siblings) 4 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-28 0:26 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler By default, `diffcore_std()` resolves the statuses for queued diff file pairs by calling `diff_resolve_rename_copy()`. If status information is already manually set, invoking `diffcore_std()` may change the status value. Introduce the `skip_resolving_statuses` diff option that prevents `diffcore_std()` from resolving file pair statuses when enabled. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- diff.c | 2 +- diff.h | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/diff.c b/diff.c index b5a779f997..37cc88c75b 100644 --- a/diff.c +++ b/diff.c @@ -7081,7 +7081,7 @@ void diffcore_std(struct diff_options *options) diffcore_order(options->orderfile); if (options->rotate_to) diffcore_rotate(options); - if (!options->found_follow) + if (!options->found_follow && !options->skip_resolving_statuses) /* See try_to_follow_renames() in tree-diff.c */ diff_resolve_rename_copy(); diffcore_apply_filter(options); diff --git a/diff.h b/diff.h index 63afa17e84..fc791ee2cc 100644 --- a/diff.h +++ b/diff.h @@ -353,6 +353,14 @@ struct diff_options { /* to support internal diff recursion by --follow hack*/ int found_follow; + /* + * By default, diffcore_std() resolves the statuses for queued diff file + * pairs by calling diff_resolve_rename_copy(). If status information + * has already been manually set, this option prevents diffcore_std() + * from resetting statuses. + */ + int skip_resolving_statuses; + /* Callback which allows tweaking the options in diff_setup_done(). */ void (*set_default)(struct diff_options *); -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v4 2/4] diff: add option to skip resolving diff statuses 2025-02-28 0:26 ` [PATCH v4 2/4] diff: add option to skip resolving diff statuses Justin Tobler @ 2025-02-28 8:29 ` Patrick Steinhardt 2025-02-28 17:10 ` Justin Tobler 0 siblings, 1 reply; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-28 8:29 UTC (permalink / raw) To: Justin Tobler; +Cc: git, karthik.188, phillip.wood123 On Thu, Feb 27, 2025 at 06:26:02PM -0600, Justin Tobler wrote: > By default, `diffcore_std()` resolves the statuses for queued diff file > pairs by calling `diff_resolve_rename_copy()`. If status information is > already manually set, invoking `diffcore_std()` may change the status > value. > > Introduce the `skip_resolving_statuses` diff option that prevents > `diffcore_std()` from resolving file pair statuses when enabled. You mentioned to me that there was another user that basically abused `found_follow` to skip over this, which seems to be in "tree-diff.c". Would it make sense to convert that user to use the new mechanism, as well, so that we don't mix up options and state? Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v4 2/4] diff: add option to skip resolving diff statuses 2025-02-28 8:29 ` Patrick Steinhardt @ 2025-02-28 17:10 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 17:10 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, karthik.188, phillip.wood123 On 25/02/28 09:29AM, Patrick Steinhardt wrote: > On Thu, Feb 27, 2025 at 06:26:02PM -0600, Justin Tobler wrote: > > By default, `diffcore_std()` resolves the statuses for queued diff file > > pairs by calling `diff_resolve_rename_copy()`. If status information is > > already manually set, invoking `diffcore_std()` may change the status > > value. > > > > Introduce the `skip_resolving_statuses` diff option that prevents > > `diffcore_std()` from resolving file pair statuses when enabled. > > You mentioned to me that there was another user that basically abused > `found_follow` to skip over this, which seems to be in "tree-diff.c". > Would it make sense to convert that user to use the new mechanism, as > well, so that we don't mix up options and state? I was mixed up with something else and mistaken. There is only the one expected existing user of the `found_follow` option. Apoligies for any confusion. -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v4 3/4] builtin: introduce diff-pairs command 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler 2025-02-28 0:26 ` [PATCH v4 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-28 0:26 ` [PATCH v4 2/4] diff: add option to skip resolving diff statuses Justin Tobler @ 2025-02-28 0:26 ` Justin Tobler 2025-02-28 8:29 ` Patrick Steinhardt 2025-02-28 0:26 ` [PATCH v4 4/4] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler 4 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-28 0:26 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler, Jeff King Through git-diff(1), a single diff can be generated from a pair of blob revisions directly. Unfortunately, there is not a mechanism to compute batches of specific file pair diffs in a single process. Such a feature is particularly useful on the server-side where diffing between a large set of changes is not feasible all at once due to timeout concerns. To facilitate this, introduce git-diff-pairs(1) which acts as a backend passing its NUL-terminated raw diff format input from stdin through diff machinery to produce various forms of output such as patch or raw. The raw format was originally designed as an interchange format and represents the contents of the diff_queued_diff list making it possible to break the diff pipeline into separate stages. For example, git-diff-tree(1) can be used as a frontend to compute file pairs to queue and feed its raw output to git-diff-pairs(1) to compute patches. With this, batches of diffs can be progressively generated without having to recompute renames or retrieve object context. Something like the following: git diff-tree -r -z -M $old $new | git diff-pairs -p -z should generate the same output as `git diff-tree -p -M`. Furthermore, each line of raw diff formatted input can also be individually fed to a separate git-diff-pairs(1) process and still produce the same output. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Justin Tobler <jltobler@gmail.com> --- .gitignore | 1 + Documentation/git-diff-pairs.adoc | 56 +++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 195 ++++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 81 +++++++++++++ 11 files changed, 340 insertions(+) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh diff --git a/.gitignore b/.gitignore index 08a66ca508..04c444404e 100644 --- a/.gitignore +++ b/.gitignore @@ -55,6 +55,7 @@ /git-diff /git-diff-files /git-diff-index +/git-diff-pairs /git-diff-tree /git-difftool /git-difftool--helper diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc new file mode 100644 index 0000000000..e31f2e2fbb --- /dev/null +++ b/Documentation/git-diff-pairs.adoc @@ -0,0 +1,56 @@ +git-diff-pairs(1) +================= + +NAME +---- +git-diff-pairs - Compare the content and mode of provided blob pairs + +SYNOPSIS +-------- +[synopsis] +git diff-pairs -z [<diff-options>] + +DESCRIPTION +----------- +Show changes for file pairs provided on stdin. Input for this command must be +in the NUL-terminated raw output format as generated by commands such as `git +diff-tree -z -r --raw`. By default, the outputted diffs are computed and shown +in the patch format when stdin closes. + +Usage of this command enables the traditional diff pipeline to be broken up +into separate stages where `diff-pairs` acts as the output phase. Other +commands, such as `diff-tree`, may serve as a frontend to compute the raw +diff format used as input. + +Instead of computing diffs via `git diff-tree -p -M` in one step, `diff-tree` +can compute the file pairs and rename information without the blob diffs. This +output can be fed to `diff-pairs` to generate the underlying blob diffs as done +in the following example: + +----------------------------- +git diff-tree -z -r -M $a $b | +git diff-pairs -z +----------------------------- + +Computing the tree diff upfront with rename information allows patch output +from `diff-pairs` to be progressively computed over the course of potentially +multiple invocations. + +Pathspecs are not currently supported by `diff-pairs`. Pathspec limiting should +be performed by the upstream command generating the raw diffs used as input. + +Tree objects are not currently supported as input and are rejected. + +Abbreviated object IDs in the `diff-pairs` input are not supported. Outputted +object IDs can be abbreviated using the `--abbrev` option. + +OPTIONS +------- + +include::diff-options.adoc[] + +include::diff-generate-patch.adoc[] + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/meson.build b/Documentation/meson.build index 1129ce4c85..ce990e9fe5 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -42,6 +42,7 @@ manpages = { 'git-diagnose.adoc' : 1, 'git-diff-files.adoc' : 1, 'git-diff-index.adoc' : 1, + 'git-diff-pairs.adoc' : 1, 'git-difftool.adoc' : 1, 'git-diff-tree.adoc' : 1, 'git-diff.adoc' : 1, diff --git a/Makefile b/Makefile index bcf5ed3f85..56df7aed3f 100644 --- a/Makefile +++ b/Makefile @@ -1242,6 +1242,7 @@ BUILTIN_OBJS += builtin/describe.o BUILTIN_OBJS += builtin/diagnose.o BUILTIN_OBJS += builtin/diff-files.o BUILTIN_OBJS += builtin/diff-index.o +BUILTIN_OBJS += builtin/diff-pairs.o BUILTIN_OBJS += builtin/diff-tree.o BUILTIN_OBJS += builtin/diff.o BUILTIN_OBJS += builtin/difftool.o diff --git a/builtin.h b/builtin.h index 89928ccf92..e6aad3a6a1 100644 --- a/builtin.h +++ b/builtin.h @@ -153,6 +153,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c new file mode 100644 index 0000000000..5a993b7c9d --- /dev/null +++ b/builtin/diff-pairs.c @@ -0,0 +1,195 @@ +#include "builtin.h" +#include "config.h" +#include "diff.h" +#include "diffcore.h" +#include "gettext.h" +#include "hash.h" +#include "hex.h" +#include "object.h" +#include "parse-options.h" +#include "revision.h" +#include "strbuf.h" + +static unsigned parse_mode_or_die(const char *mode, const char **end) +{ + uint16_t ret; + + *end = parse_mode(mode, &ret); + if (!*end) + die(_("unable to parse mode: %s"), mode); + return ret; +} + +static void parse_oid_or_die(const char *hex, struct object_id *oid, + const char **end, const struct git_hash_algo *algop) +{ + if (parse_oid_hex_algop(hex, oid, end, algop) || *(*end)++ != ' ') + die(_("unable to parse object id: %s"), hex); +} + +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, + struct repository *repo) +{ + struct strbuf path_dst = STRBUF_INIT; + struct strbuf path = STRBUF_INIT; + struct strbuf meta = STRBUF_INIT; + struct option *parseopts; + struct rev_info revs; + int line_term = '\0'; + int ret; + + const char * const usagestr[] = { + N_("git diff-pairs -z [<diff-options>]"), + NULL + }; + struct option options[] = { + OPT_END() + }; + + repo_init_revisions(repo, &revs, prefix); + + /* + * Diff options are usually parsed implicitly as part of + * setup_revisions(). Explicitly handle parsing to ensure options are + * printed in the usage message. + */ + parseopts = add_diff_options(options, &revs.diffopt); + show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); + + repo_config(repo, git_diff_basic_config, NULL); + revs.disable_stdin = 1; + revs.abbrev = 0; + revs.diff = 1; + + argc = parse_options(argc, argv, prefix, parseopts, usagestr, + PARSE_OPT_KEEP_UNKNOWN_OPT | + PARSE_OPT_KEEP_DASHDASH | + PARSE_OPT_KEEP_ARGV0); + + if (setup_revisions(argc, argv, &revs, NULL) > 1) + usagef(_("unrecognized argument: %s"), argv[0]); + + /* + * With the -z option, both command input and raw output are + * NUL-delimited (this mode does not affect patch output). At present + * only NUL-delimited raw diff formatted input is supported. + */ + if (revs.diffopt.line_termination) + usage(_("working without -z is not supported")); + + if (revs.prune_data.nr) + usage(_("pathspec arguments not supported")); + + if (revs.pending.nr || revs.max_count != -1 || + revs.min_age != (timestamp_t)-1 || + revs.max_age != (timestamp_t)-1) + usage(_("revision arguments not allowed")); + + if (!revs.diffopt.output_format) + revs.diffopt.output_format = DIFF_FORMAT_PATCH; + + /* + * If rename detection is not requested, use rename information from the + * raw diff formatted input. Setting skip_resolving_statuses ensures + * diffcore_std() does not mess with rename information already present + * in queued filepairs. + */ + if (!revs.diffopt.detect_rename) + revs.diffopt.skip_resolving_statuses = 1; + + while (1) { + struct object_id oid_a, oid_b; + struct diff_filepair *pair; + unsigned mode_a, mode_b; + const char *p; + char status; + + if (strbuf_getwholeline(&meta, stdin, line_term) == EOF) + break; + + p = meta.buf; + if (*p != ':') + die(_("invalid raw diff input")); + p++; + + mode_a = parse_mode_or_die(p, &p); + mode_b = parse_mode_or_die(p, &p); + + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) + die(_("tree objects not supported")); + + parse_oid_or_die(p, &oid_a, &p, repo->hash_algo); + parse_oid_or_die(p, &oid_b, &p, repo->hash_algo); + + status = *p++; + + if (strbuf_getwholeline(&path, stdin, line_term) == EOF) + die(_("got EOF while reading path")); + + switch (status) { + case DIFF_STATUS_ADDED: + pair = diff_queue_addremove(&diff_queued_diff, + &revs.diffopt, '+', mode_b, + &oid_b, 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_DELETED: + pair = diff_queue_addremove(&diff_queued_diff, + &revs.diffopt, '-', mode_a, + &oid_a, 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_TYPE_CHANGED: + case DIFF_STATUS_MODIFIED: + pair = diff_queue_change(&diff_queued_diff, &revs.diffopt, + mode_a, mode_b, &oid_a, &oid_b, + 1, 1, path.buf, 0, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_RENAMED: + case DIFF_STATUS_COPIED: { + struct diff_filespec *a, *b; + unsigned int score; + + if (strbuf_getwholeline(&path_dst, stdin, line_term) == EOF) + die(_("got EOF while reading destination path")); + + a = alloc_filespec(path.buf); + b = alloc_filespec(path_dst.buf); + fill_filespec(a, &oid_a, 1, mode_a); + fill_filespec(b, &oid_b, 1, mode_b); + + pair = diff_queue(&diff_queued_diff, a, b); + + if (strtoul_ui(p, 10, &score)) + die(_("unable to parse rename/copy score: %s"), p); + + pair->score = score * MAX_SCORE / 100; + pair->status = status; + pair->renamed_pair = 1; + } + break; + + default: + die(_("unknown diff status: %c"), status); + } + } + + diffcore_std(&revs.diffopt); + diff_flush(&revs.diffopt); + ret = diff_result_code(&revs); + + strbuf_release(&path_dst); + strbuf_release(&path); + strbuf_release(&meta); + release_revisions(&revs); + FREE_AND_NULL(parseopts); + + return ret; +} diff --git a/command-list.txt b/command-list.txt index c537114b46..b7ade3ab9f 100644 --- a/command-list.txt +++ b/command-list.txt @@ -96,6 +96,7 @@ git-diagnose ancillaryinterrogators git-diff mainporcelain info git-diff-files plumbinginterrogators git-diff-index plumbinginterrogators +git-diff-pairs plumbinginterrogators git-diff-tree plumbinginterrogators git-difftool ancillaryinterrogators complete git-fast-export ancillarymanipulators diff --git a/git.c b/git.c index 450d6aaa86..77c4359522 100644 --- a/git.c +++ b/git.c @@ -541,6 +541,7 @@ static struct cmd_struct commands[] = { { "diff", cmd_diff, NO_PARSEOPT }, { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, { "fast-export", cmd_fast_export, RUN_SETUP }, diff --git a/meson.build b/meson.build index bf95576f83..9e8b365d2a 100644 --- a/meson.build +++ b/meson.build @@ -540,6 +540,7 @@ builtin_sources = [ 'builtin/diagnose.c', 'builtin/diff-files.c', 'builtin/diff-index.c', + 'builtin/diff-pairs.c', 'builtin/diff-tree.c', 'builtin/diff.c', 'builtin/difftool.c', diff --git a/t/meson.build b/t/meson.build index 780939d49f..09c7bc2fad 100644 --- a/t/meson.build +++ b/t/meson.build @@ -500,6 +500,7 @@ integration_tests = [ 't4067-diff-partial-clone.sh', 't4068-diff-symmetric-merge-base.sh', 't4069-remerge-diff.sh', + 't4070-diff-pairs.sh', 't4100-apply-stat.sh', 't4101-apply-nonl.sh', 't4102-apply-rename.sh', diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh new file mode 100755 index 0000000000..8f17e55c7d --- /dev/null +++ b/t/t4070-diff-pairs.sh @@ -0,0 +1,81 @@ +#!/bin/sh + +test_description='basic diff-pairs tests' +. ./test-lib.sh + +# This creates a diff with added, modified, deleted, renamed, copied, and +# typechange entries. This includes a submodule to test submodule diff support. +test_expect_success 'setup' ' + test_config_global protocol.file.allow always && + test_create_repo sub && + test_commit -C sub initial && + + test_create_repo main && + cd main && + echo to-be-gone >deleted && + echo original >modified && + echo now-a-file >symlink && + test_seq 200 >two-hundred && + test_seq 201 500 >five-hundred && + git add . && + test_tick && + git commit -m base && + git tag base && + + git submodule add ../sub && + echo now-here >added && + echo new >modified && + rm deleted && + mkdir subdir && + echo content >subdir/file && + mv two-hundred renamed && + test_seq 201 500 | sed s/300/modified/ >copied && + rm symlink && + git add -A . && + test_ln_s_add dest symlink && + test_tick && + git commit -m new && + git tag new +' + +test_expect_success 'diff-pairs recreates --raw' ' + git diff-tree -r -M -C -C -z base new >expect && + git diff-pairs --raw -z >actual <expect && + test_cmp expect actual +' + +test_expect_success 'diff-pairs can create -p output' ' + git diff-tree -p -M -C -C base new >expect && + git diff-tree -r -M -C -C -z base new | + git diff-pairs -p -z >actual && + test_cmp expect actual +' + +test_expect_success 'diff-pairs does not support normal raw diff input' ' + git diff-tree -r base new | + test_must_fail git diff-pairs >out 2>err && + + echo "usage: working without -z is not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_expect_success 'diff-pairs does not support tree objects as input' ' + git diff-tree -z base new | + test_must_fail git diff-pairs -z >out 2>err && + + echo "fatal: tree objects not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_expect_success 'diff-pairs does not support pathspec arguments' ' + git diff-tree -r -z base new | + test_must_fail git diff-pairs -z -- new >out 2>err && + + echo "usage: pathspec arguments not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_done -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v4 3/4] builtin: introduce diff-pairs command 2025-02-28 0:26 ` [PATCH v4 3/4] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-28 8:29 ` Patrick Steinhardt 2025-02-28 17:26 ` Justin Tobler 0 siblings, 1 reply; 78+ messages in thread From: Patrick Steinhardt @ 2025-02-28 8:29 UTC (permalink / raw) To: Justin Tobler; +Cc: git, karthik.188, phillip.wood123, Jeff King On Thu, Feb 27, 2025 at 06:26:03PM -0600, Justin Tobler wrote: > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > new file mode 100644 > index 0000000000..5a993b7c9d > --- /dev/null > +++ b/builtin/diff-pairs.c [snip] > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct option *parseopts; > + struct rev_info revs; > + int line_term = '\0'; > + int ret; > + > + const char * const usagestr[] = { > + N_("git diff-pairs -z [<diff-options>]"), > + NULL > + }; We tend to call these `builtin_*_usage`, so in your case it would be `builtin_diff_pairs_usage`. > + struct option options[] = { > + OPT_END() > + }; > + > + repo_init_revisions(repo, &revs, prefix); > + > + /* > + * Diff options are usually parsed implicitly as part of > + * setup_revisions(). Explicitly handle parsing to ensure options are > + * printed in the usage message. > + */ > + parseopts = add_diff_options(options, &revs.diffopt); > + show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); > + > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + argc = parse_options(argc, argv, prefix, parseopts, usagestr, > + PARSE_OPT_KEEP_UNKNOWN_OPT | > + PARSE_OPT_KEEP_DASHDASH | > + PARSE_OPT_KEEP_ARGV0); > > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > + usagef(_("unrecognized argument: %s"), argv[0]); Okay, we now use `parse_options()` to parse stuff for us, and `setup_revisions()` only really does the setup for us as we know that all relevant diff options should've already been parsed for us. This looks much nicer to me. I wonder though: we keep unknown options when calling `parse_options()` and then end up passing them to `setup_revisions()`. But are there even any options handled by `setup_revisions()` that would make sense in our context? And if not, shouldn't we rather make `parse_options()` die in case it sees unknown options? If there are, we should probably document this because it isn't obvious to me. > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > new file mode 100755 > index 0000000000..8f17e55c7d > --- /dev/null > +++ b/t/t4070-diff-pairs.sh > @@ -0,0 +1,81 @@ > +#!/bin/sh > + > +test_description='basic diff-pairs tests' > +. ./test-lib.sh > + > +# This creates a diff with added, modified, deleted, renamed, copied, and > +# typechange entries. This includes a submodule to test submodule diff support. > +test_expect_success 'setup' ' > + test_config_global protocol.file.allow always && > + test_create_repo sub && Use of `test_create_repo ()` is deprecated, as it is merely a wrapper around git-init(1). Patrick ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH v4 3/4] builtin: introduce diff-pairs command 2025-02-28 8:29 ` Patrick Steinhardt @ 2025-02-28 17:26 ` Justin Tobler 0 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 17:26 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, karthik.188, phillip.wood123, Jeff King On 25/02/28 09:29AM, Patrick Steinhardt wrote: > On Thu, Feb 27, 2025 at 06:26:03PM -0600, Justin Tobler wrote: > > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > > new file mode 100644 > > index 0000000000..5a993b7c9d > > --- /dev/null > > +++ b/builtin/diff-pairs.c > [snip] > > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > > + struct repository *repo) > > +{ > > + struct strbuf path_dst = STRBUF_INIT; > > + struct strbuf path = STRBUF_INIT; > > + struct strbuf meta = STRBUF_INIT; > > + struct option *parseopts; > > + struct rev_info revs; > > + int line_term = '\0'; > > + int ret; > > + > > + const char * const usagestr[] = { > > + N_("git diff-pairs -z [<diff-options>]"), > > + NULL > > + }; > > We tend to call these `builtin_*_usage`, so in your case it would be > `builtin_diff_pairs_usage`. Good to know, will adapt in a followup version. > > > + struct option options[] = { > > + OPT_END() > > + }; > > + > > + repo_init_revisions(repo, &revs, prefix); > > + > > + /* > > + * Diff options are usually parsed implicitly as part of > > + * setup_revisions(). Explicitly handle parsing to ensure options are > > + * printed in the usage message. > > + */ > > + parseopts = add_diff_options(options, &revs.diffopt); > > + show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); > > + > > + repo_config(repo, git_diff_basic_config, NULL); > > + revs.disable_stdin = 1; > > + revs.abbrev = 0; > > + revs.diff = 1; > > + > > + argc = parse_options(argc, argv, prefix, parseopts, usagestr, > > + PARSE_OPT_KEEP_UNKNOWN_OPT | > > + PARSE_OPT_KEEP_DASHDASH | > > + PARSE_OPT_KEEP_ARGV0); > > > > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > > + usagef(_("unrecognized argument: %s"), argv[0]); > > Okay, we now use `parse_options()` to parse stuff for us, and > `setup_revisions()` only really does the setup for us as we know that > all relevant diff options should've already been parsed for us. This > looks much nicer to me. > > I wonder though: we keep unknown options when calling `parse_options()` > and then end up passing them to `setup_revisions()`. But are there even > any options handled by `setup_revisions()` that would make sense in our > context? And if not, shouldn't we rather make `parse_options()` die in > case it sees unknown options? Good catch, there should not be any actaully needed options left for `setup_revisions()` to parse as they should all be handled by `parse_options()`. I'll remove the `PARSE_OPT_KEEP_UNKNOWN_OPT` flag. > If there are, we should probably document this because it isn't obvious > to me. > > > diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh > > new file mode 100755 > > index 0000000000..8f17e55c7d > > --- /dev/null > > +++ b/t/t4070-diff-pairs.sh > > @@ -0,0 +1,81 @@ > > +#!/bin/sh > > + > > +test_description='basic diff-pairs tests' > > +. ./test-lib.sh > > + > > +# This creates a diff with added, modified, deleted, renamed, copied, and > > +# typechange entries. This includes a submodule to test submodule diff support. > > +test_expect_success 'setup' ' > > + test_config_global protocol.file.allow always && > > + test_create_repo sub && > > Use of `test_create_repo ()` is deprecated, as it is merely a wrapper > around git-init(1). Good to know! I'll swap to using git-init(1) instead. Thanks -Justin ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v4 4/4] builtin/diff-pairs: allow explicit diff queue flush 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler ` (2 preceding siblings ...) 2025-02-28 0:26 ` [PATCH v4 3/4] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-28 0:26 ` Justin Tobler 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler 4 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 0:26 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler The diffs queued from git-diff-pairs(1) are flushed when stdin is closed. To enable greater flexibility, allow control over when the diff queue is flushed by writing a single NUL byte on stdin between input file pairs. Diff output between flushes is separated by a single NUL byte. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- Documentation/git-diff-pairs.adoc | 4 ++++ builtin/diff-pairs.c | 14 ++++++++++++++ t/t4070-diff-pairs.sh | 9 +++++++++ 3 files changed, 27 insertions(+) diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc index e31f2e2fbb..f99fcd1ead 100644 --- a/Documentation/git-diff-pairs.adoc +++ b/Documentation/git-diff-pairs.adoc @@ -17,6 +17,10 @@ in the NUL-terminated raw output format as generated by commands such as `git diff-tree -z -r --raw`. By default, the outputted diffs are computed and shown in the patch format when stdin closes. +A single NUL byte may be written to stdin between raw input lines to compute +file pair diffs up to that point instead of waiting for stdin to close. A NUL +byte is also written to the output to delimit between these batches of diffs. + Usage of this command enables the traditional diff pipeline to be broken up into separate stages where `diff-pairs` acts as the output phase. Other commands, such as `diff-tree`, may serve as a frontend to compute the raw diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c index 5a993b7c9d..2939d4af1d 100644 --- a/builtin/diff-pairs.c +++ b/builtin/diff-pairs.c @@ -57,6 +57,7 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); repo_config(repo, git_diff_basic_config, NULL); + revs.diffopt.no_free = 1; revs.disable_stdin = 1; revs.abbrev = 0; revs.diff = 1; @@ -108,6 +109,18 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, break; p = meta.buf; + if (!*p) { + diffcore_std(&revs.diffopt); + diff_flush(&revs.diffopt); + /* + * When the diff queue is explicitly flushed, append a + * NUL byte to separate batches of diffs. + */ + fputc('\0', revs.diffopt.file); + fflush(revs.diffopt.file); + continue; + } + if (*p != ':') die(_("invalid raw diff input")); p++; @@ -181,6 +194,7 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, } } + revs.diffopt.no_free = 0; diffcore_std(&revs.diffopt); diff_flush(&revs.diffopt); ret = diff_result_code(&revs); diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh index 8f17e55c7d..c5e9972b2d 100755 --- a/t/t4070-diff-pairs.sh +++ b/t/t4070-diff-pairs.sh @@ -78,4 +78,13 @@ test_expect_success 'diff-pairs does not support pathspec arguments' ' test_cmp expect err ' +test_expect_success 'diff-pairs explicit queue flush' ' + git diff-tree -r -M -C -C -z base new >expect && + printf "\0" >>expect && + git diff-tree -r -M -C -C -z base new >>expect && + + git diff-pairs --raw -z <expect >actual && + test_cmp expect actual +' + test_done -- 2.48.1 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* [PATCH v5 0/4] batch blob diff generation 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler ` (3 preceding siblings ...) 2025-02-28 0:26 ` [PATCH v4 4/4] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler @ 2025-02-28 21:33 ` Justin Tobler 2025-02-28 21:33 ` [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler ` (3 more replies) 4 siblings, 4 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 21:33 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler Through git-diff(1) it is possible to generate a diff directly between two blobs. This is particularly useful when the pre-image and post-image blobs are known and we only care about the diff between them. Unfortunately, if a user has a batch of known blob pairs to compute diffs for, there is currently not a way to do so via a single Git process. To enable support for batch diffs of multiple blob pairs, this series introduces a new diff plumbing command git-diff-pairs(1) based on a previous patch series submitted by Peff[1]. This command uses NUL-delimited raw diffs as its source of input to control exactly which filepairs are diffed. The advantage of using the raw diff format is that it already has diff status type and object context information embedded in each line making it more efficient to generate diffs with as we can avoid having to peel revisions to get some the same info. For example: git diff-tree -r -z -M $old $new | git diff-pairs -p -z Here the output of git-diff-tree(1) is fed to git-diff-pairs(1) to generate the same output that would be expected from `git diff-tree -p -M`. While by itself not particularly useful, this means it is possible to split git-diff-tree(1) output across multiple git-diff-pairs(1) processes. Such a feature is useful on the server-side where diffs bewteen a large set of changes may not be feasible all at once due to timeout concerns. This command can be viewed as a backend tool that exposes Git's diff machinery. In its current form, the frontend that generates the raw diff lines used as input is expected to most of the heavy lifting (ie. pathspec limiting, tree object expansion). This series is structured as follows: - Patch 1 adds some new helper functions to get access to the queued `diff_filepair` after `diff_queue()` is invoked. - Patch 2 adds a new diff_options field that can be used to disable diff filepair status resolution. This prevents rename/copy statuses set from stdin from being altered when `diffcore_std()` is invoked. - Patch 3 introduces the new git-diff-pairs(1) plumbing command. - Patch 4 allows git-diff-pairs(1) to immediately compute diffs queued on stdin when a NUL-byte is written after a raw input line instead of waiting for stdin to close. Changes since V4: - Renamed usage and options variables to better follow convention. - Removed unneeded PARSE_OPT_KEEP_UNKNOWN_OPT from `parse_options()`. - Instead of using the deprecated `test_create_repo ()` in the tests, plain git-init(1) is used. Changes since V3: - Instead of relying on found_follow to prevent `diffcore_std()` from mutating diff filepair statuses, a new `diff_options` field, `skip_resolving_statuses` is introduced to achieve the same result in a more specific manner. - Parsing of diff options is now handled directly instead of going through `setup_revisions()`. This is done to so the diff options can be appended to the usage options and printed in the usage message. - Swapped to using `strbuf_getwholeline()` during stdin parsing to make the line termiantor more configurable in the future. - Stopped printing the usage message on errors to avoid masking the underlying error message. - Added test setup to exercise submodule change diffs. - Other small minor cleanups. Changes since V2: - Pathspecs are not supported and thus rejected when provided as arguments. It should be possible in a future series to add support though. - Tree objects present in `diff-pairs` input are rejected. Support for tree objects could be added in the future, but for now they are rejected to enable to future support in a backwards compatible manner. - The -z option is required by git-diff-pairs(1). The NUL-delimited raw diff format is the only accepted form of input. Consequently, NUL-delimited output is the only option in the `--raw` mode. - git-diff-pairs(1) defaults to patch output instead of raw output. This better fits the intended usecase of the command. - A NUL-byte is now always used as the delimiter between batches of file pair diffs when queued diffs are explicitly computed by writing a NUL-byte on stdin. - Several other small cleanups and fixes along with documentation changes. Changes since V1: - Changed from git-diff-blob(1) to git-diff-pairs(1) based on a previously submitted series. - Instead of each line containing a pair of blob revisions, the raw diff format is used as input which already has diff status and object context embedded. -Justin [1]: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> Justin Tobler (4): diff: return diff_filepair from diff queue helpers diff: add option to skip resolving diff statuses builtin: introduce diff-pairs command builtin/diff-pairs: allow explicit diff queue flush .gitignore | 1 + Documentation/git-diff-pairs.adoc | 60 +++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 207 ++++++++++++++++++++++++++++++ command-list.txt | 1 + diff.c | 72 ++++++++--- diff.h | 33 +++++ git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 90 +++++++++++++ 13 files changed, 449 insertions(+), 21 deletions(-) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh Range-diff against v4: 1: b2e5486442 = 1: b2e5486442 diff: return diff_filepair from diff queue helpers 2: 31d80d99ae = 2: 31d80d99ae diff: add option to skip resolving diff statuses 3: 3722c02112 ! 3: 1024a4290c builtin: introduce diff-pairs command @@ builtin/diff-pairs.c (new) + int line_term = '\0'; + int ret; + -+ const char * const usagestr[] = { ++ const char * const builtin_diff_pairs_usage[] = { + N_("git diff-pairs -z [<diff-options>]"), + NULL + }; -+ struct option options[] = { ++ struct option builtin_diff_pairs_options[] = { + OPT_END() + }; + @@ builtin/diff-pairs.c (new) + * setup_revisions(). Explicitly handle parsing to ensure options are + * printed in the usage message. + */ -+ parseopts = add_diff_options(options, &revs.diffopt); -+ show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); ++ parseopts = add_diff_options(builtin_diff_pairs_options, &revs.diffopt); ++ show_usage_with_options_if_asked(argc, argv, builtin_diff_pairs_usage, parseopts); + + repo_config(repo, git_diff_basic_config, NULL); + revs.disable_stdin = 1; + revs.abbrev = 0; + revs.diff = 1; + -+ argc = parse_options(argc, argv, prefix, parseopts, usagestr, -+ PARSE_OPT_KEEP_UNKNOWN_OPT | -+ PARSE_OPT_KEEP_DASHDASH | -+ PARSE_OPT_KEEP_ARGV0); ++ argc = parse_options(argc, argv, prefix, parseopts, builtin_diff_pairs_usage, ++ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_DASHDASH); + + if (setup_revisions(argc, argv, &revs, NULL) > 1) + usagef(_("unrecognized argument: %s"), argv[0]); @@ t/t4070-diff-pairs.sh (new) +# typechange entries. This includes a submodule to test submodule diff support. +test_expect_success 'setup' ' + test_config_global protocol.file.allow always && -+ test_create_repo sub && ++ git init sub && + test_commit -C sub initial && + -+ test_create_repo main && ++ git init main && + cd main && + echo to-be-gone >deleted && + echo original >modified && 4: a4809cbd80 ! 4: 56f21b664e builtin/diff-pairs: allow explicit diff queue flush @@ Documentation/git-diff-pairs.adoc: in the NUL-terminated raw output format as ge ## builtin/diff-pairs.c ## @@ builtin/diff-pairs.c: int cmd_diff_pairs(int argc, const char **argv, const char *prefix, - show_usage_with_options_if_asked(argc, argv, usagestr, parseopts); + show_usage_with_options_if_asked(argc, argv, builtin_diff_pairs_usage, parseopts); repo_config(repo, git_diff_basic_config, NULL); + revs.diffopt.no_free = 1; -- 2.49.0.rc0 ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler @ 2025-02-28 21:33 ` Justin Tobler 2025-03-03 16:17 ` Junio C Hamano 2025-02-28 21:33 ` [PATCH v5 2/4] diff: add option to skip resolving diff statuses Justin Tobler ` (2 subsequent siblings) 3 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-28 21:33 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler The `diff_addremove()` and `diff_change()` functions set up and queue diffs, but do not return the `diff_filepair` added to the queue. In a subsequent commit, modifications to `diff_filepair` need to occur in certain cases after being queued. Since the existing `diff_addremove()` and `diff_change()` are also used for callbacks in `diff_options` as types `add_remove_fn_t` and `change_fn_t`, modifying the existing function signatures requires further changes. The diff options for pruning use `file_add_remove()` and `file_change()` where file pairs do not even get queued. Thus, separate functions are implemented instead. Split out the queuing operations into `diff_queue_addremove()` and `diff_queue_change()` which also return a handle to the queued `diff_filepair`. Both `diff_addremove()` and `diff_change()` are reimplemented as thin wrappers around the new functions. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- diff.c | 70 +++++++++++++++++++++++++++++++++++++++++----------------- diff.h | 25 +++++++++++++++++++++ 2 files changed, 75 insertions(+), 20 deletions(-) diff --git a/diff.c b/diff.c index 019fb893a7..b5a779f997 100644 --- a/diff.c +++ b/diff.c @@ -7157,16 +7157,19 @@ void compute_diffstat(struct diff_options *options, options->found_changes = !!diffstat->nr; } -void diff_addremove(struct diff_options *options, - int addremove, unsigned mode, - const struct object_id *oid, - int oid_valid, - const char *concatpath, unsigned dirty_submodule) +struct diff_filepair *diff_queue_addremove(struct diff_queue_struct *queue, + struct diff_options *options, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, + const char *concatpath, + unsigned dirty_submodule) { struct diff_filespec *one, *two; + struct diff_filepair *pair; if (S_ISGITLINK(mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; /* This may look odd, but it is a preparation for * feeding "there are unchanged files which should @@ -7186,7 +7189,7 @@ void diff_addremove(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7198,25 +7201,29 @@ void diff_addremove(struct diff_options *options, two->dirty_submodule = dirty_submodule; } - diff_queue(&diff_queued_diff, one, two); + pair = diff_queue(queue, one, two); if (!options->flags.diff_from_contents) options->flags.has_changes = 1; + + return pair; } -void diff_change(struct diff_options *options, - unsigned old_mode, unsigned new_mode, - const struct object_id *old_oid, - const struct object_id *new_oid, - int old_oid_valid, int new_oid_valid, - const char *concatpath, - unsigned old_dirty_submodule, unsigned new_dirty_submodule) +struct diff_filepair *diff_queue_change(struct diff_queue_struct *queue, + struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, + unsigned new_dirty_submodule) { struct diff_filespec *one, *two; struct diff_filepair *p; if (S_ISGITLINK(old_mode) && S_ISGITLINK(new_mode) && is_submodule_ignored(concatpath, options)) - return; + return NULL; if (options->flags.reverse_diff) { SWAP(old_mode, new_mode); @@ -7227,7 +7234,7 @@ void diff_change(struct diff_options *options, if (options->prefix && strncmp(concatpath, options->prefix, options->prefix_length)) - return; + return NULL; one = alloc_filespec(concatpath); two = alloc_filespec(concatpath); @@ -7235,19 +7242,42 @@ void diff_change(struct diff_options *options, fill_filespec(two, new_oid, new_oid_valid, new_mode); one->dirty_submodule = old_dirty_submodule; two->dirty_submodule = new_dirty_submodule; - p = diff_queue(&diff_queued_diff, one, two); + p = diff_queue(queue, one, two); if (options->flags.diff_from_contents) - return; + return p; if (options->flags.quick && options->skip_stat_unmatch && !diff_filespec_check_stat_unmatch(options->repo, p)) { diff_free_filespec_data(p->one); diff_free_filespec_data(p->two); - return; + return p; } options->flags.has_changes = 1; + + return p; +} + +void diff_addremove(struct diff_options *options, int addremove, unsigned mode, + const struct object_id *oid, int oid_valid, + const char *concatpath, unsigned dirty_submodule) +{ + diff_queue_addremove(&diff_queued_diff, options, addremove, mode, oid, + oid_valid, concatpath, dirty_submodule); +} + +void diff_change(struct diff_options *options, + unsigned old_mode, unsigned new_mode, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *concatpath, + unsigned old_dirty_submodule, unsigned new_dirty_submodule) +{ + diff_queue_change(&diff_queued_diff, options, old_mode, new_mode, + old_oid, new_oid, old_oid_valid, new_oid_valid, + concatpath, old_dirty_submodule, new_dirty_submodule); } struct diff_filepair *diff_unmerge(struct diff_options *options, const char *path) diff --git a/diff.h b/diff.h index 0a566f5531..63afa17e84 100644 --- a/diff.h +++ b/diff.h @@ -508,6 +508,31 @@ void diff_set_default_prefix(struct diff_options *options); int diff_can_quit_early(struct diff_options *); +/* + * Stages changes in the provided diff queue for file additions and deletions. + * If a file pair gets queued, it is returned. + */ +struct diff_filepair *diff_queue_addremove(struct diff_queue_struct *queue, + struct diff_options *, + int addremove, unsigned mode, + const struct object_id *oid, + int oid_valid, const char *fullpath, + unsigned dirty_submodule); + +/* + * Stages changes in the provided diff queue for file modifications. + * If a file pair gets queued, it is returned. + */ +struct diff_filepair *diff_queue_change(struct diff_queue_struct *queue, + struct diff_options *, + unsigned mode1, unsigned mode2, + const struct object_id *old_oid, + const struct object_id *new_oid, + int old_oid_valid, int new_oid_valid, + const char *fullpath, + unsigned dirty_submodule1, + unsigned dirty_submodule2); + void diff_addremove(struct diff_options *, int addremove, unsigned mode, -- 2.49.0.rc0 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers 2025-02-28 21:33 ` [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-03-03 16:17 ` Junio C Hamano 0 siblings, 0 replies; 78+ messages in thread From: Junio C Hamano @ 2025-03-03 16:17 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps, karthik.188, phillip.wood123 Justin Tobler <jltobler@gmail.com> writes: > The `diff_addremove()` and `diff_change()` functions set up and queue > diffs, but do not return the `diff_filepair` added to the queue. In a > subsequent commit, modifications to `diff_filepair` need to occur in > certain cases after being queued. > > Since the existing `diff_addremove()` and `diff_change()` are also used > for callbacks in `diff_options` as types `add_remove_fn_t` and > `change_fn_t`, modifying the existing function signatures requires > further changes. Sensible. The patch presented below looks a sane and safe no-op for the existing code paths, which is what we want to see in a preliminary refactoring step like this one. > The diff options for pruning use `file_add_remove()` > and `file_change()` where file pairs do not even get queued. Thus, > separate functions are implemented instead. This looked a bit confusing, but it is an explanation for the reason why we simply do not change the function signature of the callback members of diff_options structure. These addremove/change callbacks are designed to be a general way to allow applications to react to discovered changes to paths, and I agree that it makes sense for them to be usable to perform something that has nothing to do with the diff_queue structure. > Split out the queuing operations into `diff_queue_addremove()` and > `diff_queue_change()` which also return a handle to the queued > `diff_filepair`. Both `diff_addremove()` and `diff_change()` are > reimplemented as thin wrappers around the new functions. Nice. ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v5 2/4] diff: add option to skip resolving diff statuses 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler 2025-02-28 21:33 ` [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler @ 2025-02-28 21:33 ` Justin Tobler 2025-03-03 16:19 ` Junio C Hamano 2025-02-28 21:33 ` [PATCH v5 3/4] builtin: introduce diff-pairs command Justin Tobler 2025-02-28 21:33 ` [PATCH v5 4/4] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 3 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-28 21:33 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler By default, `diffcore_std()` resolves the statuses for queued diff file pairs by calling `diff_resolve_rename_copy()`. If status information is already manually set, invoking `diffcore_std()` may change the status value. Introduce the `skip_resolving_statuses` diff option that prevents `diffcore_std()` from resolving file pair statuses when enabled. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- diff.c | 2 +- diff.h | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/diff.c b/diff.c index b5a779f997..37cc88c75b 100644 --- a/diff.c +++ b/diff.c @@ -7081,7 +7081,7 @@ void diffcore_std(struct diff_options *options) diffcore_order(options->orderfile); if (options->rotate_to) diffcore_rotate(options); - if (!options->found_follow) + if (!options->found_follow && !options->skip_resolving_statuses) /* See try_to_follow_renames() in tree-diff.c */ diff_resolve_rename_copy(); diffcore_apply_filter(options); diff --git a/diff.h b/diff.h index 63afa17e84..fc791ee2cc 100644 --- a/diff.h +++ b/diff.h @@ -353,6 +353,14 @@ struct diff_options { /* to support internal diff recursion by --follow hack*/ int found_follow; + /* + * By default, diffcore_std() resolves the statuses for queued diff file + * pairs by calling diff_resolve_rename_copy(). If status information + * has already been manually set, this option prevents diffcore_std() + * from resetting statuses. + */ + int skip_resolving_statuses; + /* Callback which allows tweaking the options in diff_setup_done(). */ void (*set_default)(struct diff_options *); -- 2.49.0.rc0 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v5 2/4] diff: add option to skip resolving diff statuses 2025-02-28 21:33 ` [PATCH v5 2/4] diff: add option to skip resolving diff statuses Justin Tobler @ 2025-03-03 16:19 ` Junio C Hamano 0 siblings, 0 replies; 78+ messages in thread From: Junio C Hamano @ 2025-03-03 16:19 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps, karthik.188, phillip.wood123 Justin Tobler <jltobler@gmail.com> writes: > By default, `diffcore_std()` resolves the statuses for queued diff file > pairs by calling `diff_resolve_rename_copy()`. If status information is > already manually set, invoking `diffcore_std()` may change the status > value. > > Introduce the `skip_resolving_statuses` diff option that prevents > `diffcore_std()` from resolving file pair statuses when enabled. Makes sense. > > Signed-off-by: Justin Tobler <jltobler@gmail.com> > --- > diff.c | 2 +- > diff.h | 8 ++++++++ > 2 files changed, 9 insertions(+), 1 deletion(-) > > diff --git a/diff.c b/diff.c > index b5a779f997..37cc88c75b 100644 > --- a/diff.c > +++ b/diff.c > @@ -7081,7 +7081,7 @@ void diffcore_std(struct diff_options *options) > diffcore_order(options->orderfile); > if (options->rotate_to) > diffcore_rotate(options); > - if (!options->found_follow) > + if (!options->found_follow && !options->skip_resolving_statuses) > /* See try_to_follow_renames() in tree-diff.c */ > diff_resolve_rename_copy(); > diffcore_apply_filter(options); > diff --git a/diff.h b/diff.h > index 63afa17e84..fc791ee2cc 100644 > --- a/diff.h > +++ b/diff.h > @@ -353,6 +353,14 @@ struct diff_options { > /* to support internal diff recursion by --follow hack*/ > int found_follow; > > + /* > + * By default, diffcore_std() resolves the statuses for queued diff file > + * pairs by calling diff_resolve_rename_copy(). If status information > + * has already been manually set, this option prevents diffcore_std() > + * from resetting statuses. > + */ > + int skip_resolving_statuses; > + > /* Callback which allows tweaking the options in diff_setup_done(). */ > void (*set_default)(struct diff_options *); ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v5 3/4] builtin: introduce diff-pairs command 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler 2025-02-28 21:33 ` [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-28 21:33 ` [PATCH v5 2/4] diff: add option to skip resolving diff statuses Justin Tobler @ 2025-02-28 21:33 ` Justin Tobler 2025-03-03 16:30 ` Junio C Hamano 2025-02-28 21:33 ` [PATCH v5 4/4] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 3 siblings, 1 reply; 78+ messages in thread From: Justin Tobler @ 2025-02-28 21:33 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler, Jeff King Through git-diff(1), a single diff can be generated from a pair of blob revisions directly. Unfortunately, there is not a mechanism to compute batches of specific file pair diffs in a single process. Such a feature is particularly useful on the server-side where diffing between a large set of changes is not feasible all at once due to timeout concerns. To facilitate this, introduce git-diff-pairs(1) which acts as a backend passing its NUL-terminated raw diff format input from stdin through diff machinery to produce various forms of output such as patch or raw. The raw format was originally designed as an interchange format and represents the contents of the diff_queued_diff list making it possible to break the diff pipeline into separate stages. For example, git-diff-tree(1) can be used as a frontend to compute file pairs to queue and feed its raw output to git-diff-pairs(1) to compute patches. With this, batches of diffs can be progressively generated without having to recompute renames or retrieve object context. Something like the following: git diff-tree -r -z -M $old $new | git diff-pairs -p -z should generate the same output as `git diff-tree -p -M`. Furthermore, each line of raw diff formatted input can also be individually fed to a separate git-diff-pairs(1) process and still produce the same output. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Justin Tobler <jltobler@gmail.com> --- .gitignore | 1 + Documentation/git-diff-pairs.adoc | 56 +++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/diff-pairs.c | 193 ++++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + t/meson.build | 1 + t/t4070-diff-pairs.sh | 81 +++++++++++++ 11 files changed, 338 insertions(+) create mode 100644 Documentation/git-diff-pairs.adoc create mode 100644 builtin/diff-pairs.c create mode 100755 t/t4070-diff-pairs.sh diff --git a/.gitignore b/.gitignore index 08a66ca508..04c444404e 100644 --- a/.gitignore +++ b/.gitignore @@ -55,6 +55,7 @@ /git-diff /git-diff-files /git-diff-index +/git-diff-pairs /git-diff-tree /git-difftool /git-difftool--helper diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc new file mode 100644 index 0000000000..e31f2e2fbb --- /dev/null +++ b/Documentation/git-diff-pairs.adoc @@ -0,0 +1,56 @@ +git-diff-pairs(1) +================= + +NAME +---- +git-diff-pairs - Compare the content and mode of provided blob pairs + +SYNOPSIS +-------- +[synopsis] +git diff-pairs -z [<diff-options>] + +DESCRIPTION +----------- +Show changes for file pairs provided on stdin. Input for this command must be +in the NUL-terminated raw output format as generated by commands such as `git +diff-tree -z -r --raw`. By default, the outputted diffs are computed and shown +in the patch format when stdin closes. + +Usage of this command enables the traditional diff pipeline to be broken up +into separate stages where `diff-pairs` acts as the output phase. Other +commands, such as `diff-tree`, may serve as a frontend to compute the raw +diff format used as input. + +Instead of computing diffs via `git diff-tree -p -M` in one step, `diff-tree` +can compute the file pairs and rename information without the blob diffs. This +output can be fed to `diff-pairs` to generate the underlying blob diffs as done +in the following example: + +----------------------------- +git diff-tree -z -r -M $a $b | +git diff-pairs -z +----------------------------- + +Computing the tree diff upfront with rename information allows patch output +from `diff-pairs` to be progressively computed over the course of potentially +multiple invocations. + +Pathspecs are not currently supported by `diff-pairs`. Pathspec limiting should +be performed by the upstream command generating the raw diffs used as input. + +Tree objects are not currently supported as input and are rejected. + +Abbreviated object IDs in the `diff-pairs` input are not supported. Outputted +object IDs can be abbreviated using the `--abbrev` option. + +OPTIONS +------- + +include::diff-options.adoc[] + +include::diff-generate-patch.adoc[] + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/meson.build b/Documentation/meson.build index 1129ce4c85..ce990e9fe5 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -42,6 +42,7 @@ manpages = { 'git-diagnose.adoc' : 1, 'git-diff-files.adoc' : 1, 'git-diff-index.adoc' : 1, + 'git-diff-pairs.adoc' : 1, 'git-difftool.adoc' : 1, 'git-diff-tree.adoc' : 1, 'git-diff.adoc' : 1, diff --git a/Makefile b/Makefile index bcf5ed3f85..56df7aed3f 100644 --- a/Makefile +++ b/Makefile @@ -1242,6 +1242,7 @@ BUILTIN_OBJS += builtin/describe.o BUILTIN_OBJS += builtin/diagnose.o BUILTIN_OBJS += builtin/diff-files.o BUILTIN_OBJS += builtin/diff-index.o +BUILTIN_OBJS += builtin/diff-pairs.o BUILTIN_OBJS += builtin/diff-tree.o BUILTIN_OBJS += builtin/diff.o BUILTIN_OBJS += builtin/difftool.o diff --git a/builtin.h b/builtin.h index 89928ccf92..e6aad3a6a1 100644 --- a/builtin.h +++ b/builtin.h @@ -153,6 +153,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c new file mode 100644 index 0000000000..6be17c1abd --- /dev/null +++ b/builtin/diff-pairs.c @@ -0,0 +1,193 @@ +#include "builtin.h" +#include "config.h" +#include "diff.h" +#include "diffcore.h" +#include "gettext.h" +#include "hash.h" +#include "hex.h" +#include "object.h" +#include "parse-options.h" +#include "revision.h" +#include "strbuf.h" + +static unsigned parse_mode_or_die(const char *mode, const char **end) +{ + uint16_t ret; + + *end = parse_mode(mode, &ret); + if (!*end) + die(_("unable to parse mode: %s"), mode); + return ret; +} + +static void parse_oid_or_die(const char *hex, struct object_id *oid, + const char **end, const struct git_hash_algo *algop) +{ + if (parse_oid_hex_algop(hex, oid, end, algop) || *(*end)++ != ' ') + die(_("unable to parse object id: %s"), hex); +} + +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, + struct repository *repo) +{ + struct strbuf path_dst = STRBUF_INIT; + struct strbuf path = STRBUF_INIT; + struct strbuf meta = STRBUF_INIT; + struct option *parseopts; + struct rev_info revs; + int line_term = '\0'; + int ret; + + const char * const builtin_diff_pairs_usage[] = { + N_("git diff-pairs -z [<diff-options>]"), + NULL + }; + struct option builtin_diff_pairs_options[] = { + OPT_END() + }; + + repo_init_revisions(repo, &revs, prefix); + + /* + * Diff options are usually parsed implicitly as part of + * setup_revisions(). Explicitly handle parsing to ensure options are + * printed in the usage message. + */ + parseopts = add_diff_options(builtin_diff_pairs_options, &revs.diffopt); + show_usage_with_options_if_asked(argc, argv, builtin_diff_pairs_usage, parseopts); + + repo_config(repo, git_diff_basic_config, NULL); + revs.disable_stdin = 1; + revs.abbrev = 0; + revs.diff = 1; + + argc = parse_options(argc, argv, prefix, parseopts, builtin_diff_pairs_usage, + PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_DASHDASH); + + if (setup_revisions(argc, argv, &revs, NULL) > 1) + usagef(_("unrecognized argument: %s"), argv[0]); + + /* + * With the -z option, both command input and raw output are + * NUL-delimited (this mode does not affect patch output). At present + * only NUL-delimited raw diff formatted input is supported. + */ + if (revs.diffopt.line_termination) + usage(_("working without -z is not supported")); + + if (revs.prune_data.nr) + usage(_("pathspec arguments not supported")); + + if (revs.pending.nr || revs.max_count != -1 || + revs.min_age != (timestamp_t)-1 || + revs.max_age != (timestamp_t)-1) + usage(_("revision arguments not allowed")); + + if (!revs.diffopt.output_format) + revs.diffopt.output_format = DIFF_FORMAT_PATCH; + + /* + * If rename detection is not requested, use rename information from the + * raw diff formatted input. Setting skip_resolving_statuses ensures + * diffcore_std() does not mess with rename information already present + * in queued filepairs. + */ + if (!revs.diffopt.detect_rename) + revs.diffopt.skip_resolving_statuses = 1; + + while (1) { + struct object_id oid_a, oid_b; + struct diff_filepair *pair; + unsigned mode_a, mode_b; + const char *p; + char status; + + if (strbuf_getwholeline(&meta, stdin, line_term) == EOF) + break; + + p = meta.buf; + if (*p != ':') + die(_("invalid raw diff input")); + p++; + + mode_a = parse_mode_or_die(p, &p); + mode_b = parse_mode_or_die(p, &p); + + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) + die(_("tree objects not supported")); + + parse_oid_or_die(p, &oid_a, &p, repo->hash_algo); + parse_oid_or_die(p, &oid_b, &p, repo->hash_algo); + + status = *p++; + + if (strbuf_getwholeline(&path, stdin, line_term) == EOF) + die(_("got EOF while reading path")); + + switch (status) { + case DIFF_STATUS_ADDED: + pair = diff_queue_addremove(&diff_queued_diff, + &revs.diffopt, '+', mode_b, + &oid_b, 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_DELETED: + pair = diff_queue_addremove(&diff_queued_diff, + &revs.diffopt, '-', mode_a, + &oid_a, 1, path.buf, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_TYPE_CHANGED: + case DIFF_STATUS_MODIFIED: + pair = diff_queue_change(&diff_queued_diff, &revs.diffopt, + mode_a, mode_b, &oid_a, &oid_b, + 1, 1, path.buf, 0, 0); + if (pair) + pair->status = status; + break; + + case DIFF_STATUS_RENAMED: + case DIFF_STATUS_COPIED: { + struct diff_filespec *a, *b; + unsigned int score; + + if (strbuf_getwholeline(&path_dst, stdin, line_term) == EOF) + die(_("got EOF while reading destination path")); + + a = alloc_filespec(path.buf); + b = alloc_filespec(path_dst.buf); + fill_filespec(a, &oid_a, 1, mode_a); + fill_filespec(b, &oid_b, 1, mode_b); + + pair = diff_queue(&diff_queued_diff, a, b); + + if (strtoul_ui(p, 10, &score)) + die(_("unable to parse rename/copy score: %s"), p); + + pair->score = score * MAX_SCORE / 100; + pair->status = status; + pair->renamed_pair = 1; + } + break; + + default: + die(_("unknown diff status: %c"), status); + } + } + + diffcore_std(&revs.diffopt); + diff_flush(&revs.diffopt); + ret = diff_result_code(&revs); + + strbuf_release(&path_dst); + strbuf_release(&path); + strbuf_release(&meta); + release_revisions(&revs); + FREE_AND_NULL(parseopts); + + return ret; +} diff --git a/command-list.txt b/command-list.txt index c537114b46..b7ade3ab9f 100644 --- a/command-list.txt +++ b/command-list.txt @@ -96,6 +96,7 @@ git-diagnose ancillaryinterrogators git-diff mainporcelain info git-diff-files plumbinginterrogators git-diff-index plumbinginterrogators +git-diff-pairs plumbinginterrogators git-diff-tree plumbinginterrogators git-difftool ancillaryinterrogators complete git-fast-export ancillarymanipulators diff --git a/git.c b/git.c index 450d6aaa86..77c4359522 100644 --- a/git.c +++ b/git.c @@ -541,6 +541,7 @@ static struct cmd_struct commands[] = { { "diff", cmd_diff, NO_PARSEOPT }, { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, { "fast-export", cmd_fast_export, RUN_SETUP }, diff --git a/meson.build b/meson.build index bf95576f83..9e8b365d2a 100644 --- a/meson.build +++ b/meson.build @@ -540,6 +540,7 @@ builtin_sources = [ 'builtin/diagnose.c', 'builtin/diff-files.c', 'builtin/diff-index.c', + 'builtin/diff-pairs.c', 'builtin/diff-tree.c', 'builtin/diff.c', 'builtin/difftool.c', diff --git a/t/meson.build b/t/meson.build index 780939d49f..09c7bc2fad 100644 --- a/t/meson.build +++ b/t/meson.build @@ -500,6 +500,7 @@ integration_tests = [ 't4067-diff-partial-clone.sh', 't4068-diff-symmetric-merge-base.sh', 't4069-remerge-diff.sh', + 't4070-diff-pairs.sh', 't4100-apply-stat.sh', 't4101-apply-nonl.sh', 't4102-apply-rename.sh', diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh new file mode 100755 index 0000000000..0878ad0ad1 --- /dev/null +++ b/t/t4070-diff-pairs.sh @@ -0,0 +1,81 @@ +#!/bin/sh + +test_description='basic diff-pairs tests' +. ./test-lib.sh + +# This creates a diff with added, modified, deleted, renamed, copied, and +# typechange entries. This includes a submodule to test submodule diff support. +test_expect_success 'setup' ' + test_config_global protocol.file.allow always && + git init sub && + test_commit -C sub initial && + + git init main && + cd main && + echo to-be-gone >deleted && + echo original >modified && + echo now-a-file >symlink && + test_seq 200 >two-hundred && + test_seq 201 500 >five-hundred && + git add . && + test_tick && + git commit -m base && + git tag base && + + git submodule add ../sub && + echo now-here >added && + echo new >modified && + rm deleted && + mkdir subdir && + echo content >subdir/file && + mv two-hundred renamed && + test_seq 201 500 | sed s/300/modified/ >copied && + rm symlink && + git add -A . && + test_ln_s_add dest symlink && + test_tick && + git commit -m new && + git tag new +' + +test_expect_success 'diff-pairs recreates --raw' ' + git diff-tree -r -M -C -C -z base new >expect && + git diff-pairs --raw -z >actual <expect && + test_cmp expect actual +' + +test_expect_success 'diff-pairs can create -p output' ' + git diff-tree -p -M -C -C base new >expect && + git diff-tree -r -M -C -C -z base new | + git diff-pairs -p -z >actual && + test_cmp expect actual +' + +test_expect_success 'diff-pairs does not support normal raw diff input' ' + git diff-tree -r base new | + test_must_fail git diff-pairs >out 2>err && + + echo "usage: working without -z is not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_expect_success 'diff-pairs does not support tree objects as input' ' + git diff-tree -z base new | + test_must_fail git diff-pairs -z >out 2>err && + + echo "fatal: tree objects not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_expect_success 'diff-pairs does not support pathspec arguments' ' + git diff-tree -r -z base new | + test_must_fail git diff-pairs -z -- new >out 2>err && + + echo "usage: pathspec arguments not supported" >expect && + test_must_be_empty out && + test_cmp expect err +' + +test_done -- 2.49.0.rc0 ^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [PATCH v5 3/4] builtin: introduce diff-pairs command 2025-02-28 21:33 ` [PATCH v5 3/4] builtin: introduce diff-pairs command Justin Tobler @ 2025-03-03 16:30 ` Junio C Hamano 0 siblings, 0 replies; 78+ messages in thread From: Junio C Hamano @ 2025-03-03 16:30 UTC (permalink / raw) To: Justin Tobler; +Cc: git, ps, karthik.188, phillip.wood123, Jeff King Justin Tobler <jltobler@gmail.com> writes: > +static unsigned parse_mode_or_die(const char *mode, const char **end) A minor naming issue, but the previous round called this endp, which is probably a better name; making it explicit that it is a pointer to receive the discovered end of converted string is in line with how a similar parameter to strtol() and friends is named as "endptr". Not worth a reroll to rename this one alone, though. > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getwholeline(&meta, stdin, line_term) == EOF) > + break; Nice. > diff --git a/git.c b/git.c > index 450d6aaa86..77c4359522 100644 > --- a/git.c > +++ b/git.c > @@ -541,6 +541,7 @@ static struct cmd_struct commands[] = { > { "diff", cmd_diff, NO_PARSEOPT }, > { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, > { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, > + { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, OK. We need a repository to find objects named in our input, but we do not need working tree. Makes sense. ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH v5 4/4] builtin/diff-pairs: allow explicit diff queue flush 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler ` (2 preceding siblings ...) 2025-02-28 21:33 ` [PATCH v5 3/4] builtin: introduce diff-pairs command Justin Tobler @ 2025-02-28 21:33 ` Justin Tobler 3 siblings, 0 replies; 78+ messages in thread From: Justin Tobler @ 2025-02-28 21:33 UTC (permalink / raw) To: git; +Cc: ps, karthik.188, phillip.wood123, Justin Tobler The diffs queued from git-diff-pairs(1) are flushed when stdin is closed. To enable greater flexibility, allow control over when the diff queue is flushed by writing a single NUL byte on stdin between input file pairs. Diff output between flushes is separated by a single NUL byte. Signed-off-by: Justin Tobler <jltobler@gmail.com> --- Documentation/git-diff-pairs.adoc | 4 ++++ builtin/diff-pairs.c | 14 ++++++++++++++ t/t4070-diff-pairs.sh | 9 +++++++++ 3 files changed, 27 insertions(+) diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc index e31f2e2fbb..f99fcd1ead 100644 --- a/Documentation/git-diff-pairs.adoc +++ b/Documentation/git-diff-pairs.adoc @@ -17,6 +17,10 @@ in the NUL-terminated raw output format as generated by commands such as `git diff-tree -z -r --raw`. By default, the outputted diffs are computed and shown in the patch format when stdin closes. +A single NUL byte may be written to stdin between raw input lines to compute +file pair diffs up to that point instead of waiting for stdin to close. A NUL +byte is also written to the output to delimit between these batches of diffs. + Usage of this command enables the traditional diff pipeline to be broken up into separate stages where `diff-pairs` acts as the output phase. Other commands, such as `diff-tree`, may serve as a frontend to compute the raw diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c index 6be17c1abd..71c045331a 100644 --- a/builtin/diff-pairs.c +++ b/builtin/diff-pairs.c @@ -57,6 +57,7 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, show_usage_with_options_if_asked(argc, argv, builtin_diff_pairs_usage, parseopts); repo_config(repo, git_diff_basic_config, NULL); + revs.diffopt.no_free = 1; revs.disable_stdin = 1; revs.abbrev = 0; revs.diff = 1; @@ -106,6 +107,18 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, break; p = meta.buf; + if (!*p) { + diffcore_std(&revs.diffopt); + diff_flush(&revs.diffopt); + /* + * When the diff queue is explicitly flushed, append a + * NUL byte to separate batches of diffs. + */ + fputc('\0', revs.diffopt.file); + fflush(revs.diffopt.file); + continue; + } + if (*p != ':') die(_("invalid raw diff input")); p++; @@ -179,6 +192,7 @@ int cmd_diff_pairs(int argc, const char **argv, const char *prefix, } } + revs.diffopt.no_free = 0; diffcore_std(&revs.diffopt); diff_flush(&revs.diffopt); ret = diff_result_code(&revs); diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh index 0878ad0ad1..70deafb860 100755 --- a/t/t4070-diff-pairs.sh +++ b/t/t4070-diff-pairs.sh @@ -78,4 +78,13 @@ test_expect_success 'diff-pairs does not support pathspec arguments' ' test_cmp expect err ' +test_expect_success 'diff-pairs explicit queue flush' ' + git diff-tree -r -M -C -C -z base new >expect && + printf "\0" >>expect && + git diff-tree -r -M -C -C -z base new >>expect && + + git diff-pairs --raw -z <expect >actual && + test_cmp expect actual +' + test_done -- 2.49.0.rc0 ^ permalink raw reply related [flat|nested] 78+ messages in thread
end of thread, other threads:[~2025-03-03 16:30 UTC | newest] Thread overview: 78+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-13 4:23 [PATCH 0/3] batch blob diff generation Justin Tobler 2024-12-13 4:23 ` [PATCH 1/3] builtin: introduce diff-blob command Justin Tobler 2024-12-13 4:23 ` [PATCH 2/3] builtin/diff-blob: add "--stdin" option Justin Tobler 2024-12-13 4:23 ` [PATCH 3/3] builtin/diff-blob: Add "-z" option Justin Tobler 2024-12-13 8:12 ` [PATCH 0/3] batch blob diff generation Jeff King 2024-12-13 10:17 ` Junio C Hamano 2024-12-13 10:38 ` Jeff King 2024-12-15 2:07 ` Junio C Hamano 2024-12-15 2:17 ` Junio C Hamano 2024-12-16 11:11 ` Jeff King 2024-12-16 16:29 ` Junio C Hamano 2024-12-18 11:39 ` Jeff King 2024-12-18 14:53 ` Junio C Hamano 2024-12-20 9:09 ` Jeff King 2024-12-20 9:10 ` Jeff King 2024-12-13 16:41 ` Justin Tobler 2024-12-16 11:18 ` Jeff King 2024-12-13 22:34 ` Junio C Hamano 2024-12-15 23:24 ` Junio C Hamano 2024-12-16 11:30 ` Jeff King 2025-02-12 4:18 ` [PATCH v2 " Justin Tobler 2025-02-12 4:18 ` [PATCH v2 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-12 9:06 ` Karthik Nayak 2025-02-12 17:35 ` Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-12 17:24 ` Justin Tobler 2025-02-13 5:45 ` Patrick Steinhardt 2025-02-12 4:18 ` [PATCH v2 2/3] builtin: introduce diff-pairs command Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-12 9:51 ` Karthik Nayak 2025-02-25 23:38 ` Justin Tobler 2025-02-12 11:40 ` Jean-Noël Avila 2025-02-12 16:50 ` Junio C Hamano 2025-02-19 22:19 ` Justin Tobler 2025-02-19 23:19 ` Junio C Hamano 2025-02-19 23:47 ` Junio C Hamano 2025-02-20 0:32 ` Justin Tobler 2025-02-20 14:56 ` Justin Tobler 2025-02-20 16:14 ` Junio C Hamano 2025-02-17 14:38 ` Phillip Wood 2025-02-19 20:51 ` Justin Tobler 2025-02-19 21:57 ` Junio C Hamano 2025-02-19 22:38 ` Justin Tobler 2025-02-26 14:47 ` Phillip Wood 2025-02-12 4:18 ` [PATCH v2 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 2025-02-12 9:23 ` Patrick Steinhardt 2025-02-17 14:38 ` Phillip Wood 2025-02-19 23:09 ` Justin Tobler 2025-02-25 23:39 ` [PATCH v3 0/3] batch blob diff generation Justin Tobler 2025-02-25 23:39 ` [PATCH v3 1/3] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-26 18:04 ` Junio C Hamano 2025-02-25 23:39 ` [PATCH v3 2/3] builtin: introduce diff-pairs command Justin Tobler 2025-02-26 18:24 ` Junio C Hamano 2025-02-27 22:15 ` Justin Tobler 2025-02-27 9:35 ` Karthik Nayak 2025-02-27 22:36 ` Justin Tobler 2025-02-27 12:56 ` Patrick Steinhardt 2025-02-27 23:00 ` Justin Tobler 2025-02-25 23:39 ` [PATCH v3 3/3] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 2025-02-26 14:58 ` [PATCH v3 0/3] batch blob diff generation phillip.wood123 2025-02-27 22:04 ` Justin Tobler 2025-02-28 0:26 ` [PATCH v4 0/4] " Justin Tobler 2025-02-28 0:26 ` [PATCH v4 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-02-28 0:26 ` [PATCH v4 2/4] diff: add option to skip resolving diff statuses Justin Tobler 2025-02-28 8:29 ` Patrick Steinhardt 2025-02-28 17:10 ` Justin Tobler 2025-02-28 0:26 ` [PATCH v4 3/4] builtin: introduce diff-pairs command Justin Tobler 2025-02-28 8:29 ` Patrick Steinhardt 2025-02-28 17:26 ` Justin Tobler 2025-02-28 0:26 ` [PATCH v4 4/4] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler 2025-02-28 21:33 ` [PATCH v5 0/4] batch blob diff generation Justin Tobler 2025-02-28 21:33 ` [PATCH v5 1/4] diff: return diff_filepair from diff queue helpers Justin Tobler 2025-03-03 16:17 ` Junio C Hamano 2025-02-28 21:33 ` [PATCH v5 2/4] diff: add option to skip resolving diff statuses Justin Tobler 2025-03-03 16:19 ` Junio C Hamano 2025-02-28 21:33 ` [PATCH v5 3/4] builtin: introduce diff-pairs command Justin Tobler 2025-03-03 16:30 ` Junio C Hamano 2025-02-28 21:33 ` [PATCH v5 4/4] builtin/diff-pairs: allow explicit diff queue flush Justin Tobler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).