* [RFC PATCH] diff: add option to report binary files in raw diffs
@ 2025-11-04 2:14 Justin Tobler
2025-11-04 2:26 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Justin Tobler @ 2025-11-04 2:14 UTC (permalink / raw)
To: git; +Cc: karthik.188, Justin Tobler
When generating patch diff output, if either side of a filepair is
detected as binary, Git omits the diff content and instead prints a
"Binary files differ" message. From this message it is known that at
least one of the files in the pair is considered binary, but not exactly
which ones.
Add a --report-binary-files diff option that, when enabled, extends the
raw diff output format to explicitly indicate for each file whether it
was considered binary or not.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
Greetings,
I have a usecase where I would like to know exactly which files in a
diff pair are considered binary by Git when computing diffs. When
computing patch diff output, Git already omits filepair diffs where at
least one side is considered binary and prints a "binary files differ"
message instead. From this message we cannot discern exactly which files
were considered binary by Git though.
In this patch, the raw diff format is extended with a
`--report-binary-files` option to explicitly specify which files in the
diff pair were considered binary. The output in this form looks
something like this:
$ git diff-tree --abbrev=8 --report-binary-files HEAD~ HEAD
:100644 100644 a1961526 e231acb1 bt M foo
:100644 100644 31eedd5c 402a70d7 bb M bar
With this format, there is a new column before the status that specifies
the binary status for each file. 'b' indicates binary and 't' is used
otherwise.
In an earlier iteration of this patch, I originally extended the patch
output "binary files differ" message to indicate the binary status for
each file in the diff pair, but felt it wasn't the best place to do so
since I also want it to be machine friendly. So I ended up extending the
raw diff format instead.
I'm not entirely sure the current implementation is most ideal format
here so I'm very open to feedback. :)
-Justin
---
Documentation/diff-format.adoc | 12 ++++++++++++
Documentation/diff-options.adoc | 4 ++++
diff.c | 9 +++++++++
diff.h | 6 ++++++
t/t4012-diff-binary.sh | 29 +++++++++++++++++++++++++++++
5 files changed, 60 insertions(+)
diff --git a/Documentation/diff-format.adoc b/Documentation/diff-format.adoc
index 9f7e988241..74c0a064ad 100644
--- a/Documentation/diff-format.adoc
+++ b/Documentation/diff-format.adoc
@@ -83,6 +83,18 @@ quoted as explained for the configuration variable `core.quotePath`
(see linkgit:git-config[1]). Using `-z` the filename is output
verbatim and the line is terminated by a NUL byte.
+With the `--report-binary-files` option, a new column is added prior to the
+status indicating for each file if Git considered it binary or not. If
+considered binary, a file is denoted with `b`. Otherwise, `t` is used. This
+column is followed by a space character. Combined diffs do not report binary
+file info.
+
+Example:
+
+------------------------------------------------
+:100644 100644 5be4a4a cc95eb0 bt M file.c
+------------------------------------------------
+
diff format for merges
----------------------
diff --git a/Documentation/diff-options.adoc b/Documentation/diff-options.adoc
index ae31520f7f..54eb48c067 100644
--- a/Documentation/diff-options.adoc
+++ b/Documentation/diff-options.adoc
@@ -544,6 +544,10 @@ ifndef::git-format-patch[]
Implies `--patch`.
endif::git-format-patch[]
+`--report-binary-files`::
+ Adds a column to raw diff output to report for each file in the pair
+ whether it was considered binary by Git.
+
`--abbrev[=<n>]`::
Instead of showing the full 40-byte hexadecimal object
name in diff-raw format output and diff-tree header
diff --git a/diff.c b/diff.c
index a1961526c0..e231acb1a9 100644
--- a/diff.c
+++ b/diff.c
@@ -5747,6 +5747,8 @@ struct option *add_diff_options(const struct option *opts,
OPT_CALLBACK_F(0, "binary", options, NULL,
N_("output a binary diff that can be applied"),
PARSE_OPT_NONEG | PARSE_OPT_NOARG, diff_opt_binary),
+ OPT_BOOL(0, "report-binary-files", &options->report_binary_files,
+ N_("report if pre- and post-image blobs are binary")),
OPT_BOOL(0, "full-index", &options->flags.full_index,
N_("show full pre- and post-image object names on the \"index\" lines")),
OPT_COLOR_FLAG(0, "color", &options->use_color,
@@ -6111,6 +6113,13 @@ static void diff_flush_raw(struct diff_filepair *p, struct diff_options *opt)
fprintf(opt->file, "%s ",
diff_aligned_abbrev(&p->two->oid, opt->abbrev));
}
+
+ if (opt->report_binary_files) {
+ char one = diff_filespec_is_binary(opt->repo, p->one) ? 'b' : 't';
+ char two = diff_filespec_is_binary(opt->repo, p->two) ? 'b' : 't';
+ fprintf(opt->file, "%c%c ", one, two);
+ }
+
if (p->score) {
fprintf(opt->file, "%c%03d%c", p->status, similarity_index(p),
inter_name_termination);
diff --git a/diff.h b/diff.h
index 31eedd5c0c..402a70d7ad 100644
--- a/diff.h
+++ b/diff.h
@@ -369,6 +369,12 @@ struct diff_options {
*/
int skip_resolving_statuses;
+ /*
+ * When generating raw diff output, report for each file whether it was
+ * considered binary.
+ */
+ int report_binary_files;
+
/* Callback which allows tweaking the options in diff_setup_done(). */
void (*set_default)(struct diff_options *);
diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
index d1d30ac2a9..e026e1d3a4 100755
--- a/t/t4012-diff-binary.sh
+++ b/t/t4012-diff-binary.sh
@@ -130,4 +130,33 @@ test_expect_success 'diff --stat with binary files and big change count' '
test_cmp expect actual
'
+test_expect_success SHA1 'diff --report-binary-files' '
+ test_when_finished "rm -rf repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ echo foo >foo &&
+ printf "\0bar\0" >bar &&
+ echo baz >baz &&
+ git add foo bar baz &&
+ git commit -m foo &&
+
+ printf "\0foo\0" >foo &&
+ printf "\0bar2\0" >bar &&
+ echo baz2 >baz &&
+ git commit -am "binary foo" &&
+
+ cat >expect <<-\EOF &&
+ :100644 100644 e02d9a3a8aeb904ccc3bb9ed0600f2e963ba1a10 884a24af772a87733e911a3491c0ab576d34c06c bb M bar
+ :100644 100644 76018072e09c5d31c8c6e3113b8aa0fe625195ca 3414c84ca6b7ca9cbbe40dd44f4d0715c1464f6e tt M baz
+ :100644 100644 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 a60073ceafeca287824d7b9ac3eebef233b72fce tb M foo
+ EOF
+
+ git diff-tree --report-binary-files HEAD~ HEAD >out &&
+
+ test_cmp expect out
+ )
+'
+
test_done
base-commit: 7f278e958afbf9b7e0727631b4c26dcfa1c63d6e
--
2.51.0.193.g4975ec3473b
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-04 2:14 [RFC PATCH] diff: add option to report binary files in raw diffs Justin Tobler
@ 2025-11-04 2:26 ` Junio C Hamano
2025-11-04 4:44 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2025-11-04 2:26 UTC (permalink / raw)
To: Justin Tobler; +Cc: git, karthik.188
Justin Tobler <jltobler@gmail.com> writes:
> I have a usecase where I would like to know exactly which files in a
> diff pair are considered binary by Git when computing diffs. When
> computing patch diff output, Git already omits filepair diffs where at
> least one side is considered binary and prints a "binary files differ"
> message instead. From this message we cannot discern exactly which files
> were considered binary by Git though.
I have a usecase where I would like to know exactly which side of a
diff filepair ends in an incomplete line in a concise format.
Should we add yet another column to the raw output to indicate who
is complete and who is incomplete?
Where does it lead us and when will it stop?
IOW, yuck ;-).
> In this patch, the raw diff format is extended with a
> `--report-binary-files` option to explicitly specify which files in the
> diff pair were considered binary. The output in this form looks
> something like this:
>
> $ git diff-tree --abbrev=8 --report-binary-files HEAD~ HEAD
> :100644 100644 a1961526 e231acb1 bt M foo
> :100644 100644 31eedd5c 402a70d7 bb M bar
>
> With this format, there is a new column before the status that specifies
> the binary status for each file. 'b' indicates binary and 't' is used
> otherwise.
How will would this extend beyond 2-way diffs, I wonder.
Should
$ git show -c --report-binary <a merge>
show [bt]{3} instead of [bt]{2} before the change status letter?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-04 2:26 ` Junio C Hamano
@ 2025-11-04 4:44 ` Junio C Hamano
2025-11-05 0:17 ` Justin Tobler
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2025-11-04 4:44 UTC (permalink / raw)
To: Justin Tobler; +Cc: git, karthik.188
Junio C Hamano <gitster@pobox.com> writes:
> Justin Tobler <jltobler@gmail.com> writes:
>
>> I have a usecase where I would like to know exactly which files in a
>> diff pair are considered binary by Git when computing diffs. When
>> computing patch diff output, Git already omits filepair diffs where at
>> least one side is considered binary and prints a "binary files differ"
>> message instead. From this message we cannot discern exactly which files
>> were considered binary by Git though.
>
> I have a usecase where I would like to know exactly which side of a
> diff filepair ends in an incomplete line in a concise format.
>
> Should we add yet another column to the raw output to indicate who
> is complete and who is incomplete?
>
> Where does it lead us and when will it stop?
>
> IOW, yuck ;-).
My point being that it will be a huge mistake to do this only by
singling a trait that is not so special as if it is very special,
only because you have been thinking about it too long (the "ends in
an incomplete line" trait is what has been on my mind for the past
few days, "this side is binary" may be what you've been thinking
about). There are many other things people would want to learn
concisely in machine readable format, like "where did the file stop
using CRLF line endings and swithced to LF line endings", that are
equally plausible as the question you are asking, or the question I
would be asking "which commit lost the final newline?"
Perhaps an extensible command line option syntax like
$ git log --raw-extended=binary,incomplete,crlf,...
is in order, and the presense of these options would add "tt,ic,cl"
somewhere in the output to signal that both sides are text, preimage
ends in an incomplete line but not postimage, and preimage uses crlf
but postimage uses lf, or something?
Extending beyond 2-way diff is still something we would need to
think about, I guess, but the only thing we need to do may be to
allow N-letter tuples instead of limiting ourselves to 2-letter
pairs, perhaps?
>> In this patch, the raw diff format is extended with a
>> `--report-binary-files` option to explicitly specify which files in the
>> diff pair were considered binary. The output in this form looks
>> something like this:
>>
>> $ git diff-tree --abbrev=8 --report-binary-files HEAD~ HEAD
>> :100644 100644 a1961526 e231acb1 bt M foo
>> :100644 100644 31eedd5c 402a70d7 bb M bar
>>
>> With this format, there is a new column before the status that specifies
>> the binary status for each file. 'b' indicates binary and 't' is used
>> otherwise.
>
> How will would this extend beyond 2-way diffs, I wonder.
> Should
>
> $ git show -c --report-binary <a merge>
>
> show [bt]{3} instead of [bt]{2} before the change status letter?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-04 4:44 ` Junio C Hamano
@ 2025-11-05 0:17 ` Justin Tobler
2025-11-05 8:04 ` Junio C Hamano
2025-11-05 12:14 ` Ben Knoble
0 siblings, 2 replies; 12+ messages in thread
From: Justin Tobler @ 2025-11-05 0:17 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, karthik.188
On 25/11/03 08:44PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Justin Tobler <jltobler@gmail.com> writes:
> >
> >> I have a usecase where I would like to know exactly which files in a
> >> diff pair are considered binary by Git when computing diffs. When
> >> computing patch diff output, Git already omits filepair diffs where at
> >> least one side is considered binary and prints a "binary files differ"
> >> message instead. From this message we cannot discern exactly which files
> >> were considered binary by Git though.
> >
> > I have a usecase where I would like to know exactly which side of a
> > diff filepair ends in an incomplete line in a concise format.
> >
> > Should we add yet another column to the raw output to indicate who
> > is complete and who is incomplete?
> >
> > Where does it lead us and when will it stop?
> >
> > IOW, yuck ;-).
>
> My point being that it will be a huge mistake to do this only by
> singling a trait that is not so special as if it is very special,
> only because you have been thinking about it too long (the "ends in
> an incomplete line" trait is what has been on my mind for the past
> few days, "this side is binary" may be what you've been thinking
> about). There are many other things people would want to learn
> concisely in machine readable format, like "where did the file stop
> using CRLF line endings and swithced to LF line endings", that are
> equally plausible as the question you are asking, or the question I
> would be asking "which commit lost the final newline?"
Completely fair. Having a bunch specific options for special info we
want to add to the raw diff format would get messy quickly and is not
very extensible.
> Perhaps an extensible command line option syntax like
>
> $ git log --raw-extended=binary,incomplete,crlf,...
I quite like this and agree it would be better to have a single
extensible option.
> is in order, and the presense of these options would add "tt,ic,cl"
> somewhere in the output to signal that both sides are text, preimage
> ends in an incomplete line but not postimage, and preimage uses crlf
> but postimage uses lf, or something?
Maybe the output should be something like:
binary=tt,incomplete=ic,crlf=cl
or something along those lines. That way we could freely extend in the
future without having to worry about a specific order. If we think all
of the raw diff extension modes would only report with yes/no for each
file we could just do:
binary=yn,incomplete=yy,crlf=nn
but maybe we should be more flexible and leave it up to the mode to
decide what its values can be?
Also, maybe this info could be on a newline following each raw diff
entry? Something like:
:100644 100644 a1961526 e231acb1 M foo
binary=yy
:100644 100644 31eedd5c 402a70d7 M bar
binary=nn
> Extending beyond 2-way diff is still something we would need to
> think about, I guess, but the only thing we need to do may be to
> allow N-letter tuples instead of limiting ourselves to 2-letter
> pairs, perhaps?
Ya, for combined diffs I think we could just add another letter for each
source? Something like:
$ git diff-tree -c --raw-extended=binary <merge commit>
::100644 100644 100644 f38991c02a 2defd2d465 54f409c249 MM foo
binary=yyy
I think it would be reasonable to expect that each extension mode
(binary, incomplete, crlf, etc) would want to check the commit and each
of its sources.
Thanks for the feedback :)
-Justin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-05 0:17 ` Justin Tobler
@ 2025-11-05 8:04 ` Junio C Hamano
2025-11-06 21:42 ` Justin Tobler
2025-11-05 12:14 ` Ben Knoble
1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2025-11-05 8:04 UTC (permalink / raw)
To: Justin Tobler; +Cc: git, karthik.188
Justin Tobler <jltobler@gmail.com> writes:
> Maybe the output should be something like:
>
> binary=tt,incomplete=ic,crlf=cl
>
> or something along those lines. That way we could freely extend in the
> future without having to worry about a specific order. If we think all
> of the raw diff extension modes would only report with yes/no for each
> file we could just do:
>
> binary=yn,incomplete=yy,crlf=nn
>
> but maybe we should be more flexible and leave it up to the mode to
> decide what its values can be?
>
> Also, maybe this info could be on a newline following each raw diff
> entry? Something like:
>
> :100644 100644 a1961526 e231acb1 M foo
> binary=yy
> :100644 100644 31eedd5c 402a70d7 M bar
> binary=nn
I know these are parse-able, but quite honestly, both sounds
somewhat backwards, if you meant to make this easier to parse by
simple scripts. Scripts do not mind their input line wider than 80
columns, but it is cumbersome if they have to take each pair of
lines and combine them to process. And repeated keywords like
binary= etc., do not look like it is less error prone for scripts to
parse them out, either. So, I dunno.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-05 8:04 ` Junio C Hamano
@ 2025-11-06 21:42 ` Justin Tobler
2025-11-07 8:30 ` Torsten Bögershausen
0 siblings, 1 reply; 12+ messages in thread
From: Justin Tobler @ 2025-11-06 21:42 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, karthik.188
On 25/11/05 12:04AM, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
>
> > Maybe the output should be something like:
> >
> > binary=tt,incomplete=ic,crlf=cl
> >
> > or something along those lines. That way we could freely extend in the
> > future without having to worry about a specific order. If we think all
> > of the raw diff extension modes would only report with yes/no for each
> > file we could just do:
> >
> > binary=yn,incomplete=yy,crlf=nn
> >
> > but maybe we should be more flexible and leave it up to the mode to
> > decide what its values can be?
> >
> > Also, maybe this info could be on a newline following each raw diff
> > entry? Something like:
> >
> > :100644 100644 a1961526 e231acb1 M foo
> > binary=yy
> > :100644 100644 31eedd5c 402a70d7 M bar
> > binary=nn
>
> I know these are parse-able, but quite honestly, both sounds
> somewhat backwards, if you meant to make this easier to parse by
> simple scripts. Scripts do not mind their input line wider than 80
> columns, but it is cumbersome if they have to take each pair of
> lines and combine them to process. And repeated keywords like
> binary= etc., do not look like it is less error prone for scripts to
> parse them out, either. So, I dunno.
Personally, while keywords like "binary=" add a bit of complexity to the
output, I do like the idea of having the output be self-documenting that
way the parsers can avoid being aware of the inputted arguments or some
predetermined output order. I do agree though that spreading the output
across multiple lines doesn't really help us much as it probably doesn't
matter whether we split on a comma or a newline. It's probably simpler
just to keep it all on a single line.
Currently the output in the next version will look like:
:100644 100644 a1961526 e231acb1 binary=yy M foo
:100644 100644 31eedd5c 402a70d7 binary=nn M bar
Thanks,
-Justin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-06 21:42 ` Justin Tobler
@ 2025-11-07 8:30 ` Torsten Bögershausen
2025-11-07 16:07 ` Junio C Hamano
2025-11-07 17:16 ` Justin Tobler
0 siblings, 2 replies; 12+ messages in thread
From: Torsten Bögershausen @ 2025-11-07 8:30 UTC (permalink / raw)
To: Justin Tobler; +Cc: Junio C Hamano, git, karthik.188
On Thu, Nov 06, 2025 at 03:42:49PM -0600, Justin Tobler wrote:
> On 25/11/05 12:04AM, Junio C Hamano wrote:
> > Justin Tobler <jltobler@gmail.com> writes:
> >
[snip]
> Currently the output in the next version will look like:
>
> :100644 100644 a1961526 e231acb1 binary=yy M foo
> :100644 100644 31eedd5c 402a70d7 binary=nn M bar
>
I think that is a good solutution ;-)
When I once developped the
git ls-files --eol option someone (Junio ?) convinced my to
use a TAB as a seperator.
In this case just before the filename:
git ls-file --eol | xxd
00000000: 692f 6c66 2020 2020 772f 6c66 2020 2020 i/lf w/lf
00000010: 6174 7472 2f20 2020 2020 2020 2020 2020 attr/
00000020: 2020 2020 2020 092e 6369 7272 7573 2e79 ..cirrus.y
^^
00000030: 6d6c 0a ml.
This makes the output both human readable and machine parsable:
All info is before the TAB here. (And may be parsed again in a second
round, if needed).
Thoughts ?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-07 8:30 ` Torsten Bögershausen
@ 2025-11-07 16:07 ` Junio C Hamano
2025-11-07 17:16 ` Justin Tobler
1 sibling, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2025-11-07 16:07 UTC (permalink / raw)
To: Torsten Bögershausen; +Cc: Justin Tobler, git, karthik.188
Torsten Bögershausen <tboegi@web.de> writes:
> git ls-files --eol option someone (Junio ?) convinced my to
> use a TAB as a seperator.
> In this case just before the filename:
>
> git ls-file --eol | xxd
> 00000000: 692f 6c66 2020 2020 772f 6c66 2020 2020 i/lf w/lf
> 00000010: 6174 7472 2f20 2020 2020 2020 2020 2020 attr/
> 00000020: 2020 2020 2020 092e 6369 7272 7573 2e79 ..cirrus.y
> ^^
> 00000030: 6d6c 0a ml.
>
> This makes the output both human readable and machine parsable:
> All info is before the TAB here. (And may be parsed again in a second
> round, if needed).
> Thoughts ?
This brings up another interesting question: which command should
learn these new classifications.
The original desire "when I diff A and B, I cannot tell which one of
A or B had binary when I see 'binary file differs'" almost suggests
to me that 'diff' is a wrong place and rather they wanted to know "I
have A; now who is binary in there?" Or "when I diff A and B with
pathspec P, I cannot tell..." is probably a wrong question to ask,
and the question may be "I have A; now who is binary in that tree
within pathspec P?"
IOW, "git show" or "git log", when showing a commit C in the
history, would give "that one is a binary" information as if it is
an attribute of the change between commit C and its parents, if you
tuck this new logic into the "diff" machinery. I am not sure if
that is what we really want. If we do so in "ls-tree" and allow
"git log" to show characteristics of each tree it encounters while
traversing the history, on the other hand, "that one is a binary"
would truly become an attribute of an entry in a tree, and it is not
affected by what is in the trees of the commits that are adjacent to
the commit in the history.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-07 8:30 ` Torsten Bögershausen
2025-11-07 16:07 ` Junio C Hamano
@ 2025-11-07 17:16 ` Justin Tobler
2025-11-07 17:26 ` Junio C Hamano
1 sibling, 1 reply; 12+ messages in thread
From: Justin Tobler @ 2025-11-07 17:16 UTC (permalink / raw)
To: Torsten Bögershausen; +Cc: Junio C Hamano, git, karthik.188
On 25/11/07 09:30AM, Torsten Bögershausen wrote:
> On Thu, Nov 06, 2025 at 03:42:49PM -0600, Justin Tobler wrote:
> > On 25/11/05 12:04AM, Junio C Hamano wrote:
> > > Justin Tobler <jltobler@gmail.com> writes:
> > Currently the output in the next version will look like:
> >
> > :100644 100644 a1961526 e231acb1 binary=yy M foo
> > :100644 100644 31eedd5c 402a70d7 binary=nn M bar
> >
>
> I think that is a good solutution ;-)
> When I once developped the
> git ls-files --eol option someone (Junio ?) convinced my to
> use a TAB as a seperator.
> In this case just before the filename:
>
> git ls-file --eol | xxd
> 00000000: 692f 6c66 2020 2020 772f 6c66 2020 2020 i/lf w/lf
> 00000010: 6174 7472 2f20 2020 2020 2020 2020 2020 attr/
> 00000020: 2020 2020 2020 092e 6369 7272 7573 2e79 ..cirrus.y
> ^^
> 00000030: 6d6c 0a ml.
>
> This makes the output both human readable and machine parsable:
> All info is before the TAB here. (And may be parsed again in a second
> round, if needed).
> Thoughts ?
So the raw diff format for a normal diff pair is as follows:
:<src mode>SP<dst mode>SP<src sha>SP<dest sha>SP<status>[score]TAB<src path>[TAB<dest path>]LF
When the `-z` option is used, tab and LF are replaced with a NUL byte.
So we do already use a tab to delimit between the score/paths. If we
wanted to drop avoid using comma to delimit between extended raw diff
output we could use a space instead and use TAB to indicate the end.
Maybe something like:
:100644 100644 a1961526 e231acb1 binary=yy crlf=nn M foo
:100644 100644 31eedd5c 402a70d7 binary=nn crlf=yy M bar
or we could maybe move the extended info towards the start of the line
and leave the remaining bits the same:
:binary=yy crlf=nn 100644 100644 a1961526 e231acb1 M foo
:binary=nn crlf=yy 100644 100644 31eedd5c 402a70d7 M bar
With either of these formats, the expectation would be parsers continue
reading space delimited "key=value" pairs until they encounter a tab. I
do think this latter format looks a bit nicer and I don't think it would
meaningfully impact the complexity of the parser. Ultimately, I don't
feel super strongly one way or the other though. I may go with this last
format in the next version since it does look a little nicer IMO. I'm
still very much interested in folks thoughts here though. :)
Thanks,
-Justin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-07 17:16 ` Justin Tobler
@ 2025-11-07 17:26 ` Junio C Hamano
0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2025-11-07 17:26 UTC (permalink / raw)
To: Justin Tobler; +Cc: Torsten Bögershausen, git, karthik.188
Justin Tobler <jltobler@gmail.com> writes:
> or we could maybe move the extended info towards the start of the line
> and leave the remaining bits the same:
>
> :binary=yy crlf=nn 100644 100644 a1961526 e231acb1 M foo
> :binary=nn crlf=yy 100644 100644 31eedd5c 402a70d7 M bar
>
> With either of these formats, the expectation would be parsers continue
> reading space delimited "key=value" pairs until they encounter a tab. I
> do think this latter format looks a bit nicer and I don't think it would
> meaningfully impact the complexity of the parser. Ultimately, I don't
> feel super strongly one way or the other though. I may go with this last
> format in the next version since it does look a little nicer IMO. I'm
> still very much interested in folks thoughts here though. :)
With this are your parsers/readers still using the output fields
that appear in the --raw output? Do they still want the mode bits,
or object names in preimage and postimage? Do they need to even
look at "M" anymore, as a new file or a removed file would certainly
have only a single sign for these additional traits like binary as
such a filepair has only one side by definition?
IOW, I am not sure if it is wise to shoehorn the new pieces of
information into the --raw format. Existing parsers would not be
able to grok the above at all (they do not even see the fields they
recognise at the beginning of lines which is where they recognise
them as such), so I do not see any good reason to even pretend this
to be some extension to an existing --raw format.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-05 0:17 ` Justin Tobler
2025-11-05 8:04 ` Junio C Hamano
@ 2025-11-05 12:14 ` Ben Knoble
2025-11-06 21:52 ` Justin Tobler
1 sibling, 1 reply; 12+ messages in thread
From: Ben Knoble @ 2025-11-05 12:14 UTC (permalink / raw)
To: Justin Tobler; +Cc: Junio C Hamano, git, karthik.188
> Le 4 nov. 2025 à 19:17, Justin Tobler <jltobler@gmail.com> a écrit :
>
> On 25/11/03 08:44PM, Junio C Hamano wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>>> Justin Tobler <jltobler@gmail.com> writes:
>>>
>>>> I have a usecase where I would like to know exactly which files in a
>>>> diff pair are considered binary by Git when computing diffs. When
>>>> computing patch diff output, Git already omits filepair diffs where at
>>>> least one side is considered binary and prints a "binary files differ"
>>>> message instead. From this message we cannot discern exactly which files
>>>> were considered binary by Git though.
>>>
>>> I have a usecase where I would like to know exactly which side of a
>>> diff filepair ends in an incomplete line in a concise format.
>>>
>>> Should we add yet another column to the raw output to indicate who
>>> is complete and who is incomplete?
>>>
>>> Where does it lead us and when will it stop?
>>>
>>> IOW, yuck ;-).
>>
>> My point being that it will be a huge mistake to do this only by
>> singling a trait that is not so special as if it is very special,
>> only because you have been thinking about it too long (the "ends in
>> an incomplete line" trait is what has been on my mind for the past
>> few days, "this side is binary" may be what you've been thinking
>> about). There are many other things people would want to learn
>> concisely in machine readable format, like "where did the file stop
>> using CRLF line endings and swithced to LF line endings", that are
>> equally plausible as the question you are asking, or the question I
>> would be asking "which commit lost the final newline?"
>
> Completely fair. Having a bunch specific options for special info we
> want to add to the raw diff format would get messy quickly and is not
> very extensible.
>
>> Perhaps an extensible command line option syntax like
>>
>> $ git log --raw-extended=binary,incomplete,crlf,...
>
> I quite like this and agree it would be better to have a single
> extensible option.
>
>> is in order, and the presense of these options would add "tt,ic,cl"
>> somewhere in the output to signal that both sides are text, preimage
>> ends in an incomplete line but not postimage, and preimage uses crlf
>> but postimage uses lf, or something?
>
> Maybe the output should be something like:
>
> binary=tt,incomplete=ic,crlf=cl
>
> or something along those lines. That way we could freely extend in the
> future without having to worry about a specific order. If we think all
> of the raw diff extension modes would only report with yes/no for each
> file we could just do:
>
> binary=yn,incomplete=yy,crlf=nn
>
> but maybe we should be more flexible and leave it up to the mode to
> decide what its values can be?
>
> Also, maybe this info could be on a newline following each raw diff
> entry? Something like:
>
> :100644 100644 a1961526 e231acb1 M foo
> binary=yy
> :100644 100644 31eedd5c 402a70d7 M bar
> binary=nn
>
Whether combined or separate, self-documenting output is nice. Separate might be easier for line-oriented tools? Having to split on commas and loop looking for keywords seems like more work than just processing a line at a time. Idk.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] diff: add option to report binary files in raw diffs
2025-11-05 12:14 ` Ben Knoble
@ 2025-11-06 21:52 ` Justin Tobler
0 siblings, 0 replies; 12+ messages in thread
From: Justin Tobler @ 2025-11-06 21:52 UTC (permalink / raw)
To: Ben Knoble; +Cc: Junio C Hamano, git, karthik.188
On 25/11/05 07:14AM, Ben Knoble wrote:
>
> > Le 4 nov. 2025 à 19:17, Justin Tobler <jltobler@gmail.com> a écrit :
> > Also, maybe this info could be on a newline following each raw diff
> > entry? Something like:
> >
> > :100644 100644 a1961526 e231acb1 M foo
> > binary=yy
> > :100644 100644 31eedd5c 402a70d7 M bar
> > binary=nn
> >
>
> Whether combined or separate, self-documenting output is nice. Separate might be easier for line-oriented tools? Having to split on commas and loop looking for keywords seems like more work than just processing a line at a time. Idk.
Ya, I'm also a bit torn on whether a single line or multiple lines would
be best. I'm currently leaning back towards using just a single line as
I do think it is somewhat nice that each diff entry itself is on its own
line. Ultimately from the perspective of the parser, I don't think it
should matter too much whether its splits raw-extended output on a
command or newline.
In the next version, I'm thinking the output will look something like:
$ git diff-tree --raw-extended=binary,crlf ...
:100644 100644 a1961526 e231acb1 binary=yy,crlf=nn M foo
:100644 100644 31eedd5c 402a70d7 binary=nn,crlf=yy M bar
Thanks,
-Justin
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-11-07 17:26 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04 2:14 [RFC PATCH] diff: add option to report binary files in raw diffs Justin Tobler
2025-11-04 2:26 ` Junio C Hamano
2025-11-04 4:44 ` Junio C Hamano
2025-11-05 0:17 ` Justin Tobler
2025-11-05 8:04 ` Junio C Hamano
2025-11-06 21:42 ` Justin Tobler
2025-11-07 8:30 ` Torsten Bögershausen
2025-11-07 16:07 ` Junio C Hamano
2025-11-07 17:16 ` Justin Tobler
2025-11-07 17:26 ` Junio C Hamano
2025-11-05 12:14 ` Ben Knoble
2025-11-06 21:52 ` Justin Tobler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).