[PATCH] diff: Make numstat machine friendly also for renames (and copies)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] diff: Make numstat machine friendly also for renames (and copies)
@ 2007-12-10 22:32 Jakub Narebski
  2007-12-10 22:55 ` [PATCH (amend)] " Jakub Narebski
  2007-12-10 23:00 ` [PATCH] " Junio C Hamano
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Narebski @ 2007-12-10 22:32 UTC (permalink / raw)
  To: git

"git diff --numstat" used the same format as "git diff --stat" for
renamed (and copied) files, except that filenames were not shortened
when they didn't fit in the column width.  This format is suitable for
human consumption, but it cannot be unambiguously parsed.

Instead of that always use final file name ("to" name) for numstat.
It is possible to find name before rename when name after is known.

This required to use pprint_rename (pretty print rename) during output
(in the show_stats function) and not during parsing (in diffstat_add
function).

Adding from_name field to struct diffstat_t makes is_renamed bitfield
redundant; nevertheless for the sake of clarity, readability and
making this patch minimal (and because it would not reduce memory
footprint) it was not removed, and its used not replaced by checking
from_name field.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
This would be useful for gitweb, later.

I hope I have made it in time before feature freeze...

 diff.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index f780e3e..38b9367 100644
--- a/diff.c
+++ b/diff.c
@@ -735,6 +735,7 @@ struct diffstat_t {
 	int alloc;
 	struct diffstat_file {
 		char *name;
+		char *from_name;
 		unsigned is_unmerged:1;
 		unsigned is_binary:1;
 		unsigned is_renamed:1;
@@ -755,11 +756,14 @@ static struct diffstat_file *diffstat_add(struct diffstat_t *diffstat,
 	}
 	diffstat->files[diffstat->nr++] = x;
 	if (name_b) {
-		x->name = pprint_rename(name_a, name_b);
+		x->from_name = xstrdup(name_a);
+		x->name      = xstrdup(name_b);
 		x->is_renamed = 1;
 	}
-	else
+	else {
+		x->from_name = NULL;
 		x->name = xstrdup(name_a);
+	}
 	return x;
 }

@@ -837,7 +841,7 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 		struct diffstat_file *file = data->files[i];
 		int change = file->added + file->deleted;

-		if (!file->is_renamed) {  /* renames are already quoted by pprint_rename */
+		if (!file->is_renamed) {  /* renames will be quoted by pprint_rename */
 			struct strbuf buf;
 			strbuf_init(&buf, 0);
 			if (quote_c_style(file->name, &buf, NULL, 0)) {
@@ -846,6 +850,11 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 			} else {
 				strbuf_release(&buf);
 			}
+		} else {
+			char *qname = pprint_rename(file->from_name, file->name);
+			free(file->name);
+			free(file->from_name);
+			file->name = qname;
 		}

 		len = strlen(file->name);
-- 
1.5.3.7

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-10 22:32 [PATCH] diff: Make numstat machine friendly also for renames (and copies) Jakub Narebski
@ 2007-12-10 22:55 ` Jakub Narebski
  2007-12-11  1:06   ` Junio C Hamano
  2007-12-10 23:00 ` [PATCH] " Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Narebski @ 2007-12-10 22:55 UTC (permalink / raw)
  To: git

"git diff --numstat" used the same format as "git diff --stat" for
renamed (and copied) files, except that filenames were not shortened
when they didn't fit in the column width.  This format is suitable for
human consumption, but it cannot be unambiguously parsed.

Instead of that always use final file name ("to" name) for numstat.
It is possible to find name before rename when name after is known.

This required to use pprint_rename (pretty print rename) during output
(in the show_stats function) and not during parsing (in diffstat_add
function).

Adding from_name field to struct diffstat_t makes is_renamed bitfield
redundant; nevertheless for the sake of clarity, readability and
making this patch minimal (and because it would not reduce memory
footprint) it was not removed, and its used not replaced by checking
from_name field.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Sorry for mistake: I have tested this commit, corrected it... and forgot
to update patch to send.

Previous version of this patch (from 7 May 2007) used instead of current
only "to_name" format similar to git-diff-tree raw format for renames:

  added deleted TAB path for "src" TAB path for "dst" LF

The problem was when -z option was used: how to separate end of record
from end of from_name and start of to_name. For git-diff we have status
to distinguish those; no such thing for numstat output. Previous version
of patch used (or was to use actually, because of error in the code)

  added deleted TAB path for "src" NUL NUL path for "dst" NUL

when -z option was used.

This is left now for future --numstat-extended option...

 diff.c |   22 +++++++++++++---------
 1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/diff.c b/diff.c
index f780e3e..8039ac7 100644
--- a/diff.c
+++ b/diff.c
@@ -735,6 +735,7 @@ struct diffstat_t {
 	int alloc;
 	struct diffstat_file {
 		char *name;
+		char *from_name;
 		unsigned is_unmerged:1;
 		unsigned is_binary:1;
 		unsigned is_renamed:1;
@@ -755,11 +756,14 @@ static struct diffstat_file *diffstat_add(struct diffstat_t *diffstat,
 	}
 	diffstat->files[diffstat->nr++] = x;
 	if (name_b) {
-		x->name = pprint_rename(name_a, name_b);
+		x->from_name = xstrdup(name_a);
+		x->name      = xstrdup(name_b);
 		x->is_renamed = 1;
 	}
-	else
+	else {
+		x->from_name = NULL;
 		x->name = xstrdup(name_a);
+	}
 	return x;
 }

@@ -837,7 +841,7 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 		struct diffstat_file *file = data->files[i];
 		int change = file->added + file->deleted;

-		if (!file->is_renamed) {  /* renames are already quoted by pprint_rename */
+		if (!file->is_renamed) {  /* renames will be quoted by pprint_rename */
 			struct strbuf buf;
 			strbuf_init(&buf, 0);
 			if (quote_c_style(file->name, &buf, NULL, 0)) {
@@ -846,6 +850,11 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 			} else {
 				strbuf_release(&buf);
 			}
+		} else {
+			char *qname = pprint_rename(file->from_name, file->name);
+			free(file->name);
+			free(file->from_name);
+			file->name = qname;
 		}

 		len = strlen(file->name);
@@ -982,12 +991,7 @@ static void show_numstat(struct diffstat_t* data, struct diff_options *options)
 			printf("-\t-\t");
 		else
 			printf("%d\t%d\t", file->added, file->deleted);
-		if (!file->is_renamed) {
-			write_name_quoted(file->name, stdout, options->line_termination);
-		} else {
-			fputs(file->name, stdout);
-			putchar(options->line_termination);
-		}
+		write_name_quoted(file->name, stdout, options->line_termination);
 	}
 }

-- 
1.5.3.7

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-10 22:32 [PATCH] diff: Make numstat machine friendly also for renames (and copies) Jakub Narebski
  2007-12-10 22:55 ` [PATCH (amend)] " Jakub Narebski
@ 2007-12-10 23:00 ` Junio C Hamano
  2007-12-10 23:14   ` Jakub Narebski
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-12-10 23:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> "git diff --numstat" used the same format as "git diff --stat" for
> renamed (and copied) files, except that filenames were not shortened
> when they didn't fit in the column width.  This format is suitable for
> human consumption, but it cannot be unambiguously parsed.

Agreed about the (un)parsability, and --numstat is all about parsability
so I would not object.  A fix is really needed there.

I do not have time to look at the patch right now, but if the changed
output is in line with what --name-status would show, that would be
great.  I'd call that "the format that should have been from day one".

I.e. no '=>' rename marker, but show two names c-quoted (unless -z is
used) and separated with inter_name_termination).  IIRC, that is how
rename/copy is shown with --name-status.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-10 23:00 ` [PATCH] " Junio C Hamano
@ 2007-12-10 23:14   ` Jakub Narebski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Narebski @ 2007-12-10 23:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > "git diff --numstat" used the same format as "git diff --stat" for
> > renamed (and copied) files, except that filenames were not shortened
> > when they didn't fit in the column width.  This format is suitable for
> > human consumption, but it cannot be unambiguously parsed.
> 
> Agreed about the (un)parsability, and --numstat is all about parsability
> so I would not object.  A fix is really needed there.
> 
> I do not have time to look at the patch right now, but if the changed
> output is in line with what --name-status would show, that would be
> great.  I'd call that "the format that should have been from day one".
> 
> I.e. no '=>' rename marker, but show two names c-quoted (unless -z is
> used) and separated with inter_name_termination).  IIRC, that is how
> rename/copy is shown with --name-status.

Unfortunately this is not possible, at least if we want to retain
the assertion that -z output looks like normal output, only without
quoting.

diff --name-status has _status_ field which can be used to distinguish
if the NUL (for -z output) is the end of source filename, or the end
of record.

The patch send changes --numstat to use only _destination_ name.
What you want I'd left for futore --numstat-extended (basically --numstat,
but with status field.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-10 22:55 ` [PATCH (amend)] " Jakub Narebski
@ 2007-12-11  1:06   ` Junio C Hamano
  2007-12-11  1:26     ` Jakub Narebski
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-12-11  1:06 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Previous version of this patch (from 7 May 2007) used instead of current
> only "to_name" format similar to git-diff-tree raw format for renames:
>
>   added deleted TAB path for "src" TAB path for "dst" LF
>     
> The problem was when -z option was used: how to separate end of record
> from end of from_name and start of to_name. For git-diff we have status
> to distinguish those; no such thing for numstat output. Previous version
> of patch used (or was to use actually, because of error in the code)
>
>   added deleted TAB path for "src" NUL NUL path for "dst" NUL
>
> when -z option was used.

I think the cleanest at this point is to have --numstat-enhanced that
shows

	<added> <deleted> <status> <path1>
        <added> <deleted> <status> <path1> <path2>

Anything else would be a regression.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-11  1:06   ` Junio C Hamano
@ 2007-12-11  1:26     ` Jakub Narebski
  2007-12-11  2:50       ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Narebski @ 2007-12-11  1:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > Previous version of this patch (from 7 May 2007) used instead of current
> > only "to_name" format similar to git-diff-tree raw format for renames:
> >
> >   added deleted TAB path for "src" TAB path for "dst" LF
> >     
> > The problem was when -z option was used: how to separate end of record
> > from end of from_name and start of to_name. For git-diff we have status
> > to distinguish those; no such thing for numstat output. Previous version
> > of patch used (or was to use actually, because of error in the code)
> >
> >   added deleted TAB path for "src" NUL NUL path for "dst" NUL
> >
> > when -z option was used.
> 
> I think the cleanest at this point is to have --numstat-enhanced that
> shows
> 
> 	<added> <deleted> <status> <path1>
> 	<added> <deleted> <status> <path1> <path2>
> 
> Anything else would be a regression.

That is the plan[*1*]. Nevertheless always using destination filename for
"ordinary" numstat is a step in good direction. I don't think that would
break _any_ scripts (as previous version was not good to be parsed by
a machine); I think it is even more probable that old version _broke_
scripts if -M / -C was provided.

[*1*] When I (or somebody else) find time for that.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-11  1:26     ` Jakub Narebski
@ 2007-12-11  2:50       ` Junio C Hamano
  2007-12-11 23:09         ` Jakub Narebski
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-12-11  2:50 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Junio C Hamano wrote:
>> Jakub Narebski <jnareb@gmail.com> writes:
>> 
>> > Previous version of this patch (from 7 May 2007) used instead of current
>> > only "to_name" format similar to git-diff-tree raw format for renames:
>> >
>> >   added deleted TAB path for "src" TAB path for "dst" LF

That's perfectly fine (that is without -z).

>> > ... Previous version
>> > of patch used (or was to use actually, because of error in the code)
>> >
>> >   added deleted TAB path for "src" NUL NUL path for "dst" NUL
>> >
>> > when -z option was used.

I think your "previous" one:

	<added> <deleted> <src> NUL		  (no rename)
	<added> <deleted> <src> NUL NUL <dst> NUL (with rename)

would be unambiguously parsable, but it would be cleaner and easier for
the parser if the latter were like this instead:

	<added> <deleted> NUL <src> NUL <dst> NUL (with rename)

The reader can expect to find a NUL terminated src, finds the length is
zero and notices it cannot be src and then reads two paths from that
point.

Without having the status letter, we cannot do "optional two paths"
without breaking existing --numstat -z readers that do not care about
renames and copies, and I agree that existing --numstat -z readers that
pass -M or -C are already broken, so I think a format extension along
the above line (your "previous" and the above clean-up proposal have the
same power of expressiveness, I just think the latter is syntactically
cleaner) would be reasonable.  And at that point we probably would not
need --numstat-enhanced at all ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-11  2:50       ` Junio C Hamano
@ 2007-12-11 23:09         ` Jakub Narebski
  2007-12-12  7:46           ` Junio C Hamano
  2007-12-12 10:31           ` Jakub Narebski
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Narebski @ 2007-12-11 23:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 11 Dec 2007, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>> Junio C Hamano wrote:
>>> Jakub Narebski <jnareb@gmail.com> writes:
>>> 
>>>> Previous version of this patch (from 7 May 2007) used instead of current
>>>> only "to_name" format similar to git-diff-tree raw format for renames:
>>>>
>>>>   added deleted TAB path for "src" TAB path for "dst" LF
> 
> That's perfectly fine (that is without -z).
> 
>>>> ... Previous version
>>>> of patch used (or was to use actually, because of error in the code)
>>>>
>>>>   added deleted TAB path for "src" NUL NUL path for "dst" NUL
>>>>
>>>> when -z option was used.
> 
> I think your "previous" one:
> 
> 	<added> <deleted> <src> NUL		  (no rename)
> 	<added> <deleted> <src> NUL NUL <dst> NUL (with rename)
> 
> would be unambiguously parsable, but it would be cleaner and easier for
> the parser if the latter were like this instead:
> 
> 	<added> <deleted> NUL <src> NUL <dst> NUL (with rename)
> 
> The reader can expect to find a NUL terminated src, finds the length is
> zero and notices it cannot be src and then reads two paths from that
> point.

Very good idea.

> Without having the status letter, we cannot do "optional two paths"
> without breaking existing --numstat -z readers that do not care about
> renames and copies, and I agree that existing --numstat -z readers that
> pass -M or -C are already broken, so I think a format extension along
> the above line (your "previous" and the above clean-up proposal have the
> same power of expressiveness, I just think the latter is syntactically
> cleaner) would be reasonable.  And at that point we probably would not
> need --numstat-enhanced at all ;-)

The previous patch is a fix (usability fix) for numstat output in the
presence of rename detection, making it truly machine friendly. Moreover
it should not break any scripts which parsed numstat output not
expecting to deal with renames (for example if repository has
diff.renames config option set), and might even fix them.

The proposed - and implemented in patch below - change could break some
scripts not expecting to deal with new numstat output. Adding status
letter (with similarity/dissimilarity percentage info) would
unfortunately require greater surgery... We can either allow scripts
(do there exist any?) to break, or make the full rename info shown
only for --numstat-extended; but I'd like to have this status letter
for --numstat-extended / --numstat-enhanced.

So possible solution are:
 (a) Accept both patches, and accept remote possibility that some
     scripts might break
 (b) Add --numstat-extended option whoich would show rename info
     as implemented in this patch (and of course accept previous)
 (c) Implement --numstat-extended with status letter, as suggested,
     accept first patch but not this one.


Note that gitweb would require --numstat with proper rename detection
support if we want to e.g. provide graphical diffstat in 'commit' view
for merge commits.


P.S. The numstat output format for renames should be probably described
in documentation, and not only in commit message, but I was not sure
where to put it (and even how it should be written).

-- >8 --
From: Jakub Narebski <jnareb@gmail.com>
Date: Tue, 11 Dec 2007 23:52:18 +0100
Subject: [RFC/PATCH] diff: Show rename info in numstat output, correctly

Make numstat output of git-diff show both source and destination
filename for renames (and copies) in easily parseable format, similar
to raw diff output format.

The numstat format for rename (or copy) is now

  <added> <deleted> TAB <path for "src"> TAB <path for "dst"> LF

or if -z option is used

  <added> <deleted> TAB NULL <path for "src"> NULL <path for "dst"> NULL

where <added> and <deleted> is number of added/deleted lines
(or '-' for binary file).

When -z option is not used, TAB, LF, and backslash characters in
pathnames are represented as \t, \n, and \\, respectively.


Idea for -z numstat rename format by Junio C Hamano.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
 diff.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/diff.c b/diff.c
index 8039ac7..56cbf28 100644
--- a/diff.c
+++ b/diff.c
@@ -991,7 +991,14 @@ static void show_numstat(struct diffstat_t* data, struct diff_options *options)
 			printf("-\t-\t");
 		else
 			printf("%d\t%d\t", file->added, file->deleted);
-		write_name_quoted(file->name, stdout, options->line_termination);
+		if (file->is_renamed) {
+			char sep = options->line_termination ? '\t' : '\0';
+			if (!options->line_termination)
+				putchar(options->line_termination);
+			write_name_quoted(file->from_name, stdout, sep);
+			write_name_quoted(file->name, stdout, options->line_termination);
+		} else
+			write_name_quoted(file->name, stdout, options->line_termination);
 	}
 }
 
-- 
1.5.3.7

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-11 23:09         ` Jakub Narebski
@ 2007-12-12  7:46           ` Junio C Hamano
  2007-12-12 10:21             ` Jakub Narebski
  2007-12-12 10:31           ` Jakub Narebski
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-12-12  7:46 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> The previous patch is a fix (usability fix) for numstat output in the
> presence of rename detection, making it truly machine friendly. Moreover
> it should not break any scripts which parsed numstat output not
> expecting to deal with renames (for example if repository has
> diff.renames config option set), and might even fix them.
>
> The proposed - and implemented in patch below - change could break some
> scripts not expecting to deal with new numstat output.

I do not quite follow.  If a script wanted -M and/or -C output in the
older "common/{a => b}/common" format, either change would break it
equally badly, either with or without -z.  And I do not think adding
status letter is necessary for --numstat; if the caller wants that
information, it can ask for it explicitly with both options.

So I do not think there is anything more to solve, than the patch you
just sent.

The patch looks good.  Thanks.

> P.S. The numstat output format for renames should be probably described
> in documentation, and not only in commit message, but I was not sure
> where to put it (and even how it should be written).

I started writing this, and then reviewed the result of squashing your
two patches.

Although everybody complains that most of git is run-once-and-exit, I
wrote the original diff part fairly conservatively to make it leak-free,
because of a single command, "diff-tree --stdin", that can be told to
run millions of diffs inside a single process.

But the diffstat part is horribly leaky, especially after your patch,
and it has ugly workarounds such as refusing to show both diffstat and
shortstat at the same time, not because it does not make much sense
(admittedly it doesn't), but because show_stats() discards necessary
information when it is done, making it impossible for shortstat to run.

So I ended up restructuring the name allocation/free code a bit while at
it.

-- >8 --
diff --numstat -z: make it machine readable

The "-z" format is all about machine parsability, but showing renamed
paths as "common/{a => b}/suffix" makes it impossible.  The scripts would
never have successfully parsed "--numstat -z -M" in the old format.

This fixes the output format in a (hopefully minimally) backward
incompatible way.

 * The output without -z is not changed.  This has given a good way for
   humans to view added and deleted lines separately, and showing the
   path in combined, shorter way would preserve readability.

 * The output with -z is unchanged for paths that do not involve renames.
   Existing scripts that do not pass -M/-C are not affected at all.

 * The output with -z for a renamed path is shown in a format that can
   easily be distinguished from an unrenamed path.

This is based on Jakub Narebski's patch.  Bugs and documentation typos
are mine.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/diff-format.txt |   61 +++++++++++++++++++++++++
 diff.c                        |  100 ++++++++++++++++++++++++++++-------------
 2 files changed, 129 insertions(+), 32 deletions(-)

diff --git a/Documentation/diff-format.txt b/Documentation/diff-format.txt
index 2c3a4c4..400cbb3 100644
--- a/Documentation/diff-format.txt
+++ b/Documentation/diff-format.txt
@@ -84,3 +84,64 @@ all parents.
 
 
 include::diff-generate-patch.txt[]
+
+
+other diff formats
+------------------
+
+The `--summary` option describes newly added, deleted, renamed and
+copied files.  The `--stat` option adds diffstat(1) graph to the
+output.  These options can be combined with other options, such as
+`-p`, and are meant for human consumption.
+
+When showing a change that involves a rename or a copy, `--stat` output
+formats the pathnames compactly by combining common prefix and suffix of
+the pathnames.  For example, a change that moves `arch/i386/Makefile` to
+`arch/x86/Makefile` while modifying 4 lines will be shown like this:
+
+------------------------------------
+arch/{i386 => x86}/Makefile    |   4 +--
+------------------------------------
+
+The `--numstat` option gives the diffstat(1) information but is designed
+for easier machine consumption.  An entry in `--numstat` output looks
+like this:
+
+----------------------------------------
+1	2	README
+3	1	arch/{i386 => x86}/Makefile
+----------------------------------------
+
+That is, from left to right:
+
+. the number of added lines;
+. a tab;
+. the number of deleted lines;
+. a tab;
+. pathname (possibly with rename/copy information);
+. a newline.
+
+When `-z` output option is in effect, the output is formatted this way:
+
+----------------------------------------
+1	2	README NUL
+3	1	NUL arch/i386/Makefile NUL arch/x86/Makefile NUL
+----------------------------------------
+
+That is:
+
+. the number of added lines;
+. a tab;
+. the number of deleted lines;
+. a tab;
+. a NUL (only exists if renamed/copied);
+. pathname in preimage;
+. a NUL (only exists if renamed/copied);
+. pathname in postimage (only exists if renamed/copied);
+. a NUL.
+
+The extra `NUL` before the preimage path in renamed case is to allow
+scripts that read the output to tell if the current record being read is
+a single-path record or a rename/copy record without reading ahead.
+After reading added and deleted lines, reading up to `NUL` would yield
+the pathname, but if that is `NUL`, the record will show two paths.
diff --git a/diff.c b/diff.c
index f780e3e..fc51506 100644
--- a/diff.c
+++ b/diff.c
@@ -734,7 +734,9 @@ struct diffstat_t {
 	int nr;
 	int alloc;
 	struct diffstat_file {
+		char *from_name;
 		char *name;
+		char *print_name;
 		unsigned is_unmerged:1;
 		unsigned is_binary:1;
 		unsigned is_renamed:1;
@@ -755,11 +757,14 @@ static struct diffstat_file *diffstat_add(struct diffstat_t *diffstat,
 	}
 	diffstat->files[diffstat->nr++] = x;
 	if (name_b) {
-		x->name = pprint_rename(name_a, name_b);
+		x->from_name = xstrdup(name_a);
+		x->name = xstrdup(name_b);
 		x->is_renamed = 1;
 	}
-	else
+	else {
+		x->from_name = NULL;
 		x->name = xstrdup(name_a);
+	}
 	return x;
 }
 
@@ -803,6 +808,28 @@ static void show_graph(char ch, int cnt, const char *set, const char *reset)
 	printf("%s", reset);
 }
 
+static void fill_print_name(struct diffstat_file *file)
+{
+	char *pname;
+
+	if (file->print_name)
+		return;
+
+	if (!file->is_renamed) {
+		struct strbuf buf;
+		strbuf_init(&buf, 0);
+		if (quote_c_style(file->name, &buf, NULL, 0)) {
+			pname = strbuf_detach(&buf, NULL);
+		} else {
+			pname = file->name;
+			strbuf_release(&buf);
+		}
+	} else {
+		pname = pprint_rename(file->from_name, file->name);
+	}
+	file->print_name = pname;
+}
+
 static void show_stats(struct diffstat_t* data, struct diff_options *options)
 {
 	int i, len, add, del, total, adds = 0, dels = 0;
@@ -836,19 +863,8 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 	for (i = 0; i < data->nr; i++) {
 		struct diffstat_file *file = data->files[i];
 		int change = file->added + file->deleted;
-
-		if (!file->is_renamed) {  /* renames are already quoted by pprint_rename */
-			struct strbuf buf;
-			strbuf_init(&buf, 0);
-			if (quote_c_style(file->name, &buf, NULL, 0)) {
-				free(file->name);
-				file->name = strbuf_detach(&buf, NULL);
-			} else {
-				strbuf_release(&buf);
-			}
-		}
-
-		len = strlen(file->name);
+		fill_print_name(file);
+		len = strlen(file->print_name);
 		if (max_len < len)
 			max_len = len;
 
@@ -873,7 +889,7 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 
 	for (i = 0; i < data->nr; i++) {
 		const char *prefix = "";
-		char *name = data->files[i]->name;
+		char *name = data->files[i]->print_name;
 		int added = data->files[i]->added;
 		int deleted = data->files[i]->deleted;
 		int name_len;
@@ -901,17 +917,17 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 			printf("%s%d%s", add_c, added, reset);
 			printf(" bytes");
 			printf("\n");
-			goto free_diffstat_file;
+			continue;
 		}
 		else if (data->files[i]->is_unmerged) {
 			show_name(prefix, name, len, reset, set);
 			printf("  Unmerged\n");
-			goto free_diffstat_file;
+			continue;
 		}
 		else if (!data->files[i]->is_renamed &&
 			 (added + deleted == 0)) {
 			total_files--;
-			goto free_diffstat_file;
+			continue;
 		}
 
 		/*
@@ -933,11 +949,7 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 		show_graph('+', add, add_c, reset);
 		show_graph('-', del, del_c, reset);
 		putchar('\n');
-	free_diffstat_file:
-		free(data->files[i]->name);
-		free(data->files[i]);
 	}
-	free(data->files);
 	printf("%s %d files changed, %d insertions(+), %d deletions(-)%s\n",
 	       set, total_files, adds, dels, reset);
 }
@@ -962,11 +974,7 @@ static void show_shortstats(struct diffstat_t* data)
 				dels += deleted;
 			}
 		}
-		free(data->files[i]->name);
-		free(data->files[i]);
 	}
-	free(data->files);
-
 	printf(" %d files changed, %d insertions(+), %d deletions(-)\n",
 	       total_files, adds, dels);
 }
@@ -975,6 +983,9 @@ static void show_numstat(struct diffstat_t* data, struct diff_options *options)
 {
 	int i;
 
+	if (data->nr == 0)
+		return;
+
 	for (i = 0; i < data->nr; i++) {
 		struct diffstat_file *file = data->files[i];
 
@@ -982,15 +993,39 @@ static void show_numstat(struct diffstat_t* data, struct diff_options *options)
 			printf("-\t-\t");
 		else
 			printf("%d\t%d\t", file->added, file->deleted);
-		if (!file->is_renamed) {
-			write_name_quoted(file->name, stdout, options->line_termination);
+		if (options->line_termination) {
+			fill_print_name(file);
+			if (!file->is_renamed)
+				write_name_quoted(file->name, stdout,
+						  options->line_termination);
+			else {
+				fputs(file->print_name, stdout);
+				putchar(options->line_termination);
+			}
 		} else {
-			fputs(file->name, stdout);
-			putchar(options->line_termination);
+			if (file->is_renamed) {
+				putchar('\0');
+				write_name_quoted(file->from_name, stdout, '\0');
+			}
+			write_name_quoted(file->name, stdout, '\0');
 		}
 	}
 }
 
+static void free_diffstat_info(struct diffstat_t *diffstat)
+{
+	int i;
+	for (i = 0; i < diffstat->nr; i++) {
+		struct diffstat_file *f = diffstat->files[i];
+		if (f->name != f->print_name)
+			free(f->print_name);
+		free(f->name);
+		free(f->from_name);
+		free(f);
+	}
+	free(diffstat->files);
+}
+
 struct checkdiff_t {
 	struct xdiff_emit_state xm;
 	const char *filename;
@@ -2943,8 +2978,9 @@ void diff_flush(struct diff_options *options)
 			show_numstat(&diffstat, options);
 		if (output_format & DIFF_FORMAT_DIFFSTAT)
 			show_stats(&diffstat, options);
-		else if (output_format & DIFF_FORMAT_SHORTSTAT)
+		if (output_format & DIFF_FORMAT_SHORTSTAT)
 			show_shortstats(&diffstat);
+		free_diffstat_info(&diffstat);
 		separator++;
 	}
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-12  7:46           ` Junio C Hamano
@ 2007-12-12 10:21             ` Jakub Narebski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Narebski @ 2007-12-12 10:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > The previous patch is a fix (usability fix) for numstat output in the
> > presence of rename detection, making it truly machine friendly. Moreover
> > it should not break any scripts which parsed numstat output not
> > expecting to deal with renames (for example if repository has
> > diff.renames config option set), and might even fix them.
> >
> > The proposed - and implemented in patch below - change could break some
> > scripts not expecting to deal with new numstat output.
> 
> I do not quite follow.  If a script wanted -M and/or -C output in the
> older "common/{a => b}/common" format, either change would break it
> equally badly, either with or without -z.  And I do not think adding
> status letter is necessary for --numstat; if the caller wants that
> information, it can ask for it explicitly with both options.

I meant it there that if an [older] script parsing numstat output
didn't expect to deal with renames (but had to for example because
of diff.renames set), and expected the numstat output format to be
either

  <add> TAB <del> TAB <filename_q> LF

or (if running with -z option)

  <add> TAB <del> TAB <filename> NUL

After first patch, but before second, those expectations were met, that
I meant by "and might even fix them" (the scripts).

BTW. beside the fact that for renames it was not filename (pathname),
it could be half-quoted; see also below.

> So I do not think there is anything more to solve, than the patch you
> just sent.
> 
> The patch looks good.  Thanks.
> 
> > P.S. The numstat output format for renames should be probably described
> > in documentation, and not only in commit message, but I was not sure
> > where to put it (and even how it should be written).
> 
> I started writing this, and then reviewed the result of squashing your
> two patches.

And because of the above I wanted that you'd rather _not_ squash those
patches, but apply them separately...

> Although everybody complains that most of git is run-once-and-exit, I
> wrote the original diff part fairly conservatively to make it leak-free,
> because of a single command, "diff-tree --stdin", that can be told to
> run millions of diffs inside a single process.
> 
> But the diffstat part is horribly leaky, especially after your patch,
> and it has ugly workarounds such as refusing to show both diffstat and
> shortstat at the same time, not because it does not make much sense
> (admittedly it doesn't), but because show_stats() discards necessary
> information when it is done, making it impossible for shortstat to run.
> 
> So I ended up restructuring the name allocation/free code a bit while at
> it.

...but as you did more surgery on diffstat, it is reasonable to do it
in one commit, not two.

> -- >8 --
> diff --numstat -z: make it machine readable
> 
> The "-z" format is all about machine parsability, but showing renamed
> paths as "common/{a => b}/suffix" makes it impossible.  The scripts would
> never have successfully parsed "--numstat -z -M" in the old format.
> 
> This fixes the output format in a (hopefully minimally) backward
> incompatible way.
> 
>  * The output without -z is not changed.  This has given a good way for
>    humans to view added and deleted lines separately, and showing the
>    path in combined, shorter way would preserve readability.
> 
>  * The output with -z is unchanged for paths that do not involve renames.
>    Existing scripts that do not pass -M/-C are not affected at all.
> 
>  * The output with -z for a renamed path is shown in a format that can
>    easily be distinguished from an unrenamed path.

This changes invariant what we had, that output with and without -z differ
only in [filename] quoting and record separator. Now those two have
different semantic: --numstat without -z is for easier reading, --numstat
with -z is for machine consumption.

But as output with and without -z has to differ (in the form) already
for renames, why not...

[...]
>  include::diff-generate-patch.txt[]
> +
> +
> +other diff formats
> +------------------
> +
> +The `--summary` option describes newly added, deleted, renamed and
> +copied files.  The `--stat` option adds diffstat(1) graph to the
> +output.  These options can be combined with other options, such as
> +`-p`, and are meant for human consumption.
> +
> +When showing a change that involves a rename or a copy, `--stat` output
> +formats the pathnames compactly by combining common prefix and suffix of
> +the pathnames.  For example, a change that moves `arch/i386/Makefile` to
> +`arch/x86/Makefile` while modifying 4 lines will be shown like this:

Perhaps it could be added there word about "table-like format", and about
column widths (total width, filename [column] width).

> +
> +------------------------------------
> +arch/{i386 => x86}/Makefile    |   4 +--

I would add there example of shortened filename there, for example

  +.../SubmittingPatches          |   2 +

Do I understand correctly that code first tries to strip directories,
and only if the filename is too long it does shorten filename?

By the way the combining common prefix and suffix (pprint_rename) doesn't
play well with quoting: when only one of the source and destination names
are quoted it wouldn't of course find any common prefix/suffix.  I think
that if one of filenames has to be quoted, both should be.

And shortening of filenames together with either combining common prefix,
and/or with filename quoting produces output with might be a bit strange
for some. But I don't have idea how (and if) it could be corrected.

> +------------------------------------
> +
> +The `--numstat` option gives the diffstat(1) information but is designed
> +for easier machine consumption.  An entry in `--numstat` output looks
> +like this:
> +
> +----------------------------------------
> +1	2	README
> +3	1	arch/{i386 => x86}/Makefile
> +----------------------------------------

Note that filename is not shortened in --numstat (as it can be in --stat).

> +
> +When `-z` output option is in effect, the output is formatted this way:
> +
> +----------------------------------------
> +1	2	README NUL
> +3	1	NUL arch/i386/Makefile NUL arch/x86/Makefile NUL
> +----------------------------------------

I wanted to say that it would make it harder on "line"-based parsers...
but I have realized that it would not, as one can simply read next two
"lines" (records) for pre-image and post-image filename if 0-length
filename is detected.

[...]
> +static void fill_print_name(struct diffstat_file *file)

Nice idea to separate this into function.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-11 23:09         ` Jakub Narebski
  2007-12-12  7:46           ` Junio C Hamano
@ 2007-12-12 10:31           ` Jakub Narebski
  2007-12-12 19:07             ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Narebski @ 2007-12-12 10:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

BTW. I have noticed something strange with current (after my two
patches) diffstat code:

 3606:[gitweb/web@git]# ./git diff-tree -C -C -r --stat gitweb/test~8 
 0456a2ba58efb0e1d5f7421d5a8a2278e3b15ebc
  .../test/file with spaces\tand\ttabs"              |    4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)
 3607:[gitweb/web@git]# ./git diff-tree -C -C -r --numstat gitweb/test~8
 0456a2ba58efb0e1d5f7421d5a8a2278e3b15ebc
 3       1       gitweb/test/file with spaces    "gitweb/test/file with spaces\tand\ttabs"

but:

 3603:[gitweb/web@git]# ./git diff-tree -C -C -r --stat gitweb/test~6
 cfb9ef9ec73b37f44e9370c4f5d91e01d46ec6e4
  gitweb/test/Documentation-symlink1                 |    1 +
  gitweb/test/Documentation-symlink2                 |    1 +
  .../{git-shortlog.8.html => git-shortlog.html}     |    0 
  3 files changed, 2 insertions(+), 0 deletions(-)

So rename detection works. Strange.

P.S. gitweb/test branch can be found on
  http://repo.or.cz/w/git/jnareb-git.git

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH (amend)] diff: Make numstat machine friendly also for renames (and copies)
  2007-12-12 10:31           ` Jakub Narebski
@ 2007-12-12 19:07             ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2007-12-12 19:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> BTW. I have noticed something strange with current (after my two
> patches) diffstat code:
>
>  3606:[gitweb/web@git]# ./git diff-tree -C -C -r --stat gitweb/test~8 
>  0456a2ba58efb0e1d5f7421d5a8a2278e3b15ebc
>   .../test/file with spaces\tand\ttabs"              |    4 +++-
>   1 files changed, 3 insertions(+), 1 deletions(-)
>  3607:[gitweb/web@git]# ./git diff-tree -C -C -r --numstat gitweb/test~8
>  0456a2ba58efb0e1d5f7421d5a8a2278e3b15ebc
>  3       1       gitweb/test/file with spaces    "gitweb/test/file with spaces\tand\ttabs"

If you are wondering about the lack of => in --stat when --numstat is
showing rename, there is nothing strange going on.  "filename scaling"
done in --stat happens way later than pprint_rename() in the current
code structure (essentially, it chomps at the last slash to keep the
long "path" fit within the given space).  --stat (not --numstat) is for
human consumption and showing longer part of the name of postimage is
more important if we do not have enough room, and most tools that enable
rename use --summary with --stat anyway, so the rename information for
such an oddball long path can be found out elsewhere in the output if
needed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-12-12 19:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-10 22:32 [PATCH] diff: Make numstat machine friendly also for renames (and copies) Jakub Narebski
2007-12-10 22:55 ` [PATCH (amend)] " Jakub Narebski
2007-12-11  1:06   ` Junio C Hamano
2007-12-11  1:26     ` Jakub Narebski
2007-12-11  2:50       ` Junio C Hamano
2007-12-11 23:09         ` Jakub Narebski
2007-12-12  7:46           ` Junio C Hamano
2007-12-12 10:21             ` Jakub Narebski
2007-12-12 10:31           ` Jakub Narebski
2007-12-12 19:07             ` Junio C Hamano
2007-12-10 23:00 ` [PATCH] " Junio C Hamano
2007-12-10 23:14   ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).