* RFC: "negative" dirstat
@ 2011-04-18 21:02 Linus Torvalds
2011-04-18 21:33 ` Junio C Hamano
0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2011-04-18 21:02 UTC (permalink / raw)
To: Junio C Hamano, Johan Herland, Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 2689 bytes --]
Ok, so this is just an RFC patch, but the concept is pretty simple..
In the kernel, the ARM platform is growing boundlessly, with platform
files being added for every random SoC out there. One of the things
that Russell King (arm maintainer) has worried about is that when they
try to clean stuff up, code *removal* also ends up adding to the
damage, so if ARM ever gets its act together and is able to
consolidate a lot of this, it's still going to look very bad in the
statistics because there will be a lot of damage due to removed files.
In the regular "--diffstat" output, this is all very obvious, because
if you actually remove more lines than you add, it will say so, and
people will be very happy.
But in --dirstat, removed lines are always counted towards damage.
So here's a total hacky RFC patch to add a "--negative" option, which
allows for dirstat to actually take the amount of added/removed code
into account, and make "damage" a signed integer instead.
Example case right now for my pulls today:
[torvalds@i5 linux]$ git diff --dirstat=1 --cumulative @{6am}..
22.6% Documentation/input/
1.4% arch/powerpc/include/asm/
3.7% arch/powerpc/kernel/
6.6% arch/powerpc/
6.3% block/
6.0% drivers/input/
15.7% drivers/md/
22.1% drivers/
38.4% fs/btrfs/
39.1% fs/
2.7% include/linux/
[torvalds@i5 linux]$ git diff --dirstat=1 --cumulative --negative @{6am}..
56.2% Documentation/input/
1.9% arch/powerpc/include/asm/
4.3% arch/powerpc/kernel/
1.3% arch/powerpc/platforms/pseries/
1.4% arch/powerpc/platforms/
8.4% arch/powerpc/
2.4% block/
1.6% drivers/input/misc/
8.4% drivers/input/
-5.6% drivers/md/
2.7% drivers/
28.5% fs/btrfs/
29.0% fs/
ie note how with "--negative", it becomes obvious that drivers/md
actually removed more than it added, while some subdirectories were
all about adding (Documentation/input got a new file), while others
were more about changing existing lines with not a lot of additional
actual new code.
The diffstat for drivers/md looks like this:
7 files changed, 103 insertions(+), 137 deletions(-)
and I think most of the insertions were shorter lines than the deletions too.
NOTE! This is known-buggy in that you may end up in a situation where
the percentages are > 100% (the "total change" may be arbitrarily
small, since they all add up). In fact, you might get a
division-by-zero if the total change ends up being zero. This is a RFC
patch, nothing more. I don't know what the "right" solution for the
percentages should be (except that it obviously should never cause a
divide-by-zero).
Comments?
Linus "yeah, that option name sucks" Torvalds
[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3234 bytes --]
diff.c | 26 +++++++++++++++++---------
diff.h | 1 +
2 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/diff.c b/diff.c
index 9fa841010cc2..c93fdaaa541f 100644
--- a/diff.c
+++ b/diff.c
@@ -1447,7 +1447,7 @@ static void show_numstat(struct diffstat_t *data, struct diff_options *options)
struct dirstat_file {
const char *name;
- unsigned long changed;
+ long changed;
};
struct dirstat_dir {
@@ -1458,7 +1458,7 @@ struct dirstat_dir {
static long gather_dirstat(struct diff_options *opt, struct dirstat_dir *dir,
unsigned long changed, const char *base, int baselen)
{
- unsigned long this_dir = 0;
+ long this_dir = 0;
unsigned int sources = 0;
const char *line_prefix = "";
struct strbuf *msg = NULL;
@@ -1499,12 +1499,14 @@ static long gather_dirstat(struct diff_options *opt, struct dirstat_dir *dir,
* under this directory (sources == 1).
*/
if (baselen && sources != 1) {
- int permille = this_dir * 1000 / changed;
+ int permille = labs(this_dir) * 1000 / changed;
if (permille) {
- int percent = permille / 10;
+ double percent = permille / 10.0;
if (percent >= dir->percent) {
- fprintf(opt->file, "%s%4d.%01d%% %.*s\n", line_prefix,
- percent, permille % 10, baselen, base);
+ if (this_dir < 0)
+ percent = -percent;
+ fprintf(opt->file, "%s%6.1f%% %.*s\n", line_prefix,
+ percent, baselen, base);
if (!dir->cumulative)
return 0;
}
@@ -1537,7 +1539,8 @@ static void show_dirstat(struct diff_options *options)
for (i = 0; i < q->nr; i++) {
struct diff_filepair *p = q->queue[i];
const char *name;
- unsigned long copied, added, damage;
+ unsigned long copied, added, deleted;
+ long damage;
name = p->one->path ? p->one->path : p->two->path;
@@ -1567,7 +1570,11 @@ static void show_dirstat(struct diff_options *options)
* damaged files, not damaged lines. This is done by
* counting only a single damaged line per file.
*/
- damage = (p->one->size - copied) + added;
+ deleted = p->one->size - copied;
+ if (DIFF_OPT_TST(options, DIFFSTAT_NEGATIVE))
+ damage = added - deleted;
+ else
+ damage = added + deleted;
if (DIFF_OPT_TST(options, DIRSTAT_BY_FILE) && damage > 0)
damage = 1;
@@ -3139,7 +3146,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
&options->dirstat_percent)) {
options->output_format |= DIFF_FORMAT_DIRSTAT;
DIFF_OPT_SET(options, DIRSTAT_BY_FILE);
- }
+ } else if (!strcmp(arg, "--negative"))
+ DIFF_OPT_SET(options, DIFFSTAT_NEGATIVE);
else if (!strcmp(arg, "--check"))
options->output_format |= DIFF_FORMAT_CHECKDIFF;
else if (!strcmp(arg, "--summary"))
diff --git a/diff.h b/diff.h
index 007a0554d4b2..95d6e65247ae 100644
--- a/diff.h
+++ b/diff.h
@@ -78,6 +78,7 @@ typedef struct strbuf *(*diff_prefix_fn_t)(struct diff_options *opt, void *data)
#define DIFF_OPT_IGNORE_UNTRACKED_IN_SUBMODULES (1 << 25)
#define DIFF_OPT_IGNORE_DIRTY_SUBMODULES (1 << 26)
#define DIFF_OPT_OVERRIDE_SUBMODULE_CONFIG (1 << 27)
+#define DIFF_OPT_DIFFSTAT_NEGATIVE (1 << 28)
#define DIFF_OPT_TST(opts, flag) ((opts)->flags & DIFF_OPT_##flag)
#define DIFF_OPT_SET(opts, flag) ((opts)->flags |= DIFF_OPT_##flag)
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: RFC: "negative" dirstat
2011-04-18 21:02 RFC: "negative" dirstat Linus Torvalds
@ 2011-04-18 21:33 ` Junio C Hamano
2011-04-18 21:46 ` Linus Torvalds
0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2011-04-18 21:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Johan Herland, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Ok, so this is just an RFC patch, but the concept is pretty simple..
>
> In the kernel, the ARM platform is growing boundlessly, with platform
> files being added for every random SoC out there. One of the things
> that Russell King (arm maintainer) has worried about is that when they
> try to clean stuff up, code *removal* also ends up adding to the
> damage, so if ARM ever gets its act together and is able to
> consolidate a lot of this, it's still going to look very bad in the
> statistics because there will be a lot of damage due to removed files.
Are these removal really "remove old cruft, replace with a better version
which does not have much common to removed stuff", or are they more like
"remove N duplicated similar copies of old cruft, refactoring them
properly and the result is used by N callsites"?
The second reason you gave in an earlier discussion why dirstat uses the
damage assessor code was to disregard code movements. It appears to me
that if you spanhash the contents of _all_ files in the preimage and the
postimage of ARM tree and compute literal-added vs src-copied within the
whole tree, I wonder if you can mitigate this "false damage -- because the
refactoring involved code movement across files but within the same
subsystem".
I guess what it boils down to is what you are trying to measure as the
"goodness" value of a change. Adding a lot of Documentation may be good,
adding a lot of "subarchs that do not deserve to be" may be bad, and
moving common logic from one existing subarch to a common file (which
counts towards "literal-added" in that new common file, at the same time
counting towards deletion, i.e. "size - copied", from the original) and
reusing it in a new subarch by simply calling that common infrastructure
is a very good thing. At least, if you count literal-added vs src-copied
across the files within the subarch, instead of doing it per-file, you
would be able to detect the "moving" part more accurately. Of course, you
still cannot tell between good and bad kinds of additions, and you cannot
tell that the new subarch that reuses the result of refactoring by calling
into the refactored code, without understanding the source code, and I
don't think that is within the scope of dirstat.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: RFC: "negative" dirstat
2011-04-18 21:33 ` Junio C Hamano
@ 2011-04-18 21:46 ` Linus Torvalds
2011-04-19 4:51 ` Junio C Hamano
0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2011-04-18 21:46 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Johan Herland, Git Mailing List
On Mon, Apr 18, 2011 at 2:33 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> Are these removal really "remove old cruft, replace with a better version
> which does not have much common to removed stuff", or are they more like
> "remove N duplicated similar copies of old cruft, refactoring them
> properly and the result is used by N callsites"?
Right now, there not so much of either.
The _hope_ is that there will be "remove <n> different implementations
of some gpio driver, and replace them with a couple of generic ones
and some setup code".
So no, it wouldn't necessarily be about code movement at all, but
about totally different code.
> The second reason you gave in an earlier discussion why dirstat uses the
> damage assessor code was to disregard code movements.
No, that only works within a file, which sometimes is common when you
have to re-order functions because or some re-organization.
See for example commit 9d412a43c3b2 ("vfs: split off vfsmount-related
parts of vfs_kern_mount()") in the kernel for an example of why line
ordering shouldn't necessarily count against damage.
But what I'm talking about is really different code, not
re-organization of existing one.
> I guess what it boils down to is what you are trying to measure as the
> "goodness" value of a change.
Yes.
> Adding a lot of Documentation may be good,
> adding a lot of "subarchs that do not deserve to be" may be bad, and
> moving common logic from one existing subarch to a common file (which
> counts towards "literal-added" in that new common file, at the same time
> counting towards deletion, i.e. "size - copied", from the original) and
> reusing it in a new subarch by simply calling that common infrastructure
> is a very good thing.
Yes. If I see a lot of lines added in Documentation/, I really don't
mind at all. And if I see a diffstat that says something like
1057 lines added, 2901 lines deleted
I'm extremely happy. It's very different from one that would say
2958 lines added, 0 lines deleted
even though --dirstat might consider the two equivalent right now.
> At least, if you count literal-added vs src-copied
> across the files within the subarch, instead of doing it per-file, you
> would be able to detect the "moving" part more accurately.
Yes. However, "moving" is in many ways still not as interesting as
"actually removed".
Moving, on it's own, is a pretty neutral operation. So I really don't
want to concentrate on the moving part. It's really about "some
operations actually clean up code and remove code in the process".
Linus
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: RFC: "negative" dirstat
2011-04-18 21:46 ` Linus Torvalds
@ 2011-04-19 4:51 ` Junio C Hamano
0 siblings, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2011-04-19 4:51 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Johan Herland, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, Apr 18, 2011 at 2:33 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>> I guess what it boils down to is what you are trying to measure as the
>> "goodness" value of a change.
>
> Yes.
>
> Yes. If I see a lot of lines added in Documentation/, I really don't
> mind at all. And if I see a diffstat that says something like
>
> 1057 lines added, 2901 lines deleted
>
> I'm extremely happy. It's very different from one that would say
>
> 2958 lines added, 0 lines deleted
>
> even though --dirstat might consider the two equivalent right now.
I am afraid that this is ultimately "how to compare apples and oranges"
(or "which is juicier, oranges or squirrels?") question.
If you are willing to say
1000 lines added, 700 lines deleted
means "It changes 700 lines (removing the original 700 and replacing them
with the new 700) and then adds 300 lines of new material", then you could
simply use the max(added, p->one->size - copied) instead of using the sum
of them as the damage count. Of course, depending on the nature of the
change, the above interpretation does not apply in general.
>> At least, if you count literal-added vs src-copied across the files
>> within the subarch, instead of doing it per-file, you would be able to
>> detect the "moving" part more accurately.
>
> Yes. However, "moving" is in many ways still not as interesting as
> "actually removed". Moving, on it's own, is a pretty neutral operation.
I think I was unclear, but I was shooting for the same thing.
I meant "by doing the added/copied computation across all files, the real
additions and deletions would reflect in the result without the noise from
the 'moving' part."
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-04-19 4:51 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-18 21:02 RFC: "negative" dirstat Linus Torvalds
2011-04-18 21:33 ` Junio C Hamano
2011-04-18 21:46 ` Linus Torvalds
2011-04-19 4:51 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).