git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
@ 2010-07-28  9:43 Matthieu Moy
  2010-07-29 16:10 ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Matthieu Moy @ 2010-07-28  9:43 UTC (permalink / raw)
  To: git, gitster; +Cc: Matthieu Moy

These options take an optional argument, but this optional argument was
not documented.

Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
---
I'm not really happy with my description of -Bn/m, which I essentially
took from eeaa46031479 (Junio, Jun 3 2005, diff: Update -B
heuristics). Someone with better understanding of how it works can
probably propose something better.

 Documentation/diff-options.txt |   18 ++++++++++++++----
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 2371262..d07809c 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -206,10 +206,18 @@ endif::git-format-patch[]
 	the diff-patch output format.  Non default number of
 	digits can be specified with `--abbrev=<n>`.
 
--B::
+-B[<n>]::
+-B<n>/<m>::
 	Break complete rewrite changes into pairs of delete and create.
-
--M::
+	If `n` is specified, it gives the threshold (as a percentage
+	of changed lines) above which a change is considered as
+	complete rewrite.  For example, `-B90%` means git will detect a
+	rewrite if more than 90% of the lines have been modified.  If
+	`m` is specified, then it is the minimum amount of deleted
+	lines a surviving broken pair must have to avoid being merged
+	back together.  See linkgit:gitdiffcore[7] for more details.
+
+-M[<n>]::
 ifndef::git-log[]
 	Detect renames.
 endif::git-log[]
@@ -218,9 +226,11 @@ ifdef::git-log[]
 	For following files across renames while traversing history, see
 	`--follow`.
 endif::git-log[]
+	If `n` is specified, it has the same meaning as for `-B<n>`.
 
--C::
+-C[<n>]::
 	Detect copies as well as renames.  See also `--find-copies-harder`.
+	If `n` is specified, it has the same meaning as for `-B<n>`.
 
 ifndef::git-format-patch[]
 --diff-filter=[ACDMRTUXB*]::
-- 
1.7.2.25.g9ebe3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-07-28  9:43 [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C Matthieu Moy
@ 2010-07-29 16:10 ` Junio C Hamano
  2010-07-30 15:23   ` Matthieu Moy
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2010-07-29 16:10 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

Matthieu Moy <Matthieu.Moy@imag.fr> writes:

> I'm not really happy with my description of -Bn/m, which I essentially
> took from eeaa46031479 (Junio, Jun 3 2005, diff: Update -B
> heuristics). Someone with better understanding of how it works can
> probably propose something better.

Your explanation for '<n>' being the same across -B/-M/-C is reasonable.
Explanation of '<m>' might want to clarify why it counts only the deletion
and to mention that "100-similarity != dissimilarity", but as the end-user
level documentation, these probably are unnecessary.

> +-B[<n>]::
> +-B<n>/<m>::
>  	Break complete rewrite changes into pairs of delete and create.
> +	If `n` is specified, it gives the threshold (as a percentage
> +	of changed lines) above which a change is considered as
> +	complete rewrite.  For example, `-B90%` means git will detect a
> +	rewrite if more than 90% of the lines have been modified. ...

I am fine with the use of word "lines" if it is clear that we are giving a
simplified explanation (white lie) to the readers, but the (dis)similarity
numbers don't have much to do with "lines".  I would of course be happier
if we can come up with a phrase that tells the readers that these numbers
range between 0-100%, and the larger the <n> is, the larger the extent of
change has to be for the filepair to be considered for -B/-M processing,
without use of word "lines".

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-07-29 16:10 ` Junio C Hamano
@ 2010-07-30 15:23   ` Matthieu Moy
  2010-07-30 16:42     ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Matthieu Moy @ 2010-07-30 15:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Matthieu Moy <Matthieu.Moy@imag.fr> writes:
>
>> I'm not really happy with my description of -Bn/m, which I essentially
>> took from eeaa46031479 (Junio, Jun 3 2005, diff: Update -B
>> heuristics). Someone with better understanding of how it works can
>> probably propose something better.
>
> Your explanation for '<n>' being the same across -B/-M/-C is reasonable.
> Explanation of '<m>' might want to clarify why it counts only the deletion
> and to mention that "100-similarity != dissimilarity", but as the end-user
> level documentation, these probably are unnecessary.

The thing is: I don't know the anwser myself, so I'm not in a position
do write such documentation :-(.

>> +-B[<n>]::
>> +-B<n>/<m>::
>>  	Break complete rewrite changes into pairs of delete and create.
>> +	If `n` is specified, it gives the threshold (as a percentage
>> +	of changed lines) above which a change is considered as
>> +	complete rewrite.  For example, `-B90%` means git will detect a
>> +	rewrite if more than 90% of the lines have been modified. ...
>
> I am fine with the use of word "lines" if it is clear that we are giving a
> simplified explanation (white lie) to the readers, but the (dis)similarity
> numbers don't have much to do with "lines".

Likewise, I didn't write "lines" as a white lie, but because of my
ignorance ... hence my request for help.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-07-30 15:23   ` Matthieu Moy
@ 2010-07-30 16:42     ` Junio C Hamano
  2010-08-02 11:12       ` [PATCH v2] " Matthieu Moy
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2010-07-30 16:42 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>> Matthieu Moy <Matthieu.Moy@imag.fr> writes:
>
>> Explanation of '<m>' might want to clarify why it counts only the deletion
>> and to mention that "100-similarity != dissimilarity", but as the end-user
>> level documentation, these probably are unnecessary.
>
> The thing is: I don't know the anwser myself, so I'm not in a position
> do write such documentation :-(.
> ...
> Likewise, I didn't write "lines" as a white lie, but because of my
> ignorance ... hence my request for help.

Sorry, but I actually do not have much more to say than what eeaa460
(diff: Update -B heuristics., 2005-06-03) says.

When breaking for the purpose of showing a patch as "total rewrite", what
matters is how little the original contents remain in the result.  Imagine
that you start from a 100-line document and removed 97 lines from it.  You
then added 27 lines of new material to make a 30-line document or added
997 lines to make a 1000-line document---either way you rewrote the
document and how dissimilar the result is relative to the original
wouldn't be different in either case.  N.B. this is only true as long as
there are enough new material in the result---removing 97% without adding
anything is not a rewrite.  This 97% is "how much did we discard from the
original", and it is the number you would see as the "dissimilarity index"
('m' in '-Bn/m').

When breaking, tentatively, for the purpose of rename detection, the
amount of the new material starts mattering more.  The reason why we try
to see if we want to break the pair is exactly because we hope that we may
find something similar to the new material in a blob that used to be in
but disappeared from another path in the preimage.  So we count both
deletion and addition to see if the pair has a lot of changes ('n' in
'-Bn/m'), which is similar to the way how "similiarity index" used in the
"rename" codepath is computed, to decide if we want to tentatively break
the pair.  Halves of a pair that is tentatively broken, when they do not
have a matching rename, are merged back together if they were not total
rewrite (i.e. the dissimilarity index for the pair is lower than the
threshold 'm').

In either case, the algorithm to compute how much "stuff" was copied from
the original and how much "stuff" was added anew to the result is not
based on "lines", but based on "bytes".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-07-30 16:42     ` Junio C Hamano
@ 2010-08-02 11:12       ` Matthieu Moy
  2010-08-02 18:18         ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Matthieu Moy @ 2010-08-02 11:12 UTC (permalink / raw)
  To: git, gitster; +Cc: Matthieu Moy

These options take an optional argument, but this optional argument was
not documented.

While we're there, fix a typo in a comment in diffcore.h.

Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
---
Here's a new version. I've eliminated "line" from the wording. I'm
still not sure I'm technically correct, especially about the
interaction between "n" and "m" (the "A rewrite is considered when
both thresholds are reached par of the patch).

 Documentation/diff-options.txt |   24 +++++++++++++++++++-----
 diffcore.h                     |    2 +-
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 2371262..f1ab4e7 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -206,10 +206,22 @@ endif::git-format-patch[]
 	the diff-patch output format.  Non default number of
 	digits can be specified with `--abbrev=<n>`.
 
--B::
-	Break complete rewrite changes into pairs of delete and create.
-
--M::
+-B[<n>]::
+-B<n>/<m>::
+	Break complete rewrite changes into pairs of delete and
+	create. When `n` and/or `m` are specified, they give threshold
+	(as a percentage of changed content) above which a change is
+	considered as complete rewrite. `n` is a threshold on the
+	similarity index (i.e. amount of addition/deletions compared
+	to the file's size). For example, `-B90%` means git will
+	detect a rewrite if more than 90% of the file's content have
+	been modified. `m` is a threshold on the disimilarity index
+	(i.e. amount of deletions from the old version). A rewrite is
+	considered when both thresholds are reached. When either `n`
+	or `m` is not specified, the default applies (`n` = 50% and
+	`m` = 60%). See linkgit:gitdiffcore[7] for more details.
+
+-M[<n>]::
 ifndef::git-log[]
 	Detect renames.
 endif::git-log[]
@@ -218,9 +230,11 @@ ifdef::git-log[]
 	For following files across renames while traversing history, see
 	`--follow`.
 endif::git-log[]
+	If `n` is specified, it has the same meaning as for `-B<n>`.
 
--C::
+-C[<n>]::
 	Detect copies as well as renames.  See also `--find-copies-harder`.
+	If `n` is specified, it has the same meaning as for `-B<n>`.
 
 ifndef::git-format-patch[]
 --diff-filter=[ACDMRTUXB*]::
diff --git a/diffcore.h b/diffcore.h
index 491bea0..fed9b15 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -18,7 +18,7 @@
 #define MAX_SCORE 60000.0
 #define DEFAULT_RENAME_SCORE 30000 /* rename/copy similarity minimum (50%) */
 #define DEFAULT_BREAK_SCORE  30000 /* minimum for break to happen (50%) */
-#define DEFAULT_MERGE_SCORE  36000 /* maximum for break-merge to happen 60%) */
+#define DEFAULT_MERGE_SCORE  36000 /* maximum for break-merge to happen (60%) */
 
 #define MINIMUM_BREAK_SIZE     400 /* do not break a file smaller than this */
 
-- 
1.7.2.1.10.g5cb67a

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-08-02 11:12       ` [PATCH v2] " Matthieu Moy
@ 2010-08-02 18:18         ` Junio C Hamano
  2010-08-05 16:09           ` Matthieu Moy
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2010-08-02 18:18 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

Matthieu Moy <Matthieu.Moy@imag.fr> writes:

> These options take an optional argument, but this optional argument was
> not documented.
>
> While we're there, fix a typo in a comment in diffcore.h.
>
> Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
> ---
> Here's a new version. I've eliminated "line" from the wording. I'm
> still not sure I'm technically correct, especially about the
> interaction between "n" and "m" (the "A rewrite is considered when
> both thresholds are reached par of the patch).

I think they are technically correct, but I doubt diff-options.txt is a
place to be more technically correct than to be useful to the end users.

What does an end user want to know from the list of options?

> +-B[<n>]::
> +-B<n>/<m>::
> +	Break complete rewrite changes into pairs of delete and
> +	create. When `n` and/or `m` are specified, they give threshold
> +	(as a percentage of changed content) above which a change is
> +	considered as complete rewrite. `n` is a threshold on the
> +	similarity index (i.e. amount of addition/deletions compared
> +	to the file's size). For example, `-B90%` means git will
> +	detect a rewrite if more than 90% of the file's content have
> +	been modified. `m` is a threshold on the disimilarity index
> +	(i.e. amount of deletions from the old version). A rewrite is
> +	considered when both thresholds are reached. When either `n`
> +	or `m` is not specified, the default applies (`n` = 50% and
> +	`m` = 60%). See linkgit:gitdiffcore[7] for more details.

After reading this, we know that with magic numbers given to -B, we can
"break" changes into "pairs of delete and create".  What does it mean in
the practical terms?  That is a lot more essential information than how
the magic numbers affect the decision to break or not break.  The user
does not get a motivation to help git "break" a pair from the above.

    The -B option serves two purposes.

    It affects the way a change that amounts to a total rewrite of a file
    not as a series of deletion and insertion mixed together with a very
    few lines that happen to match textually as the context, but as a
    single deletion of everything old followed by a single insertion of
    everything new, and the number <m> controls this aspect of the -B
    option (defaults to 60%).  `-B/70%` specifies that less than 30% of
    the original should remain in the result for git to consider it a
    total rewrite (i.e. otherwise the resulting patch will be a series of
    deletion and insertion mixed together with context lines).

    When used with -M, a totally-rewritten file is also considered as the
    source of a rename (usually -M only considers a file that disappeared
    as the source of a rename), and the number <n> controls this aspect of
    the -B option (defaults to 50%).  `-B20%` specifies that a change with
    addition and deletion compared to 20% or more of the file's size are
    eligible for being picked up as a possible source of a rename to
    another file.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-08-02 18:18         ` Junio C Hamano
@ 2010-08-05 16:09           ` Matthieu Moy
  2010-08-05 16:14             ` [PATCH] " Matthieu Moy
  0 siblings, 1 reply; 8+ messages in thread
From: Matthieu Moy @ 2010-08-05 16:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> After reading this, we know that with magic numbers given to -B, we can
> "break" changes into "pairs of delete and create".  What does it mean in
> the practical terms?  That is a lot more essential information than how
> the magic numbers affect the decision to break or not break.  The user
> does not get a motivation to help git "break" a pair from the above.
>
>     The -B option serves two purposes.
>     [...]

I like your version much more than mine. I took it with a bit of
asciidoc reformatting.

I detailed a bit more the -M<n> case, so that users do not have to
read the whole -B thing to understand -M<n>. I'm citing the expression
"similarity index" since it sometimes appears in the output of diff,
hence, can help the user to match what's happening and the doc.

New version follows.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C
  2010-08-05 16:09           ` Matthieu Moy
@ 2010-08-05 16:14             ` Matthieu Moy
  0 siblings, 0 replies; 8+ messages in thread
From: Matthieu Moy @ 2010-08-05 16:14 UTC (permalink / raw)
  To: git, gitster; +Cc: Matthieu Moy

These options take an optional argument, but this optional argument was
not documented.

Original patch by Matthieu Moy, but documentation for -B mostly copied
from the explanations of Junio C Hamano.

While we're there, fix a typo in a comment in diffcore.h.

Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
---
 Documentation/diff-options.txt |   35 ++++++++++++++++++++++++++++++-----
 diffcore.h                     |    2 +-
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 2371262..eecedaa 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -206,10 +206,29 @@ endif::git-format-patch[]
 	the diff-patch output format.  Non default number of
 	digits can be specified with `--abbrev=<n>`.
 
--B::
-	Break complete rewrite changes into pairs of delete and create.
-
--M::
+-B[<n>][/<m>]::
+	Break complete rewrite changes into pairs of delete and
+	create. This serves two purposes:
++
+It affects the way a change that amounts to a total rewrite of a file
+not as a series of deletion and insertion mixed together with a very
+few lines that happen to match textually as the context, but as a
+single deletion of everything old followed by a single insertion of
+everything new, and the number `m` controls this aspect of the -B
+option (defaults to 60%). `-B/70%` specifies that less than 30% of the
+original should remain in the result for git to consider it a total
+rewrite (i.e. otherwise the resulting patch will be a series of
+deletion and insertion mixed together with context lines).
++
+When used with -M, a totally-rewritten file is also considered as the
+source of a rename (usually -M only considers a file that disappeared
+as the source of a rename), and the number `n` controls this aspect of
+the -B option (defaults to 50%). `-B20%` specifies that a change with
+addition and deletion compared to 20% or more of the file's size are
+eligible for being picked up as a possible source of a rename to
+another file.
+
+-M[<n>]::
 ifndef::git-log[]
 	Detect renames.
 endif::git-log[]
@@ -218,9 +237,15 @@ ifdef::git-log[]
 	For following files across renames while traversing history, see
 	`--follow`.
 endif::git-log[]
+	If `n` is specified, it is a is a threshold on the similarity
+	index (i.e. amount of addition/deletions compared to the
+	file's size). For example, `-M90%` means git should consider a
+	delete/add pair to be a rename if more than 90% of the file
+	hasn't changed.
 
--C::
+-C[<n>]::
 	Detect copies as well as renames.  See also `--find-copies-harder`.
+	If `n` is specified, it has the same meaning as for `-M<n>`.
 
 ifndef::git-format-patch[]
 --diff-filter=[ACDMRTUXB*]::
diff --git a/diffcore.h b/diffcore.h
index 491bea0..fed9b15 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -18,7 +18,7 @@
 #define MAX_SCORE 60000.0
 #define DEFAULT_RENAME_SCORE 30000 /* rename/copy similarity minimum (50%) */
 #define DEFAULT_BREAK_SCORE  30000 /* minimum for break to happen (50%) */
-#define DEFAULT_MERGE_SCORE  36000 /* maximum for break-merge to happen 60%) */
+#define DEFAULT_MERGE_SCORE  36000 /* maximum for break-merge to happen (60%) */
 
 #define MINIMUM_BREAK_SIZE     400 /* do not break a file smaller than this */
 
-- 
1.7.2.1.30.g18195

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-05 16:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-28  9:43 [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C Matthieu Moy
2010-07-29 16:10 ` Junio C Hamano
2010-07-30 15:23   ` Matthieu Moy
2010-07-30 16:42     ` Junio C Hamano
2010-08-02 11:12       ` [PATCH v2] " Matthieu Moy
2010-08-02 18:18         ` Junio C Hamano
2010-08-05 16:09           ` Matthieu Moy
2010-08-05 16:14             ` [PATCH] " Matthieu Moy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).