[PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
@ 2011-03-08 10:59 Namhyung Kim
  2011-03-08 10:59 ` [PATCH 2/2] blame: introduce -u/--unique option Namhyung Kim
  2011-03-09 19:12 ` [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Namhyung Kim @ 2011-03-08 10:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

If find_unique_abbrev() finds a ambiguous SHA1 name, it tries
to find again with increased length. In this case, result hex
strings could have different lengths even though the
core.abbrevguard config option is specified. But if the option
is specified and increased length (delta) is less than its
value, the result could be adjusted to the same length.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 sha1_name.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 709ff2e..6bb8942 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -197,6 +197,7 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
 {
 	int status, exists;
 	static char hex[41];
+	int extra_len = unique_abbrev_extra_length;
 
 	exists = has_sha1_file(sha1);
 	memcpy(hex, sha1_to_hex(sha1), 40);
@@ -208,12 +209,14 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
 		if (exists
 		    ? !status
 		    : status == SHORT_NAME_NOT_FOUND) {
-			int cut_at = len + unique_abbrev_extra_length;
+			int cut_at = len + extra_len;
 			cut_at = (cut_at < 40) ? cut_at : 40;
 			hex[cut_at] = 0;
 			return hex;
 		}
 		len++;
+		if (extra_len > 0)
+			extra_len--;
 	}
 	return hex;
 }
-- 
1.7.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] blame: introduce -u/--unique option
  2011-03-08 10:59 [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Namhyung Kim
@ 2011-03-08 10:59 ` Namhyung Kim
  2011-03-09 19:45   ` Junio C Hamano
  2011-03-09 19:12 ` [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Namhyung Kim @ 2011-03-08 10:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

-u/--unique option will find and use minimum length of unique
SHA-1 name. If -l option is specified also, it will have higher
priority, IOW git blame will use full 40-length SHA-1 name.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 builtin/blame.c |   18 +++++++++++++++++-
 1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index aa30ec5..9ea41bc 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -41,6 +41,7 @@ static int reverse;
 static int blank_boundary;
 static int incremental;
 static int xdl_opts;
+static int longest_uniq_sha1;
 
 static enum date_mode blame_date_mode = DATE_ISO8601;
 static size_t blame_date_width;
@@ -1618,6 +1619,7 @@ static const char *format_time(unsigned long time, const char *tz_str,
 #define OUTPUT_SHOW_SCORE      0100
 #define OUTPUT_NO_AUTHOR       0200
 #define OUTPUT_SHOW_EMAIL	0400
+#define OUTPUT_UNIQ_OBJECT_NAME	01000
 
 static void emit_porcelain(struct scoreboard *sb, struct blame_entry *ent)
 {
@@ -1664,14 +1666,22 @@ static void emit_other(struct scoreboard *sb, struct blame_entry *ent, int opt)
 	struct commit_info ci;
 	char hex[41];
 	int show_raw_time = !!(opt & OUTPUT_RAW_TIMESTAMP);
+	int sha1_len;
 
 	get_commit_info(suspect->commit, &ci, 1);
 	strcpy(hex, sha1_to_hex(suspect->commit->object.sha1));
 
+	if (opt & OUTPUT_LONG_OBJECT_NAME)
+		sha1_len = 40;
+	else if (opt & OUTPUT_UNIQ_OBJECT_NAME)
+		sha1_len = longest_uniq_sha1;
+	else
+		sha1_len = 8;
+
 	cp = nth_line(sb, ent->lno);
 	for (cnt = 0; cnt < ent->num_lines; cnt++) {
 		char ch;
-		int length = (opt & OUTPUT_LONG_OBJECT_NAME) ? 40 : 8;
+		int length = sha1_len;
 
 		if (suspect->commit->object.flags & UNINTERESTING) {
 			if (blank_boundary)
@@ -1842,6 +1852,7 @@ static void find_alignment(struct scoreboard *sb, int *option)
 	for (e = sb->ent; e; e = e->next) {
 		struct origin *suspect = e->suspect;
 		struct commit_info ci;
+		const char *sha1;
 		int num;
 
 		if (strcmp(suspect->path, sb->path))
@@ -1867,6 +1878,10 @@ static void find_alignment(struct scoreboard *sb, int *option)
 			longest_dst_lines = num;
 		if (largest_score < ent_score(sb, e))
 			largest_score = ent_score(sb, e);
+		sha1 = find_unique_abbrev(suspect->commit->object.sha1,
+					  MINIMUM_ABBREV);
+		if (longest_uniq_sha1 < strlen(sha1))
+			longest_uniq_sha1 = strlen(sha1);
 	}
 	max_orig_digits = lineno_width(longest_src_lines);
 	max_digits = lineno_width(longest_dst_lines);
@@ -2306,6 +2321,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 		OPT_BIT('s', NULL, &output_option, "Suppress author name and timestamp (Default: off)", OUTPUT_NO_AUTHOR),
 		OPT_BIT('e', "show-email", &output_option, "Show author email instead of name (Default: off)", OUTPUT_SHOW_EMAIL),
 		OPT_BIT('w', NULL, &xdl_opts, "Ignore whitespace differences", XDF_IGNORE_WHITESPACE),
+		OPT_BIT('u', "unique", &output_option, "Show minimum unique SHA-1 (Default: off)", OUTPUT_UNIQ_OBJECT_NAME),
 		OPT_STRING('S', NULL, &revs_file, "file", "Use revisions from <file> instead of calling git-rev-list"),
 		OPT_STRING(0, "contents", &contents_from, "file", "Use <file>'s contents as the final image"),
 		{ OPTION_CALLBACK, 'C', NULL, &opt, "score", "Find line copies within and across files", PARSE_OPT_OPTARG, blame_copy_callback },
-- 
1.7.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-08 10:59 [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Namhyung Kim
  2011-03-08 10:59 ` [PATCH 2/2] blame: introduce -u/--unique option Namhyung Kim
@ 2011-03-09 19:12 ` Junio C Hamano
  2011-03-09 19:20   ` Junio C Hamano
  2011-03-10  5:27   ` Namhyung Kim
  1 sibling, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2011-03-09 19:12 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: Junio C Hamano, git

Namhyung Kim <namhyung@gmail.com> writes:

> If find_unique_abbrev() finds a ambiguous SHA1 name, it tries
> to find again with increased length. In this case, result hex
> strings could have different lengths even though the
> core.abbrevguard config option is specified. But if the option
> is specified and increased length (delta) is less than its
> value, the result could be adjusted to the same length.

I am not sure if I can understand what problem you are trying to solve
from the above description.

The function is given "len" from the caller to specify the minimum length
of the output the caller expects (i.e. even if 4 hexdigits is enough to
identify the given commit in a small project, the caller can say it wants
to see at least 7 hexdigits).  The loop without your patch finds the
shortest prefix whose length is at least that given length that uniquely
identifies the given object (or the shortest prefix that doesn't identify
any existing object if the given sha1 does not exist in the repository).
And then ensures the returned value is longer by the guard as an extra
safety measure, so that later when the project grows, the disambiguation
we find today has a better chance to survive.

With this patch, the loop decreases the length of the guard when "len"
given by the caller is insufficient to ensure uniqueness, which does not
sound right.

Suppose the given object has ambiguous other objects and you need 8
hexdigits at least to make it unique in today's history.  The caller gives
you len of 7, and the guard is set to 3.

With the original code, the loop starts with 7, finds that it is not
long enough to disambiguate, increments and retries, finds that 8 is the
shortest prefix, and then adds the guard and returns 11 hexdigits.

With your patch, the loop starts with 7 with extra set to 3, finds that 7
is not long enough and decrements extra to 2, finds that 8 is the shortest
prefix, and then returns only 10 hexdigits.

Which feels like totally going against the reason why we added the guard.

What am I missing?

>  sha1_name.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 709ff2e..6bb8942 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -197,6 +197,7 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
>  {
>  	int status, exists;
>  	static char hex[41];
> +	int extra_len = unique_abbrev_extra_length;
>  
>  	exists = has_sha1_file(sha1);
>  	memcpy(hex, sha1_to_hex(sha1), 40);
> @@ -208,12 +209,14 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
>  		if (exists
>  		    ? !status
>  		    : status == SHORT_NAME_NOT_FOUND) {
> -			int cut_at = len + unique_abbrev_extra_length;
> +			int cut_at = len + extra_len;
>  			cut_at = (cut_at < 40) ? cut_at : 40;
>  			hex[cut_at] = 0;
>  			return hex;
>  		}
>  		len++;
> +		if (extra_len > 0)
> +			extra_len--;
>  	}
>  	return hex;
>  }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-09 19:12 ` [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Junio C Hamano
@ 2011-03-09 19:20   ` Junio C Hamano
  2011-03-10  9:19     ` Re* " Junio C Hamano
  2011-03-10  5:27   ` Namhyung Kim
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2011-03-09 19:20 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> The function is given "len" from the caller to specify the minimum length
> of the output the caller expects (i.e. even if 4 hexdigits is enough to
> identify the given commit in a small project, the caller can say it wants
> to see at least 7 hexdigits).  The loop without your patch finds the
> shortest prefix whose length is at least that given length that uniquely
> identifies the given object (or the shortest prefix that doesn't identify
> any existing object if the given sha1 does not exist in the repository).
> And then ensures the returned value is longer by the guard as an extra
> safety measure, so that later when the project grows, the disambiguation
> we find today has a better chance to survive.
> ...
> What am I missing?

I think what may be desirable is to honor the caller-supplied "len" a bit
better.  If an object is uniquely identifiable with only 4-hexdigit today,
and if the caller gives 7 as len and the guard is set to 3, we return 10
hexdigits with the current code.  We should instead return 7 hexdigits in
such a case, as that is in line with the "use 3 extra to give the
disambiguation we find today a better chance to survive".

Perhaps that was what you wanted to do, but I don't think that is what
your code actually does.

>
>>  sha1_name.c |    5 ++++-
>>  1 files changed, 4 insertions(+), 1 deletions(-)
>>
>> diff --git a/sha1_name.c b/sha1_name.c
>> index 709ff2e..6bb8942 100644
>> --- a/sha1_name.c
>> +++ b/sha1_name.c
>> @@ -197,6 +197,7 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
>>  {
>>  	int status, exists;
>>  	static char hex[41];
>> +	int extra_len = unique_abbrev_extra_length;
>>  
>>  	exists = has_sha1_file(sha1);
>>  	memcpy(hex, sha1_to_hex(sha1), 40);
>> @@ -208,12 +209,14 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
>>  		if (exists
>>  		    ? !status
>>  		    : status == SHORT_NAME_NOT_FOUND) {
>> -			int cut_at = len + unique_abbrev_extra_length;
>> +			int cut_at = len + extra_len;
>>  			cut_at = (cut_at < 40) ? cut_at : 40;
>>  			hex[cut_at] = 0;
>>  			return hex;
>>  		}
>>  		len++;
>> +		if (extra_len > 0)
>> +			extra_len--;
>>  	}
>>  	return hex;
>>  }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] blame: introduce -u/--unique option
  2011-03-08 10:59 ` [PATCH 2/2] blame: introduce -u/--unique option Namhyung Kim
@ 2011-03-09 19:45   ` Junio C Hamano
  2011-03-10  6:13     ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2011-03-09 19:45 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: git

Namhyung Kim <namhyung@gmail.com> writes:

> -u/--unique option will find and use minimum length of unique
> SHA-1 name. If -l option is specified also, it will have higher
> priority, IOW git blame will use full 40-length SHA-1 name.

> @@ -1867,6 +1878,10 @@ static void find_alignment(struct scoreboard *sb, int *option)
>  			longest_dst_lines = num;
>  		if (largest_score < ent_score(sb, e))
>  			largest_score = ent_score(sb, e);
> +		sha1 = find_unique_abbrev(suspect->commit->object.sha1,
> +					  MINIMUM_ABBREV);
> +		if (longest_uniq_sha1 < strlen(sha1))
> +			longest_uniq_sha1 = strlen(sha1);

The logic to determine and keep track of the longuest-unique looks
correct, but I was hoping that we already have an easy optimization
codepath to do this only once per commit, not for every blame-entry in the
result.  Doesn't the code have a similar optimization to figure out the
necessary number of columns to show author names (I haven't read the code
recently, though)?

Also we might find that the performance impact of doing this may be so
miniscule that it is not worth wasting a short option name.  If we were to
use an option, I was actually hoping that the option would let the users
specify a value different from the hardcoded 8 at the same time.  E.g.

    git blame --abbrev=8 ;# current default with uniquefy applied
    git blame --abbrev=4 ;# equivalent to your "blame -u"

Can we have a benchmark of this feature in a largish and busy file in a
project with a deep history?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-09 19:12 ` [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Junio C Hamano
  2011-03-09 19:20   ` Junio C Hamano
@ 2011-03-10  5:27   ` Namhyung Kim
  1 sibling, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2011-03-10  5:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2011-03-09 (수), 11:12 -0800, Junio C Hamano: 
> Namhyung Kim <namhyung@gmail.com> writes:
> 
> > If find_unique_abbrev() finds a ambiguous SHA1 name, it tries
> > to find again with increased length. In this case, result hex
> > strings could have different lengths even though the
> > core.abbrevguard config option is specified. But if the option
> > is specified and increased length (delta) is less than its
> > value, the result could be adjusted to the same length.
> 
> I am not sure if I can understand what problem you are trying to solve
> from the above description.
> 

That's probably because of my poor English. :(


> The function is given "len" from the caller to specify the minimum length
> of the output the caller expects (i.e. even if 4 hexdigits is enough to
> identify the given commit in a small project, the caller can say it wants
> to see at least 7 hexdigits).  The loop without your patch finds the
> shortest prefix whose length is at least that given length that uniquely
> identifies the given object (or the shortest prefix that doesn't identify
> any existing object if the given sha1 does not exist in the repository).
> And then ensures the returned value is longer by the guard as an extra
> safety measure, so that later when the project grows, the disambiguation
> we find today has a better chance to survive.
> 
> With this patch, the loop decreases the length of the guard when "len"
> given by the caller is insufficient to ensure uniqueness, which does not
> sound right.
> 
> Suppose the given object has ambiguous other objects and you need 8
> hexdigits at least to make it unique in today's history.  The caller gives
> you len of 7, and the guard is set to 3.
> 
> With the original code, the loop starts with 7, finds that it is not
> long enough to disambiguate, increments and retries, finds that 8 is the
> shortest prefix, and then adds the guard and returns 11 hexdigits.
> 
> With your patch, the loop starts with 7 with extra set to 3, finds that 7
> is not long enough and decrements extra to 2, finds that 8 is the shortest
> prefix, and then returns only 10 hexdigits.
> 
> Which feels like totally going against the reason why we added the guard.
> 
> What am I missing?
> 

What I was thinking is like below (from a user's point of view):

I knew git use 7 hexdigit to represent a commit object and it was not
enough for my project. So I gave it 5 for the guard and expected to see
12 in everywhere - if 12 always guarantees uniqueness for the project
now and probably, the near future. But sometimes git prints 13 even
though 12 is long enough to represent the commit uniquely. Because 7 is
not enough but 8 is, and simply add 5 (the guard) to it.

So I thought it would be natural to print 12 always in this case and
thus, submitted the patch.

Thanks.

-- 
Regards,
Namhyung Kim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] blame: introduce -u/--unique option
  2011-03-09 19:45   ` Junio C Hamano
@ 2011-03-10  6:13     ` Namhyung Kim
  2011-03-10  9:43       ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Namhyung Kim @ 2011-03-10  6:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2011-03-09 (수), 11:45 -0800, Junio C Hamano:
> Namhyung Kim <namhyung@gmail.com> writes:
> 
> > -u/--unique option will find and use minimum length of unique
> > SHA-1 name. If -l option is specified also, it will have higher
> > priority, IOW git blame will use full 40-length SHA-1 name.
> 
> > @@ -1867,6 +1878,10 @@ static void find_alignment(struct scoreboard *sb, int *option)
> >  			longest_dst_lines = num;
> >  		if (largest_score < ent_score(sb, e))
> >  			largest_score = ent_score(sb, e);
> > +		sha1 = find_unique_abbrev(suspect->commit->object.sha1,
> > +					  MINIMUM_ABBREV);
> > +		if (longest_uniq_sha1 < strlen(sha1))
> > +			longest_uniq_sha1 = strlen(sha1);
> 
> The logic to determine and keep track of the longuest-unique looks
> correct, but I was hoping that we already have an easy optimization
> codepath to do this only once per commit, not for every blame-entry in the
> result.  Doesn't the code have a similar optimization to figure out the
> necessary number of columns to show author names (I haven't read the code
> recently, though)?
> 

Right. METAINFO_SHOWN flag does that. I'll move the code under the block
in next version.


> Also we might find that the performance impact of doing this may be so
> miniscule that it is not worth wasting a short option name.  If we were to
> use an option, I was actually hoping that the option would let the users
> specify a value different from the hardcoded 8 at the same time.  E.g.
> 
>     git blame --abbrev=8 ;# current default with uniquefy applied
>     git blame --abbrev=4 ;# equivalent to your "blame -u"
> 

Hmm... I thought about that too. But you mean that you only want
--abbrev instead of -u, right?

My original intention was if user specified --abbrev explicitly, it
should be honored regardless of the uniqueness. The guard will not be
used in this situation because the user gave the exact length [s]he
wants to see.


> Can we have a benchmark of this feature in a largish and busy file in a
> project with a deep history?
> 

I gave it a try on a file in Linux kernel (with METAINFO_SHOWN
optimization applied):

$ wc -l mm/page_alloc.c
5629 mm/page_alloc.c
$ git log --oneline mm/page_alloc.c | wc -l
566
$ perf record ../git/git blame -u mm/page_alloc.c > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.229 MB perf.data (~10014 samples) ]
$ perf report
# Events: 5K cycles
#
# Overhead  Command         Shared Object                         Symbol
# ........  .......  ....................  .............................
#
    21.11%      git  git                   [.] cmd_blame
    18.34%      git  git                   [.] get_origin
    18.32%      git  git                   [.] pass_blame
     3.99%      git  libz.so.1.2.3.3       [.] inflate
     3.14%      git  git                   [.] xdl_hash_record
     2.84%      git  libz.so.1.2.3.3       [.] inflate_table
     2.60%      git  libz.so.1.2.3.3       [.] inflate_fast
     2.23%      git  git                   [.] find_pack_entry_one
     1.84%      git  git                   [.] xdl_recmatch
     1.51%      git  libc-2.11.1.so        [.] memcpy
     1.49%      git  git                   [.] xdl_prepare_env
     1.36%      git  git                   [.] xdl_prepare_ctx
     1.13%      git  [kernel.kallsyms]     [k] __lock_acquire
     1.12%      git  git                   [.] xdi_diff
     1.09%      git  git                   [.] blame_chunk
     0.99%      git  libc-2.11.1.so        [.] _int_malloc
     0.79%      git  git                   [.] tree_entry_interesting
     0.77%      git  git                   [.] origin_decref
     0.67%      git  libz.so.1.2.3.3       [.] adler32
     0.42%      git  [kernel.kallsyms]     [k] sched_clock_local
     0.40%      git  git                   [.] decode_tree_entry
     0.38%      git  [kernel.kallsyms]     [k] clear_page_c
     0.35%      git  libc-2.11.1.so        [.] __GI___strncmp_ssse3
     0.32%      git  git                   [.] lookup_object
     0.31%      git  [kernel.kallsyms]     [k] lock_release
     0.29%      git  [kernel.kallsyms]     [k] hlock_class
     0.29%      git  [kernel.kallsyms]     [k] page_fault
     0.29%      git  git                   [.] patch_delta
     0.28%      git  [kernel.kallsyms]     [k] local_clock
     0.27%      git  [kernel.kallsyms]     [k] sched_clock
     0.25%      git  [kernel.kallsyms]     [k] delay_tsc
     0.25%      git  git                   [.] parse_commit_buffer
     0.25%      git  [kernel.kallsyms]     [k] look_up_lock_class
     0.23%      git  git                   [.] xdl_cha_alloc
     0.22%      git  libc-2.11.1.so        [.] _int_free
     0.22%      git  libc-2.11.1.so        [.] malloc_consolidate
     0.22%      git  [kernel.kallsyms]     [k] mark_lock
     0.21%      git  libc-2.11.1.so        [.] __malloc
     0.20%      git  git                   [.] nth_packed_object_offset

Looks like it doesn't have a performance impact IMHO.
Hope this helps.

Thank you.


-- 
Regards,
Namhyung Kim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re* [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-09 19:20   ` Junio C Hamano
@ 2011-03-10  9:19     ` Junio C Hamano
  2011-03-10 11:52       ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2011-03-10  9:19 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> I think what may be desirable is to honor the caller-supplied "len" a bit
> better.  If an object is uniquely identifiable with only 4-hexdigit today,
> and if the caller gives 7 as len and the guard is set to 3, we return 10
> hexdigits with the current code.  We should instead return 7 hexdigits in
> such a case, as that is in line with the "use 3 extra to give the
> disambiguation we find today a better chance to survive".

And here is an attempt to do so.  I have to admit that I didn't give it
too much thought, though, so please be careful when reviewing the logic.

-- >8 --
Subject: find_unique_abbrev(): honor caller-supplied "len" better

The caller of this function wants to ensure that the returned string is a
unique abbreviation of the object name, and at least len characters long.
When "len" is sufficiently short and we cannot ensure uniqueness with only
"len" bytes, we returned minimally unique prefix (i.e. if you dropped the
last character, there would be two or more objects that share that same
prefix as their names in the repository).

An earlier change introduced core.abbrevguard configuration with a
realization that a short prefix that is unique today may not stay unique
forever, as new objects are added to the repository. When "len" is shorter
than the length necessary to ensure uniqueness today, instead of returning
a string that is only one character longer than the longest ambiguous
prefix, we wanted to add that many extra characters to ensure uniqueness
for longer time.

However, the code forgot that "len" may be sufficiently long.  If an
object is uniquely identifiable with only 4-hexdigit today, and if the
caller gives 7 as len and the guard is set to 3, we returned 10 hexdigits,
which was 3 characters longer than necessary.  We should instead return 7
hexdigits in such a case, as that is in line with the original intention
of using 3 extra hexdigits to give the disambiguation we find today a
better chance to survive.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 sha1_name.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 709ff2e..0f81581 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -193,6 +193,23 @@ static int get_short_sha1(const char *name, int len, unsigned char *sha1,
 	return status;
 }

+/*
+ * The caller wants a unique abbreviation of the full object name in
+ * "sha1" that is at least "len" characters long.
+ *
+ * (1) If sha1 identifies an existing object, there must be no other
+ *     object that shares the returned string as the prefix of its
+ *     name;
+ *
+ * (2) If no object with the given object name exists, there must be
+ *     no object that has the returned string as the prefix of its
+ *     name.
+ *
+ * Usually we try to return as short a string as possible, but the
+ * core.abbrevguard configuration may tell us to use at least that
+ * many characters more than necessary to make the result unique,
+ * in order to keep it unique a bit longer.
+ */
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
 {
 	int status, exists;
@@ -202,6 +219,13 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
 	memcpy(hex, sha1_to_hex(sha1), 40);
 	if (len == 40 || !len)
 		return hex;
+	len -= unique_abbrev_extra_length;
+	if (len <= 0)
+		len = 1;
+	/*
+	 * Try to see how short a prefix we can feed to get
+	 * the desired unique hit
+	 */
 	while (len < 40) {
 		unsigned char sha1_ret[20];
 		status = get_short_sha1(hex, len, sha1_ret, 1);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] blame: introduce -u/--unique option
  2011-03-10  6:13     ` Namhyung Kim
@ 2011-03-10  9:43       ` Junio C Hamano
  2011-03-10 13:58         ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2011-03-10  9:43 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: git

Namhyung Kim <namhyung@gmail.com> writes:

> Hmm... I thought about that too. But you mean that you only want
> --abbrev instead of -u, right?

Yeah, if it can be shown that ensuring the uniqueness in addition to
uniform length with negligible cost, I don't think there is any reason to
have this as an option.  So perhaps the minimum change would be to keep
the current "8" as a hardwired constant to feed find-unique-abbrev with to
find the longest unique prefix as your patch did and use that length when
showing the output.  Making "8" configurable would become a separate
topic.

> My original intention was if user specified --abbrev explicitly, it
> should be honored regardless of the uniqueness. The guard will not be
> used in this situation because the user gave the exact length [s]he
> wants to see.

Hmm, like giving a ridiculously short --abbrev and forcing the output to
be ambiguous yet hopefully they can be differenciated by other clues like
author and date?

The reason I originally suggested to make the uniqueness an optional
feature was primarily output performance, but that is linear with the
number of lines in the file and is dwarfed by the history traversal cost,
so it is a non-issue, so I then suggested that it doesn't need to be an
option.  But I think you may have a point--an option to decline uniqueness
would also make sense, especially when you want a very short prefix that
you know won't be unique and when you don't care about uniqueness.

As all the normal commands that use --abbrev would call f-u-a to ensure
uniqueness, the current "cut at fixed length" behaviour is an oddball.  It
should become an optional behaviour.

The current behaviour, when we eventually have both options, would be:

	$ git blame --abbrev=8 --no-uniq $rev -- $path

and should be the default when neither --abbrev=$n nor --no-uniq is given
for backward compatibility.  Porcelains are _not_ supposed to be reading
from blame output without giving it --porcelain option, but we know we
cannot expect sanity from people who script around git.

When you want to give a very short width that wouldn't guarantee
uniqueness, and you want to force that width, you would say:

	$ git blame --abbrev=4 --no-uniq $rev -- $path

Without --no-uniq, 

	$ git blame --abbrev=4 $rev -- $path

the command may use more hexdigits than 4 to ensure uniqueness, but all
commit names are shown with the same width to ensure alignment as well.

I kind of like that.  While I don't think "--no-uniq" is the best name for
that option, I think --abbrev everywhere else implies uniqueness so it is
not a good idea to require "-u" option for unique output and give
ambiguous output without one.  It would hurt consistency in the UI.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re* [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-10  9:19     ` Re* " Junio C Hamano
@ 2011-03-10 11:52       ` Namhyung Kim
  2011-03-10 11:54         ` Namhyung Kim
  2011-03-10 19:11         ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Namhyung Kim @ 2011-03-10 11:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2011-03-10 (목), 01:19 -0800, Junio C Hamano:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > I think what may be desirable is to honor the caller-supplied "len" a bit
> > better.  If an object is uniquely identifiable with only 4-hexdigit today,
> > and if the caller gives 7 as len and the guard is set to 3, we return 10
> > hexdigits with the current code.  We should instead return 7 hexdigits in
> > such a case, as that is in line with the "use 3 extra to give the
> > disambiguation we find today a better chance to survive".
> 
> And here is an attempt to do so.  I have to admit that I didn't give it
> too much thought, though, so please be careful when reviewing the logic.
> 

What if the unique length is greater than or equal to the given length?
For instance the unique length is 7 and the caller gives 7 and the guard
is 3. What do you want to return, 7 or 10? How about the unique length
of 8?

I think the meaning of the guard is somewhat vague. When this feature
was considered in LKML at first, Linus just wanted to change the default
length of commit abbreviation to 12 by making it user-configurable. [1]
And this is the same situation what I tried to tell you in the previous
email.

So I think it would be better to choose the output length using only
caller-given length and the guard length if it guarantees the uniqueness
today. It'll be simple, consistent and expected behavior IMHO.

Thanks.


> -- >8 --
> Subject: find_unique_abbrev(): honor caller-supplied "len" better
> 
> The caller of this function wants to ensure that the returned string is a
> unique abbreviation of the object name, and at least len characters long.
> When "len" is sufficiently short and we cannot ensure uniqueness with only
> "len" bytes, we returned minimally unique prefix (i.e. if you dropped the
> last character, there would be two or more objects that share that same
> prefix as their names in the repository).
> 
> An earlier change introduced core.abbrevguard configuration with a
> realization that a short prefix that is unique today may not stay unique
> forever, as new objects are added to the repository. When "len" is shorter
> than the length necessary to ensure uniqueness today, instead of returning
> a string that is only one character longer than the longest ambiguous
> prefix, we wanted to add that many extra characters to ensure uniqueness
> for longer time.
> 
> However, the code forgot that "len" may be sufficiently long.  If an
> object is uniquely identifiable with only 4-hexdigit today, and if the
> caller gives 7 as len and the guard is set to 3, we returned 10 hexdigits,
> which was 3 characters longer than necessary.  We should instead return 7
> hexdigits in such a case, as that is in line with the original intention
> of using 3 extra hexdigits to give the disambiguation we find today a
> better chance to survive.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  sha1_name.c |   24 ++++++++++++++++++++++++
>  1 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/sha1_name.c b/sha1_name.c
> index 709ff2e..0f81581 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -193,6 +193,23 @@ static int get_short_sha1(const char *name, int len, unsigned char *sha1,
>  	return status;
>  }
>  
> +/*
> + * The caller wants a unique abbreviation of the full object name in
> + * "sha1" that is at least "len" characters long.
> + *
> + * (1) If sha1 identifies an existing object, there must be no other
> + *     object that shares the returned string as the prefix of its
> + *     name;
> + *
> + * (2) If no object with the given object name exists, there must be
> + *     no object that has the returned string as the prefix of its
> + *     name.
> + *
> + * Usually we try to return as short a string as possible, but the
> + * core.abbrevguard configuration may tell us to use at least that
> + * many characters more than necessary to make the result unique,
> + * in order to keep it unique a bit longer.
> + */
>  const char *find_unique_abbrev(const unsigned char *sha1, int len)
>  {
>  	int status, exists;
> @@ -202,6 +219,13 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len)
>  	memcpy(hex, sha1_to_hex(sha1), 40);
>  	if (len == 40 || !len)
>  		return hex;
> +	len -= unique_abbrev_extra_length;
> +	if (len <= 0)
> +		len = 1;

Anyway, how about

	  if (len < MINIMUM_ABBREV)
		  len = MINIMUM_ABBREV;


> +	/*
> +	 * Try to see how short a prefix we can feed to get
> +	 * the desired unique hit
> +	 */
>  	while (len < 40) {
>  		unsigned char sha1_ret[20];
>  		status = get_short_sha1(hex, len, sha1_ret, 1);


-- 
Regards,
Namhyung Kim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re* [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-10 11:52       ` Namhyung Kim
@ 2011-03-10 11:54         ` Namhyung Kim
  2011-03-10 19:11         ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2011-03-10 11:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2011-03-10 (목), 20:52 +0900, Namhyung Kim:
> 2011-03-10 (목), 01:19 -0800, Junio C Hamano:
> > And here is an attempt to do so.  I have to admit that I didn't give it
> > too much thought, though, so please be careful when reviewing the logic.
> > 
> 
> What if the unique length is greater than or equal to the given length?
> For instance the unique length is 7 and the caller gives 7 and the guard
> is 3. What do you want to return, 7 or 10? How about the unique length
> of 8?
> 
> I think the meaning of the guard is somewhat vague. When this feature
> was considered in LKML at first, Linus just wanted to change the default
> length of commit abbreviation to 12 by making it user-configurable. [1]
> And this is the same situation what I tried to tell you in the previous
> email.
> 

Oh, I missed the link. :)

[1] https://lkml.org/lkml/2010/10/28/264


-- 
Regards,
Namhyung Kim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] blame: introduce -u/--unique option
  2011-03-10  9:43       ` Junio C Hamano
@ 2011-03-10 13:58         ` Namhyung Kim
  0 siblings, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2011-03-10 13:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2011-03-10 (목), 01:43 -0800, Junio C Hamano:
> Namhyung Kim <namhyung@gmail.com> writes:
> 
> > Hmm... I thought about that too. But you mean that you only want
> > --abbrev instead of -u, right?
> 
> Yeah, if it can be shown that ensuring the uniqueness in addition to
> uniform length with negligible cost, I don't think there is any reason to
> have this as an option.  So perhaps the minimum change would be to keep
> the current "8" as a hardwired constant to feed find-unique-abbrev with to
> find the longest unique prefix as your patch did and use that length when
> showing the output.  Making "8" configurable would become a separate
> topic.
> 

OK. But I'm still wonder why the value is "8" instead of
"7" (DEFAULT_ABBREV).


> > My original intention was if user specified --abbrev explicitly, it
> > should be honored regardless of the uniqueness. The guard will not be
> > used in this situation because the user gave the exact length [s]he
> > wants to see.
> 
> Hmm, like giving a ridiculously short --abbrev and forcing the output to
> be ambiguous yet hopefully they can be differenciated by other clues like
> author and date?
> 
> The reason I originally suggested to make the uniqueness an optional
> feature was primarily output performance, but that is linear with the
> number of lines in the file and is dwarfed by the history traversal cost,
> so it is a non-issue, so I then suggested that it doesn't need to be an
> option.  But I think you may have a point--an option to decline uniqueness
> would also make sense, especially when you want a very short prefix that
> you know won't be unique and when you don't care about uniqueness.
> 
> As all the normal commands that use --abbrev would call f-u-a to ensure
> uniqueness, the current "cut at fixed length" behaviour is an oddball.  It
> should become an optional behaviour.
> 
> The current behaviour, when we eventually have both options, would be:
> 
> 	$ git blame --abbrev=8 --no-uniq $rev -- $path
> 
> and should be the default when neither --abbrev=$n nor --no-uniq is given
> for backward compatibility.  Porcelains are _not_ supposed to be reading
> from blame output without giving it --porcelain option, but we know we
> cannot expect sanity from people who script around git.
> 
> When you want to give a very short width that wouldn't guarantee
> uniqueness, and you want to force that width, you would say:
> 
> 	$ git blame --abbrev=4 --no-uniq $rev -- $path
> 
> Without --no-uniq, 
> 
> 	$ git blame --abbrev=4 $rev -- $path
> 
> the command may use more hexdigits than 4 to ensure uniqueness, but all
> commit names are shown with the same width to ensure alignment as well.
> 

Right. But I'm not sure what the end result should be if the guard is
set and/or --no-uniq is given. It might be related to the patch 1/2.

Say the unique length is 4 and the guard is 3. What do you expect from
the following command?

	$ git blame --abbrev=5 -- $path
	$ git blame --abbrev=5 --no-uniq -- $path

I think it should be 8 and 5. But possible alternatives are (7, 5),
(5, 5), (8, 8) and (7, 8).

Let me think of another example. In this case the unique length is 8,
the guard is 4 and the commands are same.

I think it should be (9, 5). And corresponding alternatives are (12, 5),
(8, 5), (12, 9) and (12, 9).


My logic can be described like:

	default output 	= max((c + g), u)
	no-uniq output 	= c

where 'l' is the caller-given length, 'g' is the guard length, and 'u'
is the unique length.

Alternatives are:

(a)	default output = max((u + g), c)
	no-uniq output = c

(b)	default output = max(c, u)
	no-uniq output = c

(c)	default output = (c > u) ? (c + g) : (u + g)
	no-uniq output = c + g

(d)	default output = max((u + g), c)
	no-uniq output = c + g


Maybe we can make more combination if you want. What do you think?



> I kind of like that.  While I don't think "--no-uniq" is the best name for
> that option, I think --abbrev everywhere else implies uniqueness so it is
> not a good idea to require "-u" option for unique output and give
> ambiguous output without one.  It would hurt consistency in the UI.
> 

How about "--fixed" ?

Thanks.


-- 
Regards,
Namhyung Kim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re* [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified
  2011-03-10 11:52       ` Namhyung Kim
  2011-03-10 11:54         ` Namhyung Kim
@ 2011-03-10 19:11         ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2011-03-10 19:11 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: git

Namhyung Kim <namhyung@gmail.com> writes:

> 2011-03-10 (목), 01:19 -0800, Junio C Hamano:
>> Junio C Hamano <gitster@pobox.com> writes:
>> 
>> > I think what may be desirable is to honor the caller-supplied "len" a bit
>> > better.  If an object is uniquely identifiable with only 4-hexdigit today,
>> > and if the caller gives 7 as len and the guard is set to 3, we return 10
>> > hexdigits with the current code.  We should instead return 7 hexdigits in
>> > such a case, as that is in line with the "use 3 extra to give the
>> > disambiguation we find today a better chance to survive".
>> 
>> And here is an attempt to do so.  I have to admit that I didn't give it
>> too much thought, though, so please be careful when reviewing the logic.
>> 
>
> What if the unique length is greater than or equal to the given length?
> For instance the unique length is 7 and the caller gives 7 and the guard
> is 3. What do you want to return, 7 or 10? How about the unique length
> of 8?

Perhaps I didn't state it clearly enough.

Conceptually, f-u-a works in two stages.  First find out what the shortest
prefix that can be used to uniquely identify the given object _now_.
Then make sure that the result it returns is at least "len" characters
long.  So if you code it naively, it would have looked like this:

	for (try = 0; try < 40; try++)
		if ("do first try characters name an object uniquely?")
			break; /* happy */
	if (try < len)
		return the first "len" characters;
	else
        	return the first "try" characters;

The original (before the abbrevguard patch) uses an obvious optimization
to start the trial from "len" because it didn't matter if a prefix shorter
than "len" is already unique---it will return "len" anyway.

The "guard" change was buggy.  At the conceptual level, it should have
changed the first phase of f-u-a, by dropping the "_now_" part. Given that
we will be adding more and more objects, only one extra character to make
things unique is not sufficient and having "guard" extra characters would
keep the result unique hopefully a little longer.  It should have done
this:

	for (try = 0; try < 40; try++)
		if ("do first try characters name an object uniquely?")
			break; /* happy */
       +/* try has only one extra character to make it unique */
       +if (guard)
       +	try += (guard - 1); /* could be try += guard */
	if (try < len)
		return the first "len" characters;
	else
        	return the first "try" characters;

Back to your original question, if the object needs 7 characters (iow, its
6-character prefix is shared with another object) to make it unique, and
if the caller gives you 7 as length, without guard, you would get 7
because you are saying 1 extra character is sufficiently unique by not
setting the guard.  If you set the guard to 3, you are saying that 1 extra
character is not sufficient to guard it against future collisions and you
want 3 extra instead, so you should be getting either 9 (if you define
that "guard" overrides the default "1 more characters to make it minimally
unique") or 10 (if you define that "guard" specifies how many extra
characters to use to make it safer).

If the object needs 8 characters to make it minimally unique and you give
7 as "len", you should never get a 7-character string back.  f-u-a is not
"truncate to 'len'" function but its primary job is to ensure uniqueness.
Without guard, you should get 8 (minimally unique), and with guard set to
3, you should get either 10 or 11 (again, depending on the definition of
"guard").

> I think the meaning of the guard is somewhat vague. When this feature
> was considered in LKML at first, Linus just wanted to change the default
> length of commit abbreviation to 12 by making it user-configurable. [1]

Linus's "Let's make the default 12 for our project by adding a way to
customize it" merely states "right now we know 7 is not enough and 12 is
plenty for some time, and likely to be enough to give the same length".
It does not mean he suddenly start favoring uniform length over
uniqueness.  Uniqueness requirement is a given.

And that discussion is all about what "len" to be given when calling
f-u-a; it doesn't have anything to do with what the guard is about.

Setting "len" to 12 would make sure your output you get today would have
at least 12 characters, but it doesn't have any effect on how long an
abbreviated unique object name will stay unique in the future.

When you give the name of an object you have today to somebody else by
shortening it with f-u-a, you can ensure that the object name has a better
chance of staying unique a bit longer by setting the guard to a non-zero
value even if you use shorter "len".

The "len" and "guard" are different concepts, have different purposes, and
implemented as two different variables, of course.

> So I think it would be better to choose the output length using only
> caller-given length and the guard length if it guarantees the uniqueness
> today. It'll be simple, consistent and expected behavior IMHO.

Sorry, but I see nothing consistent nor expected about what you are
saying. Perhaps it is "expected" if you don't understand what the guard is
about. It is about the future viability of the output you produce today.

If you don't care about uniqueness tomorrow, you won't even need any
guard.  You just give an interface to specify "len" and be done with it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-03-10 19:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-08 10:59 [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Namhyung Kim
2011-03-08 10:59 ` [PATCH 2/2] blame: introduce -u/--unique option Namhyung Kim
2011-03-09 19:45   ` Junio C Hamano
2011-03-10  6:13     ` Namhyung Kim
2011-03-10  9:43       ` Junio C Hamano
2011-03-10 13:58         ` Namhyung Kim
2011-03-09 19:12 ` [PATCH 1/2] sha1_name: try to use same abbrev length when core.abbrevguard is specified Junio C Hamano
2011-03-09 19:20   ` Junio C Hamano
2011-03-10  9:19     ` Re* " Junio C Hamano
2011-03-10 11:52       ` Namhyung Kim
2011-03-10 11:54         ` Namhyung Kim
2011-03-10 19:11         ` Junio C Hamano
2011-03-10  5:27   ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).