[PATCH] abbrev: allow extending beyond 20 chars to disambiguate

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
@ 2025-08-11 15:26 Junio C Hamano
  2025-08-11 18:53 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Junio C Hamano @ 2025-08-11 15:26 UTC (permalink / raw)
  To: git; +Cc: Jon Forrest, Derrick Stolee

When you have two or more objects with object names that share more
than half the length of the hash algorithm in use (e.g. 10 bytes for
SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
fails to show disambiguation.

To see how many leading letters of a given full object name is
sufficiently unambiguous, the algorithm starts from a initial
length, guessed based on the estimated number of objects in the
repository, and see if another object that shares the prefix, and
keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
which is counted as the number of bytes, since 5b20ace6 (sha1_name:
unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
because the loop is adding one output letter per iteration.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * No tests added, since I do not think I want to find two valid
   objects with their object names sharing the same prefix that is
   more than 20 letters long.  The current abbreviation code happens
   to ignore validity of the object and takes invalid objects into
   account when disambiguating, but I do not want to see a test rely
   on that.

   Git 2.15 (that predates 5b20ace6 which is in Git 2.16) does the
   right thing, even though it is a bit too old codebase to build
   these days (I had to omit curl and pcre, as I didn't need to get
   them working again only to see how its disambiguation code works)
   with the up-to-date libraries.

 object-name.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git c/object-name.c w/object-name.c
index 11aa0e6afc..13e8a4e47d 100644
--- c/object-name.c
+++ w/object-name.c
@@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
 		i++;

-	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+	if (i < GIT_MAX_HEXSZ && i >= mad->cur_len)
 		mad->cur_len = i + 1;

 	return 0;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano
@ 2025-08-11 18:53 ` Junio C Hamano
  2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
  2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson
  2 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2025-08-11 18:53 UTC (permalink / raw)
  To: git; +Cc: Jon Forrest, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
> change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
> because the loop is adding one output letter per iteration.

This is half a truth.  It is correct that the loop used to terminate
at GIT_SHA1_HEXSZ, and replacing it with GIT_MAX_HEXSZ is wrong, as
MAX_HEXSZ can be much larger than the max hexsz for the hash function
in use in the repository.

I'll be sending a reworked version that takes the current hash
function into account.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano
  2025-08-11 18:53 ` Junio C Hamano
@ 2025-08-11 19:06 ` Junio C Hamano
  2025-08-11 22:23   ` brian m. carlson
                     ` (3 more replies)
  2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson
  2 siblings, 4 replies; 14+ messages in thread
From: Junio C Hamano @ 2025-08-11 19:06 UTC (permalink / raw)
  To: git; +Cc: Jon Forrest, Derrick Stolee

When you have two or more objects with object names that share more
than half the length of the hash algorithm in use (e.g. 10 bytes for
SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
fails to show disambiguation.

To see how many leading letters of a given full object name is
sufficiently unambiguous, the algorithm starts from a initial
length, guessed based on the estimated number of objects in the
repository, and see if another object that shares the prefix, and
keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
which is counted as the number of bytes, since 5b20ace6 (sha1_name:
unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
change, it extended up to GIT_SHA1_HEXSZ, which was the correct
limit because the loop is adding one output letter per iteration
and back then SHA256 was not in the picture.

Pass the max length of the hash being in use in the current
repository down the code path, and use it to compute the code to
update the abbreviation length required to make it unique.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 object-name.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/object-name.c b/object-name.c
index 11aa0e6afc..8f9af57c0a 100644
--- a/object-name.c
+++ b/object-name.c
@@ -680,6 +680,7 @@ static unsigned msb(unsigned long val)
 struct min_abbrev_data {
 	unsigned int init_len;
 	unsigned int cur_len;
+	unsigned int max_len;
 	char *hex;
 	struct repository *repo;
 	const struct object_id *oid;
@@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid,
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
 	struct min_abbrev_data *mad = cb_data;
-
 	unsigned int i = mad->init_len;
+
 	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
 		i++;
 
-	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+	if (mad->cur_len <= i && i < mad->max_len)
 		mad->cur_len = i + 1;
 
 	return 0;
@@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex,
 	mad.repo = r;
 	mad.init_len = len;
 	mad.cur_len = len;
+	mad.max_len = hexsz;
 	mad.hex = hex;
 	mad.oid = oid;
 

Range-diff:
1:  2e1d2b4ef6 ! 1:  5c67e57f14 abbrev: allow extending beyond 20 chars to disambiguate
    @@ Commit message
         keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
         which is counted as the number of bytes, since 5b20ace6 (sha1_name:
         unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
    -    change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
    -    because the loop is adding one output letter per iteration.
    +    change, it extended up to GIT_SHA1_HEXSZ, which was the correct
    +    limit because the loop is adding one output letter per iteration and
    +    back then SHA256 was not in the picture.
    +
    +    Pass the max length of the hash being in use in the current
    +    repository down the code path, and use it to compute the code to
    +    update the abbreviation length required to make it unique.
     
         Signed-off-by: Junio C Hamano <gitster@pobox.com>
     
      ## object-name.c ##
    -@@ object-name.c: static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
    +@@ object-name.c: static unsigned msb(unsigned long val)
    + struct min_abbrev_data {
    + 	unsigned int init_len;
    + 	unsigned int cur_len;
    ++	unsigned int max_len;
    + 	char *hex;
    + 	struct repository *repo;
    + 	const struct object_id *oid;
    +@@ object-name.c: static inline char get_hex_char_from_oid(const struct object_id *oid,
    + static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
    + {
    + 	struct min_abbrev_data *mad = cb_data;
    +-
    + 	unsigned int i = mad->init_len;
    ++
      	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
      		i++;
      
     -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
    -+	if (i < GIT_MAX_HEXSZ && i >= mad->cur_len)
    ++	if (mad->cur_len <= i && i < mad->max_len)
      		mad->cur_len = i + 1;
      
      	return 0;
    +@@ object-name.c: int repo_find_unique_abbrev_r(struct repository *r, char *hex,
    + 	mad.repo = r;
    + 	mad.init_len = len;
    + 	mad.cur_len = len;
    ++	mad.max_len = hexsz;
    + 	mad.hex = hex;
    + 	mad.oid = oid;
    + 
-- 
2.51.0-rc1-144-g869f44a1ca


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
@ 2025-08-11 22:23   ` brian m. carlson
  2025-08-12 13:28   ` Derrick Stolee
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: brian m. carlson @ 2025-08-11 22:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jon Forrest, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 1652 bytes --]

On 2025-08-11 at 19:06:39, Junio C Hamano wrote:
> diff --git a/object-name.c b/object-name.c
> index 11aa0e6afc..8f9af57c0a 100644
> --- a/object-name.c
> +++ b/object-name.c
> @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val)
>  struct min_abbrev_data {
>  	unsigned int init_len;
>  	unsigned int cur_len;
> +	unsigned int max_len;
>  	char *hex;
>  	struct repository *repo;
>  	const struct object_id *oid;
> @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid,
>  static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
>  {
>  	struct min_abbrev_data *mad = cb_data;
> -
>  	unsigned int i = mad->init_len;
> +
>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>  		i++;
>  
> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
> +	if (mad->cur_len <= i && i < mad->max_len)
>  		mad->cur_len = i + 1;
>  
>  	return 0;
> @@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex,
>  	mad.repo = r;
>  	mad.init_len = len;
>  	mad.cur_len = len;
> +	mad.max_len = hexsz;
>  	mad.hex = hex;
>  	mad.oid = oid;

This definitely looks more sensible, since we're using the algorithm
specified in the passed in `oid` variable in
`repo_find_unique_abbrev_r` to determine the length.

I will admit that despite having touched this code recently in my
SHA-1/SHA-256 interoperability work, I'm definitely not an expert in
this area, so while I don't see anything that stands out to me as wrong,
you probably will want someone else to verify here.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
  2025-08-11 22:23   ` brian m. carlson
@ 2025-08-12 13:28   ` Derrick Stolee
  2025-08-12 14:58   ` René Scharfe
  2025-08-14 15:09   ` [PATCH v3] abbrev: allow extending beyond 32 " Junio C Hamano
  3 siblings, 0 replies; 14+ messages in thread
From: Derrick Stolee @ 2025-08-12 13:28 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Jon Forrest

On 8/11/2025 3:06 PM, Junio C Hamano wrote:
> When you have two or more objects with object names that share more
> than half the length of the hash algorithm in use (e.g. 10 bytes for
> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
> fails to show disambiguation.
> 
> To see how many leading letters of a given full object name is
> sufficiently unambiguous, the algorithm starts from a initial
> length, guessed based on the estimated number of objects in the
> repository, and see if another object that shares the prefix, and
> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
> unroll len loop in find_unique_abbrev_r(), 2017-10-08);

Wow. My first Git contribution.

Some style issues that you're opportunistically cleaning up are
due to my newness, for sure.

> before that
> change, it extended up to GIT_SHA1_HEXSZ, which was the correct
> limit because the loop is adding one output letter per iteration
> and back then SHA256 was not in the picture.
> 
> Pass the max length of the hash being in use in the current
> repository down the code path, and use it to compute the code to
> update the abbreviation length required to make it unique.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  object-name.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/object-name.c b/object-name.c
> index 11aa0e6afc..8f9af57c0a 100644
> --- a/object-name.c
> +++ b/object-name.c
> @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val)
>  struct min_abbrev_data {
>  	unsigned int init_len;
>  	unsigned int cur_len;
> +	unsigned int max_len;
>  	char *hex;
>  	struct repository *repo;
>  	const struct object_id *oid;
> @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid,
>  static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
>  {
>  	struct min_abbrev_data *mad = cb_data;
> -
>  	unsigned int i = mad->init_len;
> +
>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>  		i++;
>  
> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
> +	if (mad->cur_len <= i && i < mad->max_len)
>  		mad->cur_len = i + 1;

This logic is all about not extending the abbreviation
length to beyond the length of the hex array, so your
limits make sense. Moving the comparisons here helps
with readability.  
>  	return 0;
> @@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex,
>  	mad.repo = r;
>  	mad.init_len = len;
>  	mad.cur_len = len;
> +	mad.max_len = hexsz;

This new parameter required for allowing SHA256 is valuable.

I agree that we shouldn't need a test case to guarantee this
in the future. Good to clean up unnecessary uses of the
macro limits.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
  2025-08-11 22:23   ` brian m. carlson
  2025-08-12 13:28   ` Derrick Stolee
@ 2025-08-12 14:58   ` René Scharfe
  2025-08-12 15:17     ` Junio C Hamano
  2025-08-14 15:09   ` [PATCH v3] abbrev: allow extending beyond 32 " Junio C Hamano
  3 siblings, 1 reply; 14+ messages in thread
From: René Scharfe @ 2025-08-12 14:58 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Jon Forrest, Derrick Stolee

On 8/11/25 9:06 PM, Junio C Hamano wrote:
> When you have two or more objects with object names that share more
> than half the length of the hash algorithm in use (e.g. 10 bytes for
> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
> fails to show disambiguation.
> 
> To see how many leading letters of a given full object name is
> sufficiently unambiguous, the algorithm starts from a initial
> length, guessed based on the estimated number of objects in the
> repository, and see if another object that shares the prefix, and
> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
> change, it extended up to GIT_SHA1_HEXSZ, which was the correct
> limit because the loop is adding one output letter per iteration
> and back then SHA256 was not in the picture.
> 
> Pass the max length of the hash being in use in the current
> repository down the code path, and use it to compute the code to
> update the abbreviation length required to make it unique.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  object-name.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/object-name.c b/object-name.c
> index 11aa0e6afc..8f9af57c0a 100644
> --- a/object-name.c
> +++ b/object-name.c
> @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val)
>  struct min_abbrev_data {
>  	unsigned int init_len;
>  	unsigned int cur_len;
> +	unsigned int max_len;
>  	char *hex;
>  	struct repository *repo;
>  	const struct object_id *oid;
> @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid,
>  static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
>  {
>  	struct min_abbrev_data *mad = cb_data;
> -
>  	unsigned int i = mad->init_len;
> +
>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>  		i++;
>  
> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
> +	if (mad->cur_len <= i && i < mad->max_len)
>  		mad->cur_len = i + 1;

This combines two checks: Whether we can increment and whether the new
length is greater than the old one.  Only if both are true we take the
new length.  Shouldn't they be separate, though?  Why reject a new
length that happens to be the maximum?  And max_len is not explicitly
needed for the first check:

	/* One more to disambiguate, if possible. */
	if (mad->hex[i])
		i++;

	/* New record? */
	if (i > mad->cur_len)
		mad->cur_len = i;

René


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-12 14:58   ` René Scharfe
@ 2025-08-12 15:17     ` Junio C Hamano
  2025-08-12 15:59       ` René Scharfe
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2025-08-12 15:17 UTC (permalink / raw)
  To: René Scharfe; +Cc: git, Jon Forrest, Derrick Stolee

René Scharfe <l.s.r@web.de> writes:

>>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>>  		i++;
>>  
>> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
>> +	if (mad->cur_len <= i && i < mad->max_len)
>>  		mad->cur_len = i + 1;
>
> This combines two checks: Whether we can increment and whether the new
> length is greater than the old one.  Only if both are true we take the
> new length.  Shouldn't they be separate, though?  Why reject a new
> length that happens to be the maximum?  And max_len is not explicitly
> needed for the first check:
>
> 	/* One more to disambiguate, if possible. */
> 	if (mad->hex[i])
> 		i++;
>
> 	/* New record? */
> 	if (i > mad->cur_len)
> 		mad->cur_len = i;
>
> René

Great.

Your observation resolves my puzzlement about the first while() loop
that has been bugging me ever since I started looking at this code.
The mad->hex[] array is NUL terminated, and the loop can terminate
correctly without being told about hexsz at all, and we ought to be
able to use the same information to make sure we stop incrementing
the .cur_len member without running beyond the end of the string.

In other words, wouldn't this be what we want, without any of the
max_len crap?

diff --git c/object-name.c w/object-name.c
index 11aa0e6afc..4cd1d38778 100644
--- c/object-name.c
+++ w/object-name.c
@@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
 		i++;
 
-	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+	if (mad->hex[i] && i >= mad->cur_len)
 		mad->cur_len = i + 1;
 
 	return 0;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-12 15:17     ` Junio C Hamano
@ 2025-08-12 15:59       ` René Scharfe
  0 siblings, 0 replies; 14+ messages in thread
From: René Scharfe @ 2025-08-12 15:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jon Forrest, Derrick Stolee



On 8/12/25 5:17 PM, Junio C Hamano wrote:
> René Scharfe <l.s.r@web.de> writes:
> 
>>>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>>>  		i++;
>>>  
>>> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
>>> +	if (mad->cur_len <= i && i < mad->max_len)
>>>  		mad->cur_len = i + 1;
>>
>> This combines two checks: Whether we can increment and whether the new
>> length is greater than the old one.  Only if both are true we take the
>> new length.  Shouldn't they be separate, though?  Why reject a new
>> length that happens to be the maximum?  And max_len is not explicitly
>> needed for the first check:
>>
>> 	/* One more to disambiguate, if possible. */
>> 	if (mad->hex[i])
>> 		i++;
>>
>> 	/* New record? */
>> 	if (i > mad->cur_len)
>> 		mad->cur_len = i;
>>
>> René
> 
> Great.
> 
> Your observation resolves my puzzlement about the first while() loop
> that has been bugging me ever since I started looking at this code.
> The mad->hex[] array is NUL terminated, and the loop can terminate
> correctly without being told about hexsz at all, and we ought to be
> able to use the same information to make sure we stop incrementing
> the .cur_len member without running beyond the end of the string.
> 
> In other words, wouldn't this be what we want, without any of the
> max_len crap?
> 
> diff --git c/object-name.c w/object-name.c
> index 11aa0e6afc..4cd1d38778 100644
> --- c/object-name.c
> +++ w/object-name.c
> @@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>  		i++;
>  
> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
> +	if (mad->hex[i] && i >= mad->cur_len)
>  		mad->cur_len = i + 1;
>  
>  	return 0;

This combines the two checks and I don't see why.  Why not update
cur_len when oid and mad->oid are the same (mad->hex[i] == '\0')?  Ah,
to _ignore_ duplicates on purpose, when we have the same object loose
and packed or in multiple packs, right?  We wouldn't want to let that
affect abbreviations, makes sense.

René


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3] abbrev: allow extending beyond 32 chars to disambiguate
  2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
                     ` (2 preceding siblings ...)
  2025-08-12 14:58   ` René Scharfe
@ 2025-08-14 15:09   ` Junio C Hamano
  3 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2025-08-14 15:09 UTC (permalink / raw)
  To: git; +Cc: Jon Forrest, Derrick Stolee, René Scharfe

When you have two or more objects with object names that share more
than 32 letters in an SHA-1 repository, find_unique_abbrev() fails
to show disambiguation.

To see how many leading letters of a given full object name is
sufficiently unambiguous, the algorithm starts from a initial
length, guessed based on the estimated number of objects in the
repository, and see if another object that shares the prefix, and
keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
which is counted as the number of bytes, since 5b20ace6 (sha1_name:
unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
change, it extended up to GIT_SHA1_HEXSZ, which meant to stop at the
end of hexadecimal SHA-1 object name.

Because the hexadecimal object name passed to the function is
NUL-terminated, and this fact is used to correctly terminate the
loop that scans for the first difference earlier in the function,
use it to make sure we do not increment the .cur_len member beyond
the end of the string.

Noticed-by: Jon Forrest <nobozo@gmail.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * To tie the loose ends, here is what is in 'seen'.  We may want to
   merge it down once 2.51 final gets tagged.

 object-name.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/object-name.c b/object-name.c
index 11aa0e6afc..4cd1d38778 100644
--- a/object-name.c
+++ b/object-name.c
@@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
 		i++;

-	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+	if (mad->hex[i] && i >= mad->cur_len)
 		mad->cur_len = i + 1;

 	return 0;
-- 
2.51.0-rc2-158-gf97fc618fa

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano
  2025-08-11 18:53 ` Junio C Hamano
  2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
@ 2025-08-11 21:17 ` brian m. carlson
  2025-08-11 21:25   ` Junio C Hamano
  2025-08-12 15:26   ` Jon Forrest
  2 siblings, 2 replies; 14+ messages in thread
From: brian m. carlson @ 2025-08-11 21:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jon Forrest, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 1884 bytes --]

On 2025-08-11 at 15:26:32, Junio C Hamano wrote:
> When you have two or more objects with object names that share more
> than half the length of the hash algorithm in use (e.g. 10 bytes for
> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
> fails to show disambiguation.

Is this really the case?  If the restriction is due to using
GIT_MAX_RAWSZ instead of GIT_MAX_HEXSZ, then that's 32 vs. 64 in our
modern codebase.

> To see how many leading letters of a given full object name is
> sufficiently unambiguous, the algorithm starts from a initial
> length, guessed based on the estimated number of objects in the
> repository, and see if another object that shares the prefix, and
> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
> change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
> because the loop is adding one output letter per iteration.

Nicely explained.

>  * No tests added, since I do not think I want to find two valid
>    objects with their object names sharing the same prefix that is
>    more than 20 letters long.  The current abbreviation code happens
>    to ignore validity of the object and takes invalid objects into
>    account when disambiguating, but I do not want to see a test rely
>    on that.

Yes, even if we could efficiently create such a collision with SHA-1
using the best known attacks on it, that would still be 2^63.5, which
was estimated to cost about USD 10,000 in 2025.  I don't think doing
that just to produce a test would be a good use of the project's (or
really, anyone else's) funds.  Using SHA-256, of course, would require
at least 2^80 work.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson
@ 2025-08-11 21:25   ` Junio C Hamano
  2025-08-11 21:28     ` Junio C Hamano
  2025-08-12 15:26   ` Jon Forrest
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2025-08-11 21:25 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jon Forrest, Derrick Stolee

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2025-08-11 at 15:26:32, Junio C Hamano wrote:
>> When you have two or more objects with object names that share more
>> than half the length of the hash algorithm in use (e.g. 10 bytes for
>> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
>> fails to show disambiguation.
>
> Is this really the case?  

What I wrote in the above is correct.

> If the restriction is due to using
> GIT_MAX_RAWSZ instead of GIT_MAX_HEXSZ, then that's 32 vs. 64 in our
> modern codebase.

The above numbers are correct but irrelevant ;-).

The thing is, the offending commit changed from 40-bytes
(GIT_SHA1_HEXSZ) to 32-bytes (GIT_MAX_RAWSZ).  Plase see v2 patch.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 21:25   ` Junio C Hamano
@ 2025-08-11 21:28     ` Junio C Hamano
  0 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2025-08-11 21:28 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jon Forrest, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
>> On 2025-08-11 at 15:26:32, Junio C Hamano wrote:
>>> When you have two or more objects with object names that share more
>>> than half the length of the hash algorithm in use (e.g. 10 bytes for
>>> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
>>> fails to show disambiguation.
>>
>> Is this really the case?  
>
> What I wrote in the above is correct.

Sorry, I meant "incorrect".  And using GIT_MAX_HEXSZ will break the
abbreviation system rather badly.  It has to be the length of the
hash that is being used.

>
>> If the restriction is due to using
>> GIT_MAX_RAWSZ instead of GIT_MAX_HEXSZ, then that's 32 vs. 64 in our
>> modern codebase.
>
> The above numbers are correct but irrelevant ;-).
>
> The thing is, the offending commit changed from 40-bytes
> (GIT_SHA1_HEXSZ) to 32-bytes (GIT_MAX_RAWSZ).  Plase see v2 patch.
>
> Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson
  2025-08-11 21:25   ` Junio C Hamano
@ 2025-08-12 15:26   ` Jon Forrest
  2025-08-12 16:21     ` René Scharfe
  1 sibling, 1 reply; 14+ messages in thread
From: Jon Forrest @ 2025-08-12 15:26 UTC (permalink / raw)
  To: brian m. carlson, Junio C Hamano, git, Derrick Stolee



On 8/11/25 2:17 PM, brian m. carlson wrote:

>> To see how many leading letters of a given full object name is
>> sufficiently unambiguous, the algorithm starts from a initial
>> length, guessed based on the estimated number of objects in the
>> repository, and see if another object that shares the prefix, and
>> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
>> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
>> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
>> change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
>> because the loop is adding one output letter per iteration.

I'm new to all this but the way I did it is much simpler.
What I did was to check all the files in the appropriate
object store directory (e.g. .git/objects/XX, where XX are
the first 2 letters of the object given on the command line.
If any of the filenames in that directory start with the
string given on the command line, minus the first 2 letters,
then that's a match. If more than one filename matches then
that's ambiguous.

What's wrong with this approach?

Jon


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
  2025-08-12 15:26   ` Jon Forrest
@ 2025-08-12 16:21     ` René Scharfe
  0 siblings, 0 replies; 14+ messages in thread
From: René Scharfe @ 2025-08-12 16:21 UTC (permalink / raw)
  To: Jon Forrest, brian m. carlson, Junio C Hamano, git,
	Derrick Stolee

On 8/12/25 5:26 PM, Jon Forrest wrote:
> 
> 
> On 8/11/25 2:17 PM, brian m. carlson wrote:
> 
>>> To see how many leading letters of a given full object name is
>>> sufficiently unambiguous, the algorithm starts from a initial
>>> length, guessed based on the estimated number of objects in the
>>> repository, and see if another object that shares the prefix, and
>>> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
>>> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
>>> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
>>> change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
>>> because the loop is adding one output letter per iteration.
> 
> I'm new to all this but the way I did it is much simpler.
> What I did was to check all the files in the appropriate
> object store directory (e.g. .git/objects/XX, where XX are
> the first 2 letters of the object given on the command line.
> If any of the filenames in that directory start with the
> string given on the command line, minus the first 2 letters,
> then that's a match. If more than one filename matches then
> that's ambiguous.
> 
> What's wrong with this approach?
You also need to check all pack files and alternate object databases
that you might have.

René


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-08-14 15:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano
2025-08-11 18:53 ` Junio C Hamano
2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano
2025-08-11 22:23   ` brian m. carlson
2025-08-12 13:28   ` Derrick Stolee
2025-08-12 14:58   ` René Scharfe
2025-08-12 15:17     ` Junio C Hamano
2025-08-12 15:59       ` René Scharfe
2025-08-14 15:09   ` [PATCH v3] abbrev: allow extending beyond 32 " Junio C Hamano
2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson
2025-08-11 21:25   ` Junio C Hamano
2025-08-11 21:28     ` Junio C Hamano
2025-08-12 15:26   ` Jon Forrest
2025-08-12 16:21     ` René Scharfe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).