* [PATCH] abbrev: allow extending beyond 20 chars to disambiguate
@ 2025-08-11 15:26 Junio C Hamano
2025-08-11 18:53 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Junio C Hamano @ 2025-08-11 15:26 UTC (permalink / raw)
To: git; +Cc: Jon Forrest, Derrick Stolee
When you have two or more objects with object names that share more
than half the length of the hash algorithm in use (e.g. 10 bytes for
SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
fails to show disambiguation.
To see how many leading letters of a given full object name is
sufficiently unambiguous, the algorithm starts from a initial
length, guessed based on the estimated number of objects in the
repository, and see if another object that shares the prefix, and
keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ,
which is counted as the number of bytes, since 5b20ace6 (sha1_name:
unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that
change, it extended up to GIT_MAX_HEXSZ, which is the correct limit
because the loop is adding one output letter per iteration.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* No tests added, since I do not think I want to find two valid
objects with their object names sharing the same prefix that is
more than 20 letters long. The current abbreviation code happens
to ignore validity of the object and takes invalid objects into
account when disambiguating, but I do not want to see a test rely
on that.
Git 2.15 (that predates 5b20ace6 which is in Git 2.16) does the
right thing, even though it is a bit too old codebase to build
these days (I had to omit curl and pcre, as I didn't need to get
them working again only to see how its disambiguation code works)
with the up-to-date libraries.
object-name.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git c/object-name.c w/object-name.c
index 11aa0e6afc..13e8a4e47d 100644
--- c/object-name.c
+++ w/object-name.c
@@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
- if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+ if (i < GIT_MAX_HEXSZ && i >= mad->cur_len)
mad->cur_len = i + 1;
return 0;
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano @ 2025-08-11 18:53 ` Junio C Hamano 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano 2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson 2 siblings, 0 replies; 14+ messages in thread From: Junio C Hamano @ 2025-08-11 18:53 UTC (permalink / raw) To: git; +Cc: Jon Forrest, Derrick Stolee Junio C Hamano <gitster@pobox.com> writes: > keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, > which is counted as the number of bytes, since 5b20ace6 (sha1_name: > unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that > change, it extended up to GIT_MAX_HEXSZ, which is the correct limit > because the loop is adding one output letter per iteration. This is half a truth. It is correct that the loop used to terminate at GIT_SHA1_HEXSZ, and replacing it with GIT_MAX_HEXSZ is wrong, as MAX_HEXSZ can be much larger than the max hexsz for the hash function in use in the repository. I'll be sending a reworked version that takes the current hash function into account. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano 2025-08-11 18:53 ` Junio C Hamano @ 2025-08-11 19:06 ` Junio C Hamano 2025-08-11 22:23 ` brian m. carlson ` (3 more replies) 2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson 2 siblings, 4 replies; 14+ messages in thread From: Junio C Hamano @ 2025-08-11 19:06 UTC (permalink / raw) To: git; +Cc: Jon Forrest, Derrick Stolee When you have two or more objects with object names that share more than half the length of the hash algorithm in use (e.g. 10 bytes for SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev() fails to show disambiguation. To see how many leading letters of a given full object name is sufficiently unambiguous, the algorithm starts from a initial length, guessed based on the estimated number of objects in the repository, and see if another object that shares the prefix, and keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, which is counted as the number of bytes, since 5b20ace6 (sha1_name: unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that change, it extended up to GIT_SHA1_HEXSZ, which was the correct limit because the loop is adding one output letter per iteration and back then SHA256 was not in the picture. Pass the max length of the hash being in use in the current repository down the code path, and use it to compute the code to update the abbreviation length required to make it unique. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- object-name.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/object-name.c b/object-name.c index 11aa0e6afc..8f9af57c0a 100644 --- a/object-name.c +++ b/object-name.c @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val) struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; + unsigned int max_len; char *hex; struct repository *repo; const struct object_id *oid; @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid, static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { struct min_abbrev_data *mad = cb_data; - unsigned int i = mad->init_len; + while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + if (mad->cur_len <= i && i < mad->max_len) mad->cur_len = i + 1; return 0; @@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex, mad.repo = r; mad.init_len = len; mad.cur_len = len; + mad.max_len = hexsz; mad.hex = hex; mad.oid = oid; Range-diff: 1: 2e1d2b4ef6 ! 1: 5c67e57f14 abbrev: allow extending beyond 20 chars to disambiguate @@ Commit message keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, which is counted as the number of bytes, since 5b20ace6 (sha1_name: unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that - change, it extended up to GIT_MAX_HEXSZ, which is the correct limit - because the loop is adding one output letter per iteration. + change, it extended up to GIT_SHA1_HEXSZ, which was the correct + limit because the loop is adding one output letter per iteration and + back then SHA256 was not in the picture. + + Pass the max length of the hash being in use in the current + repository down the code path, and use it to compute the code to + update the abbreviation length required to make it unique. Signed-off-by: Junio C Hamano <gitster@pobox.com> ## object-name.c ## -@@ object-name.c: static int extend_abbrev_len(const struct object_id *oid, void *cb_data) +@@ object-name.c: static unsigned msb(unsigned long val) + struct min_abbrev_data { + unsigned int init_len; + unsigned int cur_len; ++ unsigned int max_len; + char *hex; + struct repository *repo; + const struct object_id *oid; +@@ object-name.c: static inline char get_hex_char_from_oid(const struct object_id *oid, + static int extend_abbrev_len(const struct object_id *oid, void *cb_data) + { + struct min_abbrev_data *mad = cb_data; +- + unsigned int i = mad->init_len; ++ while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) -+ if (i < GIT_MAX_HEXSZ && i >= mad->cur_len) ++ if (mad->cur_len <= i && i < mad->max_len) mad->cur_len = i + 1; return 0; +@@ object-name.c: int repo_find_unique_abbrev_r(struct repository *r, char *hex, + mad.repo = r; + mad.init_len = len; + mad.cur_len = len; ++ mad.max_len = hexsz; + mad.hex = hex; + mad.oid = oid; + -- 2.51.0-rc1-144-g869f44a1ca ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano @ 2025-08-11 22:23 ` brian m. carlson 2025-08-12 13:28 ` Derrick Stolee ` (2 subsequent siblings) 3 siblings, 0 replies; 14+ messages in thread From: brian m. carlson @ 2025-08-11 22:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Jon Forrest, Derrick Stolee [-- Attachment #1: Type: text/plain, Size: 1652 bytes --] On 2025-08-11 at 19:06:39, Junio C Hamano wrote: > diff --git a/object-name.c b/object-name.c > index 11aa0e6afc..8f9af57c0a 100644 > --- a/object-name.c > +++ b/object-name.c > @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val) > struct min_abbrev_data { > unsigned int init_len; > unsigned int cur_len; > + unsigned int max_len; > char *hex; > struct repository *repo; > const struct object_id *oid; > @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid, > static int extend_abbrev_len(const struct object_id *oid, void *cb_data) > { > struct min_abbrev_data *mad = cb_data; > - > unsigned int i = mad->init_len; > + > while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) > i++; > > - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) > + if (mad->cur_len <= i && i < mad->max_len) > mad->cur_len = i + 1; > > return 0; > @@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex, > mad.repo = r; > mad.init_len = len; > mad.cur_len = len; > + mad.max_len = hexsz; > mad.hex = hex; > mad.oid = oid; This definitely looks more sensible, since we're using the algorithm specified in the passed in `oid` variable in `repo_find_unique_abbrev_r` to determine the length. I will admit that despite having touched this code recently in my SHA-1/SHA-256 interoperability work, I'm definitely not an expert in this area, so while I don't see anything that stands out to me as wrong, you probably will want someone else to verify here. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano 2025-08-11 22:23 ` brian m. carlson @ 2025-08-12 13:28 ` Derrick Stolee 2025-08-12 14:58 ` René Scharfe 2025-08-14 15:09 ` [PATCH v3] abbrev: allow extending beyond 32 " Junio C Hamano 3 siblings, 0 replies; 14+ messages in thread From: Derrick Stolee @ 2025-08-12 13:28 UTC (permalink / raw) To: Junio C Hamano, git; +Cc: Jon Forrest On 8/11/2025 3:06 PM, Junio C Hamano wrote: > When you have two or more objects with object names that share more > than half the length of the hash algorithm in use (e.g. 10 bytes for > SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev() > fails to show disambiguation. > > To see how many leading letters of a given full object name is > sufficiently unambiguous, the algorithm starts from a initial > length, guessed based on the estimated number of objects in the > repository, and see if another object that shares the prefix, and > keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, > which is counted as the number of bytes, since 5b20ace6 (sha1_name: > unroll len loop in find_unique_abbrev_r(), 2017-10-08); Wow. My first Git contribution. Some style issues that you're opportunistically cleaning up are due to my newness, for sure. > before that > change, it extended up to GIT_SHA1_HEXSZ, which was the correct > limit because the loop is adding one output letter per iteration > and back then SHA256 was not in the picture. > > Pass the max length of the hash being in use in the current > repository down the code path, and use it to compute the code to > update the abbreviation length required to make it unique. > > Signed-off-by: Junio C Hamano <gitster@pobox.com> > --- > object-name.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/object-name.c b/object-name.c > index 11aa0e6afc..8f9af57c0a 100644 > --- a/object-name.c > +++ b/object-name.c > @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val) > struct min_abbrev_data { > unsigned int init_len; > unsigned int cur_len; > + unsigned int max_len; > char *hex; > struct repository *repo; > const struct object_id *oid; > @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid, > static int extend_abbrev_len(const struct object_id *oid, void *cb_data) > { > struct min_abbrev_data *mad = cb_data; > - > unsigned int i = mad->init_len; > + > while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) > i++; > > - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) > + if (mad->cur_len <= i && i < mad->max_len) > mad->cur_len = i + 1; This logic is all about not extending the abbreviation length to beyond the length of the hex array, so your limits make sense. Moving the comparisons here helps with readability. > return 0; > @@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex, > mad.repo = r; > mad.init_len = len; > mad.cur_len = len; > + mad.max_len = hexsz; This new parameter required for allowing SHA256 is valuable. I agree that we shouldn't need a test case to guarantee this in the future. Good to clean up unnecessary uses of the macro limits. Thanks, -Stolee ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano 2025-08-11 22:23 ` brian m. carlson 2025-08-12 13:28 ` Derrick Stolee @ 2025-08-12 14:58 ` René Scharfe 2025-08-12 15:17 ` Junio C Hamano 2025-08-14 15:09 ` [PATCH v3] abbrev: allow extending beyond 32 " Junio C Hamano 3 siblings, 1 reply; 14+ messages in thread From: René Scharfe @ 2025-08-12 14:58 UTC (permalink / raw) To: Junio C Hamano, git; +Cc: Jon Forrest, Derrick Stolee On 8/11/25 9:06 PM, Junio C Hamano wrote: > When you have two or more objects with object names that share more > than half the length of the hash algorithm in use (e.g. 10 bytes for > SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev() > fails to show disambiguation. > > To see how many leading letters of a given full object name is > sufficiently unambiguous, the algorithm starts from a initial > length, guessed based on the estimated number of objects in the > repository, and see if another object that shares the prefix, and > keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, > which is counted as the number of bytes, since 5b20ace6 (sha1_name: > unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that > change, it extended up to GIT_SHA1_HEXSZ, which was the correct > limit because the loop is adding one output letter per iteration > and back then SHA256 was not in the picture. > > Pass the max length of the hash being in use in the current > repository down the code path, and use it to compute the code to > update the abbreviation length required to make it unique. > > Signed-off-by: Junio C Hamano <gitster@pobox.com> > --- > object-name.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/object-name.c b/object-name.c > index 11aa0e6afc..8f9af57c0a 100644 > --- a/object-name.c > +++ b/object-name.c > @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val) > struct min_abbrev_data { > unsigned int init_len; > unsigned int cur_len; > + unsigned int max_len; > char *hex; > struct repository *repo; > const struct object_id *oid; > @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid, > static int extend_abbrev_len(const struct object_id *oid, void *cb_data) > { > struct min_abbrev_data *mad = cb_data; > - > unsigned int i = mad->init_len; > + > while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) > i++; > > - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) > + if (mad->cur_len <= i && i < mad->max_len) > mad->cur_len = i + 1; This combines two checks: Whether we can increment and whether the new length is greater than the old one. Only if both are true we take the new length. Shouldn't they be separate, though? Why reject a new length that happens to be the maximum? And max_len is not explicitly needed for the first check: /* One more to disambiguate, if possible. */ if (mad->hex[i]) i++; /* New record? */ if (i > mad->cur_len) mad->cur_len = i; René ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-12 14:58 ` René Scharfe @ 2025-08-12 15:17 ` Junio C Hamano 2025-08-12 15:59 ` René Scharfe 0 siblings, 1 reply; 14+ messages in thread From: Junio C Hamano @ 2025-08-12 15:17 UTC (permalink / raw) To: René Scharfe; +Cc: git, Jon Forrest, Derrick Stolee René Scharfe <l.s.r@web.de> writes: >> while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) >> i++; >> >> - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) >> + if (mad->cur_len <= i && i < mad->max_len) >> mad->cur_len = i + 1; > > This combines two checks: Whether we can increment and whether the new > length is greater than the old one. Only if both are true we take the > new length. Shouldn't they be separate, though? Why reject a new > length that happens to be the maximum? And max_len is not explicitly > needed for the first check: > > /* One more to disambiguate, if possible. */ > if (mad->hex[i]) > i++; > > /* New record? */ > if (i > mad->cur_len) > mad->cur_len = i; > > René Great. Your observation resolves my puzzlement about the first while() loop that has been bugging me ever since I started looking at this code. The mad->hex[] array is NUL terminated, and the loop can terminate correctly without being told about hexsz at all, and we ought to be able to use the same information to make sure we stop incrementing the .cur_len member without running beyond the end of the string. In other words, wouldn't this be what we want, without any of the max_len crap? diff --git c/object-name.c w/object-name.c index 11aa0e6afc..4cd1d38778 100644 --- c/object-name.c +++ w/object-name.c @@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + if (mad->hex[i] && i >= mad->cur_len) mad->cur_len = i + 1; return 0; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-12 15:17 ` Junio C Hamano @ 2025-08-12 15:59 ` René Scharfe 0 siblings, 0 replies; 14+ messages in thread From: René Scharfe @ 2025-08-12 15:59 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Jon Forrest, Derrick Stolee On 8/12/25 5:17 PM, Junio C Hamano wrote: > René Scharfe <l.s.r@web.de> writes: > >>> while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) >>> i++; >>> >>> - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) >>> + if (mad->cur_len <= i && i < mad->max_len) >>> mad->cur_len = i + 1; >> >> This combines two checks: Whether we can increment and whether the new >> length is greater than the old one. Only if both are true we take the >> new length. Shouldn't they be separate, though? Why reject a new >> length that happens to be the maximum? And max_len is not explicitly >> needed for the first check: >> >> /* One more to disambiguate, if possible. */ >> if (mad->hex[i]) >> i++; >> >> /* New record? */ >> if (i > mad->cur_len) >> mad->cur_len = i; >> >> René > > Great. > > Your observation resolves my puzzlement about the first while() loop > that has been bugging me ever since I started looking at this code. > The mad->hex[] array is NUL terminated, and the loop can terminate > correctly without being told about hexsz at all, and we ought to be > able to use the same information to make sure we stop incrementing > the .cur_len member without running beyond the end of the string. > > In other words, wouldn't this be what we want, without any of the > max_len crap? > > diff --git c/object-name.c w/object-name.c > index 11aa0e6afc..4cd1d38778 100644 > --- c/object-name.c > +++ w/object-name.c > @@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) > while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) > i++; > > - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) > + if (mad->hex[i] && i >= mad->cur_len) > mad->cur_len = i + 1; > > return 0; This combines the two checks and I don't see why. Why not update cur_len when oid and mad->oid are the same (mad->hex[i] == '\0')? Ah, to _ignore_ duplicates on purpose, when we have the same object loose and packed or in multiple packs, right? We wouldn't want to let that affect abbreviations, makes sense. René ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3] abbrev: allow extending beyond 32 chars to disambiguate 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano ` (2 preceding siblings ...) 2025-08-12 14:58 ` René Scharfe @ 2025-08-14 15:09 ` Junio C Hamano 3 siblings, 0 replies; 14+ messages in thread From: Junio C Hamano @ 2025-08-14 15:09 UTC (permalink / raw) To: git; +Cc: Jon Forrest, Derrick Stolee, René Scharfe When you have two or more objects with object names that share more than 32 letters in an SHA-1 repository, find_unique_abbrev() fails to show disambiguation. To see how many leading letters of a given full object name is sufficiently unambiguous, the algorithm starts from a initial length, guessed based on the estimated number of objects in the repository, and see if another object that shares the prefix, and keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, which is counted as the number of bytes, since 5b20ace6 (sha1_name: unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that change, it extended up to GIT_SHA1_HEXSZ, which meant to stop at the end of hexadecimal SHA-1 object name. Because the hexadecimal object name passed to the function is NUL-terminated, and this fact is used to correctly terminate the loop that scans for the first difference earlier in the function, use it to make sure we do not increment the .cur_len member beyond the end of the string. Noticed-by: Jon Forrest <nobozo@gmail.com> Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> --- * To tie the loose ends, here is what is in 'seen'. We may want to merge it down once 2.51 final gets tagged. object-name.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/object-name.c b/object-name.c index 11aa0e6afc..4cd1d38778 100644 --- a/object-name.c +++ b/object-name.c @@ -704,7 +704,7 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; - if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + if (mad->hex[i] && i >= mad->cur_len) mad->cur_len = i + 1; return 0; -- 2.51.0-rc2-158-gf97fc618fa ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano 2025-08-11 18:53 ` Junio C Hamano 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano @ 2025-08-11 21:17 ` brian m. carlson 2025-08-11 21:25 ` Junio C Hamano 2025-08-12 15:26 ` Jon Forrest 2 siblings, 2 replies; 14+ messages in thread From: brian m. carlson @ 2025-08-11 21:17 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Jon Forrest, Derrick Stolee [-- Attachment #1: Type: text/plain, Size: 1884 bytes --] On 2025-08-11 at 15:26:32, Junio C Hamano wrote: > When you have two or more objects with object names that share more > than half the length of the hash algorithm in use (e.g. 10 bytes for > SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev() > fails to show disambiguation. Is this really the case? If the restriction is due to using GIT_MAX_RAWSZ instead of GIT_MAX_HEXSZ, then that's 32 vs. 64 in our modern codebase. > To see how many leading letters of a given full object name is > sufficiently unambiguous, the algorithm starts from a initial > length, guessed based on the estimated number of objects in the > repository, and see if another object that shares the prefix, and > keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, > which is counted as the number of bytes, since 5b20ace6 (sha1_name: > unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that > change, it extended up to GIT_MAX_HEXSZ, which is the correct limit > because the loop is adding one output letter per iteration. Nicely explained. > * No tests added, since I do not think I want to find two valid > objects with their object names sharing the same prefix that is > more than 20 letters long. The current abbreviation code happens > to ignore validity of the object and takes invalid objects into > account when disambiguating, but I do not want to see a test rely > on that. Yes, even if we could efficiently create such a collision with SHA-1 using the best known attacks on it, that would still be 2^63.5, which was estimated to cost about USD 10,000 in 2025. I don't think doing that just to produce a test would be a good use of the project's (or really, anyone else's) funds. Using SHA-256, of course, would require at least 2^80 work. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson @ 2025-08-11 21:25 ` Junio C Hamano 2025-08-11 21:28 ` Junio C Hamano 2025-08-12 15:26 ` Jon Forrest 1 sibling, 1 reply; 14+ messages in thread From: Junio C Hamano @ 2025-08-11 21:25 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Jon Forrest, Derrick Stolee "brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2025-08-11 at 15:26:32, Junio C Hamano wrote: >> When you have two or more objects with object names that share more >> than half the length of the hash algorithm in use (e.g. 10 bytes for >> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev() >> fails to show disambiguation. > > Is this really the case? What I wrote in the above is correct. > If the restriction is due to using > GIT_MAX_RAWSZ instead of GIT_MAX_HEXSZ, then that's 32 vs. 64 in our > modern codebase. The above numbers are correct but irrelevant ;-). The thing is, the offending commit changed from 40-bytes (GIT_SHA1_HEXSZ) to 32-bytes (GIT_MAX_RAWSZ). Plase see v2 patch. Thanks. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 21:25 ` Junio C Hamano @ 2025-08-11 21:28 ` Junio C Hamano 0 siblings, 0 replies; 14+ messages in thread From: Junio C Hamano @ 2025-08-11 21:28 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Jon Forrest, Derrick Stolee Junio C Hamano <gitster@pobox.com> writes: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > >> On 2025-08-11 at 15:26:32, Junio C Hamano wrote: >>> When you have two or more objects with object names that share more >>> than half the length of the hash algorithm in use (e.g. 10 bytes for >>> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev() >>> fails to show disambiguation. >> >> Is this really the case? > > What I wrote in the above is correct. Sorry, I meant "incorrect". And using GIT_MAX_HEXSZ will break the abbreviation system rather badly. It has to be the length of the hash that is being used. > >> If the restriction is due to using >> GIT_MAX_RAWSZ instead of GIT_MAX_HEXSZ, then that's 32 vs. 64 in our >> modern codebase. > > The above numbers are correct but irrelevant ;-). > > The thing is, the offending commit changed from 40-bytes > (GIT_SHA1_HEXSZ) to 32-bytes (GIT_MAX_RAWSZ). Plase see v2 patch. > > Thanks. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson 2025-08-11 21:25 ` Junio C Hamano @ 2025-08-12 15:26 ` Jon Forrest 2025-08-12 16:21 ` René Scharfe 1 sibling, 1 reply; 14+ messages in thread From: Jon Forrest @ 2025-08-12 15:26 UTC (permalink / raw) To: brian m. carlson, Junio C Hamano, git, Derrick Stolee On 8/11/25 2:17 PM, brian m. carlson wrote: >> To see how many leading letters of a given full object name is >> sufficiently unambiguous, the algorithm starts from a initial >> length, guessed based on the estimated number of objects in the >> repository, and see if another object that shares the prefix, and >> keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, >> which is counted as the number of bytes, since 5b20ace6 (sha1_name: >> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that >> change, it extended up to GIT_MAX_HEXSZ, which is the correct limit >> because the loop is adding one output letter per iteration. I'm new to all this but the way I did it is much simpler. What I did was to check all the files in the appropriate object store directory (e.g. .git/objects/XX, where XX are the first 2 letters of the object given on the command line. If any of the filenames in that directory start with the string given on the command line, minus the first 2 letters, then that's a match. If more than one filename matches then that's ambiguous. What's wrong with this approach? Jon ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] abbrev: allow extending beyond 20 chars to disambiguate 2025-08-12 15:26 ` Jon Forrest @ 2025-08-12 16:21 ` René Scharfe 0 siblings, 0 replies; 14+ messages in thread From: René Scharfe @ 2025-08-12 16:21 UTC (permalink / raw) To: Jon Forrest, brian m. carlson, Junio C Hamano, git, Derrick Stolee On 8/12/25 5:26 PM, Jon Forrest wrote: > > > On 8/11/25 2:17 PM, brian m. carlson wrote: > >>> To see how many leading letters of a given full object name is >>> sufficiently unambiguous, the algorithm starts from a initial >>> length, guessed based on the estimated number of objects in the >>> repository, and see if another object that shares the prefix, and >>> keeps extending the abbreviation. The loop stops at GIT_MAX_RAWSZ, >>> which is counted as the number of bytes, since 5b20ace6 (sha1_name: >>> unroll len loop in find_unique_abbrev_r(), 2017-10-08); before that >>> change, it extended up to GIT_MAX_HEXSZ, which is the correct limit >>> because the loop is adding one output letter per iteration. > > I'm new to all this but the way I did it is much simpler. > What I did was to check all the files in the appropriate > object store directory (e.g. .git/objects/XX, where XX are > the first 2 letters of the object given on the command line. > If any of the filenames in that directory start with the > string given on the command line, minus the first 2 letters, > then that's a match. If more than one filename matches then > that's ambiguous. > > What's wrong with this approach? You also need to check all pack files and alternate object databases that you might have. René ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-08-14 15:09 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-11 15:26 [PATCH] abbrev: allow extending beyond 20 chars to disambiguate Junio C Hamano 2025-08-11 18:53 ` Junio C Hamano 2025-08-11 19:06 ` [PATCH v2] " Junio C Hamano 2025-08-11 22:23 ` brian m. carlson 2025-08-12 13:28 ` Derrick Stolee 2025-08-12 14:58 ` René Scharfe 2025-08-12 15:17 ` Junio C Hamano 2025-08-12 15:59 ` René Scharfe 2025-08-14 15:09 ` [PATCH v3] abbrev: allow extending beyond 32 " Junio C Hamano 2025-08-11 21:17 ` [PATCH] abbrev: allow extending beyond 20 " brian m. carlson 2025-08-11 21:25 ` Junio C Hamano 2025-08-11 21:28 ` Junio C Hamano 2025-08-12 15:26 ` Jon Forrest 2025-08-12 16:21 ` René Scharfe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).