git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Namhyung Kim <namhyung@gmail.com>
Subject: Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better
Date: Thu, 10 Mar 2011 18:21:36 -0800	[thread overview]
Message-ID: <AANLkTinHCsyN4TLBAxzxOn_QAK67VkT1pLgcu7XB0JCD@mail.gmail.com> (raw)
In-Reply-To: <7vmxl21lwr.fsf@alter.siamese.dyndns.org>

On Thu, Mar 10, 2011 at 5:33 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> That 979f79 one already have enough other objects with similar names, so
> compared to 83c3c that doesn't, it is natural that you would need more
> digits to protect its uniqueness, no?  The result shouldn't be affected by
> the value of "short" as long as it is not long enough, as that is merely
> specifying "at least this many letters"

Yes, uniqueness in that sense is sane and has a good definition.

But that's _not_ the case when you then randomly add extra <n> digits to it.

Why? Because that <n> is meaningless, because what <n> means depends
on what the base number was. And the base number is different for
different objects.

The case of n=0 is special, because it is the "current state". But
what does "n=1" mean?

Let me make it more explicit by making an extreme example about this.
It's extreme just because I'm going to assume that the shortest
abbreviation is 1 (in reality it's 4) but that doesn't really change
the math, it just makes the numbers smaller/easier.

So let's say that we have a repository with just 100 objects. What
does that mean? In practical terms, it means that it is not impossible
that you will have some object that is unique in a single digit (yes,
it may be a bit unlikely, but it's not unreasonable), while you'll
have other objects that need three digits. And most will be unique in
two.

Shortening the numbers that way has a _meaning_: the notion of
"unique" is clearly meaningful. Sure, different objects get different
lengths, but the different lengths have a very real reason for them.

So then (again, to make the numbers small, and the math simple), let's
assume abbrevguard=1. What does that MEAN?

And I claim that it means something totally _different_ for the
different objects, and that's the crazy thing. Because now we're
talking about possible _future_ objects, but the likely _future_
uniqueness of "unique in one digit" is TOTALLY DIFFERENT from the
future uniqueness of "unique in three digits"!

The single-digit uniqueness is going to be gone _long long_ before the
three-digit uniqueness is gone. Adding a single digit to the object
that currently happens to need only a single-digit unique id will
_not_ do a whole lot of future-proofing - if you add another one
hundred objects, that object may well need three digits to be unique.
But if you add a single digit to the one that currently already needed
three digits, you're likely to need to add an order of magnitude more
objects to need to extend the three digits to four.

See what I'm trying to say? This is why I think abbrevguard is a
broken concept, when it is relative to "how unique is the object now".

If the abbrevguard was related to the maximum number of digits
required for _any_ object in the current repository, it would be
meaningful - it would actually be about the _size_ of the current
repository, and thus indirectly about the size of a future one. But it
isn't. It's always relative to the "local uniqueness", which is only
valid for the *current* state, and has very little to do with future
growth.

Now, to put things in terms of a real repository ("git" itself), and
two extreme cases from it:

 - commit 1dae(0d38b8119de2b67f87e800c860ed958bbed6) currently unique
in four digits

 - commit 979f7929(51913d75f992f87022b75610303a614f) currently unique
in eight digits

and think about what "abbrevguard" means for those two commits.

Let's pick an abbrev-guard of two digits. For the first one, it means
that you use six digits total, and for the second one, you'd use 10
digits total.

What does that mean for future work? How many objects do we need to
add for clashes to start happening?

For the first commit, it's _almost certain_ that if you double the
size of the repository, those six digits will no longer be unique. For
the second case? I can pretty much guarantee that EVEN IF you didn't
have any abbrev-guard at all, and you doubled the size of the git
repository, the thing would still be unique in eight digits.

Why? It's simply *much*much* less likely that you'll get future
clashes from new objects in the eight digits. The likelihood of a
clash with the currently unique 4-digit object is 16^4=65536 times
higher than a clash with the currently 8-digit unique shortening.

So it's senseless to add an equal number of digits to the two objects.
They simply don't have the same likelihood of future collisions.

So what is mathematically the sensible thing? It's actually to extend
both objects to the _same_ number of digits. It's _more_ sensible to
extend the current four-digit number to eight digits, than it is to
extend the currently unique in eight digits by even a single digit. It
would future-proof things a fair amount, exactly because the
likelihood of future objects clashing with the two objects are totally
different.

That's why I said that it would be sensible to make the abbrevguard be
relative to the current worst-case uniqueness. Because THAT actually
is what is probable. If we currently have a maximum uniqueness
requirement of 8 characters, it is _probable_ that by the time the
project has grown by a factor of 4, we'll need 9 characters (I think,
I may have gotten the math wrong).

But it is somewhat expensive to calculate "max current uniqueness", so
I would suggest ditching the whole notion of "how many extra numbers
do I need for futureproofing", and go for just setting the absolute
value of DEFAULT_ABBREV.

                               Linus

  reply	other threads:[~2011-03-11  2:22 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <E1PBKT9-0004Uk-Sm@tytso-glaptop>
     [not found] ` <20101028075631.GA7690@elte.hu>
     [not found]   ` <AANLkTi=8SbOZizWpxLg=Bgp7atdgr8MsR6tnRDYr1+eW@mail.gmail.com>
     [not found]     ` <20101028163854.GA15450@elte.hu>
     [not found]       ` <AANLkTin62vAwJxcsrFk6Yn7Q6tzr-D=EmKKwPazuAJ11@mail.gmail.com>
     [not found]         ` <20101028171701.GA18368@elte.hu>
2010-10-28 17:27           ` Minimum git commit abbrev length (Was Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update) for 2.6.37) Ted Ts'o
2010-10-28 18:28             ` Linus Torvalds
2010-10-28 18:54               ` Linus Torvalds
2010-10-29  0:14                 ` Minimum git commit abbrev length (Was Re: -tip: origin tree build failure Brandon Casey
     [not found]         ` <7veiba9ev2.fsf@alter.siamese.dyndns.org>
2011-03-10 22:37           ` [PATCH] find_unique_abbrev(): honor caller-supplied "len" better Junio C Hamano
2011-03-10 23:07             ` Linus Torvalds
2011-03-11  0:40               ` Junio C Hamano
2011-03-11  1:16                 ` Linus Torvalds
2011-03-11  1:33                   ` Junio C Hamano
2011-03-11  2:21                     ` Linus Torvalds [this message]
2011-03-11  3:09                       ` Junio C Hamano
2011-03-11  3:03                     ` Junio C Hamano
2011-03-11  5:22                       ` Jeff King
2011-03-11  5:33                         ` Jeff King
2011-03-11 22:45                       ` Junio C Hamano
2011-03-13 13:30                         ` Namhyung Kim
2011-03-19  1:22                         ` Jay Soffian
2011-03-19 16:24                           ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AANLkTinHCsyN4TLBAxzxOn_QAK67VkT1pLgcu7XB0JCD@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=namhyung@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).