From: Johan Herland <johan@herland.net>
To: Shawn Pearce <spearce@spearce.org>
Cc: git@vger.kernel.org
Subject: Re: bug in git notes fanout
Date: Wed, 03 Nov 2010 00:36:28 +0100 [thread overview]
Message-ID: <201011030036.28318.johan@herland.net> (raw)
In-Reply-To: <AANLkTi=JfwJq8wq_1an014WO+A_Dwp5cDKa+xHi10awr@mail.gmail.com>
On Tuesday 02 November 2010, Shawn Pearce wrote:
> Why doesn't the fan-out work for this case?
>
> git notes --ref buggy.fanout add -m test HEAD
> perl -e 'for($i=0;$i<1024;$i++){printf "%s %s%4.4x\n", $ARGV[0],
> "0"x36, $i}' $(git rev-parse HEAD) | git notes --ref buggy.fanout copy
> --stdin
> git ls-tree refs/notes/buggy.fanout
> 100644 blob ... 0000000000000000000000000000000000000000
> 100644 blob ... 0000000000000000000000000000000000000001
> 100644 blob ... 0000000000000000000000000000000000000002
> 100644 blob ... 0000000000000000000000000000000000000003
> ...
> 100644 blob ... 000000000000000000000000000000000000018c
> ...
> 100644 blob ... 00000000000000000000000000000000000003e1
> ...
> 100644 blob ... 00000000000000000000000000000000000003e2
> ...
> 100644 blob ... 00000000000000000000000000000000000003fd
> 100644 blob ... 00000000000000000000000000000000000003fe
> 100644 blob ... 00000000000000000000000000000000000003ff
> 100644 blob ... 638d3d9244720e0f07f22a953d25db878e9ad3f5
Yes, for your extremely far-fetched testcase, the current implementation is
pessimal. However, please, consider exactly _what_ the notes tree contains.
Yes, that's right: SHA1 sums. Now, try to come up with a real-world scenario
where your notes tree would end up looking like the above.
I guess the same "problem" exists with the 2/38 fanout that is done for
loose objects in .git/objects/. If you happen to only add objects whose SHA1
start with 00..., you will get worst case behaviour there as well. Again, if
you can construct a real-world testcase for this, I am probably not the only
one interested in looking at it...
> I thought the entire points of the notes fanout being 0/40, 2/38,
> 2/2/36, 2/2/2/34 was to prevent git from having a big linear search
> within any single tree when a large number of notes are added to a
> note branch.
Yes, and by assuming that SHA1 gives a fairly uniform distribution of keys
(a pretty safe assumption if the rest of Git's design is anything to go by),
we can conclude that the large number of keys would be fairly evenly spread
out across the notes tree.
> IIRC when the notes stuff was being debated on this list we insisted
> that the fan-out algorithm had to be consistent, ensuring that if I
> create a note for 638d3d92 and Junio creates a note on the same
> object, they will wind up at the same path in the notes, and we won't
> have doppleganger notes (two different notes in the same tree with the
> same object).
Let's separate the cases here:
A. There is absolutely no guarantee that two different notes trees will
place a note for 638d3d92 at the same location/path. I.e. if you add a note
to your notes tree, and Junio adds one to his, they might end up at
different paths. This depends on how many, and which other notes are in your
respective notes trees.
B. When combining/merging those notes tree, the notes code _does_ guarantee
that there won't be doppelganger notes. I.e. if you're both working in the
same notes tree (in the same repo), the notes code will not allow you to add
two notes to 638d3d92 at different locations in the notes tree.
Of course you might artificially construct a notes tree that contains
doppelganger notes, and the current implementation will successfully read
that notes tree (combining the doppelganger notes using the configured
combine_notes function), but - crucially - the notes code will _never_
generate a notes tree with doppelganger notes.
> Basically this issue arises because we are adding node support to
> JGit. Our planned implementation would have inserted
> 00000000000000000000000000000000000003ff into
> 00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/03/ff because
> there are so many notes that at each depth we wind up with >255
> entries, and split the tree again. But that results in a very
> different structure from what the C implementation is doing today.
>
> As far as I can tell from the C code, it won't split unless every hex
> digit in the 0..f range is used. That is just a weird heuristic, and
> actual paths used depend on the actual SHA-1s that we add to the tree.
Read that sentence again. As explained above, since SHA1 provides a near-
uniform distribution, looking at hex digits in the 0..f range is a very
inexpensive and moderately good heuristic to estimating whether it's time to
make a split.
Granted, there might be better heuristics, and the current design allows us
to implement such heuristics without making pre-existing notes trees
invalid.
> If I get "unlucky" and add 5 commits that start with "f" and never
> add one that starts with "0", I won't get fan-out, even though my
> notes tree has more than 256 items in it.
True, however, the uniform distribution of SHA1 pretty much guarantees that
as you keep adding notes, you're going to get one that starts with "0", and
then you'll get the needed split.
Hope this helps,
...Johan
--
Johan Herland, <johan@herland.net>
www.herland.net
prev parent reply other threads:[~2010-11-02 23:37 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-02 22:06 bug in git notes fanout Shawn Pearce
2010-11-02 23:36 ` Johan Herland [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201011030036.28318.johan@herland.net \
--to=johan@herland.net \
--cc=git@vger.kernel.org \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).