From: Sam Vilain <sam.vilain@catalyst.net.nz>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
Jeff King <peff@peff.net>,
gitster@pobox.com, git@vger.kernel.org
Subject: Re: RFC: Flat directory for notes, or fan-out? Both!
Date: Wed, 11 Feb 2009 16:19:27 +1300 [thread overview]
Message-ID: <499243BF.1010701@catalyst.net.nz> (raw)
In-Reply-To: <20090210164430.GN30949@spearce.org>
Shawn O. Pearce wrote:
>> The point you raised earlier, that there would be a lot of ambiguity if
>> we allow both flat and fan-out directory structures, is a valid point,
>> though.
>
> Yup. The flat vs. fan-out is a problem.
[...]
> Notes on commits though are a hell of a problem. SHA-1 is just so
> uniform at distributing the commits around the namespace that even
> with just the 200 most recent commits we wind up with a commit in
> almost every "bucket", assuming a two hex digit fan-out bucket like
> the loose object directory.
I think my patch from 1 Feb addressed this, at least for the operations
it implemented.
I just don't see why you need to decide up front what the split is going
to be. Just read the next tree, descend into the closest matching tree
until you find the record you are looking for and that's it. Sure, my
patch just loads it all and throws it into a hash - this should still be
efficient for short log operations even if the hash table ends up 1MB.
But why take my guess. Let's stress test it.
'lorem' is the binary in the Text::Lorem Perl module. It generates a
paragraph of random Latin text.
wilber:~/src/git$ time git-log | wc -l
256072
real 0m0.709s
user 0m0.608s
sys 0m0.116s
wilber:~/src/git$ git rev-list HEAD | wc -l
17678
wilber:~/src/git$ cat > my-editor
#!/bin/sh
( lorem; echo ) > $1
wilber:~/src/git$ chmod +x my-editor
wilber:~/src/git$ export EDITOR=`pwd`/my-editor
wilber:~/src/git$ export GIT_NOTES_SPLIT=2
wilber:~/src/git$ time git-rev-list HEAD | while read rev
> do ./git-notes.sh edit $rev; done
fatal: unable to create '.git/refs/notes/commits.lock': File exists
error: Ref refs/notes/commits is at
5f0732975b4acf237912a31e7ce14aa86d2e8179 but expected
725a2d119d2725e7d821906ad085bfbadbf43c8e
fatal: Cannot lock the ref 'refs/notes/commits'.
[...]
fatal: unable to write new index file
Could not read index
fatal: unable to write new index file
Could not read index
fatal: unable to write new index file
Could not read index
fatal: unable to write new index file
Could not read index
fatal: unable to write new index file
Could not read index
real 76m16.927s
user 43m55.909s
sys 19m33.005s
Oo. Nasty errors there but never mind that for now. Obviously some
remaining issues in the shell script.
What did I get out of that?
wilber:~/src/git$ git-ls-tree -r refs/notes/commits | wc
12043 48172 1144085
wilber:~/src/git$
Hey well that's not too bad. Enough to be a good test. How long does
"git-log" take now?
wilber:~/src/git$ time ./git-log | wc -l
292201
real 0m13.740s
user 0m0.852s
sys 0m0.716s
wilber:~/src/git$ time ./git-log | wc -l
292201
real 0m1.335s
user 0m0.856s
sys 0m0.512s
Not bad! Cool cache performance sucked there but only a 50% slowdown
for reading almost twice the number of objects. Let's try 200 commits:
wilber:~/src/git$ time git-log -200 | wc -l
2877
real 0m0.027s
user 0m0.008s
sys 0m0.020s
wilber:~/src/git$ time ./git-log -200 | wc -l
3477
real 0m0.081s
user 0m0.056s
sys 0m0.020s
Quite a big slowdown proportionally, but not a huge amount in absolute
terms. And we didn't even make the builtin-log machinery smart enough
to skip unneeded trees!
> In a slightly unrelated
> thread offlist I have been talking with Sam Vilain about using Git
> as a database backend for tuple storage.
[...]
> This would make the git-notes.sh code a *lot* more complex, as you
> can't just toss everything into an index file and then update it with
> a single update-index call. Doing a tree split is much more work and
> requires removing and adding back all of the affected path names.
> (Its also perhaps unreasonable anyway to load 17,491 paths into a
> temporary index just to twiddle a note for the latest commit.)
Hehe, horribly overcomplicated for this use case... many applicable
ideas though.
> For the "git database" thing above, I've been contemplating the
> idea of an index stored external from the Git object database.
> Sam thinks indexes should be in the object database tree, but
> I'm considering storing them outside entirely because we can
> make the indexes more easily searched by a hash or binary search,
> like pack-*.idx. Whenever the "database ref" gets moved we'd need
> to run a "sync" utility to bring these external indexes current.
> But they could also be more efficiently scanned.
Well either way it's a file you've got to scan somehow ... guess it
doesn't matter much whether it's in-tree or not. I was actually saying
that there are some use cases where you might want to keep indexes in
the history and some where you don't. Keeping them in-tree is not
normalised, but there are good use cases for it - eg efficient retrieval
of pre-computed aggregates that don't need to be up to the second, or
for instances where you want your nodes to be able to "hit the ground
running" after synchronisation without having to reindex.
For the use case we originally talked about I don't think you'd want any
indexes in-tree at all.
But I'd like to steer this thread well away from the database stuff I'm
drafting ... it's a lot more comprehensive, notes are a very simple hash
relationship.
--
Sam Vilain, Perl Hacker, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267 PGP ID: 0x66B25843
next prev parent reply other threads:[~2009-02-11 3:53 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin
2009-02-10 7:58 ` Boyd Stephen Smith Jr.
2009-02-10 13:16 ` Jeff King
2009-02-11 1:58 ` Boyd Stephen Smith Jr.
2009-02-11 2:35 ` Linus Torvalds
2009-02-11 3:30 ` Sam Vilain
2009-02-11 3:54 ` Linus Torvalds
2009-02-11 5:05 ` Sam Vilain
2009-02-11 12:35 ` Johannes Schindelin
2009-02-10 12:18 ` Jeff King
2009-02-10 12:59 ` Johannes Schindelin
2009-02-10 13:10 ` Jeff King
2009-02-10 13:32 ` Johannes Schindelin
2009-02-10 15:58 ` Junio C Hamano
2009-02-10 16:48 ` Shawn O. Pearce
2009-02-10 16:48 ` Johannes Schindelin
2009-02-10 16:56 ` Shawn O. Pearce
2009-02-10 17:31 ` Johannes Schindelin
2009-02-10 18:35 ` Junio C Hamano
2009-02-10 19:09 ` Shawn O. Pearce
2009-02-10 21:10 ` Johannes Schindelin
2009-02-10 22:16 ` Thomas Rast
2009-02-10 22:26 ` Thomas Rast
2009-02-10 22:32 ` Junio C Hamano
2009-02-11 20:02 ` Jeff King
2009-02-11 20:57 ` Johannes Schindelin
2009-02-11 21:16 ` Junio C Hamano
2009-02-11 23:05 ` Johannes Schindelin
2009-02-10 16:44 ` Shawn O. Pearce
2009-02-10 17:09 ` Johannes Schindelin
2009-02-10 17:17 ` Shawn O. Pearce
2009-02-11 3:19 ` Sam Vilain [this message]
2009-02-11 1:14 ` Sam Vilain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=499243BF.1010701@catalyst.net.nz \
--to=sam.vilain@catalyst.net.nz \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).