From: "Shawn O. Pearce" <spearce@spearce.org>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Jeff King <peff@peff.net>,
gitster@pobox.com, git@vger.kernel.org,
Sam Vilain <sam.vilain@catalyst.net.nz>
Subject: Re: RFC: Flat directory for notes, or fan-out? Both!
Date: Tue, 10 Feb 2009 08:44:30 -0800 [thread overview]
Message-ID: <20090210164430.GN30949@spearce.org> (raw)
In-Reply-To: <alpine.DEB.1.00.0902101427490.10279@pacific.mpi-cbg.de>
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> [Junio: seems like both Peff and me would like to hold the notes out of
> 1.6.2, would you mind?]
Sorry I'm getting involved in this notes thing so late. I was way
too focused on Gerrit2 and just didn't pay much attention to what
was on the git ML recently. Like Dscho and Peff, I think we may
want to hold notes out of 1.6.2.
> On Tue, 10 Feb 2009, Jeff King wrote:
> > On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote:
>
> The thing is: Shawn is correct when he says that a tree object to hold the
> notes of all commits (which is not an unlikely scenario if you are
> thinking about corporate processes) would be huge.
A notes tree entry requires 6+1+40+1+20=68 bytes per entry. If I
use it for what I want in Gerrit, which is to annotate every commit,
on a project like git.git with 17,491 commits we're talking about
a tree that is 1.13 MB.
That tree grows at a rate of 276 KB/year.
I'm not sure I want to think about the cost to unpack that tree,
just so I can look at "git log --since=1.week.ago".
My fear here is that over time we will be spending a lot of CPU
time unpacking and indexing the tree in memory, only to then pull
out a handful of recent commits, and then see the pager abort and
kill the revision walk.
> The point you raised earlier, that there would be a lot of ambiguity if
> we allow both flat and fan-out directory structures, is a valid point,
> though.
Yup. The flat vs. fan-out is a problem. In a slightly unrelated
thread offlist I have been talking with Sam Vilain about using Git
as a database backend for tuple storage. There is a related issue
there about making the tree structure consistent, but never stored
in a way that we wind up with these massive multi-megabyte objects.
We've only started to kick it around, but I think we are both in
agreement that a "database tree" is owned by the database code
and must not be twiddled manually. Not unless you can honor the
formatting rules. Just like you shouldn't use "git hash-object"
to create a tree, unless you can honor the basic formatting rules
for trees.
This also means that the "database trees" probably are not going
to be mergeable with a basic merge-recursive sort of algorithm,
but instead need specialized handling to perform the combination.
I think we're leaning in a direction of something more like this
for trees:
- Tuples are stored under a path constructed from their primary key.
The analog here is, the commit SHA-1 the note is annotating.
- Trees are capped at some reasonable size limit. For sake of
argument lets call that MAX_TREE. My feeling is this would be
closer to the 16 KB side of the spectrum then to the 1 MB side.
- Initially the database tree starts out as a single root tree that
is empty.
- Records are inserted, creating new tree entries, until MAX_TREE
is reached for the root level tree. Up until this point it is
a flat tree structure, like the current notes design.
- Once MAX_TREE is reached the root is split, and ranges are used
to point to the subtrees, which are now flat, and approximately
are MAX_TREE/2 in size.
Etc.
This would make the git-notes.sh code a *lot* more complex, as you
can't just toss everything into an index file and then update it with
a single update-index call. Doing a tree split is much more work and
requires removing and adding back all of the affected path names.
(Its also perhaps unreasonable anyway to load 17,491 paths into a
temporary index just to twiddle a note for the latest commit.)
Notes on commits though are a hell of a problem. SHA-1 is just so
uniform at distributing the commits around the namespace that even
with just the 200 most recent commits we wind up with a commit in
almost every "bucket", assuming a two hex digit fan-out bucket like
the loose object directory.
For the "git database" thing above, I've been contemplating the
idea of an index stored external from the Git object database.
Sam thinks indexes should be in the object database tree, but
I'm considering storing them outside entirely because we can
make the indexes more easily searched by a hash or binary search,
like pack-*.idx. Whenever the "database ref" gets moved we'd need
to run a "sync" utility to bring these external indexes current.
But they could also be more efficiently scanned.
E.g. in the case of commit notes, we could just mmap() the index into
memory and perform our lookups through the mmap. Thus we wouldn't
pay massive penalities to index all 17,491 names just to access 200.
Though we may wind up paging in a good part of the index due to
the random access nature, but we can't really do anything about that.
Keeping the indexes current would perhaps mean teaching "git fetch"
to run something after the fetch is complete. Rather trivial in
the grand scheme of things. I also liken the external index to the
pack-*.idx, in that its derived from the real sources in the object
database, and can always be generated client side. So making fetch
do it is really no different then making fetch run index-pack.
Eh. That wound up being a lot longer than I wanted it to be.
Sam and I may be putting some effort into this "git as a database"
thing, and it could be used as an efficient notes store. Its just
a very complex notes store. Much more complex to implement than
the simple notes currently slated for 1.6.2.
--
Shawn.
next prev parent reply other threads:[~2009-02-10 16:46 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin
2009-02-10 7:58 ` Boyd Stephen Smith Jr.
2009-02-10 13:16 ` Jeff King
2009-02-11 1:58 ` Boyd Stephen Smith Jr.
2009-02-11 2:35 ` Linus Torvalds
2009-02-11 3:30 ` Sam Vilain
2009-02-11 3:54 ` Linus Torvalds
2009-02-11 5:05 ` Sam Vilain
2009-02-11 12:35 ` Johannes Schindelin
2009-02-10 12:18 ` Jeff King
2009-02-10 12:59 ` Johannes Schindelin
2009-02-10 13:10 ` Jeff King
2009-02-10 13:32 ` Johannes Schindelin
2009-02-10 15:58 ` Junio C Hamano
2009-02-10 16:48 ` Shawn O. Pearce
2009-02-10 16:48 ` Johannes Schindelin
2009-02-10 16:56 ` Shawn O. Pearce
2009-02-10 17:31 ` Johannes Schindelin
2009-02-10 18:35 ` Junio C Hamano
2009-02-10 19:09 ` Shawn O. Pearce
2009-02-10 21:10 ` Johannes Schindelin
2009-02-10 22:16 ` Thomas Rast
2009-02-10 22:26 ` Thomas Rast
2009-02-10 22:32 ` Junio C Hamano
2009-02-11 20:02 ` Jeff King
2009-02-11 20:57 ` Johannes Schindelin
2009-02-11 21:16 ` Junio C Hamano
2009-02-11 23:05 ` Johannes Schindelin
2009-02-10 16:44 ` Shawn O. Pearce [this message]
2009-02-10 17:09 ` Johannes Schindelin
2009-02-10 17:17 ` Shawn O. Pearce
2009-02-11 3:19 ` Sam Vilain
2009-02-11 1:14 ` Sam Vilain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090210164430.GN30949@spearce.org \
--to=spearce@spearce.org \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=sam.vilain@catalyst.net.nz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).