From: David Turner <dturner@twopensource.com>
To: Jeff King <peff@peff.net>
Cc: git mailing list <git@vger.kernel.org>
Subject: Re: RFC/Pull Request: Refs db backend
Date: Wed, 24 Jun 2015 13:29:59 -0400 [thread overview]
Message-ID: <1435166999.26709.12.camel@twopensource.com> (raw)
In-Reply-To: <20150624091417.GB5436@peff.net>
On Wed, 2015-06-24 at 05:14 -0400, Jeff King wrote:
> On Tue, Jun 23, 2015 at 02:18:36PM -0400, David Turner wrote:
>
> > > Can you describe a bit more about the reflog handling?
> > >
> > > One of the problems we've had with large-ref repos is that the reflog
> > > storage is quite inefficient. You can pack all the refs, but you may
> > > still be stuck with a bunch of reflog files with one entry, wasting a
> > > whole inode. Doing a "git repack" when you have a million of those has
> > > horrible cold-cache performance. Basically anything that isn't
> > > one-file-per-reflog would be a welcome change. :)
> >
> > Reflogs are stored in the database as well. There is one header entry
> > per ref to indicate that a reflog is present, and then one database
> > entry per reflog entry; the entries are stored consecutively and
> > immediately following the header so that it's fast to iterate over them.
>
> OK, that make sense. I did notice that the storage for the refdb grows
> rapidly. If I add a millions refs (like refs/tags/$i) with a simple
> reflog message "foo", I ended up with a 500MB database file.
>
> That's _probably_ OK, because a million is getting into crazy
> territory[1]. But it's 500 bytes per ref, each with one reflog entry.
> Our ideal lower bound is probably something like 100 bytes per reflog
> entry:
>
> - 20 bytes for old sha1
> - 20 bytes for new sha1
> - ~50 bytes for name, email, timestamp
> - ~6 bytes for refname (1000000 is the longest unique part)
>
> That assumes we store binary[2] (and not just the raw reflog lines), and
> reconstruct the reflog lines on the fly. It also assumes we use some
> kind of trie-like storage (where we can amortize the cost of storing
> "refs/tags/" across all of the entries).
>
> Of course that neglects lmdb's overhead, and the storage of the ref tip
> itself. But it would hopefully give us a ballpark for an optimal
> solution. We don't have to hit that, of course, but it's food for
> thought.
>
> [1] The homebrew/homebrew repository on GitHub has almost half a million
> ref updates. Since this is storing not just refs but all ref
> updates, that's actually the interesting number (and optimizing the
> per-reflog-entry size is more interesting than the per-ref size).
>
> [2] I'm hesitant to suggest binary formats in general, but given that
> this is a blob embedded inside lmdb, I think it's OK. If we were to
> pursue the log-structured idea I suggested earlier, I'm torn on
> whether it should be binary or not.
I could try a binary format. I was optimizing for simplicity,
debuggability, recoverability, compatibility with the choice of the text
format, but I wouldn't have to. I don't know how much this will save.
Unfortunately, given the way LMDB works, a trie-like storage to save
refs/tags does not seem possible (of course, we could hard-code some
hacks like \001=refs/rags, \002=refs/heads, etc but that is a
micro-optimization that might not be worth it.
Also, the reflog header has some overhead (it's an entire extra record
per ref). The header exists to implement reflog creation/existence
checking. I didn't really try to understand why we have the distinction
between empty and nonexistent reflogs; I just copied it. If we didn't
have that distinction, we could eliminate that overhead.
> > Thanks, that's valuable. For the refs backend, opening the LMDB
> > database for writing is sufficient to block other writers. Do you think
> > it would be valuable to provide a git hold-ref-lock command that simply
> > reads refs from stdin and keeps them locked until it reads EOF from
> > stdin? That would allow cross-backend ref locking.
>
> I'm not sure what you would use it for. If you want to update the refs,
> then you can specify a whole transaction with "git update-ref --stdin",
> and that should work whatever backend you choose. Is there some other
> operation you want where you hold the lock for a longer period of time?
I'm sure I had a reason for this at the time I wrote it, but now I can't
think of what it was. Nevermind!
next prev parent reply other threads:[~2015-06-24 17:30 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-23 0:50 RFC/Pull Request: Refs db backend David Turner
2015-06-23 5:36 ` Junio C Hamano
2015-06-23 10:23 ` Duy Nguyen
2015-06-23 18:47 ` David Turner
2015-06-23 17:29 ` David Turner
2015-06-23 11:47 ` Jeff King
2015-06-23 13:10 ` Duy Nguyen
2015-06-24 8:51 ` Jeff King
2015-06-23 18:18 ` David Turner
2015-06-24 9:14 ` Jeff King
2015-06-24 17:29 ` David Turner [this message]
2015-06-24 6:09 ` Shawn Pearce
2015-06-24 9:49 ` Jeff King
2015-06-25 1:08 ` brian m. carlson
2015-06-24 10:18 ` Duy Nguyen
2015-06-23 15:51 ` Michael Haggerty
2015-06-23 19:53 ` David Turner
2015-06-23 21:27 ` Michael Haggerty
2015-06-24 17:31 ` David Turner
2015-06-23 21:35 ` David Turner
2015-06-23 21:41 ` Junio C Hamano
2015-06-23 17:16 ` Stefan Beller
2015-06-23 20:04 ` David Turner
2015-06-23 20:10 ` Randall S. Becker
2015-06-23 20:22 ` David Turner
2015-06-23 20:27 ` Randall S. Becker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1435166999.26709.12.camel@twopensource.com \
--to=dturner@twopensource.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.