From: David Turner <dturner@twopensource.com>
To: Jeff King <peff@peff.net>
Cc: git mailing list <git@vger.kernel.org>
Subject: Re: RFC/Pull Request: Refs db backend
Date: Wed, 24 Jun 2015 13:29:59 -0400 [thread overview]
Message-ID: <1435166999.26709.12.camel@twopensource.com> (raw)
In-Reply-To: <20150624091417.GB5436@peff.net>
On Wed, 2015-06-24 at 05:14 -0400, Jeff King wrote:
> On Tue, Jun 23, 2015 at 02:18:36PM -0400, David Turner wrote:
>
> > > Can you describe a bit more about the reflog handling?
> > >
> > > One of the problems we've had with large-ref repos is that the reflog
> > > storage is quite inefficient. You can pack all the refs, but you may
> > > still be stuck with a bunch of reflog files with one entry, wasting a
> > > whole inode. Doing a "git repack" when you have a million of those has
> > > horrible cold-cache performance. Basically anything that isn't
> > > one-file-per-reflog would be a welcome change. :)
> >
> > Reflogs are stored in the database as well. There is one header entry
> > per ref to indicate that a reflog is present, and then one database
> > entry per reflog entry; the entries are stored consecutively and
> > immediately following the header so that it's fast to iterate over them.
>
> OK, that make sense. I did notice that the storage for the refdb grows
> rapidly. If I add a millions refs (like refs/tags/$i) with a simple
> reflog message "foo", I ended up with a 500MB database file.
>
> That's _probably_ OK, because a million is getting into crazy
> territory[1]. But it's 500 bytes per ref, each with one reflog entry.
> Our ideal lower bound is probably something like 100 bytes per reflog
> entry:
>
> - 20 bytes for old sha1
> - 20 bytes for new sha1
> - ~50 bytes for name, email, timestamp
> - ~6 bytes for refname (1000000 is the longest unique part)
>
> That assumes we store binary[2] (and not just the raw reflog lines), and
> reconstruct the reflog lines on the fly. It also assumes we use some
> kind of trie-like storage (where we can amortize the cost of storing
> "refs/tags/" across all of the entries).
>
> Of course that neglects lmdb's overhead, and the storage of the ref tip
> itself. But it would hopefully give us a ballpark for an optimal
> solution. We don't have to hit that, of course, but it's food for
> thought.
>
> [1] The homebrew/homebrew repository on GitHub has almost half a million
> ref updates. Since this is storing not just refs but all ref
> updates, that's actually the interesting number (and optimizing the
> per-reflog-entry size is more interesting than the per-ref size).
>
> [2] I'm hesitant to suggest binary formats in general, but given that
> this is a blob embedded inside lmdb, I think it's OK. If we were to
> pursue the log-structured idea I suggested earlier, I'm torn on
> whether it should be binary or not.
I could try a binary format. I was optimizing for simplicity,
debuggability, recoverability, compatibility with the choice of the text
format, but I wouldn't have to. I don't know how much this will save.
Unfortunately, given the way LMDB works, a trie-like storage to save
refs/tags does not seem possible (of course, we could hard-code some
hacks like \001=refs/rags, \002=refs/heads, etc but that is a
micro-optimization that might not be worth it.
Also, the reflog header has some overhead (it's an entire extra record
per ref). The header exists to implement reflog creation/existence
checking. I didn't really try to understand why we have the distinction
between empty and nonexistent reflogs; I just copied it. If we didn't
have that distinction, we could eliminate that overhead.
> > Thanks, that's valuable. For the refs backend, opening the LMDB
> > database for writing is sufficient to block other writers. Do you think
> > it would be valuable to provide a git hold-ref-lock command that simply
> > reads refs from stdin and keeps them locked until it reads EOF from
> > stdin? That would allow cross-backend ref locking.
>
> I'm not sure what you would use it for. If you want to update the refs,
> then you can specify a whole transaction with "git update-ref --stdin",
> and that should work whatever backend you choose. Is there some other
> operation you want where you hold the lock for a longer period of time?
I'm sure I had a reason for this at the time I wrote it, but now I can't
think of what it was. Nevermind!
next prev parent reply other threads:[~2015-06-24 17:30 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-23 0:50 RFC/Pull Request: Refs db backend David Turner
2015-06-23 5:36 ` Junio C Hamano
2015-06-23 10:23 ` Duy Nguyen
2015-06-23 18:47 ` David Turner
2015-06-23 17:29 ` David Turner
2015-06-23 11:47 ` Jeff King
2015-06-23 13:10 ` Duy Nguyen
2015-06-24 8:51 ` Jeff King
2015-06-23 18:18 ` David Turner
2015-06-24 9:14 ` Jeff King
2015-06-24 17:29 ` David Turner [this message]
2015-06-24 6:09 ` Shawn Pearce
2015-06-24 9:49 ` Jeff King
2015-06-25 1:08 ` brian m. carlson
2015-06-24 10:18 ` Duy Nguyen
2015-06-23 15:51 ` Michael Haggerty
2015-06-23 19:53 ` David Turner
2015-06-23 21:27 ` Michael Haggerty
2015-06-24 17:31 ` David Turner
2015-06-23 21:35 ` David Turner
2015-06-23 21:41 ` Junio C Hamano
2015-06-23 17:16 ` Stefan Beller
2015-06-23 20:04 ` David Turner
2015-06-23 20:10 ` Randall S. Becker
2015-06-23 20:22 ` David Turner
2015-06-23 20:27 ` Randall S. Becker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1435166999.26709.12.camel@twopensource.com \
--to=dturner@twopensource.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).