git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Turner <dturner@twopensource.com>
To: Jeff King <peff@peff.net>
Cc: git mailing list <git@vger.kernel.org>
Subject: Re: RFC/Pull Request: Refs db backend
Date: Wed, 24 Jun 2015 13:29:59 -0400	[thread overview]
Message-ID: <1435166999.26709.12.camel@twopensource.com> (raw)
In-Reply-To: <20150624091417.GB5436@peff.net>

On Wed, 2015-06-24 at 05:14 -0400, Jeff King wrote:
> On Tue, Jun 23, 2015 at 02:18:36PM -0400, David Turner wrote:
> 
> > > Can you describe a bit more about the reflog handling?
> > > 
> > > One of the problems we've had with large-ref repos is that the reflog
> > > storage is quite inefficient. You can pack all the refs, but you may
> > > still be stuck with a bunch of reflog files with one entry, wasting a
> > > whole inode. Doing a "git repack" when you have a million of those has
> > > horrible cold-cache performance. Basically anything that isn't
> > > one-file-per-reflog would be a welcome change. :)
> > 
> > Reflogs are stored in the database as well.  There is one header entry
> > per ref to indicate that a reflog is present, and then one database
> > entry per reflog entry; the entries are stored consecutively and
> > immediately following the header so that it's fast to iterate over them.
> 
> OK, that make sense. I did notice that the storage for the refdb grows
> rapidly. If I add a millions refs (like refs/tags/$i) with a simple
> reflog message "foo", I ended up with a 500MB database file.
> 
> That's _probably_ OK, because a million is getting into crazy
> territory[1].  But it's 500 bytes per ref, each with one reflog entry.
> Our ideal lower bound is probably something like 100 bytes per reflog
> entry:
> 
>   - 20 bytes for old sha1
>   - 20 bytes for new sha1
>   - ~50 bytes for name, email, timestamp
>   - ~6 bytes for refname (1000000 is the longest unique part)
> 
> That assumes we store binary[2] (and not just the raw reflog lines), and
> reconstruct the reflog lines on the fly. It also assumes we use some
> kind of trie-like storage (where we can amortize the cost of storing
> "refs/tags/" across all of the entries).
> 
> Of course that neglects lmdb's overhead, and the storage of the ref tip
> itself. But it would hopefully give us a ballpark for an optimal
> solution. We don't have to hit that, of course, but it's food for
> thought.
> 
> [1] The homebrew/homebrew repository on GitHub has almost half a million
>     ref updates. Since this is storing not just refs but all ref
>     updates, that's actually the interesting number (and optimizing the
>     per-reflog-entry size is more interesting than the per-ref size).
> 
> [2] I'm hesitant to suggest binary formats in general, but given that
>     this is a blob embedded inside lmdb, I think it's OK. If we were to
>     pursue the log-structured idea I suggested earlier, I'm torn on
>     whether it should be binary or not.

I could try a binary format.  I was optimizing for simplicity,
debuggability, recoverability, compatibility with the choice of the text
format, but I wouldn't have to.  I don't know how much this will save.
Unfortunately, given the way LMDB works, a trie-like storage to save
refs/tags does not seem possible (of course, we could hard-code some
hacks like \001=refs/rags, \002=refs/heads, etc but that is a
micro-optimization that might not be worth it.

Also, the reflog header has some overhead (it's an entire extra record
per ref). The header exists to implement reflog creation/existence
checking.  I didn't really try to understand why we have the distinction
between empty and nonexistent reflogs; I just copied it.  If we didn't
have that distinction, we could eliminate that overhead.

> > Thanks, that's valuable.  For the refs backend, opening the LMDB
> > database for writing is sufficient to block other writers.  Do you think
> > it would be valuable to provide a git hold-ref-lock command that simply
> > reads refs from stdin and keeps them locked until it reads EOF from
> > stdin?  That would allow cross-backend ref locking.
> 
> I'm not sure what you would use it for. If you want to update the refs,
> then you can specify a whole transaction with "git update-ref --stdin",
> and that should work whatever backend you choose. Is there some other
> operation you want where you hold the lock for a longer period of time?

I'm sure I had a reason for this at the time I wrote it, but now I can't
think of what it was.  Nevermind!

  reply	other threads:[~2015-06-24 17:30 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-23  0:50 RFC/Pull Request: Refs db backend David Turner
2015-06-23  5:36 ` Junio C Hamano
2015-06-23 10:23   ` Duy Nguyen
2015-06-23 18:47     ` David Turner
2015-06-23 17:29   ` David Turner
2015-06-23 11:47 ` Jeff King
2015-06-23 13:10   ` Duy Nguyen
2015-06-24  8:51     ` Jeff King
2015-06-23 18:18   ` David Turner
2015-06-24  9:14     ` Jeff King
2015-06-24 17:29       ` David Turner [this message]
2015-06-24  6:09   ` Shawn Pearce
2015-06-24  9:49     ` Jeff King
2015-06-25  1:08       ` brian m. carlson
2015-06-24 10:18     ` Duy Nguyen
2015-06-23 15:51 ` Michael Haggerty
2015-06-23 19:53   ` David Turner
2015-06-23 21:27     ` Michael Haggerty
2015-06-24 17:31       ` David Turner
2015-06-23 21:35     ` David Turner
2015-06-23 21:41       ` Junio C Hamano
2015-06-23 17:16 ` Stefan Beller
2015-06-23 20:04   ` David Turner
2015-06-23 20:10     ` Randall S. Becker
2015-06-23 20:22       ` David Turner
2015-06-23 20:27         ` Randall S. Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1435166999.26709.12.camel@twopensource.com \
    --to=dturner@twopensource.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).