From: Jeff King <peff@peff.net>
To: Ted Ts'o <tytso@mit.edu>
Cc: Thomas Rast <trast@student.ethz.ch>,
Hallvard B Furuseth <h.b.furuseth@usit.uio.no>,
git@vger.kernel.org, Nicolas Pitre <nico@fluxnic.net>
Subject: Re: Keeping unreachable objects in a separate pack instead of loose?
Date: Mon, 11 Jun 2012 14:34:14 -0400 [thread overview]
Message-ID: <20120611183414.GD20134@sigill.intra.peff.net> (raw)
In-Reply-To: <20120611172732.GB16086@thunk.org>
On Mon, Jun 11, 2012 at 01:27:32PM -0400, Ted Ts'o wrote:
> The 4.5 megabytes of loose objects packed down to a 224k "cruft" repo,
> and 768k worth of private development objects.
>
> So depending on how you would want to do the comparison, probably the
> fairest thing to say is that I had a total "good" packs totally about
> 16 megs, and the loose cruft objects was an additional 4.5 megabytes.
OK, so that 4.5 is at least a respectable percentage of the total repo
size. I suspect it may be worse for small repos in that sense, because
the 4.5 megabytes is not "how big is this repo" but probably "how much
work did you do in this repo in the last 2 weeks". Which should be a
constant with respect to the total size of the repo.
However, for a very busy repo (e.g., one used for automated integration
testing or similar), the "how much work" number could be quite high.
We ran into this at github due to our "merge this" button, which does
test-merges to see if a pull request can be merged cleanly. Whenever the
upstream branch is updated, all of the outstanding pull requests get
re-tested, and the old test-merge objects become unreferenced. We ended
up dropping pruneExpire to 1 day to keep the cruft to a minimum.
> > 1. You run "git repack -Ad". It makes A.pack, with stuff you want,
> > and B.pack, with unreachable junk. They both get a timestamp of
> > "now".
> >
> > 2. A day passes. You run "git repack -Ad" again. It makes C.pack,
> > the new stuff you want, and repacks all of B.pack along with the
> > new expired cruft from A.pack, making D.pack. B.pack can go away.
> > D.pack gets a timestamp of "now".
>
> Hmm, yes. What we'd really want to do is to make D.pack contain those
> items that were are newly unreachable, not including the objects in
> B.pack, and keep B.pack around until the expiry window goes by. But
> that's a much more complicated thing, and the proof-of-concept
> algorithm I had outlined wouldn't do that.
Right. When dumping the list of unreachable objects from pack-objects,
you'd want to tell it to ignore objects from any pack that contained
only unreachable objects in the first place.
Except that is perhaps not quite right, either. Because if a pack has
100 objects, but you happen to re-reference 1 of them, you'd probably
want to leave it (even though that re-referenced one is now duplicated,
the savings aren't worth the trouble of repacking the cruft objects).
Of course, what is the N that makes it worth the trouble? Now you're
getting into heuristics.
You _could_ make a separate cruft pack for each pack that you repack. So
if I have A.pack and B.pack, I'd pack all of the reachable objects into
C.pack, and then make D.pack containing the unreachable objects from
A.pack, and E.pack with the unreachable objects from B.pack. And then
set the mtime of the cruft packs to that of their parent packs.
And then the next time you pack, repacking D and E would probably be a
no-op that preserves mtime, but might create a new pack that ejects some
now-reachable object.
To implement that, I think your --list-unreachable would just have to
print a list of "<pack-mtime> <sha1>" pairs, and then you would pack
each set with an identical mtime (or even a "close enough" mtime within
some slop).
But yet, this is all getting complicated. :)
> > I think solving it for good would involve a separate list of
> > per-object expiration dates. Obviously we get that easily with loose
> > objects (since it is one object per file).
>
> Well, either that or we need to teach git-repack the difference
> between packs that are expected to contain good stuff, and packs that
> contain cruft, and to not copy "old cruft" to new packs, so the old
> pack can finally get nuked 2 weeks (or whatever the expire window
> might happen to be) later.
That is harder because those objects may become re-reachable during that
window. So I think you don't want to deal with "expected to contain..."
but rather "what does it contain now?". The latter is easy to figure out
by doing a reachability analysis (which we do as part of the repack
anyway).
-Peff
next prev parent reply other threads:[~2012-06-11 18:34 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-10 12:31 Keeping unreachable objects in a separate pack instead of loose? Theodore Ts'o
2012-06-10 23:24 ` Hallvard B Furuseth
2012-06-11 14:44 ` Thomas Rast
2012-06-11 15:31 ` Ted Ts'o
2012-06-11 16:08 ` Jeff King
2012-06-11 17:04 ` Nicolas Pitre
2012-06-11 17:45 ` Ted Ts'o
2012-06-11 17:54 ` Jeff King
2012-06-11 18:20 ` Ted Ts'o
2012-06-11 18:43 ` Jeff King
2012-06-11 17:46 ` Jeff King
2012-06-11 17:27 ` Ted Ts'o
2012-06-11 18:34 ` Jeff King [this message]
2012-06-11 20:44 ` Hallvard Breien Furuseth
2012-06-11 21:14 ` Jeff King
2012-06-11 21:41 ` Hallvard Breien Furuseth
2012-06-11 21:14 ` Ted Ts'o
2012-06-11 21:39 ` Jeff King
2012-06-11 22:14 ` Ted Ts'o
2012-06-11 22:23 ` Jeff King
2012-06-11 22:28 ` Ted Ts'o
2012-06-11 22:35 ` Jeff King
2012-06-12 0:41 ` Nicolas Pitre
2012-06-12 17:10 ` Jeff King
2012-06-12 17:30 ` Nicolas Pitre
2012-06-12 17:32 ` Jeff King
2012-06-12 17:45 ` Shawn Pearce
2012-06-12 17:50 ` Jeff King
2012-06-12 17:57 ` Nicolas Pitre
2012-06-12 18:43 ` Andreas Schwab
2012-06-12 19:07 ` Jeff King
2012-06-12 19:09 ` Nicolas Pitre
2012-06-12 19:23 ` Jeff King
2012-06-12 19:39 ` Nicolas Pitre
2012-06-12 19:41 ` Jeff King
2012-06-12 17:55 ` Nicolas Pitre
2012-06-12 17:49 ` Nicolas Pitre
2012-06-12 17:54 ` Jeff King
2012-06-12 18:25 ` Nicolas Pitre
2012-06-12 18:37 ` Ted Ts'o
2012-06-12 19:15 ` Nicolas Pitre
2012-06-12 19:19 ` Ted Ts'o
2012-06-12 19:35 ` Nicolas Pitre
2012-06-12 19:43 ` Ted Ts'o
2012-06-12 19:15 ` Jeff King
2012-06-13 18:17 ` Martin Fick
2012-06-13 21:27 ` Johan Herland
2012-06-11 15:40 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120611183414.GD20134@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=h.b.furuseth@usit.uio.no \
--cc=nico@fluxnic.net \
--cc=trast@student.ethz.ch \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).