git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Nicolas Pitre <nico@fluxnic.net>,
	git@vger.kernel.org, Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>,
	Jay Soffian <jaysoffian@gmail.com>,
	Shawn Pearce <spearce@spearce.org>
Subject: Re: gc --aggressive
Date: Tue, 1 May 2012 16:01:23 -0400	[thread overview]
Message-ID: <20120501200123.GB26245@sigill.intra.peff.net> (raw)
In-Reply-To: <7vr4v391s1.fsf@alter.siamese.dyndns.org>

On Tue, May 01, 2012 at 11:47:26AM -0700, Junio C Hamano wrote:

> > While keeping the size comparison commented out, you could try to 
> > replace this line with:
> >
> > 	return b < a ? -1 : (b > a);
> >
> > If this doesn't improve things then it would be clear that this avenue 
> > should be abandoned.
> 
> Very interesting.  The difference between the two should only matter if
> there are many blobs with exactly the same size, and most of them delta
> horribly with each other.  Does the problematic repository exhibit such
> a characteristic?

No. Here are the objects with the same sizes:

  $ git rev-list --objects --all |
    cut -d' ' -f1 |
    git cat-file --batch-check |
    cut -d' ' -f2,3 |
    sort | uniq -c | sort -rn | head

  19722 tree 2222
  14068 tree 4393
  11418 tree 2156
   9994 tree 4676
   9479 tree 2189
   7944 tree 2255
   6454 commit 251
   6437 tree 4611
   5328 tree 4439
   4586 commit 254

So it's mostly trees and commits (the first repeated blob size is on
line 332 of the output). The commits aren't all that big even without
deltafication, but the trees are. They should be sorted by name_hash,
but within a single name, there are going to be a lot of repetitions (I
think each of those size clusters is just a repetition of the same "po"
directory getting lots of tiny modifications).

So we are triggering that part of the sort quite a bit. But by your
reasoning here:

> The original tie-breaks based on the address (the earlier object we read
> in the original input comes earlier in the output) and yours make the
> objects later we read (which in turn are from older parts of the history)
> come early, but adjacency between two objects of the same type and the
> same size would not change (if A and B were next to each other in this
> order, your updated sorter will give B and then A still next to each
> other), so I suspect not much would change in the candidate selection.

I don't think it makes a big difference (and indeed, switching it and
repacking the phpmyadmin repository yields the same-size pack, although
a lot more CPU time is spent).

-Peff

  parent reply	other threads:[~2012-05-01 20:01 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-17 16:16 gc --aggressive Jay Soffian
2012-04-17 17:53 ` Jay Soffian
2012-04-17 20:52   ` Matthieu Moy
2012-04-17 21:58     ` Jeff King
2012-04-28 12:25     ` Jeff King
2012-04-28 17:11       ` Nicolas Pitre
2012-04-29 11:34         ` Jeff King
2012-04-29 13:53           ` Nicolas Pitre
2012-05-01 16:28             ` Jeff King
2012-05-01 17:16               ` Jeff King
2012-05-01 17:59                 ` Nicolas Pitre
2012-05-01 18:47                   ` Junio C Hamano
2012-05-01 19:22                     ` Nicolas Pitre
2012-05-01 20:01                     ` Jeff King [this message]
2012-05-01 19:35                   ` Jeff King
2012-05-01 20:02                     ` Nicolas Pitre
2012-05-01 17:17               ` Nicolas Pitre
2012-05-01 17:22                 ` Jeff King
2012-05-01 17:47                   ` Nicolas Pitre
2012-04-28 16:56   ` Nicolas Pitre
2012-04-17 22:08 ` Jeff King
2012-04-17 22:17   ` Junio C Hamano
2012-04-17 22:18     ` Jeff King
2012-04-17 22:34       ` Junio C Hamano
2012-04-28 16:42         ` Nicolas Pitre
2012-04-18  8:49       ` Andreas Ericsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120501200123.GB26245@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=Matthieu.Moy@grenoble-inp.fr \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jaysoffian@gmail.com \
    --cc=nico@fluxnic.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).