git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Nicolas Pitre <nico@fluxnic.net>
Cc: git@vger.kernel.org, Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>,
	Jay Soffian <jaysoffian@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Shawn Pearce <spearce@spearce.org>
Subject: Re: gc --aggressive
Date: Sun, 29 Apr 2012 07:34:32 -0400	[thread overview]
Message-ID: <20120429113431.GA24254@sigill.intra.peff.net> (raw)
In-Reply-To: <alpine.LFD.2.02.1204281258050.21030@xanadu.home>

On Sat, Apr 28, 2012 at 01:11:48PM -0400, Nicolas Pitre wrote:

> > Here's a list of commands and the pack sizes they yield on the repo:
> > 
> >   1. `git repack -ad`: 246M
> >   2. `git repack -ad -f`: 376M
> >   3. `git repack -ad --window=250`: 246M
> >   4. `git repack -ad -f --window=250`: 145M
> > 
> > The most interesting thing is (4): repacking with a larger window size
> > yields a 100M (40%) space improvement. The other commands show that it
> > is not that the current pack is simply bad; command (2) repacks from
> > scratch and actually ends up with a worse pack. So the increased window
> > size really is important.
> 
> Absolutely.  This doesn't surprises me.

I was somewhat surprised, because this repo behaves very differently
from other ones as the window size increases. Our default window of 10
is somewhat arbitrary, but I think there was a sense from early tests
that you got diminishing returns from increasing it (this is my vague
recollection; I didn't actually search for old discussions). But here
are some charts showing "repack -adf" with various window sizes on a few
repositories. The first column is the window size; the second is the
resulting pack size (and its percentage of the window=10 case); the
third is the number of seconds of CPU time (and again, the percentage of
the window=10 case).

Here's git.git:

  10 | 31.3M (100%) |   54s (100%)
  20 | 28.8M ( 92%) |   72s (133%)
  40 | 27.4M ( 87%) |  101s (187%)
  80 | 26.3M ( 84%) |  153s (282%)
 160 | 25.7M ( 82%) |  247s (455%)
 320 | 25.4M ( 81%) |  415s (763%)

You can see we get some benefit from increasing window size to 20 or
even 40, but we hit an asymptote around 80%. Meanwhile, CPU time keeps
jumping. Something like 20 or 40 seems like it might be a nice
compromise.

Here's linux-2.6:

  10 | 564M (100%) |  990s (100%)
  20 | 521M ( 92%) | 1323s (134%)
  40 | 495M ( 88%) | 1855s (187%)
  80 | 479M ( 85%) | 2743s (277%)
 160 | 470M ( 83%) | 4284s (432%)
 320 | 463M ( 82%) | 7064s (713%)

It's quite similar, asymptotically heading towards ~80%. And the CPU
numbers look quite similar, too.

And here's the phpmyadmin repository (the one I linked to earlier):

  10 | 386M (100%) | 1592s (100%)
  20 | 280M ( 72%) | 1947s (122%)
  40 | 209M ( 54%) | 2514s (158%)
  80 | 169M ( 44%) | 3386s (213%)
 160 | 151M ( 39%) | 4822s (303%)
 320 | 142M ( 37%) | 6948s (436%)

The packfile size improvements go on for much longer as we increase the
window size. For this repo, a window size of 80-100 is probably a good
spot.

That leads me to a few questions:

  1. Should we bump our default window size? The numbers above show that
     typical repos would benefit from jumping to 20 or even 40.

  2. Is there a heuristic or other metric we can figure out to
     differentiate the first two repositories from the third, and use a
     larger window size on the latter?

  3. Does the phpmyadmin case give us any insight into whether we can
     improve our window sorting algorithm? Looking at the repo, ~55K of
     the ~75K commits are small changes in the po/ directory (it looks
     like they were using a web-based tool to let non-committers tweak
     the translation files). In particular, I see a lot of commits in
     which most of the changes are simply line number changes as the po
     files are refreshed from the source. I wonder if that is making the
     size-sorting heuristics perform poorly, as we end up with many
     files of the same size, and the good deltas get pushed further
     along the window.

  4. What is typical? I suspect that git.git and linux-2.6 are typical,
     and the weird po-files in the phpmyadmin repository are not. But
     I'd be happy to test more repos if people have suggestions. And the
     scripts that generated the charts are included below if anybody
     wants to try it themselves.

-Peff

-- >8 --
cat >collect <<\EOF
#!/bin/sh
# usage: collect /path/to/repo >foo.out

windows='10 20 40 80 160 320'

for i in $windows; do
  echo >&2 "Repacking with window $i..."
  rm -rf tmp && cp -a "$1" tmp && (
    cd tmp &&
    time=`time -f %U -o /dev/stdout git repack -adf --window=$i`
    size=`du -bc objects/pack/pack-*.pack | tail -1 | awk '{print $1}'`
    echo "$i $size $time"
  )
done
EOF

cat >chart <<\EOF
#!/usr/bin/perl
# usage: chart <foo.out

use strict;

my @base;
while (<>) {
  chomp;
  my ($window, $size, $time) = split;

  @base = ($size, $time) unless @base;

  printf '%4s', $window;
  print ' | ', humanize($size);
  printf ' (%3d%%)', int($size / $base[0] * 100 + 0.5);
  printf ' | %4ds', $time;
  printf ' (%d%%)', int($time / $base[1] * 100 + 0.5);
  print "\n";
}

sub human_digits {
  my $n = shift;
  my $digits = $n >= 100 ? 0 :
               $n >=  10 ? 1 :
               2;
  return sprintf '%.*f', $digits, $n;
}

sub humanize {
  my $n = shift;
  my $u;
  foreach $u ('', qw(K M G)) {
    return human_digits($n) . $u if $n < 900;
    $n /= 1024;
  }
  return human_digits($n) . $u;
}
EOF

  reply	other threads:[~2012-04-29 11:35 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-17 16:16 gc --aggressive Jay Soffian
2012-04-17 17:53 ` Jay Soffian
2012-04-17 20:52   ` Matthieu Moy
2012-04-17 21:58     ` Jeff King
2012-04-28 12:25     ` Jeff King
2012-04-28 17:11       ` Nicolas Pitre
2012-04-29 11:34         ` Jeff King [this message]
2012-04-29 13:53           ` Nicolas Pitre
2012-05-01 16:28             ` Jeff King
2012-05-01 17:16               ` Jeff King
2012-05-01 17:59                 ` Nicolas Pitre
2012-05-01 18:47                   ` Junio C Hamano
2012-05-01 19:22                     ` Nicolas Pitre
2012-05-01 20:01                     ` Jeff King
2012-05-01 19:35                   ` Jeff King
2012-05-01 20:02                     ` Nicolas Pitre
2012-05-01 17:17               ` Nicolas Pitre
2012-05-01 17:22                 ` Jeff King
2012-05-01 17:47                   ` Nicolas Pitre
2012-04-28 16:56   ` Nicolas Pitre
2012-04-17 22:08 ` Jeff King
2012-04-17 22:17   ` Junio C Hamano
2012-04-17 22:18     ` Jeff King
2012-04-17 22:34       ` Junio C Hamano
2012-04-28 16:42         ` Nicolas Pitre
2012-04-18  8:49       ` Andreas Ericsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120429113431.GA24254@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=Matthieu.Moy@grenoble-inp.fr \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jaysoffian@gmail.com \
    --cc=nico@fluxnic.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).