git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Björn Steinbrink" <B.Steinbrink@gmx.de>
To: Antony Stubbs <antony.stubbs@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [script] find largest pack objects
Date: Fri, 10 Jul 2009 13:43:16 +0200	[thread overview]
Message-ID: <20090710114316.GA6880@atjola.homenet> (raw)
In-Reply-To: <A67AA762-487D-4CFB-B555-718C88C5787D@gmail.com>

On 2009.07.10 13:16:50 +1200, Antony Stubbs wrote:
> Blog post about git pruning history and finding large objects in
> your repo: http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
> 
> This is a script I put together after migrating the Spring Modules
> project from CVS, using git-cvsimport (which I also had to patch, to
> get to work on OS X / MacPorts). I wrote it because I wanted to get
> rid of all the large jar files, and documentation etc, that had been
> put into source control. However, if _large files_ are deleted in
> the latest revision, then they can be hard to track down.

Here's my script, basically for the same purpose, but instead of looking
at the packfiles, it looks at the rev-list output to find those objects
that aren't prunable (ignoring the reflog). I'm also using some kind of
ugly sed invocation to run rev-list only twice, regardless of the number
of objects to be shown, which greatly reduces the time required to run
the script.

#!/bin/sh
git rev-list --all --objects |
	sed -n $(git rev-list --objects --all |
		cut -f1 -d' ' | git cat-file --batch-check | grep blob |
		sort -n -k3 | tail -n$1 | while read hash type size;
		do
			echo -n "-e s/$hash/$size/p ";
		done) |
	sort -n -k1

It takes the number of objects to be shown as an argument, so for the
top ten run as "git find-large 10" (assuming that the script is in $PATH
and called git-find-large).

It doesn't list as much information as yours does, e.g. the compressed
size is missing, but it's good enough for me, and speed was far more
important for me, especially since the "rev-list --all --objects" trick
gets you only a single filename for the blob, so if there were renames,
you may need to run it again after having deleted one version via
filter-branch.

Something similar applies to deltified stuff. As verify-pack shows the
size of the delta, your script might miss some file B if that is a
currently stored as a delta against some other large file A. Only after
the blob for A got deleted, B will be shown (as it is no longer
deltified).

OTOH, this means that the output of my script is likely to have the same
filename over and over again. If that gets out of hand, I usually do
something like:
git find-large 100 | cut -d' ' -f2 | sort -u

So I get just the filenames, hoping that the top 100 include all
interesting things ;-)

Maybe this helps someone to come up with a smart combination of our
scripts.

Björn

      parent reply	other threads:[~2009-07-10 11:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-10  1:16 [script] find largest pack objects Antony Stubbs
2009-07-10  3:34 ` Nicolas Pitre
2009-08-31 13:25   ` Antony Stubbs
2009-07-10 11:43 ` Björn Steinbrink [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090710114316.GA6880@atjola.homenet \
    --to=b.steinbrink@gmx.de \
    --cc=antony.stubbs@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).