git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Björn Steinbrink" <B.Steinbrink@gmx.de>
To: Michael J Gruber <git@drmicha.warpmail.net>
Cc: Thomas Jarosch <thomas.jarosch@intra2net.com>, git@vger.kernel.org
Subject: Re: help needed: Splitting a git repository after subversion migration
Date: Mon, 8 Dec 2008 15:24:47 +0100	[thread overview]
Message-ID: <20081208142447.GA20186@atjola.homenet> (raw)
In-Reply-To: <493D2174.80500@drmicha.warpmail.net>

On 2008.12.08 14:30:28 +0100, Michael J Gruber wrote:
> Thomas Jarosch venit, vidit, dixit 07.12.2008 18:41:
> > Hello together,
> > 
> > I've successfully imported a large subversion repository into git.
> > The tree contains source code and binary data ("releases"),
> > the resulting .git directory is about 11GB.
> > 
> > After the import I recreated the tags/branches by converting the refs
> > to the subversion tags using a small shell script from the web:
> > 
> > for branch in `git branch -r`; do
> >      ...
> >      version=`basename $branch`
> >      git tag -s -f -m "$subject" "$version" "$branch^"
> >      git branch -d -r $branch
> > done
> > 
> > Ok, so far everything went really smooth. I wanted to split this repository
> > into two repositories, one for the source code and one for the binary data.
> > The current tree layout is like this:
> > 
> > sources/c++_xyz
> > releases/large_binary_data
> > ...
> > 
> > The original tree was imported from CVS to subversion and the layout
> > of the trunk was once reorganized/moved later. Here's the command
> > I used to split out the "source" tree:
> > 
> > git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f
> > CVSROOT Attic source/Attic develpkg/Attic
> > source/packages/Attic releases update_pkg' -- --all
> > 
> > After that I ran these commands to reclaim the space:
> > - git clone --no-hardlinks filtered_tree final_output
> > - cd final_output
> > - git gc
> > - git prune
> > - git repack -a -d --depth=250 --window=250
> > 
> > Unfortunately the .git directory of the "source" tree is still 7.5GB big.
> > 
> > When I just imported the "trunk" from subversion without any tags
> > and then ran "git filter-branch --subdirectory-filter source" + git gc,
> > the .git directory was about 1.5GB afterwards.
> > 
> > How can I find out where those other 6GB go to?
> > I already looked at the tags with gitk,
> > there's no sign of the releases/* stuff left.
> 
> I strongly suspect the reorganization/move to be the cause. Most
> probably some releases were put in places where you don't expect them,
> and therefore they are not filtered out by removing the releases subdir.
> If they have distinguished file names (say you know a name from before
> the move) you can find them using "git log". Or use gitk --all, switch
> to "tree display" and look for unexpected files in the earliest revisions.

If it's about huge objects, and not just lots of small objects, you can
use this:

# Find large objects
git rev-list --objects --all | cut -f1 -d' ' | \
	git cat-file --batch-check | grep blob | sort -n -k 3

This outputs lines in the format:
<object_hash> blob <object_size>

sorted by object size, large objects come last. To make use of that
information, you'll likely need to also find the filename(s) that are
used for these blobs:

# Find filenames for objects
git rev-list --all --objects | grep <object_hash>

And then you can use the filenames to do some more filtering.

Björn

  reply	other threads:[~2008-12-08 14:26 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-07 17:41 help needed: Splitting a git repository after subversion migration Thomas Jarosch
2008-12-08 13:30 ` Michael J Gruber
2008-12-08 14:24   ` Björn Steinbrink [this message]
2008-12-08 17:34     ` Thomas Jarosch
2008-12-10 16:33       ` Thomas Jarosch
2008-12-11  8:10         ` Björn Steinbrink
2008-12-12 14:22           ` Thomas Jarosch
2008-12-12 14:49             ` Björn Steinbrink

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081208142447.GA20186@atjola.homenet \
    --to=b.steinbrink@gmx.de \
    --cc=git@drmicha.warpmail.net \
    --cc=git@vger.kernel.org \
    --cc=thomas.jarosch@intra2net.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).