From: "Björn Steinbrink" <B.Steinbrink@gmx.de>
To: Michael J Gruber <git@drmicha.warpmail.net>
Cc: Thomas Jarosch <thomas.jarosch@intra2net.com>, git@vger.kernel.org
Subject: Re: help needed: Splitting a git repository after subversion migration
Date: Mon, 8 Dec 2008 15:24:47 +0100 [thread overview]
Message-ID: <20081208142447.GA20186@atjola.homenet> (raw)
In-Reply-To: <493D2174.80500@drmicha.warpmail.net>
On 2008.12.08 14:30:28 +0100, Michael J Gruber wrote:
> Thomas Jarosch venit, vidit, dixit 07.12.2008 18:41:
> > Hello together,
> >
> > I've successfully imported a large subversion repository into git.
> > The tree contains source code and binary data ("releases"),
> > the resulting .git directory is about 11GB.
> >
> > After the import I recreated the tags/branches by converting the refs
> > to the subversion tags using a small shell script from the web:
> >
> > for branch in `git branch -r`; do
> > ...
> > version=`basename $branch`
> > git tag -s -f -m "$subject" "$version" "$branch^"
> > git branch -d -r $branch
> > done
> >
> > Ok, so far everything went really smooth. I wanted to split this repository
> > into two repositories, one for the source code and one for the binary data.
> > The current tree layout is like this:
> >
> > sources/c++_xyz
> > releases/large_binary_data
> > ...
> >
> > The original tree was imported from CVS to subversion and the layout
> > of the trunk was once reorganized/moved later. Here's the command
> > I used to split out the "source" tree:
> >
> > git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f
> > CVSROOT Attic source/Attic develpkg/Attic
> > source/packages/Attic releases update_pkg' -- --all
> >
> > After that I ran these commands to reclaim the space:
> > - git clone --no-hardlinks filtered_tree final_output
> > - cd final_output
> > - git gc
> > - git prune
> > - git repack -a -d --depth=250 --window=250
> >
> > Unfortunately the .git directory of the "source" tree is still 7.5GB big.
> >
> > When I just imported the "trunk" from subversion without any tags
> > and then ran "git filter-branch --subdirectory-filter source" + git gc,
> > the .git directory was about 1.5GB afterwards.
> >
> > How can I find out where those other 6GB go to?
> > I already looked at the tags with gitk,
> > there's no sign of the releases/* stuff left.
>
> I strongly suspect the reorganization/move to be the cause. Most
> probably some releases were put in places where you don't expect them,
> and therefore they are not filtered out by removing the releases subdir.
> If they have distinguished file names (say you know a name from before
> the move) you can find them using "git log". Or use gitk --all, switch
> to "tree display" and look for unexpected files in the earliest revisions.
If it's about huge objects, and not just lots of small objects, you can
use this:
# Find large objects
git rev-list --objects --all | cut -f1 -d' ' | \
git cat-file --batch-check | grep blob | sort -n -k 3
This outputs lines in the format:
<object_hash> blob <object_size>
sorted by object size, large objects come last. To make use of that
information, you'll likely need to also find the filename(s) that are
used for these blobs:
# Find filenames for objects
git rev-list --all --objects | grep <object_hash>
And then you can use the filenames to do some more filtering.
Björn
next prev parent reply other threads:[~2008-12-08 14:26 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-07 17:41 help needed: Splitting a git repository after subversion migration Thomas Jarosch
2008-12-08 13:30 ` Michael J Gruber
2008-12-08 14:24 ` Björn Steinbrink [this message]
2008-12-08 17:34 ` Thomas Jarosch
2008-12-10 16:33 ` Thomas Jarosch
2008-12-11 8:10 ` Björn Steinbrink
2008-12-12 14:22 ` Thomas Jarosch
2008-12-12 14:49 ` Björn Steinbrink
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081208142447.GA20186@atjola.homenet \
--to=b.steinbrink@gmx.de \
--cc=git@drmicha.warpmail.net \
--cc=git@vger.kernel.org \
--cc=thomas.jarosch@intra2net.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).