* help needed: Splitting a git repository after subversion migration
@ 2008-12-07 17:41 Thomas Jarosch
2008-12-08 13:30 ` Michael J Gruber
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Jarosch @ 2008-12-07 17:41 UTC (permalink / raw)
To: git
Hello together,
I've successfully imported a large subversion repository into git.
The tree contains source code and binary data ("releases"),
the resulting .git directory is about 11GB.
After the import I recreated the tags/branches by converting the refs
to the subversion tags using a small shell script from the web:
for branch in `git branch -r`; do
...
version=`basename $branch`
git tag -s -f -m "$subject" "$version" "$branch^"
git branch -d -r $branch
done
Ok, so far everything went really smooth. I wanted to split this repository
into two repositories, one for the source code and one for the binary data.
The current tree layout is like this:
sources/c++_xyz
releases/large_binary_data
...
The original tree was imported from CVS to subversion and the layout
of the trunk was once reorganized/moved later. Here's the command
I used to split out the "source" tree:
git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f
CVSROOT Attic source/Attic develpkg/Attic
source/packages/Attic releases update_pkg' -- --all
After that I ran these commands to reclaim the space:
- git clone --no-hardlinks filtered_tree final_output
- cd final_output
- git gc
- git prune
- git repack -a -d --depth=250 --window=250
Unfortunately the .git directory of the "source" tree is still 7.5GB big.
When I just imported the "trunk" from subversion without any tags
and then ran "git filter-branch --subdirectory-filter source" + git gc,
the .git directory was about 1.5GB afterwards.
How can I find out where those other 6GB go to?
I already looked at the tags with gitk,
there's no sign of the releases/* stuff left.
The "--all" switch for "git filter-branch"
doesn't seem documented in git 1.6.0.4?
I just learned about it from the example usage.
"git filter-branch" also had trouble converting the tags
and suggested I should add "--tag-name-filter cat", which I did.
Maybe that's something for the examples, too?
I also tried running "git filter-branch --tag-name-filter cat
--subdirectory-filter source -- --all", but that commands aborts
with these messages:
WARNING: 'refs/tags/v5-0-8' was rewritten into multiple commits:
ee180f6117597b60ee237e9da92047946dfdeec5
fd7824d1926ce9e4c89b685583eb9a9c2f2537af
WARNING: Ref 'refs/tags/v5-0-8' points to the first one now.
error: Ref refs/tags/v5-0-8 is at 4ea78238cfd6ee259c4e8bde7be4a90bc86295b0
but expected 06c60261502acfb7b2bbe44c2e2ec371bea65827
fatal: Cannot lock the ref 'refs/tags/v5-0-8'.
Could not rewrite refs/tags/v5-0-8
Besides that git really rocks :-)
Thanks in advance,
Thomas
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-07 17:41 help needed: Splitting a git repository after subversion migration Thomas Jarosch
@ 2008-12-08 13:30 ` Michael J Gruber
2008-12-08 14:24 ` Björn Steinbrink
0 siblings, 1 reply; 8+ messages in thread
From: Michael J Gruber @ 2008-12-08 13:30 UTC (permalink / raw)
To: Thomas Jarosch; +Cc: git
Thomas Jarosch venit, vidit, dixit 07.12.2008 18:41:
> Hello together,
>
> I've successfully imported a large subversion repository into git.
> The tree contains source code and binary data ("releases"),
> the resulting .git directory is about 11GB.
>
> After the import I recreated the tags/branches by converting the refs
> to the subversion tags using a small shell script from the web:
>
> for branch in `git branch -r`; do
> ...
> version=`basename $branch`
> git tag -s -f -m "$subject" "$version" "$branch^"
> git branch -d -r $branch
> done
>
> Ok, so far everything went really smooth. I wanted to split this repository
> into two repositories, one for the source code and one for the binary data.
> The current tree layout is like this:
>
> sources/c++_xyz
> releases/large_binary_data
> ...
>
> The original tree was imported from CVS to subversion and the layout
> of the trunk was once reorganized/moved later. Here's the command
> I used to split out the "source" tree:
>
> git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f
> CVSROOT Attic source/Attic develpkg/Attic
> source/packages/Attic releases update_pkg' -- --all
>
> After that I ran these commands to reclaim the space:
> - git clone --no-hardlinks filtered_tree final_output
> - cd final_output
> - git gc
> - git prune
> - git repack -a -d --depth=250 --window=250
>
> Unfortunately the .git directory of the "source" tree is still 7.5GB big.
>
> When I just imported the "trunk" from subversion without any tags
> and then ran "git filter-branch --subdirectory-filter source" + git gc,
> the .git directory was about 1.5GB afterwards.
>
> How can I find out where those other 6GB go to?
> I already looked at the tags with gitk,
> there's no sign of the releases/* stuff left.
I strongly suspect the reorganization/move to be the cause. Most
probably some releases were put in places where you don't expect them,
and therefore they are not filtered out by removing the releases subdir.
If they have distinguished file names (say you know a name from before
the move) you can find them using "git log". Or use gitk --all, switch
to "tree display" and look for unexpected files in the earliest revisions.
Also, it may be better to do the tag creation (from tags/... branches)
after the filter-branch. If you don't rewrite the tags (have you?) then
the tags will still point to the original commits (before the rewrite)
and therefore include all the "fat blobs". You avoid this best by
creating them after the rewrite.
Michael
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-08 13:30 ` Michael J Gruber
@ 2008-12-08 14:24 ` Björn Steinbrink
2008-12-08 17:34 ` Thomas Jarosch
0 siblings, 1 reply; 8+ messages in thread
From: Björn Steinbrink @ 2008-12-08 14:24 UTC (permalink / raw)
To: Michael J Gruber; +Cc: Thomas Jarosch, git
On 2008.12.08 14:30:28 +0100, Michael J Gruber wrote:
> Thomas Jarosch venit, vidit, dixit 07.12.2008 18:41:
> > Hello together,
> >
> > I've successfully imported a large subversion repository into git.
> > The tree contains source code and binary data ("releases"),
> > the resulting .git directory is about 11GB.
> >
> > After the import I recreated the tags/branches by converting the refs
> > to the subversion tags using a small shell script from the web:
> >
> > for branch in `git branch -r`; do
> > ...
> > version=`basename $branch`
> > git tag -s -f -m "$subject" "$version" "$branch^"
> > git branch -d -r $branch
> > done
> >
> > Ok, so far everything went really smooth. I wanted to split this repository
> > into two repositories, one for the source code and one for the binary data.
> > The current tree layout is like this:
> >
> > sources/c++_xyz
> > releases/large_binary_data
> > ...
> >
> > The original tree was imported from CVS to subversion and the layout
> > of the trunk was once reorganized/moved later. Here's the command
> > I used to split out the "source" tree:
> >
> > git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r -f
> > CVSROOT Attic source/Attic develpkg/Attic
> > source/packages/Attic releases update_pkg' -- --all
> >
> > After that I ran these commands to reclaim the space:
> > - git clone --no-hardlinks filtered_tree final_output
> > - cd final_output
> > - git gc
> > - git prune
> > - git repack -a -d --depth=250 --window=250
> >
> > Unfortunately the .git directory of the "source" tree is still 7.5GB big.
> >
> > When I just imported the "trunk" from subversion without any tags
> > and then ran "git filter-branch --subdirectory-filter source" + git gc,
> > the .git directory was about 1.5GB afterwards.
> >
> > How can I find out where those other 6GB go to?
> > I already looked at the tags with gitk,
> > there's no sign of the releases/* stuff left.
>
> I strongly suspect the reorganization/move to be the cause. Most
> probably some releases were put in places where you don't expect them,
> and therefore they are not filtered out by removing the releases subdir.
> If they have distinguished file names (say you know a name from before
> the move) you can find them using "git log". Or use gitk --all, switch
> to "tree display" and look for unexpected files in the earliest revisions.
If it's about huge objects, and not just lots of small objects, you can
use this:
# Find large objects
git rev-list --objects --all | cut -f1 -d' ' | \
git cat-file --batch-check | grep blob | sort -n -k 3
This outputs lines in the format:
<object_hash> blob <object_size>
sorted by object size, large objects come last. To make use of that
information, you'll likely need to also find the filename(s) that are
used for these blobs:
# Find filenames for objects
git rev-list --all --objects | grep <object_hash>
And then you can use the filenames to do some more filtering.
Björn
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-08 14:24 ` Björn Steinbrink
@ 2008-12-08 17:34 ` Thomas Jarosch
2008-12-10 16:33 ` Thomas Jarosch
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Jarosch @ 2008-12-08 17:34 UTC (permalink / raw)
To: Björn Steinbrink; +Cc: Michael J Gruber, git
On Monday, 8. December 2008 15:24:47 you wrote:
> If it's about huge objects, and not just lots of small objects, you can
> use this:
Thanks, those two commands have been really helpful. I've found some objects
that shouldn't be there and now I have two more questions:
1. When I run "git rev-list --all --objects", I can see file names that look
like "SVN-branchname/directory/filename". Is it normal that "git svn"
creates a directory with the name of the branch and puts files below it?
"git rev-list --all --objects |grep 5-0-3-hotfix":
5fe3265b6941c2fa74c12da799ea23e2801efa8a 5-0-3-hotfix/source
...
The branch in question existed for a limited time in branches/xyz
on the SVN tree and was deleted later on. Guessing the version number
from the filename, it looks like a copy of the files when I started the branch
as it's an old version number before I committed changes to it.
(f.e. upgraded libpng). When I just grep for "libpng" on the whole index,
I see all the various updates I made over the years.
2. Something goes wrong after the filter branch:
Output from the full 11GB tree:
git rev-list --all --objects |grep 5-0-3-hotfix |grep xyz
-> No match
Output from the filtered tree:
git rev-list --all --objects |grep 5-0-3-hotfix |grep xyz
3a13f87bc116aee96e031441eaafc416652ba4bd 5-0-3-hotfix/update_pkg/xyz
ebebb84ccff26c949fb1f803c60034074e6603fe 5-0-3-hotfix/update_pkg/xyz
5529ef51de887cc905fe460e4c4f6cd34b93b5a6 5-0-3-hotfix/update_pkg/xyz
c264a9d5db30ebb131c96c4f93192bfe9a5c0a7b 5-0-3-hotfix/update_pkg/xyz
I have no idea how those objects suddenly appeared there.
It feels like something was stitched together wrongly.
When I converted the SVN tag to a git tag, I tagged the branches
with a "branch-" prefix. Might that be a problem, is "branch-" reserved?
Cheers,
Thomas
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-08 17:34 ` Thomas Jarosch
@ 2008-12-10 16:33 ` Thomas Jarosch
2008-12-11 8:10 ` Björn Steinbrink
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Jarosch @ 2008-12-10 16:33 UTC (permalink / raw)
To: Björn Steinbrink; +Cc: Michael J Gruber, git
On Monday, 8. December 2008 18:34:20 Thomas Jarosch wrote:
> 1. When I run "git rev-list --all --objects", I can see file names that
> look like "SVN-branchname/directory/filename". Is it normal that "git svn"
> creates a directory with the name of the branch and puts files below it?
Ok, this seems to be a PEBKAC: In the history of the subversion repository,
f.e. I once copied the "branches" root folder to tags/xyz. One revision later
I noticed this and retagged the correct branch. git-svn imports all branches
from the first tag, which is the correct thing to do :o)
Now I'll manually check the history of the tags/ and branches/ folder
for more funny tags and write down the revision. If I understood
the git-svn man page correctly, I should be able to specifiy
revision ranges it's going to import. I'll try to skip the broken tags.
Cheers,
Thomas
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-10 16:33 ` Thomas Jarosch
@ 2008-12-11 8:10 ` Björn Steinbrink
2008-12-12 14:22 ` Thomas Jarosch
0 siblings, 1 reply; 8+ messages in thread
From: Björn Steinbrink @ 2008-12-11 8:10 UTC (permalink / raw)
To: Thomas Jarosch; +Cc: Michael J Gruber, git
On 2008.12.10 17:33:28 +0100, Thomas Jarosch wrote:
> On Monday, 8. December 2008 18:34:20 Thomas Jarosch wrote:
> > 1. When I run "git rev-list --all --objects", I can see file names that
> > look like "SVN-branchname/directory/filename". Is it normal that "git svn"
> > creates a directory with the name of the branch and puts files below it?
>
> Ok, this seems to be a PEBKAC: In the history of the subversion repository,
> f.e. I once copied the "branches" root folder to tags/xyz. One revision later
> I noticed this and retagged the correct branch. git-svn imports all branches
> from the first tag, which is the correct thing to do :o)
>
> Now I'll manually check the history of the tags/ and branches/ folder
> for more funny tags and write down the revision. If I understood
> the git-svn man page correctly, I should be able to specifiy
> revision ranges it's going to import. I'll try to skip the broken tags.
As long as the breakage only involves branches/tags that are completely
useless, it's probably a lot easier to just delete them afterwards.
And if you accidently added changes to a tag, after it was created, it's
also easier to manually tag to right version in git, and just forgetting
about the additional commit.
And for a bunch of other cases, rebase -i/filter-branch are probably
also better options ;-)
Skipping revisions in a git-svn import sounds rather annoying and
error-prone.
Björn
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-11 8:10 ` Björn Steinbrink
@ 2008-12-12 14:22 ` Thomas Jarosch
2008-12-12 14:49 ` Björn Steinbrink
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Jarosch @ 2008-12-12 14:22 UTC (permalink / raw)
To: Björn Steinbrink; +Cc: Michael J Gruber, git
On Thursday, 11. December 2008 09:10:09 you wrote:
> > Now I'll manually check the history of the tags/ and branches/ folder
> > for more funny tags and write down the revision. If I understood
> > the git-svn man page correctly, I should be able to specifiy
> > revision ranges it's going to import. I'll try to skip the broken tags.
>
> As long as the breakage only involves branches/tags that are completely
> useless, it's probably a lot easier to just delete them afterwards.
>
> And if you accidently added changes to a tag, after it was created, it's
> also easier to manually tag to right version in git, and just forgetting
> about the additional commit.
>
> And for a bunch of other cases, rebase -i/filter-branch are probably
> also better options ;-)
>
> Skipping revisions in a git-svn import sounds rather annoying and
> error-prone.
Sounds very reasonable. When I'm done filtering with filter-branch,
the original commits are still stored in "refs/originals" and the reflogs.
What's the best way to get rid of those to free up the space?
A nice way to find the corresponding commit for a file can be found here:
http://stackoverflow.com/questions/223678/git-which-commit-has-this-blob
Thanks for your help so far!
Thomas
PS: Yes, I have a backup copy of the repository ;-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: help needed: Splitting a git repository after subversion migration
2008-12-12 14:22 ` Thomas Jarosch
@ 2008-12-12 14:49 ` Björn Steinbrink
0 siblings, 0 replies; 8+ messages in thread
From: Björn Steinbrink @ 2008-12-12 14:49 UTC (permalink / raw)
To: Thomas Jarosch; +Cc: Michael J Gruber, git
On 2008.12.12 15:22:15 +0100, Thomas Jarosch wrote:
> On Thursday, 11. December 2008 09:10:09 you wrote:
> > > Now I'll manually check the history of the tags/ and branches/ folder
> > > for more funny tags and write down the revision. If I understood
> > > the git-svn man page correctly, I should be able to specifiy
> > > revision ranges it's going to import. I'll try to skip the broken tags.
> >
> > As long as the breakage only involves branches/tags that are completely
> > useless, it's probably a lot easier to just delete them afterwards.
> >
> > And if you accidently added changes to a tag, after it was created, it's
> > also easier to manually tag to right version in git, and just forgetting
> > about the additional commit.
> >
> > And for a bunch of other cases, rebase -i/filter-branch are probably
> > also better options ;-)
> >
> > Skipping revisions in a git-svn import sounds rather annoying and
> > error-prone.
>
> Sounds very reasonable. When I'm done filtering with filter-branch,
> the original commits are still stored in "refs/originals" and the reflogs.
> What's the best way to get rid of those to free up the space?
See the "purging unwanted history" thread:
http://n2.nabble.com/purging-unwanted-history-td1507638.html
The commands there (starting with the "git for-each-ref") should clean
out all the pre-filter-branch stuff.
> A nice way to find the corresponding commit for a file can be found here:
> http://stackoverflow.com/questions/223678/git-which-commit-has-this-blob
Yeah, I think something similar (or even the same?) is in the git wiki
somewhere. I never had any use for it though ;-)
Björn
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-12-12 14:50 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-07 17:41 help needed: Splitting a git repository after subversion migration Thomas Jarosch
2008-12-08 13:30 ` Michael J Gruber
2008-12-08 14:24 ` Björn Steinbrink
2008-12-08 17:34 ` Thomas Jarosch
2008-12-10 16:33 ` Thomas Jarosch
2008-12-11 8:10 ` Björn Steinbrink
2008-12-12 14:22 ` Thomas Jarosch
2008-12-12 14:49 ` Björn Steinbrink
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).