* git-svn: .git/svn disk usage @ 2007-12-03 6:17 Ollie Wild 2007-12-03 6:37 ` Pascal Obry 2007-12-03 18:51 ` David Voit 0 siblings, 2 replies; 10+ messages in thread From: Ollie Wild @ 2007-12-03 6:17 UTC (permalink / raw) To: git Hi, I've been using git-svn to mirror the gcc repository at svn://gcc.gnu.org/svn/gcc. Recently, I noticed that my .git directory is consuming 11GB of disk space. Digging further, I discovered that 9.8GB of this is attributable to the .git/svn directory (which includes 200 branches and 2,588 tags). Given that my .git/objects directory is 652MB, it seems that it ought to be possible to store this information in a more compact form. I'm curious if other developers have run into this issue. If so, are there any proposals / plans for improving the storage of git-svn metadata? Thanks, Ollie ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 6:17 git-svn: .git/svn disk usage Ollie Wild @ 2007-12-03 6:37 ` Pascal Obry 2007-12-03 6:46 ` David Brown 2007-12-03 18:51 ` David Voit 1 sibling, 1 reply; 10+ messages in thread From: Pascal Obry @ 2007-12-03 6:37 UTC (permalink / raw) To: Ollie Wild; +Cc: git Ollie, > I'm curious if other developers have run into this issue. If so, are > there any proposals / plans for improving the storage of git-svn > metadata? Did you run "git gc" after importing code form the subversion repository? On my side I found that it has reduced drastically the size of the local Git repository. Pascal. -- --|------------------------------------------------------ --| Pascal Obry Team-Ada Member --| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE --|------------------------------------------------------ --| http://www.obry.net --| "The best way to travel is by means of imagination" --| --| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 6:37 ` Pascal Obry @ 2007-12-03 6:46 ` David Brown 2007-12-03 6:53 ` Kelvie Wong 2007-12-03 17:35 ` Ollie Wild 0 siblings, 2 replies; 10+ messages in thread From: David Brown @ 2007-12-03 6:46 UTC (permalink / raw) To: Pascal Obry; +Cc: Ollie Wild, git On Mon, Dec 03, 2007 at 07:37:51AM +0100, Pascal Obry wrote: >Ollie, > >> I'm curious if other developers have run into this issue. If so, are >> there any proposals / plans for improving the storage of git-svn >> metadata? > >Did you run "git gc" after importing code form the subversion >repository? On my side I found that it has reduced drastically the size >of the local Git repository. I think the original poster is probably finding the space in the .git/svn directory. 'git-svn' keeps an index file for every branch in SVN. I suspect it does this for speed, at least on a large import, since the SVN commits will come across numerically, affecting the branches out of order. However, the index could fairly easily be extracted from git (since that is what it normally does). In this case, where all of the indexes take significant space if this is worth it. Ollie, if you look in these svn branch directories, is most of the space taken up with files called 'index'? Browsing through the few svn clones that I have, the space seems to be roughly split between 'index' files and 'unhandled.log' files. Dave ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 6:46 ` David Brown @ 2007-12-03 6:53 ` Kelvie Wong 2007-12-03 17:35 ` Ollie Wild 1 sibling, 0 replies; 10+ messages in thread From: Kelvie Wong @ 2007-12-03 6:53 UTC (permalink / raw) To: Pascal Obry, Ollie Wild, git I'm going to have to say this is due to the unhandled.log as well. Just gzip -9 it (AFAIK it's not used for anything, but keep it just in case). On Dec 2, 2007 10:46 PM, David Brown <git@davidb.org> wrote: > On Mon, Dec 03, 2007 at 07:37:51AM +0100, Pascal Obry wrote: > >Ollie, > > > >> I'm curious if other developers have run into this issue. If so, are > >> there any proposals / plans for improving the storage of git-svn > >> metadata? > > > >Did you run "git gc" after importing code form the subversion > >repository? On my side I found that it has reduced drastically the size > >of the local Git repository. > > I think the original poster is probably finding the space in the .git/svn > directory. 'git-svn' keeps an index file for every branch in SVN. > > I suspect it does this for speed, at least on a large import, since the SVN > commits will come across numerically, affecting the branches out of order. > > However, the index could fairly easily be extracted from git (since that is > what it normally does). In this case, where all of the indexes take > significant space if this is worth it. > > Ollie, if you look in these svn branch directories, is most of the space > taken up with files called 'index'? > > Browsing through the few svn clones that I have, the space seems to be > roughly split between 'index' files and 'unhandled.log' files. > > Dave > > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Kelvie Wong ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 6:46 ` David Brown 2007-12-03 6:53 ` Kelvie Wong @ 2007-12-03 17:35 ` Ollie Wild 2007-12-04 8:29 ` Karl Hasselström 1 sibling, 1 reply; 10+ messages in thread From: Ollie Wild @ 2007-12-03 17:35 UTC (permalink / raw) To: Pascal Obry, Ollie Wild, git On Dec 2, 2007 10:46 PM, David Brown <git@davidb.org> wrote: > > Ollie, if you look in these svn branch directories, is most of the space > taken up with files called 'index'? I'm seeing the following breakdown: 4.3G index 77M unhandled.log 5.5G .rev_db.138bc75d-0d04-0410-961f-82ee72b054a4 What exactly are the index and .rev_db files used for? Ollie ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 17:35 ` Ollie Wild @ 2007-12-04 8:29 ` Karl Hasselström 0 siblings, 0 replies; 10+ messages in thread From: Karl Hasselström @ 2007-12-04 8:29 UTC (permalink / raw) To: Ollie Wild; +Cc: Pascal Obry, git On 2007-12-03 09:35:22 -0800, Ollie Wild wrote: > I'm seeing the following breakdown: > > 4.3G index > 77M unhandled.log > 5.5G .rev_db.138bc75d-0d04-0410-961f-82ee72b054a4 > > What exactly are the index and .rev_db files used for? The indexes are just normal git index files, one for each branch and tag. They're used to speed up importing new commits to the branch or tag. My guess is that the performance impact of deleting them between git-svn runs would be very small, since recreating an index is cheap, and we'd still get the speed benefit when importing several revisions to a branch in the same run. And it'd be a very small code change too, I think. If nothing else, it's insane to keep the index for the tags. :-) -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 6:17 git-svn: .git/svn disk usage Ollie Wild 2007-12-03 6:37 ` Pascal Obry @ 2007-12-03 18:51 ` David Voit 2007-12-05 8:54 ` Eric Wong 1 sibling, 1 reply; 10+ messages in thread From: David Voit @ 2007-12-03 18:51 UTC (permalink / raw) To: git Ollie Wild <aaw <at> google.com> writes: > > Hi, > > I've been using git-svn to mirror the gcc repository at > svn://gcc.gnu.org/svn/gcc. Recently, I noticed that my .git directory > is consuming 11GB of disk space. Digging further, I discovered that > 9.8GB of this is attributable to the .git/svn directory (which > includes 200 branches and 2,588 tags). Given that my .git/objects > directory is 652MB, it seems that it ought to be possible to store > this information in a more compact form. > > I'm curious if other developers have run into this issue. If so, are > there any proposals / plans for improving the storage of git-svn > metadata? > > Thanks, > Ollie > Hi all, I've seen the same effect, so i tried to reduce the size of the revdb and made a new format: First, in the bin files the sha1 are stored as hexvalues not as ascii, this reduces the a single sha1 from 41 bytes to 20. Second, only save the non-zero commits, thats what the idx are used for. A idx file has three 32bit integers per entry. The first integer represents the first zero svn revision, the second the last zero revision and the last integer is the position of the next non-zero revision in the bin. Example: Revision 0-373006 are zero revision and 373007 is the first actualy used revision and 373008-373623 are again zero revisions the idx has the following content: 0 373006 0 373007 373007 1 and the bin only saves 59037b8043268c9ca0d87ba86519ed0b5358c8a1 eef3f7e25993a46e3c4242aa502d93e909b08c57 The format currently used produce a 373624*41bytes large file. Used on a git-svn clone here, i get: The results are: old: 1,1G hadoop (1004M svn/) new: 47M hadoop (5,9M svn/) in detail: .git/svn/trunk: old: -rw-r--r-- 1 david david 24M 2007-11-29 10:26 .rev_db.13f79535-47bb-0310-9956-ffa450edef68 -rw-r--r-- 1 david david 75K 2007-11-29 10:26 .rev_db.7fecf15c-03ad-4724-994c-e980afa7160c new: -rw-r--r-- 1 david david 32K 2007-12-03 18:40 revdb-13f79535-47bb-0310-9956-ffa450edef68.bin -rw-r--r-- 1 david david 18K 2007-12-03 18:40 revdb-13f79535-47bb-0310-9956-ffa450edef68.idx -rw-r--r-- 1 david david 32K 2007-12-03 18:44 revdb-7fecf15c-03ad-4724-994c-e980afa7160c.bin -rw-r--r-- 1 david david 2,0K 2007-12-03 18:44 revdb-7fecf15c-03ad-4724-994c-e980afa7160c.idx Here a example sourcecode to test this idea: I try to integrate this in git-svn this week. NOTE: I'm not a perl hacker, so use at your own risk. Bye David ps.: I'm not a member of this list please reply directly to me. migrate.pl: $uuid = "7fecf15c-03ad-4724-994c-e980afa7160c"; open (NONZERO, '.rev_db.'.$uuid); open (IDX, '>revdb-'.$uuid.'.idx'); open (BIN, '>revdb-'.$uuid.'.bin'); $first_zero = 0; $pos = 0; $rev = 0; while ($sha1 = <NONZERO>) { chomp($sha1); if ($sha1 ne ("0" x 40)) { print BIN pack("H40", $sha1); if ($first_zero != $rev) { print IDX pack("N N N", $first_zero, ($rev-1), $pos); } $first_zero=$rev+1; $pos++; } $rev++; } close(BIN); close(IDX); close(NONZERO); parse.pl: use strict; use Fcntl; my(@index, $buf, $i); my($uuid) = "13f79535-47bb-0310-9956-ffa450edef68"; my($db_path) = "revdb-$uuid.bin"; sysopen(IDX, "revdb-$uuid.idx", O_RDONLY); while (sysread(IDX, $buf, 12)) { my($minrev, $maxrev, $pos) = unpack("N N N", $buf); push @index, [$minrev, $maxrev, $pos]; } close(IDX); my($lastindex) = scalar(@index)-1; my($lastindexpos) = $index[$lastindex][2]; my($lastindexrev) = $index[$lastindex][1]; my @stat = stat $db_path; ($stat[7] % 20) == 0 or die "$db_path inconsistent size: $stat[7]\n"; my ($maxrev) = ($stat[7]/20)-($lastindexpos)+($lastindexrev); my ($minrev) = $index[0][1]+1; my($cachestep) = int((scalar(@index))/9); my(@cache); for (my($i)=0; $i < scalar(@index); $i += $cachestep) { push @cache, [$index[$i][0], $i]; } my($lastsearch) = 0; sub pos2sha1 { my($pos) = @_; sysopen(BINDB, $db_path, O_RDONLY); sysseek(BINDB, $pos, 0); my($buf); sysread(BINDB, $buf, 20); return unpack ("H40", $buf); close(BINDB); } sub get_sha1 { my($rev) = @_; my($i) = 0; if (($rev <= 0) || ($rev > $maxrev) || $rev <= $index[0][1]) { return ("0" x 40); } if ($rev > $lastindexrev) { my($pos) = (((($rev-1) - $lastindexrev)+$lastindexpos))*20; return pos2sha1($pos); } if(($rev >= $index[$lastsearch][0] && $rev <= $index[$lastsearch][1]) || ($rev >= $index[$lastsearch+1][0] && $rev <= $index[$lastsearch+1][1])) { return ("0" x 40); } elsif ($rev > $index[$lastsearch][1] && $rev < $index[$lastsearch+1][0]) { my($pos) = (($rev-1) - $index[$lastsearch][1] + $index[$lastsearch][2]) * 20; return pos2sha1($pos); } elsif($rev > $index[$lastsearch+1][1] && $rev < $index[$lastsearch+2][0]) { $lastsearch++; my($pos) = (($rev-1) - $index[$lastsearch][1] + $index[$lastsearch][2]) * 20; return pos2sha1($pos); } elsif($lastsearch != 0 && $rev > $index[$lastsearch-1][1] && $rev < $index[$lastsearch][0]) { $lastsearch--; my($pos) = (($rev-1) - $index[$lastsearch][1] + $index[$lastsearch][2]) * 20; return pos2sha1($pos); } my($l, $r); $l = 0; $r = scalar(@index)-1; my($lastcache) = scalar(@cache)-1; for (my($i)=0; $i <= $lastcache; $i++) { if ($rev >= $cache[$i][0]) { $l = $cache[$i][1]; } if ($rev < $cache[$lastcache-$i][0]) { $r = $cache[$lastcache-$i][1]; } } if ($rev <= $index[$l][1]) { return ("0" x 40); } while ($l <= $r) { $i = int(($l + $r)/2); if ($rev >= $index[$i][0] && $rev <= $index[$i][1]) { $lastsearch = $i; return ("0" x 40); } elsif ($rev <= $index[$i][0]) { $r = $i-1; } elsif ($rev >= $index[$i+1][0]) { $l = $i+1; } else { $lastsearch = $i; my($pos) = (($rev-1) - $index[$i][1] + $index[$i][2]) * 20; return pos2sha1($pos); } } return ("0" x 40); } for (my($i)=$maxrev; $i >= $minrev; $i--) { my($sha1) = get_sha1($i); if ($sha1 ne ("0" x 40)) { print "$i = $sha1\n"; } } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-03 18:51 ` David Voit @ 2007-12-05 8:54 ` Eric Wong 2007-12-05 21:30 ` Steven Grimm 0 siblings, 1 reply; 10+ messages in thread From: Eric Wong @ 2007-12-05 8:54 UTC (permalink / raw) To: David Voit; +Cc: git David Voit <david.voit@gmail.com> wrote: > Ollie Wild <aaw <at> google.com> writes: > > > > > Hi, > > > > I've been using git-svn to mirror the gcc repository at > > svn://gcc.gnu.org/svn/gcc. Recently, I noticed that my .git directory > > is consuming 11GB of disk space. Digging further, I discovered that > > 9.8GB of this is attributable to the .git/svn directory (which > > includes 200 branches and 2,588 tags). Given that my .git/objects > > directory is 652MB, it seems that it ought to be possible to store > > this information in a more compact form. > > > > I'm curious if other developers have run into this issue. If so, are > > there any proposals / plans for improving the storage of git-svn > > metadata? > > > > Thanks, > > Ollie > > > > Hi all, > > I've seen the same effect, so i tried to reduce the size of the revdb and made a > new format: > First, in the bin files the sha1 are stored as hexvalues not as ascii, this > reduces the a single sha1 from 41 bytes to 20. > Second, only save the non-zero commits, thats what the idx are used for. > A idx file has three 32bit integers per entry. > The first integer represents the first zero svn revision, the second the last > zero revision and the last integer is the position of the next non-zero revision > in the bin. > > Example: > Revision 0-373006 are zero revision and 373007 is the first actualy used revision > and 373008-373623 are again zero revisions > the idx has the following content: > 0 373006 0 > 373007 373007 1 > > and the bin only saves > 59037b8043268c9ca0d87ba86519ed0b5358c8a1 > eef3f7e25993a46e3c4242aa502d93e909b08c57 I'd very much like rev_db to be smaller, but I find the idea of the data relying on a separate index too fragile and difficult to recover from if corruption occurs (mainly for --no-metadata users). The rev_db is simply a lookup for mapping SVN revision numbers to git commit SHA1s. I have an idea for a more compact .rev_db format: All records are 24 bytes: 4 bytes for a 32-bit integer representing the SVN revision 20 bytes for the git commit SHA1 rev_db is an append-only format, so the 32-bit integer will be monotonically increasing over time, which allows: Lookups by revision number done via binary search: Which means empty revisions never need to be entered anymore. Of course there needs to be a migration strategy for existing repositories (mainly the ones using --no-metadata), too. Users not using --no-metadata (nor the option for svk metadata) can just remove their .rev_db* files and git-svn will automatically recreate them as needed. > The format currently used produce a 373624*41bytes large file. > > Used on a git-svn clone here, i get: > The results are: > old: > 1,1G hadoop (1004M svn/) > new: > 47M hadoop (5,9M svn/) Very nice reduction! > Here a example sourcecode to test this idea: > > I try to integrate this in git-svn this week. > > NOTE: I'm not a perl hacker, so use at your own risk. > > Bye David > ps.: I'm not a member of this list please reply directly to me. If you don't have time, I'll try to implement my ideas sometime this week or weekend (assuming I have time, too). -- Eric Wong ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-05 8:54 ` Eric Wong @ 2007-12-05 21:30 ` Steven Grimm 2007-12-06 6:47 ` Eric Wong 0 siblings, 1 reply; 10+ messages in thread From: Steven Grimm @ 2007-12-05 21:30 UTC (permalink / raw) To: Eric Wong; +Cc: David Voit, git How about using git itself to keep some of this information? I'll just throw this idea out there; might or might not make any actual sense. Create a new "git-svn metadata" branch. This branch contains a fake directory (never intended for checkout, though you could do it) that has a "file" for each svn revision. The filename is just the svn revision number, maybe divided into subdirectories in case you want to check the branch out for debugging purposes or whatever. The contents are the git commit SHA1 and whatever other metadata you want to keep in the future. The advantage of doing it this way? You can pass around svn metadata using the normal git fetch/push tools, query the metadata using "git show", etc. In terms of data integrity, it's as secure as anything else in a git repository, much more so than a separately maintained db file under .git. Along similar lines, a separate branch where the filenames are commit SHA1s and the file contents are the stuff that currently gets written into the git-svn-id: lines would mean no more need to rewrite history when doing dcommit, and thus easier mixing of native git workflows and interactions with an svn repository. It would be great if you could clone a git-svn repository and then do "git svn dcommit" from the clone, secure in the knowledge that things will stay consistent even if the origin gets your changes via "git svn fetch" rather than from you. -Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-svn: .git/svn disk usage 2007-12-05 21:30 ` Steven Grimm @ 2007-12-06 6:47 ` Eric Wong 0 siblings, 0 replies; 10+ messages in thread From: Eric Wong @ 2007-12-06 6:47 UTC (permalink / raw) To: Steven Grimm; +Cc: David Voit, git Steven Grimm <koreth@midwinter.com> wrote: > How about using git itself to keep some of this information? I'll just > throw this idea out there; might or might not make any actual sense. > > Create a new "git-svn metadata" branch. This branch contains a fake > directory (never intended for checkout, though you could do it) that > has a "file" for each svn revision. The filename is just the svn > revision number, maybe divided into subdirectories in case you want to > check the branch out for debugging purposes or whatever. The contents > are the git commit SHA1 and whatever other metadata you want to keep > in the future. > > The advantage of doing it this way? You can pass around svn metadata > using the normal git fetch/push tools, query the metadata using "git > show", etc. In terms of data integrity, it's as secure as anything > else in a git repository, much more so than a separately maintained db > file under .git. I've thought of doing the way you describe in the past, too. However, a missing ref to the tree you proposed would mean that the metadata becomes inaccessible unless git-svn-id: lines are retained. Right now there's a single ref for all data and metadata. Going to two refs would mean those two refs would always need to be in sync with each other. The basic idea of the git-svn-id: lines is that with the default settings, the .rev_db files are deletable and can be regenerated from that metadata. git-svn will automatically re-create .rev_db files it cannot find. This is why the rev_db code in git-svn uses slow, synchronous writes iff svk props or no-metadata is enabled; and fast, assynchronous writes when the user sticks with the git-svn defaults. > Along similar lines, a separate branch where the filenames are commit > SHA1s and the file contents are the stuff that currently gets written > into the git-svn-id: lines would mean no more need to rewrite history > when doing dcommit, and thus easier mixing of native git workflows and > interactions with an svn repository. The current dcommit still has the advantage that commit times match those in the SVN repository. > It would be great if you could clone a git-svn repository and then do > "git svn dcommit" from the clone, secure in the knowledge that things > will stay consistent even if the origin gets your changes via "git svn > fetch" rather than from you. It's actually doable after the [svn-remote "..."] section .git/config is copied and the refs/remotes/* structure is cloned via git. The [svn-remote "..."] information can be regenerated based on git-svn-id: lines (there's no automated way to do that, currently). -- Eric Wong ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-12-06 6:47 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-12-03 6:17 git-svn: .git/svn disk usage Ollie Wild 2007-12-03 6:37 ` Pascal Obry 2007-12-03 6:46 ` David Brown 2007-12-03 6:53 ` Kelvie Wong 2007-12-03 17:35 ` Ollie Wild 2007-12-04 8:29 ` Karl Hasselström 2007-12-03 18:51 ` David Voit 2007-12-05 8:54 ` Eric Wong 2007-12-05 21:30 ` Steven Grimm 2007-12-06 6:47 ` Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).