git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git-svn: .git/svn disk usage
@ 2007-12-03  6:17 Ollie Wild
  2007-12-03  6:37 ` Pascal Obry
  2007-12-03 18:51 ` David Voit
  0 siblings, 2 replies; 10+ messages in thread
From: Ollie Wild @ 2007-12-03  6:17 UTC (permalink / raw)
  To: git

Hi,

I've been using git-svn to mirror the gcc repository at
svn://gcc.gnu.org/svn/gcc.  Recently, I noticed that my .git directory
is consuming 11GB of disk space.  Digging further, I discovered that
9.8GB of this is attributable to the .git/svn directory (which
includes 200 branches and 2,588 tags).  Given that my .git/objects
directory is 652MB, it seems that it ought to be possible to store
this information in a more compact form.

I'm curious if other developers have run into this issue.  If so, are
there any proposals / plans for improving the storage of git-svn
metadata?

Thanks,
Ollie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03  6:17 git-svn: .git/svn disk usage Ollie Wild
@ 2007-12-03  6:37 ` Pascal Obry
  2007-12-03  6:46   ` David Brown
  2007-12-03 18:51 ` David Voit
  1 sibling, 1 reply; 10+ messages in thread
From: Pascal Obry @ 2007-12-03  6:37 UTC (permalink / raw)
  To: Ollie Wild; +Cc: git

Ollie,

> I'm curious if other developers have run into this issue.  If so, are
> there any proposals / plans for improving the storage of git-svn
> metadata?

Did you run "git gc" after importing code form the subversion
repository? On my side I found that it has reduced drastically the size
of the local Git repository.

Pascal.

-- 

--|------------------------------------------------------
--| Pascal Obry                           Team-Ada Member
--| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE
--|------------------------------------------------------
--|              http://www.obry.net
--| "The best way to travel is by means of imagination"
--|
--| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03  6:37 ` Pascal Obry
@ 2007-12-03  6:46   ` David Brown
  2007-12-03  6:53     ` Kelvie Wong
  2007-12-03 17:35     ` Ollie Wild
  0 siblings, 2 replies; 10+ messages in thread
From: David Brown @ 2007-12-03  6:46 UTC (permalink / raw)
  To: Pascal Obry; +Cc: Ollie Wild, git

On Mon, Dec 03, 2007 at 07:37:51AM +0100, Pascal Obry wrote:
>Ollie,
>
>> I'm curious if other developers have run into this issue.  If so, are
>> there any proposals / plans for improving the storage of git-svn
>> metadata?
>
>Did you run "git gc" after importing code form the subversion
>repository? On my side I found that it has reduced drastically the size
>of the local Git repository.

I think the original poster is probably finding the space in the .git/svn
directory.  'git-svn' keeps an index file for every branch in SVN.

I suspect it does this for speed, at least on a large import, since the SVN
commits will come across numerically, affecting the branches out of order.

However, the index could fairly easily be extracted from git (since that is
what it normally does).  In this case, where all of the indexes take
significant space if this is worth it.

Ollie, if you look in these svn branch directories, is most of the space
taken up with files called 'index'?

Browsing through the few svn clones that I have, the space seems to be
roughly split between 'index' files and 'unhandled.log' files.

Dave

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03  6:46   ` David Brown
@ 2007-12-03  6:53     ` Kelvie Wong
  2007-12-03 17:35     ` Ollie Wild
  1 sibling, 0 replies; 10+ messages in thread
From: Kelvie Wong @ 2007-12-03  6:53 UTC (permalink / raw)
  To: Pascal Obry, Ollie Wild, git

I'm going to have to say this is due to the unhandled.log as well.

Just gzip -9 it (AFAIK it's not used for anything, but keep it just in case).

On Dec 2, 2007 10:46 PM, David Brown <git@davidb.org> wrote:
> On Mon, Dec 03, 2007 at 07:37:51AM +0100, Pascal Obry wrote:
> >Ollie,
> >
> >> I'm curious if other developers have run into this issue.  If so, are
> >> there any proposals / plans for improving the storage of git-svn
> >> metadata?
> >
> >Did you run "git gc" after importing code form the subversion
> >repository? On my side I found that it has reduced drastically the size
> >of the local Git repository.
>
> I think the original poster is probably finding the space in the .git/svn
> directory.  'git-svn' keeps an index file for every branch in SVN.
>
> I suspect it does this for speed, at least on a large import, since the SVN
> commits will come across numerically, affecting the branches out of order.
>
> However, the index could fairly easily be extracted from git (since that is
> what it normally does).  In this case, where all of the indexes take
> significant space if this is worth it.
>
> Ollie, if you look in these svn branch directories, is most of the space
> taken up with files called 'index'?
>
> Browsing through the few svn clones that I have, the space seems to be
> roughly split between 'index' files and 'unhandled.log' files.
>
> Dave
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Kelvie Wong

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03  6:46   ` David Brown
  2007-12-03  6:53     ` Kelvie Wong
@ 2007-12-03 17:35     ` Ollie Wild
  2007-12-04  8:29       ` Karl Hasselström
  1 sibling, 1 reply; 10+ messages in thread
From: Ollie Wild @ 2007-12-03 17:35 UTC (permalink / raw)
  To: Pascal Obry, Ollie Wild, git

On Dec 2, 2007 10:46 PM, David Brown <git@davidb.org> wrote:
>
> Ollie, if you look in these svn branch directories, is most of the space
> taken up with files called 'index'?

I'm seeing the following breakdown:

4.3G index
77M  unhandled.log
5.5G .rev_db.138bc75d-0d04-0410-961f-82ee72b054a4

What exactly are the index and .rev_db files used for?

Ollie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03  6:17 git-svn: .git/svn disk usage Ollie Wild
  2007-12-03  6:37 ` Pascal Obry
@ 2007-12-03 18:51 ` David Voit
  2007-12-05  8:54   ` Eric Wong
  1 sibling, 1 reply; 10+ messages in thread
From: David Voit @ 2007-12-03 18:51 UTC (permalink / raw)
  To: git

Ollie Wild <aaw <at> google.com> writes:

> 
> Hi,
> 
> I've been using git-svn to mirror the gcc repository at
> svn://gcc.gnu.org/svn/gcc.  Recently, I noticed that my .git directory
> is consuming 11GB of disk space.  Digging further, I discovered that
> 9.8GB of this is attributable to the .git/svn directory (which
> includes 200 branches and 2,588 tags).  Given that my .git/objects
> directory is 652MB, it seems that it ought to be possible to store
> this information in a more compact form.
> 
> I'm curious if other developers have run into this issue.  If so, are
> there any proposals / plans for improving the storage of git-svn
> metadata?
> 
> Thanks,
> Ollie
> 

Hi all,

I've seen the same effect, so i tried to reduce the size of the revdb and made a
new format:
First, in the bin files the sha1 are stored as hexvalues not as ascii, this
reduces the a single sha1 from 41 bytes to 20.
Second, only save the non-zero commits, thats what the idx are used for.
A idx file has three 32bit integers per entry.
The first integer represents the first zero svn revision, the second the last
zero revision and the last integer is the position of the next non-zero revision
in the bin.

Example:
Revision 0-373006 are zero revision and 373007 is the first actualy used revision
and 373008-373623 are again zero revisions
the idx has the following content:
0 373006 0
373007 373007 1

and the bin only saves
59037b8043268c9ca0d87ba86519ed0b5358c8a1
eef3f7e25993a46e3c4242aa502d93e909b08c57

The format currently used produce a 373624*41bytes large file.

Used on a git-svn clone here, i get:
The results are:
old:
1,1G    hadoop (1004M   svn/)
new:
47M     hadoop (5,9M    svn/)

in detail:

.git/svn/trunk:
old:
-rw-r--r-- 1 david david 24M 2007-11-29 10:26
.rev_db.13f79535-47bb-0310-9956-ffa450edef68
-rw-r--r-- 1 david david 75K 2007-11-29 10:26
.rev_db.7fecf15c-03ad-4724-994c-e980afa7160c
new:
-rw-r--r-- 1 david david 32K 2007-12-03 18:40
revdb-13f79535-47bb-0310-9956-ffa450edef68.bin
-rw-r--r-- 1 david david 18K 2007-12-03 18:40
revdb-13f79535-47bb-0310-9956-ffa450edef68.idx
-rw-r--r-- 1 david david  32K 2007-12-03 18:44
revdb-7fecf15c-03ad-4724-994c-e980afa7160c.bin
-rw-r--r-- 1 david david 2,0K 2007-12-03 18:44
revdb-7fecf15c-03ad-4724-994c-e980afa7160c.idx

Here a example sourcecode to test this idea:

I try to integrate this in git-svn this week.

NOTE: I'm not a perl hacker, so use at your own risk.

Bye David
ps.: I'm not a member of this list please reply directly to me.

migrate.pl:
$uuid = "7fecf15c-03ad-4724-994c-e980afa7160c";

open (NONZERO, '.rev_db.'.$uuid);
open (IDX, '>revdb-'.$uuid.'.idx');
open (BIN, '>revdb-'.$uuid.'.bin');

$first_zero = 0;
$pos = 0;
$rev = 0;

while ($sha1 = <NONZERO>) {

chomp($sha1);

if ($sha1 ne ("0" x 40))
{
        print BIN pack("H40", $sha1);

        if ($first_zero != $rev)
        {
                print IDX pack("N N N", $first_zero, ($rev-1), $pos);
        }

        $first_zero=$rev+1;
        $pos++;
}

$rev++;

}

close(BIN);
close(IDX);
close(NONZERO);

parse.pl:
use strict;
use Fcntl;

my(@index, $buf, $i);
my($uuid) = "13f79535-47bb-0310-9956-ffa450edef68";
my($db_path) = "revdb-$uuid.bin";

sysopen(IDX, "revdb-$uuid.idx", O_RDONLY);

while (sysread(IDX, $buf, 12)) {
   my($minrev, $maxrev, $pos) = unpack("N N N", $buf);

   push @index, [$minrev, $maxrev, $pos];
}

close(IDX);

my($lastindex) = scalar(@index)-1;
my($lastindexpos) = $index[$lastindex][2];
my($lastindexrev) = $index[$lastindex][1];

my @stat = stat $db_path;
($stat[7] % 20) == 0 or die "$db_path inconsistent size: $stat[7]\n";
my ($maxrev) = ($stat[7]/20)-($lastindexpos)+($lastindexrev);
my ($minrev) = $index[0][1]+1;

my($cachestep) = int((scalar(@index))/9);
my(@cache);
for (my($i)=0; $i < scalar(@index); $i += $cachestep) {
   push @cache, [$index[$i][0], $i];
}

my($lastsearch) = 0;

sub pos2sha1 {
   my($pos) = @_;
   sysopen(BINDB, $db_path, O_RDONLY);

   sysseek(BINDB, $pos, 0);

   my($buf);
   sysread(BINDB, $buf, 20);

   return unpack ("H40", $buf);

   close(BINDB);
}

sub get_sha1 {
   my($rev) = @_;
   my($i) = 0;

   if (($rev <= 0) || ($rev > $maxrev) || $rev <= $index[0][1]) {
      return ("0" x 40);
   }

   if ($rev > $lastindexrev) {
      my($pos) = (((($rev-1) - $lastindexrev)+$lastindexpos))*20;

      return pos2sha1($pos);
   }

   if(($rev >= $index[$lastsearch][0] && $rev <= $index[$lastsearch][1]) ||
($rev >= $index[$lastsearch+1][0] && $rev <= $index[$lastsearch+1][1])) {
      return ("0" x 40);
   }
   elsif ($rev > $index[$lastsearch][1] && $rev < $index[$lastsearch+1][0]) {
      my($pos) = (($rev-1) - $index[$lastsearch][1] + $index[$lastsearch][2]) * 20;

      return pos2sha1($pos);
   }
   elsif($rev > $index[$lastsearch+1][1] && $rev < $index[$lastsearch+2][0]) {
      $lastsearch++;

      my($pos) = (($rev-1) - $index[$lastsearch][1] + $index[$lastsearch][2]) * 20;

      return pos2sha1($pos);
   }
   elsif($lastsearch != 0 && $rev > $index[$lastsearch-1][1] && $rev <
$index[$lastsearch][0]) {
      $lastsearch--;

      my($pos) = (($rev-1) - $index[$lastsearch][1] + $index[$lastsearch][2]) * 20;

      return pos2sha1($pos);
   }

   my($l, $r);
   $l = 0;
   $r = scalar(@index)-1;

   my($lastcache) = scalar(@cache)-1;

   for (my($i)=0; $i <= $lastcache; $i++) {
      if ($rev >= $cache[$i][0]) {
         $l = $cache[$i][1];
      }

      if ($rev < $cache[$lastcache-$i][0]) {
         $r = $cache[$lastcache-$i][1];
      }
   }

   if ($rev <= $index[$l][1]) {
      return ("0" x 40);
   }

   while ($l <= $r) {
      $i = int(($l + $r)/2);

      if ($rev >= $index[$i][0]  && $rev <= $index[$i][1]) {
         $lastsearch = $i;

         return ("0" x 40);
      }
      elsif ($rev <= $index[$i][0]) {
         $r = $i-1;
      }
      elsif ($rev >= $index[$i+1][0]) {
         $l = $i+1;
      } 
      else {
         $lastsearch = $i;
         my($pos) = (($rev-1) - $index[$i][1] + $index[$i][2]) * 20;

         return pos2sha1($pos);
      }
   }

   return ("0" x 40);
}

for (my($i)=$maxrev; $i >= $minrev; $i--) {
   my($sha1) = get_sha1($i);
   if ($sha1 ne ("0" x 40)) {
      print "$i = $sha1\n";
   }
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03 17:35     ` Ollie Wild
@ 2007-12-04  8:29       ` Karl Hasselström
  0 siblings, 0 replies; 10+ messages in thread
From: Karl Hasselström @ 2007-12-04  8:29 UTC (permalink / raw)
  To: Ollie Wild; +Cc: Pascal Obry, git

On 2007-12-03 09:35:22 -0800, Ollie Wild wrote:

> I'm seeing the following breakdown:
>
> 4.3G index
> 77M  unhandled.log
> 5.5G .rev_db.138bc75d-0d04-0410-961f-82ee72b054a4
>
> What exactly are the index and .rev_db files used for?

The indexes are just normal git index files, one for each branch and
tag. They're used to speed up importing new commits to the branch or
tag.

My guess is that the performance impact of deleting them between
git-svn runs would be very small, since recreating an index is cheap,
and we'd still get the speed benefit when importing several revisions
to a branch in the same run. And it'd be a very small code change too,
I think.

If nothing else, it's insane to keep the index for the tags. :-)

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-03 18:51 ` David Voit
@ 2007-12-05  8:54   ` Eric Wong
  2007-12-05 21:30     ` Steven Grimm
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2007-12-05  8:54 UTC (permalink / raw)
  To: David Voit; +Cc: git

David Voit <david.voit@gmail.com> wrote:
> Ollie Wild <aaw <at> google.com> writes:
> 
> > 
> > Hi,
> > 
> > I've been using git-svn to mirror the gcc repository at
> > svn://gcc.gnu.org/svn/gcc.  Recently, I noticed that my .git directory
> > is consuming 11GB of disk space.  Digging further, I discovered that
> > 9.8GB of this is attributable to the .git/svn directory (which
> > includes 200 branches and 2,588 tags).  Given that my .git/objects
> > directory is 652MB, it seems that it ought to be possible to store
> > this information in a more compact form.
> > 
> > I'm curious if other developers have run into this issue.  If so, are
> > there any proposals / plans for improving the storage of git-svn
> > metadata?
> > 
> > Thanks,
> > Ollie
> > 
> 
> Hi all,
> 
> I've seen the same effect, so i tried to reduce the size of the revdb and made a
> new format:
> First, in the bin files the sha1 are stored as hexvalues not as ascii, this
> reduces the a single sha1 from 41 bytes to 20.
> Second, only save the non-zero commits, thats what the idx are used for.
> A idx file has three 32bit integers per entry.
> The first integer represents the first zero svn revision, the second the last
> zero revision and the last integer is the position of the next non-zero revision
> in the bin.
> 
> Example:
> Revision 0-373006 are zero revision and 373007 is the first actualy used revision
> and 373008-373623 are again zero revisions
> the idx has the following content:
> 0 373006 0
> 373007 373007 1
> 
> and the bin only saves
> 59037b8043268c9ca0d87ba86519ed0b5358c8a1
> eef3f7e25993a46e3c4242aa502d93e909b08c57

I'd very much like rev_db to be smaller, but I find the idea of the data
relying on a separate index too fragile and difficult to recover
from if corruption occurs (mainly for --no-metadata users).

The rev_db is simply a lookup for mapping SVN revision numbers to
git commit SHA1s.

I have an idea for a more compact .rev_db format:

  All records are 24 bytes:
    4 bytes for a 32-bit integer representing the SVN revision
    20 bytes for the git commit SHA1

  rev_db is an append-only format, so the 32-bit integer will be
  monotonically increasing over time, which allows:

  Lookups by revision number done via binary search:

  Which means empty revisions never need to be entered anymore.

Of course there needs to be a migration strategy for existing
repositories (mainly the ones using --no-metadata), too.

Users not using --no-metadata (nor the option for svk metadata) can just
remove their .rev_db* files and git-svn will automatically recreate them
as needed.

> The format currently used produce a 373624*41bytes large file.
> 
> Used on a git-svn clone here, i get:
> The results are:
> old:
> 1,1G    hadoop (1004M   svn/)
> new:
> 47M     hadoop (5,9M    svn/)

Very nice reduction!

> Here a example sourcecode to test this idea:
> 
> I try to integrate this in git-svn this week.
> 
> NOTE: I'm not a perl hacker, so use at your own risk.
> 
> Bye David
> ps.: I'm not a member of this list please reply directly to me.

If you don't have time, I'll try to implement my ideas sometime this
week or weekend (assuming I have time, too).

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-05  8:54   ` Eric Wong
@ 2007-12-05 21:30     ` Steven Grimm
  2007-12-06  6:47       ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Grimm @ 2007-12-05 21:30 UTC (permalink / raw)
  To: Eric Wong; +Cc: David Voit, git

How about using git itself to keep some of this information? I'll just  
throw this idea out there; might or might not make any actual sense.

Create a new "git-svn metadata" branch. This branch contains a fake  
directory (never intended for checkout, though you could do it) that  
has a "file" for each svn revision. The filename is just the svn  
revision number, maybe divided into subdirectories in case you want to  
check the branch out for debugging purposes or whatever. The contents  
are the git commit SHA1 and whatever other metadata you want to keep  
in the future.

The advantage of doing it this way? You can pass around svn metadata  
using the normal git fetch/push tools, query the metadata using "git  
show", etc. In terms of data integrity, it's as secure as anything  
else in a git repository, much more so than a separately maintained db  
file under .git.

Along similar lines, a separate branch where the filenames are commit  
SHA1s and the file contents are the stuff that currently gets written  
into the git-svn-id: lines would mean no more need to rewrite history  
when doing dcommit, and thus easier mixing of native git workflows and  
interactions with an svn repository.

It would be great if you could clone a git-svn repository and then do  
"git svn dcommit" from the clone, secure in the knowledge that things  
will stay consistent even if the origin gets your changes via "git svn  
fetch" rather than from you.

-Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-svn: .git/svn disk usage
  2007-12-05 21:30     ` Steven Grimm
@ 2007-12-06  6:47       ` Eric Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2007-12-06  6:47 UTC (permalink / raw)
  To: Steven Grimm; +Cc: David Voit, git

Steven Grimm <koreth@midwinter.com> wrote:
> How about using git itself to keep some of this information? I'll just  
> throw this idea out there; might or might not make any actual sense.
> 
> Create a new "git-svn metadata" branch. This branch contains a fake  
> directory (never intended for checkout, though you could do it) that  
> has a "file" for each svn revision. The filename is just the svn  
> revision number, maybe divided into subdirectories in case you want to  
> check the branch out for debugging purposes or whatever. The contents  
> are the git commit SHA1 and whatever other metadata you want to keep  
> in the future.
> 
> The advantage of doing it this way? You can pass around svn metadata  
> using the normal git fetch/push tools, query the metadata using "git  
> show", etc. In terms of data integrity, it's as secure as anything  
> else in a git repository, much more so than a separately maintained db  
> file under .git.

I've thought of doing the way you describe in the past, too.

However, a missing ref to the tree you proposed would mean that the
metadata becomes inaccessible unless git-svn-id: lines are retained.

Right now there's a single ref for all data and metadata.  Going to
two refs would mean those two refs would always need to be in sync
with each other.

The basic idea of the git-svn-id: lines is that with the default
settings, the .rev_db files are deletable and can be regenerated from
that metadata.  git-svn will automatically re-create .rev_db files it
cannot find.

This is why the rev_db code in git-svn uses slow, synchronous writes iff
svk props or no-metadata is enabled; and fast, assynchronous writes
when the user sticks with the git-svn defaults.

> Along similar lines, a separate branch where the filenames are commit  
> SHA1s and the file contents are the stuff that currently gets written  
> into the git-svn-id: lines would mean no more need to rewrite history  
> when doing dcommit, and thus easier mixing of native git workflows and  
> interactions with an svn repository.

The current dcommit still has the advantage that commit times match
those in the SVN repository.

> It would be great if you could clone a git-svn repository and then do  
> "git svn dcommit" from the clone, secure in the knowledge that things  
> will stay consistent even if the origin gets your changes via "git svn  
> fetch" rather than from you.

It's actually doable after the [svn-remote "..."] section .git/config is
copied and the refs/remotes/* structure is cloned via git.

The [svn-remote "..."] information can be regenerated based on
git-svn-id: lines (there's no automated way to do that, currently).

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-12-06  6:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-03  6:17 git-svn: .git/svn disk usage Ollie Wild
2007-12-03  6:37 ` Pascal Obry
2007-12-03  6:46   ` David Brown
2007-12-03  6:53     ` Kelvie Wong
2007-12-03 17:35     ` Ollie Wild
2007-12-04  8:29       ` Karl Hasselström
2007-12-03 18:51 ` David Voit
2007-12-05  8:54   ` Eric Wong
2007-12-05 21:30     ` Steven Grimm
2007-12-06  6:47       ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).