All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Wong <normalperson@yhbt.net>
To: David Voit <david.voit@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: git-svn: .git/svn disk usage
Date: Wed, 5 Dec 2007 00:54:52 -0800	[thread overview]
Message-ID: <20071205085451.GA347@soma> (raw)
In-Reply-To: <loom.20071203T182924-435@post.gmane.org>

David Voit <david.voit@gmail.com> wrote:
> Ollie Wild <aaw <at> google.com> writes:
> 
> > 
> > Hi,
> > 
> > I've been using git-svn to mirror the gcc repository at
> > svn://gcc.gnu.org/svn/gcc.  Recently, I noticed that my .git directory
> > is consuming 11GB of disk space.  Digging further, I discovered that
> > 9.8GB of this is attributable to the .git/svn directory (which
> > includes 200 branches and 2,588 tags).  Given that my .git/objects
> > directory is 652MB, it seems that it ought to be possible to store
> > this information in a more compact form.
> > 
> > I'm curious if other developers have run into this issue.  If so, are
> > there any proposals / plans for improving the storage of git-svn
> > metadata?
> > 
> > Thanks,
> > Ollie
> > 
> 
> Hi all,
> 
> I've seen the same effect, so i tried to reduce the size of the revdb and made a
> new format:
> First, in the bin files the sha1 are stored as hexvalues not as ascii, this
> reduces the a single sha1 from 41 bytes to 20.
> Second, only save the non-zero commits, thats what the idx are used for.
> A idx file has three 32bit integers per entry.
> The first integer represents the first zero svn revision, the second the last
> zero revision and the last integer is the position of the next non-zero revision
> in the bin.
> 
> Example:
> Revision 0-373006 are zero revision and 373007 is the first actualy used revision
> and 373008-373623 are again zero revisions
> the idx has the following content:
> 0 373006 0
> 373007 373007 1
> 
> and the bin only saves
> 59037b8043268c9ca0d87ba86519ed0b5358c8a1
> eef3f7e25993a46e3c4242aa502d93e909b08c57

I'd very much like rev_db to be smaller, but I find the idea of the data
relying on a separate index too fragile and difficult to recover
from if corruption occurs (mainly for --no-metadata users).

The rev_db is simply a lookup for mapping SVN revision numbers to
git commit SHA1s.

I have an idea for a more compact .rev_db format:

  All records are 24 bytes:
    4 bytes for a 32-bit integer representing the SVN revision
    20 bytes for the git commit SHA1

  rev_db is an append-only format, so the 32-bit integer will be
  monotonically increasing over time, which allows:

  Lookups by revision number done via binary search:

  Which means empty revisions never need to be entered anymore.

Of course there needs to be a migration strategy for existing
repositories (mainly the ones using --no-metadata), too.

Users not using --no-metadata (nor the option for svk metadata) can just
remove their .rev_db* files and git-svn will automatically recreate them
as needed.

> The format currently used produce a 373624*41bytes large file.
> 
> Used on a git-svn clone here, i get:
> The results are:
> old:
> 1,1G    hadoop (1004M   svn/)
> new:
> 47M     hadoop (5,9M    svn/)

Very nice reduction!

> Here a example sourcecode to test this idea:
> 
> I try to integrate this in git-svn this week.
> 
> NOTE: I'm not a perl hacker, so use at your own risk.
> 
> Bye David
> ps.: I'm not a member of this list please reply directly to me.

If you don't have time, I'll try to implement my ideas sometime this
week or weekend (assuming I have time, too).

-- 
Eric Wong

  reply	other threads:[~2007-12-05  8:55 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-03  6:17 git-svn: .git/svn disk usage Ollie Wild
2007-12-03  6:37 ` Pascal Obry
2007-12-03  6:46   ` David Brown
2007-12-03  6:53     ` Kelvie Wong
2007-12-03 17:35     ` Ollie Wild
2007-12-04  8:29       ` Karl Hasselström
2007-12-03 18:51 ` David Voit
2007-12-05  8:54   ` Eric Wong [this message]
2007-12-05 21:30     ` Steven Grimm
2007-12-06  6:47       ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071205085451.GA347@soma \
    --to=normalperson@yhbt.net \
    --cc=david.voit@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.