Re: fast-import and unique objects.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Shawn Pearce" <spearce@spearce.org>
Cc: git <git@vger.kernel.org>
Subject: Re: fast-import and unique objects.
Date: Tue, 8 Aug 2006 19:56:32 -0400	[thread overview]
Message-ID: <9e4733910608081656p4bcd68d9xb3e60b58d9afbcc9@mail.gmail.com> (raw)
In-Reply-To: <20060808224537.GB18163@spearce.org>

[-- Attachment #1: Type: text/plain, Size: 2940 bytes --]

On 8/8/06, Shawn Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > We're designing a dumpfile format for git like the one SVN has.
>
> I'm not sure I'd call it a dumpfile format.  More like an importfile
> format.  Reading a GIT pack is really pretty trivial; if someone was
> going to write a parser/reader to pull apart a GIT repository and
> use that information in another way they would just do it against
> the pack files.  Its really not that much code.  But generating a
> pack efficiently for a large volume of data is slightly less trivial;
> the attempt here is to produce some tool that can take a relatively
> trivial data stream and produce a reasonable (but not necessarily
> absolute smallest) pack from it in the least amount of CPU and
> disk time necessary to do the job.  I would hope that nobody would
> seriously consider dumping a GIT repository back INTO this format!
>
> [snip]
> > AFAIK the svn code doesn't do merge commits. We probably need a post
> > processing pass in the git repo that finds the merges and closes off
> > the branches. gitk won't be pretty with 1,500 open branches. This may
> > need some manual clues.
>
> *wince* 1500 open branches.  Youch.  OK, that answers a lot of
> questions for me with regards to memory handling in fast-import.
> Which you provide excellent suggestions for below.  I guess I didn't
> think you had nearly that many...
>
> [snip]
> > The file names are used over and over. Alloc a giant chunk of memory
> > and keep appending the file name strings to it. Then build a little
> > tree so that you can look up existing names. i.e. turn the files names
> > into atoms. Never delete anything.
>
> Agreed.  For 1500 branches its worth doing.
>
> [snip]
> > About 100,000 files in the initial change set that builds the repo.
> > FInal repo has 120,000 files.
> >
> > There are 1,500 branches. I haven't looked at the svn dump file format
> > for branches, but I suspect that it sends everything on a branch out
> > at once and doesn't intersperse it with the trunk commits.
>
> If you can tell fast-import your are completely done processing a
> branch I can recycle the memory I have tied up for that branch; but
> if that's going to be difficult then...  hmm.
>
> Right now I'm looking at around 5 MB/branch, based on implementing
> the memory handling optimizations you suggested.  That's still *huge*
> for 1500 branches.  I clearly can't hang onto every branch in memory
> for the entire life of the import like I was planning on doing.
> I'll kick that around for a couple of hours and see what I come
> up with.

Some of these branches are what cvs2svn calls unlabeled branches.
cvs2svn is probably creating more of these than necessary since the
code for coalescing them into a single big unlabeled branch is not
that good.

I attached the list of branch names being generated.



>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

[-- Attachment #2: cvs2svn-branches.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 24830 bytes --]

next prev parent reply	other threads:[~2006-08-08 23:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-06 12:32 fast-import and unique objects Jon Smirl
2006-08-06 15:53 ` Jon Smirl
2006-08-06 18:03   ` Shawn Pearce
2006-08-07  4:48     ` Jon Smirl
2006-08-07  5:04       ` Shawn Pearce
2006-08-07 14:37         ` Jon Smirl
2006-08-07 14:48           ` Jakub Narebski
2006-08-07 18:45             ` Jon Smirl
2006-08-08  3:12           ` Shawn Pearce
2006-08-08 12:11             ` Jon Smirl
2006-08-08 22:45               ` Shawn Pearce
2006-08-08 23:56                 ` Jon Smirl [this message]
2006-08-07  5:10       ` Martin Langhoff
2006-08-07  7:57     ` Ryan Anderson
2006-08-07 23:02       ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910608081656p4bcd68d9xb3e60b58d9afbcc9@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).