From: Shawn Pearce <spearce@spearce.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: git <git@vger.kernel.org>
Subject: Re: fast-import and unique objects.
Date: Tue, 8 Aug 2006 18:45:37 -0400 [thread overview]
Message-ID: <20060808224537.GB18163@spearce.org> (raw)
In-Reply-To: <9e4733910608080511t5aa96865p41d6bc1b85e236fa@mail.gmail.com>
Jon Smirl <jonsmirl@gmail.com> wrote:
> We're designing a dumpfile format for git like the one SVN has.
I'm not sure I'd call it a dumpfile format. More like an importfile
format. Reading a GIT pack is really pretty trivial; if someone was
going to write a parser/reader to pull apart a GIT repository and
use that information in another way they would just do it against
the pack files. Its really not that much code. But generating a
pack efficiently for a large volume of data is slightly less trivial;
the attempt here is to produce some tool that can take a relatively
trivial data stream and produce a reasonable (but not necessarily
absolute smallest) pack from it in the least amount of CPU and
disk time necessary to do the job. I would hope that nobody would
seriously consider dumping a GIT repository back INTO this format!
[snip]
> AFAIK the svn code doesn't do merge commits. We probably need a post
> processing pass in the git repo that finds the merges and closes off
> the branches. gitk won't be pretty with 1,500 open branches. This may
> need some manual clues.
*wince* 1500 open branches. Youch. OK, that answers a lot of
questions for me with regards to memory handling in fast-import.
Which you provide excellent suggestions for below. I guess I didn't
think you had nearly that many...
[snip]
> The file names are used over and over. Alloc a giant chunk of memory
> and keep appending the file name strings to it. Then build a little
> tree so that you can look up existing names. i.e. turn the files names
> into atoms. Never delete anything.
Agreed. For 1500 branches its worth doing.
[snip]
> About 100,000 files in the initial change set that builds the repo.
> FInal repo has 120,000 files.
>
> There are 1,500 branches. I haven't looked at the svn dump file format
> for branches, but I suspect that it sends everything on a branch out
> at once and doesn't intersperse it with the trunk commits.
If you can tell fast-import your are completely done processing a
branch I can recycle the memory I have tied up for that branch; but
if that's going to be difficult then... hmm.
Right now I'm looking at around 5 MB/branch, based on implementing
the memory handling optimizations you suggested. That's still *huge*
for 1500 branches. I clearly can't hang onto every branch in memory
for the entire life of the import like I was planning on doing.
I'll kick that around for a couple of hours and see what I come
up with.
--
Shawn.
next prev parent reply other threads:[~2006-08-08 22:45 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-06 12:32 fast-import and unique objects Jon Smirl
2006-08-06 15:53 ` Jon Smirl
2006-08-06 18:03 ` Shawn Pearce
2006-08-07 4:48 ` Jon Smirl
2006-08-07 5:04 ` Shawn Pearce
2006-08-07 14:37 ` Jon Smirl
2006-08-07 14:48 ` Jakub Narebski
2006-08-07 18:45 ` Jon Smirl
2006-08-08 3:12 ` Shawn Pearce
2006-08-08 12:11 ` Jon Smirl
2006-08-08 22:45 ` Shawn Pearce [this message]
2006-08-08 23:56 ` Jon Smirl
2006-08-07 5:10 ` Martin Langhoff
2006-08-07 7:57 ` Ryan Anderson
2006-08-07 23:02 ` Shawn Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060808224537.GB18163@spearce.org \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=jonsmirl@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).