On 8/8/06, Shawn Pearce wrote: > Jon Smirl wrote: > > We're designing a dumpfile format for git like the one SVN has. > > I'm not sure I'd call it a dumpfile format. More like an importfile > format. Reading a GIT pack is really pretty trivial; if someone was > going to write a parser/reader to pull apart a GIT repository and > use that information in another way they would just do it against > the pack files. Its really not that much code. But generating a > pack efficiently for a large volume of data is slightly less trivial; > the attempt here is to produce some tool that can take a relatively > trivial data stream and produce a reasonable (but not necessarily > absolute smallest) pack from it in the least amount of CPU and > disk time necessary to do the job. I would hope that nobody would > seriously consider dumping a GIT repository back INTO this format! > > [snip] > > AFAIK the svn code doesn't do merge commits. We probably need a post > > processing pass in the git repo that finds the merges and closes off > > the branches. gitk won't be pretty with 1,500 open branches. This may > > need some manual clues. > > *wince* 1500 open branches. Youch. OK, that answers a lot of > questions for me with regards to memory handling in fast-import. > Which you provide excellent suggestions for below. I guess I didn't > think you had nearly that many... > > [snip] > > The file names are used over and over. Alloc a giant chunk of memory > > and keep appending the file name strings to it. Then build a little > > tree so that you can look up existing names. i.e. turn the files names > > into atoms. Never delete anything. > > Agreed. For 1500 branches its worth doing. > > [snip] > > About 100,000 files in the initial change set that builds the repo. > > FInal repo has 120,000 files. > > > > There are 1,500 branches. I haven't looked at the svn dump file format > > for branches, but I suspect that it sends everything on a branch out > > at once and doesn't intersperse it with the trunk commits. > > If you can tell fast-import your are completely done processing a > branch I can recycle the memory I have tied up for that branch; but > if that's going to be difficult then... hmm. > > Right now I'm looking at around 5 MB/branch, based on implementing > the memory handling optimizations you suggested. That's still *huge* > for 1500 branches. I clearly can't hang onto every branch in memory > for the entire life of the import like I was planning on doing. > I'll kick that around for a couple of hours and see what I come > up with. Some of these branches are what cvs2svn calls unlabeled branches. cvs2svn is probably creating more of these than necessary since the code for coalescing them into a single big unlabeled branch is not that good. I attached the list of branch names being generated. > > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com