From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jon Smirl" Subject: Re: fast-import and unique objects. Date: Mon, 7 Aug 2006 10:37:30 -0400 Message-ID: <9e4733910608070737k52aaea7clf871d716d16547c2@mail.gmail.com> References: <9e4733910608060532w51fca2c0r8038828df0d41eeb@mail.gmail.com> <9e4733910608060853ua0eabc1w9b35b8414d3c9bae@mail.gmail.com> <20060806180323.GA19120@spearce.org> <9e4733910608062148u4341dabag451c3f49f1a792a1@mail.gmail.com> <20060807050422.GD20514@spearce.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: git X-From: git-owner@vger.kernel.org Mon Aug 07 16:37:49 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1GA6Eb-00063f-7N for gcvg-git@gmane.org; Mon, 07 Aug 2006 16:37:38 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751004AbWHGOhd (ORCPT ); Mon, 7 Aug 2006 10:37:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932070AbWHGOhd (ORCPT ); Mon, 7 Aug 2006 10:37:33 -0400 Received: from wr-out-0506.google.com ([64.233.184.236]:25689 "EHLO wr-out-0506.google.com") by vger.kernel.org with ESMTP id S1751030AbWHGOhc (ORCPT ); Mon, 7 Aug 2006 10:37:32 -0400 Received: by wr-out-0506.google.com with SMTP id i21so59227wra for ; Mon, 07 Aug 2006 07:37:31 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=DUB9vDc+fr/g/9+KUbOEnhSRSWmQJAVgi2GrDPIOPHYzTUpegpDcfusxllvTmY8KccQS25ytiuBfBbjGWlmgi9fOGOkQEcLkNxGJT0bqGZocVrkXvBvM5PKfGkR5m4ECqYuAmfZgI9Syo05JyLPXO4NSMudaM+qSNE/R8/98Gzk= Received: by 10.78.127.6 with SMTP id z6mr2415396huc; Mon, 07 Aug 2006 07:37:30 -0700 (PDT) Received: by 10.78.148.9 with HTTP; Mon, 7 Aug 2006 07:37:30 -0700 (PDT) To: "Shawn Pearce" In-Reply-To: <20060807050422.GD20514@spearce.org> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On 8/7/06, Shawn Pearce wrote: > > I'm staring at the cvs2svn code now trying to figure out how to modify > > it without rewriting everything. I may just leave it all alone and > > build a table with cvs_file:rev to sha-1 mappings. It would be much > > more efficient to carry sha-1 throughout the stages but that may > > require significant rework. > > Does it matter? How long does the cvs2svn processing take, > excluding the GIT blob processing that's now known to take 2 hours? > What's your target for an acceptable conversion time on the system > you are working on? As is, it takes the code about a week to import MozCVS into Subversion. But I've already addressed the core of why that was taking so long. The original code forks off a copy of cvs for each revision to exact the text. Doing that 1M times takes about two days. The version with fast-import takes two hours. At the end of the process cvs2svn forks off svn 250K times to import the change sets. That takes about four days to finish. Doing a fast-import backend should fix that. > Any thoughts yet on how you might want to feed trees and commits > to a fast pack writer? I was thinking about doing a stream into > fast-import such as: The data I have generates an output that indicates add/change/delete for each file name. Add/change should have an associated sha-1 for the new revision. cvs/svn have no concept of trees. How about sending out a stream of add/change/delete operations interspersed with commits? That would let fast-import track the tree and only generate tree nodes when they change. The protocol may need some thought. I need to be able to handle branches and labels too. > <4 byte length of commit>* > > where is the raw commit minus the first "tree nnn\n" line, and > is: > > > > where is one of 'B' (normal blob), 'L' (symlink), 'X' > (executable blob), is the 40 byte hex, is the file from > the root of the repository ("src/module/foo.c"), and and > are the obvious values. You would feed all tree entries and the pack > writer would split the stream up into the individual tree objects. > > fast-import would generate the tree(s) delta'ing them against the > prior tree of the same path, prefix "tree nnn\n" to the commit > blob you supplied, generate the commit, and print out its ID. > By working from the first commit up to the most recent each tree > deltas would be using the older tree as the base which may not be > ideal if a large number of items get added to a tree but should be > effective enough to generate a reasonably sized initial pack. > > It would however mean you need to monitor the output pipe from > fast-import to get back the commit id so you can use it to prep > the next commit's parent(s) as you can't produce that in Python. > > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com