From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matt Mackall Subject: Re: Mercurial 0.4b vs git patchbomb benchmark Date: Fri, 29 Apr 2005 13:23:41 -0700 Message-ID: <20050429202341.GB21897@waste.org> References: <20050426004111.GI21897@waste.org> <20050429060157.GS21897@waste.org> <3817.10.10.10.24.1114756831.squirrel@linux1> <20050429074043.GT21897@waste.org> <20050429163705.GU21897@waste.org> <20050429191207.GX21897@waste.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Sean , linux-kernel , git@vger.kernel.org X-From: git-owner@vger.kernel.org Fri Apr 29 22:20:16 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DRbxd-0006Q9-7N for gcvg-git@gmane.org; Fri, 29 Apr 2005 22:19:41 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262961AbVD2UZC (ORCPT ); Fri, 29 Apr 2005 16:25:02 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262959AbVD2UYp (ORCPT ); Fri, 29 Apr 2005 16:24:45 -0400 Received: from waste.org ([216.27.176.166]:23448 "EHLO waste.org") by vger.kernel.org with ESMTP id S262950AbVD2UXq (ORCPT ); Fri, 29 Apr 2005 16:23:46 -0400 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.4/8.13.4/Debian-1) with ESMTP id j3TKNfgC018537 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 29 Apr 2005 15:23:41 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.4/8.13.4/Submit) id j3TKNfu2018534; Fri, 29 Apr 2005 15:23:41 -0500 To: Linus Torvalds Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.6+20040907i X-Virus-Scanned: by amavisd-new Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Fri, Apr 29, 2005 at 12:50:55PM -0700, Linus Torvalds wrote: > > > On Fri, 29 Apr 2005, Matt Mackall wrote: > > > > Here's an excerpt from http://selenic.com/mercurial/notes.txt on how > > the back-end works. > > Any notes on how you maintain repository-level information? > > For example, the expense in BK wasn't the single-file history, it was the > _repository_ history, ie the "ChangeSet" file. Which grows quite slowly, > but because it _always_ grows, it ends up being quite big and expensive to > parse after three years. > > Ie do you have the git kind of "independent trees/commits", or do you > create a revision history of those too? The changeset log (and everything else) has an external index. The index is basically an array of (base, offset, length, parent1-hash, parent2-hash, my-hash). This has everything you need to reconstruct a given file revision with one seek/read into the data stream itself, and also everything you need for doing graph merging. This is small enough (68 bytes, currently) that the index for a million changesets can be read into memory in a couple seconds or so, even in Python. It can also be mmapped and random accessed since the index entries are fixed-sized. (And it's already stored big-endian.) So you never have to read all the data. You also never need more than a few indices in memory at once. And you never have to rewrite the data (it's all append-only), except to do a bulk copy when you break a hardlink. -- Mathematics is the supreme nostalgia of our time.