From mboxrd@z Thu Jan 1 00:00:00 1970 From: Magnus Damm Subject: Re: Mercurial 0.3 vs git benchmarks Date: Tue, 26 Apr 2005 18:23:11 +0200 Message-ID: References: <20050426004111.GI21897@waste.org> <200504260713.26020.mason@suse.com> <200504261138.46339.mason@suse.com> Reply-To: Magnus Damm Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Cc: Linus Torvalds , Mike Taht , Matt Mackall , linux-kernel , git@vger.kernel.org X-From: git-owner@vger.kernel.org Tue Apr 26 18:26:01 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DQSqx-0002eP-Go for gcvg-git@gmane.org; Tue, 26 Apr 2005 18:24:03 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261672AbVDZQ2F (ORCPT ); Tue, 26 Apr 2005 12:28:05 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261688AbVDZQYg (ORCPT ); Tue, 26 Apr 2005 12:24:36 -0400 Received: from wproxy.gmail.com ([64.233.184.201]:23377 "EHLO wproxy.gmail.com") by vger.kernel.org with ESMTP id S261658AbVDZQXO convert rfc822-to-8bit (ORCPT ); Tue, 26 Apr 2005 12:23:14 -0400 Received: by wproxy.gmail.com with SMTP id 49so92300wri for ; Tue, 26 Apr 2005 09:23:11 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=BPWzYhvCeKrLr0IctWIZhkzH4uozT+4Vds3LSOaTW/i74PnwcNMszrHjQm4EFCxy6pTE7qSo3VQxiBMKG4KqY1iF9VVPUP/86DUkwUd1+KKsq1seRARR8/fpzzGW4frfiFIFjHSO3zmMnNyHINOil2WUIT/nk9kutwFDMxRkHt8= Received: by 10.54.17.66 with SMTP id 66mr69484wrq; Tue, 26 Apr 2005 09:23:11 -0700 (PDT) Received: by 10.54.17.63 with HTTP; Tue, 26 Apr 2005 09:23:11 -0700 (PDT) To: Chris Mason In-Reply-To: <200504261138.46339.mason@suse.com> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On 4/26/05, Chris Mason wrote: > On Tuesday 26 April 2005 11:09, Magnus Damm wrote: > > On 4/26/05, Chris Mason wrote: > > > This agrees with my tests here, the time to apply patches is somewhat > > > disk bound, even for the small 100 or 200 patch series. The io should be > > > coming from data=ordered, since the commits are still every 5 seconds or > > > so. > > > > Yes, as long as you apply the patches to disk that is. I've hacked up > > a small backend tool that applies patches to files kept in memory and > > uses a modifed rabin-karp search to match hunks. So you basically read > > once and write once per file instead of moving data around for each > > applied patch. But it needs two passes. > > > > And no, the source code for the entire Linux kernel is not kept in > > memory - you need a smart frontend to manage the file cache. Drop me a > > line if you are interested. > > Sorry, you've lost me. Right now the cycle goes like this: Ehrm, maybe I'm way off. =) > 1) patch reads patch file, reads source file, writes source file > 2) update-cache reads source file, writes git file Ok. > Which of those writes are you avoiding? We have a smart way to manage the > cache already for the source files...the vm does pretty well. There's > nothing to manage for the git files. For the apply a bunch of patches > workload, they are write once, read never (except for the index). Well, maybe I misunderstood everything, but I thought you were applying a lot of patches and complained that it took a lot of time due to the data order. When I applied a lot of patches to the kernel recently the cpu load dropped to zero after a while and the HD worked hard a sec or two and then things came back again. My primitive guess is that it was because the ext3 journal became full. To workaround this fact I started hacking on this in-memory patcher. In the cycle above, I'm trying to speed up step 1: If the patch modifies each source file multiple times (either using multiple hunks or multiple ---/+++) then the lines below the hunk in the source file will be moved multiple times. And if the source file is written to disk after each hunk or ---/+++ is applied then this will generate a lot of writes that can be avoided if the entire patch procedure is broken down into a first pass that analyzes the patches and a second pass that applies the patches and keeps source files in memory. But my rather trivial observation above is of course only suitable if you have a lot of patches that should be applied and you are only interested in the final version of the patched source files. If you apply one patch at a time and import each source file as a new revision then my little hack is probably not for you. / magnus