From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dmitry Potapov Subject: Re: [PATCH] don't use mmap() to hash files Date: Mon, 15 Feb 2010 15:23:59 +0300 Message-ID: <37fcd2781002150423n36378105t9bee9c0e5106ca4c@mail.gmail.com> References: <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org> <20100214011812.GA2175@dpotapov.dyndns.org> <20100214024259.GB9704@dpotapov.dyndns.org> <37fcd2781002141106v761ce6e0kc5c5bdd5001f72a9@mail.gmail.com> <37fcd2781002141156n7e2b9673s1eb6c12869facdb2@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Johannes Schindelin , Zygo Blaxell , Ilari Liusvaara , Thomas Rast , Jonathan Nieder , git@vger.kernel.org To: Nicolas Pitre X-From: git-owner@vger.kernel.org Mon Feb 15 13:24:15 2010 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Ngzzo-0007Ha-JA for gcvg-git-2@lo.gmane.org; Mon, 15 Feb 2010 13:24:12 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755437Ab0BOMYG convert rfc822-to-quoted-printable (ORCPT ); Mon, 15 Feb 2010 07:24:06 -0500 Received: from mail-fx0-f215.google.com ([209.85.220.215]:60169 "EHLO mail-fx0-f215.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754860Ab0BOMYB convert rfc822-to-8bit (ORCPT ); Mon, 15 Feb 2010 07:24:01 -0500 Received: by fxm7 with SMTP id 7so5628948fxm.28 for ; Mon, 15 Feb 2010 04:24:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=UxXTmvm3hEX47+XFvCL+GQqGF8/BwqigZMgdxSre4mI=; b=rc6BqgYqjwg+KfMFo4evUDUjYWTwB4OXe3VwCGSsDGbcJ/O/JiG2SBtNqPQkn8LbTL 3DywUMlSYsDA7oGcMwGV9xkhoGJyAfqp6FUtAR9JMRDlCNSuuX1mq3P85V+Ipdxd09d9 YgRpeKLxHIyJ9y+Ori7GUSoey/IqEZ9J6/OJE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=snNVapVT/NToJJO4Z6B13RExywvDi8WVIMrUJ+YvIgmlr1e9HjLtltTPuMAwwA4Hn/ o4qm5Vlm3R3dH1h2IbGxrfvBON28DWfqXlD2HYfDyXa/ZImsRusPWpBMeMn0jqmLG0dp v+/CZnpeG2tSlVd+k2QZm6o4lWr42IwnJ2D24= Received: by 10.239.193.74 with SMTP id h10mr524151hbi.163.1266236639895; Mon, 15 Feb 2010 04:23:59 -0800 (PST) In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, Feb 15, 2010 at 8:05 AM, Nicolas Pitre wrote= : > On Sun, 14 Feb 2010, Dmitry Potapov wrote: > >> 1. to introduce a configuration parameter that will define whether t= o use >> mmap() to hash files or not. It is a trivial change, but the real qu= estion >> is what default value for this option (should we do some heuristic b= ased >> on filesize vs available memory?) > > I don't like such kind of heuristic. =A0They're almost always wrong, = and > any issue is damn hard to reproduce. I tend to believe that mmap() wo= rks > better by letting the OS paging in and out memory as needed while > reading data into allocated memory is only going to force the system > into swap. Probably, you are right. Heuristic is a bad idea. Still, we may want to add an option to disable mmap() during hash calculation if we preserve mmap() here. Though, I don't like keeping mmap() there if we go for #2.= =2E See below... > >> 2. to stream files in chunks. It is better because it is faster, esp= ecially on >> large files, as you calculate SHA-1 and zip data while they are in C= PU >> cache. However, it may be more difficult to implement, because we ha= ve >> filters that should be apply to files that are put to the repository= =2E > > So? =A0"More difficult" when it is the right thing to do is no excuse= not > to do it and satisfy ourselves with an half solution. =A0Barely repla= cing > mmap() with read() has drawbacks while the advantages aren't that man= y. > Gaining a few speed percentage while making it less robust when memor= y > is tight isn't such a great compromize to me. =A0BUT if you were to > replace mmap() with read() and make the process chunked then you do > improve both speed _and_ memory usage. I have not had time to look closely at this, but there is one problem that I noticed -- the header of any git object contains the blob length= =2E We know this length in advance (without reading all data) only for regular files and only if they do not have any filter to be applied. In all other cases, it seems we cannot do much better than we do now, assuming that we do not want to change the storage format... If so, the question remains what to do about regular files with some filter. Currently, we use mmap() for the original data but store the processed data in memory anyway. The question is whether want to keep this use of mmap() here? Considering that it is a potential source of a repository corruption and these filters should not be used for big file= s because they take a lot of memory anyway, I think we should get rid of mmap() in hashing file completely, once we can process regular files without filters in chunks. Dmitry