From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object` Date: Tue, 17 Feb 2009 12:33:03 -0500 Message-ID: <20090217173302.GC31297@sigill.intra.peff.net> References: <0984029E-57D0-4EFA-A060-E0B6FFA77D58@hoskings.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Cc: git@vger.kernel.org To: Ben Hoskings X-From: git-owner@vger.kernel.org Tue Feb 17 18:34:43 2009 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1LZTqA-0007JG-MO for gcvg-git-2@gmane.org; Tue, 17 Feb 2009 18:34:39 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752863AbZBQRdL (ORCPT ); Tue, 17 Feb 2009 12:33:11 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752692AbZBQRdJ (ORCPT ); Tue, 17 Feb 2009 12:33:09 -0500 Received: from peff.net ([208.65.91.99]:52788 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751460AbZBQRdI (ORCPT ); Tue, 17 Feb 2009 12:33:08 -0500 Received: (qmail 30473 invoked by uid 107); 17 Feb 2009 17:33:28 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.40) with ESMTPA; Tue, 17 Feb 2009 12:33:28 -0500 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Tue, 17 Feb 2009 12:33:03 -0500 Content-Disposition: inline In-Reply-To: <0984029E-57D0-4EFA-A060-E0B6FFA77D58@hoskings.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, Feb 18, 2009 at 12:31:35AM +1000, Ben Hoskings wrote: > This patch adds a proof-of-concept implementation of streaming SHA1 > calculation in sha1_file.c, as demoed with `git hash-object file>`. Instead of the command's memory footprint being equal to the input > file's size, this caps it at SHA1_CHUNK_SIZE (currently 64MB). This might be nice to reduce the memory footprint for _some_ operations that only need to look at the data once, but... > Capping memory use this easily seems like a win, but then all this code > does is stream-calculate a SHA1 and print it to stdout. There seem to be a > lot of disparate places throughout the codebase where objects have their > SHA1 calculated. ...as you noticed, there are a lot of other places where we need the whole object. So you are not suddenly making git not suck with a 700MB file on a 512MB system. But more importantly, hash-object (via index_fd) already mmaps the input when it is possible to do so. So on memory-tight systems, won't the OS already release memory when appropriate (i.e., your virtual size is large, but the OS is free to optimize what is actually paged in as appropriate). I suppose this is an extra hint to the OS that we don't care about past chunks. On systems which support it, perhaps using madvise with MADV_SEQUENTIAL would be another useful hint. > Then again, I presume most of these are working with blobs and not entire > files, and hence wouldn't require streaming anyway. (I'm assuming blobs > don't grow large enough to warrant it - is that necessarily true?) Either I don't understand what you are saying here, or you are missing something fundamental about hwo git works: blobs _are_ entire files. > On my machine, the original implementation hashed a 700MB file in 5.8sec. > My patch does it in 6.2sec with SHA1_CHUNK_SIZE set to 64MB. Hmph. So it's slower? I'm not sure what the advantage is, then. On a low-memory machine, the OS's paging strategy should be reasonable, especially with MADV_SEQUENTIAL (though I haven't tested it). And on a machine with enough memory, it's slower. I guess you are evicting fewer pages from other processes on the system in the meantime. > +inline void write_sha1_fd_process_chunk(int fd, unsigned long len, > + unsigned long offset, > git_SHA_CTX *c, > + void *buf) > +{ > + buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, offset); > + git_SHA1_Update(c, buf, len); > + munmap(buf, len); > +} What is the point of the buf input parameter, which is promptly overwritten? > +int hash_sha1_fd(int fd, unsigned long len, const char *type, > + unsigned char *sha1) > +{ > + char hdr[32]; > + int hdrlen; > + write_sha1_fd_prepare(fd, len, type, sha1, hdr, &hdrlen); > + return 0; > +} What is the point of the hdr and hdrlen variables being passed in to receive values, when we never actually look at them? I would think you are conforming to an existing interface except that you just added the function earlier in the patch. -Peff