From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jonathan Nieder Subject: Re: large files and low memory Date: Tue, 5 Oct 2010 15:34:50 -0500 Message-ID: <20101005203450.GA2096@burratino> References: <20101004092046.GA4382@nibiru.local> <20101004185854.GA6466@burratino> <20101004191657.GC6466@burratino> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Shawn Pearce , weigelt@metux.de, git@vger.kernel.org To: Nicolas Pitre X-From: git-owner@vger.kernel.org Tue Oct 05 22:38:24 2010 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1P3EHG-0005PV-EY for gcvg-git-2@lo.gmane.org; Tue, 05 Oct 2010 22:38:22 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753823Ab0JEUiQ (ORCPT ); Tue, 5 Oct 2010 16:38:16 -0400 Received: from mail-fx0-f46.google.com ([209.85.161.46]:40426 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752070Ab0JEUiP (ORCPT ); Tue, 5 Oct 2010 16:38:15 -0400 Received: by fxm4 with SMTP id 4so746269fxm.19 for ; Tue, 05 Oct 2010 13:38:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:from:to:cc:subject :message-id:references:mime-version:content-type:content-disposition :in-reply-to:user-agent; bh=kH+lwydqbqZTmv0Idsog85DVQEObaQMhsZSwGTFeshU=; b=EE0D97RouBYaEWh2gY65ZAbH3kMXOJaObQY62II0MwcS60Zw4gNyqbkHNhCYJUWGCr 9kkkuSlEGutp9N45bNPUkVShHOiW7zqYryRLL00ozSRUTTH8r6T+ZQI0srts2MWjC+kQ 3vMY0dnJLk5yr31lpQBNV2kmNYhnAP93a0e3U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=uRlhphGIG116sF5ptoP0FA5zN1eUWJNrc7IgLy/Hrz+FXG1ZHLIOYLeDOQhbKBgAKK IMlXk+r14X9lCJjOU02vu3Ke76eQLvEdWnZDmxzA+xxnPM29kJ2QIdfYMXXJRNBK5V7E YuRsz5Ntf7cQlv2Y56Lkbk/QMtXmYFlVOSbZ4= Received: by 10.223.1.146 with SMTP id 18mr11362279faf.80.1286311094229; Tue, 05 Oct 2010 13:38:14 -0700 (PDT) Received: from burratino (adsl-68-255-106-176.dsl.chcgil.ameritech.net [68.255.106.176]) by mx.google.com with ESMTPS id k15sm3188300fai.40.2010.10.05.13.38.11 (version=SSLv3 cipher=RC4-MD5); Tue, 05 Oct 2010 13:38:12 -0700 (PDT) Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Nicolas Pitre wrote: > You can't do a one-pass calculation. The first one is required to > compute the SHA1 of the file being added, and if that corresponds to an > object that we already have then the operation stops right there as > there is actually nothing to do. Ah. Thanks for a reminder. > In the case of big files, what we need to do is to stream the file data > in, compute the SHA1 and deflate it, in order to stream it out into a > temporary file, then rename it according to the final SHA1. This would > allow Git to work with big files, but of course it won't be possible to > know if the object corresponding to the file is already known until all > the work has been done, possibly just to throw it away. To make sure I understand correctly: are you suggesting that for big files we should skip the first pass? I suppose that makes sense: for small files, using a patch application tool to reach a postimage that matches an existing object is something git historically needed to expect, but for typical big files: - once you've computed the SHA1, you've already invested a noticeable amount of time. - emailing patches around is difficult, making "git am" etc less important - hopefully git or zlib can notice when files are uncompressible, making the deflate not cost so much in that case.