Re: large files and low memory

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nicolas Pitre <nico@fluxnic.net>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: Shawn Pearce <spearce@spearce.org>,
	weigelt@metux.de, git@vger.kernel.org
Subject: Re: large files and low memory
Date: Tue, 05 Oct 2010 17:11:45 -0400 (EDT)	[thread overview]
Message-ID: <alpine.LFD.2.00.1010051657440.3107@xanadu.home> (raw)
In-Reply-To: <20101005203450.GA2096@burratino>

On Tue, 5 Oct 2010, Jonathan Nieder wrote:

> Nicolas Pitre wrote:
> 
> > You can't do a one-pass  calculation.  The first one is required to 
> > compute the SHA1 of the file being added, and if that corresponds to an 
> > object that we already have then the operation stops right there as 
> > there is actually nothing to do.
> 
> Ah.  Thanks for a reminder.
> 
> > In the case of big files, what we need to do is to stream the file data 
> > in, compute the SHA1 and deflate it, in order to stream it out into a 
> > temporary file, then rename it according to the final SHA1.  This would 
> > allow Git to work with big files, but of course it won't be possible to 
> > know if the object corresponding to the file is already known until all 
> > the work has been done, possibly just to throw it away.
> 
> To make sure I understand correctly: are you suggesting that for big
> files we should skip the first pass?

For big files we need a totally separate code path to process the file 
data in small chunks at 'git add' time, using a loop containing 
read()+SHA1sum()+deflate()+write().  Then, if the SHA1 matches an 
existing object we delete the temporary output file, otherwise we rename 
it as a valid object.  No CRLF, no smudge filters, no diff, no deltas, 
just plain storage of huge objects, based on the value of 
core.bigFileThreshold config option.

Same thing on the checkout path: a simple loop to 
read()+inflate()+write() in small chunks.

That's the only sane way to kinda support big files with Git.

> I suppose that makes sense: for small files, using a patch application
> tool to reach a postimage that matches an existing object is something
> git historically needed to expect, but for typical big files:
> 
>  - once you've computed the SHA1, you've already invested a noticeable
>    amount of time.
>  - emailing patches around is difficult, making "git am" etc less important
>  - hopefully git or zlib can notice when files are uncompressible,
>    making the deflate not cost so much in that case.

Emailing is out of the question.  We're talking file sizes in the 
hundreds of megabytes and above here.  So yes, simply computing the SHA1 
is a significant cost, given that you are going to trash your page cache 
in the process already, so better pay the price of deflating it at the 
same time even if it turns out to be unnecessary.

Nicolas

next prev parent reply	other threads:[~2010-10-05 21:11 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-04  9:20 large files and low memory Enrico Weigelt
2010-10-04 18:05 ` Shawn Pearce
2010-10-04 18:24   ` Joshua Jensen
2010-10-04 18:57     ` Shawn Pearce
2010-10-05  0:59       ` Enrico Weigelt
2010-10-05  7:41         ` Enrico Weigelt
2010-10-05  8:01           ` Matthieu Moy
2010-10-05  8:17             ` Enrico Weigelt
2010-10-05 11:29               ` Alex Riesen
2010-10-05 11:38                 ` Matthieu Moy
2010-10-05 11:55                   ` Nguyen Thai Ngoc Duy
2010-10-05 16:42                     ` Junio C Hamano
2010-10-05 10:13           ` Nguyen Thai Ngoc Duy
2010-10-05 19:12             ` Nicolas Pitre
2010-10-04 18:58   ` Jonathan Nieder
2010-10-04 19:11     ` Shawn Pearce
2010-10-04 19:16       ` Jonathan Nieder
2010-10-05 10:59         ` Nguyen Thai Ngoc Duy
2010-10-05 20:17         ` Nicolas Pitre
2010-10-05 20:34           ` Jonathan Nieder
2010-10-05 21:11             ` Nicolas Pitre [this message]
2010-10-05  0:57     ` Enrico Weigelt
2010-10-05  1:07       ` Ævar Arnfjörð Bjarmason
2010-10-05  1:10       ` Jonathan Nieder
2010-10-05  7:35         ` Enrico Weigelt
2010-10-05 13:47           ` Jonathan Nieder
2010-10-05  0:50   ` Enrico Weigelt
2010-10-05 19:06     ` Nicolas Pitre
2010-10-05 22:51       ` Enrico Weigelt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.1010051657440.3107@xanadu.home \
    --to=nico@fluxnic.net \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=spearce@spearce.org \
    --cc=weigelt@metux.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).