large files and low memory

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* large files and low memory
@ 2010-10-04  9:20 Enrico Weigelt
  2010-10-04 18:05 ` Shawn Pearce
  0 siblings, 1 reply; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-04  9:20 UTC (permalink / raw)
  To: git


Hi folks,


when adding files which are larger than available physical memory,
git performs very slow. Perhaps it has to do with git's mmap()ing
the whole file. Is there any way to do it w/o mmap (hoping that
might perform a bit better) ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04  9:20 large files and low memory Enrico Weigelt
@ 2010-10-04 18:05 ` Shawn Pearce
  2010-10-04 18:24   ` Joshua Jensen
                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Shawn Pearce @ 2010-10-04 18:05 UTC (permalink / raw)
  To: weigelt, git

On Mon, Oct 4, 2010 at 2:20 AM, Enrico Weigelt <weigelt@metux.de> wrote:
>
> when adding files which are larger than available physical memory,
> git performs very slow. Perhaps it has to do with git's mmap()ing
> the whole file. Is there any way to do it w/o mmap (hoping that
> might perform a bit better) ?

The mmap() isn't the problem.  Its the allocation of a buffer that is
larger than the file in order to hold the result of deflating the file
before it gets written to disk.  When the file is bigger than physical
memory, the kernel has to page in parts of the file as well as swap in
and out parts of that allocated buffer to hold the deflated file.

This is a known area in Git where big files aren't handled well.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:05 ` Shawn Pearce
@ 2010-10-04 18:24   ` Joshua Jensen
  2010-10-04 18:57     ` Shawn Pearce
  2010-10-04 18:58   ` Jonathan Nieder
  2010-10-05  0:50   ` Enrico Weigelt
  2 siblings, 1 reply; 29+ messages in thread
From: Joshua Jensen @ 2010-10-04 18:24 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: weigelt, git

  ----- Original Message -----
From: Shawn Pearce
Date: 10/4/2010 12:05 PM
> On Mon, Oct 4, 2010 at 2:20 AM, Enrico Weigelt<weigelt@metux.de>  wrote:
>> when adding files which are larger than available physical memory,
>> git performs very slow. Perhaps it has to do with git's mmap()ing
>> the whole file. Is there any way to do it w/o mmap (hoping that
>> might perform a bit better) ?
> The mmap() isn't the problem.  Its the allocation of a buffer that is
> larger than the file in order to hold the result of deflating the file
> before it gets written to disk.  When the file is bigger than physical
> memory, the kernel has to page in parts of the file as well as swap in
> and out parts of that allocated buffer to hold the deflated file.
>
> This is a known area in Git where big files aren't handled well.
As a curiosity, I've always done streaming decompression with zlib using 
minimal buffer sizes (64k, perhaps).  I'm sure there is good reason why 
Git doesn't do this (delta application?).  Do you know what it is?

Josh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:24   ` Joshua Jensen
@ 2010-10-04 18:57     ` Shawn Pearce
  2010-10-05  0:59       ` Enrico Weigelt
  0 siblings, 1 reply; 29+ messages in thread
From: Shawn Pearce @ 2010-10-04 18:57 UTC (permalink / raw)
  To: Joshua Jensen; +Cc: weigelt, git

On Mon, Oct 4, 2010 at 11:24 AM, Joshua Jensen
<jjensen@workspacewhiz.com> wrote:
>> On Mon, Oct 4, 2010 at 2:20 AM, Enrico Weigelt<weigelt@metux.de>  wrote:
>>>
>>> when adding files which are larger than available physical memory,
>>> git performs very slow.
>>
>> The mmap() isn't the problem.  Its the allocation of a buffer that is
>> larger than the file in order to hold the result of deflating the file
>> before it gets written to disk.
...
>> This is a known area in Git where big files aren't handled well.
>
> As a curiosity, I've always done streaming decompression with zlib using
> minimal buffer sizes (64k, perhaps).  I'm sure there is good reason why Git
> doesn't do this (delta application?).  Do you know what it is?

Laziness.  Git originally assumed it would only be used for smaller
source files written by humans.  Its easier to write the code as a
single malloc'd buffer than to stream it.  We'd like to fix it, but
its harder than it sounds.  Today we copy the file into a buffer
before we deflate and compute the SHA-1 as this prevents us from
getting into a consistency error when the file is modified between
these two stages.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:57     ` Shawn Pearce
@ 2010-10-05  0:59       ` Enrico Weigelt
  2010-10-05  7:41         ` Enrico Weigelt
  0 siblings, 1 reply; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05  0:59 UTC (permalink / raw)
  To: git

* Shawn Pearce <spearce@spearce.org> wrote:

> Laziness.  Git originally assumed it would only be used for smaller
> source files written by humans.  Its easier to write the code as a
> single malloc'd buffer than to stream it.  We'd like to fix it, but
> its harder than it sounds.  Today we copy the file into a buffer
> before we deflate and compute the SHA-1 as this prevents us from
> getting into a consistency error when the file is modified between
> these two stages.

hmm, perhaps copy it to a temporary file, if it's too large ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  0:59       ` Enrico Weigelt
@ 2010-10-05  7:41         ` Enrico Weigelt
  2010-10-05  8:01           ` Matthieu Moy
  2010-10-05 10:13           ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05  7:41 UTC (permalink / raw)
  To: git

* Enrico Weigelt <weigelt@metux.de> wrote:

<snip>

Found another possible bottleneck: git-commit seems to scan through
a lot of files. Shouldnt it just create a commit object from the
current index and update the head ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  7:41         ` Enrico Weigelt
@ 2010-10-05  8:01           ` Matthieu Moy
  2010-10-05  8:17             ` Enrico Weigelt
  2010-10-05 10:13           ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 29+ messages in thread
From: Matthieu Moy @ 2010-10-05  8:01 UTC (permalink / raw)
  To: git

Enrico Weigelt <weigelt@metux.de> writes:

> * Enrico Weigelt <weigelt@metux.de> wrote:
>
> <snip>
>
> Found another possible bottleneck: git-commit seems to scan through
> a lot of files. Shouldnt it just create a commit object from the
> current index and update the head ?

git commit will show what's being commited (the output of "git commit
--dry-run") in your editor, hence it needs to compute that.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  8:01           ` Matthieu Moy
@ 2010-10-05  8:17             ` Enrico Weigelt
  2010-10-05 11:29               ` Alex Riesen
  0 siblings, 1 reply; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05  8:17 UTC (permalink / raw)
  To: git

* Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote:

> git commit will show what's being commited (the output of "git commit
> --dry-run") in your editor, hence it needs to compute that.

hmm, is there any way to get around this ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  8:17             ` Enrico Weigelt
@ 2010-10-05 11:29               ` Alex Riesen
  2010-10-05 11:38                 ` Matthieu Moy
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Riesen @ 2010-10-05 11:29 UTC (permalink / raw)
  To: weigelt, git

On Tue, Oct 5, 2010 at 10:17, Enrico Weigelt <weigelt@metux.de> wrote:
> * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote:
>
>> git commit will show what's being commited (the output of "git commit
>> --dry-run") in your editor, hence it needs to compute that.
>
> hmm, is there any way to get around this ?
>

Try "git commit -q -uno". This should skip creation of summary in the
commit message and lookup for untracked files.
This will somewhat speed things up

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 11:29               ` Alex Riesen
@ 2010-10-05 11:38                 ` Matthieu Moy
  2010-10-05 11:55                   ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 29+ messages in thread
From: Matthieu Moy @ 2010-10-05 11:38 UTC (permalink / raw)
  To: Alex Riesen; +Cc: weigelt, git

Alex Riesen <raa.lkml@gmail.com> writes:

> On Tue, Oct 5, 2010 at 10:17, Enrico Weigelt <weigelt@metux.de> wrote:
>> * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote:
>>
>>> git commit will show what's being commited (the output of "git commit
>>> --dry-run") in your editor, hence it needs to compute that.
>>
>> hmm, is there any way to get around this ?
>>
>
> Try "git commit -q -uno". This should skip creation of summary in the
> commit message and lookup for untracked files.

To avoid including the summary, the option would be --no-status (-q
makes commit less verbose in stdout, not in COMMIT_EDITMSG).

But

strace -fe lstat64 git commit -uno --no-status -q

still shows lstat64 for each tracked file in my working tree (even
when using -m to avoid launching the editor). I don't know if this is
intended, or just that nobody cared enough to optimize this.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 11:38                 ` Matthieu Moy
@ 2010-10-05 11:55                   ` Nguyen Thai Ngoc Duy
  2010-10-05 16:42                     ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-10-05 11:55 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Alex Riesen, weigelt, git, Junio C Hamano

On Tue, Oct 5, 2010 at 6:38 PM, Matthieu Moy
<Matthieu.Moy@grenoble-inp.fr> wrote:
> Alex Riesen <raa.lkml@gmail.com> writes:
>
>> On Tue, Oct 5, 2010 at 10:17, Enrico Weigelt <weigelt@metux.de> wrote:
>>> * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote:
>>>
>>>> git commit will show what's being commited (the output of "git commit
>>>> --dry-run") in your editor, hence it needs to compute that.
>>>
>>> hmm, is there any way to get around this ?
>>>
>>
>> Try "git commit -q -uno". This should skip creation of summary in the
>> commit message and lookup for untracked files.
>
> To avoid including the summary, the option would be --no-status (-q
> makes commit less verbose in stdout, not in COMMIT_EDITMSG).
>
> But
>
> strace -fe lstat64 git commit -uno --no-status -q
>
> still shows lstat64 for each tracked file in my working tree (even
> when using -m to avoid launching the editor). I don't know if this is
> intended, or just that nobody cared enough to optimize this.

I assume you do git-commit with no index modification at all. The
index refresh part dated back in 2888605 (builtin-commit: fix
partial-commit support - 2007-11-18). Commit message does not tell why
index refresh is needed (for summary stat maybe?). If so, then we can
skip refreshing if -q is given.
-- 
Duy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 11:55                   ` Nguyen Thai Ngoc Duy
@ 2010-10-05 16:42                     ` Junio C Hamano
  0 siblings, 0 replies; 29+ messages in thread
From: Junio C Hamano @ 2010-10-05 16:42 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Matthieu Moy, Alex Riesen, weigelt, git, Junio C Hamano

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> I assume you do git-commit with no index modification at all. The
> index refresh part dated back in 2888605 (builtin-commit: fix
> partial-commit support - 2007-11-18). Commit message does not tell why
> index refresh is needed (for summary stat maybe?). If so, then we can
> skip refreshing if -q is given.

Most likely to give a clean index to post-commit hook.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  7:41         ` Enrico Weigelt
  2010-10-05  8:01           ` Matthieu Moy
@ 2010-10-05 10:13           ` Nguyen Thai Ngoc Duy
  2010-10-05 19:12             ` Nicolas Pitre
  1 sibling, 1 reply; 29+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-10-05 10:13 UTC (permalink / raw)
  To: weigelt, git

On Tue, Oct 5, 2010 at 2:41 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> * Enrico Weigelt <weigelt@metux.de> wrote:
>
> <snip>
>
> Found another possible bottleneck: git-commit seems to scan through
> a lot of files. Shouldnt it just create a commit object from the
> current index and update the head ?

You mean a lot of stat()? There is no way to avoid that unless you set
assume-unchanged bits. Or you could use
write-tree/commit-tree/update-ref directly.
-- 
Duy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 10:13           ` Nguyen Thai Ngoc Duy
@ 2010-10-05 19:12             ` Nicolas Pitre
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2010-10-05 19:12 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: weigelt, git

On Tue, 5 Oct 2010, Nguyen Thai Ngoc Duy wrote:

> On Tue, Oct 5, 2010 at 2:41 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> > * Enrico Weigelt <weigelt@metux.de> wrote:
> >
> > <snip>
> >
> > Found another possible bottleneck: git-commit seems to scan through
> > a lot of files. Shouldnt it just create a commit object from the
> > current index and update the head ?
> 
> You mean a lot of stat()? There is no way to avoid that unless you set
> assume-unchanged bits. Or you could use
> write-tree/commit-tree/update-ref directly.

Avoiding memory exhaustion is also going to help a lot as the stat() 
information will remain cached instead of requiring disk access.  Just a 
guess given $subject.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:05 ` Shawn Pearce
  2010-10-04 18:24   ` Joshua Jensen
@ 2010-10-04 18:58   ` Jonathan Nieder
  2010-10-04 19:11     ` Shawn Pearce
  2010-10-05  0:57     ` Enrico Weigelt
  2010-10-05  0:50   ` Enrico Weigelt
  2 siblings, 2 replies; 29+ messages in thread
From: Jonathan Nieder @ 2010-10-04 18:58 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: weigelt, git

Shawn Pearce wrote:

> The mmap() isn't the problem.  Its the allocation of a buffer that is
> larger than the file in order to hold the result of deflating the file
> before it gets written to disk.

Wasn't this already fixed, at least in some cases?

commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1)
Author: Nicolas Pitre <nico@fluxnic.net>
Date:   Sat Feb 20 23:27:31 2010 -0500

    sha1_file: don't malloc the whole compressed result when writing out objects

    There is no real advantage to malloc the whole output buffer and
    deflate the data in a single pass when writing loose objects. That is
    like only 1% faster while using more memory, especially with large
    files where memory usage is far more. It is best to deflate and write
    the data out in small chunks reusing the same memory instead.

    For example, using 'git add' on a few large files averaging 40 MB ...

    Before:
    21.45user 1.10system 0:22.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+828040outputs (0major+142640minor)pagefaults 0swaps

    After:
    21.50user 1.25system 0:22.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+828040outputs (0major+104408minor)pagefaults 0swaps

    While the runtime stayed relatively the same, the number of minor page
    faults went down significantly.

    Signed-off-by: Nicolas Pitre <nico@fluxnic.net>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:58   ` Jonathan Nieder
@ 2010-10-04 19:11     ` Shawn Pearce
  2010-10-04 19:16       ` Jonathan Nieder
  2010-10-05  0:57     ` Enrico Weigelt
  1 sibling, 1 reply; 29+ messages in thread
From: Shawn Pearce @ 2010-10-04 19:11 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: weigelt, git

On Mon, Oct 4, 2010 at 11:58 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Shawn Pearce wrote:
>
>> The mmap() isn't the problem.  Its the allocation of a buffer that is
>> larger than the file in order to hold the result of deflating the file
>> before it gets written to disk.
>
> Wasn't this already fixed, at least in some cases?
>
> commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1)
> Author: Nicolas Pitre <nico@fluxnic.net>
> Date:   Sat Feb 20 23:27:31 2010 -0500
>
>    sha1_file: don't malloc the whole compressed result when writing out objects

This change only removes the deflate copy.  But due to the SHA-1
consistency issue I alluded to earlier, I think we're still making a
full copy of the file in memory before we SHA-1 it or deflate it.  So
Nico halved the memory usage, but we're still using 1x the size of the
file rather than ~2x.


-- 
Shawn.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 19:11     ` Shawn Pearce
@ 2010-10-04 19:16       ` Jonathan Nieder
  2010-10-05 10:59         ` Nguyen Thai Ngoc Duy
  2010-10-05 20:17         ` Nicolas Pitre
  0 siblings, 2 replies; 29+ messages in thread
From: Jonathan Nieder @ 2010-10-04 19:16 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: weigelt, git

Shawn Pearce wrote:

> This change only removes the deflate copy.  But due to the SHA-1
> consistency issue I alluded to earlier, I think we're still making a
> full copy of the file in memory before we SHA-1 it or deflate it.

Hmm, I _think_ we still use mmap for that (which is why 748af44c needs
to compare the sha1 before and after).

But

 1) a one-pass calculation would presumably be a little (5%?) faster
 2) if there are smudge/clean filters or autocrlf involved, the
    cleaned-up file is backed by swap and this all becomes moot.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 19:16       ` Jonathan Nieder
@ 2010-10-05 10:59         ` Nguyen Thai Ngoc Duy
  2010-10-05 20:17         ` Nicolas Pitre
  1 sibling, 0 replies; 29+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-10-05 10:59 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Shawn Pearce, weigelt, git

On Tue, Oct 5, 2010 at 2:16 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Shawn Pearce wrote:
>
>> This change only removes the deflate copy.  But due to the SHA-1
>> consistency issue I alluded to earlier, I think we're still making a
>> full copy of the file in memory before we SHA-1 it or deflate it.
>
> Hmm, I _think_ we still use mmap for that (which is why 748af44c needs
> to compare the sha1 before and after).

Just tried valgrind massif on a 200MB file with master. It used ~270kb
heap. I haven't tested but I believe git-checkout will keep the whole
inflated copy in memory. So git-add alone does not help much.
-- 
Duy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 19:16       ` Jonathan Nieder
  2010-10-05 10:59         ` Nguyen Thai Ngoc Duy
@ 2010-10-05 20:17         ` Nicolas Pitre
  2010-10-05 20:34           ` Jonathan Nieder
  1 sibling, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2010-10-05 20:17 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Shawn Pearce, weigelt, git

On Mon, 4 Oct 2010, Jonathan Nieder wrote:

> Shawn Pearce wrote:
> 
> > This change only removes the deflate copy.  But due to the SHA-1
> > consistency issue I alluded to earlier, I think we're still making a
> > full copy of the file in memory before we SHA-1 it or deflate it.
> 
> Hmm, I _think_ we still use mmap for that (which is why 748af44c needs
> to compare the sha1 before and after).
> 
> But
> 
>  1) a one-pass calculation would presumably be a little (5%?) faster

You can't do a one-pass  calculation.  The first one is required to 
compute the SHA1 of the file being added, and if that corresponds to an 
object that we already have then the operation stops right there as 
there is actually nothing to do.  The second pass is to deflate the 
data, and recompute the SHA1 to make sure what we deflated and written 
out is still the same data.

In the case of big files, what we need to do is to stream the file data 
in, compute the SHA1 and deflate it, in order to stream it out into a 
temporary file, then rename it according to the final SHA1.  This would 
allow Git to work with big files, but of course it won't be possible to 
know if the object corresponding to the file is already known until all 
the work has been done, possibly just to throw it away.  But normally 
big files are the minority.

Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 20:17         ` Nicolas Pitre
@ 2010-10-05 20:34           ` Jonathan Nieder
  2010-10-05 21:11             ` Nicolas Pitre
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Nieder @ 2010-10-05 20:34 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, weigelt, git

Nicolas Pitre wrote:

> You can't do a one-pass  calculation.  The first one is required to 
> compute the SHA1 of the file being added, and if that corresponds to an 
> object that we already have then the operation stops right there as 
> there is actually nothing to do.

Ah.  Thanks for a reminder.

> In the case of big files, what we need to do is to stream the file data 
> in, compute the SHA1 and deflate it, in order to stream it out into a 
> temporary file, then rename it according to the final SHA1.  This would 
> allow Git to work with big files, but of course it won't be possible to 
> know if the object corresponding to the file is already known until all 
> the work has been done, possibly just to throw it away.

To make sure I understand correctly: are you suggesting that for big
files we should skip the first pass?

I suppose that makes sense: for small files, using a patch application
tool to reach a postimage that matches an existing object is something
git historically needed to expect, but for typical big files:

 - once you've computed the SHA1, you've already invested a noticeable
   amount of time.
 - emailing patches around is difficult, making "git am" etc less important
 - hopefully git or zlib can notice when files are uncompressible,
   making the deflate not cost so much in that case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 20:34           ` Jonathan Nieder
@ 2010-10-05 21:11             ` Nicolas Pitre
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2010-10-05 21:11 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Shawn Pearce, weigelt, git

On Tue, 5 Oct 2010, Jonathan Nieder wrote:

> Nicolas Pitre wrote:
> 
> > You can't do a one-pass  calculation.  The first one is required to 
> > compute the SHA1 of the file being added, and if that corresponds to an 
> > object that we already have then the operation stops right there as 
> > there is actually nothing to do.
> 
> Ah.  Thanks for a reminder.
> 
> > In the case of big files, what we need to do is to stream the file data 
> > in, compute the SHA1 and deflate it, in order to stream it out into a 
> > temporary file, then rename it according to the final SHA1.  This would 
> > allow Git to work with big files, but of course it won't be possible to 
> > know if the object corresponding to the file is already known until all 
> > the work has been done, possibly just to throw it away.
> 
> To make sure I understand correctly: are you suggesting that for big
> files we should skip the first pass?

For big files we need a totally separate code path to process the file 
data in small chunks at 'git add' time, using a loop containing 
read()+SHA1sum()+deflate()+write().  Then, if the SHA1 matches an 
existing object we delete the temporary output file, otherwise we rename 
it as a valid object.  No CRLF, no smudge filters, no diff, no deltas, 
just plain storage of huge objects, based on the value of 
core.bigFileThreshold config option.

Same thing on the checkout path: a simple loop to 
read()+inflate()+write() in small chunks.

That's the only sane way to kinda support big files with Git.

> I suppose that makes sense: for small files, using a patch application
> tool to reach a postimage that matches an existing object is something
> git historically needed to expect, but for typical big files:
> 
>  - once you've computed the SHA1, you've already invested a noticeable
>    amount of time.
>  - emailing patches around is difficult, making "git am" etc less important
>  - hopefully git or zlib can notice when files are uncompressible,
>    making the deflate not cost so much in that case.

Emailing is out of the question.  We're talking file sizes in the 
hundreds of megabytes and above here.  So yes, simply computing the SHA1 
is a significant cost, given that you are going to trash your page cache 
in the process already, so better pay the price of deflating it at the 
same time even if it turns out to be unnecessary.

Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:58   ` Jonathan Nieder
  2010-10-04 19:11     ` Shawn Pearce
@ 2010-10-05  0:57     ` Enrico Weigelt
  2010-10-05  1:07       ` Ævar Arnfjörð Bjarmason
  2010-10-05  1:10       ` Jonathan Nieder
  1 sibling, 2 replies; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05  0:57 UTC (permalink / raw)
  To: git

* Jonathan Nieder <jrnieder@gmail.com> wrote:
> Shawn Pearce wrote:
> 
> > The mmap() isn't the problem.  Its the allocation of a buffer that is
> > larger than the file in order to hold the result of deflating the file
> > before it gets written to disk.
> 
> Wasn't this already fixed, at least in some cases?
> 
> commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1)
> Author: Nicolas Pitre <nico@fluxnic.net>
> Date:   Sat Feb 20 23:27:31 2010 -0500

I guess I'll have to do a update.

But: latest tag (1.7.3.1) doesnt build:


    CC read-cache.o
    read-cache.c: In function `fill_stat_cache_info':
    read-cache.c:73: structure has no member named `st_ctim'
    read-cache.c:74: structure has no member named `st_mtim'
    read-cache.c: In function `read_index_from':
    read-cache.c:1334: structure has no member named `st_mtim'
    read-cache.c: In function `write_index':
    read-cache.c:1614: structure has no member named `st_mtim'
    make: *** [read-cache.o] Fehler 1
    
Is my libc too old ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  0:57     ` Enrico Weigelt
@ 2010-10-05  1:07       ` Ævar Arnfjörð Bjarmason
  2010-10-05  1:10       ` Jonathan Nieder
  1 sibling, 0 replies; 29+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-10-05  1:07 UTC (permalink / raw)
  To: weigelt, git

On Tue, Oct 5, 2010 at 00:57, Enrico Weigelt <weigelt@metux.de> wrote:
> * Jonathan Nieder <jrnieder@gmail.com> wrote:
>> Shawn Pearce wrote:
>>
>> > The mmap() isn't the problem.  Its the allocation of a buffer that is
>> > larger than the file in order to hold the result of deflating the file
>> > before it gets written to disk.
>>
>> Wasn't this already fixed, at least in some cases?
>>
>> commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1)
>> Author: Nicolas Pitre <nico@fluxnic.net>
>> Date:   Sat Feb 20 23:27:31 2010 -0500
>
> I guess I'll have to do a update.
>
> But: latest tag (1.7.3.1) doesnt build:
>
>
>    CC read-cache.o
>    read-cache.c: In function `fill_stat_cache_info':
>    read-cache.c:73: structure has no member named `st_ctim'
>    read-cache.c:74: structure has no member named `st_mtim'
>    read-cache.c: In function `read_index_from':
>    read-cache.c:1334: structure has no member named `st_mtim'
>    read-cache.c: In function `write_index':
>    read-cache.c:1614: structure has no member named `st_mtim'
>    make: *** [read-cache.o] Fehler 1
>
> Is my libc too old ?

Those lines are accessing members called st_ctime, i.e. with an "e" at
the end, but your errors just report "st_ctim". What gives?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  0:57     ` Enrico Weigelt
  2010-10-05  1:07       ` Ævar Arnfjörð Bjarmason
@ 2010-10-05  1:10       ` Jonathan Nieder
  2010-10-05  7:35         ` Enrico Weigelt
  1 sibling, 1 reply; 29+ messages in thread
From: Jonathan Nieder @ 2010-10-05  1:10 UTC (permalink / raw)
  To: git; +Cc: Enrico Weigelt

Enrico Weigelt wrote:

>     CC read-cache.o
>     read-cache.c: In function `fill_stat_cache_info':
>     read-cache.c:73: structure has no member named `st_ctim'
>     read-cache.c:74: structure has no member named `st_mtim'
>     read-cache.c: In function `read_index_from':
>     read-cache.c:1334: structure has no member named `st_mtim'
>     read-cache.c: In function `write_index':
>     read-cache.c:1614: structure has no member named `st_mtim'
>     make: *** [read-cache.o] Fehler 1
>     
> Is my libc too old ?

What platform are you on?  You probably need USE_ST_TIMESPEC;
if so, please send a makefile patch so the next person trying
it doesn't need to worry about it.

Also, please don't destroy the cc lists; it makes it hard for
people with some mail setups (e.g., mine) to notice when you've
replied to them.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  1:10       ` Jonathan Nieder
@ 2010-10-05  7:35         ` Enrico Weigelt
  2010-10-05 13:47           ` Jonathan Nieder
  0 siblings, 1 reply; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05  7:35 UTC (permalink / raw)
  To: git

* Jonathan Nieder <jrnieder@gmail.com> wrote:

> What platform are you on? 

GNU/Linux. glibc-2.25


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  7:35         ` Enrico Weigelt
@ 2010-10-05 13:47           ` Jonathan Nieder
  0 siblings, 0 replies; 29+ messages in thread
From: Jonathan Nieder @ 2010-10-05 13:47 UTC (permalink / raw)
  To: git; +Cc: Enrico Weigelt, Ævar Arnfjörð Bjarmason

Enrico Weigelt wrote:
> * Jonathan Nieder <jrnieder@gmail.com> wrote:

>> What platform are you on? 
>
> GNU/Linux. glibc-2.25

Hmm, I've heard of glib 2.25 but never glibc 2.25. :)

 $ /lib/libc.so.6 | head -1
 GNU C Library (Debian EGLIBC 2.11.2-6) stable release version 2.11.2, by Roland McGrath et al.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-04 18:05 ` Shawn Pearce
  2010-10-04 18:24   ` Joshua Jensen
  2010-10-04 18:58   ` Jonathan Nieder
@ 2010-10-05  0:50   ` Enrico Weigelt
  2010-10-05 19:06     ` Nicolas Pitre
  2 siblings, 1 reply; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05  0:50 UTC (permalink / raw)
  To: git

* Shawn Pearce <spearce@spearce.org> wrote:

> The mmap() isn't the problem.  Its the allocation of a buffer that is
> larger than the file in order to hold the result of deflating the file
> before it gets written to disk.  
> When the file is bigger than physical memory, the kernel has to
> page in parts of the file as well as swap in and out parts of
> that allocated buffer to hold the deflated file.

What are the access pattern of these memory areas ?
Perhaps madvise() could help ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05  0:50   ` Enrico Weigelt
@ 2010-10-05 19:06     ` Nicolas Pitre
  2010-10-05 22:51       ` Enrico Weigelt
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2010-10-05 19:06 UTC (permalink / raw)
  To: Enrico Weigelt; +Cc: git

On Tue, 5 Oct 2010, Enrico Weigelt wrote:

> * Shawn Pearce <spearce@spearce.org> wrote:
> 
> > The mmap() isn't the problem.  Its the allocation of a buffer that is
> > larger than the file in order to hold the result of deflating the file
> > before it gets written to disk.  
> > When the file is bigger than physical memory, the kernel has to
> > page in parts of the file as well as swap in and out parts of
> > that allocated buffer to hold the deflated file.
> 
> What are the access pattern of these memory areas ?

Perfectly linear.

> Perhaps madvise() could help ?

Perhaps.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: large files and low memory
  2010-10-05 19:06     ` Nicolas Pitre
@ 2010-10-05 22:51       ` Enrico Weigelt
  0 siblings, 0 replies; 29+ messages in thread
From: Enrico Weigelt @ 2010-10-05 22:51 UTC (permalink / raw)
  To: git

* Nicolas Pitre <nico@fluxnic.net> wrote:

> > > The mmap() isn't the problem.  Its the allocation of a buffer that is
> > > larger than the file in order to hold the result of deflating the file
> > > before it gets written to disk.  
> > > When the file is bigger than physical memory, the kernel has to
> > > page in parts of the file as well as swap in and out parts of
> > > that allocated buffer to hold the deflated file.
> > 
> > What are the access pattern of these memory areas ?
> 
> Perfectly linear.

In this case, I wonder why my machine goes into thrashing so easily
(P3 w/ 256MB ram). Seems the mmu/paging code doesnt recognize that
the previously-used pages can be kicked-off quickly ;-o
Perhaps I should talk to the kernel folks.

> > Perhaps madvise() could help ?
> 
> Perhaps.

hmm, so we should try it ;-p
where'd be the right place to add it ?


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-10-05 22:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-04  9:20 large files and low memory Enrico Weigelt
2010-10-04 18:05 ` Shawn Pearce
2010-10-04 18:24   ` Joshua Jensen
2010-10-04 18:57     ` Shawn Pearce
2010-10-05  0:59       ` Enrico Weigelt
2010-10-05  7:41         ` Enrico Weigelt
2010-10-05  8:01           ` Matthieu Moy
2010-10-05  8:17             ` Enrico Weigelt
2010-10-05 11:29               ` Alex Riesen
2010-10-05 11:38                 ` Matthieu Moy
2010-10-05 11:55                   ` Nguyen Thai Ngoc Duy
2010-10-05 16:42                     ` Junio C Hamano
2010-10-05 10:13           ` Nguyen Thai Ngoc Duy
2010-10-05 19:12             ` Nicolas Pitre
2010-10-04 18:58   ` Jonathan Nieder
2010-10-04 19:11     ` Shawn Pearce
2010-10-04 19:16       ` Jonathan Nieder
2010-10-05 10:59         ` Nguyen Thai Ngoc Duy
2010-10-05 20:17         ` Nicolas Pitre
2010-10-05 20:34           ` Jonathan Nieder
2010-10-05 21:11             ` Nicolas Pitre
2010-10-05  0:57     ` Enrico Weigelt
2010-10-05  1:07       ` Ævar Arnfjörð Bjarmason
2010-10-05  1:10       ` Jonathan Nieder
2010-10-05  7:35         ` Enrico Weigelt
2010-10-05 13:47           ` Jonathan Nieder
2010-10-05  0:50   ` Enrico Weigelt
2010-10-05 19:06     ` Nicolas Pitre
2010-10-05 22:51       ` Enrico Weigelt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).