malloc fails when dealing with huge files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* malloc fails when dealing with huge files
@ 2008-12-10 15:42 Jonathan Blanton
  2008-12-10 19:32 ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Blanton @ 2008-12-10 15:42 UTC (permalink / raw)
  To: git

I'm using Git for a project that contains huge (multi-gigabyte) files.
 I need to track these files, but with some of the really big ones,
git-add aborts with the message "fatal: Out of memory, malloc failed".
 Also, git-gc sometimes fails because it can't allocate enough memory.
 I've been using the "--window-memory" option to git- repack to work
around the git-gc problem, but I don't know of a similar trick for
git-add.  Is there any way (aside from adding more memory, of course)
that I can deal with these huge files?  I'm using git 1.5.6.2 on
Debian 4.0.

Thanks in advance.

Jonathan Blanton

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: malloc fails when dealing with huge files
  2008-12-10 15:42 malloc fails when dealing with huge files Jonathan Blanton
@ 2008-12-10 19:32 ` Linus Torvalds
  2008-12-11  0:16   ` Jeff Whiteside
  2008-12-11  9:11   ` Johannes Schindelin
  0 siblings, 2 replies; 4+ messages in thread
From: Linus Torvalds @ 2008-12-10 19:32 UTC (permalink / raw)
  To: Jonathan Blanton; +Cc: git

On Wed, 10 Dec 2008, Jonathan Blanton wrote:
>
> I'm using Git for a project that contains huge (multi-gigabyte) files.
>  I need to track these files, but with some of the really big ones,
> git-add aborts with the message "fatal: Out of memory, malloc failed".

git is _really_ not designed for huge files.

By design - good or bad - git does pretty much all single file operations 
with the whole file in memory as one single allocation. 

Now, some of that is hard to fix - or at least would generate much more 
complex code. The _particular_ case of "git add" could be fixed without 
undue pain, but it's not entirely trivial either.

The main offender is probably "index_fd()" that just mmap's the whole file 
in one go and then calls write_sha1_file() which really expects it to be 
one single memory area both for the initial SHA1 create and for the 
compression and writing out of the result.

Changing that to do big files in pieces would not be _too_ painful, but 
it's not just a couple of lines either.

However, git performance with big files would never be wonderful, and 
things like "git diff" would still end up reading not just the whole file, 
but _both_versions_ at the same time. Marking the big files as being 
no-diff might help, though.

			Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: malloc fails when dealing with huge files
  2008-12-10 19:32 ` Linus Torvalds
@ 2008-12-11  0:16   ` Jeff Whiteside
  2008-12-11  9:11   ` Johannes Schindelin
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff Whiteside @ 2008-12-11  0:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jonathan Blanton, git

i tried to do something like that over a year ago, having gotten the
insane idea that i wanted to version my whole harddrive.  binaries
were a huge problem.

checkouts were also a problem over slow connections because there is
no git-clone --resume, so if your connection is interrupted, you're
back at square one.  perhaps git-torrent will fix that.

git wasn't supposed to be file based, as much as line/code based.  let
me know if you find a better alternative to git for filesystems.

it's too bad there's not a better way to keep resources tagged to a
version by a sha1, but keep source separate.

On Wed, Dec 10, 2008 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 10 Dec 2008, Jonathan Blanton wrote:
>>
>> I'm using Git for a project that contains huge (multi-gigabyte) files.
>>  I need to track these files, but with some of the really big ones,
>> git-add aborts with the message "fatal: Out of memory, malloc failed".
>
> git is _really_ not designed for huge files.
>
> By design - good or bad - git does pretty much all single file operations
> with the whole file in memory as one single allocation.
>
> Now, some of that is hard to fix - or at least would generate much more
> complex code. The _particular_ case of "git add" could be fixed without
> undue pain, but it's not entirely trivial either.
>
> The main offender is probably "index_fd()" that just mmap's the whole file
> in one go and then calls write_sha1_file() which really expects it to be
> one single memory area both for the initial SHA1 create and for the
> compression and writing out of the result.
>
> Changing that to do big files in pieces would not be _too_ painful, but
> it's not just a couple of lines either.
>
> However, git performance with big files would never be wonderful, and
> things like "git diff" would still end up reading not just the whole file,
> but _both_versions_ at the same time. Marking the big files as being
> no-diff might help, though.
>
>
>                        Linus
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: malloc fails when dealing with huge files
  2008-12-10 19:32 ` Linus Torvalds
  2008-12-11  0:16   ` Jeff Whiteside
@ 2008-12-11  9:11   ` Johannes Schindelin
  1 sibling, 0 replies; 4+ messages in thread
From: Johannes Schindelin @ 2008-12-11  9:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jonathan Blanton, git

Hi,

On Wed, 10 Dec 2008, Linus Torvalds wrote:

> However, git performance with big files would never be wonderful, and 
> things like "git diff" would still end up reading not just the whole 
> file, but _both_versions_ at the same time. Marking the big files as 
> being no-diff might help, though.

Makes me wonder if we should not have a default cut-off, say, 10MB, at 
which files are automatically tagged with the no-diff attribute (unless 
overridden explicitely in .gitattributes)?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-12-11  9:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-10 15:42 malloc fails when dealing with huge files Jonathan Blanton
2008-12-10 19:32 ` Linus Torvalds
2008-12-11  0:16   ` Jeff Whiteside
2008-12-11  9:11   ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).