Git memory usage (1): fast-import

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Git memory usage (1): fast-import
@ 2009-03-07 20:19 Sam Hocevar
  2009-03-07 21:01 ` Shawn O. Pearce
  0 siblings, 1 reply; 2+ messages in thread
From: Sam Hocevar @ 2009-03-07 20:19 UTC (permalink / raw)
  To: git

   I joined a project that uses very large binary files (up to 1 GiB) in
a p4 repository and as I would like to use Git, I am trying to make it
more memory-efficient when handling huge files.

   The first problem I am hitting is with fast-import. Currently it
keeps the last imported file in memory (end of store_object()) in order
to find interesting deltas with the next file. Since most huge binary
files are already compressed, it means that fast-importing two large
X MiB files is going to use at least 3X MiB of memory: once for the
first file, once for the second file, and once for the deflate data
that is going to be as large as the file itself.

   In practice, it takes even more memory than that. Experiment shows
that importing six 100 MiB files made of urandom data takes 370 MiB of
memory (http://zoy.org/~sam/git/git-memory-usage.png) (simple script
available at http://zoy.org/~sam/git/gencommit.txt). I am unable to plot
how it behaves with 1 GiB files since I don't have enough memory, but I
don't see why the trend wouldn't stand.

   I can understand it not being a priority, but I'm trying to think of
acceptable ways to fix this that do not mess with Git's performance in
more usual cases. Here is what I can think of:

   - stop trying to compute deltas in fast-import and leave that task
   to other tools (optionally, define a file size threshold beyond
   which the last file is not kept in memory, and maybe make that a
   configuration option).

   - use a temporary file to store the deflate data when it reaches a
   given size threshold (and maybe make that a configuration option).

   - also, I haven't tracked all strbuf_* uses in fast-import, but I got
   the feeling that strbuf_release() could be used in a few places
   instead of strbuf_setlen(0) in order to free some memory.

   Any thoughts?

-- 
Sam.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Git memory usage (1): fast-import
  2009-03-07 20:19 Git memory usage (1): fast-import Sam Hocevar
@ 2009-03-07 21:01 ` Shawn O. Pearce
  0 siblings, 0 replies; 2+ messages in thread
From: Shawn O. Pearce @ 2009-03-07 21:01 UTC (permalink / raw)
  To: Sam Hocevar; +Cc: git

Sam Hocevar <sam@zoy.org> wrote:
>    I joined a project that uses very large binary files (up to 1 GiB) in
> a p4 repository and as I would like to use Git, I am trying to make it
> more memory-efficient when handling huge files.

Yikes.  As you saw, this won't play well...

>    In practice, it takes even more memory than that. Experiment shows
> that importing six 100 MiB files made of urandom data takes 370 MiB of
> memory [...]

Yes.

As you saw, this is the last object, the current object, the delta
index of the last object (in order to more efficiently compare the
current one to it), and the deflate buffer for the current object,
oh, and probably memory fragmentation....

I'm not surprised a 100 MiB file turned into 370 MiB heap usage.

>    - stop trying to compute deltas in fast-import and leave that task
>    to other tools

This isn't practical for source code imports, unless we do...

> (optionally, define a file size threshold beyond
>    which the last file is not kept in memory, and maybe make that a
>    configuration option).

what you suggest here.  fast-import is faster than other methods
because we get some delta compression on the content, so the output
pack uses up less virtual memory when the front-end or end-user
finally gets around to doing `git repack -a -d -f` to recompute
the delta chains.

>    - use a temporary file to store the deflate data when it reaches a
>    given size threshold (and maybe make that a configuration option).

Zoiks.  There's no reason for that.

A better method would be to just look at the size of the incoming
blob, and if its over some configured threshold (default e.g. 100
MB is perhaps sane) we just stream the data through deflate()
and into the pack file, with no delta compression.

That would also bypass the "massive" buffer in the last object slot,
as you point out above.

>    - also, I haven't tracked all strbuf_* uses in fast-import, but I got
>    the feeling that strbuf_release() could be used in a few places
>    instead of strbuf_setlen(0) in order to free some memory.

Examples?  I haven't gone through the code in detail since it
was modified to use strbufs.  But I had the feeling that the code
wasn't freeing strbufs that it would just reuse on the next command,
and that are likely to be "smallish", e.g. just a few KiBs in size.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-03-07 21:03 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-07 20:19 Git memory usage (1): fast-import Sam Hocevar
2009-03-07 21:01 ` Shawn O. Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).