From: Sam Hocevar <sam@zoy.org>
To: git@vger.kernel.org
Subject: Git memory usage (1): fast-import
Date: Sat, 7 Mar 2009 21:19:20 +0100 [thread overview]
Message-ID: <20090307201920.GE12880@zoy.org> (raw)
I joined a project that uses very large binary files (up to 1 GiB) in
a p4 repository and as I would like to use Git, I am trying to make it
more memory-efficient when handling huge files.
The first problem I am hitting is with fast-import. Currently it
keeps the last imported file in memory (end of store_object()) in order
to find interesting deltas with the next file. Since most huge binary
files are already compressed, it means that fast-importing two large
X MiB files is going to use at least 3X MiB of memory: once for the
first file, once for the second file, and once for the deflate data
that is going to be as large as the file itself.
In practice, it takes even more memory than that. Experiment shows
that importing six 100 MiB files made of urandom data takes 370 MiB of
memory (http://zoy.org/~sam/git/git-memory-usage.png) (simple script
available at http://zoy.org/~sam/git/gencommit.txt). I am unable to plot
how it behaves with 1 GiB files since I don't have enough memory, but I
don't see why the trend wouldn't stand.
I can understand it not being a priority, but I'm trying to think of
acceptable ways to fix this that do not mess with Git's performance in
more usual cases. Here is what I can think of:
- stop trying to compute deltas in fast-import and leave that task
to other tools (optionally, define a file size threshold beyond
which the last file is not kept in memory, and maybe make that a
configuration option).
- use a temporary file to store the deflate data when it reaches a
given size threshold (and maybe make that a configuration option).
- also, I haven't tracked all strbuf_* uses in fast-import, but I got
the feeling that strbuf_release() could be used in a few places
instead of strbuf_setlen(0) in order to free some memory.
Any thoughts?
--
Sam.
next reply other threads:[~2009-03-07 20:20 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-07 20:19 Sam Hocevar [this message]
2009-03-07 21:01 ` Git memory usage (1): fast-import Shawn O. Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090307201920.GE12880@zoy.org \
--to=sam@zoy.org \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).