From: Jeff King <peff@peff.net>
To: John <john@puckerupgames.com>
Cc: git@vger.kernel.org
Subject: Re: serious performance issues with images, audio files, and other "non-code" data
Date: Mon, 17 May 2010 19:16:43 -0400 [thread overview]
Message-ID: <20100517231642.GB12092@coredump.intra.peff.net> (raw)
In-Reply-To: <4BED47EA.9090905@puckerupgames.com>
On Fri, May 14, 2010 at 08:54:02AM -0400, John wrote:
> Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For
By git standards, that version is ancient. You may want to try with a
more recent version of git (at the very least, multithreaded delta
compression has been enabled by default since then).
> I packed the bare repo, then ran `gc --aggressive`.
Note that "gc --aggressive" will repack from scratch, throwing away the
previous pack.
> Then I did a `git pull`, which took 35 minutes.
That sounds like a long time. What was taking so long? Was delta
compression pegging the CPU? Was it limited during the "Writing objects"
phase, which is going to be limited by either disk I/O or network speed?
How big is your packed repo? Given the pattern you describe below, I am
beginning to wonder if it is simply the case that even though a single
checkout of your repo isn't that large, the complete history of your
project may simply be gigantic (e.g., because you are repeatedly writing
new apparently-random versions of each file, so your repository size
will grow quite quickly).
Remember that a git clone transfers the full history (and a pull will
transfer all of the intermediate history). If you have rewritten those
files many times, you may be transferring many times your working
directory size in history.
> You can simulate it all by generating a batch of 1-100 MB files from
> /dev/urandom (since they won't compress), commit them, then do it
> again many times to simulate edits. Every few iterates, push it
> somewhere.
I tried this script to make a 100M working directory with a 400M .git
directory:
-- >8 --
#!/bin/sh
rm -rf big-repo
mkdir big-repo && cd big-repo && git init
mark() {
echo "`date` $*"
}
randomize() {
mark randomize start
for i in `seq 1 100`; do
openssl rand $((1024*1024)) >$i.rand
done
mark randomize end
}
commit() {
mark add start
git add .
mark add end
mark commit start
git commit -m "$1"
mark commit end
}
randomize; commit base
randomize; commit one
randomize; commit two
randomize; commit three
-- 8< --
Here are a few timings I noted:
- it takes about 5 seconds to generate and write the random data
- git add runs in about 13 seconds. It pegs the CPU hashing all of the
data.
- the first commit is nearly instantaneous, as the summary diff takes
no work; subsequent commits spend about 9 seconds to create the
summary diff. Changing commit to "commit -q" drops that to back to
near-instantaneous.
- with no attributes set, "time git gc --aggressive" reports:
real 1m31.983s
user 2m29.621s
sys 0m3.732s
Note the real/user discrepancy. It's a dual-core machine, and recent
git will multi-thread the delta phase, which is what dominates the
time. This should correspond roughly to the delta-compression phase
of your pull time, as that was just making a pack on the fly (but
now that we are packed, pulls will be limited only by the time to
transfer the objects themselves).
- Turning off delta compression for the .rand files makes repacking
much faster:
$ echo '*.rand -delta' >.gitattributes
$ time git gc --aggressive
...
real 0m25.354s
user 0m22.057s
sys 0m1.316s
The delta compression phase is very quick, and we spend most of our
time writing out the packfile to disk.
So I stand by my earlier statements:
1. Use "git commit -q" to avoid wasting time on the commit diff
summary (we should perhaps have a commit.quiet config option for
repos like this where you would almost always want to suppress it).
2. Make sure your upstream repo is packed so pullers do not have to
generate a new packfile all the time.
3. Use -delta where appropriate to avoid useless delta compression.
If things are still slow after that, you'll need to be more specific
about your exact workload and exactly what is slow (I am still not sure
if delta compression or network bandwidth is the limiting factor for
your slow pulls).
-Peff
next prev parent reply other threads:[~2010-05-17 23:16 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-12 18:53 serious performance issues with images, audio files, and other "non-code" data John
2010-05-12 19:15 ` Jakub Narebski
2010-05-14 5:10 ` Jeff King
2010-05-14 12:54 ` John
2010-05-14 17:26 ` Dirk Süsserott
2010-05-17 23:16 ` Jeff King [this message]
2010-05-17 23:33 ` Sverre Rabbelier
2010-05-18 19:07 ` Jeff King
2010-05-18 19:10 ` Sverre Rabbelier
2010-05-18 19:27 ` Jeff King
2010-05-18 19:37 ` Nicolas Pitre
2010-05-18 18:50 ` John
2010-05-18 18:54 ` Sverre Rabbelier
2010-05-18 19:19 ` Jeff King
2010-05-18 19:33 ` Nicolas Pitre
2010-05-18 19:41 ` Jeff King
2010-05-18 19:59 ` Nicolas Pitre
2010-05-24 0:21 ` John
2010-05-24 1:16 ` Junio C Hamano
2010-05-24 7:01 ` John
2010-05-25 6:33 ` Jeff King
2010-05-25 7:28 ` Michael J Gruber
2010-05-25 16:12 ` John
2010-05-25 17:18 ` Nicolas Pitre
2010-05-25 17:47 ` John
2010-05-24 5:39 ` Jeff King
2010-05-24 6:44 ` John
2010-05-24 6:45 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100517231642.GB12092@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=john@puckerupgames.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).