Re: serious performance issues with images, audio files, and other "non-code" data

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: John <john@puckerupgames.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: serious performance issues with images, audio files, and other "non-code" data
Date: Tue, 18 May 2010 14:50:16 -0400	[thread overview]
Message-ID: <4BF2E168.2020706@puckerupgames.com> (raw)
In-Reply-To: <20100517231642.GB12092@coredump.intra.peff.net>

On 05/17/2010 07:16 PM, Jeff King wrote:
> On Fri, May 14, 2010 at 08:54:02AM -0400, John wrote:
>
>> Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For
>
> By git standards, that version is ancient. You may want to try with a
> more recent version of git (at the very least, multithreaded delta
> compression has been enabled by default since then).

I just compiled the latest git. It got worse!!

$  git --version
git version 1.5.6.5
$ time git gc --aggressive
Counting objects: 2086, done.
Compressing objects: 100% (2054/2054), done.
Writing objects: 100% (2086/2086), done.
Total 2086 (delta 676), reused 0 (delta 0)

real    4m28.573s
user    3m38.650s
sys     0m5.156s
$  git --version
git version 1.7.1
$ time git gc --aggressive
Counting objects: 2086, done.
Compressing objects: 100% (2054/2054), done.
Writing objects: 100% (2086/2086), done.
Total 2086 (delta 676), reused 0 (delta 0)

real    6m16.406s
user    5m28.665s
sys     0m6.196s
$ du -hs .git
203M	.git


>
>> I packed the bare repo, then ran `gc --aggressive`.
>
> Note that "gc --aggressive" will repack from scratch, throwing away the
> previous pack.
>
>> Then I did a `git pull`, which took 35 minutes.
>
> That sounds like a long time. What was taking so long? Was delta
> compression pegging the CPU? Was it limited during the "Writing objects"
> phase, which is going to be limited by either disk I/O or network speed?

The compressing objects phase. Yes, pegging the CPU and hogging memory.


> How big is your packed repo? Given the pattern you describe below, I am
> beginning to wonder if it is simply the case that even though a single
> checkout of your repo isn't that large, the complete history of your
> project may simply be gigantic (e.g., because you are repeatedly writing
> new apparently-random versions of each file, so your repository size
> will grow quite quickly).

The packed .git dir is 203 MB. Yes, we make frequent changes to these files, and push/pull 
frequently as well. Just a normal development pattern, though. It's all manually done -- i.e., 
there's no automated bot doing excessive git operations.

>
> Remember that a git clone transfers the full history (and a pull will
> transfer all of the intermediate history). If you have rewritten those
> files many times, you may be transferring many times your working
> directory size in history.
>
>> You can simulate it all by generating a batch of 1-100 MB files from
>> /dev/urandom (since they won't compress), commit them, then do it
>> again many times to simulate edits. Every few iterates, push it
>> somewhere.
>
> I tried this script to make a 100M working directory with a 400M .git
> directory:
>
> -- >8 --
> #!/bin/sh
>
> rm -rf big-repo
> mkdir big-repo&&  cd big-repo&&  git init
>
> mark() {
>    echo "`date` $*"
> }
>
> randomize() {
>    mark randomize start
>    for i in `seq 1 100`; do
>      openssl rand $((1024*1024))>$i.rand
>    done
>    mark randomize end
> }
>
> commit() {
>    mark add start
>    git add .
>    mark add end
>    mark commit start
>    git commit -m "$1"
>    mark commit end
> }
>
> randomize; commit base
> randomize; commit one
> randomize; commit two
> randomize; commit three
> -- 8<  --
>
> Here are a few timings I noted:
>
>    - it takes about 5 seconds to generate and write the random data
>
>    - git add runs in about 13 seconds. It pegs the CPU hashing all of the
>      data.
>
>    - the first commit is nearly instantaneous, as the summary diff takes
>      no work; subsequent commits spend about 9 seconds to create the
>      summary diff.  Changing commit to "commit -q" drops that to back to
>      near-instantaneous.
>
>    - with no attributes set, "time git gc --aggressive" reports:
>
>        real    1m31.983s
>        user    2m29.621s
>        sys     0m3.732s
>
>      Note the real/user discrepancy. It's a dual-core machine, and recent
>      git will multi-thread the delta phase, which is what dominates the
>      time. This should correspond roughly to the delta-compression phase
>      of your pull time, as that was just making a pack on the fly (but
>      now that we are packed, pulls will be limited only by the time to
>      transfer the objects themselves).
>
>    - Turning off delta compression for the .rand files makes repacking
>      much faster:
>
>        $ echo '*.rand -delta'>.gitattributes
>        $ time git gc --aggressive
>        ...
>        real    0m25.354s
>        user    0m22.057s
>        sys     0m1.316s
>
>      The delta compression phase is very quick, and we spend most of our
>      time writing out the packfile to disk.
>
> So I stand by my earlier statements:
>
>    1. Use "git commit -q" to avoid wasting time on the commit diff
>       summary (we should perhaps have a commit.quiet config option for
>       repos like this where you would almost always want to suppress it).

Thanks, I will try that,

>
>    2. Make sure your upstream repo is packed so pullers do not have to
>       generate a new packfile all the time.

Got that in cron now.


>    3. Use -delta where appropriate to avoid useless delta compression.

Already in there (thanks to your previous advice).


> If things are still slow after that, you'll need to be more specific
> about your exact workload and exactly what is slow (I am still not sure
> if delta compression or network bandwidth is the limiting factor for
> your slow pulls).

It's definitely the pull/push in git. Not knowing my way around git internals at all, I don't know 
(nor do I really want to know, to be honest) which "sub-processes" of `git pull` or `git push` are 
the culprit. Yes, network bandwidth is always a factor, but I guess my expectation is that git 
shouldn't transfer too much more info than the amount of recent changes. For example, if we change 
10 files for a total of 10MB, then my admittedly naive expectation is that git will send that 10MB 
of changes, plus some small constant amount of meta info... not the whole repo every time. No?

next prev parent reply	other threads:[~2010-05-18 18:51 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-12 18:53 serious performance issues with images, audio files, and other "non-code" data John
2010-05-12 19:15 ` Jakub Narebski
2010-05-14  5:10 ` Jeff King
2010-05-14 12:54   ` John
2010-05-14 17:26     ` Dirk Süsserott
2010-05-17 23:16     ` Jeff King
2010-05-17 23:33       ` Sverre Rabbelier
2010-05-18 19:07         ` Jeff King
2010-05-18 19:10           ` Sverre Rabbelier
2010-05-18 19:27             ` Jeff King
2010-05-18 19:37               ` Nicolas Pitre
2010-05-18 18:50       ` John [this message]
2010-05-18 18:54         ` Sverre Rabbelier
2010-05-18 19:19         ` Jeff King
2010-05-18 19:33           ` Nicolas Pitre
2010-05-18 19:41             ` Jeff King
2010-05-18 19:59               ` Nicolas Pitre
2010-05-24  0:21                 ` John
2010-05-24  1:16                   ` Junio C Hamano
2010-05-24  7:01                     ` John
2010-05-25  6:33                       ` Jeff King
2010-05-25  7:28                     ` Michael J Gruber
2010-05-25 16:12                       ` John
2010-05-25 17:18                         ` Nicolas Pitre
2010-05-25 17:47                           ` John
2010-05-24  5:39                   ` Jeff King
2010-05-24  6:44                     ` John
2010-05-24  6:45                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BF2E168.2020706@puckerupgames.com \
    --to=john@puckerupgames.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).