git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "Thomas Rast" <trast@student.ethz.ch>,
	"René Scharfe" <rene.scharfe@lsrfire.ath.cx>,
	"Eric Herman" <eric@freesa.org>,
	git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>,
	"Fredrik Kuivinen" <frekui@gmail.com>
Subject: Re: [PATCH v2 3/3] grep: disable threading in all but worktree case
Date: Sat, 24 Dec 2011 02:07:15 -0500	[thread overview]
Message-ID: <20111224070715.GA32267@sigill.intra.peff.net> (raw)
In-Reply-To: <CACBZZX67WhcdhXdqOm8gZHW7C3YMbV2KzeytwjHwsnF=8-M_+w@mail.gmail.com>

On Sat, Dec 24, 2011 at 02:39:11AM +0100, Ævar Arnfjörð Bjarmason wrote:

> Is the expensive part of git-grep all the setup work, or the actual
> traversal and searching? I'm guessing it's the latter.
> 
> In that case an easy way to do git-grep in parallel would be to simply
> spawn multiple sub-processes, e.g. if we had 1000 files and 4 cores:
> 
>  1. Split the 1000 into 4 parts 250 each.
>  2. Spawn 4 processes as: git grep <pattern> -- <250 files>
>  3. Aggregate all of the results in the parent process

That's an interesting idea. The expense of the traversal and searching
depends on two things:

  - how complex is your regex?

  - are you reading from objects (which need zlib inflated) or disk?

But you should be able to approximate it by compiling with NO_PTHREADS
and doing (assuming you have GNU xargs):

  # grep in working tree
  git ls-files | xargs -P 8 git grep "$re" --

  # grep tree-ish
  git ls-tree -r --name-only $tree | xargs -P 8 git grep "$re" $tree --

I tried to get some timings for this, but ran across some quite
surprising results. Here's a simple grep of the linux-2.6 working tree,
using a single-threaded grep:

  $ time git grep SIMPLE >/dev/null
  real    0m0.439s
  user    0m0.272s
  sys     0m0.160s

and then the same thing, via xargs, without even turning on
parallelization. This should give us a measurement of the overhead for
going through xargs at all. We'd expect it to be slower, but not too
much so:

  $ time git ls-files | xargs git grep SIMPLE -- >/dev/null
  real    0m11.989s
  user    0m11.769s
  sys     0m0.268s

Twenty-five times slower! Running 'perf' reports the culprit as pathspec
matching:

  +  63.23%    git  git                 [.] match_pathspec_depth
  +  28.60%    git  libc-2.13.so        [.] __strncmp_sse42
  +   2.22%    git  git                 [.] strncmp@plt
  +   1.67%    git  git                 [.] kwsexec

where the strncmps are called as part of match_pathspec_depth. So over
90% of the CPU time is spent on matching the pathspecs, compared to less
than 2% actually grepping.

Which really makes me wonder if our pathspec matching could stand to be
faster. True, giving a bunch of single files is the least efficient way
to use pathspecs, but that's pretty amazingly slow.

The case where we would most expect the setup cost to be drowned out is
using a more complex regex, grepping tree objects. There we have a
baseline of:

  $ time git grep 'a.*c' HEAD >/dev/null
  real    0m5.684s
  user    0m5.472s
  sys     0m0.196s

  $ time git ls-tree --name-only -r HEAD |
      xargs git grep 'a.*c' HEAD -- >/dev/null
  real    0m10.906s
  user    0m10.725s
  sys     0m0.240s

Here, we still almost double our time. It looks like we don't use the
same pathspec matching code in this case. But we do waste a lot of extra
time zlib-inflating the trees in "ls-tree", only to do it separately in
"grep".

Doing it in parallel yields:

  $ time git ls-tree --name-only -r HEAD |
      xargs -n 4000 -P 8 git grep 'a.*c' HEAD -- >/dev/null
  real    0m3.573s
  user    0m21.885s
  sys     0m0.400s

So that does at least yield a real speedup, albeit only by about half,
despite using over six times as much CPU (though my numbers are skewed
somewhat, as this is a quad i7 with hyperthreading and turbo boost).

-Peff

  reply	other threads:[~2011-12-24  7:07 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-25 14:46 [PATCH] grep: load funcname patterns for -W Thomas Rast
2011-11-25 16:32 ` René Scharfe
2011-11-26 12:15   ` [PATCH] grep: enable multi-threading for -p and -W René Scharfe
2011-11-29  9:54     ` Thomas Rast
2011-11-29 13:49       ` René Scharfe
2011-11-29 14:07         ` Thomas Rast
2011-12-02 13:07           ` [PATCH v2 0/3] grep multithreading and scaling Thomas Rast
2011-12-02 13:07             ` [PATCH v2 1/3] grep: load funcname patterns for -W Thomas Rast
2011-12-02 13:07             ` [PATCH v2 2/3] grep: enable threading with -p and -W using lazy attribute lookup Thomas Rast
2011-12-02 13:07             ` [PATCH v2 3/3] grep: disable threading in all but worktree case Thomas Rast
2011-12-02 16:15               ` René Scharfe
2011-12-05  9:02                 ` Thomas Rast
2011-12-06 22:48                 ` René Scharfe
2011-12-06 23:01                   ` [PATCH 4/2] grep: turn off threading for non-worktree René Scharfe
2011-12-07  4:42                     ` Jeff King
2011-12-07 17:11                       ` René Scharfe
2011-12-07 18:28                         ` Jeff King
2011-12-07 20:11                       ` J. Bruce Fields
2011-12-07 20:45                         ` Jeff King
2011-12-07  8:12                     ` Thomas Rast
2011-12-07 17:00                       ` René Scharfe
2011-12-10 13:13                         ` Pete Wyckoff
2011-12-12 22:37                           ` René Scharfe
2011-12-07  4:24                   ` [PATCH v2 3/3] grep: disable threading in all but worktree case Jeff King
2011-12-07 16:52                     ` René Scharfe
2011-12-07 18:10                       ` Jeff King
2011-12-07  8:11                   ` Thomas Rast
2011-12-07 16:54                     ` René Scharfe
2011-12-12 21:16                 ` [PATCH v3 0/3] grep attributes and multithreading Thomas Rast
2011-12-12 21:16                   ` [PATCH v3 1/3] grep: load funcname patterns for -W Thomas Rast
2011-12-12 21:16                   ` [PATCH v3 2/3] grep: enable threading with -p and -W using lazy attribute lookup Thomas Rast
2011-12-16  8:22                     ` Johannes Sixt
2011-12-16 17:34                       ` Junio C Hamano
2011-12-12 21:16                   ` [PATCH v3 3/3] grep: disable threading in non-worktree case Thomas Rast
2011-12-12 22:37                   ` [PATCH v3 0/3] grep attributes and multithreading René Scharfe
2011-12-12 23:44                   ` Junio C Hamano
2011-12-13  8:44                     ` Thomas Rast
2011-12-23 22:37               ` [PATCH v2 3/3] grep: disable threading in all but worktree case Ævar Arnfjörð Bjarmason
2011-12-23 22:49                 ` Thomas Rast
2011-12-24  1:39                   ` Ævar Arnfjörð Bjarmason
2011-12-24  7:07                     ` Jeff King [this message]
2011-12-24 10:49                       ` Nguyen Thai Ngoc Duy
2011-12-24 10:55                       ` Nguyen Thai Ngoc Duy
2011-12-24 13:38                         ` Jeff King
2011-12-25  3:32                       ` Nguyen Thai Ngoc Duy
2011-12-02 17:34             ` [PATCH v2 0/3] grep multithreading and scaling Jeff King
2011-12-05  9:38               ` Thomas Rast
2011-12-05 20:16                 ` Thomas Rast
2011-12-06  0:40                 ` Jeff King
2011-12-02 20:02             ` Eric Herman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111224070715.GA32267@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=eric@freesa.org \
    --cc=frekui@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=rene.scharfe@lsrfire.ath.cx \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).