From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "Thomas Rast" <trast@student.ethz.ch>,
"René Scharfe" <rene.scharfe@lsrfire.ath.cx>,
"Eric Herman" <eric@freesa.org>,
git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>,
"Fredrik Kuivinen" <frekui@gmail.com>
Subject: Re: [PATCH v2 3/3] grep: disable threading in all but worktree case
Date: Sat, 24 Dec 2011 02:07:15 -0500 [thread overview]
Message-ID: <20111224070715.GA32267@sigill.intra.peff.net> (raw)
In-Reply-To: <CACBZZX67WhcdhXdqOm8gZHW7C3YMbV2KzeytwjHwsnF=8-M_+w@mail.gmail.com>
On Sat, Dec 24, 2011 at 02:39:11AM +0100, Ævar Arnfjörð Bjarmason wrote:
> Is the expensive part of git-grep all the setup work, or the actual
> traversal and searching? I'm guessing it's the latter.
>
> In that case an easy way to do git-grep in parallel would be to simply
> spawn multiple sub-processes, e.g. if we had 1000 files and 4 cores:
>
> 1. Split the 1000 into 4 parts 250 each.
> 2. Spawn 4 processes as: git grep <pattern> -- <250 files>
> 3. Aggregate all of the results in the parent process
That's an interesting idea. The expense of the traversal and searching
depends on two things:
- how complex is your regex?
- are you reading from objects (which need zlib inflated) or disk?
But you should be able to approximate it by compiling with NO_PTHREADS
and doing (assuming you have GNU xargs):
# grep in working tree
git ls-files | xargs -P 8 git grep "$re" --
# grep tree-ish
git ls-tree -r --name-only $tree | xargs -P 8 git grep "$re" $tree --
I tried to get some timings for this, but ran across some quite
surprising results. Here's a simple grep of the linux-2.6 working tree,
using a single-threaded grep:
$ time git grep SIMPLE >/dev/null
real 0m0.439s
user 0m0.272s
sys 0m0.160s
and then the same thing, via xargs, without even turning on
parallelization. This should give us a measurement of the overhead for
going through xargs at all. We'd expect it to be slower, but not too
much so:
$ time git ls-files | xargs git grep SIMPLE -- >/dev/null
real 0m11.989s
user 0m11.769s
sys 0m0.268s
Twenty-five times slower! Running 'perf' reports the culprit as pathspec
matching:
+ 63.23% git git [.] match_pathspec_depth
+ 28.60% git libc-2.13.so [.] __strncmp_sse42
+ 2.22% git git [.] strncmp@plt
+ 1.67% git git [.] kwsexec
where the strncmps are called as part of match_pathspec_depth. So over
90% of the CPU time is spent on matching the pathspecs, compared to less
than 2% actually grepping.
Which really makes me wonder if our pathspec matching could stand to be
faster. True, giving a bunch of single files is the least efficient way
to use pathspecs, but that's pretty amazingly slow.
The case where we would most expect the setup cost to be drowned out is
using a more complex regex, grepping tree objects. There we have a
baseline of:
$ time git grep 'a.*c' HEAD >/dev/null
real 0m5.684s
user 0m5.472s
sys 0m0.196s
$ time git ls-tree --name-only -r HEAD |
xargs git grep 'a.*c' HEAD -- >/dev/null
real 0m10.906s
user 0m10.725s
sys 0m0.240s
Here, we still almost double our time. It looks like we don't use the
same pathspec matching code in this case. But we do waste a lot of extra
time zlib-inflating the trees in "ls-tree", only to do it separately in
"grep".
Doing it in parallel yields:
$ time git ls-tree --name-only -r HEAD |
xargs -n 4000 -P 8 git grep 'a.*c' HEAD -- >/dev/null
real 0m3.573s
user 0m21.885s
sys 0m0.400s
So that does at least yield a real speedup, albeit only by about half,
despite using over six times as much CPU (though my numbers are skewed
somewhat, as this is a quad i7 with hyperthreading and turbo boost).
-Peff
next prev parent reply other threads:[~2011-12-24 7:07 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-25 14:46 [PATCH] grep: load funcname patterns for -W Thomas Rast
2011-11-25 16:32 ` René Scharfe
2011-11-26 12:15 ` [PATCH] grep: enable multi-threading for -p and -W René Scharfe
2011-11-29 9:54 ` Thomas Rast
2011-11-29 13:49 ` René Scharfe
2011-11-29 14:07 ` Thomas Rast
2011-12-02 13:07 ` [PATCH v2 0/3] grep multithreading and scaling Thomas Rast
2011-12-02 13:07 ` [PATCH v2 1/3] grep: load funcname patterns for -W Thomas Rast
2011-12-02 13:07 ` [PATCH v2 2/3] grep: enable threading with -p and -W using lazy attribute lookup Thomas Rast
2011-12-02 13:07 ` [PATCH v2 3/3] grep: disable threading in all but worktree case Thomas Rast
2011-12-02 16:15 ` René Scharfe
2011-12-05 9:02 ` Thomas Rast
2011-12-06 22:48 ` René Scharfe
2011-12-06 23:01 ` [PATCH 4/2] grep: turn off threading for non-worktree René Scharfe
2011-12-07 4:42 ` Jeff King
2011-12-07 17:11 ` René Scharfe
2011-12-07 18:28 ` Jeff King
2011-12-07 20:11 ` J. Bruce Fields
2011-12-07 20:45 ` Jeff King
2011-12-07 8:12 ` Thomas Rast
2011-12-07 17:00 ` René Scharfe
2011-12-10 13:13 ` Pete Wyckoff
2011-12-12 22:37 ` René Scharfe
2011-12-07 4:24 ` [PATCH v2 3/3] grep: disable threading in all but worktree case Jeff King
2011-12-07 16:52 ` René Scharfe
2011-12-07 18:10 ` Jeff King
2011-12-07 8:11 ` Thomas Rast
2011-12-07 16:54 ` René Scharfe
2011-12-12 21:16 ` [PATCH v3 0/3] grep attributes and multithreading Thomas Rast
2011-12-12 21:16 ` [PATCH v3 1/3] grep: load funcname patterns for -W Thomas Rast
2011-12-12 21:16 ` [PATCH v3 2/3] grep: enable threading with -p and -W using lazy attribute lookup Thomas Rast
2011-12-16 8:22 ` Johannes Sixt
2011-12-16 17:34 ` Junio C Hamano
2011-12-12 21:16 ` [PATCH v3 3/3] grep: disable threading in non-worktree case Thomas Rast
2011-12-12 22:37 ` [PATCH v3 0/3] grep attributes and multithreading René Scharfe
2011-12-12 23:44 ` Junio C Hamano
2011-12-13 8:44 ` Thomas Rast
2011-12-23 22:37 ` [PATCH v2 3/3] grep: disable threading in all but worktree case Ævar Arnfjörð Bjarmason
2011-12-23 22:49 ` Thomas Rast
2011-12-24 1:39 ` Ævar Arnfjörð Bjarmason
2011-12-24 7:07 ` Jeff King [this message]
2011-12-24 10:49 ` Nguyen Thai Ngoc Duy
2011-12-24 10:55 ` Nguyen Thai Ngoc Duy
2011-12-24 13:38 ` Jeff King
2011-12-25 3:32 ` Nguyen Thai Ngoc Duy
2011-12-02 17:34 ` [PATCH v2 0/3] grep multithreading and scaling Jeff King
2011-12-05 9:38 ` Thomas Rast
2011-12-05 20:16 ` Thomas Rast
2011-12-06 0:40 ` Jeff King
2011-12-02 20:02 ` Eric Herman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111224070715.GA32267@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=avarab@gmail.com \
--cc=eric@freesa.org \
--cc=frekui@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=rene.scharfe@lsrfire.ath.cx \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).