git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: git@vger.kernel.org
Subject: Re: git grep with leading inverted bracket expression
Date: Thu, 07 Jun 2018 21:09:25 +0200	[thread overview]
Message-ID: <87602uza56.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20180607152711.GA27079@bombadil.infradead.org>


On Thu, Jun 07 2018, Matthew Wilcox wrote:

> If the first atom of a regex is a bracket expression with an inverted range,
> git grep is very slow.
>
> $ time git grep 'struct_size' >/dev/null
>
> real	0m0.368s
> user	0m0.563s
> sys	0m0.453s
>
> $ time git grep '[^t]truct_size' >/dev/null
>
> real	0m31.529s
> user	1m54.909s
> sys	0m0.805s
>
> If the bracket expression is moved to even the second position in the string,
> it runs much faster:
>
> $ time git grep 's[^p]ruct_size' >/dev/null
>
> real	0m3.989s
> user	0m13.939s
> sys	0m0.403s
>
> It's pretty bad with even a '.' as the first character:
>
> $ time git grep '.truct_size' >/dev/null
>
> real	0m14.514s
> user	0m52.624s
> sys	0m0.598s
>
> $ git --version
> git version 2.17.1
>
> Setting LANG=C improves matters by a factor of 3-4 (depending if you
> count real or user time):
>
> $ time git grep '[^t]truct_size' >/dev/null
> real	0m10.035s
> user	0m28.795s
> sys	0m0.537s
>
> (this is using something pretty close to Linus' current HEAD of the
> linux repository, an i7-7500, 16GB memory).

I have some WIP patches to fix all of this, which I'll hopefully submit
before 2.19 is out the door.

What you've discovered here is how shitty your libc regex engine is,
because unless you provide -P and compile with a reasonably up-to-date
libpcre (preferably v2) with JIT that's what you'll get.

The reason stuff like 'struct_size' is so much faster is because there
we don't use any regex engine at all, but rather an optimized
fixed-string searcher.

With our own benchmarks modified per your E-Mail:
    
    diff --git a/t/perf/p7820-grep-engines.sh b/t/perf/p7820-grep-engines.sh
    index 8b09c5bf32..fe4c5681da 100755
    --- a/t/perf/p7820-grep-engines.sh
    +++ b/t/perf/p7820-grep-engines.sh
    @@ -28,11 +28,10 @@ then
     fi
    
     for pattern in \
    -       'how.to' \
    -       '^how to' \
    -       '[how] to' \
    -       '\(e.t[^ ]*\|v.ry\) rare' \
    -       'm\(ú\|u\)lt.b\(æ\|y\)te'
    +       'struct size' \
    +       '[^t]truct_size' \
    +       's[^p]ruct_size' \
    +       '.truct_size'
     do
            for engine in basic extended perl
            do

I get these results against linux.git:

    $ GIT_PERF_LARGE_REPO=~/g/linux ./run p7820-grep-engines.sh
    [...]
    Test                                      this tree
    ----------------------------------------------------------
    7820.1: basic grep 'struct size'          0.23(0.52+0.76)
    7820.2: extended grep 'struct size'       0.22(0.60+0.61)
    7820.3: perl grep 'struct size'           0.22(0.56+0.65)
    7820.5: basic grep '[^t]truct_size'       4.29(29.43+0.51)
    7820.6: extended grep '[^t]truct_size'    4.27(29.59+0.36)
    7820.7: perl grep '[^t]truct_size'        0.21(0.40+0.69)
    7820.9: basic grep 's[^p]ruct_size'       0.49(2.22+0.49)
    7820.10: extended grep 's[^p]ruct_size'   0.43(2.24+0.48)
    7820.11: perl grep 's[^p]ruct_size'       0.21(0.38+0.71)
    7820.13: basic grep '.truct_size'         4.42(31.29+0.44)
    7820.14: extended grep '.truct_size'      4.50(31.18+0.46)
    7820.15: perl grep '.truct_size'          0.21(0.35+0.75)

So you need to just use an up-to-date libpcre2 & -P and performance
won't suck.

My WIP patches will make us use PCRE for all grep modes, using an API it
has to convert basic & extended regexp syntax to its own syntax, so
we'll be able to do that transparently.

  reply	other threads:[~2018-06-07 19:09 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-07 15:27 git grep with leading inverted bracket expression Matthew Wilcox
2018-06-07 19:09 ` Ævar Arnfjörð Bjarmason [this message]
2018-06-07 19:22   ` Matthew Wilcox
2018-06-07 19:29     ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87602uza56.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).