git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0
@ 2017-04-24 21:12 Ævar Arnfjörð Bjarmason
  2017-04-24 21:12 ` [PATCH 1/2] perf: add function to setup a fresh test repo Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-24 21:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason

Russ Cox just published an article about how various glob()
implementations suffer from pathological performance when fed certain
pathological patterns like "a*a*a*a*b" given a file like "aaaaaaa...":
https://research.swtch.com/glob

I was curious to see if this impacted git. It turns out it does. This
used to be a per-platform issue with git, since globbing was
implemented via fnmatch() by default before v1.8.4, and support for
using the OS fnmatch() was removed entirely in v2.0.0.

This performance test shows the regression:

    $ GIT_PERF_REPEAT_COUNT=1 GIT_PERF_MAKE_OPTS="[...]NO_WILDMATCH=YesPlease[...]" ./run v1.9.5 v2.0.0 v2.12.0 p0100-globbing.sh
    [...]
    Test                                                       v1.9.5            v2.0.0                    v2.12.0
    ------------------------------------------------------------------------------------------------------------------------------
    [...]
    0100.7: fileglob((a*)^nb) against file (a^100).t; n = 1    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
    0100.8: fileglob((a*)^nb) against file (a^100).t; n = 2    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
    0100.9: fileglob((a*)^nb) against file (a^100).t; n = 3    0.00(0.00+0.00)   0.00(0.00+0.00) =         0.01(0.00+0.00) +inf
    0100.10: fileglob((a*)^nb) against file (a^100).t; n = 4   0.00(0.00+0.00)   0.01(0.01+0.00) +inf      0.02(0.02+0.00) +inf
    0100.11: fileglob((a*)^nb) against file (a^100).t; n = 5   0.00(0.00+0.00)   0.20(0.19+0.00) +inf      0.24(0.23+0.00) +inf
    0100.12: fileglob((a*)^nb) against file (a^100).t; n = 6   0.00(0.00+0.00)   3.03(3.00+0.00) +inf      3.08(3.05+0.00) +inf

And here's a one-liner to do the same:

    $ time (rm -rf test; git init -q test && (cd test && touch $(perl -e 'print "a" x 100').t && git add a* && git commit -q -m"test" && git ls-files 'a*a*a*a*a*a*a*b'))

Add or remove "a*"'s to adjust the runtime. With 6 that executes in 3
seconds on my system, 40 seconds with 7 etc.

I don't think this is something we need to worry much about, if you
have a file like this and feed Git insane patterns you probably
deserve what you get.

The real concern is if we have behavior like this and ever e.g. expose
globbing over the network, e.g. in some future upload-pack protocol.

There are probably some web-based programs built on top of git that
are vulnerable to DoS attacks as a result of this, e.g. if they take
user-supplied globs and feed them to ls-files.

Ævar Arnfjörð Bjarmason (2):
  perf: add function to setup a fresh test repo
  perf: add test showing exponential growth in path globbing

 t/perf/README            |  1 +
 t/perf/p0100-globbing.sh | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 t/perf/perf-lib.sh       | 17 +++++++++++++----
 3 files changed, 62 insertions(+), 4 deletions(-)
 create mode 100755 t/perf/p0100-globbing.sh

-- 
2.11.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] perf: add function to setup a fresh test repo
  2017-04-24 21:12 [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Ævar Arnfjörð Bjarmason
@ 2017-04-24 21:12 ` Ævar Arnfjörð Bjarmason
  2017-04-24 21:12 ` [PATCH 2/2] perf: add test showing exponential growth in path globbing Ævar Arnfjörð Bjarmason
  2017-04-24 21:34 ` [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Brandon Williams
  2 siblings, 0 replies; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-24 21:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason

Add a function to setup a fresh test repo via 'git init' to compliment
the existing functions to copy over a normal & large repo.

Some performance tests don't need any existing repository data at all
to be significant, e.g. tests which stress glob matches against
revisions or files, which I'm about to add in a subsequent commit.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/perf/README      |  1 +
 t/perf/perf-lib.sh | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/t/perf/README b/t/perf/README
index 49ea4349be..de2fe15696 100644
--- a/t/perf/README
+++ b/t/perf/README
@@ -106,6 +106,7 @@ sources perf-lib.sh:
 
 After that you will want to use some of the following:
 
+	test_perf_fresh_repo    # sets up an empty repository
 	test_perf_default_repo  # sets up a "normal" repository
 	test_perf_large_repo    # sets up a "large" repository
 
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index ab4b8b06ae..f51fc773e8 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -78,6 +78,10 @@ if test -z "$GIT_PERF_LARGE_REPO"; then
 	GIT_PERF_LARGE_REPO=$TEST_DIRECTORY/..
 fi
 
+test_perf_do_repo_symlink_config_ () {
+	test_have_prereq SYMLINKS || git config core.symlinks false
+}
+
 test_perf_create_repo_from () {
 	test "$#" = 2 ||
 	error "bug in the test script: not 2 parameters to test-create-repo"
@@ -102,15 +106,20 @@ test_perf_create_repo_from () {
 	) &&
 	(
 		cd "$repo" &&
-		"$MODERN_GIT" init -q && {
-			test_have_prereq SYMLINKS ||
-			git config core.symlinks false
-		} &&
+		"$MODERN_GIT" init -q &&
+		test_perf_do_repo_symlink_config_ &&
 		mv .git/hooks .git/hooks-disabled 2>/dev/null
 	) || error "failed to copy repository '$source' to '$repo'"
 }
 
 # call at least one of these to establish an appropriately-sized repository
+test_perf_fresh_repo () {
+	repo="${1:-$TRASH_DIRECTORY}"
+	"$MODERN_GIT" init -q "$repo" &&
+	cd "$repo" &&
+	test_perf_do_repo_symlink_config_
+}
+
 test_perf_default_repo () {
 	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_REPO"
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] perf: add test showing exponential growth in path globbing
  2017-04-24 21:12 [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Ævar Arnfjörð Bjarmason
  2017-04-24 21:12 ` [PATCH 1/2] perf: add function to setup a fresh test repo Ævar Arnfjörð Bjarmason
@ 2017-04-24 21:12 ` Ævar Arnfjörð Bjarmason
  2017-04-24 21:34 ` [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Brandon Williams
  2 siblings, 0 replies; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-24 21:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason

Add a test showing that ls-files times grow exponentially in the face
of some pathological globs, whereas refglobs via for-each-ref don't in
practice suffer from the same issue.

As noted in the test description this is a test to see whether Git
suffers from the issue noted in an article Russ Cox posted today about
common bugs in various glob implementations:
https://research.swtch.com/glob

The pathological git-ls-files globbing is done by wildmatch() in
wildmatch.c. The for-each-ref codepath also uses wildmatch(), but will
always match against e.g. "refs/tags/aaa...", not "aaa.." as
git-ls-files will.

I'm unsure why the pathological case isn't triggered by for-each-ref,
but in any case, now we have a performance test for it.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/perf/p0100-globbing.sh | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100755 t/perf/p0100-globbing.sh

diff --git a/t/perf/p0100-globbing.sh b/t/perf/p0100-globbing.sh
new file mode 100755
index 0000000000..e98fd7ce4b
--- /dev/null
+++ b/t/perf/p0100-globbing.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+
+test_description="Tests pathalogical globbing performance
+
+Shows how Git's globbing performance performs when given the sort of
+pathalogical patterns described in at https://research.swtch.com/glob
+"
+
+. ./perf-lib.sh
+
+test_globs_big='10 25 50 75 100'
+test_globs_small='1 2 3 4 5 6'
+
+test_perf_fresh_repo
+
+test_expect_success 'setup' '
+	for i in $(test_seq 1 100)
+	do
+		printf "a" >>refname &&
+		for j in $(test_seq 1 $i)
+		do
+			printf "a*" >>refglob.$i
+		done &&
+		echo b >>refglob.$i
+	done &&
+	test_commit $(cat refname) &&
+	for i in $(test_seq 1 100)
+	do
+	echo	git tag $(cat refname)-$i
+	done &&
+	test_commit hello
+'
+
+for i in $test_globs_big
+do
+	test_perf "refglob((a*)^nb) against tag a^100; n = $i" '
+		git for-each-ref "refs/tags/$(cat refglob.'$i')b"
+	'
+done
+
+for i in $test_globs_small
+do
+	test_perf "fileglob((a*)^nb) against file (a^100).t; n = $i" '
+		git ls-files "$(cat refglob.'$i')b"
+	'
+done
+
+test_done
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0
  2017-04-24 21:12 [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Ævar Arnfjörð Bjarmason
  2017-04-24 21:12 ` [PATCH 1/2] perf: add function to setup a fresh test repo Ævar Arnfjörð Bjarmason
  2017-04-24 21:12 ` [PATCH 2/2] perf: add test showing exponential growth in path globbing Ævar Arnfjörð Bjarmason
@ 2017-04-24 21:34 ` Brandon Williams
  2017-04-24 22:45   ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 5+ messages in thread
From: Brandon Williams @ 2017-04-24 21:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Nguyễn Thái Ngọc Duy

On 04/24, Ævar Arnfjörð Bjarmason wrote:
> Russ Cox just published an article about how various glob()
> implementations suffer from pathological performance when fed certain
> pathological patterns like "a*a*a*a*b" given a file like "aaaaaaa...":
> https://research.swtch.com/glob
> 
> I was curious to see if this impacted git. It turns out it does. This
> used to be a per-platform issue with git, since globbing was
> implemented via fnmatch() by default before v1.8.4, and support for
> using the OS fnmatch() was removed entirely in v2.0.0.
> 
> This performance test shows the regression:
> 
>     $ GIT_PERF_REPEAT_COUNT=1 GIT_PERF_MAKE_OPTS="[...]NO_WILDMATCH=YesPlease[...]" ./run v1.9.5 v2.0.0 v2.12.0 p0100-globbing.sh
>     [...]
>     Test                                                       v1.9.5            v2.0.0                    v2.12.0
>     ------------------------------------------------------------------------------------------------------------------------------
>     [...]
>     0100.7: fileglob((a*)^nb) against file (a^100).t; n = 1    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
>     0100.8: fileglob((a*)^nb) against file (a^100).t; n = 2    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
>     0100.9: fileglob((a*)^nb) against file (a^100).t; n = 3    0.00(0.00+0.00)   0.00(0.00+0.00) =         0.01(0.00+0.00) +inf
>     0100.10: fileglob((a*)^nb) against file (a^100).t; n = 4   0.00(0.00+0.00)   0.01(0.01+0.00) +inf      0.02(0.02+0.00) +inf
>     0100.11: fileglob((a*)^nb) against file (a^100).t; n = 5   0.00(0.00+0.00)   0.20(0.19+0.00) +inf      0.24(0.23+0.00) +inf
>     0100.12: fileglob((a*)^nb) against file (a^100).t; n = 6   0.00(0.00+0.00)   3.03(3.00+0.00) +inf      3.08(3.05+0.00) +inf
> 
> And here's a one-liner to do the same:
> 
>     $ time (rm -rf test; git init -q test && (cd test && touch $(perl -e 'print "a" x 100').t && git add a* && git commit -q -m"test" && git ls-files 'a*a*a*a*a*a*a*b'))
> 
> Add or remove "a*"'s to adjust the runtime. With 6 that executes in 3
> seconds on my system, 40 seconds with 7 etc.
> 
> I don't think this is something we need to worry much about, if you
> have a file like this and feed Git insane patterns you probably
> deserve what you get.
> 
> The real concern is if we have behavior like this and ever e.g. expose
> globbing over the network, e.g. in some future upload-pack protocol.
> 
> There are probably some web-based programs built on top of git that
> are vulnerable to DoS attacks as a result of this, e.g. if they take
> user-supplied globs and feed them to ls-files.

I was taking a look at wildmatch a few months ago and have an unfinished
patch to do some cleanup there.  I noticed this was inefficient but
didn't expect those kinds of numbers.  I wonder how difficult it would
be to rewrite it so that we don't have this issue.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0
  2017-04-24 21:34 ` [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Brandon Williams
@ 2017-04-24 22:45   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-24 22:45 UTC (permalink / raw)
  To: Brandon Williams
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

On Mon, Apr 24, 2017 at 11:34 PM, Brandon Williams <bmwill@google.com> wrote:
> On 04/24, Ęvar Arnfjörš Bjarmason wrote:
>> Russ Cox just published an article about how various glob()
>> implementations suffer from pathological performance when fed certain
>> pathological patterns like "a*a*a*a*b" given a file like "aaaaaaa...":
>> https://research.swtch.com/glob
>>
>> I was curious to see if this impacted git. It turns out it does. This
>> used to be a per-platform issue with git, since globbing was
>> implemented via fnmatch() by default before v1.8.4, and support for
>> using the OS fnmatch() was removed entirely in v2.0.0.
>>
>> This performance test shows the regression:
>>
>>     $ GIT_PERF_REPEAT_COUNT=1 GIT_PERF_MAKE_OPTS="[...]NO_WILDMATCH=YesPlease[...]" ./run v1.9.5 v2.0.0 v2.12.0 p0100-globbing.sh
>>     [...]
>>     Test                                                       v1.9.5            v2.0.0                    v2.12.0
>>     ------------------------------------------------------------------------------------------------------------------------------
>>     [...]
>>     0100.7: fileglob((a*)^nb) against file (a^100).t; n = 1    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
>>     0100.8: fileglob((a*)^nb) against file (a^100).t; n = 2    0.01(0.00+0.00)   0.00(0.00+0.00) -100.0%   0.01(0.00+0.00) +0.0%
>>     0100.9: fileglob((a*)^nb) against file (a^100).t; n = 3    0.00(0.00+0.00)   0.00(0.00+0.00) =         0.01(0.00+0.00) +inf
>>     0100.10: fileglob((a*)^nb) against file (a^100).t; n = 4   0.00(0.00+0.00)   0.01(0.01+0.00) +inf      0.02(0.02+0.00) +inf
>>     0100.11: fileglob((a*)^nb) against file (a^100).t; n = 5   0.00(0.00+0.00)   0.20(0.19+0.00) +inf      0.24(0.23+0.00) +inf
>>     0100.12: fileglob((a*)^nb) against file (a^100).t; n = 6   0.00(0.00+0.00)   3.03(3.00+0.00) +inf      3.08(3.05+0.00) +inf
>>
>> And here's a one-liner to do the same:
>>
>>     $ time (rm -rf test; git init -q test && (cd test && touch $(perl -e 'print "a" x 100').t && git add a* && git commit -q -m"test" && git ls-files 'a*a*a*a*a*a*a*b'))
>>
>> Add or remove "a*"'s to adjust the runtime. With 6 that executes in 3
>> seconds on my system, 40 seconds with 7 etc.
>>
>> I don't think this is something we need to worry much about, if you
>> have a file like this and feed Git insane patterns you probably
>> deserve what you get.
>>
>> The real concern is if we have behavior like this and ever e.g. expose
>> globbing over the network, e.g. in some future upload-pack protocol.
>>
>> There are probably some web-based programs built on top of git that
>> are vulnerable to DoS attacks as a result of this, e.g. if they take
>> user-supplied globs and feed them to ls-files.
>
> I was taking a look at wildmatch a few months ago and have an unfinished
> patch to do some cleanup there.  I noticed this was inefficient but
> didn't expect those kinds of numbers.  I wonder how difficult it would
> be to rewrite it so that we don't have this issue.

Something I've started hacking on as part of a post-PCRE v2 series I'm
working on, which I'll submit after PCRE v2 support is merged, is to
replace various parts of regex / matching machinery with PCRE under
the hood if it's available, even when the user doesn't ask or care
that we're using PCRE.

The thing I'm working on now is to replace fixed-string grep/log
searches that we currently do via kwset with PCRE v2. It's quite a bit
faster than kwset, and allows us to fix a bunch of outstanding grep
TODO items.

It would similarly be interesting to replace wildmatch() with a shimmy
wrapper for PCRE that translates glob patterns to regexes. Not that we
need PCRE for that, we could probably just use regcomp(), but PCRE
would almost certainly be faster than wildmatch().

Of course the use-cases are different. For the kwset replacement we'd
just use PCRE under the hood because it's faster, whereas the reason
we integrated wildmatch() in the first place was to get consistent
behavior across all platforms, and PCRE is currently an optional
dependency.

I think it might make sense to solve this general class of problem by
eventually making PCRE a mandatory dependency of Git. Right now we
have copy/pasted versions of whatever the last GPLv2 version was of
kwset/wildmatch, which we then have to maintain.

Between kwset/wildmatch/various code in grep.c & diffcore-pickaxe.c we
probably have 2-3k lines of very tricky string matching code which
could be replaced by relatively trivial calls into PCRE.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-04-24 22:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-24 21:12 [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Ævar Arnfjörð Bjarmason
2017-04-24 21:12 ` [PATCH 1/2] perf: add function to setup a fresh test repo Ævar Arnfjörð Bjarmason
2017-04-24 21:12 ` [PATCH 2/2] perf: add test showing exponential growth in path globbing Ævar Arnfjörð Bjarmason
2017-04-24 21:34 ` [PATCH 0/2] perf: show that wildmatch() regressed for pathological cases in v2.0 Brandon Williams
2017-04-24 22:45   ` Ævar Arnfjörð Bjarmason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).