git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Git List" <git@vger.kernel.org>,
	"David Coppa" <dcoppa@openbsd.org>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: Re: [PATCH] t4062: stop using repetition in regex
Date: Wed, 9 Aug 2017 00:27:58 +0200	[thread overview]
Message-ID: <6e721fde-0c6d-1539-b4fb-110aac47989b@web.de> (raw)
In-Reply-To: <xmqqinhxaf0i.fsf@gitster.mtv.corp.google.com>

Am 09.08.2017 um 00:09 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>> Am 08.08.2017 um 16:49 schrieb Johannes Schindelin:
>>> Hi René,
>>>
>>> On Tue, 8 Aug 2017, René Scharfe wrote:
>>>
>>>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
>>>> That's the minimum acceptable value according to POSIX.  In t4062 we use
>>>> 4096 repetitions in the test "-G matches", though, causing it to fail.
>>>>
>>>> Do the same as the test "-S --pickaxe-regex" in the same file and search
>>>> for a single zero instead.  That still suffices to trigger the buffer
>>>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
>>>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.
>>>
>>> I am afraid not. The 4096 is precisely the page size required to trigger
>>> the bug on Windows against which this regression test tries to safeguard.
>>
>> Checked with b7d36ffca02^ on MinGW now as well and found that it
>> segfaults with the proposed change ten out of ten times.
> 
> That is a strange result but I can believe it.
> 
> The reason why I find it strange is that the test wants to run
> diff_grep() in diffcore-pickaxe.c with one == NULL (because we are
> looking at difference between an initial empty commit and the
> current commit that adds 4096-zeroes.txt file), which makes the
> current blob (i.e. a page of '0' that may be mmap(2)ed without
> trailing NUL to terminate it) scanned via regexec() to look for the
> search string.
> 
> I can understand why Dscho originally did "^0{4096}$"; it is to
> ensure that the whole page is scanned for 4096 zeroes and then the
> code would somehow make sure that there is no more byte until the
> end of line, which will force regexec (but not regexec_buf that knows
> where the buffer ends) to look at the 4097th byte that does not exist.
> 
> If you change the pattern to just "0" that is not anchored, I'd expect
> regexec() that does not know how big the haystack is to just find "0"
> at the first byte and happily return without segfaulting (because it
> does not even have to scan the remainder of the buffer).
> 
> So I find Dscho's concern quite valid, even though I do believe you
> when you say the code somehow segfaults.  I just can not tell
> how/why it would segfault, though---it is possible that regexec()
> implementation is stupid and does not realize that it can return early
> reporting success without looking at the rest of the buffer, but
> somehow I find it unlikely.
> 
> Puzzled.

Good point.  Valgrind reports:

==57466== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==57466==  Access not within mapped region at address 0x4027000
==57466==    at 0x4C2EDF4: strlen (vg_replace_strmem.c:458)
==57466==    by 0x59D9F76: regexec@@GLIBC_2.3.4 (regexec.c:240)
==57466==    by 0x54D96E: diff_grep (diffcore-pickaxe.c:0)
==57466==    by 0x54DAC3: pickaxe_match (diffcore-pickaxe.c:149)

And you can see in our version in compat/regex/regexec.c:241 that the
first thing regexec() does is calling strlen().

So to avoid depending on that implementation detail we'd need to use
a search string that won't be found (e.g. "1") or with unlimited
repetition (e.g. "0*"), right?

René

      parent reply	other threads:[~2017-08-08 22:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-08  6:53 [PATCH] t4062: stop using repetition in regex René Scharfe
2017-08-08 14:49 ` Johannes Schindelin
2017-08-08 15:18   ` René Scharfe
2017-08-08 22:09     ` Junio C Hamano
2017-08-08 22:26       ` Junio C Hamano
2017-08-08 22:34         ` René Scharfe
2017-08-09  5:29           ` Junio C Hamano
2017-08-09  6:15             ` René Scharfe
2017-08-09 14:15               ` René Scharfe
2017-08-09 14:25                 ` David Coppa
2017-08-09 21:49                   ` Johannes Schindelin
2017-08-09 16:07                 ` Junio C Hamano
2017-08-09 17:20                   ` René Scharfe
2017-08-09 17:47                     ` Junio C Hamano
2017-08-10  6:08                       ` René Scharfe
2017-08-11 18:20                         ` Junio C Hamano
2017-08-08 22:27       ` René Scharfe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6e721fde-0c6d-1539-b4fb-110aac47989b@web.de \
    --to=l.s.r@web.de \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=dcoppa@openbsd.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).