From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, git-packagers@googlegroups.com,
gitgitgadget@gmail.com, johannes.schindelin@gmx.de,
peff@peff.net, sandals@crustytoothpaste.net,
szeder.dev@gmail.com
Subject: Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
Date: Tue, 02 Jul 2019 13:10:27 +0200 [thread overview]
Message-ID: <87pnms7kv0.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <xmqqef395tmi.fsf@gitster-ct.c.googlers.com>
On Mon, Jul 01 2019, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> This v3 has a new patch (3/10) that I believe fixes the regression on
>> MinGW Johannes noted in
>> https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/
>>
>> As noted in the updated commit message in 10/10 I believe just
>> skipping this test & documenting this in a commit message is the least
>> amount of suck for now. It's really an existing issue with us doing
>> nothing sensible when the log/grep haystack encoding doesn't match the
>> needle encoding supplied via the command line.
>
> Is that quite the case? If they do not match, not finding the match
> is the right answer, because we are byte-for-byte matching/searching
> IIUC.
>
>> We swept that under the carpet with the kwset backend, but PCRE v2
>> exposes it.
>
> Is it exposing, or just showing the limitation of the rewritten
> implementation where it cannot do byte-for-byte matching/searching
> as we used to be able to?
>
> Without having a way to know what encoding is used on the command
> line, there is no sensible way to reencode them to match the
> haystack encoding (even when it is known), so "you got to feed the
> strings in the same encoding, as we are going to match/search
> byte-for-byte" is the only sensible way to work, given the design
> space, I would think.
>
> Not that it is all that useful to be able to match/search
> byte-for-byte, of course, so I am OK if we punt with these tests,
> but I'd prefer to see us admit we are punting when we do ;-).
I'm guilty as charged in punting this larger encoding issue. As it
pertains to this patch series it unearths an obscure case I think nobody
cares about in practice, and I'd like to move on with the "remove kwset"
optimization.
But I strongly believe that the new behavior with the PCRE v2
optimization is the only sane thing to do, and to the extent we have
anything left to do (#leftoverbits) it's that we should modify git more
generally (aside from string searching) to do the same thing where
appropriate.
Remember, this only happens if the user has set a UTF-8 locale and thus
promised that they're going to give us UTF-8. We then take that promise
and make e.g. "æ" match "Æ" under --ignore-case.
Just falling back on raw byte matching isn't going to cut it, because
then "æ<invalid utf8>" won't match "Æ<same invalid utf8>" under
--ignore-case, and there's other cases like that with matching word
boundaries & other Unicode gotchas.
The best that can be hoped for at that point is some "loose UTF-8"
mode. I see both perl & GNU grep seem to support that (although I'm sure
it falls apart at some point). GNU grep will also die in the same way
that we now die with --perl-regexp (since it also use PCRE).
I think that's saner, if the user thinks they're feeding us UTF-8 but
they're not I think they'd like to know rather than having the string
matching library fall back.
next prev parent reply other threads:[~2019-07-02 11:10 UTC|newest]
Thread overview: 90+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
2019-06-13 16:11 ` Junio C Hamano
2019-06-14 9:53 ` SZEDER Gábor
2019-06-14 10:00 ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
2019-06-14 10:00 ` [PATCH v1 1/4] " SZEDER Gábor
2019-06-14 10:00 ` [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues SZEDER Gábor
2019-06-14 10:00 ` [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang SZEDER Gábor
2019-06-14 10:00 ` [PATCH v1 4/4] compat/obstack: fix some sparse warnings SZEDER Gábor
2019-06-14 17:57 ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream Jeff King
2019-06-14 18:19 ` Junio C Hamano
2019-06-14 20:30 ` Ramsay Jones
2019-06-14 21:24 ` Ramsay Jones
2019-06-17 18:36 ` SZEDER Gábor
2019-06-14 16:12 ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
2019-06-17 18:26 ` SZEDER Gábor
2019-06-14 22:09 ` Ævar Arnfjörð Bjarmason
2019-06-14 22:55 ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
2019-06-14 23:19 ` Ævar Arnfjörð Bjarmason
2019-06-20 10:35 ` Jeff King
2019-06-15 9:01 ` Carlo Arenas
2019-06-15 19:15 ` brian m. carlson
2019-06-15 22:14 ` Ævar Arnfjörð Bjarmason
2019-06-26 0:03 ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
2019-06-26 14:02 ` Johannes Schindelin
2019-06-27 9:16 ` Johannes Schindelin
2019-06-27 16:27 ` Ævar Arnfjörð Bjarmason
2019-06-27 18:21 ` Johannes Schindelin
2019-06-27 23:39 ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
2019-06-28 7:23 ` Ævar Arnfjörð Bjarmason
2019-06-28 16:10 ` Junio C Hamano
2019-07-01 21:20 ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
2019-07-01 21:31 ` Junio C Hamano
2019-07-02 11:10 ` Ævar Arnfjörð Bjarmason [this message]
2019-07-02 12:32 ` Johannes Schindelin
2019-07-02 19:57 ` Junio C Hamano
2019-07-03 10:08 ` Johannes Schindelin
2019-07-03 10:25 ` Johannes Schindelin
2019-07-03 11:27 ` Johannes Schindelin
2019-07-01 21:20 ` [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 04/10] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-07-01 21:20 ` [PATCH v3 09/10] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-07-01 21:21 ` [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 3/9] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 5/9] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 8/9] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-27 23:39 ` [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26 0:03 ` [RFC/PATCH 1/7] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-26 0:03 ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-26 14:05 ` Johannes Schindelin
2019-06-26 18:13 ` Junio C Hamano
2019-06-26 0:03 ` [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-26 0:03 ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27 2:03 ` brian m. carlson
2019-06-26 0:03 ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-26 16:14 ` Junio C Hamano
2019-06-26 0:03 ` [RFC/PATCH 6/7] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-26 0:03 ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26 14:13 ` Johannes Schindelin
2019-06-26 18:45 ` Junio C Hamano
2019-06-27 9:31 ` Johannes Schindelin
2019-06-27 18:45 ` Johannes Schindelin
2019-06-27 19:06 ` Junio C Hamano
2019-06-28 10:56 ` Johannes Schindelin
2019-06-13 11:49 ` [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
2019-06-13 16:13 ` Junio C Hamano
2019-06-16 6:48 ` René Scharfe
2019-06-16 8:24 ` René Scharfe
2019-06-16 14:01 ` René Scharfe
2019-06-16 22:26 ` Junio C Hamano
2019-06-20 19:58 ` René Scharfe
2019-06-20 21:07 ` Junio C Hamano
2019-06-21 18:35 ` Johannes Schindelin
2019-06-22 10:03 ` René Scharfe
2019-06-22 10:03 ` [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows René Scharfe
2019-06-22 10:03 ` [PATCH v2 2/3] config: don't multiply in parse_unit_factor() René Scharfe
2019-06-22 10:03 ` [PATCH v2 3/3] config: simplify parsing of unit factors René Scharfe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pnms7kv0.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=git-packagers@googlegroups.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=peff@peff.net \
--cc=sandals@crustytoothpaste.net \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.