From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, FROM_EXCESS_BASE64,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,SPF_HELO_NONE,SPF_NONE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id F3E721F462 for ; Fri, 26 Jul 2019 20:05:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726277AbfGZUFf (ORCPT ); Fri, 26 Jul 2019 16:05:35 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:44241 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726000AbfGZUFf (ORCPT ); Fri, 26 Jul 2019 16:05:35 -0400 Received: by mail-ed1-f67.google.com with SMTP id k8so54208327edr.11 for ; Fri, 26 Jul 2019 13:05:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:user-agent:in-reply-to:date :message-id:mime-version:content-transfer-encoding; bh=PeOlix8IQQdCV9e1BX81+zA5dOLcLFiuqaHgIW/4HUE=; b=j4EZvUH4qxGPc7C7vANQNAVGU5mi0A6u4VDxoFm2E3V9etkO9gOe+sc30HlJcSdbO5 uZYB4kiTI0u7ihEhj0t5UAUwQefJopCLNDq+Jt+2fT67NqMeuwhXohxfoW818kTWdWeQ N/KAeMGC0nRg7UoEZgjOLA/FEbT61LzhII6BTf9GUgQWCUrgns5mHm6/3NWoufkROMMJ Lx8oRKME6YMYJRecIa7E9EGYewK2yWQq6yQwHLSqG/D9m7hsoTtAXiueKPKnYomJL/Tz 52AyY1TPiLPhv84NtbzFO+GqC+d48IESDAe5sru6FSzaZV4uX76Ismne03HXxXgsxaf9 ph4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:user-agent :in-reply-to:date:message-id:mime-version:content-transfer-encoding; bh=PeOlix8IQQdCV9e1BX81+zA5dOLcLFiuqaHgIW/4HUE=; b=cyw9FgKke2Bm/ByJyZ1fSOPM7mP4TgCr/rrCSkhVCYdqmwPFqRtbFrMB/4iMR+k4nM J4uwuAi+QQ/+loOlPXDHhmXOLVdtEDD1+fN2SB/5QEK7M9Cq0rRRUiJtGVaNM4JTAW5A ZPpyM6cPdWfR1DhGJ5r54HrNmYHrFsJH3SpcWX7l0pcXdaoZYnxfoCSaoe51rUgsSiaM q5jjw4QA9ealUfHNui9XjCibVGf2RGQSn94yjCMztEL41Yy0xZiFPt9VYBDW9tKP6qFw cqRjJ1orZ0P/pkFP8nqieJlklR2JOou+J5VB6LOhb5sermVK8Ab0jvtXzuKVvbbZT75a trVg== X-Gm-Message-State: APjAAAU5+zByBHHlBt2bqNi5ZbVm0f7n8k/xaTvWX1Sfq0c5w04LZ/2h Mtiz3Fg612GeR2KRVgQX7SqMNa+yH74= X-Google-Smtp-Source: APXvYqzuofZFDDACmNZwYhfOSUiD8bsID6VJSI36a7DEKkN7TcRLabcdoUOzWnE5P6RJQ2t41x3U9Q== X-Received: by 2002:a05:6402:782:: with SMTP id d2mr85108755edy.80.1564171532928; Fri, 26 Jul 2019 13:05:32 -0700 (PDT) Received: from evledraar (i237193.upc-i.chello.nl. [62.195.237.193]) by smtp.gmail.com with ESMTPSA id d44sm14515893eda.75.2019.07.26.13.05.31 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 13:05:31 -0700 (PDT) From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason To: Carlo Arenas Cc: Johannes Schindelin , Junio C Hamano , git@vger.kernel.org, =?utf-8?B?Tmd1eeG7hW4gVGjDoWkg?= =?utf-8?B?Tmfhu41j?= Duy Subject: Re: [PATCH] grep: skip UTF8 checks explicitally References: <20190721183115.14985-1-carenas@gmail.com> <87muh57t5r.fsf@evledraar.gmail.com> <87lfwn70nb.fsf@evledraar.gmail.com> <87ef2c7roy.fsf@evledraar.gmail.com> User-agent: Debian GNU/Linux 10 (buster); Emacs 26.1; mu4e 1.1.0 In-reply-to: Date: Fri, 26 Jul 2019 22:05:31 +0200 Message-ID: <87blxg7e9g.fsf@evledraar.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Fri, Jul 26 2019, Carlo Arenas wrote: > On Fri, Jul 26, 2019 at 8:15 AM =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason > wrote: >> I'm not sure what a real fix for that is. Part of it is probably 8/8 in >> the series I mention below, but more generally we'd need to be more >> encoding aware at a much higher callsite than "grep". So e.g. we'd know >> that we match "binary" data as not-UTF-8. Now we just throw arbitrary >> bytes around and hope something sticks. > > I haven't look yet at your proposed changes, but my gut feeling is that > the work to support invalid UTF in the yet unreleased PCRE version would > be needed as part of it, and therefore it might be better to keep PCRE > out of the main path until that gets released and can be relied upon. I'm hoping my 8-part series is good enough to move it forward, but as 48de2a768c ("grep: remove the kwset optimization", .2019-07-01) shows we can always just fall back on using regcomp instead. > kwset is not going away with this series anyway, regardless of the no-kws= et > name on the branch. The larger context here is that this is the 1st step of a 2-step series to get rid of kwset. If I can pull that off successfully is another matter, but that's the plan. After it's applied we just use it in the pickaxe code, and it's relatively straightforward to convert that. See: https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/ >> > If we're already deciding to paper over things, I'd much rather prefer >> > the simpler patch, i.e. Carlo's. >> >> As I noted upthread PCRE's own docs promise undefined behavior and fire >> and brimstone if that patch is applied. Those last two not >> guaranteed. So we need another solution. > > in my original reply I mentioned I explicitly didn't do a test because of= this > "undefined behavior", but I think it should be fair to mention that we are > already affected by that because using the JIT fast path does skip any > UTF-8 validations and is currently possible to get git into an infinite l= oop > or make it segfault when using PCRE. Right, this is a good point that we should take notice of. I.e. this is *not* a new bug per-se, you can do this on master and get a UTF-8 bug from git.git: git grep -P '(*NO_JIT)[=C3=A6]' > in that line, I am not sure I understand the pushback against making that > explicit since it only makes both codepaths behave the same (bugs and > risks of burning alike) Because with my kwset series we're getting a lot more users of this until-now obscure code, so we're finding old-but-new-to-us bugs. We've had this bug dating all the way back to Duy's 18547aacf5 ("grep/pcre: support utf-8", 2016-06-25). It was first released with git 2.10. So why are we getting list discussion about it *now*? Because my kwset series got merged to "next", and we apparently have a lot of users who'd use fixed-string git-grep under locales, but never used PCRE via -P explicitly before. So it's worth getting the semantics right. As noted in the E-Mail I linked to earlier my ulterior motive here is to get to a point where we'll funnel all regex matching through PCRE implicitly if it's available. We need to get these UTF-8 edge cases right. I don't know if my recent 8-part series gets us 100% there, but hopefully it at least gets us closer to it.