From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E885C48BCF for ; Wed, 9 Jun 2021 14:17:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0BD96613B1 for ; Wed, 9 Jun 2021 14:17:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234271AbhFIOTC convert rfc822-to-8bit (ORCPT ); Wed, 9 Jun 2021 10:19:02 -0400 Received: from elephants.elehost.com ([216.66.27.132]:44949 "EHLO elephants.elehost.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233642AbhFIOTB (ORCPT ); Wed, 9 Jun 2021 10:19:01 -0400 X-Virus-Scanned: amavisd-new at elehost.com Received: from gnash (cpe00fc8d49d843-cm00fc8d49d840.cpe.net.cable.rogers.com [173.33.197.34]) (authenticated bits=0) by elephants.elehost.com (8.15.2/8.15.2) with ESMTPSA id 159EGum1089466 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 9 Jun 2021 10:16:56 -0400 (EDT) (envelope-from rsbecker@nexbridge.com) From: "Randall S. Becker" To: "'Jeff King'" , "'Junio C Hamano'" Cc: References: <01f901d75c7c$5a8bcb10$0fa36130$@nexbridge.com> In-Reply-To: Subject: RE: [RFE] Teach git textconv to support %f Date: Wed, 9 Jun 2021 10:16:49 -0400 Message-ID: <002401d75d3a$19a837a0$4cf8a6e0$@nexbridge.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8BIT X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQHpOr3p7X//LpuBa7CIpZ84sn1qVwGW4J1qAk/TNOiqyMgkgA== Content-Language: en-ca Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On June 9, 2021 12:12 AM, Peff wrote: >On Wed, Jun 09, 2021 at 09:42:39AM +0900, Junio C Hamano wrote: > >> "Randall S. Becker" writes: >> >> > The filter structure provides a mechanism for providing the working >> > directory's file name path to a filter using a %f argument. This >> > request is to teach the textconv mechanism to support the same >> > capability. >> > >> > The use case comes from a complex content renderer that needs to >> > know what the original file name is, so as to be able to find >> > additional content, by name, that describes the file (base >> > name+different extension). >> > >> > If this is considered a good idea, I would be happy to implement >> > this but need a pointer or two of where to look in the code to make >> > it happen. >> >> Both in diff, grep and cat-file, textconv eventually triggers >> diff.c::fill_textconv() and calls run_textconv() unless there is a >> cached copy of the resut of running textconv earlier on the same >> contents. This is because for each textconv driver, the output is >> expected to be purely a function of the input bytestream, and that is >> why it does not take any other input. >> >> So, if we have two identical blobs in a tree object under different >> pathnames, making the output from textconv different for them because >> they sit at different pathnames directly goes against the basic design >> of the system at the philosophical level. >> >> Having said that, I _suspect_ (but not verified) that as long as the >> driver is marked as non-cacheable, it may be acceptable to export a >> new environment variable, say, GIT_TEXTCONV_PATH, and allow the >> textconv program to produce different results for the same input. I >> am not offhand sure if it is OK to allow command line substitutions >> like the filter scripts, though. It would be nice from the point of >> view of consistency if we could do so, but those who use an existing >> textconv program at a pathname with per-cent in it may get negatively >> affected. > >As the person who implemented both textconv and its caching, all of that sounds quite sensible to me. :) > >The caching mechanism has never been turned on by default, and probably will never be (because one of the original uses for which I >wanted textconv is decrypting blobs on the fly, and storing the result would defeat the purpose). So it may be sufficient to say "if it hurts to >turn on caching with a filter that depends on the path, then don't do it". > >It may be possible to make it work with a cache, too. The caching mechanism also has a validity key, which looks something like this: > > [prime the cache with a silly filter] > $ git -c diff.perl.textconv='tr a-z A-Z <' \ > -c diff.perl.cachetextconv=true \ > log -p '*.perl' >/dev/null > > [the result is stored via git-notes; the body is the cache-key] > $ git cat-file commit refs/notes/textconv/perl > tree a27c3c18d222e47a7943c2844f9ed6c75710d8c4 > author Jeff King 1623211550 -0400 > committer Jeff King 1623211550 -0400 > > tr a-z A-Z < > >I thought at first we could put the pathname into that key, but it's really a key for the _whole_ cache, not an individual entry. So that >wouldn't work. You'd have to redesign the cache mechanism (which in turn might require assistance from the notes code). So I lean >towards "if it hurts, don't do it". :) > >One final note. Randall said: > >> > The use case comes from a complex content renderer that needs to >> > know what the original file name is, so as to be able to find >> > additional content, by name, that describes the file (base >> > name+different extension). > >That needs the filename, but there's also the implication there that ancillary data would be coming from _other_ files based on that >name. >That seems a lot more complicated. If I'm a filter and am asked to convert "foo.one", I might want to see "foo.two". But how do I know >from which commit to get it? It is OK if "foo.two" is a globally unambiguous name across time, but I'd imagine in most cases you want >"foo.two" that accompanied "foo.one" at the time. And now the cache is not even a property of a blob/filename pair, but rather of the >whole tree. Well, sadly, I see the points you both make. It might be better for my team to come up with a self-describing file format with embedded metadata instead. Thing is, the whole approach of idempotency in git where the commit would immutably contain both the file and its metadata in a separate blob but same commit is appealing - which is the only condition where I would see my original suggestion applying. -Randall