[RFE] Teach git textconv to support %f

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFE] Teach git textconv to support %f
@ 2021-06-08 15:38 Randall S. Becker
  2021-06-09  0:42 ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Randall S. Becker @ 2021-06-08 15:38 UTC (permalink / raw)
  To: git

The filter structure provides a mechanism for providing the working directory's file name path to a filter using a %f argument. This
request is to teach the textconv mechanism to support the same capability.

The use case comes from a complex content renderer that needs to know what the original file name is, so as to be able to find
additional content, by name, that describes the file (base name+different extension).

If this is considered a good idea, I would be happy to implement this but need a pointer or two of where to look in the code to make
it happen.

Thanks,
Randall

-- Brief whoami:
 NonStop developer since approximately 211288444200000000
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFE] Teach git textconv to support %f
  2021-06-08 15:38 [RFE] Teach git textconv to support %f Randall S. Becker
@ 2021-06-09  0:42 ` Junio C Hamano
  2021-06-09  4:11   ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2021-06-09  0:42 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: git

"Randall S. Becker" <rsbecker@nexbridge.com> writes:

> The filter structure provides a mechanism for providing the
> working directory's file name path to a filter using a %f
> argument. This request is to teach the textconv mechanism to
> support the same capability.
>
> The use case comes from a complex content renderer that needs to
> know what the original file name is, so as to be able to find
> additional content, by name, that describes the file (base
> name+different extension).
>
> If this is considered a good idea, I would be happy to implement
> this but need a pointer or two of where to look in the code to
> make it happen.

Both in diff, grep and cat-file, textconv eventually triggers
diff.c::fill_textconv() and calls run_textconv() unless there is a
cached copy of the resut of running textconv earlier on the same
contents.  This is because for each textconv driver, the output is
expected to be purely a function of the input bytestream, and that
is why it does not take any other input.

So, if we have two identical blobs in a tree object under different
pathnames, making the output from textconv different for them
because they sit at different pathnames directly goes against the
basic design of the system at the philosophical level.

Having said that, I _suspect_ (but not verified) that as long as the
driver is marked as non-cacheable, it may be acceptable to export a
new environment variable, say, GIT_TEXTCONV_PATH, and allow the
textconv program to produce different results for the same input.  I
am not offhand sure if it is OK to allow command line substitutions
like the filter scripts, though.  It would be nice from the point of
view of consistency if we could do so, but those who use an existing
textconv program at a pathname with per-cent in it may get negatively
affected.

Thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFE] Teach git textconv to support %f
  2021-06-09  0:42 ` Junio C Hamano
@ 2021-06-09  4:11   ` Jeff King
  2021-06-09 14:16     ` Randall S. Becker
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2021-06-09  4:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Randall S. Becker, git

On Wed, Jun 09, 2021 at 09:42:39AM +0900, Junio C Hamano wrote:

> "Randall S. Becker" <rsbecker@nexbridge.com> writes:
> 
> > The filter structure provides a mechanism for providing the
> > working directory's file name path to a filter using a %f
> > argument. This request is to teach the textconv mechanism to
> > support the same capability.
> >
> > The use case comes from a complex content renderer that needs to
> > know what the original file name is, so as to be able to find
> > additional content, by name, that describes the file (base
> > name+different extension).
> >
> > If this is considered a good idea, I would be happy to implement
> > this but need a pointer or two of where to look in the code to
> > make it happen.
> 
> Both in diff, grep and cat-file, textconv eventually triggers
> diff.c::fill_textconv() and calls run_textconv() unless there is a
> cached copy of the resut of running textconv earlier on the same
> contents.  This is because for each textconv driver, the output is
> expected to be purely a function of the input bytestream, and that
> is why it does not take any other input.
> 
> So, if we have two identical blobs in a tree object under different
> pathnames, making the output from textconv different for them
> because they sit at different pathnames directly goes against the
> basic design of the system at the philosophical level.
> 
> Having said that, I _suspect_ (but not verified) that as long as the
> driver is marked as non-cacheable, it may be acceptable to export a
> new environment variable, say, GIT_TEXTCONV_PATH, and allow the
> textconv program to produce different results for the same input.  I
> am not offhand sure if it is OK to allow command line substitutions
> like the filter scripts, though.  It would be nice from the point of
> view of consistency if we could do so, but those who use an existing
> textconv program at a pathname with per-cent in it may get negatively
> affected.

As the person who implemented both textconv and its caching, all of that
sounds quite sensible to me. :)

The caching mechanism has never been turned on by default, and probably
will never be (because one of the original uses for which I wanted
textconv is decrypting blobs on the fly, and storing the result would
defeat the purpose). So it may be sufficient to say "if it hurts to turn
on caching with a filter that depends on the path, then don't do it".

It may be possible to make it work with a cache, too. The caching
mechanism also has a validity key, which looks something like this:

  [prime the cache with a silly filter]
  $ git -c diff.perl.textconv='tr a-z A-Z <' \
        -c diff.perl.cachetextconv=true \
        log -p '*.perl' >/dev/null

  [the result is stored via git-notes; the body is the cache-key]
  $ git cat-file commit refs/notes/textconv/perl
  tree a27c3c18d222e47a7943c2844f9ed6c75710d8c4
  author Jeff King <peff@peff.net> 1623211550 -0400
  committer Jeff King <peff@peff.net> 1623211550 -0400

  tr a-z A-Z <

I thought at first we could put the pathname into that key, but it's
really a key for the _whole_ cache, not an individual entry. So that
wouldn't work. You'd have to redesign the cache mechanism (which in turn
might require assistance from the notes code). So I lean towards "if it
hurts, don't do it". :)

One final note. Randall said:

> > The use case comes from a complex content renderer that needs to
> > know what the original file name is, so as to be able to find
> > additional content, by name, that describes the file (base
> > name+different extension).

That needs the filename, but there's also the implication there that
ancillary data would be coming from _other_ files based on that name.
That seems a lot more complicated. If I'm a filter and am asked to
convert "foo.one", I might want to see "foo.two". But how do I know from
which commit to get it? It is OK if "foo.two" is a globally unambiguous
name across time, but I'd imagine in most cases you want "foo.two" that
accompanied "foo.one" at the time. And now the cache is not even a
property of a blob/filename pair, but rather of the whole tree.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [RFE] Teach git textconv to support %f
  2021-06-09  4:11   ` Jeff King
@ 2021-06-09 14:16     ` Randall S. Becker
  0 siblings, 0 replies; 4+ messages in thread
From: Randall S. Becker @ 2021-06-09 14:16 UTC (permalink / raw)
  To: 'Jeff King', 'Junio C Hamano'; +Cc: git

On June 9, 2021 12:12 AM, Peff wrote:
>On Wed, Jun 09, 2021 at 09:42:39AM +0900, Junio C Hamano wrote:
>
>> "Randall S. Becker" <rsbecker@nexbridge.com> writes:
>>
>> > The filter structure provides a mechanism for providing the working
>> > directory's file name path to a filter using a %f argument. This
>> > request is to teach the textconv mechanism to support the same
>> > capability.
>> >
>> > The use case comes from a complex content renderer that needs to
>> > know what the original file name is, so as to be able to find
>> > additional content, by name, that describes the file (base
>> > name+different extension).
>> >
>> > If this is considered a good idea, I would be happy to implement
>> > this but need a pointer or two of where to look in the code to make
>> > it happen.
>>
>> Both in diff, grep and cat-file, textconv eventually triggers
>> diff.c::fill_textconv() and calls run_textconv() unless there is a
>> cached copy of the resut of running textconv earlier on the same
>> contents.  This is because for each textconv driver, the output is
>> expected to be purely a function of the input bytestream, and that is
>> why it does not take any other input.
>>
>> So, if we have two identical blobs in a tree object under different
>> pathnames, making the output from textconv different for them because
>> they sit at different pathnames directly goes against the basic design
>> of the system at the philosophical level.
>>
>> Having said that, I _suspect_ (but not verified) that as long as the
>> driver is marked as non-cacheable, it may be acceptable to export a
>> new environment variable, say, GIT_TEXTCONV_PATH, and allow the
>> textconv program to produce different results for the same input.  I
>> am not offhand sure if it is OK to allow command line substitutions
>> like the filter scripts, though.  It would be nice from the point of
>> view of consistency if we could do so, but those who use an existing
>> textconv program at a pathname with per-cent in it may get negatively
>> affected.
>
>As the person who implemented both textconv and its caching, all of that sounds quite sensible to me. :)
>
>The caching mechanism has never been turned on by default, and probably will never be (because one of the original uses for which I
>wanted textconv is decrypting blobs on the fly, and storing the result would defeat the purpose). So it may be sufficient to say "if it hurts to
>turn on caching with a filter that depends on the path, then don't do it".
>
>It may be possible to make it work with a cache, too. The caching mechanism also has a validity key, which looks something like this:
>
>  [prime the cache with a silly filter]
>  $ git -c diff.perl.textconv='tr a-z A-Z <' \
>        -c diff.perl.cachetextconv=true \
>        log -p '*.perl' >/dev/null
>
>  [the result is stored via git-notes; the body is the cache-key]
>  $ git cat-file commit refs/notes/textconv/perl
>  tree a27c3c18d222e47a7943c2844f9ed6c75710d8c4
>  author Jeff King <peff@peff.net> 1623211550 -0400
>  committer Jeff King <peff@peff.net> 1623211550 -0400
>
>  tr a-z A-Z <
>
>I thought at first we could put the pathname into that key, but it's really a key for the _whole_ cache, not an individual entry. So that
>wouldn't work. You'd have to redesign the cache mechanism (which in turn might require assistance from the notes code). So I lean
>towards "if it hurts, don't do it". :)
>
>One final note. Randall said:
>
>> > The use case comes from a complex content renderer that needs to
>> > know what the original file name is, so as to be able to find
>> > additional content, by name, that describes the file (base
>> > name+different extension).
>
>That needs the filename, but there's also the implication there that ancillary data would be coming from _other_ files based on that
>name.
>That seems a lot more complicated. If I'm a filter and am asked to convert "foo.one", I might want to see "foo.two". But how do I know
>from which commit to get it? It is OK if "foo.two" is a globally unambiguous name across time, but I'd imagine in most cases you want
>"foo.two" that accompanied "foo.one" at the time. And now the cache is not even a property of a blob/filename pair, but rather of the
>whole tree.

Well, sadly, I see the points you both make. It might be better for my team to come up with a self-describing file format with embedded metadata instead. Thing is, the whole approach of idempotency in git where the commit would immutably contain both the file and its metadata in a separate blob but same commit is appealing - which is the only condition where I would see my original suggestion applying.

-Randall


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-09 14:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-08 15:38 [RFE] Teach git textconv to support %f Randall S. Becker
2021-06-09  0:42 ` Junio C Hamano
2021-06-09  4:11   ` Jeff King
2021-06-09 14:16     ` Randall S. Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).