git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Looking for help to understand external filter driver code
@ 2016-07-19 16:40 Lars Schneider
  2016-07-19 17:56 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Schneider @ 2016-07-19 16:40 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano, Johannes Sixt

Hi,

a long time ago in aa4ed4 Junio introduced the external filter driver definition. Since that time we fork the Git process and then we fork again to run the external filter. This is probably a super stupid question with an obvious answer... but can anyone help me to understand the code and explain why we fork twice? Wouldn't it be sufficient to just fork once for the external filter since we cannot process anything in parallel anyways?

In 546bb5 Hannes refactored Junio's code to use the async infrastructure that is used today.

Thank you,
Lars

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 16:40 Looking for help to understand external filter driver code Lars Schneider
@ 2016-07-19 17:56 ` Junio C Hamano
  2016-07-19 18:53   ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2016-07-19 17:56 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List, Johannes Sixt

Lars Schneider <larsxschneider@gmail.com> writes:

> a long time ago in aa4ed4 Junio introduced the external filter
> driver definition. Since that time we fork the Git process and
> then we fork again to run the external filter. This is probably a
> super stupid question with an obvious answer... but can anyone
> help me to understand the code and explain why we fork twice?

In "git show aa4ed4" you find this picture:

       /*
        * Create a pipeline to have the command filter the buffer's
        * contents.
        *
        * (child --> cmd) --> us
        */

where "child" is a fork of the original Git process; it still has
the <src, size> buffered data inherited from the original process,
so it can write_in_full() into the pipe going to the cmd process.
"us" is the original process that has the reading end of the pipe
that connects us to the "(child --> cmd)" processes, so that it can
read the filtered result from them.

The key benefit of this arrangement is the above can be done without
having to do poll() to flip between reading and writing that is
needed to avoid deadlocking, which kept the code simpler.  A later
conversion of the write side into async does not fundamentally
change anything from the original arrangement.







^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 17:56 ` Junio C Hamano
@ 2016-07-19 18:53   ` Junio C Hamano
  2016-07-19 20:44     ` Lars Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2016-07-19 18:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List, Johannes Sixt

Junio C Hamano <gitster@pobox.com> writes:

> The key benefit of this arrangement is the above can be done without
> having to do poll() to flip between reading and writing that is
> needed to avoid deadlocking, which kept the code simpler.  A later
> conversion of the write side into async does not fundamentally
> change anything from the original arrangement.

Translation: I was too lazy to worry about doing poll()/select()
when I did it originally.  As long as you can do so correctly, be my
guest to reduce one process by having the main process do both
reading and writing.

;-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 18:53   ` Junio C Hamano
@ 2016-07-19 20:44     ` Lars Schneider
  2016-07-19 21:33       ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Schneider @ 2016-07-19 20:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Johannes Sixt


On 19 Jul 2016, at 20:53, Junio C Hamano <gitster@pobox.com> wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
>> The key benefit of this arrangement is the above can be done without
>> having to do poll() to flip between reading and writing that is
>> needed to avoid deadlocking, which kept the code simpler.  A later
>> conversion of the write side into async does not fundamentally
>> change anything from the original arrangement.
> 
> Translation: I was too lazy to worry about doing poll()/select()
> when I did it originally.  As long as you can do so correctly, be my
> guest to reduce one process by having the main process do both
> reading and writing.
> 
> ;-)

Thanks a lot for this explanation. My goal is to add an option to
the clean/smudge filter config that instructs Git to keep the
external filter process running. Whenever Git encounters a file
to be filtered then it would continue to talk to the filter process
over a simple pipe protocol:

Git writes --> 4 byte filename length
Git writes --> filename string
Git writes --> 4 byte content length
Git writes --> content string
Git reads <-- 4 byte filtered content length
Git reads <-- filtered content

I still need to read more about poll()/select() but it looks to me
as if these functions are only required if Git doesn't know what to
expect. With the sketched protocol above that wouldn't be the case.
Therefore, I wonder if I would need to use poll()/select()?

Thanks,
Lars


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 20:44     ` Lars Schneider
@ 2016-07-19 21:33       ` Junio C Hamano
  2016-07-19 22:01         ` Lars Schneider
                           ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Junio C Hamano @ 2016-07-19 21:33 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List, Johannes Sixt

Lars Schneider <larsxschneider@gmail.com> writes:

> Git writes --> 4 byte filename length
> Git writes --> filename string

Why limit to 32GB?  Perhaps NUL termination is more appropriate
here?

> Git writes --> 4 byte content length
> Git writes --> content string
> Git reads <-- 4 byte filtered content length
> Git reads <-- filtered content

Do you really need to force the sender to know the length in
advance?  Together with the sequential nature of the above exchange,
i.e. the filter is forbidden from producing even a single byte of
its output before reading everything Git feeds it, you are making it
impossible to use filters that perform streaming conversion.

Of course, with the "sequential" thing, you do not have to worry
about deadlocking hence no need for poll/select, but I am not sure
that is a good thing.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 21:33       ` Junio C Hamano
@ 2016-07-19 22:01         ` Lars Schneider
  2016-07-20  2:33           ` Torsten Bögershausen
  2016-07-20  8:59         ` Jakub Narębski
  2016-07-20 13:49         ` Jeff King
  2 siblings, 1 reply; 10+ messages in thread
From: Lars Schneider @ 2016-07-19 22:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Johannes Sixt


On 19 Jul 2016, at 23:33, Junio C Hamano <gitster@pobox.com> wrote:

> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> Git writes --> 4 byte filename length
>> Git writes --> filename string
> 
> Why limit to 32GB?  Perhaps NUL termination is more appropriate
> here?
OK, I will use NUL termination for the filename. 
You're also right about the limit - I will use 8 byte to encode the 
content length.


>> Git writes --> 4 byte content length
>> Git writes --> content string
>> Git reads <-- 4 byte filtered content length
>> Git reads <-- filtered content
> 
> Do you really need to force the sender to know the length in
> advance?  Together with the sequential nature of the above exchange,
> i.e. the filter is forbidden from producing even a single byte of
> its output before reading everything Git feeds it, you are making it
> impossible to use filters that perform streaming conversion.
That is correct. However, for my particular use case streaming
conversion wouldn't be useful anyways:
https://github.com/github/git-lfs/pull/1382


> Of course, with the "sequential" thing, you do not have to worry
> about deadlocking hence no need for poll/select, but I am not sure
> that is a good thing.
Thanks for the confirmation. I consider to exchange a "filter protocol 
version" right after the filter process has started. That way someone 
could add a more evolved "filter driver protocol" later on that supports 
streaming and the external filter could pick whatever protocol is most
appropriate (and supported). Could that be an acceptable compromise
to get a serious review of the "sequential" thing?

Thanks,
Lars



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 22:01         ` Lars Schneider
@ 2016-07-20  2:33           ` Torsten Bögershausen
  0 siblings, 0 replies; 10+ messages in thread
From: Torsten Bögershausen @ 2016-07-20  2:33 UTC (permalink / raw)
  To: Lars Schneider, Junio C Hamano; +Cc: Git Mailing List, Johannes Sixt



On 07/20/2016 12:01 AM, Lars Schneider wrote:
>
> On 19 Jul 2016, at 23:33, Junio C Hamano <gitster@pobox.com> wrote:
>
>> Lars Schneider <larsxschneider@gmail.com> writes:
>>
>>> Git writes --> 4 byte filename length
>>> Git writes --> filename string
>>
>> Why limit to 32GB?  Perhaps NUL termination is more appropriate
>> here?
> OK, I will use NUL termination for the filename.
> You're also right about the limit - I will use 8 byte to encode the
> content length.
Is there any reason to encode the file length in binary format?
With all the discussions about big endianess, little endianess, 4GiB or 
32 GiB.
How about simply writing the length as ASCII ?

Unless we don't want to have a "spare" field for future extensions,
it could be good to add an option field, which may be empty.
On top of that, do we want a field separator different from the line
separator ?

How about this:
<options><TAB><length><TAB><filename><NUL>

<options> may be "var=value;var2=value2" or simply ""

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 21:33       ` Junio C Hamano
  2016-07-19 22:01         ` Lars Schneider
@ 2016-07-20  8:59         ` Jakub Narębski
  2016-07-20  9:43           ` Lars Schneider
  2016-07-20 13:49         ` Jeff King
  2 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2016-07-20  8:59 UTC (permalink / raw)
  To: Junio C Hamano, Lars Schneider; +Cc: Git Mailing List, Johannes Sixt

W dniu 2016-07-19 o 23:33, Junio C Hamano pisze:
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> > Git writes --> 4 byte filename length
>> > Git writes --> filename string
>
> Why limit to 32GB?  Perhaps NUL termination is more appropriate
> here?

Errr, I think limiting _filename_ to 32GB is a reasonable
assumption, for forever...
 
>> > Git writes --> 4 byte content length
>> > Git writes --> content string

...while limiting file _content_ to 32GB might not be
future-proof enough.

;-)
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-20  8:59         ` Jakub Narębski
@ 2016-07-20  9:43           ` Lars Schneider
  0 siblings, 0 replies; 10+ messages in thread
From: Lars Schneider @ 2016-07-20  9:43 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Junio C Hamano, Git Mailing List, Johannes Sixt


> On 20 Jul 2016, at 10:59, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-19 o 23:33, Junio C Hamano pisze:
>> Lars Schneider <larsxschneider@gmail.com> writes:
>> 
>>>> Git writes --> 4 byte filename length
>>>> Git writes --> filename string
>> 
>> Why limit to 32GB?  Perhaps NUL termination is more appropriate
>> here?
> 
> Errr, I think limiting _filename_ to 32GB is a reasonable
> assumption, for forever...

Well, Java packages can get reaaaally long :D

> 
>>>> Git writes --> 4 byte content length
>>>> Git writes --> content string
> 
> ...while limiting file _content_ to 32GB might not be
> future-proof enough.
> 
> ;-)

I think this is what Junio meant to say. At least I 
interpreted it that way.


- Lars

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Looking for help to understand external filter driver code
  2016-07-19 21:33       ` Junio C Hamano
  2016-07-19 22:01         ` Lars Schneider
  2016-07-20  8:59         ` Jakub Narębski
@ 2016-07-20 13:49         ` Jeff King
  2 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2016-07-20 13:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Lars Schneider, Git Mailing List, Johannes Sixt

On Tue, Jul 19, 2016 at 02:33:09PM -0700, Junio C Hamano wrote:

> > Git writes --> 4 byte content length
> > Git writes --> content string
> > Git reads <-- 4 byte filtered content length
> > Git reads <-- filtered content
> 
> Do you really need to force the sender to know the length in
> advance?  Together with the sequential nature of the above exchange,
> i.e. the filter is forbidden from producing even a single byte of
> its output before reading everything Git feeds it, you are making it
> impossible to use filters that perform streaming conversion.

Another option: use pkt-lines with a flush packet to indicate
end-of-input. That allows arbitrary sized data, with streaming, and
reuses existing concepts from git. There is proportional overhead, but
it's only 4 bytes per 64k, which is a tiny percent.

It does make some implementations easier if they know the size ahead of
time, though, so if we are _sure_ that nobody will want streaming later,
it may not be a good tradeoff. If we do print a size ahead of time, the
"normal" thing in git would be to do so in base-10 ascii followed by a
newline (e.g., as found in "cat-file --batch", or fast-import's "data"
command).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-07-20 13:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-19 16:40 Looking for help to understand external filter driver code Lars Schneider
2016-07-19 17:56 ` Junio C Hamano
2016-07-19 18:53   ` Junio C Hamano
2016-07-19 20:44     ` Lars Schneider
2016-07-19 21:33       ` Junio C Hamano
2016-07-19 22:01         ` Lars Schneider
2016-07-20  2:33           ` Torsten Bögershausen
2016-07-20  8:59         ` Jakub Narębski
2016-07-20  9:43           ` Lars Schneider
2016-07-20 13:49         ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).