Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] diff: document behavior of relative diff.orderFile
From: Junio C Hamano @ 2017-01-11 20:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Richard Hansen, git
In-Reply-To: <20170111144158.ef6kle3vw3ejgmut@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> On Tue, Jan 10, 2017 at 02:15:11PM -0800, Junio C Hamano wrote:
>
>> Richard Hansen <hansenr@google.com> writes:
>> ...
>> > I think so.  (For bare repositories anyway; non-bare should be
>> > relative to GIT_WORK_TREE.)  Perhaps git_config_pathname() itself
>> > should convert relative paths to absolute so that every pathname
>> > setting automatically works without changing any calling code.
>> 
>> Yes, that was what I was alluding to.  We might have to wait until
>> major version boundary to do so, but I think that it is the sensible
>> way forward in the longer term to convert relative to absolute in
>> git_config_pathname().
>
> Yeah, I'd agree.
>
> I'm undecided on whether it would need to happen at a major version
> bump. ...
>
> So I dunno. I do hate to break even corner cases, but I'm having trouble
> imagining the scenario where somebody is actually using the current
> behavior in a useful way.

Thanks for a detailed analysis (I probably should have spelt them
out when I said "we might have to" to save you the trouble).


^ permalink raw reply

* Re: [musl] Re: Test failures when Git is built with libpcre and grep is built without it
From: Junio C Hamano @ 2017-01-11 20:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Andreas Schwab, A. Wilcox
In-Reply-To: <20170111100400.vhd5ytarqpujigbn@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> Another difference I found is that "[\d]" matches a literal "\" or "d"
> in ERE, but behaves like "[0-9]" in PCRE. I'll work up a patch based on
> that.

Wow, clever.

Thanks.

^ permalink raw reply

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
From: Junio C Hamano @ 2017-01-11 20:45 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git mailing list, Eric Wong, Jakub Narębski,
	Torsten Bögershausen, Taylor Blau
In-Reply-To: <3165D057-7486-4ACB-8336-E63F49182CBE@gmail.com>

Lars Schneider <larsxschneider@gmail.com> writes:

>> Hmm, I would have expected that the basic flow would become
>> 
>> 	for each paths to be processed:
>> 		convert-to-worktree to buf
>> 		if not delayed:
>> 			do the caller's thing to use buf
>> 		else:
>> 			remember path
>> 
>> 	for each delayed paths:
>> 		ensure filter process finished processing for path
>> 		fetch the thing to buf from the process
>> 		do the caller's thing to use buf
>> 
>> and that would make quite a lot of sense.  However, what is actually
>> implemented is a bit disappointing from that point of view.  While
>> its first part is the same as above, the latter part instead does:
>> 
>> 	for each delayed paths:
>> 		checkout the path
>> ...
>
> I am not sure I can follow you here.
> ...
> I implemented the "checkout_delayed_entries" function in v1 because
> it solved the problem with minimal changes in the existing code. Our previous 
> discussion made me think that this is the preferred way:
>
>      I do not think we want to see such a rewrite all over the
>      codepaths.  It might be OK to add such a "these entries are known
>      to be delayed" list in struct checkout so that the above becomes
>      more like this:
>
>        for (i = 0; i < active_nr; i++)
>           checkout_entry(active_cache[i], state, NULL);
>      + checkout_entry_finish(state);
>
>      That is, addition of a single "some of the checkout_entry() calls
>      done so far might have been lazy, and I'll give them a chance to
>      clean up" might be palatable.  Anything more than that on the
>      caller side is not.

But that is apples-and-oranges comparision, no?  The old discussion
assumes there is no "caller's thing to use buf" other than "checkout
to the working tree", which is why the function its main loop calls
is "checkout_entry()" and the caller does not see the contents of
the filtered blob at all.  In that context, checkout_entry_finish()
that equally does not let the caller see the contents of the
filtered blob is quite appropriate.


^ permalink raw reply

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
From: Junio C Hamano @ 2017-01-11 20:41 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Lars Schneider, Git mailing list, Eric Wong, Taylor Blau
In-Reply-To: <17fa31a5-8689-2766-952b-704f433a5b3a@gmail.com>

Jakub Narębski <jnareb@gmail.com> writes:

>> Yes, this problem happens every day with filters that perform network
>> requests (e.g. GitLFS). 
>
> Do I understand it correctly that the expected performance improvement
> thanks to this feature is possible only if there is some amount of
> parallelism and concurrency in the filter?  That is, filter can be sending
> one blob to Git while processing other one, or filter can be fetching blobs
> in parallel.

The first-object latency may not be helped, but by allowing
"delayed", the latency to retrieve the second and subsequent objects
can be hidden, I would think.

^ permalink raw reply

* Re: git cat-file on a submodule
From: Junio C Hamano @ 2017-01-11 18:56 UTC (permalink / raw)
  To: Jeff King; +Cc: David Turner, git
In-Reply-To: <20170111125330.3skwxdleoooacts6@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> On Tue, Jan 10, 2017 at 07:11:40PM -0500, David Turner wrote:
>
>> Why does git cat-file -t $sha:foo, where foo is a submodule, not work?
> ...
> I'm not sure if I'm complaining or not. I can't immediately think of
> something that would be horribly broken. But it really feels like you
> are using the wrong tool, and patching the tool to handle this case will
> probably lead to weird cognitive dissonance down the road.

"git cat-file [any option] $sha" should fail and complain for any
$sha that does not name an object that exists in the object database
of the repository it is working on.  

So I'd complain if the first example command quoted above from
David's mail stopped failing when the commit bound at 'foo' in the
top-level treeish $sha (i.e. a commit in the submodule) does not
exist in the top-level repository's object database.

> Maybe it would help to describe your use case more fully. If what you
> care about is the presumed type based on the surrounding tree, then
> maybe:
>
>   git --literal-pathspecs ls-tree $sha -- foo
>
> would be a better match.

Yup.

^ permalink raw reply

* [PATCH] lib-submodule-update.sh: drop unneeded shell
From: Stefan Beller @ 2017-01-11 18:47 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller

In modern Git we prefer "git -C <cmd" over "(cd <somewhere && git <cmd>)"
as it doesn't need an extra shell.

Signed-off-by: Stefan Beller <sbeller@google.com>
---

And because it is in a setup function, we actually save the invocation
of 22 shells for a single run of the whole test suite.

Noticed while adding a lot more in near vincinity, though not as near
to cause merge conflicts, so sending it extra.

Thanks,
Stefan

 t/lib-submodule-update.sh | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/t/lib-submodule-update.sh b/t/lib-submodule-update.sh
index 79cdd34a54..915eb4a7c6 100755
--- a/t/lib-submodule-update.sh
+++ b/t/lib-submodule-update.sh
@@ -69,10 +69,7 @@ create_lib_submodule_repo () {
 
 		git checkout -b "replace_sub1_with_directory" "add_sub1" &&
 		git submodule update &&
-		(
-			cd sub1 &&
-			git checkout modifications
-		) &&
+		git -C sub1 checkout modifications &&
 		git rm --cached sub1 &&
 		rm sub1/.git* &&
 		git config -f .gitmodules --remove-section "submodule.sub1" &&
-- 
2.11.0.259.g7b30ecf4f0


^ permalink raw reply related

* Re: [PATCH 2/2] diff: document the pattern format for diff.orderFile
From: Richard Hansen @ 2017-01-11 18:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqqwpe1o43k.fsf@gitster.mtv.corp.google.com>

On 2017-01-11 13:15, Junio C Hamano wrote:
> Richard Hansen <hansenr@google.com> writes:
>
>> On 2017-01-10 21:46, Junio C Hamano wrote:
>>> Richard Hansen <hansenr@google.com> writes:
>>>
>>>> I was looking at the code to see how the two file formats differed and
>>>> noticed that match_order() doesn't set the WM_PATHNAME flag when it
>>>> calls wildmatch().  That's unintentional (a bug), right?
>>>
>>> It has been that way from day one IIRC even before we introduced
>>> wildmatch()---IOW it may be intentional that the current code that
>>> uses wildmatch() does not use WM_PATHNAME.
>>
>> You are the original author (af5323e027 2005-05-30).  Do you remember
>> what your intention was?
>
> Yes.
>
> Back then we didn't even have wildmatch(), and used fnmatch()
> instead, so forcing FNM_PATHNAME would have meant that people
> wouldn't be able to say "foo*bar" to match "foo/other/bar"; with
> wildmatch, "foo**bar" lets you defeat WM_PATHNAME so having
> WM_PATHNAME always in effect is less of an issue, but with
> fnmatch(), having FNM_PATHNAME always in effect has a lot of
> downside.

Ah, that makes sense.

>
> I'd expect that orderfile people have today will be broken and
> require tweaking if you switched WM_PATHNAME on.

OK, so we don't want to turn on WM_PATHNAME unless we do it for a new 
major version.

I'll do another re-roll and document the non-WM_PATHNAME behavior. 
Perhaps I'll encourage users to prefer ** over * if they want to match 
slash (even though they are equivalent) so that migration is easier if 
we ever do turn on WM_PATHNAME.

-Richard

^ permalink raw reply

* Re: git cat-file on a submodule
From: David Turner @ 2017-01-11 18:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20170111125330.3skwxdleoooacts6@sigill.intra.peff.net>

On Wed, 2017-01-11 at 07:53 -0500, Jeff King wrote:
> On Tue, Jan 10, 2017 at 07:11:40PM -0500, David Turner wrote:
> 
> > Why does git cat-file -t $sha:foo, where foo is a submodule, not work?
> 
> Because "cat-file" is about inspecting items in the object database, and
> typically the submodule commit is not present in the superproject's
> database. So we cannot know its type. You can infer what it _should_ be
> from the surrounding tree, but you cannot actually do the object lookup.
> 
> Likewise, "git cat-file -t $sha1:Makefile" is not just telling you that
> we found a 100644 entry in the tree, so we expect a blob. It's resolving
> to a sha1, and then checking the type of that sha1 in the database. It
> _should_ be a blob, but if it isn't, then cat-file is the tool that
> should tell you that it is not.
> 
> > git rev-parse $sha:foo works.
> 
> Right. Because that command is about resolving a name to a sha1, which
> we can do even without the object.
> 
> > By "why", I mean "would anyone complain if I fixed it?"  FWIW, I think
> > -p should just return the submodule's sha.
> 
> I'm not sure if I'm complaining or not. I can't immediately think of
> something that would be horribly broken. But it really feels like you
> are using the wrong tool, and patching the tool to handle this case will
> probably lead to weird cognitive dissonance down the road.

OK, this makes sense to me.  I tried cat-file because that is the tool I
was familiar with, but that doesn't mean that it was the right thing to
do.

> Maybe it would help to describe your use case more fully. If what you
> care about is the presumed type based on the surrounding tree, then
> maybe:
> 
>   git --literal-pathspecs ls-tree $sha -- foo

That (minus --literal-pathspecs, which does not exist in the version of
git I happen to be using) is fine for my purposes.  Thanks.


^ permalink raw reply

* Re: [PATCH 2/2] diff: document the pattern format for diff.orderFile
From: Junio C Hamano @ 2017-01-11 18:15 UTC (permalink / raw)
  To: Richard Hansen; +Cc: git
In-Reply-To: <21b416ae-8bf6-4f82-25d3-e51a574e7746@google.com>

Richard Hansen <hansenr@google.com> writes:

> On 2017-01-10 21:46, Junio C Hamano wrote:
>> Richard Hansen <hansenr@google.com> writes:
>>
>>> I was looking at the code to see how the two file formats differed and
>>> noticed that match_order() doesn't set the WM_PATHNAME flag when it
>>> calls wildmatch().  That's unintentional (a bug), right?
>>
>> It has been that way from day one IIRC even before we introduced
>> wildmatch()---IOW it may be intentional that the current code that
>> uses wildmatch() does not use WM_PATHNAME.
>
> You are the original author (af5323e027 2005-05-30).  Do you remember
> what your intention was?

Yes.  

Back then we didn't even have wildmatch(), and used fnmatch()
instead, so forcing FNM_PATHNAME would have meant that people
wouldn't be able to say "foo*bar" to match "foo/other/bar"; with
wildmatch, "foo**bar" lets you defeat WM_PATHNAME so having
WM_PATHNAME always in effect is less of an issue, but with
fnmatch(), having FNM_PATHNAME always in effect has a lot of
downside.

I'd expect that orderfile people have today will be broken and
require tweaking if you switched WM_PATHNAME on.

^ permalink raw reply

* Re: [RFC PATCH 0/5] Localise error headers
From: Junio C Hamano @ 2017-01-11 18:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Michael J Gruber, git@vger.kernel.org
In-Reply-To: <20170111113725.avl3wetwrfezdni2@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> Yes, I would think die_errno() is a no-brainer for translation, since
> the strerror() will be translated.
>
>>     apply.c:                die(_("internal error"));
>> 
>> That is funny, too. I think we should substitute that with
>> 
>>     die("BUG: untranslated, but what went wrong instead")
>
> Yep. We did not consistently use "BUG:" in the early days. I would say
> that "BUG" lines do not need to be translated. The point is that nobody
> should ever see them, so it seems like there is little point in giving
> extra work to translators.

In addition, "BUG: " is relatively recent introduction to our
codebase.  Perhaps having a separate BUG(<string>) function help the
distinction further?

^ permalink raw reply

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
From: Taylor Blau @ 2017-01-11 17:59 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, git, e, jnareb
In-Reply-To: <F533857D-9B51-44C1-8889-AA0542AD8250@gmail.com>

On Wed, Jan 11, 2017 at 11:13:00AM +0100, Lars Schneider wrote:
>
> In v1 I implemented a) with the busy-loop problem in mind.
>
> My thinking was this:
>
> If the filter sees at least one filter request twice then the filter knows that
> Git has already requested all files that require filtering. At that point the
> filter could just block the "delayed" answer to the latest filter request until
> at least one of the previously delayed requests can be fulfilled. Then the filter
> answers "delay" to Git until Git requests the blob that can be fulfilled. This
> process cycles until all requests can be fulfilled. Wouldn't that work?
>
> I think a "done" message by the filter is not easy. Right now the protocol works
> in a mode were Git always asks and the filter always answers. I believe changing
> the filter to be able to initiate a "done" message would complicated the protocol.
>
> > Additionally, the protocol should specify a sentinel "no more entries" value
> > that could be sent from Git to the filter to signal that there are no more files
> > to checkout. Some filters may implement mechanisms for converting files that
> > require a signal to know when all files have been sent. Specifically, Git LFS
> > (https://git-lfs.github.com) batches files to be transferred together, and needs
> > to know when all files have been announced to truncate and send the last batch,
> > if it is not yet full. I'm sure other filter implementations use a similar
> > mechanism and would benefit from this as well.
>
> I agree. I think the filter already has this info implicitly as explained above
> but an explicit message would be better!

This works, but forces some undesirable constraints:

- The filter has to keep track of all items in the checkout to determine how
  many times Git has asked about them. This is probably fine, though annoying.
  Cases where this is not fine is when there are _many_ items in the checkout
  forcing filter implementations to keep track of large sets of data.
- The protocol is now _must_ ask for the status of checkout items in a
  particular order. If the assumption is that "Git has sent me all items in the
  checkout by the point I see Git ask for the status of an item more than once",
  then Git _cannot_ ask for the status of a delayed item while there are still
  more items to checkout. This could be OK, so long as the protocol commits to
  it and documents this behavior.

What are your thoughts?

--
Thanks,
Taylor Blau

ttaylorr@github.com

^ permalink raw reply

* Re: [PATCH 2/2] diff: document the pattern format for diff.orderFile
From: Richard Hansen @ 2017-01-11 17:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqq60lmpb4j.fsf@gitster.mtv.corp.google.com>

On 2017-01-10 21:46, Junio C Hamano wrote:
> Richard Hansen <hansenr@google.com> writes:
>
>> I was looking at the code to see how the two file formats differed and
>> noticed that match_order() doesn't set the WM_PATHNAME flag when it
>> calls wildmatch().  That's unintentional (a bug), right?
>
> It has been that way from day one IIRC even before we introduced
> wildmatch()---IOW it may be intentional that the current code that
> uses wildmatch() does not use WM_PATHNAME.

You are the original author (af5323e027 2005-05-30).  Do you remember 
what your intention was?

I would like to change it to pass WM_PATHNAME, but I'm not sure if that 
would break anyone.  I'm guessing probably not, because users probably 
expect WM_PATHNAME and would be surprised (like I was) to learn otherwise.

If we want to keep it as-is, I'll have to adjust [PATCH v2 2/2].

-Richard

^ permalink raw reply

* Re: [RFC PATCH 0/5] Localise error headers
From: Stefan Beller @ 2017-01-11 17:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael J Gruber, git@vger.kernel.org
In-Reply-To: <20170111113725.avl3wetwrfezdni2@sigill.intra.peff.net>

On Wed, Jan 11, 2017 at 3:37 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Jan 10, 2017 at 10:28:42AM -0800, Stefan Beller wrote:
>
>> > And then presumably that mix would gradually move to 100% consistency as
>> > more messages are translated. But the implicit question is: are there
>> > die() messages that should never be translated? I'm not sure.
>>
>> I would assume any plumbing command is not localizing?
>> Because in plumbing land, (easily scriptable) you may find
>> a grep on the output/stderr for a certain condition?
>
> That's the assumption I'm challenging. Certainly the behavior and
> certain aspects of the output of a plumbing command should remain the
> same over time. But error messages to stderr?

In an ideal world that assumption would hold true and any machine
readable information from the plumbing commands are on stdout and
nobody in their sane mind would ever try to parse stderr.

There is no easy way to check though except auditing all the code.
Having pointed out some string handling mistakes in the prior email,
my confidence is not high that plumbing commands are that strict about
separating useful machine output and errors for human consumption.

>
> It seems like they should be translated, because plumbing invoked on
> behalf of porcelain scripts is going to send its stderr directly to the
> user.

Well that could be solved, c.f. unpack-trees.c lines 15-55.

As another data point (coming from that area of strings):
If you grep for these plumbing strings in the project,
i.e.
$ git grep "would be overwritten by merge. Cannot merge"
$ git grep "not uptodate. Cannot merge."
...

you only find the occurrence in the  unpack-trees.c lines 15-55,
which means our test suite at least doesn't grep for error messages
but relies on the exit code of plumbing commands(?!)

>
>> To find a good example, "git grep die" giving me some food of though:
>>
>> die_errno(..) should always take a string marked up for translation,
>> because the errno string is translated?
>
> Yes, I would think die_errno() is a no-brainer for translation, since
> the strerror() will be translated.
>
>>     apply.c:                die(_("internal error"));
>>
>> That is funny, too. I think we should substitute that with
>>
>>     die("BUG: untranslated, but what went wrong instead")
>
> Yep. We did not consistently use "BUG:" in the early days. I would say
> that "BUG" lines do not need to be translated. The point is that nobody
> should ever see them, so it seems like there is little point in giving
> extra work to translators.
>
> -Peff

^ permalink raw reply

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
From: Jakub Narębski @ 2017-01-11 14:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, Git mailing list, Eric Wong, Taylor Blau
In-Reply-To: <9A1064BB-DA72-44DB-A875-39E007708A69@gmail.com>

W dniu 11.01.2017 o 11:20, Lars Schneider pisze: 
> On 10 Jan 2017, at 23:11, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>>> larsxschneider@gmail.com writes:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>>
>>>> Some `clean` / `smudge` filters might require a significant amount of
>>>> time to process a single blob. During this process the Git checkout
>>>> operation is blocked and Git needs to wait until the filter is done to
>>>> continue with the checkout.
>>
>> Lars, what is expected use case for this feature; that is when do you
>> think this problem may happen?  Is it something that happened IRL?
> 
> Yes, this problem happens every day with filters that perform network
> requests (e.g. GitLFS). 

Do I understand it correctly that the expected performance improvement
thanks to this feature is possible only if there is some amount of
parallelism and concurrency in the filter?  That is, filter can be sending
one blob to Git while processing other one, or filter can be fetching blobs
in parallel.

This means that filter process works as a kind of (de)multiplexer for
fetching and/or processing blob contents, I think.

> [...] In GitLFS we even implemented Git wrapper
> commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
> The ultimate goal of this patch is to be able to get rid of the wrapper 
> commands.

I'm sorry, I don't see it how the wrapper helps here.

> 
>>>> Teach the filter process protocol (introduced in edcc858) to accept the
>>>> status "delayed" as response to a filter request. Upon this response Git
>>>> continues with the checkout operation and asks the filter to process the
>>>> blob again after all other blobs have been processed.
>>>
>>> Hmm, I would have expected that the basic flow would become
>>>
>>> 	for each paths to be processed:
>>> 		convert-to-worktree to buf
>>> 		if not delayed:
>>> 			do the caller's thing to use buf
>>> 		else:
>>> 			remember path
>>>
>>> 	for each delayed paths:
>>> 		ensure filter process finished processing for path
>>> 		fetch the thing to buf from the process
>>> 		do the caller's thing to use buf
>>
>> I would expect here to have a kind of event loop, namely
>>
>>        while there are delayed paths:
>>                get path that is ready from filter
>>                fetch the thing to buf (supporting "delayed")
>>                if path done
>>                        do the caller's thing to use buf 
>>                        (e.g. finish checkout path, eof convert, etc.)
>>
>> We can either trust filter process to tell us when it finished sending
>> delayed paths, or keep list of paths that are being delayed in Git.
> 
> I could implement "get path that is ready from filter" but wouldn't
> that complicate the filter protocol? I think we can use the protocol pretty 
> much as if with the strategy outlined here:
> http://public-inbox.org/git/F533857D-9B51-44C1-8889-AA0542AD8250@gmail.com/

You are talking about the "busy-loop" solution, isn't it?  In the
same notation, it would look like this:

          while there are delayed paths:
                  for each delayed path:
                          request path from filter [1]
                          fetch the thing (supporting "delayed") [2]
                          if path done
                                  do the caller's thing to use buf
                                  remove it from delayed paths list


Footnotes:
----------
1) We don't send the Git-side contents of blob again, isn't it?
   So we need some protocol extension / new understanding anyway.
   for example that we don't send contents if we request path again.
2) If path is not ready at all, filter protocol would send status=delayed
   with empty contents.  This means that we would immediately go to the
   next path, if there is one.

There are some cases where busy loop is preferable, but I don't think
it is the case here.


The event loop solution would require additional protocol extension,
but I don't think those complicate protocol too much:

A. Git would need to signal filter process that it has sent all paths,
and that it should be sending delayed paths when they are ready.  This
could be done for example with "command=continue".

    packet:          git> command=continue


B. Filter driver, in the event-loop phase, when (de)multiplexing fetching
or processing of data, it would need now to initialize transfer, instead
of waiting for Git to ask.  It could look like this:

    packet:          git< status=resumed [3]
    packet:          git< pathname=file/to/be/resumed [4]
    packet:          git< 0000
    packet:          git< SMUDGED_CONTENT_CONTINUED
    packet:          git< 0000
    packet:          git< 0000  # empty list, means "status=success" [5]

Footnotes:
----------
3.) It could be "status=success", "status=delayed", "command=resumed", etc.
4.) In the future we can add byte at which we resume, size of file, etc.
5.) Of course sending reminder of contents may be further delayed.

-- 
Jakub Narębski

^ permalink raw reply

* Re: [PATCH 1/2] diff: document behavior of relative diff.orderFile
From: Jeff King @ 2017-01-11 14:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Richard Hansen, git
In-Reply-To: <xmqqtw96pno0.fsf@gitster.mtv.corp.google.com>

On Tue, Jan 10, 2017 at 02:15:11PM -0800, Junio C Hamano wrote:

> Richard Hansen <hansenr@google.com> writes:
> 
> >> A related tangent.
> >>
> >> I wonder if anything that uses git_config_pathname() should be
> >> relative to GIT_DIR when it is not absolute.
> >
> > I think so.  (For bare repositories anyway; non-bare should be
> > relative to GIT_WORK_TREE.)  Perhaps git_config_pathname() itself
> > should convert relative paths to absolute so that every pathname
> > setting automatically works without changing any calling code.
> 
> Yes, that was what I was alluding to.  We might have to wait until
> major version boundary to do so, but I think that it is the sensible
> way forward in the longer term to convert relative to absolute in
> git_config_pathname().

Yeah, I'd agree.

I'm undecided on whether it would need to happen at a major version
bump. The existing semantics are fairly insane, and would cause a lot of
confusing breakages. We can imagine use of relative paths in a bare
repository falls into one of a few categories:

  1. The user generally runs by "cd /path/to/bare.git && git ...". This
     would be unaffected, as relative and $GIT_DIR are the same.

  2. The user runs via "cd /path/to/bare.git/some-subdir". This would be
     broken, but I have trouble imagining that they really wanted to
     read something like "objects/orderfile".

  3. The user runs via "GIT_DIR=/path/to/bare.git" from various
     directories. This case is probably horribly broken, as things like
     diff.orderFile will complain if they ever run from a directory that
     doesn't have the order file.

  4. They run GIT_DIR=/path/to/bare.git from a consistent origin
     directory. This _does_ work, and we'd be breaking it. Though I kind
     of question why the config in $GIT_DIR is meant to apply to a file
     in a totally unrelated directory.

     I suppose somebody could be relying on the behavior where setting
     GIT_DIR uses the current directory as the working tree (i.e., if
     core.bare is "true" in bare.git). But then, we'd consider their
     working directory as the working tree and read from that anyway. So
     the behavior would stay the same.

So I dunno. I do hate to break even corner cases, but I'm having trouble
imagining the scenario where somebody is actually using the current
behavior in a useful way.

-Peff

^ permalink raw reply

* [RFC PATCH 3/2] vreportf: add prefix to each line
From: Jeff King @ 2017-01-11 14:07 UTC (permalink / raw)
  To: git
In-Reply-To: <20170111140138.5p647xuqpqrej63b@sigill.intra.peff.net>

On Wed, Jan 11, 2017 at 09:01:38AM -0500, Jeff King wrote:

>   [1/2]: Revert "vreportf: avoid intermediate buffer"
>   [2/2]: vreport: sanitize ASCII control chars

We've talked before about repeating the "error:" header for multi-line
messages. The reversion in patch 1 makes this easy to play with, so I
did. I kind of hate it. But here it is, for your viewing pleasure.

-- >8 --
Subject: vreportf: add prefix to each line

Some error and warning messages are multi-line, but we put
the prefix only on the first line. This means that either
the subsequent lines don't have any indication that they're
connected to the original error, or the caller has to make
multiple calls to error() or warning(). And that's not even
possible for die().

Instead, let's put the prefix at the start of each line,
just as advise() does.

Note that this patch doesn't update the tests yet, so it
causes tons of failures. This is just to see what it might
look like.

It's actually kind of ugly.  For instance, a failing test in
t3600 now produces:

   error: the following files have staged content different from both the
   error: file and the HEAD:
   error:     bar.txt
   error:     foo.txt
   error: (use -f to force removal)

which seems a bit aggressive.  It also screws up messages
which indent with tabs (the prefix eats up some of the
tabstop, making the indentation smaller). And the result is
a little harder if you need to cut-and-paste the file lines
(if your terminal lets you triple-click to select a whole
line, now you have this "error:" cruft on the front).

It may be possible to address some of that by using some
other kind of continuation marker (instead of just repeating
the prefix), and expanding initial tabs.

Signed-off-by: Jeff King <peff@peff.net>
---
 usage.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/usage.c b/usage.c
index ad6d2910f..8a1f6ff4e 100644
--- a/usage.c
+++ b/usage.c
@@ -8,18 +8,30 @@

 static FILE *error_handle;

+static void show_line(FILE *fh, const char *prefix, const char *line, int len)
+{
+	fprintf(fh, "%s%.*s\n", prefix, len, line);
+}
+
 void vreportf(const char *prefix, const char *err, va_list params)
 {
 	char msg[4096];
 	FILE *fh = error_handle ? error_handle : stderr;
+	const char *base;
 	char *p;

 	vsnprintf(msg, sizeof(msg), err, params);
+
+	base = msg;
 	for (p = msg; *p; p++) {
-		if (iscntrl(*p) && *p != '\t' && *p != '\n')
+		if (*p == '\n') {
+			show_line(fh, prefix, base, p - base);
+			base = p + 1;
+		} else if (iscntrl(*p) && *p != '\t')
 			*p = '?';
 	}
-	fprintf(fh, "%s%s\n", prefix, msg);
+
+	show_line(fh, prefix, base, p - base);
 }

 static NORETURN void usage_builtin(const char *err, va_list params)
-- 
2.11.0.627.gfa6151259

^ permalink raw reply related

* [PATCH 2/2] vreport: sanitize ASCII control chars
From: Jeff King @ 2017-01-11 14:02 UTC (permalink / raw)
  To: git
In-Reply-To: <20170111140138.5p647xuqpqrej63b@sigill.intra.peff.net>

Our error() and die() calls may report messages with
arbitrary data (e.g., filenames or even data from a remote
server). Let's make it harder to cause confusion with
mischievous filenames. E.g., try:

  git rev-parse "$(printf "\rfatal: this argument is too sneaky")" --

or

  git rev-parse "$(printf "\x1b[5mblinky\x1b[0m")" --

Let's block all ASCII control characters, with the exception
of TAB and LF. We use both in our own messages (and we are
necessarily sanitizing the complete output of snprintf here,
as we do not have access to the individual varargs). And TAB
and LF are unlikely to cause confusion (you could put
"\nfatal: sneaky\n" in your filename, but it would at least
not _cover up_ the message leading to it, unlike "\r").

We'll replace the characters with a "?", which is similar to
how "ls" behaves. It might be nice to do something less
lossy, like converting them to "\x" hex codes. But replacing
with a single character makes it easy to do in-place and
without worrying about length limitations. This feature
should kick in rarely enough that the "?" marks are almost
never seen.

We'll leave high-bit characters as-is, as they are likely to
be UTF-8 (though there may be some Unicode mischief you
could cause, which may require further patches).

Signed-off-by: Jeff King <peff@peff.net>
---
 usage.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/usage.c b/usage.c
index b1cbe6799..ad6d2910f 100644
--- a/usage.c
+++ b/usage.c
@@ -12,7 +12,13 @@ void vreportf(const char *prefix, const char *err, va_list params)
 {
 	char msg[4096];
 	FILE *fh = error_handle ? error_handle : stderr;
+	char *p;
+
 	vsnprintf(msg, sizeof(msg), err, params);
+	for (p = msg; *p; p++) {
+		if (iscntrl(*p) && *p != '\t' && *p != '\n')
+			*p = '?';
+	}
 	fprintf(fh, "%s%s\n", prefix, msg);
 }

-- 
2.11.0.627.gfa6151259

^ permalink raw reply related

* [PATCH 1/2] Revert "vreportf: avoid intermediate buffer"
From: Jeff King @ 2017-01-11 14:02 UTC (permalink / raw)
  To: git
In-Reply-To: <20170111140138.5p647xuqpqrej63b@sigill.intra.peff.net>

This reverts commit f4c3edc0b156362a92bf9de4f0ec794e90a757fc.

The purpose of that commit was to let us write errors of
arbitrary length to stderr by skipping the intermediate
buffer and sending our varargs straight to fprintf. That
works, but it comes with a downside: we do not get access to
the varargs before they are sent to stderr.

On balance, it's not a good tradeoff. Error messages larger
than our 4K buffer are quite uncommon, and we've lost the
ability to make any modifications to the output (e.g., to
remove non-printable characters).

The only way to have both would be one of:

  1. Write into a dynamic buffer. But this is a bad idea for
     a low-level function that may be called when malloc()
     has failed.

  2. Do our own printf-format varargs parsing. This is too
     complex to be worth the trouble.

Let's just revert that change and go back to a fixed buffer.

Signed-off-by: Jeff King <peff@peff.net>
---
 usage.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/usage.c b/usage.c
index 17f52c1b5..b1cbe6799 100644
--- a/usage.c
+++ b/usage.c
@@ -7,21 +7,13 @@
 #include "cache.h"
 
 static FILE *error_handle;
-static int tweaked_error_buffering;
 
 void vreportf(const char *prefix, const char *err, va_list params)
 {
+	char msg[4096];
 	FILE *fh = error_handle ? error_handle : stderr;
-
-	fflush(fh);
-	if (!tweaked_error_buffering) {
-		setvbuf(fh, NULL, _IOLBF, 0);
-		tweaked_error_buffering = 1;
-	}
-
-	fputs(prefix, fh);
-	vfprintf(fh, err, params);
-	fputc('\n', fh);
+	vsnprintf(msg, sizeof(msg), err, params);
+	fprintf(fh, "%s%s\n", prefix, msg);
 }
 
 static NORETURN void usage_builtin(const char *err, va_list params)
@@ -93,7 +85,6 @@ void set_die_is_recursing_routine(int (*routine)(void))
 void set_error_handle(FILE *fh)
 {
 	error_handle = fh;
-	tweaked_error_buffering = 0;
 }
 
 void NORETURN usagef(const char *err, ...)
-- 
2.11.0.627.gfa6151259


^ permalink raw reply related

* [PATCH 0/2] sanitizing error message contents
From: Jeff King @ 2017-01-11 14:01 UTC (permalink / raw)
  To: git

When adding a warning() call in 50d341374 (http: make redirects more
obvious, 2016-12-06), somebody brought up that evil servers can redirect
you to something like:

  https://evil.example.com/some/repo?unused=\rwarning:+rainbows+and_unicorns_ahead

(where "\r" is a literal CR), and instead of seeing:

  warning: redirecting to https://evil.example.com/...

you just get:

  warning: rainbows and unicorns ahead

or whatever innocuous looking line they prefer (probably just ANSI
"clear to beginning of line" would be even more effective).

Since it's hard to figure out which error messages could potentially
contain malicious contents, and since spewing control characters to the
terminal is generally bad anyway, this series sanitizes at the lowest
level.

Note that this doesn't cover "remote:" lines coming over the sideband.
Those are already covered for "\r", as we have to parse it to handle
printing "remote:" consistently. But you can play tricks like putting:

  printf '\0331K\033[0Efatal: this looks local\n'

into a pre-receive hook. I'm not sure if we would want to do more
sanitizing there. The goal of this series is not so much that a remote
can't send funny strings that may look local, but that they can't
prevent local strings from being displayed. OTOH, I suspect clever use
of ANSI codes (moving the cursor, clearing lines, etc) could get you
pretty far.

I'd be hesitant to disallow control codes entirely, though, as I suspect
some servers do send colors over the sideband. So I punted on that here,
but I think this is at least an incremental improvement.

  [1/2]: Revert "vreportf: avoid intermediate buffer"
  [2/2]: vreport: sanitize ASCII control chars

 usage.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

-Peff

^ permalink raw reply

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
From: Lars Schneider @ 2017-01-11 12:57 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Git Mailing List
In-Reply-To: <CAGZ79kYDPLDU5Dg_CTnpEX+D9bs6BUSSNTHkqpW2nY-b=e9+SQ@mail.gmail.com>


> On 09 Jan 2017, at 21:44, Stefan Beller <sbeller@google.com> wrote:
> 
> On Mon, Nov 14, 2016 at 1:09 PM, Lars Schneider
> <larsxschneider@gmail.com> wrote:
>> Hi,
>> 
>> Git always performs a clean/smudge filter on files in sequential order.
>> Sometimes a filter operation can take a noticeable amount of time.
>> This blocks the entire Git process.
>> 
>> I would like to give a filter process the possibility to answer Git with
>> "I got your request, I am processing it, ask me for the result later!".
>> 
>> I see the following way to realize this:
>> 
>> In unpack-trees.c:check_updates() [1] we loop through the cache
>> entries and "ask me later" could be an acceptable return value of the
>> checkout_entry() call. The loop could run until all entries returned
>> success or error.
> 
> Late to this thread, but here is an answer nevertheless.
> 
> I am currently working on getting submodules working
> for working tree modifying commands (prominently checkout, but
> also read-tree -u and any other caller that uses the code in
> unpack-trees.)
> 
> Once the submodules are supported and used, I anticipate that
> putting the files in the working tree on disk will become a bottle neck,
> i.e. the checkout taking way too long for an oversized project.
> 
> So in the future we have to do something to make checkout fast
> again, which IMHO is threading. My current vision is to have checkout
> automatically choose a number of threads based on expected workload,
> c.f. preload-index.c, line 18-25.

That sounds interesting! We are using "submodule.fetchjobs=0" to process
submodules in parallel already and it works great! Thanks a lot for
implementing this!


>> The filter machinery is triggered in various other places in Git and
>> all places that want to support "ask me later" would need to be patched
>> accordingly.
> 
> I think this makes sense, even in a threaded git-checkout.
> I assume this idea is implemented before threading hits checkout,
> so a question on the design:
> 
> Who determines the workload that is acceptable?
> From reading this email, it seems to be solely the filter that uses
> as many threads/processes as it thinks is ok.

Correct.


> Would it be possible to enhance the protocol further to have
> Git also mingle with the workload, i.e. tell the filter it is
> allowed to use up (N-M) threads, as it itself already uses
> M out of N configured threads?
> 
> (I do not want to discuss the details here, but only if such a thing
> is viable with this approach as well)

Yes, I think we could give a filter these kind of hints. However, it
would, of course, be up to the filter implementation to follow the hints.

In case you curious, here is the discussion on v1 of the delay implementation:
http://public-inbox.org/git/20170108191736.47359-1-larsxschneider@gmail.com/


Cheers,
Lars

^ permalink raw reply

* Re: git cat-file on a submodule
From: Jeff King @ 2017-01-11 12:53 UTC (permalink / raw)
  To: David Turner; +Cc: git
In-Reply-To: <1484093500.17967.6.camel@frank>

On Tue, Jan 10, 2017 at 07:11:40PM -0500, David Turner wrote:

> Why does git cat-file -t $sha:foo, where foo is a submodule, not work?

Because "cat-file" is about inspecting items in the object database, and
typically the submodule commit is not present in the superproject's
database. So we cannot know its type. You can infer what it _should_ be
from the surrounding tree, but you cannot actually do the object lookup.

Likewise, "git cat-file -t $sha1:Makefile" is not just telling you that
we found a 100644 entry in the tree, so we expect a blob. It's resolving
to a sha1, and then checking the type of that sha1 in the database. It
_should_ be a blob, but if it isn't, then cat-file is the tool that
should tell you that it is not.

> git rev-parse $sha:foo works.

Right. Because that command is about resolving a name to a sha1, which
we can do even without the object.

> By "why", I mean "would anyone complain if I fixed it?"  FWIW, I think
> -p should just return the submodule's sha.

I'm not sure if I'm complaining or not. I can't immediately think of
something that would be horribly broken. But it really feels like you
are using the wrong tool, and patching the tool to handle this case will
probably lead to weird cognitive dissonance down the road.

Maybe it would help to describe your use case more fully. If what you
care about is the presumed type based on the surrounding tree, then
maybe:

  git --literal-pathspecs ls-tree $sha -- foo

would be a better match.

-Peff

^ permalink raw reply

* Re: [RFC PATCH 0/5] Localise error headers
From: Jeff King @ 2017-01-11 11:37 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Michael J Gruber, git@vger.kernel.org
In-Reply-To: <CAGZ79kYVc0YQ4okrTHGiYQzPqfiVAm_f7orXdkhwgf5kMPXj-w@mail.gmail.com>

On Tue, Jan 10, 2017 at 10:28:42AM -0800, Stefan Beller wrote:

> > And then presumably that mix would gradually move to 100% consistency as
> > more messages are translated. But the implicit question is: are there
> > die() messages that should never be translated? I'm not sure.
> 
> I would assume any plumbing command is not localizing?
> Because in plumbing land, (easily scriptable) you may find
> a grep on the output/stderr for a certain condition?

That's the assumption I'm challenging. Certainly the behavior and
certain aspects of the output of a plumbing command should remain the
same over time. But error messages to stderr?

It seems like they should be translated, because plumbing invoked on
behalf of porcelain scripts is going to send its stderr directly to the
user.

> To find a good example, "git grep die" giving me some food of though:
> 
> die_errno(..) should always take a string marked up for translation,
> because the errno string is translated?

Yes, I would think die_errno() is a no-brainer for translation, since
the strerror() will be translated.

>     apply.c:                die(_("internal error"));
> 
> That is funny, too. I think we should substitute that with
> 
>     die("BUG: untranslated, but what went wrong instead")

Yep. We did not consistently use "BUG:" in the early days. I would say
that "BUG" lines do not need to be translated. The point is that nobody
should ever see them, so it seems like there is little point in giving
extra work to translators.

-Peff

^ permalink raw reply

* [PATCH] t7810: avoid assumption about invalid regex syntax
From: Jeff King @ 2017-01-11 11:10 UTC (permalink / raw)
  To: A. Wilcox; +Cc: git, Andreas Schwab
In-Reply-To: <20170111100400.vhd5ytarqpujigbn@sigill.intra.peff.net>

A few of the tests want to check that "git grep -P -E" will
override -P with -E, and vice versa. To do so, we use a
regex with "\x{..}", which is valid in PCRE but not defined
by POSIX (for basic or extended regular expressions).

However, POSIX declares quite a lot of syntax, including
"\x", as "undefined". That leaves implementations free to
extend the standard if they choose. At least one, musl libc,
implements "\x" in the same way as PCRE.  Our tests check
that "-E" complains about "\x", which fails with musl.

We can fix this by finding some construct which behaves
reliably on both PCRE and POSIX, but differently in each
system.

One such construct is the use of backslash inside brackets.
In PCRE, "[\d]" interprets "\d" as it would outside the
brackets, matching a digit. Whereas in POSIX, the backslash
must be treated literally, and we match either it or a
literal "d".  Moreover, implementations are not free to
change this according to POSIX, so we should be able to rely
on it.

Signed-off-by: Jeff King <peff@peff.net>
---
I've tested this with glibc, but I wasn't able to do so with musl. The
two complications are:

  1. Recent versions of git won't build with musl's regex at all,
     because it doesn't support the non-standard REG_STARTEND that we
     rely on since b7d36ffca (regex: use regexec_buf(), 2016-09-21).

     So if applied on an older git, this patch should help, but newer
     versions need NO_REGEX (to use the fallback glibc regex code)
     either way, which would also make the problem go away.

     Still, I think it's the right thing to do, since we are relying on
     something that POSIX clearly leaves up to the implementation. It
     may also help on other systems, or if musl ends up supporting
     REG_STARTEND in the future.

  2. I tried to cherry-pick to v2.7.x and test it with musl. Debian
     ships with a "musl-gcc" wrapper, but it doesn't work out of the
     box. Both zlib and pcre are compiled against glibc, so I'd have to
     rebuild those, too. At which point I gave up and decided to just
     let you test it on your musl-based system. :)

 t/t7810-grep.sh | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index de2405ccb..19f0108f8 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -39,6 +39,10 @@ test_expect_success setup '
 		echo "a+bc"
 		echo "abc"
 	} >ab &&
+	{
+		echo d &&
+		echo 0
+	} >d0 &&
 	echo vvv >v &&
 	echo ww w >w &&
 	echo x x xx x >x &&
@@ -1105,36 +1109,36 @@ test_expect_success 'grep pattern with grep.patternType=fixed, =basic, =extended
 '
 
 test_expect_success 'grep -G -F -P -E pattern' '
-	>empty &&
-	test_must_fail git grep -G -F -P -E "a\x{2b}b\x{2a}c" ab >actual &&
-	test_cmp empty actual
+	echo "d0:d" >expected &&
+	git grep -G -F -P -E "[\d]" d0 >actual &&
+	test_cmp expected actual
 '
 
 test_expect_success 'grep pattern with grep.patternType=fixed, =basic, =perl, =extended' '
-	>empty &&
-	test_must_fail git \
+	echo "d0:d" >expected &&
+	git \
 		-c grep.patterntype=fixed \
 		-c grep.patterntype=basic \
 		-c grep.patterntype=perl \
 		-c grep.patterntype=extended \
-		grep "a\x{2b}b\x{2a}c" ab >actual &&
-	test_cmp empty actual
+		grep "[\d]" d0 >actual &&
+	test_cmp expected actual
 '
 
 test_expect_success LIBPCRE 'grep -G -F -E -P pattern' '
-	echo "ab:a+b*c" >expected &&
-	git grep -G -F -E -P "a\x{2b}b\x{2a}c" ab >actual &&
+	echo "d0:0" >expected &&
+	git grep -G -F -E -P "[\d]" d0 >actual &&
 	test_cmp expected actual
 '
 
 test_expect_success LIBPCRE 'grep pattern with grep.patternType=fixed, =basic, =extended, =perl' '
-	echo "ab:a+b*c" >expected &&
+	echo "d0:0" >expected &&
 	git \
 		-c grep.patterntype=fixed \
 		-c grep.patterntype=basic \
 		-c grep.patterntype=extended \
 		-c grep.patterntype=perl \
-		grep "a\x{2b}b\x{2a}c" ab >actual &&
+		grep "[\d]" d0 >actual &&
 	test_cmp expected actual
 '
 
-- 
2.11.0.627.gfa6151259

^ permalink raw reply related

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
From: Lars Schneider @ 2017-01-11 10:20 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Git mailing list, Eric Wong, Taylor Blau
In-Reply-To: <ec8078ef-8ff2-d26f-ef73-5ef612737eee@gmail.com>


> On 10 Jan 2017, at 23:11, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>> larsxschneider@gmail.com writes:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>> 
>>> Some `clean` / `smudge` filters might require a significant amount of
>>> time to process a single blob. During this process the Git checkout
>>> operation is blocked and Git needs to wait until the filter is done to
>>> continue with the checkout.
> 
> Lars, what is expected use case for this feature; that is when do you
> think this problem may happen?  Is it something that happened IRL?

Yes, this problem happens every day with filters that perform network
requests (e.g. GitLFS). In GitLFS we even implemented Git wrapper
commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
The ultimate goal of this patch is to be able to get rid of the wrapper 
commands.


>>> Teach the filter process protocol (introduced in edcc858) to accept the
>>> status "delayed" as response to a filter request. Upon this response Git
>>> continues with the checkout operation and asks the filter to process the
>>> blob again after all other blobs have been processed.
>> 
>> Hmm, I would have expected that the basic flow would become
>> 
>> 	for each paths to be processed:
>> 		convert-to-worktree to buf
>> 		if not delayed:
>> 			do the caller's thing to use buf
>> 		else:
>> 			remember path
>> 
>> 	for each delayed paths:
>> 		ensure filter process finished processing for path
>> 		fetch the thing to buf from the process
>> 		do the caller's thing to use buf
> 
> I would expect here to have a kind of event loop, namely
> 
>        while there are delayed paths:
>                get path that is ready from filter
>                fetch the thing to buf (supporting "delayed")
>                if path done
>                        do the caller's thing to use buf 
>                        (e.g. finish checkout path, eof convert, etc.)
> 
> We can either trust filter process to tell us when it finished sending
> delayed paths, or keep list of paths that are being delayed in Git.

I could implement "get path that is ready from filter" but wouldn't
that complicate the filter protocol? I think we can use the protocol pretty 
much as if with the strategy outlined here:
http://public-inbox.org/git/F533857D-9B51-44C1-8889-AA0542AD8250@gmail.com/


Thanks,
Lars

^ permalink raw reply

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
From: Lars Schneider @ 2017-01-11 10:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, e, jnareb
In-Reply-To: <20170109233816.GA70151@Ida>


> On 10 Jan 2017, at 00:38, Taylor Blau <ttaylorr@github.com> wrote:
> 
> I've been considering some alternative approaches in order to make the
> communication between Git and any extension that implements this protocol more
> intuitive.
> 
> In particular, I'm considering alternatives to:
> 
>> 	for each delayed paths:
>> 		ensure filter process finished processing for path
>> 		fetch the thing to buf from the process
>> 		do the caller's thing to use buf
> 
> As I understand it, the above sequence of steps would force Git to either:
> 
> a) loop over all delayed paths and ask the filter if it's done processing,
>   creating a busy-loop between the filter and Git, or...
> b) loop over all delayed paths sequentially, checking out each path in sequence
> 
> I would like to avoid both of those situations, and instead opt for an
> asynchronous approach. In (a), the protocol is far too chatty. In (b), the
> protocol is much less chatty, but forces the checkout to be the very last step,
> which has negative performance implications on checkouts with many large files.
> 
> For instance, checking out several multi-gigabyte files one after the other
> means that a significant amount of time is lost while the filter has some of the
> items ready. Instead of checking them out as they become available, Git waits
> until the very end when they are all available.
> 
> I think it would be preferable for the protocol to specify a sort of "done"
> signal against each path such that Git could check out delayed paths as they
> become available. If implemented this way, Git could checkout files
> asynchronously, while the filter continues to do work on the other end.

In v1 I implemented a) with the busy-loop problem in mind. 

My thinking was this:

If the filter sees at least one filter request twice then the filter knows that
Git has already requested all files that require filtering. At that point the
filter could just block the "delayed" answer to the latest filter request until
at least one of the previously delayed requests can be fulfilled. Then the filter
answers "delay" to Git until Git requests the blob that can be fulfilled. This
process cycles until all requests can be fulfilled. Wouldn't that work?

I think a "done" message by the filter is not easy. Right now the protocol works 
in a mode were Git always asks and the filter always answers. I believe changing
the filter to be able to initiate a "done" message would complicated the protocol.


> Additionally, the protocol should specify a sentinel "no more entries" value
> that could be sent from Git to the filter to signal that there are no more files
> to checkout. Some filters may implement mechanisms for converting files that
> require a signal to know when all files have been sent. Specifically, Git LFS
> (https://git-lfs.github.com) batches files to be transferred together, and needs
> to know when all files have been announced to truncate and send the last batch,
> if it is not yet full. I'm sure other filter implementations use a similar
> mechanism and would benefit from this as well.

I agree. I think the filter already has this info implicitly as explained above
but an explicit message would be better!

Thanks,
Lars

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox