git smudge filter fails

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git smudge filter fails
@ 2016-03-09 18:29 Stephen Morton
  2016-03-10  1:59 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Morton @ 2016-03-09 18:29 UTC (permalink / raw)
  To: git@vger.kernel.org

A git smudge filter, at least one that relies on the results from 'git
log' does not seem to work
on file A when doing a 'git update' from a revision where file A
doesn't exist to a revision where
it does exist.

Below is a simple recipe to reproduce.

This appears to me to be a bug. If not, why is it expected and is
there anything I can do to
work around this behaviour?

Steve

mkdir git_test
cd git_test/
git init .
touch bar.c
git add .
git commit -am "Initial commit. foo.c not here yet."
git tag no_foo

touch foo.c
git add .
git commit -am "Add foo, no content"
echo 'Date is $Date$' >> foo.c
git commit -am "Add date to foo.c"
echo 'foo.c filter=dater' > .git/info/attributes
git config --local filter.dater.smudge 'myDate=`git log
--pretty=format:"%cd" --date=iso -1 -- %f`; sed -e
"s/\(\\$\)Date[^\\$]*\\$/\1Date: $myDate \\$/g"'
git config --local filter.dater.clean 'sed -e
"s/\(\\$\)Date[^\\$]*\\$/\1Date\\$/g"'
rm -f foo.c
git checkout -- foo.c
cat foo.c
# observe keyword expansion

git checkout no_foo
git checkout master
cat foo.c
#observe keyword expansion lost

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git smudge filter fails
  2016-03-09 18:29 git smudge filter fails Stephen Morton
@ 2016-03-10  1:59 ` Jeff King
  2016-03-10 14:45   ` Stephen Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2016-03-10  1:59 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git@vger.kernel.org

On Wed, Mar 09, 2016 at 01:29:31PM -0500, Stephen Morton wrote:

> git config --local filter.dater.smudge 'myDate=`git log
> --pretty=format:"%cd" --date=iso -1 -- %f`; sed -e
> "s/\(\\$\)Date[^\\$]*\\$/\1Date: $myDate \\$/g"'

Your filter is running "git log" without a revision parameter, which
means it is looking at HEAD.

And here....

> git checkout no_foo
> git checkout master
> cat foo.c
> #observe keyword expansion lost

You are expecting this second one to do:

  1. Switch HEAD to "master".

  2. Checkout files which need updating. Looking at HEAD in your filter
     then examines "master", and you see the commit timestamp of the
     destination.

But that isn't how it is implemented. Checkout will handle the file
checkout _first_, as that is the part that is likely to run into
problems (e.g., rejecting a switch because it would lose changes in the
working tree). Only at the very end, after the change to the working
tree has succeeded, do we update HEAD.

I think the order you are expecting is conceptually cleaner, but I don't
think we would want to switch it around (for reasons of efficiency and
robustness). And I don't think we would want to make a promise about the
ordering to callers either way, as it binds our implementation.

So is there a way to reliably know the destination of a checkout? My
first thought was that we could add a placeholder similar to "%f" that
your filter could use. I'm not sure how we would handle the corner cases
there, though (e.g., do we always have a "destination" to report? If
not, what do we give the script?).

I suspect you could also hack something together with a post-checkout
script, though it would probably be a lot less efficient (and might also
have some weird corner cases).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git smudge filter fails
  2016-03-10  1:59 ` Jeff King
@ 2016-03-10 14:45   ` Stephen Morton
  2016-03-10 21:05     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Morton @ 2016-03-10 14:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org

I am a bit confused because this is basically the example used in
ProGit [1] and it is fundamentally broken. In fact, if I understand
correctly, this means that smudge filters cannot be relied upon to
provide any 'keyword expansion' type tasks because they will all by
nature have to query the file with 'git log'.


(Note that although in my example I used 'git checkout', with an only
slightly more complicated example I can make it fail on 'git pull'
which is perhaps a much more realistic use case. That was probably
implied in your answer, I just wanted to mention it.)

Steve

[1] https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes



On Wed, Mar 9, 2016 at 8:59 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Mar 09, 2016 at 01:29:31PM -0500, Stephen Morton wrote:
>
>> git config --local filter.dater.smudge 'myDate=`git log
>> --pretty=format:"%cd" --date=iso -1 -- %f`; sed -e
>> "s/\(\\$\)Date[^\\$]*\\$/\1Date: $myDate \\$/g"'
>
> Your filter is running "git log" without a revision parameter, which
> means it is looking at HEAD.
>
> And here....
>
>> git checkout no_foo
>> git checkout master
>> cat foo.c
>> #observe keyword expansion lost
>
> You are expecting this second one to do:
>
>   1. Switch HEAD to "master".
>
>   2. Checkout files which need updating. Looking at HEAD in your filter
>      then examines "master", and you see the commit timestamp of the
>      destination.
>
> But that isn't how it is implemented. Checkout will handle the file
> checkout _first_, as that is the part that is likely to run into
> problems (e.g., rejecting a switch because it would lose changes in the
> working tree). Only at the very end, after the change to the working
> tree has succeeded, do we update HEAD.
>
> I think the order you are expecting is conceptually cleaner, but I don't
> think we would want to switch it around (for reasons of efficiency and
> robustness). And I don't think we would want to make a promise about the
> ordering to callers either way, as it binds our implementation.
>
> So is there a way to reliably know the destination of a checkout? My
> first thought was that we could add a placeholder similar to "%f" that
> your filter could use. I'm not sure how we would handle the corner cases
> there, though (e.g., do we always have a "destination" to report? If
> not, what do we give the script?).
>
> I suspect you could also hack something together with a post-checkout
> script, though it would probably be a lot less efficient (and might also
> have some weird corner cases).
>
> -Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git smudge filter fails
  2016-03-10 14:45   ` Stephen Morton
@ 2016-03-10 21:05     ` Jeff King
  2016-03-10 22:04       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2016-03-10 21:05 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git@vger.kernel.org

On Thu, Mar 10, 2016 at 09:45:19AM -0500, Stephen Morton wrote:

> I am a bit confused because this is basically the example used in
> ProGit [1] and it is fundamentally broken. In fact, if I understand
> correctly, this means that smudge filters cannot be relied upon to
> provide any 'keyword expansion' type tasks because they will all by
> nature have to query the file with 'git log'.

Interesting. Perhaps I am missing something (I am far from an expert in
clean/smudge filters, which I do not generally use myself), but the
example in ProGit looks kind of bogus to me. I don't think it ever would
have worked reliably, under any version of git.

> (Note that although in my example I used 'git checkout', with an only
> slightly more complicated example I can make it fail on 'git pull'
> which is perhaps a much more realistic use case. That was probably
> implied in your answer, I just wanted to mention it.)

Yeah, I think the issue is basically the same for several commands which
update the worktree and the HEAD. Most of them are going to do the
worktree first.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git smudge filter fails
  2016-03-10 21:05     ` Jeff King
@ 2016-03-10 22:04       ` Junio C Hamano
  2016-03-15 16:17         ` Stephen Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-03-10 22:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Stephen Morton, git@vger.kernel.org

Jeff King <peff@peff.net> writes:

> On Thu, Mar 10, 2016 at 09:45:19AM -0500, Stephen Morton wrote:
>
>> I am a bit confused because this is basically the example used in
>> ProGit [1] and it is fundamentally broken. In fact, if I understand
>> correctly, this means that smudge filters cannot be relied upon to
>> provide any 'keyword expansion' type tasks because they will all by
>> nature have to query the file with 'git log'.
>
> Interesting. Perhaps I am missing something (I am far from an expert in
> clean/smudge filters, which I do not generally use myself), but the
> example in ProGit looks kind of bogus to me. I don't think it ever would
> have worked reliably, under any version of git.
>
>> (Note that although in my example I used 'git checkout', with an only
>> slightly more complicated example I can make it fail on 'git pull'
>> which is perhaps a much more realistic use case. That was probably
>> implied in your answer, I just wanted to mention it.)
>
> Yeah, I think the issue is basically the same for several commands which
> update the worktree and the HEAD. Most of them are going to do the
> worktree first.

You can have a pair of branches A and B that have forked long time
ago, and have a path F that has been changed identically since they
forked, perhaps by cherry-picking the same change.  This happens all
the time.

If there were some mechanism that modifies the checked out version
of F with information that depends on the history that leads to A
(e.g. "which commit that is an ancestor of A last modified F?")
when you check out branch A, it will become invalid when you do "git
checkout B", because the path F will not change because they are the
same between the branches.  In short, CVS $Id$-style substitutions
that depend on the history fundamentally does not work, unless you
are willing to always rewrite the whole working tree every time you
switch branches.

The smudge and clean filters are given _only_ the blob contents and
nothing else, not "which commit (or tree) the blob is taken from",
not "which path this blob sits in that tree-ish", not "what branch
am I on" and this is a very much deliberate design decision made in
order to avoid leading people to a misguided attempt to mimick CVS
$Id$-style substitutions.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git smudge filter fails
  2016-03-10 22:04       ` Junio C Hamano
@ 2016-03-15 16:17         ` Stephen Morton
  2016-03-15 16:48           ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Morton @ 2016-03-15 16:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git@vger.kernel.org

On Thu, Mar 10, 2016 at 5:04 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> On Thu, Mar 10, 2016 at 09:45:19AM -0500, Stephen Morton wrote:
>>
>>> I am a bit confused because this is basically the example used in
>>> ProGit [1] and it is fundamentally broken. In fact, if I understand
>>> correctly, this means that smudge filters cannot be relied upon to
>>> provide any 'keyword expansion' type tasks because they will all by
>>> nature have to query the file with 'git log'.
>>
>> Interesting. Perhaps I am missing something (I am far from an expert in
>> clean/smudge filters, which I do not generally use myself), but the
>> example in ProGit looks kind of bogus to me. I don't think it ever would
>> have worked reliably, under any version of git.
>>
>>> (Note that although in my example I used 'git checkout', with an only
>>> slightly more complicated example I can make it fail on 'git pull'
>>> which is perhaps a much more realistic use case. That was probably
>>> implied in your answer, I just wanted to mention it.)
>>
>> Yeah, I think the issue is basically the same for several commands which
>> update the worktree and the HEAD. Most of them are going to do the
>> worktree first.
>
> You can have a pair of branches A and B that have forked long time
> ago, and have a path F that has been changed identically since they
> forked, perhaps by cherry-picking the same change.  This happens all
> the time.
>
> If there were some mechanism that modifies the checked out version
> of F with information that depends on the history that leads to A
> (e.g. "which commit that is an ancestor of A last modified F?")
> when you check out branch A, it will become invalid when you do "git
> checkout B", because the path F will not change because they are the
> same between the branches.  In short, CVS $Id$-style substitutions
> that depend on the history fundamentally does not work, unless you
> are willing to always rewrite the whole working tree every time you
> switch branches.
>
> The smudge and clean filters are given _only_ the blob contents and
> nothing else, not "which commit (or tree) the blob is taken from",
> not "which path this blob sits in that tree-ish", not "what branch
> am I on" and this is a very much deliberate design decision made in
> order to avoid leading people to a misguided attempt to mimick CVS
> $Id$-style substitutions.
>


I will raise an Issue with ProGit.

It's perhaps beyond the scope of my original question, but for
situations where I need a "last change date" embedded in a file (e.g.
because a protocol standard requires it), is there any recommended way
to do so? We've the hard way that hardcoding makes
merging/cherry-picking a bit of a nightmare and should be avoided. Is
a post-checkout hook the way to go? I've actually found the smudge
filter to be very slow for this application as each file is processed
in series; a post-commit hook that could operate on files in parallel
would likely be substantially faster.

Stephen

(Sorry about the earlier top-posting. I didn't realize what gmail was
doing until after it had happened.)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git smudge filter fails
  2016-03-15 16:17         ` Stephen Morton
@ 2016-03-15 16:48           ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2016-03-15 16:48 UTC (permalink / raw)
  To: Stephen Morton; +Cc: Jeff King, git@vger.kernel.org

Stephen Morton <stephen.c.morton@gmail.com> writes:

> It's perhaps beyond the scope of my original question, but for
> situations where I need a "last change date" embedded in a file (e.g.
> because a protocol standard requires it), is there any recommended way
> to do so? We've the hard way that hardcoding makes
> merging/cherry-picking a bit of a nightmare and should be avoided.

Does that "last change date" have to be embedded in a file with
other stuff in there, or can it be a standalone file by itself
(which may be used by other things via linking or inclusion)?

If it can be a standalone file, a custom ll-merge driver that knows
how yoru datestring looks like and takes the later of the versions
in the two branches being merged would not be too hard to write to
eliminate the "nightmare", I would think.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-03-15 16:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-09 18:29 git smudge filter fails Stephen Morton
2016-03-10  1:59 ` Jeff King
2016-03-10 14:45   ` Stephen Morton
2016-03-10 21:05     ` Jeff King
2016-03-10 22:04       ` Junio C Hamano
2016-03-15 16:17         ` Stephen Morton
2016-03-15 16:48           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).