Determining if a merge was produced automatically

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Determining if a merge was produced automatically
@ 2024-06-30 18:06 Pavel Rappo
       [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Pavel Rappo @ 2024-06-30 18:06 UTC (permalink / raw)
  To: Git mailing list

Hello,

I'm looking for a robust way to determine if a given merge commit
could've been produced automatically by `git merge`, without any
manual intervention or tampering, such as:

  - resolving conflicts,
  - stopping (`--no-commit`) and modifying,
  - amending the commit.

My initial idea was to re-enact the merge. If the merge failed, I
would conclude that the original merge couldn't have been produced
automatically. If the merge succeeded, I would compare it with the
original merge. Any differences would indicate that the original merge
couldn't have been produced automatically. Otherwise, I would conclude
that it could've been. This approach is simple, but involves multiple
steps and requires clean-up.

My second idea was to use `git show --diff-merges=dense-combined`,
which only prints hunks that come from neither parent. If nothing is
printed, I would conclude that the merge could've been produced
automatically. This approach is simple, single-step, but seems to have
an issue. In my experiments, I found that if some hunks from different
parents were located closely enough, output was produced. So, checking
if nothing is output could lead to false negatives: a merge that
could've been produced automatically might look like it was tampered
with.

My third idea was to use a recently added feature, `git show
--remerge-diff`, which seemingly embodies my first idea and is immune
to the issue of the second. It is also single-step and requires no
clean-up:

> Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit.

However, this bit means that I shouldn't entirely trust its output:

> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).

What is my best course of action?

Thanks,
-Pavel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
       [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
@ 2024-07-01  0:45   ` Martin von Zweigbergk
  2024-07-01 15:11     ` Elijah Newren
  0 siblings, 1 reply; 11+ messages in thread
From: Martin von Zweigbergk @ 2024-07-01  0:45 UTC (permalink / raw)
  To: Pavel Rappo; +Cc: Git mailing list

Forwarding to the list without HTML so others can correct me if I was wrong.

On Sun, Jun 30, 2024 at 3:32 PM Martin von Zweigbergk
<martinvonz@gmail.com> wrote:
>
>
>
> On Sun, Jun 30, 2024, 11:06 Pavel Rappo <pavel.rappo@gmail.com> wrote:
>>
>> Hello,
>>
>> I'm looking for a robust way to determine if a given merge commit
>> could've been produced automatically by `git merge`, without any
>> manual intervention or tampering, such as:
>>
>>   - resolving conflicts,
>>   - stopping (`--no-commit`) and modifying,
>>   - amending the commit.
>>
>> My initial idea was to re-enact the merge. If the merge failed, I
>> would conclude that the original merge couldn't have been produced
>> automatically. If the merge succeeded, I would compare it with the
>> original merge. Any differences would indicate that the original merge
>> couldn't have been produced automatically. Otherwise, I would conclude
>> that it could've been. This approach is simple, but involves multiple
>> steps and requires clean-up.
>>
>> My second idea was to use `git show --diff-merges=dense-combined`,
>> which only prints hunks that come from neither parent. If nothing is
>> printed, I would conclude that the merge could've been produced
>> automatically. This approach is simple, single-step, but seems to have
>> an issue. In my experiments, I found that if some hunks from different
>> parents were located closely enough, output was produced. So, checking
>> if nothing is output could lead to false negatives: a merge that
>> could've been produced automatically might look like it was tampered
>> with.
>>
>> My third idea was to use a recently added feature, `git show
>> --remerge-diff`, which seemingly embodies my first idea and is immune
>> to the issue of the second. It is also single-step and requires no
>> clean-up:
>>
>> > Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit.
>>
>> However, this bit means that I shouldn't entirely trust its output:
>>
>> > The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).
>
>
> There's basically only one way to display an empty diff, so I suspect that checking that the diff is empty is still going to be enough for your purposes.
>
> Note that you can specify e.g. the rename detection threshold to use while merging, and the person doing the merge might have used a different threshold than you're using when you're trying to check if they added other changes. There are also different merge strategies and diff algorithms to choose. That means that you might get false positives and false negatives. Maybe that's still good enough for you.
>
>>
>> What is my best course of action?
>>
>> Thanks,
>> -Pavel
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-06-30 18:06 Determining if a merge was produced automatically Pavel Rappo
       [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
@ 2024-07-01 11:49 ` Jonathan Nieder
  2024-07-01 13:13   ` Pavel Rappo
  2024-07-01 15:43   ` Junio C Hamano
  2024-07-01 15:39 ` Junio C Hamano
  2 siblings, 2 replies; 11+ messages in thread
From: Jonathan Nieder @ 2024-07-01 11:49 UTC (permalink / raw)
  To: Pavel Rappo; +Cc: Git mailing list

Hi Pavel,

Pavel Rappo wrote:

> However, this bit means that I shouldn't entirely trust its output:
>
>> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).
>
> What is my best course of action?

I'd encourage proposing a patch to make the documentation say what you
wish it said.  What guarantees would you like it to make?  That will
help with others in the project being able to decide what it should
guarantee and what it shouldn't, and regardless of the outcome of
discussion it would make the documentation more helpful for the next
person.

See Documentation/MyFirstContribution for some more details on how
proposing a patch works.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01 11:49 ` Jonathan Nieder
@ 2024-07-01 13:13   ` Pavel Rappo
  2024-07-01 15:43   ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Pavel Rappo @ 2024-07-01 13:13 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git mailing list

I think the documentation is quite clear. It's just that it makes that
feature unhelpful for my use case, as output from the following
command is subject to change:

    git show --remerge-diff --pretty=format:%b <merge-commit>

Now, if that's untrue, and the output from the above command cannot
change, then yes, the documentation should be improved.

-Pavel

On Mon, Jul 1, 2024 at 12:49 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Hi Pavel,
>
> Pavel Rappo wrote:
>
> > However, this bit means that I shouldn't entirely trust its output:
> >
> >> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).
> >
> > What is my best course of action?
>
> I'd encourage proposing a patch to make the documentation say what you
> wish it said.  What guarantees would you like it to make?  That will
> help with others in the project being able to decide what it should
> guarantee and what it shouldn't, and regardless of the outcome of
> discussion it would make the documentation more helpful for the next
> person.
>
> See Documentation/MyFirstContribution for some more details on how
> proposing a patch works.
>
> Thanks and hope that helps,
> Jonathan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01  0:45   ` Martin von Zweigbergk
@ 2024-07-01 15:11     ` Elijah Newren
  2024-07-01 16:43       ` Pavel Rappo
  0 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2024-07-01 15:11 UTC (permalink / raw)
  To: Martin von Zweigbergk; +Cc: Pavel Rappo, Git mailing list

Hi,

On Sun, Jun 30, 2024 at 5:45 PM Martin von Zweigbergk
<martinvonz@gmail.com> wrote:
>
> Forwarding to the list without HTML so others can correct me if I was wrong.
>
> On Sun, Jun 30, 2024 at 3:32 PM Martin von Zweigbergk
> <martinvonz@gmail.com> wrote:
> >
> >
> >
> > On Sun, Jun 30, 2024, 11:06 Pavel Rappo <pavel.rappo@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> I'm looking for a robust way to determine if a given merge commit
> >> could've been produced automatically by `git merge`, without any
> >> manual intervention or tampering, such as:
> >>
> >>   - resolving conflicts,
> >>   - stopping (`--no-commit`) and modifying,
> >>   - amending the commit.
> >>
> >> My initial idea was to re-enact the merge. If the merge failed, I
> >> would conclude that the original merge couldn't have been produced
> >> automatically. If the merge succeeded, I would compare it with the
> >> original merge. Any differences would indicate that the original merge
> >> couldn't have been produced automatically. Otherwise, I would conclude
> >> that it could've been. This approach is simple, but involves multiple
> >> steps and requires clean-up.

Further, your strategy has some blind spots.  What if the original
person creating the merge used special flags, such as changing the
rename threshold, ignoring space changes, or a different underlying
diff algorithm?  It may be that the merge was clean for the original
merger, but if you don't use the same options it doesn't look clean to
you -- or vice versa.  (In short, this method has both false positives
and false negatives.)

(The odds that someone used specialized options and then had the merge
succeed, when it wouldn't have otherwise, is pretty low.  So your
strategy is probably good enough, but it's good to be aware of other
possibilities.)

> >> My second idea was to use `git show --diff-merges=dense-combined`,
> >> which only prints hunks that come from neither parent. If nothing is
> >> printed, I would conclude that the merge could've been produced
> >> automatically. This approach is simple, single-step, but seems to have
> >> an issue. In my experiments, I found that if some hunks from different
> >> parents were located closely enough, output was produced. So, checking
> >> if nothing is output could lead to false negatives: a merge that
> >> could've been produced automatically might look like it was tampered
> >> with.
> >>
> >> My third idea was to use a recently added feature, `git show
> >> --remerge-diff`, which seemingly embodies my first idea and is immune
> >> to the issue of the second. It is also single-step and requires no
> >> clean-up:
> >>
> >> > Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit.

Yes, this is exactly your strategy, except that instead of checking
whether it "completes" you are checking whether the output is empty,
and it avoids messing up the working tree and index and thus also
avoids the need for clean-up.

> >> However, this bit means that I shouldn't entirely trust its output:
> >>
> >> > The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).

--remerge-diff output uses diff headers a bit inventively.  And, being
a somewhat new option, I didn't want a repeat of issues like we had
with --cc (where the output format is documented in detail and was for
a few years before we realized that showing diff headers for exactly
two files when there are at least three files is kinda dumb when
renames or mode changes are involved).  So, we needed the flexibility
to change the output in the future.

However, this wording was not intended to detract from the main point
that "empty output means clean merge and non-empty output means
conflicted merge" (it never even occurred to me that someone might
read that part of the documentation and assume that it presents a
problem for checking-if-diff-is-empty).  I use it for the same
purpose, and that's absolutely a guarantee we want to provide.  If you
want to guarantee output format beyond that, though, then I object.

Anyway, in summary...

> > There's basically only one way to display an empty diff, so I suspect that checking that the diff is empty is still going to be enough for your purposes.
> >
> > Note that you can specify e.g. the rename detection threshold to use while merging, and the person doing the merge might have used a different threshold than you're using when you're trying to check if they added other changes. There are also different merge strategies and diff algorithms to choose. That means that you might get false positives and false negatives. Maybe that's still good enough for you.

Yes, Martin has it exactly right.  And it has possible false positives
and false negatives precisely because your original strategy did too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-06-30 18:06 Determining if a merge was produced automatically Pavel Rappo
       [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
  2024-07-01 11:49 ` Jonathan Nieder
@ 2024-07-01 15:39 ` Junio C Hamano
  2024-07-01 16:08   ` Pavel Rappo
  2 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2024-07-01 15:39 UTC (permalink / raw)
  To: Pavel Rappo; +Cc: Git mailing list

Pavel Rappo <pavel.rappo@gmail.com> writes:

> I'm looking for a robust way to determine if a given merge commit
> could've been produced automatically by `git merge`, without any
> manual intervention or tampering, such as:
>
>   - resolving conflicts,
>   - stopping (`--no-commit`) and modifying,
>   - amending the commit.

This fundamentally cannot be done perfectly and it is by design.

The automated merge algorithms, backends, and strategies will
improve over time, so a merge attempted 7 years ago with the
then-current version of Git may have conflicted and required manual
resolution, but a newer Git you use to "look for manual intervention
or tampering" may have improved enough to resolve the same merge
cleanly and automatically.  Even if the person who created the merge
originally and you are using with the same version of Git, the former
may be using a custom low-level merge driver to resolve conflicts in
certain types of files better than the barebones textual merge driver,
while you do not have access to the same driver.

So you'd need to tighten the definition a bit more, at bit more like
constraints like "using the same version of Git" and "without any
low-level merge driver customization".

Now, with problem narrowed down a bit with tightened constraints.

> My initial idea was to re-enact the merge. If the merge failed, ...
> My second idea was to use `git show --diff-merges=dense-combined` ...
> My third idea was to use a recently added feature, `git show --remerge-diff`...

The first and the third are equivalent (the latter is an automation
of what the former would do).  The --cc output, as you said, is
about showing resolution that is different from any of the parents,
and serves the need quite different from what you are trying to do
(e.g., if the merge was created with "git merge -s ours", there is
nothing _interesting_ in the result from the "--cc"'s point of view),
but it is likely that you are interested in noticing such an unusual
"cauterizing the history" merge.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01 11:49 ` Jonathan Nieder
  2024-07-01 13:13   ` Pavel Rappo
@ 2024-07-01 15:43   ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2024-07-01 15:43 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Pavel Rappo, Git mailing list

Jonathan Nieder <jrnieder@gmail.com> writes:

>> However, this bit means that I shouldn't entirely trust its output:
>>
>>> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).
>>
>> What is my best course of action?
>
> I'd encourage proposing a patch to make the documentation say what you
> wish it said.

I do not think that was what Pavel wanted to get suggestion on, but
it is certainly something that would benefit not just a single
person but other users ;-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01 15:39 ` Junio C Hamano
@ 2024-07-01 16:08   ` Pavel Rappo
  2024-07-01 18:16     ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Rappo @ 2024-07-01 16:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git mailing list

I can see that I used "robustly" misleadingly. My intent was to get
robust automation, not a robust boolean classifier; sorry.

Here's my use case. In our project, every commit to a PR means that
the PR needs to be re-reviewed. During a PR's lifetime, the PR's
target branch might be merged into the PR multiple times. Since
re-reviewing is costly, I'm exploring the possibility of not requiring
it for such merge commits produced automatically because of the
assumption that nothing bad can happen there.

In that use case, both false positives and false negatives are _fine_,
if only annoying. If it's false positive, meaning that a PR has been
truly merged, but the check couldn't figure it out, re-review will be
required. Not a big deal. If it's false negative, meaning that a PR
has been tampered with, but looks like it hasn't, then it's okay. Why?
Because the result will be the _same_ [^*] if that PR is eventually,
automatically merged into the target branch by that very Git that
performed the merge check.

[*]: Or very, very similar.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01 15:11     ` Elijah Newren
@ 2024-07-01 16:43       ` Pavel Rappo
  0 siblings, 0 replies; 11+ messages in thread
From: Pavel Rappo @ 2024-07-01 16:43 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Martin von Zweigbergk, Git mailing list

On Mon, Jul 1, 2024 at 4:11 PM Elijah Newren <newren@gmail.com> wrote:

> However, this wording was not intended to detract from the main point
> that "empty output means clean merge and non-empty output means
> conflicted merge" (it never even occurred to me that someone might
> read that part of the documentation and assume that it presents a
> problem for checking-if-diff-is-empty).  I use it for the same
> purpose, and that's absolutely a guarantee we want to provide.  If you
> want to guarantee output format beyond that, though, then I object.
>

That's good to know; thanks! Back to Jonathan's suggestion: I don't
know how to express that in documentation. I cannot suggest any good
wording at this time.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01 16:08   ` Pavel Rappo
@ 2024-07-01 18:16     ` Junio C Hamano
  2024-07-01 22:26       ` Pavel Rappo
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2024-07-01 18:16 UTC (permalink / raw)
  To: Pavel Rappo; +Cc: Git mailing list

Pavel Rappo <pavel.rappo@gmail.com> writes:

> it for such merge commits produced automatically because of the
> assumption that nothing bad can happen there.

I do not think that assumption holds in the first place, though.  A
typical and often cited example is when one side changed a helper
function's behaviour while the other side added new callers to the
helper function, still assuming the original behaviour.  In such a
case there may not even be an textual conflict but the end results
may be broken, and if the breakage is subtle, it may take weeks or
months before somebody notices such a semantic mismerge.

Your "review a conflicted-and-resolved merge on one integration
branch once, and skip the re-review as long as the resolution is the
same way as the original one when the same branch gets merged into
another integration branch" is a neat idea (and the integration
branches we have in our project are run more-or-less like that).

But there, you'd need more than "both are cleanly auto-merged"; more
like "both may have conflicted but they are resolved the same way"
is what you are interested, no?  Since at that point, your primary
interest shouldn't be "does it cleanly auto-merge?" but "do these
two merges do the same thing?", determining if a merge was created
automatically becomes a problem you do not need to solve, or solving
it would not further your true goal.

If you have two integration branches A and B, and a topic branch T
first gets merged to A and then after proving its worth it gets
merged down to B, I wonder if you can verify somebody's merge of B
into T by comparing the result with your "verification merge", which
you preform locally and on a throw-away branch by using "git rebase
--rebase-merges" or some mechanism, to replay the original merge of
T into A on top of B (before the merge of T you are verifying).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Determining if a merge was produced automatically
  2024-07-01 18:16     ` Junio C Hamano
@ 2024-07-01 22:26       ` Pavel Rappo
  0 siblings, 0 replies; 11+ messages in thread
From: Pavel Rappo @ 2024-07-01 22:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git mailing list

On Mon, Jul 1, 2024 at 7:16 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Pavel Rappo <pavel.rappo@gmail.com> writes:
>
> > it for such merge commits produced automatically because of the
> > assumption that nothing bad can happen there.
>
> I do not think that assumption holds in the first place, though.  A
> typical and often cited example is when one side changed a helper
> function's behaviour while the other side added new callers to the
> helper function, still assuming the original behaviour.  In such a
> case there may not even be an textual conflict but the end results
> may be broken, and if the breakage is subtle, it may take weeks or
> months before somebody notices such a semantic mismerge.

Junio, I'm under no illusion that Git can resolve semantic conflicts,
such as the one you described. When I said "nothing bad can happen" I
didn't mean broken code, I meant unreviewed, possibly malicious
changes making their way into the target branch.

Suppose, we have the master branch and a PR against that branch. If a
new commit is pushed into the PR branch, we want it to be reviewed,
unless that commit is a merge from master to the PR branch. In that
case, we try to replicate the merge to see if it yields the same
result. If the result is the same, we will not require re-review.

Why do we not require a re-review in that case? Because when a PR is
integrated, it is merged with the master anyway. That merge is
unsupervised and unreviewed. If we flip sides, it means that we may
want to not require a review for merging from master to the PR. It's
this sense that "nothing bad can happen".

> But there, you'd need more than "both are cleanly auto-merged"; more
> like "both may have conflicted but they are resolved the same way"
> is what you are interested, no?  Since at that point, your primary
> interest shouldn't be "does it cleanly auto-merge?" but "do these
> two merges do the same thing?", determining if a merge was created
> automatically becomes a problem you do not need to solve, or solving
> it would not further your true goal.

Automatic merge is just a merge made by git merge or another command
that I expect authors to use for occasional merge from master into
their PRs. It's not a special merge. It's just a reference merge.

Sorry, I'm not good at writing text. I really hope this email
clarifies my use case and what I am trying to do about it.

> If you have two integration branches A and B, and a topic branch T
> first gets merged to A and then after proving its worth it gets
> merged down to B, I wonder if you can verify somebody's merge of B
> into T by comparing the result with your "verification merge", which
> you preform locally and on a throw-away branch by using "git rebase
> --rebase-merges" or some mechanism, to replay the original merge of
> T into A on top of B (before the merge of T you are verifying).

That sounds like my first idea with extra steps; not sure.

-Pavel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-01 22:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-30 18:06 Determining if a merge was produced automatically Pavel Rappo
     [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
2024-07-01  0:45   ` Martin von Zweigbergk
2024-07-01 15:11     ` Elijah Newren
2024-07-01 16:43       ` Pavel Rappo
2024-07-01 11:49 ` Jonathan Nieder
2024-07-01 13:13   ` Pavel Rappo
2024-07-01 15:43   ` Junio C Hamano
2024-07-01 15:39 ` Junio C Hamano
2024-07-01 16:08   ` Pavel Rappo
2024-07-01 18:16     ` Junio C Hamano
2024-07-01 22:26       ` Pavel Rappo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).