* Determining if a merge was produced automatically
@ 2024-06-30 18:06 Pavel Rappo
[not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Pavel Rappo @ 2024-06-30 18:06 UTC (permalink / raw)
To: Git mailing list
Hello,
I'm looking for a robust way to determine if a given merge commit
could've been produced automatically by `git merge`, without any
manual intervention or tampering, such as:
- resolving conflicts,
- stopping (`--no-commit`) and modifying,
- amending the commit.
My initial idea was to re-enact the merge. If the merge failed, I
would conclude that the original merge couldn't have been produced
automatically. If the merge succeeded, I would compare it with the
original merge. Any differences would indicate that the original merge
couldn't have been produced automatically. Otherwise, I would conclude
that it could've been. This approach is simple, but involves multiple
steps and requires clean-up.
My second idea was to use `git show --diff-merges=dense-combined`,
which only prints hunks that come from neither parent. If nothing is
printed, I would conclude that the merge could've been produced
automatically. This approach is simple, single-step, but seems to have
an issue. In my experiments, I found that if some hunks from different
parents were located closely enough, output was produced. So, checking
if nothing is output could lead to false negatives: a merge that
could've been produced automatically might look like it was tampered
with.
My third idea was to use a recently added feature, `git show
--remerge-diff`, which seemingly embodies my first idea and is immune
to the issue of the second. It is also single-step and requires no
clean-up:
> Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit.
However, this bit means that I shouldn't entirely trust its output:
> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).
What is my best course of action?
Thanks,
-Pavel
^ permalink raw reply [flat|nested] 11+ messages in thread[parent not found: <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>]
* Re: Determining if a merge was produced automatically [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com> @ 2024-07-01 0:45 ` Martin von Zweigbergk 2024-07-01 15:11 ` Elijah Newren 0 siblings, 1 reply; 11+ messages in thread From: Martin von Zweigbergk @ 2024-07-01 0:45 UTC (permalink / raw) To: Pavel Rappo; +Cc: Git mailing list Forwarding to the list without HTML so others can correct me if I was wrong. On Sun, Jun 30, 2024 at 3:32 PM Martin von Zweigbergk <martinvonz@gmail.com> wrote: > > > > On Sun, Jun 30, 2024, 11:06 Pavel Rappo <pavel.rappo@gmail.com> wrote: >> >> Hello, >> >> I'm looking for a robust way to determine if a given merge commit >> could've been produced automatically by `git merge`, without any >> manual intervention or tampering, such as: >> >> - resolving conflicts, >> - stopping (`--no-commit`) and modifying, >> - amending the commit. >> >> My initial idea was to re-enact the merge. If the merge failed, I >> would conclude that the original merge couldn't have been produced >> automatically. If the merge succeeded, I would compare it with the >> original merge. Any differences would indicate that the original merge >> couldn't have been produced automatically. Otherwise, I would conclude >> that it could've been. This approach is simple, but involves multiple >> steps and requires clean-up. >> >> My second idea was to use `git show --diff-merges=dense-combined`, >> which only prints hunks that come from neither parent. If nothing is >> printed, I would conclude that the merge could've been produced >> automatically. This approach is simple, single-step, but seems to have >> an issue. In my experiments, I found that if some hunks from different >> parents were located closely enough, output was produced. So, checking >> if nothing is output could lead to false negatives: a merge that >> could've been produced automatically might look like it was tampered >> with. >> >> My third idea was to use a recently added feature, `git show >> --remerge-diff`, which seemingly embodies my first idea and is immune >> to the issue of the second. It is also single-step and requires no >> clean-up: >> >> > Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit. >> >> However, this bit means that I shouldn't entirely trust its output: >> >> > The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented). > > > There's basically only one way to display an empty diff, so I suspect that checking that the diff is empty is still going to be enough for your purposes. > > Note that you can specify e.g. the rename detection threshold to use while merging, and the person doing the merge might have used a different threshold than you're using when you're trying to check if they added other changes. There are also different merge strategies and diff algorithms to choose. That means that you might get false positives and false negatives. Maybe that's still good enough for you. > >> >> What is my best course of action? >> >> Thanks, >> -Pavel >> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 0:45 ` Martin von Zweigbergk @ 2024-07-01 15:11 ` Elijah Newren 2024-07-01 16:43 ` Pavel Rappo 0 siblings, 1 reply; 11+ messages in thread From: Elijah Newren @ 2024-07-01 15:11 UTC (permalink / raw) To: Martin von Zweigbergk; +Cc: Pavel Rappo, Git mailing list Hi, On Sun, Jun 30, 2024 at 5:45 PM Martin von Zweigbergk <martinvonz@gmail.com> wrote: > > Forwarding to the list without HTML so others can correct me if I was wrong. > > On Sun, Jun 30, 2024 at 3:32 PM Martin von Zweigbergk > <martinvonz@gmail.com> wrote: > > > > > > > > On Sun, Jun 30, 2024, 11:06 Pavel Rappo <pavel.rappo@gmail.com> wrote: > >> > >> Hello, > >> > >> I'm looking for a robust way to determine if a given merge commit > >> could've been produced automatically by `git merge`, without any > >> manual intervention or tampering, such as: > >> > >> - resolving conflicts, > >> - stopping (`--no-commit`) and modifying, > >> - amending the commit. > >> > >> My initial idea was to re-enact the merge. If the merge failed, I > >> would conclude that the original merge couldn't have been produced > >> automatically. If the merge succeeded, I would compare it with the > >> original merge. Any differences would indicate that the original merge > >> couldn't have been produced automatically. Otherwise, I would conclude > >> that it could've been. This approach is simple, but involves multiple > >> steps and requires clean-up. Further, your strategy has some blind spots. What if the original person creating the merge used special flags, such as changing the rename threshold, ignoring space changes, or a different underlying diff algorithm? It may be that the merge was clean for the original merger, but if you don't use the same options it doesn't look clean to you -- or vice versa. (In short, this method has both false positives and false negatives.) (The odds that someone used specialized options and then had the merge succeed, when it wouldn't have otherwise, is pretty low. So your strategy is probably good enough, but it's good to be aware of other possibilities.) > >> My second idea was to use `git show --diff-merges=dense-combined`, > >> which only prints hunks that come from neither parent. If nothing is > >> printed, I would conclude that the merge could've been produced > >> automatically. This approach is simple, single-step, but seems to have > >> an issue. In my experiments, I found that if some hunks from different > >> parents were located closely enough, output was produced. So, checking > >> if nothing is output could lead to false negatives: a merge that > >> could've been produced automatically might look like it was tampered > >> with. > >> > >> My third idea was to use a recently added feature, `git show > >> --remerge-diff`, which seemingly embodies my first idea and is immune > >> to the issue of the second. It is also single-step and requires no > >> clean-up: > >> > >> > Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit. Yes, this is exactly your strategy, except that instead of checking whether it "completes" you are checking whether the output is empty, and it avoids messing up the working tree and index and thus also avoids the need for clean-up. > >> However, this bit means that I shouldn't entirely trust its output: > >> > >> > The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented). --remerge-diff output uses diff headers a bit inventively. And, being a somewhat new option, I didn't want a repeat of issues like we had with --cc (where the output format is documented in detail and was for a few years before we realized that showing diff headers for exactly two files when there are at least three files is kinda dumb when renames or mode changes are involved). So, we needed the flexibility to change the output in the future. However, this wording was not intended to detract from the main point that "empty output means clean merge and non-empty output means conflicted merge" (it never even occurred to me that someone might read that part of the documentation and assume that it presents a problem for checking-if-diff-is-empty). I use it for the same purpose, and that's absolutely a guarantee we want to provide. If you want to guarantee output format beyond that, though, then I object. Anyway, in summary... > > There's basically only one way to display an empty diff, so I suspect that checking that the diff is empty is still going to be enough for your purposes. > > > > Note that you can specify e.g. the rename detection threshold to use while merging, and the person doing the merge might have used a different threshold than you're using when you're trying to check if they added other changes. There are also different merge strategies and diff algorithms to choose. That means that you might get false positives and false negatives. Maybe that's still good enough for you. Yes, Martin has it exactly right. And it has possible false positives and false negatives precisely because your original strategy did too. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 15:11 ` Elijah Newren @ 2024-07-01 16:43 ` Pavel Rappo 0 siblings, 0 replies; 11+ messages in thread From: Pavel Rappo @ 2024-07-01 16:43 UTC (permalink / raw) To: Elijah Newren; +Cc: Martin von Zweigbergk, Git mailing list On Mon, Jul 1, 2024 at 4:11 PM Elijah Newren <newren@gmail.com> wrote: > However, this wording was not intended to detract from the main point > that "empty output means clean merge and non-empty output means > conflicted merge" (it never even occurred to me that someone might > read that part of the documentation and assume that it presents a > problem for checking-if-diff-is-empty). I use it for the same > purpose, and that's absolutely a guarantee we want to provide. If you > want to guarantee output format beyond that, though, then I object. > That's good to know; thanks! Back to Jonathan's suggestion: I don't know how to express that in documentation. I cannot suggest any good wording at this time. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-06-30 18:06 Determining if a merge was produced automatically Pavel Rappo [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com> @ 2024-07-01 11:49 ` Jonathan Nieder 2024-07-01 13:13 ` Pavel Rappo 2024-07-01 15:43 ` Junio C Hamano 2024-07-01 15:39 ` Junio C Hamano 2 siblings, 2 replies; 11+ messages in thread From: Jonathan Nieder @ 2024-07-01 11:49 UTC (permalink / raw) To: Pavel Rappo; +Cc: Git mailing list Hi Pavel, Pavel Rappo wrote: > However, this bit means that I shouldn't entirely trust its output: > >> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented). > > What is my best course of action? I'd encourage proposing a patch to make the documentation say what you wish it said. What guarantees would you like it to make? That will help with others in the project being able to decide what it should guarantee and what it shouldn't, and regardless of the outcome of discussion it would make the documentation more helpful for the next person. See Documentation/MyFirstContribution for some more details on how proposing a patch works. Thanks and hope that helps, Jonathan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 11:49 ` Jonathan Nieder @ 2024-07-01 13:13 ` Pavel Rappo 2024-07-01 15:43 ` Junio C Hamano 1 sibling, 0 replies; 11+ messages in thread From: Pavel Rappo @ 2024-07-01 13:13 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Git mailing list I think the documentation is quite clear. It's just that it makes that feature unhelpful for my use case, as output from the following command is subject to change: git show --remerge-diff --pretty=format:%b <merge-commit> Now, if that's untrue, and the output from the above command cannot change, then yes, the documentation should be improved. -Pavel On Mon, Jul 1, 2024 at 12:49 PM Jonathan Nieder <jrnieder@gmail.com> wrote: > > Hi Pavel, > > Pavel Rappo wrote: > > > However, this bit means that I shouldn't entirely trust its output: > > > >> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented). > > > > What is my best course of action? > > I'd encourage proposing a patch to make the documentation say what you > wish it said. What guarantees would you like it to make? That will > help with others in the project being able to decide what it should > guarantee and what it shouldn't, and regardless of the outcome of > discussion it would make the documentation more helpful for the next > person. > > See Documentation/MyFirstContribution for some more details on how > proposing a patch works. > > Thanks and hope that helps, > Jonathan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 11:49 ` Jonathan Nieder 2024-07-01 13:13 ` Pavel Rappo @ 2024-07-01 15:43 ` Junio C Hamano 1 sibling, 0 replies; 11+ messages in thread From: Junio C Hamano @ 2024-07-01 15:43 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Pavel Rappo, Git mailing list Jonathan Nieder <jrnieder@gmail.com> writes: >> However, this bit means that I shouldn't entirely trust its output: >> >>> The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented). >> >> What is my best course of action? > > I'd encourage proposing a patch to make the documentation say what you > wish it said. I do not think that was what Pavel wanted to get suggestion on, but it is certainly something that would benefit not just a single person but other users ;-) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-06-30 18:06 Determining if a merge was produced automatically Pavel Rappo [not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com> 2024-07-01 11:49 ` Jonathan Nieder @ 2024-07-01 15:39 ` Junio C Hamano 2024-07-01 16:08 ` Pavel Rappo 2 siblings, 1 reply; 11+ messages in thread From: Junio C Hamano @ 2024-07-01 15:39 UTC (permalink / raw) To: Pavel Rappo; +Cc: Git mailing list Pavel Rappo <pavel.rappo@gmail.com> writes: > I'm looking for a robust way to determine if a given merge commit > could've been produced automatically by `git merge`, without any > manual intervention or tampering, such as: > > - resolving conflicts, > - stopping (`--no-commit`) and modifying, > - amending the commit. This fundamentally cannot be done perfectly and it is by design. The automated merge algorithms, backends, and strategies will improve over time, so a merge attempted 7 years ago with the then-current version of Git may have conflicted and required manual resolution, but a newer Git you use to "look for manual intervention or tampering" may have improved enough to resolve the same merge cleanly and automatically. Even if the person who created the merge originally and you are using with the same version of Git, the former may be using a custom low-level merge driver to resolve conflicts in certain types of files better than the barebones textual merge driver, while you do not have access to the same driver. So you'd need to tighten the definition a bit more, at bit more like constraints like "using the same version of Git" and "without any low-level merge driver customization". Now, with problem narrowed down a bit with tightened constraints. > My initial idea was to re-enact the merge. If the merge failed, ... > My second idea was to use `git show --diff-merges=dense-combined` ... > My third idea was to use a recently added feature, `git show --remerge-diff`... The first and the third are equivalent (the latter is an automation of what the former would do). The --cc output, as you said, is about showing resolution that is different from any of the parents, and serves the need quite different from what you are trying to do (e.g., if the merge was created with "git merge -s ours", there is nothing _interesting_ in the result from the "--cc"'s point of view), but it is likely that you are interested in noticing such an unusual "cauterizing the history" merge. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 15:39 ` Junio C Hamano @ 2024-07-01 16:08 ` Pavel Rappo 2024-07-01 18:16 ` Junio C Hamano 0 siblings, 1 reply; 11+ messages in thread From: Pavel Rappo @ 2024-07-01 16:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git mailing list I can see that I used "robustly" misleadingly. My intent was to get robust automation, not a robust boolean classifier; sorry. Here's my use case. In our project, every commit to a PR means that the PR needs to be re-reviewed. During a PR's lifetime, the PR's target branch might be merged into the PR multiple times. Since re-reviewing is costly, I'm exploring the possibility of not requiring it for such merge commits produced automatically because of the assumption that nothing bad can happen there. In that use case, both false positives and false negatives are _fine_, if only annoying. If it's false positive, meaning that a PR has been truly merged, but the check couldn't figure it out, re-review will be required. Not a big deal. If it's false negative, meaning that a PR has been tampered with, but looks like it hasn't, then it's okay. Why? Because the result will be the _same_ [^*] if that PR is eventually, automatically merged into the target branch by that very Git that performed the merge check. [*]: Or very, very similar. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 16:08 ` Pavel Rappo @ 2024-07-01 18:16 ` Junio C Hamano 2024-07-01 22:26 ` Pavel Rappo 0 siblings, 1 reply; 11+ messages in thread From: Junio C Hamano @ 2024-07-01 18:16 UTC (permalink / raw) To: Pavel Rappo; +Cc: Git mailing list Pavel Rappo <pavel.rappo@gmail.com> writes: > it for such merge commits produced automatically because of the > assumption that nothing bad can happen there. I do not think that assumption holds in the first place, though. A typical and often cited example is when one side changed a helper function's behaviour while the other side added new callers to the helper function, still assuming the original behaviour. In such a case there may not even be an textual conflict but the end results may be broken, and if the breakage is subtle, it may take weeks or months before somebody notices such a semantic mismerge. Your "review a conflicted-and-resolved merge on one integration branch once, and skip the re-review as long as the resolution is the same way as the original one when the same branch gets merged into another integration branch" is a neat idea (and the integration branches we have in our project are run more-or-less like that). But there, you'd need more than "both are cleanly auto-merged"; more like "both may have conflicted but they are resolved the same way" is what you are interested, no? Since at that point, your primary interest shouldn't be "does it cleanly auto-merge?" but "do these two merges do the same thing?", determining if a merge was created automatically becomes a problem you do not need to solve, or solving it would not further your true goal. If you have two integration branches A and B, and a topic branch T first gets merged to A and then after proving its worth it gets merged down to B, I wonder if you can verify somebody's merge of B into T by comparing the result with your "verification merge", which you preform locally and on a throw-away branch by using "git rebase --rebase-merges" or some mechanism, to replay the original merge of T into A on top of B (before the merge of T you are verifying). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Determining if a merge was produced automatically 2024-07-01 18:16 ` Junio C Hamano @ 2024-07-01 22:26 ` Pavel Rappo 0 siblings, 0 replies; 11+ messages in thread From: Pavel Rappo @ 2024-07-01 22:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git mailing list On Mon, Jul 1, 2024 at 7:16 PM Junio C Hamano <gitster@pobox.com> wrote: > > Pavel Rappo <pavel.rappo@gmail.com> writes: > > > it for such merge commits produced automatically because of the > > assumption that nothing bad can happen there. > > I do not think that assumption holds in the first place, though. A > typical and often cited example is when one side changed a helper > function's behaviour while the other side added new callers to the > helper function, still assuming the original behaviour. In such a > case there may not even be an textual conflict but the end results > may be broken, and if the breakage is subtle, it may take weeks or > months before somebody notices such a semantic mismerge. Junio, I'm under no illusion that Git can resolve semantic conflicts, such as the one you described. When I said "nothing bad can happen" I didn't mean broken code, I meant unreviewed, possibly malicious changes making their way into the target branch. Suppose, we have the master branch and a PR against that branch. If a new commit is pushed into the PR branch, we want it to be reviewed, unless that commit is a merge from master to the PR branch. In that case, we try to replicate the merge to see if it yields the same result. If the result is the same, we will not require re-review. Why do we not require a re-review in that case? Because when a PR is integrated, it is merged with the master anyway. That merge is unsupervised and unreviewed. If we flip sides, it means that we may want to not require a review for merging from master to the PR. It's this sense that "nothing bad can happen". > But there, you'd need more than "both are cleanly auto-merged"; more > like "both may have conflicted but they are resolved the same way" > is what you are interested, no? Since at that point, your primary > interest shouldn't be "does it cleanly auto-merge?" but "do these > two merges do the same thing?", determining if a merge was created > automatically becomes a problem you do not need to solve, or solving > it would not further your true goal. Automatic merge is just a merge made by git merge or another command that I expect authors to use for occasional merge from master into their PRs. It's not a special merge. It's just a reference merge. Sorry, I'm not good at writing text. I really hope this email clarifies my use case and what I am trying to do about it. > If you have two integration branches A and B, and a topic branch T > first gets merged to A and then after proving its worth it gets > merged down to B, I wonder if you can verify somebody's merge of B > into T by comparing the result with your "verification merge", which > you preform locally and on a throw-away branch by using "git rebase > --rebase-merges" or some mechanism, to replay the original merge of > T into A on top of B (before the merge of T you are verifying). That sounds like my first idea with extra steps; not sure. -Pavel ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-07-01 22:26 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-30 18:06 Determining if a merge was produced automatically Pavel Rappo
[not found] ` <CANiSa6hVbrCpPtBCL_W8+43uWGL0LFJkFhSJYGtfFgxX75zE8w@mail.gmail.com>
2024-07-01 0:45 ` Martin von Zweigbergk
2024-07-01 15:11 ` Elijah Newren
2024-07-01 16:43 ` Pavel Rappo
2024-07-01 11:49 ` Jonathan Nieder
2024-07-01 13:13 ` Pavel Rappo
2024-07-01 15:43 ` Junio C Hamano
2024-07-01 15:39 ` Junio C Hamano
2024-07-01 16:08 ` Pavel Rappo
2024-07-01 18:16 ` Junio C Hamano
2024-07-01 22:26 ` Pavel Rappo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).