Beyond Merge and Rebase: The Upstream Import Approach in Git

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Beyond Merge and Rebase: The Upstream Import Approach in Git
@ 2023-07-11  8:24 Aleksander Korzyński
  2023-07-11 17:02 ` Junio C Hamano
  2023-07-12 11:34 ` Johannes Schindelin
  0 siblings, 2 replies; 9+ messages in thread
From: Aleksander Korzyński @ 2023-07-11  8:24 UTC (permalink / raw)
  To: git

Hello,

Git users often have to make a choice: to merge or rebase. I'm going
to describe a third way that has the characteristics of both and is
very well suited for tracking an open-source project or any other
upstream branch. I'm looking for feedback on the approach.

MERGE OR REBASE?

Let's assume that you have forked an upstream open-source repository
and keep the fork in your own repo. The default branch of the upstream
repository is called "main" and is called the same in your own fork.
You have made a few changes to the source code and committed them to
the "main" branch of your fork. In the meantime, new changes have been
committed to the upstream "main" branch of the project. How do you
import the upstream changes to your fork?

Let's assume that your local fork also contains a branch called
"upstream/main", which reflects the state of the upstream's "main"
branch. So the "main" branch contains your own changes and the
"upstream/main" branch contains the community's changes:

  time -->

  o---o---o---o---o  upstream/main
       \
        o---o---o  main

So a different way to ask the question is: how do you bring
upstream/main's changes into main?

One solution is to merge "upstream/main" into "main":

  o---o---o---o---o  upstream/main
       \           \
        o---o---o---M  main

The merge above would certainly work, but it becomes problematic as
time passes and you get a lot of these merges in your "main" branch.
You then no longer have visibility into the differences between
"upstream/main" and "main", because your commits get lost deep in the
history of the branch, as illustrated below:

  o---o---o---o---o---o---o---o---o---o---o  upstream/main
       \           \       \       \       \
        o---o---o---M---o---M---o---M---o---M  main

So the alternative solution is to rebase your "main" branch on top of
"upstream/main":

  o---o---o---o---o  upstream/main
                   \
                    o'---o'---o'  main

You now have the advantage of having greater visibility into the
differences between "upstream/main" and "main". However, a rebase
comes with a different problem: if any user of your fork had the
"main" branch checked out in their local repository and they run "git
pull", they are going to get an error stating that the local and
upstream branches have diverged. They will have to take special steps
to recover from the rebase of the "main" branch.

So how to solve that problem?

THE THIRD WAY - UPSTREAM IMPORT

The proposed third way is a special operation that (in the described
use case) has the advantages of both a merge and a rebase, without the
disadvantages. The approach is illustrated below:

  o---o---o---o---o  upstream/main
       \           \
        \           o'---o'---o'
         \                     \
          o---o---o-------------S  main

First, the divergent commits from "main" are rebased on top of
"upstream/main", but then they are combined back with "main" using a
special merge commit, which has a custom strategy: it replaces the old
content of "main" with the new rebased content. This last commit is
the secret sauce of this solution: the commit has two parents, like an
ordinary merge, but has the semantics of a rebase.

The structure above has the advantages of both a merge and a rebase.
On the one hand, just like with an ordinary merge, a user who runs
"git pull" on their local copy of "main" is not going to see the error
about divergent branches. On the other hand, just like with an
ordinary rebase, there is visibility into the last imported commit
from "upstream/main" and the differences between that commit and the
tip of "main".

DROPPING PATCHES

What is supposed to happen if one of the commits from "main" is ported
to "upstream/main", as illustrated below?

  o---o---o---A'---o  upstream/main
       \
        \
         \
          A---B---C  main

In that case, the upstream importing operation should drop that patch,
as illustrated below:

  o---o---o---A'---o  upstream/main
       \            \
        \            B'---C'
         \                 \
          A---B---C---------S  main

But how would the upstream importing operation know which patches to
drop? There are one of two ways.

Firstly, it can look at the git's patch-id, which is the SHA of the
file changes with line numbers ignored. This is the same strategy that
rebase uses to drop duplicate commits.

Secondly, it can use an arbitrary change-id associated with a commit
(for example, for projects that use Gerrit, it can be the Gerrit's
Change-Id, which is saved in the commit message). This is useful when
a given patch lands upstream in a slightly changed form, but is meant
to replace the version in "main".

IMPLEMENTATION

The solution above has already been implemented in an open-source
Python script called git-upstream[1], published 10 years ago. It was
originally implemented for the OpenStack project, but the solution is
generic and applicable to any open-source project. It is going to be
easier for users to benefit from the ideas behind git-upstream if the
functionality is integrated directly into git.

Would you like to see the above functionality integrated directly into git?

Best regards,
Aleksander Korzynski

www.linkedin.com/in/akorzy
www.devopsera.com/blog

P.S.

For completeness, I'm providing links to alternative solutions for
tracking patches:

* git-upstream[1] uses the strategy described above
* quilt[2] uses patch files saved in a source code repository
* StGit[3] is inspired by quilt and uses git commits to store patches
* MQ[4] is also inspired by quilt and implements a patch queue in Mercurial

[1] https://opendev.org/x/git-upstream
[2] https://savannah.nongnu.org/projects/quilt
[3] https://stacked-git.github.io
[4] https://wiki.mercurial-scm.org/MqExtension

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-11  8:24 Beyond Merge and Rebase: The Upstream Import Approach in Git Aleksander Korzyński
@ 2023-07-11 17:02 ` Junio C Hamano
  2023-07-13 10:55   ` Aleksander Korzyński
  2023-07-12 11:34 ` Johannes Schindelin
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2023-07-11 17:02 UTC (permalink / raw)
  To: Aleksander Korzyński; +Cc: git

Aleksander Korzyński <ak@akorzy.net> writes:

> So the alternative solution is to rebase your "main" branch on top of
> "upstream/main":
>
>   o---o---o---o---o  upstream/main
>                    \
>                     o'---o'---o'  main
>
> You now have the advantage of having greater visibility into the
> differences between "upstream/main" and "main". However, a rebase
> comes with a different problem: if any user of your fork had the
> "main" branch checked out in their local repository and they run "git
> pull", they are going to get an error stating that the local and
> upstream branches have diverged. They will have to take special steps
> to recover from the rebase of the "main" branch.
>
> So how to solve that problem?

In short, what you wrote is a way to use rebase but help those who
have older versions of your work to bring themselves up to date.
That is a useful thing for downstream contributors to have, and it
is a valuable goal to aim to help these downstream contributors to
coordinate sharing of their work.  Because in general, downstream
contributors tend to outnumber upstream maintainers.  It would help
you to hear perspective from upstream maintainers as well, and here
are a few things that come to my mind.

    o---o---o---o---o  upstream/main
         \           \
          \           a'---b'---c'
           \                     \
            a---b---c-------------S  main

 * It certainly would help folks who received a copy of c from you
   and then want to observe your progress after you rebased c to c',
   but how does this help those who have older versions of your
   work, *and* built their own changes on top?  They would not just
   need to update their remote-tracking branch that has your older
   version of the work to the latest, but also rebase their work on
   top.

    o---o---o---o---o  upstream/main
         \           \
          \           a'---b'---c'---d'---e'
           \                     \
            a---b---c-------------S  main
                     \
                      d---e  your coworker

 * It is a reasonable way to for keeping your work as a fork from
   the upstream up-to-date, but it is unclear what the eventual
   presentation to and adoption by the upstream would look like.  As
   an upstream maintainer, for example, I do not want to merge S
   above to the upstream tree.

 * There is no need to say that it is undesirable to merge from
   upstream to your working topic branch like 'main' repeatedly, as
   everybody knows it will clutter your history, but more
   importantly, the resulting history becomes more useless from the
   upstream's point of view as you have more such reverse merges.
   The upstream wants to see your work and only your work delineated
   on your repository.  If you repeat the "rebase and merge", then
   the next round would create a new history a", b" and c" forked on
   top of an updated upstream/main, merged on top of S, perhaps
   looking like this:

    o---o---o---o---o---------------o  upstream/main
         \           \               \
          \           a'---b'---c'    a"--b"--c"
           \                     \             \
            a---b---c-------------S-------------T  main

   However, once you keep going this way for several rounds, would
   the result really be much better than bushy history with full of
   reverse merges from upstream?  Would it help to add new history
   simplification mechanisms and options to help visualize the
   history, or do we already have necessary support (e.g. if the
   convention for these "merge to cauterize the older versions of
   history with the newly rebased history" S and T merges is to
   record the rebased history as the first-parent, then "git log
   --first-parent upstream/main..main" should be sufficient).  The
   users would benefit to have an easy way, given only T or S, to
   get range-diff among (a,b,c) and (a',b',c') and (a",b",c").

What is interesting is that, because S and T are essentially "ours"
merges of your local history into the history that would result if
you rebased on top of the upstream (i.e. merge S and T would have
the same tree as c' and c"), what is tested and used by the holder
of S and T are the changes represented by the latest rebased
versions of the commits.  So from the upstream point of view,
throwing a pull request for c" (not the original a, b and c) would
be a reasonable way to finalize your work.  That way, what you are
offering to the upstream is not an ancient original commits (i.e. a,
b, and c) that you haven't been using at all once you created S and
T.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-11  8:24 Beyond Merge and Rebase: The Upstream Import Approach in Git Aleksander Korzyński
  2023-07-11 17:02 ` Junio C Hamano
@ 2023-07-12 11:34 ` Johannes Schindelin
  2023-07-12 15:37   ` Junio C Hamano
  2023-07-14 10:06   ` Aleksander Korzyński
  1 sibling, 2 replies; 9+ messages in thread
From: Johannes Schindelin @ 2023-07-12 11:34 UTC (permalink / raw)
  To: Aleksander Korzyński; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 4509 bytes --]

Hi Aleksander,

On Tue, 11 Jul 2023, Aleksander Korzyński wrote:

> THE THIRD WAY - UPSTREAM IMPORT
>
> The proposed third way is a special operation that (in the described
> use case) has the advantages of both a merge and a rebase, without the
> disadvantages. The approach is illustrated below:
>
>   o---o---o---o---o  upstream/main
>        \           \
>         \           o'---o'---o'
>          \                     \
>           o---o---o-------------S  main
>
> First, the divergent commits from "main" are rebased on top of
> "upstream/main", but then they are combined back with "main" using a
> special merge commit, which has a custom strategy: it replaces the old
> content of "main" with the new rebased content. This last commit is
> the secret sauce of this solution: the commit has two parents, like an
> ordinary merge, but has the semantics of a rebase.
>
> The structure above has the advantages of both a merge and a rebase.
> On the one hand, just like with an ordinary merge, a user who runs
> "git pull" on their local copy of "main" is not going to see the error
> about divergent branches. On the other hand, just like with an
> ordinary rebase, there is visibility into the last imported commit
> from "upstream/main" and the differences between that commit and the
> tip of "main".

I know this strategy well, having used it initially to maintain Git for
Windows' patches on top of Git releases. I refer to it as `rebasing merge`
strategy.

The main benefit for me was that the patches were always kept in an
"upstreamable state", which incidentally also helped resolving the
merge conflicts that occurred by continually rebasing them onto upstream
releases.

However, I soon realized that the delineation between upstream and
downstream patches was unsatisfactory, in particular when new downstream
patches are added. In the context of the example above, try to find a `git
rebase` invocation that rebases the current set of downstream patches:

   o---o---o---o---o---o---o---o  upstream/main
        \           \
         \           o'---o'---o'
          \                     \
           o---o---o-------------S---o---o---o  main

A candidate to describe this in a commit range would be
`upstream/main..main ^S^`, but you cannot pass that to `git rebase -i`,
which expects a single upstream.

Side note: You could _simulate_ this by calling `git replace --graft
upstream/main upstream/main^ S^` before calling `git rebase -i
upstream/main`, but I found it really easy to forget to remove the replace
object afterwards, and I managed to confuse myself many times before
deciding to use replace objects only very rarely.

So I switched to a different scheme instead that I dub "merging rebase".
Instead of finishing the rebase with a merge, I start it with that merge.
In your example, it would look like this:

   o---o---o---o---o  upstream/main
        \           \
         o---o---o---M---o'---o'---o' main

Naturally, `M` needs to be a merge that _must_ be made with `-s ours` in
order to be "tree-same with upstream/main".

This strategy was implemented initially in
https://github.com/msysgit/msysgit/commit/95ae63b8c6c0b275f460897c15a44a7df5246dfb
and is in use to this day:
https://github.com/git-for-windows/build-extra/blob/main/shears.sh

This strategy is not without problems, though, which becomes quite clear
when you accept PRs that are based on commits prior to the most recent
merging rebase (or rebasing merge, both strategies suffer from the same
problem): the _next_ merging rebase will not necessarily find the most
appropriate base commit, in particular when rebasing with
`--rebase-merges`, causing unnecessary merge conflicts.

The underlying problem is, of course, the lack of mapping between
pre-rebase and post-rebase versions of the commits: Git has no idea
that two commits should be considered identical for the purposes of the
rebase, even if their SHA-1 differs. And in my hands, the patch ID has
been a poor tool to address this lack of mapping, almost always failing
for me. Not even hacked-up `git range-diff` was able to reconstruct the
mapping reliably enough.

And that problem, as far as I can tell, is still unsolved.

There have been efforts to this end, including
https://lore.kernel.org/git/pull.1356.v2.git.1664981957.gitgitgadget@gmail.com/,
but I do not think that any satisfying consensus was reached.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-12 11:34 ` Johannes Schindelin
@ 2023-07-12 15:37   ` Junio C Hamano
  2023-07-12 20:27     ` Junio C Hamano
  2023-07-14 10:06   ` Aleksander Korzyński
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2023-07-12 15:37 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Aleksander Korzyński, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> I know this strategy well, having used it initially to maintain Git for
> Windows' patches on top of Git releases. I refer to it as `rebasing merge`
> strategy.

;-) Thanks. This did look familiar.

>    o---o---o---o---o---o---o---o  upstream/main
>         \           \
>          \           o'---o'---o'
>           \                     \
>            o---o---o-------------S---o---o---o  main
>
> A candidate to describe this in a commit range would be
> `upstream/main..main ^S^`, but you cannot pass that to `git rebase -i`,
> which expects a single upstream.

If "git rebase" is taught the `--ancestry-path` option and made to
pass it down to the underlying "which commits do I want to replay
and in what order" logic, it would be sufficient to help the above
topology, I would think.  But offhand I do not know what other
rev-list options will become useful in different scenarios.

> So I switched to a different scheme instead that I dub "merging rebase".
> Instead of finishing the rebase with a merge, I start it with that merge.
> In your example, it would look like this:
>
>    o---o---o---o---o  upstream/main
>         \           \
>          o---o---o---M---o'---o'---o' main
>
> Naturally, `M` needs to be a merge that _must_ be made with `-s ours` in
> order to be "tree-same with upstream/main".

And this will let you say "rebase -i upstream/main" to further
rebase the most recent round of commits.  That does look quite
natural.

> This strategy is not without problems, though, which becomes quite clear
> when you accept PRs that are based on commits prior to the most recent
> merging rebase (or rebasing merge, both strategies suffer from the same
> problem): the _next_ merging rebase will not necessarily find the most
> appropriate base commit, in particular when rebasing with
> `--rebase-merges`, causing unnecessary merge conflicts.

Even without rebasing merge or merging rebase, changes, which could
be useful if they were not based on a stale base, do happen, and it
is more effective to have the original authors of these changes to
update them to your most recent tree, than you dealing with them
yourself, for two reasons.  There are more ICs than you alone, and
they are more familiar with their work.

In other words, isn't the real cause of the above that the workflow
is not taking advantage of the distributed development?  "This PR
seems to solve the right problem, but it is based on an old version
of the code, please update?"

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-12 15:37   ` Junio C Hamano
@ 2023-07-12 20:27     ` Junio C Hamano
  2023-07-14 10:56       ` Aleksander Korzyński
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2023-07-12 20:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Aleksander Korzyński, git

Junio C Hamano <gitster@pobox.com> writes:

>> So I switched to a different scheme instead that I dub "merging rebase".
>> Instead of finishing the rebase with a merge, I start it with that merge.
>> In your example, it would look like this:
>>
>>    o---o---o---o---o  upstream/main
>>         \           \
>>          o---o---o---M---o'---o'---o' main
> ...
>> This strategy is not without problems, though, which becomes quite clear
>> when you accept PRs that are based on commits prior to the most recent
>> merging rebase (or rebasing merge, both strategies suffer from the same
>> problem): the _next_ merging rebase will not necessarily find the most
>> appropriate base commit, in particular when rebasing with
>> `--rebase-merges`, causing unnecessary merge conflicts.

In Git, any commit, be it a single parent commit or a merge, makes
this statement:

    I considered all the parents of this commit, and it is my belief
    that it suits the purpose of the branch I am growing better than
    all of them.

This is the foundation of the correctness of three-way merges.
Coming from a common ancestor, because M suits the purpose of the
branch better than M^1 or M^2, when merging anything forked from M^1
(or M^2) into a decendant of M (say, 'main'), as long as the
descendant of M still shares the same purpose of the branch, it does
not need to consider what the commits before M^1 (or M^2) did.

M in the "merging rebase", however, claims that M, i.e. the recent
upstream, fits the purpose of the branch better than the earlier
three commits did, which is not quite right.  In contrast, rebasing
merge does not have such a problem, i.e.

    o---o---o---o---o  upstream/main
         \           \
          \           a'---b'---c'
           \                     \
            a---b---c-------------M main

The commit c, a parent of M, implemented the features the topic
wanted to, and the commit c', another parent of M, implements the
same on top of a newer upstream.  The tree of M is the same as c'
and it matches the purpose, which presumably is to implement
whatever (a,b,c) or (a',b',c') wanted to on top of reasonably recent
upstream, of the branch.

Anyway, I do not think building on top of M would help from this
state, so let's stop seeing if there is a way to make rebasing merge
a bit more useful.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-11 17:02 ` Junio C Hamano
@ 2023-07-13 10:55   ` Aleksander Korzyński
  0 siblings, 0 replies; 9+ messages in thread
From: Aleksander Korzyński @ 2023-07-13 10:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hello Junio,

Thanks for the insightful response.

Firstly, let me add some context, which will be helpful for responding
to your comments. This strategy was created at a large organization,
where the "main" branch was used as the integration branch for changes
made within the organization and for the organization. So whenever an
employee needed to make a change, they would start a topic branch off
"main" and the topic branch would be later merged or rebased onto
"main". When the employee was ready to contribute the change upstream,
they would only present the topic branch upstream. The "main" branch
was never presented upstream. Release branches were created from the
"main" branch to work on releases to production at the organization.
Tags were created on the release branches to mark stable and tested
versions. Deployments to production were made from the tagged
versions.

The rest of my response is inline:

> So from the upstream point of view,
> throwing a pull request for c" (not the original a, b and c) would
> be a reasonable way to finalize your work.  That way, what you are
> offering to the upstream is not an ancient original commits (i.e. a,
> b, and c) that you haven't been using at all once you created S and
> T.

Yes, this is how it's intended to be used.

> it is unclear what the eventual
> presentation to and adoption by the upstream would look like.  As
> an upstream maintainer, for example, I do not want to merge S
> above to the upstream tree.

Presentation to upstream would only involve submitting a pull request
for c". Merge S would not be presented to upstream.

> how does this help those who have older versions of your
> work, *and* built their own changes on top?  They would not just
> need to update their remote-tracking branch that has your older
> version of the work to the latest, but also rebase their work on
> top.
>
>     o---o---o---o---o  upstream/main
>          \           \
>           \           a'---b'---c'---d'---e'
>            \                     \
>             a---b---c-------------S  main
>                      \
>                       d---e  your coworker

The coworker would either merge their topic branch to "main" or rebase
it on top of "main":

    o---o---o---o---o  upstream/main
         \           \
          \           a'---b'---c'
           \                     \
            a---b---c-------------S---M  main
                     \               /
                      d-------------e

    o---o---o---o---o  upstream/main
         \           \
          \           a'---b'---c'
           \                     \
            a---b---c-------------S---d'---e'  main

> However, once you keep going this way for several rounds, would
> the result really be much better than bushy history with full of
> reverse merges from upstream?   Would it help to add new history
> simplification mechanisms and options to help visualize the
> history, or do we already have necessary support (e.g. if the
> convention for these "merge to cauterize the older versions of
> history with the newly rebased history" S and T merges is to
> record the rebased history as the first-parent, then "git log
> --first-parent upstream/main..main" should be sufficient).

Good point. I believe the proposed method has two advantages over
using "git log --first-parent":

* consider the scenario where the cauterizing merge is not the only
merge in the "main" branch - a topic branch from a coworker has also
been merged to "main". In that case, "git log --first-parent" would
not show the commits from the merged topic branch.

* a user unfamiliar with the command-line interface can use a history
visualization tool (such as gitk or tig) to obtain a clear view of the
differences between the last imported version of upstream and the tip
of "main".

> The users would benefit to have an easy way, given only T or S, to
> get range-diff among (a,b,c) and (a',b',c') and (a",b",c").

I like that idea. It would be great to have a simple git command to
display such a range-diff. The command would have to correctly
identify the commits.

--
Best regards,
Aleksander Korzynski

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-12 11:34 ` Johannes Schindelin
  2023-07-12 15:37   ` Junio C Hamano
@ 2023-07-14 10:06   ` Aleksander Korzyński
  1 sibling, 0 replies; 9+ messages in thread
From: Aleksander Korzyński @ 2023-07-14 10:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Hi Johannes,

My responses are inline:

> I know this strategy well, having used it initially to maintain Git for
> Windows' patches on top of Git releases.

It's good to know others are using similar ideas :-)

> However, I soon realized that the delineation between upstream and
> downstream patches was unsatisfactory, in particular when new downstream
> patches are added. In the context of the example above, try to find a `git
> rebase` invocation that rebases the current set of downstream patches:
>
>    o---o---o---o---o---o---o---o  upstream/main
>         \           \
>          \           o'---o'---o'
>           \                     \
>            o---o---o-------------S---o---o---o  main

We have solved that problem with custom scripting. The git-upstream[1]
tool properly rebases the commits in that case. This is one of the
reasons why I would like to see the git-upstream functionality
reimplemented in git itself. With git today, you can't achieve that
with a single `git rebase` command, but you can with a series of
commands. Introducing a new command or switch to git would allow us to
perform that operation with a single command.

Let's recap what the automation needs to do. Assume the following situation:

  o---o---o---o---x---o---o---t  tag=v1.2.3  branch=upstream/main
       \           \
        \           a'---b'---c'
         \                     \
          a---b---c-------------S---d---e  branch=main

Let's assume the user has the "main" branch checked out and they want
to import the latest tag from the "upstream/main" branch. The commands
they run are:

git checkout main
git upstream import v1.2.3

The automation should now perform the following:

* create a new branch "import/v1.2.3" starting from tag "v1.2.3"
* rebase a', b', c' onto "import/v1.2.3"
* rebase d, e onto "import/v1.2.3"
* perform the cauterizing merge of "import/v1.2.3" to "main"

Here are important observations:

* the first rebase operates on commits present on the main branch,
starting from the first commit after x, ending with the last commit
before S
* the second rebase operates on commits present on the main branch,
starting from the first commit after S, ending with the tip of main

So the problem boils down to identifying commits x and S. Once we
identify these commits, we can perform the rebases.

To identify x we need to find the most recent common ancestor of
"main" and "v1.2.3". To identify S we need to iterate over branch
"main" starting from x and forward in time until we find the first
merge. That's the logic that needs to be implemented. If that logic
was available under a single command or switch in git, we'd be able to
perform the upstream import operation without a helper script such as
git-upstream.

> This strategy is not without problems, though, which becomes quite clear
> when you accept PRs that are based on commits prior to the most recent
> merging rebase (or rebasing merge, both strategies suffer from the same
> problem): the _next_ merging rebase will not necessarily find the most
> appropriate base commit, in particular when rebasing with
> `--rebase-merges`, causing unnecessary merge conflicts.

This can also be solved with custom logic. Let's consider the scenario
in detail:

  o---o---o---o---x---o---o---t  tag=v1.2.3  branch=upstream/main
       \           \
        \           a'---b'
         \                \
          a---b------------S---c---M---f  branch=main
               \                  /
                d----------------e  branch=topic

As before, the user runs the following commands:

git checkout main
git upstream import v1.2.3

In this case, the automation should rebase d and e between c and f:

  o---o---o---o---x---o---o---t  tag=v1.2.3  branch=upstream/main
       \           \           \
        \           a'---b'     a"--b"--c"--d"--e"--f"
         \                \                          \
          a---b------------S---c---M---f--------------S'  branch=main
               \                  /
                d----------------e  branch=topic

This logic can be implemented as follows. When the automation reaches
the merge commit M, it finds the second parent e and then searches for
the most recent common ancestor of e and main, so that it finds b. The
rebase then operates on commits starting from the first commit after b
and ending with the second parent of M.

The logic above could also be incorporated into git.

> The underlying problem is, of course, the lack of mapping between
> pre-rebase and post-rebase versions of the commits: Git has no idea
> that two commits should be considered identical for the purposes of the
> rebase, even if their SHA-1 differs. And in my hands, the patch ID has
> been a poor tool to address this lack of mapping, almost always failing
> for me. Not even hacked-up `git range-diff` was able to reconstruct the
> mapping reliably enough.
>
> And that problem, as far as I can tell, is still unsolved.

As shown above, we don't actually need to be able to map pre-rebase
and post-rebase versions of the commits in order to correctly perform
the "git upstream import" operation. The "git-upstream" helper script
is a working implementation of the strategy without doing the mapping.

That being said, being able to map pre-rebase and post-rebase versions
of the commits is useful for something else: dropping patches that
have been incorporated upstream. The "git-upstream" script utilizes
two strategies for that purpose. One of them is to use patch-id. The
other one is to use an arbitrary identifier that you attach to the
commit both in the "main" and "upstream/main" branches. In our case,
we have used the Gerrit's Change-Id as the identifier, but it could be
something else. The Gerrit's Change-Id is just a random string added
to the bottom of a commit message by a git commit hook.

> So I switched to a different scheme instead that I dub "merging rebase".
> Instead of finishing the rebase with a merge, I start it with that merge.
> In your example, it would look like this:
>
>    o---o---o---o---o  upstream/main
>         \           \
>          o---o---o---M---o'---o'---o' main

I like Junio's word "cauterize" to describe the special merge :-) So
I'm going to call this strategy "cauterize & rebase" and the strategy
I described in the initial email "rebase & cauterize".

We have also considered "cauterize & rebase" instead of "rebase &
cauterize" and the reason we opted for the latter was peer review in
Gerrit. When we rebase first, we can store the rebased commits on a
temporary import branch and push the import branch to a shared
repository. The import branch then contains everything except for the
last cauterizing merge. We then need to push only the cauterizing
merge into the Gerrit review system. The reviewer then only has to
approve the cauterizing merge to approve the entire "upstream import"
structure. We didn't need to make any changes to the Gerrit review
system to utilize it in that way. These considerations may not apply
to other review systems.

> This strategy was implemented initially in
> https://github.com/msysgit/msysgit/commit/95ae63b8c6c0b275f460897c15a44a7df5246dfb
> and is in use to this day:
> https://github.com/git-for-windows/build-extra/blob/main/shears.sh
> (...)
> https://lore.kernel.org/git/pull.1356.v2.git.1664981957.gitgitgadget@gmail.com/

Thanks for the links, they are useful :-)

With the content of this email in mind, what are your thoughts? Would
you like to see the strategy becoming a first-class feature in git?

Best regards,
Aleksander Korzynski

[1] https://opendev.org/x/git-upstream

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-12 20:27     ` Junio C Hamano
@ 2023-07-14 10:56       ` Aleksander Korzyński
  2023-08-01  9:17         ` Aleksander Korzyński
  0 siblings, 1 reply; 9+ messages in thread
From: Aleksander Korzyński @ 2023-07-14 10:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

> it is more effective to have the original authors of these changes to
> update them to your most recent tree, than you dealing with them
> yourself, for two reasons.  There are more ICs than you alone, and
> they are more familiar with their work.
>
> In other words, isn't the real cause of the above that the workflow
> is not taking advantage of the distributed development?  "This PR
> seems to solve the right problem, but it is based on an old version
> of the code, please update?"

That's a valid point. Let me describe how it used to work.

We tracked a busy project (OpenStack Nova), which used to have as many
as 50 commits per day. At night we used to run an automated job that
would attempt to import the latest upstream (rebase and cauterize),
deploy to a test environment and test it with our configuration. It
would have been impractically slow to require the developer of every
internal patch to manually update to the latest version of the code,
before deploying and testing. Also, it wouldn't be acceptable to an
enterprise to always require the original author to rebase their
patch, because the author may be on holiday or may have left the
company, but the business has to move on.

In the morning, we used to check if the automated job returned green
or red. In practice, however, most of the time our patches would
cleanly rebase automatically without manual intervention. That was
because of the way we used to write the patches - changing as few
lines as possible. Also, patches were typically only temporary in
nature, as they were eventually contributed to the open-source
upstream project.

If the automated job returned red, there was a designated engineer who
would investigate the issue on a given day. They would try to rebase
the patches themselves and fix any issues. If they had any questions
or concerns they would contact the original author, as long as the
original author was at work. Most of the time contacting the original
author wasn't necessary.

> In Git, any commit, be it a single parent commit or a merge, makes
> this statement:
>
>     I considered all the parents of this commit, and it is my belief
>     that it suits the purpose of the branch I am growing better than
>     all of them.
> (...)
> M in the "merging rebase", however, claims that M, i.e. the recent
> upstream, fits the purpose of the branch better than the earlier
> three commits did, which is not quite right.  In contrast, rebasing
> merge does not have such a problem, i.e.
>
>     o---o---o---o---o  upstream/main
>          \           \
>           \           a'---b'---c'
>            \                     \
>             a---b---c-------------M main

I second that observation.

Any other comments? :-)

--
Best regards,
Aleksander Korzynski

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Beyond Merge and Rebase: The Upstream Import Approach in Git
  2023-07-14 10:56       ` Aleksander Korzyński
@ 2023-08-01  9:17         ` Aleksander Korzyński
  0 siblings, 0 replies; 9+ messages in thread
From: Aleksander Korzyński @ 2023-08-01  9:17 UTC (permalink / raw)
  To: ak; +Cc: Junio C Hamano, Johannes Schindelin, git

Hi,

Thanks again for your responses, Junio and Johannes. I'm looking to
implement the discussed structure in git. As the first step, I'd like
to implement:

git merge -s theirs

The name of the `theirs` strategy above is inspired by the existing
`ours` strategy. The command above is going to be the equivalent of
the following three commands:

git merge --no-commit -s ours <commit>
git read-tree --reset -u <commit>
git commit --no-edit

The new command is going to be used to create the last "welding merge"
at the end of the structure below:

     o---o---o---o---o  upstream/main
          \           \
           \           a'---b'---c'
            \                     \
             a---b---c-------------M main

The strategy above could be called "rebase & weld". The new command
can also be used with the "weld & rebase" strategy described by
Johannes:

   o---o---o---o---o  upstream/main
        \           \
         a---b---c---M---a'---b'---c' main

In addition, `git merge -s theirs` could be called `git weld`, which
would make it shorter to type. What do you think?

Also, I'm thinking about the eventual interface for creating the
entire structure. Perhaps "rebase & weld" could be created with the
following command:

git rebase --weld

and "weld & rebase" with the following:

git rebase --pre-weld

What are your thoughts?

--
Best regards,
Aleksander Korzynski

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-01  9:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-11  8:24 Beyond Merge and Rebase: The Upstream Import Approach in Git Aleksander Korzyński
2023-07-11 17:02 ` Junio C Hamano
2023-07-13 10:55   ` Aleksander Korzyński
2023-07-12 11:34 ` Johannes Schindelin
2023-07-12 15:37   ` Junio C Hamano
2023-07-12 20:27     ` Junio C Hamano
2023-07-14 10:56       ` Aleksander Korzyński
2023-08-01  9:17         ` Aleksander Korzyński
2023-07-14 10:06   ` Aleksander Korzyński

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).