git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Zach FettersMoore via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Zach FettersMoore <zach.fetters@apollographql.com>
Subject: Re: [PATCH v4] subtree: fix split processing with multiple subtrees present
Date: Sat, 18 Nov 2023 12:28:42 +0100	[thread overview]
Message-ID: <CAP8UFD18Hh=m8HQibAgZW1KNAn6zg_rxe9asg0ViC5z27W=Smw@mail.gmail.com> (raw)
In-Reply-To: <pull.1587.v4.git.1698347871200.gitgitgadget@gmail.com>

On Thu, Oct 26, 2023 at 9:18 PM Zach FettersMoore via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Zach FettersMoore <zach.fetters@apollographql.com>
>
> When there are multiple subtrees present in a repository and they are
> all using 'git subtree split', the 'split' command can take a
> significant (and constantly growing) amount of time to run even when
> using the '--rejoin' flag. This is due to the fact that when processing
> commits to determine the last known split to start from when looking
> for changes, if there has been a split/merge done from another subtree
> there will be 2 split commits, one mainline and one subtree, for the
> second subtree that are part of the processing. The non-mainline
> subtree split commit will cause the processing to always need to search
> the entire history of the given subtree as part of its processing even
> though those commits are totally irrelevant to the current subtree
> split being run.

Thanks for your continued work on this!

I am not familiar with git subtree so I might miss obvious things. On
the other hand, my comments might help increase a bit the number of
people who could review this patch.

> In the diagram below, 'M' represents the mainline repo branch, 'A'
> represents one subtree, and 'B' represents another. M3 and B1 represent
> a split commit for subtree B that was created from commit M4. M2 and A1
> represent a split commit made from subtree A that was also created
> based on changes back to and including M4. M1 represents new changes to
> the repo, in this scenario if you try to run a 'git subtree split
> --rejoin' for subtree B, commits M1, M2, and A1, will be included in
> the processing of changes for the new split commit since the last
> split/rejoin for subtree B was at M3. The issue is that by having A1
> included in this processing the command ends up needing to processing
> every commit down tree A even though none of that is needed or relevant
> to the current command and result.
>
> M1
>  |        \       \
> M2         |       |
>  |        A1       |
> M3         |       |
>  |         |      B1
> M4         |       |

About the above, Junio already commented the following:

-> The above paragraph explains which different things you drew in the
-> diagram are representing, but it is not clear how they relate to
-> each other.  Do they for example depict parent-child commit
-> relationship?  What are the wide gaps between these three tracks and
-> what are the short angled lines leaning to the left near the tip?
-> Is the time/topology flowing from bottom to top?

and it doesn't look like you have addressed that comment.

When you say "M3 and B1 represent a split commit for subtree B that
was created from commit M4." I am not sure what it means exactly.
Could you give example commands that could have created the M3 and B1
commits?

> So this commit makes a change to the processing of commits for the split
> command in order to ignore non-mainline commits from other subtrees such
> as A1 in the diagram by adding a new function
> 'should_ignore_subtree_commit' which is called during
> 'process_split_commit'. This allows the split/rejoin processing to still
> function as expected but removes all of the unnecessary processing that
> takes place currently which greatly inflates the processing time.

Could you tell a bit more what kind of processing time reduction is or
would be possible on what kind of repo? Have you benchmark-ed or just
timed this somehow on one of your repos or better on an open source
repo (so that we could reproduce if we wanted)?

> Added a test to validate that the proposed fix
> solves the issue.
>
> The test accomplishes this by checking the output
> of the split command to ensure the output from
> the progress of 'process_split_commit' function
> that represents the 'extracount' of commits
> processed does not increment.

Does not increment compared to what?

> This was tested against the original functionality
> to show the test failed, and then with this fix
> to show the test passes.
>
> This illustrated that when using multiple subtrees,
> A and B, when doing a split on subtree B, the
> processing does not traverse the entire history
> of subtree A which is unnecessary and would cause
> the 'extracount' of processed commits to climb
> based on the number of commits in the history of
> subtree A.

Does this mean that the test checks that the extracount is the same
when subtree A exists as when it doesn't exist?

[...]

>  contrib/subtree/git-subtree.sh     | 29 ++++++++++++++++++++-
>  contrib/subtree/t/t7900-subtree.sh | 42 ++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
> index e0c5d3b0de6..e69991a9d80 100755
> --- a/contrib/subtree/git-subtree.sh
> +++ b/contrib/subtree/git-subtree.sh
> @@ -778,6 +778,20 @@ ensure_valid_ref_format () {
>                 die "fatal: '$1' does not look like a ref"
>  }
>
> +# Usage: check if a commit from another subtree should be
> +# ignored from processing for splits
> +should_ignore_subtree_split_commit () {

Maybe adding:

    assert test $# = 1
    local rev="$1"

here, and using $rev instead of $1 in this function could make things
a bit clearer and similar to what is done in other functions.

> +  if test -n "$(git log -1 --grep="git-subtree-dir:" $1)"
> +  then
> +    if test -z "$(git log -1 --grep="git-subtree-mainline:" $1)" &&
> +                       test -z "$(git log -1 --grep="git-subtree-dir: $arg_prefix$" $1)"
> +    then
> +      return 0
> +    fi
> +  fi
> +  return 1
> +}

The above doesn't seem to be properly indented. We use tabs not spaces.

>  # Usage: process_split_commit REV PARENTS
>  process_split_commit () {
>         assert test $# = 2
> @@ -963,7 +977,20 @@ cmd_split () {
>         eval "$grl" |
>         while read rev parents
>         do
> -               process_split_commit "$rev" "$parents"
> +               if should_ignore_subtree_split_commit "$rev"
> +               then
> +                       continue
> +               fi
> +               parsedParents=''

It seems to me that we name variables "parsed_parents" (or sometimes
"parsedparents") rather than "parsedParents".

> +               for parent in $parents
> +               do
> +                       should_ignore_subtree_split_commit "$parent"
> +                       if test $? -eq 1

I think the 2 lines above could be replaced by:

+                       if ! should_ignore_subtree_split_commit "$parent"

> +                       then
> +                               parsedParents+="$parent "

It doesn't seem to me that we use "+=" much in our shell scripts.
https://www.shellcheck.net/ emits the following:

(warning): In POSIX sh, += is undefined.

so I guess we don't use it because it's not available in some usual shells.

(I haven't checked the script with https://www.shellcheck.net/ before
and after your patch, but it might help avoid bash-isms and such
issues.)

> +                       fi
> +               done
> +               process_split_commit "$rev" "$parsedParents"
>         done || exit $?

It looks like we use "exit $?" a lot in git-subtree.sh while we use
just "exit" most often elsewhere. Not sure why.

>         latest_new=$(cache_get latest_new) || exit $?
> diff --git a/contrib/subtree/t/t7900-subtree.sh b/contrib/subtree/t/t7900-subtree.sh
> index 49a21dd7c9c..87d59afd761 100755
> --- a/contrib/subtree/t/t7900-subtree.sh
> +++ b/contrib/subtree/t/t7900-subtree.sh
> @@ -385,6 +385,48 @@ test_expect_success 'split sub dir/ with --rejoin' '
>         )
>  '
>
> +test_expect_success 'split with multiple subtrees' '
> +       subtree_test_create_repo "$test_count" &&
> +       subtree_test_create_repo "$test_count/subA" &&
> +       subtree_test_create_repo "$test_count/subB" &&
> +       test_create_commit "$test_count" main1 &&
> +       test_create_commit "$test_count/subA" subA1 &&
> +       test_create_commit "$test_count/subA" subA2 &&
> +       test_create_commit "$test_count/subA" subA3 &&
> +       test_create_commit "$test_count/subB" subB1 &&
> +       (
> +               cd "$test_count" &&
> +               git fetch ./subA HEAD &&
> +               git subtree add --prefix=subADir FETCH_HEAD
> +       ) &&
> +       (
> +               cd "$test_count" &&
> +               git fetch ./subB HEAD &&
> +               git subtree add --prefix=subBDir FETCH_HEAD
> +       ) &&
> +       test_create_commit "$test_count" subADir/main-subA1 &&
> +       test_create_commit "$test_count" subBDir/main-subB1 &&
> +       (
> +               cd "$test_count" &&
> +               git subtree split --prefix=subADir --squash --rejoin -m "Sub A Split 1"
> +       ) &&

Not sure why there are so many sub-shells used, and why the -C option
is not used instead to tell Git to work in a subdirectory. I guess you
copied what most existing (old) tests in this test script do.

For example perhaps the 4 line above could be replaced by just:

+               git -C "$test_count" subtree split --prefix=subADir
--squash --rejoin -m "Sub A Split 1" &&

> +       (
> +               cd "$test_count" &&
> +               git subtree split --prefix=subBDir --squash --rejoin -m "Sub B Split 1"
> +       ) &&
> +       test_create_commit "$test_count" subADir/main-subA2 &&
> +       test_create_commit "$test_count" subBDir/main-subB2 &&
> +       (
> +               cd "$test_count" &&
> +               git subtree split --prefix=subADir --squash --rejoin -m "Sub A Split 2"
> +       ) &&
> +       (
> +               cd "$test_count" &&
> +               test "$(git subtree split --prefix=subBDir --squash --rejoin \
> +                -d -m "Sub B Split 1" 2>&1 | grep -w "\[1\]")" = ""
> +       )
> +'

It's not clear to me what the test is doing. Maybe you could split it
into 2 tests. Perhaps one setting up a repo with multiple subtrees and
one checking that a new split ignores other subtree split commits.
Perhaps adding a few comments would help too.

Best,
Christian.

  reply	other threads:[~2023-11-18 11:28 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-18 20:05 [PATCH] subtree: fix split processing with multiple subtrees present Zach FettersMoore via GitGitGadget
2023-09-18 23:31 ` Junio C Hamano
2023-09-19  1:04 ` Junio C Hamano
2023-10-26 19:59   ` Zach FettersMoore
2023-09-22 16:25 ` [PATCH v2 0/2] " Zach FettersMoore via GitGitGadget
2023-09-22 16:25   ` [PATCH v2 1/2] " Zach FettersMoore via GitGitGadget
2023-09-22 16:25   ` [PATCH v2 2/2] subtree: changing location of commit ignore processing Zach FettersMoore via GitGitGadget
2023-09-29 20:32   ` [PATCH v3 0/3] subtree: fix split processing with multiple subtrees present Zach FettersMoore via GitGitGadget
2023-09-29 20:32     ` [PATCH v3 1/3] " Zach FettersMoore via GitGitGadget
2023-09-29 20:32     ` [PATCH v3 2/3] subtree: changing location of commit ignore processing Zach FettersMoore via GitGitGadget
2023-09-29 20:33     ` [PATCH v3 3/3] subtree: adding test to validate fix Zach FettersMoore via GitGitGadget
2023-10-17 20:02       ` Zach FettersMoore
2023-10-26 19:17     ` [PATCH v4] subtree: fix split processing with multiple subtrees present Zach FettersMoore via GitGitGadget
2023-11-18 11:28       ` Christian Couder [this message]
2023-11-28 21:04         ` Zach FettersMoore
2023-11-28 21:17       ` [PATCH v5] " Zach FettersMoore via GitGitGadget
2023-11-30 20:33         ` Christian Couder
2023-11-30 21:01           ` Zach FettersMoore
2023-12-01 14:54         ` [PATCH v6] " Zach FettersMoore via GitGitGadget
2023-12-04 11:08           ` Christian Couder
2023-12-11 15:39             ` Zach FettersMoore
2023-12-12 16:06               ` Christian Couder
2023-12-12 22:28                 ` Junio C Hamano
2023-12-13 15:20                   ` Zach FettersMoore
2024-01-03 16:33                     ` Christian Couder
2023-12-20 15:25                 ` Christian Couder
2024-01-25 10:09                   ` Christian Couder
2024-01-25 16:38                     ` Junio C Hamano
2024-01-25 18:52                       ` Christian Couder
2024-01-25 18:56                         ` Junio C Hamano
2025-08-21  3:13           ` subtree: [v2.44 regression] split may produce different history Colin Stagner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAP8UFD18Hh=m8HQibAgZW1KNAn6zg_rxe9asg0ViC5z27W=Smw@mail.gmail.com' \
    --to=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=zach.fetters@apollographql.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).