Git development
 help / color / mirror / Atom feed
* [PATCH v4 1/4] rebase: fully ignore rebase.autoSquash without -i
From: Andy Koppe @ 2023-11-11 13:27 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, Andy Koppe
In-Reply-To: <20231111132720.78877-1-andy.koppe@gmail.com>

Setting the rebase.autoSquash config variable to true implies a couple
of restrictions: it prevents preemptive fast-forwarding and it triggers
conflicts with amend backend options. However, it only actually results
in auto-squashing when combined with the --interactive (or -i) option,
due to code in run_specific_rebase() that disables auto-squashing unless
the REBASE_INTERACTIVE_EXPLICIT flag is set.

Doing autosquashing for rebase.autoSquash without --interactive would be
problematic in terms of backward compatibility, but conversely, there is
no need for the aforementioned restrictions without --interactive.

So drop the options.config_autosquash check from the conditions for
clearing allow_preemptive_ff, as the case where it is combined with
--interactive is already covered by the REBASE_INTERACTIVE_EXPLICIT
flag check above it.

Also drop the "apply options are incompatible with rebase.autoSquash"
error, because it is unreachable if it is restricted to --interactive,
as apply options already cause an error when used with --interactive.
Drop the tests for the error from t3422-rebase-incompatible-options.sh,
which has separate tests for the conflicts of --interactive with apply
options.

When neither --autosquash nor --no-autosquash are given, only set
options.autosquash to true if rebase.autosquash is combined with
--interactive.

Don't initialize options.config_autosquash to -1, as there is no need to
distinguish between rebase.autoSquash being unset or explicitly set to
false.

Finally, amend the rebase.autoSquash documentation to say it only
affects interactive mode.

Signed-off-by: Andy Koppe <andy.koppe@gmail.com>
---
 Documentation/config/rebase.txt        |  4 +++-
 builtin/rebase.c                       | 13 ++++++-------
 t/t3422-rebase-incompatible-options.sh | 12 ------------
 3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/Documentation/config/rebase.txt b/Documentation/config/rebase.txt
index 9c248accec..d59576dbb2 100644
--- a/Documentation/config/rebase.txt
+++ b/Documentation/config/rebase.txt
@@ -9,7 +9,9 @@ rebase.stat::
 	rebase. False by default.
 
 rebase.autoSquash::
-	If set to true enable `--autosquash` option by default.
+	If set to true, enable the `--autosquash` option of
+	linkgit:git-rebase[1] by default for interactive mode.
+	This can be overridden with the `--no-autosquash` option.
 
 rebase.autoStash::
 	When set to true, automatically create a temporary stash entry
diff --git a/builtin/rebase.c b/builtin/rebase.c
index 043c65dccd..a73de7892b 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -149,7 +149,6 @@ struct rebase_options {
 		.reapply_cherry_picks = -1,             \
 		.allow_empty_message = 1,               \
 		.autosquash = -1,                       \
-		.config_autosquash = -1,                \
 		.rebase_merges = -1,                    \
 		.config_rebase_merges = -1,             \
 		.update_refs = -1,                      \
@@ -1405,7 +1404,6 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
 	if ((options.flags & REBASE_INTERACTIVE_EXPLICIT) ||
 	    (options.action != ACTION_NONE) ||
 	    (options.exec.nr > 0) ||
-	    (options.autosquash == -1 && options.config_autosquash == 1) ||
 	    options.autosquash == 1) {
 		allow_preemptive_ff = 0;
 	}
@@ -1508,8 +1506,6 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
 			if (is_merge(&options))
 				die(_("apply options and merge options "
 					  "cannot be used together"));
-			else if (options.autosquash == -1 && options.config_autosquash == 1)
-				die(_("apply options are incompatible with rebase.autoSquash.  Consider adding --no-autosquash"));
 			else if (options.rebase_merges == -1 && options.config_rebase_merges == 1)
 				die(_("apply options are incompatible with rebase.rebaseMerges.  Consider adding --no-rebase-merges"));
 			else if (options.update_refs == -1 && options.config_update_refs == 1)
@@ -1529,10 +1525,13 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
 	options.rebase_merges = (options.rebase_merges >= 0) ? options.rebase_merges :
 				((options.config_rebase_merges >= 0) ? options.config_rebase_merges : 0);
 
-	if (options.autosquash == 1)
+	if (options.autosquash == 1) {
 		imply_merge(&options, "--autosquash");
-	options.autosquash = (options.autosquash >= 0) ? options.autosquash :
-			     ((options.config_autosquash >= 0) ? options.config_autosquash : 0);
+	} else if (options.autosquash == -1) {
+		options.autosquash =
+			options.config_autosquash &&
+			(options.flags & REBASE_INTERACTIVE_EXPLICIT);
+	}
 
 	if (options.type == REBASE_UNSPECIFIED) {
 		if (!strcmp(options.default_backend, "merge"))
diff --git a/t/t3422-rebase-incompatible-options.sh b/t/t3422-rebase-incompatible-options.sh
index 2eba00bdf5..b40f26250b 100755
--- a/t/t3422-rebase-incompatible-options.sh
+++ b/t/t3422-rebase-incompatible-options.sh
@@ -100,12 +100,6 @@ test_rebase_am_only () {
 		test_must_fail git rebase $opt --root A
 	"
 
-	test_expect_success "$opt incompatible with rebase.autosquash" "
-		git checkout B^0 &&
-		test_must_fail git -c rebase.autosquash=true rebase $opt A 2>err &&
-		grep -e --no-autosquash err
-	"
-
 	test_expect_success "$opt incompatible with rebase.rebaseMerges" "
 		git checkout B^0 &&
 		test_must_fail git -c rebase.rebaseMerges=true rebase $opt A 2>err &&
@@ -118,12 +112,6 @@ test_rebase_am_only () {
 		grep -e --no-update-refs err
 	"
 
-	test_expect_success "$opt okay with overridden rebase.autosquash" "
-		test_when_finished \"git reset --hard B^0\" &&
-		git checkout B^0 &&
-		git -c rebase.autosquash=true rebase --no-autosquash $opt A
-	"
-
 	test_expect_success "$opt okay with overridden rebase.rebaseMerges" "
 		test_when_finished \"git reset --hard B^0\" &&
 		git checkout B^0 &&
-- 
2.43.0-rc1


^ permalink raw reply related

* [PATCH v4 2/4] rebase: support --autosquash without -i
From: Andy Koppe @ 2023-11-11 13:27 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, Andy Koppe
In-Reply-To: <20231111132720.78877-1-andy.koppe@gmail.com>

The --autosquash option prevents preemptive fast-forwarding and triggers
conflicts with amend backend options, yet it only actually performs
auto-squashing when combined with the --interactive (or -i) option.

Remove the latter restriction and tweak the --autosquash description
accordingly.

Signed-off-by: Andy Koppe <andy.koppe@gmail.com>
---
 Documentation/git-rebase.txt | 2 +-
 builtin/rebase.c             | 4 +---
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index b4526ca246..10548e715c 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -592,7 +592,7 @@ See also INCOMPATIBLE OPTIONS below.
 	When the commit log message begins with "squash! ..." or "fixup! ..."
 	or "amend! ...", and there is already a commit in the todo list that
 	matches the same `...`, automatically modify the todo list of
-	`rebase -i`, so that the commit marked for squashing comes right after
+	`rebase`, so that the commit marked for squashing comes right after
 	the commit to be modified, and change the action of the moved commit
 	from `pick` to `squash` or `fixup` or `fixup -C` respectively. A commit
 	matches the `...` if the commit subject matches, or if the `...` refers
diff --git a/builtin/rebase.c b/builtin/rebase.c
index a73de7892b..9f8192e0a5 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -710,10 +710,8 @@ static int run_specific_rebase(struct rebase_options *opts)
 	if (opts->type == REBASE_MERGE) {
 		/* Run sequencer-based rebase */
 		setenv("GIT_CHERRY_PICK_HELP", resolvemsg, 1);
-		if (!(opts->flags & REBASE_INTERACTIVE_EXPLICIT)) {
+		if (!(opts->flags & REBASE_INTERACTIVE_EXPLICIT))
 			setenv("GIT_SEQUENCE_EDITOR", ":", 1);
-			opts->autosquash = 0;
-		}
 		if (opts->gpg_sign_opt) {
 			/* remove the leading "-S" */
 			char *tmp = xstrdup(opts->gpg_sign_opt + 2);
-- 
2.43.0-rc1


^ permalink raw reply related

* [PATCH v4 0/4] rebase: support --autosquash without -i
From: Andy Koppe @ 2023-11-11 13:27 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, Andy Koppe
In-Reply-To: <20231105000808.10171-1-andy.koppe@gmail.com>

Make rebase --autosquash work without --interactive, but limit
rebase.autoSquash's effects to interactive mode, and improve testing
and documentation.

Changes from v3:
- Separate commits for rebase.autoSquash, --autosquash and testing.
- Go back to interpreting rebase.autoSquash as a boolean, but document
  it as only affecting interactive mode.
- In the --autosquash documentation, bring back partial matching as a
  fallback and show what todo list command each squash marker
  corresponds to.

Thanks very much to Junio and Phillip for their reviews. I made plenty
of use of non-interactive autosquash in preparing this version. :)

Andy Koppe (4):
  rebase: fully ignore rebase.autoSquash without -i
  rebase: support --autosquash without -i
  rebase: test autosquash with and without -i
  rebase: rewrite --(no-)autosquash documentation

 Documentation/config/rebase.txt        |  4 ++-
 Documentation/git-rebase.txt           | 34 +++++++++++++----------
 builtin/rebase.c                       | 17 +++++-------
 t/t3415-rebase-autosquash.sh           | 38 +++++++++++++++++++-------
 t/t3422-rebase-incompatible-options.sh | 12 --------
 5 files changed, 58 insertions(+), 47 deletions(-)

-- 
2.43.0-rc1


^ permalink raw reply

* Re: [RFC PATCH] status: avoid reporting worktrees as "Untracked files"
From: Edmundo Carmona Antoranz @ 2023-11-11  9:22 UTC (permalink / raw)
  To: Junio C Hamano, Eric Sunshine; +Cc: git
In-Reply-To: <xmqqjzqygg3i.fsf@gitster.g>

Hey, guys! Thanks Junio and Eric for sharing your thoughts.

And I candidly thought this was going to be an "easy sell".... :-D

On Sat, Nov 4, 2023 at 7:58 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> What problem are you trying to solve?  "git add foo" where "foo" is
> actually a different worktree of the repository would add it as a
> submodule that causes confusion?  If that is the case, I think the
> right solution is not to get into such a state, i.e. not create a
> worktree of the repository inside a different worktree in the first
> place.
>

I am not against the idea of creating worktrees outside of the
repository... however, I like them to be _inside_ the repository. Am I
the only one? IDK. I might be! It feels completely natural, if you ask
me.... but that's just my opinion, I acknowledge that.

While I was running a couple of quick tests to add more information
about git behaviour with "my use case" I think I found something to
work on so more RFCs might be on the way in the next few days or
weeks.

About adding an error message when 'git add' will skip doing something
because it is working on a different worktree, I think it makes
sense.... will probably work on that too.

Thanks again! BR!

^ permalink raw reply

* Re: first-class conflicts?
From: Sandra Snan @ 2023-11-11  7:48 UTC (permalink / raw)
  To: git
In-Reply-To: <xmqqh6ltrs6t.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> writes:
> Correct but with a caveat: it is too easy for lazy folks to 
> circumvent the safety by mistake with "commit -a". 

Lazy and ignorant like myself because I didn't know -a was that
dangerous. Thank you both!

^ permalink raw reply

* Re: Git Rename Detection Bug
From: Elijah Newren @ 2023-11-11  5:46 UTC (permalink / raw)
  To: Jeremy Pridmore; +Cc: git@vger.kernel.org, Paul Baumgartner
In-Reply-To: <LO6P265MB6736F5F9E8368A9DE95D294FFAA9A@LO6P265MB6736.GBRP265.PROD.OUTLOOK.COM>

Hi Jeremy,

On Fri, Nov 10, 2023 at 3:28 AM Jeremy Pridmore <jpridmore@rdt.co.uk> wrote:
>
> Hi Elijah,
>
> Many thanks for your reply, the detail is much appreciated.  I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag.

The fact that you were trying to "undo" renames and "redo the correct
ones" suggested there's something you still didn't understand about
rename detection, though.  The fact that you were worried that "git
status" showing the "wrong renames" and the implication that it needed
to show the right ones before you committed also suggested there's
something that was still not being understood.  Likewise, the fact
that the renames reported by "git status" change even when you haven't
renamed files further but have simply made additional changes to the
contents of some files suggests there is something that was still not
being understood about the phrase that "renames are detected rather
than recorded".

While renames are used in the merge algorithm in order to know what
files to match up for three-way content merges (so that changes from
both sides to the "same" file can be incorporated into the end
result), once the merge stops to ask you to resolve conflicts, the
detection of renames doesn't matter beyond that point.  The fact that
renames aren't recorded means whatever renames are shown is only there
as a guide to help you understand.  All that matters to Git is that
all files have the intended content.  If some of the files have the
wrong content, then by all means go and correct it.  If "git mv"
commands help you do that, great.  If simply editing all files of
interest (including adding and deleting files) until they match the
expected contents works, that's fine too.  Once all files have the
correct content, commit it.  Git will have no way whatsoever of
knowing which of those two routes you picked, and won't behave any
differently in the future based on which way you ended up with the
right contents in each file.  You could have even done the extreme of
"git merge -s ours --no-commit ${OTHER_BRANCH}" (merge the other
branch but completely ignore *every* change the other side made and
then stop for user to make further changes), followed by opening and
editing every relevant file to include changes made by the other side,
deleting and adding files as relevant, to end up with the right
contents in all files and then committed, and Git wouldn't know the
difference and no one who pulled your merge commit would be able to
tell the difference either.

> I think the issue I'm encountering is described by what you say here:
> "Exact renames are detected first, before any other method of rename
> detection is employed, and other than giving a preference to files
> with the same basename, if there are multiple choices it just picks
> one (what I'd call at random, though technically based on what the
> internal processing order happens to be)"
>
> That is close to the behaviour I'm seeing.  As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches:
>
> (I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text)
> > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> > Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> > Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> > Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> > Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> > Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
>
> Given git compares both the content and the directory\filenames,

For exact renames, it only will look at filenames if there are
multiple 100% matches.  If one match is 100%, and the other 99.9999%
match, the 100% match is taken.

(Since we think that exact renames seem to be describing your problem,
there's no point discussing the inexact rename handling.)

> and as the repositories have unrelated histories,

What do you mean by this?  If there's no common point in history (i.e.
no merge base), rename detection doesn't even get invoked, so you must
mean something different by this phrase than what I would normally
take it to mean.

Any chance, if you're still in the middle of the merge, that you could run
   git merge-base --all HEAD MERGE_HEAD
and report the output?  It'll be commit hashes that don't mean a lot
to anyone who doesn't have your repository, but I'm interested in how
many commit hashes it responds with (0, 1, or more than 1).

> the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base.

I assume by "Base" you are referring to the commit in history common
to both branches being merged (i.e. what we call the "merge base"), or
to the relevant file from that commit.  Is that right?

Even if I'm right about that, this comment has me lost.  Could you
clarify it, with one particular example?  For example (I'm making
stuff up since I'm not familiar with your repo, but showing how to
clarify for a given set of path(s)):

   * In this repository, with the renames detected above,
Landscape/Main/somefile.licx is an empty file in the merge base.
   * On my local side, I renamed Landscape/Main/somefile.licx to
Landscape/src/Main/somefile.licx and populated it with some content
(with hash A).  There is also another new file on the local side (at
least new relative to the merge base) named
Landscape/src/Other/somefile.licx that happens to have hash A.
   * On the remote side, Landscape/Main/somefile.licx was left in
place but populated with some content (with hash A).
   * Git is detecting the rename as Landscape/Main/somefile.licx ->
Landscape/src/Other/somefile.licx, when I wanted it to detect a rename
to Landscape/src/Main/somefile.licx.

I'm pretty sure this example is not what you're seeing, even if
components of it are, because the empty file thing is impossible with
the rest of the story.

>  That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different.  Any ideas?

The fact that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
is marked as "both added" means that this file
   * did not exist in the merge base
   * did exist on your local side
   * did exist on the remote side
   * the version of this file on the local and remote sides do not match
Basically, both sides added a new file and they are not the same, so
you have a conflict.

The answer for Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1,
is an identical set of bullet points; it was a file added by both
sides of history that did not exist in the merge base, and the sides
are different so you have a conflict in this file too.

> I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head --
<file>" or a "git rm <file>" based upon which set the file is in and
whether it is in another set or not.  This has worked really well and
helped me through the large changeset with 3k conflicts.

So, you're simply throwing away the changes made by the remote side?
I mean, that's one way to merge, and it might be right in your case,
but to someone unfamiliar with your repo it smells like a hack to just
ignore conflicts and throw away other people's changes in order to
complete the merge.

> As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides),

"deleted by us" and "deleted by them" means no rename was detected for
the file (at least on the side that the delete is reported for).  So,
a "deleted in both" only happens when neither side detects a rename,
and if the file isn't renamed on either side and both removed it, then
there's no conflict -- just delete the file.

>  an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference.  The difference value would be the number of differences in folder names, e.g.
>
> deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
>
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> (path\name difference = 2)
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> (path\name difference = 1)
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> (path\name difference = 2)
>
> So, given the above, git would chose the second "added in both" entry.

Three problems here:

* If git were to report "deleted in both" for one path and "added in
both" in another as you suggest, that would only be because the files
are dissimilar.  Not only would they not be an exact match, their
contents would have less than 50% similarity.  Thus augmenting exact
matching like this just wouldn't work, because the files aren't
matches at all.  The only correct thing to do would be to not report
the "deleted in both" because that file is not a rename of anything
else and has simply been deleted by both sides.

* filename similarity is extraordinarily expensive compared to exact
renames, and if not carefully handled, can sometimes rival the cost of
file content similarity computations given our spanhash
representations.  Exact renames are tasked with finding renames even
if they are known to not be relevant, simply because exact renames can
do so very quickly.  If we change that, we throw a monkey wrench in
our performance handling elsewhere and have to rethink a number of
other things.

* While I was optimizing rename detection while investigating the new
merge backend, I actually attempted a few versions of filename
similarity looking for something that was predictive and useful.
While I think the idea was potentially helpful for some repositories,
it has a significant risk of hurting merges in other repositories.
While what I tried was far from an exhaustive checking of all filename
similarity ideas, I came away doubting there was a useful heuristic
other than exact matches of basenames (i.e. exact matches of
everything in the filename after the final slash).  If someone else
wants to try more ideas, and do a study on various existing
repositories, they can go ahead, but I suspect most work here is going
to end up at a dead end and I'm unwilling to put further time of my
own into it.

> Food for thought?  Happy to discuss the idea further.

So, I've occasionally seen repositories that have something like the following:

  * base version: directory named library-x-1.7/
  * stable branch: many changes to files under library-x-1.7/
  * development branch: library-x-1.7/ no longer exists.  However,
library-x-1.8/ and library-x-1.9/ both do.  Both are obviously
"similar" to library-x-1.7/ but both have many changes.

What happens when someone tries to merge the stable branch into the
development branch?  There are two obvious guesses:

  * Changes from library-x-1.7/ on the stable branch are applied to
library-x-1.8/
  * Changes from library-x-1.7/ on the stable branch are applied to
library-x-1.9/

Either answer can be suboptimal depending on your viewpoint.
(Applying the changes to both directories would also have other
suboptimal effects even if it might sound right based for this exact
problem as I've worded it.  But git doesn't do copy detection as part
of merges so Git won't choose this third choice ever.)  So, which of
those two happens?  Well, since renames are detected based upon file
similarity, the changes will go to whatever file is most similar.
What does that mean?  It means that both answers above are wrong.
Instead:

  * Some of the changes from library-x-1.7/ on the stable branch are
applied to files from library-x-1.8/, while others are applied to
files from library-x-1.9/, and to determine which files from which
directory are matched up is an individual file choice based on which
file in library-x-1.8/ or library-x-1.9/ is most similar to the file
from the base version in library-x-1.7/.

This answer is clearly worse than either of the two above, and is
virtually never what people would want.  But it's also fundamental to
the idea of matching up files and detecting renames individually based
upon file similarity.  It's part of both the old and new merge
backends in Git, because both were based upon this fundamental idea.

So, if you have this kind of situation, or even something like it
where files from one old directory could match files from multiple
other directories, it's just something you have to be aware of.

All that said, here's something that might help:

*** Hack to workaround rename detection in special cases where there
are directories of multiple possible matches ***

1. Get back to a clean slate from before the merge.

$ git merge --abort
OR
$ cd ${OTHER_DIRECTORY} && git clone ${url} && cd ${REPONAME} && git
checkout ${relevant_branch}

2. Temporarily undo your local renames and make a temporary commit

$ git mv Landscape/src/* Landscape/
$ git commit -m "TEMPORARY COMMIT"

3. Perform the merge (files will be in Landscape/ instead of
Landscape/src/ for now).  Don't worry, we'll fix the merge commit
later.

$ git merge ${REMOTE_BRANCH}
[...fix up any conflicts and commit, if needed...]

4. Rename Landscape/ back to Landscape/src/ and make another (temporary) commit.

5. Create a corrected merge commit with the current tree, the commit
message from your merge commit, and the correct parents:
$ git commit-tree -p HEAD~3 -p HEAD~1^2 -F $(git log -1 --format=%B
HEAD~1) HEAD^{tree}
[...the above command will print out a new commit id for a corrected
merge commit.  You can inspect it first, but we just need to pass this
to reset --hard...]

6. Reset your branch to this corrected merge commit (which will orphan
the temporary commits from steps 2, 3, and 4 so they can later be
garbage collected)
$ git reset --hard [...output of commit-tree command...]


Hope that helps,
Elijah

^ permalink raw reply

* Re: [PATCH 1/1] attr: add native file mode values support
From: Junio C Hamano @ 2023-11-11  4:48 UTC (permalink / raw)
  To: Joanna Wang; +Cc: git
In-Reply-To: <20231111040309.2848560-1-jojwang@google.com>

Joanna Wang <jojwang@google.com> writes:

> +/*
> + * This implements native file mode attr support.
> + * We always return the current mode of the path, not the mode stored
> + * in the index, unless the path is a directory where we check the index
> + * to see if it is a GITLINK. It is ok to rely on the index for GITLINK
> + * modes because a path cannot become a GITLINK (which is a git concept only)
> + * without having it indexed with a GITLINK mode in git.
> + */
> +static unsigned int native_mode_attr(struct index_state *istate, const char *path)
> +{
> +	struct stat st;
> +	unsigned int mode;

Please have a blank line here between decls and the first statement.

> +	if (lstat(path, &st))
> +		die_errno(_("unable to stat '%s'"), path);

For checking in, this is probably OK, but don't we need to switch
between lstat() on the filesystem entity and inspecting ce_mode()
for the in-index cache_entry, depending on the git_attr_direction?

> +	mode = canon_mode(st.st_mode);
> +	if (S_ISDIR(mode)) {
> +		int pos = index_name_pos(istate, path, strlen(path));
> +		if (pos >= 0)
> +			if (S_ISGITLINK(istate->cache[pos]->ce_mode))
> +				return istate->cache[pos]->ce_mode;

Even if we assume that this code is currently meant to work only
with GIT_ATTR_CHECKIN, I do not think this is what you want.  When
asked to perform "git add . ':(exclude,mode=160000)'", you not only
want to exclude the submodules that are already known to this
superproject, but also a repository that _can_ become a submodule of
this superproject when added, no?  You are missing "if (pos < 0)"
case where you'd probably want to see if the directory at the path
looks like the top of a working tree with ".git" subdirectory that
is a valid repository, or something like that to detect such a case.

On the other hand, the GIT_ATTR_CHECKOUT direction is hopefully much
simpler.  You'd see what the path in the index is, among a gitlink,
a regular non-executable file, an executable file, or a symlink.

> +	}
> +	return mode;
> +}
> +
> +
>  void git_check_attr(struct index_state *istate,
>  		    const char *path,
>  		    struct attr_check *check)
>  {
>  	int i;
>  	const struct object_id *tree_oid = default_attr_source();
> +	struct strbuf sb = STRBUF_INIT;
>  
>  	collect_some_attrs(istate, tree_oid, path, check);
>  
>  	for (i = 0; i < check->nr; i++) {
>  		unsigned int n = check->items[i].attr->attr_nr;
>  		const char *value = check->all_attrs[n].value;
> -		if (value == ATTR__UNKNOWN)
> -			value = ATTR__UNSET;
> +		if (value == ATTR__UNKNOWN){

Style.  Missing SP between "){".

> +			if (strcmp(check->all_attrs[n].attr->name, "mode") == 0) {

Style.  We avoid comparison with 0; so say

			if (!strcmp(check->all_attrs[n].attr->name, "mode") == 0) {

instead.

It is disturbing that we always need to perform a string comparison
in this loop (and the function is called repeatedly while traversing
the tree looking for paths that match pathspec).  The attribute objects
are designed to be interned, so at least you should be able to do

	mode_attr = git_attr("mode");

before the loop and then compare check->all_attrs[n].attr and
mode_attr as two addresses.

> +				/*
> +				 * If value is ATTR_UNKNOWN then it has not been set
> +				 * anywhere with -mode (ATTR_FALSE), !mode (ATTR_UNSET),
> +				 * mode (ATTR_TRUE), or an explicit value. We fill
> +				 * value with the native mode value.
> +				 */
> +				strbuf_addf(&sb, "%06o", native_mode_attr(istate, path));
> +				value = sb.buf;

Or even better yet, as this is not a place to write too many lines
for a single oddball attribute, it might make even more sense to
introduce a helper function and make a call to it like so:

		if (value == ATTR__UNKNOWN)
-			value = ATTR__UNSET;
+			value = compute_attr_from_path(istate, path, check_items[i].attr);

with the new helper function do all the heavy lifting, e.g.,

	static const char *compute_attr_from_path(struct index_state *istate,
						  const char *path,
						  const struct git_attr *attr) {
		static struct git_attr mode_attr;

		if (!mode_attr)
			mode_attr = git_attr("mode");

		if (attr == mode_attr) {
			call native_mode_attr(), stringify, and
			return it;
		}
		return ATTR__UNSET;
	}

which will allow us to easily add in the future special attribute
similar to "mode" whose fallback value is computed.

Otherwise, looking pretty good.  Thanks for working on this.

^ permalink raw reply

* Re: [RFC PATCH 0/3] replay: implement support for writing new objects to a pack
From: Elijah Newren @ 2023-11-11  4:04 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Patrick Steinhardt, Junio C Hamano,
	Johannes Schindelin
In-Reply-To: <cover.1699381371.git.me@ttaylorr.com>

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> (Based on a combination of Christian's cc/git-replay and my
> tb/merge-tree-write-pack branches).
>
> This RFC demonstrates extending the new `--write-pack` option that
> `merge-tree` recently learned to the `replay` builtin as well.
>
> The approach is as follows:
>
>   - write a pack out after each step in the replay operation, so that
>     subsequent steps may see any new object(s) created during previous
>     steps
>
>   - combine those packs into one before migrating them back into the
>     main object store
>
> This is accomplished with a combination of the bulk-checkin and
> tmp-objdir APIs, with some minor modifications made to when we flush out
> and finalize bulk-checkin transactions.

Just thought I'd note that I finished reading all of this series as
well as the five-commit series it's based on.  Just wanted to note
that any commits I didn't comment on from either series looked good to
me.

^ permalink raw reply

* [PATCH 1/1] attr: add native file mode values support
From: Joanna Wang @ 2023-11-11  4:03 UTC (permalink / raw)
  To: git; +Cc: Joanna Wang

Gives all paths inherent 'mode' attribute values based on the paths'
modes (one of 100644, 100755, 120000, 040000, 160000). Users may use
this feature to filter by file types. For example a pathspec such as
':(attr:mode=160000)' could filter for submodules without needing
`mode=160000` to actually be specified for each subdmoule path in
.gitattributes. The native mode values are also reflected in
`git check-attr` results.

If there is any existing mode attribute for a path (e.g. there is
!mode, -mode, mode, mode=<value> in .gitattributes) that setting will
take precedence over the native mode value.

---

I went with 'mode' because that is the word used in documentation
(e.g. https://git-scm.com/book/sv/v2/Git-Internals-Git-Objects)
and in the code (e.g. `ce_mode`). Please let me know what you think
of this UX.

The implementation happens within git_check_attr() method which is
the common mathod called for getting a pathspec attribute value.

The previous discussed idea did not work with `git check-attr`.
(https://lore.kernel.org/all/CAMmZTi8swsSMcLUcW+YwUDg8GcrY_ks2+i35-nsHE3o9MNpsUQ@mail.gmail.com/).

There are no tests for excluding based on pathspec attrs in subdirectories
due to an existing bug. Since it is not specific to native mode, I thought
it would be better to fix separately.
https://lore.kernel.org/all/CAMmZTi89Un+bsLXdEdYs44oT8eLNp8y=Pm8ywaurcQ7ccRKGdQ@mail.gmail.com/

 Documentation/gitattributes.txt | 10 +++++++
 attr.c                          | 42 ++++++++++++++++++++++++--
 t/t0003-attributes.sh           | 40 +++++++++++++++++++++++++
 t/t6135-pathspec-with-attrs.sh  | 53 +++++++++++++++++++++++++++++++++
 4 files changed, 143 insertions(+), 2 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8c1793c148..bb3c11f151 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -156,6 +156,16 @@ Unspecified::
 Any other value causes Git to act as if `text` has been left
 unspecified.
 
+`mode`
+^^^^^
+
+This attribute is for filtering files by their file bit modes
+(40000, 120000, 160000, 100755, 100644) and has native support in git,
+meaning values for this attribute are natively set (e.g. mode=100644) by
+git without having to specify them in .gitattributes. However, if
+'mode' is set in .gitattributest for some path, that value takes precendence,
+whether it is 'set', 'unset', 'unspecified', or some other value.
+
 `eol`
 ^^^^^
 
diff --git a/attr.c b/attr.c
index e62876dfd3..95dc1cf695 100644
--- a/attr.c
+++ b/attr.c
@@ -1240,20 +1240,58 @@ static struct object_id *default_attr_source(void)
 	return &attr_source;
 }
 
+
+/*
+ * This implements native file mode attr support.
+ * We always return the current mode of the path, not the mode stored
+ * in the index, unless the path is a directory where we check the index
+ * to see if it is a GITLINK. It is ok to rely on the index for GITLINK
+ * modes because a path cannot become a GITLINK (which is a git concept only)
+ * without having it indexed with a GITLINK mode in git.
+ */
+static unsigned int native_mode_attr(struct index_state *istate, const char *path)
+{
+	struct stat st;
+	unsigned int mode;
+	if (lstat(path, &st))
+		die_errno(_("unable to stat '%s'"), path);
+	mode = canon_mode(st.st_mode);
+	if (S_ISDIR(mode)) {
+		int pos = index_name_pos(istate, path, strlen(path));
+		if (pos >= 0)
+			if (S_ISGITLINK(istate->cache[pos]->ce_mode))
+				return istate->cache[pos]->ce_mode;
+	}
+	return mode;
+}
+
+
 void git_check_attr(struct index_state *istate,
 		    const char *path,
 		    struct attr_check *check)
 {
 	int i;
 	const struct object_id *tree_oid = default_attr_source();
+	struct strbuf sb = STRBUF_INIT;
 
 	collect_some_attrs(istate, tree_oid, path, check);
 
 	for (i = 0; i < check->nr; i++) {
 		unsigned int n = check->items[i].attr->attr_nr;
 		const char *value = check->all_attrs[n].value;
-		if (value == ATTR__UNKNOWN)
-			value = ATTR__UNSET;
+		if (value == ATTR__UNKNOWN){
+			if (strcmp(check->all_attrs[n].attr->name, "mode") == 0) {
+				/*
+				 * If value is ATTR_UNKNOWN then it has not been set
+				 * anywhere with -mode (ATTR_FALSE), !mode (ATTR_UNSET),
+				 * mode (ATTR_TRUE), or an explicit value. We fill
+				 * value with the native mode value.
+				 */
+				strbuf_addf(&sb, "%06o", native_mode_attr(istate, path));
+				value = sb.buf;
+			} else
+				value = ATTR__UNSET;
+		}
 		check->items[i].value = value;
 	}
 }
diff --git a/t/t0003-attributes.sh b/t/t0003-attributes.sh
index aee2298f01..9c2603d8e2 100755
--- a/t/t0003-attributes.sh
+++ b/t/t0003-attributes.sh
@@ -19,6 +19,15 @@ attr_check () {
 	test_must_be_empty err
 }
 
+attr_check_mode () {
+	path="$1" expect="$2" git_opts="$3" &&
+
+	git $git_opts check-attr mode -- "$path" >actual 2>err &&
+	echo "$path: mode: $expect" >expect &&
+	test_cmp expect actual &&
+	test_must_be_empty err
+}
+
 attr_check_quote () {
 	path="$1" quoted_path="$2" expect="$3" &&
 
@@ -558,4 +567,35 @@ test_expect_success EXPENSIVE 'large attributes file ignored in index' '
 	test_cmp expect err
 '
 
+test_expect_success 'submodule with .git file' '
+	mkdir sub &&
+	(cd sub &&
+	git init &&
+	mv .git .real &&
+	echo "gitdir: .real" >.git &&
+	test_commit first) &&
+	git update-index --add -- sub
+'
+
+test_expect_success 'native mode attributes work' '
+	>exec && chmod +x exec && attr_check_mode exec 100755 &&
+	>normal && attr_check_mode normal 100644 &&
+	mkdir dir && attr_check_mode dir 040000 &&
+	ln -s normal normal_sym && attr_check_mode normal_sym 120000 &&
+	attr_check_mode sub 160000
+'
+
+test_expect_success '.gitattributes mode values take precedence' '
+	(
+		echo "mode_value* mode=myownmode" &&
+		echo "mode_set* mode" &&
+		echo "mode_unset* -mode" &&
+		echo "mode_unspecified* !mode"
+	) >.gitattributes &&
+	>mode_value && attr_check_mode mode_value myownmode &&
+	>mode_unset && attr_check_mode mode_unset unset &&
+	>mode_unspecified && attr_check_mode mode_unspecified unspecified &&
+	>mode_set && attr_check_mode mode_set set
+'
+
 test_done
diff --git a/t/t6135-pathspec-with-attrs.sh b/t/t6135-pathspec-with-attrs.sh
index a9c1e4e0ec..fd9569d39b 100755
--- a/t/t6135-pathspec-with-attrs.sh
+++ b/t/t6135-pathspec-with-attrs.sh
@@ -64,6 +64,9 @@ test_expect_success 'setup .gitattributes' '
 	fileSetLabel label
 	fileValue label=foo
 	fileWrongLabel label☺
+	mode_set* mode=1234
+	mode_unset* -mode
+	mode_unspecified* !mode
 	EOF
 	echo fileSetLabel label1 >sub/.gitattributes &&
 	git add .gitattributes sub/.gitattributes &&
@@ -295,4 +298,54 @@ test_expect_success 'reading from .gitattributes in a subdirectory (3)' '
 	test_cmp expect actual
 '
 
+test_expect_success 'mode attr is handled correctly for overriden values' '
+	>mode_set_1 &&
+	>mode_unset_1 &&
+	>mode_unspecified_1 &&
+	>mode_regular_file_1 &&
+
+	git status -s ":(attr:mode=1234)mode*" >actual &&
+	cat <<-\EOF >expect &&
+	?? mode_set_1
+	EOF
+	test_cmp expect actual &&
+
+	git status -s ":(attr:-mode)mode*" >actual &&
+	echo ?? mode_unset_1 >expect &&
+	test_cmp expect actual &&
+
+	git status -s ":(attr:!mode)mode*" >actual &&
+	echo ?? mode_unspecified_1 >expect &&
+	test_cmp expect actual &&
+
+	git status -s ":(attr:mode=100644)mode*" >actual &&
+	echo ?? mode_regular_file_1 >expect &&
+	test_cmp expect actual
+'
+
+test_expect_success 'mode attr values are current file modes, not indexed modes' '
+	>mode_exec_file_1 &&
+
+	git status -s ":(attr:mode=100644)mode_exec_*" >actual &&
+	echo ?? mode_exec_file_1 >expect &&
+	test_cmp expect actual &&
+
+	git add mode_exec_file_1 && chmod +x mode_exec_file_1 &&
+	git status -s ":(attr:mode=100755)mode_exec_*" >actual &&
+	echo AM mode_exec_file_1 >expect &&
+	test_cmp expect actual
+'
+
+test_expect_success 'mode attr can be excluded' '
+	>mode_1_regular &&
+	>mode_1_exec  && chmod +x mode_1_exec &&
+	git status -s ":(exclude,attr:mode=100644)" "mode_1_*" >actual &&
+	echo ?? mode_1_exec >expect &&
+	test_cmp expect actual &&
+
+	git status -s ":(exclude,attr:mode=100755)" "mode_1_*" >actual &&
+	echo ?? mode_1_regular >expect &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related

* [PATCH] RelNotes: minor wording fixes in 2.43.0 release notes
From: Elijah Newren via GitGitGadget @ 2023-11-11  4:02 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    RelNotes: minor wording fixes in 2.43.0 release notes

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1611%2Fnewren%2Frelnotes-wording-improvements-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1611/newren/relnotes-wording-improvements-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1611

 Documentation/RelNotes/2.43.0.txt | 36 +++++++++++++++----------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/RelNotes/2.43.0.txt b/Documentation/RelNotes/2.43.0.txt
index 74f304b1b2b..770543c464e 100644
--- a/Documentation/RelNotes/2.43.0.txt
+++ b/Documentation/RelNotes/2.43.0.txt
@@ -27,14 +27,14 @@ UI, Workflows & Features
    a branch that is checked out and protect it.  Rephrase the message
    to say that the branch is in use.
 
- * Hourly and other schedule of "git maintenance" jobs are randomly
+ * Hourly and other schedules of "git maintenance" jobs are randomly
    distributed now.
 
  * "git cmd -h" learned to signal which options can be negated by
    listing such options like "--[no-]opt".
 
- * The way authentication related data other than passwords (e.g.
-   oath token and password expiration data) are stored in libsecret
+ * The way authentication related data other than passwords (e.g.,
+   oauth token and password expiration data) are stored in libsecret
    keyrings has been rethought.
 
  * Update the libsecret and wincred credential helpers to correctly
@@ -60,7 +60,7 @@ UI, Workflows & Features
 
  * The default log message created by "git revert", when reverting a
    commit that records a revert, has been tweaked, to encourage people
-   describe complex "revert of revert of revert" situation better in
+   to describe complex "revert of revert of revert" situations better in
    their own words.
 
  * The command-line completion support (in contrib/) learned to
@@ -77,8 +77,8 @@ UI, Workflows & Features
 
  * The command line completion script (in contrib/) can be told to
    complete aliases by including ": git <cmd> ;" in the alias to tell
-   it that the alias should be completed similar to how "git <cmd>" is
-   completed.  The parsing code for the alias as been loosened to
+   it that the alias should be completed in a similar way to how "git <cmd>" is
+   completed.  The parsing code for the alias has been loosened to
    allow ';' without an extra space before it.
 
  * "git for-each-ref" and friends learned to apply mailmap to
@@ -117,8 +117,8 @@ Performance, Internal Implementation, Development Support etc.
  * Flaky "git p4" tests, as well as "git svn" tests, are now skipped
    in the (rather expensive) sanitizer CI job.
 
- * Tests with LSan from time to time seem to emit harmless message
-   that makes our tests unnecessarily flaky; we work it around by
+ * Tests with LSan from time to time seem to emit harmless messages
+   that make our tests unnecessarily flaky; we work around it by
    filtering the uninteresting output.
 
  * Unused parameters to functions are marked as such, and/or removed,
@@ -135,7 +135,7 @@ Performance, Internal Implementation, Development Support etc.
 
  * Test coverage for trailers has been improved.
 
- * The code to iterate over loose references have been optimized to
+ * The code to iterate over loose references has been optimized to
    reduce the number of lstat() system calls.
 
  * The codepaths that read "chunk" formatted files have been corrected
@@ -157,7 +157,7 @@ Fixes since v2.42
    branch tips at the same time will not waste building and testing
    the same thing twice.
 
- * The commit-graph verification code that detects mixture of zero and
+ * The commit-graph verification code that detects a mixture of zero and
    non-zero generation numbers has been updated.
 
  * "git diff -w --exit-code" with various options did not work
@@ -170,20 +170,20 @@ Fixes since v2.42
    the sequencer code has been cleaned up for consistency.
 
  * "git diff --no-such-option" and other corner cases around the exit
-   status of the "diff" command has been corrected.
+   status of the "diff" command have been corrected.
 
  * "git for-each-ref --sort='contents:size'" sorts the refs according
    to size numerically, giving a ref that points at a blob twelve-byte
    (12) long before showing a blob hundred-byte (100) long.
 
- * We now limit depth of the tree objects and maximum length of
+ * We now limit the depth of the tree objects and maximum length of
    pathnames recorded in tree objects.
    (merge 4d5693ba05 jk/tree-name-and-depth-limit later to maint).
 
  * Various fixes to the behavior of "rebase -i" when the command got
    interrupted by conflicting changes.
 
- * References from description of the `--patch` option in various
+ * References from a description of the `--patch` option in various
    manual pages have been simplified and improved.
 
  * "git grep -e A --no-or -e B" is accepted, even though the negation
@@ -203,8 +203,8 @@ Fixes since v2.42
    information for a file when fsmonitor knows it is clean and ended
    up behaving as if it is not clean, which has been corrected.
 
- * Clarify how "alias.foo = : git cmd ; aliased-command-string" should
-   be spelled with necessary whitespaces around punctuation marks to
+ * Clarify how "alias.foo = : git cmd ; aliased-command-string" should be
+   spelled with necessary whitespace around punctuation marks to
    work.
 
  * HTTP Header redaction code has been adjusted for a newer version of
@@ -256,9 +256,9 @@ Fixes since v2.42
    by "git stash create" now errors out.
    (merge d9b6634589 jc/fail-stash-to-store-non-stash later to maint).
 
- * The index file has room only for lower 32-bit of the file size in
+ * The index file has room only for the lower 32-bit of the file size in
    the cached stat information, which means cached stat information
-   will have 0 in its sd_size member for a file whose size is multiple
+   will have 0 in its sd_size member for a file whose size is a multiple
    of 4GiB.  This is mistaken for a racily clean path.  Avoid it by
    storing a bogus sd_size value instead for such files.
    (merge 5143ac07b1 bc/racy-4gb-files later to maint).
@@ -281,7 +281,7 @@ Fixes since v2.42
    20 months or so, which has been corrected.
 
  * "git send-email" did not have certain pieces of data computed yet
-   when it tried to validate the outging messages and its recipient
+   when it tried to validate the outgoing messages and its recipient
    addresses, which has been sorted out.
 
  * "git bugreport" learned to complain when it received a command line

base-commit: dadef801b365989099a9929e995589e455c51fed
-- 
gitgitgadget

^ permalink raw reply related

* Re: [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step
From: Elijah Newren @ 2023-11-11  3:45 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Patrick Steinhardt, Junio C Hamano,
	Johannes Schindelin
In-Reply-To: <c615a61d32644b64ef8f47feb47ec909286c56b3.1699381371.git.me@ttaylorr.com>

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> In a previous commit, the ORT merge backend learned how to use the
> bulk-checkin mechanism to emit a single pack containing any new objects
> created during the merge. This functionality was implemented by setting
> up a new ODB transaction, and finalizing it at the end of the merge via
> `process_entries()`.
>
> In a future commit, we will extend this functionality to the new `git
> replay` command, which needs to see objects from previous steps in order
> to replay each commit.
>
> As a step towards implementing this, teach the ORT backend to flush the
> ODB transaction at the end of each step in `process_entries()`, and then
> finalize the result with `end_odb_transaction()` when calling
> `merge_finalize()`.

process_entries() contains a for loop inside it, so "end of each step
in `process_entries()`" sounds like you are flushing after entry, i.e.
several times per commit being replayed.

Perhaps "at the end of `process_entries()` (thus creating one pack per
commit)" ?


(Of course, the fact that we need this change earlier, in the other
series, kinda makes this point moot.  Just thought I'd mention it
anyway in case it comes back up in your restructuring.)

^ permalink raw reply

* Re: [RFC PATCH 0/3] replay: implement support for writing new objects to a pack
From: Elijah Newren @ 2023-11-11  3:42 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Patrick Steinhardt, Junio C Hamano,
	Johannes Schindelin
In-Reply-To: <cover.1699381371.git.me@ttaylorr.com>

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> (Based on a combination of Christian's cc/git-replay and my
> tb/merge-tree-write-pack branches).
>
> This RFC demonstrates extending the new `--write-pack` option that
> `merge-tree` recently learned to the `replay` builtin as well.
>
> The approach is as follows:
>
>   - write a pack out after each step in the replay operation, so that
>     subsequent steps may see any new object(s) created during previous
>     steps
>
>   - combine those packs into one before migrating them back into the
>     main object store
>
> This is accomplished with a combination of the bulk-checkin and
> tmp-objdir APIs, with some minor modifications made to when we flush out
> and finalize bulk-checkin transactions.
>
> The benefit to this approach is that we bound the number of inodes
> required per replayed commit to a constant (by default, 3: one for the
> .pack, one for the .idx, and another for the .rev file), instead of
> having each operation take an unbounded number of inodes proportional to
> the number of new objects created during that step. We also only migrate
> a single pack back to the main object store.

Isn't it actually 4?  Since you only put blobs and trees into the bulk
checkin packfiles, the commit object will still be loose.

> In other words, this makes the maximum number of inodes required by
> 'replay' grow proportional to the number of commits being replayed,
> instead of the number of new *objects* created as a result of the replay
> operation.

As per comments on the other series, we actually need 2 packs per
commit (when replaying non-merge commits, we don't have to worry about
recursive merges, so 2*depth is just 2), so this would be 7 inodes per
commit (3 files per pack * 2 packs + 1 loose commit object).

I was curious what the comparative number of loose objects might be if
we didn't use the bulk checking, so:
$ cd linux.git
$ git rev-list --no-merges --count HEAD
1141436
$ time git log --oneline --no-merges --name-only | wc -l
3781628
$ python -c 'print(3781628/1141436)'
3.3130442705504293

So, if repositories are similar to linux, that's a bit over 3 files
modified per commit.  It's a bit harder to count trees, but let's take
a wild guess and say that each file is 7 directories (because I'm too
lazy to do real research and 7 happens to give me nice round numbers
later), so that's up to 7*3 tree objects.  So 3 file objects + 21 tree
objects + 1 commit object = 25 loose objects.

So, your scheme certainly seems to reduce number of inodes, but does
it introduce a different scaling issue?  git-gc repacks when there are
>= 50 packs, or when there are >= 6700 loose objects, suggesting that
if we exceed those numbers, we might start seeing performance suffer.
We would hit 50 packs with a mere 25 commits being replayed, and
wouldn't expect to get to 6700 loose objects until we replayed about
268 commits (6700/25).  Does this mean we are risking worse
performance degradation with this scheme than with just using loose
objects until the end of the operation?

^ permalink raw reply

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
From: Taylor Blau @ 2023-11-11  1:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, git, Eric W. Biederman, Jeff King,
	Patrick Steinhardt
In-Reply-To: <xmqqmsvlrv5t.fsf@gitster.g>

On Sat, Nov 11, 2023 at 09:27:42AM +0900, Junio C Hamano wrote:
> Elijah Newren <newren@gmail.com> writes:
>
> > I believe the above is built on an assumption that any objects written
> > will not need to be read until after the merge is completed.  And we
> > might have a nesting issue too...
> > ...
> > This is unsafe; the object may need to be read later within the same
> > merge.
>
> Thanks for a good analysis.  I agree.

Ditto. I responded to Elijah more in-depth elsewhere in the thread, but
I think for your purposes it is OK to discard this series.

Thanks,
Taylor

^ permalink raw reply

* Re: first-class conflicts?
From: Junio C Hamano @ 2023-11-11  1:31 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Sandra Snan, git, Dragan Simic, rsbecker
In-Reply-To: <ZUmJyFs7z7wdmLVK@mit.edu>

"Theodore Ts'o" <tytso@mit.edu> writes:

> And if you attempt to commit the merge without resolving the
> conflicts, git won't let you:
>
>    error: Committing is not possible because you have unmerged files.
>    hint: Fix them up in the work tree, and then use 'git add/rm <file>'
>    hint: as appropriate to mark resolution and make a commit.
>
> So it's hard to miss the indications of the content conflict, because
> if you try to commit without resolving them, it's not a warning, it's
> an outright error.

Correct but with a caveat: it is too easy for lazy folks to
circumvent the safety by mistake with "commit -a".

I wonder if it would help users to add a new configuration option
for those who want to live safer that tells "commit -a" to leave
unmerged paths alone and require the unmerged paths to be added
explicitly (which may have to extend to cover things like "add -u"
and "add .").

Perhaps not.  I often find myself doing "git add -u" after resolving
conflicts and re-reading the result, without an explicit pathspec.



^ permalink raw reply

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
From: Taylor Blau @ 2023-11-11  1:24 UTC (permalink / raw)
  To: Elijah Newren
  Cc: git, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <CABPp-BEfy9VOvimP9==ry_rZXu=metOQ8s=_-XiG_Pdx9c06Ww@mail.gmail.com>

On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:
> > @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
> >                 if ((merge_status < 0) || !result_buf.ptr)
> >                         ret = error(_("failed to execute internal merge"));
> >
> > -               if (!ret &&
> > -                   write_object_file(result_buf.ptr, result_buf.size,
> > -                                     OBJ_BLOB, &result->oid))
> > -                       ret = error(_("unable to add %s to database"), path);
> > +               if (!ret) {
> > +                       ret = opt->write_pack
> > +                               ? index_blob_bulk_checkin_incore(&result->oid,
> > +                                                                result_buf.ptr,
> > +                                                                result_buf.size,
> > +                                                                path, 1)
> > +                               : write_object_file(result_buf.ptr,
> > +                                                   result_buf.size,
> > +                                                   OBJ_BLOB, &result->oid);
> > +                       if (ret)
> > +                               ret = error(_("unable to add %s to database"),
> > +                                           path);
> > +               }
> >
> >                 free(result_buf.ptr);
> >                 if (ret)
>
> This is unsafe; the object may need to be read later within the same
> merge.  Perhaps the simplest example related to your testcase is
> modifying the middle section of that testcase (I'll highlight where
> when I comment on the testcase) to read:
>
>     test_commit -C repo base file "$(test_seq 3 5)" &&
>     git -C repo branch -M main &&
>     git -C repo checkout -b side main &&
>     test_commit -C repo side-file file "$(test_seq 1 5)" &&
>     test_commit -C repo side-file2 file2 "$(test_seq 1 6)" &&
>     git -C repo checkout -b other main &&
>     test_commit -C repo other-file file "$(test_seq 3 7)" &&
>     git -C repo mv file file2 &&
>     git -C repo commit -m other-file2 &&
>
>     find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
>     git -C repo merge-tree --write-pack side other &&
>
> In words, what I'm doing here is having both sides modify "file" (the
> same as you did) but also having one side rename "file"->"file2".  The
> side that doesn't rename "file" also introduces a new "file2".  ort
> needs to merge the three versions of "file" to get a new blob object,
> and then merge that with the content from the brand new "file2".  More
> complicated cases are possible, of course.  Anyway, with the modified
> testcase above, merge-tree will fail with:
>
>     fatal: unable to read blob object 06e567b11dfdafeaf7d3edcc89864149383aeab6
>
> I think (untested) that you could fix this by creating two packs
> instead of just one.  In particular, add a call to
> flush_odb_transcation() after the "redo_after_renames" block in
> merge_ort_nonrecursive_internal().  (It's probably easier to see that
> you could place the flush_odb_transaction() call inside
> detect_and_process_renames() just after the process_renames() call,
> but when redo_after_renames is true you'd end up with three packs
> instead of just two).

Great analysis, thanks for catching this error. I tested your approach,
and indeed a flush_odb_transaction() call after the process_renames()
call in detect_and_process_renames() does do the trick.

> What happens with the odb transaction stuff if no new objects are
> written before the call to flush_odb_transaction?  Will that be a
> problem?

I think that the bulk-checkin code is flexible enough to understand that
we shouldn't do anything when there aren't any objects to pack.

> (Since any tree written will not be re-read within the same merge, the
> other write_object_file() call you changed does not have the same
> problem as the one here.)
>
> >@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
> >                fflush(stdout);
> >                BUG("dir_metadata accounting completely off; shouldn't happen");
> >        }
> >-       if (write_tree(result_oid, &dir_metadata.versions, 0,
> >+       if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
> >                       opt->repo->hash_algo->rawsz) < 0)
> >                ret = -1;
> >
> > +
> > +       if (opt->write_pack)
> > +               end_odb_transaction();
> > +
>
> Is the end_odb_transaction() here going to fail with an "Unbalanced
> ODB transaction nesting" when faced with a recursive merge?

I think so, and we should have a test-case demonstrating that. In the
remaining three patches that I posted to extend this approach to 'git
replay', I moved this call elsewhere in such a way that I think

> Perhaps flushing here, and then calling end_odb_transaction() in
> merge_finalize(), much as you do in your replay-and-write-pack series,
> should actually be moved to this commit and included here?

Yep, exactly.

> This does mean that for a recursive merge, that you'll get up to 2*N
> packfiles, where N is the depth of the recursive merge.

We definitely want to avoid that ;-). I think there are a couple of
potential directions forward here, but the most promising one I think is
due to Johannes who suggests that we write loose objects into a
temporary directory with a replace_tmp_objdir() call, and then repack
that side directory before migrating a single pack back into the main
object store.

Thanks,
eaylor

^ permalink raw reply

* Re: first-class conflicts?
From: Theodore Ts'o @ 2023-11-07  0:50 UTC (permalink / raw)
  To: Sandra Snan; +Cc: git, Dragan Simic, rsbecker
In-Reply-To: <Gr..Y5kkszDx87g@idiomdrottning.org>

On Mon, Nov 06, 2023 at 10:45:03PM +0000, Sandra Snan wrote:
> Randall, thank you for that.
> 
> I just have sometimes wish git could be a little more aware of them beyond
> just storing them with ASCII art in the files themselves (and alerting /
> warning when they happen but I often can't properly see those warnings flash
> by so I end up having to search for the conflict markers manually). So if
> conflicts are a thing that *can* happen, it'd be better if vc could know
> about them which would make some of the rebases simpler as in jj. That
> doesn't mean we wanna adopt the jj workflow of deliberately checking in
> conflicts (not even locally), just be able to deal with them better if it
> does happen.

Well, if you miss them, "git status" does show that there are conflicts:

   Unmerged paths:
     (use "git add <file>..." to mark resolution)
           both modified:   version.h

And if you attempt to commit the merge without resolving the
conflicts, git won't let you:

   error: Committing is not possible because you have unmerged files.
   hint: Fix them up in the work tree, and then use 'git add/rm <file>'
   hint: as appropriate to mark resolution and make a commit.

So it's hard to miss the indications of the content conflict, because
if you try to commit without resolving them, it's not a warning, it's
an outright error.

Cheers,

					- Ted

^ permalink raw reply

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
From: Junio C Hamano @ 2023-11-11  0:27 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, git, Eric W. Biederman, Jeff King,
	Patrick Steinhardt
In-Reply-To: <CABPp-BEfy9VOvimP9==ry_rZXu=metOQ8s=_-XiG_Pdx9c06Ww@mail.gmail.com>

Elijah Newren <newren@gmail.com> writes:

> I believe the above is built on an assumption that any objects written
> will not need to be read until after the merge is completed.  And we
> might have a nesting issue too...
> ...
> This is unsafe; the object may need to be read later within the same
> merge.

Thanks for a good analysis.  I agree.


^ permalink raw reply

* Re: [PATCH v2 1/4] global: convert trivial usages of `test <expr> -a/-o <expr>`
From: Junio C Hamano @ 2023-11-11  0:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git
In-Reply-To: <xmqq34xdtac3.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> writes:

> I am not so surprised that this one was missed, though.  I didn't
> point this one out during my review of the previous round, either,
> and not everybody is as careful as you are.

Ah, sorry, thist came out in a way I did not mean to.

I didn't mean "I did not point it out explicitly.  It is not
surprising if a contributor who was not careful did not find it on
their own and took initiative to fix it themselves".

I meant "I failed to spot it myself hence I didn't point it out in
my review---I was not being so careful to aim for thoroughly cover
and find all the similar issues".

In any case, I'll tweak it while queueing.  Thanks for noticing.

diff --git i/t/valgrind/valgrind.sh w/t/valgrind/valgrind.sh
index 9fbf90cee7..3c8ee19975 100755
--- i/t/valgrind/valgrind.sh
+++ w/t/valgrind/valgrind.sh
@@ -23,7 +23,7 @@ memcheck)
 	VALGRIND_MAJOR=$(expr "$VALGRIND_VERSION" : '[^0-9]*\([0-9]*\)')
 	VALGRIND_MINOR=$(expr "$VALGRIND_VERSION" : '[^0-9]*[0-9]*\.\([0-9]*\)')
 	test 3 -gt "$VALGRIND_MAJOR" ||
-	( test 3 -eq "$VALGRIND_MAJOR" && test 4 -gt "$VALGRIND_MINOR" ) ||
+	{ test 3 -eq "$VALGRIND_MAJOR" && test 4 -gt "$VALGRIND_MINOR"; } ||
 	TOOL_OPTIONS="$TOOL_OPTIONS --track-origins=yes"
 	;;
 *)

^ permalink raw reply related

* Re: [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
From: Elijah Newren @ 2023-11-11  0:17 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Taylor Blau, Patrick Steinhardt, git, Eric W. Biederman,
	Jeff King, Junio C Hamano
In-Reply-To: <CAPig+cTjQe6FWo98LxvDS=s3dOs33SUUJa=x-bkyWHNBMx+XFw@mail.gmail.com>

On Wed, Oct 25, 2023 at 10:22 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Wed, Oct 25, 2023 at 11:44 AM Taylor Blau <me@ttaylorr.com> wrote:
> > On Wed, Oct 25, 2023 at 09:58:06AM +0200, Patrick Steinhardt wrote:
> > > On Mon, Oct 23, 2023 at 06:45:01PM -0400, Taylor Blau wrote:
> > > > In order to support streaming from a location in memory, we must
> > > > implement a new kind of bulk_checkin_source that does just that. These
> > > > implementation in spread out across:
> > >
>
> Perhaps:
>
>     s/implementation in/implementations are/

Was just about to comment with the same suggestion; I had a hard time
parsing the commit message as well because of this.

^ permalink raw reply

* Re: [PATCH v2 1/4] global: convert trivial usages of `test <expr> -a/-o <expr>`
From: Junio C Hamano @ 2023-11-11  0:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git
In-Reply-To: <20231110214423.GC2758295@coredump.intra.peff.net>

Jeff King <peff@peff.net> writes:

> On Fri, Nov 10, 2023 at 11:01:15AM +0100, Patrick Steinhardt wrote:
>
>> diff --git a/t/valgrind/valgrind.sh b/t/valgrind/valgrind.sh
>> index 669ebaf68be..9fbf90cee7c 100755
>> --- a/t/valgrind/valgrind.sh
>> +++ b/t/valgrind/valgrind.sh
>> @@ -23,7 +23,7 @@ memcheck)
>>  	VALGRIND_MAJOR=$(expr "$VALGRIND_VERSION" : '[^0-9]*\([0-9]*\)')
>>  	VALGRIND_MINOR=$(expr "$VALGRIND_VERSION" : '[^0-9]*[0-9]*\.\([0-9]*\)')
>>  	test 3 -gt "$VALGRIND_MAJOR" ||
>> -	test 3 -eq "$VALGRIND_MAJOR" -a 4 -gt "$VALGRIND_MINOR" ||
>> +	( test 3 -eq "$VALGRIND_MAJOR" && test 4 -gt "$VALGRIND_MINOR" ) ||
>>  	TOOL_OPTIONS="$TOOL_OPTIONS --track-origins=yes"
>
> I was surprised to see this one as a subshell after you adjusted the
> other.

;-)

I am not so surprised that this one was missed, though.  I didn't
point this one out during my review of the previous round, either,
and not everybody is as careful as you are.

Thanks, both.

^ permalink raw reply

* Re: [PATCH v2 1/4] global: convert trivial usages of `test <expr> -a/-o <expr>`
From: Junio C Hamano @ 2023-11-11  0:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King
In-Reply-To: <2967c8ebb460934eb4aaaaebe5941bff643d4a94.1699609940.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> Our coding guidelines say to not use `test` with `-a` and `-o` because
> it can easily lead to bugs. Convert trivial cases where we still use
> these to instead instead concatenate multiple invocations of `test` via
> `&&` and `||`, respectively.
>
> While not all of the converted instances can cause ambiguity, it is
> worth getting rid of all of them regardless:
>
>     - It becomes easier to reason about the code as we do not have to
>       argue why one use of `-a`/`-o` is okay while another one isn't.
>
>     - We don't encourage people to use these expressions.

Thanks for these additional notes.  Nicely done.

^ permalink raw reply

* Re: [PATCH v4 0/3] t: improve compatibility with NixOS
From: Junio C Hamano @ 2023-11-11  0:10 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King
In-Reply-To: <cover.1699596457.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> Three changes compared to v3:
>
>     - Switched from `test <expr> -a <expr>` to `test <expr> && test
>       <expr>`.
>
>     - Improved the commit message to explain why the new
>       runtime-detected paths are only used as a fallback.
>
>     - Rebased on top of 0e3b67e2aa (ci: add support for GitLab CI,
>       2023-11-09), which has been merged to `next` and caused conflicts.

Please, no.  "The other topic has been merged to 'next'" is *not* a
good reason to do this.  Before that, the other topic was in 'seen'
and causing conflicts already, so getting into 'next' did not create
any new reason for rebasing.

I'll manage this time, but please do not do such a rebase unless you
are asked to do so.  The rebase will force me to do (1) detach from
'next' and apply these, (2) discard the result and detach from the
base of where the previous iteration is queued, (3) apply the new
iteration with "am -3" to undo the rebase, before I can compare the
new iteration with the old iteration.

> Technically speaking this series also depends on 0763c3a2c4 (http:
> update curl http/2 info matching for curl 8.3.0, 2023-09-15), without
> which the tests will fail on NixOS machines with a recent libcurl.

Thanks for that note.  This topic has been queued on top of
v2.43.0-rc1 which has 0763c3a2c4, so we'd be safe.

Will queue.

^ permalink raw reply

* tb/merge-tree-write-pack [Was: Re: What's cooking in git.git (Nov 2023, #04; Thu, 9)]
From: Elijah Newren @ 2023-11-11  0:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, Jeff King, Git Mailing List
In-Reply-To: <ZUzcmsfJe6jk4fTk@nand.local>

Hi,

On Thu, Nov 9, 2023 at 5:21 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Thu, Nov 09, 2023 at 02:40:28AM +0900, Junio C Hamano wrote:
> > * tb/merge-tree-write-pack (2023-10-23) 5 commits
> >  - builtin/merge-tree.c: implement support for `--write-pack`
> >  - bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
> >  - bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
> >  - bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
> >  - bulk-checkin: extract abstract `bulk_checkin_source`
> >
> >  "git merge-tree" learned "--write-pack" to record its result
> >  without creating loose objects.
> >
> >  Comments?
> >  source: <cover.1698101088.git.me@ttaylorr.com>
>
[...]
> I am fairly confident that tb/merge-tree-write-pack is ready to go. I'll
> spin off a separate thread based on that branch and cc/git-replay as a
> non-RFC series that extends this approach to 'git replay', so we'll be
> ready to go there once Christian's series progresses.

Sorry for being so late to review.  I posted some comments on patch 5
in relation to repeated file merges within a single (non-recursive)
merge, and in relation to recursive merges.  Both are cases that'd be
really easy to miss if you're not familiar with the merge code, but I
think will cause the code to die with fatal errors.  I provided my
guesses at some potential fixes for both issues.

^ permalink raw reply

* Re: [PATCH v4 1/3] t/lib-httpd: dynamically detect httpd and modules path
From: Junio C Hamano @ 2023-11-11  0:00 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King
In-Reply-To: <41b9dada2e0b2e713328e6a4d31f713a2d3ffa38.1699596457.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> +if test -x "$DEFAULT_HTTPD_PATH"
> +then
> +	DETECTED_HTTPD_ROOT="$("$DEFAULT_HTTPD_PATH" -V | sed -n 's/^ -D HTTPD_ROOT="\(.*\)"$/\1/p')"
> +fi

With this patch, my test run starts like so:

    rm -f -r 'test-results'
    *** prove ***
    apache2: Could not open configuration file /etc/apache2/apache2.conf: No such file or directory
    ...

I find the error message leaking mildly annoying, and would suggest
doing something like the following on top.

diff --git c/t/lib-httpd.sh w/t/lib-httpd.sh
index 0a74922d7f..03493ee72b 100644
--- c/t/lib-httpd.sh
+++ w/t/lib-httpd.sh
@@ -68,7 +68,7 @@ done
 
 if test -x "$DEFAULT_HTTPD_PATH"
 then
-	DETECTED_HTTPD_ROOT="$("$DEFAULT_HTTPD_PATH" -V | sed -n 's/^ -D HTTPD_ROOT="\(.*\)"$/\1/p')"
+	DETECTED_HTTPD_ROOT="$("$DEFAULT_HTTPD_PATH" -V 2>/dev/null | sed -n 's/^ -D HTTPD_ROOT="\(.*\)"$/\1/p')"
 fi
 
 for DEFAULT_HTTPD_MODULE_PATH in '/usr/libexec/apache2' \

^ permalink raw reply related

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
From: Elijah Newren @ 2023-11-10 23:51 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <3595db76a525fcebc3c896e231246704b044310c.1698101088.git.me@ttaylorr.com>

Hi,

Sorry for taking so long to find some time to review.  And sorry for
the bad news, but...

On Mon, Oct 23, 2023 at 3:45 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.
>
> Building on the functionality introduced in previous commits, the
> bulk-checkin machinery now has support to write arbitrary blob and tree
> objects which are small enough to be held in-core. We can use this to
> write any blob/tree objects generated by ORT into a separate pack
> instead of writing them out individually as loose.
>
> This functionality is gated behind a new `--write-pack` option to
> `merge-tree` that works with the (non-deprecated) `--write-tree` mode.
>
> The implementation is relatively straightforward. There are two spots
> within the ORT mechanism where we call `write_object_file()`, one for
> content differences within blobs, and another to assemble any new trees
> necessary to construct the merge. In each of those locations,
> conditionally replace calls to `write_object_file()` with
> `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
> depending on which kind of object we are writing.
>
> The only remaining task is to begin and end the transaction necessary to
> initialize the bulk-checkin machinery, and move any new pack(s) it
> created into the main object store.

I believe the above is built on an assumption that any objects written
will not need to be read until after the merge is completed.  And we
might have a nesting issue too...

> @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
>                 if ((merge_status < 0) || !result_buf.ptr)
>                         ret = error(_("failed to execute internal merge"));
>
> -               if (!ret &&
> -                   write_object_file(result_buf.ptr, result_buf.size,
> -                                     OBJ_BLOB, &result->oid))
> -                       ret = error(_("unable to add %s to database"), path);
> +               if (!ret) {
> +                       ret = opt->write_pack
> +                               ? index_blob_bulk_checkin_incore(&result->oid,
> +                                                                result_buf.ptr,
> +                                                                result_buf.size,
> +                                                                path, 1)
> +                               : write_object_file(result_buf.ptr,
> +                                                   result_buf.size,
> +                                                   OBJ_BLOB, &result->oid);
> +                       if (ret)
> +                               ret = error(_("unable to add %s to database"),
> +                                           path);
> +               }
>
>                 free(result_buf.ptr);
>                 if (ret)

This is unsafe; the object may need to be read later within the same
merge.  Perhaps the simplest example related to your testcase is
modifying the middle section of that testcase (I'll highlight where
when I comment on the testcase) to read:

    test_commit -C repo base file "$(test_seq 3 5)" &&
    git -C repo branch -M main &&
    git -C repo checkout -b side main &&
    test_commit -C repo side-file file "$(test_seq 1 5)" &&
    test_commit -C repo side-file2 file2 "$(test_seq 1 6)" &&
    git -C repo checkout -b other main &&
    test_commit -C repo other-file file "$(test_seq 3 7)" &&
    git -C repo mv file file2 &&
    git -C repo commit -m other-file2 &&

    find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
    git -C repo merge-tree --write-pack side other &&

In words, what I'm doing here is having both sides modify "file" (the
same as you did) but also having one side rename "file"->"file2".  The
side that doesn't rename "file" also introduces a new "file2".  ort
needs to merge the three versions of "file" to get a new blob object,
and then merge that with the content from the brand new "file2".  More
complicated cases are possible, of course.  Anyway, with the modified
testcase above, merge-tree will fail with:

    fatal: unable to read blob object 06e567b11dfdafeaf7d3edcc89864149383aeab6

I think (untested) that you could fix this by creating two packs
instead of just one.  In particular, add a call to
flush_odb_transcation() after the "redo_after_renames" block in
merge_ort_nonrecursive_internal().  (It's probably easier to see that
you could place the flush_odb_transaction() call inside
detect_and_process_renames() just after the process_renames() call,
but when redo_after_renames is true you'd end up with three packs
instead of just two).

What happens with the odb transaction stuff if no new objects are
written before the call to flush_odb_transaction?  Will that be a
problem?

(Since any tree written will not be re-read within the same merge, the
other write_object_file() call you changed does not have the same
problem as the one here.)

>@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
>                fflush(stdout);
>                BUG("dir_metadata accounting completely off; shouldn't happen");
>        }
>-       if (write_tree(result_oid, &dir_metadata.versions, 0,
>+       if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
>                       opt->repo->hash_algo->rawsz) < 0)
>                ret = -1;
>
> +
> +       if (opt->write_pack)
> +               end_odb_transaction();
> +

Is the end_odb_transaction() here going to fail with an "Unbalanced
ODB transaction nesting" when faced with a recursive merge?

Perhaps flushing here, and then calling end_odb_transaction() in
merge_finalize(), much as you do in your replay-and-write-pack series,
should actually be moved to this commit and included here?

This does mean that for a recursive merge, that you'll get up to 2*N
packfiles, where N is the depth of the recursive merge.

> +       test_commit -C repo base file "$(test_seq 3 5)" &&
> +       git -C repo branch -M main &&
> +       git -C repo checkout -b side main &&
> +       test_commit -C repo side file "$(test_seq 1 5)" &&
> +       git -C repo checkout -b other main &&
> +       test_commit -C repo other file "$(test_seq 3 7)" &&
> +
> +       find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> +       tree="$(git -C repo merge-tree --write-pack \
> +               refs/tags/side refs/tags/other)" &&

These were the lines from your testcase that I replaced to show the bug.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox