Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
From: Elijah Newren @ 2023-11-10 23:51 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <3595db76a525fcebc3c896e231246704b044310c.1698101088.git.me@ttaylorr.com>

Hi,

Sorry for taking so long to find some time to review.  And sorry for
the bad news, but...

On Mon, Oct 23, 2023 at 3:45 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.
>
> Building on the functionality introduced in previous commits, the
> bulk-checkin machinery now has support to write arbitrary blob and tree
> objects which are small enough to be held in-core. We can use this to
> write any blob/tree objects generated by ORT into a separate pack
> instead of writing them out individually as loose.
>
> This functionality is gated behind a new `--write-pack` option to
> `merge-tree` that works with the (non-deprecated) `--write-tree` mode.
>
> The implementation is relatively straightforward. There are two spots
> within the ORT mechanism where we call `write_object_file()`, one for
> content differences within blobs, and another to assemble any new trees
> necessary to construct the merge. In each of those locations,
> conditionally replace calls to `write_object_file()` with
> `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
> depending on which kind of object we are writing.
>
> The only remaining task is to begin and end the transaction necessary to
> initialize the bulk-checkin machinery, and move any new pack(s) it
> created into the main object store.

I believe the above is built on an assumption that any objects written
will not need to be read until after the merge is completed.  And we
might have a nesting issue too...

> @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
>                 if ((merge_status < 0) || !result_buf.ptr)
>                         ret = error(_("failed to execute internal merge"));
>
> -               if (!ret &&
> -                   write_object_file(result_buf.ptr, result_buf.size,
> -                                     OBJ_BLOB, &result->oid))
> -                       ret = error(_("unable to add %s to database"), path);
> +               if (!ret) {
> +                       ret = opt->write_pack
> +                               ? index_blob_bulk_checkin_incore(&result->oid,
> +                                                                result_buf.ptr,
> +                                                                result_buf.size,
> +                                                                path, 1)
> +                               : write_object_file(result_buf.ptr,
> +                                                   result_buf.size,
> +                                                   OBJ_BLOB, &result->oid);
> +                       if (ret)
> +                               ret = error(_("unable to add %s to database"),
> +                                           path);
> +               }
>
>                 free(result_buf.ptr);
>                 if (ret)

This is unsafe; the object may need to be read later within the same
merge.  Perhaps the simplest example related to your testcase is
modifying the middle section of that testcase (I'll highlight where
when I comment on the testcase) to read:

    test_commit -C repo base file "$(test_seq 3 5)" &&
    git -C repo branch -M main &&
    git -C repo checkout -b side main &&
    test_commit -C repo side-file file "$(test_seq 1 5)" &&
    test_commit -C repo side-file2 file2 "$(test_seq 1 6)" &&
    git -C repo checkout -b other main &&
    test_commit -C repo other-file file "$(test_seq 3 7)" &&
    git -C repo mv file file2 &&
    git -C repo commit -m other-file2 &&

    find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
    git -C repo merge-tree --write-pack side other &&

In words, what I'm doing here is having both sides modify "file" (the
same as you did) but also having one side rename "file"->"file2".  The
side that doesn't rename "file" also introduces a new "file2".  ort
needs to merge the three versions of "file" to get a new blob object,
and then merge that with the content from the brand new "file2".  More
complicated cases are possible, of course.  Anyway, with the modified
testcase above, merge-tree will fail with:

    fatal: unable to read blob object 06e567b11dfdafeaf7d3edcc89864149383aeab6

I think (untested) that you could fix this by creating two packs
instead of just one.  In particular, add a call to
flush_odb_transcation() after the "redo_after_renames" block in
merge_ort_nonrecursive_internal().  (It's probably easier to see that
you could place the flush_odb_transaction() call inside
detect_and_process_renames() just after the process_renames() call,
but when redo_after_renames is true you'd end up with three packs
instead of just two).

What happens with the odb transaction stuff if no new objects are
written before the call to flush_odb_transaction?  Will that be a
problem?

(Since any tree written will not be re-read within the same merge, the
other write_object_file() call you changed does not have the same
problem as the one here.)

>@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
>                fflush(stdout);
>                BUG("dir_metadata accounting completely off; shouldn't happen");
>        }
>-       if (write_tree(result_oid, &dir_metadata.versions, 0,
>+       if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
>                       opt->repo->hash_algo->rawsz) < 0)
>                ret = -1;
>
> +
> +       if (opt->write_pack)
> +               end_odb_transaction();
> +

Is the end_odb_transaction() here going to fail with an "Unbalanced
ODB transaction nesting" when faced with a recursive merge?

Perhaps flushing here, and then calling end_odb_transaction() in
merge_finalize(), much as you do in your replay-and-write-pack series,
should actually be moved to this commit and included here?

This does mean that for a recursive merge, that you'll get up to 2*N
packfiles, where N is the depth of the recursive merge.

> +       test_commit -C repo base file "$(test_seq 3 5)" &&
> +       git -C repo branch -M main &&
> +       git -C repo checkout -b side main &&
> +       test_commit -C repo side file "$(test_seq 1 5)" &&
> +       git -C repo checkout -b other main &&
> +       test_commit -C repo other file "$(test_seq 3 7)" &&
> +
> +       find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> +       tree="$(git -C repo merge-tree --write-pack \
> +               refs/tags/side refs/tags/other)" &&

These were the lines from your testcase that I replaced to show the bug.

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Junio C Hamano @ 2023-11-10 23:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git
In-Reply-To: <20231110220142.GH2758295@coredump.intra.peff.net>

Jeff King <peff@peff.net> writes:

> I think dropping those is a bad direction. The point of adding
> pair_chunk_expect() is that we could use it consistently.

I think so, too.  If we are adding a helper to avoid common
mistakes, and if it cannot be used in some situations, then the
helper needs to be improved.

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Junio C Hamano @ 2023-11-10 23:38 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King
In-Reply-To: <ZU5Z/Z4PcdNP5U1r@nand.local>

Taylor Blau <me@ttaylorr.com> writes:

> But I don't think we enjoy the same benefits in the MIDX scenario. In
> this case, the num_objects field is just:
>
>     m->num_objects = ntohl(m->chunk_oid_fanout[255])
>
> so I don't think we can make the same guarantees about what is and isn't
> save to compute under size_t arithmetic.

So ..., from the correctness's point of view, if we do not mind
st_mult() dying, the "multiply-and-compare" should give us much more
robust results.  If we do mind st_mult() dying, we could pair the
"truncating-division-and-compare" you wrote with "ensure that chunk
size is a multiple of record size" to achieve the same, I would
think.  I.e.,

	if (chunk_size % pcd->record_size ||
	    chunk_size / pcd->record_size != pcd->record_nr)
		return -1;

or something like that.

^ permalink raw reply

* Re: git-send-email: Send with mutt(1)
From: Junio C Hamano @ 2023-11-10 23:31 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jeff King, git
In-Reply-To: <ZU4widBlHljjg9lL@debian>

Alejandro Colomar <alx@kernel.org> writes:

> [sendemail]
> 	sendmailcmd = mutt -H - && true

Ah, this is fun, an ugly but serviceable trick to ignore the command
line arguments.  I briefly wondered it would work equally well and
much more readable to simply append "#", but the above should work
just fine.

^ permalink raw reply

* Re: [PATCH 2/4] contrib/subtree: stop using `-o` to test for number of args
From: Junio C Hamano @ 2023-11-10 23:27 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git
In-Reply-To: <ZU4DiTQbKyuuT55k@tanuki>

Patrick Steinhardt <ps@pks.im> writes:

> On Fri, Nov 10, 2023 at 08:02:32AM +0900, Junio C Hamano wrote:
>> Jeff King <peff@peff.net> writes:
>> 
>> >>  # Usage: process_subtree_split_trailer SPLIT_HASH MAIN_HASH [REPOSITORY]
>> >>  process_subtree_split_trailer () {
>> >> -	assert test $# = 2 -o $# = 3
>> >> +	assert test $# -ge 2
>> >> +	assert test $# -le 3
>> > ...
>> >> -	elif test $# -eq 1 -o $# -eq 2
>> >> +	elif test $# -eq 1 || test $# -eq 2
>> >
>> > OK, this one is a straight-forward use of "||".
>> 
>> Yes, but why not consistently use the range notation like the
>> earlier one here, or below?
>
> I opted to go for the "obvious" conversion, if there was one easily
> available, to make the diff easier to read.

... and due to the limitation of "assert" we cannot do the obvious

	test $# = 2 || test $# = 3

and feed it to "assert" (and for equality with $#, = and -eq would
work equally fine, and = is much more readable, by the way).

OK, then.

^ permalink raw reply

* Re: first-class conflicts?
From: Elijah Newren @ 2023-11-10 22:57 UTC (permalink / raw)
  To: phillip.wood; +Cc: Sandra Snan, git, Martin von Zweigbergk, Randall S. Becker
In-Reply-To: <a1cd2dba-3a74-4b41-8585-209b4a13b8c4@gmail.com>

Hi Phillip,

On Thu, Nov 9, 2023 at 6:45 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
[...]
> > This is a great thing to think about and bring up.  However, I'm not
> > sure what part of it actually needs to be preserved; in fact, it's not
> > clear to me that any of it needs preserving -- especially not the
> > files read by "git commit".  A commit was already created, after all.
> >
> > It seems that CHERRY_PICK_HEAD/REVERT_HEAD files exist primarily to
> > clue in that we are in-the-middle-of-<op>, and the conflict header
> > (the "tree A + tree B - tree C" thing; whatever that's called)
> > similarly provides signal that the commit still has conflicts.
> > Secondarily, these files contain information about the tree we came
> > from and its parent tree, which allows users to investigate the diff
> > between those...but that information is also available from the
> > conflict header in the recorded commit.  The CHERRY_PICK_HEAD and
> > REVERT_HEAD files could also be used to access the commit message, but
> > that would have been stored in the conflicted commit as well.  Are
> > there any other pieces of information I'm missing?
>
> Mainly that I'm an idiot and forgot we were actually creating a commit
> and can store the message and authorship there!

You're definitely not an idiot.  The whole problem space is new and
different, so it's easy to overlook or forget certain details, and
even to make completely different assumptions than others and have no
one aware that we're operating with similar sounding but entirely
different mental models.

> More seriously I think
> being able to inspect the commit being cherry-picked (including the
> original commit message) is useful so we'd need to recreate something
> like CHERRY_PICK_HEAD when the conflict commit is checked out.

So, I see a few issues with this:

1) Even if we were to create CHERRY_PICK_HEAD as you envision, that
doesn't necessarily guarantee the user can view the original commit
because they may not have it.  It may have been a local-only commit
that wasn't pushed or pulled to the person who is now investigating
it.

2a) You highlight the original commit message, but if someone doesn't
want to immediately resolve conflicts, why would they be modifying the
commit message?

2b) Even if users did want to modify the commit message without
resolving conflicts, how would they do so?  Rebasing has
interactivity, but cherry-picking doesn't.  And interactivity seems to
be something people probably wouldn't use together with storing
conflicts; the point of interactivity is to tweak things further and
fix them up, suggesting they'd want to be running in
address-conflicts-now mode.

> Recreating CHERRY_PICK_HEAD is useful for "git status" as well.

"git status" uses this file to determine if it should display
information about currently being in the middle of a cherry-pick
operation.  Putting such a file in place would thus be misleading,
because we aren't in a cherry-pick operation anymore; that has
completed already.  I would not expect the suggested commands printed
by git-status while it thinks we're in such a state (namely, "git
cherry-pick [--continue|--skip|--abort]") to work either.  So, I'd
argue it would be a bug to create such a file when checking out a
conflicted-commit.

Of course, we would want git-status to display information about the
current commit being conflicted, but I think that could be based on
the simple conflict header without additional info.

> I think
> that means storing a little more that just the "tree A + tree B - tree
> C" thing.

I'm totally willing to believe there will be cases where more info is
needed.  I'm suspecting that conflicts with certain kinds of renames,
or which were performed with certain types of strategies or strategy
options might be some examples.  But I'm not sure I'm understanding
why CHERRY_PICK_HEAD should be one of those cases.

> > I think the big piece here is whether we also want to adopt jj's
> > behavior of automatically rebasing all descendant commits when
> > checking out and amending some historical commit (or at least having
> > the option of doing so).  That behavior allows users to amend commits
> > to resolve conflicts without figuring out complicated interactive
> > rebases to fix all the descendant commits across all relevant
> > branches.
>
> That's a potentially attractive option which is fairly simple to
> implement locally as I think you can use the commit DAG to find all the
> descendants though that could be expensive if there are lots of
> branches. However, if we're going to share conflicts I think we'd need
> something like "hg evolve" - if I push a commit with conflicts and you
> base some work on it and then I resolve the conflict and push again you
> would want to your work to be rebased onto my conflict resolution.

Ooh, that's an interesting point.

> To handle "rebase --exec" we could store the exec command and run it when
> the  conflicts are resolved.

So, my assumption is that even if we add the ability to commit
conflicts and even if we default to auto-committing them during
cherry-picks or non-interactive rebases, there will still be people
who want to resolve conflicts as they are hit rather than
auto-committing them, and thus that stop-on-conflict should always be
an option.  In the world where a user has this choice, I think it'd be
rare for users to want to auto-commit conflicts with --exec.  I'd
suggest that --exec, and even --interactive, would default to stopping
on conflicts and waiting for the user to resolve even if
auto-commit-on-conflict is the default in other cases.

That leaves me wondering if there are any cases where users want to
auto-commit conflicts in.conjunction with --exec, which I'm already
struggling to come up with, _and_ that would further want the exec
commands to be preserved in the conflicted commits (and any descendant
commits?) for later usage.  Maybe there's a case for that, but I'm not
coming up with it right now.

Also, another way of looking at this is that my current mental model
is that the cherry-pick or rebase operation is completed once it has
handled each of the commits in its list; the operation does not extend
until all the conflicts in the commits it creates are resolved.  The
fact that rebases do not extend until conflicts are resolved is
important because you can later further rebase conflicted-commits (as
Martin alludes to in his emails); considering the old rebase(s) to
still be in progress while a new one starts might get excessively
complex to handle.  The reason all of this matters to --exec is that
--exec is part of the rebase operation; once the rebase operation is
done, the --exec stuff is also done.  (And thus, if you don't want
--exec to run on conflicted commits, then don't opt for
auto-committing conflicts.).

> Also I wonder how annoying it would be in cases where I just want to
> rebase and resolve the conflicts now. At the moment "git rebase" stops
> at the conflict, with this feature I'd have to go and checkout the
> conflicted commit and fix the conflicts after the rebase had finished.

I agree that would often be annoying.  Personally, I think that
auto-committing conflicts as a feature should at most be an option
(even if perhaps the default in some cases), not a new mandatory
worldview.  And I'm currently not convinced that even if it were
implemented it should be the default in any cases.

> > Without that feature, I agree this might be a bit more
> > difficult,
>
> Yes, when I wrote my original message I was imagining that we'd stop at
> the first conflicting pick and store all the rebase state like some kind
> of stash on steroids so it could be continued when the conflict was
> resolved. It would be much simpler to try and avoid that.

Yeah, this is an example of how completely different mental models we
can come up with when none of us (other than Martin) know much about
the problem space.  I suspect there's at least a few more examples
like this where we still have very different mental models, and
perhaps some gems to be found by mixing and matching them.

> > There are some special state files related to half-completed
> > operations (e.g. squash commits when we haven't yet reached the final
> > one in the sequence, a file to note that we want to edit a commit
> > message once the user has finished resolving conflicts, whether we
> > need to create a new root commit), but again, the operation has
> > completed and commits have been created with appropriate parentage and
> > commit messages so I don't think these are useful anymore either.
>
> Yes, though we may want to remember which commits were squashed together
> so the user can inspect that when resolving conflicts.

Ooh, that's interesting...though it does run into the problem of users
not having access to the original commits.

> > The biggest issue is perhaps that REBASE_HEAD is used in the
> > implementation of `git rebase --show-current-patch`, but all
> > information stored in that is still accessible -- the commit message
> > is stored in the commit, the author time is stored in the commit, and
> > the trees involved are in the conflict header.  The only thing missing
> > is committer timestamp, which isn't relevant anyway.
>
> The commit message may have been edited so we lose the original message
> but I'm not sure how important that is.

Is this a reversal from your comment earlier in your email about the
importance of the original commit message for CHERRY_PICK_HEAD?  :-)

> > The only ones I'm pausing a bit on are the strategy and
> > strategy-options.  Those might be useful somehow...but I can't
> > currently quite put my finger on explaining how they would be useful
> > and I'm not sure they are.
>
> I can't think of an immediate use for them. When we re-create conflicts
> we do it per-file based on the index entries created by the original
> merge so I don't think we need to know anything about the strategy or
> strategy-options.

But we don't have index entries.  We only have trees in this
conflicted commit, and when users check it out, they probably expect
conflicted index entries to be put into place.  Can we correctly
regenerate the right conflicted index entries from the original trees
without the strategy and strategy-options command line flags?  I
suspect there might be problems here, and user-defined merge
strategies could really throw a wrench in the works.  Hmm...

> > Am I missing anything?
>
> exec commands? If the user runs "git rebase --exec" and there are
> conflicts then we'd need to defer running the exec commands until the
> conflicts are resolved. For something like "git rebase --exec 'make
> test'" that should be fine. I wonder if there are corner cases where the
> exec command changes HEAD though.

We talked about exec commands above, as well as the assumption whether
auto-committing conflicts should be mandatory vs. an option, so I
won't repeat that here.  It was definitely a very interesting topic to
bring up though; thanks!

[...]

> Yes, it would certainly be lots of work.

...but even if none of us get time and prioritization to work on it, I
personally find it a really interesting topic to discuss and explore.
Thanks for joining in and bringing up many good points!

^ permalink raw reply

* Re: [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()`
From: Jeff King @ 2023-11-10 22:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano
In-Reply-To: <5b3c0b99f8052733bb714122582ab229556c94ef.1699569246.git.me@ttaylorr.com>

On Thu, Nov 09, 2023 at 05:34:14PM -0500, Taylor Blau wrote:

> @@ -435,8 +425,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>  		error(_("commit-graph required OID fanout chunk missing or corrupted"));
>  		goto free_and_return;
>  	}
> -	if (read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph)) {
> -		error(_("commit-graph required OID lookup chunk missing or corrupted"));
> +	if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDLOOKUP,
> +			      &graph->chunk_oid_lookup, graph->hash_len,
> +			      graph->num_commits)) {
> +		error(_("commit-graph OID lookup chunk is the wrong size"));
>  		goto free_and_return;

I know the original message was vague, but I think the new one is
actively misleading in the case of a missing chunk. We'll say "wrong
size", but it was not present at all!

-Peff

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Jeff King @ 2023-11-10 22:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git
In-Reply-To: <20231110215747.GG2758295@coredump.intra.peff.net>

On Fri, Nov 10, 2023 at 04:57:47PM -0500, Jeff King wrote:

> One of those patches calls out the truncating division issue, but to
> summarize: IMHO this is OK, as what we really want to know is "is it big
> enough that we can always ask for NR records of size ELEM", which
> division gives us. If we do want to be more precise, but also avoid
> die(), we'd need a variant of st_mult() that returns a boolean. I didn't
> think it was worth it for this case (but arguably it is something that
> would be useful to have in general).

Oh, and obviously there is another option here if we want to be more
careful but don't want to introduce an st_mult() variant: we can use "%"
to check for divisibility ourselves.

I don't think it's worth doing that in every individual size-check, but
maybe it would be in a central pair_chunk_expect().

-Peff

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Jeff King @ 2023-11-10 22:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano
In-Reply-To: <af5fe3b7237caeba8f970e967933db96c83a230e.1699569246.git.me@ttaylorr.com>

On Thu, Nov 09, 2023 at 05:34:11PM -0500, Taylor Blau wrote:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}

I think this is taking us backwards in terms of the output the user sees
(see the message I just sent elsewhere in the thread). The user won't be
able to tell the difference between a missing chunk and one with the
wrong size.

And we miss the opportunity to give more details about the expected and
detected sizes of the chunks.

If the two-line error message really bothers you, then...

> +int pair_chunk_expect(struct chunkfile *cf,
> +		      uint32_t chunk_id,
> +		      const unsigned char **p,
> +		      size_t record_size,
> +		      size_t record_nr)
> +{
> +	struct pair_chunk_data pcd = {
> +		.p = p,
> +		.record_size = record_size,
> +		.record_nr = record_nr,
> +	};
> +	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
> +}

...this is where we could take a CHUNK_REQUIRED flag, and so:

  ret = read_chunk(...);
  /* pair_chunk_expect_fn() wrote an error already for other cases */
  if (ret == CHUNK_MISSING)
	error("chunk not found");
  return ret;

And then the callers become a very nice:

  if (pair_chunk_expect(cf, id, &ptr, size, nr, CHUNK_REQUIRED))
	return -1;

You probably would need to pass some kind of string giving more context
for the error messages, though. We can show the chunk id, but the
context of "commit-graph" vs "midx" is important.

-Peff

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Jeff King @ 2023-11-10 22:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git
In-Reply-To: <ZU5Z/Z4PcdNP5U1r@nand.local>

On Fri, Nov 10, 2023 at 11:27:41AM -0500, Taylor Blau wrote:

> Hmm. I was thinking of Peff's "commit-graph: handle overflow in
> chunk_size checks", but I think that I was overly eager in applying the
> same reasoning to the MIDX code.
> 
> The important piece of the rationale in that patch is as follows:
> 
>     In the current code this is only possible for the CDAT chunk, but
>     the reasons are quite subtle. We compute g->num_commits by dividing
>     the size of the OIDL chunk by the hash length (since it consists of
>     a bunch of hashes). So we know that any size_t multiplication that
>     uses a value smaller than the hash length cannot overflow. And the
>     CDAT records are the only ones that are larger (the others are just
>     4-byte records). So it's worth fixing all of these, to make it clear
>     that they're not subject to overflow (without having to reason about
>     seemingly unrelated code).
> 
> In particular, that g->num_commits is computed by dividing the length of
> the OIDL chunk by the hash length, thus any size_t multiplication of
> g->num_commits with a value smaller than that hash length cannot
> overflow.
> 
> But I don't think we enjoy the same benefits in the MIDX scenario. In
> this case, the num_objects field is just:
> 
>     m->num_objects = ntohl(m->chunk_oid_fanout[255])
> 
> so I don't think we can make the same guarantees about what is and isn't
> save to compute under size_t arithmetic.

Yes, the logic does not apply to the midx code (just like the graph code
after we switch to using the fanout value later in my series). But that
paragraph was just explaining why only the CDAT chunk was _currently_
vulnerable.

The more interesting question is the paragraphs after that, about
whether it is OK to die() or not when we see overflow (and IMHO it is
not for commit-graphs).

The situation _is_ different there for midx's, because we immediately
die() if we see a bogus midx anyway. But I think that is wrong, and
something we may want to change in the long run. Both because it's the
wrong thing for lib-ification, but also because we can easily keep going
if the midx is broken; we can still use the individual pack idx files.

> I'd be curious what Peff has to say, but if he agrees with me, I'd
> recommend taking the first five patches, and dropping the two
> MIDX-related ones.

I think dropping those is a bad direction. The point of adding
pair_chunk_expect() is that we could use it consistently.

-Peff

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Jeff King @ 2023-11-10 21:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git
In-Reply-To: <xmqqedgyw6jv.fsf@gitster.g>

On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
> 
> I know one of the original places did the "divide the whole by
> per-record size and see if it matches the number of records", the
> same as we see above, but the check in the above could also be
> 
> 	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
> 		return -1;
> 
> which would also catch the case where chunk_size is not a multiple
> of the record size.  Your conversion of OOFF in midx.c loses this
> protection as the original uses the multiplication-and-compare, but
> the rewrite to call pair_chunk_expect would call the above and
> checks with the truncating-divide-and-compare.
> 
> Does the distinction matter?  I dunno.  If the record/chunk
> alignment is asserted elsewhere, then the distinction should not
> matter, but even if it were, seeing a truncating division used in
> any validation makes my skin tingle.

Yes, the distinction does matter. If pair_chunk_expect_fn() used
st_mult(), then it would die on overflow, rather than returning an
error. For commit-graph files this is propagated up, and we continue the
operation by falling back to the non-graph code. There's a test in t5318
that will fail in this case. See patches 1 and 6 in my series for more
discussion.

One of those patches calls out the truncating division issue, but to
summarize: IMHO this is OK, as what we really want to know is "is it big
enough that we can always ask for NR records of size ELEM", which
division gives us. If we do want to be more precise, but also avoid
die(), we'd need a variant of st_mult() that returns a boolean. I didn't
think it was worth it for this case (but arguably it is something that
would be useful to have in general).

-Peff

^ permalink raw reply

* Re: [PATCH 6/9] commit-graph: use fanout value for graph size
From: Jeff King @ 2023-11-10 21:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git
In-Reply-To: <ZU1Z6WUrEhcxLBTO@nand.local>

On Thu, Nov 09, 2023 at 05:15:05PM -0500, Taylor Blau wrote:

> > This is one of the ways that pair_chunk_expect() could do better without
> > adding a lot of code, because it can check the return value from
> > read_chunk(). It doesn't help the other cases (like OIDF) that still
> > have to call read_chunk() themselves, though. Possibly read_chunk()
> > should just take a flag to indicate that it should complain when the
> > chunk is missing. And then callers could just do:
> >
> >   if (read_chunk(cf, id, our_callback, CHUNK_REQUIRED)
> > 	return -1; /* no need to complain; either our_callback() did, or
> > 	              read_chunk() itself */
> 
> We do return CHUNK_NOT_FOUND when we have a missing chunk, which we
> could check for individually. But TBH, I don't find the first error all
> that useful. In this instance, there's really only one way for the OIDL
> chunk to be corrupt, which is that it has the wrong size.

But aren't there two ways it can go wrong? It can be the wrong size, or
it can be missing. In the first we say:

  error: wrong size
  error: missing or corrupted

and in the second we say:

  error: missing or corrupted

Which is why I think issuing a message from the callback has value. Of
course the ideal would be:

  error: wrong size

and:

  error: missing

but I didn't think it was worth making the code much longer for an error
condition we don't really expect anybody to see in practice.

And also...

> And short of making the error much more robust, e.g.:
> 
>     error: commit-graph OID lookup chunk is the wrong size (got: $X, wanted: $Y)

...yes, I think that would be worth doing, especially if you are
centralizing the error messages in pair_chunk_expect(). But your series
goes the opposite direction.

> and dropping the generic "missing or corrupt" error, I think that just
> the generic error is fine here.

If you drop that error, then who will warn about the missing-chunk case?
The user would see nothing at all.

-Peff

^ permalink raw reply

* Re: [PATCH] format-patch: fix ignored encode_email_headers for cover letter
From: Jeff King @ 2023-11-10 21:48 UTC (permalink / raw)
  To: Simon Ser; +Cc: René Scharfe, git, Junio C Hamano
In-Reply-To: <VTz8XT3MCqWUh1HFQon62NxmGJiYFfNmBeWTtR8MwmeuaSkovfBJ02P-S79GsD94XwlxlrL6W80YZ8OwfpDd1BqA0E4GwFQlDKN5DWq0Qtg=@emersion.fr>

On Fri, Nov 10, 2023 at 10:36:22AM +0000, Simon Ser wrote:

> > I don't think that answering those questions needs to hold up your
> > patch. We can take it as a quick fix for a real bug, and then anybody
> > interested can dig further as a separate topic on top.
> 
> These are good questions indeed. Unfortunately I don't hink I'll have time to
> work on this though.

That's OK. I think it's fine to stop here for now.

> > Some of these long lines (and the in-string newlines!) make this ugly
> > and hard to read. But it is also just copying the already-ugly style of
> > nearby tests. So I'm OK with that. But a prettier version might be:
> > 
> >   test_expect_success 'cover letter respects --encode-email-headers' '
> >         test_config branch.rebuild-1.description "Café?" &&
> >         git checkout rebuild-1 &&
> >         git format-patch --stdout --encode-email-headers \
> >                 --cover-letter --cover-from-description=subject \
> >                 main >actual &&
> >         ...
> >   '
> 
> Yeah, that sounds better indeed. Let me know if you want me to resend a cleaner
> version of the test.

I don't have a strong opinion, so I'd leave it up to you.

> > I also wondered if we could be just be testing this much more easily
> > with another header like "--to". But I guess that would be found in both
> > the cover letter and the actual patches (we also don't seem to encode
> > it even in the regular patches; is that a bug?).
> 
> That sounds like another bug indeed… But maybe that'll be harder to fix. To
> Q-encode this field one needs to split off the full name and actual mail
> address ("André <andre@example.org>" would be split into "André" and
> "andre@example.org"), then Q-encode the full name, then join the strings
> together again. In particular, it's incorrect to Q-encode the full string.

Yeah, without even looking at the code, I had a suspicion that this
would be an issue. I doubt that format-patch is doing much parsing at
all of what we feed it via --to.

-Peff

^ permalink raw reply

* Re: [PATCH v2 0/4] Replace use of `test <expr> -o/a <expr>`
From: Jeff King @ 2023-11-10 21:46 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano
In-Reply-To: <cover.1699609940.git.ps@pks.im>

On Fri, Nov 10, 2023 at 11:01:11AM +0100, Patrick Steinhardt wrote:

> this is the second version of my patch series that replaces all uses of
> `test <expr> -o/a <expr`.
> 
> Changes compared to v1:
> 
>     - I've expanded a bit on why we want to do these conversions in the
>       first place in the first commit message.
> 
>     - Dropped a needless subshell and added missing quoting while at it.
> 
>     - Explained why we need to decompose the asserts in the second patch
>       into two asserts.

These look OK to me. I mentioned a small nit on the first patch, but I
am OK with ignoring it (and we are reaching diminishing returns
polishing an otherwise trivial series).

-Peff

^ permalink raw reply

* Re: [PATCH v2 1/4] global: convert trivial usages of `test <expr> -a/-o <expr>`
From: Jeff King @ 2023-11-10 21:44 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano
In-Reply-To: <2967c8ebb460934eb4aaaaebe5941bff643d4a94.1699609940.git.ps@pks.im>

On Fri, Nov 10, 2023 at 11:01:15AM +0100, Patrick Steinhardt wrote:

> diff --git a/t/valgrind/valgrind.sh b/t/valgrind/valgrind.sh
> index 669ebaf68be..9fbf90cee7c 100755
> --- a/t/valgrind/valgrind.sh
> +++ b/t/valgrind/valgrind.sh
> @@ -23,7 +23,7 @@ memcheck)
>  	VALGRIND_MAJOR=$(expr "$VALGRIND_VERSION" : '[^0-9]*\([0-9]*\)')
>  	VALGRIND_MINOR=$(expr "$VALGRIND_VERSION" : '[^0-9]*[0-9]*\.\([0-9]*\)')
>  	test 3 -gt "$VALGRIND_MAJOR" ||
> -	test 3 -eq "$VALGRIND_MAJOR" -a 4 -gt "$VALGRIND_MINOR" ||
> +	( test 3 -eq "$VALGRIND_MAJOR" && test 4 -gt "$VALGRIND_MINOR" ) ||
>  	TOOL_OPTIONS="$TOOL_OPTIONS --track-origins=yes"

I was surprised to see this one as a subshell after you adjusted the
other. It probably isn't that big a deal either way, though (as a style
thing I generally try to use braces unless I am relying on the separate
environment provided by the subshell, but it's certainly not wrong in
this case).

-Peff

^ permalink raw reply

* Re: first-class conflicts?
From: Elijah Newren @ 2023-11-10 21:41 UTC (permalink / raw)
  To: Martin von Zweigbergk; +Cc: phillip.wood, Sandra Snan, git, Randall S. Becker
In-Reply-To: <CAESOdVBjEYAp+P_mYdByYrPmbiu9DWL=Z_r19H8D9bxkJrquFA@mail.gmail.com>

Hi Martin,

On Wed, Nov 8, 2023 at 10:23 AM Martin von Zweigbergk
<martinvonz@google.com> wrote:
> On Tue, Nov 7, 2023 at 11:31 PM Elijah Newren <newren@gmail.com> wrote:
> > On Tue, Nov 7, 2023 at 9:38 AM Martin von Zweigbergk
> > <martinvonz@google.com> wrote:
> > >
[...]
> > I am curious more about the data you do store.  My fuzzy memory is
> > that you store a commit header involving something of the form "A + B
> > - C", where those are all commit IDs.  Is that correct?
>
> We actually store it outside the Git repo (together with the "change
> id"). We have avoided using commit headers because I wasn't sure how
> well different tools deal with unexpected commit headers, and because
> I wanted commits to be indistinguishable from commits created by a
> regular Git binary. The latter argument doesn't apply to commits with
> conflicts since those are clearly not from a regular Git binary
> anyway, and we don't allow pushing them to a remote.
>
> >  Is this in
> > addition to a normal "tree" header as in Git, or are one of A or B
> > found in the tree header?
>
> It's in addition. For the tree, we actually write a tree object with
> three subtrees:
>
> .jjconflict-base-0: C
> .jjconflict-side-0: A
> .jjconflict-side-1: B
>
> The tree is not authoritative - we use the Git-external storage for
> that. The reason we write the trees is mostly to prevent them from
> getting GC'd.

Oh, that seems like a clever way to handle reachability and make sure
the relevant trees are automatically included in any pushes or pulls.

> Also, if a user does `git checkout <conflicted commit>`,
> they'll see those subdirectories and will hopefully be reminded that
> they did something odd (perhaps we should drop the leading `.` so `ls`
> will show them...). They can also diff the directories in a diff tool
> if they like.

Oh, so they don't get a regular top-level looking tree with
possibly-conflicted-files present?  Or is this in addition to the
regular repository contents?  If in addition, are you worried about
users ever creating real entries named ".jjconflict-base-<N>" in their
repository?

> >  I think you said there was also the
> > possibility for more than three terms.  Are those for when a
> > conflicted commit is merged with another branch that adds more
> > conflicts, or are there other cases too?  (Octopus merges?)
>
> Yes, they can happen in both of those cases you mention. More
> generally, whenever you apply a diff between two trees onto another
> tree, you might end up with a higher-arity conflict. So merging in
> another branch can do that, or doing an octopus merge (which is the
> same thing at the tree level, just different at the commit level), or
> rebasing or reverting a commit.
>
> We simplify conflicts algebraically, so rebasing a commit multiple
> times does not increase the arity - the intermediate parents were both
> added and removed and thus cancel out. These simple algorithms for
> simplifying conflicts are encapsulated in
> https://github.com/martinvonz/jj/blob/main/lib/src/merge.rs. Most of
> them are independent of the type of values being merged; they can be
> used for doing algebra on tree ids, content hunks, refs, etc. (in the
> test cases, we mostly merge integers because integer literals are
> compact).

It's done on content hunks as well?  That's interesting.

When exactly would it be done on refs, though?  I'm not following that one.

And what else is in that "etc."?

> > What about recursive merges, i.e. merges where the two sides do not
> > have a unique merge base.  What is the form of those?  (Would "- C" be
> > replaced by "- C1 - C2 - ... - Cn"?  Or would we create the virtual
> > merge base V and then do a " - V"?  Or do we only have "A + B"?)
>
> We do that by recursively creating a virtual tree just like Git does,
> I think (https://github.com/martinvonz/jj/blob/084b99e1e2c42c40f2d52038cdc97687b76fed89/lib/src/rewrite.rs#L56-L71).
> I think the main difference is that by modeling conflicts, we can
> avoid recursive conflict markers (if that's what Git does), and we can
> even automatically resolve some cases where the virtual tree has a
> conflict.

Okay, but that talks about the mechanics of creating a recursive
merge, omitting all the details about how the conflict header is
written when you record the merge.  Is the virtual merge base
represented in the algebraic "A + B - C" expressions, or is the "- C"
part omitted?  If it is represented, and the virtual merge base had
conflicts which you could not automatically resolve, what exactly does
the conflicted header for the outer merge get populated with?

[...]

> Great questions! We don't have support for renames, so we haven't had
> to worry about these things. We have talked a little about divergent
> renames and the need for recording that in the commit so we can tell
> the user about it and maybe ask them which name they want to keep. I
> had not considered the interaction with partial conflict resolution,
> so thanks for bringing that up. I don't have any answers now, but
> we'll probably need to start thinking about this soon.

I was wondering if that might be the answer.  When you do tackle this,
I'd be interested to hear your thoughts.  I'm wondering if we just
need to augment the data in the conflict header to handle such cases
(though I guess this could risk having commit objects that are
significantly bigger than normal in theoretical cases where many such
paths are involved?)

> > I'm curious to hear what happens when you do start dogfooding, on
> > projects with many developers and which are jj-only.  Do commits with
> > conflicts accidentally end up in mainline branches, or are there good
> > ways to make sure they don't hit anything considered stable?
>
> That won't happen at Google because our source of truth for "merged
> PRs" (in GitHub-speak) is in our existing VCS. We will necessarily
> have to translate from jj's data model to its data model before a
> commit can even be sent for review.

That makes sense, but I was just hoping we'd have an example to look
to for how to keep things safe if we were to implement this.  Sadly, I
don't think we have the benefit of relying on folks to first push
their commits into some other VCS which lacks this feature.  ;-)

^ permalink raw reply

* Re: [PATCH v4 0/3] t: improve compatibility with NixOS
From: Jeff King @ 2023-11-10 21:41 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano
In-Reply-To: <cover.1699596457.git.ps@pks.im>

On Fri, Nov 10, 2023 at 09:16:56AM +0100, Patrick Steinhardt wrote:

> this is the fourth version of my patch series to improve compatibility
> of our test suite with NixOS.
> 
> Three changes compared to v3:
> 
>     - Switched from `test <expr> -a <expr>` to `test <expr> && test
>       <expr>`.
> 
>     - Improved the commit message to explain why the new
>       runtime-detected paths are only used as a fallback.
> 
>     - Rebased on top of 0e3b67e2aa (ci: add support for GitLab CI,
>       2023-11-09), which has been merged to `next` and caused conflicts.

Thanks, these all look good to me.

-Peff

^ permalink raw reply

* Re: git-send-email: Send with mutt(1)
From: Jeff King @ 2023-11-10 21:41 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: git
In-Reply-To: <ZU4widBlHljjg9lL@debian>

On Fri, Nov 10, 2023 at 02:30:49PM +0100, Alejandro Colomar wrote:

> > Having it directly in sendmailcmd causes some glitch: It repeats all CCs
> > in TO.  See a log:

Ah, right. That makes sense. send-email has to pass all of the envelope
recipients on the command-line, which mutt then interprets as
destinations to add to "to". Mutt is smart enough to de-duplicate the
"to", but it (correctly) allows duplicate to/cc. (My basic test didn't
have any cc's).

> > So maybe we need the wrapper script to ignore the arguments.
> 
> Heh!  The following trick works as well, without needing a script:
> 
> [sendemail]
> 	sendmailcmd = mutt -H - && true
> 
> It probably relies too much on git-send-email(1)'s current
> implementation, but it works.  :)

I was going to suggest the same thing. And no, I don't think it is
relying on any send-email implementation details. Git always tries to
feed command-invoking config like this to the shell for consistency (and
to allow tricks like this).

-Peff

^ permalink raw reply

* Re: git-send-email: Send with mutt(1)
From: Alejandro Colomar @ 2023-11-10 21:06 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Jeff King, git
In-Reply-To: <5yn2w52iymgnobesoi2jdwpyzaf5foc4sytxfdzl4vavlox62j@7att57ya6scz>

[-- Attachment #1: Type: text/plain, Size: 4055 bytes --]

Hi Konstantin,

On Thu, Nov 09, 2023 at 12:59:23PM -0500, Konstantin Ryabitsev wrote:
> On Thu, Nov 09, 2023 at 06:42:19PM +0100, Alejandro Colomar wrote:
> > I haven't yet tried b4(1), and considered trying it some time ago, but
> > really didn't find git-send-email(1) and mutt(1) so difficult to use
> > that b4(1) would simplify much.
> 
> Well, sending is only a small part of what b4 will do for you -- the core
> benefits are really cover letter management, automatic versioning and
> simplified list trailer collection. It's all tailored to kernel needs, but it
> will work for any project that depends on mailed patches.
> 
> > But I have tried patatt(1) before, which is what I think b4(1) uses for
> > signing.  Here are some of my concerns about patatt(4):
> > 
> > It doesn't sign the mail, but just the patch.
> 
> Well, no, it signs the entire thing, not just the patch, but it's true that
> it's specifically targeted at patches (hence the name).
> 
> > There's not much
> > difference, if any, in authenticability terms, but there's a big
> > difference in usability terms:
> > 
> > To authenticate a given patch submitted to a mailing list, the receiver
> > needs to also have patatt(1) configured.  Otherwise, it looks like a
> > random message.
> 
> Yes, but that's a feature.
> 
> > MUAs normally don't show random headers (patatt(1)
> > signs by adding the signature header), so unless one is searching for
> > that header, it will be ignored.  This means, if I contribute to other
> > projects, I need to tell them my patch is signed via patatt(1) in
> > order for them to verify.  If instead, I sign the email as usual with my
> > MUA, every MUA will recognize the signature by default and show it to
> > the reader.
> 
> I go into this in the FAQ for patatt:
> https://github.com/mricon/patatt#why-not-simply-pgp-sign-all-patches
> 
> Basically, developers really hated getting patches signed with PGP, either
> inline or MIME, which is why it never took off. Putting it into the header
> where it's not seen except by specialized tooling was a design choice.

It would be interesting if MUAs would support PGP signatures in a
header.  Did you consider that option back then?  Maybe patching a
mutt(1) or neomutt(1) to do that would have been simpler than developing
patatt(1).

> 
> > It also doesn't allow encrypting mail, so let's say I send some patch
> > fixing a security vulnerability, I'll need a custom tool to send it.  If
> > instead, I use mutt(1) to send it signed+encrypted to a mailing list
> > that provides a PGP public key, I can reuse my usual tools.
> 
> Right, the goal was really *just* attestation. For encrypted patch exchange we
> have remail (https://korg.docs.kernel.org/remail.html), which worked
> significantly better than any other alternative we've considered.
> 
> > Also, and I don't know if b4(1) handles this automagically, but AFAIR,
> > patatt(1) didn't: fo signing a patch, I had to configure per-directory
> > with `patatt install-hook`.  I have more than a hundred git worktrees
> > (think of dozens of git repositories, and half a dozen worktrees --see
> > git-worktree(1)-- per repository).  If I need to configure every one of
> > those worktrees to sign all of my patches, that's going to be
> > cumbersome.  Also, I scrape and re-create worktrees for new branches
> > all the time, so I'd need to be installing hooks for patatt(1) all the
> > time.  Compare that to having mutt(1) configured once.  It doesn't
> > scale that well.
> 
> Also true -- patatt was really envisioned as a library for b4, where you can
> configure patch signing in your ~/.gitconfig for all projects.

Overall, I think mutt(1) works better for me than patatt(1) via b4(1)
for crypto operations.  I don't use software that doesn't work with
PGP/MIME, and it's more versatile.  I'm still curious about the other
features of b4(1), so I'll try it for those.

Thanks!
Alex

> 
> -K

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: Error when "git mv" file in a sparsed checkout
From: Elijah Newren @ 2023-11-10 20:11 UTC (permalink / raw)
  To: Josef Wolf, git
In-Reply-To: <20231108113636.GT7041@raven.inka.de>

Hi,

On Wed, Nov 8, 2023 at 3:38 AM Josef Wolf <jw@raven.inka.de> wrote:
>
> Thanks for the reply, Elijah!
>
[...]
> > > Error message suggests, there already exists a file named "new-filename". This
> > > is not true at all. There is no file named "new-filename" in the entire
> > > repository. Not in any directory of any branch.
> >
> > You are correct; the wording of the error message here is suboptimal
> > and seems to have been focused more on the git-add case (the error
> > message is shared by git-add, git-mv, and git-rm).  Thanks for
> > pointing it out!  We could improve that wording, perhaps with
> > something like:
> >
> >     The following paths and/or pathspecs match paths that are
> >     outside of your sparse-checkout definition, so will not be
> >     updated:
> >
> > Which is still slightly slanted towards git-add and git-rm cases, but
> > I hope it works better than the current message.  Thoughts?
>
> Yes, the wording was pretty much confusing me, since i could not find a file
> named "new-file" anywhere in the repo.
>
>
> There are more things confusing concerning sparse mode:

Sweet, thanks for taking the time to write these up.  It certainly
helps confirm some of the directions we picked and changes we made to
make things a little clearer, and helps us continue working in that
direction.  Some comments below on individual points...

> - It is not clear from git-sparse-checkout(1) when changes to
>   $GIT_DIR/info/sparse-checkout are catched up. In my case: would it be enough
>   to add the new pathname just before git-mv or would a fresh git-checkout be
>   needed after modifying $GIT_DIR/info/sparse-checkout? You have clarified
>   this in your response, but shouldn't this be clear from the manpage?

I believe at least part of this confusion is due to using the old
style of handling sparse checkouts; namely, by actually editing
$GIT_DIR/info/sparse-checkout.  We have taken pains to guide people
away from that workflow, because it is both more work, and leads to
more confusion.  If you instead do a
   git sparse-checkout set --no-cone <pattern1> <pattern2> ... <patternN>
or a
   git sparse-checkout add <another-pattern>
then the sparse-checkout command handles populating the
$GIT_DIR/info/sparse-checkout for you as well as any needed checkout,
meaning that there isn't a "catch up" step as there traditionally was.
And it makes it a bit clearer that if you add some path to your
sparse-checkout, then your sparse-checkout is ready to handle the
additional path right away.

> - git-sparse-checkout(1) refers to "skip-worktree bit". This concept is
>   potentially not very familiar to the average git user which uses mostly
>   porcelain. Thus, edge cases remain to be unclear.

I totally agree that the "skip-worktree bit" is something we should
avoid exposing to the user.  I called it out previously:
"""
Most sparse checkout users are unaware of this implementation
detail, and the term should generally be avoided in user-facing
descriptions and command flags.  Unfortunately, prior to the
`sparse-checkout` subcommand this low-level detail was exposed,
and as of time of writing, is still exposed in various places.
"""

However, we should also note that you reported using v2.34.1, which is
really quite old.  The current git-sparse-checkout(1) has been almost
completely overhauled in the meantime:

$ git show v2.34.1:Documentation/git-sparse-checkout.txt | wc -l
263
$ git diff --stat v2.34.1 v2.43.0-rc1 -- Documentation/git-sparse-checkout.txt
 Documentation/git-sparse-checkout.txt | 446 +++++++++++++++++++++++++---------
 1 file changed, 333 insertions(+), 113 deletions(-)

And, in particular, "skip-worktree" doesn't appear until quite a bit
later in the file, and then only appears in a section labelled
"INTERNALS -- SPARSE CHECKOUT".

> - The pathspecs refers to .gitignore (which by itself is not very clear). But
>   there are differences:
>   1. giignore is relative to containing directoy, which don't seem to make
>      much sense for sparse mode
>   2. sparse specs are the opposite of gitignore, which seems to have different
>      meaning in some edge-cases.

Yeah, copying .gitignore syntax and merely referring to the gitignore
manual for specification of the patterns was a huge design mistake.
I've hated it since the beginning.  The internals actually managed to
make it *even more* confusing for quite some time as it referred to
everything as "excludes", regardless of whether used for gitignore or
sparse-checkout.  But yeah, these and other inherent problems with
non-cone mode are called out in the "INTERNALS -- NON-CONE PROBLEMS"
section of the manual, and is a big piece of why we recommend users
migrate away from it if possible.

> - For cone, it is not clear how the two "accepted patterns" look like what the
>   semantics are.

Yeah, it is a bit complex, and I advocated for just using a different
control file for cone mode to help us step away from the blight of
$GIT_DIR/info/sparse-checkout and its inherent tie to gitignore.  I
didn't win that one.

However, why does it actually matter?  You shouldn't be bothering with
the patterns or editing the $GIT_DIR/info/sparse-checkout file for
cone mode.  You just
    git sparse-checkout set <directory1> <directory2> ... <directoryN>
and then the file will be set up for you.

>  I understand that specifying a directory adds siblings
>   recursively. But what does the "Parent" mode mean exactly and when/how is
>   this recognized? I guess, this is just a mis-namer? IMHO, parent of /a/b/c would
>   be /a/b and not /a/b/c/* (as git-sparse-chekout(1) suggests).

No, /a/b/c/* is not the parent, that's the portion that says that all
things below a/b/c (i.e. all descendant files) should be included.

The parent parts would be where it adds /a/b/ and /a/ as patterns to
ensure things directly under a/ and directly under a/b/ are included.

I think the newer version of the manual might explain this a little
better, but honestly, attempting to explain it is a losing battle.
Users shouldn't read or edit the sparse-checkout file in cone mode.
We let them, but we strongly recommend against it.  Just pass the
actual directory names to `git sparse-checkout {set,add}` and let it
take care of the patterns for you.

^ permalink raw reply

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
From: Taylor Blau @ 2023-11-10 16:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King
In-Reply-To: <xmqqedgyw6jv.fsf@gitster.g>

On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
>
> I know one of the original places did the "divide the whole by
> per-record size and see if it matches the number of records", the
> same as we see above, but the check in the above could also be
>
> 	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
> 		return -1;
>
> which would also catch the case where chunk_size is not a multiple
> of the record size.  Your conversion of OOFF in midx.c loses this
> protection as the original uses the multiplication-and-compare, but
> the rewrite to call pair_chunk_expect would call the above and
> checks with the truncating-divide-and-compare.

Hmm. I was thinking of Peff's "commit-graph: handle overflow in
chunk_size checks", but I think that I was overly eager in applying the
same reasoning to the MIDX code.

The important piece of the rationale in that patch is as follows:

    In the current code this is only possible for the CDAT chunk, but
    the reasons are quite subtle. We compute g->num_commits by dividing
    the size of the OIDL chunk by the hash length (since it consists of
    a bunch of hashes). So we know that any size_t multiplication that
    uses a value smaller than the hash length cannot overflow. And the
    CDAT records are the only ones that are larger (the others are just
    4-byte records). So it's worth fixing all of these, to make it clear
    that they're not subject to overflow (without having to reason about
    seemingly unrelated code).

In particular, that g->num_commits is computed by dividing the length of
the OIDL chunk by the hash length, thus any size_t multiplication of
g->num_commits with a value smaller than that hash length cannot
overflow.

But I don't think we enjoy the same benefits in the MIDX scenario. In
this case, the num_objects field is just:

    m->num_objects = ntohl(m->chunk_oid_fanout[255])

so I don't think we can make the same guarantees about what is and isn't
save to compute under size_t arithmetic.

I'd be curious what Peff has to say, but if he agrees with me, I'd
recommend taking the first five patches, and dropping the two
MIDX-related ones.

Thanks,
Taylor

^ permalink raw reply

* Re: git-send-email: Send with mutt(1)
From: Alejandro Colomar @ 2023-11-10 13:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <ZU1-l4PwMU5H4_VN@debian>

[-- Attachment #1: Type: text/plain, Size: 2964 bytes --]

On Fri, Nov 10, 2023 at 01:51:34AM +0100, Alejandro Colomar wrote:
> Hi Jeff,
> 
> On Thu, Nov 09, 2023 at 01:03:08PM -0500, Jeff King wrote:
> > On Thu, Nov 09, 2023 at 04:26:23PM +0100, Alejandro Colomar wrote:
> > 
> > > I've tried something even simpler:
> > > 
> > > ---8<---
> > > #!/bin/sh
> > > 
> > > mutt -H -;
> > > --->8---
> > > 
> > > I used it for sending a couple of patches to linux-man@, and it seems to
> > > work.  I don't have much experience with mutt, so maybe I'm missing some
> > > corner cases.  Do you expect it to not work for some case?  Otherwise,
> > > we might have a winner.  :)
> > 
> > Wow, I don't know how I missed that when I read the manual. That was
> > exactly the feature I was thinking that mutt would need. ;)
> > 
> > So yeah, that is obviously better than the "postponed" hackery I showed.
> > I notice that "-H" even causes mutt to ignore "-i" (a sendmail flag that
> > Git adds to sendemail.sendmailcmd). So you can just invoke it directly
> > from your config like:
> > 
> >   git config sendemail.sendmailcmd "mutt -H -"
> 
> Having it directly in sendmailcmd causes some glitch: It repeats all CCs
> in TO.  See a log:
> 
> 	Send this email? ([y]es|[n]o|[e]dit|[q]uit|[a]ll): y
> 	OK. Log says:
> 	Sendmail: mutt -H - -i kevin@8t8.us mutt-dev@mutt.org alx@kernel.org e.sovetkin@gmail.com neomutt-devel@neomutt.org
> 	From: Alejandro Colomar <alx@kernel.org>
> 	To: Kevin McCarthy <kevin@8t8.us>,
> 		mutt-dev@mutt.org
> 	Cc: Alejandro Colomar <alx@kernel.org>,
> 		Jenya Sovetkin <e.sovetkin@gmail.com>,
> 		neomutt-devel@neomutt.org
> 	Subject: [PATCH] send.c: Allow crypto operations in batch and mailx modes.
> 	Date: Fri, 10 Nov 2023 01:41:24 +0100
> 	Message-ID: <20231110004128.5972-2-alx@kernel.org>
> 	X-Mailer: git-send-email 2.42.0
> 	MIME-Version: 1.0
> 	Content-Transfer-Encoding: 8bit
> 
> 	Result: OK
> 
> The sent mail ended up being
> 
> 	From: Alejandro Colomar <alx@kernel.org>
> 	To: Kevin McCarthy <kevin@8t8.us>, mutt-dev@mutt.org, alx@kernel.org,
> 		e.sovetkin@gmail.com, neomutt-devel@neomutt.org
> 	Cc: Alejandro Colomar <alx@kernel.org>,
> 		Jenya Sovetkin <e.sovetkin@gmail.com>, neomutt-devel@neomutt.org
> 
> So maybe we need the wrapper script to ignore the arguments.

Heh!  The following trick works as well, without needing a script:

[sendemail]
	sendmailcmd = mutt -H - && true

It probably relies too much on git-send-email(1)'s current
implementation, but it works.  :)

Cheers,
Alex

> 
> Cheers,
> Alex
> 
> > 
> > Annoyingly, "-E" doesn't work when reading over stdin (I guess mutt
> > isn't willing to re-open the tty itself). But if you're happy with not
> > editing as they go through, then "-H" is then that's enough (in my
> > workflow, I do the final proofread via mutt).
> > 
> > -Peff
> 
> -- 
> <https://www.alejandro-colomar.es/>



-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: Git Rename Detection Bug
From: Jeremy Pridmore @ 2023-11-10 11:28 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git@vger.kernel.org, Paul Baumgartner
In-Reply-To: <CABPp-BHYaxa7QoXabM=7hW-93hQLK-=KayGtDHtWxxdAnJCcJw@mail.gmail.com>

Hi Elijah,

Many thanks for your reply, the detail is much appreciated.  I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag.

I think the issue I'm encountering is described by what you say here:
"Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be)"

That is close to the behaviour I'm seeing.  As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches:

(I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text)
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx

Given git compares both the content and the directory\filenames, and as the repositories have unrelated histories, the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base.  That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different.  Any ideas?

I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head -- <file>" or a "git rm <file>" based upon which set the file is in and whether it is in another set or not.  This has worked really well and helped me through the large changeset with 3k conflicts.

As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides), an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference.  The difference value would be the number of differences in folder names, e.g.

deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1

added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
(path\name difference = 2)
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
(path\name difference = 1)
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
(path\name difference = 2)

So, given the above, git would chose the second "added in both" entry.

Food for thought?  Happy to discuss the idea further.

Regards,

Jeremy Pridmore
Lead Solution Architect

From: Elijah Newren <newren@gmail.com>
Sent: 07 November 2023 08:05
To: Jeremy Pridmore <jpridmore@rdt.co.uk>
Cc: git@vger.kernel.org <git@vger.kernel.org>; Paul Baumgartner <pbaumgartner@rdt.co.uk>
Subject: Re: Git Rename Detection Bug

[You don't often get email from newren@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Caution: This email originated from outside of the organisation. Please treat any attachments or links with caution. If in doubt please contact IT

Hi,

On Mon, Nov 6, 2023 at 4:01 AM Jeremy Pridmore <jpridmore@rdt.co.uk> wrote:
>
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.

I think it might be worthwhile to point out a few facts about rename
handling in Git, as background information that might clarify a few
things about how Git's mental model seems to differ from yours:

  * In git, renames are not tracked; they are detected (based on file
similarity of the commits being compared).
  * So, when you run "git mv A B", there is no rename recorded.  It's
basically the same as "git rm A", followed by creating B with the same
contents, followed by "git add B".
  * The detection happens whenever an operation (diff, log -p, merge,
status, etc.) needs or wants to know about renames.
  * In git, directory renames are detected _after_ normal renames, and
via amalgamation of the individual renames.
  * As a corollary of the last item, the only way individual renames
can be affected by directory renames, is if the individual rename on
one side of history was into a directory that the other side of
history renamed away; in such a case, we apply an _extra_ rename to
move it into the new directory.  But we don't "undo" individual
renames to make them fit the majority-determined directory rename or
anything like that.

(Also, if it matters, all of this is true of both `recursive` and
`ort` merge strategies, i.e. the old default merge backend and the new
one.)

> What did you do before the bug happened? (Steps to reproduce your issue)
> I have two GIT repositories (A and B). Both migrated from the same TFS server using git-tfs tool. I migrated code into A and made lots of changes, including moving 50,000+ files from folder "/Landscape" to "/Landscape/src".  B contains the same code but with various other changes made since my original migration from TFS to A.  All the files in B are still in the "/Landscape" folder.  I recently needed to merge my changes from A to B, so I added A as a remote to B and then performed a number of cherry-picks from A to B, but got stuck when trying to cherry-pick the commit containing the results of moving all files into "/Landscape/src".

In case anyone else wants to dig into this, note that this problem
setup precludes directory rename detection being involved.  Directory
rename detection has a rule where if the source directory wasn't
entirely removed on one side, then that directory was not renamed on
that side.  Seems obvious, but the upshot of that rule is that a
directory cannot be renamed into a subdirectory of itself, because by
virtue of being a subdirectory that means its parent directory still
exists.

So, this is a problem where only regular rename detection (i.e. rename
detection of individual files) is going to be at play.

> What did you expect to happen? (Expected behavior)
> I expected the git rename detection to match all files in A "/Landscape" to files in B "/Landscape/src".

Are all files under "/Landscape" from the merge base commit
individually more similar to the counterpart under "/Landscape/src"
than to files under any other directory?  If not, the expectation goes
against how rename detection has worked in git from the beginning.

> What happened instead? (Actual behavior)
> Although many files were matched successfully, git mismatched over two dozen similarly named files, e.g.
>
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx -> Landscape/src/Main/uiMain/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rates/uiRates/My Project/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Claim.UI/Properties/licenses.licx -> Landscape/src/Rates/uiRates/My Project/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Claim.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx -> Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb -> Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx -> Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx
> Incorrect path match: Landscape/Services/busServices/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
> Incorrect path match: Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx -> Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx -> Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/uiUtils/My Project/licenses.licx -> Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx
> Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Utils/uiUtils/My Project/licenses.licx
>
>
> What's different between what you expected and what actually happened?
>
> As you can see, although the filenames (and content) are the same,

The content is the same as well?  So, these renames that you label as
incorrect are actually _exact_ renames -- and further, in most cases
they also have an identical basename for the file as well.

Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be; see the "Too many identical
alternatives? Pick one" code comment).

And this, too, is true of both the `recursve` and `ort` backends; no
change has been made to how exact renames are handled.

>  In some cases, it seems that the catalyst has been git thinking that a file from B has been deleted from A, when in fact it has not actually been deleted at all.
>
> For example, the file Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 has not been deleted in A or B, therefore git should not have attempted to rename Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 to Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1, especially as it then attempts to rename Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 to Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 and so on.

Renamed files, from Git's perspective, always involve files that have
been deleted.

> Git status contains, for example:
>         deleted by them: Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1

This means that it wasn't sufficiently similar to any of the new
files...or that _other_ deleted files were more similar to the new
files and thus that they were paired up instead of this file, leaving
this file to simply be marked as deleted.  (Or that other deleted
files were just as similar; tie-breakers are kinda random in such a
case.)

[...]

> Anything else you want to add:
> I can't help but think that this is related to changes made by Palantir:
> https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1

Curious.  What makes you think it's related?

If there is some reason you think it's related, there's an easy way to
check -- just repeat the cherry-pick with the "-s recursive" flag to
use the old merge backend and compare the results.

I'll be somewhat surprised if it's related, though.

> I have tried to unstage these renames using "git restore --staged <file_name>" so I can then apply the correct "git mv" commands

Why?  Just modify all the files to have the correct end results and then commit.

>, but bizzarely, this then results in "git status" reporting a different, smaller set of mismatched names:

As mentioned earlier, git does _not_ record renames.  So, running the
correct "git mv" command doesn't really mean much.  If you use
completely "incorrect" git-mv commands, but then manually tweak files
until they have the correct results, then what's recorded is exactly
the same as if you had used the "correct" git-mv commands.

Further, when you run "git status", it can't access any renames you
did because that information isn't recorded anywhere.  It instead
recomputes renames on the fly.  And it does so each and every time you
run "git status", even if you make no changes between two invocations.
In fact, from this you can probably also deduce that there are other
ways to affect what will be shown as renames, when you have multiple
files similar to any given source file.  In particular, you can cause
a different pairing modifying one of the similar files enough that it
becomes the most similar to the source file, or so that it becomes no
longer the most similar to the source file.  However, what "git
status" reports for renames is irrelevant, since that info won't be
recorded in the commit.  Renames are never recorded.  Anywhere.

In fact, you can even run "git status --no-renames" to just see the
old filenames that were removed and the new ones that were added
without having all the files be paired up as renames.

Hope that helps,
Elijah

________________________________

DISCLAIMER This email is confidential. It should only be read by those persons to whom it is addressed. RDT Ltd accept no liability for the consequences of any person acting, or refraining from acting, on any information contained within this e-mail or any attached documents prior to the receipt by those persons of subsequent written confirmation of that information. If you think this e-mail may not be intended for you, do not use, pass on or copy the transmission in any way. While all reasonable precautions are taken to minimise the risk of transmitting software viruses we advise you to carry out your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses.

^ permalink raw reply

* Re: [PATCH] format-patch: fix ignored encode_email_headers for cover letter
From: Simon Ser @ 2023-11-10 10:36 UTC (permalink / raw)
  To: Jeff King; +Cc: René Scharfe, git, Junio C Hamano
In-Reply-To: <20231109183506.GB2711684@coredump.intra.peff.net>

On Thursday, November 9th, 2023 at 19:35, Jeff King <peff@peff.net> wrote:

> That makes sense, and your patch looks the right thing to do as an
> immediate fix. But I have to wonder:
> 
>   1. Are there other bits that need to be copied? Grepping for other
>      code that does the same thing, I see that show_log() and
>      cmd_format_patch() copy a lot more. (For that matter, why doesn't
>      make_cover_letter() just use the context set up by
>      cmd_format_patch()?).
> 
>   2. Why are we copying this stuff at all? When we introduced the
>      pretty-print context back in 6bf139440c (clean up calling
>      conventions for pretty.c functions, 2011-05-26), the idea was just
>      to keep all of the format options together. But later, 6d167fd7cc
>      (pretty: use fmt_output_email_subject(), 2017-03-01) added a
>      pointer to the rev_info directly. So could/should we just be using
>      pp->rev->encode_email_headers here?
> 
>      Or if that field is not always set (or we want to override some
>      elements), should there be a single helper function to initialize
>      the pretty_print_context from a rev_info, that could be shared
>      between spots like show_log() and make_cover_letter()?
> 
> I don't think that answering those questions needs to hold up your
> patch. We can take it as a quick fix for a real bug, and then anybody
> interested can dig further as a separate topic on top.

These are good questions indeed. Unfortunately I don't hink I'll have time to
work on this though.

> > +test_expect_success 'cover letter with --cover-from-description subject (UTF-8 subject line)' '
> > +	test_config branch.rebuild-1.description "Café?
> > +
> > +body" &&
> > +	git checkout rebuild-1 &&
> > +	git format-patch --stdout --cover-letter --cover-from-description subject --encode-email-headers main >actual &&
> > +	grep "^Subject: \[PATCH 0/2\] =?UTF-8?q?Caf=C3=A9=3F?=$" actual &&
> > +	! grep "Café" actual
> > +'
> 
> The test looks correct to me.
> 
> Some of these long lines (and the in-string newlines!) make this ugly
> and hard to read. But it is also just copying the already-ugly style of
> nearby tests. So I'm OK with that. But a prettier version might be:
> 
>   test_expect_success 'cover letter respects --encode-email-headers' '
>         test_config branch.rebuild-1.description "Café?" &&
>         git checkout rebuild-1 &&
>         git format-patch --stdout --encode-email-headers \
>                 --cover-letter --cover-from-description=subject \
>                 main >actual &&
>         ...
>   '

Yeah, that sounds better indeed. Let me know if you want me to resend a cleaner
version of the test.

> I also wondered if we could be just be testing this much more easily
> with another header like "--to". But I guess that would be found in both
> the cover letter and the actual patches (we also don't seem to encode
> it even in the regular patches; is that a bug?).

That sounds like another bug indeed… But maybe that'll be harder to fix. To
Q-encode this field one needs to split off the full name and actual mail
address ("André <andre@example.org>" would be split into "André" and
"andre@example.org"), then Q-encode the full name, then join the strings
together again. In particular, it's incorrect to Q-encode the full string.

^ permalink raw reply

* Re: [PATCH 2/4] contrib/subtree: stop using `-o` to test for number of args
From: Patrick Steinhardt @ 2023-11-10 10:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git
In-Reply-To: <xmqq8r76zg1j.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 2509 bytes --]

On Fri, Nov 10, 2023 at 08:02:32AM +0900, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> >>  # Usage: process_subtree_split_trailer SPLIT_HASH MAIN_HASH [REPOSITORY]
> >>  process_subtree_split_trailer () {
> >> -	assert test $# = 2 -o $# = 3
> >> +	assert test $# -ge 2
> >> +	assert test $# -le 3
> >
> > It took me a minute to figure out why we were swapping "=" for "-ge". It
> > is because we want to logical-OR the two conditions, but "assert"
> > requires that we test one at a time. I think that is probably worth
> > explaining in the commit message.
> 
> I wish we could write something like
> 
> 	assert test $# -ge 2 && test $# -le 3
> 
> (and I'd allow double quoting the whole thing after assert if
> needed) but we cannot do so without tweaking the implementation of
> assert.
> 
> >
> >> @@ -916,7 +919,7 @@ cmd_split () {
> >>  	if test $# -eq 0
> >>  	then
> >>  		rev=$(git rev-parse HEAD)
> >> -	elif test $# -eq 1 -o $# -eq 2
> >> +	elif test $# -eq 1 || test $# -eq 2
> >
> > OK, this one is a straight-forward use of "||".
> 
> Yes, but why not consistently use the range notation like the
> earlier one here, or below?

I opted to go for the "obvious" conversion, if there was one easily
available, to make the diff easier to read. We could of course use a
range notation like this:

 		rev=$(git rev-parse HEAD)
-	elif test $# -eq 1 || test $# -eq 2
+	elif test $# -ge 1 && test $# -le 2
 	then
 		rev=$(git rev-parse -q --verify "$1^{commit}") ||
 			die "fatal: '$1' does not refer to a commit"

Or :

 		rev=$(git rev-parse HEAD)
-	elif test $# -eq 1 || test $# -eq 2
+	elif ! { test $# -lt 1 || test $# -gt 2; }
 	then
 		rev=$(git rev-parse -q --verify "$1^{commit}") ||
 			die "fatal: '$1' does not refer to a commit"

But both of these are not consistent with the other cases due to the
negation here, and both of them are harder to read in my opinion. So
I'm not sure whether we gain anything by trying to make this a bit more
consistent with the other conversions.

Patrick

> 	elif test $# -ge 1 && test $# -le 2
> 
> >>  cmd_merge () {
> >> -	test $# -eq 1 -o $# -eq 2 ||
> >> +	if test $# -lt 1 || test $# -gt 2
> >> ...
> > (I am OK with either, it just took me a minute to verify that your
> > conversion was correct. But that is a one-time issue now while
> > reviewing, and I think the code is readable going forward).
> 
> Yeah, the end result looks good.
> 
> Thanks, both.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox