Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] ref-filter: support filtering of operational refs
From: Junio C Hamano @ 2024-01-03 14:38 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: Patrick Steinhardt, Taylor Blau, git, christian.couder
In-Reply-To: <CAOLa=ZS4OOAmyRvf4HH-c_3GvnVkh6zS2kD3hEhRZ7NZT-rvyA@mail.gmail.com>

Karthik Nayak <karthik.188@gmail.com> writes:

> The confusion was that I thought Junio was referring to using
>
>     $ git for-each-ref ""
>
> to print all refs under $GIT_DIR, while he was actually talking about
> "$GIT_DIR/refs/" directory.

I do not think you misunderstood me here, though.  

When you have your master branch (refs/heads/master), your v1.0 tag
(refs/tags/v1.0), and the usual pseudorefs, giving "refs" to "git
for-each-ref" would yield refs/heads/master and refs/tags/v1.0 but
not HEAD and others, simply because the pattern "refs" in

    $ git for-each-ref "refs"

works as a hierarchy prefix match.  You give "refs/heads" and you
get only your master branch but not tags or HEAD in such a
repository.  As a natural extension to that behaviour, an empty
string as a hierarchy prefix that matches everything would work
well: you'd get HEAD, refs/heads/master, and refs/tags/v1.0 as an
empty prefix would cover all of the hiearchies these three refs (and
pseudorefs if you had ORIG_HEAD and MERGE_HEAD there) live in.

In any case, it is not a very much interesting to define the syntax
to tell for-each-ref not to limit itself under "refs/".  My point
was that you do not need a special option for that, as shown above.

What is more interesting is what to do with refs that are specific
to other worktrees, e.g.

    $ git rev-parse "worktrees/$name/refs/bisect/bad"

would currently let you peek into (and worse yet, muck with, if you
really wanted to, with something like "git update-ref") refs that
should be only visible in another worktree.  Should for-each-ref and
friends learn a way to iterate over them?  I have no answer to that
question.

^ permalink raw reply

* Re: Concurrent fetch commands
From: Patrick Steinhardt @ 2024-01-03 10:40 UTC (permalink / raw)
  To: Oswald Buddenhagen; +Cc: Junio C Hamano, Stefan Haller, git
In-Reply-To: <ZZU1TCyQdLqoLxPw@ugly>

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

On Wed, Jan 03, 2024 at 11:22:04AM +0100, Oswald Buddenhagen wrote:
> On Wed, Jan 03, 2024 at 09:11:07AM +0100, Patrick Steinhardt wrote:
> > Ah, one thing I didn't think of is parallel fetches. It's expected that
> > all of the fetches write into FETCH_HEAD at the same point in time
> > concurrently
> > 
> is it, though? given that the contents could be already randomly scrambled,
> it would not seem particularly bad if the behavior changed.
> 
> the one real complication i see is the --append option, which requires using
> a waiting lock after the actual fetch, rather than acquiring it immediately
> and erroring out on failure (and ideally giving a hint to use
> --no-write-fetch-head).

I should probably clarify, but with "parallel fetches" I meant `git
fetch --jobs=`, not two separate executions of git-fetch(1). And these
do in fact use `--append` internally: the main process first truncates
FETCH_HEAD and then spawns its children, which will then append to
FETCH_HEAD in indeterministic order.

But even though the order is indeterministic, I wouldn't go as far as
claiming that the complete feature is broken. It works and records all
updated refs in FETCH_HEAD just fine, even if it's not particularly
elegant. Which to me shows that we should try hard not to break it.

> an extra complication is that concurrent runs with and without --append
> should be precluded, because that would again result in undefined behavior.
> it generally seems tricky to get --append straight if parallel fetches are
> supposed to work.

Yeah, the `--append` flag indeed complicates things. There are two ways
to handle this:

  - `--append` should refrain from running when there is a lockfile.
    This breaks `git fetch --jobs` without extra infra to handle this
    case, and furthermore a user may (rightfully?) expect that two
    manually spawned `git fetch --append` processes should work just
    fine.

  - `--append` should handle concurrency just fine, that is it knows to
    append to a preexisting lockfile. This is messy though, and the
    original creator of the lockfile wouldn't know when it can commit it
    into place.

Both options are kind of ugly, so I'm less sure now whether lockfiles
are the way to go.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: What's the recommendation for forgetting all rerere's records?
From: Sean Allred @ 2024-01-03  9:27 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano
In-Reply-To: <m0frzeu89w.fsf@epic96565.epic.com>


There was enough going on with that prior email that I gave it another
look and found some errors.

>     $ echo 'bad merge' >file
>     $ git add file
>
>     $ EDITOR=: git rebase --continue
>     file: needs merge
>     You must edit all merge conflicts and then
>     mark them as resolved using git add
>
>     $ git rebase --abort
>
>     $ git rebase main
>     Auto-merging file
>     CONFLICT (content): Merge conflict in file
>     error: could not apply b4d7aeb... local
>     hint: Resolve all conflicts manually, mark them as resolved with
>     hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
>     hint: You can instead skip this commit: run "git rebase --skip".
>     hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
>     Recorded preimage for 'file'
>     Could not apply b4d7aeb... local
>
>     $ git checkout --merge .
>     Recreated 1 merge conflict
>
>     $ git rerere forget .
>     error: no remembered resolution for 'file'
>
>     $ echo 'good merge' >file
>
>     $ EDITOR=: git rebase --continue
>     file: needs merge
>     You must edit all merge conflicts and then
>     mark them as resolved using git add

This section should have read:

    $ echo 'bad merge' >file
    $ git add file

    $ EDITOR=: git rebase --continue
    Recorded resolution for 'file'.
    [detached HEAD 5e3c431] local
     1 file changed, 1 insertion(+), 1 deletion(-)
    Successfully rebased and updated refs/heads/feature.

    $ git reset --hard @{1}
    HEAD is now at b4d7aeb local

    $ git rebase main
    Auto-merging file
    CONFLICT (content): Merge conflict in file
    error: could not apply b4d7aeb... local
    hint: Resolve all conflicts manually, mark them as resolved with
    hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
    hint: You can instead skip this commit: run "git rebase --skip".
    hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
    Resolved 'file' using previous resolution.
    Could not apply b4d7aeb... local

    $ git checkout --merge .
    Recreated 1 merge conflict

    $ git rerere forget .
    error: no remembered resolution for 'file'

I've driven myself a little nuts trying to reproduce it this morning,
but in doing so I've come to an important discovery: this bug presents
if `core.autocrlf=true` but does *not* present if `core.autocrlf=input`.

For completeness and future reference, the following script reproduces
the issue on Windows:

    git init
    echo aaa >file
    git add file
    git commit -ambase
    git branch feature
    echo bbb >file
    git commit -amremote
    git switch feature
    echo ccc >file
    git commit -amlocal
    git config rerere.enabled true
    git config core.autocrlf true     # <--

    # setup complete; let's rebase!
    git rebase main
    echo 'bad merge' >file
    git add file
    EDITOR=: git rebase --continue

    # uh oh; that was a bad resolution; let's try again
    git reset --hard @{1}
    git rebase main
    git checkout --merge .
    git rerere forget .               # fails
    echo 'good merge' >file
    git add file
    EDITOR=: git rebase --continue

At the end of this script, the 'bad merge' is still the recorded
resolution and no rerere record exists for the 'good merge'.

Just in case there's another piece of config somehow relevant, here's a
dump of the system that reproduced this:

    $ git config --list --show-scope | sort
    global	user.email=[clip]
    global	user.name=[clip]
    local	core.autocrlf=true
    local	core.bare=false
    local	core.filemode=false
    local	core.ignorecase=true
    local	core.logallrefupdates=true
    local	core.repositoryformatversion=0
    local	core.symlinks=false
    local	rerere.enabled=true
    system	core.autocrlf=input
    system	core.fscache=true
    system	core.fsmonitor=true
    system	core.symlinks=false
    system	credential.helper=manager
    system	credential.https://dev.azure.com.usehttppath=true
    system	diff.astextplain.textconv=astextplain
    system	filter.lfs.clean=git-lfs clean -- %f
    system	filter.lfs.process=git-lfs filter-process
    system	filter.lfs.required=true
    system	filter.lfs.smudge=git-lfs smudge -- %f
    system	http.sslbackend=schannel
    system	http.sslcainfo=C:/Program Files/Git/mingw64/etc/ssl/certs/ca-bundle.crt
    system	init.defaultbranch=main
    system	pull.rebase=true

    $ git --version
    git version 2.43.0.windows.1

It's worth noting at this point that while I believe I reproduced on
macOS last week, that doesn't jive with the available evidence (and I
can't reproduce it on macOS this morning, though I suspect that has more
to do with native use of LF over CRLF than anything else).

--
Sean Allred

^ permalink raw reply

* Re: [PATCH 2/2] ref-filter: support filtering of operational refs
From: Karthik Nayak @ 2024-01-03 10:22 UTC (permalink / raw)
  To: Patrick Steinhardt, Taylor Blau; +Cc: Junio C Hamano, git, christian.couder
In-Reply-To: <ZZUgUUlB8A-rhep5@tanuki>

[-- Attachment #1: Type: text/plain, Size: 3991 bytes --]

Patrick Steinhardt <ps@pks.im> writes:

> On Tue, Jan 02, 2024 at 01:47:22PM -0500, Taylor Blau wrote:
>> On Tue, Jan 02, 2024 at 07:18:48AM -0800, Karthik Nayak wrote:
>> > > As "git for-each-ref" takes pattern that is prefix match, e.g.,
>> > >
>> > >     $ git for-each-ref refs/remotes/
>> > >
>> > > shows everything like refs/remotes/origin/main that begins with
>> > > refs/remotes/, I wonder if
>> > >
>> > >     $ git for-each-ref ""
>> > >
>> > > should mean what you are asking for.  After all, "git for-each-ref"
>> > > does *not* take "--branches" and others like "git log" family to
>> > > limit its operation to subhierarchy of "refs/" to begin with.
>> >
>> > But I don't think using an empty pattern is the best way to go forward.
>> > This would break the pattern matching feature. For instance, what if the
>> > user wanted to print all refs, but pattern match "*_HEAD"?
>> >
>> > Would that be
>> >
>> >       $ git for-each-ref "" "*_HEAD"
>> >
>> > I think this would be confusing, since the first pattern is now acting
>> > as an option, since its not really filtering rather its changing the
>> > search space.
>> >
>> > Maybe "--all-refs" or "--all-ref-types" instead?
>>
>> I tend to agree that the special empty pattern would be a good shorthand
>> for listing all references underneath refs/, including any top-level
>> psuedo-refs.
>>
>> But I don't think that I quite follow what Karthik is saying here.
>> for-each-ref returns the union of references that match the given
>> pattern(s), not their intersection. So if you wanted to list just the
>> psudo-refs ending in '_HEAD', you'd do:
>>
>>   $ git for-each-ref "*_HEAD"
>>
>> I think if you wanted to list all pseudo-refs, calling the option
>> `--pseudo-refs` seems reasonable. But if you want to list some subset of
>> psueod-refs matching a given pattern, you should specify that pattern
>> directly.
>
> Where I think this proposal falls short is if you have refs outside of
> the "refs/" hierarchy. Granted, this is nothing that should usually
> happen nowadays. But I think we should safeguard us for the future:
>
>   - There may be bugs in the reftable backend that allow for such refs
>     to be created.
>
>   - We may even eventually end up saying that it's valid for refs to not
>     start with "refs/". I consider this to be mostly an artifact of how
>     the files backend works, so it is not entirely unreasonable for us
>     to eventually lift the restriction for the reftable backend.
>
> I do not want to push for the second bullet point anytime soon, nor do I
> have any plans to do so in the future. But regardless of that I would
> really love to have a way to ask the ref backend for _any_ reference
> that it knows of, regardless of its prefix. Otherwise it becomes next to
> impossible for a user to learn about what the reftable binary-format
> actually contains. So I think that the current focus on pseudo-refs is
> too short-sighted, and would want to aim for a more complete solution to
> this problem.
>
> This could be in the form of a `--all-refs` flag that gets translated
> into a new `DO_FOR_EACH_REF_ALL_REFS` bit, which would indicate to the
> ref backend to also enumerate refs outside of the "refs/" hierarchy.
> This is orthogonal to the already existing `--all` pseudo-opt, because
> `--all` would only ever enumerate refs inside of the "refs/" hierarchy.
>
> Patrick

Thanks Patrick. I think the confusion was because I was referring to add
a new command to print all refs, meaning all refs including the ones
outside the "refs/" folder.

The confusion was that I thought Junio was referring to using

    $ git for-each-ref ""

to print all refs under $GIT_DIR, while he was actually talking about
"$GIT_DIR/refs/" directory.

So to loop back, I'm suggesting to add `--all-refs` to print all the
refs under $GIT_DIR. This would include refs traditionally found under
"refs/" and other refs like pseudo refs, HEAD and perhaps user created
refs under $GIT_DIR.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply

* Re: What's cooking in git.git (Jan 2024, #01; Tue, 2)
From: Jeff King @ 2024-01-03  9:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: René Scharfe, git
In-Reply-To: <xmqq5y0bcjpw.fsf@gitster.g>

On Tue, Jan 02, 2024 at 05:02:35PM -0800, Junio C Hamano wrote:

> * jk/t1006-cat-file-objectsize-disk (2023-12-21) 1 commit
>   (merged to 'next' on 2023-12-28 at d82812e636)
>  + t1006: add tests for %(objectsize:disk)
> 
>  Test update.
> 
>  Will merge to 'master'.
>  source: <20231221094722.GA570888@coredump.intra.peff.net>

It looks like this is the original version. I posted a v2 that took
René's suggestion to swap out the awk for shell, but it got overlooked.
I'm happy enough either way, but if we want to salvage that effort,
here's a patch which could go on top:

-- >8 --
From: René Scharfe <l.s.r@web.de>
Subject: [PATCH] t1006: prefer shell loop to awk for packed object sizes

To compute the expected on-disk size of packed objects, we sort the
output of show-index by pack offset and then compute the difference
between adjacent entries using awk. This works but has a few readability
problems:

  1. Reading the index in pack order means don't find out the size of an
     oid's entry until we see the _next_ entry. So we have to save it to
     print later.

     We can instead iterate in reverse order, so we compute each oid's
     size as we see it.

  2. Since the awk invocation is inside a text_expect block, we can't
     easily use single-quotes to hold the script. So we use
     double-quotes, but then have to escape the dollar signs in the awk
     script.

     We can swap this out for a shell loop instead (which is made much
     easier by the first change).

Signed-off-by: Jeff King <peff@peff.net>
---
I gave René authorship since this was his cleverness, but obviously I
wrote the commit message. Giving an explicit signoff would be nice,
though.

 t/t1006-cat-file.sh | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 0c2eafae65..5ea3326128 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1117,14 +1117,16 @@ test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
 		while read idx
 		do
 			git show-index <"$idx" >idx.raw &&
-			sort -n <idx.raw >idx.sorted &&
+			sort -nr <idx.raw >idx.sorted &&
 			packsz=$(test_file_size "${idx%.idx}.pack") &&
 			end=$((packsz - rawsz)) &&
-			awk -v end="$end" "
-			  NR > 1 { print oid, \$1 - start }
-			  { start = \$1; oid = \$2 }
-			  END { print oid, end - start }
-			" idx.sorted ||
+			while read start oid rest
+			do
+				size=$((end - start)) &&
+				end=$start &&
+				echo "$oid $size" ||
+				return 1
+			done <idx.sorted ||
 			return 1
 		done
 	} >expect.raw &&
-- 
2.43.0.514.g7147b80757


^ permalink raw reply related

* Re: [PATCH 2/2] ref-filter: support filtering of operational refs
From: Patrick Steinhardt @ 2024-01-03  8:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Karthik Nayak, Junio C Hamano, git, christian.couder
In-Reply-To: <ZZRaOhK869S1Sg1h@nand.local>

[-- Attachment #1: Type: text/plain, Size: 3338 bytes --]

On Tue, Jan 02, 2024 at 01:47:22PM -0500, Taylor Blau wrote:
> On Tue, Jan 02, 2024 at 07:18:48AM -0800, Karthik Nayak wrote:
> > > As "git for-each-ref" takes pattern that is prefix match, e.g.,
> > >
> > >     $ git for-each-ref refs/remotes/
> > >
> > > shows everything like refs/remotes/origin/main that begins with
> > > refs/remotes/, I wonder if
> > >
> > >     $ git for-each-ref ""
> > >
> > > should mean what you are asking for.  After all, "git for-each-ref"
> > > does *not* take "--branches" and others like "git log" family to
> > > limit its operation to subhierarchy of "refs/" to begin with.
> >
> > But I don't think using an empty pattern is the best way to go forward.
> > This would break the pattern matching feature. For instance, what if the
> > user wanted to print all refs, but pattern match "*_HEAD"?
> >
> > Would that be
> >
> >       $ git for-each-ref "" "*_HEAD"
> >
> > I think this would be confusing, since the first pattern is now acting
> > as an option, since its not really filtering rather its changing the
> > search space.
> >
> > Maybe "--all-refs" or "--all-ref-types" instead?
> 
> I tend to agree that the special empty pattern would be a good shorthand
> for listing all references underneath refs/, including any top-level
> psuedo-refs.
> 
> But I don't think that I quite follow what Karthik is saying here.
> for-each-ref returns the union of references that match the given
> pattern(s), not their intersection. So if you wanted to list just the
> psudo-refs ending in '_HEAD', you'd do:
> 
>   $ git for-each-ref "*_HEAD"
> 
> I think if you wanted to list all pseudo-refs, calling the option
> `--pseudo-refs` seems reasonable. But if you want to list some subset of
> psueod-refs matching a given pattern, you should specify that pattern
> directly.

Where I think this proposal falls short is if you have refs outside of
the "refs/" hierarchy. Granted, this is nothing that should usually
happen nowadays. But I think we should safeguard us for the future:

  - There may be bugs in the reftable backend that allow for such refs
    to be created.

  - We may even eventually end up saying that it's valid for refs to not
    start with "refs/". I consider this to be mostly an artifact of how
    the files backend works, so it is not entirely unreasonable for us
    to eventually lift the restriction for the reftable backend.

I do not want to push for the second bullet point anytime soon, nor do I
have any plans to do so in the future. But regardless of that I would
really love to have a way to ask the ref backend for _any_ reference
that it knows of, regardless of its prefix. Otherwise it becomes next to
impossible for a user to learn about what the reftable binary-format
actually contains. So I think that the current focus on pseudo-refs is
too short-sighted, and would want to aim for a more complete solution to
this problem.

This could be in the form of a `--all-refs` flag that gets translated
into a new `DO_FOR_EACH_REF_ALL_REFS` bit, which would indicate to the
ref backend to also enumerate refs outside of the "refs/" hierarchy.
This is orthogonal to the already existing `--all` pseudo-opt, because
`--all` would only ever enumerate refs inside of the "refs/" hierarchy.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 2/6] setup: move creation of "refs/" into the files backend
From: Patrick Steinhardt @ 2024-01-03  8:33 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git
In-Reply-To: <CAOLa=ZTR6aW5aoxcMOS3U3TL1VxSfmyVno9fu7B5201pJTqyyg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1520 bytes --]

On Tue, Jan 02, 2024 at 05:23:18AM -0800, Karthik Nayak wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Move the code to create the directory into the files backend itself to
> > make it so. This means that future ref backends will also need to have
> > equivalent logic around to ensure that the directory exists, but it
> > seems a lot more sensible to have it this way round than to require
> > callers to create the directory themselves.
> >
> 
> Why not move it to refs.c:refs_init_db() instead? this way each
> implementation doesn't have to do it?

True, that would be another possibility. But I think it is conceptually
unclean to split up creation of the refdb into multiple locations. The
"files" backend already has to create "refs/heads/" and "refs/tags/",
and the "reftable" backend will set up "refs/heads" as a file so that
it's impossible to create branches as loose files by accident. So both
do have specific knowledge around how exactly this directory hierarchy
should look like, and thus I think it is sensible to make the code self
contained inside of the backends.

My opinion on this would be different if we expected a proliferation of
backends to happen, but that's quite unlikely. Furthermore, we may at
some point decide that repos don't need "refs/" and/or "HEAD" at all
anymore, at which point it is easier to drop the creation of those files
and dirs from the reftable backend.

I'll update the commit message to include these considerations.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: What's the recommendation for forgetting all rerere's records?
From: Sean Allred @ 2024-01-03  8:30 UTC (permalink / raw)
  To: Sean Allred; +Cc: Junio C Hamano, Sean Allred, git
In-Reply-To: <35f24a01d15ce28932bb6be098d6a164a49cc542008f75673cd6221a9b24b578@mu.id>


(Doesn't look like this actually got picked up by lore when I originally
sent it at Fri, 22 Dec 2023 12:19:50 -0600. This is a re-send; apologies
if you get this message twice.)

==

There might be a bug here.

Junio C Hamano <gitster@pobox.com> writes:
     # Now the fun command you seem to have missed.  You MUST give
     # "git checkout --merge" a pathspec.  I do not encourage it but
     # using "." to say "unresolve everything under the sun" should
     # also work.
     $ git checkout --merge builtin/mv.c
     Recreated 1 merge conflict

Yep, I definitely missed this! Very handy, thank you :-)

     # You should then be able to correct the resolution with your
     # editor.
     $ edit builtin/mv.c

     # If this is one-time fix (you are happy with the original
     # resolution and wanted to deviate from it only once this time),
     # there is nothing else need to be done.  If you want to record
     # this as a new resolution, you'd get rid of the old one and
     # record this one.
     $ git rerere forget builtin/mv.c
     $ git rerere

It's taken some time to investigate on our end, but it appears that the
issue we're seeing is particular to git-rebase.

Consider a run using git-merge, which works perfectly as you describe:

    # Let's set up our conflict; most output here elided for brevity.
    $ git init
    $ echo aaa >file
    $ git add file
    $ git commit -ambase
    $ git branch feature
    $ echo bbb >file
    $ git commit -amremote
    $ git switch feature
    $ echo ccc >file
    $ git commit -amlocal
    $ git --no-pager log --oneline --graph --all
    * 6b33f42 (HEAD -> feature) local
    | * 7e189f6 (main) remote
    |/
    * c5901f5 base

    # Now for the fun part!
    $ git config rerere.enabled true

    $ git merge main
    Auto-merging file
    CONFLICT (content): Merge conflict in file
    Recorded preimage for 'file'
    Automatic merge failed; fix conflicts and then commit the result.

    $ echo 'bad merge' >file
    $ git commit -ammerge
    Recorded resolution for 'file'.
    [feature 75d45f0] merge

    # Ack! That merge was bad. Let's try that again.
    $ git reset --hard @^
    HEAD is now at 6b33f42 local

    $ git merge main
    Auto-merging file
    CONFLICT (content): Merge conflict in file
    Resolved 'file' using previous resolution.
    Automatic merge failed; fix conflicts and then commit the result.

    # Your method to correct this single bad merge works flawlessly:
    $ git checkout --merge file
    Recreated 1 merge conflict

    $ git rerere forget file
    Updated preimage for 'file'
    Forgot resolution for 'file'

    $ echo 'good merge' >file

    $ git commit -ammerge
    Recorded resolution for 'file'.
    [feature 1770541] merge

Let's compare this with git-rebase:

    # Same setup as before
    $ git init
    $ echo aaa >file
    $ git add file
    $ git commit -ambase
    $ git branch feature
    $ echo bbb >file
    $ git commit -amremote
    $ git switch feature
    $ echo ccc >file
    $ git commit -amlocal
    $ git --no-pager log --oneline --graph --all
    * b4d7aeb (HEAD -> feature) local
    | * 2a0978d (main) remote
    |/
    * 91140a6 base

    # Now for the fun part! Just like before, but we're going to use
    # git-rebase instead of git-merge.
    $ git config rerere.enabled true

    $ git rebase main
    Auto-merging file
    CONFLICT (content): Merge conflict in file
    error: could not apply b4d7aeb... local
    hint: Resolve all conflicts manually, mark them as resolved with
    hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
    hint: You can instead skip this commit: run "git rebase --skip".
    hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
    Recorded preimage for 'file'
    Could not apply b4d7aeb... local

    $ echo 'bad merge' >file
    $ git add file

    $ EDITOR=: git rebase --continue
    file: needs merge
    You must edit all merge conflicts and then
    mark them as resolved using git add

    $ git rebase --abort

    $ git rebase main
    Auto-merging file
    CONFLICT (content): Merge conflict in file
    error: could not apply b4d7aeb... local
    hint: Resolve all conflicts manually, mark them as resolved with
    hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
    hint: You can instead skip this commit: run "git rebase --skip".
    hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
    Recorded preimage for 'file'
    Could not apply b4d7aeb... local

    $ git checkout --merge .
    Recreated 1 merge conflict

    $ git rerere forget .
    error: no remembered resolution for 'file'

    $ echo 'good merge' >file

    $ EDITOR=: git rebase --continue
    file: needs merge
    You must edit all merge conflicts and then
    mark them as resolved using git add

Is this a bug?

--
Sean Allred

^ permalink raw reply

* Re: [PATCH v2] write-or-die: make GIT_FLUSH a Boolean environment variable
From: Patrick Steinhardt @ 2024-01-03  8:22 UTC (permalink / raw)
  To: Chandra Pratap via GitGitGadget; +Cc: git, Chandra Pratap, Chandra Pratap
In-Reply-To: <pull.1628.v2.git.1704268708720.gitgitgadget@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1411 bytes --]

On Wed, Jan 03, 2024 at 07:58:28AM +0000, Chandra Pratap via GitGitGadget wrote:
[snip]
> diff --git a/write-or-die.c b/write-or-die.c
> index 42a2dc73cd3..a6acabd329f 100644
> --- a/write-or-die.c
> +++ b/write-or-die.c
> @@ -20,15 +20,12 @@ void maybe_flush_or_die(FILE *f, const char *desc)
>  {
>  	static int skip_stdout_flush = -1;
>  	struct stat st;
> -	char *cp;
>  
>  	if (f == stdout) {
>  		if (skip_stdout_flush < 0) {
> -			/* NEEDSWORK: make this a normal Boolean */
> -			cp = getenv("GIT_FLUSH");
> -			if (cp)
> -				skip_stdout_flush = (atoi(cp) == 0);
> -			else if ((fstat(fileno(stdout), &st) == 0) &&
> +			if (!git_env_bool("GIT_FLUSH", -1))
> +				skip_stdout_flush = 1;

It's a bit surprising to pass `-1` as default value to `git_env_bool()`
here, as this value would hint that the caller wants to explicitly
handle the case where the "GIT_FLUSH" envvar is not set at all. We don't
though, and essentially fall back to "GIT_FLUSH=1", so passing `1` as
the fallback value would be less confusing.

Anyway, the resulting behaviour is the same regardless of whether we
pass `1` or `-1`, so I'm not sure whether this is worth a reroll.

Patrick

> +			else if (!fstat(fileno(stdout), &st) &&
>  				 S_ISREG(st.st_mode))
>  				skip_stdout_flush = 1;
>  			else
> 
> base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
> -- 
> gitgitgadget
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH V4 1/1] Replace SID with domain/username
From: Matthias Aßhauer @ 2024-01-03  8:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Sören Krecker, Johannes Schindelin, git, sunshine
In-Reply-To: <xmqqa5pnckm4.fsf@gitster.g>



On Tue, 2 Jan 2024, Junio C Hamano wrote:

> Sören Krecker <soekkle@freenet.de> writes:
>
>> Replace SID with domain/username in error message, if owner of repository
>> and user are not equal on windows systems. Each user should have a unique
>> SID (https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/manage/understand-security-identifiers#what-are-security-identifiers).
>
> That paragraph your URL refers to does say that a SID that is used
> for an account will never be reused to identify a different account.
> But I am not sure if it means a user will never be assigned more
> than one SID (in other words, the reverse is not necessarily true).

To my knowledge a user account will never have multiple active SIDs, but
the documentation of LookupAccountSidA [1] explicitly mentions that it 
does look up historic SIDs.

> In addition to looking up SIDs for local accounts, local domain 
> accounts, and explicitly trusted domain accounts, LookupAccountSid can 
> look up SIDs for any account in any domain in the forest, including SIDs 
> that appear only in the SIDhistory field of an account in the forest. 
> The SIDhistory field stores former SIDs of an account that has been
> moved from another domain.

[1] https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-lookupaccountsida#remarks

>
> The paragraph also mentions that a SID can identify a non-user
> entity like a computer account (as opposed to "a user account")---I
> do not know what its implications are in the context of this patch,
> though.
>
>> This means that domain/username is not a loss of information.
>
> This statement does not (grammatically) make sense, but more
> importantly, loss of information may not be a bad thing in this
> case.  If more than one SIDs are given to a user account and
> processes working for that account, these different SIDs may be
> translated, by using LookupAccountSidA(), to the same string for a
> single user@domain, and it would be an operation that loses
> information in that sense.
>
> But if what we *care* about is user@domain between the current
> process and the owner of the directory in question being the same
> (or not), then such a loss of information is a *good* thing.

This patch only changes the output of our error message, though.
It does not change what ownership information we actually compare.
So if we had a hypothetical user Bob that was part of the  domain 
example.com (SID S-1-5-21-100000001-1000000001-10000001-1001) and
had been moved over from the example.org domain (old SID S-1-5-21-
2000000002-2000000002-20000002-2002) and we would detect a repository 
owned by bobs old SID, we would now lookup the old SID, find it 
attached to a user named example.com\Bob, look up Bobs  current SID, find 
it belongs to a user named example.com\Bob and print a confusing error 
message.

> So I dunno.  Arguing what we care about (is that exact SID equality
> between the "owner of the directory" and the "user, which the
> current process is working on behalf of", or do we care about the
> equality of the "accounts"?) may be a better way to justify this
> change, if you ask me.
>
>> +static BOOL user_sid_to_user_name(PSID sid, LPSTR *str)
>> +{
>> +	SID_NAME_USE pe_use;
>> +	DWORD len_user = 0, len_domain = 0;
>> +	BOOL translate_sid_to_user;
>> +
>> +	/* returns only FALSE, because the string pointers are NULL*/
>> +	LookupAccountSidA(NULL, sid, NULL, &len_user, NULL, &len_domain,
>> +			  &pe_use);
>> +	/*Alloc needed space of the strings*/
>> +	ALLOC_ARRAY((*str), (size_t)len_domain + (size_t)len_user);
>> +	translate_sid_to_user = LookupAccountSidA(NULL, sid, (*str) + len_domain, &len_user,
>> +				   *str, &len_domain, &pe_use);
>> +	if (translate_sid_to_user == FALSE) {
>> +		FREE_AND_NULL(*str);
>> +	}
>
> Style: do not enclose a single-statement block inside {}.
>
>> +	else
>> +		(*str)[len_domain] = '/';
>> +	return translate_sid_to_user;
>> +}
>
>> @@ -2767,7 +2788,9 @@ int is_path_owned_by_current_sid(const char *path, struct strbuf *report)
>>  		} else if (report) {
>>  			LPSTR str1, str2, to_free1 = NULL, to_free2 = NULL;
>>
>> -			if (ConvertSidToStringSidA(sid, &str1))
>> +			if (user_sid_to_user_name(sid, &str1))
>> +				to_free1 = str1;
>> +			else if (ConvertSidToStringSidA(sid, &str1))
>>  				to_free1 = str1;
>
> Do these two helper functions return pointers pointing into the same
> kind of memory that you can free with the same function?  That is ...
>
>> ...
>>  				    "'%s' is owned by:\n"
>>  				    "\t'%s'\nbut the current user is:\n"
>>  				    "\t'%s'\n", path, str1, str2);
>> -			LocalFree(to_free1);
>> -			LocalFree(to_free2);
>> +			free(to_free1);
>> +			free(to_free2);
>
> ... the original code seems to say that the piece of memory we
> obtain from ConvertSidToStringSidA() must not be freed by calling
> free() but use something special called LocalFree().  I am assuing
> that your user_sid_to_user_name() returns a regular piece of memory
> that can be freed by calling regular free()?  Do we need to keep
> track of where we got the memory from and use different function to
> free each variable, or something (again I do not do Windows so I'll
> defer all of these to Dscho, who is CC'ed this time).
>
> Thanks and a happy new year.
>
>>  		}
>>  	}
>
>

^ permalink raw reply

* Re: Concurrent fetch commands
From: Patrick Steinhardt @ 2024-01-03  8:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Stefan Haller, git
In-Reply-To: <ZZUNxNciNb_xZveY@tanuki>

[-- Attachment #1: Type: text/plain, Size: 3619 bytes --]

On Wed, Jan 03, 2024 at 08:33:24AM +0100, Patrick Steinhardt wrote:
> On Sun, Dec 31, 2023 at 09:27:19AM -0800, Junio C Hamano wrote:
> > Stefan Haller <lists@haller-berlin.de> writes:
> > 
> > > I can reliably reproduce this by doing
> > >
> > >    $ git fetch&; sleep 0.1; git pull
> > >    [1] 42160
> > >    [1]  + done       git fetch
> > >    fatal: Cannot rebase onto multiple branches.
> > 
> > I see a bug here.
> > 
> > How this _ought_ to work is
> > 
> >  - The first "git fetch" wants to report what it fetched by writing
> >    into the $GIT_DIR/FETCH_HEAD file ("git merge FETCH_HEAD" after
> >    the fetch finishes can consume its contents).
> > 
> >  - The second "git pull" runs "git fetch" under the hood.  Because
> >    it also wants to write to $GIT_DIR/FETCH_HEAD, and because there
> >    is already somebody writing to the file, it should notice and
> >    barf, saying "fatal: a 'git fetch' is already working" or
> >    something.
> > 
> > But because there is no "Do not overwrite FETCH_HEAD somebody else
> > is using" protection, "git merge" or "git rebase" that is run as the
> > second half of the "git pull" ends up working on the contents of
> > FETCH_HEAD that is undefined, and GIGO result follows.
> > 
> > The "bug" that the second "git fetch" does not notice an already
> > running one (who is in possession of FETCH_HEAD) and refrain from
> > starting is not easy to design a fix for---we cannot just abort by
> > opening it with O_CREAT|O_EXCL because it is a normal thing for
> > $GIT_DIR/FETCH_HEAD to exist after the "last" fetch.  We truncate
> > its contents before starting to avoid getting affected by contents
> > leftover by the last fetch, but when there is a "git fetch" that is
> > actively running, and it finishes _after_ the second one starts and
> > truncates the file, the second one will end up seeing the contents
> > the first one left.  We have the "--no-write-fetch-head" option for
> > users to explicitly tell which invocation of "git fetch" should not
> > write FETCH_HEAD.
> 
> While I agree that it's the right thing to use "--no-write-fetch-head"
> in this context, I still wonder whether we want to fix this "bug". It
> would be a rather easy change on our side to start using the lockfile
> API to write to FETCH_HEAD, which has a bunch of benefits:
> 
>   - We would block concurrent processes of writing to FETCH_HEAD at the
>     same time (well, at least for clients aware of the new semantics).
> 
>   - Consequentially, we do not write a corrupted FETCH_HEAD anymore when
>     multiple processes write to it at the same time.
> 
>   - We're also more robust against corruption in the context of hard
>     crashes due to atomic rename semantics and proper flushing.
> 
> I don't really see much of a downside except for the fact that we change
> how this special ref is being written to, so other implementations would
> need to adapt accordingly. But even if they didn't, if clients with both
> the new and old behaviour write FETCH_HEAD at the same point in time the
> result would still be a consistent FETCH_HEAD if both writes finish. We
> do have a race now which of both versions of FETCH_HEAD we see, but that
> feels better than corrupted contents to me.

Ah, one thing I didn't think of is parallel fetches. It's expected that
all of the fetches write into FETCH_HEAD at the same point in time
concurrently, so the end result is the collection of updated refs even
though the order isn't guaranteed. That makes things a bit more
complicated indeed.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2] write-or-die: make GIT_FLUSH a Boolean environment variable
From: Chandra Pratap via GitGitGadget @ 2024-01-03  7:58 UTC (permalink / raw)
  To: git; +Cc: Chandra Pratap, Chandra Pratap
In-Reply-To: <pull.1628.git.1703955246308.gitgitgadget@gmail.com>

From: Chandra Pratap <chandrapratap3519@gmail.com>

Among Git's environment variable, the ones marked as "Boolean"
accept values in a way similar to Boolean configuration variables,
i.e. values like 'yes', 'on', 'true' and positive numbers are
taken as "on" and values like 'no', 'off', 'false' are taken as
"off".
GIT_FLUSH can be used to force Git to use non-buffered I/O when
writing to stdout. It can only accept two values, '1' which causes
Git to flush more often and '0' which makes all output buffered.
Make GIT_FLUSH accept more values besides '0' and '1' by turning it
into a Boolean environment variable, modifying the required logic.
Update the related documentation.

Signed-off-by: Chandra Pratap <chandrapratap3519@gmail.com>
---
    write-or-die: make GIT_FLUSH a Boolean environment variable
    
    Helped-by: Torsten Bögershausen <tboegi@web.de>

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1628%2FChand-ra%2Fgit_flush-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1628/Chand-ra/git_flush-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1628

Range-diff vs v1:

 1:  2123b43a080 ! 1:  585c76fff68 write-or-die: make GIT_FLUSH a Boolean environment variable
     @@ write-or-die.c: void maybe_flush_or_die(FILE *f, const char *desc)
       	static int skip_stdout_flush = -1;
       	struct stat st;
      -	char *cp;
     -+	int cp;
       
       	if (f == stdout) {
       		if (skip_stdout_flush < 0) {
     @@ write-or-die.c: void maybe_flush_or_die(FILE *f, const char *desc)
      -			cp = getenv("GIT_FLUSH");
      -			if (cp)
      -				skip_stdout_flush = (atoi(cp) == 0);
     -+			cp = git_env_bool("GIT_FLUSH", -1);
     -+			if (cp >= 0)
     -+				skip_stdout_flush = (cp == 0);
     - 			else if ((fstat(fileno(stdout), &st) == 0) &&
     +-			else if ((fstat(fileno(stdout), &st) == 0) &&
     ++			if (!git_env_bool("GIT_FLUSH", -1))
     ++				skip_stdout_flush = 1;
     ++			else if (!fstat(fileno(stdout), &st) &&
       				 S_ISREG(st.st_mode))
       				skip_stdout_flush = 1;
     + 			else


 Documentation/git.txt | 16 +++++++---------
 write-or-die.c        |  9 +++------
 2 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/Documentation/git.txt b/Documentation/git.txt
index 2535a30194f..83fd60f2d11 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -724,16 +724,14 @@ for further details.
 	waiting for someone with sufficient permissions to fix it.
 
 `GIT_FLUSH`::
-// NEEDSWORK: make it into a usual Boolean environment variable
-	If this environment variable is set to "1", then commands such
+	If this Boolean environment variable is set to true, then commands such
 	as 'git blame' (in incremental mode), 'git rev-list', 'git log',
-	'git check-attr' and 'git check-ignore' will
-	force a flush of the output stream after each record have been
-	flushed. If this
-	variable is set to "0", the output of these commands will be done
-	using completely buffered I/O.   If this environment variable is
-	not set, Git will choose buffered or record-oriented flushing
-	based on whether stdout appears to be redirected to a file or not.
+	'git check-attr' and 'git check-ignore' will force a flush of the output
+	stream after each record have been flushed. If this variable is set to
+	false, the output of these commands will be done using completely buffered
+	I/O. If this environment variable is not set, Git will choose buffered or
+	record-oriented flushing based on whether stdout appears to be redirected
+	to a file or not.
 
 `GIT_TRACE`::
 	Enables general trace messages, e.g. alias expansion, built-in
diff --git a/write-or-die.c b/write-or-die.c
index 42a2dc73cd3..a6acabd329f 100644
--- a/write-or-die.c
+++ b/write-or-die.c
@@ -20,15 +20,12 @@ void maybe_flush_or_die(FILE *f, const char *desc)
 {
 	static int skip_stdout_flush = -1;
 	struct stat st;
-	char *cp;
 
 	if (f == stdout) {
 		if (skip_stdout_flush < 0) {
-			/* NEEDSWORK: make this a normal Boolean */
-			cp = getenv("GIT_FLUSH");
-			if (cp)
-				skip_stdout_flush = (atoi(cp) == 0);
-			else if ((fstat(fileno(stdout), &st) == 0) &&
+			if (!git_env_bool("GIT_FLUSH", -1))
+				skip_stdout_flush = 1;
+			else if (!fstat(fileno(stdout), &st) &&
 				 S_ISREG(st.st_mode))
 				skip_stdout_flush = 1;
 			else

base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
-- 
gitgitgadget

^ permalink raw reply related

* Re: Concurrent fetch commands
From: Patrick Steinhardt @ 2024-01-03  7:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Stefan Haller, git
In-Reply-To: <xmqqy1daffk8.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 3123 bytes --]

On Sun, Dec 31, 2023 at 09:27:19AM -0800, Junio C Hamano wrote:
> Stefan Haller <lists@haller-berlin.de> writes:
> 
> > I can reliably reproduce this by doing
> >
> >    $ git fetch&; sleep 0.1; git pull
> >    [1] 42160
> >    [1]  + done       git fetch
> >    fatal: Cannot rebase onto multiple branches.
> 
> I see a bug here.
> 
> How this _ought_ to work is
> 
>  - The first "git fetch" wants to report what it fetched by writing
>    into the $GIT_DIR/FETCH_HEAD file ("git merge FETCH_HEAD" after
>    the fetch finishes can consume its contents).
> 
>  - The second "git pull" runs "git fetch" under the hood.  Because
>    it also wants to write to $GIT_DIR/FETCH_HEAD, and because there
>    is already somebody writing to the file, it should notice and
>    barf, saying "fatal: a 'git fetch' is already working" or
>    something.
> 
> But because there is no "Do not overwrite FETCH_HEAD somebody else
> is using" protection, "git merge" or "git rebase" that is run as the
> second half of the "git pull" ends up working on the contents of
> FETCH_HEAD that is undefined, and GIGO result follows.
> 
> The "bug" that the second "git fetch" does not notice an already
> running one (who is in possession of FETCH_HEAD) and refrain from
> starting is not easy to design a fix for---we cannot just abort by
> opening it with O_CREAT|O_EXCL because it is a normal thing for
> $GIT_DIR/FETCH_HEAD to exist after the "last" fetch.  We truncate
> its contents before starting to avoid getting affected by contents
> leftover by the last fetch, but when there is a "git fetch" that is
> actively running, and it finishes _after_ the second one starts and
> truncates the file, the second one will end up seeing the contents
> the first one left.  We have the "--no-write-fetch-head" option for
> users to explicitly tell which invocation of "git fetch" should not
> write FETCH_HEAD.

While I agree that it's the right thing to use "--no-write-fetch-head"
in this context, I still wonder whether we want to fix this "bug". It
would be a rather easy change on our side to start using the lockfile
API to write to FETCH_HEAD, which has a bunch of benefits:

  - We would block concurrent processes of writing to FETCH_HEAD at the
    same time (well, at least for clients aware of the new semantics).

  - Consequentially, we do not write a corrupted FETCH_HEAD anymore when
    multiple processes write to it at the same time.

  - We're also more robust against corruption in the context of hard
    crashes due to atomic rename semantics and proper flushing.

I don't really see much of a downside except for the fact that we change
how this special ref is being written to, so other implementations would
need to adapt accordingly. But even if they didn't, if clients with both
the new and old behaviour write FETCH_HEAD at the same point in time the
result would still be a consistent FETCH_HEAD if both writes finish. We
do have a race now which of both versions of FETCH_HEAD we see, but that
feels better than corrupted contents to me.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v3 8/8] reftable/merged: transfer ownership of records when iterating
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2156 bytes --]

When iterating over records with the merged iterator we put the records
into a priority queue before yielding them to the caller. This means
that we need to allocate the contents of these records before we can
pass them over to the caller.

The handover to the caller is quite inefficient though because we first
deallocate the record passed in by the caller and then copy over the new
record, which requires us to reallocate memory.

Refactor the code to instead transfer ownership of the new record to the
caller. So instead of reallocating all contents, we now release the old
record and then copy contents of the new record into place.

The following benchmark of `git show-ref --quiet` in a repository with
around 350k refs shows a clear improvement. Before:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 708,058 allocs, 707,865 frees, 36,783,255 bytes allocated

After:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 357,007 allocs, 356,814 frees, 24,193,602 bytes allocated

This shows that we now have roundabout a single allocation per record
that we're yielding from the iterator. Ideally, we'd also get rid of
this allocation so that the number of allocations doesn't scale with the
number of refs anymore. This would require some larger surgery though
because the memory is owned by the priority queue before transferring it
over to the caller.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/merged.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/reftable/merged.c b/reftable/merged.c
index a28bb99aaf..a52914d667 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -124,10 +124,12 @@ static int merged_iter_next_entry(struct merged_iter *mi,
 		reftable_record_release(&top.rec);
 	}

-	reftable_record_copy_from(rec, &entry.rec, hash_size(mi->hash_id));
+	reftable_record_release(rec);
+	*rec = entry.rec;

 done:
-	reftable_record_release(&entry.rec);
+	if (err)
+		reftable_record_release(&entry.rec);
 	return err;
 }

-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 7/8] reftable/merged: really reuse buffers to compute record keys
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

In 829231dc20 (reftable/merged: reuse buffer to compute record keys,
2023-12-11), we have refactored the merged iterator to reuse a pair of
long-living strbufs by relying on the fact that `reftable_record_key()`
tries to reuse already allocated strbufs by calling `strbuf_reset()`,
which should give us significantly fewer reallocations compared to the
old code that used on-stack strbufs that are allocated for each and
every iteration. Unfortunately, we called `strbuf_release()` on these
long-living strbufs that we meant to reuse on each iteration, defeating
the optimization.

Fix this performance issue by not releasing those buffers on iteration
anymore, where we instead rely on `merged_iter_close()` to release the
buffers for us.

Using `git show-ref --quiet` in a repository with ~350k refs this leads
to a significant drop in allocations. Before:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 1,410,148 allocs, 1,409,955 frees, 61,976,068 bytes allocated

After:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 708,058 allocs, 707,865 frees, 36,783,255 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/merged.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/reftable/merged.c b/reftable/merged.c
index 556bb5c556..a28bb99aaf 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -128,8 +128,6 @@ static int merged_iter_next_entry(struct merged_iter *mi,

 done:
 	reftable_record_release(&entry.rec);
-	strbuf_release(&mi->entry_key);
-	strbuf_release(&mi->key);
 	return err;
 }

-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 6/8] reftable/record: store "val2" hashes as static arrays
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 5214 bytes --]

Similar to the preceding commit, convert ref records of type "val2" to
store their object IDs in static arrays instead of allocating them for
every single record.

We're using the same benchmark as in the preceding commit, with `git
show-ref --quiet` in a repository with ~350k refs. This time around
though the effects aren't this huge. Before:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 1,419,040 allocs, 1,418,847 frees, 62,153,868 bytes allocated

After:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 1,410,148 allocs, 1,409,955 frees, 61,976,068 bytes allocated

This is because "val2"-type records are typically only stored for peeled
tags, and the number of annotated tags in the benchmark repository is
rather low. Still, it can be seen that this change leads to a reduction
of allocations overall, even if only a small one.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/readwrite_test.c  | 12 ++++--------
 reftable/record.c          |  6 ------
 reftable/record_test.c     |  4 ----
 reftable/reftable-record.h |  4 ++--
 4 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 87b238105c..178766dfa8 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -547,8 +547,6 @@ static void test_table_refs_for(int indexed)
 		uint8_t hash[GIT_SHA1_RAWSZ];
 		char fill[51] = { 0 };
 		char name[100];
-		uint8_t hash1[GIT_SHA1_RAWSZ];
-		uint8_t hash2[GIT_SHA1_RAWSZ];
 		struct reftable_ref_record ref = { NULL };
 
 		memset(hash, i, sizeof(hash));
@@ -558,11 +556,9 @@ static void test_table_refs_for(int indexed)
 		name[40] = 0;
 		ref.refname = name;
 
-		set_test_hash(hash1, i / 4);
-		set_test_hash(hash2, 3 + i / 4);
 		ref.value_type = REFTABLE_REF_VAL2;
-		ref.value.val2.value = hash1;
-		ref.value.val2.target_value = hash2;
+		set_test_hash(ref.value.val2.value, i / 4);
+		set_test_hash(ref.value.val2.target_value, 3 + i / 4);
 
 		/* 80 bytes / entry, so 3 entries per block. Yields 17
 		 */
@@ -570,8 +566,8 @@ static void test_table_refs_for(int indexed)
 		n = reftable_writer_add_ref(w, &ref);
 		EXPECT(n == 0);
 
-		if (!memcmp(hash1, want_hash, GIT_SHA1_RAWSZ) ||
-		    !memcmp(hash2, want_hash, GIT_SHA1_RAWSZ)) {
+		if (!memcmp(ref.value.val2.value, want_hash, GIT_SHA1_RAWSZ) ||
+		    !memcmp(ref.value.val2.target_value, want_hash, GIT_SHA1_RAWSZ)) {
 			want_names[want_names_len++] = xstrdup(name);
 		}
 	}
diff --git a/reftable/record.c b/reftable/record.c
index a67a6b4d8a..5c3fbb7b2a 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -222,9 +222,7 @@ static void reftable_ref_record_copy_from(void *rec, const void *src_rec,
 		memcpy(ref->value.val1, src->value.val1, hash_size);
 		break;
 	case REFTABLE_REF_VAL2:
-		ref->value.val2.value = reftable_malloc(hash_size);
 		memcpy(ref->value.val2.value, src->value.val2.value, hash_size);
-		ref->value.val2.target_value = reftable_malloc(hash_size);
 		memcpy(ref->value.val2.target_value,
 		       src->value.val2.target_value, hash_size);
 		break;
@@ -298,8 +296,6 @@ void reftable_ref_record_release(struct reftable_ref_record *ref)
 		reftable_free(ref->value.symref);
 		break;
 	case REFTABLE_REF_VAL2:
-		reftable_free(ref->value.val2.target_value);
-		reftable_free(ref->value.val2.value);
 		break;
 	case REFTABLE_REF_VAL1:
 		break;
@@ -401,11 +397,9 @@ static int reftable_ref_record_decode(void *rec, struct strbuf key,
 			return -1;
 		}
 
-		r->value.val2.value = reftable_malloc(hash_size);
 		memcpy(r->value.val2.value, in.buf, hash_size);
 		string_view_consume(&in, hash_size);
 
-		r->value.val2.target_value = reftable_malloc(hash_size);
 		memcpy(r->value.val2.target_value, in.buf, hash_size);
 		string_view_consume(&in, hash_size);
 		break;
diff --git a/reftable/record_test.c b/reftable/record_test.c
index 5c94d26e35..2876db7d27 100644
--- a/reftable/record_test.c
+++ b/reftable/record_test.c
@@ -122,11 +122,7 @@ static void test_reftable_ref_record_roundtrip(void)
 			set_hash(in.u.ref.value.val1, 1);
 			break;
 		case REFTABLE_REF_VAL2:
-			in.u.ref.value.val2.value =
-				reftable_malloc(GIT_SHA1_RAWSZ);
 			set_hash(in.u.ref.value.val2.value, 1);
-			in.u.ref.value.val2.target_value =
-				reftable_malloc(GIT_SHA1_RAWSZ);
 			set_hash(in.u.ref.value.val2.target_value, 2);
 			break;
 		case REFTABLE_REF_SYMREF:
diff --git a/reftable/reftable-record.h b/reftable/reftable-record.h
index 7f3a0df635..bb6e99acd3 100644
--- a/reftable/reftable-record.h
+++ b/reftable/reftable-record.h
@@ -41,8 +41,8 @@ struct reftable_ref_record {
 	union {
 		unsigned char val1[GIT_MAX_RAWSZ];
 		struct {
-			uint8_t *value; /* first value, malloced hash  */
-			uint8_t *target_value; /* second value, malloced hash */
+			unsigned char value[GIT_MAX_RAWSZ]; /* first hash  */
+			unsigned char target_value[GIT_MAX_RAWSZ]; /* second hash */
 		} val2;
 		char *symref; /* referent, malloced 0-terminated string */
 	} value;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 5/8] reftable/record: store "val1" hashes as static arrays
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 8589 bytes --]

When reading ref records of type "val1", we store its object ID in an
allocated array. This results in an additional allocation for every
single ref record we read, which is rather inefficient especially when
iterating over refs.

Refactor the code to instead use an embedded array of `GIT_MAX_RAWSZ`
bytes. While this means that `struct ref_record` is bigger now, we
typically do not store all refs in an array anyway and instead only
handle a limited number of records at the same point in time.

Using `git show-ref --quiet` in a repository with ~350k refs this leads
to a significant drop in allocations. Before:

    HEAP SUMMARY:
        in use at exit: 21,098 bytes in 192 blocks
      total heap usage: 2,116,683 allocs, 2,116,491 frees, 76,098,060 bytes allocated

After:

    HEAP SUMMARY:
        in use at exit: 21,098 bytes in 192 blocks
      total heap usage: 1,419,031 allocs, 1,418,839 frees, 62,145,036 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block_test.c      |  4 +---
 reftable/merged_test.c     | 16 ++++++----------
 reftable/readwrite_test.c  | 14 ++++----------
 reftable/record.c          |  3 ---
 reftable/record_test.c     |  1 -
 reftable/reftable-record.h |  3 ++-
 reftable/stack_test.c      |  2 --
 7 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/reftable/block_test.c b/reftable/block_test.c
index c00bbc8aed..dedb05c7d8 100644
--- a/reftable/block_test.c
+++ b/reftable/block_test.c
@@ -49,13 +49,11 @@ static void test_block_read_write(void)
 
 	for (i = 0; i < N; i++) {
 		char name[100];
-		uint8_t hash[GIT_SHA1_RAWSZ];
 		snprintf(name, sizeof(name), "branch%02d", i);
-		memset(hash, i, sizeof(hash));
 
 		rec.u.ref.refname = name;
 		rec.u.ref.value_type = REFTABLE_REF_VAL1;
-		rec.u.ref.value.val1 = hash;
+		memset(rec.u.ref.value.val1, i, GIT_SHA1_RAWSZ);
 
 		names[i] = xstrdup(name);
 		n = block_writer_add(&bw, &rec);
diff --git a/reftable/merged_test.c b/reftable/merged_test.c
index d08c16abef..b3927a5d73 100644
--- a/reftable/merged_test.c
+++ b/reftable/merged_test.c
@@ -123,13 +123,11 @@ static void readers_destroy(struct reftable_reader **readers, size_t n)
 
 static void test_merged_between(void)
 {
-	uint8_t hash1[GIT_SHA1_RAWSZ] = { 1, 2, 3, 0 };
-
 	struct reftable_ref_record r1[] = { {
 		.refname = "b",
 		.update_index = 1,
 		.value_type = REFTABLE_REF_VAL1,
-		.value.val1 = hash1,
+		.value.val1 = { 1, 2, 3, 0 },
 	} };
 	struct reftable_ref_record r2[] = { {
 		.refname = "a",
@@ -165,26 +163,24 @@ static void test_merged_between(void)
 
 static void test_merged(void)
 {
-	uint8_t hash1[GIT_SHA1_RAWSZ] = { 1 };
-	uint8_t hash2[GIT_SHA1_RAWSZ] = { 2 };
 	struct reftable_ref_record r1[] = {
 		{
 			.refname = "a",
 			.update_index = 1,
 			.value_type = REFTABLE_REF_VAL1,
-			.value.val1 = hash1,
+			.value.val1 = { 1 },
 		},
 		{
 			.refname = "b",
 			.update_index = 1,
 			.value_type = REFTABLE_REF_VAL1,
-			.value.val1 = hash1,
+			.value.val1 = { 1 },
 		},
 		{
 			.refname = "c",
 			.update_index = 1,
 			.value_type = REFTABLE_REF_VAL1,
-			.value.val1 = hash1,
+			.value.val1 = { 1 },
 		}
 	};
 	struct reftable_ref_record r2[] = { {
@@ -197,13 +193,13 @@ static void test_merged(void)
 			.refname = "c",
 			.update_index = 3,
 			.value_type = REFTABLE_REF_VAL1,
-			.value.val1 = hash2,
+			.value.val1 = { 2 },
 		},
 		{
 			.refname = "d",
 			.update_index = 3,
 			.value_type = REFTABLE_REF_VAL1,
-			.value.val1 = hash1,
+			.value.val1 = { 1 },
 		},
 	};
 
diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 9c16e0504e..87b238105c 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -60,18 +60,15 @@ static void write_table(char ***names, struct strbuf *buf, int N,
 	*names = reftable_calloc(sizeof(char *) * (N + 1));
 	reftable_writer_set_limits(w, update_index, update_index);
 	for (i = 0; i < N; i++) {
-		uint8_t hash[GIT_SHA256_RAWSZ] = { 0 };
 		char name[100];
 		int n;
 
-		set_test_hash(hash, i);
-
 		snprintf(name, sizeof(name), "refs/heads/branch%02d", i);
 
 		ref.refname = name;
 		ref.update_index = update_index;
 		ref.value_type = REFTABLE_REF_VAL1;
-		ref.value.val1 = hash;
+		set_test_hash(ref.value.val1, i);
 		(*names)[i] = xstrdup(name);
 
 		n = reftable_writer_add_ref(w, &ref);
@@ -675,11 +672,10 @@ static void test_write_object_id_min_length(void)
 	struct strbuf buf = STRBUF_INIT;
 	struct reftable_writer *w =
 		reftable_new_writer(&strbuf_add_void, &buf, &opts);
-	uint8_t hash[GIT_SHA1_RAWSZ] = {42};
 	struct reftable_ref_record ref = {
 		.update_index = 1,
 		.value_type = REFTABLE_REF_VAL1,
-		.value.val1 = hash,
+		.value.val1 = {42},
 	};
 	int err;
 	int i;
@@ -711,11 +707,10 @@ static void test_write_object_id_length(void)
 	struct strbuf buf = STRBUF_INIT;
 	struct reftable_writer *w =
 		reftable_new_writer(&strbuf_add_void, &buf, &opts);
-	uint8_t hash[GIT_SHA1_RAWSZ] = {42};
 	struct reftable_ref_record ref = {
 		.update_index = 1,
 		.value_type = REFTABLE_REF_VAL1,
-		.value.val1 = hash,
+		.value.val1 = {42},
 	};
 	int err;
 	int i;
@@ -814,11 +809,10 @@ static void test_write_multiple_indices(void)
 	writer = reftable_new_writer(&strbuf_add_void, &writer_buf, &opts);
 	reftable_writer_set_limits(writer, 1, 1);
 	for (i = 0; i < 100; i++) {
-		unsigned char hash[GIT_SHA1_RAWSZ] = {i};
 		struct reftable_ref_record ref = {
 			.update_index = 1,
 			.value_type = REFTABLE_REF_VAL1,
-			.value.val1 = hash,
+			.value.val1 = {i},
 		};
 
 		strbuf_reset(&buf);
diff --git a/reftable/record.c b/reftable/record.c
index 5e258c734b..a67a6b4d8a 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -219,7 +219,6 @@ static void reftable_ref_record_copy_from(void *rec, const void *src_rec,
 	case REFTABLE_REF_DELETION:
 		break;
 	case REFTABLE_REF_VAL1:
-		ref->value.val1 = reftable_malloc(hash_size);
 		memcpy(ref->value.val1, src->value.val1, hash_size);
 		break;
 	case REFTABLE_REF_VAL2:
@@ -303,7 +302,6 @@ void reftable_ref_record_release(struct reftable_ref_record *ref)
 		reftable_free(ref->value.val2.value);
 		break;
 	case REFTABLE_REF_VAL1:
-		reftable_free(ref->value.val1);
 		break;
 	case REFTABLE_REF_DELETION:
 		break;
@@ -394,7 +392,6 @@ static int reftable_ref_record_decode(void *rec, struct strbuf key,
 			return -1;
 		}
 
-		r->value.val1 = reftable_malloc(hash_size);
 		memcpy(r->value.val1, in.buf, hash_size);
 		string_view_consume(&in, hash_size);
 		break;
diff --git a/reftable/record_test.c b/reftable/record_test.c
index 70ae78feca..5c94d26e35 100644
--- a/reftable/record_test.c
+++ b/reftable/record_test.c
@@ -119,7 +119,6 @@ static void test_reftable_ref_record_roundtrip(void)
 		case REFTABLE_REF_DELETION:
 			break;
 		case REFTABLE_REF_VAL1:
-			in.u.ref.value.val1 = reftable_malloc(GIT_SHA1_RAWSZ);
 			set_hash(in.u.ref.value.val1, 1);
 			break;
 		case REFTABLE_REF_VAL2:
diff --git a/reftable/reftable-record.h b/reftable/reftable-record.h
index f7eb2d6015..7f3a0df635 100644
--- a/reftable/reftable-record.h
+++ b/reftable/reftable-record.h
@@ -9,6 +9,7 @@ license that can be found in the LICENSE file or at
 #ifndef REFTABLE_RECORD_H
 #define REFTABLE_RECORD_H
 
+#include "hash-ll.h"
 #include <stdint.h>
 
 /*
@@ -38,7 +39,7 @@ struct reftable_ref_record {
 #define REFTABLE_NR_REF_VALUETYPES 4
 	} value_type;
 	union {
-		uint8_t *val1; /* malloced hash. */
+		unsigned char val1[GIT_MAX_RAWSZ];
 		struct {
 			uint8_t *value; /* first value, malloced hash  */
 			uint8_t *target_value; /* second value, malloced hash */
diff --git a/reftable/stack_test.c b/reftable/stack_test.c
index 14a3fc11ee..feab49d7f7 100644
--- a/reftable/stack_test.c
+++ b/reftable/stack_test.c
@@ -463,7 +463,6 @@ static void test_reftable_stack_add(void)
 		refs[i].refname = xstrdup(buf);
 		refs[i].update_index = i + 1;
 		refs[i].value_type = REFTABLE_REF_VAL1;
-		refs[i].value.val1 = reftable_malloc(GIT_SHA1_RAWSZ);
 		set_test_hash(refs[i].value.val1, i);
 
 		logs[i].refname = xstrdup(buf);
@@ -600,7 +599,6 @@ static void test_reftable_stack_tombstone(void)
 		refs[i].update_index = i + 1;
 		if (i % 2 == 0) {
 			refs[i].value_type = REFTABLE_REF_VAL1;
-			refs[i].value.val1 = reftable_malloc(GIT_SHA1_RAWSZ);
 			set_test_hash(refs[i].value.val1, i);
 		}
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 4/8] reftable/record: constify some parts of the interface
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

We're about to convert reftable records to stop storing their object IDs
as allocated hashes. Prepare for this refactoring by constifying some
parts of the interface that will be impacted by this.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/record.c          | 8 ++++----
 reftable/reftable-record.h | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/reftable/record.c b/reftable/record.c
index fbaa1fbef5..5e258c734b 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -76,7 +76,7 @@ int reftable_is_block_type(uint8_t typ)
 	return 0;
 }
 
-uint8_t *reftable_ref_record_val1(const struct reftable_ref_record *rec)
+const unsigned char *reftable_ref_record_val1(const struct reftable_ref_record *rec)
 {
 	switch (rec->value_type) {
 	case REFTABLE_REF_VAL1:
@@ -88,7 +88,7 @@ uint8_t *reftable_ref_record_val1(const struct reftable_ref_record *rec)
 	}
 }
 
-uint8_t *reftable_ref_record_val2(const struct reftable_ref_record *rec)
+const unsigned char *reftable_ref_record_val2(const struct reftable_ref_record *rec)
 {
 	switch (rec->value_type) {
 	case REFTABLE_REF_VAL2:
@@ -242,7 +242,7 @@ static char hexdigit(int c)
 	return 'a' + (c - 10);
 }
 
-static void hex_format(char *dest, uint8_t *src, int hash_size)
+static void hex_format(char *dest, const unsigned char *src, int hash_size)
 {
 	assert(hash_size > 0);
 	if (src) {
@@ -1164,7 +1164,7 @@ int reftable_record_equal(struct reftable_record *a, struct reftable_record *b,
 		reftable_record_data(a), reftable_record_data(b), hash_size);
 }
 
-static int hash_equal(uint8_t *a, uint8_t *b, int hash_size)
+static int hash_equal(const unsigned char *a, const unsigned char *b, int hash_size)
 {
 	if (a && b)
 		return !memcmp(a, b, hash_size);
diff --git a/reftable/reftable-record.h b/reftable/reftable-record.h
index 67104f8fbf..f7eb2d6015 100644
--- a/reftable/reftable-record.h
+++ b/reftable/reftable-record.h
@@ -49,11 +49,11 @@ struct reftable_ref_record {
 
 /* Returns the first hash, or NULL if `rec` is not of type
  * REFTABLE_REF_VAL1 or REFTABLE_REF_VAL2. */
-uint8_t *reftable_ref_record_val1(const struct reftable_ref_record *rec);
+const unsigned char *reftable_ref_record_val1(const struct reftable_ref_record *rec);
 
 /* Returns the second hash, or NULL if `rec` is not of type
  * REFTABLE_REF_VAL2. */
-uint8_t *reftable_ref_record_val2(const struct reftable_ref_record *rec);
+const unsigned char *reftable_ref_record_val2(const struct reftable_ref_record *rec);
 
 /* returns whether 'ref' represents a deletion */
 int reftable_ref_record_is_deletion(const struct reftable_ref_record *ref);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 3/8] reftable/writer: fix index corruption when writing multiple indices
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 5241 bytes --]

Each reftable may contain multiple types of blocks for refs, objects and
reflog records, where each of these may have an index that makes it more
efficient to find the records. It was observed that the index for log
records can become corrupted under certain circumstances, where the
first entry of the index points into the object index instead of to the
log records.

As it turns out, this corruption can occur whenever we write a log index
as well as at least one additional index. Writing records and their index
is basically a two-step process:

  1. We write all blocks for the corresponding record. Each block that
     gets written is added to a list of blocks to index.

  2. Once all blocks were written we finish the section. If at least two
     blocks have been added to the list of blocks to index then we will
     now write the index for those blocks and flush it, as well.

When we have a very large number of blocks then we may decide to write a
multi-level index, which is why we also keep track of the list of the
index blocks in the same way as we previously kept track of the blocks
to index.

Now when we have finished writing all index blocks we clear the index
and flush the last block to disk. This is done in the wrong order though
because flushing the block to disk will re-add it to the list of blocks
to be indexed. The result is that the next section we are about to write
will have an entry in the list of blocks to index that points to the
last block of the preceding section's index, which will corrupt the log
index.

Fix this corruption by clearing the index after having written the last
block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/readwrite_test.c | 80 +++++++++++++++++++++++++++++++++++++++
 reftable/writer.c         |  4 +-
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 278663f22d..9c16e0504e 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -798,6 +798,85 @@ static void test_write_key_order(void)
 	strbuf_release(&buf);
 }
 
+static void test_write_multiple_indices(void)
+{
+	struct reftable_write_options opts = {
+		.block_size = 100,
+	};
+	struct strbuf writer_buf = STRBUF_INIT, buf = STRBUF_INIT;
+	struct reftable_block_source source = { 0 };
+	struct reftable_iterator it = { 0 };
+	const struct reftable_stats *stats;
+	struct reftable_writer *writer;
+	struct reftable_reader *reader;
+	int err, i;
+
+	writer = reftable_new_writer(&strbuf_add_void, &writer_buf, &opts);
+	reftable_writer_set_limits(writer, 1, 1);
+	for (i = 0; i < 100; i++) {
+		unsigned char hash[GIT_SHA1_RAWSZ] = {i};
+		struct reftable_ref_record ref = {
+			.update_index = 1,
+			.value_type = REFTABLE_REF_VAL1,
+			.value.val1 = hash,
+		};
+
+		strbuf_reset(&buf);
+		strbuf_addf(&buf, "refs/heads/%04d", i);
+		ref.refname = buf.buf,
+
+		err = reftable_writer_add_ref(writer, &ref);
+		EXPECT_ERR(err);
+	}
+
+	for (i = 0; i < 100; i++) {
+		unsigned char hash[GIT_SHA1_RAWSZ] = {i};
+		struct reftable_log_record log = {
+			.update_index = 1,
+			.value_type = REFTABLE_LOG_UPDATE,
+			.value.update = {
+				.old_hash = hash,
+				.new_hash = hash,
+			},
+		};
+
+		strbuf_reset(&buf);
+		strbuf_addf(&buf, "refs/heads/%04d", i);
+		log.refname = buf.buf,
+
+		err = reftable_writer_add_log(writer, &log);
+		EXPECT_ERR(err);
+	}
+
+	reftable_writer_close(writer);
+
+	/*
+	 * The written data should be sufficiently large to result in indices
+	 * for each of the block types.
+	 */
+	stats = reftable_writer_stats(writer);
+	EXPECT(stats->ref_stats.index_offset > 0);
+	EXPECT(stats->obj_stats.index_offset > 0);
+	EXPECT(stats->log_stats.index_offset > 0);
+
+	block_source_from_strbuf(&source, &writer_buf);
+	err = reftable_new_reader(&reader, &source, "filename");
+	EXPECT_ERR(err);
+
+	/*
+	 * Seeking the log uses the log index now. In case there is any
+	 * confusion regarding indices we would notice here.
+	 */
+	err = reftable_reader_seek_log(reader, &it, "");
+	EXPECT_ERR(err);
+
+	reftable_iterator_destroy(&it);
+	reftable_writer_free(writer);
+	reftable_reader_free(reader);
+	strbuf_release(&writer_buf);
+	strbuf_release(&buf);
+}
+
 static void test_corrupt_table_empty(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -847,5 +926,6 @@ int readwrite_test_main(int argc, const char *argv[])
 	RUN_TEST(test_log_overflow);
 	RUN_TEST(test_write_object_id_length);
 	RUN_TEST(test_write_object_id_min_length);
+	RUN_TEST(test_write_multiple_indices);
 	return 0;
 }
diff --git a/reftable/writer.c b/reftable/writer.c
index 2e322a5683..ee4590e20f 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -432,12 +432,12 @@ static int writer_finish_section(struct reftable_writer *w)
 		reftable_free(idx);
 	}
 
-	writer_clear_index(w);
-
 	err = writer_flush_block(w);
 	if (err < 0)
 		return err;
 
+	writer_clear_index(w);
+
 	bstats = writer_reftable_block_stats(w, typ);
 	bstats->index_blocks = w->stats.idx_stats.blocks - before_blocks;
 	bstats->index_offset = index_start;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 2/8] reftable/stack: do not auto-compact twice in `reftable_stack_add()`
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1171 bytes --]

In 5c086453ff (reftable/stack: perform auto-compaction with
transactional interface, 2023-12-11), we fixed a bug where the
transactional interface to add changes to a reftable stack did not
perform auto-compaction by calling `reftable_stack_auto_compact()` in
`reftable_stack_addition_commit()`. While correct, this change may now
cause us to perform auto-compaction twice in the non-transactional
interface `reftable_stack_add()`:

  - It performs auto-compaction by itself.

  - It now transitively performs auto-compaction via the transactional
    interface.

Remove the first instance so that we only end up doing auto-compaction
once.

Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index 8729508dc3..7ffeb3ee10 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -425,9 +425,6 @@ int reftable_stack_add(struct reftable_stack *st,
 		return err;
 	}

-	if (!st->disable_auto_compact)
-		return reftable_stack_auto_compact(st);
-
 	return 0;
 }

-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 1/8] reftable/stack: do not overwrite errors when compacting
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1704262787.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

In order to compact multiple stacks we iterate through the merged ref
and log records. When there is any error either when reading the records
from the old merged table or when writing the records to the new table
then we break out of the respective loops. When breaking out of the loop
for the ref records though the error code will be overwritten, which may
cause us to inadvertently skip over bad ref records. In the worst case,
this can lead to a compacted stack that is missing records.

Fix the code by using `goto done` instead so that any potential error
codes are properly returned to the caller.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index 16bab82063..8729508dc3 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -801,18 +801,16 @@ static int stack_write_compact(struct reftable_stack *st,
 			err = 0;
 			break;
 		}
-		if (err < 0) {
-			break;
-		}
+		if (err < 0)
+			goto done;
 
 		if (first == 0 && reftable_ref_record_is_deletion(&ref)) {
 			continue;
 		}
 
 		err = reftable_writer_add_ref(wr, &ref);
-		if (err < 0) {
-			break;
-		}
+		if (err < 0)
+			goto done;
 		entries++;
 	}
 	reftable_iterator_destroy(&it);
@@ -827,9 +825,8 @@ static int stack_write_compact(struct reftable_stack *st,
 			err = 0;
 			break;
 		}
-		if (err < 0) {
-			break;
-		}
+		if (err < 0)
+			goto done;
 		if (first == 0 && reftable_log_record_is_deletion(&log)) {
 			continue;
 		}
@@ -845,9 +842,8 @@ static int stack_write_compact(struct reftable_stack *st,
 		}
 
 		err = reftable_writer_add_log(wr, &log);
-		if (err < 0) {
-			break;
-		}
+		if (err < 0)
+			goto done;
 		entries++;
 	}
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 0/8] reftable: fixes and optimizations (pt.2)
From: Patrick Steinhardt @ 2024-01-03  6:22 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Junio C Hamano
In-Reply-To: <cover.1703063544.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4803 bytes --]

Hi,

this is the third version of my patch series that contains additional
fixes and optimizations for the reftable library.

The only changes in this iteration are improvements to the commit
messages to hopefully make them easier to understand. Thanks Junio for
your suggestions!

Patrick

Patrick Steinhardt (8):
  reftable/stack: do not overwrite errors when compacting
  reftable/stack: do not auto-compact twice in `reftable_stack_add()`
  reftable/writer: fix index corruption when writing multiple indices
  reftable/record: constify some parts of the interface
  reftable/record: store "val1" hashes as static arrays
  reftable/record: store "val2" hashes as static arrays
  reftable/merged: really reuse buffers to compute record keys
  reftable/merged: transfer ownership of records when iterating

 reftable/block_test.c      |   4 +-
 reftable/merged.c          |   8 +--
 reftable/merged_test.c     |  16 +++---
 reftable/readwrite_test.c  | 102 +++++++++++++++++++++++++++++++------
 reftable/record.c          |  17 ++-----
 reftable/record_test.c     |   5 --
 reftable/reftable-record.h |  11 ++--
 reftable/stack.c           |  23 +++------
 reftable/stack_test.c      |   2 -
 reftable/writer.c          |   4 +-
 10 files changed, 117 insertions(+), 75 deletions(-)

Range-diff against v2:
1:  22a57020c6 = 1:  1dc8ddf04a reftable/stack: do not overwrite errors when compacting
2:  a08f7efc99 = 2:  eccc34a4b6 reftable/stack: do not auto-compact twice in `reftable_stack_add()`
3:  c00e08d97f = 3:  15e12b8f29 reftable/writer: fix index corruption when writing multiple indices
4:  3416268087 = 4:  017e8943c7 reftable/record: constify some parts of the interface
5:  46ca3a37f8 ! 5:  f1d2190489 reftable/record: store "val1" hashes as static arrays
    @@ Metadata
      ## Commit message ##
         reftable/record: store "val1" hashes as static arrays
     
    -    When reading ref records of type "val1" we store its object ID in an
    +    When reading ref records of type "val1", we store its object ID in an
         allocated array. This results in an additional allocation for every
         single ref record we read, which is rather inefficient especially when
         iterating over refs.
     
    -    Refactor the code to instead use a static array of `GIT_MAX_RAWSZ`
    +    Refactor the code to instead use an embedded array of `GIT_MAX_RAWSZ`
         bytes. While this means that `struct ref_record` is bigger now, we
         typically do not store all refs in an array anyway and instead only
         handle a limited number of records at the same point in time.
6:  c8a36917b1 = 6:  6ec02d6775 reftable/record: store "val2" hashes as static arrays
7:  6313f8affd ! 7:  845dec2390 reftable/merged: really reuse buffers to compute record keys
    @@ Commit message
         reftable/merged: really reuse buffers to compute record keys
     
         In 829231dc20 (reftable/merged: reuse buffer to compute record keys,
    -    2023-12-11), we have refactored the merged iterator to reuse a set of
    -    buffers that it would otherwise have to reallocate on every single
    -    iteration. Unfortunately, there was a brown-paper-bag-style bug here as
    -    we continue to release these buffers after the iteration, and thus we
    -    have essentially gained nothing.
    +    2023-12-11), we have refactored the merged iterator to reuse a pair of
    +    long-living strbufs by relying on the fact that `reftable_record_key()`
    +    tries to reuse already allocated strbufs by calling `strbuf_reset()`,
    +    which should give us significantly fewer reallocations compared to the
    +    old code that used on-stack strbufs that are allocated for each and
    +    every iteration. Unfortunately, we called `strbuf_release()` on these
    +    long-living strbufs that we meant to reuse on each iteration, defeating
    +    the optimization.
     
         Fix this performance issue by not releasing those buffers on iteration
         anymore, where we instead rely on `merged_iter_close()` to release the
8:  25a3919e58 ! 8:  eea0e161ad reftable/merged: transfer ownership of records when iterating
    @@ Metadata
      ## Commit message ##
         reftable/merged: transfer ownership of records when iterating
     
    -    When iterating over recods with the merged iterator we put the records
    +    When iterating over records with the merged iterator we put the records
         into a priority queue before yielding them to the caller. This means
         that we need to allocate the contents of these records before we can
         pass them over to the caller.

base-commit: a26002b62827b89a19b1084bd75d9371d565d03c
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: ps/refstorage-extension (was: What's cooking in git.git (Jan 2024, #01; Tue, 2))
From: Patrick Steinhardt @ 2024-01-03  5:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqq5y0bcjpw.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

On Tue, Jan 02, 2024 at 05:02:35PM -0800, Junio C Hamano wrote:
[snip]
> * ps/refstorage-extension (2024-01-02) 13 commits
>  - t9500: write "extensions.refstorage" into config
>  - builtin/clone: introduce `--ref-format=` value flag
>  - builtin/init: introduce `--ref-format=` value flag
>  - builtin/rev-parse: introduce `--show-ref-format` flag
>  - t: introduce GIT_TEST_DEFAULT_REF_FORMAT envvar
>  - setup: introduce GIT_DEFAULT_REF_FORMAT envvar
>  - setup: introduce "extensions.refStorage" extension
>  - setup: set repository's formats on init
>  - setup: start tracking ref storage format
>  - refs: refactor logic to look up storage backends
>  - worktree: skip reading HEAD when repairing worktrees
>  - t: introduce DEFAULT_REPO_FORMAT prereq
>  - Merge branch 'ps/clone-into-reftable-repository' into ps/refstorage-extension
> 
>  Introduce a new extension "refstorage" so that we can mark a
>  repository that uses a non-default ref backend, like reftable.
> 
>  Will merge to 'next'?
>  source: <cover.1703833818.git.ps@pks.im>

It's ready from my point of view, but I'm happy to wait a few days for
people to come back from holidays.

Thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 0/2] doc: bisect: change plural paths to singular pathspec
From: Britton Leo Kerin @ 2024-01-03  4:02 UTC (permalink / raw)
  To: git; +Cc: Britton Leo Kerin
In-Reply-To: <a5a8c257-8550-492e-a6fa-e88ee59d4d66@smtp-relay.sendinblue.com>

Britton Leo Kerin (2):
  doc: use singular form of repeatable path arg
  doc: refer to pathspec instead of path

 Documentation/git-bisect.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Range-diff against v1:
1:  90c081dcab ! 1:  da40e4736b doc: use singular form of repeatable path arg
    @@ Commit message
         later document text mentions 'path' arguments, while it doesn't mention
         'paths'.

    -    Signed-off-by: Britton L Kerin <britton.kergin@gmail.com>
    +    Signed-off-by: Britton Leo Kerin <britton.kergin@gmail.com>

      ## Documentation/git-bisect.txt ##
     @@ Documentation/git-bisect.txt: The command takes various subcommands, and different options depending
-:  ---------- > 2:  d932b6d501 doc: refer to pathspec instead of path
--
2.43.0



^ permalink raw reply

* [PATCH v2 1/2] doc: use singular form of repeatable path arg
From: Britton Leo Kerin @ 2024-01-03  4:02 UTC (permalink / raw)
  To: git; +Cc: Britton Leo Kerin, Britton Leo Kerin
In-Reply-To: <20240103040207.661413-1-britton.kerin@gmail.com>

This is more correct because the <path>... doc syntax already indicates
that the arg is "array-type".  It's how other tools do it.  Finally, the
later document text mentions 'path' arguments, while it doesn't mention
'paths'.

Signed-off-by: Britton Leo Kerin <britton.kergin@gmail.com>
---
 Documentation/git-bisect.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/git-bisect.txt b/Documentation/git-bisect.txt
index aa02e46224..b798282788 100644
--- a/Documentation/git-bisect.txt
+++ b/Documentation/git-bisect.txt
@@ -17,7 +17,7 @@ The command takes various subcommands, and different options depending
 on the subcommand:
 
  git bisect start [--term-(new|bad)=<term-new> --term-(old|good)=<term-old>]
-		  [--no-checkout] [--first-parent] [<bad> [<good>...]] [--] [<paths>...]
+		  [--no-checkout] [--first-parent] [<bad> [<good>...]] [--] [<path>...]
  git bisect (bad|new|<term-new>) [<rev>]
  git bisect (good|old|<term-old>) [<rev>...]
  git bisect terms [--term-good | --term-bad]
-- 
2.43.0



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox