Git development
 help / color / mirror / Atom feed
* [PATCH v2 02/11] reftable: handle interrupted reads
From: Patrick Steinhardt @ 2023-12-08 14:53 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <cover.1702047081.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

There are calls to pread(3P) and read(3P) where we don't properly handle
interrupts. Convert them to use `pread_in_full()` and `read_in_full()`,
respectively.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/blocksource.c | 2 +-
 reftable/stack.c       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/reftable/blocksource.c b/reftable/blocksource.c
index 8331b34e82..a1ea304429 100644
--- a/reftable/blocksource.c
+++ b/reftable/blocksource.c
@@ -109,7 +109,7 @@ static int file_read_block(void *v, struct reftable_block *dest, uint64_t off,
 	struct file_block_source *b = v;
 	assert(off + size <= b->size);
 	dest->data = reftable_malloc(size);
-	if (pread(b->fd, dest->data, size, off) != size)
+	if (pread_in_full(b->fd, dest->data, size, off) != size)
 		return -1;
 	dest->len = size;
 	return size;
diff --git a/reftable/stack.c b/reftable/stack.c
index ddbdf1b9c8..ed108a929b 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -92,7 +92,7 @@ static int fd_read_lines(int fd, char ***namesp)
 	}
 
 	buf = reftable_malloc(size + 1);
-	if (read(fd, buf, size) != size) {
+	if (read_in_full(fd, buf, size) != size) {
 		err = REFTABLE_IO_ERROR;
 		goto done;
 	}
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 00/11] reftable: small set of fixes
From: Patrick Steinhardt @ 2023-12-08 14:52 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <cover.1700549493.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2189 bytes --]

Hi,

this is the second version of my patch series that addresses several
smallish issues in the reftable backend. Given that the first version
didn't receive any reviews yet I decided to squash in additional
findings into this series. This is both to reduce the number of follow
up series, but also to hopefully push this topic onto the radar of folks
on the mailing list.

Changes compared to v1:

  - Allocations were optimized further for `struct merged_iter` by
    making the buffers part of the structure itself so that they can
    be reused across iterations.

  - Allocations were optimized for `struct block_iter` in the same
    way.

  - Temporary stacks have a supposedly-random suffix so that concurrent
    writers don't conflict with each other. We used unseeded `rand()`
    calls for it though, so they weren't random after all. This is fixed
    by converting to use `git_rand()` instead.

Patrick

Patrick Steinhardt (11):
  reftable: wrap EXPECT macros in do/while
  reftable: handle interrupted reads
  reftable: handle interrupted writes
  reftable/stack: verify that `reftable_stack_add()` uses
    auto-compaction
  reftable/stack: perform auto-compaction with transactional interface
  reftable/stack: reuse buffers when reloading stack
  reftable/stack: fix stale lock when dying
  reftable/stack: fix use of unseeded randomness
  reftable/merged: reuse buffer to compute record keys
  reftable/block: introduce macro to initialize `struct block_iter`
  reftable/block: reuse buffer to compute record keys

 reftable/block.c          |  23 ++++-----
 reftable/block.h          |   6 +++
 reftable/block_test.c     |   4 +-
 reftable/blocksource.c    |   2 +-
 reftable/iter.h           |   8 +--
 reftable/merged.c         |  31 +++++------
 reftable/merged.h         |   2 +
 reftable/reader.c         |   7 ++-
 reftable/readwrite_test.c |   6 +--
 reftable/stack.c          |  73 +++++++++++---------------
 reftable/stack_test.c     | 105 +++++++++++++++++++++++++++++++++++++-
 reftable/test_framework.h |  58 +++++++++++----------
 12 files changed, 211 insertions(+), 114 deletions(-)

-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 01/11] reftable: wrap EXPECT macros in do/while
From: Patrick Steinhardt @ 2023-12-08 14:52 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <cover.1702047081.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4427 bytes --]

The `EXPECT` macros used by the reftable test framework are all using a
single `if` statement with the actual condition. This results in weird
syntax when using them in if/else statements like the following:

```
if (foo)
	EXPECT(foo == 2)
else
	EXPECT(bar == 2)
```

Note that there need not be a trailing semicolon. Furthermore, it is not
immediately obvious whether the else now belongs to the `if (foo)` or
whether it belongs to the expanded `if (foo == 2)` from the macro.

Fix this by wrapping the macros in a do/while loop.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/test_framework.h | 58 +++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/reftable/test_framework.h b/reftable/test_framework.h
index 774cb275bf..ee44f735ae 100644
--- a/reftable/test_framework.h
+++ b/reftable/test_framework.h
@@ -12,32 +12,38 @@ license that can be found in the LICENSE file or at
 #include "system.h"
 #include "reftable-error.h"
 
-#define EXPECT_ERR(c)                                                  \
-	if (c != 0) {                                                  \
-		fflush(stderr);                                        \
-		fflush(stdout);                                        \
-		fprintf(stderr, "%s: %d: error == %d (%s), want 0\n",  \
-			__FILE__, __LINE__, c, reftable_error_str(c)); \
-		abort();                                               \
-	}
-
-#define EXPECT_STREQ(a, b)                                               \
-	if (strcmp(a, b)) {                                              \
-		fflush(stderr);                                          \
-		fflush(stdout);                                          \
-		fprintf(stderr, "%s:%d: %s (%s) != %s (%s)\n", __FILE__, \
-			__LINE__, #a, a, #b, b);                         \
-		abort();                                                 \
-	}
-
-#define EXPECT(c)                                                          \
-	if (!(c)) {                                                        \
-		fflush(stderr);                                            \
-		fflush(stdout);                                            \
-		fprintf(stderr, "%s: %d: failed assertion %s\n", __FILE__, \
-			__LINE__, #c);                                     \
-		abort();                                                   \
-	}
+#define EXPECT_ERR(c)                                                          \
+	do {                                                                   \
+		if (c != 0) {                                                  \
+			fflush(stderr);                                        \
+			fflush(stdout);                                        \
+			fprintf(stderr, "%s: %d: error == %d (%s), want 0\n",  \
+				__FILE__, __LINE__, c, reftable_error_str(c)); \
+			abort();                                               \
+		}                                                              \
+	} while (0)
+
+#define EXPECT_STREQ(a, b)                                                       \
+	do {                                                                     \
+		if (strcmp(a, b)) {                                              \
+			fflush(stderr);                                          \
+			fflush(stdout);                                          \
+			fprintf(stderr, "%s:%d: %s (%s) != %s (%s)\n", __FILE__, \
+				__LINE__, #a, a, #b, b);                         \
+			abort();                                                 \
+		}                                                                \
+	} while (0)
+
+#define EXPECT(c)                                                                  \
+	do {                                                                       \
+		if (!(c)) {                                                        \
+			fflush(stderr);                                            \
+			fflush(stdout);                                            \
+			fprintf(stderr, "%s: %d: failed assertion %s\n", __FILE__, \
+				__LINE__, #c);                                     \
+			abort();                                                   \
+		}                                                                  \
+	} while (0)
 
 #define RUN_TEST(f)                          \
 	fprintf(stderr, "running %s\n", #f); \
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCH v2 1/2] completion: refactor existence checks for pseudorefs
From: Patrick Steinhardt @ 2023-12-08  8:24 UTC (permalink / raw)
  To: Stan Hu; +Cc: git
In-Reply-To: <1c6a747691f36ede4224b6d4c2e0c8fd4c0575fd.1701928891.git.stanhu@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On Wed, Dec 06, 2023 at 10:06:39PM -0800, Stan Hu wrote:
> In preparation for the reftable backend, this commit introduces a
> '__git_pseudoref_exists' function that continues to use 'test -f' to
> determine whether a given pseudoref exists in the local filesystem.
> 
> Signed-off-by: Stan Hu <stanhu@gmail.com>
> ---
>  contrib/completion/git-completion.bash | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index 13a39ebd2e..9fbdc13f9a 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -122,6 +122,15 @@ __git ()
>  		${__git_dir:+--git-dir="$__git_dir"} "$@" 2>/dev/null
>  }
>  
> +# Runs git in $__git_repo_path to determine whether a ref exists.
> +# 1: The ref to search
> +__git_ref_exists ()

I first thought that you missed Junio's point that `__git_ref_exists`
may better be renamed to something lkie `__git_pseudoref_exists`. But
you do indeed change the name in the second patch. I'd propose to
instead squash the rename into the first patch so that the series
becomes easier to read.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
From: Patrick Steinhardt @ 2023-12-08  8:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXIsbO++u9n/yDYi@nand.local>

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

On Thu, Dec 07, 2023 at 03:34:52PM -0500, Taylor Blau wrote:
> On Thu, Dec 07, 2023 at 02:13:13PM +0100, Patrick Steinhardt wrote:
[snip]
> > Can't it happen that we have no pack here? In the MIDX-case we skip all
> > packs that either do not have a bitmap or are not preferred. So does it
> > mean that in reverse, every preferred packfile must have a a bitmap? I'd
> > think that to not be true in case bitmaps are turned off.
> 
> It's subtle, but this state is indeed not possible. If we have a MIDX
> and it has a bitmap, we know that there is at least one object at least
> one pack.
> 
> On the "at least one object front", that check was added in eb57277ba3
> (midx: prevent writing a .bitmap without any objects, 2022-02-09). And
> we know that our preferred pack (either explicitly given or the one we
> infer automatically) is non-empty, via the check added in 5d3cd09a80
> (midx: reject empty `--preferred-pack`'s, 2021-08-31).
> 
> (As a fun/non-fun aside, looking these up gave me some serious deja-vu
> and reminded me of how painful discovering and fixing those bugs was!)
> 
> So we're OK here. We could add a comment which captures what I wrote
> above here, but since this is a temporary state (and we're going to
> change how we select which packs are reuse candidates in a later patch),
> I think it's OK to avoid (but please let me know if you feel differently).

Makes sense, thanks for the explanation!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
From: Patrick Steinhardt @ 2023-12-08  8:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXIq4mjDUoqlGvgW@nand.local>

[-- Attachment #1: Type: text/plain, Size: 4326 bytes --]

On Thu, Dec 07, 2023 at 03:28:18PM -0500, Taylor Blau wrote:
> On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote:
> > >   - cruft packs (which may necessarily need to include an object from a
> > >     disjoint pack in order to freshen it in certain circumstances)
> >
> > This one took me a while to figure out. If we'd mark crufts as disjoint,
> > then it would mean that new packfiles cannot be marked as disjoint if
> > objects which were previously unreachable do become reachable again.
> > So we'd be pessimizing packfiles for live objects in favor of others
> > which aren't.
> 
> Yeah, that's right, too. There are a couple of cases where more than one
> cruft pack may contain the same object, one of them being the
> flip-flopping between reachable and unreachable as you suggest above.
> Another is that you have a non-prunable unreachable object which is
> already in a cruft pack. If the object's mtime gets updated (and still
> cannot be pruned), we'll end up freshening the object loose, and then
> packing it again (with the more recent mtime) into a new cruft pack.
> 
> That aside, I actually think that there are ways to mark cruft packs
> disjoint. But they're complicated, and moreover, I don't think you'd
> ever *want* to mark a cruft pack as disjoint. Cruft packs usually
> contain garbage, which is unlikely to be useful to any fetches/clones.
> 
> If we did mark them as disjoint, it would mean that we could reuse
> verbatim sections of the cruft pack in our output, but we would likely
> end up with very few such sections.

Makes sense. It also doesn't feel worth it to introduce additional
complexity for objects that for most of the part wouldn't ever be served
on a fetch anyway.

[snip]
> > Okay. I had a bit of trouble to sift through the various different
> > flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
> > and so on. But well, they do different things and it's been a few days
> > since I've reviewed the preceding patches, so this is probably fine.
> 
> Yeah, I am definitely open to better naming conventions here? I figured
> that:
> 
>   - --retain-disjoint was a good name for the MIDX option, since it is
>     retaining existing disjoint packs in the new MIDX
>   - --extend-disjoint was a good name for the repack option, since it is
>     extending the set of disjoint packs
>   - --ignore-disjoint was a good name for the pack-objects option, since
>     it is ignoring objects in disjoint packs
> 
> Writing this out, I think that you could make an argument that
> `--exclude-disjoint` is a better name for the last option. So I'm
> definitely open to suggestions here, but I don't want to get too bogged
> down on command-line option naming (so long as we're all reasonably
> happy with the result).

Yeah, as said, I don't mind it too much. It's a complex area and the
flags all do different things, so it's expected that you may have to do
some research on what exactly they do. That being said, I do like your
proposed `--exclude-disjoint` a lot more than `--ignore-disjoint`.

> > One thing I wondered: do we need to consider the `-l` flag? When using
> > an alternate object directory it is totally feasible that the alternate
> > may be creating new disjoint packages without us knowing, and thus we
> > may not be able to guarantee the disjoint property anymore.
> 
> I don't think so. We'd only care about one direction of this (that
> alternates do not create disjoint packs which overlap with ours, instead
> of the other way around), but since we don't put non-local packs in the
> MIDX, I think we're OK.
> 
> I suppose that you might run into trouble if you use the chained MIDX
> thing (via its `->next` pointer). I haven't used that feature myself, so
> I'd have to play around with it.

We do use this feature at GitLab for forks, where forks connect to a
common alternate object directory to deduplicate objects. As both the
fork repository and the alternate object directory use an MIDX I think
they would be set up exactly like that.

I guess the only really viable solution here is to ignore disjoint packs
in the main repo that connects to the alternate in the case where the
alternate has any disjoint packs itself.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [BUG] rev-list doesn't validate arguments to -n option
From: Britton Kerin @ 2023-12-07 22:12 UTC (permalink / raw)
  To: git

It tolerates non-numeric arguments and garbage after a number:

For example:

$ # -n 1 means same as -n 0:
$ git rev-list -n q newest_commit
$ git rev-list -n 0 newest_commit
$ # Garbage after number is tolerated:
$ git rev-list -n 1q newest_commit
3be33f83695088d968cf084a1a08bdcde25a8d7a
$ git rev-list -n 2q newest_commit
3be33f83695088d968cf084a1a08bdcde25a8d7a
286e62e1b68e39334978e6222cbff187ecc17bcf

^ permalink raw reply

* Re: [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
From: Taylor Blau @ 2023-12-07 20:47 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXHE-cLrP7iRHWHY@tanuki>

On Thu, Dec 07, 2023 at 02:13:29PM +0100, Patrick Steinhardt wrote:
> On Tue, Nov 28, 2023 at 02:08:37PM -0500, Taylor Blau wrote:
> > The function `write_reused_pack()` within `builtin/pack-objects.c` is
> > responsible for performing pack-reuse on a single pack, and has two main
> > functions:
> >
> >   - it dispatches a call to `write_reused_pack_verbatim()` to see if we
> >     can reuse portions of the packfile in whole-word chunks
> >
> >   - for any remaining objects (that is, any objects that appear after
> >     the first "gap" in the bitmap), call write_reused_pack_one() on that
> >     object to record it for reuse.
> >
> > Prepare this function for multi-pack reuse by removing the assumption
> > that the bit position corresponding to the first object being reused
> > from a given pack may not be at bit position zero.
>
> Is this double-negation intended? We remove the assumption that we start
> reading at position zero, but the paragraph here states that we remove
> the assumption that we do _not_ start at bit zero.

Oops, great catch. I'll s/may not/must in the last paragraph to clarify.

> I'll stop reviewing here and will come back to this series somewhen next
> week.

Thanks as usual for your review -- I appreciate you digging through this
rather dense series.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack
From: Taylor Blau @ 2023-12-07 20:43 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXHE9L7iqXQAit_1@tanuki>

On Thu, Dec 07, 2023 at 02:13:24PM +0100, Patrick Steinhardt wrote:
> > In order to compute this value correctly, we need to know not only where
> > we are in the packfile we're assembling (with `hashfile_total(f)`) but
> > also the position of the first byte of the packfile that we are
> > currently reusing.
> >
> > Together, these two allow us to compute the reused chunk's offset
> > difference relative to the start of the reused pack, as desired.
>
> Hm. I'm not quite sure I fully understand the motivation here. Is this
> something that was broken all along? Why does it become a problem now?
> Sorry if I'm missing the obvious here.

No worries, I should have explained this better. Indeed we do have to
worry about patching deltas today when reusing objects from a pack. But
we have to extend the implementation in order to perform reuse over
multiple packs when any of them (excluding the first, which would work
with the existing logic) have delta/base pairs on either side of a gap.

I'll try to make it a little clearer, thanks for pointing that out.

> > @@ -1126,10 +1129,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
> >  {
> >  	size_t i = 0;
> >  	uint32_t offset;
> > +	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
>
> Given that this patch in its current state doesn't seem to do anything
> yet, am I right in assuming that `hashfile_total(f) - sizeof(struct
> pack_header)` is always expected to be zero for now?

Yep, that's right.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
From: Taylor Blau @ 2023-12-07 20:34 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXHE6Ym3CICtNxFd@tanuki>

On Thu, Dec 07, 2023 at 02:13:13PM +0100, Patrick Steinhardt wrote:
> > +	if (bitmap_is_midx(bitmap_git)) {
> > +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> > +			struct bitmapped_pack pack;
> > +			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
> > +				warning(_("unable to load pack: '%s', disabling pack-reuse"),
> > +					bitmap_git->midx->pack_names[i]);
> > +				free(packs);
> > +				return -1;
> > +			}
> > +			if (!pack.bitmap_nr)
> > +				continue; /* no objects from this pack */
> > +			if (pack.bitmap_pos)
> > +				continue; /* not preferred pack */
> > +
> > +			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> > +			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
> > +
> > +			objects_nr += pack.p->num_objects;
> > +		}
> > +	} else {
> > +		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> > +
> > +		packs[packs_nr].p = bitmap_git->pack;
> > +		packs[packs_nr].bitmap_pos = 0;
> > +		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
> > +		packs[packs_nr].disjoint = 1;
> > +
> > +		objects_nr = packs[packs_nr++].p->num_objects;
> > +	}
> > +
> > +	word_alloc = objects_nr / BITS_IN_EWORD;
> > +	if (objects_nr % BITS_IN_EWORD)
> > +		word_alloc++;
> > +	reuse = bitmap_word_alloc(word_alloc);
> > +
> > +	if (packs_nr != 1)
> > +		BUG("pack reuse not yet implemented for multiple packs");
>
> Can't it happen that we have no pack here? In the MIDX-case we skip all
> packs that either do not have a bitmap or are not preferred. So does it
> mean that in reverse, every preferred packfile must have a a bitmap? I'd
> think that to not be true in case bitmaps are turned off.

It's subtle, but this state is indeed not possible. If we have a MIDX
and it has a bitmap, we know that there is at least one object at least
one pack.

On the "at least one object front", that check was added in eb57277ba3
(midx: prevent writing a .bitmap without any objects, 2022-02-09). And
we know that our preferred pack (either explicitly given or the one we
infer automatically) is non-empty, via the check added in 5d3cd09a80
(midx: reject empty `--preferred-pack`'s, 2021-08-31).

(As a fun/non-fun aside, looking these up gave me some serious deja-vu
and reminded me of how painful discovering and fixing those bugs was!)

So we're OK here. We could add a comment which captures what I wrote
above here, but since this is a temporary state (and we're going to
change how we select which packs are reuse candidates in a later patch),
I think it's OK to avoid (but please let me know if you feel differently).

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
From: Taylor Blau @ 2023-12-07 20:28 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXHE5Lce_6CAWKFT@tanuki>

On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote:
> > The gotchas mostly have to do with making sure that we do not generate a
> > disjoint pack in the following scenarios:
>
> Okay, let me verify whether I understand the reasons:
>
> >   - promisor packs
>
> Which is because promisor packs actually don't contain any objects?

Right.

> >   - cruft packs (which may necessarily need to include an object from a
> >     disjoint pack in order to freshen it in certain circumstances)
>
> This one took me a while to figure out. If we'd mark crufts as disjoint,
> then it would mean that new packfiles cannot be marked as disjoint if
> objects which were previously unreachable do become reachable again.
> So we'd be pessimizing packfiles for live objects in favor of others
> which aren't.

Yeah, that's right, too. There are a couple of cases where more than one
cruft pack may contain the same object, one of them being the
flip-flopping between reachable and unreachable as you suggest above.
Another is that you have a non-prunable unreachable object which is
already in a cruft pack. If the object's mtime gets updated (and still
cannot be pruned), we'll end up freshening the object loose, and then
packing it again (with the more recent mtime) into a new cruft pack.

That aside, I actually think that there are ways to mark cruft packs
disjoint. But they're complicated, and moreover, I don't think you'd
ever *want* to mark a cruft pack as disjoint. Cruft packs usually
contain garbage, which is unlikely to be useful to any fetches/clones.

If we did mark them as disjoint, it would mean that we could reuse
verbatim sections of the cruft pack in our output, but we would likely
end up with very few such sections.

> >   - all-into-one repacks without '-d'
>
> Because here the old packfiles that this would make redundant aren't
> deleted and thus the objects are duplicate now.

Yep.

> > Otherwise, we mark which packs were created as disjoint by using a new
> > bit in the `generated_pack_data` struct, and then marking those pack(s)
> > as disjoint accordingly when generating the MIDX. Non-deleted packs
> > which are marked as disjoint are retained as such by passing the
> > equivalent of `--retain-disjoint` when calling the MIDX API to update
> > the MIDX.
>
> Okay. I had a bit of trouble to sift through the various different
> flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
> and so on. But well, they do different things and it's been a few days
> since I've reviewed the preceding patches, so this is probably fine.

Yeah, I am definitely open to better naming conventions here? I figured
that:

  - --retain-disjoint was a good name for the MIDX option, since it is
    retaining existing disjoint packs in the new MIDX
  - --extend-disjoint was a good name for the repack option, since it is
    extending the set of disjoint packs
  - --ignore-disjoint was a good name for the pack-objects option, since
    it is ignoring objects in disjoint packs

Writing this out, I think that you could make an argument that
`--exclude-disjoint` is a better name for the last option. So I'm
definitely open to suggestions here, but I don't want to get too bogged
down on command-line option naming (so long as we're all reasonably
happy with the result).

> One thing I wondered: do we need to consider the `-l` flag? When using
> an alternate object directory it is totally feasible that the alternate
> may be creating new disjoint packages without us knowing, and thus we
> may not be able to guarantee the disjoint property anymore.

I don't think so. We'd only care about one direction of this (that
alternates do not create disjoint packs which overlap with ours, instead
of the other way around), but since we don't put non-local packs in the
MIDX, I think we're OK.

I suppose that you might run into trouble if you use the chained MIDX
thing (via its `->next` pointer). I haven't used that feature myself, so
I'd have to play around with it.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
From: Taylor Blau @ 2023-12-07 14:36 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXHE7_KwukSRBET1@tanuki>

On Thu, Dec 07, 2023 at 02:13:19PM +0100, Patrick Steinhardt wrote:
> > +	if (reuse_packfile) {
> > +		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
> > +		if (!reuse_packfile_objects)
> > +			BUG("expected non-empty reuse bitmap");
>
> We're now re-computing `bitmap_popcount()` for the bitmap a second time.
> But I really don't think this is ever going to be a problem in practice
> given that it only does a bunch of math. Any performance regression
> would thus ultimately be drowned out by everything else.
>
> In other words: this is probably fine.

I definitely agree that any performance regression from calling
bitmap_popcount() twice would be drowned out by the rest of what
pack-objects is doing.

For what it's worth:

- The bitmap_popcount() call is a loop over ewah_bit_popcount64() for
  each of the allocated words. And the latter is more or less three
  copies of:

      b7:	55 55 55
      ba:	48 23 45 f8          	and    -0x8(%rbp),%rax
      be:	48 8b 55 f8          	mov    -0x8(%rbp),%rdx
      c2:	48 89 d1             	mov    %rdx,%rcx
      c5:	48 d1 e9             	shr    %rcx
      c8:	48 ba 55 55 55 55 55 	movabs $0x5555555555555555,%rdx
      cf:	55 55 55
      d2:	48 21 ca             	and    %rcx,%rdx
      d5:	48 01 d0             	add    %rdx,%rax
      d8:	48 89 45 f8          	mov    %rax,-0x8(%rbp)
      dc:	48 b8 33 33 33 33 33 	movabs $0x3333333333333333,%rax

  Followed by:

     144:	48 0f af c2          	imul   %rdx,%rax
     148:	48 c1 e8 38          	shr    $0x38,%rax
     14c:	5d                   	pop    %rbp
     14d:	c3                   	ret

  With the usual x86 ABI preamble and postamble. So this should be an
  extremely cheap function to compute.

- But, the earlier bitmap_popcount() call in
  reuse_partial_packfile_from_bitmap() is not necessary, since we only
  care whether or not there are _any_ bits set in the bitmap, not how
  many of them there are.

  So we could write something like `bitmap_empty(reuse)` instead, which
  would be much cheaper (again, not that I think we'll notice this
  either way, but throwing away the result of bitmap_popcount() and
  calling it twice does leave me a little unsatisfied).

So I think we could reasonably do something like:

--- 8< ---
diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 7b525b1ecd..ac7e0af622 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -169,6 +169,15 @@ size_t bitmap_popcount(struct bitmap *self)
 	return count;
 }

+int bitmap_is_empty(struct bitmap *self)
+{
+	size_t i;
+	for (i = 0; i < self->word_alloc; i++)
+		if (self->words[i])
+			return 0;
+	return 1;
+}
+
 int bitmap_equals(struct bitmap *self, struct bitmap *other)
 {
 	struct bitmap *big, *small;
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 7eb8b9b630..c11d76c6f3 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -189,5 +189,6 @@ void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other);
 void bitmap_or(struct bitmap *self, const struct bitmap *other);

 size_t bitmap_popcount(struct bitmap *self);
+int bitmap_is_empty(struct bitmap *self);

 #endif
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 614fc09a4e..e50b322779 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2045,7 +2045,7 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,

 	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);

-	if (!bitmap_popcount(reuse)) {
+	if (bitmap_is_empty(reuse)) {
 		free(packs);
 		bitmap_free(reuse);
 		return;
--- >8 ---

Thanks,
Taylor

^ permalink raw reply related

* Re: [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <67a8196978244b56d4f60861f140b78c59d15e30.1701198172.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]

On Tue, Nov 28, 2023 at 02:08:37PM -0500, Taylor Blau wrote:
> The function `write_reused_pack()` within `builtin/pack-objects.c` is
> responsible for performing pack-reuse on a single pack, and has two main
> functions:
> 
>   - it dispatches a call to `write_reused_pack_verbatim()` to see if we
>     can reuse portions of the packfile in whole-word chunks
> 
>   - for any remaining objects (that is, any objects that appear after
>     the first "gap" in the bitmap), call write_reused_pack_one() on that
>     object to record it for reuse.
> 
> Prepare this function for multi-pack reuse by removing the assumption
> that the bit position corresponding to the first object being reused
> from a given pack may not be at bit position zero.

Is this double-negation intended? We remove the assumption that we start
reading at position zero, but the paragraph here states that we remove
the assumption that we do _not_ start at bit zero.

I'll stop reviewing here and will come back to this series somewhen next
week.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <6f4fba861b59f909148775ee64c3ba89afc872b5.1701198172.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 4093 bytes --]

On Tue, Nov 28, 2023 at 02:08:32PM -0500, Taylor Blau wrote:
> When reusing objects from a pack, we keep track of a set of one or more
> `reused_chunk`s, corresponding to sections of one or more object(s) from
> a source pack that we are reusing. Each chunk contains two pieces of
> information:
> 
>   - the offset of the first object in the source pack (relative to the
>     beginning of the source pack)
>   - the difference between that offset, and the corresponding offset in
>     the pack we're generating
> 
> The purpose of keeping track of these is so that we can patch an
> OFS_DELTAs that cross over a section of the reuse pack that we didn't
> take.
> 
> For instance, consider a hypothetical pack as shown below:
> 
>                                                 (chunk #2)
>                                                 __________...
>                                                /
>                                               /
>       +--------+---------+-------------------+---------+
>   ... | <base> | <other> |      (unused)     | <delta> | ...
>       +--------+---------+-------------------+---------+
>        \                /
>         \______________/
>            (chunk #1)
> 
> Suppose that we are sending objects "base", "other", and "delta", and
> that the "delta" object is stored as an OFS_DELTA, and that its base is
> "base". If we don't send any objects in the "(unused)" range, we can't
> copy the delta'd object directly, since its delta offset includes a
> range of the pack that we didn't copy, so we have to account for that
> difference when patching and reassembling the delta.
> 
> In order to compute this value correctly, we need to know not only where
> we are in the packfile we're assembling (with `hashfile_total(f)`) but
> also the position of the first byte of the packfile that we are
> currently reusing.
> 
> Together, these two allow us to compute the reused chunk's offset
> difference relative to the start of the reused pack, as desired.

Hm. I'm not quite sure I fully understand the motivation here. Is this
something that was broken all along? Why does it become a problem now?
Sorry if I'm missing the obvious here.

> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/pack-objects.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 7682bd65bb..eb8be514d1 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1016,6 +1016,7 @@ static off_t find_reused_offset(off_t where)
>  
>  static void write_reused_pack_one(struct packed_git *reuse_packfile,
>  				  size_t pos, struct hashfile *out,
> +				  off_t pack_start,
>  				  struct pack_window **w_curs)
>  {
>  	off_t offset, next, cur;
> @@ -1025,7 +1026,8 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
>  	offset = pack_pos_to_offset(reuse_packfile, pos);
>  	next = pack_pos_to_offset(reuse_packfile, pos + 1);
>  
> -	record_reused_object(offset, offset - hashfile_total(out));
> +	record_reused_object(offset,
> +			     offset - (hashfile_total(out) - pack_start));
>  
>  	cur = offset;
>  	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> @@ -1095,6 +1097,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
>  
>  static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
>  					 struct hashfile *out,
> +					 off_t pack_start UNUSED,
>  					 struct pack_window **w_curs)
>  {
>  	size_t pos = 0;
> @@ -1126,10 +1129,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
>  {
>  	size_t i = 0;
>  	uint32_t offset;
> +	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);

Given that this patch in its current state doesn't seem to do anything
yet, am I right in assuming that `hashfile_total(f) - sizeof(struct
pack_header)` is always expected to be zero for now?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <432854b27c6731bd6ab1fa739b3a086ec0a90be8.1701198172.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 2520 bytes --]

On Tue, Nov 28, 2023 at 02:08:24PM -0500, Taylor Blau wrote:
> The signature of `reuse_partial_packfile_from_bitmap()` currently takes
> in a bitmap, as well as three output parameters (filled through
> pointers, and passed as arguments), and also returns an integer result.
> 
> The output parameters are filled out with: (a) the packfile used for
> pack-reuse, (b) the number of objects from that pack that we can reuse,
> and (c) a bitmap indicating which objects we can reuse. The return value
> is either -1 (when there are no objects to reuse), or 0 (when there is
> at least one object to reuse).
> 
> Some of these parameters are redundant. Notably, we can infer from the
> bitmap how many objects are reused by calling bitmap_popcount(). And we
> can similar compute the return value based on that number as well.
> 
> As such, clean up the signature of this function to drop the "*entries"
> parameter, as well as the int return value, since the single caller of
> this function can infer these values themself.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/pack-objects.c | 16 +++++++++-------
>  pack-bitmap.c          | 16 +++++++---------
>  pack-bitmap.h          |  7 +++----
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 107154db34..2bb1b64e8f 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3946,13 +3946,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
>  	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
>  		return -1;
>  
> -	if (pack_options_allow_reuse() &&
> -	    !reuse_partial_packfile_from_bitmap(
> -			bitmap_git,
> -			&reuse_packfile,
> -			&reuse_packfile_objects,
> -			&reuse_packfile_bitmap)) {
> -		assert(reuse_packfile_objects);
> +	if (pack_options_allow_reuse())
> +		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
> +						   &reuse_packfile_bitmap);
> +
> +	if (reuse_packfile) {
> +		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
> +		if (!reuse_packfile_objects)
> +			BUG("expected non-empty reuse bitmap");

We're now re-computing `bitmap_popcount()` for the bitmap a second time.
But I really don't think this is ever going to be a problem in practice
given that it only does a bunch of math. Any performance regression
would thus ultimately be drowned out by everything else.

In other words: this is probably fine.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <970bd9eaeae038adb6e7d4c399c9b033668a8864.1701198172.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]

On Tue, Nov 28, 2023 at 02:08:21PM -0500, Taylor Blau wrote:
[snip]
> @@ -2002,6 +1986,65 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
>  
>  done:
>  	unuse_pack(&w_curs);
> +}
> +
> +int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
> +				       struct packed_git **packfile_out,
> +				       uint32_t *entries,
> +				       struct bitmap **reuse_out)
> +{
> +	struct repository *r = the_repository;
> +	struct bitmapped_pack *packs = NULL;
> +	struct bitmap *result = bitmap_git->result;
> +	struct bitmap *reuse;
> +	size_t i;
> +	size_t packs_nr = 0, packs_alloc = 0;
> +	size_t word_alloc;
> +	uint32_t objects_nr = 0;
> +
> +	assert(result);
> +
> +	load_reverse_index(r, bitmap_git);
> +
> +	if (bitmap_is_midx(bitmap_git)) {
> +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> +			struct bitmapped_pack pack;
> +			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
> +				warning(_("unable to load pack: '%s', disabling pack-reuse"),
> +					bitmap_git->midx->pack_names[i]);
> +				free(packs);
> +				return -1;
> +			}
> +			if (!pack.bitmap_nr)
> +				continue; /* no objects from this pack */
> +			if (pack.bitmap_pos)
> +				continue; /* not preferred pack */
> +
> +			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> +			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
> +
> +			objects_nr += pack.p->num_objects;
> +		}
> +	} else {
> +		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> +
> +		packs[packs_nr].p = bitmap_git->pack;
> +		packs[packs_nr].bitmap_pos = 0;
> +		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
> +		packs[packs_nr].disjoint = 1;
> +
> +		objects_nr = packs[packs_nr++].p->num_objects;
> +	}
> +
> +	word_alloc = objects_nr / BITS_IN_EWORD;
> +	if (objects_nr % BITS_IN_EWORD)
> +		word_alloc++;
> +	reuse = bitmap_word_alloc(word_alloc);
> +
> +	if (packs_nr != 1)
> +		BUG("pack reuse not yet implemented for multiple packs");

Can't it happen that we have no pack here? In the MIDX-case we skip all
packs that either do not have a bitmap or are not preferred. So does it
mean that in reverse, every preferred packfile must have a a bitmap? I'd
think that to not be true in case bitmaps are turned off.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <b75869befba26899d88d6c6d413cc756aeadbd80.1701198172.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 3299 bytes --]

On Tue, Nov 28, 2023 at 02:08:18PM -0500, Taylor Blau wrote:
> Now that we can generate packs which are disjoint with respect to the
> set of currently-disjoint packs, implement a mode of `git repack` which
> extends the set of disjoint packs with any new (non-cruft) pack(s)
> generated during the repack.
> 
> The idea is mostly straightforward, with a couple of gotcha's. The
> straightforward part is to make sure that any new packs are disjoint
> with respect to the set of currently disjoint packs which are _not_
> being removed from the repository as a result of the repack.
> 
> If a pack which is currently marked as disjoint is, on the other hand,
> about to be removed from the repository, it is OK (and expected) that
> new pack(s) will contain some or all of its objects. Since the pack
> originally marked as disjoint will be removed, it will necessarily leave
> the disjoint set, making room for new packs with its same objects to
> take its place. In other words, the resulting set of disjoint packs will
> be disjoint with respect to one another.
> 
> The gotchas mostly have to do with making sure that we do not generate a
> disjoint pack in the following scenarios:

Okay, let me verify whether I understand the reasons:

>   - promisor packs

Which is because promisor packs actually don't contain any objects?

>   - cruft packs (which may necessarily need to include an object from a
>     disjoint pack in order to freshen it in certain circumstances)

This one took me a while to figure out. If we'd mark crufts as disjoint,
then it would mean that new packfiles cannot be marked as disjoint if
objects which were previously unreachable do become reachable again.
So we'd be pessimizing packfiles for live objects in favor of others
which aren't.

>   - all-into-one repacks without '-d'

Because here the old packfiles that this would make redundant aren't
deleted and thus the objects are duplicate now.

>   - `--filter-to`, which conceptually could work with the new
>     `--extend-disjoint` option, but only in limited circumstances

We're probably also not properly set up to handle the new alternate
object directory and exclude objects that are part of a potentially
disjoint packfile that exists already. Also, the current MIDX may not
even cover the alternate.

> Otherwise, we mark which packs were created as disjoint by using a new
> bit in the `generated_pack_data` struct, and then marking those pack(s)
> as disjoint accordingly when generating the MIDX. Non-deleted packs
> which are marked as disjoint are retained as such by passing the
> equivalent of `--retain-disjoint` when calling the MIDX API to update
> the MIDX.

Okay. I had a bit of trouble to sift through the various different
flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
and so on. But well, they do different things and it's been a few days
since I've reviewed the preceding patches, so this is probably fine.

One thing I wondered: do we need to consider the `-l` flag? When using
an alternate object directory it is totally feasible that the alternate
may be creating new disjoint packages without us knowing, and thus we
may not be able to guarantee the disjoint property anymore.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 0/9] bonus config cleanups
From: Patrick Steinhardt @ 2023-12-07  8:58 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20231207072338.GA1277727@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

On Thu, Dec 07, 2023 at 02:23:38AM -0500, Jeff King wrote:
> While looking carefully at various config callbacks for the series at:
> 
>   https://lore.kernel.org/git/20231207071030.GA1275835@coredump.intra.peff.net/
> 
> I noticed a bunch of other small bugs/cleanups. I split these into their
> own series here, which should be applied on top (it could go straight to
> "master", but there is a small conflict in patch 6, as the option it
> touches was fixed in the other series). I'm happy to prepare it as an
> independent series if we prefer.

The whole patch series looks good to me, thanks!

Patrick

>   [1/9]: config: reject bogus values for core.checkstat
>   [2/9]: git_xmerge_config(): prefer error() to die()
>   [3/9]: imap-send: don't use git_die_config() inside callback
>   [4/9]: config: use config_error_nonbool() instead of custom messages
>   [5/9]: diff: give more detailed messages for bogus diff.* config
>   [6/9]: config: use git_config_string() for core.checkRoundTripEncoding
>   [7/9]: push: drop confusing configset/callback redundancy
>   [8/9]: gpg-interface: drop pointless config_error_nonbool() checks
>   [9/9]: sequencer: simplify away extra git_config_string() call
> 
>  builtin/push.c      | 31 +++++++++++++------------------
>  builtin/send-pack.c | 27 ++++++++++++---------------
>  config.c            | 11 +++++------
>  convert.h           |  2 +-
>  diff.c              |  8 ++++++--
>  environment.c       |  2 +-
>  gpg-interface.c     | 15 +++------------
>  imap-send.c         |  2 +-
>  merge-ll.c          |  2 +-
>  sequencer.c         | 21 ++++++++-------------
>  xdiff-interface.c   |  7 ++++---
>  11 files changed, 55 insertions(+), 73 deletions(-)
> 
> -Peff
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 3/9] imap-send: don't use git_die_config() inside callback
From: Patrick Steinhardt @ 2023-12-07  8:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20231207072458.GC1277973@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 617 bytes --]

On Thu, Dec 07, 2023 at 02:24:58AM -0500, Jeff King wrote:
[snip]
> diff --git a/imap-send.c b/imap-send.c
> index 996651e4f8..5b0fe4f95a 100644
> --- a/imap-send.c
> +++ b/imap-send.c
> @@ -1346,7 +1346,7 @@ static int git_imap_config(const char *var, const char *val,
>  		server.port = git_config_int(var, val, ctx->kvi);
>  	else if (!strcmp("imap.host", var)) {
>  		if (!val) {
> -			git_die_config("imap.host", "Missing value for 'imap.host'");
> +			return error("Missing value for 'imap.host'");

Nit: while at it we might also mark this error for translation. Not
worth a reroll on its own though.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] setup: recognize bare repositories with packed-refs
From: Patrick Steinhardt @ 2023-12-07  8:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Adam Majer, git
In-Reply-To: <20231207073451.GA1278091@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 3353 bytes --]

On Thu, Dec 07, 2023 at 02:34:51AM -0500, Jeff King wrote:
> On Thu, Dec 07, 2023 at 08:01:41AM +0100, Patrick Steinhardt wrote:
> 
> > > So forgetting at all about how we structure the code, it seems to me
> > > that the problem is not new code, but all of the existing code which
> > > looks for access("refs", X_OK).
> > 
> > True. The question is of course how much value there is in an old tool
> > to be able to discover a new repository that it wouldn't be able to read
> > in the first place due to it not understanding the reference format. So
> > I'd very much like to see that eventually, we're able to get rid of
> > "legacy" cruft that doesn't serve any purpose anymore.
> 
> Right, nobody is going to be mad about not being able to use the
> repository with old code. My concern is that by skipping it in the
> discovery phase, though, the user may see unexpected behavior (like
> continuing and finding some _other_ repo). I admit it's a pretty narrow
> case, though.

Agreed, that's also an angle I brought up in a separate thread [1]. The
second benefit is that the user would get a proper error message stating
that the "extensions.refFormat" is not understood compared to Git just
skipping over the repository completely.

> > The question is whether we can do a better job of this going forward so
> > that at least we don't have to pose the same question in the future.
> > Right now, we'll face the same problem whenever any part of the current
> > on-disk repository data structures changes.
> > 
> > I wonder whether it would make sense to introduce something like a
> > filesystem-level hint, e.g. in the form of a new ".is-git-repository"
> > file. If Git discovers that file then it assumes the directory to be a
> > Git repository -- and everything else is set up by parsing the config
> > and thus the repository's configured format.
> 
> I kind of think the presence of a well-formed HEAD is a good indicator
> that it is a Git directory. IOW, I don't have any real problem with
> simply loosening is_git_directory() to remove the "refs/" check
> entirely. But again, that can only help us going forward; older versions
> will still get confused if we truly ditch "refs/" for other formats.
> 
> Of course some ref formats may want to avoid the "HEAD" file itself, so
> your .is-git-repository would be cleaner. I'm just not sure if it's
> worth the headache to try to switch things now.

I think that both "HEAD" and "refs/" are in the same spirit and consider
both to be legacy cruft that ideally wouldn't exist with the reftable
backend. I think dropping just one of these requirements ("refs/") is
the least favorable option though:

  - We'd still have unneeded files that only exist to aid old clients.

  - At the same time, the old clients wouldn't be able to detect the
    repository anymore and need an update. So we could just as well drop
    both files and would have the same outcome.

  - This is not a long-term solution in case anything else in the
    on-disk format will ever change.

Whether it's worth getting rid of them now... probably not, at least not
for the time being. But if we want to address this issue I'd rather want
to aim for a proper solution that also works for future changes.

Patrick

[1]: <ZXFy0_T1AZLh058g@tanuki>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] setup: recognize bare repositories with packed-refs
From: Adam Majer @ 2023-12-07  8:20 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: git
In-Reply-To: <20231206210836.GA106480@coredump.intra.peff.net>



On 12/6/23 22:08, Jeff King wrote:
> On Wed, Nov 29, 2023 at 04:30:46PM -0500, Taylor Blau wrote:
> 
>> On Tue, Nov 28, 2023 at 02:04:46PM -0500, Jeff King wrote:
>>>    - whatever is consuming the embedded repos could "mkdir -p refs
>>>      objects" as needed. This is a minor pain, but I think in the long
>>>      term we are moving to a world where you have to explicitly do
>>>      "GIT_DIR=$PWD/embedded.git" to access an embedded bare repo. So
>>>      they're already special and require some setup; adding an extra step
>>>      may not be so bad.
>>
>> I hope not. I suppose that using embedded bare repositories in a test
>> requires additional setup at least to "cd" into the directory (if they
>> are not using `$GIT_DIR` or `--git-dir` already). But I fear that
>> imposing even a small change like this is too tall an order for how many
>> millions of these exist in the wild across all sorts of projects.
> 
> I dunno. I am skeptical that there are millions of these. Who really
> wants to embed bare git repos except for projects related to Git itself,
> which want test vectors? Is there a use case I'm missing?
Well, it's an "easy" thing to do, instead of recreating these test cases 
from sources like it's done here. It seems this is what happens in 
projects like Gitea.

As to the original questions you've raised earlier in the thread, I 
thought about it, and I don't really have a compelling reason to try to 
force this patch into Git. At least, I do not feel it necessary to try 
to argue the points you've raised. If that means the patch is ignored, 
I'm ok with that.

The reasons I put it here is simply I found that it fixes an issue I 
came across and that "everything else" worked. I don't know the 
intricacies of current or future git plans and I would rather delegate 
such discussion to the experts.

Best regards,
Adam

^ permalink raw reply

* Re: [PATCH 7/7] fsck: handle NULL value when parsing message config
From: Patrick Steinhardt @ 2023-12-07  8:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231207071135.GG1276005@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]

On Thu, Dec 07, 2023 at 02:11:35AM -0500, Jeff King wrote:
> When parsing fsck.*, receive.fsck.*, or fetch.fsck.*, we don't check for
> an implicit bool. So any of:
> 
>   [fsck]
>   badTree
>   [receive "fsck"]
>   badTree
>   [fetch "fsck"]
>   badTree
> 
> will cause us to segfault. We can fix it with config_error_nonbool() in
> the usual way, but we have to make a few more changes to get good error
> messages. The problem is that all three spots do:
> 
>   if (skip_prefix(var, "fsck.", &var))
> 
> to match and parse the actual message id. But that means that "var" now
> just says "badTree" instead of "receive.fsck.badTree", making the
> resulting message confusing. We can fix that by storing the parsed
> message id in its own separate variable.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  builtin/receive-pack.c | 11 +++++++----
>  fetch-pack.c           | 12 ++++++++----
>  fsck.c                 |  8 ++++++--
>  3 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
> index 8c4f0cb90a..ccf9738bce 100644
> --- a/builtin/receive-pack.c
> +++ b/builtin/receive-pack.c
> @@ -142,6 +142,7 @@ static enum deny_action parse_deny_action(const char *var, const char *value)
>  static int receive_pack_config(const char *var, const char *value,
>  			       const struct config_context *ctx, void *cb)
>  {
> +	const char *msg_id;
>  	int status = parse_hide_refs_config(var, value, "receive", &hidden_refs);
>  
>  	if (status)
> @@ -178,12 +179,14 @@ static int receive_pack_config(const char *var, const char *value,
>  		return 0;
>  	}
>  
> -	if (skip_prefix(var, "receive.fsck.", &var)) {
> -		if (is_valid_msg_type(var, value))
> +	if (skip_prefix(var, "receive.fsck.", &msg_id)) {
> +		if (!value)
> +			return config_error_nonbool(var);
> +		if (is_valid_msg_type(msg_id, value))
>  			strbuf_addf(&fsck_msg_types, "%c%s=%s",
> -				fsck_msg_types.len ? ',' : '=', var, value);
> +				fsck_msg_types.len ? ',' : '=', msg_id, value);
>  		else
> -			warning("skipping unknown msg id '%s'", var);
> +			warning("skipping unknown msg id '%s'", msg_id);
>  		return 0;
>  	}

Same remark here. We would only generate warnings for an actual boolean,
so I'd think we might consider doing the same for implicit booleans.

Partick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 6/7] trailer: handle NULL value when parsing trailer-specific config
From: Patrick Steinhardt @ 2023-12-07  8:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231207071132.GF1276005@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 1779 bytes --]

On Thu, Dec 07, 2023 at 02:11:32AM -0500, Jeff King wrote:
> When parsing the "key", "command", and "cmd" trailer config, we just
> make a copy of the value string.  If we see an implicit bool like:
> 
>   [trailer "foo"]
>   key
> 
> we'll segfault trying to copy a NULL pointer. We can fix this with the
> usual config_error_nonbool() check.
> 
> I split this out from the other vanilla cases, because at first glance
> it looks like a better fix here would be to move the NULL check out of
> the switch statement. But it would change the behavior of other keys
> like trailer.*.ifExists, where an implicit bool is interpreted as
> EXISTS_DEFAULT.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  trailer.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/trailer.c b/trailer.c
> index b0e2ec224a..e4b08ed267 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -553,16 +553,22 @@ static int git_trailer_config(const char *conf_key, const char *value,
>  	case TRAILER_KEY:
>  		if (conf->key)
>  			warning(_("more than one %s"), conf_key);
> +		if (!value)
> +			return config_error_nonbool(conf_key);
>  		conf->key = xstrdup(value);
>  		break;
>  	case TRAILER_COMMAND:
>  		if (conf->command)
>  			warning(_("more than one %s"), conf_key);
> +		if (!value)
> +			return config_error_nonbool(conf_key);
>  		conf->command = xstrdup(value);
>  		break;
>  	case TRAILER_CMD:
>  		if (conf->cmd)
>  			warning(_("more than one %s"), conf_key);
> +		if (!value)
> +			return config_error_nonbool(conf_key);
>  		conf->cmd = xstrdup(value);
>  		break;
>  	case TRAILER_WHERE:

For the other cases we only generate warnings for unknown values, but
return successfully. Should we do the same here?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 5/7] submodule: handle NULL value when parsing submodule.*.branch
From: Patrick Steinhardt @ 2023-12-07  8:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231207071129.GE1276005@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 1920 bytes --]

On Thu, Dec 07, 2023 at 02:11:29AM -0500, Jeff King wrote:
> We record the submodule branch config value as a string, so config that
> uses an implicit bool like:
> 
>   [submodule "foo"]
>   branch
> 
> will cause us to segfault. Note that unlike most other config-parsing
> bugs of this class, this can be triggered by parsing a bogus .gitmodules
> file (which we might do after cloning a malicious repository).
> 
> I don't think the security implications are important, though. It's
> always a strict NULL dereference, not an out-of-bounds read or write. So
> we should reliably kill the process. That may be annoying, but the
> impact is limited to the attacker preventing the victim from
> successfully using "git clone --recurse-submodules", etc, on the
> malicious repo.
> 
> The "branch" entry is the only one with this problem; other strings like
> "path" and "url" already check for NULL.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  submodule-config.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/submodule-config.c b/submodule-config.c
> index 6a48fd12f6..f4dd482abc 100644
> --- a/submodule-config.c
> +++ b/submodule-config.c
> @@ -516,7 +516,9 @@ static int parse_config(const char *var, const char *value,
>  			submodule->recommend_shallow =
>  				git_config_bool(var, value);
>  	} else if (!strcmp(item.buf, "branch")) {
> -		if (!me->overwrite && submodule->branch)
> +		if (!value)
> +			ret = config_error_nonbool(var);
> +		else if (!me->overwrite && submodule->branch)
>  			warn_multiple_config(me->treeish_name, submodule->name,
>  					     "branch");
>  		else {

I was about to say that I'd rather expect us to handle this gracefully
so that Git doesn't die when parsing an invalid gitmodules file. But
there are other cases where we already fail in the same way, so this
looks good to me.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/7] config: handle NULL value when parsing non-bools
From: Patrick Steinhardt @ 2023-12-07  8:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231207071114.GA1276005@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 7461 bytes --]

On Thu, Dec 07, 2023 at 02:11:14AM -0500, Jeff King wrote:
> When the config parser sees an "implicit" bool like:
> 
>   [core]
>   someVariable
> 
> it passes NULL to the config callback. Any callback code which expects a
> string must check for NULL. This usually happens via helpers like
> git_config_string(), etc, but some custom code forgets to do so and will
> segfault.
> 
> These are all fairly vanilla cases where the solution is just the usual
> pattern of:
> 
>   if (!value)
>         return config_error_nonbool(var);
> 
> though note that in a few cases we have to split initializers like:
> 
>   int some_var = initializer();
> 
> into:
> 
>   int some_var;
>   if (!value)
>         return config_error_nonbool(var);
>   some_var = initializer();
> 
> There are still some broken instances after this patch, which I'll
> address on their own in individual patches after this one.
> 
> Reported-by: Carlos Andrés Ramírez Cataño <antaigroupltda@gmail.com>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  builtin/blame.c        |  2 ++
>  builtin/checkout.c     |  2 ++
>  builtin/clone.c        |  2 ++
>  builtin/log.c          |  5 ++++-
>  builtin/pack-objects.c |  6 +++++-
>  compat/mingw.c         |  2 ++
>  config.c               |  8 ++++++++
>  diff.c                 | 19 ++++++++++++++++---
>  mailinfo.c             |  2 ++
>  notes-utils.c          |  2 ++
>  trailer.c              |  2 ++
>  11 files changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/builtin/blame.c b/builtin/blame.c
> index 9c987d6567..2433b7da5c 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -748,6 +748,8 @@ static int git_blame_config(const char *var, const char *value,
>  	}
>  
>  	if (!strcmp(var, "blame.coloring")) {
> +		if (!value)
> +			return config_error_nonbool(var);

In the `else` statement where we fail to parse the value we only
generate a warning and return successfully regardless. Should we do the
same here?

>  		if (!strcmp(value, "repeatedLines")) {
>  			coloring_mode |= OUTPUT_COLOR_LINE;
>  		} else if (!strcmp(value, "highlightRecent")) {
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index f02434bc15..d5c784854f 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -1202,6 +1202,8 @@ static int git_checkout_config(const char *var, const char *value,
>  	struct checkout_opts *opts = cb;
>  
>  	if (!strcmp(var, "diff.ignoresubmodules")) {
> +		if (!value)
> +			return config_error_nonbool(var);
>  		handle_ignore_submodules_arg(&opts->diff_options, value);
>  		return 0;
>  	}

This one is fine, `handle_ignore_submodules_arg()` dies if it gets an
unknown value and `git_config()` will die when the callback function
returns an error. The same is true for many other cases you've
converted.

[snip]
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 89a8b5a976..62c540b4db 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3204,14 +3204,18 @@ static int git_pack_config(const char *k, const char *v,
>  		return 0;
>  	}
>  	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
> -		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
> +		struct configured_exclusion *ex;
>  		const char *oid_end, *pack_end;
>  		/*
>  		 * Stores the pack hash. This is not a true object ID, but is
>  		 * of the same form.
>  		 */
>  		struct object_id pack_hash;
>  
> +		if (!v)
> +			return config_error_nonbool(k);
> +
> +		ex = xmalloc(sizeof(*ex));
>  		if (parse_oid_hex(v, &ex->e.oid, &oid_end) ||
>  		    *oid_end != ' ' ||
>  		    parse_oid_hex(oid_end + 1, &pack_hash, &pack_end) ||

This isn't part of the diff and not a new issue, but why don't we
`return 0` when parsing this config correctly? We fall through to
`git_default_config()` even if we've successfully parsed the config key,
which seems like a bug to me.

Anyway, this case looks fine.

[snip]
> diff --git a/config.c b/config.c
> index b330c7adb4..18085c7e38 100644
> --- a/config.c
> +++ b/config.c
> @@ -1386,6 +1386,8 @@ static int git_default_core_config(const char *var, const char *value,
>  		return 0;
>  	}
>  	if (!strcmp(var, "core.checkstat")) {
> +		if (!value)
> +			return config_error_nonbool(var);
>  		if (!strcasecmp(value, "default"))
>  			check_stat = 1;
>  		else if (!strcasecmp(value, "minimal"))

We would ignore `true` here, so should we ignore implicit `true`, as
well?

> @@ -1547,11 +1549,15 @@ static int git_default_core_config(const char *var, const char *value,
>  	}
>  
>  	if (!strcmp(var, "core.checkroundtripencoding")) {
> +		if (!value)
> +			return config_error_nonbool(var);
>  		check_roundtrip_encoding = xstrdup(value);
>  		return 0;
>  	}
>  
>  	if (!strcmp(var, "core.notesref")) {
> +		if (!value)
> +			return config_error_nonbool(var);
>  		notes_ref_name = xstrdup(value);
>  		return 0;
>  	}

I wonder the same here. We might as well use `xstrdup_or_null()`, but it
feels like the right thing to do to convert these to actual errors.

> @@ -426,10 +429,15 @@ int git_diff_ui_config(const char *var, const char *value,
>  	if (!strcmp(var, "diff.orderfile"))
>  		return git_config_pathname(&diff_order_file_cfg, var, value);
>  
> -	if (!strcmp(var, "diff.ignoresubmodules"))
> +	if (!strcmp(var, "diff.ignoresubmodules")) {
> +		if (!value)
> +			return config_error_nonbool(var);
>  		handle_ignore_submodules_arg(&default_diff_options, value);
> +	}
>  
>  	if (!strcmp(var, "diff.submodule")) {
> +		if (!value)
> +			return config_error_nonbool(var);
>  		if (parse_submodule_params(&default_diff_options, value))
>  			warning(_("Unknown value for 'diff.submodule' config variable: '%s'"),
>  				value);

Should we generate a warning instead according to the preexisting code
for "diff.submodule"?

> @@ -490,6 +501,8 @@ int git_diff_basic_config(const char *var, const char *value,
>  
>  	if (!strcmp(var, "diff.dirstat")) {
>  		struct strbuf errmsg = STRBUF_INIT;
> +		if (!value)
> +			return config_error_nonbool(var);
>  		default_diff_options.dirstat_permille = diff_dirstat_permille_default;
>  		if (parse_dirstat_params(&default_diff_options, value, &errmsg))
>  			warning(_("Found errors in 'diff.dirstat' config variable:\n%s"),

Same here, should we generate a warning instead?

> diff --git a/notes-utils.c b/notes-utils.c
> index 97c031c26e..01f4f5b424 100644
> --- a/notes-utils.c
> +++ b/notes-utils.c
> @@ -112,6 +112,8 @@ static int notes_rewrite_config(const char *k, const char *v,
>  		}
>  		return 0;
>  	} else if (!c->refs_from_env && !strcmp(k, "notes.rewriteref")) {
> +		if (!v)
> +			return config_error_nonbool(k);
>  		/* note that a refs/ prefix is implied in the
>  		 * underlying for_each_glob_ref */
>  		if (starts_with(v, "refs/notes/"))

Here, as well.

> diff --git a/trailer.c b/trailer.c
> index b6de5d9cb2..b0e2ec224a 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -507,6 +507,8 @@ static int git_trailer_default_config(const char *conf_key, const char *value,
>  				warning(_("unknown value '%s' for key '%s'"),
>  					value, conf_key);
>  		} else if (!strcmp(trailer_item, "separators")) {
> +			if (!value)
> +				return config_error_nonbool(conf_key);
>  			separators = xstrdup(value);
>  		}
>  	}

And here.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox