Bug report: v2.47.0 cannot fetch version 1 pack indexes

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bug report: v2.47.0 cannot fetch version 1 pack indexes
@ 2024-10-19 23:22 fox
  2024-10-20  0:37 ` Eric Sunshine
  2024-10-20  1:24 ` Jeff King
  0 siblings, 2 replies; 31+ messages in thread
From: fox @ 2024-10-19 23:22 UTC (permalink / raw)
  To: git

What did you do before the bug happened? (Steps to reproduce your issue)

1. Run git clone https://www.townlong-yak.com/test.git

What did you expect to happen? (Expected behavior)

A local "test" folder to be created with a clone of the remote repository.

What happened instead? (Actual behavior)

The command produces the following output:
Cloning into 'test'...
error: files '/Users/me/test/.git/objects/pack/tmp_idx_WT81vv' and '/Users/me/test/.git/objects/pack/pack-427331d91391b00844273eeb3879cb479ce2c995.idx' differ in contents
fatal: unable to rename temporary '*.idx' file to '/Users/me/test/.git/objects/pack/pack-427331d91391b00844273eeb3879cb479ce2c995.idx'
error: Unable to find 6261a9d9f7704c02a5421ff733919ab18793aa7d under https://www.townlong-yak.com/test.git
Cannot obtain needed object 6261a9d9f7704c02a5421ff733919ab18793aa7d
error: fetch failed.

What's different between what you expected and what actually happened?

The clone fails and no test folder exists after the command completes.

Anything else you want to add:

This scenario works correctly with git v2.46.2 and earlier versions, and began failing with v2.47.0.

Running git-bisect identifies b1b8dfde6929ec9463eca0a858c4adb9786d7c93 as the first bad commit,
suggesting that the .idx file downloaded from the remote is now expected to be byte-for-byte
identical with a locally-generated version; due to format differences, they are not.

The remote idx is in the original (version 1) format, and git verify-pack seems satisfied with it.
Did v2.47.0 intend to block fetching such indices?

[System Info]
git version:
git version 2.47.0.163.g1226f6d8fa
cpu: x86_64
built from commit: 1226f6d8faf60d03cbb3b021c68d48364bf67ac0
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
libcurl: 8.7.1
zlib: 1.2.12
uname: Darwin 24.0.0 Darwin Kernel Version 24.0.0: Tue Sep 24 23:36:30 PDT 2024; root:xnu-11215.1.12~1/RELEASE_X86_64 x86_64
compiler info: clang: 16.0.0 (clang-1600.0.26.3)
libc info: no libc information available
$SHELL (typically, interactive shell): /bin/bash

[Enabled Hooks]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-19 23:22 Bug report: v2.47.0 cannot fetch version 1 pack indexes fox
@ 2024-10-20  0:37 ` Eric Sunshine
  2024-10-21 20:06   ` Taylor Blau
  2024-10-20  1:24 ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Sunshine @ 2024-10-20  0:37 UTC (permalink / raw)
  To: fox; +Cc: git, Jeff King, Taylor Blau, Elijah Newren

On Sat, Oct 19, 2024 at 7:31 PM fox <fox.gbr@townlong-yak.com> wrote:
> What did you do before the bug happened? (Steps to reproduce your issue)
> 1. Run git clone https://www.townlong-yak.com/test.git
> Cloning into 'test'...
> error: files '/Users/me/test/.git/objects/pack/tmp_idx_WT81vv' and '/Users/me/test/.git/objects/pack/pack-427331d91391b00844273eeb3879cb479ce2c995.idx' differ in contents
> fatal: unable to rename temporary '*.idx' file to '/Users/me/test/.git/objects/pack/pack-427331d91391b00844273eeb3879cb479ce2c995.idx'
> error: Unable to find 6261a9d9f7704c02a5421ff733919ab18793aa7d under https://www.townlong-yak.com/test.git
> Cannot obtain needed object 6261a9d9f7704c02a5421ff733919ab18793aa7d
> error: fetch failed.

I can reproduce this problem.

> Running git-bisect identifies b1b8dfde6929ec9463eca0a858c4adb9786d7c93 as the first bad commit,
> suggesting that the .idx file downloaded from the remote is now expected to be byte-for-byte
> identical with a locally-generated version; due to format differences, they are not.

Cc:'ing the authors of that commit.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-19 23:22 Bug report: v2.47.0 cannot fetch version 1 pack indexes fox
  2024-10-20  0:37 ` Eric Sunshine
@ 2024-10-20  1:24 ` Jeff King
  2024-10-20  2:40   ` Jeff King
  2024-10-21 20:23   ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
  1 sibling, 2 replies; 31+ messages in thread
From: Jeff King @ 2024-10-20  1:24 UTC (permalink / raw)
  To: fox; +Cc: Taylor Blau, Eric Sunshine, git

On Sun, Oct 20, 2024 at 01:22:34AM +0200, fox wrote:

> Running git-bisect identifies b1b8dfde6929ec9463eca0a858c4adb9786d7c93
> as the first bad commit, suggesting that the .idx file downloaded from
> the remote is now expected to be byte-for-byte identical with a
> locally-generated version; due to format differences, they are not.
> 
> The remote idx is in the original (version 1) format, and git
> verify-pack seems satisfied with it.  Did v2.47.0 intend to block
> fetching such indices?

Interesting case. No, we did not intend to block fetching those indices.

We knew there was a potential problem with dumb http here, but the idea
was that if we had a local file "foo.idx" (or "foo.pack"), that we'd
never try to download the other side in the first place, preferring
ours.

And AFAICT that does hold. But here we have the opposite problem! We
download the pack idx from the remote first, so that we know what it
claims to have (which tells us whether we want to grab the matching
pack). And then we grab the matching pack, but instead of using the .idx
we downloaded, we index it again ourselves.

Which is probably a good idea to make sure that we prefer our
locally-checked copy versus what the other side has. But obviously there
is no guarantee that we'll come up with the same bytes as the other side
if there are version or configuration differences.

One option is to just discard the remote version before re-indexing,
like this:

diff --git a/http.c b/http.c
index d59e59f66b..6879d7ba1b 100644
--- a/http.c
+++ b/http.c
@@ -2500,6 +2500,7 @@ int finish_http_pack_request(struct http_pack_request *preq)
 	struct child_process ip = CHILD_PROCESS_INIT;
 	int tmpfile_fd;
 	int ret = 0;
+	size_t len;

 	fclose(preq->packfile);
 	preq->packfile = NULL;
@@ -2517,6 +2518,18 @@ int finish_http_pack_request(struct http_pack_request *preq)
 	else
 		ip.no_stdout = 1;

+	/*
+	 * We're about to reindex the packfile, so throw away the .idx
+	 * we downloaded from the remote in order to prefer ours.
+	 */
+	if (strip_suffix(preq->tmpfile.buf, ".pack.temp", &len)) {
+		struct strbuf idx = STRBUF_INIT;
+		strbuf_add(&idx, preq->tmpfile.buf, len);
+		strbuf_addstr(&idx, ".idx");
+		unlink(idx.buf);
+		strbuf_release(&idx);
+	}
+
 	if (run_command(&ip)) {
 		ret = -1;
 		goto cleanup;

There are a few things I don't like there, though:

  - obviously reversing the tempfile name back to the idx is hacky. We
    could probably store the original idx that led us to this pack and
    use that.

  - I don't _think_ there is any case where that .idx file is precious.
    We wouldn't be indexing the .pack file if we didn't just download
    it, and we wouldn't have downloaded it if we already had a
    .idx/.pack pair. OTOH the name we got from the other side isn't
    necessarily the same one we'll use locally; we're feeding the pack
    via "index-pack --stdin", so it will do its own hash to come up with
    the name. The other side could have used a different scheme, or even
    be trying to intentionally trick us.

So I feel like the root of the problem is that we moved the other side's
"pack-1234abcd.idx" into place at all! We do not plan to use it, but
only need to access it via our custom packed_git structs to see whether
we want to download the matching pack. And in fact I'd suspect there are
other possible bugs/trickery lurking, if the other side gives us a
broken .idx that does not match its pack (our odb would now have the
broken file with no matching pack).

So IMHO a cleaner fix is that we should keep the stuff we download from
the remote marked as temporary files, and then throw them away when we
complete the fetch (rather than just expecting index-pack to overwrite
them).

-Peff

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-20  1:24 ` Jeff King
@ 2024-10-20  2:40   ` Jeff King
  2024-10-21 20:33     ` Taylor Blau
  2024-10-21 20:23   ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-20  2:40 UTC (permalink / raw)
  To: fox; +Cc: Taylor Blau, Eric Sunshine, git

On Sat, Oct 19, 2024 at 09:24:55PM -0400, Jeff King wrote:

> So I feel like the root of the problem is that we moved the other side's
> "pack-1234abcd.idx" into place at all! We do not plan to use it, but
> only need to access it via our custom packed_git structs to see whether
> we want to download the matching pack. And in fact I'd suspect there are
> other possible bugs/trickery lurking, if the other side gives us a
> broken .idx that does not match its pack (our odb would now have the
> broken file with no matching pack).
> 
> So IMHO a cleaner fix is that we should keep the stuff we download from
> the remote marked as temporary files, and then throw them away when we
> complete the fetch (rather than just expecting index-pack to overwrite
> them).

And here's what that might look like. It stores the temporary index
outside of the pack/ directory, and then we don't "finalize" it into
place. We have to teach parse_pack to retain the original name, but
that's OK. It's used only by the http walker anyway.

diff --git a/http.c b/http.c
index d59e59f66b..6df032e40f 100644
--- a/http.c
+++ b/http.c
@@ -19,6 +19,7 @@
 #include "string-list.h"
 #include "object-file.h"
 #include "object-store-ll.h"
+#include "tempfile.h"
 
 static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
 static int trace_curl_data = 1;
@@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
 	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
 	url = strbuf_detach(&buf, NULL);
 
-	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
-	tmp = strbuf_detach(&buf, NULL);
+	/*
+	 * Don't put this into packs/, since it's just temporary and we don't
+	 * want to confuse it with our local .idx files.  We'll generate our
+	 * own index if we choose to download the matching packfile.
+	 *
+	 * It's tempting to use xmks_tempfile() here, but it's important that
+	 * the file not exist, otherwise http_get_file() complains. So we
+	 * create a filename that should be unique, and then just register it
+	 * as a tempfile so that it will get cleaned up on exit.
+	 *
+	 * Arguably it would be better to hold on to the tempfile handle so
+	 * that we can remove it as soon as we download the pack and generate
+	 * the real index, but that might need more surgery.
+	 */
+	tmp = xstrfmt("%s/tmp_pack_%s.idx",
+		      repo_get_object_directory(the_repository),
+		      hash_to_hex(hash));
+	register_tempfile(tmp);
 
 	if (http_get_file(url, tmp, NULL) != HTTP_OK) {
 		error("Unable to get pack index %s", url);
@@ -2427,10 +2444,8 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head,
 	}
 
 	ret = verify_pack_index(new_pack);
-	if (!ret) {
+	if (!ret)
 		close_pack_index(new_pack);
-		ret = finalize_object_file(tmp_idx, sha1_pack_index_name(sha1));
-	}
 	free(tmp_idx);
 	if (ret)
 		return -1;
diff --git a/packfile.c b/packfile.c
index df4ba67719..16d3bcf7f7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -237,13 +237,22 @@ static struct packed_git *alloc_packed_git(int extra)
 	return p;
 }
 
+static char *pack_path_from_idx(const char *idx_path)
+{
+	size_t len;
+	if (!strip_suffix(idx_path, ".idx", &len))
+		BUG("idx path does not end in .idx: %s", idx_path);
+	return xstrfmt("%.*s.pack", (int)len, idx_path);
+}
+
 struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
 {
-	const char *path = sha1_pack_name(sha1);
+	char *path = pack_path_from_idx(idx_path);
 	size_t alloc = st_add(strlen(path), 1);
 	struct packed_git *p = alloc_packed_git(alloc);
 
 	memcpy(p->pack_name, path, alloc); /* includes NUL */
+	free(path);
 	hashcpy(p->hash, sha1, the_repository->hash_algo);
 	if (check_packed_git_idx(idx_path, p)) {
 		free(p);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-20  0:37 ` Eric Sunshine
@ 2024-10-21 20:06   ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-21 20:06 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: fox, git, Jeff King, Taylor Blau, Elijah Newren

On Sat, Oct 19, 2024 at 08:37:08PM -0400, Eric Sunshine wrote:
> On Sat, Oct 19, 2024 at 7:31 PM fox <fox.gbr@townlong-yak.com> wrote:
> > What did you do before the bug happened? (Steps to reproduce your issue)
> > 1. Run git clone https://www.townlong-yak.com/test.git
> > Cloning into 'test'...
> > error: files '/Users/me/test/.git/objects/pack/tmp_idx_WT81vv' and '/Users/me/test/.git/objects/pack/pack-427331d91391b00844273eeb3879cb479ce2c995.idx' differ in contents
> > fatal: unable to rename temporary '*.idx' file to '/Users/me/test/.git/objects/pack/pack-427331d91391b00844273eeb3879cb479ce2c995.idx'
> > error: Unable to find 6261a9d9f7704c02a5421ff733919ab18793aa7d under https://www.townlong-yak.com/test.git
> > Cannot obtain needed object 6261a9d9f7704c02a5421ff733919ab18793aa7d
> > error: fetch failed.
>
> I can reproduce this problem.

Thanks for reproducing. This is definitely unintentional breakage, which
I see that Peff has already diagnosed and proposed a fix for. Let's read
on ...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-20  1:24 ` Jeff King
  2024-10-20  2:40   ` Jeff King
@ 2024-10-21 20:23   ` Taylor Blau
  2024-10-22  5:00     ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2024-10-21 20:23 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Sat, Oct 19, 2024 at 09:24:55PM -0400, Jeff King wrote:
> There are a few things I don't like there, though:
>
>   - obviously reversing the tempfile name back to the idx is hacky. We
>     could probably store the original idx that led us to this pack and
>     use that.

TBH, I don't think that this is all that hacky, but...

>   - I don't _think_ there is any case where that .idx file is precious.
>     We wouldn't be indexing the .pack file if we didn't just download
>     it, and we wouldn't have downloaded it if we already had a
>     .idx/.pack pair. OTOH the name we got from the other side isn't
>     necessarily the same one we'll use locally; we're feeding the pack
>     via "index-pack --stdin", so it will do its own hash to come up with
>     the name. The other side could have used a different scheme, or even
>     be trying to intentionally trick us.
>
> So I feel like the root of the problem is that we moved the other side's
> "pack-1234abcd.idx" into place at all! We do not plan to use it, but
> only need to access it via our custom packed_git structs to see whether
> we want to download the matching pack. And in fact I'd suspect there are
> other possible bugs/trickery lurking, if the other side gives us a
> broken .idx that does not match its pack (our odb would now have the
> broken file with no matching pack).
>
> So IMHO a cleaner fix is that we should keep the stuff we download from
> the remote marked as temporary files, and then throw them away when we
> complete the fetch (rather than just expecting index-pack to overwrite
> them).

...this seems like a much better idea to me. We shouldn't be keeping the
provided *.idx file given that we plan on overwriting it later on in the
process.

This feels like more fallout similar to the one described by c177d3dc50
(pack-objects: use finalize_object_file() to rename pack/idx/etc,
2024-09-26), in particularly the paragraph beginning with "This has some
test and real-world fallout ..." and the one immediately below it.

So I think that your "cleaner fix" is the way to go.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-20  2:40   ` Jeff King
@ 2024-10-21 20:33     ` Taylor Blau
  2024-10-22  5:14       ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2024-10-21 20:33 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Sat, Oct 19, 2024 at 10:40:22PM -0400, Jeff King wrote:
> diff --git a/http.c b/http.c
> index d59e59f66b..6df032e40f 100644
> --- a/http.c
> +++ b/http.c
> @@ -19,6 +19,7 @@
>  #include "string-list.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "tempfile.h"
>
>  static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
>  static int trace_curl_data = 1;
> @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
>  	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
>  	url = strbuf_detach(&buf, NULL);
>
> -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> -	tmp = strbuf_detach(&buf, NULL);
> +	/*
> +	 * Don't put this into packs/, since it's just temporary and we don't
> +	 * want to confuse it with our local .idx files.  We'll generate our
> +	 * own index if we choose to download the matching packfile.
> +	 *
> +	 * It's tempting to use xmks_tempfile() here, but it's important that
> +	 * the file not exist, otherwise http_get_file() complains. So we
> +	 * create a filename that should be unique, and then just register it
> +	 * as a tempfile so that it will get cleaned up on exit.
> +	 *
> +	 * Arguably it would be better to hold on to the tempfile handle so
> +	 * that we can remove it as soon as we download the pack and generate
> +	 * the real index, but that might need more surgery.
> +	 */
> +	tmp = xstrfmt("%s/tmp_pack_%s.idx",
> +		      repo_get_object_directory(the_repository),
> +		      hash_to_hex(hash));
> +	register_tempfile(tmp);

Makes perfect sense, and the comment above here is much appreciated.

I thought about trying to use some intermediate state of the strbuf here
to avoid an extra xstrfmt() call, but couldn't come up with anything I
didn't think was awkward.

>  	if (http_get_file(url, tmp, NULL) != HTTP_OK) {
>  		error("Unable to get pack index %s", url);
> @@ -2427,10 +2444,8 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head,
>  	}
>
>  	ret = verify_pack_index(new_pack);
> -	if (!ret) {
> +	if (!ret)
>  		close_pack_index(new_pack);
> -		ret = finalize_object_file(tmp_idx, sha1_pack_index_name(sha1));

And here's where we stop calling finalize_object_file(). Good.

> -	}
>  	free(tmp_idx);
>  	if (ret)
>  		return -1;
> diff --git a/packfile.c b/packfile.c
> index df4ba67719..16d3bcf7f7 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -237,13 +237,22 @@ static struct packed_git *alloc_packed_git(int extra)
>  	return p;
>  }
>
> +static char *pack_path_from_idx(const char *idx_path)
> +{
> +	size_t len;
> +	if (!strip_suffix(idx_path, ".idx", &len))
> +		BUG("idx path does not end in .idx: %s", idx_path);
> +	return xstrfmt("%.*s.pack", (int)len, idx_path);
> +}
> +
>  struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
>  {
> -	const char *path = sha1_pack_name(sha1);
> +	char *path = pack_path_from_idx(idx_path);

Huh. I would have thought we have such a helper function already. I
guess we probably do, but that it's also defined statically because it's
so easy to write.

In any case, this looks like the right thing to do here. It would be
nice to have a corresponding test here, since unlike the other
finalize_object_file() changes, this one can be provoked
deterministically.

Would you mind submitting this as a bona-fide patch, which I can then
pick up and start merging down?

In any event, I appreciate you figuring out what was going on here and
outlining a couple of paths forward.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-21 20:23   ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
@ 2024-10-22  5:00     ` Jeff King
  2024-10-22 15:50       ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-22  5:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

On Mon, Oct 21, 2024 at 04:23:57PM -0400, Taylor Blau wrote:

> On Sat, Oct 19, 2024 at 09:24:55PM -0400, Jeff King wrote:
> > There are a few things I don't like there, though:
> >
> >   - obviously reversing the tempfile name back to the idx is hacky. We
> >     could probably store the original idx that led us to this pack and
> >     use that.
> 
> TBH, I don't think that this is all that hacky, but...

Mostly I was worried that there was some case where the filename would
not be the usual pack-$hash.idx that we expected. But having now looked
at the code quite a bit to produce the follow-on patch, I think these
names are always generated locally from sha1_pack_name(), etc.

Another variant of this approach is to actually teach index-pack a flag
to say "it's OK to overwrite an existing .idx file, even if the bytes
don't match". That's the simplest fix _conceptually_, in that it takes
us back to the pre-2.47 behavior. But plumbing through that option would
require a lot of ugly and confusing boilerplate code.

> > So IMHO a cleaner fix is that we should keep the stuff we download from
> > the remote marked as temporary files, and then throw them away when we
> > complete the fetch (rather than just expecting index-pack to overwrite
> > them).
> 
> ...this seems like a much better idea to me. We shouldn't be keeping the
> provided *.idx file given that we plan on overwriting it later on in the
> process.

I dug on this a little more, and...I don't love the design of the
dumb-http protocol.

A big question here is: what should the life cycle of those downloaded
.idx files be? And how much should we try to avoid re-downloading them
(keeping in mind that the .idx for linux.git is almost 300MB)?

In a normal dumb-http clone, something like this happens with the
current code:

  1. We see we need some object X.

  2. We download _all_ of the .idx files offered by the remote, since we
     have no idea in which one we'll find X. (There's an obvious
     low-hanging optimization to stop when we find X, but in the worst
     case we need all of them anyway).

  3. For each one, we download the matching pack if needed. In most
     cases you need all of them, but there might be some pack P for
     which we only have the .idx. We leave that bare .idx sitting in the
     objects directory.

So now we have all of the other side's .idx files locally, but not
necessarily all of their .packs. And here are some interesting follow-on
cases:

  a. You do a follow-up fetch, and now you need object Y. The saved idx
     of P is useful, because you can avoid downloading it again. Yay.
     You likewise should avoid re-downloading any for which you did get
     the matching pack, because you'll have pack-$hash.idx locally (just
     your generated copy, not the one you got from the other side).

  b. The other side repacks, and then you fetch. Now all their pack
     names are changed. So your saved idx of P is useless. But also you
     end up downloading all of the new .idx files anyway (even though
     you probably already have 99% of the objects).

  c. You repack, then fetch. Now the other side has the same packs, but
     your local copies all have new names. You re-download every idx
     from the other side, not realizing that you already saw them last
     time.

With my patch, case (a) doesn't save P (because now we treat it like a
tempfile), so we re-download it unnecessarily. That's a little worse.
I think the other two cases are largely the same.

But there's a flip side here. What if we kept the .idx files we got from
the other side in objects/pack/remote-idx/, basically as a cache? Now
case (c) avoids a lot of pointless downloading, though it comes at the
cost of storing those extra .idx files (which, before we repack, are
basically duplicates of the .idx files we generated ourselves; if we
really wanted to get clever we could probably hard-link or something).

We didn't improve (b). And in fact, we've created a new issue: we're
still holding on to the old cached .idx files. We'd have to have some
scheme to get rid of them. I suppose that happens when we fetch the
other side's objects/info/packs file and see that they're no longer
mentioned (though if you want to be pedantic, this really should be a
per-remote cache, since in theory two remotes could hold the same pack).

So...what does it all mean for this fix?

I think the patch I posted earlier, to keep the .idx files as separate
tempfiles, is only slightly worse than the status quo (it doesn't keep
P.idx that we didn't need). And given my general feelings towards the
dumb-http protocol, maybe that is the right place to stop.

It just feels like we _could_ be improving things with a better managed
.idx cache system. And that would fix the regression at the same time.
But I'm not sure it's worth sinking time and complexity into this
terribly inefficient protocol.

(Part of why I laid all this out is that until wrote all of the above, I
wasn't completely convinced that case (a) was the only one that
suffered, and that it was so mild).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-21 20:33     ` Taylor Blau
@ 2024-10-22  5:14       ` Jeff King
  2024-10-22 15:18         ` Taylor Blau
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
  0 siblings, 2 replies; 31+ messages in thread
From: Jeff King @ 2024-10-22  5:14 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

On Mon, Oct 21, 2024 at 04:33:15PM -0400, Taylor Blau wrote:

> > @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
> >  	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
> >  	url = strbuf_detach(&buf, NULL);
> >
> > -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> > -	tmp = strbuf_detach(&buf, NULL);
> > +	/*
> > +	 * Don't put this into packs/, since it's just temporary and we don't
> > +	 * want to confuse it with our local .idx files.  We'll generate our
> > +	 * own index if we choose to download the matching packfile.
> > +	 *
> > +	 * It's tempting to use xmks_tempfile() here, but it's important that
> > +	 * the file not exist, otherwise http_get_file() complains. So we
> > +	 * create a filename that should be unique, and then just register it
> > +	 * as a tempfile so that it will get cleaned up on exit.
> > +	 *
> > +	 * Arguably it would be better to hold on to the tempfile handle so
> > +	 * that we can remove it as soon as we download the pack and generate
> > +	 * the real index, but that might need more surgery.
> > +	 */
> > +	tmp = xstrfmt("%s/tmp_pack_%s.idx",
> > +		      repo_get_object_directory(the_repository),
> > +		      hash_to_hex(hash));
> > +	register_tempfile(tmp);
> 
> Makes perfect sense, and the comment above here is much appreciated.
> 
> I thought about trying to use some intermediate state of the strbuf here
> to avoid an extra xstrfmt() call, but couldn't come up with anything I
> didn't think was awkward.

I don't think there's any useful intermediate state. The earlier %s is
the base url, but here it's our local directory.

We could continue to re-use the scratch strbuf as the existing code did
(and which xstrfmt() is doing under the hood). It wasn't really
intentional for me to change that, but I went through a lot of attempts
to get here (using mks_tempfile(), and so on).

> > +static char *pack_path_from_idx(const char *idx_path)
> > +{
> > +	size_t len;
> > +	if (!strip_suffix(idx_path, ".idx", &len))
> > +		BUG("idx path does not end in .idx: %s", idx_path);
> > +	return xstrfmt("%.*s.pack", (int)len, idx_path);
> > +}
> > +
> >  struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
> >  {
> > -	const char *path = sha1_pack_name(sha1);
> > +	char *path = pack_path_from_idx(idx_path);
> 
> Huh. I would have thought we have such a helper function already. I
> guess we probably do, but that it's also defined statically because it's
> so easy to write.

I thought so, too, but couldn't find one. We have pack_bitmap_filename()
(and so on for .rev and .midx files) that goes from .pack to those
extensions. But here we want to go from .idx to .pack. I think most
stuff goes from ".pack" because that's what we store in the packed_git
struct.

There's also sha1_pack_index_name(), but that goes from a csum-file hash
to a filename.

I grepped around and strip_suffix() seems to be par for the course in
similar situations within pack/repack code, so I think it's OK here.

> In any case, this looks like the right thing to do here. It would be
> nice to have a corresponding test here, since unlike the other
> finalize_object_file() changes, this one can be provoked
> deterministically.
> 
> Would you mind submitting this as a bona-fide patch, which I can then
> pick up and start merging down?

Yeah, the test is easy:

diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index 58189c9f7d..50a7b98813 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -507,4 +507,14 @@ test_expect_success 'fetching via http alternates works' '
 	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
 '
 
+test_expect_success 'dumb http can fetch index v1' '
+	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
+	git init --bare "$server" &&
+	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
+	git -C "$server" -c pack.indexVersion=1 gc &&
+
+	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
+	git -C idx-v1 fsck
+'
+
 test_done

I raised some other more philosophical issues in the other part of the
thread, but assuming the answer is "no, let's do the simplest thing",
then I think this approach is OK.

I'd also like to see if I can clean things up around parse_pack_index(),
whose semantics I'm changing here (and which violates all manner of
assumptions that we usually have about packed_git structs). It's used
only by the dumb-http code, and I think we want to refactor it a bit so
that nobody else is tempted to use it.

I'll try to send something out tonight or tomorrow.

-Peff

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-22  5:14       ` Jeff King
@ 2024-10-22 15:18         ` Taylor Blau
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-22 15:18 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Tue, Oct 22, 2024 at 01:14:02AM -0400, Jeff King wrote:
> > > +static char *pack_path_from_idx(const char *idx_path)
> > > +{
> > > +	size_t len;
> > > +	if (!strip_suffix(idx_path, ".idx", &len))
> > > +		BUG("idx path does not end in .idx: %s", idx_path);
> > > +	return xstrfmt("%.*s.pack", (int)len, idx_path);
> > > +}
> > > +
> > >  struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
> > >  {
> > > -	const char *path = sha1_pack_name(sha1);
> > > +	char *path = pack_path_from_idx(idx_path);
> >
> > Huh. I would have thought we have such a helper function already. I
> > guess we probably do, but that it's also defined statically because it's
> > so easy to write.
>
> I thought so, too, but couldn't find one. We have pack_bitmap_filename()
> (and so on for .rev and .midx files) that goes from .pack to those
> extensions. But here we want to go from .idx to .pack. I think most
> stuff goes from ".pack" because that's what we store in the packed_git
> struct.
>
> There's also sha1_pack_index_name(), but that goes from a csum-file hash
> to a filename.
>
> I grepped around and strip_suffix() seems to be par for the course in
> similar situations within pack/repack code, so I think it's OK here.

It would kind of be nice to have a convenient function like:

    const char *pack_ext_name(const struct packed_git *p,
                              const char *ext);

in our codebase, but that is well outside the scope of this patch ;-).

> > In any case, this looks like the right thing to do here. It would be
> > nice to have a corresponding test here, since unlike the other
> > finalize_object_file() changes, this one can be provoked
> > deterministically.
> >
> > Would you mind submitting this as a bona-fide patch, which I can then
> > pick up and start merging down?
>
> Yeah, the test is easy:
>
> diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
> index 58189c9f7d..50a7b98813 100755
> --- a/t/t5550-http-fetch-dumb.sh
> +++ b/t/t5550-http-fetch-dumb.sh
> @@ -507,4 +507,14 @@ test_expect_success 'fetching via http alternates works' '
>  	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
>  '
>
> +test_expect_success 'dumb http can fetch index v1' '
> +	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
> +	git init --bare "$server" &&
> +	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
> +	git -C "$server" -c pack.indexVersion=1 gc &&
> +
> +	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
> +	git -C idx-v1 fsck
> +'
> +
>  test_done

Beautifully written. I wondered why you didn't use `test_commit -C
$server`, but it's because that repository is bare, and so needs
`--work-tree=.`, so what you wrote here makes sense.

> I raised some other more philosophical issues in the other part of the
> thread, but assuming the answer is "no, let's do the simplest thing",
> then I think this approach is OK.
>
> I'd also like to see if I can clean things up around parse_pack_index(),
> whose semantics I'm changing here (and which violates all manner of
> assumptions that we usually have about packed_git structs). It's used
> only by the dumb-http code, and I think we want to refactor it a bit so
> that nobody else is tempted to use it.
>
> I'll try to send something out tonight or tomorrow.

Sounds all good. Thanks!

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Bug report: v2.47.0 cannot fetch version 1 pack indexes
  2024-10-22  5:00     ` Jeff King
@ 2024-10-22 15:50       ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-22 15:50 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Tue, Oct 22, 2024 at 01:00:40AM -0400, Jeff King wrote:
> So...what does it all mean for this fix?
>
> I think the patch I posted earlier, to keep the .idx files as separate
> tempfiles, is only slightly worse than the status quo (it doesn't keep
> P.idx that we didn't need). And given my general feelings towards the
> dumb-http protocol, maybe that is the right place to stop.
>
> It just feels like we _could_ be improving things with a better managed
> .idx cache system. And that would fix the regression at the same time.
> But I'm not sure it's worth sinking time and complexity into this
> terribly inefficient protocol.
>
> (Part of why I laid all this out is that until wrote all of the above, I
> wasn't completely convinced that case (a) was the only one that
> suffered, and that it was so mild).

All very well analyzed. I share your feelings about the dumb-HTTP
protocol. It would be nice to fix, but I have a feeling that our
collective time is probably better spent pursuing a reasonable band-aid
and then focusing our attention elsewhere ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 0/11] dumb-http pack index v1 regression + cleanups
  2024-10-22  5:14       ` Jeff King
  2024-10-22 15:18         ` Taylor Blau
@ 2024-10-25  6:41         ` Jeff King
  2024-10-25  6:43           ` [PATCH 01/11] midx: avoid duplicate packed_git entries Jeff King
                             ` (11 more replies)
  1 sibling, 12 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  6:41 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

On Tue, Oct 22, 2024 at 01:14:02AM -0400, Jeff King wrote:

> I'd also like to see if I can clean things up around parse_pack_index(),
> whose semantics I'm changing here (and which violates all manner of
> assumptions that we usually have about packed_git structs). It's used
> only by the dumb-http code, and I think we want to refactor it a bit so
> that nobody else is tempted to use it.
> 
> I'll try to send something out tonight or tomorrow.

This took me a little longer than I expected, mostly due to triggering
an (existing) bug related to midx. So here's the series I came up with.

The first patch fixes the midx bug, which otherwise causes spurious test
failures in patch 3.

The second patch beefs up the existing tests for the area we're
touching.

Patch 3 is the actual regression fix. I apologize in advance for the
length of the commit message. ;) I tried to summarize all of the weird
dumb-http implications I discovered.

If we wanted to we could stop there for maint, and treat the rest as a
separate series. But since it's cleanup with no intended behavior
change, it's pretty low-risk.

Patches 4-10 are mostly about cleaning up crufty interfaces that I ran
across while working on the earlier patches.

And then the final one is an oddity where I think the code is doing the
wrong thing, but it's not possible to actually trigger the bug in
practice.

  [01/11]: midx: avoid duplicate packed_git entries
  [02/11]: t5550: count fetches in "previously-fetched .idx" test
  [03/11]: dumb-http: store downloaded pack idx as tempfile
  [04/11]: packfile: drop has_pack_index()
  [05/11]: packfile: drop sha1_pack_name()
  [06/11]: packfile: drop sha1_pack_index_name()
  [07/11]: packfile: warn people away from parse_packed_git()
  [08/11]: http-walker: use object_id instead of bare hash
  [09/11]: packfile: convert find_sha1_pack() to use object_id
  [10/11]: packfile: use object_id in find_pack_entry_one()
  [11/11]: packfile: use oidread() instead of hashcpy() to fill object_id

 builtin/fast-import.c         |  6 ++---
 builtin/pack-objects.c        |  4 +--
 builtin/pack-redundant.c      |  5 +++-
 connected.c                   |  4 +--
 http-push.c                   |  4 +--
 http-walker.c                 | 25 ++++++++---------
 http.c                        | 43 ++++++++++++++++++++---------
 midx.c                        | 22 ++++++++++++---
 pack-bitmap.c                 |  4 +--
 packfile.c                    | 51 ++++++++++++++---------------------
 packfile.h                    | 40 +++++++++++++--------------
 t/helper/test-find-pack.c     |  2 +-
 t/t5200-update-server-info.sh |  8 ++++++
 t/t5550-http-fetch-dumb.sh    | 28 +++++++++++++++++--
 walker.c                      |  4 +--
 walker.h                      |  4 +--
 16 files changed, 153 insertions(+), 101 deletions(-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/11] midx: avoid duplicate packed_git entries
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
@ 2024-10-25  6:43           ` Jeff King
  2024-10-25 21:09             ` Taylor Blau
  2024-10-25  6:44           ` [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test Jeff King
                             ` (10 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-25  6:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee, fox, Eric Sunshine, git

When we scan a pack directory to load the idx entries we find into the
packed_git list, we skip any of them that are contained in a midx. We
then load them later lazily if we actually need to access the
corresponding pack, referencing them both from the midx struct and the
packed_git list.

The lazy-load in the midx code checks to see if the midx already
mentions the pack, but doesn't otherwise check the packed_git list. This
makes sense, since we should have added any pack to both lists.

But there's a loophole! If we call close_object_store(), that frees the
midx entirely, but _not_ the packed_git structs, which we must keep
around for Reasons[1]. If we then try to look up more objects, we'll
auto-load the midx again, which won't realize that we've already loaded
those packs, and will create duplicate entries in the packed_git list.

This is possibly inefficient, because it means we may open and map the
pack redundantly. But it can also lead to weird user-visible behavior.
The case I found is in "git repack", which closes and reopens the midx
after repacking and then calls update_server_info(). We end up writing
the duplicate entries into objects/info/packs.

We could obviously de-dup them while writing that file, but it seems
like a violation of more core assumptions that we end up with these
duplicate structs at all.

We can avoid the duplicates reasonably efficiently by checking their
names in the pack_map hash. This annoyingly does require a little more
than a straight hash lookup due to the naming conventions, but it should
only happen when we are about to actually open a pack. I don't think one
extra malloc will be noticeable there.

[1] I'm not entirely sure of all the details, except that we generally
    assume the packed_git structs never go away. We noted this
    restriction in the comment added by 6f1e9394e2 (object: fix leaking
    packfiles when closing object store, 2024-08-08), but it's somewhat
    vague. At any rate, if you try freeing the structs in
    close_object_store(), you can observe segfaults all over the test
    suite. So it might be fixable, but it's not trivial.

Signed-off-by: Jeff King <peff@peff.net>
---
+cc Stolee here as the original midx author. I can't think of a good
reason we'd want to avoid this dup-detection here.

 midx.c                        | 20 +++++++++++++++++---
 t/t5200-update-server-info.sh |  8 ++++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 67e0d64004..479812cb9b 100644
--- a/midx.c
+++ b/midx.c
@@ -445,6 +445,7 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
 		      uint32_t pack_int_id)
 {
 	struct strbuf pack_name = STRBUF_INIT;
+	struct strbuf key = STRBUF_INIT;
 	struct packed_git *p;

 	pack_int_id = midx_for_pack(&m, pack_int_id);
@@ -455,16 +456,29 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
 	strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir,
 		    m->pack_names[pack_int_id]);

-	p = add_packed_git(pack_name.buf, pack_name.len, m->local);
+	/* pack_map holds the ".pack" name, but we have the .idx */
+	strbuf_addbuf(&key, &pack_name);
+	strbuf_strip_suffix(&key, ".idx");
+	strbuf_addstr(&key, ".pack");
+	p = hashmap_get_entry_from_hash(&r->objects->pack_map,
+					strhash(key.buf), key.buf,
+					struct packed_git, packmap_ent);
+	if (!p) {
+		p = add_packed_git(pack_name.buf, pack_name.len, m->local);
+		if (p) {
+			install_packed_git(r, p);
+			list_add_tail(&p->mru, &r->objects->packed_git_mru);
+		}
+	}
+
 	strbuf_release(&pack_name);
+	strbuf_release(&key);

 	if (!p)
 		return 1;

 	p->multi_pack_index = 1;
 	m->packs[pack_int_id] = p;
-	install_packed_git(r, p);
-	list_add_tail(&p->mru, &r->objects->packed_git_mru);

 	return 0;
 }
diff --git a/t/t5200-update-server-info.sh b/t/t5200-update-server-info.sh
index ed9dfd624c..cc51c73986 100755
--- a/t/t5200-update-server-info.sh
+++ b/t/t5200-update-server-info.sh
@@ -39,4 +39,12 @@ test_expect_success 'info/refs updates when changes are made' '
 	! test_cmp a b
 '

+test_expect_success 'midx does not create duplicate pack entries' '
+	git repack -d --write-midx &&
+	git repack -d &&
+	grep ^P .git/objects/info/packs >packs &&
+	uniq -d <packs >dups &&
+	test_must_be_empty dups
+'
+
 test_done
-- 
2.47.0.363.g6e72b256be

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
  2024-10-25  6:43           ` [PATCH 01/11] midx: avoid duplicate packed_git entries Jeff King
@ 2024-10-25  6:44           ` Jeff King
  2024-10-25  6:58           ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
                             ` (9 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  6:44 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

We have a test in t5550 that looks at index fetching over dumb http. It
creates two branches, each of which is completely stored in its own
pack, then fetches the branches independently. What should (and does)
happen is that the first fetch grabs both .idx files and one .pack file,
and then the fetch of the second branch re-uses the previously
downloaded .idx files (fetching none) and grabs the now-required .pack
file.

Since the next few patches will be touching this area of the code, let's
beef up the test a little by checking that we're downloading the
expected items at each step.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5550-http-fetch-dumb.sh | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index 58189c9f7d..3968b82260 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -307,6 +307,14 @@ test_expect_success 'fetch notices corrupt idx' '
 	)
 '
 
+# usage: count_fetches <nr> <extension> <trace_file>
+count_fetches () {
+	# ignore grep exit code; it may return non-zero if we are expecting no
+	# matches
+	grep "GET .*objects/pack/pack-[a-z0-9]*.$2" "$3" >trace.count
+	test_line_count = "$1" trace.count
+}
+
 test_expect_success 'fetch can handle previously-fetched .idx files' '
 	git checkout --orphan branch1 &&
 	echo base >file &&
@@ -321,8 +329,14 @@ test_expect_success 'fetch can handle previously-fetched .idx files' '
 	git push "$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git branch2 &&
 	git --git-dir="$HTTPD_DOCUMENT_ROOT_PATH"/repo_packed_branches.git repack -d &&
 	git --bare init clone_packed_branches.git &&
-	git --git-dir=clone_packed_branches.git fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch1:branch1 &&
-	git --git-dir=clone_packed_branches.git fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2
+	GIT_TRACE_CURL=$PWD/one.trace git --git-dir=clone_packed_branches.git \
+		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch1:branch1 &&
+	count_fetches 2 idx one.trace &&
+	count_fetches 1 pack one.trace &&
+	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
+		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
+	count_fetches 0 idx two.trace &&
+	count_fetches 1 pack two.trace
 '
 
 test_expect_success 'did not use upload-pack service' '
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
  2024-10-25  6:43           ` [PATCH 01/11] midx: avoid duplicate packed_git entries Jeff King
  2024-10-25  6:44           ` [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test Jeff King
@ 2024-10-25  6:58           ` Jeff King
  2024-10-25 21:18             ` Taylor Blau
  2024-10-25  7:00           ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
                             ` (8 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-25  6:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

This patch fixes a regression in b1b8dfde69 (finalize_object_file():
implement collision check, 2024-09-26) where fetching a v1 pack idx file
over the dumb-http protocol would cause the fetch to fail.

The core of the issue is that dumb-http stores the idx we fetch from the
remote at the same path that will eventually hold the idx we generate
from "index-pack --stdin". The sequence is something like this:

  0. We realize we need some object X, which we don't have locally, and
     nor does the other side have it as a loose object.

  1. We download the list of remote packs from objects/info/packs.

  2. For each entry in that file, we download each pack index and store
     it locally in .git/objects/pack/pack-$hash.idx (the $hash is not
     something we can verify yet and is given to us by the remote).

  3. We check each pack index we got to see if it has object X. When we
     find a match, we download the matching .pack file from the remote
     to a tempfile. We feed that to "index-pack --stdin", which
     reindexes the pack, rather than trusting that it has what the other
     side claims it does. In most cases, this will end up generating the
     exact same (byte-for-byte) pack index which we'll store at the same
     pack-$hash.idx path, because the index generation and $hash id are
     computed based on what's in the packfile. But:

       a. The other side might have used other options to generate the
          index. For instance we use index v2 by default, but long ago
          it was v1 (and you can still ask for v1 explicitly).

       b. The other side might even use a different mechanism to
          determine $hash. E.g., long ago it was based on the sorted
          list of objects in the packfile, but we switched to using the
          pack checksum in 1190a1acf8 (pack-objects: name pack files
          after trailer hash, 2013-12-05).

The regression we saw in the real world was (3a). A recent client
fetching from a server with a v1 index downloaded that index, then
complained about trying to overwrite it with its own v2 index. This
collision is otherwise harmless; we know we want to replace the remote
version with our local one, but the collision check doesn't realize
that.

There are a few options to fix it:

  - we could teach index-pack a command-line option to ignore only pack
    idx collisions, and use it when the dumb-http code invokes
    index-pack. This would be an awkward thing to expose users to and
    would involve a lot of boilerplate to get the option down to the
    collision code.

  - we could delete the remote .idx file right before running
    index-pack. It should be redundant at that point (since we've just
    downloaded the matching pack). But it feels risky to delete
    something from our own .git/objects based on what the other side has
    said. I'm not entirely positive that a malicious server couldn't lie
    about which pack-$hash.idx it has and get us to delete something
    precious.

  - we can stop co-mingling the downloaded idx files in our local
    objects directory. This is a slightly bigger change but I think
    fixes the root of the problem more directly.

This patch implements the third option. The big design questions are:
where do we store the downloaded files, and how do we manage their
lifetimes?

There are some additional quirks to the dumb-http system we should
consider. Remember that in step 2 we downloaded every pack index, but in
step 3 we may only download some of the matching packs. What happens to
those other idx files now? They sit in the .git/objects/pack directory,
possibly waiting to be used at a later date. That may save bandwidth for
a subsequent fetch, but it also creates a lot of weird corner cases:

  - our local object directory now has semi-untrusted .idx files sitting
    around, without their matching .pack

  - in case 3b, we noted that we might not generate the same hash as the
    other side. In that case even if we download the matching pack,
    our index-pack invocation will store it in a different
    pack-$hash.idx file. And the unmatched .idx will sit there forever.

  - if the server repacks, it may delete the old packs. Now we have
    these orphaned .idx files sitting around locally that will never be
    used (nor deleted).

  - if we repack locally we may delete our local version of the server's
    pack index and not realize we have it. So we'll download it again,
    even though we have all of the objects it mentions.

I think the right solution here is probably some more complex cache
management system: download the remote .idx files to their own storage
directory, mark them as "seen" when we get their matching pack (to avoid
re-downloading even if we repack), and then delete them when the
server's objects/info/refs no longer mentions them.

But since the dumb http protocol is so ancient and so inferior to the
smart http protocol, I don't think it's worth spending a lot of time
creating such a system. For this patch I'm just downloading the idx
files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
deleted when we exit (and due to the name, any we miss due to a crash,
etc, should eventually be removed by "git gc" runs based on timestamps).

That is slightly worse for one case: if we download an idx but not the
matching pack, we won't retain that idx for subsequent runs. But the
flip side is that we're making other cases better (we never hold on to
useless idx files forever). I suspect that worse case does not even come
up often, since it implies that the packs are generated to match
distinct parts of history (i.e., in practice even in a repo with many
packs you're going to end up grabbing all of those packs to do a clone).
If somebody really cares about that, I think the right path forward is a
managed cache directory as above, and this patch is providing the first
step in that direction anyway (by moving things out of the objects/pack/
directory).

There are two test changes. One demonstrates the broken v1 index case
(it double-checks the resulting clone with fsck to be careful, but prior
to this patch it actually fails at the clone step). The other tweaks the
expectation for a test that covers the "slightly worse" case to
accommodate the extra index download.

The code changes are fairly simple. We stop using finalize_object_file()
to copy the remote's index file into place, and leave it as a tempfile.
We give the tempfile a real ".idx" name, since the packfile code expects
that, and thus we make sure it is out of the usual packs/ directory (so
we'd never mistake it for a real local .idx).

We also have to change parse_pack_index(), which creates a temporary
packed_git to access our index (we need this because all of the pack idx
code assumes we have that struct). It reads the index data from the
tempfile, but prior to this patch would speculatively write the
finalized name into the packed_git struct using the pack-$hash we expect
to use.

I was mildly surprised that this worked at all, since we call
verify_pack_index() on the packed_git which mentions the final name
before moving the file into place! But it works because
parse_pack_index() leaves the mmap-ed data in the struct, so the
lazy-open in verify_pack_index() never triggers, and we read from the
tempfile, ignoring the filename in the struct completely. Hacky, but it
works.

After this patch, parse_pack_index() now uses the index filename we pass
in to derive a matching .pack name. This is OK to change because there
are only two callers, both in the dumb http code (and the other passes
in an existing pack-$hash.idx name, so the derived name is going to be
pack-$hash.pack, which is what we were using anyway).

I'll follow up with some more cleanups in that area, but this patch is
sufficient to fix the regression.

Reported-by: fox <fox.gbr@townlong-yak.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 http.c                     | 25 ++++++++++++++++++++-----
 packfile.c                 | 11 ++++++++++-
 t/t5550-http-fetch-dumb.sh | 12 +++++++++++-
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/http.c b/http.c
index d59e59f66b..9642cad2e3 100644
--- a/http.c
+++ b/http.c
@@ -19,6 +19,7 @@
 #include "string-list.h"
 #include "object-file.h"
 #include "object-store-ll.h"
+#include "tempfile.h"
 
 static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
 static int trace_curl_data = 1;
@@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
 	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
 	url = strbuf_detach(&buf, NULL);
 
-	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
-	tmp = strbuf_detach(&buf, NULL);
+	/*
+	 * Don't put this into packs/, since it's just temporary and we don't
+	 * want to confuse it with our local .idx files.  We'll generate our
+	 * own index if we choose to download the matching packfile.
+	 *
+	 * It's tempting to use xmks_tempfile() here, but it's important that
+	 * the file not exist, otherwise http_get_file() complains. So we
+	 * create a filename that should be unique, and then just register it
+	 * as a tempfile so that it will get cleaned up on exit.
+	 *
+	 * In theory we could hold on to the tempfile and delete these as soon
+	 * as we download the matching pack, but it would take a bit of
+	 * refactoring. Leaving them until the process ends is probably OK.
+	 */
+	tmp = xstrfmt("%s/tmp_pack_%s.idx",
+		      repo_get_object_directory(the_repository),
+		      hash_to_hex(hash));
+	register_tempfile(tmp);
 
 	if (http_get_file(url, tmp, NULL) != HTTP_OK) {
 		error("Unable to get pack index %s", url);
@@ -2427,10 +2444,8 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head,
 	}
 
 	ret = verify_pack_index(new_pack);
-	if (!ret) {
+	if (!ret)
 		close_pack_index(new_pack);
-		ret = finalize_object_file(tmp_idx, sha1_pack_index_name(sha1));
-	}
 	free(tmp_idx);
 	if (ret)
 		return -1;
diff --git a/packfile.c b/packfile.c
index df4ba67719..16d3bcf7f7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -237,13 +237,22 @@ static struct packed_git *alloc_packed_git(int extra)
 	return p;
 }
 
+static char *pack_path_from_idx(const char *idx_path)
+{
+	size_t len;
+	if (!strip_suffix(idx_path, ".idx", &len))
+		BUG("idx path does not end in .idx: %s", idx_path);
+	return xstrfmt("%.*s.pack", (int)len, idx_path);
+}
+
 struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
 {
-	const char *path = sha1_pack_name(sha1);
+	char *path = pack_path_from_idx(idx_path);
 	size_t alloc = st_add(strlen(path), 1);
 	struct packed_git *p = alloc_packed_git(alloc);
 
 	memcpy(p->pack_name, path, alloc); /* includes NUL */
+	free(path);
 	hashcpy(p->hash, sha1, the_repository->hash_algo);
 	if (check_packed_git_idx(idx_path, p)) {
 		free(p);
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index 3968b82260..706540ab74 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -335,7 +335,7 @@ test_expect_success 'fetch can handle previously-fetched .idx files' '
 	count_fetches 1 pack one.trace &&
 	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
 		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
-	count_fetches 0 idx two.trace &&
+	count_fetches 1 idx two.trace &&
 	count_fetches 1 pack two.trace
 '
 
@@ -521,4 +521,14 @@ test_expect_success 'fetching via http alternates works' '
 	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
 '
 
+test_expect_success 'dumb http can fetch index v1' '
+	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
+	git init --bare "$server" &&
+	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
+	git -C "$server" -c pack.indexVersion=1 gc &&
+
+	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
+	git -C idx-v1 fsck
+'
+
 test_done
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/11] packfile: drop has_pack_index()
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (2 preceding siblings ...)
  2024-10-25  6:58           ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
@ 2024-10-25  7:00           ` Jeff King
  2024-10-25 21:27             ` Taylor Blau
  2024-10-25  7:00           ` [PATCH 05/11] packfile: drop sha1_pack_name() Jeff King
                             ` (7 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

The has_pack_index() function has several oddities that may make it
surprising if you are trying to find out if we have a pack with some
$hash:

  - it is not looking for a valid pack that we found while searching
    object directories. It just looks for any pack-$hash.idx file in the
    pack directory.

  - it only looks in the local directory, not any alternates

  - it takes a bare "unsigned char" hash, which we try to avoid these
    days

The only caller it has is in the dumb http code; it wants to know if we
already have the pack idx in question. This can happen if we downloaded
the pack (and generated its index) during a previous fetch.

Before the previous patch ("dumb-http: store downloaded pack idx as
tempfile"), it could also happen if we downloaded the .idx from the
remote but didn't get the matching .pack. But since that patch, we don't
hold on to those .idx files. So there's no need to look for the .idx
file in the filesystem; we can just scan through the packed_git list to
see if we have it.

That lets us simplify the dumb http code a bit, as we know that if we
have the .idx we have the matching .pack already. And it lets us get rid
of this odd function that is unlikely to be needed again.

Signed-off-by: Jeff King <peff@peff.net>
---
 http.c     | 15 ++++++++-------
 packfile.c |  8 --------
 packfile.h |  2 --
 3 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/http.c b/http.c
index 9642cad2e3..03802b9049 100644
--- a/http.c
+++ b/http.c
@@ -2420,15 +2420,17 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
 static int fetch_and_setup_pack_index(struct packed_git **packs_head,
 	unsigned char *sha1, const char *base_url)
 {
-	struct packed_git *new_pack;
+	struct packed_git *new_pack, *p;
 	char *tmp_idx = NULL;
 	int ret;
 
-	if (has_pack_index(sha1)) {
-		new_pack = parse_pack_index(sha1, sha1_pack_index_name(sha1));
-		if (!new_pack)
-			return -1; /* parse_pack_index() already issued error message */
-		goto add_pack;
+	/*
+	 * If we already have the pack locally, no need to fetch its index or
+	 * even add it to list; we already have all of its objects.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (hasheq(p->hash, sha1, the_repository->hash_algo))
+			return 0;
 	}
 
 	tmp_idx = fetch_pack_index(sha1, base_url);
@@ -2450,7 +2452,6 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head,
 	if (ret)
 		return -1;
 
-add_pack:
 	new_pack->next = *packs_head;
 	*packs_head = new_pack;
 	return 0;
diff --git a/packfile.c b/packfile.c
index 16d3bcf7f7..0ead2290d4 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2160,14 +2160,6 @@ int has_object_kept_pack(const struct object_id *oid, unsigned flags)
 	return find_kept_pack_entry(the_repository, oid, flags, &e);
 }
 
-int has_pack_index(const unsigned char *sha1)
-{
-	struct stat st;
-	if (stat(sha1_pack_index_name(sha1), &st))
-		return 0;
-	return 1;
-}
-
 int for_each_object_in_pack(struct packed_git *p,
 			    each_packed_object_fn cb, void *data,
 			    enum for_each_object_flags flags)
diff --git a/packfile.h b/packfile.h
index 0f78658229..b4df3546a3 100644
--- a/packfile.h
+++ b/packfile.h
@@ -193,8 +193,6 @@ int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsi
 int has_object_pack(const struct object_id *oid);
 int has_object_kept_pack(const struct object_id *oid, unsigned flags);
 
-int has_pack_index(const unsigned char *sha1);
-
 /*
  * Return 1 if an object in a promisor packfile is or refers to the given
  * object, 0 otherwise.
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/11] packfile: drop sha1_pack_name()
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (3 preceding siblings ...)
  2024-10-25  7:00           ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
@ 2024-10-25  7:00           ` Jeff King
  2024-10-25  7:01           ` [PATCH 06/11] packfile: drop sha1_pack_index_name() Jeff King
                             ` (6 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

The sha1_pack_name() function has a few ugly bits:

  - it writes into a static strbuf (and not even a ring buffer of them),
    which can lead to subtle invalidation problems

  - it uses the term "sha1", but it's really using the_hash_algo, which
    could be sha256

There's only one caller of it left. And in fact that caller is better
off using the underlying odb_pack_name() function itself, since it's
just copying the result into its own strbuf anyway.

Converting that caller lets us get rid of this now-obselete function.

Signed-off-by: Jeff King <peff@peff.net>
---
 http.c     | 3 ++-
 packfile.c | 6 ------
 packfile.h | 7 -------
 3 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/http.c b/http.c
index 03802b9049..510332ab04 100644
--- a/http.c
+++ b/http.c
@@ -2579,7 +2579,8 @@ struct http_pack_request *new_direct_http_pack_request(
 
 	preq->url = url;
 
-	strbuf_addf(&preq->tmpfile, "%s.temp", sha1_pack_name(packed_git_hash));
+	odb_pack_name(&preq->tmpfile, packed_git_hash, "pack");
+	strbuf_addstr(&preq->tmpfile, ".temp");
 	preq->packfile = fopen(preq->tmpfile.buf, "a");
 	if (!preq->packfile) {
 		error("Unable to open local file %s for pack",
diff --git a/packfile.c b/packfile.c
index 0ead2290d4..48d650161f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -35,12 +35,6 @@ char *odb_pack_name(struct strbuf *buf,
 	return buf->buf;
 }
 
-char *sha1_pack_name(const unsigned char *sha1)
-{
-	static struct strbuf buf = STRBUF_INIT;
-	return odb_pack_name(&buf, sha1, "pack");
-}
-
 char *sha1_pack_index_name(const unsigned char *sha1)
 {
 	static struct strbuf buf = STRBUF_INIT;
diff --git a/packfile.h b/packfile.h
index b4df3546a3..2bbcc58571 100644
--- a/packfile.h
+++ b/packfile.h
@@ -31,13 +31,6 @@ struct pack_entry {
  */
 char *odb_pack_name(struct strbuf *buf, const unsigned char *sha1, const char *ext);
 
-/*
- * Return the name of the (local) packfile with the specified sha1 in
- * its name.  The return value is a pointer to memory that is
- * overwritten each time this function is called.
- */
-char *sha1_pack_name(const unsigned char *sha1);
-
 /*
  * Return the name of the (local) pack index file with the specified
  * sha1 in its name.  The return value is a pointer to memory that is
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/11] packfile: drop sha1_pack_index_name()
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (4 preceding siblings ...)
  2024-10-25  7:00           ` [PATCH 05/11] packfile: drop sha1_pack_name() Jeff King
@ 2024-10-25  7:01           ` Jeff King
  2024-10-25  7:02           ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
                             ` (5 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

Like sha1_pack_name() that we dropped in the previous commit, this
function uses an error-prone static strbuf and the somewhat misleading
name "sha1".

The only caller left is in pack-redundant.c. While this command is
marked for potential removal in our BreakingChanges document, we still
have it for now. But it's simple enough to convert it to use its own
strbuf with the underlying odb_pack_name() function, letting us drop the
otherwise obsolete function.

Note that odb_pack_name() does its own strbuf_reset(), so it's safe to
use directly within a loop like this.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-redundant.c | 5 ++++-
 packfile.c               | 6 ------
 packfile.h               | 7 -------
 3 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/builtin/pack-redundant.c b/builtin/pack-redundant.c
index 5809613002..d2c1c4e5ec 100644
--- a/builtin/pack-redundant.c
+++ b/builtin/pack-redundant.c
@@ -13,6 +13,7 @@
 
 #include "packfile.h"
 #include "object-store-ll.h"
+#include "strbuf.h"
 
 #define BLKSIZE 512
 
@@ -591,6 +592,7 @@ static void load_all(void)
 int cmd_pack_redundant(int argc, const char **argv, const char *prefix UNUSED, struct repository *repo UNUSED) {
 	int i; int i_still_use_this = 0; struct pack_list *min = NULL, *red, *pl;
 	struct llist *ignore;
+	struct strbuf idx_name = STRBUF_INIT;
 	char buf[GIT_MAX_HEXSZ + 2]; /* hex hash + \n + \0 */
 
 	if (argc == 2 && !strcmp(argv[1], "-h"))
@@ -688,7 +690,7 @@ int cmd_pack_redundant(int argc, const char **argv, const char *prefix UNUSED, s
 	pl = red = pack_list_difference(local_packs, min);
 	while (pl) {
 		printf("%s\n%s\n",
-		       sha1_pack_index_name(pl->pack->hash),
+		       odb_pack_name(&idx_name, pl->pack->hash, "idx"),
 		       pl->pack->pack_name);
 		pl = pl->next;
 	}
@@ -699,5 +701,6 @@ int cmd_pack_redundant(int argc, const char **argv, const char *prefix UNUSED, s
 	pack_list_free(red);
 	pack_list_free(min);
 	llist_free(ignore);
+	strbuf_release(&idx_name);
 	return 0;
 }
diff --git a/packfile.c b/packfile.c
index 48d650161f..0d4fd2737a 100644
--- a/packfile.c
+++ b/packfile.c
@@ -35,12 +35,6 @@ char *odb_pack_name(struct strbuf *buf,
 	return buf->buf;
 }
 
-char *sha1_pack_index_name(const unsigned char *sha1)
-{
-	static struct strbuf buf = STRBUF_INIT;
-	return odb_pack_name(&buf, sha1, "idx");
-}
-
 static unsigned int pack_used_ctr;
 static unsigned int pack_mmap_calls;
 static unsigned int peak_pack_open_windows;
diff --git a/packfile.h b/packfile.h
index 2bbcc58571..6812abaae0 100644
--- a/packfile.h
+++ b/packfile.h
@@ -31,13 +31,6 @@ struct pack_entry {
  */
 char *odb_pack_name(struct strbuf *buf, const unsigned char *sha1, const char *ext);
 
-/*
- * Return the name of the (local) pack index file with the specified
- * sha1 in its name.  The return value is a pointer to memory that is
- * overwritten each time this function is called.
- */
-char *sha1_pack_index_name(const unsigned char *sha1);
-
 /*
  * Return the basename of the packfile, omitting any containing directory
  * (e.g., "pack-1234abcd[...].pack").
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/11] packfile: warn people away from parse_packed_git()
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (5 preceding siblings ...)
  2024-10-25  7:01           ` [PATCH 06/11] packfile: drop sha1_pack_index_name() Jeff King
@ 2024-10-25  7:02           ` Jeff King
  2024-10-25 21:28             ` Taylor Blau
  2024-10-25  7:03           ` [PATCH 08/11] http-walker: use object_id instead of bare hash Jeff King
                             ` (4 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

With a name like parse_packed_git(), you might think it's the right way
to access a local pack index and its associated objects. But not so!
It's a one-off used by the dumb-http code to access pack idx files we've
downloaded from the remote, but whose packs we might not have.

There's only one caller left for this function, and ideally we'd drop it
completely and just inline it there. But that would require exposing
other internals from packfile.[ch], like alloc_packed_git() and
check_packed_git_idx().

So let's leave it be for now, and just warn people that it's probably
not what they're looking for. Perhaps in the long run if we eventually
drop dumb-http support, we can remove the function entirely then.

Signed-off-by: Jeff King <peff@peff.net>
---
 packfile.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/packfile.h b/packfile.h
index 6812abaae0..ad393be9f1 100644
--- a/packfile.h
+++ b/packfile.h
@@ -37,6 +37,15 @@ char *odb_pack_name(struct strbuf *buf, const unsigned char *sha1, const char *e
  */
 const char *pack_basename(struct packed_git *p);
 
+/*
+ * Parse the pack idx file found at idx_path and create a packed_git struct
+ * which can be used with find_pack_entry_one().
+ *
+ * You probably don't want to use this function! It skips most of the normal
+ * sanity checks (including whether we even have the matching .pack file),
+ * and does not add the resulting packed_git struct to the internal list of
+ * packs. You probably want add_packed_git() instead.
+ */
 struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
 typedef void each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
@@ -193,7 +202,7 @@ int is_promisor_object(const struct object_id *oid);
  *
  * This function should not be used directly. It is exposed here only so that we
  * have a convenient entry-point for fuzz testing. For real uses, you should
- * probably use open_pack_index() or parse_pack_index() instead.
+ * probably use open_pack_index() instead.
  */
 int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 	     size_t idx_size, struct packed_git *p);
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/11] http-walker: use object_id instead of bare hash
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (6 preceding siblings ...)
  2024-10-25  7:02           ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
@ 2024-10-25  7:03           ` Jeff King
  2024-10-25  7:05           ` [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id Jeff King
                             ` (3 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

We long ago switched most code to using object_id structs instead of
bare "unsigned char *" hashes. This gives us more type safety from the
compiler, and generally makes it easier to understand what we expect in
each parameter.

But the dumb-http code has lagged behind. And indeed, the whole "walker"
subsystem interface has the same problem, though http-walker is the only
user left.

So let's update the walker interface to pass object_id structs (which we
already have anyway at all call sites!), and likewise use those within
the http-walker methods that it calls.

This cleans up the dumb-http code a bit, but will also let us fix a few
more commonly used helper functions.

Signed-off-by: Jeff King <peff@peff.net>
---
 http-walker.c | 25 +++++++++++++------------
 walker.c      |  4 ++--
 walker.h      |  4 ++--
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/http-walker.c b/http-walker.c
index fb2d86d5e7..36dd1f33c0 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -147,14 +147,14 @@ static int fill_active_slot(void *data UNUSED)
 	return 0;
 }
 
-static void prefetch(struct walker *walker, unsigned char *sha1)
+static void prefetch(struct walker *walker, const struct object_id *oid)
 {
 	struct object_request *newreq;
 	struct walker_data *data = walker->data;
 
 	newreq = xmalloc(sizeof(*newreq));
 	newreq->walker = walker;
-	oidread(&newreq->oid, sha1, the_repository->hash_algo);
+	oidcpy(&newreq->oid, oid);
 	newreq->repo = data->alt;
 	newreq->state = WAITING;
 	newreq->req = NULL;
@@ -422,7 +422,8 @@ static int fetch_indices(struct walker *walker, struct alt_base *repo)
 	return ret;
 }
 
-static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigned char *sha1)
+static int http_fetch_pack(struct walker *walker, struct alt_base *repo,
+			   const struct object_id *oid)
 {
 	struct packed_git *target;
 	int ret;
@@ -431,7 +432,7 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne
 
 	if (fetch_indices(walker, repo))
 		return -1;
-	target = find_sha1_pack(sha1, repo->packs);
+	target = find_sha1_pack(oid->hash, repo->packs);
 	if (!target)
 		return -1;
 	close_pack_index(target);
@@ -440,7 +441,7 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne
 		fprintf(stderr, "Getting pack %s\n",
 			hash_to_hex(target->hash));
 		fprintf(stderr, " which contains %s\n",
-			hash_to_hex(sha1));
+			oid_to_hex(oid));
 	}
 
 	preq = new_http_pack_request(target->hash, repo->base);
@@ -477,17 +478,17 @@ static void abort_object_request(struct object_request *obj_req)
 	release_object_request(obj_req);
 }
 
-static int fetch_object(struct walker *walker, unsigned char *hash)
+static int fetch_object(struct walker *walker, const struct object_id *oid)
 {
-	char *hex = hash_to_hex(hash);
+	char *hex = oid_to_hex(oid);
 	int ret = 0;
 	struct object_request *obj_req = NULL;
 	struct http_object_request *req;
 	struct list_head *pos, *head = &object_queue_head;
 
 	list_for_each(pos, head) {
 		obj_req = list_entry(pos, struct object_request, node);
-		if (hasheq(obj_req->oid.hash, hash, the_repository->hash_algo))
+		if (oideq(&obj_req->oid, oid))
 			break;
 	}
 	if (!obj_req)
@@ -548,20 +549,20 @@ static int fetch_object(struct walker *walker, unsigned char *hash)
 	return ret;
 }
 
-static int fetch(struct walker *walker, unsigned char *hash)
+static int fetch(struct walker *walker, const struct object_id *oid)
 {
 	struct walker_data *data = walker->data;
 	struct alt_base *altbase = data->alt;
 
-	if (!fetch_object(walker, hash))
+	if (!fetch_object(walker, oid))
 		return 0;
 	while (altbase) {
-		if (!http_fetch_pack(walker, altbase, hash))
+		if (!http_fetch_pack(walker, altbase, oid))
 			return 0;
 		fetch_alternates(walker, data->alt->base);
 		altbase = altbase->next;
 	}
-	return error("Unable to find %s under %s", hash_to_hex(hash),
+	return error("Unable to find %s under %s", oid_to_hex(oid),
 		     data->alt->base);
 }
 
diff --git a/walker.c b/walker.c
index 807a7a3881..5ea7e5b392 100644
--- a/walker.c
+++ b/walker.c
@@ -157,7 +157,7 @@ static int process(struct walker *walker, struct object *obj)
 	else {
 		if (obj->flags & COMPLETE)
 			return 0;
-		walker->prefetch(walker, obj->oid.hash);
+		walker->prefetch(walker, &obj->oid);
 	}
 
 	object_list_insert(obj, process_queue_end);
@@ -186,7 +186,7 @@ static int loop(struct walker *walker)
 		 * the queue because we needed to fetch it first.
 		 */
 		if (! (obj->flags & TO_SCAN)) {
-			if (walker->fetch(walker, obj->oid.hash)) {
+			if (walker->fetch(walker, &obj->oid)) {
 				stop_progress(&progress);
 				report_missing(obj);
 				return -1;
diff --git a/walker.h b/walker.h
index d40b016bab..25aaa3631c 100644
--- a/walker.h
+++ b/walker.h
@@ -6,8 +6,8 @@
 struct walker {
 	void *data;
 	int (*fetch_ref)(struct walker *, struct ref *ref);
-	void (*prefetch)(struct walker *, unsigned char *sha1);
-	int (*fetch)(struct walker *, unsigned char *sha1);
+	void (*prefetch)(struct walker *, const struct object_id *oid);
+	int (*fetch)(struct walker *, const struct object_id *oid);
 	void (*cleanup)(struct walker *);
 	int get_verbosely;
 	int get_progress;
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (7 preceding siblings ...)
  2024-10-25  7:03           ` [PATCH 08/11] http-walker: use object_id instead of bare hash Jeff King
@ 2024-10-25  7:05           ` Jeff King
  2024-10-25  7:06           ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
                             ` (2 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

The find_sha1_pack() function has a few problems:

  - it's badly named, since it works with any object hash

  - it takes the hash as a bare pointer rather than an object_id struct

We can fix both of these easily, as all callers actually have a real
object_id anyway.

I also found the existence of this function somewhat confusing, as it is
about looking in an arbitrary set of linked packed_git structs. It's
good for things like dumb-http which are looking in downloaded remote
packs, and not our local packs. But despite the name, it is not a good
way to find the pack which contains a local object (it skips the use of
the midx, the pack mru list, and so on).

So let's also add an explanatory comment above the declaration that may
point people in the right direction.

I suspect the calls in fast-import.c, which use the packed_git list from
the repository struct, could actually just be using find_pack_entry().
But since we'd need to keep it anyway for dumb-http, I didn't dig
further there. If we eventually drop dumb-http support, then it might be
worth examining them to see if we can get rid of the function entirely.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-import.c | 6 ++----
 http-push.c           | 4 ++--
 http-walker.c         | 2 +-
 packfile.c            | 6 +++---
 packfile.h            | 9 +++++++--
 5 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 1e7ab67f6e..76d5c20f14 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -966,8 +966,7 @@ static int store_object(
 	if (e->idx.offset) {
 		duplicate_count_by_type[type]++;
 		return 1;
-	} else if (find_sha1_pack(oid.hash,
-				  get_all_packs(the_repository))) {
+	} else if (find_oid_pack(&oid, get_all_packs(the_repository))) {
 		e->type = type;
 		e->pack_id = MAX_PACK_ID;
 		e->idx.offset = 1; /* just not zero! */
@@ -1167,8 +1166,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		duplicate_count_by_type[OBJ_BLOB]++;
 		truncate_pack(&checkpoint);
 
-	} else if (find_sha1_pack(oid.hash,
-				  get_all_packs(the_repository))) {
+	} else if (find_oid_pack(&oid, get_all_packs(the_repository))) {
 		e->type = OBJ_BLOB;
 		e->pack_id = MAX_PACK_ID;
 		e->idx.offset = 1; /* just not zero! */
diff --git a/http-push.c b/http-push.c
index aad89f2eab..4d24e6b8d4 100644
--- a/http-push.c
+++ b/http-push.c
@@ -309,7 +309,7 @@ static void start_fetch_packed(struct transfer_request *request)
 	struct transfer_request *check_request = request_queue_head;
 	struct http_pack_request *preq;
 
-	target = find_sha1_pack(request->obj->oid.hash, repo->packs);
+	target = find_oid_pack(&request->obj->oid, repo->packs);
 	if (!target) {
 		fprintf(stderr, "Unable to fetch %s, will not be able to update server info refs\n", oid_to_hex(&request->obj->oid));
 		repo->can_update_info_refs = 0;
@@ -681,7 +681,7 @@ static int add_send_request(struct object *obj, struct remote_lock *lock)
 		get_remote_object_list(obj->oid.hash[0]);
 	if (obj->flags & (REMOTE | PUSHING))
 		return 0;
-	target = find_sha1_pack(obj->oid.hash, repo->packs);
+	target = find_oid_pack(&obj->oid, repo->packs);
 	if (target) {
 		obj->flags |= REMOTE;
 		return 0;
diff --git a/http-walker.c b/http-walker.c
index 36dd1f33c0..43cde0ebe5 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -432,7 +432,7 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo,
 
 	if (fetch_indices(walker, repo))
 		return -1;
-	target = find_sha1_pack(oid->hash, repo->packs);
+	target = find_oid_pack(oid, repo->packs);
 	if (!target)
 		return -1;
 	close_pack_index(target);
diff --git a/packfile.c b/packfile.c
index 0d4fd2737a..c51eab15a5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2010,13 +2010,13 @@ int is_pack_valid(struct packed_git *p)
 	return !open_packed_git(p);
 }
 
-struct packed_git *find_sha1_pack(const unsigned char *sha1,
-				  struct packed_git *packs)
+struct packed_git *find_oid_pack(const struct object_id *oid,
+				 struct packed_git *packs)
 {
 	struct packed_git *p;
 
 	for (p = packs; p; p = p->next) {
-		if (find_pack_entry_one(sha1, p))
+		if (find_pack_entry_one(oid->hash, p))
 			return p;
 	}
 	return NULL;
diff --git a/packfile.h b/packfile.h
index ad393be9f1..3baffa940c 100644
--- a/packfile.h
+++ b/packfile.h
@@ -79,8 +79,13 @@ struct packed_git *get_all_packs(struct repository *r);
  */
 unsigned long repo_approximate_object_count(struct repository *r);
 
-struct packed_git *find_sha1_pack(const unsigned char *sha1,
-				  struct packed_git *packs);
+/*
+ * Find the pack within the "packs" list whose index contains the object "oid".
+ * For general object lookups, you probably don't want this; use
+ * find_pack_entry() instead.
+ */
+struct packed_git *find_oid_pack(const struct object_id *oid,
+				 struct packed_git *packs);
 
 void pack_report(void);
 
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/11] packfile: use object_id in find_pack_entry_one()
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (8 preceding siblings ...)
  2024-10-25  7:05           ` [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id Jeff King
@ 2024-10-25  7:06           ` Jeff King
  2024-10-25 21:33             ` Taylor Blau
  2024-10-25  7:08           ` [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id Jeff King
  2024-10-25 21:35           ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Taylor Blau
  11 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

The main function we use to search a pack index for an object is
find_pack_entry_one(). That function still takes a bare pointer to the
hash, despite the fact that its underlying bsearch_pack() function needs
an object_id struct. And so we end up making an extra copy of the hash
into the struct just to do a lookup.

As it turns out, all callers but one already have such an object_id. So
we can just take a pointer to that struct and use it directly. This
avoids the extra copy and provides a more type-safe interface.

The one exception is get_delta_base() in packfile.c, when we are chasing
a REF_DELTA from inside the pack (and thus we have a pointer directly to
the mmap'd pack memory, not a struct). We can just bump the hashcpy()
from inside find_pack_entry_one() to this one caller that needs it.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c    |  4 ++--
 connected.c               |  4 ++--
 midx.c                    |  2 +-
 pack-bitmap.c             |  4 ++--
 packfile.c                | 16 ++++++++--------
 packfile.h                |  4 ++--
 t/helper/test-find-pack.c |  2 +-
 7 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0fc0680b40..0800714267 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1556,7 +1556,7 @@ static int want_object_in_pack_one(struct packed_git *p,
 	if (p == *found_pack)
 		offset = *found_offset;
 	else
-		offset = find_pack_entry_one(oid->hash, p);
+		offset = find_pack_entry_one(oid, p);
 
 	if (offset) {
 		if (!*found_pack) {
@@ -3984,7 +3984,7 @@ static int has_sha1_pack_kept_or_nonlocal(const struct object_id *oid)
 	while (p) {
 		if ((!p->pack_local || p->pack_keep ||
 				p->pack_keep_in_core) &&
-			find_pack_entry_one(oid->hash, p)) {
+			find_pack_entry_one(oid, p)) {
 			last_found = p;
 			return 1;
 		}
diff --git a/connected.c b/connected.c
index 87cc4b57a1..a9e2e13995 100644
--- a/connected.c
+++ b/connected.c
@@ -78,7 +78,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			for (p = get_all_packs(the_repository); p; p = p->next) {
 				if (!p->pack_promisor)
 					continue;
-				if (find_pack_entry_one(oid->hash, p))
+				if (find_pack_entry_one(oid, p))
 					goto promisor_pack_found;
 			}
 			/*
@@ -144,7 +144,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		 * are sure the ref is good and not sending it to
 		 * rev-list for verification.
 		 */
-		if (new_pack && find_pack_entry_one(oid->hash, new_pack))
+		if (new_pack && find_pack_entry_one(oid, new_pack))
 			continue;
 
 		if (fprintf(rev_list_in, "%s\n", oid_to_hex(oid)) < 0)
diff --git a/midx.c b/midx.c
index 479812cb9b..e82d4f2e65 100644
--- a/midx.c
+++ b/midx.c
@@ -987,7 +987,7 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag
 		}
 
 		m_offset = e.offset;
-		p_offset = find_pack_entry_one(oid.hash, e.p);
+		p_offset = find_pack_entry_one(&oid, e.p);
 
 		if (m_offset != p_offset)
 			midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 32b222a7af..4fa9dfc771 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -935,7 +935,7 @@ static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 					   const struct object_id *oid)
 {
 	uint32_t pos;
-	off_t offset = find_pack_entry_one(oid->hash, bitmap_git->pack);
+	off_t offset = find_pack_entry_one(oid, bitmap_git->pack);
 	if (!offset)
 		return -1;
 
@@ -1609,7 +1609,7 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
 			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
 				return 1;
 		} else {
-			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
+			if (find_pack_entry_one(&object->oid, bitmap_git->pack) > 0)
 				return 1;
 		}
 	}
diff --git a/packfile.c b/packfile.c
index c51eab15a5..005ca670b4 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1239,7 +1239,9 @@ off_t get_delta_base(struct packed_git *p,
 		*curpos += used;
 	} else if (type == OBJ_REF_DELTA) {
 		/* The base entry _must_ be in the same pack */
-		base_offset = find_pack_entry_one(base_info, p);
+		struct object_id oid;
+		hashcpy(oid.hash, base_info, the_repository->hash_algo);
+		base_offset = find_pack_entry_one(&oid, p);
 		*curpos += the_hash_algo->rawsz;
 	} else
 		die("I am totally screwed");
@@ -1971,20 +1973,18 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n)
 	}
 }
 
-off_t find_pack_entry_one(const unsigned char *sha1,
-				  struct packed_git *p)
+off_t find_pack_entry_one(const struct object_id *oid,
+			  struct packed_git *p)
 {
 	const unsigned char *index = p->index_data;
-	struct object_id oid;
 	uint32_t result;
 
 	if (!index) {
 		if (open_pack_index(p))
 			return 0;
 	}
 
-	hashcpy(oid.hash, sha1, the_repository->hash_algo);
-	if (bsearch_pack(&oid, p, &result))
+	if (bsearch_pack(oid, p, &result))
 		return nth_packed_object_offset(p, result);
 	return 0;
 }
@@ -2016,7 +2016,7 @@ struct packed_git *find_oid_pack(const struct object_id *oid,
 	struct packed_git *p;
 
 	for (p = packs; p; p = p->next) {
-		if (find_pack_entry_one(oid->hash, p))
+		if (find_pack_entry_one(oid, p))
 			return p;
 	}
 	return NULL;
@@ -2033,7 +2033,7 @@ static int fill_pack_entry(const struct object_id *oid,
 	    oidset_contains(&p->bad_objects, oid))
 		return 0;
 
-	offset = find_pack_entry_one(oid->hash, p);
+	offset = find_pack_entry_one(oid, p);
 	if (!offset)
 		return 0;
 
diff --git a/packfile.h b/packfile.h
index 3baffa940c..08f88a7ff5 100644
--- a/packfile.h
+++ b/packfile.h
@@ -154,10 +154,10 @@ int nth_packed_object_id(struct object_id *, struct packed_git *, uint32_t n);
 off_t nth_packed_object_offset(const struct packed_git *, uint32_t n);
 
 /*
- * If the object named sha1 is present in the specified packfile,
+ * If the object named by oid is present in the specified packfile,
  * return its offset within the packfile; otherwise, return 0.
  */
-off_t find_pack_entry_one(const unsigned char *sha1, struct packed_git *);
+off_t find_pack_entry_one(const struct object_id *oid, struct packed_git *);
 
 int is_pack_valid(struct packed_git *);
 void *unpack_entry(struct repository *r, struct packed_git *, off_t, enum object_type *, unsigned long *);
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
index 14b2b0c12c..85a69a4e55 100644
--- a/t/helper/test-find-pack.c
+++ b/t/helper/test-find-pack.c
@@ -40,7 +40,7 @@ int cmd__find_pack(int argc, const char **argv)
 		die("cannot parse %s as an object name", argv[0]);
 
 	for (p = get_all_packs(the_repository); p; p = p->next)
-		if (find_pack_entry_one(oid.hash, p)) {
+		if (find_pack_entry_one(&oid, p)) {
 			printf("%s\n", p->pack_name);
 			actual_count++;
 		}
-- 
2.47.0.363.g6e72b256be


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (9 preceding siblings ...)
  2024-10-25  7:06           ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
@ 2024-10-25  7:08           ` Jeff King
  2024-10-25 21:35           ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Taylor Blau
  11 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-10-25  7:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Patrick Steinhardt, fox, Eric Sunshine, git

When chasing a REF_DELTA, we need to pull the raw hash bytes out of the
mmap'd packfile into an object_id struct. We do that with a raw
hashcpy() of the appropriate length (that happens directly now, though
before the previous commit it happened inside find_pack_entry_one(),
also using a hashcpy).

But I think this creates a potentially dangerous situation due to
d4d364b2c7 (hash: convert `oidcmp()` and `oideq()` to compare whole
hash, 2024-06-14). When using sha1, we'll have uninitialized bytes in
the latter part of the object_id.hash buffer, which could fool oideq(),
etc.

We should use oidread() instead, which correctly zero-pads the extra
bytes, as of c98d762ed9 (global: ensure that object IDs are always
padded, 2024-06-14).

As far as I can see, this has not been a problem in practice because the
object_id we feed to find_pack_entry_one() is never used with oideq(),
etc. It is being compared to the bytes mmap'd from a pack idx file,
which of course do not have the extra padding bytes themselves. So
there's no bug here, but this just puzzled me while looking at the code.
We should do the more obviously safe thing, both for future-proofing and
to avoid confusing readers.

Signed-off-by: Jeff King <peff@peff.net>
---
+cc Patrick for any wisdom here. I'd guess that the conversions from
c98d762ed9 were found by using ASan, valgrind, or similar with the new
oideq() implementation.  And so this case, because it actually is safe,
was not flagged.

 packfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 005ca670b4..9560f0a33c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1240,7 +1240,7 @@ off_t get_delta_base(struct packed_git *p,
 	} else if (type == OBJ_REF_DELTA) {
 		/* The base entry _must_ be in the same pack */
 		struct object_id oid;
-		hashcpy(oid.hash, base_info, the_repository->hash_algo);
+		oidread(&oid, base_info, the_repository->hash_algo);
 		base_offset = find_pack_entry_one(&oid, p);
 		*curpos += the_hash_algo->rawsz;
 	} else
-- 
2.47.0.363.g6e72b256be

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/11] midx: avoid duplicate packed_git entries
  2024-10-25  6:43           ` [PATCH 01/11] midx: avoid duplicate packed_git entries Jeff King
@ 2024-10-25 21:09             ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-25 21:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 02:43:40AM -0400, Jeff King wrote:
> +cc Stolee here as the original midx author. I can't think of a good
> reason we'd want to avoid this dup-detection here.
>
>  midx.c                        | 20 +++++++++++++++++---
>  t/t5200-update-server-info.sh |  8 ++++++++
>  2 files changed, 25 insertions(+), 3 deletions(-)

:-). I swear I have written this exact patch before. I think this was
way back in the early days of MIDX bitmaps, and working around this bug
was the reason that we don't generate the MIDX/bitmap at the end of a
repack in the same process.

I don't recall all of the details at the time (or even if the above
rationale was what I was thinking back then or not), but I think the
change below here is obviously correct.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile
  2024-10-25  6:58           ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
@ 2024-10-25 21:18             ` Taylor Blau
  2024-10-26  6:02               ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Taylor Blau @ 2024-10-25 21:18 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 02:58:06AM -0400, Jeff King wrote:
> This patch fixes a regression in b1b8dfde69 (finalize_object_file():
> implement collision check, 2024-09-26) where fetching a v1 pack idx file
> over the dumb-http protocol would cause the fetch to fail.

Makes sense, and matches our discussion from elsewhere on the list
sometime last week.

> There are a few options to fix it:
>
>   - we could teach index-pack a command-line option to ignore only pack
>     idx collisions, and use it when the dumb-http code invokes
>     index-pack. This would be an awkward thing to expose users to and
>     would involve a lot of boilerplate to get the option down to the
>     collision code.
>
>   - we could delete the remote .idx file right before running
>     index-pack. It should be redundant at that point (since we've just
>     downloaded the matching pack). But it feels risky to delete
>     something from our own .git/objects based on what the other side has
>     said. I'm not entirely positive that a malicious server couldn't lie
>     about which pack-$hash.idx it has and get us to delete something
>     precious.
>
>   - we can stop co-mingling the downloaded idx files in our local
>     objects directory. This is a slightly bigger change but I think
>     fixes the root of the problem more directly.
>
> This patch implements the third option. The big design questions are:
> where do we store the downloaded files, and how do we manage their
> lifetimes?

Yep, the third option is definitely the most sensible here.

> I think the right solution here is probably some more complex cache
> management system: download the remote .idx files to their own storage
> directory, mark them as "seen" when we get their matching pack (to avoid
> re-downloading even if we repack), and then delete them when the
> server's objects/info/refs no longer mentions them.
>
> But since the dumb http protocol is so ancient and so inferior to the
> smart http protocol, I don't think it's worth spending a lot of time
> creating such a system. For this patch I'm just downloading the idx
> files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
> deleted when we exit (and due to the name, any we miss due to a crash,
> etc, should eventually be removed by "git gc" runs based on timestamps).
>
> That is slightly worse for one case: [...]

OK. I think the more robust version of the solution you identified here
makes sense and is probably the right thing to do if we wanted to spend
more time and effort on fixing this part of the dumb HTTP protocol.

But I'm willing to believe that there are so few users of the protocol
at this point that it's almost certainly not worth the effort. So I'm
supportive of the stopping point that you chose here.

> diff --git a/http.c b/http.c
> index d59e59f66b..9642cad2e3 100644
> --- a/http.c
> +++ b/http.c
> @@ -19,6 +19,7 @@
>  #include "string-list.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "tempfile.h"
>
>  static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
>  static int trace_curl_data = 1;
> @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
>  	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
>  	url = strbuf_detach(&buf, NULL);
>
> -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> -	tmp = strbuf_detach(&buf, NULL);
> +	/*
> +	 * Don't put this into packs/, since it's just temporary and we don't
> +	 * want to confuse it with our local .idx files.  We'll generate our
> +	 * own index if we choose to download the matching packfile.

I actually think putting it in $GIT_DIR/objects/pack is fine and perhaps
preferable, since that's where we put existing in-progress temporary
files. We'll ignore anything that doesn't end with "*.idx", so I think
it would be fine.

I don't feel strongly about it. It just feels a little weird to stick
these temporary files in one place, and stick other (similar) temporary
flies in another.

> diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
> index 3968b82260..706540ab74 100755
> --- a/t/t5550-http-fetch-dumb.sh
> +++ b/t/t5550-http-fetch-dumb.sh
> @@ -335,7 +335,7 @@ test_expect_success 'fetch can handle previously-fetched .idx files' '
>  	count_fetches 1 pack one.trace &&
>  	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
>  		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
> -	count_fetches 0 idx two.trace &&
> +	count_fetches 1 idx two.trace &&
>  	count_fetches 1 pack two.trace
>  '
>
> @@ -521,4 +521,14 @@ test_expect_success 'fetching via http alternates works' '
>  	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
>  '
>
> +test_expect_success 'dumb http can fetch index v1' '
> +	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
> +	git init --bare "$server" &&
> +	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
> +	git -C "$server" -c pack.indexVersion=1 gc &&
> +
> +	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
> +	git -C idx-v1 fsck
> +'

Very nicely done.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/11] packfile: drop has_pack_index()
  2024-10-25  7:00           ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
@ 2024-10-25 21:27             ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-25 21:27 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 03:00:09AM -0400, Jeff King wrote:
> ---
>  http.c     | 15 ++++++++-------
>  packfile.c |  8 --------
>  packfile.h |  2 --
>  3 files changed, 8 insertions(+), 17 deletions(-)

All makes sense. Good, let's keep reading...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 07/11] packfile: warn people away from parse_packed_git()
  2024-10-25  7:02           ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
@ 2024-10-25 21:28             ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-25 21:28 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 03:02:22AM -0400, Jeff King wrote:
> With a name like parse_packed_git(), you might think it's the right way
> to access a local pack index and its associated objects. But not so!
> It's a one-off used by the dumb-http code to access pack idx files we've
> downloaded from the remote, but whose packs we might not have.
>
> There's only one caller left for this function, and ideally we'd drop it
> completely and just inline it there. But that would require exposing
> other internals from packfile.[ch], like alloc_packed_git() and
> check_packed_git_idx().
>
> So let's leave it be for now, and just warn people that it's probably
> not what they're looking for. Perhaps in the long run if we eventually
> drop dumb-http support, we can remove the function entirely then.

That seems like a good step and the right place to stop rather than
exposing more internals from packfile.c. Nicely done.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 10/11] packfile: use object_id in find_pack_entry_one()
  2024-10-25  7:06           ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
@ 2024-10-25 21:33             ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-25 21:33 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 03:06:06AM -0400, Jeff King wrote:
> The one exception is get_delta_base() in packfile.c, when we are chasing
> a REF_DELTA from inside the pack (and thus we have a pointer directly to
> the mmap'd pack memory, not a struct). We can just bump the hashcpy()
> from inside find_pack_entry_one() to this one caller that needs it.

Makes sense.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 0fc0680b40..0800714267 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1556,7 +1556,7 @@ static int want_object_in_pack_one(struct packed_git *p,
>  	if (p == *found_pack)
>  		offset = *found_offset;
>  	else
> -		offset = find_pack_entry_one(oid->hash, p);
> +		offset = find_pack_entry_one(oid, p);

This and all of the similar changes that follow it are trivially correct
and pleasing to read.

> diff --git a/packfile.c b/packfile.c
> index c51eab15a5..005ca670b4 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1239,7 +1239,9 @@ off_t get_delta_base(struct packed_git *p,
>  		*curpos += used;
>  	} else if (type == OBJ_REF_DELTA) {
>  		/* The base entry _must_ be in the same pack */
> -		base_offset = find_pack_entry_one(base_info, p);
> +		struct object_id oid;
> +		hashcpy(oid.hash, base_info, the_repository->hash_algo);
> +		base_offset = find_pack_entry_one(&oid, p);

Here's the one that needed to turn its bare hash into an object_id that
it can pass a pointer to. And...

> @@ -1971,20 +1973,18 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n)
>  	}
>  }
>
> -off_t find_pack_entry_one(const unsigned char *sha1,
> -				  struct packed_git *p)
> +off_t find_pack_entry_one(const struct object_id *oid,
> +			  struct packed_git *p)
>  {
>  	const unsigned char *index = p->index_data;
> -	struct object_id oid;
>  	uint32_t result;
>
>  	if (!index) {
>  		if (open_pack_index(p))
>  			return 0;
>  	}
>
> -	hashcpy(oid.hash, sha1, the_repository->hash_algo);

...here's the original location of that hashcpy() that moved to its only
useful caller in get_delta_base(). Makes sense.

The remaining conversions are trivial.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/11] dumb-http pack index v1 regression + cleanups
  2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
                             ` (10 preceding siblings ...)
  2024-10-25  7:08           ` [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id Jeff King
@ 2024-10-25 21:35           ` Taylor Blau
  11 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-25 21:35 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 02:41:48AM -0400, Jeff King wrote:
>  16 files changed, 153 insertions(+), 101 deletions(-)

I was glad that you worked on this, because I figured I might end up
~tricking you into~ getting you to take a closer look at some of the
other potential fixups in the dumb http code.

The series was a pleasant read, and it looks quite good to me. Let's
wait for some other reviewers to chime in before we start merging this
down, but otherwise I think this is good to go.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile
  2024-10-25 21:18             ` Taylor Blau
@ 2024-10-26  6:02               ` Jeff King
  2024-10-28  0:14                 ` Taylor Blau
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-10-26  6:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: fox, Eric Sunshine, git

On Fri, Oct 25, 2024 at 05:18:51PM -0400, Taylor Blau wrote:

> > -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> > -	tmp = strbuf_detach(&buf, NULL);
> > +	/*
> > +	 * Don't put this into packs/, since it's just temporary and we don't
> > +	 * want to confuse it with our local .idx files.  We'll generate our
> > +	 * own index if we choose to download the matching packfile.
> 
> I actually think putting it in $GIT_DIR/objects/pack is fine and perhaps
> preferable, since that's where we put existing in-progress temporary
> files. We'll ignore anything that doesn't end with "*.idx", so I think
> it would be fine.

But the new tempfile _does_ end with .idx. And it (kinda) has to,
because that's an assumption made by packed_git: it stores
path/to/foo.pack, and then assumes it can access /path/to/foo.idx as
needed.

I say "kinda" because as I mentioned elsewhere, the prior code cheats a
bit and assumes that we can access the idx file through the original
mmap we made, even if the ".idx" the packed_git mentions does not yet
exist. I _think_ that's pretty foolproof and that we'd never unmap or
close an idx (though we certainly do for packs themselves). But it
doesn't feel like a great thing to rely on.

> I don't feel strongly about it. It just feels a little weird to stick
> these temporary files in one place, and stick other (similar) temporary
> flies in another.

One difference is that these are pure tempfiles. We'll never "finalize"
them by renaming them into place, so there's no chance of having to do a
cross-directory link/rename. They just get created and then deleted[1].

Perhaps something like "objects/temp-remote-index/*.idx" would be a
more descriptive name (and certainly what I'd use if making a more full
caching system), but there are complications:

  1. We don't have good cleanup routines for directories. Would that
     just become a semi-permanent empty directory in a dumb-http repo?

  2. git-prune cleans up tmp_* in objects/ and objects/pack, but would
     have to learn to look there, too.

So just sticking them in objects/ seemed like the path of least
resistance to me.

-Peff

[1] They really could be opened anywhere, like /tmp. Which I was tempted
    to do, but since they're non-trivial in size it seemed like keeping
    them inside the repo directory was the best bet.

    Curiously, if we lean into the "we will never close a mapped idx",
    then we could create them, open and map them, and then delete them
    immediately, which is a somewhat common tempfile idiom. At least on
    Unix systems. I'm not sure how Windows would like that. :)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile
  2024-10-26  6:02               ` Jeff King
@ 2024-10-28  0:14                 ` Taylor Blau
  0 siblings, 0 replies; 31+ messages in thread
From: Taylor Blau @ 2024-10-28  0:14 UTC (permalink / raw)
  To: Jeff King; +Cc: fox, Eric Sunshine, git

On Sat, Oct 26, 2024 at 02:02:38AM -0400, Jeff King wrote:
> So just sticking them in objects/ seemed like the path of least
> resistance to me.

OK, reading it all back, I tend to agree that sticking them in objects/
makes sense. Thanks!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-10-28  0:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-19 23:22 Bug report: v2.47.0 cannot fetch version 1 pack indexes fox
2024-10-20  0:37 ` Eric Sunshine
2024-10-21 20:06   ` Taylor Blau
2024-10-20  1:24 ` Jeff King
2024-10-20  2:40   ` Jeff King
2024-10-21 20:33     ` Taylor Blau
2024-10-22  5:14       ` Jeff King
2024-10-22 15:18         ` Taylor Blau
2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
2024-10-25  6:43           ` [PATCH 01/11] midx: avoid duplicate packed_git entries Jeff King
2024-10-25 21:09             ` Taylor Blau
2024-10-25  6:44           ` [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test Jeff King
2024-10-25  6:58           ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
2024-10-25 21:18             ` Taylor Blau
2024-10-26  6:02               ` Jeff King
2024-10-28  0:14                 ` Taylor Blau
2024-10-25  7:00           ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
2024-10-25 21:27             ` Taylor Blau
2024-10-25  7:00           ` [PATCH 05/11] packfile: drop sha1_pack_name() Jeff King
2024-10-25  7:01           ` [PATCH 06/11] packfile: drop sha1_pack_index_name() Jeff King
2024-10-25  7:02           ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
2024-10-25 21:28             ` Taylor Blau
2024-10-25  7:03           ` [PATCH 08/11] http-walker: use object_id instead of bare hash Jeff King
2024-10-25  7:05           ` [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id Jeff King
2024-10-25  7:06           ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
2024-10-25 21:33             ` Taylor Blau
2024-10-25  7:08           ` [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id Jeff King
2024-10-25 21:35           ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Taylor Blau
2024-10-21 20:23   ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
2024-10-22  5:00     ` Jeff King
2024-10-22 15:50       ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).