git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: fox <fox.gbr@townlong-yak.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	git@vger.kernel.org
Subject: Re: [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile
Date: Fri, 25 Oct 2024 17:18:51 -0400	[thread overview]
Message-ID: <ZxwLOxT17OhLX/Rd@nand.local> (raw)
In-Reply-To: <20241025065806.GC2110355@coredump.intra.peff.net>

On Fri, Oct 25, 2024 at 02:58:06AM -0400, Jeff King wrote:
> This patch fixes a regression in b1b8dfde69 (finalize_object_file():
> implement collision check, 2024-09-26) where fetching a v1 pack idx file
> over the dumb-http protocol would cause the fetch to fail.

Makes sense, and matches our discussion from elsewhere on the list
sometime last week.

> There are a few options to fix it:
>
>   - we could teach index-pack a command-line option to ignore only pack
>     idx collisions, and use it when the dumb-http code invokes
>     index-pack. This would be an awkward thing to expose users to and
>     would involve a lot of boilerplate to get the option down to the
>     collision code.
>
>   - we could delete the remote .idx file right before running
>     index-pack. It should be redundant at that point (since we've just
>     downloaded the matching pack). But it feels risky to delete
>     something from our own .git/objects based on what the other side has
>     said. I'm not entirely positive that a malicious server couldn't lie
>     about which pack-$hash.idx it has and get us to delete something
>     precious.
>
>   - we can stop co-mingling the downloaded idx files in our local
>     objects directory. This is a slightly bigger change but I think
>     fixes the root of the problem more directly.
>
> This patch implements the third option. The big design questions are:
> where do we store the downloaded files, and how do we manage their
> lifetimes?

Yep, the third option is definitely the most sensible here.

> I think the right solution here is probably some more complex cache
> management system: download the remote .idx files to their own storage
> directory, mark them as "seen" when we get their matching pack (to avoid
> re-downloading even if we repack), and then delete them when the
> server's objects/info/refs no longer mentions them.
>
> But since the dumb http protocol is so ancient and so inferior to the
> smart http protocol, I don't think it's worth spending a lot of time
> creating such a system. For this patch I'm just downloading the idx
> files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
> deleted when we exit (and due to the name, any we miss due to a crash,
> etc, should eventually be removed by "git gc" runs based on timestamps).
>
> That is slightly worse for one case: [...]

OK. I think the more robust version of the solution you identified here
makes sense and is probably the right thing to do if we wanted to spend
more time and effort on fixing this part of the dumb HTTP protocol.

But I'm willing to believe that there are so few users of the protocol
at this point that it's almost certainly not worth the effort. So I'm
supportive of the stopping point that you chose here.

> diff --git a/http.c b/http.c
> index d59e59f66b..9642cad2e3 100644
> --- a/http.c
> +++ b/http.c
> @@ -19,6 +19,7 @@
>  #include "string-list.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "tempfile.h"
>
>  static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
>  static int trace_curl_data = 1;
> @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
>  	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
>  	url = strbuf_detach(&buf, NULL);
>
> -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> -	tmp = strbuf_detach(&buf, NULL);
> +	/*
> +	 * Don't put this into packs/, since it's just temporary and we don't
> +	 * want to confuse it with our local .idx files.  We'll generate our
> +	 * own index if we choose to download the matching packfile.

I actually think putting it in $GIT_DIR/objects/pack is fine and perhaps
preferable, since that's where we put existing in-progress temporary
files. We'll ignore anything that doesn't end with "*.idx", so I think
it would be fine.

I don't feel strongly about it. It just feels a little weird to stick
these temporary files in one place, and stick other (similar) temporary
flies in another.

> diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
> index 3968b82260..706540ab74 100755
> --- a/t/t5550-http-fetch-dumb.sh
> +++ b/t/t5550-http-fetch-dumb.sh
> @@ -335,7 +335,7 @@ test_expect_success 'fetch can handle previously-fetched .idx files' '
>  	count_fetches 1 pack one.trace &&
>  	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
>  		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
> -	count_fetches 0 idx two.trace &&
> +	count_fetches 1 idx two.trace &&
>  	count_fetches 1 pack two.trace
>  '
>
> @@ -521,4 +521,14 @@ test_expect_success 'fetching via http alternates works' '
>  	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
>  '
>
> +test_expect_success 'dumb http can fetch index v1' '
> +	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
> +	git init --bare "$server" &&
> +	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
> +	git -C "$server" -c pack.indexVersion=1 gc &&
> +
> +	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
> +	git -C idx-v1 fsck
> +'

Very nicely done.

Thanks,
Taylor

  reply	other threads:[~2024-10-25 21:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-19 23:22 Bug report: v2.47.0 cannot fetch version 1 pack indexes fox
2024-10-20  0:37 ` Eric Sunshine
2024-10-21 20:06   ` Taylor Blau
2024-10-20  1:24 ` Jeff King
2024-10-20  2:40   ` Jeff King
2024-10-21 20:33     ` Taylor Blau
2024-10-22  5:14       ` Jeff King
2024-10-22 15:18         ` Taylor Blau
2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
2024-10-25  6:43           ` [PATCH 01/11] midx: avoid duplicate packed_git entries Jeff King
2024-10-25 21:09             ` Taylor Blau
2024-10-25  6:44           ` [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test Jeff King
2024-10-25  6:58           ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
2024-10-25 21:18             ` Taylor Blau [this message]
2024-10-26  6:02               ` Jeff King
2024-10-28  0:14                 ` Taylor Blau
2024-10-25  7:00           ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
2024-10-25 21:27             ` Taylor Blau
2024-10-25  7:00           ` [PATCH 05/11] packfile: drop sha1_pack_name() Jeff King
2024-10-25  7:01           ` [PATCH 06/11] packfile: drop sha1_pack_index_name() Jeff King
2024-10-25  7:02           ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
2024-10-25 21:28             ` Taylor Blau
2024-10-25  7:03           ` [PATCH 08/11] http-walker: use object_id instead of bare hash Jeff King
2024-10-25  7:05           ` [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id Jeff King
2024-10-25  7:06           ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
2024-10-25 21:33             ` Taylor Blau
2024-10-25  7:08           ` [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id Jeff King
2024-10-25 21:35           ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Taylor Blau
2024-10-21 20:23   ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
2024-10-22  5:00     ` Jeff King
2024-10-22 15:50       ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZxwLOxT17OhLX/Rd@nand.local \
    --to=me@ttaylorr.com \
    --cc=fox.gbr@townlong-yak.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).