Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH 0/7] fix segfaults with implicit-bool config
From: Patrick Steinhardt @ 2023-12-12  4:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231212005228.GB376323@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 1338 bytes --]

On Mon, Dec 11, 2023 at 07:52:28PM -0500, Jeff King wrote:
> On Thu, Dec 07, 2023 at 09:14:36AM +0100, Patrick Steinhardt wrote:
> 
> > Thanks for working on this topic! I've left a couple of comments, most
> > of which are about whether we should retain previous behaviour where we
> > generate a warning instead of raising an error for unknown values.
> 
> Thanks for taking a look. I see what you're saying about the warnings,
> but IMHO it's not worth the extra complexity. Returning early means the
> existing code can proceed without worrying about NULLs. Though I suppose
> we could have a "warn_error_nonbool()" which issues a warning and
> returns 0.
> 
> Still, the "return config_error_nonbool()" pattern is pretty
> well-established and used for most options. I would go so far as to say
> the ones that warn for invalid values are the odd ones out, and probably
> should be returning errors as well (though doing so now may not be worth
> the trouble and risk of annoyance).
> 
> And certainly there should be no regressions in this series; every case
> is currently a segfault, so returning an error is a strict improvement.
> I'd just as soon stay strict there, as it's easier to loosen later if we
> choose than to tighten.

Fair enough, I'm perfectly fine with this reasoning. Thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 0/7] clone: fix init of refdb with wrong object format
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1701863960.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 8774 bytes --]

Hi,

this is the second version of my patch series to make git-clone(1)
initialize the reference database with the correct hash format. This is
a preparatory step for the reftable backend, which encodes the hash
format on-disk and thus requires us to get the hash format right at the
time of initialization.

Changes compared to v1:

  - Patch 1: Extend the comment explaining why we always create the
    "refs/" directory.

  - Patch 3: Explain better why the patch is required in the first
    place.

  - Patch 4: Fix a typo.

  - Patch 7: Adapt a failing test to now assert the new behaviour.

Thanks for your reviews!

Patrick

Patrick Steinhardt (7):
  setup: extract function to create the refdb
  setup: allow skipping creation of the refdb
  remote-curl: rediscover repository when fetching refs
  builtin/clone: fix bundle URIs with mismatching object formats
  builtin/clone: set up sparse checkout later
  builtin/clone: skip reading HEAD when retrieving remote
  builtin/clone: create the refdb with the correct object format

 builtin/clone.c             |  65 ++++++++++----------
 remote-curl.c               |   7 ++-
 remote.c                    |  26 ++++----
 remote.h                    |   1 +
 setup.c                     | 114 ++++++++++++++++++++++--------------
 setup.h                     |   6 +-
 t/t5550-http-fetch-dumb.sh  |   4 +-
 t/t5558-clone-bundle-uri.sh |  18 ++++++
 8 files changed, 150 insertions(+), 91 deletions(-)

Range-diff against v1:
1:  b69c57d272 ! 1:  2f34daa082 setup: extract function to create the refdb
    @@ Commit message
         calling `init_db()`. Extract the logic into a standalone function so
         that it becomes easier to do this refactoring.
     
    +    While at it, expand the comment that explains why we always create the
    +    "refs/" directory.
    +
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## setup.c ##
    @@ setup.c: void initialize_repository_version(int hash_algo, int reinit)
     +	int reinit = is_reinit();
     +
     +	/*
    -+	 * We need to create a "refs" dir in any case so that older
    -+	 * versions of git can tell that this is a repository.
    ++	 * We need to create a "refs" dir in any case so that older versions of
    ++	 * Git can tell that this is a repository. This serves two main purposes:
    ++	 *
    ++	 * - Clients will know to stop walking the parent-directory chain when
    ++	 *   detecting the Git repository. Otherwise they may end up detecting
    ++	 *   a Git repository in a parent directory instead.
    ++	 *
    ++	 * - Instead of failing to detect a repository with unknown reference
    ++	 *   format altogether, old clients will print an error saying that
    ++	 *   they do not understand the reference format extension.
     +	 */
     +	safe_create_dir(git_path("refs"), 1);
     +	adjust_shared_perm(git_path("refs"));
2:  090c52423e = 2:  40005ac1f1 setup: allow skipping creation of the refdb
3:  a1b86a0cbb ! 3:  374a1c514b remote-curl: rediscover repository when fetching refs
    @@ Metadata
      ## Commit message ##
         remote-curl: rediscover repository when fetching refs
     
    -    We're about to change git-clone(1) so that we set up the reference
    -    database at a later point. This change will cause git-remote-curl(1) to
    -    not detect the repository anymore due to "HEAD" not having been created
    -    yet at the time it spawns, and thus causes it to error out once it is
    -    asked to fetch the references.
    +    The reftable format encodes the hash function used by the repository
    +    inside of its tables. The reftable backend thus needs to be initialized
    +    with the correct hash function right from the start, or otherwise we may
    +    end up writing tables with the wrong hash function. But git-clone(1)
    +    initializes the reference database before learning about the hash
    +    function used by the remote repository, which has never been a problem
    +    with the reffiles backend.
    +
    +    To fix this, we'll have to change git-clone(1) to be more careful and
    +    only create the reference backend once it learned about the remote hash
    +    function. This creates a problem for git-remote-curl(1), which will then
    +    be spawned at a time where the repository is not yet fully-initialized.
    +    Consequentially, git-remote-curl(1) will fail to detect the repository,
    +    which eventually causes it to error out once it is asked to fetch remote
    +    objects.
     
         We can address this issue by trying to re-discover the Git repository in
         case none was detected at startup time. With this change, the clone will
    @@ Commit message
              repository
     
           4. git-clone(1) creates the reference database as it has now learned
    -         about the object format.
    +         about the hash function.
     
           5. git-clone(1) asks git-remote-curl(1) to fetch the remote packfile.
              The latter notices that it doesn't have a repository available, but
4:  c7a9d6ef74 ! 4:  3bef564b57 builtin/clone: fix bundle URIs with mismatching object formats
    @@ Commit message
         is indeed not necessarily the case for the hash algorithm right now. So
         in practice it is the right thing to detect the remote's object format
         before downloading bundle URIs anyway, and not doing so causes clones
    -    with bundle URIS to fail when the local default object format does not
    +    with bundle URIs to fail when the local default object format does not
         match the remote repository's format.
     
         Unfortunately though, this creates a new issue: downloading bundles may
5:  703ff77eea = 5:  917f15055f builtin/clone: set up sparse checkout later
6:  6c919fb19c = 6:  f3485a2708 builtin/clone: skip reading HEAD when retrieving remote
7:  eb5530e6a8 ! 7:  f062b11550 builtin/clone: create the refdb with the correct object format
    @@ Commit message
         upcoming reftable backend when cloning repositories with the SHA256
         object format.
     
    +    This change breaks a test in "t5550-http-fetch-dumb.sh" when cloning an
    +    empty repository with `GIT_TEST_DEFAULT_HASH=sha256`. The test expects
    +    the resulting hash format of the empty cloned repository to match the
    +    default hash, but now we always end up with a sha1 repository. The
    +    problem is that for dumb HTTP fetches, we have no easy way to figure out
    +    the remote's hash function except for deriving it based on the hash
    +    length of refs in `info/refs`. But as the remote repository is empty we
    +    cannot rely on this detection mechanism.
    +
    +    Before the change in this commit we already initialized the repository
    +    with the default hash function and then left it as-is. With this patch
    +    we always use the hash function detected via the remote, where we fall
    +    back to "sha1" in case we cannot detect it.
    +
    +    Neither the old nor the new behaviour are correct as we second-guess the
    +    remote hash function in both cases. But given that this is a rather
    +    unlikely edge case (we use the dumb HTTP protocol, the remote repository
    +    uses SHA256 and the remote repository is empty), let's simply adapt the
    +    test to assert the new behaviour. If we want to properly address this
    +    edge case in the future we will have to extend the dumb HTTP protocol so
    +    that we can properly detect the hash function for empty repositories.
    +
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## builtin/clone.c ##
    @@ setup.h: int init_db(const char *git_dir, const char *real_git_dir,
      
      /*
       * NOTE NOTE NOTE!!
    +
    + ## t/t5550-http-fetch-dumb.sh ##
    +@@ t/t5550-http-fetch-dumb.sh: test_expect_success 'create empty remote repository' '
    + 	setup_post_update_server_info_hook "$HTTPD_DOCUMENT_ROOT_PATH/empty.git"
    + '
    + 
    +-test_expect_success 'empty dumb HTTP repository has default hash algorithm' '
    ++test_expect_success 'empty dumb HTTP repository falls back to SHA1' '
    + 	test_when_finished "rm -fr clone-empty" &&
    + 	git clone $HTTPD_URL/dumb/empty.git clone-empty &&
    + 	git -C clone-empty rev-parse --show-object-format >empty-format &&
    +-	test "$(cat empty-format)" = "$(test_oid algo)"
    ++	test "$(cat empty-format)" = sha1
    + '
    + 
    + setup_askpass_helper

base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 1/7] setup: extract function to create the refdb
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4810 bytes --]

We're about to let callers skip creation of the reference database when
calling `init_db()`. Extract the logic into a standalone function so
that it becomes easier to do this refactoring.

While at it, expand the comment that explains why we always create the
"refs/" directory.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 setup.c | 103 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 65 insertions(+), 38 deletions(-)

diff --git a/setup.c b/setup.c
index fc592dc6dd..865cfe6743 100644
--- a/setup.c
+++ b/setup.c
@@ -1885,6 +1885,68 @@ void initialize_repository_version(int hash_algo, int reinit)
 		git_config_set_gently("extensions.objectformat", NULL);
 }
 
+static int is_reinit(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	char junk[2];
+	int ret;
+
+	git_path_buf(&buf, "HEAD");
+	ret = !access(buf.buf, R_OK) || readlink(buf.buf, junk, sizeof(junk) - 1) != -1;
+	strbuf_release(&buf);
+	return ret;
+}
+
+static void create_reference_database(const char *initial_branch, int quiet)
+{
+	struct strbuf err = STRBUF_INIT;
+	int reinit = is_reinit();
+
+	/*
+	 * We need to create a "refs" dir in any case so that older versions of
+	 * Git can tell that this is a repository. This serves two main purposes:
+	 *
+	 * - Clients will know to stop walking the parent-directory chain when
+	 *   detecting the Git repository. Otherwise they may end up detecting
+	 *   a Git repository in a parent directory instead.
+	 *
+	 * - Instead of failing to detect a repository with unknown reference
+	 *   format altogether, old clients will print an error saying that
+	 *   they do not understand the reference format extension.
+	 */
+	safe_create_dir(git_path("refs"), 1);
+	adjust_shared_perm(git_path("refs"));
+
+	if (refs_init_db(&err))
+		die("failed to set up refs db: %s", err.buf);
+
+	/*
+	 * Point the HEAD symref to the initial branch with if HEAD does
+	 * not yet exist.
+	 */
+	if (!reinit) {
+		char *ref;
+
+		if (!initial_branch)
+			initial_branch = git_default_branch_name(quiet);
+
+		ref = xstrfmt("refs/heads/%s", initial_branch);
+		if (check_refname_format(ref, 0) < 0)
+			die(_("invalid initial branch name: '%s'"),
+			    initial_branch);
+
+		if (create_symref("HEAD", ref, NULL) < 0)
+			exit(1);
+		free(ref);
+	}
+
+	if (reinit && initial_branch)
+		warning(_("re-init: ignored --initial-branch=%s"),
+			initial_branch);
+
+	strbuf_release(&err);
+}
+
 static int create_default_files(const char *template_path,
 				const char *original_git_dir,
 				const char *initial_branch,
@@ -1896,10 +1958,8 @@ static int create_default_files(const char *template_path,
 	struct stat st1;
 	struct strbuf buf = STRBUF_INIT;
 	char *path;
-	char junk[2];
 	int reinit;
 	int filemode;
-	struct strbuf err = STRBUF_INIT;
 	const char *init_template_dir = NULL;
 	const char *work_tree = get_git_work_tree();
 
@@ -1919,6 +1979,8 @@ static int create_default_files(const char *template_path,
 	reset_shared_repository();
 	git_config(git_default_config, NULL);
 
+	reinit = is_reinit();
+
 	/*
 	 * We must make sure command-line options continue to override any
 	 * values we might have just re-read from the config.
@@ -1962,39 +2024,7 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
-	/*
-	 * We need to create a "refs" dir in any case so that older
-	 * versions of git can tell that this is a repository.
-	 */
-	safe_create_dir(git_path("refs"), 1);
-	adjust_shared_perm(git_path("refs"));
-
-	if (refs_init_db(&err))
-		die("failed to set up refs db: %s", err.buf);
-
-	/*
-	 * Point the HEAD symref to the initial branch with if HEAD does
-	 * not yet exist.
-	 */
-	path = git_path_buf(&buf, "HEAD");
-	reinit = (!access(path, R_OK)
-		  || readlink(path, junk, sizeof(junk)-1) != -1);
-	if (!reinit) {
-		char *ref;
-
-		if (!initial_branch)
-			initial_branch = git_default_branch_name(quiet);
-
-		ref = xstrfmt("refs/heads/%s", initial_branch);
-		if (check_refname_format(ref, 0) < 0)
-			die(_("invalid initial branch name: '%s'"),
-			    initial_branch);
-
-		if (create_symref("HEAD", ref, NULL) < 0)
-			exit(1);
-		free(ref);
-	}
-
+	create_reference_database(initial_branch, quiet);
 	initialize_repository_version(fmt->hash_algo, 0);
 
 	/* Check filemode trustability */
@@ -2158,9 +2188,6 @@ int init_db(const char *git_dir, const char *real_git_dir,
 				      prev_bare_repository,
 				      init_shared_repository,
 				      flags & INIT_DB_QUIET);
-	if (reinit && initial_branch)
-		warning(_("re-init: ignored --initial-branch=%s"),
-			initial_branch);
 
 	create_object_directory();
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 2/7] setup: allow skipping creation of the refdb
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

Allow callers to skip creation of the reference database via a new flag
`INIT_DB_SKIP_REFDB`, which is required for git-clone(1) so that we can
create it at a later point once the object format has been discovered
from the remote repository.

Note that we also uplift the call to `create_reference_database()` into
`init_db()`, which makes it easier to handle the new flag for us. This
changes the order in which we do initialization so that we now set up
the Git configuration before we create the reference database. In
practice this move should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 setup.c | 13 +++++--------
 setup.h |  5 +++--
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/setup.c b/setup.c
index 865cfe6743..d6a1c59b7b 100644
--- a/setup.c
+++ b/setup.c
@@ -1949,11 +1949,9 @@ static void create_reference_database(const char *initial_branch, int quiet)
 
 static int create_default_files(const char *template_path,
 				const char *original_git_dir,
-				const char *initial_branch,
 				const struct repository_format *fmt,
 				int prev_bare_repository,
-				int init_shared_repository,
-				int quiet)
+				int init_shared_repository)
 {
 	struct stat st1;
 	struct strbuf buf = STRBUF_INIT;
@@ -2024,7 +2022,6 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
-	create_reference_database(initial_branch, quiet);
 	initialize_repository_version(fmt->hash_algo, 0);
 
 	/* Check filemode trustability */
@@ -2184,11 +2181,11 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	validate_hash_algorithm(&repo_fmt, hash);
 
 	reinit = create_default_files(template_dir, original_git_dir,
-				      initial_branch, &repo_fmt,
-				      prev_bare_repository,
-				      init_shared_repository,
-				      flags & INIT_DB_QUIET);
+				      &repo_fmt, prev_bare_repository,
+				      init_shared_repository);
 
+	if (!(flags & INIT_DB_SKIP_REFDB))
+		create_reference_database(initial_branch, flags & INIT_DB_QUIET);
 	create_object_directory();
 
 	if (get_shared_repository()) {
diff --git a/setup.h b/setup.h
index b48cf1c43b..cbf538286b 100644
--- a/setup.h
+++ b/setup.h
@@ -169,8 +169,9 @@ int verify_repository_format(const struct repository_format *format,
  */
 void check_repository_format(struct repository_format *fmt);
 
-#define INIT_DB_QUIET 0x0001
-#define INIT_DB_EXIST_OK 0x0002
+#define INIT_DB_QUIET      (1 << 0)
+#define INIT_DB_EXIST_OK   (1 << 1)
+#define INIT_DB_SKIP_REFDB (1 << 2)
 
 int init_db(const char *git_dir, const char *real_git_dir,
 	    const char *template_dir, int hash_algo,
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 3/7] remote-curl: rediscover repository when fetching refs
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]

The reftable format encodes the hash function used by the repository
inside of its tables. The reftable backend thus needs to be initialized
with the correct hash function right from the start, or otherwise we may
end up writing tables with the wrong hash function. But git-clone(1)
initializes the reference database before learning about the hash
function used by the remote repository, which has never been a problem
with the reffiles backend.

To fix this, we'll have to change git-clone(1) to be more careful and
only create the reference backend once it learned about the remote hash
function. This creates a problem for git-remote-curl(1), which will then
be spawned at a time where the repository is not yet fully-initialized.
Consequentially, git-remote-curl(1) will fail to detect the repository,
which eventually causes it to error out once it is asked to fetch remote
objects.

We can address this issue by trying to re-discover the Git repository in
case none was detected at startup time. With this change, the clone will
look as following:

  1. git-clone(1) sets up the initial repository, excluding the
     reference database.

  2. git-clone(1) spawns git-remote-curl(1), which will be unable to
     detect the repository due to a missing "HEAD".

  3. git-clone(1) asks git-remote-curl(1) to list remote references.
     This works just fine as this step does not require a local
     repository

  4. git-clone(1) creates the reference database as it has now learned
     about the hash function.

  5. git-clone(1) asks git-remote-curl(1) to fetch the remote packfile.
     The latter notices that it doesn't have a repository available, but
     it now knows to try and re-discover it.

If the re-discovery succeeds in the last step we can continue with the
clone.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 remote-curl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index ef05752ca5..fc29757b65 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1564,8 +1564,11 @@ int cmd_main(int argc, const char **argv)
 		if (buf.len == 0)
 			break;
 		if (starts_with(buf.buf, "fetch ")) {
-			if (nongit)
-				die(_("remote-curl: fetch attempted without a local repo"));
+			if (nongit) {
+				setup_git_directory_gently(&nongit);
+				if (nongit)
+					die(_("remote-curl: fetch attempted without a local repo"));
+			}
 			parse_fetch(&buf);

 		} else if (!strcmp(buf.buf, "list") || starts_with(buf.buf, "list ")) {
-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 4/7] builtin/clone: fix bundle URIs with mismatching object formats
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 6668 bytes --]

We create the reference database in git-clone(1) quite early before
connecting to the remote repository. Given that we do not yet know about
the object format that the remote repository uses at that point in time
the consequence is that the refdb may be initialized with the wrong
object format.

This is not a problem in the context of the files backend as we do not
encode the object format anywhere, and furthermore the only reference
that we write between initializing the refdb and learning about the
object format is the "HEAD" symref. It will become a problem though once
we land the reftable backend, which indeed does require to know about
the proper object format at the time of creation. We thus need to
rearrange the logic in git-clone(1) so that we only initialize the refdb
once we have learned about the actual object format.

As a first step, move listing of remote references to happen earlier,
which also allow us to set up the hash algorithm of the repository
earlier now. While we aim to execute this logic as late as possible
until after most of the setup has happened already, detection of the
object format and thus later the setup of the reference database must
happen before any other logic that may spawn Git commands or otherwise
these Git commands may not recognize the repository as such.

The first Git step where we expect the repository to be fully initalized
is when we fetch bundles via bundle URIs. Funny enough, the comments
there also state that "the_repository must match the cloned repo", which
is indeed not necessarily the case for the hash algorithm right now. So
in practice it is the right thing to detect the remote's object format
before downloading bundle URIs anyway, and not doing so causes clones
with bundle URIs to fail when the local default object format does not
match the remote repository's format.

Unfortunately though, this creates a new issue: downloading bundles may
take a long time, so if we list refs beforehand they might've grown
stale meanwhile. It is not clear how to solve this issue except for a
second reference listing though after we have downloaded the bundles,
which may be an expensive thing to do.

Arguably though, it's preferable to have a staleness issue compared to
being unable to clone a repository altogether.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c             | 48 ++++++++++++++++++-------------------
 t/t5558-clone-bundle-uri.sh | 18 ++++++++++++++
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index c6357af949..d188650881 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1266,6 +1266,26 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (transport->smart_options && !deepen && !filter_options.choice)
 		transport->smart_options->check_self_contained_and_connected = 1;

+	strvec_push(&transport_ls_refs_options.ref_prefixes, "HEAD");
+	refspec_ref_prefixes(&remote->fetch,
+			     &transport_ls_refs_options.ref_prefixes);
+	if (option_branch)
+		expand_ref_prefix(&transport_ls_refs_options.ref_prefixes,
+				  option_branch);
+	if (!option_no_tags)
+		strvec_push(&transport_ls_refs_options.ref_prefixes,
+			    "refs/tags/");
+
+	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
+
+	/*
+	 * Now that we know what algorithm the remote side is using, let's set
+	 * ours to the same thing.
+	 */
+	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
+	initialize_repository_version(hash_algo, 1);
+	repo_set_hash_algo(the_repository, hash_algo);
+
 	/*
 	 * Before fetching from the remote, download and install bundle
 	 * data from the --bundle-uri option.
@@ -1281,24 +1301,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 				bundle_uri);
 		else if (has_heuristic)
 			git_config_set_gently("fetch.bundleuri", bundle_uri);
-	}
-
-	strvec_push(&transport_ls_refs_options.ref_prefixes, "HEAD");
-	refspec_ref_prefixes(&remote->fetch,
-			     &transport_ls_refs_options.ref_prefixes);
-	if (option_branch)
-		expand_ref_prefix(&transport_ls_refs_options.ref_prefixes,
-				  option_branch);
-	if (!option_no_tags)
-		strvec_push(&transport_ls_refs_options.ref_prefixes,
-			    "refs/tags/");
-
-	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
-
-	if (refs)
-		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
-
-	if (!bundle_uri) {
+	} else {
 		/*
 		* Populate transport->got_remote_bundle_uri and
 		* transport->bundle_uri. We might get nothing.
@@ -1319,13 +1322,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}

-		/*
-		 * Now that we know what algorithm the remote side is using,
-		 * let's set ours to the same thing.
-		 */
-	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
-	initialize_repository_version(hash_algo, 1);
-	repo_set_hash_algo(the_repository, hash_algo);
+	if (refs)
+		mapped_refs = wanted_peer_refs(refs, &remote->fetch);

 	if (mapped_refs) {
 		/*
diff --git a/t/t5558-clone-bundle-uri.sh b/t/t5558-clone-bundle-uri.sh
index 996a08e90c..1ca5f745e7 100755
--- a/t/t5558-clone-bundle-uri.sh
+++ b/t/t5558-clone-bundle-uri.sh
@@ -33,6 +33,15 @@ test_expect_success 'clone with path bundle' '
 	test_cmp expect actual
 '

+test_expect_success 'clone with path bundle and non-default hash' '
+	test_when_finished "rm -rf clone-path-non-default-hash" &&
+	GIT_DEFAULT_HASH=sha256 git clone --bundle-uri="clone-from/B.bundle" \
+		clone-from clone-path-non-default-hash &&
+	git -C clone-path-non-default-hash rev-parse refs/bundles/topic >actual &&
+	git -C clone-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'clone with file:// bundle' '
 	git clone --bundle-uri="file://$(pwd)/clone-from/B.bundle" \
 		clone-from clone-file &&
@@ -284,6 +293,15 @@ test_expect_success 'clone HTTP bundle' '
 	test_config -C clone-http log.excludedecoration refs/bundle/
 '

+test_expect_success 'clone HTTP bundle with non-default hash' '
+	test_when_finished "rm -rf clone-http-non-default-hash" &&
+	GIT_DEFAULT_HASH=sha256 git clone --bundle-uri="$HTTPD_URL/B.bundle" \
+		"$HTTPD_URL/smart/fetch.git" clone-http-non-default-hash &&
+	git -C clone-http-non-default-hash rev-parse refs/bundles/topic >actual &&
+	git -C clone-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'clone bundle list (HTTP, no heuristic)' '
 	test_when_finished rm -f trace*.txt &&

-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 5/7] builtin/clone: set up sparse checkout later
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

When asked to do a sparse checkout, then git-clone(1) will spawn
`git sparse-checkout set` to set up the configuration accordingly. This
requires a proper Git repository or otherwise the command will fail. But
as we are about to move creation of the reference database to a later
point, this prerequisite will not hold anymore.

Move the logic to a later point in time where we know to have created
the reference database already.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index d188650881..9c60923f31 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1185,9 +1185,6 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (option_required_reference.nr || option_optional_reference.nr)
 		setup_reference();
 
-	if (option_sparse_checkout && git_sparse_checkout_init(dir))
-		return 1;
-
 	remote = remote_get(remote_name);
 
 	refspec_appendf(&remote->fetch, "+%s*:%s*", src_ref_prefix,
@@ -1426,6 +1423,9 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		dissociate_from_references();
 	}
 
+	if (option_sparse_checkout && git_sparse_checkout_init(dir))
+		return 1;
+
 	junk_mode = JUNK_LEAVE_REPO;
 	err = checkout(submodule_progress, filter_submodules);
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 6/7] builtin/clone: skip reading HEAD when retrieving remote
From: Patrick Steinhardt @ 2023-12-12  7:01 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 5199 bytes --]

After we have set up the remote configuration in git-clone(1) we'll call
`remote_get()` to read the remote from the on-disk configuration. But
next to reading the on-disk configuration, `remote_get()` will also
cause us to try and read the repository's HEAD reference so that we can
figure out the current branch. Besides being pointless in git-clone(1)
because we're operating in an empty repository anyway, this will also
break once we move creation of the reference database to a later point
in time.

Refactor the code to introduce a new `remote_get_early()` function that
will skip reading the HEAD reference to address this issue.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c |  2 +-
 remote.c        | 26 ++++++++++++++++----------
 remote.h        |  1 +
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 9c60923f31..06966c5d4c 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1185,7 +1185,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (option_required_reference.nr || option_optional_reference.nr)
 		setup_reference();
 
-	remote = remote_get(remote_name);
+	remote = remote_get_early(remote_name);
 
 	refspec_appendf(&remote->fetch, "+%s*:%s*", src_ref_prefix,
 			branch_top.buf);
diff --git a/remote.c b/remote.c
index abb24822be..051d0a64a0 100644
--- a/remote.c
+++ b/remote.c
@@ -509,7 +509,7 @@ static void alias_all_urls(struct remote_state *remote_state)
 	}
 }
 
-static void read_config(struct repository *repo)
+static void read_config(struct repository *repo, int early)
 {
 	int flag;
 
@@ -518,7 +518,7 @@ static void read_config(struct repository *repo)
 	repo->remote_state->initialized = 1;
 
 	repo->remote_state->current_branch = NULL;
-	if (startup_info->have_repository) {
+	if (startup_info->have_repository && !early) {
 		const char *head_ref = refs_resolve_ref_unsafe(
 			get_main_ref_store(repo), "HEAD", 0, NULL, &flag);
 		if (head_ref && (flag & REF_ISSYMREF) &&
@@ -561,7 +561,7 @@ static const char *remotes_remote_for_branch(struct remote_state *remote_state,
 
 const char *remote_for_branch(struct branch *branch, int *explicit)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	return remotes_remote_for_branch(the_repository->remote_state, branch,
@@ -587,7 +587,7 @@ remotes_pushremote_for_branch(struct remote_state *remote_state,
 
 const char *pushremote_for_branch(struct branch *branch, int *explicit)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	return remotes_pushremote_for_branch(the_repository->remote_state,
@@ -599,7 +599,7 @@ static struct remote *remotes_remote_get(struct remote_state *remote_state,
 
 const char *remote_ref_for_branch(struct branch *branch, int for_push)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	if (branch) {
@@ -709,7 +709,13 @@ remotes_remote_get(struct remote_state *remote_state, const char *name)
 
 struct remote *remote_get(const char *name)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
+	return remotes_remote_get(the_repository->remote_state, name);
+}
+
+struct remote *remote_get_early(const char *name)
+{
+	read_config(the_repository, 1);
 	return remotes_remote_get(the_repository->remote_state, name);
 }
 
@@ -722,7 +728,7 @@ remotes_pushremote_get(struct remote_state *remote_state, const char *name)
 
 struct remote *pushremote_get(const char *name)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	return remotes_pushremote_get(the_repository->remote_state, name);
 }
 
@@ -738,7 +744,7 @@ int remote_is_configured(struct remote *remote, int in_repo)
 int for_each_remote(each_remote_fn fn, void *priv)
 {
 	int i, result = 0;
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	for (i = 0; i < the_repository->remote_state->remotes_nr && !result;
 	     i++) {
 		struct remote *remote =
@@ -1831,7 +1837,7 @@ struct branch *branch_get(const char *name)
 {
 	struct branch *ret;
 
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	if (!name || !*name || !strcmp(name, "HEAD"))
 		ret = the_repository->remote_state->current_branch;
 	else
@@ -1973,7 +1979,7 @@ static const char *branch_get_push_1(struct remote_state *remote_state,
 
 const char *branch_get_push(struct branch *branch, struct strbuf *err)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	if (!branch)
diff --git a/remote.h b/remote.h
index cdc8b1db42..79353ba226 100644
--- a/remote.h
+++ b/remote.h
@@ -118,6 +118,7 @@ struct remote {
  * and configuration.
  */
 struct remote *remote_get(const char *name);
+struct remote *remote_get_early(const char *name);
 
 struct remote *pushremote_get(const char *name);
 int remote_is_configured(struct remote *remote, int in_repo);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 7/7] builtin/clone: create the refdb with the correct object format
From: Patrick Steinhardt @ 2023-12-12  7:01 UTC (permalink / raw)
  To: git
In-Reply-To: <cover.1702361370.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 5283 bytes --]

We're currently creating the reference database with a potentially
incorrect object format when the remote repository's object format is
different from the local default object format. This works just fine for
now because the files backend never records the object format anywhere.
But this logic will fail with any new reference backend that encodes
this information in some form either on-disk or in-memory.

The preceding commits have reshuffled code in git-clone(1) so that there
is no code path that will access the reference database before we have
detected the remote's object format. With these refactorings we can now
defer initialization of the reference database until after we have
learned the remote's object format and thus initialize it with the
correct format from the get-go.

These refactorings are required to make git-clone(1) work with the
upcoming reftable backend when cloning repositories with the SHA256
object format.

This change breaks a test in "t5550-http-fetch-dumb.sh" when cloning an
empty repository with `GIT_TEST_DEFAULT_HASH=sha256`. The test expects
the resulting hash format of the empty cloned repository to match the
default hash, but now we always end up with a sha1 repository. The
problem is that for dumb HTTP fetches, we have no easy way to figure out
the remote's hash function except for deriving it based on the hash
length of refs in `info/refs`. But as the remote repository is empty we
cannot rely on this detection mechanism.

Before the change in this commit we already initialized the repository
with the default hash function and then left it as-is. With this patch
we always use the hash function detected via the remote, where we fall
back to "sha1" in case we cannot detect it.

Neither the old nor the new behaviour are correct as we second-guess the
remote hash function in both cases. But given that this is a rather
unlikely edge case (we use the dumb HTTP protocol, the remote repository
uses SHA256 and the remote repository is empty), let's simply adapt the
test to assert the new behaviour. If we want to properly address this
edge case in the future we will have to extend the dumb HTTP protocol so
that we can properly detect the hash function for empty repositories.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c            | 9 ++++++++-
 setup.c                    | 2 +-
 setup.h                    | 1 +
 t/t5550-http-fetch-dumb.sh | 4 ++--
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 06966c5d4c..fd052b8b54 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1097,8 +1097,14 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}

+	/*
+	 * Initialize the repository, but skip initializing the reference
+	 * database. We do not yet know about the object format of the
+	 * repository, and reference backends may persist that information into
+	 * their on-disk data structures.
+	 */
 	init_db(git_dir, real_git_dir, option_template, GIT_HASH_UNKNOWN, NULL,
-		do_not_override_repo_unix_permissions, INIT_DB_QUIET);
+		do_not_override_repo_unix_permissions, INIT_DB_QUIET | INIT_DB_SKIP_REFDB);

 	if (real_git_dir) {
 		free((char *)git_dir);
@@ -1282,6 +1288,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
 	initialize_repository_version(hash_algo, 1);
 	repo_set_hash_algo(the_repository, hash_algo);
+	create_reference_database(NULL, 1);

 	/*
 	 * Before fetching from the remote, download and install bundle
diff --git a/setup.c b/setup.c
index d6a1c59b7b..155fe13f70 100644
--- a/setup.c
+++ b/setup.c
@@ -1897,7 +1897,7 @@ static int is_reinit(void)
 	return ret;
 }

-static void create_reference_database(const char *initial_branch, int quiet)
+void create_reference_database(const char *initial_branch, int quiet)
 {
 	struct strbuf err = STRBUF_INIT;
 	int reinit = is_reinit();
diff --git a/setup.h b/setup.h
index cbf538286b..3f0f17c351 100644
--- a/setup.h
+++ b/setup.h
@@ -178,6 +178,7 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	    const char *initial_branch, int init_shared_repository,
 	    unsigned int flags);
 void initialize_repository_version(int hash_algo, int reinit);
+void create_reference_database(const char *initial_branch, int quiet);

 /*
  * NOTE NOTE NOTE!!
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index e444b30bf6..4c3b32785d 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -66,11 +66,11 @@ test_expect_success 'create empty remote repository' '
 	setup_post_update_server_info_hook "$HTTPD_DOCUMENT_ROOT_PATH/empty.git"
 '

-test_expect_success 'empty dumb HTTP repository has default hash algorithm' '
+test_expect_success 'empty dumb HTTP repository falls back to SHA1' '
 	test_when_finished "rm -fr clone-empty" &&
 	git clone $HTTPD_URL/dumb/empty.git clone-empty &&
 	git -C clone-empty rev-parse --show-object-format >empty-format &&
-	test "$(cat empty-format)" = "$(test_oid algo)"
+	test "$(cat empty-format)" = sha1
 '

 setup_askpass_helper
-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
From: Jeff King @ 2023-12-12  7:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <6f5ff96998946f3f49da56fd05c096b949521339.1701198172.git.me@ttaylorr.com>

On Tue, Nov 28, 2023 at 02:07:59PM -0500, Taylor Blau wrote:

> +static void clear_bb_commit(struct bb_commit *commit)
> +{
> +	free(commit->reverse_edges);
> +	bitmap_free(commit->commit_mask);
> +	bitmap_free(commit->bitmap);
> +}

I think these bitmaps may sometimes be NULL. But double-checking
bitmap_free(), it sensibly is noop when passed NULL. So this look good
to me.

-Peff

^ permalink raw reply

* Re: [PATCH 03/24] pack-bitmap: plug leak in find_objects()
From: Jeff King @ 2023-12-12  7:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <bc38fba65547f5716a2ff9904dd615e755ea84bb.1701198172.git.me@ttaylorr.com>

On Tue, Nov 28, 2023 at 02:08:02PM -0500, Taylor Blau wrote:

> The `find_objects()` function creates an object_list for any tips of the
> reachability query which do not have corresponding bitmaps.
> 
> The object_list is not used outside of `find_objects()`, but we never
> free it with `object_list_free()`, resulting in a leak. Let's plug that
> leak by calling `object_list_free()`, which results in t6113 becoming
> leak-free.

Makes sense.

> @@ -1280,6 +1280,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
>  		base = fill_in_bitmap(bitmap_git, revs, base, seen);
>  	}
>  
> +	object_list_free(&not_mapped);
> +
>  	return base;
>  }

There's an extra return earlier in the function, but it triggers only
when not_mapped is NULL. So this covers all cases. Good.

> +++ b/t/t6113-rev-list-bitmap-filters.sh
> [..]
> +TEST_PASSES_SANITIZE_LEAK=true

Yay. :)

-Peff

^ permalink raw reply

* [PATCH v2 0/4] refs: improve handling of special refs
From: Patrick Steinhardt @ 2023-12-12  7:18 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood
In-Reply-To: <cover.1701243201.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4823 bytes --]

Hi,

this is the second version of my patch series that aims to improve our
handling of special refs. This patch series is in preparation for the
upcoming reftable backend.

Changes compared to v1:

  - Patch 1: Stop needlessly resetting `errno` before calling
    `strbuf_read_file()`. Also, clean up state we create in our test
    repos properly.

  - Patch 3: The discussion with Phillip made me reevaluate my claim
    that "rebase-apply/" and "rebase-merge/" contain special refs.
    Indeed they don't, all files in there seem to be exclusively
    read and written via the filesystem without ever going via the
    refdb. I've thus dropped them from the set of special refs.

  - Patch 4: The array of static refs is now static.

Thanks for your reviews!

Patrick

Patrick Steinhardt (4):
  wt-status: read HEAD and ORIG_HEAD via the refdb
  refs: propagate errno when reading special refs fails
  refs: complete list of special refs
  bisect: consistently write BISECT_EXPECTED_REV via the refdb

 bisect.c                    | 25 +++-------------
 builtin/bisect.c            |  8 ++---
 refs.c                      | 59 +++++++++++++++++++++++++++++++++++--
 t/t1403-show-ref.sh         | 10 +++++++
 t/t6030-bisect-porcelain.sh |  2 +-
 wt-status.c                 | 17 ++++++-----
 6 files changed, 82 insertions(+), 39 deletions(-)

Range-diff against v1:
1:  35b74eb972 = 1:  1db3eb3945 wt-status: read HEAD and ORIG_HEAD via the refdb
2:  691552a17e ! 2:  24032a62e5 refs: propagate errno when reading special refs fails
    @@ refs.c: static int refs_read_special_head(struct ref_store *ref_store,
      	strbuf_addf(&full_path, "%s/%s", ref_store->gitdir, refname);
      
     -	if (strbuf_read_file(&content, full_path.buf, 0) < 0)
    -+	errno = 0;
    -+
     +	if (strbuf_read_file(&content, full_path.buf, 0) < 0) {
     +		*failure_errno = errno;
      		goto done;
    @@ t/t1403-show-ref.sh: test_expect_success '--exists with directory fails with gen
     +'
     +
     +test_expect_success '--exists with existing special ref' '
    ++	test_when_finished "rm .git/FETCH_HEAD" &&
     +	git rev-parse HEAD >.git/FETCH_HEAD &&
     +	git show-ref --exists FETCH_HEAD
     +'
3:  0e38103114 ! 3:  3dd9089fd5 refs: complete list of special refs
    @@ refs.c: static int refs_read_special_head(struct ref_store *ref_store,
     +	 * - MERGE_HEAD may contain multiple object IDs when merging multiple
     +	 *   heads.
     +	 *
    -+	 * - "rebase-apply/" and "rebase-merge/" contain all of the state for
    -+	 *   rebases, where keeping it closely together feels sensible.
    -+	 *
     +	 * There are some exceptions that you might expect to see on this list
     +	 * but which are handled exclusively via the reference backend:
     +	 *
     +	 * - CHERRY_PICK_HEAD
    ++	 *
     +	 * - HEAD
    ++	 *
     +	 * - ORIG_HEAD
     +	 *
    ++	 * - "rebase-apply/" and "rebase-merge/" contain all of the state for
    ++	 *   rebases, including some reference-like files. These are
    ++	 *   exclusively read and written via the filesystem and never go
    ++	 *   through the refdb.
    ++	 *
     +	 * Writing or deleting references must consistently go either through
     +	 * the filesystem (special refs) or through the reference backend
     +	 * (normal ones).
     +	 */
    -+	const char * const special_refs[] = {
    ++	static const char * const special_refs[] = {
     +		"AUTO_MERGE",
     +		"BISECT_EXPECTED_REV",
     +		"FETCH_HEAD",
     +		"MERGE_AUTOSTASH",
     +		"MERGE_HEAD",
     +	};
    -+	int i;
    ++	size_t i;
     +
     +	for (i = 0; i < ARRAY_SIZE(special_refs); i++)
     +		if (!strcmp(refname, special_refs[i]))
     +			return 1;
     +
    -+	/*
    -+	 * git-rebase(1) stores its state in `rebase-apply/` or
    -+	 * `rebase-merge/`, including various reference-like bits.
    -+	 */
    -+	if (starts_with(refname, "rebase-apply/") ||
    -+	    starts_with(refname, "rebase-merge/"))
    -+		return 1;
    -+
     +	return 0;
     +}
     +
4:  c7ab26fb7e ! 4:  4a4447a3f5 bisect: consistently write BISECT_EXPECTED_REV via the refdb
    @@ refs.c: static int is_special_ref(const char *refname)
      	 * but which are handled exclusively via the reference backend:
      	 *
     +	 * - BISECT_EXPECTED_REV
    ++	 *
      	 * - CHERRY_PICK_HEAD
    + 	 *
      	 * - HEAD
    - 	 * - ORIG_HEAD
     @@ refs.c: static int is_special_ref(const char *refname)
      	 */
    - 	const char * const special_refs[] = {
    + 	static const char * const special_refs[] = {
      		"AUTO_MERGE",
     -		"BISECT_EXPECTED_REV",
      		"FETCH_HEAD",

base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 1/4] wt-status: read HEAD and ORIG_HEAD via the refdb
From: Patrick Steinhardt @ 2023-12-12  7:18 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood
In-Reply-To: <cover.1702365291.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]

We read both the HEAD and ORIG_HEAD references directly from the
filesystem in order to figure out whether we're currently splitting a
commit. If both of the following are true:

  - HEAD points to the same object as "rebase-merge/amend".

  - ORIG_HEAD points to the same object as "rebase-merge/orig-head".

Then we are currently splitting commits.

The current code only works by chance because we only have a single
reference backend implementation. Refactor it to instead read both refs
via the refdb layer so that we'll also be compatible with alternate
reference backends.

Note that we pass `RESOLVE_REF_NO_RECURSE` to `read_ref_full()`. This is
because we didn't resolve symrefs before either, and in practice none of
the refs in "rebase-merge/" would be symbolic. We thus don't want to
resolve symrefs with the new code either to retain the old behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 wt-status.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/wt-status.c b/wt-status.c
index 9f45bf6949..fe9e590b80 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -1295,26 +1295,27 @@ static char *read_line_from_git_path(const char *filename)
 static int split_commit_in_progress(struct wt_status *s)
 {
 	int split_in_progress = 0;
-	char *head, *orig_head, *rebase_amend, *rebase_orig_head;
+	struct object_id head_oid, orig_head_oid;
+	char *rebase_amend, *rebase_orig_head;
 
 	if ((!s->amend && !s->nowarn && !s->workdir_dirty) ||
 	    !s->branch || strcmp(s->branch, "HEAD"))
 		return 0;
 
-	head = read_line_from_git_path("HEAD");
-	orig_head = read_line_from_git_path("ORIG_HEAD");
+	if (read_ref_full("HEAD", RESOLVE_REF_NO_RECURSE, &head_oid, NULL) ||
+	    read_ref_full("ORIG_HEAD", RESOLVE_REF_NO_RECURSE, &orig_head_oid, NULL))
+		return 0;
+
 	rebase_amend = read_line_from_git_path("rebase-merge/amend");
 	rebase_orig_head = read_line_from_git_path("rebase-merge/orig-head");
 
-	if (!head || !orig_head || !rebase_amend || !rebase_orig_head)
+	if (!rebase_amend || !rebase_orig_head)
 		; /* fall through, no split in progress */
 	else if (!strcmp(rebase_amend, rebase_orig_head))
-		split_in_progress = !!strcmp(head, rebase_amend);
-	else if (strcmp(orig_head, rebase_orig_head))
+		split_in_progress = !!strcmp(oid_to_hex(&head_oid), rebase_amend);
+	else if (strcmp(oid_to_hex(&orig_head_oid), rebase_orig_head))
 		split_in_progress = 1;
 
-	free(head);
-	free(orig_head);
 	free(rebase_amend);
 	free(rebase_orig_head);
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 2/4] refs: propagate errno when reading special refs fails
From: Patrick Steinhardt @ 2023-12-12  7:18 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood
In-Reply-To: <cover.1702365291.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

Some refs in Git are more special than others due to reasons explained
in the next commit. These refs are read via `refs_read_special_head()`,
but this function doesn't behave the same as when we try to read a
normal ref. Most importantly, we do not propagate `failure_errno` in the
case where the reference does not exist, which is behaviour that we rely
on in many parts of Git.

Fix this bug by propagating errno when `strbuf_read_file()` fails.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs.c              |  4 +++-
 t/t1403-show-ref.sh | 10 ++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/refs.c b/refs.c
index fcae5dddc6..00e72a2abf 100644
--- a/refs.c
+++ b/refs.c
@@ -1806,8 +1806,10 @@ static int refs_read_special_head(struct ref_store *ref_store,
 	int result = -1;
 	strbuf_addf(&full_path, "%s/%s", ref_store->gitdir, refname);
 
-	if (strbuf_read_file(&content, full_path.buf, 0) < 0)
+	if (strbuf_read_file(&content, full_path.buf, 0) < 0) {
+		*failure_errno = errno;
 		goto done;
+	}
 
 	result = parse_loose_ref_contents(content.buf, oid, referent, type,
 					  failure_errno);
diff --git a/t/t1403-show-ref.sh b/t/t1403-show-ref.sh
index b50ae6fcf1..66e6e77fa9 100755
--- a/t/t1403-show-ref.sh
+++ b/t/t1403-show-ref.sh
@@ -266,4 +266,14 @@ test_expect_success '--exists with directory fails with generic error' '
 	test_cmp expect err
 '
 
+test_expect_success '--exists with non-existent special ref' '
+	test_expect_code 2 git show-ref --exists FETCH_HEAD
+'
+
+test_expect_success '--exists with existing special ref' '
+	test_when_finished "rm .git/FETCH_HEAD" &&
+	git rev-parse HEAD >.git/FETCH_HEAD &&
+	git show-ref --exists FETCH_HEAD
+'
+
 test_done
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 3/4] refs: complete list of special refs
From: Patrick Steinhardt @ 2023-12-12  7:18 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood
In-Reply-To: <cover.1702365291.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4042 bytes --]

We have some references that are more special than others. The reason
for them being special is that they either do not follow the usual
format of references, or that they are written to the filesystem
directly by the respective owning subsystem and thus circumvent the
reference backend.

This works perfectly fine right now because the reffiles backend will
know how to read those refs just fine. But with the prospect of gaining
a new reference backend implementation we need to be a lot more careful
here:

  - We need to make sure that we are consistent about how those refs are
    written. They must either always be written via the filesystem, or
    they must always be written via the reference backend. Any mixture
    will lead to inconsistent state.

  - We need to make sure that such special refs are always handled
    specially when reading them.

We're already mostly good with regard to the first item, except for
`BISECT_EXPECTED_REV` which will be addressed in a subsequent commit.
But the current list of special refs is missing a lot of refs that
really should be treated specially. Right now, we only treat
`FETCH_HEAD` and `MERGE_HEAD` specially here.

Introduce a new function `is_special_ref()` that contains all current
instances of special refs to fix the reading path.

Based-on-patch-by: Han-Wen Nienhuys <hanwenn@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/refs.c b/refs.c
index 00e72a2abf..8fe34d51e4 100644
--- a/refs.c
+++ b/refs.c
@@ -1820,15 +1820,65 @@ static int refs_read_special_head(struct ref_store *ref_store,
 	return result;
 }
 
+static int is_special_ref(const char *refname)
+{
+	/*
+	 * Special references get written and read directly via the filesystem
+	 * by the subsystems that create them. Thus, they must not go through
+	 * the reference backend but must instead be read directly. It is
+	 * arguable whether this behaviour is sensible, or whether it's simply
+	 * a leaky abstraction enabled by us only having a single reference
+	 * backend implementation. But at least for a subset of references it
+	 * indeed does make sense to treat them specially:
+	 *
+	 * - FETCH_HEAD may contain multiple object IDs, and each one of them
+	 *   carries additional metadata like where it came from.
+	 *
+	 * - MERGE_HEAD may contain multiple object IDs when merging multiple
+	 *   heads.
+	 *
+	 * There are some exceptions that you might expect to see on this list
+	 * but which are handled exclusively via the reference backend:
+	 *
+	 * - CHERRY_PICK_HEAD
+	 *
+	 * - HEAD
+	 *
+	 * - ORIG_HEAD
+	 *
+	 * - "rebase-apply/" and "rebase-merge/" contain all of the state for
+	 *   rebases, including some reference-like files. These are
+	 *   exclusively read and written via the filesystem and never go
+	 *   through the refdb.
+	 *
+	 * Writing or deleting references must consistently go either through
+	 * the filesystem (special refs) or through the reference backend
+	 * (normal ones).
+	 */
+	static const char * const special_refs[] = {
+		"AUTO_MERGE",
+		"BISECT_EXPECTED_REV",
+		"FETCH_HEAD",
+		"MERGE_AUTOSTASH",
+		"MERGE_HEAD",
+	};
+	size_t i;
+
+	for (i = 0; i < ARRAY_SIZE(special_refs); i++)
+		if (!strcmp(refname, special_refs[i]))
+			return 1;
+
+	return 0;
+}
+
 int refs_read_raw_ref(struct ref_store *ref_store, const char *refname,
 		      struct object_id *oid, struct strbuf *referent,
 		      unsigned int *type, int *failure_errno)
 {
 	assert(failure_errno);
-	if (!strcmp(refname, "FETCH_HEAD") || !strcmp(refname, "MERGE_HEAD")) {
+	if (is_special_ref(refname))
 		return refs_read_special_head(ref_store, refname, oid, referent,
 					      type, failure_errno);
-	}
 
 	return ref_store->be->read_raw_ref(ref_store, refname, oid, referent,
 					   type, failure_errno);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 4/4] bisect: consistently write BISECT_EXPECTED_REV via the refdb
From: Patrick Steinhardt @ 2023-12-12  7:19 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood
In-Reply-To: <cover.1702365291.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 5649 bytes --]

We're inconsistently writing BISECT_EXPECTED_REV both via the filesystem
and via the refdb, which violates the newly established rules for how
special refs must be treated. This works alright in practice with the
reffiles reference backend, but will cause bugs once we gain additional
backends.

Fix this issue and consistently write BISECT_EXPECTED_REV via the refdb
so that it is no longer a special ref.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 bisect.c                    | 25 ++++---------------------
 builtin/bisect.c            |  8 ++------
 refs.c                      |  3 ++-
 t/t6030-bisect-porcelain.sh |  2 +-
 4 files changed, 9 insertions(+), 29 deletions(-)

diff --git a/bisect.c b/bisect.c
index 1be8e0a271..985b96ed13 100644
--- a/bisect.c
+++ b/bisect.c
@@ -471,7 +471,6 @@ static int read_bisect_refs(void)
 }
 
 static GIT_PATH_FUNC(git_path_bisect_names, "BISECT_NAMES")
-static GIT_PATH_FUNC(git_path_bisect_expected_rev, "BISECT_EXPECTED_REV")
 static GIT_PATH_FUNC(git_path_bisect_ancestors_ok, "BISECT_ANCESTORS_OK")
 static GIT_PATH_FUNC(git_path_bisect_run, "BISECT_RUN")
 static GIT_PATH_FUNC(git_path_bisect_start, "BISECT_START")
@@ -707,26 +706,10 @@ static enum bisect_error error_if_skipped_commits(struct commit_list *tried,
 
 static int is_expected_rev(const struct object_id *oid)
 {
-	const char *filename = git_path_bisect_expected_rev();
-	struct stat st;
-	struct strbuf str = STRBUF_INIT;
-	FILE *fp;
-	int res = 0;
-
-	if (stat(filename, &st) || !S_ISREG(st.st_mode))
+	struct object_id expected_oid;
+	if (read_ref("BISECT_EXPECTED_REV", &expected_oid))
 		return 0;
-
-	fp = fopen_or_warn(filename, "r");
-	if (!fp)
-		return 0;
-
-	if (strbuf_getline_lf(&str, fp) != EOF)
-		res = !strcmp(str.buf, oid_to_hex(oid));
-
-	strbuf_release(&str);
-	fclose(fp);
-
-	return res;
+	return oideq(oid, &expected_oid);
 }
 
 enum bisect_error bisect_checkout(const struct object_id *bisect_rev,
@@ -1185,10 +1168,10 @@ int bisect_clean_state(void)
 	struct string_list refs_for_removal = STRING_LIST_INIT_NODUP;
 	for_each_ref_in("refs/bisect", mark_for_removal, (void *) &refs_for_removal);
 	string_list_append(&refs_for_removal, xstrdup("BISECT_HEAD"));
+	string_list_append(&refs_for_removal, xstrdup("BISECT_EXPECTED_REV"));
 	result = delete_refs("bisect: remove", &refs_for_removal, REF_NO_DEREF);
 	refs_for_removal.strdup_strings = 1;
 	string_list_clear(&refs_for_removal, 0);
-	unlink_or_warn(git_path_bisect_expected_rev());
 	unlink_or_warn(git_path_bisect_ancestors_ok());
 	unlink_or_warn(git_path_bisect_log());
 	unlink_or_warn(git_path_bisect_names());
diff --git a/builtin/bisect.c b/builtin/bisect.c
index 35938b05fd..4e2c43daf5 100644
--- a/builtin/bisect.c
+++ b/builtin/bisect.c
@@ -17,7 +17,6 @@
 #include "revision.h"
 
 static GIT_PATH_FUNC(git_path_bisect_terms, "BISECT_TERMS")
-static GIT_PATH_FUNC(git_path_bisect_expected_rev, "BISECT_EXPECTED_REV")
 static GIT_PATH_FUNC(git_path_bisect_ancestors_ok, "BISECT_ANCESTORS_OK")
 static GIT_PATH_FUNC(git_path_bisect_start, "BISECT_START")
 static GIT_PATH_FUNC(git_path_bisect_log, "BISECT_LOG")
@@ -921,7 +920,6 @@ static enum bisect_error bisect_state(struct bisect_terms *terms, int argc,
 	const char *state;
 	int i, verify_expected = 1;
 	struct object_id oid, expected;
-	struct strbuf buf = STRBUF_INIT;
 	struct oid_array revs = OID_ARRAY_INIT;
 
 	if (!argc)
@@ -976,10 +974,8 @@ static enum bisect_error bisect_state(struct bisect_terms *terms, int argc,
 		oid_array_append(&revs, &commit->object.oid);
 	}
 
-	if (strbuf_read_file(&buf, git_path_bisect_expected_rev(), 0) < the_hash_algo->hexsz ||
-	    get_oid_hex(buf.buf, &expected) < 0)
+	if (read_ref("BISECT_EXPECTED_REV", &expected))
 		verify_expected = 0; /* Ignore invalid file contents */
-	strbuf_release(&buf);
 
 	for (i = 0; i < revs.nr; i++) {
 		if (bisect_write(state, oid_to_hex(&revs.oid[i]), terms, 0)) {
@@ -988,7 +984,7 @@ static enum bisect_error bisect_state(struct bisect_terms *terms, int argc,
 		}
 		if (verify_expected && !oideq(&revs.oid[i], &expected)) {
 			unlink_or_warn(git_path_bisect_ancestors_ok());
-			unlink_or_warn(git_path_bisect_expected_rev());
+			delete_ref(NULL, "BISECT_EXPECTED_REV", NULL, REF_NO_DEREF);
 			verify_expected = 0;
 		}
 	}
diff --git a/refs.c b/refs.c
index 8fe34d51e4..c76ce86bef 100644
--- a/refs.c
+++ b/refs.c
@@ -1840,6 +1840,8 @@ static int is_special_ref(const char *refname)
 	 * There are some exceptions that you might expect to see on this list
 	 * but which are handled exclusively via the reference backend:
 	 *
+	 * - BISECT_EXPECTED_REV
+	 *
 	 * - CHERRY_PICK_HEAD
 	 *
 	 * - HEAD
@@ -1857,7 +1859,6 @@ static int is_special_ref(const char *refname)
 	 */
 	static const char * const special_refs[] = {
 		"AUTO_MERGE",
-		"BISECT_EXPECTED_REV",
 		"FETCH_HEAD",
 		"MERGE_AUTOSTASH",
 		"MERGE_HEAD",
diff --git a/t/t6030-bisect-porcelain.sh b/t/t6030-bisect-porcelain.sh
index 2a5b7d8379..792c1504bc 100755
--- a/t/t6030-bisect-porcelain.sh
+++ b/t/t6030-bisect-porcelain.sh
@@ -1176,7 +1176,7 @@ test_expect_success 'git bisect reset cleans bisection state properly' '
 	git bisect bad $HASH4 &&
 	git bisect reset &&
 	test -z "$(git for-each-ref "refs/bisect/*")" &&
-	test_path_is_missing ".git/BISECT_EXPECTED_REV" &&
+	test_ref_missing BISECT_EXPECTED_REV &&
 	test_path_is_missing ".git/BISECT_ANCESTORS_OK" &&
 	test_path_is_missing ".git/BISECT_LOG" &&
 	test_path_is_missing ".git/BISECT_RUN" &&
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCH 05/24] midx: implement `DISP` chunk
From: Jeff King @ 2023-12-12  8:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, Patrick Steinhardt
In-Reply-To: <ZXPRL5yCTQKr0k7e@nand.local>

On Fri, Dec 08, 2023 at 09:30:07PM -0500, Taylor Blau wrote:

> > I did wonder how expensive to recompute and validate the "distinct"
> > information (in other words, is it too expensive for the consumers
> > of an existing midx file to see which packs are distinct on demand
> > before they stream contents out of the underlying packs?), as the
> > way the packs are marked as distinct looked rather error prone (you
> > can very easily list packfiles with overlaps with "+" prefix and the
> > DISK chunk writer does not even notice that you lied to it).  As long
> > as "git fsck" catches when two packs that are marked as distinct share
> > an object, that is OK, but the arrangement did look rather brittle
> > to me.
> 
> It's likely too expensive to do on the reading side for every
> pack-objects operation or MIDX load.

This made me think a bit. Obviously we can check for disjointedness in
O(n log k), where n is the total number of objects and k is the number
of packs, by doing a k-merge of the sorted lists. But that's a
non-starter, because we may be serving a request that is much smaller
than all "n" objects (e.g., any small fetch, but also even clones when
the pack contains a bunch of irrelevant objects).

But we can relax our condition a bit. The packs only need to be disjoint
with respect to the set of objects that we are going to output (we'll
call that "m"). So at the very least, you can do O(mk) lookups (each one
itself "log n", of course). We know that the work is already going to
be linear in "m". In theory we want to generally keep "k" small, but
part of the point of using midx in this way is to let "k" grow a bit.
So that might turn out pretty bad in practice.

So let's take one more step back. One thing I didn't feel like I saw
either in this patch or the cover letter is exactly why we care about
disjointedness. IIRC, the main issue is making sure that for any object
X we send verbatim, it is either a non-delta or its delta base is
viable. And the two reasons I can think of for the base to be non-viable
are:

  1. We are not sending the base at all.

  2. The base is in another pack, and we are worried about creating a
     cycle (i.e., in pack A we have X as a delta against Y, and in pack
     B we have Y as a delta against X, and we send both deltas).

We already deal with (1) for the single-pack case by finding the base
object offset, converting it to a pack position, and then making sure
that position is also marked for verbatim reuse.

The interesting one is (2), which is new for midx (since a single pack
cannot have two copies of an object). But I'm not sure if it's possible.
The verbatim reuse code depends on using bitmaps in the first place. And
there is already a preference-order to the packs in the midx bitmaps.

That is, we'll choose all of the objects for pack A over any duplicates
in B, and duplicates from B over C, and so on. If we likewise try
verbatim reuse in that order, then we know that an object in pack A can
never have a base that is selected from pack B or C (because if they do
have duplicates, we'd have selected A's copy to put in the midx bitmap).
And likewise, a copy of an object in pack B will always have its base
either in A or B, but never in C.

So it kind of seems to me that this would "just work" if
try_partial_reuse() did something like for the midx case:

  - convert offset of base object into an object id hash using the pack
    revindex (similar to offset_to_pack_pos)

  - look up the object id in the midx to get a pack/offset combo

  - use the midx revindex to convert that into a bit position

  - check if that bit is marked for verbatim reuse

The assumption there is that in the second step, the midx has resolved
duplicates by putting all of pack A before pack B, and so on, as above.
It also assumes that we are trying verbatim reuse in the same order
(though a different order would not produce wrong results, it would only
result in less verbatim reuse).

All of which makes me think I'm missing some other case that is a
problem. While I wait for you to explain it, though, let's continue our
thought experiment for a moment.

If we assume that any cross-pack deltas are a problem, we could always
just skip them for verbatim reuse. That is, once we look up the object
id in the midx, we can see if it's in the same pack we're currently
processing. If not, we could punt and let the non-verbatim code paths
handle it as usual.

That still leaves the problem of realizing we skipped over a chunk of
the packfile (say we are pulling object X from pack B, and it is a delta
of Y, but we already decided to reuse Y from pack A). But the reuse code
already has to accommodate gaps. I think it happens naturally in
write_reused_pack_one(), where we feed the actual byte offsets into
record_reused_object(). You'd have to take some care not to go past gaps
in the blind copy that happens write_reused_pack_verbatim(). So you
might need to mark the first such gap somehow (if it's hard, I'd suggest
just skipping write_reused_pack_verbatim() entirely; it is a fairly
minor optimization compared to the rest of it).

And of course there's a bunch of hard work teaching all of those
functions about midx's and multiple packs in the first place, but you
already had to do all that in the latter part of your series, and it
would still be applicable.

OK, this is the part where you tell me what I'm missing. ;)

-Peff

^ permalink raw reply

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
From: Jeff King @ 2023-12-12  8:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>

On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:

> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Right, it's definitely a tradeoff. So taking a really big step back,
there are a few optimizations all tied up in the verbatim reuse code:

  1. in some cases we get to dump whole swaths of the on-disk packfile
     to the output, covering many objects with a few memcpy() calls.
     (This is still O(n), of course, but it's fewer instructions per
     object).

  2. any other reused objects have only a small-ish amount of work to
     fix up ofs deltas, handle gaps, and so on. We get to skip adding
     them to the packing_list struct (this saves some CPU, but also a
     lot of memory)

  3. we skip the delta search for these reused objects. This is where
     your big CPU / output size tradeoff comes into play, I'd think.

So my question is: how much of what you're seeing is from (1) and (2),
and how much is from (3)? Because there are other ways to trigger (3),
such as lowering the window size. For example, if you try your same
packing example with --window=0, how do the CPU and output size compare
to the results of your series? (I'd also check peak memory usage).

-Peff

^ permalink raw reply

* [PATCH] tests: prefer host Git to verify chainlint self-checks
From: Patrick Steinhardt @ 2023-12-12 11:32 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine

[-- Attachment #1: Type: text/plain, Size: 2192 bytes --]

The "check-chainlint" target runs automatically when running tests and
performs self-checks to verify that the chainlinter itself produces the
expected output. Originally, the chainlinter was implemented via sed,
but the infrastructure has been rewritten in fb41727b7e (t: retire
unused chainlint.sed, 2022-09-01) to use a Perl script instead.

The rewrite caused some slight whitespace changes in the output that are
ultimately not of much importance. In order to be able to assert that
the actual chainlinter errors match our expectations we thus have to
ignore whitespace characters when diffing them. As the `-w` flag is not
in POSIX we try to use `git diff -w --no-index` before we fall back to
`diff -w -u`.

To accomodate for cases where the host system has no Git installation we
use the locally-compiled version of Git. This can result in problems
though when the Git project's repository is using extensions that the
locally-compiled version of Git doesn't understand. It will refuse to
run and thus cause the checks to fail.

Fix this issue by prefering the host's Git resolved via PATH. If it
doesn't exist, then we fall back to the locally-compiled Git version and
diff as before.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

I've started to dogfood the reftable backend on my local machine and
have converted many repositories to use the reftable backend. This
surfaced the described issue because the repository now sets up the
"extensions.refStorage" extension, and thus "check-chainlint" fails
depending on which versions of Git I'm trying to compile and test.

 t/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/t/Makefile b/t/Makefile
index 225aaf78ed..8b7f7aceaa 100644
--- a/t/Makefile
+++ b/t/Makefile
@@ -111,7 +111,9 @@ check-chainlint:
 	if test -f ../GIT-BUILD-OPTIONS; then \
 		. ../GIT-BUILD-OPTIONS; \
 	fi && \
-	if test -x ../git$$X; then \
+	if command -v git >/dev/null 2>&1; then \
+		DIFFW="git --no-pager diff -w --no-index"; \
+	elif test -x ../git$$X; then \
 		DIFFW="../git$$X --no-pager diff -w --no-index"; \
 	else \
 		DIFFW="diff -w -u"; \
-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Test breakage with zlib-ng
From: Ondrej Pohorelsky @ 2023-12-12 14:16 UTC (permalink / raw)
  To: git

Hi everyone,

As some might have heard, there is a proposal for Fedora 40 to
transition from zlib to zlib-ng[0]. Because of this, there has been a
rebuild of all packages to ensure every package works under zlib-ng.

Git test suit has three breakages in t6300-for-each-ref.sh.
To be precise, it is:

not ok 35 - basic atom: refs/heads/main objectsize:disk
not ok 107 - basic atom: refs/tags/testtag objectsize:disk
not ok 108 - basic atom: refs/tags/testtag *objectsize:disk

All of these tests are atomic, and they compare the result against
$disklen. I discussed it with Tulio Magno Quites Machado Filho, who
ran the tests and is owner of the proposal.
It seems like the compression of zlib-ng is shaving/adding some bytes
to the actual output, which then fails the comparison.

Here is an example:

```
expecting success of 6300.35 'basic atom: refs/heads/main objectsize:disk':
git for-each-ref --format="%($format)" "$ref" >actual &&
sanitize_pgp <actual >actual.clean &&
test_cmp expected actual.clean

++ git for-each-ref '--format=%(objectsize:disk)' refs/heads/main
++ sanitize_pgp
++ perl -ne '
/^-----END PGP/ and $in_pgp = 0;
print unless $in_pgp;
/^-----BEGIN PGP/ and $in_pgp = 1;
'
++ command /usr/bin/perl -ne '
/^-----END PGP/ and $in_pgp = 0;
print unless $in_pgp;
/^-----BEGIN PGP/ and $in_pgp = 1;
'
++ test_cmp expected actual.clean
++ test 2 -ne 2
++ eval 'diff -u' '"$@"'
+++ diff -u expected actual.clean
--- expected 2023-12-06 21:06:07.808849497 +0000
+++ actual.clean 2023-12-06 21:06:07.812849541 +0000
@@ -1 +1 @@
-138
+137
error: last command exited with $?=1
not ok 35 - basic atom: refs/heads/main objectsize:disk
```

The whole build log can be found here[1].

I can easily patch these tests in Fedora to be compatible with zlib-ng
only by not comparing to $disklen, but a concrete value, however I
would like to have a universal solution that works with both zlib and
zlib-ng. So if anyone has an idea on how to do it, please let me know.
Thanks

[0]https://discussion.fedoraproject.org/t/f40-change-proposal-transitioning-to-zlib-ng-as-a-compatible-replacement-for-zlib-system-wide/95807
[1]https://download.copr.fedorainfracloud.org/results/tuliom/zlib-ng-compat-mpb/fedora-rawhide-x86_64/06729801-git/builder-live.log.gz

Cheers,
Ondřej Pohořelský

^ permalink raw reply

* Propose a change in open for passing in the file type.
From: Haritha D @ 2023-12-12 14:46 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi Everyone,

Am working on porting git to z/OS. For reference, the pull request am working on https://github.com/git/git/pull/1537.

On z/OS there is a notion of file tag attributes. Files can be tagged as binary, ASCII, UTF8, EBCDIC, etc. z/OS uses these attributes to determine if auto-conversion is necessary. It was recommended in PR that we add logic directly to xopen . In order for me to do this in xopen , I have to pass an extra parameter to xopen that specifies the file type. 

Ex: 
xopen(output_file, O_CREAT | O_WRONLY | O_TRUNC, 0666);

To :
xopen(output_file, O_CREAT | O_WRONLY | O_TRUNC, 0666, BINARY);

BINARY: would be an enum value.

Would this be okay to do? Or are there other recommendations?

Best regards
Haritha

^ permalink raw reply

* Re: [PATCH] revision: parse integer arguments to --max-count, --skip, etc., more carefully
From: Junio C Hamano @ 2023-12-12 15:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Britton Kerin
In-Reply-To: <20231212013045.GE376323@coredump.intra.peff.net>

Jeff King <peff@peff.net> writes:

> This all looks pretty reasonable to me.
>
> I couldn't help but think, though, that surely we have some helpers for
> this already? But the closest seems to be git_parse_int(), which also
> allows unit factors. I'm not sure if allowing "-n 1k" would be a feature
> or a bug. ;)

The change in question is meant to be a pure fix to replace a careless
use of atoi().  I do not mind to see a separate patch to add such a
feature later on top.

> I guess "strtol_i()" maybe is that helper already, though I did not even
> know it existed. Looks like it goes back to 2007, and is seldom used.

I didn't know about it, either ;-) I used it only because there is
one use of it, among places that used atoi() I wanted to tighten.

> I wonder if there are more spots that could benefit.

"git grep -e 'atoi('" would give somebody motivated a decent set of
#microproject ideas, but many hits are not suited for strtol_i(),
which is designed to parse an integer at the end of a string.  Some
places use atoi() immediately followed by strspn() to skip over
digits, which means they are parsing an integer and want to continue
reading after the integer, which is incompatible with what
strtol_i() wants to do.  They need either a separate helper or an
updated strtol_i() that optionally allows you to parse the prefix
and report where the integer ended, e.g., something like:

#define strtol_i(s,b,r) strtol_i2((s), (b), (r), NULL)
static inline int strtol_i2(char const *s, int base, int *result, char **endp)
{
	long ul;
        char *dummy = NULL;

        if (!endp)
		endp = &dummy;
	errno = 0;
	ul = strtol(s, &endp, base);
	if (errno ||
	    /*
	     * if we are told to parse to the end of the string by
	     * passing NULL to endp, it is an error to have any
	     * remaining character after the digits.
	     */
            (dummy && *dummy) ||
	    endp == s || (int) ul != ul)
		return -1;
	*result = ul;
	return 0;
}

perhaps.

^ permalink raw reply

* Feature Request: Add confirm message when executed prune
From: Miguel de Dios Matias @ 2023-12-12 15:18 UTC (permalink / raw)
  To: git

Hi.

I accidentallly did a git push --prune --all in a git from the company
where I am working (and hope I will be working...well sorry for the
joke), because it is simple I mistook git pull with git push and I
deleted all branches of my jobmates.

I thing it might be a good idea to add by default a confirmation
message (can disabled by conf).

Or add a parameter such as rm --interactive for to avoid or minimice
(depends of coffe level) these kind mistakes.

Regards.

^ permalink raw reply

* Re: [PATCH v6] subtree: fix split processing with multiple subtrees present
From: Christian Couder @ 2023-12-12 16:06 UTC (permalink / raw)
  To: Zach FettersMoore; +Cc: Zach FettersMoore via GitGitGadget, git
In-Reply-To: <CAEWN6q3RTbVuMb0VyCYz196ZL+OGAAHbJLZ2-MnW1RVVabg7Mw@mail.gmail.com>

On Mon, Dec 11, 2023 at 4:39 PM Zach FettersMoore
<zach.fetters@apollographql.com> wrote:
>
> >>
> >> From: Zach FettersMoore <zach.fetters@apollographql.com>

> >> To see this in practice you can use the open source GitHub repo
> >> 'apollo-ios-dev' and do the following in order:
> >>
> >> -Make a changes to a file in 'apollo-ios' and 'apollo-ios-codegen'
> >> directories
> >> -Create a commit containing these changes
> >> -Do a split on apollo-ios-codegen
> >> - Do a fetch on the subtree repo
> >> - git fetch git@github.com:apollographql/apollo-ios-codegen.git
> >> - git subtree split --prefix=apollo-ios-codegen --squash --rejoin
>
> > Now I get the following without your patch at this step:
> >
> > $ git subtree split --prefix=apollo-ios-codegen --squash --rejoin
> > [...]/libexec/git-core/git-subtree: 318: Maximum function recursion
> > depth (1000) reached
> >
> > Line 318 in git-subtree.sh contains the following:
> >
> > missed=$(cache_miss "$@") || exit $?
> >
> > With your patch it seems to work:
> >
> > $ git subtree split --prefix=apollo-ios-codegen --squash --rejoin
> > Merge made by the 'ort' strategy.
> > e274aed3ba6d0659fb4cc014587cf31c1e8df7f4
>
> Looking into this some it looks like it could be a bash config
> difference? My machine always runs it all the way through vs
> failing for recursion depth. Although that would also be an issue
> which is solved by this fix.

I use Ubuntu where /bin/sh is dash so my current guess is that dash
might have a smaller recursion limit than bash.

I just found https://stackoverflow.com/questions/69493528/git-subtree-maximum-function-recursion-depth
which seems to agree.

I will try to test using bash soon.

> >> - Depending on the current state of the 'apollo-ios-dev' repo
> >> you may see the issue at this point if the last split was on
> >> apollo-ios
>
> > I guess I see it, but it seems a bit different for me than what you describe.
> >
> > Otherwise your patch looks good to me now.
>
> Yea I hadn't accounted for/realized that some folks may see a recursion
> depth error vs it just taking a long time like it does for me. Also what
> I was saying with the apollo-ios-dev repo is you may not need all the steps
> to see the issue, because its possible the state of the repo is already
> in a position to display the issue just by doing a split on
> apollo-ios-codegen.
>
> Great! Thanks again for all the feedback and guidance! Is there anything
> else I need to do to get this across the finish line and merged in?

Hopefully I will be able to confirm I see the same error as you with
bash soon, and it will be enough to get it merged.

^ permalink raw reply

* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
From: Taylor Blau @ 2023-12-12 16:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <20231212070406.GA1117953@coredump.intra.peff.net>

On Tue, Dec 12, 2023 at 02:04:06AM -0500, Jeff King wrote:
> On Tue, Nov 28, 2023 at 02:07:59PM -0500, Taylor Blau wrote:
>
> > +static void clear_bb_commit(struct bb_commit *commit)
> > +{
> > +	free(commit->reverse_edges);
> > +	bitmap_free(commit->commit_mask);
> > +	bitmap_free(commit->bitmap);
> > +}
>
> I think these bitmaps may sometimes be NULL. But double-checking
> bitmap_free(), it sensibly is noop when passed NULL. So this look good
> to me.

Yeah, bitamp_free() handles a NULL input correctly, so we can pass a
possibly-NULL `commit->commit_mask` or `commit->bitmap` argument and be
OK.

Thanks,
Taylor

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox