Git development
 help / color / mirror / Atom feed
* [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
From: Karthik Nayak @ 2023-10-19 12:10 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps
In-Reply-To: <20231019121024.194317-1-karthik.188@gmail.com>

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a new `MISSING` flag that the
revision walker sets whenever it encounters a missing tree/commit. The
revision walker will now continue the traversal and call `show_commit()`
even for missing commits. In rev-list we can then check for this flag
and call the existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  2 +-
 object.h                    |  4 +-
 revision.c                  | 11 ++++--
 revision.h                  |  2 +
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 6 files changed, 93 insertions(+), 6 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..604ae01d0c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    commit->object.flags & MISSING) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..60c5533721 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -387,7 +387,7 @@ static void do_traverse(struct traversal_context *ctx)
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
 		 */
-		if (!ctx->revs->tree_objects)
+		if (!ctx->revs->tree_objects || commit->object.flags & MISSING)
 			; /* do not bother loading tree */
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
diff --git a/object.h b/object.h
index 114d45954d..b76830fce1 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
 
 /*
  * object flag allocation:
- * revision.h:               0---------10         15             23------27
+ * revision.h:               0---------10         15             22------28
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -82,7 +82,7 @@ void object_array_init(struct object_array *array);
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  28
+#define FLAG_BITS  29
 
 #define TYPE_BITS 3
 
diff --git a/revision.c b/revision.c
index 219dc76716..8100806068 100644
--- a/revision.c
+++ b/revision.c
@@ -1110,7 +1110,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	struct commit_list *parent = commit->parents;
 	unsigned pass_flags;
 
-	if (commit->object.flags & ADDED)
+	if (commit->object.flags & (ADDED | MISSING))
 		return 0;
 	commit->object.flags |= ADDED;
 
@@ -1168,7 +1168,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1177,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (!revs->do_not_die_on_missing_objects)
+				return -1;
+			else
+				p->object.flags |= MISSING;
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
diff --git a/revision.h b/revision.h
index c73c92ef40..d3c1ca0f4e 100644
--- a/revision.h
+++ b/revision.h
@@ -53,6 +53,8 @@
 #define ANCESTRY_PATH	(1u<<27)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE)
 
+#define MISSING (1u<<28)
+
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
 
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done
-- 
2.42.0


^ permalink raw reply related

* [PATCH] typo: fix the typo 'neeed' into 'needed' in the comment under merge-ort.c
From: 王常新 @ 2023-10-19 12:24 UTC (permalink / raw)
  To: git; +Cc: Wangchangxin, foril

From: foril <1571825323@qq.com>

Signed-off-by: 王常新 (Wang Changxin) <foril@foril.space>
---
   typo: fix the typo 'neeed' into 'needed' in the comment under merge-o…

   the comments on line 2039 under merge-ort.c should be :
   this is needed if we have content merges of content merges rather than
   this is neeed if we have content merges of content merges

   fix the typo

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1592%2FforiLLL%2Fcomment_patch-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1592/foriLLL/comment_patch-v1
Pull-Request: https://github.com/git/git/pull/1592

merge-ort.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 7857ce9fbd1..aee6f7d8173 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2036,7 +2036,7 @@ static int handle_content_merge(struct merge_options *opt,
	 * the three blobs to merge on various sides of history.
	 *
	 * extra_marker_size is the amount to extend conflict markers in
-	 * ll_merge; this is neeed if we have content merges of content
+	 * ll_merge; this is needed if we have content merges of content
	 * merges, which happens for example with rename/rename(2to1) and
	 * rename/add conflicts.
	 */

base-commit: a9ecda2788e229afc9b611acaa26d0d9d4da53ed
-- 
gitgitgadget


^ permalink raw reply related

* [PATCH v2] merge-ort.c: fix typo 'neeed' to 'needed'
From: Wangchangxin via GitGitGadget @ 2023-10-19 12:40 UTC (permalink / raw)
  To: git; +Cc: Wangchangxin, foril
In-Reply-To: <pull.1592.git.git.1697378928693.gitgitgadget@gmail.com>

From: foril <1571825323@qq.com>

Signed-off-by: 王常新 (Wang Changxin) <foril@foril.space>
---
    typo: fix the typo 'neeed' into 'needed' in the comment under merge-o…
    
    the comments on line 2039 under merge-ort.c should be :
    this is needed if we have content merges of content merges rather than
    this is neeed if we have content merges of content merges
    
    fix the typo

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1592%2FforiLLL%2Fcomment_patch-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1592/foriLLL/comment_patch-v2
Pull-Request: https://github.com/git/git/pull/1592

Range-diff vs v1:

 1:  737eccd2811 ! 1:  3934ee6b684 typo: fix the typo 'neeed' into 'needed' in the comment under merge-ort.c
     @@ Metadata
      Author: foril <1571825323@qq.com>
      
       ## Commit message ##
     -    typo: fix the typo 'neeed' into 'needed' in the comment under merge-ort.c
     +    merge-ort.c: fix typo 'neeed' to 'needed'
      
          Signed-off-by: 王常新 (Wang Changxin) <foril@foril.space>
      


 merge-ort.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 7857ce9fbd1..aee6f7d8173 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2036,7 +2036,7 @@ static int handle_content_merge(struct merge_options *opt,
 	 * the three blobs to merge on various sides of history.
 	 *
 	 * extra_marker_size is the amount to extend conflict markers in
-	 * ll_merge; this is neeed if we have content merges of content
+	 * ll_merge; this is needed if we have content merges of content
 	 * merges, which happens for example with rename/rename(2to1) and
 	 * rename/add conflicts.
 	 */

base-commit: a9ecda2788e229afc9b611acaa26d0d9d4da53ed
-- 
gitgitgadget

^ permalink raw reply related

* Re: [PATCH] diagnose: require repository
From: Martin Ågren @ 2023-10-19 13:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: ks1322 ks1322, git, Victoria Dye
In-Reply-To: <xmqq5y39unvc.fsf@gitster.g>

On Sat, 14 Oct 2023 at 19:15, Junio C Hamano <gitster@pobox.com> wrote:
>
> Martin Ågren <martin.agren@gmail.com> writes:
>
> > Switch from the gentle setup to requiring a git directory. Without a git
> > repo, there isn't really much to diagnose.
> >
> > We could possibly do a best-effort collection of information about the
> > machine and then give up. That would roughly be today's behavior but
> > with a controlled exit rather than a segfault. However, the purpose of
> > this tool is largely to create a zip archive. Rather than creating an
> > empty zip file or no zip file at all, and having to explain that

Correcting myself: The zip archive would actually contain
`diagnostics.log` with some general info about the machine and Git
build.

> > behavior, it seems more helpful to bail out clearly and early with a
> > succinct error message.
>
> Without having thought things through, offhand I agree with your "no
> repository?  there is nothing worth tarring up then" assessment.
>
> Because "git bugreport --diag" unconditionally spawns "git
> diagnose", the former may also want to be extra careful, perhaps
> like the attached patch.

Good point. TBH, I had no idea about `git bugreport --diagnose`.

> +       if (!startup_info->have_repository && diagnose != DIAGNOSE_NONE) {
> +               warning(_("no repository--diagnostic output disabled"));
> +               diagnose = DIAGNOSE_NONE;
> +       }
> +

When the user explicitly provides that option, it seems unfortunate to
me to drop it. Yes, we'd warn, but `git bugreport` then pops a text
editor, so you would only see the warning after finishing up the report.
(Maybe. By the time you quit your editor, you might not consider
checking the terminal for warnings and such.)

So I'm inclined to instead just die if we see the option outside a repo.
If `diagnose` the command fundamentally requires a repo (as with my
patch) it seems surprising to me to not have `--diagnose` the option
behave the same.

Martin

^ permalink raw reply

* [PATCH 1/2] t5574: test porcelain output of atomic fetch
From: Jiang Xin @ 2023-10-19 14:34 UTC (permalink / raw)
  To: Git List, Junio C Hamano, Patrick Steinhardt; +Cc: Jiang Xin

From: Jiang Xin <zhiyou.jx@alibaba-inc.com>

The test case "fetch porcelain output" checks output of the fetch
command. The error output must be empty with the follow assertion:

    test_must_be_empty stderr

Refactor this test case to run it twice. The first time will be run
using non-atomic fetch and the other time will be run using atomic
fetch. We can see that the above assertion fails for atomic get, as
shown below:

    ok 5 - fetch porcelain output  # TODO known breakage vanished
    not ok 6 - fetch porcelain output (atomic) # TODO known breakage

The failed test case had an error message with only the error prompt but
no message body, as follows:

    'stderr' is not empty, it contains:
    error:

In a later commit, we will fix this issue.

Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
---
 t/t5574-fetch-output.sh | 96 ++++++++++++++++++++++++-----------------
 1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/t/t5574-fetch-output.sh b/t/t5574-fetch-output.sh
index 90e6dcb9a7..1397101629 100755
--- a/t/t5574-fetch-output.sh
+++ b/t/t5574-fetch-output.sh
@@ -61,9 +61,7 @@ test_expect_success 'fetch compact output' '
 	test_cmp expect actual
 '
 
-test_expect_success 'fetch porcelain output' '
-	test_when_finished "rm -rf porcelain" &&
-
+test_expect_success 'setup for fetch porcelain output' '
 	# Set up a bunch of references that we can use to demonstrate different
 	# kinds of flag symbols in the output format.
 	MAIN_OLD=$(git rev-parse HEAD) &&
@@ -74,15 +72,10 @@ test_expect_success 'fetch porcelain output' '
 	FORCE_UPDATED_OLD=$(git rev-parse HEAD) &&
 	git checkout main &&
 
-	# Clone and pre-seed the repositories. We fetch references into two
-	# namespaces so that we can test that rejected and force-updated
-	# references are reported properly.
-	refspecs="refs/heads/*:refs/unforced/* +refs/heads/*:refs/forced/*" &&
-	git clone . porcelain &&
-	git -C porcelain fetch origin $refspecs &&
+	# Backup to preseed.git
+	git clone --mirror . preseed.git &&
 
-	# Now that we have set up the client repositories we can change our
-	# local references.
+	# Continue changing our local references.
 	git branch new-branch &&
 	git branch -d deleted-branch &&
 	git checkout fast-forward &&
@@ -91,36 +84,61 @@ test_expect_success 'fetch porcelain output' '
 	git checkout force-updated &&
 	git reset --hard HEAD~ &&
 	test_commit --no-tag force-update-new &&
-	FORCE_UPDATED_NEW=$(git rev-parse HEAD) &&
-
-	cat >expect <<-EOF &&
-	- $MAIN_OLD $ZERO_OID refs/forced/deleted-branch
-	- $MAIN_OLD $ZERO_OID refs/unforced/deleted-branch
-	  $MAIN_OLD $FAST_FORWARD_NEW refs/unforced/fast-forward
-	! $FORCE_UPDATED_OLD $FORCE_UPDATED_NEW refs/unforced/force-updated
-	* $ZERO_OID $MAIN_OLD refs/unforced/new-branch
-	  $MAIN_OLD $FAST_FORWARD_NEW refs/forced/fast-forward
-	+ $FORCE_UPDATED_OLD $FORCE_UPDATED_NEW refs/forced/force-updated
-	* $ZERO_OID $MAIN_OLD refs/forced/new-branch
-	  $MAIN_OLD $FAST_FORWARD_NEW refs/remotes/origin/fast-forward
-	+ $FORCE_UPDATED_OLD $FORCE_UPDATED_NEW refs/remotes/origin/force-updated
-	* $ZERO_OID $MAIN_OLD refs/remotes/origin/new-branch
-	EOF
-
-	# Execute a dry-run fetch first. We do this to assert that the dry-run
-	# and non-dry-run fetches produces the same output. Execution of the
-	# fetch is expected to fail as we have a rejected reference update.
-	test_must_fail git -C porcelain fetch \
-		--porcelain --dry-run --prune origin $refspecs >actual &&
-	test_cmp expect actual &&
-
-	# And now we perform a non-dry-run fetch.
-	test_must_fail git -C porcelain fetch \
-		--porcelain --prune origin $refspecs >actual 2>stderr &&
-	test_cmp expect actual &&
-	test_must_be_empty stderr
+	FORCE_UPDATED_NEW=$(git rev-parse HEAD)
 '
 
+for opt in off on
+do
+	case $opt in
+	on)
+		opt=--atomic
+		;;
+	off)
+		opt=
+		;;
+	esac
+	test_expect_failure "fetch porcelain output ${opt:+(atomic)}" '
+		test_when_finished "rm -rf porcelain" &&
+
+		# Clone and pre-seed the repositories. We fetch references into two
+		# namespaces so that we can test that rejected and force-updated
+		# references are reported properly.
+		refspecs="refs/heads/*:refs/unforced/* +refs/heads/*:refs/forced/*" &&
+		git clone preseed.git porcelain &&
+		git -C porcelain fetch origin $opt $refspecs &&
+
+		cat >expect <<-EOF &&
+		- $MAIN_OLD $ZERO_OID refs/forced/deleted-branch
+		- $MAIN_OLD $ZERO_OID refs/unforced/deleted-branch
+		  $MAIN_OLD $FAST_FORWARD_NEW refs/unforced/fast-forward
+		! $FORCE_UPDATED_OLD $FORCE_UPDATED_NEW refs/unforced/force-updated
+		* $ZERO_OID $MAIN_OLD refs/unforced/new-branch
+		  $MAIN_OLD $FAST_FORWARD_NEW refs/forced/fast-forward
+		+ $FORCE_UPDATED_OLD $FORCE_UPDATED_NEW refs/forced/force-updated
+		* $ZERO_OID $MAIN_OLD refs/forced/new-branch
+		  $MAIN_OLD $FAST_FORWARD_NEW refs/remotes/origin/fast-forward
+		+ $FORCE_UPDATED_OLD $FORCE_UPDATED_NEW refs/remotes/origin/force-updated
+		* $ZERO_OID $MAIN_OLD refs/remotes/origin/new-branch
+		EOF
+
+		# Change the URL of the repository to fetch different references.
+		git -C porcelain remote set-url origin .. &&
+
+		# Execute a dry-run fetch first. We do this to assert that the dry-run
+		# and non-dry-run fetches produces the same output. Execution of the
+		# fetch is expected to fail as we have a rejected reference update.
+		test_must_fail git -C porcelain fetch $opt \
+			--porcelain --dry-run --prune origin $refspecs >actual &&
+		test_cmp expect actual &&
+
+		# And now we perform a non-dry-run fetch.
+		test_must_fail git -C porcelain fetch $opt \
+			--porcelain --prune origin $refspecs >actual 2>stderr &&
+		test_cmp expect actual &&
+		test_must_be_empty stderr
+	'
+done
+
 test_expect_success 'fetch porcelain with multiple remotes' '
 	test_when_finished "rm -rf porcelain" &&
 
-- 
2.42.0.411.g813d9a9188


^ permalink raw reply related

* [PATCH 2/2] fetch: no redundant error message for atomic fetch
From: Jiang Xin @ 2023-10-19 14:34 UTC (permalink / raw)
  To: Git List, Junio C Hamano, Patrick Steinhardt; +Cc: Jiang Xin
In-Reply-To: <38b0b22038399265407f7fc5f126f471dcc6f1a3.1697725898.git.zhiyou.jx@alibaba-inc.com>

From: Jiang Xin <zhiyou.jx@alibaba-inc.com>

If an error occurs during an atomic fetch, a redundant error message
will appear at the end of do_fetch(). It was introduced in b3a804663c
(fetch: make `--atomic` flag cover backfilling of tags, 2022-02-17).

Instead of displaying the error message unconditionally, the final error
output should follow the pattern in update-ref.c and files-backend.c as
follows:

    if (ref_transaction_abort(transaction, &error))
        error("abort: %s", error.buf);

This will fix the test case "fetch porcelain output (atomic)" in t5574.

Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
---
 builtin/fetch.c         | 4 +---
 t/t5574-fetch-output.sh | 2 +-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index fd134ba74d..01a573cf8d 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1775,10 +1775,8 @@ static int do_fetch(struct transport *transport,
 	}
 
 cleanup:
-	if (retcode && transaction) {
-		ref_transaction_abort(transaction, &err);
+	if (retcode && transaction && ref_transaction_abort(transaction, &err))
 		error("%s", err.buf);
-	}
 
 	display_state_release(&display_state);
 	close_fetch_head(&fetch_head);
diff --git a/t/t5574-fetch-output.sh b/t/t5574-fetch-output.sh
index 1397101629..3c72fc693f 100755
--- a/t/t5574-fetch-output.sh
+++ b/t/t5574-fetch-output.sh
@@ -97,7 +97,7 @@ do
 		opt=
 		;;
 	esac
-	test_expect_failure "fetch porcelain output ${opt:+(atomic)}" '
+	test_expect_success "fetch porcelain output ${opt:+(atomic)}" '
 		test_when_finished "rm -rf porcelain" &&
 
 		# Clone and pre-seed the repositories. We fetch references into two
-- 
2.42.0.411.g813d9a9188


^ permalink raw reply related

* Re: [PATCH v3 05/10] bulk-checkin: extract abstract `bulk_checkin_source`
From: Taylor Blau @ 2023-10-19 15:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King,
	Patrick Steinhardt
In-Reply-To: <xmqqa5sfplvw.fsf@gitster.g>

On Wed, Oct 18, 2023 at 04:10:43PM -0700, Junio C Hamano wrote:
> Looks OK, even though I expected to see a bit more involved object
> orientation with something like
>
> 	struct bulk_checkin_source {
> 		off_t (*read)(struct bulk_checkin_source *, void *, size_t);
> 		off_t (*seek)(struct bulk_checkin_source *, off_t);
> 		union {
> 			struct {
> 				int fd;
> 				size_t size;
> 				const char *path;
> 			} from_fd;
> 			struct {
> 				...
> 			} incore;
> 		} data;
> 	};
>
> As there will only be two subclasses of this thing, it may not
> matter all that much right now, but it would be much nicer as your
> methods do not have to care about "switch (enum) { case FILE: ... }".

I want to be cautious of going too far in this direction. I anticipate
that "two" is probably the maximum number of kinds of sources we can
reasonably expect for the foreseeable future. If that changes, it's easy
enough to convert from the existing implementation to something closer
to the above.

Thanks,
Taylor

^ permalink raw reply

* [PATCH 0/3] CMake unit test fixups
From: Phillip Wood @ 2023-10-19 15:21 UTC (permalink / raw)
  To: gitster
  Cc: calvinwan, git, johannes.schindelin, linusa, phillip.wood123,
	rsbecker, steadmon
In-Reply-To: <xmqqh6mzwe24.fsf@gitster.g>

From: Phillip Wood <phillip.wood@dunelm.org.uk>

Hi Junio

> The other topic to adjust for cmake by Dscho builds on this topic,
> and it needs to be rebased on this updated round.  I think I did so
> correctly, but because I use neither cmake or Windows, the result is
> not even compile tested.  Sanity checking the result is very much
> appreciated when I push out the result of today's integration cycle.

I need these fixups to get our CI to successfully build an run the
unit tests using CMake & MSVC. They are all adjusting paths now that
the unit test programs are built in t/unit-tests/bin

You can see the unit tests passing at
https://github.com/phillipwood/git/actions/runs/6575606322/job/17863460719
Note that I have split up the patches since that run but the changes
are the same.

Best Wishes

Phillip


Phillip Wood (3):
  fixup! cmake: also build unit tests
  fixup! artifacts-tar: when including `.dll` files, don't forget the
    unit-tests
  fixup! cmake: handle also unit tests

 Makefile                            | 2 +-
 contrib/buildsystems/CMakeLists.txt | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

-- 
2.42.0.506.g0dd4464cfd3


^ permalink raw reply

* [PATCH 1/3] fixup! cmake: also build unit tests
From: Phillip Wood @ 2023-10-19 15:21 UTC (permalink / raw)
  To: gitster
  Cc: calvinwan, git, johannes.schindelin, linusa, phillip.wood123,
	rsbecker, steadmon
In-Reply-To: <20231019152726.14624-1-phillip.wood123@gmail.com>

From: Phillip Wood <phillip.wood@dunelm.org.uk>

---
 contrib/buildsystems/CMakeLists.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/contrib/buildsystems/CMakeLists.txt b/contrib/buildsystems/CMakeLists.txt
index d21835ca65..20f38e94c9 100644
--- a/contrib/buildsystems/CMakeLists.txt
+++ b/contrib/buildsystems/CMakeLists.txt
@@ -973,12 +973,12 @@ foreach(unit_test ${unit_test_PROGRAMS})
 	add_executable("${unit_test}" "${CMAKE_SOURCE_DIR}/t/unit-tests/${unit_test}.c")
 	target_link_libraries("${unit_test}" unit-test-lib common-main)
 	set_target_properties("${unit_test}"
-		PROPERTIES RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/t/unit-tests)
+		PROPERTIES RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/t/unit-tests/bin)
 	if(MSVC)
 		set_target_properties("${unit_test}"
-			PROPERTIES RUNTIME_OUTPUT_DIRECTORY_DEBUG ${CMAKE_BINARY_DIR}/t/unit-tests)
+			PROPERTIES RUNTIME_OUTPUT_DIRECTORY_DEBUG ${CMAKE_BINARY_DIR}/t/unit-tests/bin)
 		set_target_properties("${unit_test}"
-			PROPERTIES RUNTIME_OUTPUT_DIRECTORY_RELEASE ${CMAKE_BINARY_DIR}/t/unit-tests)
+			PROPERTIES RUNTIME_OUTPUT_DIRECTORY_RELEASE ${CMAKE_BINARY_DIR}/t/unit-tests/bin)
 	endif()
 	list(APPEND PROGRAMS_BUILT "${unit_test}")
 
-- 
2.42.0.506.g0dd4464cfd3


^ permalink raw reply related

* [PATCH 2/3] fixup! artifacts-tar: when including `.dll` files, don't forget the unit-tests
From: Phillip Wood @ 2023-10-19 15:21 UTC (permalink / raw)
  To: gitster
  Cc: calvinwan, git, johannes.schindelin, linusa, phillip.wood123,
	rsbecker, steadmon
In-Reply-To: <20231019152726.14624-1-phillip.wood123@gmail.com>

From: Phillip Wood <phillip.wood@dunelm.org.uk>

---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 075d8e4899..e15c34e506 100644
--- a/Makefile
+++ b/Makefile
@@ -3597,7 +3597,7 @@ rpm::
 .PHONY: rpm
 
 ifneq ($(INCLUDE_DLLS_IN_ARTIFACTS),)
-OTHER_PROGRAMS += $(shell echo *.dll t/helper/*.dll t/unit-tests/*.dll)
+OTHER_PROGRAMS += $(shell echo *.dll t/helper/*.dll t/unit-tests/bin/*.dll)
 endif
 
 artifacts-tar:: $(ALL_COMMANDS_TO_INSTALL) $(SCRIPT_LIB) $(OTHER_PROGRAMS) \
-- 
2.42.0.506.g0dd4464cfd3


^ permalink raw reply related

* [PATCH 3/3] fixup! cmake: handle also unit tests
From: Phillip Wood @ 2023-10-19 15:21 UTC (permalink / raw)
  To: gitster
  Cc: calvinwan, git, johannes.schindelin, linusa, phillip.wood123,
	rsbecker, steadmon
In-Reply-To: <20231019152726.14624-1-phillip.wood123@gmail.com>

From: Phillip Wood <phillip.wood@dunelm.org.uk>

---
 contrib/buildsystems/CMakeLists.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/buildsystems/CMakeLists.txt b/contrib/buildsystems/CMakeLists.txt
index 20f38e94c9..671c7ead75 100644
--- a/contrib/buildsystems/CMakeLists.txt
+++ b/contrib/buildsystems/CMakeLists.txt
@@ -990,7 +990,7 @@ foreach(unit_test ${unit_test_PROGRAMS})
 	if(NOT ${unit_test} STREQUAL "t-basic")
 		add_test(NAME "t.unit-tests.${unit_test}"
 			COMMAND "./${unit_test}"
-			WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}/t/unit-tests)
+			WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}/t/unit-tests/bin)
 	endif()
 endforeach()
 
-- 
2.42.0.506.g0dd4464cfd3


^ permalink raw reply related

* Re: [PATCH v3 08/10] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
From: Taylor Blau @ 2023-10-19 15:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King,
	Patrick Steinhardt
In-Reply-To: <xmqq34y7plj4.fsf@gitster.g>

On Wed, Oct 18, 2023 at 04:18:23PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Now that we have factored out many of the common routines necessary to
> > index a new object into a pack created by the bulk-checkin machinery, we
> > can introduce a variant of `index_blob_bulk_checkin()` that acts on
> > blobs whose contents we can fit in memory.
>
> Hmph.
>
> Doesn't the duplication between the main loop of the new
> deflate_obj_contents_to_pack() with existing deflate_blob_to_pack()
> bother you?

Yeah, I am not sure how I missed seeing the opportunity to clean that
up. Another (hopefully final...) reroll incoming.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH v5 0/2] attr: add attr.tree config
From: John Cai @ 2023-10-19 15:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: John Cai via GitGitGadget, git, Jeff King, Jonathan Tan,
	Eric Sunshine
In-Reply-To: <xmqqzg0mxnaj.fsf@gitster.g>

Hi Junio,

On 13 Oct 2023, at 16:47, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> So in a sense, for !!ignore_bad_attr_tree case, the code ends up
>> doing the right thing.  But if !ignore_bad_attr_tree is true, i.e.,
>> a blob object name is given via --attr-source or GIT_ATTR_SOURCE,
>> then the bug will be uncovered.
>
> Having said all that, I suspect that this problem is not new and
> certainly not caused by this topic.  We should have unconditionally
> died when GIT_ATTR_SOURCE gave a blob object name, but pretended as
> if an empty tree was given.  There may even be existing users who
> now assume that is working as intended and depend on this bug.
>
> So, let's leave it as a "possible bug" that we might want to fix in
> the future, outside the scope of this series.
>
> Thanks.

That sounds good--let's fix in a separate series. Thanks for the careful review!

thanks
John

>
>
>>  t/t0003-attributes.sh | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git c/t/t0003-attributes.sh w/t/t0003-attributes.sh
>> index ecf43ab545..0f02f22171 100755
>> --- c/t/t0003-attributes.sh
>> +++ w/t/t0003-attributes.sh
>> @@ -394,6 +394,18 @@ test_expect_success 'bare repo defaults to reading .gitattributes from HEAD' '
>>  	test_cmp expect actual
>>  '
>>
>> +test_expect_success '--attr-source that points at a non-treeish' '
>> +	test_when_finished rm -rf empty &&
>> +	git init empty &&
>> +	(
>> +		cd empty &&
>> +		echo "$bad_attr_source_err" >expect_err &&
>> +		H=$(git hash-object -t blob --stdin -w </dev/null) &&
>> +		test_must_fail git --attr-source=$H check-attr test -- f/path 2>err &&
>> +		test_cmp expect_err err
>> +	)
>> +'
>> +
>>  test_expect_success 'precedence of --attr-source, GIT_ATTR_SOURCE, then attr.tree' '
>>  	test_when_finished rm -rf empty &&
>>  	git init empty &&

^ permalink raw reply

* Re: What's cooking in git.git (Oct 2023, #04; Tue, 10)
From: SZEDER Gábor @ 2023-10-19 17:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, Jonathan Tan, Elijah Newren, git
In-Reply-To: <ZSb292VxDQoeSu2o@nand.local>

On Wed, Oct 11, 2023 at 03:26:47PM -0400, Taylor Blau wrote:
> On Tue, Oct 10, 2023 at 06:32:16PM -0700, Junio C Hamano wrote:
> > * tb/path-filter-fix (2023-08-30) 15 commits
> >  - bloom: introduce `deinit_bloom_filters()`
> >  - commit-graph: reuse existing Bloom filters where possible
> >  - object.h: fix mis-aligned flag bits table
> >  - commit-graph: drop unnecessary `graph_read_bloom_data_context`
> >  - commit-graph.c: unconditionally load Bloom filters
> >  - t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
> >  - bloom: prepare to discard incompatible Bloom filters
> >  - bloom: annotate filters with hash version
> >  - commit-graph: new filter ver. that fixes murmur3
> >  - repo-settings: introduce commitgraph.changedPathsVersion
> >  - t4216: test changed path filters with high bit paths
> >  - t/helper/test-read-graph: implement `bloom-filters` mode
> >  - bloom.h: make `load_bloom_filter_from_graph()` public
> >  - t/helper/test-read-graph.c: extract `dump_graph_info()`
> >  - gitformat-commit-graph: describe version 2 of BDAT
> >
> >  The Bloom filter used for path limited history traversal was broken
> >  on systems whose "char" is unsigned; update the implementation and
> >  bump the format version to 2.
> >
> >  Reroll exists, not picked up yet.
> >  cf. <20230830200218.GA5147@szeder.dev>
> >  cf. <20230901205616.3572722-1-jonathantanmy@google.com>
> >  cf. <20230924195900.GA1156862@szeder.dev>
> >  cf. <20231008143523.GA18858@szeder.dev>
> >  source: <cover.1693413637.git.jonathantanmy@google.com>
> 
> Great, thanks for noting that you saw it ;-). I think that this one is
> ready to go, but I'm obviously biased and I'd feel better if Jonathan or
> Gábor (both CC'd) would take a look before you merge this down.

The test I posted in 20230830200218.GA5147@szeder.dev checking
different Bloom filter versions in different commit-graph layers still
fails in current seen.


^ permalink raw reply

* Re: [PATCH] typo: fix the typo 'neeed' into 'needed' in the comment under merge-ort.c
From: Junio C Hamano @ 2023-10-19 17:05 UTC (permalink / raw)
  To: 王常新; +Cc: git, Wangchangxin
In-Reply-To: <2DB9ED79-FE58-4072-91E0-B4C51A3F6C5B@gmail.com>

王常新 <wchangxin824@gmail.com> writes:

> From: foril <1571825323@qq.com>
>
> Signed-off-by: 王常新 (Wang Changxin) <foril@foril.space>
> ---

Thanks.  

We want to make sure that the "Name <e-mail-address>" on the From:
and Signed-off-by: lines match.  Is your official name/address the
one on the Singed-off-by: line?

^ permalink raw reply

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
From: Junio C Hamano @ 2023-10-19 17:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Taylor Blau
In-Reply-To: <ZTDn-Wd5xsFrBmqI@tanuki>

Patrick Steinhardt <ps@pks.im> writes:

> There's another way to handle this, which is to conditionally enable the
> object existence check. This would be less of a performance hit compared
> to disabling commit graphs altogether with `--missing`, but still ensure
> that repository corruption was detected. Second, it would not regress
> performance for all preexisting users of `repo_parse_commit_gently()`.

The above was what I meant to suggest when you demonstrated that the
code with additional check is still much more performant than
running without the commit-graph optimization, yet has observable
impact on performance for normal codepaths that do not need the
extra carefulness.

But I wasn't sure if it is easy to plumb the "do we want to double
check?  in other words, are we running something like --missing that
care the correctness a bit more than usual cases?" bit down from the
caller, because this check is so deep in the callchain.

Thanks.


^ permalink raw reply

* [PATCH v4 0/7] merge-ort: implement support for packing objects together
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

(Rebased onto the current tip of 'master', which is 813d9a9188 (The
nineteenth batch, 2023-10-18) at the time of writing).

This series implements support for a new merge-tree option,
`--write-pack`, which causes any newly-written objects to be packed
together instead of being stored individually as loose.

I intentionally broke this off from the existing thread, since I
accidentally rerolled mine and Jonathan Tan's Bloom v2 series into it,
causing some confusion.

This is a new round that is significantly simplified thanks to
another very helpful suggestion[1] from Junio. By factoring out a common
"deflate object to pack" that takes an abstract bulk_checkin_source as a
parameter, all of the earlier refactorings can be dropped since we
retain only a single caller instead of multiple.

This resulted in a rather satisfying range-diff (included below, as
usual), and a similarly satisfying inter-diff:

    $ git diff --stat tb/ort-bulk-checkin.v3..
     bulk-checkin.c | 203 ++++++++++++++++---------------------------------
     1 file changed, 64 insertions(+), 139 deletions(-)

Beyond that, the changes since last time can be viewed in the range-diff
below. Thanks in advance for any review!

[1]: https://lore.kernel.org/git/xmqq34y7plj4.fsf@gitster.g/

Taylor Blau (7):
  bulk-checkin: extract abstract `bulk_checkin_source`
  bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
  bulk-checkin: refactor deflate routine to accept a
    `bulk_checkin_source`
  bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
  builtin/merge-tree.c: implement support for `--write-pack`

 Documentation/git-merge-tree.txt |   4 +
 builtin/merge-tree.c             |   5 +
 bulk-checkin.c                   | 161 ++++++++++++++++++++++++++-----
 bulk-checkin.h                   |   8 ++
 merge-ort.c                      |  42 ++++++--
 merge-recursive.h                |   1 +
 t/t4301-merge-tree-write-tree.sh |  93 ++++++++++++++++++
 7 files changed, 280 insertions(+), 34 deletions(-)

Range-diff against v3:
 1:  2dffa45183 <  -:  ---------- bulk-checkin: factor out `format_object_header_hash()`
 2:  7a10dc794a <  -:  ---------- bulk-checkin: factor out `prepare_checkpoint()`
 3:  20c32d2178 <  -:  ---------- bulk-checkin: factor out `truncate_checkpoint()`
 4:  893051d0b7 <  -:  ---------- bulk-checkin: factor out `finalize_checkpoint()`
 5:  da52ec8380 !  1:  97bb6e9f59 bulk-checkin: extract abstract `bulk_checkin_source`
    @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
      			if (*already_hashed_to < offset) {
      				size_t hsize = offset - *already_hashed_to;
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 	git_hash_ctx ctx;
    + 	unsigned header_len;
      	struct hashfile_checkpoint checkpoint = {0};
      	struct pack_idx_entry *idx = NULL;
     +	struct bulk_checkin_source source = {
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
      	seekback = lseek(fd, 0, SEEK_CUR);
      	if (seekback == (off_t) -1)
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 	while (1) {
    - 		prepare_checkpoint(state, &checkpoint, idx, flags);
    + 			crc32_begin(state->f);
    + 		}
      		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
     -					 fd, size, path, flags))
     +					 &source, flags))
      			break;
    - 		truncate_checkpoint(state, &checkpoint, idx);
    + 		/*
    + 		 * Writing this object to the current pack will make
    +@@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    + 		hashfile_truncate(state->f, &checkpoint);
    + 		state->offset = checkpoint.offset;
    + 		flush_bulk_checkin_packfile(state);
     -		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
     +		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
      			return error("cannot seek back");
      	}
    - 	finalize_checkpoint(state, &ctx, &checkpoint, idx, result_oid);
    + 	the_hash_algo->final_oid_fn(result_oid, &ctx);
 7:  04ec74e357 !  2:  9d633df339 bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
    @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
      	s.avail_out = sizeof(obuf) - hdrlen;
      
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 
    - 	while (1) {
    - 		prepare_checkpoint(state, &checkpoint, idx, flags);
    + 			idx->offset = state->offset;
    + 			crc32_begin(state->f);
    + 		}
     -		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
     -					 &source, flags))
     +		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
     +					&source, OBJ_BLOB, flags))
      			break;
    - 		truncate_checkpoint(state, &checkpoint, idx);
    - 		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
    + 		/*
    + 		 * Writing this object to the current pack will make
 -:  ---------- >  3:  d5bbd7810e bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
 6:  4e9bac5bc1 =  4:  e427fe6ad3 bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
 8:  8667b76365 !  5:  48095afe80 bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
    @@ Commit message
         objects individually as loose.
     
         Similar to the existing `index_blob_bulk_checkin()` function, the
    -    entrypoint delegates to `deflate_blob_to_pack_incore()`, which is
    -    responsible for formatting the pack header and then deflating the
    -    contents into the pack. The latter is accomplished by calling
    -    deflate_obj_contents_to_pack_incore(), which takes advantage of the
    -    earlier refactorings and is responsible for writing the object to the
    -    pack and handling any overage from pack.packSizeLimit.
    -
    -    The bulk of the new functionality is implemented in the function
    -    `stream_obj_to_pack()`, which can handle streaming objects from memory
    -    to the bulk-checkin pack as a result of the earlier refactoring.
    +    entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
    +    turn delegates to deflate_obj_to_pack(), which is responsible for
    +    formatting the pack header and then deflating the contents into the
    +    pack.
     
         Consistent with the rest of the bulk-checkin mechanism, there are no
         direct tests here. In future commits when we expose this new
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## bulk-checkin.c ##
    -@@ bulk-checkin.c: static void finalize_checkpoint(struct bulk_checkin_packfile *state,
    - 	}
    +@@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
    + 	return 0;
      }
      
    -+static int deflate_obj_contents_to_pack_incore(struct bulk_checkin_packfile *state,
    -+					       git_hash_ctx *ctx,
    -+					       struct hashfile_checkpoint *checkpoint,
    -+					       struct object_id *result_oid,
    -+					       const void *buf, size_t size,
    -+					       enum object_type type,
    -+					       const char *path, unsigned flags)
    ++static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
    ++				       struct object_id *result_oid,
    ++				       const void *buf, size_t size,
    ++				       const char *path, enum object_type type,
    ++				       unsigned flags)
     +{
    -+	struct pack_idx_entry *idx = NULL;
    -+	off_t already_hashed_to = 0;
     +	struct bulk_checkin_source source = {
     +		.type = SOURCE_INCORE,
     +		.buf = buf,
    @@ bulk-checkin.c: static void finalize_checkpoint(struct bulk_checkin_packfile *st
     +		.path = path,
     +	};
     +
    -+	/* Note: idx is non-NULL when we are writing */
    -+	if (flags & HASH_WRITE_OBJECT)
    -+		CALLOC_ARRAY(idx, 1);
    -+
    -+	while (1) {
    -+		prepare_checkpoint(state, checkpoint, idx, flags);
    -+
    -+		if (!stream_obj_to_pack(state, ctx, &already_hashed_to, &source,
    -+					type, flags))
    -+			break;
    -+		truncate_checkpoint(state, checkpoint, idx);
    -+		bulk_checkin_source_seek_to(&source, 0);
    -+	}
    -+
    -+	finalize_checkpoint(state, ctx, checkpoint, idx, result_oid);
    -+
    -+	return 0;
    -+}
    -+
    -+static int deflate_blob_to_pack_incore(struct bulk_checkin_packfile *state,
    -+				       struct object_id *result_oid,
    -+				       const void *buf, size_t size,
    -+				       const char *path, unsigned flags)
    -+{
    -+	git_hash_ctx ctx;
    -+	struct hashfile_checkpoint checkpoint = {0};
    -+
    -+	format_object_header_hash(the_hash_algo, &ctx, &checkpoint, OBJ_BLOB,
    -+				  size);
    -+
    -+	return deflate_obj_contents_to_pack_incore(state, &ctx, &checkpoint,
    -+						   result_oid, buf, size,
    -+						   OBJ_BLOB, path, flags);
    ++	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
     +}
     +
      static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    @@ bulk-checkin.c: int index_blob_bulk_checkin(struct object_id *oid,
     +				   const void *buf, size_t size,
     +				   const char *path, unsigned flags)
     +{
    -+	int status = deflate_blob_to_pack_incore(&bulk_checkin_packfile, oid,
    -+						 buf, size, path, flags);
    ++	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
    ++						buf, size, path, OBJ_BLOB,
    ++						flags);
     +	if (!odb_transaction_nesting)
     +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
     +	return status;
 9:  cba043ef14 !  6:  60568f9281 bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
    @@ Commit message
         machinery will have enough to keep track of the converted object's hash
         in order to update the compatibility mapping.
     
    -    Within `deflate_tree_to_pack_incore()`, the changes should be limited
    -    to something like:
    +    Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
    +    `deflate_tree_to_pack_incore()`), the changes should be limited to
    +    something like:
     
             struct strbuf converted = STRBUF_INIT;
             if (the_repository->compat_hash_algo) {
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## bulk-checkin.c ##
    -@@ bulk-checkin.c: static int deflate_blob_to_pack_incore(struct bulk_checkin_packfile *state,
    - 						   OBJ_BLOB, path, flags);
    - }
    - 
    -+static int deflate_tree_to_pack_incore(struct bulk_checkin_packfile *state,
    -+				       struct object_id *result_oid,
    -+				       const void *buf, size_t size,
    -+				       const char *path, unsigned flags)
    -+{
    -+	git_hash_ctx ctx;
    -+	struct hashfile_checkpoint checkpoint = {0};
    -+
    -+	format_object_header_hash(the_hash_algo, &ctx, &checkpoint, OBJ_TREE,
    -+				  size);
    -+
    -+	return deflate_obj_contents_to_pack_incore(state, &ctx, &checkpoint,
    -+						   result_oid, buf, size,
    -+						   OBJ_TREE, path, flags);
    -+}
    -+
    - static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 				struct object_id *result_oid,
    - 				int fd, size_t size,
     @@ bulk-checkin.c: int index_blob_bulk_checkin_incore(struct object_id *oid,
      	return status;
      }
    @@ bulk-checkin.c: int index_blob_bulk_checkin_incore(struct object_id *oid,
     +				   const void *buf, size_t size,
     +				   const char *path, unsigned flags)
     +{
    -+	int status = deflate_tree_to_pack_incore(&bulk_checkin_packfile, oid,
    -+						 buf, size, path, flags);
    ++	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
    ++						buf, size, path, OBJ_TREE,
    ++						flags);
     +	if (!odb_transaction_nesting)
     +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
     +	return status;
10:  ae70508037 =  7:  b9be9df122 builtin/merge-tree.c: implement support for `--write-pack`
-- 
2.42.0.405.g86fe3250c2

^ permalink raw reply

* [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source`
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

A future commit will want to implement a very similar routine as in
`stream_blob_to_pack()` with two notable changes:

  - Instead of streaming just OBJ_BLOBs, this new function may want to
    stream objects of arbitrary type.

  - Instead of streaming the object's contents from an open
    file-descriptor, this new function may want to "stream" its contents
    from memory.

To avoid duplicating a significant chunk of code between the existing
`stream_blob_to_pack()`, extract an abstract `bulk_checkin_source`. This
concept currently is a thin layer of `lseek()` and `read_in_full()`, but
will grow to understand how to perform analogous operations when writing
out an object's contents from memory.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 61 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6ce62999e5..c05d06e1e1 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -140,8 +140,41 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
 	return 0;
 }
 
+struct bulk_checkin_source {
+	enum { SOURCE_FILE } type;
+
+	/* SOURCE_FILE fields */
+	int fd;
+
+	/* common fields */
+	size_t size;
+	const char *path;
+};
+
+static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
+					 off_t offset)
+{
+	switch (source->type) {
+	case SOURCE_FILE:
+		return lseek(source->fd, offset, SEEK_SET);
+	default:
+		BUG("unknown bulk-checkin source: %d", source->type);
+	}
+}
+
+static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
+					void *buf, size_t nr)
+{
+	switch (source->type) {
+	case SOURCE_FILE:
+		return read_in_full(source->fd, buf, nr);
+	default:
+		BUG("unknown bulk-checkin source: %d", source->type);
+	}
+}
+
 /*
- * Read the contents from fd for size bytes, streaming it to the
+ * Read the contents from 'source' for 'size' bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
  * by returning a negative value when the resulting pack would exceed
  * the pack size limit and this is not the first object in the pack,
@@ -157,7 +190,7 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
  */
 static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       int fd, size_t size, const char *path,
+			       struct bulk_checkin_source *source,
 			       unsigned flags)
 {
 	git_zstream s;
@@ -167,22 +200,28 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 	int status = Z_OK;
 	int write_object = (flags & HASH_WRITE_OBJECT);
 	off_t offset = 0;
+	size_t size = source->size;
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB, size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
+					      size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
 	while (status != Z_STREAM_END) {
 		if (size && !s.avail_in) {
 			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
-			ssize_t read_result = read_in_full(fd, ibuf, rsize);
+			ssize_t read_result;
+
+			read_result = bulk_checkin_source_read(source, ibuf,
+							       rsize);
 			if (read_result < 0)
-				die_errno("failed to read from '%s'", path);
+				die_errno("failed to read from '%s'",
+					  source->path);
 			if (read_result != rsize)
 				die("failed to read %d bytes from '%s'",
-				    (int)rsize, path);
+				    (int)rsize, source->path);
 			offset += rsize;
 			if (*already_hashed_to < offset) {
 				size_t hsize = offset - *already_hashed_to;
@@ -258,6 +297,12 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
+	struct bulk_checkin_source source = {
+		.type = SOURCE_FILE,
+		.fd = fd,
+		.size = size,
+		.path = path,
+	};
 
 	seekback = lseek(fd, 0, SEEK_CUR);
 	if (seekback == (off_t) -1)
@@ -283,7 +328,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			crc32_begin(state->f);
 		}
 		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 fd, size, path, flags))
+					 &source, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -295,7 +340,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related

* [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

The existing `stream_blob_to_pack()` function is named based on the fact
that it knows only how to stream blobs into a bulk-checkin pack.

But there is no longer anything in this function which prevents us from
writing objects of arbitrary types to the bulk-checkin pack. Prepare to
write OBJ_TREEs by removing this assumption, adding an `enum
object_type` parameter to this function's argument list, and renaming it
to `stream_obj_to_pack()` accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index c05d06e1e1..7e6b52112e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -188,10 +188,10 @@ static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
  * status before calling us just in case we ask it to call us again
  * with a new pack.
  */
-static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
-			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       struct bulk_checkin_source *source,
-			       unsigned flags)
+static int stream_obj_to_pack(struct bulk_checkin_packfile *state,
+			      git_hash_ctx *ctx, off_t *already_hashed_to,
+			      struct bulk_checkin_source *source,
+			      enum object_type type, unsigned flags)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
@@ -204,8 +204,7 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
-					      size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
@@ -327,8 +326,8 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			idx->offset = state->offset;
 			crc32_begin(state->f);
 		}
-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 &source, flags))
+		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
+					&source, OBJ_BLOB, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related

* [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

Prepare for a future change where we will want to use a routine very
similar to the existing `deflate_blob_to_pack()` but over arbitrary
sources (i.e. either open file-descriptors, or a location in memory).

Extract out a common "deflate_obj_to_pack()" routine that acts on a
bulk_checkin_source, instead of a (int, size_t) pair. Then rewrite
`deflate_blob_to_pack()` in terms of it.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 52 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 7e6b52112e..28bc8d5ab4 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -285,30 +285,23 @@ static void prepare_to_stream(struct bulk_checkin_packfile *state,
 		die_errno("unable to write pack header");
 }
 
-static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
-				struct object_id *result_oid,
-				int fd, size_t size,
-				const char *path, unsigned flags)
+
+static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
+			       struct object_id *result_oid,
+			       struct bulk_checkin_source *source,
+			       enum object_type type,
+			       off_t seekback,
+			       unsigned flags)
 {
-	off_t seekback, already_hashed_to;
+	off_t already_hashed_to = 0;
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
-	struct bulk_checkin_source source = {
-		.type = SOURCE_FILE,
-		.fd = fd,
-		.size = size,
-		.path = path,
-	};
 
-	seekback = lseek(fd, 0, SEEK_CUR);
-	if (seekback == (off_t) -1)
-		return error("cannot find the current offset");
-
-	header_len = format_object_header((char *)obuf, sizeof(obuf),
-					  OBJ_BLOB, size);
+	header_len = format_object_header((char *)obuf, sizeof(obuf), type,
+					  source->size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
 	the_hash_algo->init_fn(&checkpoint.ctx);
@@ -317,8 +310,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
 
-	already_hashed_to = 0;
-
 	while (1) {
 		prepare_to_stream(state, flags);
 		if (idx) {
@@ -327,7 +318,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			crc32_begin(state->f);
 		}
 		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
-					&source, OBJ_BLOB, flags))
+					source, type, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -339,7 +330,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
+		if (bulk_checkin_source_seek_to(source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
@@ -361,6 +352,25 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
+				struct object_id *result_oid,
+				int fd, size_t size,
+				const char *path, unsigned flags)
+{
+	struct bulk_checkin_source source = {
+		.type = SOURCE_FILE,
+		.fd = fd,
+		.size = size,
+		.path = path,
+	};
+	off_t seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	return deflate_obj_to_pack(state, result_oid, &source, OBJ_BLOB,
+				   seekback, flags);
+}
+
 void prepare_loose_object_bulk_checkin(void)
 {
 	/*
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related

* [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

Continue to prepare for streaming an object's contents directly from
memory by teaching `bulk_checkin_source` how to perform reads and seeks
based on an address in memory.

Unlike file descriptors, which manage their own offset internally, we
have to keep track of how many bytes we've read out of the buffer, and
make sure we don't read past the end of the buffer.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 28bc8d5ab4..60361b3e2e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -141,11 +141,15 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
 }
 
 struct bulk_checkin_source {
-	enum { SOURCE_FILE } type;
+	enum { SOURCE_FILE, SOURCE_INCORE } type;
 
 	/* SOURCE_FILE fields */
 	int fd;
 
+	/* SOURCE_INCORE fields */
+	const void *buf;
+	size_t read;
+
 	/* common fields */
 	size_t size;
 	const char *path;
@@ -157,6 +161,11 @@ static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
 	switch (source->type) {
 	case SOURCE_FILE:
 		return lseek(source->fd, offset, SEEK_SET);
+	case SOURCE_INCORE:
+		if (!(0 <= offset && offset < source->size))
+			return (off_t)-1;
+		source->read = offset;
+		return source->read;
 	default:
 		BUG("unknown bulk-checkin source: %d", source->type);
 	}
@@ -168,6 +177,13 @@ static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
 	switch (source->type) {
 	case SOURCE_FILE:
 		return read_in_full(source->fd, buf, nr);
+	case SOURCE_INCORE:
+		assert(source->read <= source->size);
+		if (nr > source->size - source->read)
+			nr = source->size - source->read;
+		memcpy(buf, (unsigned char *)source->buf + source->read, nr);
+		source->read += nr;
+		return nr;
 	default:
 		BUG("unknown bulk-checkin source: %d", source->type);
 	}
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related

* [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

Now that we have factored out many of the common routines necessary to
index a new object into a pack created by the bulk-checkin machinery, we
can introduce a variant of `index_blob_bulk_checkin()` that acts on
blobs whose contents we can fit in memory.

This will be useful in a couple of more commits in order to provide the
`merge-tree` builtin with a mechanism to create a new pack containing
any objects it created during the merge, instead of storing those
objects individually as loose.

Similar to the existing `index_blob_bulk_checkin()` function, the
entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
turn delegates to deflate_obj_to_pack(), which is responsible for
formatting the pack header and then deflating the contents into the
pack.

Consistent with the rest of the bulk-checkin mechanism, there are no
direct tests here. In future commits when we expose this new
functionality via the `merge-tree` builtin, we will test it indirectly
there.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 29 +++++++++++++++++++++++++++++
 bulk-checkin.h |  4 ++++
 2 files changed, 33 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 60361b3e2e..655a583b06 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -368,6 +368,23 @@ static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
+				       struct object_id *result_oid,
+				       const void *buf, size_t size,
+				       const char *path, enum object_type type,
+				       unsigned flags)
+{
+	struct bulk_checkin_source source = {
+		.type = SOURCE_INCORE,
+		.buf = buf,
+		.size = size,
+		.read = 0,
+		.path = path,
+	};
+
+	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
+}
+
 static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 				struct object_id *result_oid,
 				int fd, size_t size,
@@ -431,6 +448,18 @@ int index_blob_bulk_checkin(struct object_id *oid,
 	return status;
 }
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_BLOB,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index aa7286a7b3..1b91daeaee 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -13,6 +13,10 @@ int index_blob_bulk_checkin(struct object_id *oid,
 			    int fd, size_t size,
 			    const char *path, unsigned flags);
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related

* [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

The remaining missing piece in order to teach the `merge-tree` builtin
how to write the contents of a merge into a pack is a function to index
tree objects into a bulk-checkin pack.

This patch implements that missing piece, which is a thin wrapper around
all of the functionality introduced in previous commits.

If and when Git gains support for a "compatibility" hash algorithm, the
changes to support that here will be minimal. The bulk-checkin machinery
will need to convert the incoming tree to compute its length under the
compatibility hash, necessary to reconstruct its header. With that
information (and the converted contents of the tree), the bulk-checkin
machinery will have enough to keep track of the converted object's hash
in order to update the compatibility mapping.

Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
`deflate_tree_to_pack_incore()`), the changes should be limited to
something like:

    struct strbuf converted = STRBUF_INIT;
    if (the_repository->compat_hash_algo) {
      if (convert_object_file(&compat_obj,
                              the_repository->hash_algo,
                              the_repository->compat_hash_algo, ...) < 0)
        die(...);

      format_object_header_hash(the_repository->compat_hash_algo,
                                OBJ_TREE, size);
    }
    /* compute the converted tree's hash using the compat algorithm */
    strbuf_release(&converted);

, assuming related changes throughout the rest of the bulk-checkin
machinery necessary to update the hash of the converted object, which
are likewise minimal in size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 12 ++++++++++++
 bulk-checkin.h |  4 ++++
 2 files changed, 16 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 655a583b06..c1faf75f5f 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -460,6 +460,18 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 	return status;
 }
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_TREE,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 1b91daeaee..89786b3954 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -17,6 +17,10 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 				   const void *buf, size_t size,
 				   const char *path, unsigned flags);
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related

* [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack`
From: Taylor Blau @ 2023-10-19 17:29 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt
In-Reply-To: <cover.1697736516.git.me@ttaylorr.com>

When using merge-tree often within a repository[^1], it is possible to
generate a relatively large number of loose objects, which can result in
degraded performance, and inode exhaustion in extreme cases.

Building on the functionality introduced in previous commits, the
bulk-checkin machinery now has support to write arbitrary blob and tree
objects which are small enough to be held in-core. We can use this to
write any blob/tree objects generated by ORT into a separate pack
instead of writing them out individually as loose.

This functionality is gated behind a new `--write-pack` option to
`merge-tree` that works with the (non-deprecated) `--write-tree` mode.

The implementation is relatively straightforward. There are two spots
within the ORT mechanism where we call `write_object_file()`, one for
content differences within blobs, and another to assemble any new trees
necessary to construct the merge. In each of those locations,
conditionally replace calls to `write_object_file()` with
`index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
depending on which kind of object we are writing.

The only remaining task is to begin and end the transaction necessary to
initialize the bulk-checkin machinery, and move any new pack(s) it
created into the main object store.

[^1]: Such is the case at GitHub, where we run presumptive "test merges"
  on open pull requests to see whether or not we can light up the merge
  button green depending on whether or not the presumptive merge was
  conflicted.

  This is done in response to a number of user-initiated events,
  including viewing an open pull request whose last test merge is stale
  with respect to the current base and tip of the pull request. As a
  result, merge-tree can be run very frequently on large, active
  repositories.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-merge-tree.txt |  4 ++
 builtin/merge-tree.c             |  5 ++
 merge-ort.c                      | 42 +++++++++++----
 merge-recursive.h                |  1 +
 t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
 5 files changed, 136 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
index ffc4fbf7e8..9d37609ef1 100644
--- a/Documentation/git-merge-tree.txt
+++ b/Documentation/git-merge-tree.txt
@@ -69,6 +69,10 @@ OPTIONS
 	specify a merge-base for the merge, and specifying multiple bases is
 	currently not supported. This option is incompatible with `--stdin`.
 
+--write-pack::
+	Write any new objects into a separate packfile instead of as
+	individual loose objects.
+
 [[OUTPUT]]
 OUTPUT
 ------
diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
index 0de42aecf4..672ebd4c54 100644
--- a/builtin/merge-tree.c
+++ b/builtin/merge-tree.c
@@ -18,6 +18,7 @@
 #include "quote.h"
 #include "tree.h"
 #include "config.h"
+#include "bulk-checkin.h"
 
 static int line_termination = '\n';
 
@@ -414,6 +415,7 @@ struct merge_tree_options {
 	int show_messages;
 	int name_only;
 	int use_stdin;
+	int write_pack;
 };
 
 static int real_merge(struct merge_tree_options *o,
@@ -440,6 +442,7 @@ static int real_merge(struct merge_tree_options *o,
 	init_merge_options(&opt, the_repository);
 
 	opt.show_rename_progress = 0;
+	opt.write_pack = o->write_pack;
 
 	opt.branch1 = branch1;
 	opt.branch2 = branch2;
@@ -548,6 +551,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix)
 			   &merge_base,
 			   N_("commit"),
 			   N_("specify a merge-base for the merge")),
+		OPT_BOOL(0, "write-pack", &o.write_pack,
+			 N_("write new objects to a pack instead of as loose")),
 		OPT_END()
 	};
 
diff --git a/merge-ort.c b/merge-ort.c
index 3653725661..523577d71e 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -48,6 +48,7 @@
 #include "tree.h"
 #include "unpack-trees.h"
 #include "xdiff-interface.h"
+#include "bulk-checkin.h"
 
 /*
  * We have many arrays of size 3.  Whenever we have such an array, the
@@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
 		if ((merge_status < 0) || !result_buf.ptr)
 			ret = error(_("failed to execute internal merge"));
 
-		if (!ret &&
-		    write_object_file(result_buf.ptr, result_buf.size,
-				      OBJ_BLOB, &result->oid))
-			ret = error(_("unable to add %s to database"), path);
+		if (!ret) {
+			ret = opt->write_pack
+				? index_blob_bulk_checkin_incore(&result->oid,
+								 result_buf.ptr,
+								 result_buf.size,
+								 path, 1)
+				: write_object_file(result_buf.ptr,
+						    result_buf.size,
+						    OBJ_BLOB, &result->oid);
+			if (ret)
+				ret = error(_("unable to add %s to database"),
+					    path);
+		}
 
 		free(result_buf.ptr);
 		if (ret)
@@ -3597,7 +3607,8 @@ static int tree_entry_order(const void *a_, const void *b_)
 				 b->string, strlen(b->string), bmi->result.mode);
 }
 
-static int write_tree(struct object_id *result_oid,
+static int write_tree(struct merge_options *opt,
+		      struct object_id *result_oid,
 		      struct string_list *versions,
 		      unsigned int offset,
 		      size_t hash_size)
@@ -3631,8 +3642,14 @@ static int write_tree(struct object_id *result_oid,
 	}
 
 	/* Write this object file out, and record in result_oid */
-	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
+	ret = opt->write_pack
+		? index_tree_bulk_checkin_incore(result_oid,
+						 buf.buf, buf.len, "", 1)
+		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
+
+	if (ret)
 		ret = -1;
+
 	strbuf_release(&buf);
 	return ret;
 }
@@ -3797,8 +3814,8 @@ static int write_completed_directory(struct merge_options *opt,
 		 */
 		dir_info->is_null = 0;
 		dir_info->result.mode = S_IFDIR;
-		if (write_tree(&dir_info->result.oid, &info->versions, offset,
-			       opt->repo->hash_algo->rawsz) < 0)
+		if (write_tree(opt, &dir_info->result.oid, &info->versions,
+			       offset, opt->repo->hash_algo->rawsz) < 0)
 			ret = -1;
 	}
 
@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
 		fflush(stdout);
 		BUG("dir_metadata accounting completely off; shouldn't happen");
 	}
-	if (write_tree(result_oid, &dir_metadata.versions, 0,
+	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
 		       opt->repo->hash_algo->rawsz) < 0)
 		ret = -1;
+
+	if (opt->write_pack)
+		end_odb_transaction();
+
 cleanup:
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
@@ -4878,6 +4899,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	 */
 	strmap_init(&opt->priv->conflicts);
 
+	if (opt->write_pack)
+		begin_odb_transaction();
+
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
diff --git a/merge-recursive.h b/merge-recursive.h
index b88000e3c2..156e160876 100644
--- a/merge-recursive.h
+++ b/merge-recursive.h
@@ -48,6 +48,7 @@ struct merge_options {
 	unsigned renormalize : 1;
 	unsigned record_conflict_msgs_as_headers : 1;
 	const char *msg_header_prefix;
+	unsigned write_pack : 1;
 
 	/* internal fields used by the implementation */
 	struct merge_options_internal *priv;
diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
index 250f721795..2d81ff4de5 100755
--- a/t/t4301-merge-tree-write-tree.sh
+++ b/t/t4301-merge-tree-write-tree.sh
@@ -922,4 +922,97 @@ test_expect_success 'check the input format when --stdin is passed' '
 	test_cmp expect actual
 '
 
+packdir=".git/objects/pack"
+
+test_expect_success 'merge-tree can pack its result with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+
+	# base has lines [3, 4, 5]
+	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
+	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
+	#
+	# merging the two should result in a new blob object containing
+	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
+	test_commit -C repo base file "$(test_seq 3 5)" &&
+	git -C repo branch -M main &&
+	git -C repo checkout -b side main &&
+	test_commit -C repo side file "$(test_seq 1 5)" &&
+	git -C repo checkout -b other main &&
+	test_commit -C repo other file "$(test_seq 3 7)" &&
+
+	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
+	tree="$(git -C repo merge-tree --write-pack \
+		refs/tags/side refs/tags/other)" &&
+	blob="$(git -C repo rev-parse $tree:file)" &&
+	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&
+
+	test_must_be_empty packs.before &&
+	test_line_count = 1 packs.after &&
+
+	git show-index <$(cat packs.after) >objects &&
+	test_line_count = 2 objects &&
+	grep "^[1-9][0-9]* $tree" objects &&
+	grep "^[1-9][0-9]* $blob" objects
+'
+
+test_expect_success 'merge-tree can write multiple packs with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		git config pack.packSizeLimit 512 &&
+
+		test_seq 512 >f &&
+
+		# "f" contains roughly ~2,000 bytes.
+		#
+		# Each side ("foo" and "bar") adds a small amount of data at the
+		# beginning and end of "base", respectively.
+		git add f &&
+		test_tick &&
+		git commit -m base &&
+		git branch -M main &&
+
+		git checkout -b foo main &&
+		{
+			echo foo && cat f
+		} >f.tmp &&
+		mv f.tmp f &&
+		git add f &&
+		test_tick &&
+		git commit -m foo &&
+
+		git checkout -b bar main &&
+		echo bar >>f &&
+		git add f &&
+		test_tick &&
+		git commit -m bar &&
+
+		find $packdir -type f -name "pack-*.idx" >packs.before &&
+		# Merging either side should result in a new object which is
+		# larger than 1M, thus the result should be split into two
+		# separate packs.
+		tree="$(git merge-tree --write-pack \
+			refs/heads/foo refs/heads/bar)" &&
+		blob="$(git rev-parse $tree:f)" &&
+		find $packdir -type f -name "pack-*.idx" >packs.after &&
+
+		test_must_be_empty packs.before &&
+		test_line_count = 2 packs.after &&
+		for idx in $(cat packs.after)
+		do
+			git show-index <$idx || return 1
+		done >objects &&
+
+		# The resulting set of packs should contain one copy of both
+		# objects, each in a separate pack.
+		test_line_count = 2 objects &&
+		grep "^[1-9][0-9]* $tree" objects &&
+		grep "^[1-9][0-9]* $blob" objects
+
+	)
+'
+
 test_done
-- 
2.42.0.405.g86fe3250c2

^ permalink raw reply related

* Re: [PATCH 00/11] t: reduce direct disk access to data structures
From: Junio C Hamano @ 2023-10-19 17:55 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: Patrick Steinhardt, git
In-Reply-To: <CAFQ2z_Om724+o+EG1FAhC9VrvJECnQ5UA+Z04Rzycpi_mXvMHg@mail.gmail.com>

Han-Wen Nienhuys <hanwen@google.com> writes:

> I think it would be really great if there were separate unittests for
> the ref backend API. Some of the reftable work was needlessly
> difficult because the contract of the API was underspecified. The API
> is well compartmentalized in refs-internal.h, and a lot of the API
> behavior can be tested as a black box, eg.
>
> * setup symref HEAD pointing to R1
> * setup transaction updating ref R1 from C1 to C2
> * commit transaction, check that it succeeds
> * read ref R1, check if it is C2
> * read reflog for R1, see that it has a C1 => C2 update
> * read reflog for HEAD, see that it has a C1 => C2 update
>
> Tests for the loose/packed backend could directly mess with the
> on-disk files to test failure scenarios.
>
> With unittests like that, the tests can zoom in on the functionality
> of the ref backend, and provide more convenient coverage for
> dynamic/static analysis.

Yeah, I agree that something like that would really be great.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox