git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, Elijah Newren <newren@gmail.com>,
	"Eric W. Biederman" <ebiederm@gmail.com>,
	Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
Date: Wed, 25 Oct 2023 09:58:11 +0200	[thread overview]
Message-ID: <ZTjKk8E55M7lQN1m@tanuki> (raw)
In-Reply-To: <3595db76a525fcebc3c896e231246704b044310c.1698101088.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 11501 bytes --]

On Mon, Oct 23, 2023 at 06:45:06PM -0400, Taylor Blau wrote:
> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.
> 
> Building on the functionality introduced in previous commits, the
> bulk-checkin machinery now has support to write arbitrary blob and tree
> objects which are small enough to be held in-core. We can use this to
> write any blob/tree objects generated by ORT into a separate pack
> instead of writing them out individually as loose.
> 
> This functionality is gated behind a new `--write-pack` option to
> `merge-tree` that works with the (non-deprecated) `--write-tree` mode.
> 
> The implementation is relatively straightforward. There are two spots
> within the ORT mechanism where we call `write_object_file()`, one for
> content differences within blobs, and another to assemble any new trees
> necessary to construct the merge. In each of those locations,
> conditionally replace calls to `write_object_file()` with
> `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
> depending on which kind of object we are writing.
> 
> The only remaining task is to begin and end the transaction necessary to
> initialize the bulk-checkin machinery, and move any new pack(s) it
> created into the main object store.
> 
> [^1]: Such is the case at GitHub, where we run presumptive "test merges"
>   on open pull requests to see whether or not we can light up the merge
>   button green depending on whether or not the presumptive merge was
>   conflicted.
> 
>   This is done in response to a number of user-initiated events,
>   including viewing an open pull request whose last test merge is stale
>   with respect to the current base and tip of the pull request. As a
>   result, merge-tree can be run very frequently on large, active
>   repositories.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-merge-tree.txt |  4 ++
>  builtin/merge-tree.c             |  5 ++
>  merge-ort.c                      | 42 +++++++++++----
>  merge-recursive.h                |  1 +
>  t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
>  5 files changed, 136 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
> index ffc4fbf7e8..9d37609ef1 100644
> --- a/Documentation/git-merge-tree.txt
> +++ b/Documentation/git-merge-tree.txt
> @@ -69,6 +69,10 @@ OPTIONS
>  	specify a merge-base for the merge, and specifying multiple bases is
>  	currently not supported. This option is incompatible with `--stdin`.
>  
> +--write-pack::
> +	Write any new objects into a separate packfile instead of as
> +	individual loose objects.
> +
>  [[OUTPUT]]
>  OUTPUT
>  ------
> diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
> index a35e0452d6..218442ac9b 100644
> --- a/builtin/merge-tree.c
> +++ b/builtin/merge-tree.c
> @@ -19,6 +19,7 @@
>  #include "tree.h"
>  #include "config.h"
>  #include "strvec.h"
> +#include "bulk-checkin.h"
>  
>  static int line_termination = '\n';
>  
> @@ -416,6 +417,7 @@ struct merge_tree_options {
>  	int name_only;
>  	int use_stdin;
>  	struct merge_options merge_options;
> +	int write_pack;
>  };
>  
>  static int real_merge(struct merge_tree_options *o,
> @@ -441,6 +443,7 @@ static int real_merge(struct merge_tree_options *o,
>  				 _("not something we can merge"));
>  
>  	opt.show_rename_progress = 0;
> +	opt.write_pack = o->write_pack;
>  
>  	opt.branch1 = branch1;
>  	opt.branch2 = branch2;
> @@ -553,6 +556,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix)
>  			   N_("specify a merge-base for the merge")),
>  		OPT_STRVEC('X', "strategy-option", &xopts, N_("option=value"),
>  			N_("option for selected merge strategy")),
> +		OPT_BOOL(0, "write-pack", &o.write_pack,
> +			 N_("write new objects to a pack instead of as loose")),
>  		OPT_END()
>  	};
>  
> diff --git a/merge-ort.c b/merge-ort.c
> index 3653725661..523577d71e 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -48,6 +48,7 @@
>  #include "tree.h"
>  #include "unpack-trees.h"
>  #include "xdiff-interface.h"
> +#include "bulk-checkin.h"
>  
>  /*
>   * We have many arrays of size 3.  Whenever we have such an array, the
> @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
>  		if ((merge_status < 0) || !result_buf.ptr)
>  			ret = error(_("failed to execute internal merge"));
>  
> -		if (!ret &&
> -		    write_object_file(result_buf.ptr, result_buf.size,
> -				      OBJ_BLOB, &result->oid))
> -			ret = error(_("unable to add %s to database"), path);
> +		if (!ret) {
> +			ret = opt->write_pack
> +				? index_blob_bulk_checkin_incore(&result->oid,
> +								 result_buf.ptr,
> +								 result_buf.size,
> +								 path, 1)
> +				: write_object_file(result_buf.ptr,
> +						    result_buf.size,
> +						    OBJ_BLOB, &result->oid);
> +			if (ret)
> +				ret = error(_("unable to add %s to database"),
> +					    path);
> +		}
>  
>  		free(result_buf.ptr);
>  		if (ret)
> @@ -3597,7 +3607,8 @@ static int tree_entry_order(const void *a_, const void *b_)
>  				 b->string, strlen(b->string), bmi->result.mode);
>  }
>  
> -static int write_tree(struct object_id *result_oid,
> +static int write_tree(struct merge_options *opt,
> +		      struct object_id *result_oid,
>  		      struct string_list *versions,
>  		      unsigned int offset,
>  		      size_t hash_size)
> @@ -3631,8 +3642,14 @@ static int write_tree(struct object_id *result_oid,
>  	}
>  
>  	/* Write this object file out, and record in result_oid */
> -	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
> +	ret = opt->write_pack
> +		? index_tree_bulk_checkin_incore(result_oid,
> +						 buf.buf, buf.len, "", 1)
> +		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
> +
> +	if (ret)
>  		ret = -1;
> +
>  	strbuf_release(&buf);
>  	return ret;
>  }
> @@ -3797,8 +3814,8 @@ static int write_completed_directory(struct merge_options *opt,
>  		 */
>  		dir_info->is_null = 0;
>  		dir_info->result.mode = S_IFDIR;
> -		if (write_tree(&dir_info->result.oid, &info->versions, offset,
> -			       opt->repo->hash_algo->rawsz) < 0)
> +		if (write_tree(opt, &dir_info->result.oid, &info->versions,
> +			       offset, opt->repo->hash_algo->rawsz) < 0)
>  			ret = -1;
>  	}
>  
> @@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
>  		fflush(stdout);
>  		BUG("dir_metadata accounting completely off; shouldn't happen");
>  	}
> -	if (write_tree(result_oid, &dir_metadata.versions, 0,
> +	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
>  		       opt->repo->hash_algo->rawsz) < 0)
>  		ret = -1;
> +
> +	if (opt->write_pack)
> +		end_odb_transaction();
> +
>  cleanup:
>  	string_list_clear(&plist, 0);
>  	string_list_clear(&dir_metadata.versions, 0);
> @@ -4878,6 +4899,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
>  	 */
>  	strmap_init(&opt->priv->conflicts);
>  
> +	if (opt->write_pack)
> +		begin_odb_transaction();
> +
>  	trace2_region_leave("merge", "allocate/init", opt->repo);
>  }
>  
> diff --git a/merge-recursive.h b/merge-recursive.h
> index 3d3b3e3c29..5c5ff380a8 100644
> --- a/merge-recursive.h
> +++ b/merge-recursive.h
> @@ -48,6 +48,7 @@ struct merge_options {
>  	unsigned renormalize : 1;
>  	unsigned record_conflict_msgs_as_headers : 1;
>  	const char *msg_header_prefix;
> +	unsigned write_pack : 1;
>  
>  	/* internal fields used by the implementation */
>  	struct merge_options_internal *priv;
> diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
> index b2c8a43fce..d2a8634523 100755
> --- a/t/t4301-merge-tree-write-tree.sh
> +++ b/t/t4301-merge-tree-write-tree.sh
> @@ -945,4 +945,97 @@ test_expect_success 'check the input format when --stdin is passed' '
>  	test_cmp expect actual
>  '
>  
> +packdir=".git/objects/pack"
> +
> +test_expect_success 'merge-tree can pack its result with --write-pack' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +
> +	# base has lines [3, 4, 5]
> +	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
> +	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
> +	#
> +	# merging the two should result in a new blob object containing
> +	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
> +	test_commit -C repo base file "$(test_seq 3 5)" &&
> +	git -C repo branch -M main &&
> +	git -C repo checkout -b side main &&
> +	test_commit -C repo side file "$(test_seq 1 5)" &&
> +	git -C repo checkout -b other main &&
> +	test_commit -C repo other file "$(test_seq 3 7)" &&
> +
> +	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> +	tree="$(git -C repo merge-tree --write-pack \
> +		refs/tags/side refs/tags/other)" &&
> +	blob="$(git -C repo rev-parse $tree:file)" &&
> +	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&

While we do assert that we write a new packfile, we don't assert whether
parts of the written object may have been written as loose objects. Do
we want to tighten the checks to verify that?

Patrick

> +	test_must_be_empty packs.before &&
> +	test_line_count = 1 packs.after &&
> +
> +	git show-index <$(cat packs.after) >objects &&
> +	test_line_count = 2 objects &&
> +	grep "^[1-9][0-9]* $tree" objects &&
> +	grep "^[1-9][0-9]* $blob" objects
> +'
> +
> +test_expect_success 'merge-tree can write multiple packs with --write-pack' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		git config pack.packSizeLimit 512 &&
> +
> +		test_seq 512 >f &&
> +
> +		# "f" contains roughly ~2,000 bytes.
> +		#
> +		# Each side ("foo" and "bar") adds a small amount of data at the
> +		# beginning and end of "base", respectively.
> +		git add f &&
> +		test_tick &&
> +		git commit -m base &&
> +		git branch -M main &&
> +
> +		git checkout -b foo main &&
> +		{
> +			echo foo && cat f
> +		} >f.tmp &&
> +		mv f.tmp f &&
> +		git add f &&
> +		test_tick &&
> +		git commit -m foo &&
> +
> +		git checkout -b bar main &&
> +		echo bar >>f &&
> +		git add f &&
> +		test_tick &&
> +		git commit -m bar &&
> +
> +		find $packdir -type f -name "pack-*.idx" >packs.before &&
> +		# Merging either side should result in a new object which is
> +		# larger than 1M, thus the result should be split into two
> +		# separate packs.
> +		tree="$(git merge-tree --write-pack \
> +			refs/heads/foo refs/heads/bar)" &&
> +		blob="$(git rev-parse $tree:f)" &&
> +		find $packdir -type f -name "pack-*.idx" >packs.after &&
> +
> +		test_must_be_empty packs.before &&
> +		test_line_count = 2 packs.after &&
> +		for idx in $(cat packs.after)
> +		do
> +			git show-index <$idx || return 1
> +		done >objects &&
> +
> +		# The resulting set of packs should contain one copy of both
> +		# objects, each in a separate pack.
> +		test_line_count = 2 objects &&
> +		grep "^[1-9][0-9]* $tree" objects &&
> +		grep "^[1-9][0-9]* $blob" objects
> +
> +	)
> +'
> +
>  test_done
> -- 
> 2.42.0.425.g963d08ddb3.dirty

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2023-10-25  7:58 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-20  7:35   ` Jeff King
2023-10-20 16:55     ` Junio C Hamano
2023-10-19 17:28 ` [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-19 17:28 ` [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source` Taylor Blau
2023-10-19 17:28 ` [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
2023-10-23  9:19   ` Patrick Steinhardt
2023-10-23 18:58     ` Jeff King
2023-10-24  6:34       ` Patrick Steinhardt
2023-10-24 17:08         ` Junio C Hamano
2023-10-19 17:28 ` [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-19 17:28 ` [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-19 17:29 ` [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-19 21:47 ` [PATCH v4 0/7] merge-ort: implement support for packing objects together Junio C Hamano
2023-10-20  7:29 ` Jeff King
2023-10-20 16:53   ` Junio C Hamano
2023-10-23  9:19 ` Patrick Steinhardt
2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
2023-10-23 22:44   ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-25  7:37     ` Jeff King
2023-10-25 15:39       ` Taylor Blau
2023-10-27 23:12       ` Junio C Hamano
2023-10-23 22:44   ` [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-23 22:45   ` [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-25  7:58     ` Patrick Steinhardt
2023-10-25 15:44       ` Taylor Blau
2023-10-25 17:21         ` Eric Sunshine
2023-10-26  8:16           ` Patrick Steinhardt
2023-11-11  0:17           ` Elijah Newren
2023-10-23 22:45   ` [PATCH v5 4/5] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-23 22:45   ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-25  7:58     ` Patrick Steinhardt [this message]
2023-10-25 15:46       ` Taylor Blau
2023-11-10 23:51     ` Elijah Newren
2023-11-11  0:27       ` Junio C Hamano
2023-11-11  1:34         ` Taylor Blau
2023-11-11  1:24       ` Taylor Blau
2023-11-13 22:05         ` Jeff King
2023-11-14  1:40           ` Junio C Hamano
2023-11-14  2:54             ` Elijah Newren
2023-11-14 21:55             ` Jeff King
2023-11-14  3:08           ` Elijah Newren
2023-11-13 22:02       ` Jeff King
2023-11-13 22:34         ` Taylor Blau
2023-11-14  2:50           ` Elijah Newren
2023-11-14 21:53             ` Jeff King
2023-11-14 22:04           ` Jeff King
2023-10-23 23:31   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Junio C Hamano
2023-11-06 15:46     ` Johannes Schindelin
2023-11-06 23:19       ` Junio C Hamano
2023-11-07  3:42       ` Jeff King
2023-11-07 15:58       ` Taylor Blau
2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
2023-11-07 18:22           ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
2023-11-11  3:45             ` Elijah Newren
2023-11-07 18:22           ` [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()` Taylor Blau
2023-11-08  7:05             ` Patrick Steinhardt
2023-11-09 19:26               ` Taylor Blau
2023-11-07 18:23           ` [RFC PATCH 3/3] builtin/replay.c: introduce `--write-pack` Taylor Blau
2023-11-11  3:42           ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Elijah Newren
2023-11-11  4:04           ` Elijah Newren
2023-10-25  7:58   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZTjKk8E55M7lQN1m@tanuki \
    --to=ps@pks.im \
    --cc=ebiederm@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).