From: Elijah Newren <newren@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, "Eric W. Biederman" <ebiederm@gmail.com>,
Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>,
Patrick Steinhardt <ps@pks.im>
Subject: Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
Date: Fri, 10 Nov 2023 15:51:18 -0800 [thread overview]
Message-ID: <CABPp-BEfy9VOvimP9==ry_rZXu=metOQ8s=_-XiG_Pdx9c06Ww@mail.gmail.com> (raw)
In-Reply-To: <3595db76a525fcebc3c896e231246704b044310c.1698101088.git.me@ttaylorr.com>
Hi,
Sorry for taking so long to find some time to review. And sorry for
the bad news, but...
On Mon, Oct 23, 2023 at 3:45 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.
>
> Building on the functionality introduced in previous commits, the
> bulk-checkin machinery now has support to write arbitrary blob and tree
> objects which are small enough to be held in-core. We can use this to
> write any blob/tree objects generated by ORT into a separate pack
> instead of writing them out individually as loose.
>
> This functionality is gated behind a new `--write-pack` option to
> `merge-tree` that works with the (non-deprecated) `--write-tree` mode.
>
> The implementation is relatively straightforward. There are two spots
> within the ORT mechanism where we call `write_object_file()`, one for
> content differences within blobs, and another to assemble any new trees
> necessary to construct the merge. In each of those locations,
> conditionally replace calls to `write_object_file()` with
> `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
> depending on which kind of object we are writing.
>
> The only remaining task is to begin and end the transaction necessary to
> initialize the bulk-checkin machinery, and move any new pack(s) it
> created into the main object store.
I believe the above is built on an assumption that any objects written
will not need to be read until after the merge is completed. And we
might have a nesting issue too...
> @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
> if ((merge_status < 0) || !result_buf.ptr)
> ret = error(_("failed to execute internal merge"));
>
> - if (!ret &&
> - write_object_file(result_buf.ptr, result_buf.size,
> - OBJ_BLOB, &result->oid))
> - ret = error(_("unable to add %s to database"), path);
> + if (!ret) {
> + ret = opt->write_pack
> + ? index_blob_bulk_checkin_incore(&result->oid,
> + result_buf.ptr,
> + result_buf.size,
> + path, 1)
> + : write_object_file(result_buf.ptr,
> + result_buf.size,
> + OBJ_BLOB, &result->oid);
> + if (ret)
> + ret = error(_("unable to add %s to database"),
> + path);
> + }
>
> free(result_buf.ptr);
> if (ret)
This is unsafe; the object may need to be read later within the same
merge. Perhaps the simplest example related to your testcase is
modifying the middle section of that testcase (I'll highlight where
when I comment on the testcase) to read:
test_commit -C repo base file "$(test_seq 3 5)" &&
git -C repo branch -M main &&
git -C repo checkout -b side main &&
test_commit -C repo side-file file "$(test_seq 1 5)" &&
test_commit -C repo side-file2 file2 "$(test_seq 1 6)" &&
git -C repo checkout -b other main &&
test_commit -C repo other-file file "$(test_seq 3 7)" &&
git -C repo mv file file2 &&
git -C repo commit -m other-file2 &&
find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
git -C repo merge-tree --write-pack side other &&
In words, what I'm doing here is having both sides modify "file" (the
same as you did) but also having one side rename "file"->"file2". The
side that doesn't rename "file" also introduces a new "file2". ort
needs to merge the three versions of "file" to get a new blob object,
and then merge that with the content from the brand new "file2". More
complicated cases are possible, of course. Anyway, with the modified
testcase above, merge-tree will fail with:
fatal: unable to read blob object 06e567b11dfdafeaf7d3edcc89864149383aeab6
I think (untested) that you could fix this by creating two packs
instead of just one. In particular, add a call to
flush_odb_transcation() after the "redo_after_renames" block in
merge_ort_nonrecursive_internal(). (It's probably easier to see that
you could place the flush_odb_transaction() call inside
detect_and_process_renames() just after the process_renames() call,
but when redo_after_renames is true you'd end up with three packs
instead of just two).
What happens with the odb transaction stuff if no new objects are
written before the call to flush_odb_transaction? Will that be a
problem?
(Since any tree written will not be re-read within the same merge, the
other write_object_file() call you changed does not have the same
problem as the one here.)
>@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
> fflush(stdout);
> BUG("dir_metadata accounting completely off; shouldn't happen");
> }
>- if (write_tree(result_oid, &dir_metadata.versions, 0,
>+ if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
> opt->repo->hash_algo->rawsz) < 0)
> ret = -1;
>
> +
> + if (opt->write_pack)
> + end_odb_transaction();
> +
Is the end_odb_transaction() here going to fail with an "Unbalanced
ODB transaction nesting" when faced with a recursive merge?
Perhaps flushing here, and then calling end_odb_transaction() in
merge_finalize(), much as you do in your replay-and-write-pack series,
should actually be moved to this commit and included here?
This does mean that for a recursive merge, that you'll get up to 2*N
packfiles, where N is the depth of the recursive merge.
> + test_commit -C repo base file "$(test_seq 3 5)" &&
> + git -C repo branch -M main &&
> + git -C repo checkout -b side main &&
> + test_commit -C repo side file "$(test_seq 1 5)" &&
> + git -C repo checkout -b other main &&
> + test_commit -C repo other file "$(test_seq 3 7)" &&
> +
> + find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> + tree="$(git -C repo merge-tree --write-pack \
> + refs/tags/side refs/tags/other)" &&
These were the lines from your testcase that I replaced to show the bug.
next prev parent reply other threads:[~2023-11-10 23:51 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-20 7:35 ` Jeff King
2023-10-20 16:55 ` Junio C Hamano
2023-10-19 17:28 ` [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-19 17:28 ` [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source` Taylor Blau
2023-10-19 17:28 ` [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
2023-10-23 9:19 ` Patrick Steinhardt
2023-10-23 18:58 ` Jeff King
2023-10-24 6:34 ` Patrick Steinhardt
2023-10-24 17:08 ` Junio C Hamano
2023-10-19 17:28 ` [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-19 17:28 ` [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-19 17:29 ` [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-19 21:47 ` [PATCH v4 0/7] merge-ort: implement support for packing objects together Junio C Hamano
2023-10-20 7:29 ` Jeff King
2023-10-20 16:53 ` Junio C Hamano
2023-10-23 9:19 ` Patrick Steinhardt
2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
2023-10-23 22:44 ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-25 7:37 ` Jeff King
2023-10-25 15:39 ` Taylor Blau
2023-10-27 23:12 ` Junio C Hamano
2023-10-23 22:44 ` [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-23 22:45 ` [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-25 7:58 ` Patrick Steinhardt
2023-10-25 15:44 ` Taylor Blau
2023-10-25 17:21 ` Eric Sunshine
2023-10-26 8:16 ` Patrick Steinhardt
2023-11-11 0:17 ` Elijah Newren
2023-10-23 22:45 ` [PATCH v5 4/5] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-23 22:45 ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-25 7:58 ` Patrick Steinhardt
2023-10-25 15:46 ` Taylor Blau
2023-11-10 23:51 ` Elijah Newren [this message]
2023-11-11 0:27 ` Junio C Hamano
2023-11-11 1:34 ` Taylor Blau
2023-11-11 1:24 ` Taylor Blau
2023-11-13 22:05 ` Jeff King
2023-11-14 1:40 ` Junio C Hamano
2023-11-14 2:54 ` Elijah Newren
2023-11-14 21:55 ` Jeff King
2023-11-14 3:08 ` Elijah Newren
2023-11-13 22:02 ` Jeff King
2023-11-13 22:34 ` Taylor Blau
2023-11-14 2:50 ` Elijah Newren
2023-11-14 21:53 ` Jeff King
2023-11-14 22:04 ` Jeff King
2023-10-23 23:31 ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Junio C Hamano
2023-11-06 15:46 ` Johannes Schindelin
2023-11-06 23:19 ` Junio C Hamano
2023-11-07 3:42 ` Jeff King
2023-11-07 15:58 ` Taylor Blau
2023-11-07 18:22 ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
2023-11-07 18:22 ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
2023-11-11 3:45 ` Elijah Newren
2023-11-07 18:22 ` [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()` Taylor Blau
2023-11-08 7:05 ` Patrick Steinhardt
2023-11-09 19:26 ` Taylor Blau
2023-11-07 18:23 ` [RFC PATCH 3/3] builtin/replay.c: introduce `--write-pack` Taylor Blau
2023-11-11 3:42 ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Elijah Newren
2023-11-11 4:04 ` Elijah Newren
2023-10-25 7:58 ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CABPp-BEfy9VOvimP9==ry_rZXu=metOQ8s=_-XiG_Pdx9c06Ww@mail.gmail.com' \
--to=newren@gmail.com \
--cc=ebiederm@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).