From: Patrick Steinhardt <ps@pks.im>
To: Justin Tobler <jltobler@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront
Date: Tue, 31 Mar 2026 09:48:43 +0200 [thread overview]
Message-ID: <act8W1BEg6iyUpHB@pks.im> (raw)
In-Reply-To: <20260331033835.2863514-5-jltobler@gmail.com>
On Mon, Mar 30, 2026 at 10:38:33PM -0500, Justin Tobler wrote:
> In certain scenarios, Git handles writing blobs that exceed
> "core.bigFilesThreshold" differently by streaming the object directly
> into a packfile. When there is an active ODB transaction, these blobs
> are streamed to the same packfile instead of using a separate packfile
> for each. If "pack.packSizeLimit" is configured and streaming another
> object causes the packfile to exceed the configured limit, the packfile
> is truncated back to the previous object and the object write is
> restarted in a new packfile.
>
> This works fine, but requires the fd being read from to save a
> checkpoint so it becomes possible to rewind the input source via seeking
> back to a known offset at the beginning. In a subsequent commit, blob
> streaming is converted to use `struct odb_write_stream` as a more
> generic input source instead of an fd which doesn't provide a mechanism
> for rewinding.
>
> For this use case though, rewinding the fd is not strictly necessary
> because the inflated size of the object is known and can be used to
> approximate whether writing the object would cause the packfile to
> exceed the configured limit prior to writing anything. These blobs
> written to the packfile are never deltafied thus the size difference
s/deltafied/deltified/
> between what is written versus the inflated size is due to zlib
> compression. While this does prevent packfiles from being filled to the
> potential maximum is some cases, it should be good enough and still
> prevents the packfile from exceeding any configured limit.
>
> Use the inflated blob size to determine whether writing an object to a
> packfile will exceed the configured "pack.packSizeLimit".
I agree that this is a reasonable tradeoff:
- For small objects it's probably not going to make a huge difference,
as we'd at most waste a couple kilobytes.
- For large objects we can expect that we wouldn't use deltification
anyway due to "core.bigFileThreshold".
- We can expect that in many cases large files will be not compress
well, either, as it's more likely than not that a file >512MB (our
default limit for "core.bigFileThreshold") is going to be a binary
file.
We also nicely document "pack.packSizeLimit" as "rarely useful, and may
result in a larger total on-disk size" in git-config(1), so I think it's
fair to not bend ourselves over backwards just to make a rarely-useful
feature work exactly the same.
> diff --git a/object-file.c b/object-file.c
> index 493173eaf4..1de2244ac5 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1473,15 +1461,10 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
> if ((size_t)read_result != rsize)
> die("failed to read %u bytes from '%s'",
> (unsigned)rsize, path);
> - offset += rsize;
> - if (*already_hashed_to < offset) {
> - size_t hsize = offset - *already_hashed_to;
> - if (rsize < hsize)
> - hsize = rsize;
> - if (hsize)
> - git_hash_update(ctx, ibuf, hsize);
> - *already_hashed_to = offset;
> - }
> +
> + if (rsize)
> + git_hash_update(ctx, ibuf, rsize);
Is this guard really needed? I wouldn't expect that we ever try to read
zero bytes into `ibuf`, and we bail in case we didn't receive the
expected number of bytes.
And even if we did, `git_hash_update()` works just fine with no data.
> @@ -1592,48 +1566,34 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
> size_t size, const char *path)
> {
> struct transaction_packfile *state = &transaction->packfile;
> - off_t seekback, already_hashed_to;
> struct git_hash_ctx ctx;
> unsigned char obuf[16384];
> unsigned header_len;
> struct hashfile_checkpoint checkpoint;
> struct pack_idx_entry *idx;
>
> - seekback = lseek(fd, 0, SEEK_CUR);
> - if (seekback == (off_t)-1)
> - return error("cannot find the current offset");
Okay, no seeking necessary because we don't restart the write anymore.
> header_len = format_object_header((char *)obuf, sizeof(obuf),
> OBJ_BLOB, size);
> transaction->base.source->odb->repo->hash_algo->init_fn(&ctx);
> git_hash_update(&ctx, obuf, header_len);
>
> + /*
> + * If writing another object to the packfile could result in it
> + * exceeding the configured size limit, flush the current packfile
> + * transaction.
> + */
Do we want to document that this intentionally works on the inflated
size, not the deflated one, with the arguments mentioned in the commit
message?
> + if (state->nr_written && pack_size_limit_cfg &&
> + pack_size_limit_cfg < state->offset + size)
> + flush_packfile_transaction(transaction);
And we now flush the packfile before writing any object that may cause
us to bust the size limit. Makes sense.
> CALLOC_ARRAY(idx, 1);
> prepare_packfile_transaction(transaction);
> hashfile_checkpoint_init(state->f, &checkpoint);
>
> - already_hashed_to = 0;
> -
> - while (1) {
> - prepare_packfile_transaction(transaction);
> - hashfile_checkpoint(state->f, &checkpoint);
> - idx->offset = state->offset;
> - crc32_begin(state->f);
> -
> - if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
> - fd, size, path))
> - break;
> - /*
> - * Writing this object to the current pack will make
> - * it too big; we need to truncate it, start a new
> - * pack, and write into it.
> - */
> - hashfile_truncate(state->f, &checkpoint);
> - state->offset = checkpoint.offset;
> - flush_packfile_transaction(transaction);
> - if (lseek(fd, seekback, SEEK_SET) == (off_t)-1)
> - return error("cannot seek back");
> - }
Hm. I was briefly wondering whether we'd loop indefinitely in case the
object alone is bigger than the packsize. But we have an escape hatch in
`stream_blob_to_pack()` that special-cases when we haven't written any
data yet, so the answer is "no".
> + hashfile_checkpoint(state->f, &checkpoint);
> + idx->offset = state->offset;
> + crc32_begin(state->f);
> + stream_blob_to_pack(state, &ctx, fd, size, path);
> git_hash_final_oid(result_oid, &ctx);
Thanks!
Patrick
next prev parent reply other threads:[~2026-03-31 7:48 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 3:38 [PATCH 0/6] odb: add write operation to ODB transaction interface Justin Tobler
2026-03-31 3:38 ` [PATCH 1/6] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 13:56 ` Justin Tobler
2026-03-31 15:58 ` Junio C Hamano
2026-03-31 16:44 ` Justin Tobler
2026-03-31 3:38 ` [PATCH 2/6] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 3:38 ` [PATCH 3/6] object-file: remove flags from transaction packfile writes Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:10 ` Justin Tobler
2026-03-31 3:38 ` [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt [this message]
2026-03-31 14:14 ` Justin Tobler
2026-03-31 3:38 ` [PATCH 5/6] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:31 ` Justin Tobler
2026-03-31 22:59 ` Patrick Steinhardt
2026-03-31 23:21 ` Justin Tobler
2026-03-31 23:40 ` Patrick Steinhardt
2026-03-31 3:38 ` [PATCH 6/6] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:40 ` Justin Tobler
2026-04-01 3:03 ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Justin Tobler
2026-04-01 3:03 ` [PATCH v2 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-01 3:03 ` [PATCH v2 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-01 3:03 ` [PATCH v2 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-04-01 11:23 ` Patrick Steinhardt
2026-04-01 3:03 ` [PATCH v2 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-01 11:23 ` Patrick Steinhardt
2026-04-01 14:02 ` Justin Tobler
2026-04-01 3:03 ` [PATCH v2 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-01 3:03 ` [PATCH v2 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-01 3:03 ` [PATCH v2 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-01 11:24 ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-04-02 21:32 ` [PATCH v3 " Justin Tobler
2026-04-02 21:32 ` [PATCH v3 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-02 21:32 ` [PATCH v3 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-02 21:32 ` [PATCH v3 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-11 17:58 ` Jeff King
2026-05-12 15:19 ` Justin Tobler
2026-04-02 21:32 ` [PATCH v3 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-06 20:16 ` Jeff King
2026-04-06 20:19 ` Jeff King
2026-04-02 21:32 ` [PATCH v3 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-02 21:32 ` [PATCH v3 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-02 21:32 ` [PATCH v3 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-08 7:25 ` [PATCH v3 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-05-14 18:37 ` [PATCH v4 " Justin Tobler
2026-05-14 18:37 ` [PATCH v4 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-05-14 18:37 ` [PATCH v4 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-05-14 18:37 ` [PATCH v4 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-14 18:37 ` [PATCH v4 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-05-14 18:37 ` [PATCH v4 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-05-14 18:37 ` [PATCH v4 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-05-14 18:37 ` [PATCH v4 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-05-15 3:56 ` [PATCH v4 0/7] odb: add write operation to ODB transaction interface Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=act8W1BEg6iyUpHB@pks.im \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=jltobler@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox