From: Justin Tobler <jltobler@gmail.com>
To: git@vger.kernel.org
Cc: ps@pks.im, gitster@pobox.com, Justin Tobler <jltobler@gmail.com>
Subject: [PATCH v2 5/7] object-file: avoid fd seekback by checking object size upfront
Date: Tue, 31 Mar 2026 22:03:13 -0500 [thread overview]
Message-ID: <20260401030316.1847362-6-jltobler@gmail.com> (raw)
In-Reply-To: <20260401030316.1847362-1-jltobler@gmail.com>
In certain scenarios, Git handles writing blobs that exceed
"core.bigFileThreshold" differently by streaming the object directly
into a packfile. When there is an active ODB transaction, these blobs
are streamed to the same packfile instead of using a separate packfile
for each. If "pack.packSizeLimit" is configured and streaming another
object causes the packfile to exceed the configured limit, the packfile
is truncated back to the previous object and the object write is
restarted in a new packfile.
This works fine, but requires the fd being read from to save a
checkpoint so it becomes possible to rewind the input source via seeking
back to a known offset at the beginning. In a subsequent commit, blob
streaming is converted to use `struct odb_write_stream` as a more
generic input source instead of an fd which doesn't provide a mechanism
for rewinding.
For this use case though, rewinding the fd is not strictly necessary
because the inflated size of the object is known and can be used to
approximate whether writing the object would cause the packfile to
exceed the configured limit prior to writing anything. These blobs
written to the packfile are never deltified thus the size difference
between what is written versus the inflated size is due to zlib
compression. While this does prevent packfiles from being filled to the
potential maximum is some cases, it should be good enough and still
prevents the packfile from exceeding any configured limit.
Use the inflated blob size to determine whether writing an object to a
packfile will exceed the configured "pack.packSizeLimit".
Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
object-file.c | 86 +++++++++++++++------------------------------------
1 file changed, 25 insertions(+), 61 deletions(-)
diff --git a/object-file.c b/object-file.c
index f317a24ccf..23229fbd95 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1445,29 +1445,17 @@ static int hash_blob_stream(struct odb_write_stream *stream,
/*
* Read the contents from fd for size bytes, streaming it to the
- * packfile in state while updating the hash in ctx. Signal a failure
- * by returning a negative value when the resulting pack would exceed
- * the pack size limit and this is not the first object in the pack,
- * so that the caller can discard what we wrote from the current pack
- * by truncating it and opening a new one. The caller will then call
- * us again after rewinding the input fd.
- *
- * The already_hashed_to pointer is kept untouched by the caller to
- * make sure we do not hash the same byte when we are called
- * again. This way, the caller does not have to checkpoint its hash
- * status before calling us just in case we ask it to call us again
- * with a new pack.
+ * packfile in state while updating the hash in ctx.
*/
-static int stream_blob_to_pack(struct transaction_packfile *state,
- struct git_hash_ctx *ctx, off_t *already_hashed_to,
- int fd, size_t size, const char *path)
+static void stream_blob_to_pack(struct transaction_packfile *state,
+ struct git_hash_ctx *ctx, int fd, size_t size,
+ const char *path)
{
git_zstream s;
unsigned char ibuf[16384];
unsigned char obuf[16384];
unsigned hdrlen;
int status = Z_OK;
- off_t offset = 0;
git_deflate_init(&s, pack_compression_level);
@@ -1484,15 +1472,9 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
if ((size_t)read_result != rsize)
die("failed to read %u bytes from '%s'",
(unsigned)rsize, path);
- offset += rsize;
- if (*already_hashed_to < offset) {
- size_t hsize = offset - *already_hashed_to;
- if (rsize < hsize)
- hsize = rsize;
- if (hsize)
- git_hash_update(ctx, ibuf, hsize);
- *already_hashed_to = offset;
- }
+
+ git_hash_update(ctx, ibuf, rsize);
+
s.next_in = ibuf;
s.avail_in = rsize;
size -= rsize;
@@ -1503,14 +1485,6 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
if (!s.avail_out || status == Z_STREAM_END) {
size_t written = s.next_out - obuf;
- /* would we bust the size limit? */
- if (state->nr_written &&
- pack_size_limit_cfg &&
- pack_size_limit_cfg < state->offset + written) {
- git_deflate_abort(&s);
- return -1;
- }
-
hashwrite(state->f, obuf, written);
state->offset += written;
s.next_out = obuf;
@@ -1527,7 +1501,6 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
}
}
git_deflate_end(&s);
- return 0;
}
static void flush_packfile_transaction(struct odb_transaction_files *transaction)
@@ -1603,48 +1576,39 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
size_t size, const char *path)
{
struct transaction_packfile *state = &transaction->packfile;
- off_t seekback, already_hashed_to;
struct git_hash_ctx ctx;
unsigned char obuf[16384];
unsigned header_len;
struct hashfile_checkpoint checkpoint;
struct pack_idx_entry *idx;
- seekback = lseek(fd, 0, SEEK_CUR);
- if (seekback == (off_t)-1)
- return error("cannot find the current offset");
-
header_len = format_object_header((char *)obuf, sizeof(obuf),
OBJ_BLOB, size);
transaction->base.source->odb->repo->hash_algo->init_fn(&ctx);
git_hash_update(&ctx, obuf, header_len);
+ /*
+ * If writing another object to the packfile could result in it
+ * exceeding the configured size limit, flush the current packfile
+ * transaction.
+ *
+ * Note that this uses the inflated object size as an approximation.
+ * Blob objects written in this manner are not delta-compressed, so
+ * the difference between the inflated and on-disk size is limited
+ * to zlib compression and is sufficient for this check.
+ */
+ if (state->nr_written && pack_size_limit_cfg &&
+ pack_size_limit_cfg < state->offset + size)
+ flush_packfile_transaction(transaction);
+
CALLOC_ARRAY(idx, 1);
prepare_packfile_transaction(transaction);
hashfile_checkpoint_init(state->f, &checkpoint);
- already_hashed_to = 0;
-
- while (1) {
- prepare_packfile_transaction(transaction);
- hashfile_checkpoint(state->f, &checkpoint);
- idx->offset = state->offset;
- crc32_begin(state->f);
-
- if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
- fd, size, path))
- break;
- /*
- * Writing this object to the current pack will make
- * it too big; we need to truncate it, start a new
- * pack, and write into it.
- */
- hashfile_truncate(state->f, &checkpoint);
- state->offset = checkpoint.offset;
- flush_packfile_transaction(transaction);
- if (lseek(fd, seekback, SEEK_SET) == (off_t)-1)
- return error("cannot seek back");
- }
+ hashfile_checkpoint(state->f, &checkpoint);
+ idx->offset = state->offset;
+ crc32_begin(state->f);
+ stream_blob_to_pack(state, &ctx, fd, size, path);
git_hash_final_oid(result_oid, &ctx);
idx->crc32 = crc32_end(state->f);
--
2.53.0.381.g628a66ccf6
next prev parent reply other threads:[~2026-04-01 3:03 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 3:38 [PATCH 0/6] odb: add write operation to ODB transaction interface Justin Tobler
2026-03-31 3:38 ` [PATCH 1/6] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 13:56 ` Justin Tobler
2026-03-31 15:58 ` Junio C Hamano
2026-03-31 16:44 ` Justin Tobler
2026-03-31 3:38 ` [PATCH 2/6] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 3:38 ` [PATCH 3/6] object-file: remove flags from transaction packfile writes Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:10 ` Justin Tobler
2026-03-31 3:38 ` [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:14 ` Justin Tobler
2026-03-31 3:38 ` [PATCH 5/6] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:31 ` Justin Tobler
2026-03-31 22:59 ` Patrick Steinhardt
2026-03-31 23:21 ` Justin Tobler
2026-03-31 23:40 ` Patrick Steinhardt
2026-03-31 3:38 ` [PATCH 6/6] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-03-31 7:48 ` Patrick Steinhardt
2026-03-31 14:40 ` Justin Tobler
2026-04-01 3:03 ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Justin Tobler
2026-04-01 3:03 ` [PATCH v2 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-01 3:03 ` [PATCH v2 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-01 3:03 ` [PATCH v2 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-04-01 11:23 ` Patrick Steinhardt
2026-04-01 3:03 ` [PATCH v2 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-01 11:23 ` Patrick Steinhardt
2026-04-01 14:02 ` Justin Tobler
2026-04-01 3:03 ` Justin Tobler [this message]
2026-04-01 3:03 ` [PATCH v2 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-01 3:03 ` [PATCH v2 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-01 11:24 ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-04-02 21:32 ` [PATCH v3 " Justin Tobler
2026-04-02 21:32 ` [PATCH v3 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-02 21:32 ` [PATCH v3 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-02 21:32 ` [PATCH v3 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-11 17:58 ` Jeff King
2026-05-12 15:19 ` Justin Tobler
2026-04-02 21:32 ` [PATCH v3 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-06 20:16 ` Jeff King
2026-04-06 20:19 ` Jeff King
2026-04-02 21:32 ` [PATCH v3 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-02 21:32 ` [PATCH v3 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-02 21:32 ` [PATCH v3 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-08 7:25 ` [PATCH v3 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-05-14 18:37 ` [PATCH v4 " Justin Tobler
2026-05-14 18:37 ` [PATCH v4 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-05-14 18:37 ` [PATCH v4 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-05-14 18:37 ` [PATCH v4 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-14 18:37 ` [PATCH v4 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-05-14 18:37 ` [PATCH v4 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-05-14 18:37 ` [PATCH v4 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-05-14 18:37 ` [PATCH v4 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-05-15 3:56 ` [PATCH v4 0/7] odb: add write operation to ODB transaction interface Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260401030316.1847362-6-jltobler@gmail.com \
--to=jltobler@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox