Git development
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Justin Tobler <jltobler@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront
Date: Tue, 31 Mar 2026 09:48:43 +0200	[thread overview]
Message-ID: <act8W1BEg6iyUpHB@pks.im> (raw)
In-Reply-To: <20260331033835.2863514-5-jltobler@gmail.com>

On Mon, Mar 30, 2026 at 10:38:33PM -0500, Justin Tobler wrote:
> In certain scenarios, Git handles writing blobs that exceed
> "core.bigFilesThreshold" differently by streaming the object directly
> into a packfile. When there is an active ODB transaction, these blobs
> are streamed to the same packfile instead of using a separate packfile
> for each. If "pack.packSizeLimit" is configured and streaming another
> object causes the packfile to exceed the configured limit, the packfile
> is truncated back to the previous object and the object write is
> restarted in a new packfile.
> 
> This works fine, but requires the fd being read from to save a
> checkpoint so it becomes possible to rewind the input source via seeking
> back to a known offset at the beginning. In a subsequent commit, blob
> streaming is converted to use `struct odb_write_stream` as a more
> generic input source instead of an fd which doesn't provide a mechanism
> for rewinding.
> 
> For this use case though, rewinding the fd is not strictly necessary
> because the inflated size of the object is known and can be used to
> approximate whether writing the object would cause the packfile to
> exceed the configured limit prior to writing anything. These blobs
> written to the packfile are never deltafied thus the size difference

s/deltafied/deltified/

> between what is written versus the inflated size is due to zlib
> compression. While this does prevent packfiles from being filled to the
> potential maximum is some cases, it should be good enough and still
> prevents the packfile from exceeding any configured limit.
> 
> Use the inflated blob size to determine whether writing an object to a
> packfile will exceed the configured "pack.packSizeLimit".

I agree that this is a reasonable tradeoff:

  - For small objects it's probably not going to make a huge difference,
    as we'd at most waste a couple kilobytes.

  - For large objects we can expect that we wouldn't use deltification
    anyway due to "core.bigFileThreshold".

  - We can expect that in many cases large files will be not compress
    well, either, as it's more likely than not that a file >512MB (our
    default limit for "core.bigFileThreshold") is going to be a binary
    file.

We also nicely document "pack.packSizeLimit" as "rarely useful, and may
result in a larger total on-disk size" in git-config(1), so I think it's
fair to not bend ourselves over backwards just to make a rarely-useful
feature work exactly the same.

> diff --git a/object-file.c b/object-file.c
> index 493173eaf4..1de2244ac5 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1473,15 +1461,10 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
>  			if ((size_t)read_result != rsize)
>  				die("failed to read %u bytes from '%s'",
>  				    (unsigned)rsize, path);
> -			offset += rsize;
> -			if (*already_hashed_to < offset) {
> -				size_t hsize = offset - *already_hashed_to;
> -				if (rsize < hsize)
> -					hsize = rsize;
> -				if (hsize)
> -					git_hash_update(ctx, ibuf, hsize);
> -				*already_hashed_to = offset;
> -			}
> +
> +			if (rsize)
> +				git_hash_update(ctx, ibuf, rsize);

Is this guard really needed? I wouldn't expect that we ever try to read
zero bytes into `ibuf`, and we bail in case we didn't receive the
expected number of bytes.

And even if we did, `git_hash_update()` works just fine with no data.

> @@ -1592,48 +1566,34 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
>  					   size_t size, const char *path)
>  {
>  	struct transaction_packfile *state = &transaction->packfile;
> -	off_t seekback, already_hashed_to;
>  	struct git_hash_ctx ctx;
>  	unsigned char obuf[16384];
>  	unsigned header_len;
>  	struct hashfile_checkpoint checkpoint;
>  	struct pack_idx_entry *idx;
>  
> -	seekback = lseek(fd, 0, SEEK_CUR);
> -	if (seekback == (off_t)-1)
> -		return error("cannot find the current offset");

Okay, no seeking necessary because we don't restart the write anymore.

>  	header_len = format_object_header((char *)obuf, sizeof(obuf),
>  					  OBJ_BLOB, size);
>  	transaction->base.source->odb->repo->hash_algo->init_fn(&ctx);
>  	git_hash_update(&ctx, obuf, header_len);
>  
> +	/*
> +	 * If writing another object to the packfile could result in it
> +	 * exceeding the configured size limit, flush the current packfile
> +	 * transaction.
> +	 */

Do we want to document that this intentionally works on the inflated
size, not the deflated one, with the arguments mentioned in the commit
message?

> +	if (state->nr_written && pack_size_limit_cfg &&
> +	    pack_size_limit_cfg < state->offset + size)
> +		flush_packfile_transaction(transaction);

And we now flush the packfile before writing any object that may cause
us to bust the size limit. Makes sense.

>  	CALLOC_ARRAY(idx, 1);
>  	prepare_packfile_transaction(transaction);
>  	hashfile_checkpoint_init(state->f, &checkpoint);
>  
> -	already_hashed_to = 0;
> -
> -	while (1) {
> -		prepare_packfile_transaction(transaction);
> -		hashfile_checkpoint(state->f, &checkpoint);
> -		idx->offset = state->offset;
> -		crc32_begin(state->f);
> -
> -		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
> -					 fd, size, path))
> -			break;
> -		/*
> -		 * Writing this object to the current pack will make
> -		 * it too big; we need to truncate it, start a new
> -		 * pack, and write into it.
> -		 */
> -		hashfile_truncate(state->f, &checkpoint);
> -		state->offset = checkpoint.offset;
> -		flush_packfile_transaction(transaction);
> -		if (lseek(fd, seekback, SEEK_SET) == (off_t)-1)
> -			return error("cannot seek back");
> -	}

Hm. I was briefly wondering whether we'd loop indefinitely in case the
object alone is bigger than the packsize. But we have an escape hatch in
`stream_blob_to_pack()` that special-cases when we haven't written any
data yet, so the answer is "no".

> +	hashfile_checkpoint(state->f, &checkpoint);
> +	idx->offset = state->offset;
> +	crc32_begin(state->f);
> +	stream_blob_to_pack(state, &ctx, fd, size, path);
>  	git_hash_final_oid(result_oid, &ctx);

Thanks!

Patrick

  reply	other threads:[~2026-03-31  7:48 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31  3:38 [PATCH 0/6] odb: add write operation to ODB transaction interface Justin Tobler
2026-03-31  3:38 ` [PATCH 1/6] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 13:56     ` Justin Tobler
2026-03-31 15:58       ` Junio C Hamano
2026-03-31 16:44         ` Justin Tobler
2026-03-31  3:38 ` [PATCH 2/6] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31  3:38 ` [PATCH 3/6] object-file: remove flags from transaction packfile writes Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 14:10     ` Justin Tobler
2026-03-31  3:38 ` [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt [this message]
2026-03-31 14:14     ` Justin Tobler
2026-03-31  3:38 ` [PATCH 5/6] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 14:31     ` Justin Tobler
2026-03-31 22:59       ` Patrick Steinhardt
2026-03-31 23:21         ` Justin Tobler
2026-03-31 23:40           ` Patrick Steinhardt
2026-03-31  3:38 ` [PATCH 6/6] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 14:40     ` Justin Tobler
2026-04-01  3:03 ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Justin Tobler
2026-04-01  3:03   ` [PATCH v2 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-01  3:03   ` [PATCH v2 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-01  3:03   ` [PATCH v2 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-04-01 11:23     ` Patrick Steinhardt
2026-04-01  3:03   ` [PATCH v2 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-01 11:23     ` Patrick Steinhardt
2026-04-01 14:02       ` Justin Tobler
2026-04-01  3:03   ` [PATCH v2 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-01  3:03   ` [PATCH v2 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-01  3:03   ` [PATCH v2 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-01 11:24   ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-04-02 21:32   ` [PATCH v3 " Justin Tobler
2026-04-02 21:32     ` [PATCH v3 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-02 21:32     ` [PATCH v3 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-02 21:32     ` [PATCH v3 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-11 17:58       ` Jeff King
2026-05-12 15:19         ` Justin Tobler
2026-04-02 21:32     ` [PATCH v3 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-06 20:16       ` Jeff King
2026-04-06 20:19         ` Jeff King
2026-04-02 21:32     ` [PATCH v3 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-02 21:32     ` [PATCH v3 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-02 21:32     ` [PATCH v3 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-08  7:25     ` [PATCH v3 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-05-14 18:37     ` [PATCH v4 " Justin Tobler
2026-05-14 18:37       ` [PATCH v4 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-05-14 18:37       ` [PATCH v4 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-05-14 18:37       ` [PATCH v4 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-14 18:37       ` [PATCH v4 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-05-14 18:37       ` [PATCH v4 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-05-14 18:37       ` [PATCH v4 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-05-14 18:37       ` [PATCH v4 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-05-15  3:56       ` [PATCH v4 0/7] odb: add write operation to ODB transaction interface Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=act8W1BEg6iyUpHB@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=jltobler@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox