All of lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Justin Tobler <jltobler@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront
Date: Tue, 31 Mar 2026 09:48:43 +0200	[thread overview]
Message-ID: <act8W1BEg6iyUpHB@pks.im> (raw)
In-Reply-To: <20260331033835.2863514-5-jltobler@gmail.com>

On Mon, Mar 30, 2026 at 10:38:33PM -0500, Justin Tobler wrote:
> In certain scenarios, Git handles writing blobs that exceed
> "core.bigFilesThreshold" differently by streaming the object directly
> into a packfile. When there is an active ODB transaction, these blobs
> are streamed to the same packfile instead of using a separate packfile
> for each. If "pack.packSizeLimit" is configured and streaming another
> object causes the packfile to exceed the configured limit, the packfile
> is truncated back to the previous object and the object write is
> restarted in a new packfile.
> 
> This works fine, but requires the fd being read from to save a
> checkpoint so it becomes possible to rewind the input source via seeking
> back to a known offset at the beginning. In a subsequent commit, blob
> streaming is converted to use `struct odb_write_stream` as a more
> generic input source instead of an fd which doesn't provide a mechanism
> for rewinding.
> 
> For this use case though, rewinding the fd is not strictly necessary
> because the inflated size of the object is known and can be used to
> approximate whether writing the object would cause the packfile to
> exceed the configured limit prior to writing anything. These blobs
> written to the packfile are never deltafied thus the size difference

s/deltafied/deltified/

> between what is written versus the inflated size is due to zlib
> compression. While this does prevent packfiles from being filled to the
> potential maximum is some cases, it should be good enough and still
> prevents the packfile from exceeding any configured limit.
> 
> Use the inflated blob size to determine whether writing an object to a
> packfile will exceed the configured "pack.packSizeLimit".

I agree that this is a reasonable tradeoff:

  - For small objects it's probably not going to make a huge difference,
    as we'd at most waste a couple kilobytes.

  - For large objects we can expect that we wouldn't use deltification
    anyway due to "core.bigFileThreshold".

  - We can expect that in many cases large files will be not compress
    well, either, as it's more likely than not that a file >512MB (our
    default limit for "core.bigFileThreshold") is going to be a binary
    file.

We also nicely document "pack.packSizeLimit" as "rarely useful, and may
result in a larger total on-disk size" in git-config(1), so I think it's
fair to not bend ourselves over backwards just to make a rarely-useful
feature work exactly the same.

> diff --git a/object-file.c b/object-file.c
> index 493173eaf4..1de2244ac5 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1473,15 +1461,10 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
>  			if ((size_t)read_result != rsize)
>  				die("failed to read %u bytes from '%s'",
>  				    (unsigned)rsize, path);
> -			offset += rsize;
> -			if (*already_hashed_to < offset) {
> -				size_t hsize = offset - *already_hashed_to;
> -				if (rsize < hsize)
> -					hsize = rsize;
> -				if (hsize)
> -					git_hash_update(ctx, ibuf, hsize);
> -				*already_hashed_to = offset;
> -			}
> +
> +			if (rsize)
> +				git_hash_update(ctx, ibuf, rsize);

Is this guard really needed? I wouldn't expect that we ever try to read
zero bytes into `ibuf`, and we bail in case we didn't receive the
expected number of bytes.

And even if we did, `git_hash_update()` works just fine with no data.

> @@ -1592,48 +1566,34 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
>  					   size_t size, const char *path)
>  {
>  	struct transaction_packfile *state = &transaction->packfile;
> -	off_t seekback, already_hashed_to;
>  	struct git_hash_ctx ctx;
>  	unsigned char obuf[16384];
>  	unsigned header_len;
>  	struct hashfile_checkpoint checkpoint;
>  	struct pack_idx_entry *idx;
>  
> -	seekback = lseek(fd, 0, SEEK_CUR);
> -	if (seekback == (off_t)-1)
> -		return error("cannot find the current offset");

Okay, no seeking necessary because we don't restart the write anymore.

>  	header_len = format_object_header((char *)obuf, sizeof(obuf),
>  					  OBJ_BLOB, size);
>  	transaction->base.source->odb->repo->hash_algo->init_fn(&ctx);
>  	git_hash_update(&ctx, obuf, header_len);
>  
> +	/*
> +	 * If writing another object to the packfile could result in it
> +	 * exceeding the configured size limit, flush the current packfile
> +	 * transaction.
> +	 */

Do we want to document that this intentionally works on the inflated
size, not the deflated one, with the arguments mentioned in the commit
message?

> +	if (state->nr_written && pack_size_limit_cfg &&
> +	    pack_size_limit_cfg < state->offset + size)
> +		flush_packfile_transaction(transaction);

And we now flush the packfile before writing any object that may cause
us to bust the size limit. Makes sense.

>  	CALLOC_ARRAY(idx, 1);
>  	prepare_packfile_transaction(transaction);
>  	hashfile_checkpoint_init(state->f, &checkpoint);
>  
> -	already_hashed_to = 0;
> -
> -	while (1) {
> -		prepare_packfile_transaction(transaction);
> -		hashfile_checkpoint(state->f, &checkpoint);
> -		idx->offset = state->offset;
> -		crc32_begin(state->f);
> -
> -		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
> -					 fd, size, path))
> -			break;
> -		/*
> -		 * Writing this object to the current pack will make
> -		 * it too big; we need to truncate it, start a new
> -		 * pack, and write into it.
> -		 */
> -		hashfile_truncate(state->f, &checkpoint);
> -		state->offset = checkpoint.offset;
> -		flush_packfile_transaction(transaction);
> -		if (lseek(fd, seekback, SEEK_SET) == (off_t)-1)
> -			return error("cannot seek back");
> -	}

Hm. I was briefly wondering whether we'd loop indefinitely in case the
object alone is bigger than the packsize. But we have an escape hatch in
`stream_blob_to_pack()` that special-cases when we haven't written any
data yet, so the answer is "no".

> +	hashfile_checkpoint(state->f, &checkpoint);
> +	idx->offset = state->offset;
> +	crc32_begin(state->f);
> +	stream_blob_to_pack(state, &ctx, fd, size, path);
>  	git_hash_final_oid(result_oid, &ctx);

Thanks!

Patrick

  reply	other threads:[~2026-03-31  7:48 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31  3:38 [PATCH 0/6] odb: add write operation to ODB transaction interface Justin Tobler
2026-03-31  3:38 ` [PATCH 1/6] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 13:56     ` Justin Tobler
2026-03-31 15:58       ` Junio C Hamano
2026-03-31 16:44         ` Justin Tobler
2026-03-31  3:38 ` [PATCH 2/6] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31  3:38 ` [PATCH 3/6] object-file: remove flags from transaction packfile writes Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 14:10     ` Justin Tobler
2026-03-31  3:38 ` [PATCH 4/6] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt [this message]
2026-03-31 14:14     ` Justin Tobler
2026-03-31  3:38 ` [PATCH 5/6] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 14:31     ` Justin Tobler
2026-03-31 22:59       ` Patrick Steinhardt
2026-03-31 23:21         ` Justin Tobler
2026-03-31 23:40           ` Patrick Steinhardt
2026-03-31  3:38 ` [PATCH 6/6] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-03-31  7:48   ` Patrick Steinhardt
2026-03-31 14:40     ` Justin Tobler
2026-04-01  3:03 ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Justin Tobler
2026-04-01  3:03   ` [PATCH v2 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-01  3:03   ` [PATCH v2 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-01  3:03   ` [PATCH v2 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-04-01 11:23     ` Patrick Steinhardt
2026-04-01  3:03   ` [PATCH v2 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-01 11:23     ` Patrick Steinhardt
2026-04-01 14:02       ` Justin Tobler
2026-04-01  3:03   ` [PATCH v2 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-01  3:03   ` [PATCH v2 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-01  3:03   ` [PATCH v2 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-01 11:24   ` [PATCH v2 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-04-02 21:32   ` [PATCH v3 " Justin Tobler
2026-04-02 21:32     ` [PATCH v3 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-04-02 21:32     ` [PATCH v3 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-04-02 21:32     ` [PATCH v3 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-11 17:58       ` Jeff King
2026-05-12 15:19         ` Justin Tobler
2026-04-02 21:32     ` [PATCH v3 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-04-06 20:16       ` Jeff King
2026-04-06 20:19         ` Jeff King
2026-04-02 21:32     ` [PATCH v3 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-04-02 21:32     ` [PATCH v3 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-04-02 21:32     ` [PATCH v3 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-04-08  7:25     ` [PATCH v3 0/7] odb: add write operation to ODB transaction interface Patrick Steinhardt
2026-05-14 18:37     ` [PATCH v4 " Justin Tobler
2026-05-14 18:37       ` [PATCH v4 1/7] odb: split `struct odb_transaction` into separate header Justin Tobler
2026-05-14 18:37       ` [PATCH v4 2/7] odb/transaction: use pluggable `begin_transaction()` Justin Tobler
2026-05-14 18:37       ` [PATCH v4 3/7] odb: update `struct odb_write_stream` read() callback Justin Tobler
2026-05-14 18:37       ` [PATCH v4 4/7] object-file: remove flags from transaction packfile writes Justin Tobler
2026-05-14 18:37       ` [PATCH v4 5/7] object-file: avoid fd seekback by checking object size upfront Justin Tobler
2026-05-14 18:37       ` [PATCH v4 6/7] object-file: generalize packfile writes to use odb_write_stream Justin Tobler
2026-05-14 18:37       ` [PATCH v4 7/7] odb/transaction: make `write_object_stream()` pluggable Justin Tobler
2026-05-15  3:56       ` [PATCH v4 0/7] odb: add write operation to ODB transaction interface Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=act8W1BEg6iyUpHB@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=jltobler@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.