Git development
 help / color / mirror / Atom feed
From: "Scott Bauersfeld via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
	Derrick Stolee <stolee@gmail.com>,
	Scott Bauersfeld <sbauersfeld@g.ucla.edu>,
	Scott Bauersfeld <sbauersfeld@g.ucla.edu>
Subject: [PATCH v3] index-pack, unpack-objects: increase input buffer from 4 KiB to 128 KiB
Date: Mon, 27 Apr 2026 19:26:38 +0000	[thread overview]
Message-ID: <pull.2282.v3.git.git.1777317998098.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2282.v2.git.git.1777306114914.gitgitgadget@gmail.com>

From: Scott Bauersfeld <sbauersfeld@g.ucla.edu>

index-pack and unpack-objects both read pack data from stdin through
a 4 KiB static buffer. In index-pack, each fill() flushes consumed
bytes to the pack file via write_or_die(), capping every write(2)
at 4 KiB. unpack-objects uses the same buffer pattern for reads.

On FUSE-backed filesystems every write(2) is a synchronous round
trip through the FUSE protocol (userspace -> kernel -> userspace ->
back), so the 4 KiB buffer turns a clone into many unnecessary tiny
writes with noticeable latency overhead.

Increase the buffer from 4 KiB to 128 KiB. Introduce a shared
DEFAULT_IO_BUFFER_SIZE constant in git-compat-util.h (next to
MAX_IO_SIZE) and use it in index-pack, unpack-objects, and the
hashfile layer in csum-file (which already used 128 KiB but
hardcoded the value).

Syscall counts via strace on HTTPS clones of git/git (~296 MB pack,
5 runs per variant, isolated builds from the same v2.54.0 source):

  index-pack pack file writes: 72,465 -> 24,943 avg (65% fewer)
  total write() syscalls:     310,192 -> 259,530 avg (16% fewer)
  writes of exactly 4096 bytes: ~40,077 -> 0

Wall-clock time of git clone over HTTPS onto a FUSE passthrough
filesystem with writeback caching disabled, 3 runs per variant:

  vscode (~1.26 GB pack): 84.5s -> 75.7s avg (10% faster)
  git/git (~306 MB pack):  22.6s -> 20.0s avg (11% faster)

Signed-off-by: Scott Bauersfeld <sbauersfeld@g.ucla.edu>
---
    index-pack, unpack-objects: increase input buffer from 4 KiB to 128 KiB
    
    index-pack and unpack-objects read pack data from stdin through a 4 KiB
    static buffer. In index-pack, each fill() flushes consumed bytes to the
    pack file via write_or_die(), capping every write(2) at 4 KiB.
    unpack-objects uses the same buffer pattern for reads.
    
    On FUSE-backed filesystems every write(2) is a synchronous round trip
    through the FUSE protocol (userspace → kernel → userspace → back), so
    the 4 KiB buffer turns a clone into many unnecessary tiny writes with
    noticeable latency overhead.
    
    Increase the buffer from 4 KiB to 128 KiB. Introduce a shared
    DEFAULT_IO_BUFFER_SIZE constant in git-compat-util.h (next to
    MAX_IO_SIZE) and use it in index-pack, unpack-objects, and the hashfile
    layer in csum-file (which already used 128 KiB but hardcoded the value).
    
    
    Syscall reduction
    =================
    
    Measured via strace -f on HTTPS clones of git/git (~296 MB pack, 5 runs
    per variant, isolated builds from the same v2.54.0 source):
    
    Metric Unpatched (4 KiB) Patched (128 KiB) Change index-pack writes to
    pack file 72,465 avg 24,943 avg −65% Total write() syscalls (all
    processes) 310,192 avg 259,530 avg −16% Writes of exactly 4096 bytes
    ~40,077 avg 0 eliminated HEAD / file count / fsck ✓ ✓ identical
    
    
    Wall-clock time on FUSE
    =======================
    
    Measured wall-clock time of git clone over HTTPS onto a FUSE passthrough
    filesystem with writeback caching disabled. 3 runs per variant:
    
    Repo Unpatched avg Patched avg Change microsoft/vscode (~1.26 GB pack)
    84.5s 75.7s −10% git/git (~306 MB pack) 22.6s 20.0s −11%
    
    
    Changes since v2
    ================
    
     * Renamed DEFAULT_PACKFILE_BUFFER_SIZE → DEFAULT_IO_BUFFER_SIZE per
       Stolee's feedback. The constant is not packfile-specific, since it is
       also used by the hashfile layer.
     * Stolee noted that WRITE_BUFFER_SIZE in read-cache.c could be
       consolidated. That constant was already removed in f6e2cd0625
       ("read-cache: delete unused hashing methods", 2021-05-18) when
       read-cache.c was converted to use the hashfile API, so there is
       nothing left to unify. The rename to DEFAULT_IO_BUFFER_SIZE helps
       account for the multiple usages of this constant.
    
    
    Changes since v1
    ================
    
     * Introduced shared DEFAULT_PACKFILE_BUFFER_SIZE constant in
       git-compat-util.h (next to MAX_IO_SIZE), replacing per-file #define
       and the hardcoded value in csum-file.c. Placed here rather than
       environment.h since it is an I/O buffer size, not an environment
       variable or repo config.
     * Added wall-clock timing on a FUSE filesystem.
     * Cleaned up the commit description a bit.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2282%2Fsbauersfeld%2Fsb%2Fincrease-index-pack-input-buffer-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2282/sbauersfeld/sb/increase-index-pack-input-buffer-v3
Pull-Request: https://github.com/git/git/pull/2282

Range-diff vs v2:

 1:  ac2559ccb5 ! 1:  df754ac879 index-pack, unpack-objects: increase input buffer from 4 KiB to 128 KiB
     @@ Commit message
          writes with noticeable latency overhead.
      
          Increase the buffer from 4 KiB to 128 KiB. Introduce a shared
     -    DEFAULT_PACKFILE_BUFFER_SIZE constant in git-compat-util.h (next to
     +    DEFAULT_IO_BUFFER_SIZE constant in git-compat-util.h (next to
          MAX_IO_SIZE) and use it in index-pack, unpack-objects, and the
          hashfile layer in csum-file (which already used 128 KiB but
          hardcoded the value).
     @@ builtin/index-pack.c: static int check_self_contained_and_connected;
       
      -/* We always read in 4kB chunks. */
      -static unsigned char input_buffer[4096];
     -+static unsigned char input_buffer[DEFAULT_PACKFILE_BUFFER_SIZE];
     ++static unsigned char input_buffer[DEFAULT_IO_BUFFER_SIZE];
       static unsigned int input_offset, input_len;
       static off_t consumed_bytes;
       static off_t max_input_size;
     @@ builtin/unpack-objects.c
       
      -/* We always read in 4kB chunks. */
      -static unsigned char buffer[4096];
     -+static unsigned char buffer[DEFAULT_PACKFILE_BUFFER_SIZE];
     ++static unsigned char buffer[DEFAULT_IO_BUFFER_SIZE];
       static unsigned int offset, len;
       static off_t consumed_bytes;
       static off_t max_input_size;
     @@ csum-file.c: struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
       	f->algop->init_fn(&f->ctx);
       
      -	f->buffer_len = opts->buffer_len ? opts->buffer_len : 128 * 1024;
     -+	f->buffer_len = opts->buffer_len ? opts->buffer_len : DEFAULT_PACKFILE_BUFFER_SIZE;
     ++	f->buffer_len = opts->buffer_len ? opts->buffer_len : DEFAULT_IO_BUFFER_SIZE;
       	f->buffer = xmalloc(f->buffer_len);
       	f->check_buffer = NULL;
       
     @@ git-compat-util.h: static inline uint64_t u64_add(uint64_t a, uint64_t b)
       #endif
       
      +/*
     -+ * Default buffer size for buffered I/O in pack file operations (index-pack,
     -+ * unpack-objects) and the hashfile layer in csum-file.
     ++ * Default buffer size for buffered I/O in index-pack, unpack-objects,
     ++ * and the hashfile layer in csum-file.
      + */
     -+#define DEFAULT_PACKFILE_BUFFER_SIZE (128 * 1024)
     ++#define DEFAULT_IO_BUFFER_SIZE (128 * 1024)
      +
       #ifdef HAVE_ALLOCA_H
       # include <alloca.h>


 builtin/index-pack.c     | 3 +--
 builtin/unpack-objects.c | 3 +--
 csum-file.c              | 2 +-
 git-compat-util.h        | 6 ++++++
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index ca7784dc2c..bb3639641c 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -145,8 +145,7 @@ static int check_self_contained_and_connected;
 
 static struct progress *progress;
 
-/* We always read in 4kB chunks. */
-static unsigned char input_buffer[4096];
+static unsigned char input_buffer[DEFAULT_IO_BUFFER_SIZE];
 static unsigned int input_offset, input_len;
 static off_t consumed_bytes;
 static off_t max_input_size;
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index e01cf6e360..af67d1a1d3 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -23,8 +23,7 @@
 static int dry_run, quiet, recover, has_errors, strict;
 static const char unpack_usage[] = "git unpack-objects [-n] [-q] [-r] [--strict]";
 
-/* We always read in 4kB chunks. */
-static unsigned char buffer[4096];
+static unsigned char buffer[DEFAULT_IO_BUFFER_SIZE];
 static unsigned int offset, len;
 static off_t consumed_bytes;
 static off_t max_input_size;
diff --git a/csum-file.c b/csum-file.c
index 9558177a11..d7a682c2b6 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -178,7 +178,7 @@ struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
 	f->algop = unsafe_hash_algo(algop);
 	f->algop->init_fn(&f->ctx);
 
-	f->buffer_len = opts->buffer_len ? opts->buffer_len : 128 * 1024;
+	f->buffer_len = opts->buffer_len ? opts->buffer_len : DEFAULT_IO_BUFFER_SIZE;
 	f->buffer = xmalloc(f->buffer_len);
 	f->check_buffer = NULL;
 
diff --git a/git-compat-util.h b/git-compat-util.h
index ae1bdc90a4..5024814bd4 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -712,6 +712,12 @@ static inline uint64_t u64_add(uint64_t a, uint64_t b)
 # endif
 #endif
 
+/*
+ * Default buffer size for buffered I/O in index-pack, unpack-objects,
+ * and the hashfile layer in csum-file.
+ */
+#define DEFAULT_IO_BUFFER_SIZE (128 * 1024)
+
 #ifdef HAVE_ALLOCA_H
 # include <alloca.h>
 # define xalloca(size)      (alloca(size))

base-commit: 94f057755b7941b321fd11fec1b2e3ca5313a4e0
-- 
gitgitgadget

  parent reply	other threads:[~2026-04-27 19:26 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 19:14 [PATCH] index-pack, unpack-objects: increase input buffer from 4 KiB to 128 KiB Scott Bauersfeld via GitGitGadget
2026-04-25 10:21 ` Junio C Hamano
2026-04-27 12:36   ` Derrick Stolee
2026-04-28  1:46     ` Junio C Hamano
2026-04-28  2:09       ` Jeff King
2026-04-27 16:08 ` [PATCH v2] " Scott Bauersfeld via GitGitGadget
2026-04-27 17:23   ` Derrick Stolee
2026-04-27 19:26   ` Scott Bauersfeld via GitGitGadget [this message]
2026-04-27 20:12     ` [PATCH v3] " Derrick Stolee
2026-04-28  1:47       ` Junio C Hamano
2026-04-28 14:47     ` [PATCH v4] " Scott Bauersfeld via GitGitGadget
2026-05-12  5:51       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.2282.v3.git.git.1777317998098.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sbauersfeld@g.ucla.edu \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox