git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>,
	Patrick Steinhardt <ps@pks.im>
Subject: [PATCH 00/17] midx: incremental MIDX/bitmap layer compaction
Date: Sat, 6 Dec 2025 15:30:57 -0500	[thread overview]
Message-ID: <cover.1765053054.git.me@ttaylorr.com> (raw)

Note to the maintainer:

 * This series is the spiritual successor to my earlier series queued
   as 'tb/incremental-midx-part-3.1', but is based on 'master'
   (currently f0ef5b6d9bc (The fifth batch, 2025-11-30) at the time of
   writing). I suggest queueing as 'tb/incremental-midx-part-3.2'.

OVERVIEW
========

This series implements the second of three major components for an
incremental MIDX/bitmap-based repacking strategy. Those components
are:

 1. Refactoring code out of builtin/repack.c into individual
    compilation units outside of the builtin directory.

 2. (this series) Implementing MIDX layer compaction, i.e. the ability
    to take an contiguous sequence of MIDX layers in an existing MIDX
    chain and replace them with a single layer representing the union
    of objects/packs among the compacted layers.

 3. An incremental repacking strategy that, unlike our current
    '--geometric' approach, does not require periodic all-into-one
    repacks.

BACKGROUND
==========

I'll briefly discuss that repacking strategy here in order to provide
some context on why MIDX layer compaction is important. When doing a
--geometric repack today, any update to the MIDX and/or its .bitmap
requires a full rewrite. For most repositories this is not such a big
deal, but in large monorepos these costs can add up significantly over
time.

To deal with this, incremental MIDXs (and corresponding support for
reachability bitmaps) were introduced as a way to append updates to
the end of a MIDX chain without having to rewrite anything.

Of course, growing a MIDX chain infinitely without ever shrinking it
would be pointless. So any repacking strategy which leverages
incremental MIDXs must take some action which encourages the MIDX
chain to avoid growing too large.

The strategy that I will share in the forthcoming topic related to
this effort does (roughly) the following:

 1. Repack all non-MIDX'd packs geometrically, optionally including
    any packs in the most recent MIDX layer iff it has more than some
    threshold of packs.

 2. Write out a new MIDX layer which is the result of the previous
    step, adding it onto the end of the chain (or replacing the
    previous end, depending on whether the pack threshold condition
    was met).

 3. Perform a compaction pass, combining adjacent MIDX layers to
    restore a geometric progression on object count among MIDX layers
    in the chain.

(As an aside, this description is not the most efficient way to
implement this behavior, and so the actual implementation varies
slightly, though achieves the same effect.)

This strategy encourages MIDX chains where older layers have fewer,
larger packs, and newer layers have a greater number of smaller packs.
The compaction pass prevents the MIDX chain from ever growing too
large. (In my simulation[1], even a large number of simulated pushes
yields a MIDX chain which is consistently fewer than ~10 or so layers
long.)

COMPACTION
==========

MIDX compaction works by producing a new MIDX layer which is the
logical union of any contiguous sequence within the existing MIDX
chain. Importantly, the packs/objects from those layers within the
compaction range have their order preserved. That means that the
pseudo-pack order of the compacted layer is guaranteed to be the same
as the concatenated pseudo-pack order across all layers in the
compaction range.

As a result, we can reuse any selected bitmaps within the compaction
range verbatim. More importantly, any bitmaps belonging to layers
which are descendants of the compacted one do not need to permute any
existing bitmaps in order for them to be valid post-compaction.

This allows us to cheaply combine parts of the MIDX chain without
needing to do any work outside of the combine layers, which is
essential in the above-described repacking strategy.

THIS SERIES
===========

This series implements MIDX compaction, and is organized roughly as
follows:

 * The first two patches are stray code clean-up and minor refactoring
   changes that I made while writing this series.

 * The next three patches clean up the SYNOPSIS for in the manual page
   for git-multi-pack-index(1), and removes the builtin from the list
   of known failures in t0450.

 * The next patch is a typofix, and the one following that is a
   preemptive bugfix.

 * The eighth patch is a code refactoring patch that makes the
   argument list for midx-write.c::write_midx_internal() easier to
   reason about.

 * The ninth patch relaxes a MIDX constraint that forces all packs
   within a layer to be sorted lexicographically.

 * The remaining patches prepare for and introduce MIDX compaction.

The test coverage in the new t5335-compact-multi-pack-index.sh is
fairly sparse in my opinion, though most of the interesting behavior
is already exercised via t5319 and t5334. If others have ideas on how
to improve test coverage here, please let me know!

This series is somewhat lengthy, though the individual changes are
relatively small for the most part, so I am hoping that they aren't
too bad to review.

Thanks in advance for your review!

[1]: https://gist.github.com/ttaylorr/7ac0305f130197744f3583096c8198a1

Taylor Blau (17):
  midx: mark `get_midx_checksum()` arguments as const
  midx: split `get_midx_checksum()` by adding `get_midx_hash()`
  builtin/multi-pack-index.c: make '--progress' a common option
  git-multi-pack-index(1): remove non-existent incompatibility
  git-multi-pack-index(1): align SYNOPSIS with 'git multi-pack-index -h'
  t/t5319-multi-pack-index.sh: fix copy-and-paste error in t5319.39
  midx-write.c: don't use `pack_perm` when assigning `bitmap_pos`
  midx-write.c: introduce `struct write_midx_opts`
  midx: do not require packs to be sorted in lexicographic order
  git-compat-util.h: introduce `u32_add()`
  midx-write.c: introduce `midx_pack_perm()` helper
  midx-write.c: extract `fill_pack_from_midx()`
  midx-write.c: enumerate `pack_int_id` values directly
  midx-write.c: factor fanout layering from `compute_sorted_entries()`
  t/helper/test-read-midx.c: plug memory leak when selecting layer
  midx: implement MIDX compaction
  midx: enable reachability bitmaps during MIDX compaction

 Documentation/git-multi-pack-index.adoc |  24 +-
 builtin/multi-pack-index.c              |  84 ++++-
 git-compat-util.h                       |   8 +
 midx-write.c                            | 463 ++++++++++++++++++------
 midx.c                                  |  36 +-
 midx.h                                  |   9 +-
 pack-bitmap.c                           |   9 +-
 pack-revindex.c                         |   4 +-
 t/helper/test-read-midx.c               |  21 +-
 t/meson.build                           |   1 +
 t/t0450/adoc-help-mismatches            |   1 -
 t/t5319-multi-pack-index.sh             |   7 +-
 t/t5335-compact-multi-pack-index.sh     | 218 +++++++++++
 13 files changed, 737 insertions(+), 148 deletions(-)
 create mode 100755 t/t5335-compact-multi-pack-index.sh


base-commit: f0ef5b6d9bcc258e4cbef93839d1b7465d5212b9
-- 
2.52.0.171.gd6a4e6b6955

             reply	other threads:[~2025-12-06 20:31 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-06 20:30 Taylor Blau [this message]
2025-12-06 20:31 ` [PATCH 01/17] midx: mark `get_midx_checksum()` arguments as const Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  1:41     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 02/17] midx: split `get_midx_checksum()` by adding `get_midx_hash()` Taylor Blau
2025-12-08 18:25   ` Patrick Steinhardt
2025-12-09  1:42     ` Taylor Blau
2025-12-09  1:50       ` Taylor Blau
2025-12-09  6:27         ` Patrick Steinhardt
2025-12-06 20:31 ` [PATCH 03/17] builtin/multi-pack-index.c: make '--progress' a common option Taylor Blau
2025-12-06 20:31 ` [PATCH 04/17] git-multi-pack-index(1): remove non-existent incompatibility Taylor Blau
2025-12-06 20:31 ` [PATCH 05/17] git-multi-pack-index(1): align SYNOPSIS with 'git multi-pack-index -h' Taylor Blau
2025-12-06 20:31 ` [PATCH 06/17] t/t5319-multi-pack-index.sh: fix copy-and-paste error in t5319.39 Taylor Blau
2025-12-06 20:31 ` [PATCH 07/17] midx-write.c: don't use `pack_perm` when assigning `bitmap_pos` Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  1:59     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 08/17] midx-write.c: introduce `struct write_midx_opts` Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  2:04     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 09/17] midx: do not require packs to be sorted in lexicographic order Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  2:07     ` Taylor Blau
2025-12-09  2:11       ` Taylor Blau
2025-12-06 20:31 ` [PATCH 10/17] git-compat-util.h: introduce `u32_add()` Taylor Blau
2025-12-08 18:27   ` Patrick Steinhardt
2025-12-09  2:13     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 11/17] midx-write.c: introduce `midx_pack_perm()` helper Taylor Blau
2025-12-06 20:31 ` [PATCH 12/17] midx-write.c: extract `fill_pack_from_midx()` Taylor Blau
2025-12-06 20:31 ` [PATCH 13/17] midx-write.c: enumerate `pack_int_id` values directly Taylor Blau
2025-12-08 18:27   ` Patrick Steinhardt
2025-12-09  2:14     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 14/17] midx-write.c: factor fanout layering from `compute_sorted_entries()` Taylor Blau
2025-12-06 20:31 ` [PATCH 15/17] t/helper/test-read-midx.c: plug memory leak when selecting layer Taylor Blau
2025-12-08 18:27   ` Patrick Steinhardt
2025-12-09  2:16     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 16/17] midx: implement MIDX compaction Taylor Blau
2025-12-09  7:21   ` Patrick Steinhardt
2025-12-06 20:31 ` [PATCH 17/17] midx: enable reachability bitmaps during " Taylor Blau
2025-12-09  7:21   ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1765053054.git.me@ttaylorr.com \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).