Re: [PATCH 12/14] rust: add a new binary loose object map format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Patrick Steinhardt <ps@pks.im>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Ezekiel Newren <ezekielnewren@gmail.com>
Subject: Re: [PATCH 12/14] rust: add a new binary loose object map format
Date: Tue, 28 Oct 2025 10:18:32 +0100	[thread overview]
Message-ID: <aQCKaK6kTfwoj28O@pks.im> (raw)
In-Reply-To: <20251027004404.2152927-13-sandals@crustytoothpaste.net>

On Mon, Oct 27, 2025 at 12:44:02AM +0000, brian m. carlson wrote:
> Our current loose object format has a few problems.  First, it is not
> efficient: the list of object IDs is not sorted and even if it were,
> there would not be an efficient way to look up objects in both
> algorithms.
> 
> Second, we need to store mappings for things which are not technically
> loose objects but are not packed objects, either, and so cannot be
> stored in a pack index.  These kinds of things include shallows, their
> parents, and their trees, as well as submodules. Yet we also need to
> implement a sensible way to store the kind of object so that we can
> prune unneeded entries.  For instance, if the user has updated the
> shallows, we can remove the old values.

Doesn't this indicate that calling this "loose object map" is kind of a
misnomer? If we want to be able to store arbitrary objects regardless of
the way those are stored (or not stored) in the ODB then I think it's
overall quite confusing to have "loose" in the name.

This isn't something we can fix for the old loose object map. But
shouldn't we fix this now for the new format you're about to introduce?

> For these reasons, introduce a new binary loose object map format.  The
> careful reader will notice that it resembles very closely the pack index
> v3 format.  Add an in-memory loose object map as well, and allow
> enabling writing to a batched map, which can then be written later as
> one of the binary loose object maps.  Include several tests for round
> tripping and data lookup across algorithms.

s/enabling//

> diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc
> index 947993663e..4850c91669 100644
> --- a/Documentation/gitformat-loose.adoc
> +++ b/Documentation/gitformat-loose.adoc
> @@ -48,6 +50,108 @@ stored under
>  Similarly, a blob containing the contents `abc` would have the uncompressed
>  data of `blob 3\0abc`.
>  
> +== Loose object mapping
> +
> +When the `compatObjectFormat` option is used, Git needs to store a mapping
> +between the repository's main algorithm and the compatibility algorithm. There
> +are two formats for this: the legacy mapping and the modern mapping.
> +
> +=== Legacy mapping
> +
> +The compatibility mapping is stored in a file called
> +`$GIT_DIR/objects/loose-object-idx`.  The format of this file looks like this:
> +
> +  # loose-object-idx
> +  (main-name SP compat-name LF)*
> +
> +`main-name` refers to hexadecimal object ID of the object in the main
> +repository format and `compat-name` refers to the same thing, but for the
> +compatibility format.
> +
> +This format is read if it exists but is not written.
> +
> +Note that carriage returns are not permitted in this file, regardless of the
> +host system or configuration.

As far as I understood, this legacy mapping wasn't really used anywhere
as it is basically nonfunctional in the first place. Can we get away
with dropping it altogether?

> +=== Modern mapping
> +
> +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose`
> +ending in `.map`.  The portion of the filename before the extension is that of
> +the hash checksum in hex format.

Given that we're talking about multiple different hashes: which hash
function is used for this checksum? I assume it's the main hash, but it
might be sensible to document this.

> +`git pack-objects` will repack existing entries into one file, removing any
> +unnecessary objects, such as obsolete shallow entries or loose objects that
> +have been packed.

Curious that this is put into git-pack-objects(1), as it doesn't quite
feel related to the task. Sure, it generates packfiles, but it doesn't
really handle the logic to manage loose objects/packfiles in the repo.
This feels closer to what git-repack(1) is doing, so would that be a
better place to put it?

> +==== Mapping file format
> +
> +- A header appears at the beginning and consists of the following:
> +	* A 4-byte mapping signature: `LMAP`
> +	* 4-byte version number: 1
> +	* 4-byte length of the header section.
> +	* 4-byte number of objects declared in this map file.
> +	* 4-byte number of object formats declared in this map file.
> +  * For each object format:
> +    ** 4-byte format identifier (e.g., `sha1` for SHA-1)
> +    ** 4-byte length in bytes of shortened object names. This is the
> +      shortest possible length needed to make names in the shortened
> +      object name table unambiguous.
> +    ** 8-byte integer, recording where tables relating to this format
> +      are stored in this index file, as an offset from the beginning.

As far as I understand this allows us to even store multiple
compatibility hashes if we were ever to grow a third hash. We would
still be able to binary-search through the file as we can compute the
size of every record with this header.

> +  * 8-byte offset to the trailer from the beginning of this file.
> +	* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
> +		may optionally declare one or more chunks.  No chunks are currently
> +		defined. Readers must ignore unrecognized keys.

How does the reader identify these key/value pairs and know how many of
those there are? Also, do you already have an idea what those should be
used for?

> +- Zero or more NUL bytes.  These are used to improve the alignment of the
> +	4-byte quantities below.

How does one figure out how many NUL bytes there's going to be? I guess
the reader doesn't need to know as it simply uses the length of the
header section to seek to the tables?

> +- Tables for the first object format:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.

Okay. The length of the shortened object names is encoded in the header,
so all of the objects have the same length.

Does the reader have a way to disambiguate the shortened object names?
They may be unambiguous at the point in time where the mapping is
written, but when they are being shortened it becomes plausible that the
object names becomes ambiguous at a later point in time. 

> +  * A sorted table of full object names.

Ah, I see! We have a second table further down that encodes full object
names, so yes, we can fully disambiguate.

> +	* A table of 4-byte metadata values.
> +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> +		size (not including the identifier, parameter, or size), plus the chunk
> +		data.
> +- Zero or more NUL bytes.
> +- Tables for subsequent object formats:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.
> +  * A table of full object names in the order specified by the first object format.

Interesting, why are these sorted by the first object format again?
Doesn't that mean that I have to do a linear search now to locate the
entry for the second object format?

    Disclaimer: the following paragraphs go into how I would have
    designed this. This is _not_ meant as a "you have to do it this
    way", but as a discussion starter to figure out why you have picked
    the proposed format and for me to get a better understanding of it.

Stepping back a bit, my expectation is that we'd have one lookup table
per object format so that we can map into all directions: SHA1 -> SHA256
and in reverse. If we had more than two hash functions we'd also need to
have a table for e.g. Blake3 -> SHA1 and Blake3 -> SHA256 and reverse.

One way to do this is to have three tables, one for each object format.
The object formats would be ordered lexicographically by their own
object ID, so that one can perform a binary search for an object ID in
every format.

Each row could then either contain all compatibility hashes directly,
but this would explode quite fast in storage space. An optimization
would thus be to have one table per object format that contains the
shortened object ID plus an offset where the actual record can be found.
You know where to find the tables from the header, and you know the
exact size of each entry, so you can trivially perform a binary search
for the abbreviated object ID in that index.

Once you've found that index you take the stored offset to look up the
record in the "main" table. This main table contains the full object IDs
for all object hashes. So something like the following simplified
format:

        +---------------------------------+
        | header                          |
        | Format version                  |
        | Number of object IDs            |
        | SHA1: abbrev, offset            |
        | SHA256: abbrev, offset          |
        | Blake3: abbrev, offset          |
        | Main: offset                    |
        +---------------------------------+
        | table for SHA1                  |
        | 11111 -> 1                      |
        | 22222 -> 2                      |
        +---------------------------------+
        | table for SHA256                |
        | aaaaa -> 2                      |
        | bbbbb -> 1                      |
        +---------------------------------+
        | table for Blake3                |
        | 88888 -> 2                      |
        | 99999 -> 1                      |
        +---------------------------------+
        | main table                      |
        | 11111111 -> bbbbbbbb -> 9999999 |
        | 22222222 -> aaaaaaaa -> 8888888 |
        +---------------------------------+
        | trailer                         |
        | trailer hash                    |
        +---------------------------------+

Overall you only have to store the full object ID for each hash exactly
once, and the mappings also only have to be stored once. But you can
look up an ID by each of its formats via its indices.

With some slight adjustments one could also adapt this format to become
streamable:

  - The header only contains the format information as well as which
    hash functions are contained.

  - The header is followed by the main table. The order of these objects
    is basically the streaming order, we don't care about it. We also
    don't have to abbreviate any hashes here. Like this we can stream
    the mappings to disk one by one, and we only need to remember the
    specific offsets where each mapping was stored.

  - Once all mappings have been streamed we can then write the lookup
    tables. We remember the starting index for each lookup table.

  - The footer contains the number of records stored in the table as
    well as the individual abbreviated object ID lengths per hash. From
    that number it becomes trivial to compute the offsets of every
    single lookup table. The offset of the main table is static.

        +---------------------------------+
        | header                          |
        | Format version                  |
        | SHA1                            |
        | SHA256                          |
        | Blake3                          |
        +---------------------------------+
        | main table                      |
        | 11111111 -> bbbbbbbb -> 9999999 |
        | 22222222 -> aaaaaaaa -> 8888888 |
        +---------------------------------+
        | table for SHA1                  |
        | 11111 -> 1                      |
        | 22222 -> 2                      |
        +---------------------------------+
        | table for SHA256                |
        | aaaaa -> 2                      |
        | bbbbb -> 1                      |
        +---------------------------------+
        | table for Blake3                |
        | 88888 -> 2                      |
        | 99999 -> 1                      |
        +---------------------------------+
        | trailer                         |
        | number of objects               |
        | SHA1 abbrev                     |
        | SHA256 abbrev                   |
        | Blake3 abbrev                   |
        | hash                            |
        +---------------------------------+

Anyway, this is how I would have designed this format, and I think your
format works differently. As I said, my intent here is not to say that
you should take my format, but I mostly intend it as a discussion
starter to figure out why you have chosen the proposed design so that I
can get a better understanding for it.

Thanks!

Patrick

next prev parent reply	other threads:[~2025-10-28  9:18 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson
2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-28 18:28     ` Ezekiel Newren
2025-10-28 19:33     ` Junio C Hamano
2025-10-28 19:58       ` Ezekiel Newren
2025-10-28 20:20         ` Junio C Hamano
2025-10-30  0:23       ` brian m. carlson
2025-10-30  1:58         ` Collin Funk
2025-11-03  1:30           ` brian m. carlson
2025-10-29  0:33     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
2025-10-28  9:17   ` Patrick Steinhardt
2025-10-28 19:07     ` Ezekiel Newren
2025-10-29  0:42       ` brian m. carlson
2025-10-28 19:40     ` Junio C Hamano
2025-10-29  0:47       ` brian m. carlson
2025-10-29  0:36     ` brian m. carlson
2025-10-29  9:08       ` Patrick Steinhardt
2025-10-30  0:32         ` brian m. carlson
2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:09     ` Ezekiel Newren
2025-10-28 20:00   ` Junio C Hamano
2025-10-28 20:03     ` Ezekiel Newren
2025-10-29 13:27       ` Junio C Hamano
2025-10-29 14:32         ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 20:12   ` Junio C Hamano
2025-11-04  1:48     ` brian m. carlson
2025-11-04 10:24       ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-10-28 17:22   ` Ezekiel Newren
2025-10-27  0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson
2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
2025-10-29 16:32   ` Junio C Hamano
2025-10-30 21:42     ` brian m. carlson
2025-10-30 21:52       ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:42     ` Ezekiel Newren
2025-10-29 16:43   ` Junio C Hamano
2025-10-29 22:10     ` Ezekiel Newren
2025-10-29 23:12       ` Junio C Hamano
2025-10-30  6:26         ` Patrick Steinhardt
2025-10-30 13:54           ` Junio C Hamano
2025-10-31 22:43             ` Ezekiel Newren
2025-11-01 11:18               ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  0:53     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-28 18:05   ` Ezekiel Newren
2025-10-29  1:05     ` brian m. carlson
2025-10-29 16:02       ` Ben Knoble
2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt [this message]
2025-10-29  1:37     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-29 17:03   ` Junio C Hamano
2025-10-29 18:21   ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
2025-10-28 18:19   ` Ezekiel Newren
2025-10-29  1:39     ` brian m. carlson
2025-10-27  0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson
2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
2025-10-29 20:15   ` Junio C Hamano
2025-11-11  0:12 ` Ezekiel Newren
2025-11-14 17:25 ` Junio C Hamano
2025-11-14 21:11   ` Junio C Hamano
2025-11-17  6:56   ` Junio C Hamano
2025-11-17 22:09     ` brian m. carlson
2025-11-18  0:13       ` Junio C Hamano
2025-11-19 23:04         ` brian m. carlson
2025-11-19 23:24           ` Junio C Hamano
2025-11-19 23:37           ` Ezekiel Newren
2025-11-20 19:52             ` Ezekiel Newren
2025-11-20 23:02               ` brian m. carlson
2025-11-20 23:11                 ` Ezekiel Newren
2025-11-20 23:14                   ` Junio C Hamano
2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
2025-11-17 22:16   ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson
2025-11-17 22:16   ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson
2025-11-17 22:16   ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson
2025-11-17 22:16   ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson
2025-11-17 22:16   ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson
2025-11-17 22:16   ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson
2025-11-17 22:16   ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-11-17 22:16   ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson
2025-11-17 22:16   ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson
2025-11-17 22:16   ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson
2025-11-17 22:16   ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson
2025-11-17 22:16   ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson
2025-11-17 22:16   ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson
2025-11-17 22:16   ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQCKaK6kTfwoj28O@pks.im \
    --to=ps@pks.im \
    --cc=ezekielnewren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).