Re: [PATCH 12/14] rust: add a new binary loose object map format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: <git@vger.kernel.org>,  Patrick Steinhardt <ps@pks.im>,
	 Ezekiel Newren <ezekielnewren@gmail.com>
Subject: Re: [PATCH 12/14] rust: add a new binary loose object map format
Date: Wed, 29 Oct 2025 11:21:55 -0700	[thread overview]
Message-ID: <xmqqqzul8t6k.fsf@gitster.g> (raw)
In-Reply-To: <20251027004404.2152927-13-sandals@crustytoothpaste.net> (brian m. carlson's message of "Mon, 27 Oct 2025 00:44:02 +0000")

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +=== Modern mapping
> +
> +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose`
> +ending in `.map`.  The portion of the filename before the extension is that of
> +the hash checksum in hex format.
> +
> +`git pack-objects` will repack existing entries into one file, removing any
> +unnecessary objects, such as obsolete shallow entries or loose objects that
> +have been packed.
> +
> +==== Mapping file format

I know near the end of this document we talk about network-byte
order, but let's say that upfront here.

> +- A header appears at the beginning and consists of the following:
> +	* A 4-byte mapping signature: `LMAP`
> +	* 4-byte version number: 1
> +	* 4-byte length of the header section.
> +	* 4-byte number of objects declared in this map file.
> +	* 4-byte number of object formats declared in this map file.
> +  * For each object format:
> +    ** 4-byte format identifier (e.g., `sha1` for SHA-1)
> +    ** 4-byte length in bytes of shortened object names. This is the
> +      shortest possible length needed to make names in the shortened
> +      object name table unambiguous.

This number typically represents a small integer up to 32 or so,
right?  No objection to spend 4-byte for it, but initially I somehow
was confused into thinking that this is the number of bytes for
shortened object names of all the objects in this map file (i.e., (N
* 6) if the map describes N objects, and 6-byte is sufficient prefix
of the object names).  I wonder if there is a way to rephrase the
above to avoid such confusion?

Also I assume that "shorten" refers to "take the first N-byte
prefix".  How about calling them "unique prefix of object names" or
something?

> +    ** 8-byte integer, recording where tables relating to this format
> +      are stored in this index file, as an offset from the beginning.
> +  * 8-byte offset to the trailer from the beginning of this file.

OK.

> +	* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
> +		may optionally declare one or more chunks.  No chunks are currently
> +		defined. Readers must ignore unrecognized keys.

Is this misindented?  In other words, shouldn't the "padding" sit
immediately after "offset of the trailer in the file" and at the
same level?

This uses the word "chunk", which risks implying some relationship
with what is described in Documentation/gitformat-chunk.adoc, but I
suspect this file format has nothing to do with "Chunk-based file
format" described there.  "4-byte key plus 4-byte value" gives an
impression that it is a dictionary to associate bunch of 4-byte
words with 4-byte values, and it is hard to guess where the word
"chunk" comes from.  4-byte keyword plus 4-byte offset into (a later
part of) the file where the chunk defined by that keyword is stored?

The length of the header part minus the size up to the 8-byte offset
to the trailer defines the size occupied by "additional key/value
pairs", so the reader is supposed to tell if the next 4-byte is a
key that it cannot recognise or beyond the end of the header part?

How about replacing this with

* The remainder of the header section is reserved for future use.
  Readers must ignore this section.

until we know what kind of "chunks" are needed?

> +- Zero or more NUL bytes.  These are used to improve the alignment of the
> +	4-byte quantities below.

Everything we saw so far, if the tail end of the header section that
is reserved for future use would hold zero or more <4-byte key,
4-byte value> pairs, are of size divisible by 4.

If anything, we may be better off saying 

 * all the sections described below are placed contiguously without
   gap in the file

 * all the sections are padded with zero or more NUL bytes to make
   their length a multiple of 4

upfront, even before we start talking about the "header" section.
Then the "Zero or more NUL bytes" here, and the padding between
tables do not have to be explicitly described.

> +- Tables for the first object format:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.

"packed together without offset values...", while understandable,
smells a bit out of place, especially since you haven't explained
what you are trying to let readers find out from this table when
they have one object name.  Presumably, you have them take the first
"length in bytes of shortened object names" bytes from the object
name they have, binary search in this unique-prefix table for an
entry that matches the prefix, to find out that their object may
appear as the N-th object in the table (but the document hasn't told
the readers that is how this table is designed to be used yet)?  And
using that offset, the reader would probably ensure that the N-th
entry that appears in the next "full object names" table does indeed
fully match the object they have?  If that is the case, it is
obvious that there is no "offset value" needed here, but when the
reader does not even know how this table is supposed to be used,
a sudden mention of "offset values" only confuses them.

> +  * A sorted table of full object names.

I assume that the above two "*" bullet points are supposed to be
aligned (iow, sit at the same level under "Tables for the first
object format").

In any case, our reader with a single object name would have found
out that their object appears as the N-th entry of these two tables.

> +	* A table of 4-byte metadata values.

Again, is this (and the next) "*" bullet point at the same level as
the above two tables?

The number of entries in this table is not specified.  Is it one
4-byte metadata per object described in the table (i.e. our reader
recalls that the header has a 4-byte number of objects declared in
this file)?  IOW, would our reader, after finding out that the
object they have is found as the N-th entry in the previous "full
object names" table, look at the N-th entry of this metadata value
table to find the metadata for their object?

> +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> +		size (not including the identifier, parameter, or size), plus the chunk
> +		data.

When the chunk data is not multiple of 4-byte, don't we pad?  If we
do, would the padding included in the 8-byte size?  Or if the first
chunk is of an odd size, would the second chunk be unaligned from
its identifier, parameter and size fields?

Presumably, you will allow older readers to safely skip chunks of
newer type they do not recognise, so a reader is expected to grab
the first 16 bytes for (id, param, size), and if it does not care
about the id, just skip the size bytes to reach the next chunk, so
if we were to pad (which I think would be reasonable, given that you
are padding sections to 4-byte boundaries), the eight-byte size
would also count the padding at the end of the chunk data (if the
chunk data needs padding at the end, that is).  If we make it clear
that these chunks are aligned at 4-byte (or 8-byte, I dunno)
boundaries, then ...

> +- Zero or more NUL bytes.

... we do not need to have this entry whose length is unspecified (I
can guess that you added it to allow the reader to skip to the next
4-byte boundary, but this document does not really specify it).

> +- Tables for subsequent object formats:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.
> +  * A table of full object names in the order specified by the first object format.
> +	* A table of 4-byte values mapping object name order to the order of the
> +		first object format. For an object in the table of sorted shortened object
> +		names, the value at the corresponding index in this table is the index in
> +		the previous table for that same object.
> +	* Zero or more NUL bytes.

The same comment as the section for the primary object format.  I
assume that the above four "*" bullet points are at the same level,
i.e. one unique-prefix table to let reader with a single object name
to find that their object may be the one at N-th location in the
table, followed by the full object name table to verify that the
N-th object indeed is their object, and then find from that N that
the correponding object name in the other hash is the M-th object
in the table in the first object format, and they go from this M to
the 4-byte metadata for that object?

> +- The trailer consists of the following:
> +  * Hash checksum of all of the above.
> +
> +The lower six bits of each metadata table contain a type field indicating the
> +reason that this object is stored:
> +
> +0::
> +	Reserved.
> +1::
> +	This object is stored as a loose object in the repository.
> +2::
> +	This object is a shallow entry.  The mapping refers to a shallow value
> +	returned by a remote server.
> +3::
> +	This object is a submodule entry.  The mapping refers to the commit stored
> +	representing a submodule.
> +
> +Other data may be stored in this field in the future.  Bits that are not used
> +must be zero.
> +
> +All 4-byte numbers are in network order and must be 4-byte aligned in the file,
> +so the NUL padding may be required in some cases.

The document needs to be clear if the "length" field for each
section counts these padding.

> +impl LooseObjectMemoryMap {
> +    /// Create a new `LooseObjectMemoryMap`.
> +    ///
> +    /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in
> +    /// the correct map.
> +    fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMemoryMap {
> +        LooseObjectMemoryMap {
> +            to_compat: BTreeMap::new(),
> +            to_storage: BTreeMap::new(),
> +            compat,
> +            storage,
> +        }
> +    }
> +
> +    fn len(&self) -> usize {
> +        self.to_compat.len()
> +    }
> +
> +    /// Write this map to an interface implementing `std::io::Write`.
> +    fn write<W: Write>(&self, wrtr: W) -> io::Result<()> {
> +        const VERSION_NUMBER: u32 = 1;
> +        const NUM_OBJECT_FORMATS: u32 = 2;
> +        const PADDING: [u8; 4] = [0u8; 4];
> +
> +        let mut wrtr = wrtr;
> +        let header_size: u32 = 4 + 4 + 4 + 4 + 4 + (4 + 4 + 8) * 2 + 8;

Yikes.  Can this be written in a way that is easier to maintain?
Certainly the earlier run of 4's corresponds to what the code below
writes to wrtr, and I am wondering if we can ask wrtr how many bytes
we have asked it to write so far, or something, without having the
above hard-to-read numbers.

> +        wrtr.write_all(b"LMAP")?;
> +        wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?;
> +        wrtr.write_all(&header_size.to_be_bytes())?;
> +        wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?;
> +        wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?;
> +
> +        let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage);
> +        let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat);
> +
> +        let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len);
> +        let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len);

I said 100-column limit is OK, but I am already hating myself saying
so.

next prev parent reply	other threads:[~2025-10-29 18:21 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson
2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-28 18:28     ` Ezekiel Newren
2025-10-28 19:33     ` Junio C Hamano
2025-10-28 19:58       ` Ezekiel Newren
2025-10-28 20:20         ` Junio C Hamano
2025-10-30  0:23       ` brian m. carlson
2025-10-30  1:58         ` Collin Funk
2025-11-03  1:30           ` brian m. carlson
2025-10-29  0:33     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
2025-10-28  9:17   ` Patrick Steinhardt
2025-10-28 19:07     ` Ezekiel Newren
2025-10-29  0:42       ` brian m. carlson
2025-10-28 19:40     ` Junio C Hamano
2025-10-29  0:47       ` brian m. carlson
2025-10-29  0:36     ` brian m. carlson
2025-10-29  9:08       ` Patrick Steinhardt
2025-10-30  0:32         ` brian m. carlson
2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:09     ` Ezekiel Newren
2025-10-28 20:00   ` Junio C Hamano
2025-10-28 20:03     ` Ezekiel Newren
2025-10-29 13:27       ` Junio C Hamano
2025-10-29 14:32         ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 20:12   ` Junio C Hamano
2025-11-04  1:48     ` brian m. carlson
2025-11-04 10:24       ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-10-28 17:22   ` Ezekiel Newren
2025-10-27  0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson
2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
2025-10-29 16:32   ` Junio C Hamano
2025-10-30 21:42     ` brian m. carlson
2025-10-30 21:52       ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:42     ` Ezekiel Newren
2025-10-29 16:43   ` Junio C Hamano
2025-10-29 22:10     ` Ezekiel Newren
2025-10-29 23:12       ` Junio C Hamano
2025-10-30  6:26         ` Patrick Steinhardt
2025-10-30 13:54           ` Junio C Hamano
2025-10-31 22:43             ` Ezekiel Newren
2025-11-01 11:18               ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  0:53     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-28 18:05   ` Ezekiel Newren
2025-10-29  1:05     ` brian m. carlson
2025-10-29 16:02       ` Ben Knoble
2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  1:37     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-29 17:03   ` Junio C Hamano
2025-10-29 18:21   ` Junio C Hamano [this message]
2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
2025-10-28 18:19   ` Ezekiel Newren
2025-10-29  1:39     ` brian m. carlson
2025-10-27  0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson
2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
2025-10-29 20:15   ` Junio C Hamano
2025-11-11  0:12 ` Ezekiel Newren
2025-11-14 17:25 ` Junio C Hamano
2025-11-14 21:11   ` Junio C Hamano
2025-11-17  6:56   ` Junio C Hamano
2025-11-17 22:09     ` brian m. carlson
2025-11-18  0:13       ` Junio C Hamano
2025-11-19 23:04         ` brian m. carlson
2025-11-19 23:24           ` Junio C Hamano
2025-11-19 23:37           ` Ezekiel Newren
2025-11-20 19:52             ` Ezekiel Newren
2025-11-20 23:02               ` brian m. carlson
2025-11-20 23:11                 ` Ezekiel Newren
2025-11-20 23:14                   ` Junio C Hamano
2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
2025-11-17 22:16   ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson
2025-11-17 22:16   ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson
2025-11-17 22:16   ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson
2025-11-17 22:16   ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson
2025-11-17 22:16   ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson
2025-11-17 22:16   ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson
2025-11-17 22:16   ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-11-17 22:16   ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson
2025-11-17 22:16   ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson
2025-11-17 22:16   ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson
2025-11-17 22:16   ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson
2025-11-17 22:16   ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson
2025-11-17 22:16   ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson
2025-11-17 22:16   ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqqzul8t6k.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=ezekielnewren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=ps@pks.im \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).