Re: [PATCH 12/14] rust: add a new binary loose object map format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Patrick Steinhardt <ps@pks.im>
To: "brian m. carlson" <sandals@crustytoothpaste.net>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Ezekiel Newren <ezekielnewren@gmail.com>
Subject: Re: [PATCH 12/14] rust: add a new binary loose object map format
Date: Wed, 29 Oct 2025 10:07:36 +0100	[thread overview]
Message-ID: <aQHZWE104-cXb8Ny@pks.im> (raw)
In-Reply-To: <aQFv7cJUYaSUipF-@fruit.crustytoothpaste.net>

On Wed, Oct 29, 2025 at 01:37:49AM +0000, brian m. carlson wrote:
> On 2025-10-28 at 09:18:32, Patrick Steinhardt wrote:
> > Doesn't this indicate that calling this "loose object map" is kind of a
> > misnomer? If we want to be able to store arbitrary objects regardless of
> > the way those are stored (or not stored) in the ODB then I think it's
> > overall quite confusing to have "loose" in the name.
> > 
> > This isn't something we can fix for the old loose object map. But
> > shouldn't we fix this now for the new format you're about to introduce?
> 
> Sure.  I will admit I'm terrible at naming things.  What do you think it
> should be called.

I think the name is quite descriptive despite the misleading "loose"
part. So can't we simply drop that part and call it "object map"?

[snip]
> > > +	* A table of 4-byte metadata values.
> > > +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> > > +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> > > +		size (not including the identifier, parameter, or size), plus the chunk
> > > +		data.
> > > +- Zero or more NUL bytes.
> > > +- Tables for subsequent object formats:
> > > +	* A sorted table of shortened object names.  These are prefixes of the names
> > > +		of all objects in this file, packed together without offset values to
> > > +		reduce the cache footprint of the binary search for a specific object name.
> > > +  * A table of full object names in the order specified by the first object format.
> > 
> > Interesting, why are these sorted by the first object format again?
> > Doesn't that mean that I have to do a linear search now to locate the
> > entry for the second object format?
> 
> No, it doesn't.  The full object names are always in the order of the
> first format.  The shortened names for second and subsequent formats
> point into an offset table that finds the offset in the first format.
> 
> Therefore, to look up an OID in the second format knowing its OID in the
> first format, you use the first format's prefixes to find its offset,
> verify its OID in the full object names, and then look up that offset in
> the list of full object names in the second format.
> 
> To go the other way, you find the prefix in the second format, find its
> corresponding offset in the mapping table, verify the full object ID in
> the second format, and then look up that offset in the full object names
> in the first format.

Okay.

[snip]
> > Overall you only have to store the full object ID for each hash exactly
> > once, and the mappings also only have to be stored once. But you can
> > look up an ID by each of its formats via its indices.
> 
> This is very similar to what we have now, except that it has mapping
> offsets for each algorithm instead of the second and subsequent
> algorithms and it re-orders the location of the full object IDs.
> 
> I also intentionally wanted to produce completely deterministic output,
> since in `git verify-pack` we verify that the output is byte-for-byte
> identical and I wanted to have the ability to do that here as well.  (It
> isn't implemented yet, but that's a goal.)  In order to do that, we need
> to write every part of the data in a fixed order, so we'd have to define
> the main table as being sorted by the first algorithm.

Okay.

> > With some slight adjustments one could also adapt this format to become
> > streamable:
> 
> I don't think these formats are as streamable as you might like.  In
> order to create the tables, we need to sort the data for each algorithm
> to find the short name length, which requires knowing all of the data up
> front in order.
> 
> I, too, thought that might be a nice idea, but when I implemented pack
> index v3, I realized that effectively all of the data has to be computed
> up front.  Once you do that, computing the offsets isn't hard because
> it's just some addition and multiplication.

I guess you can make it streamable if you don't care about deterministic
output and if you're willing to have a separate ordered lookup table for
the first hash. But in any case you'd have to keep all object IDs in
memory regardless of that so that those can be sorted. I'm not sure that
this really buys us much.

So overall I'm fine with it not being streamable.

> I personally like a header with offsets better than a trailer since it
> makes parsing easier.  We can peek at the first 64 bytes of the file to
> see if it meets our needs or has data we're interested in.

It's not all that bad -- we for example use this for reftables. Both for
reftables and also for your format we'd mmap anyway, and in order to
mmap you need to figure out the overall size of the file first. From
there on it shouldn't be hard to figure out whether the trailer starts
based on the number of hashes and their respective sizes announced in
the header.

But I remember that this led to some head scratching for myself when I
initially dived into the reftable library, so I very much acknowledge
that it at least adds _some_ complexity.

Anyway, thanks for these explanations! One suggestion: it helped me
quite a bit to draw the ASCII diagrams I had in my previous mail. How
about we add such a diagram to help readers a bit with the high-level
structure of the format?

Patrick

next prev parent reply	other threads:[~2025-10-29  9:07 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson
2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-28 18:28     ` Ezekiel Newren
2025-10-28 19:33     ` Junio C Hamano
2025-10-28 19:58       ` Ezekiel Newren
2025-10-28 20:20         ` Junio C Hamano
2025-10-30  0:23       ` brian m. carlson
2025-10-30  1:58         ` Collin Funk
2025-11-03  1:30           ` brian m. carlson
2025-10-29  0:33     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
2025-10-28  9:17   ` Patrick Steinhardt
2025-10-28 19:07     ` Ezekiel Newren
2025-10-29  0:42       ` brian m. carlson
2025-10-28 19:40     ` Junio C Hamano
2025-10-29  0:47       ` brian m. carlson
2025-10-29  0:36     ` brian m. carlson
2025-10-29  9:08       ` Patrick Steinhardt
2025-10-30  0:32         ` brian m. carlson
2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:09     ` Ezekiel Newren
2025-10-28 20:00   ` Junio C Hamano
2025-10-28 20:03     ` Ezekiel Newren
2025-10-29 13:27       ` Junio C Hamano
2025-10-29 14:32         ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 20:12   ` Junio C Hamano
2025-11-04  1:48     ` brian m. carlson
2025-11-04 10:24       ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-10-28 17:22   ` Ezekiel Newren
2025-10-27  0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson
2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
2025-10-29 16:32   ` Junio C Hamano
2025-10-30 21:42     ` brian m. carlson
2025-10-30 21:52       ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:42     ` Ezekiel Newren
2025-10-29 16:43   ` Junio C Hamano
2025-10-29 22:10     ` Ezekiel Newren
2025-10-29 23:12       ` Junio C Hamano
2025-10-30  6:26         ` Patrick Steinhardt
2025-10-30 13:54           ` Junio C Hamano
2025-10-31 22:43             ` Ezekiel Newren
2025-11-01 11:18               ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  0:53     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-28 18:05   ` Ezekiel Newren
2025-10-29  1:05     ` brian m. carlson
2025-10-29 16:02       ` Ben Knoble
2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  1:37     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt [this message]
2025-10-29 17:03   ` Junio C Hamano
2025-10-29 18:21   ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
2025-10-28 18:19   ` Ezekiel Newren
2025-10-29  1:39     ` brian m. carlson
2025-10-27  0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson
2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
2025-10-29 20:15   ` Junio C Hamano
2025-11-11  0:12 ` Ezekiel Newren
2025-11-14 17:25 ` Junio C Hamano
2025-11-14 21:11   ` Junio C Hamano
2025-11-17  6:56   ` Junio C Hamano
2025-11-17 22:09     ` brian m. carlson
2025-11-18  0:13       ` Junio C Hamano
2025-11-19 23:04         ` brian m. carlson
2025-11-19 23:24           ` Junio C Hamano
2025-11-19 23:37           ` Ezekiel Newren
2025-11-20 19:52             ` Ezekiel Newren
2025-11-20 23:02               ` brian m. carlson
2025-11-20 23:11                 ` Ezekiel Newren
2025-11-20 23:14                   ` Junio C Hamano
2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
2025-11-17 22:16   ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson
2025-11-17 22:16   ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson
2025-11-17 22:16   ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson
2025-11-17 22:16   ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson
2025-11-17 22:16   ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson
2025-11-17 22:16   ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson
2025-11-17 22:16   ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-11-17 22:16   ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson
2025-11-17 22:16   ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson
2025-11-17 22:16   ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson
2025-11-17 22:16   ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson
2025-11-17 22:16   ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson
2025-11-17 22:16   ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson
2025-11-17 22:16   ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQHZWE104-cXb8Ny@pks.im \
    --to=ps@pks.im \
    --cc=ezekielnewren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).