From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fhigh-b6-smtp.messagingengine.com (fhigh-b6-smtp.messagingengine.com [202.12.124.157]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9AF9513C914 for ; Wed, 29 Oct 2025 18:21:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=202.12.124.157 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761762121; cv=none; b=o86KFLhMcI0NGpPfNCdh1jdnHfXNVCh8AsRI4M6M91jgeKsqQHmyzR+WHAsQwD6j0VqeciiR3H3qtYi1iUXK1HVFfeFwKUvRvsW8AOdnMrw6cmSnkW+LwQxlTdzHEvWLW4dvhHTixWORgFfziBWmqBIvuzdxZk5xyQuVDd+1NWM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761762121; c=relaxed/simple; bh=gGmISzrUiRs+L9C2PLBeIry88nHIOpvIxYlg2v9O2oY=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=USBoJ5Bt5zFl8osoRANckqvXPhaGdob9D7oyL57hkcfJsUXrA01FUBqyoWXtfBv1gQg80muLiznjnsgTdviyVP275vh5nF+wUaPxNp8qli/VFvG5RbLUrNGOK8ZGZHovCi4IiCWH6ASuPq1OlXhXumEu3nZXM+kQ3cA0YxZcCqA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=pobox.com; spf=pass smtp.mailfrom=pobox.com; dkim=pass (2048-bit key) header.d=pobox.com header.i=@pobox.com header.b=Zf+c/eDG; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=UlyCLWzv; arc=none smtp.client-ip=202.12.124.157 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=pobox.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=pobox.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=pobox.com header.i=@pobox.com header.b="Zf+c/eDG"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="UlyCLWzv" Received: from phl-compute-04.internal (phl-compute-04.internal [10.202.2.44]) by mailfhigh.stl.internal (Postfix) with ESMTP id AE4C57A0084; Wed, 29 Oct 2025 14:21:57 -0400 (EDT) Received: from phl-frontend-01 ([10.202.2.160]) by phl-compute-04.internal (MEProxy); Wed, 29 Oct 2025 14:21:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pobox.com; h=cc :cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm2; t=1761762117; x=1761848517; bh=11pNGhvuk2 KX67n5U/rvi9YJLWmCf+ZU2WxvnpspUwM=; b=Zf+c/eDGmMvTibQpC2hxZgKXOD Axz+8Fnl/UMIRROZ1c/UcIBXCKfjB0brtlefOKhyGF0chbIOwS78giaI6+U+pdCi c2PoY1JALt5rxcUtK72LTFXa7X/CqbT9tkhG0XqxEQ4EGK8HFYBP5ecMlIH12Uzr RbMiAgcm/LKbICr9Wr4kks6KDz2wT9bP4r0+mg+TD5nQNuPnd6dzAuhNbDKRvzoc j/QkGMMRhYign/V3ZlIrKf7oV3OPlQqjkauAxEwi6rTc9bxdTnP8HdxbdA2rbB4d TxJmdkq/LMlAcqk/clHsKqXYSfz5N93KdAQDRfUkv3mYL50VjS/vgUZbSuEg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1761762117; x=1761848517; bh=11pNGhvuk2KX67n5U/rvi9YJLWmCf+ZU2Wx vnpspUwM=; b=UlyCLWzvoAUy53LV1547l+NbG7V1EEyVbsu91tMD3S4o8z2i89n P6Sz2wctL5fqiWXT+LHpuv1rBCi/t5LwXIX12dLG9EDP9YvHR9sBHm3XtYbywR8A qGG4/FtaVGkx8JQlbccIbXMJ7cjzj95s768czJHZGeGmtnSKx0F0jrAutv3rVhXf QxUR09dlJnEEHYULitQMkEm65Nkke45u0GUCFb0hLkcfbmWa09/wt9WQn5Oaobmj DO3KzpHmfw06cgi6mUgxcruct6Lo/bbpjBuIzbGGEHu4pbaOmarfL5k+64v5f+y1 uJYmicJY1dpydVPAta6TLk7iWtMwJQOvWgg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggdduieeggeefucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhephffvvefujghffffkfgggtgesthdtredttdertdenucfhrhhomheplfhunhhiohcu vecujfgrmhgrnhhouceoghhithhsthgvrhesphhosghogidrtghomheqnecuggftrfgrth htvghrnhepfeevteetjeehueegffelvdetieevffeufeejleeuffetiefggfeftdfhfeei geeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepgh hithhsthgvrhesphhosghogidrtghomhdpnhgspghrtghpthhtohephedpmhhouggvpehs mhhtphhouhhtpdhrtghpthhtohepshgrnhgurghlshestghruhhsthihthhoohhthhhprg hsthgvrdhnvghtpdhrtghpthhtohepghhithesvhhgvghrrdhkvghrnhgvlhdrohhrghdp rhgtphhtthhopehpshesphhkshdrihhmpdhrtghpthhtohepvgiivghkihgvlhhnvgifrh gvnhesghhmrghilhdrtghomhdprhgtphhtthhopehgihhtshhtvghrsehpohgsohigrdgt ohhm X-ME-Proxy: Feedback-ID: if26b431b:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 29 Oct 2025 14:21:57 -0400 (EDT) From: Junio C Hamano To: "brian m. carlson" Cc: , Patrick Steinhardt , Ezekiel Newren Subject: Re: [PATCH 12/14] rust: add a new binary loose object map format In-Reply-To: <20251027004404.2152927-13-sandals@crustytoothpaste.net> (brian m. carlson's message of "Mon, 27 Oct 2025 00:44:02 +0000") References: <20251027004404.2152927-1-sandals@crustytoothpaste.net> <20251027004404.2152927-13-sandals@crustytoothpaste.net> Date: Wed, 29 Oct 2025 11:21:55 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain "brian m. carlson" writes: > +=== Modern mapping > + > +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose` > +ending in `.map`. The portion of the filename before the extension is that of > +the hash checksum in hex format. > + > +`git pack-objects` will repack existing entries into one file, removing any > +unnecessary objects, such as obsolete shallow entries or loose objects that > +have been packed. > + > +==== Mapping file format I know near the end of this document we talk about network-byte order, but let's say that upfront here. > +- A header appears at the beginning and consists of the following: > + * A 4-byte mapping signature: `LMAP` > + * 4-byte version number: 1 > + * 4-byte length of the header section. > + * 4-byte number of objects declared in this map file. > + * 4-byte number of object formats declared in this map file. > + * For each object format: > + ** 4-byte format identifier (e.g., `sha1` for SHA-1) > + ** 4-byte length in bytes of shortened object names. This is the > + shortest possible length needed to make names in the shortened > + object name table unambiguous. This number typically represents a small integer up to 32 or so, right? No objection to spend 4-byte for it, but initially I somehow was confused into thinking that this is the number of bytes for shortened object names of all the objects in this map file (i.e., (N * 6) if the map describes N objects, and 6-byte is sufficient prefix of the object names). I wonder if there is a way to rephrase the above to avoid such confusion? Also I assume that "shorten" refers to "take the first N-byte prefix". How about calling them "unique prefix of object names" or something? > + ** 8-byte integer, recording where tables relating to this format > + are stored in this index file, as an offset from the beginning. > + * 8-byte offset to the trailer from the beginning of this file. OK. > + * Zero or more additional key/value pairs (4-byte key, 4-byte value), which > + may optionally declare one or more chunks. No chunks are currently > + defined. Readers must ignore unrecognized keys. Is this misindented? In other words, shouldn't the "padding" sit immediately after "offset of the trailer in the file" and at the same level? This uses the word "chunk", which risks implying some relationship with what is described in Documentation/gitformat-chunk.adoc, but I suspect this file format has nothing to do with "Chunk-based file format" described there. "4-byte key plus 4-byte value" gives an impression that it is a dictionary to associate bunch of 4-byte words with 4-byte values, and it is hard to guess where the word "chunk" comes from. 4-byte keyword plus 4-byte offset into (a later part of) the file where the chunk defined by that keyword is stored? The length of the header part minus the size up to the 8-byte offset to the trailer defines the size occupied by "additional key/value pairs", so the reader is supposed to tell if the next 4-byte is a key that it cannot recognise or beyond the end of the header part? How about replacing this with * The remainder of the header section is reserved for future use. Readers must ignore this section. until we know what kind of "chunks" are needed? > +- Zero or more NUL bytes. These are used to improve the alignment of the > + 4-byte quantities below. Everything we saw so far, if the tail end of the header section that is reserved for future use would hold zero or more <4-byte key, 4-byte value> pairs, are of size divisible by 4. If anything, we may be better off saying * all the sections described below are placed contiguously without gap in the file * all the sections are padded with zero or more NUL bytes to make their length a multiple of 4 upfront, even before we start talking about the "header" section. Then the "Zero or more NUL bytes" here, and the padding between tables do not have to be explicitly described. > +- Tables for the first object format: > + * A sorted table of shortened object names. These are prefixes of the names > + of all objects in this file, packed together without offset values to > + reduce the cache footprint of the binary search for a specific object name. "packed together without offset values...", while understandable, smells a bit out of place, especially since you haven't explained what you are trying to let readers find out from this table when they have one object name. Presumably, you have them take the first "length in bytes of shortened object names" bytes from the object name they have, binary search in this unique-prefix table for an entry that matches the prefix, to find out that their object may appear as the N-th object in the table (but the document hasn't told the readers that is how this table is designed to be used yet)? And using that offset, the reader would probably ensure that the N-th entry that appears in the next "full object names" table does indeed fully match the object they have? If that is the case, it is obvious that there is no "offset value" needed here, but when the reader does not even know how this table is supposed to be used, a sudden mention of "offset values" only confuses them. > + * A sorted table of full object names. I assume that the above two "*" bullet points are supposed to be aligned (iow, sit at the same level under "Tables for the first object format"). In any case, our reader with a single object name would have found out that their object appears as the N-th entry of these two tables. > + * A table of 4-byte metadata values. Again, is this (and the next) "*" bullet point at the same level as the above two tables? The number of entries in this table is not specified. Is it one 4-byte metadata per object described in the table (i.e. our reader recalls that the header has a 4-byte number of objects declared in this file)? IOW, would our reader, after finding out that the object they have is found as the N-th entry in the previous "full object names" table, look at the N-th entry of this metadata value table to find the metadata for their object? > + * Zero or more chunks. A chunk starts with a four-byte chunk identifier and > + a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte > + size (not including the identifier, parameter, or size), plus the chunk > + data. When the chunk data is not multiple of 4-byte, don't we pad? If we do, would the padding included in the 8-byte size? Or if the first chunk is of an odd size, would the second chunk be unaligned from its identifier, parameter and size fields? Presumably, you will allow older readers to safely skip chunks of newer type they do not recognise, so a reader is expected to grab the first 16 bytes for (id, param, size), and if it does not care about the id, just skip the size bytes to reach the next chunk, so if we were to pad (which I think would be reasonable, given that you are padding sections to 4-byte boundaries), the eight-byte size would also count the padding at the end of the chunk data (if the chunk data needs padding at the end, that is). If we make it clear that these chunks are aligned at 4-byte (or 8-byte, I dunno) boundaries, then ... > +- Zero or more NUL bytes. ... we do not need to have this entry whose length is unspecified (I can guess that you added it to allow the reader to skip to the next 4-byte boundary, but this document does not really specify it). > +- Tables for subsequent object formats: > + * A sorted table of shortened object names. These are prefixes of the names > + of all objects in this file, packed together without offset values to > + reduce the cache footprint of the binary search for a specific object name. > + * A table of full object names in the order specified by the first object format. > + * A table of 4-byte values mapping object name order to the order of the > + first object format. For an object in the table of sorted shortened object > + names, the value at the corresponding index in this table is the index in > + the previous table for that same object. > + * Zero or more NUL bytes. The same comment as the section for the primary object format. I assume that the above four "*" bullet points are at the same level, i.e. one unique-prefix table to let reader with a single object name to find that their object may be the one at N-th location in the table, followed by the full object name table to verify that the N-th object indeed is their object, and then find from that N that the correponding object name in the other hash is the M-th object in the table in the first object format, and they go from this M to the 4-byte metadata for that object? > +- The trailer consists of the following: > + * Hash checksum of all of the above. > + > +The lower six bits of each metadata table contain a type field indicating the > +reason that this object is stored: > + > +0:: > + Reserved. > +1:: > + This object is stored as a loose object in the repository. > +2:: > + This object is a shallow entry. The mapping refers to a shallow value > + returned by a remote server. > +3:: > + This object is a submodule entry. The mapping refers to the commit stored > + representing a submodule. > + > +Other data may be stored in this field in the future. Bits that are not used > +must be zero. > + > +All 4-byte numbers are in network order and must be 4-byte aligned in the file, > +so the NUL padding may be required in some cases. The document needs to be clear if the "length" field for each section counts these padding. > +impl LooseObjectMemoryMap { > + /// Create a new `LooseObjectMemoryMap`. > + /// > + /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in > + /// the correct map. > + fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMemoryMap { > + LooseObjectMemoryMap { > + to_compat: BTreeMap::new(), > + to_storage: BTreeMap::new(), > + compat, > + storage, > + } > + } > + > + fn len(&self) -> usize { > + self.to_compat.len() > + } > + > + /// Write this map to an interface implementing `std::io::Write`. > + fn write(&self, wrtr: W) -> io::Result<()> { > + const VERSION_NUMBER: u32 = 1; > + const NUM_OBJECT_FORMATS: u32 = 2; > + const PADDING: [u8; 4] = [0u8; 4]; > + > + let mut wrtr = wrtr; > + let header_size: u32 = 4 + 4 + 4 + 4 + 4 + (4 + 4 + 8) * 2 + 8; Yikes. Can this be written in a way that is easier to maintain? Certainly the earlier run of 4's corresponds to what the code below writes to wrtr, and I am wondering if we can ask wrtr how many bytes we have asked it to write so far, or something, without having the above hard-to-read numbers. > + wrtr.write_all(b"LMAP")?; > + wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?; > + wrtr.write_all(&header_size.to_be_bytes())?; > + wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?; > + wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?; > + > + let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage); > + let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat); > + > + let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len); > + let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len); I said 100-column limit is OK, but I am already hating myself saying so.