git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org,  Jeff King <peff@peff.net>
Subject: Re: [PATCH 2/2] Documentation/gitformat-pack.txt: fix incorrect MIDX documentation
Date: Thu, 12 Oct 2023 14:54:17 -0700	[thread overview]
Message-ID: <xmqq5y3b4id2.fsf@gitster.g> (raw)
In-Reply-To: <af2742e05dff48c4ba0a9f36d58bcbfc052dca40.1697144959.git.me@ttaylorr.com> (Taylor Blau's message of "Thu, 12 Oct 2023 17:09:31 -0400")

Taylor Blau <me@ttaylorr.com> writes:

> diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
> index d7153962d4..54000c9412 100644
> --- a/Documentation/gitformat-pack.txt
> +++ b/Documentation/gitformat-pack.txt
> @@ -392,8 +392,9 @@ CHUNK DATA:
>  	Packfile Names (ID: {'P', 'N', 'A', 'M'})
>  	    Stores the packfile names as concatenated, NUL-terminated strings.

Not a problem this series (neither this or the previous step)
introduces, but I had to read the implementation of
write_midx_pack_names() to find out what "concatenated
NUL-terminated string" really means.  The code has a list of
strings, writes each of them as a NUL-terminated string in sequence,
and to align the beginning of the next chunk, NULs are added to make
the whole thing multiple of MIDX_CHUNK_ALIGNMENT bytes.

A naive reader code might implement a loop like so:

	while (ptr[0] != '\0') {
		endp = strlen(ptr);
		... ptr[0..endp] is one pathname ...
		ptr = endp + 1;
	}
		
expecting that the terminating NUL of the last entry would be
followed by a NUL, but that is buggy.  The sum of the pathname
strings (with one NUL after each) may happen to be multiple of
MIDX_CHUNK_ALIGNMENT bytes, in which case no extra padding NUL bytes
will be there.  So the reader also needs to pay attention to the
chunk size to notice when to stop reading.  It feels somewhat
suboptimal.

>  	    Packfiles must be listed in lexicographic order for fast lookups by
> -	    name. This is the only chunk not guaranteed to be a multiple of four
> -	    bytes in length, so should be the last chunk for alignment reasons.
> +	    name. Individual entries in this chunk are not guarenteed to be
> +	    aligned. The chunk is externally padded with zeros to align
> +	    remaining chunks.

I am not sure what "externally padded" means.

Before write_midx_pack_names() returns, there is a code to pad the
space after it writes those names, which does not look any external
than the bytes used to record the pathnames or NUL bytes used to
terminate these pathnames.

To me, "externally padded" is an appropriate phrase to use if this
function does not bother with the alignment after what it needs to
record, but the caller, i.e., write_chunkfile(), has code to notice
that the number of bytes written by cf->chunks[i].write_fn() it just
called is not a multiple of some required alignment and adds padding
bytes after .write_fn() returned.  I do not think that is what is
going on here.

My observations.

 * The packfile names chunk is used to record N pathnames; N is not
   recorded anywhere explicitly.  To record N pathnames, a single
   chunk with N names in it is used.  It is not like N chunks, each
   with one name is used.

 * Each of these pathname is recorded literally, followed by a NUL,
   in some order, without any extra padding.

 * To keep the size of the chunk multiple of alignment requirement,
   there may be extra NUL bytes after the names.

This data structure does not allow you to binary search or hash
lookup without an extra table of pointers into this table of strings
at runtime, and once you build such a table, the source being sorted
does not help all that much.  So I am not sure how strict the
"lexicographic" requirement needs to be, but of course, starting
strict and loosening later is much easier than starting loose, so
documenting "must be listed" and following that rule is fine.

Thanks.

  reply	other threads:[~2023-10-12 21:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-12 21:09 [PATCH 0/2] Documentation/gitformat-pack.txt: correct a few issues/typos Taylor Blau
2023-10-12 21:09 ` [PATCH 1/2] Documentation/gitformat-pack.txt: fix typo Taylor Blau
2023-10-12 21:09 ` [PATCH 2/2] Documentation/gitformat-pack.txt: fix incorrect MIDX documentation Taylor Blau
2023-10-12 21:54   ` Junio C Hamano [this message]
2023-10-30 21:55     ` Taylor Blau
2023-10-31  0:42       ` Junio C Hamano
2023-10-31 19:24 ` [PATCH v2 0/2] Documentation/gitformat-pack.txt: correct a few issues/typos Taylor Blau
2023-10-31 19:24   ` [PATCH v2 1/2] Documentation/gitformat-pack.txt: fix typo Taylor Blau
2023-10-31 19:24   ` [PATCH v2 2/2] Documentation/gitformat-pack.txt: fix incorrect MIDX documentation Taylor Blau
2023-10-31 20:00     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq5y3b4id2.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).