Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org,  Jeff King <peff@peff.net>,
	 Taylor Blau <me@ttaylorr.com>,
	 Derrick Stolee <stolee@gmail.com>,
	 Patrick Steinhardt <ps@pks.im>,
	 Jonathan Nieder <jrnieder@gmail.com>
Subject: Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
Date: Thu, 14 Aug 2025 07:22:18 -0700	[thread overview]
Message-ID: <xmqq1ppe9e5h.fsf@gitster.g> (raw)
In-Reply-To: <aJ03RTHaE_JvHA1t@fruit.crustytoothpaste.net> (brian m. carlson's message of "Thu, 14 Aug 2025 01:09:25 +0000")

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

I do not know if you want my input (as I wasn't CC'ed), but anyway...

> ...  We can store them in the
> `loose-object-idx`, but since it's not sorted or easily searchable, it's
> going to perform really terribly when we store enough of them.  Right
> now, we read the entire file into two hashmaps (one in each direction)
> and we sometimes need to re-read it when other processes add items, so
> it won't take much to make it be slow and take a lot of memory.
>
> For these reasons, I think we need a different datastore for this and
> I'd like to solicit opinions on what that should look like.  Here are
> some things that come to mind:

I do not see why loose-object-idx is not sorted in the first place,
but to account for new objects getting into the object store, it
would not be a viable way forward to maintain a single sorted file.
We obviously do not want to keep rewriting it in its entirety all
the time,

> Some rough ideas of what this could look like:
>
> * We could repurpose the top-bit of the pack order value in pack index
>   v3 to indicate an object that's not in the pack (this would limit us
>   to 2^31 items per pack).

Nice to see an effort to see if we can do with a small incremental
change, but would a single bit be sufficient to cover all the needs?

I suspect that the answer is no, in which case the v3 pack .idx
format would need to be further tweaked, but in that case we do not
have to resort to such a trick of stealing a single bit from here
and abusing it for other purposes.  We should just make sure that
the new .idx file format can have extensions, unlike older format
that has fixed sections in fixed order.

If there aren't any radically novel idea, I would imagine that our
design would default to have a big base file that is optimized for
reading and searching, plus another format that is easier and
quicker to write that would overlay, possibly in a way similar to
packed and loose refs work?

> * We could write some sort of quadratic rollup format like reftable.

The mapping between two hash formats is stable and once computed can
be cast in stone.  Other attributes like the type of each object may
fall into the same category.  Multi-level roll-up may be overkill
for such static data items, especially if consolidation would be a
simple "merge two sorted files into one sorted file" operation.

As there are some objects for which we need to carry dynamic
information, e.g. "we expect not to have this in our object store
and that is fine", which may be set for objects immediately behind
the shallow-clone boundary, may need to be cleared when the depth of
shallowness changes.  Would it make sense to store these auxiliary
pieces of information in separate place(s)?  I suspect that the
objects that need these extra bits of information form a small
subset of all objects that we need to have the conversion data, so a
separate table that is indexed into using the order in the main
table may not be a bad way to go.

next prev parent reply	other threads:[~2025-08-14 14:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-14  1:09 Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode brian m. carlson
2025-08-14 14:22 ` Junio C Hamano [this message]
2025-08-14 22:06   ` brian m. carlson
2025-08-14 22:51     ` Junio C Hamano
2025-08-15 15:27 ` Derrick Stolee
2025-09-03  6:43   ` Patrick Steinhardt
2025-08-27 19:08 ` Eric Wong
2025-08-28 14:53   ` Junio C Hamano
2025-08-28 21:43   ` brian m. carlson
2025-08-29 19:51     ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq1ppe9e5h.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=sandals@crustytoothpaste.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).