Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
@ 2025-08-14  1:09 brian m. carlson
  2025-08-14 14:22 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: brian m. carlson @ 2025-08-14  1:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Taylor Blau, Derrick Stolee, Patrick Steinhardt,
	Jonathan Nieder

[-- Attachment #1: Type: text/plain, Size: 5388 bytes --]

TL;DR: We need a different datastore than a flat file for storing
mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
opinions sought.

As I've mentioned earlier on the list, I'm working on some of the code
for interoperability between SHA-1 and SHA-256.  The good news is that
it's relatively advanced so far.

Right now, we have pack index v3 (so we can store both loose and packed
objects) and it's possible to clone and fetch from and push to a
single-algorithm server if the client repository supports both
algorithms and there are no shallow clones, partial clones, or
submodules involved.  Those who are interested can look at my
`sha256-interop` branch[0], learn more at my talk at Git Merge (which
I'll be giving remotely), or talk to me at the Contributor Summit.

Our approach for mapping object IDs between algorithms uses data in pack
index v3 (outlined in the transition document), plus a flat file called
`loose-object-idx` for loose objects.  However, we didn't anticipate
that we'd need to handle mappings long-term for data that is neither a
loose object nor a packed object.

For instance, with shallow clones, we must store a mapping for the
shallows the server has sent us[1], since we lack the history to convert
objects otherwise.  Similarly, if there are submodules or we're using a
partial clone, we must store those mappings as well, since we cannot
convert trees without them.  We can store them in the
`loose-object-idx`, but since it's not sorted or easily searchable, it's
going to perform really terribly when we store enough of them.  Right
now, we read the entire file into two hashmaps (one in each direction)
and we sometimes need to re-read it when other processes add items, so
it won't take much to make it be slow and take a lot of memory.

For these reasons, I think we need a different datastore for this and
I'd like to solicit opinions on what that should look like.  Here are
some things that come to mind:

* The format should be fast to read and relatively fast to write.
* We need to efficiently read and map objects in both directions.  This
  is required for many reasons, including efficient fetches and pushes.
* We still require an in-memory store because we stuff entries in their
  without writing them during pack indexing and other operations, but
  that doesn't mean we need to load data from the data files into the
  in-memory structure (in fact, we probably should try to avoid it).
* We want to be able to write small updates to the data without having
  to re-write the entire thing (e.g., `git add`).  We often know that
  we'll be writing a whole batch at once, such as with shallows or
  submodules from a clone or fetch, so many places in the code will be
  able to start a batch and then write, but we shouldn't assume that
  will always be the case.  (In other words, we will write more
  frequently than we do packs or indexes.)
* It would be helpful if we can determine the type of object being
  stored.  For instance, if we've stored an object mapping because of a
  shallow, `git gc` could remove that mapping if the shallows have been
  updated and the mapping is no longer useful.
* We should try not to assume only two hash algorithms.  Pack index v3
  allows for effectively an arbitrary number and while much of the
  compatibility code assumes one main and one compatibility algorithm,
  we should try to minimize that if possible.[2]
* Being able to mmap it would be convenient, so if we can make it
  relatively small, that's nice.

Some rough ideas of what this could look like:

* We could repurpose the top-bit of the pack order value in pack index
  v3 to indicate an object that's not in the pack (this would limit us
  to 2^31 items per pack).
* We could put this in new entries in multi-pack index and require that
  (although I'm not sure that I love the idea of requiring multi-pack
  index in all repositories and I have yet to implement compatibility
  mode there).
* We could write some sort of quadratic rollup format like reftable.

I would recommend reading the pack index v3 format documentation in
`Documentation/technical/hash-function-transition.adoc`, since I think
it's helpful to understand what we have now.  I have implemented a small
variant on it (documented in my branch) and will send some documentation
updates before code, although the differences are minor and not relevant
here.

I've taken the liberty of CCing some people who have worked deeply with
our existing formats (packs, indexes, reftable, and so on), but I've
almost certainly missed some people and I'd love thoughts from anyone
about this.  Once we have an approach that we think is useful, I'm happy
to write up a document for it and send it out.

[0] Available from https://github.com/bk2204/git.git.
[1] This assumes that the server also supports both algorithms to
generate and send the mapping.  I am implementing that work now and it
has yet to be pushed to the branch (because it presently doesn't work).
[2] We might find that SHA-256 becomes weak well before SHA-1 is
completely dead and we need to deal with a third algorithm suddenly, so
we should not mortgage our future unnecessarily.  Cryptographic attacks
only ever get better.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-14  1:09 Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode brian m. carlson
@ 2025-08-14 14:22 ` Junio C Hamano
  2025-08-14 22:06   ` brian m. carlson
  2025-08-15 15:27 ` Derrick Stolee
  2025-08-27 19:08 ` Eric Wong
  2 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2025-08-14 14:22 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Jeff King, Taylor Blau, Derrick Stolee, Patrick Steinhardt,
	Jonathan Nieder

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

I do not know if you want my input (as I wasn't CC'ed), but anyway...

> ...  We can store them in the
> `loose-object-idx`, but since it's not sorted or easily searchable, it's
> going to perform really terribly when we store enough of them.  Right
> now, we read the entire file into two hashmaps (one in each direction)
> and we sometimes need to re-read it when other processes add items, so
> it won't take much to make it be slow and take a lot of memory.
>
> For these reasons, I think we need a different datastore for this and
> I'd like to solicit opinions on what that should look like.  Here are
> some things that come to mind:

I do not see why loose-object-idx is not sorted in the first place,
but to account for new objects getting into the object store, it
would not be a viable way forward to maintain a single sorted file.
We obviously do not want to keep rewriting it in its entirety all
the time,

> Some rough ideas of what this could look like:
>
> * We could repurpose the top-bit of the pack order value in pack index
>   v3 to indicate an object that's not in the pack (this would limit us
>   to 2^31 items per pack).

Nice to see an effort to see if we can do with a small incremental
change, but would a single bit be sufficient to cover all the needs?

I suspect that the answer is no, in which case the v3 pack .idx
format would need to be further tweaked, but in that case we do not
have to resort to such a trick of stealing a single bit from here
and abusing it for other purposes.  We should just make sure that
the new .idx file format can have extensions, unlike older format
that has fixed sections in fixed order.

If there aren't any radically novel idea, I would imagine that our
design would default to have a big base file that is optimized for
reading and searching, plus another format that is easier and
quicker to write that would overlay, possibly in a way similar to
packed and loose refs work?

> * We could write some sort of quadratic rollup format like reftable.

The mapping between two hash formats is stable and once computed can
be cast in stone.  Other attributes like the type of each object may
fall into the same category.  Multi-level roll-up may be overkill
for such static data items, especially if consolidation would be a
simple "merge two sorted files into one sorted file" operation.

As there are some objects for which we need to carry dynamic
information, e.g. "we expect not to have this in our object store
and that is fine", which may be set for objects immediately behind
the shallow-clone boundary, may need to be cleared when the depth of
shallowness changes.  Would it make sense to store these auxiliary
pieces of information in separate place(s)?  I suspect that the
objects that need these extra bits of information form a small
subset of all objects that we need to have the conversion data, so a
separate table that is indexed into using the order in the main
table may not be a bad way to go.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-14 14:22 ` Junio C Hamano
@ 2025-08-14 22:06   ` brian m. carlson
  2025-08-14 22:51     ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: brian m. carlson @ 2025-08-14 22:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jeff King, Taylor Blau, Derrick Stolee, Patrick Steinhardt,
	Jonathan Nieder

[-- Attachment #1: Type: text/plain, Size: 3382 bytes --]

On 2025-08-14 at 14:22:18, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> I do not know if you want my input (as I wasn't CC'ed), but anyway...
> 
> > ...  We can store them in the
> > `loose-object-idx`, but since it's not sorted or easily searchable, it's
> > going to perform really terribly when we store enough of them.  Right
> > now, we read the entire file into two hashmaps (one in each direction)
> > and we sometimes need to re-read it when other processes add items, so
> > it won't take much to make it be slow and take a lot of memory.
> >
> > For these reasons, I think we need a different datastore for this and
> > I'd like to solicit opinions on what that should look like.  Here are
> > some things that come to mind:
> 
> I do not see why loose-object-idx is not sorted in the first place,
> but to account for new objects getting into the object store, it
> would not be a viable way forward to maintain a single sorted file.
> We obviously do not want to keep rewriting it in its entirety all
> the time,

It's not sorted because there's no way to do so and efficiently handle
both lookups.  If we sorted it in SHA-256 order, then we would still
have to look up items in SHA-1 order with a linear search, and vice
versa.

What we do for pack index v3 is a sorted table of abbreviated names, a
mapping of that order to pack order, and then full object names in pack
order, with a set for each algorithm.  The abbreviated names all use the
same prefix size, which is just long enough to be unambiguous.  This
means that we can easily look up an object, find its index into pack
order, and then find the full object ID in any algorithm.

We could probably write some sort of data file that contains these
same mappings except that since we don't have a pack order, we could
just use a sorted order in the main algorithm and omit the main
algorithm's mapping table.  We could then have a single table for the
necessary object metadata.

> If there aren't any radically novel idea, I would imagine that our
> design would default to have a big base file that is optimized for
> reading and searching, plus another format that is easier and
> quicker to write that would overlay, possibly in a way similar to
> packed and loose refs work?

Yeah, that could be an option.  Or we could have a base file and some
incrementals, with a `git gc` when we hit 50 items, just like when we
hit 50 packfiles.

> As there are some objects for which we need to carry dynamic
> information, e.g. "we expect not to have this in our object store
> and that is fine", which may be set for objects immediately behind
> the shallow-clone boundary, may need to be cleared when the depth of
> shallowness changes.  Would it make sense to store these auxiliary
> pieces of information in separate place(s)?  I suspect that the
> objects that need these extra bits of information form a small
> subset of all objects that we need to have the conversion data, so a
> separate table that is indexed into using the order in the main
> table may not be a bad way to go.

My plan is to just wire this up to `git gc`.  We'd know what entries are
potentially disposable (such as shallows) and omit the unneeded entries
when repacking.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-14 22:06   ` brian m. carlson
@ 2025-08-14 22:51     ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2025-08-14 22:51 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Jeff King, Taylor Blau, Derrick Stolee, Patrick Steinhardt,
	Jonathan Nieder

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> As there are some objects for which we need to carry dynamic
>> information, e.g. "we expect not to have this in our object store
>> and that is fine", which may be set for objects immediately behind
>> the shallow-clone boundary, may need to be cleared when the depth of
>> shallowness changes.  Would it make sense to store these auxiliary
>> pieces of information in separate place(s)?  I suspect that the
>> objects that need these extra bits of information form a small
>> subset of all objects that we need to have the conversion data, so a
>> separate table that is indexed into using the order in the main
>> table may not be a bad way to go.
>
> My plan is to just wire this up to `git gc`.  We'd know what entries are
> potentially disposable (such as shallows) and omit the unneeded entries
> when repacking.

I wasn't talking about when the information is (gathered|consumed),
though.  In response to the request for comments on file format, I
was suggesting to have at least two separte files, one for static
part, and the other for dynamic part, so that the former does not
have to be rewritten all the time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-14  1:09 Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode brian m. carlson
  2025-08-14 14:22 ` Junio C Hamano
@ 2025-08-15 15:27 ` Derrick Stolee
  2025-09-03  6:43   ` Patrick Steinhardt
  2025-08-27 19:08 ` Eric Wong
  2 siblings, 1 reply; 10+ messages in thread
From: Derrick Stolee @ 2025-08-15 15:27 UTC (permalink / raw)
  To: brian m. carlson, git, Jeff King, Taylor Blau, Patrick Steinhardt,
	Jonathan Nieder

On 8/13/2025 9:09 PM, brian m. carlson wrote:
> TL;DR: We need a different datastore than a flat file for storing
> mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
> opinions sought.
...> Our approach for mapping object IDs between algorithms uses data in pack
> index v3 (outlined in the transition document), plus a flat file called
> `loose-object-idx` for loose objects.  However, we didn't anticipate
> that we'd need to handle mappings long-term for data that is neither a
> loose object nor a packed object.

I'm generally not a fan of this approach to (ab)use the pack index format
for this, especially when the translation needs to expand beyond "objects
in the local repo".

The requirements, as I see them, are:

1. Given an OID in Hash1, load its mapped OID in Hash2 in O(log N) time.
2. Given an OID in Hash2, load its mapped OID in Hash1 in O(log N) time.
3. As OID pairs are discovered, add them to the data structure in ~O(new)
   time.

> Some rough ideas of what this could look like:
> 
> * We could repurpose the top-bit of the pack order value in pack index
>   v3 to indicate an object that's not in the pack (this would limit us
>   to 2^31 items per pack).
> * We could put this in new entries in multi-pack index and require that
>   (although I'm not sure that I love the idea of requiring multi-pack
>   index in all repositories and I have yet to implement compatibility
>   mode there).
> * We could write some sort of quadratic rollup format like reftable.

My thought is that the last option is going to be best. It does require
starting a new file format from scratch, but it doesn't need to be
complicated:

* Header information includes:
	- file version info.
	- hash versions in mapping.
	- the number of OIDs in the format
	- the previous mapping file(s) in the chain
	- offsets to the Hash1 and Hash2 tables.
	- room for expansion to other data being added to the format,
	  as necessary in the future.
* Hash1 table has a lex-ordered list of Hash1 OIDs and int IDs to do
  lookups of the mapped Hash2 OIDs from the second table (by position).
* Hash2 table has a lex-ordered list of Hash2 OIDs and int IDs to do
  lookups of the mapped Hash1 OIDs from the first table (by position).

Lookup time would be O(L * log N) where L is the number of layers in
the collection of files. Writing time could be as low as the size of
a new layer on top, with squashing of layers handled in the background
or in the foreground (opportunistically for small layers or as needed
if background maintenance is not available).

I'm sure that things are more complicated than I'm making it out to
be in this email. I haven't looked at your branch to see the subtle
details around this. Hopefully this just gives you ideas that you
can use as you compare options.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-14  1:09 Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode brian m. carlson
  2025-08-14 14:22 ` Junio C Hamano
  2025-08-15 15:27 ` Derrick Stolee
@ 2025-08-27 19:08 ` Eric Wong
  2025-08-28 14:53   ` Junio C Hamano
  2025-08-28 21:43   ` brian m. carlson
  2 siblings, 2 replies; 10+ messages in thread
From: Eric Wong @ 2025-08-27 19:08 UTC (permalink / raw)
  To: brian m. carlson, git, Jeff King, Taylor Blau, Derrick Stolee,
	Patrick Steinhardt, Jonathan Nieder

"brian m. carlson" <sandals@crustytoothpaste.net> wrote:
> TL;DR: We need a different datastore than a flat file for storing
> mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
> opinions sought.

<snip>

> Our approach for mapping object IDs between algorithms uses data in pack
> index v3 (outlined in the transition document), plus a flat file called
> `loose-object-idx` for loose objects.  However, we didn't anticipate
> that we'd need to handle mappings long-term for data that is neither a
> loose object nor a packed object.
> 
> For instance, with shallow clones, we must store a mapping for the
> shallows the server has sent us[1], since we lack the history to convert
> objects otherwise.  Similarly, if there are submodules or we're using a
> partial clone, we must store those mappings as well, since we cannot
> convert trees without them.  We can store them in the
> `loose-object-idx`, but since it's not sorted or easily searchable, it's
> going to perform really terribly when we store enough of them.  Right
> now, we read the entire file into two hashmaps (one in each direction)
> and we sometimes need to re-read it when other processes add items, so
> it won't take much to make it be slow and take a lot of memory.

This really seems ideal for SQLite, which has come a long way
since 2005 when git started.

I really wish git would've relied on more on existing formats
(e.g. LMDB refs) rather than introducing more one-off data
formats that require more cognitive overhead to document and
learn[1], especially when SQLite is extremely portable and works
on tiny devices.

> For these reasons, I think we need a different datastore for this and
> I'd like to solicit opinions on what that should look like.  Here are
> some things that come to mind:
> 
> * The format should be fast to read and relatively fast to write.
> * We need to efficiently read and map objects in both directions.  This
>   is required for many reasons, including efficient fetches and pushes.

SQLite seems to do these well, in my experience.  It's not the
fastest possible data store, but it's no slouch, either.

> * We still require an in-memory store because we stuff entries in their
>   without writing them during pack indexing and other operations, but
>   that doesn't mean we need to load data from the data files into the
>   in-memory structure (in fact, we probably should try to avoid it).

SQLite supports in-memory DBs, and also mmap.  I always prefer
to always put larger structures on TMPDIR and rely on page
cache; because sometimes code ends up running on machines with
too little memory/swap (but git has never been great w.r.t.
memory use :<).

> * We want to be able to write small updates to the data without having
>   to re-write the entire thing (e.g., `git add`).  We often know that
>   we'll be writing a whole batch at once, such as with shallows or
>   submodules from a clone or fetch, so many places in the code will be
>   able to start a batch and then write, but we shouldn't assume that
>   will always be the case.  (In other words, we will write more
>   frequently than we do packs or indexes.)

Transactions and atomicity are included, of course.

> * It would be helpful if we can determine the type of object being
>   stored.  For instance, if we've stored an object mapping because of a
>   shallow, `git gc` could remove that mapping if the shallows have been
>   updated and the mapping is no longer useful.

Column names should be enough.

> * We should try not to assume only two hash algorithms.  Pack index v3
>   allows for effectively an arbitrary number and while much of the
>   compatibility code assumes one main and one compatibility algorithm,
>   we should try to minimize that if possible.[2]

I haven't used it much, but ALTER TABLE should work well nowadays
for adding (maybe not removing) columns.

> * Being able to mmap it would be convenient, so if we can make it
>   relatively small, that's nice.

mmap is possible, but default builds of SQLite defaults to a
relatively small mmap limit (2G?).  I don't know why and never
bothered to deal sign up for their Fossil (JS required :<, last
I checked) to ask about the small default limit.


I don't like SQLite's approach to rejecting outside
contributions; but otherwise it's served me well with various
bits of Perl code for the last 15 years or so.  Yeah, the SQLite
developer doesn't have the highest opinion of git, but we
shouldn't let that affect our decision making.


[1] Fwiw, I enjoyed working on git a lot more when it used more
    high-level scripting glue.  I'm disappointed in the overall
    movement towards AOT languages (C, now Rust) due to large
    toolchains, slow builds + linkers.  Hacking was much more
    discoverable when I could just edit installed scripts like
    config files and not have to deal with builds at all :>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-27 19:08 ` Eric Wong
@ 2025-08-28 14:53   ` Junio C Hamano
  2025-08-28 21:43   ` brian m. carlson
  1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2025-08-28 14:53 UTC (permalink / raw)
  To: Eric Wong
  Cc: brian m. carlson, git, Jeff King, Taylor Blau, Derrick Stolee,
	Patrick Steinhardt, Jonathan Nieder

Eric Wong <e@80x24.org> writes:

> I don't like SQLite's approach to rejecting outside
> contributions; but otherwise it's served me well with various
> bits of Perl code for the last 15 years or so.  Yeah, the SQLite
> developer doesn't have the highest opinion of git, but we
> shouldn't let that affect our decision making.

They're in "public domain", IIRC.  And portable than we are ;-)

> [1] Fwiw, I enjoyed working on git a lot more when it used more
>     high-level scripting glue.  I'm disappointed in the overall
>     movement towards AOT languages (C, now Rust) due to large
>     toolchains, slow builds + linkers.  Hacking was much more
>     discoverable when I could just edit installed scripts like
>     config files and not have to deal with builds at all :>

Ahh, the halcyon days...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-27 19:08 ` Eric Wong
  2025-08-28 14:53   ` Junio C Hamano
@ 2025-08-28 21:43   ` brian m. carlson
  2025-08-29 19:51     ` Eric Wong
  1 sibling, 1 reply; 10+ messages in thread
From: brian m. carlson @ 2025-08-28 21:43 UTC (permalink / raw)
  To: Eric Wong
  Cc: git, Jeff King, Taylor Blau, Derrick Stolee, Patrick Steinhardt,
	Jonathan Nieder

[-- Attachment #1: Type: text/plain, Size: 2353 bytes --]

On 2025-08-27 at 19:08:16, Eric Wong wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> wrote:
> > TL;DR: We need a different datastore than a flat file for storing
> > mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
> > opinions sought.
> 
> <snip>
> 
> > Our approach for mapping object IDs between algorithms uses data in pack
> > index v3 (outlined in the transition document), plus a flat file called
> > `loose-object-idx` for loose objects.  However, we didn't anticipate
> > that we'd need to handle mappings long-term for data that is neither a
> > loose object nor a packed object.
> > 
> > For instance, with shallow clones, we must store a mapping for the
> > shallows the server has sent us[1], since we lack the history to convert
> > objects otherwise.  Similarly, if there are submodules or we're using a
> > partial clone, we must store those mappings as well, since we cannot
> > convert trees without them.  We can store them in the
> > `loose-object-idx`, but since it's not sorted or easily searchable, it's
> > going to perform really terribly when we store enough of them.  Right
> > now, we read the entire file into two hashmaps (one in each direction)
> > and we sometimes need to re-read it when other processes add items, so
> > it won't take much to make it be slow and take a lot of memory.
> 
> This really seems ideal for SQLite, which has come a long way
> since 2005 when git started.
> 
> I really wish git would've relied on more on existing formats
> (e.g. LMDB refs) rather than introducing more one-off data
> formats that require more cognitive overhead to document and
> learn[1], especially when SQLite is extremely portable and works
> on tiny devices.

SQLite is not an option because it performs poorly with Java and we want
our formats to work with other implementations, like JGit.  That's why
we created reftable instead of using SQLite.

Also, in general, I'm not interested in being tied to a single
implementation.  If the developers of SQLite decide to dramatically
change the license of all their code like Oracle did with Berkeley DB,
we're going to have a problem.  Yes, we can use the older versions, but
we'd still need people to maintain the library and update it.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-28 21:43   ` brian m. carlson
@ 2025-08-29 19:51     ` Eric Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2025-08-29 19:51 UTC (permalink / raw)
  To: brian m. carlson, git, Jeff King, Taylor Blau, Derrick Stolee,
	Patrick Steinhardt, Jonathan Nieder

"brian m. carlson" <sandals@crustytoothpaste.net> wrote:
> SQLite is not an option because it performs poorly with Java and we want
> our formats to work with other implementations, like JGit.  That's why
> we created reftable instead of using SQLite.

Interesting.  I would've thought it'd be a solved problem by now
given the popularity of both SQLite and Java.  I actually
expected choosing a more common/standard file format would make
life easier for hackers of alternative git implementations.

Searching for `pure java sqlite' reveals some projects, but
I'm not a Java user at all so can't comment on the quality of
implementations:
https://html.duckduckgo.com/html/?q=pure+java+sqlite

Fwiw, the US Library of Congress has SQLite as a recommended
data format for several years, now:
https://www.loc.gov/preservation/digital/formats/fdd/fdd000461.shtml
Nowadays I'm more concerned about the continued relevance or
even existence of the LoC than SQLite.

> Also, in general, I'm not interested in being tied to a single
> implementation.  If the developers of SQLite decide to dramatically
> change the license of all their code like Oracle did with Berkeley DB,
> we're going to have a problem.  Yes, we can use the older versions, but
> we'd still need people to maintain the library and update it.

Whereas we'd be implementing and maintaining yet another one-off
data format from day one instead of waiting for an eventuality
which may never come.  SQLite 3 has far surpassed the stability
and use of Berkeley DB; there's not a lot of similar formats
that might replace it (e.g. GDBM, LMDB).  So I'd expect there'd
be no shortage of hackers able to maintain a usable fork if push
comes to shove.

Fwiw, I consider the proliferation of data formats and protocols
the biggest threat to digital freedom and security.  Even when
the implementations are open source it feels like a huge drain
to have to constantly deal reviewing/porting code to deal with
more formats.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
  2025-08-15 15:27 ` Derrick Stolee
@ 2025-09-03  6:43   ` Patrick Steinhardt
  0 siblings, 0 replies; 10+ messages in thread
From: Patrick Steinhardt @ 2025-09-03  6:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: brian m. carlson, git, Jeff King, Taylor Blau, Jonathan Nieder

On Fri, Aug 15, 2025 at 11:27:45AM -0400, Derrick Stolee wrote:

Sorry, a bit late to the party due to various things going on that
distracted me for the last couple weeks.

> On 8/13/2025 9:09 PM, brian m. carlson wrote:
> > TL;DR: We need a different datastore than a flat file for storing
> > mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
> > opinions sought.
> ...> Our approach for mapping object IDs between algorithms uses data in pack
> > index v3 (outlined in the transition document), plus a flat file called
> > `loose-object-idx` for loose objects.  However, we didn't anticipate
> > that we'd need to handle mappings long-term for data that is neither a
> > loose object nor a packed object.
> 
> I'm generally not a fan of this approach to (ab)use the pack index format
> for this, especially when the translation needs to expand beyond "objects
> in the local repo".
> 
> The requirements, as I see them, are:
> 
> 1. Given an OID in Hash1, load its mapped OID in Hash2 in O(log N) time.
> 2. Given an OID in Hash2, load its mapped OID in Hash1 in O(log N) time.
> 3. As OID pairs are discovered, add them to the data structure in ~O(new)
>    time.
> 
> > Some rough ideas of what this could look like:
> > 
> > * We could repurpose the top-bit of the pack order value in pack index
> >   v3 to indicate an object that's not in the pack (this would limit us
> >   to 2^31 items per pack).
> > * We could put this in new entries in multi-pack index and require that
> >   (although I'm not sure that I love the idea of requiring multi-pack
> >   index in all repositories and I have yet to implement compatibility
> >   mode there).
> > * We could write some sort of quadratic rollup format like reftable.
> 
> My thought is that the last option is going to be best.

Yeah, agreed. One thing that we tend to always end up with eventually is
using geometric sequences for repacking data structures. Because
ultimately, the bigger a repository grows the more expensive it'll
become over time to rewrite the base file. And furthermore, there's
always going to be use cases where rewriting the base file needs to
happen a whole lot more frequently than one would reasonably expect.

> It does require starting a new file format from scratch, but it
> doesn't need to be complicated:
> 
> * Header information includes:
> 	- file version info.
> 	- hash versions in mapping.
> 	- the number of OIDs in the format
> 	- the previous mapping file(s) in the chain
> 	- offsets to the Hash1 and Hash2 tables.
> 	- room for expansion to other data being added to the format,
> 	  as necessary in the future.
> * Hash1 table has a lex-ordered list of Hash1 OIDs and int IDs to do
>   lookups of the mapped Hash2 OIDs from the second table (by position).
> * Hash2 table has a lex-ordered list of Hash2 OIDs and int IDs to do
>   lookups of the mapped Hash1 OIDs from the first table (by position).

I was wondering whether we need to fully reinvent the wheel here. We
already use our chunk format for multiple different data formats (commit
graphs, MIDX), so maybe we can also reuse it for this type of mapping?

> Lookup time would be O(L * log N) where L is the number of layers in
> the collection of files. Writing time could be as low as the size of
> a new layer on top, with squashing of layers handled in the background
> or in the foreground (opportunistically for small layers or as needed
> if background maintenance is not available).
> 
> I'm sure that things are more complicated than I'm making it out to
> be in this email. I haven't looked at your branch to see the subtle
> details around this. Hopefully this just gives you ideas that you
> can use as you compare options.

Likewise, and I'm very sure that due to me being late the ship has
already sailed :)

Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-09-03  6:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14  1:09 Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode brian m. carlson
2025-08-14 14:22 ` Junio C Hamano
2025-08-14 22:06   ` brian m. carlson
2025-08-14 22:51     ` Junio C Hamano
2025-08-15 15:27 ` Derrick Stolee
2025-09-03  6:43   ` Patrick Steinhardt
2025-08-27 19:08 ` Eric Wong
2025-08-28 14:53   ` Junio C Hamano
2025-08-28 21:43   ` brian m. carlson
2025-08-29 19:51     ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).