[NOTES 05/11] Pluggable object databases

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Subject: [NOTES 05/11] Pluggable object databases
Date: Mon, 6 Oct 2025 15:19:18 -0400	[thread overview]
Message-ID: <aOQWNrIDSTTZtbnG@nand.local> (raw)
In-Reply-To: <aOQVeVYY6zadPjln@nand.local>

Topic: Pluggable object databases
Leader: Patrick Steinhardt

* Already working towards, since git 2.50.
* Allow innovation on the server side on large binary.
* The design will soon be up for discussion.
* Allow migration between different object format, and allow to be picked later
	by the implementer.
* The planned work is to make the new db more pluggable, right now the work is
	still about refactoring. 2.53 will have a proof of concept. Might take into
	the second half of 2026 to be done.
* Blocker1: The current db format is still not clear. Particularly latency perf
	related issues.
	 * Might be using content chunking hashing, might be using existing db impl
		 like cassandra.
* Blocker2: Second problem is how to generate the packfile.
* Taylor wonder whether we can reuse the current object db, but patrick thinks
	the current impl is too large/complex to adopt. The current refactoring effort
	with better abstraction might speed up future changes.
* Gitster wonders whether we can just use the hash of the chunks' hashes.
* Taylor also thinks a new obj db might become just as complex.
* Patrick thinks the new obj db can be more maintainable. Starting off with a
	brand new abstraction allows faster iteration.
* Rewriting obj db in a new world might be challenging because the pack obj is
	so intimate to so many usage and optimizations (e.g. bitmap), also the need to
	identify big binary obj over the wire.
* Taylor thinks maybe we don't need to rewrite pack obj, but abstracting the
	packfile could make it worse and more verbose.
* Patrick mentions there's already many other adjacent projects abstract away
	from the pack format; e.g. jgit, libgit2. Jgit initially already identified
	Casadra's perf would never work due to latency overhead.
* Taylor suggests we identify a proof of concept with comparable latency to
	existing obj db before doing additional refactoring.
* Ezekiel is refocusing the discussion on targeting large binary files. Maybe
	with large binary files, latency degradation is not as important.
* In git, we already have a divergent code path for large binary files, we just
	chose to store them in the packfile, technically people can change the storage
	selection without refactoring.
* Patrick still thinks having sub-system abstraction would make code more
	maintainable.
* Taylor is supportive about some objects can use the current db vs only have
	the large binary files to use the new db; at least we don't impose the
	overhead over all objects.
* The obj chunk design Patrick proposing is meant to benefit both client side
	storage and server side.
* We should resume this discussion with more concrete usage, right now we are
	still talking about potential scenarios.
* The premisor feature from server side cannot satisfy all clients, since some
	clients don't want to use premisor, so the server side might still be expected
	to have the large binary files on disk.
* The packfile url might still be the main direction we can use to fix the large
	binary issue without doing exploding obj chunking.
* Another benefit of obj chunking is to reduce hash time for large binary files.
	Gerrit currently sees 50% of clone time is due to hashing. Parallel hashing is
	also possible with obj chunking.

next prev parent reply	other threads:[~2025-10-06 19:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 19:16 Notes from the Git Contributor's Summit, 2025 Taylor Blau
2025-10-06 19:18 ` [NOTES 01/11] SHA-256 and interoperability work Taylor Blau
2025-10-06 19:18 ` [NOTES 02/11] First-class conflicts in Git? Taylor Blau
2025-10-06 19:18 ` [NOTES 03/11] The future of history rewriting - rebase, replay and history (+Change-IDs) Taylor Blau
2025-10-06 19:18 ` [NOTES 04/11] Rust Taylor Blau
2025-10-06 19:19 ` Taylor Blau [this message]
2025-10-06 19:19 ` [NOTES 06/11] Repository maintenance long-term goals Taylor Blau
2025-10-06 19:19 ` [NOTES 07/11] Change-ID Header in Git Taylor Blau
2025-10-06 19:20 ` [NOTES 08/11] Resumable fetch / push Taylor Blau
2025-10-06 19:20 ` [NOTES 09/11] Git 3.0 Taylor Blau
2025-10-06 19:20 ` [NOTES 10/11] How can companies respectfully engage contractors to work on Git? Taylor Blau
2025-10-06 19:20 ` [NOTES 11/11] Conservancy 2025 updates Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOQWNrIDSTTZtbnG@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).