From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Subject: [TOPIC 4/12] Scaling Git from a forge's perspective
Date: Mon, 2 Oct 2023 11:19:06 -0400 [thread overview]
Message-ID: <ZRrfamSepdiQU9CH@nand.local> (raw)
In-Reply-To: <ZRregi3JJXFs4Msb@nand.local>
(Presenter: Taylor Blau, Notetaker: Karthik Nayak)
* Things on my mind!
* There's been a bunch of work from the forges over the last few years -
bitmaps, commit-graphs. etc.
* Q: What should we do next? Curious to hear from everyone. Including Keanen's
team
* Boundary-based bitmap traversals, already spoke about it last year. If you
have lots of tips that you're excluding from the rev-list query. Backlog to
check the perf of this.
* Patrick: still not activated it on production. Faced some issues the last
time it was activated. We do plan to experiment with this
(https://gitlab.com/gitlab-org/gitaly/-/issues/5537)
* Taylor: Curious of the impact.
* In almost all cases they perform better, in some equal and very few worse.
* (Jonathan Nieder) Two open-ended questions:
* Different forges run into the same problems. Maybe its worth comparing
notes. Do we have a good way to do this. In Git discord there is a server
operator channel, but only two messages.
* Taylor and Patrick have conversations over this via email exchange.
* Keanen: Used to have a quarterly meeting. Attendance is low.
* From an opportunistic perspective, when people want to do this,
currently seems like 1:1 conversations take place, but there hasn't been
a wider-group forum
* Server operator monthly might be fun to revive
* Git contributor summit is where this generally happens. :)
* At the last Git Merge there was a talk by Stolee about Git as a database
and how as a user that can guide you in scaling. Potential roadmap for how
a git server could do some of that automatically. Potential idea? For
example, sharding by time? Like gc automatically generating a pack to serve
shallow clones for recent history.
* Extending cruft-pack implementation to more organically have a threshold
on the number of bytes. The current scheme of rewriting the entire
cruft-pack might not be the best for big repos.
* Patrick: We currently have such a mechanism for geometric repacking.
* (Taylor Blau) Geometric repacking was done a number of years ago, to more
gradually compress the repository from many to few packfiles. We still have
periodic cases where the repository is reduced to 2 packs, one cruft and one
of the objects. If you had some set of packs which contained disjoint objects
(no duplicates), could we extend the verbatim packs to work with these
multiple packs. Anyone had similar issues?
* Jonathan: One problem is whether to know if a pack has a non-redundant
reachable object or not without worrying about things like TTL. In git,
there is "push quarantine" code, if the hook rejects it, it doesn't get
added to the repo. In JGit there is nothing similar yet, so someone could
push a bunch of objects, which get stored even though they're rejected by a
pre-receive hook. Which could end up with packs with unreachable objects.
With history rewriting we also run into complexity about knowing what packs
are "live".
* Patrick: Deterministically pruning objects from the repository is hard
to solve. In GitLab it's a problem where replicas of the repository
contain objects which probably need to be deleted.
* Jeff H: Can we have a classification of refs which makes classification
possible wherein some refs are transient and some are long term.
* Jeff King: There are a bunch of heuristic inputs which can help with
this. Like how older objects have lesser chance of change vs newer.
* Taylor: Order by recency, so older ones are in one bitmap and newer
changeable ones could be one clump of bitmaps.
* Minh: I have a question about Taylor's proposal of a single pack composed of
multiple disjoint packs. Midx can notice duplicate objects. Does that help
with knowing what can be streamed through?
* Taylor: The pack reuse code is a bit too naive at this point, but
conceptually this would work. We already have tools for working with packs
like this. But this does give more flexibility.
* Taylor: GitHub recently switched to merge-ort for test merges, tremendous
improvements, but sometimes creates a bunch of loose objects. Option to have
merge-ort to side step loose objects (write to fast-import or write a pack
directly)?
* Things slow down when writing to the filesystem so much.
* Jonathan Tan: one thing we've discussed is having support in git for a pack
handle representing a still-open pack file that you can append to and read
from in the context of an operation.
* Dscho: that sounds like the sanest thing to do. There's a robust invariant
of needing an idx for the pack file that you need for working with it
efficiently, which requires the pack file to be closed. So some things to
figure out there, I'm interested to follow it.
* Junio: There was a patch sent to list to restrict the streaming interface.
I wonder if that moves in the opposite direction of what we're describing
* brian: In sha256 work I noticed it only currently works on blobs. But I
don't think adapting it to other object types would be a major departure.
As long as we don't make the interop harder, I don't see a big problem with
doing that. Conversion happens at the pack-indexing time.
* Elijah: Did I understand correctly that this produces a lot of cruft
objects?
* Dscho: Yes. We perform test merges and then no ref points to them.
* Elijah: Nice. "git log --remerge-diff" similarly produces objects that
don't need to be stored when it performs test merges; that code path is
careful not to commit them to the object store. You might be able to reuse
some of that code.
* Dscho: Thanks! I'll take a look.
next prev parent reply other threads:[~2023-10-02 15:19 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-02 15:15 Notes from the Git Contributor's Summit, 2023 Taylor Blau
2023-10-02 15:17 ` [TOPIC 0/12] Welcome / Conservancy Update Taylor Blau
2023-10-02 15:17 ` [TOPIC 1/12] Next-gen reference backends Taylor Blau
2023-10-02 15:18 ` [TOPIC 02/12] Libification Goals and Progress Taylor Blau
2023-10-02 15:18 ` [TOPIC 3/12] Designing a Makefile for multiple libraries Taylor Blau
2023-10-02 15:19 ` Taylor Blau [this message]
2023-10-02 15:19 ` [TOPIC 5/12] Replacing Git LFS using multiple promisor remotes Taylor Blau
2023-10-02 15:20 ` [TOPIC 6/12] Clarifying backwards compatibility and when we break it Taylor Blau
2023-10-02 15:21 ` [TOPIC 7/12] Authentication to new hosts without setup Taylor Blau
2023-10-02 15:21 ` [TOPIC 8/12] Update on jj, including at Google Taylor Blau
2023-10-02 15:21 ` [TOPIC 9/12] Code churn and cleanups Taylor Blau
2023-10-02 15:22 ` [TOPIC 10/12] Project management practices Taylor Blau
2023-10-02 15:22 ` [TOPIC 11/12] Improving new contributor on-boarding Taylor Blau
2023-10-02 15:22 ` [TOPIC 12/12] Overflow discussion Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZRrfamSepdiQU9CH@nand.local \
--to=me@ttaylorr.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).