From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Subject: [TOPIC 7/8] Speeding up the connectivity check
Date: Thu, 29 Sep 2022 15:21:52 -0400 [thread overview]
Message-ID: <YzXwUEre3PP22FDU@nand.local> (raw)
In-Reply-To: <YzXvMRc6X60kjVeY@nand.local>
# Ideas for speeding up object connectivity checks in git-receive-pack
(Patrick Steinhardt)
- Patrick: At GitLab we've noticed some repositories have become slow
with a lot of references.
- Initially I thought it was the object negotiation.
- The connectivity checks seem to be the culprit. The connectivity check
is implemented quite naively. "git rev-list <commits> --not --all"
- A while ago I tried to optimize it. Stop walking when you reach an
existing object. List feedback was that I had not gotten the semantics
right, since that existing object is not necessarily pointed to by a
ref and not everything it references is necessarily present in the
repository.
- Trial 2: optimize rev-list. Sped up connectivity check by 60-70%! But
the motivating repositories had grown, so it was still too slow.
- How do other people do it, with millions of references?
- Terry: Yes, we've seen this problem.
- Just the setup time for the revision walk takes a long time. Initially
we weren't using reachability bitmaps; using those also helped. Then
using subsets of references — first HEAD, then refs/heads/-, …
- Jrnieder: Initially we did this for reachability checks; do
connectivity checks do it, too? Terry: Yes, Demetr did that.
- Demetr: When the connectivity check fails, we fall back to a fuller
set of refs.
- Patrick: Many of our references are not even visible to the public. If
Git makes configurable which refs are part of the connectivity check,
that would already make things faster in our case.
- Taylor: Did you experiment with reachability bitmaps? Am I remembering
correctly that they made things slower?
- Patrick: We did some experiments with reachability bitmaps, but not
specifically for this problem. In those experiments they did make some
things slower.
- Taylor: one thing that you could do is make the bitmap traversal by
building up a complete bitmap of the boundary between haves and
wants instead of a bitmap of all of the haves. Involves far less
object traversal, and there are some clever ways to do this.
- Taylor: As long as you can quickly determine the boundary between
the haves and wants for a given request, the connectivity check
should be as fast as you need.
- Peff: One difference between how Git and JGit does this is JGit is
building up a structure of "what is reachable". You could persist a
bitmap representing "here's everything that's reachable from this
repository" and subtract that out; that would help with many cases.
The problem with one reachability bitmap like that is that it goes
stale whenever someone pushes something. But you could make a set of
"pseudo-merge bitmaps" for each group of 10,000 refs or so. Especially
if you're clever about which refs you do that for, that can be a
significant improvement ("these 2,000,000 refs haven't been touched
for a year, so I can use this bitmap and don't even have to examine
them").
- Terry: There are two ways that unreachable objects appear. One is by
branches being deleted or rewound. The other is failed pushes where
objects were persisted and the ref update didn't succeed. I wanted to
distinguish between objects that are "applied" and "unapplied", became
very thorny. And with that, on a branch rewind, we can calculate what
just became unreachable.
- Waleed Khan: Object negotiation with bloom filters paper
- Kleppman and Howard 2020
- Blog post
https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html
- Paper https://arxiv.org/abs/2012.00472
- Quote: "Unlike the skipping algorithm used by Git, our algorithm
never unnecessarily sends any commits that the other side already
has, and the Bloom filters are very compact, even for large commit
histories." Alternative ways to write to VFS-backed worktrees (e.g.
write($HASH) instead of write(<bytes>))
- Google tried to use the sparse-index to control what gets materialized
by in a VFS-like system. What if we replaced the write() syscall +
writing bytes and write a hash instead. It could save a lot, e.g. in
an online IDE.
- Stolee: The "write" analog of FSMonitor for VFS-for-Git is the
post-index-change hook. We suppress the writes by manipulating the
index and then communicating the write back to the VFS.
- Jrnieder: This is bigger than just the "write()" call; if we have a
git-aware filesystem, we can be much less wasteful than we are today.
E.g. FSMonitor can only make `git status` so fast because we still
have to stat(), but with a VFS, we could ask the VFS what has changed.
- Do we think this is useful for anything other than a VFS? Do we still
want this even if it's only useful for VFS?
- Stolee: We could make FSMonitor more git-aware e.g. it doesn't know
about writes that we make. JeffH: We also can't say "ignore .o files,
i only care about source files". That also helps greatly. Writing a
hash instead of the contents is probably about as expensive, most of
the savings are in avoiding the stat() calls. This also sounds racy.
- Emily: Another reason to do this work is that this is a good jumping
off point to libify the Git internals. Is there any reason not to do
that? Jrnieder: To make this concrete: do you mean for example
creating a worktree.h with a vtable of worktree operations and having
things talk to that instead of the FS? Emily: Yes. Things like that
are the reason why we have libgit2, so what if Git could just ship its
own library. Bmc: Libgit2 is used in lots of places like the Rust
package manager (Cargo). The problem is that Git is GPLv2, which is
not usable by lots of folks.
- Stolee: The stance of the Git project is that the API is the CLI, not
individual files. But I think this is a good thing to have for the
project as a whole, even just internally.
- We can finally have unit tests!!??!
next prev parent reply other threads:[~2022-09-29 19:22 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
2022-09-29 19:19 ` [TOPIC 1/8] Bundle URIs Taylor Blau
2022-09-29 19:19 ` [TOPIC 2/8] State of SHA-256 transition Taylor Blau
2022-09-29 19:20 ` [TOPIC 3/8] Merge ORT timeline Taylor Blau
2022-09-29 19:20 ` [TOPIC 4/8] Commit `--filter`'s Taylor Blau
2022-09-29 19:21 ` [TOPIC 5/8] Server side merges and rebases Taylor Blau
2022-09-29 19:21 ` [TOPIC 6/8] State of sparsity work Taylor Blau
2022-09-29 19:21 ` Taylor Blau [this message]
2022-09-29 19:22 ` [TOPIC 8/8] Using Git securely in shared services Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YzXwUEre3PP22FDU@nand.local \
--to=me@ttaylorr.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).