From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Pavel Rappo <pavel.rappo@gmail.com>
Cc: Git mailing list <git@vger.kernel.org>
Subject: Re: How to reduce pickaxe times for a particular repo?
Date: Tue, 28 Jun 2022 13:35:19 +0200 [thread overview]
Message-ID: <220628.86bkudf19g.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <CAChcVumN66OxOjag9gPqgLq7gQrgdaEkZAJabusE-gGC7LLVyw@mail.gmail.com>
On Tue, Jun 28 2022, Pavel Rappo wrote:
> I have a repo of the following characteristics:
>
> * 1 branch
> * 100,000 commits
> * 1TB in size
> * The tip of the branch has 55,000 files
> * No new commits are expected: the repo is abandoned and kept for
> archaeological purposes.
>
> Typically, a `git log -S/-G` lookup takes around a minute to complete.
> I would like to significantly reduce that time. How can I do that? I
> can spend up to 10x more disk space, if required. The machine has 10
> cores and 32GB of RAM.
In git as it stands now the main thing you can do is to limit your seach
by paths, and if you use the commit-graph and have a git that's using
"commitGraph.readChangedPaths" (defaults to true) doing e.g.:
git log -p -G<rx> -- tests/
Can really help, or any other filter, such as --author or whatever.
But eventually you'll simply run into the regex engine being slow, if
you're feeling very adventurous I have a very WIP branch to make this a
lot faster by making -S and -G use PCREv2 as a backend:
http://github.com/avar/git/tree/avar/pcre2-conversion-of-diffcore-pickaxe
Bench mark results (made sometime last year) were:
Test origin/next HEAD
------------------------------------------------------------------------------------------------------------------
4209.1: git log -S'int main' <limit-rev>.. 0.38(0.36+0.01) 0.37(0.33+0.04) -2.6%
4209.2: git log -S'æ' <limit-rev>.. 0.51(0.47+0.04) 0.32(0.27+0.05) -37.3%
4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>.. 0.72(0.68+0.03) 0.57(0.54+0.03) -20.8%
4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>.. 0.60(0.55+0.02) 0.39(0.34+0.05) -35.0%
4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.43(0.40+0.03) 0.50(0.44+0.06) +16.3%
4209.6: git log -G'(int|void|null)' <limit-rev>.. 0.64(0.55+0.09) 0.63(0.56+0.05) -1.6%
4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>.. 0.64(0.59+0.05) 0.63(0.56+0.06) -1.6%
4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.63(0.54+0.08) 0.62(0.55+0.06) -1.6%
4209.9: git log -i -S'int main' <limit-rev>.. 0.39(0.35+0.03) 0.38(0.35+0.02) -2.6%
4209.10: git log -i -S'æ' <limit-rev>.. 0.39(0.33+0.06) 0.32(0.28+0.04) -17.9%
4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>.. 0.90(0.84+0.05) 0.58(0.53+0.04) -35.6%
4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>.. 0.71(0.64+0.06) 0.40(0.37+0.03) -43.7%
4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.43(0.40+0.03) 0.50(0.46+0.04) +16.3%
4209.14: git log -i -G'(int|void|null)' <limit-rev>.. 0.64(0.57+0.06) 0.62(0.56+0.05) -3.1%
4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>.. 0.65(0.59+0.06) 0.63(0.54+0.08) -3.1%
4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.63(0.55+0.08) 0.62(0.56+0.05) -1.6%
So it's much faster on some queries in particular, I don't think that
code is ready for git.git in its current form, but if you're desperate
for performance and need to run ad-hoc queries...
I don't know the full shape of your repo but 1TB in size probably means
some very big files? I think you might want to experiment with e.g. a
filtered repo to filter out big blobs or something else you may be
needlessly searching though (binaries?).
I.e. I think you're probably getting a lot of OS cache churn, where we
can't have the working data in memory for your whole search, so you're
mainly I/O bound.
I did want to (as a future infinite time project) create a search index
for regexes in git for -S and -G, i.e. we'd store something like
trigrams of potentially matchable content, so we could skip commits &
trees quickly if the diff e.g. didn't. contain the fixed string "int" or
whatever.
But that's a much bigger project...
If you're really desperate for performance & willing to hack on
somtething custom you could emulate that with a hacky solution, e.g.:
1. Create a COMMIT=DIFF pair for all commits in your repo, or e.g.
PATH=DIFF (so one concat'd diff with all modifications ever to a
given path)
2. Stick that into Lucene with trigram indexing, e.g. ElasticSearch
might make this easy. Make sure not to "store documents" in the
index, you just want the reverse index from say "int" to "documents"
that contain it.
3. Do a two-step search, where a search like "foo.*bar" is first
against tha index, where you find say all commits that have "foo" in
the diff OR "bar" in the diff, ditto changed paths.
4. Feed that list into the "real" git log -S or -G search, either
limiting by commits, or by paths (taking advantage of the
commit-graph path index).
For someone familiar with the tools involved that should be about a day
to get to a rough hacky solution, it's mostly gluing existing OTS
software together.
You should be able to get your searches down to the tens of millisecond
range with that if also carefully manage which parts are in cache, but
it depends a lot on the exact shape of data in your repo, how much
memory you have etc.
next prev parent reply other threads:[~2022-06-28 11:58 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-28 10:50 How to reduce pickaxe times for a particular repo? Pavel Rappo
2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason [this message]
2022-06-28 12:35 ` Pavel Rappo
2022-06-29 12:31 ` Ævar Arnfjörð Bjarmason
2022-06-28 13:01 ` Derrick Stolee
2022-07-01 18:21 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=220628.86bkudf19g.gmgdl@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=pavel.rappo@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).