git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: Pavel Rappo <pavel.rappo@gmail.com>,
	Git mailing list <git@vger.kernel.org>
Subject: Re: How to reduce pickaxe times for a particular repo?
Date: Tue, 28 Jun 2022 09:01:17 -0400	[thread overview]
Message-ID: <6439e948-ff79-9e10-97f5-378806e25b5b@github.com> (raw)
In-Reply-To: <CAChcVumN66OxOjag9gPqgLq7gQrgdaEkZAJabusE-gGC7LLVyw@mail.gmail.com>

On 6/28/2022 6:50 AM, Pavel Rappo wrote:

Hi Pavel! Welcome.

> I have a repo of the following characteristics:
> 
>   * 1 branch
>   * 100,000 commits

This is not too large.

>   * 1TB in size

This _is_ large.

>   * The tip of the branch has 55,000 files

And again, this is not large.

This means you have some very large files in your repo, perhaps
even binary files that you don't intend to search.

>   * No new commits are expected: the repo is abandoned and kept for
> archaeological purposes.
> 
> Typically, a `git log -S/-G` lookup takes around a minute to complete.
> I would like to significantly reduce that time. How can I do that? I
> can spend up to 10x more disk space, if required. The machine has 10
> cores and 32GB of RAM.

You are using -S<string> or -G<regex> to see which commits change the
number of matches of that <string> or <regex>. If you don't provide a
pathspec, then Git will search every changed file, including those
very large binary files.

Perhaps you'd like to start by providing a pathspec that limits the
search to only the meaningful code files?

As far as I know, Git doesn't have any data structures that can speed
up content-based matches like this. The commit-graph's content-changed
Bloom filters only help Git with questions like "did this specific file
change?" which is not going to be a critical code path in what you're
describing.

I'm not sure what you're actually trying to ask with -S or -G, so maybe
it is worth considering other types of queries, such as -L<n>,<m>:<file>
or something. This is just a shot in the dark, as you might be doing the
only thing you _can_ do to solve your problem.

Thanks,
-Stolee

  parent reply	other threads:[~2022-06-28 13:01 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-28 10:50 How to reduce pickaxe times for a particular repo? Pavel Rappo
2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
2022-06-28 12:35   ` Pavel Rappo
2022-06-29 12:31     ` Ævar Arnfjörð Bjarmason
2022-06-28 13:01 ` Derrick Stolee [this message]
2022-07-01 18:21   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6439e948-ff79-9e10-97f5-378806e25b5b@github.com \
    --to=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=pavel.rappo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).