git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Pavel Rappo <pavel.rappo@gmail.com>
Cc: Git mailing list <git@vger.kernel.org>
Subject: Re: How to reduce pickaxe times for a particular repo?
Date: Wed, 29 Jun 2022 14:31:15 +0200	[thread overview]
Message-ID: <220629.8635fnfxnz.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <CAChcVu=w8mxFtXHukZkf-VswchH_sRppCm=0XZbwh=9-Y4P8cg@mail.gmail.com>


On Tue, Jun 28 2022, Pavel Rappo wrote:

> On Tue, Jun 28, 2022 at 12:58 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>
> <snip>
>
>> But eventually you'll simply run into the regex engine being slow
>
> Since I know very little about git internals, I was under a naive
> impression that a significant, if not comparable to that of regex,
> portion of pickaxe's time is spent on computing diffs between
> revisions. So I assumed that there was a way to pre-compute those
> diffs.

Yes and no, maybe sort of :)

Firstly, -S doesn't involve a diff, it's comparing the raw pre-post
image, and seeing how many times we match.

-G does involve computing the diff.

One the one hand we're fast at making diffs, but that really shouldn't
be significant compared to the speed of a regex engine.

The other side of this is that we're really stupid about how we invoke
the regex engine, historical reasons, backwards compatibility & all
that, but we:

 * Aren't compiling the regex once, and using it N times in some cases
   (I have some local patches to fix this)
 * Are computing matches one line at a time, when we could e.g. point
   PCRE to an entire diff with the right line-split options.
 * Are often doing needless work, e.g. in v2.33 I solved an issue with
   us continuing to create diffs when we could abort early (see
   f97fe358576 (pickaxe -G: don't special-case create/delete,
   2021-04-12)), which resulted in some speed-up.q

Some of these are tricky to fix.
> <snip>
>
>>  2. Stick that into Lucene with trigram indexing, e.g. ElasticSearch
>>     might make this easy.
>
> <snip>
>
>> For someone familiar with the tools involved that should be about a day
>> to get to a rough hacky solution, it's mostly gluing existing OTS
>> software together.
>
> <snip>
>
> I'll see what I can do with external systems. You see, I initially
> came from a similar repository exposed through OpenGrok. But I think
> that something was wrong with the index or query syntax because I
> couldn't find the things that I knew were there. I was able to secure
> a git repo that was close to that of OpenGrok as I found pickaxe to be
> robust albeit slow alternative for my searches.

This is the first time I hear about OpenGrok, so no idea, sorry.

One common pitfall with search indexes is that they tend to have a
blacklist of words, e.g. Lucene will have "for", "or" and other common
English words as part of its defaults, so if you're trying to e.g. find
when you altered a for-loop you might silently be getting no results.

  reply	other threads:[~2022-06-29 12:43 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-28 10:50 How to reduce pickaxe times for a particular repo? Pavel Rappo
2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
2022-06-28 12:35   ` Pavel Rappo
2022-06-29 12:31     ` Ævar Arnfjörð Bjarmason [this message]
2022-06-28 13:01 ` Derrick Stolee
2022-07-01 18:21   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220629.8635fnfxnz.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pavel.rappo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).