Re: [RFC] Faster git grep.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Ondřej Bílka" <neleai@seznam.cz>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC] Faster git grep.
Date: Thu, 25 Jul 2013 23:31:00 +0200	[thread overview]
Message-ID: <20130725213100.GA28551@domone.kolej.mff.cuni.cz> (raw)
In-Reply-To: <7vli4u4bkm.fsf@alter.siamese.dyndns.org>

On Thu, Jul 25, 2013 at 01:41:13PM -0700, Junio C Hamano wrote:
> Ondřej Bílka <neleai@seznam.cz> writes:
> 
> > One solution would be to use same trick as was done in google code. 
> > Build and keep database of trigraphs and which files contain how many of
> > them. When querry is made then check
> > only these files that have appropriate combination of trigraphs.
> 
> This depends on how you go about trying to reducing the database
> overhead, I think.  For example, a very naive approach would be to
> create such trigraph hit index for each and every commit for all
> paths.  When "git grep $commit $pattern" is run, you would consult
> such table with $commit and potential trigraphs derived from the
> $pattern to grab the potential paths your hits _might_ be in.
>
Do you think that git grep $commit $pattern is run in more than 1% 
of cases than git grep $pattern ?

If grepping random commit in history is important use case then keeping
db information in history makes sense. Otherwise just having database
for current version and updating it on the fly as version changes is
enough.
> But the contents of a path usually do not change in each and every
> commit.  So you may want to instead index with the blob object names
> (i.e. which trigraphs appear in what blobs).  But once you go that
> route, your "git grep $commit $pattern" needs to read and enumerate
> all the blobs that appear in $commit's tree, and see which blobs may
> potentially have hits.  Then you would need to build an index every
> time you make a new commit for blobs whose trigraphs have not been
> counted.
> 
> Nice thing is that once a blob (or a commit for that matter) is
> created and its object name is known, its contents will not change,
> so you can index once and reuse it many times.  But I am not yet
> convinced if pre-indexing is an overall win, compared to the cost of
> maintaining such a database.

next prev parent reply	other threads:[~2013-07-25 21:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-25 18:29 [RFC] Faster git grep Ondřej Bílka
2013-07-25 20:08 ` Jeff King
2013-07-25 20:41 ` Junio C Hamano
2013-07-25 21:31   ` Ondřej Bílka [this message]
2013-07-26  1:28     ` Junio C Hamano
2013-07-26  5:45       ` Ondřej Bílka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130725213100.GA28551@domone.kolej.mff.cuni.cz \
    --to=neleai@seznam.cz \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).