what's a useful definition of full text index on a repository?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* what's a useful definition of full text index on a repository?
@ 2007-10-01 16:33 David Tweed
  2007-10-01 17:25 ` Jon Smirl
  0 siblings, 1 reply; 4+ messages in thread
From: David Tweed @ 2007-10-01 16:33 UTC (permalink / raw)
  To: Git Mailing List, Jon Smirl

Basically a "blue sky" question about full-text indexing git repositories.

A while back, whilst talking about overall git structure
(see

http://marc.info/?l=git&m=118891945402778&w=2

), Jon Smirl raised the question of putting a full-text index on a
repository. I doubt I full text index is of much use on a code
repository because the question tends to be focussed around either
released versions or immediate git-blame stuff. However, for
repositories of things like evolving documents/presentations/notes
where content is deleted because it doesn't fit the context rather
than being superceded by better stuff, it might be more natural to
search for "where's that paragraph I wrote on 'human' 'vision' and
'kinematic' 'feedback' ?". So I got to thinking about experimenting
with a full text index and even started writing some code. However, I
then realised that it's not obvious what the most useful definition of
a full text index is on evolving files. (To be clear, I'm _not_
thinking about changing the database fundamentals as discussed in the
referenced thread and indeed would put the full-text index into a
different file that just references the existing git db stuff
precisely because I doubt the text index will be of use to most
people.)

A "classical" full-text index seems inappropriate because, if I've got
a long text document that in a blob in commit n1 uses word 'x' in one
section and the corresponding file in descendant commit n2 has the
same text using word 'x' but has changes to a different section of the
document, there's probably no point showing me both documents (and I
can always track through the history once I've got one (commit-id,file
pair)). So in the case of a single word, a "useful" definition would
be the entry for word w in the full-text index should consist of those
(commit-id,file) pairs whose diff with their parent contains an
addition of text containing word w. (This will catch the creation of
the text containing w and then precisely those files which are close
modifications of it but ignore changes to other areas.) This seems to
make sense for a single word. Let's call this the "appearance diff"
definition of a full text index.

However, things become unclear if you consider a query with multiple
words. Consider the simplest case of a linear history where commit n0
adds word "vision" to file p1.tex (with respect to its single parent),
there are some intermediate commits and then commit n7 adds word
"feedback" to p1.tex. Then there's no commit whose diff with its
single parent contains both words "vision" and "feedback". In the
linear history case you could imagine trying to find the first commit
which is a child of _all_ the commits under the "appearance diff"
definition. However, that clearly doesn't "obviously" extend to
general full DAG histories, and in any case it's probably not fully
correct even in the linear case. So maybe a different definition would
be better. So I'm just throwing the question out to the list in case
anyone has any better ideas for what a full-text index on an evolving
set of files ought to be.

(One question is "why do you want to build a table rather than
actively search the full git repo for each query (ie, combination of
words) as you make it?" My primary motivation is that I might in the
future like to do queries on some sort of low processor power
UMPC-type thing, having built the file containing a "full text index"
data structure for the index on a quite beefy desktop. The other point
is that searching natural language text based on a fallible memory
you're more likely to try different combinations of search terms
iteratively to try and hit the right one, so there might be some point
in trying to build an index.)

Anyway, it's currently an idle speculation,

-- 
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: what's a useful definition of full text index on a repository?
  2007-10-01 16:33 what's a useful definition of full text index on a repository? David Tweed
@ 2007-10-01 17:25 ` Jon Smirl
  2007-10-02  9:34   ` David Tweed
  0 siblings, 1 reply; 4+ messages in thread
From: Jon Smirl @ 2007-10-01 17:25 UTC (permalink / raw)
  To: David Tweed; +Cc: Git Mailing List

On 10/1/07, David Tweed <david.tweed@gmail.com> wrote:
> Basically a "blue sky" question about full-text indexing git repositories.
>
> A while back, whilst talking about overall git structure
> (see
>
> http://marc.info/?l=git&m=118891945402778&w=2
>
> ), Jon Smirl raised the question of putting a full-text index on a
> repository. I doubt I full text index is of much use on a code
> repository because the question tends to be focussed around either
> released versions or immediate git-blame stuff. However, for
> repositories of things like evolving documents/presentations/notes
> where content is deleted because it doesn't fit the context rather
> than being superceded by better stuff, it might be more natural to
> search for "where's that paragraph I wrote on 'human' 'vision' and
> 'kinematic' 'feedback' ?". So I got to thinking about experimenting
> with a full text index and even started writing some code. However, I
> then realised that it's not obvious what the most useful definition of
> a full text index is on evolving files. (To be clear, I'm _not_
> thinking about changing the database fundamentals as discussed in the
> referenced thread and indeed would put the full-text index into a
> different file that just references the existing git db stuff
> precisely because I doubt the text index will be of use to most
> people.)

This is what full text is used for with code:
http://lxr.mozilla.org/

It makes grep instant.

For source code you can take the full text concept further and store
parse trees. This lets you instantly find the callers of a function,
or all users of a variable.

Once you have parse trees in the database you can offer refactoring
too. I have used powerful proprietary system that used parse trees to
make complicated refactoring quite easy.

Note that a parse tree database doesn't have to be generated for all
the old revisions, it is mainly usefully for the current HEAD. Same
for the full text index. When you generate this data you end up with
lots of tiny files that need to be kept in sync with the current HEAD.
git is good for holding those files.

You want all this analysis coordinated with git so that when you
commit a change the right parts get regenerated. Linux seems to be
missing good, automated refactoring assistance. Instead the kernel
janitorial work is being done manually.

An example I noticed today. Use of pr_debug() is chaotic in the kernel
source. Many drivers have #if DEBUG their own versions of pr_debug().
Fixing everything to have consistent use of pr_debug is a refactoring
that could probably be mostly automated.

Mozilla is undergoing some massive automated rewrites.
http://blog.mozilla.com/tglek/category/decomtamination/

Full text indexing can also achieve high levels of compression as
stated in the earlier threads. It is full scale dictionary
compression. When it is being used for compression you want to apply
it to all revisions.

> A "classical" full-text index seems inappropriate because, if I've got
> a long text document that in a blob in commit n1 uses word 'x' in one
> section and the corresponding file in descendant commit n2 has the
> same text using word 'x' but has changes to a different section of the
> document, there's probably no point showing me both documents (and I
> can always track through the history once I've got one (commit-id,file
> pair)). So in the case of a single word, a "useful" definition would
> be the entry for word w in the full-text index should consist of those
> (commit-id,file) pairs whose diff with their parent contains an
> addition of text containing word w. (This will catch the creation of
> the text containing w and then precisely those files which are close
> modifications of it but ignore changes to other areas.) This seems to
> make sense for a single word. Let's call this the "appearance diff"
> definition of a full text index.

You would full text index the expanded source text for each revision,
not the delta. There are forms of full text indexes that record the
words position in the document. They let you search for "vision NEAR
feedback"

A good feature of this is finding when a variable or function was first added.

>
> However, things become unclear if you consider a query with multiple
> words. Consider the simplest case of a linear history where commit n0
> adds word "vision" to file p1.tex (with respect to its single parent),
> there are some intermediate commits and then commit n7 adds word
> "feedback" to p1.tex. Then there's no commit whose diff with its
> single parent contains both words "vision" and "feedback". In the
> linear history case you could imagine trying to find the first commit
> which is a child of _all_ the commits under the "appearance diff"
> definition. However, that clearly doesn't "obviously" extend to
> general full DAG histories, and in any case it's probably not fully
> correct even in the linear case. So maybe a different definition would
> be better. So I'm just throwing the question out to the list in case
> anyone has any better ideas for what a full-text index on an evolving
> set of files ought to be.
>
> (One question is "why do you want to build a table rather than
> actively search the full git repo for each query (ie, combination of
> words) as you make it?" My primary motivation is that I might in the
> future like to do queries on some sort of low processor power
> UMPC-type thing, having built the file containing a "full text index"
> data structure for the index on a quite beefy desktop. The other point
> is that searching natural language text based on a fallible memory
> you're more likely to try different combinations of search terms
> iteratively to try and hit the right one, so there might be some point
> in trying to build an index.)

I do admit that these indexes are used to make functions that can be
done with brute force faster. As computers get faster the need for
these decrease. Right now the size of the kernel repo is not growing
faster than the progress of hardware. If you went back are tried to do
these things on a 386 you'd be shouting for indexes tomorrow.

>
> Anyway, it's currently an idle speculation,
>
> --
> cheers, dave tweed__________________________
> david.tweed@gmail.com
> Rm 124, School of Systems Engineering, University of Reading.
> "we had no idea that when we added templates we were adding a Turing-
> complete compile-time language." -- C++ standardisation committee
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: what's a useful definition of full text index on a repository?
  2007-10-01 17:25 ` Jon Smirl
@ 2007-10-02  9:34   ` David Tweed
  2007-10-02 15:48     ` Jon Smirl
  0 siblings, 1 reply; 4+ messages in thread
From: David Tweed @ 2007-10-02  9:34 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

On 10/1/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> This is what full text is used for with code:
> http://lxr.mozilla.org/
>
> It makes grep instant.

I'd thought that keeping a full-text index of all my program files was
my dirty little secret that shows I'm not a "pro" programmer ;-)

> For source code you can take the full text concept further and store
> parse trees.
[details snipped]

This sounds interesting in principle but is beyond what I'm thinking
in practice (particularly since I'm not in the "C is the only language
worth ever using" camp).

> Full text indexing can also achieve high levels of compression as
> stated in the earlier threads. It is full scale dictionary
> compression. When it is being used for compression you want to apply
> it to all revisions.

Well, as I say I'm not convinced it makes sense to integrate this with
existing pack stuff precisely because I don't think it's universally
useful. So you seem to end up with all the usual tricks, eg, Golomb
coding inverted indexes, etc, _if_ you treat each blob as completely
independent. I was wondering if there was anything else you can do
given the special structure that might be both more useful and more
compact?

> You would full text index the expanded source text for each revision,
> not the delta. There are forms of full text indexes that record the
> words position in the document. They let you search for "vision NEAR
> feedback"

Well, the kind of question I was thinking was "clearly you can use the
existing sort of full text indexing (eg, the stuff covered in Cleary,
Witten & Bell's covered Managing Gigabytes), but is that the most
useful way of doing things in the context of an evolving database?" If
you treat every blob as essentially a different document there are
indexing tools out there already you can use. What I was wondering was
if it's really that useful to a human user to report every revision of
a document containing those keywords even if the differences are in
other parts of the text far removed from the text containing the
keywords. I don't know the answer.

> > (One question is "why do you want to build a table rather than
> > actively search the full git repo for each query (ie, combination of
> > words) as you make it?" My primary motivation is that I might in the
> > future like to do queries on some sort of low processor power
> > UMPC-type thing, having built the file containing a "full text index"
> > data structure for the index on a quite beefy desktop. The other point
> > is that searching natural language text based on a fallible memory
> > you're more likely to try different combinations of search terms
> > iteratively to try and hit the right one, so there might be some point
> > in trying to build an index.)
>
> I do admit that these indexes are used to make functions that can be
> done with brute force faster. As computers get faster the need for
> these decrease. Right now the size of the kernel repo is not growing
> faster than the progress of hardware. If you went back are tried to do
> these things on a 386 you'd be shouting for indexes tomorrow.

The other point is that direct searching is easier because you know
exactly what the query is at the point you have access to the full
text, whereas building an index you want to extract no more and no
less information to be able to answer all allowed queries. But I still
like the idea of getting a UMPC type thing if they become affordable.

-- 
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: what's a useful definition of full text index on a repository?
  2007-10-02  9:34   ` David Tweed
@ 2007-10-02 15:48     ` Jon Smirl
  0 siblings, 0 replies; 4+ messages in thread
From: Jon Smirl @ 2007-10-02 15:48 UTC (permalink / raw)
  To: David Tweed; +Cc: Git Mailing List

On 10/2/07, David Tweed <david.tweed@gmail.com> wrote:
> > Full text indexing can also achieve high levels of compression as
> > stated in the earlier threads. It is full scale dictionary
> > compression. When it is being used for compression you want to apply
> > it to all revisions.
>
> Well, as I say I'm not convinced it makes sense to integrate this with
> existing pack stuff precisely because I don't think it's universally
> useful. So you seem to end up with all the usual tricks, eg, Golomb
> coding inverted indexes, etc, _if_ you treat each blob as completely
> independent. I was wondering if there was anything else you can do
> given the special structure that might be both more useful and more
> compact?

Dictionary compression can be used without full-text indexes. It is
just really easy to build the full-text index if the data is already
dictionary compressed. Dictionary compression works for everything
except binary or random data.

Git is already using a small scale dictionary compressor via zip. I
suspect doing a full scale dictionary for a pack file and then using
arithmetic encoding of the tokens would provide substantially more
compression. The big win is having a single dictionary instead of a
new dictionary each time zip is used.

When we were working on Mozilla, Mozilla changed licenses three times.
The license text ended up taking about 30MB in the current scheme.
With full dictionary compression this would reduce down to a few kb.

More compression is good for git. It means we can keep more data in
RAM and reduce download times. With current hardware it is almost
always better to trade off CPU to reduce IO.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-10-02 15:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-01 16:33 what's a useful definition of full text index on a repository? David Tweed
2007-10-01 17:25 ` Jon Smirl
2007-10-02  9:34   ` David Tweed
2007-10-02 15:48     ` Jon Smirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).