A tracking tree for the active work space

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* A tracking tree for the active work space
@ 2007-03-11 14:06 Jon Smirl
  2007-03-11 20:15 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jon Smirl @ 2007-03-11 14:06 UTC (permalink / raw)
  To: Shawn O. Pearce, Git Mailing List

Reading the other thread on tracking temporary changes made me think
of using inotify with git. The basic idea would be to a daemon running
that uses inotify to listen for changes in the working tree. As these
changes happen they get committed to a tracking tree.

The tracking tree serves two purposes. First it is a good way to
recover from programmer error. I have definitely written big chunks of
code, discarded them, and then realized later that they were the right
solution and had to write them again.

The tracking tree also makes a 'git grep' for uncommitted changes
easier to implement since the changes are always committed with this
model. For dual processors the inverted index can be computed in
parallel with editing.

Of course this daemon needs some smarts. You don't want it generating
a delta the hard way from a git check out of a different workspace It
will also make real check-ins instant since you  just copy the tip of
the tacking tree.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A tracking tree for the active work space
  2007-03-11 14:06 A tracking tree for the active work space Jon Smirl
@ 2007-03-11 20:15 ` Junio C Hamano
  2007-03-11 20:35   ` Jon Smirl
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2007-03-11 20:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Git Mailing List

"Jon Smirl" <jonsmirl@gmail.com> writes:

> Reading the other thread on tracking temporary changes made me think
> of using inotify with git. The basic idea would be to a daemon running
> that uses inotify to listen for changes in the working tree. As these
> changes happen they get committed to a tracking tree.

I think it is an interesting idea, but can be used with any SCM
not just git ;-).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A tracking tree for the active work space
  2007-03-11 20:15 ` Junio C Hamano
@ 2007-03-11 20:35   ` Jon Smirl
  2007-03-11 21:31     ` Johannes Schindelin
  2007-03-11 23:18     ` Johannes Schindelin
  0 siblings, 2 replies; 7+ messages in thread
From: Jon Smirl @ 2007-03-11 20:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, Git Mailing List

On 3/11/07, Junio C Hamano <junkio@cox.net> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> > Reading the other thread on tracking temporary changes made me think
> > of using inotify with git. The basic idea would be to a daemon running
> > that uses inotify to listen for changes in the working tree. As these
> > changes happen they get committed to a tracking tree.
>
> I think it is an interesting idea, but can be used with any SCM
> not just git ;-).

As for the part about 'git grep'  Shawn and I have been talking off
and on about experimenting with an inverted index for a packfile
format. The basic idea is that you tokenize the input and turn a
source file into a list of tokens. You diff with the list of tokens
like you would normally do with text. There is a universal dictionary
for tokens, a token's id is it's position in the dictionary.

Tokenized text is one of the most compact compression schemes known.
It can get even more compact by tokenizing common phrases and using
variable length token ids. Compression schemes like this are used in
web search engines. Of course you keep a check in place for input that
doesn't tokenize (binary) and fallback to gzip.

To build 'git grep' you make a bitmap index for each token in the
dictionary and put a one in it if the file has the token. Gzip these
indexes and then there are algorithms for doing and/or operations on
the zipped indexes without expanding them. grep is almost instant over
gigabytes of text if indexes like this are available.

Keeping everything up to date on a dual core system is pretty much
free since that second core is rarely doing anything while you are
editing.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A tracking tree for the active work space
  2007-03-11 20:35   ` Jon Smirl
@ 2007-03-11 21:31     ` Johannes Schindelin
  2007-03-11 21:40       ` Jeff King
  2007-03-12  1:39       ` Jon Smirl
  2007-03-11 23:18     ` Johannes Schindelin
  1 sibling, 2 replies; 7+ messages in thread
From: Johannes Schindelin @ 2007-03-11 21:31 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Junio C Hamano, Shawn O. Pearce, Git Mailing List

Hi,

On Sun, 11 Mar 2007, Jon Smirl wrote:

> As for the part about 'git grep' Shawn and I have been talking off and 
> on about experimenting with an inverted index for a packfile format. The 
> basic idea is that you tokenize the input and turn a source file into a 
> list of tokens. You diff with the list of tokens like you would normally 
> do with text. There is a universal dictionary for tokens, a token's id 
> is it's position in the dictionary.

All in all, this is an interesting idea.

However, I see some problems I'd like to know solutions for:

- how to determine the optimal length of the tokens? (It is easy if you 
  tokenize on the word level, but you suggested that it is more efficient 
  to have longer phrases.)

- the search terms can be _part of_ the tokens. In fact, a search term can 
  be the postfix of one token, then a list of other tokens, and then a 
  prefix of yet another token. It might not be really cheap to construct 
  _all_ possible combinations of tokens which make up the search term...

- how do you want to cope with regular expressions? (The previous problem 
  only addresses simple, constant search terms, i.e. no true regular 
  expressions.)

- at the moment, most objects which are contained by a pack are relatively 
  cheaply transported via the pack protocol. IIUC your new pack format 
  would need _exactly_ the same dictionary to be transmitted as-is. IOW
  how do you want to make on-the-fly pack generation cheap again?

Don't get me wrong: I don't want to discourage you, but it is too easy to 
optimize for the wrong use cases (I expact a repack, or a fetch, to happen 
much more often than a grep). If you can address above-mentioned issues, 
I see no reason why the new pack format should not be used instead of the 
current one.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A tracking tree for the active work space
  2007-03-11 21:31     ` Johannes Schindelin
@ 2007-03-11 21:40       ` Jeff King
  2007-03-12  1:39       ` Jon Smirl
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff King @ 2007-03-11 21:40 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, Git Mailing List

On Sun, Mar 11, 2007 at 10:31:58PM +0100, Johannes Schindelin wrote:

> - how do you want to cope with regular expressions? (The previous problem 
>   only addresses simple, constant search terms, i.e. no true regular 
>   expressions.)

I expect you could extract any obvious tokens from the regex, and then
run the regex only over the files which contain those tokens. Obviously
your worst case performance will be the same as the original grep (plus
the overhead of looking up the tokens), but in practice, I expect you
could end up searching through only a fraction of the files (depending
on your regexp and how diverse the data set is).

Of course, I have never had a complaint about the speed of git-grep, so
maybe it's not all that compelling. :)

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A tracking tree for the active work space
  2007-03-11 20:35   ` Jon Smirl
  2007-03-11 21:31     ` Johannes Schindelin
@ 2007-03-11 23:18     ` Johannes Schindelin
  1 sibling, 0 replies; 7+ messages in thread
From: Johannes Schindelin @ 2007-03-11 23:18 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Junio C Hamano, Shawn O. Pearce, Git Mailing List

Hi,

On Sun, 11 Mar 2007, Jon Smirl wrote:

> On 3/11/07, Junio C Hamano <junkio@cox.net> wrote:
> > "Jon Smirl" <jonsmirl@gmail.com> writes:
> > 
> > > Reading the other thread on tracking temporary changes made me think
> > > of using inotify with git. The basic idea would be to a daemon running
> > > that uses inotify to listen for changes in the working tree. As these
> > > changes happen they get committed to a tracking tree.
> > 
> > I think it is an interesting idea, but can be used with any SCM
> > not just git ;-).

I just stumbled over this:

	http://arcs.unixtreaty.com/

which might or might not do what you want, judging by the description.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A tracking tree for the active work space
  2007-03-11 21:31     ` Johannes Schindelin
  2007-03-11 21:40       ` Jeff King
@ 2007-03-12  1:39       ` Jon Smirl
  1 sibling, 0 replies; 7+ messages in thread
From: Jon Smirl @ 2007-03-12  1:39 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, Git Mailing List

On 3/11/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 11 Mar 2007, Jon Smirl wrote:
>
> > As for the part about 'git grep' Shawn and I have been talking off and
> > on about experimenting with an inverted index for a packfile format. The
> > basic idea is that you tokenize the input and turn a source file into a
> > list of tokens. You diff with the list of tokens like you would normally
> > do with text. There is a universal dictionary for tokens, a token's id
> > is it's position in the dictionary.
>
> All in all, this is an interesting idea.
>
> However, I see some problems I'd like to know solutions for:
>
> - how to determine the optimal length of the tokens? (It is easy if you
>   tokenize on the word level, but you suggested that it is more efficient
>   to have longer phrases.)
>
> - the search terms can be _part of_ the tokens. In fact, a search term can
>   be the postfix of one token, then a list of other tokens, and then a
>   prefix of yet another token. It might not be really cheap to construct
>   _all_ possible combinations of tokens which make up the search term...
>
> - how do you want to cope with regular expressions? (The previous problem
>   only addresses simple, constant search terms, i.e. no true regular
>   expressions.)

I just described a simple scheme. This is an area where a lot of
research has been done and the previous three problems have been
addressed by many papers. For example there are more complicated index
forms which support queries like find hat within ten words of cat.
Another feature is word stemming which can be used to do regular
expression searching. Let's get a basic version working first and then
go for the fancier ones.

> - at the moment, most objects which are contained by a pack are relatively
>   cheaply transported via the pack protocol. IIUC your new pack format
>   would need _exactly_ the same dictionary to be transmitted as-is. IOW
>   how do you want to make on-the-fly pack generation cheap again?

You can either copy down an entire pack including the dictionaries; or
expand the tokenized text back into normal text, gzip it and send it
down the wire. Dictionaries are only a few MB so everything fits into
RAM.

> Don't get me wrong: I don't want to discourage you, but it is too easy to
> optimize for the wrong use cases (I expact a repack, or a fetch, to happen
> much more often than a grep). If you can address above-mentioned issues,

When I am working on other people's code in the kernel I do greps all
of the time. When working on your own code you don't need it as much
since you know where everything is.

Support like this could also be the basis for future IDE integration.
The indexes would allow an IDE to quickly do refactoring operations
like renames. Mozilla is doing some major refactoring currently and
they are writing custom programs to do it since they lack efficient
tools.

The indexes don't have to be simple word lists. I've used systems
where the index is closer to being a parse tree of the C code. In
those systems refactoring is trivial to do, it's more like doing
operations on a database of code than using a normal text editor. You
can also easily query things like all users of a variable, the
definition, etc. Operations like these aren't a problem on small
projects but trying doing them when there are 20M lines of source.

Also note that these are just indexes and pack formats. I can chose to
convert my git db to this form and you don't have to follow my choice.

> I see no reason why the new pack format should not be used instead of the
> current one.
>
> Ciao,
> Dscho
>
>
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-03-12  1:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-11 14:06 A tracking tree for the active work space Jon Smirl
2007-03-11 20:15 ` Junio C Hamano
2007-03-11 20:35   ` Jon Smirl
2007-03-11 21:31     ` Johannes Schindelin
2007-03-11 21:40       ` Jeff King
2007-03-12  1:39       ` Jon Smirl
2007-03-11 23:18     ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).