From: "Shawn O. Pearce" <spearce@spearce.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: Re: More precise tag following
Date: Sat, 27 Jan 2007 03:01:26 -0500 [thread overview]
Message-ID: <20070127080126.GC9966@spearce.org> (raw)
In-Reply-To: <7vy7nqxd08.fsf@assigned-by-dhcp.cox.net>
Junio C Hamano <junkio@cox.net> wrote:
> What if (I know, this discussion does not belong here until
> 1.5.0 final) we had a "reverse" database that keeps track of
> what tag references which object, and "git rev-list" knows how
> to exploit it?
It'd be useful. In as many ways as you suggest. But it also
would be downright difficult to transfer between repositories,
as you have already pointed out in a public forum to yourself. :-)
> - If a single-path following turns out to be too expensive
> (there was a longstanding talk about "git log --single-follow
> $path"; "git blame" also follows a single path although the
> target it follows can fork into two or more when following
> cut&pastes) because we need to explode multi-level trees for
> each commit while traversing commit ancestry, we could define
> an annotation to a commit that lists the set of paths the
> commit touches relative to each of its parents (so the object
> contains N lists of paths), so that pathspec limiting needs
> to open and read only one object to figure out that the trees
> do not have to be opened to skip the commit and/or a merge
> can be simplified.
_THIS_ is worth doing. I've been having a lot of discussion on
#git with Simon 'corecode' Schubert and Chris Lee about how poorly
git-blame performs compared to its counterpart in Subversion.
Basically Subversion is storing file-level revision ancestry data
within its commit data, allowing it to quickly skip back through
the commits which modified that file.
Git doesn't have this information and must instead traverse the
entire DAG, at least until the file was initially added to the tree.
With 440,000+ revisions and a file which exists in nearly all
of them, getting anything from "git-blame" or "git-log -- foo.c"
takes ages.
Based on some (limited) profiling with Shark it seems we spend about
50% of our CPU time doing zlib decompression of objects and almost
another 14% parsing the tree objects to apply the path limiter.
The only way to speed up blame/log operations is to reduce the number
of decompressions we need to do to the bare minium, and maybe also
reduce the tree parsing overheads. Do that and we can maybe drop
the running time to 1/4th the current time.
One idea Simon and I were talking about was to store a reverse
file/tree-level DAG in the header of each tree/blob object in the
pack file. This additional header would be something like a list
of triplets:
(descendant-commit, ancestor-commit, ancestor-object)
where:
descendant-commit: the "current" commit being looked at.
ancestor-commit: the commit which descendant-commit
derives from (directly or indirectly)
ancestor-object: prior version (descendant-commit^:path)
This triplet would probably be encoded with descendant-commit using
OBJ_REF, ancestor-commit being an OBJ_OFS style back reference within
the pack (or OBJ_REF if not in this pack) and ancestor-object would
also be an OBJ_REF. So a triplet probably would wind up costing
~60 bytes.
Triplets would only be stored if ancestor-object != this-object, so
basically only for the commits which changed the path the tree/blob
is occupying in descendant-commit.
Finding the prior revision of a tree or file would be a matter of
finding the triplet which matches the current commit, then jumping
through to the ancestor-* values. If no triplet matches the current
commit then we peel back the parents of the current commit and try
again with those. Worst case we do what we do now, which is walk
the DAG. ;-)
This of course penalizes objects which don't ever change, as we'd
have to walk back a good chunk of the DAG before we find a matching
triplet. But I would suspect that files which never change are
also not given to log/blame very often either. And once we do find
a triplet, we can skip through the DAG in time proportional to the
rate of change for the path, rather than to the entire repository.
Thoughts?
--
Shawn.
next prev parent reply other threads:[~2007-01-27 8:01 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-26 11:07 More precise tag following Junio C Hamano
2007-01-26 11:53 ` Junio C Hamano
2007-01-27 8:01 ` Shawn O. Pearce [this message]
2007-01-27 8:41 ` Junio C Hamano
2007-01-27 13:33 ` Jeff King
2007-01-27 17:47 ` Nicolas Pitre
2007-01-27 9:04 ` Simon 'corecode' Schubert
2007-01-27 12:58 ` Johannes Schindelin
2007-01-27 13:50 ` Simon 'corecode' Schubert
2007-01-27 16:30 ` Jakub Narebski
2007-01-27 17:36 ` Linus Torvalds
2007-01-27 16:46 ` Johannes Schindelin
2007-01-27 17:12 ` Simon 'corecode' Schubert
2007-01-27 19:13 ` Johannes Schindelin
2007-01-27 19:55 ` Simon 'corecode' Schubert
2007-01-27 19:41 ` Nicolas Pitre
2007-01-27 17:22 ` Linus Torvalds
2007-01-27 17:56 ` Linus Torvalds
2007-01-27 22:00 ` Junio C Hamano
2007-01-27 22:54 ` Linus Torvalds
2007-01-28 9:27 ` Junio C Hamano
2007-01-28 9:44 ` [PATCH] git-blame --porcelain: quote filename in c-style when needed Junio C Hamano
2007-01-28 14:25 ` [PATCH] git-blame --incremental: don't use pager René Scharfe
2007-01-28 19:09 ` Junio C Hamano
2007-01-28 19:14 ` Junio C Hamano
2007-01-29 0:32 ` René Scharfe
2007-01-29 2:35 ` [PATCH] git blame --progress Junio C Hamano
2007-01-29 7:00 ` Simon 'corecode' Schubert
2007-01-29 16:54 ` Alex Riesen
2007-01-29 18:12 ` Matthias Lederhofer
2007-01-29 19:06 ` Junio C Hamano
2007-01-29 19:59 ` René Scharfe
2007-01-29 20:24 ` Linus Torvalds
2007-01-30 1:53 ` Junio C Hamano
2007-01-28 19:08 ` More precise tag following Linus Torvalds
2007-01-28 19:18 ` Junio C Hamano
2007-01-28 19:57 ` Linus Torvalds
2007-01-28 20:01 ` Junio C Hamano
2007-01-28 20:20 ` [PATCH] document 'blame --incremental' Junio C Hamano
2007-01-28 21:06 ` More precise tag following Junio C Hamano
2007-01-28 23:01 ` Jeff King
2007-01-30 9:22 ` Junio C Hamano
2007-01-30 15:31 ` Shawn O. Pearce
2007-01-30 17:02 ` Linus Torvalds
2007-01-28 19:58 ` Junio C Hamano
2007-01-29 6:18 ` Shawn O. Pearce
2007-01-29 10:17 ` Junio C Hamano
2007-01-29 10:31 ` Shawn O. Pearce
2007-01-29 16:24 ` Linus Torvalds
2007-01-29 18:07 ` Simon 'corecode' Schubert
2007-01-29 19:29 ` Theodore Tso
2007-01-29 19:45 ` Linus Torvalds
2007-01-29 20:25 ` Jakub Narebski
2007-01-29 20:47 ` Shawn O. Pearce
2007-01-29 21:02 ` Jakub Narebski
2007-02-09 7:41 ` Shawn O. Pearce
2007-01-31 8:39 ` David Kågedal
2007-01-31 10:59 ` David Kågedal
2007-01-31 16:12 ` Peter Eriksen
2007-01-31 17:04 ` David Kågedal
2007-01-31 17:12 ` Peter Eriksen
2007-01-31 17:35 ` Jakub Narebski
2007-01-31 20:59 ` David Kågedal
2007-01-27 18:40 ` Simon 'corecode' Schubert
2007-01-27 19:02 ` Johannes Schindelin
2007-01-27 19:12 ` Simon 'corecode' Schubert
2007-01-27 19:19 ` Johannes Schindelin
2007-01-27 19:59 ` Jakub Narebski
2007-01-27 19:15 ` Linus Torvalds
2007-01-27 19:25 ` Linus Torvalds
2007-01-27 19:54 ` Jakub Narebski
2007-01-27 20:13 ` Linus Torvalds
2007-01-27 19:36 ` Chris Lee
2007-01-28 18:10 ` Theodore Tso
2007-01-28 18:27 ` Linus Torvalds
2007-01-28 22:26 ` David Lang
2007-01-29 17:34 ` Nicolas Pitre
2007-01-29 17:42 ` Linus Torvalds
2007-01-29 17:58 ` Nicolas Pitre
2007-01-29 19:16 ` Chris Lee
2007-01-29 23:00 ` Eric Wong
2007-01-30 0:42 ` Eric Wong
2007-01-30 0:48 ` Eric Wong
2007-01-30 8:51 ` Eric Wong
2007-01-27 18:52 ` Jakub Narebski
2007-01-27 20:16 ` Jeff King
2007-01-27 22:39 ` Linus Torvalds
2007-01-27 23:52 ` Jeff King
2007-01-28 2:39 ` Theodore Tso
2007-01-28 3:17 ` Randal L. Schwartz
2007-01-28 13:15 ` Jeff King
2007-01-28 7:40 ` Shawn O. Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070127080126.GC9966@spearce.org \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).