* Trac+Git: rev-list with pathspec performance? [not found] <13399611.436896.1286218134223.JavaMail.root@mail.hq.genarts.com> @ 2010-10-04 20:21 ` Stephen Bash 2010-10-04 20:38 ` Jonathan Nieder 2010-10-05 1:09 ` Jakub Narebski 0 siblings, 2 replies; 6+ messages in thread From: Stephen Bash @ 2010-10-04 20:21 UTC (permalink / raw) To: Git Mailing List Hi all- I'm trying to improve the performance of Trac [1], the GitPlugin for Trac[2], and Git. Trac is being extremely sluggish while browsing source, and profiling revealed the majority of the time was the GitPlugin calling git rev-list. When I directly entered the rev-list calls from the shell, I found Git itself was performing slower than I would expect... The bottleneck is while Trac is populating the "last change to file" column in the source browser (see the "rev" column of [3] for an *cough* SVN *cough* example). This concept of "find the last change to a file" was discussed a few weeks ago [4], but unlike that thread, the GitPlugin is simply calling git rev-list --max-count=1 branchName -- fileName for each file in the current directory. For files modified recently this is very fast (thousandths of a second), but for older files rev-list takes a long time to come up with an answer (~2-3 seconds on our server). I created a script [5] that reproduces the rev-list behavior with 10k commits (ours is about 17k) and 500 files (we peaked at just under 600 in the root of our repo -- that's been cleaned up in the current version, but the history is still there). On our system the test script fast case is: real 0m0.003s, user 0m0.000s, sys 0m0.010s The slow case is real 0m1.072s, user 0m1.050s, sys 0m0.000s If I naively profile Git I find the worst time offender is tree_entry_interesting with over 10 million calls in the slow case. That seems high (even every commit, every file would be 500*10000=5 million), but I don't know anything about the actual search algorithm. Is there anything obvious I can do about this performance bottleneck or is it just the nature of our repository? Is there potentially a bug in how rev-list works with a pathspec? Is there a more efficient way to obtain the last commit that changed each file in a directory? (A hack I'm currently testing is just always return the current commit when Trac asks for the last change... that speeds things up but changes the user experience) Thanks, Stephen References: [1] http://trac.edgewall.org [2] http://trac-hacks.org/wiki/GitPlugin [3] http://trac.edgewall.org/browser/trunk [4] http://article.gmane.org/gmane.comp.version-control.git/150183/ [5] #!/bin/bash git init big-repo cd big-repo touch foo touch bar for ii in {1..500} do # create some files for background noise touch $ii done git add . git commit -qm "initial import" for ii in {1..10000} do echo "Creating commit $ii" echo $ii >> foo git add foo git commit -qm "simple change $ii" if [ $(( $ii % 250 )) == 0 ] then echo "Running git gc ($ii)" git gc --quiet fi done # Fast case (last change is close to HEAD) echo "git rev-list --max-count=1 HEAD -- foo ..." time git rev-list --max-count=1 HEAD -- foo # Slow case (last change is long before HEAD) echo "git rev-list --max-count=1 HEAD -- bar ..." time git rev-list --max-count=1 HEAD -- bar cd .. #rm -rf big-repo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Trac+Git: rev-list with pathspec performance? 2010-10-04 20:21 ` Trac+Git: rev-list with pathspec performance? Stephen Bash @ 2010-10-04 20:38 ` Jonathan Nieder 2010-10-05 1:09 ` Jakub Narebski 1 sibling, 0 replies; 6+ messages in thread From: Jonathan Nieder @ 2010-10-04 20:38 UTC (permalink / raw) To: Stephen Bash; +Cc: Git Mailing List Stephen Bash wrote: > Is there a more efficient way to obtain the last commit that changed > each file in a directory? Yes. http://yhbt.net/git-set-file-times ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Trac+Git: rev-list with pathspec performance? 2010-10-04 20:21 ` Trac+Git: rev-list with pathspec performance? Stephen Bash 2010-10-04 20:38 ` Jonathan Nieder @ 2010-10-05 1:09 ` Jakub Narebski 2010-10-06 15:26 ` Stephen Bash 1 sibling, 1 reply; 6+ messages in thread From: Jakub Narebski @ 2010-10-05 1:09 UTC (permalink / raw) To: Stephen Bash; +Cc: Git Mailing List Stephen Bash <bash@genarts.com> writes: > I'm trying to improve the performance of Trac [1], the GitPlugin for > Trac[2], and Git. Trac is being extremely sluggish while browsing > source, and profiling revealed the majority of the time was the > GitPlugin calling git rev-list. When I directly entered the > rev-list calls from the shell, I found Git itself was performing > slower than I would expect... > > The bottleneck is while Trac is populating the "last change to file" > column in the source browser (see the "rev" column of [3] for an > *cough* SVN *cough* example). This concept of "find the last change > to a file" was discussed a few weeks ago [4], but unlike that > thread, the GitPlugin is simply calling git rev-list --max-count=1 > branchName -- fileName for each file in the current directory. For > files modified recently this is very fast (thousandths of a second), > but for older files rev-list takes a long time to come up with an > answer (~2-3 seconds on our server). [...] > References: > [1] http://trac.edgewall.org > [2] http://trac-hacks.org/wiki/GitPlugin > [3] http://trac.edgewall.org/browser/trunk > [4] http://article.gmane.org/gmane.comp.version-control.git/150183/ Note that later[5] in mentioned thread[4] there is proof of concept "tree blame" (in Perl) which generates such 'last change to file' information, I think faster than running 'git rev-list -1 <file>' for each file. Even better would be to encode used algorithm in C. [5] http://thread.gmane.org/gmane.comp.version-control.git/150063/focus=150183 P.S. Alternate solution would be to simply get rid of SVN-inspired view. Git tracks history of a *project* as a whole, not set of histories for individual files (like CVS). -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Trac+Git: rev-list with pathspec performance? 2010-10-05 1:09 ` Jakub Narebski @ 2010-10-06 15:26 ` Stephen Bash 2010-10-07 17:49 ` Stephen Bash 0 siblings, 1 reply; 6+ messages in thread From: Stephen Bash @ 2010-10-06 15:26 UTC (permalink / raw) To: Jakub Narebski; +Cc: Git Mailing List > Note that there is proof of concept > "tree blame" (in Perl) which generates such 'last change to file' > information, I think faster than running 'git rev-list -1 <file>' for > each file. Even better would be to encode used algorithm in C. > > http://thread.gmane.org/gmane.comp.version-control.git/150063/focus=150183 My early experiments with your script are good for speed, but for some reason I'm always getting the first commit for a file rather than the most recent. I'll do some experimenting to see if I can uncover the issue. (I had seen the script earlier, but I didn't realize the fix for diff-tree had made it into a release already) Thanks, Stephen ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Trac+Git: rev-list with pathspec performance? 2010-10-06 15:26 ` Stephen Bash @ 2010-10-07 17:49 ` Stephen Bash 2010-10-07 20:33 ` Jakub Narebski 0 siblings, 1 reply; 6+ messages in thread From: Stephen Bash @ 2010-10-07 17:49 UTC (permalink / raw) To: Jakub Narebski; +Cc: Git Mailing List > > Note that there is proof of concept > > "tree blame" (in Perl) which generates such 'last change to file' > > information, I think faster than running 'git rev-list -1 <file>' > > for > > each file. Even better would be to encode used algorithm in C. > > > > http://thread.gmane.org/gmane.comp.version-control.git/150063/focus=150183 > > My early experiments with your script are good for speed, but for some > reason I'm always getting the first commit for a file rather than the > most recent. I'll do some experimenting to see if I can uncover the > issue. Following up, I had to add -r to the diff-tree command line when requesting a subdirectory to work around the problem (script always returned the first commit). I'm curious if it's faster to get the SHA of the sub-tree and compare that before actually running diff-tree? And for that matter, just run diff-tree on the sub-tree that we care about rather than a recursive sub-tree on the root? These may be early optimizations, but they're ideas that occurred to me while debugging the code... > > P.S. Alternate solution would be to simply get rid of SVN-inspired > > view. Git tracks history of a *project* as a whole, not set of > > histories for individual files (like CVS). After a lot of experimentation, this is basically what we did. I modified the Trac templates to not list the last change SHA or log message in the directory view. After all my testing, I just don't think there's a fast way to get this information from Git. This blame-dir script is the fastest alternative I've tried (about 5x faster than rev-list'ing each file), but it's still ~30 seconds on my machine (which is faster than our web server), and IMHO that's too long to ask a user to wait for a page to load. Thanks, Stephen ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Trac+Git: rev-list with pathspec performance? 2010-10-07 17:49 ` Stephen Bash @ 2010-10-07 20:33 ` Jakub Narebski 0 siblings, 0 replies; 6+ messages in thread From: Jakub Narebski @ 2010-10-07 20:33 UTC (permalink / raw) To: Stephen Bash; +Cc: Git Mailing List On Thu, 7 Oct 2010, Stephen Bash wrote: >>> Note that there is proof of concept >>> "tree blame" (in Perl) which generates such 'last change to file' >>> information, I think faster than running 'git rev-list -1 <file>' >>> for >>> each file. Even better would be to encode used algorithm in C. >>> >>> http://thread.gmane.org/gmane.comp.version-control.git/150063/focus=150183 >> >> My early experiments with your script are good for speed, but for some >> reason I'm always getting the first commit for a file rather than the >> most recent. I'll do some experimenting to see if I can uncover the >> issue. > > Following up, I had to add -r to the diff-tree command line when > requesting a subdirectory to work around the problem (script always > returned the first commit). Hmmm... I thought that I have added '-r' if there is path provided, i.e. we don't run tree blame on root commit. > I'm curious if it's faster to get the SHA of the sub-tree and compare > that before actually running diff-tree? And for that matter, just run > diff-tree on the sub-tree that we care about rather than a recursive > sub-tree on the root? These may be early optimizations, but they're > ideas that occurred to me while debugging the code... There are many possible optimizations (see also below); for the time being I was concerned with getting the fast tree blame algorithm right (and as you can see didn't get it, not completely). >>> P.S. Alternate solution would be to simply get rid of SVN-inspired >>> view. Git tracks history of a *project* as a whole, not set of >>> histories for individual files (like CVS). > > After a lot of experimentation, this is basically what we did. > I modified the Trac templates to not list the last change SHA or log > message in the directory view. After all my testing, I just don't > think there's a fast way to get this information from Git. This > blame-dir script is the fastest alternative I've tried (about 5x > faster than rev-list'ing each file), but it's still ~30 seconds on my > machine (which is faster than our web server), and IMHO that's too > long to ask a user to wait for a page to load. First, there is lot of room for optimization of tree blame script, some of which I have noted as comments, some which you have found. During developing this script I noticed that current plumbing doesn't completly fit the tree blame algorithm; for example we need '-r' for blaming subtree (subdirectory), while we need paths only up to depth of blamed directory, no more. Rewriting tree-blame in C, using in-core revision and tree traversal should be faster, though I'm not sure how much would that be. Unfortunately I don't know enough git API; I thought that writing Perl script would be easier. But you are right in that such view would always be expensive in Git, because Git tracks history of porject *as a whole*. If file was created in root commit (first commit) and left unchanged, it would be easy to find in VCS that stored history on per-file basis at least to some extent; in git you have to go through comit up till the root commit in this case. If history is long, it might take some time. Second, you can use the trick that GitHub web interface uses to display similar view, namely in displaying first just a tree of files, and then incrmentally filling in 'last changed' info. Gitweb does something similar in 'blame_incremental' view; that is why the idea was to have tree blame ("git blame <directory>") to have support for incremental format, similar to an ordinary blame. This might take some effort to develop, though... -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-10-07 20:34 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <13399611.436896.1286218134223.JavaMail.root@mail.hq.genarts.com> 2010-10-04 20:21 ` Trac+Git: rev-list with pathspec performance? Stephen Bash 2010-10-04 20:38 ` Jonathan Nieder 2010-10-05 1:09 ` Jakub Narebski 2010-10-06 15:26 ` Stephen Bash 2010-10-07 17:49 ` Stephen Bash 2010-10-07 20:33 ` Jakub Narebski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).