From: Linus Torvalds <torvalds@linux-foundation.org>
To: Roman Zippel <zippel@linux-m68k.org>
Cc: Tim Harper <timcharper@gmail.com>, git@vger.kernel.org
Subject: Re: Bizarre missing changes (git bug?)
Date: Tue, 29 Jul 2008 18:49:41 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.1.10.0807291822590.3334@nehalem.linux-foundation.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0807300223010.6791@localhost.localdomain>
On Wed, 30 Jul 2008, Roman Zippel wrote:
> >
> > time sh -c "git log <filename> | head"
> >
> > nothing else matters. If you can make that one be fast, I'm happy.
>
> I already explained it, but you simply dismissed it. It's possible, but it
> requires a bit of cached information (e.g. as part of the pack file, which
> is needed for decent performance anyway).
Bzzt. Wrong. Try again.
> > In fact, you can see what I'm talking about by trying --topo-order in the
> > above timing test.
>
> Please give me full example.
> gitk --topo-order kernel/printk.c shows no difference (e.g. it doesn't
> show 02630a12c7f72fa294981c8d86e38038781c25b7), several experiments with
> git-rev-list show no improvement either.
Roman, what the f*ck is wrong with you? Let me repeat that thing one more
time:
you can see what I'm talking about by trying --topo-order in the
above timing test.
^^^^^^^^^^^
The fact is, --topo-order is a post-processing thing, exactly the way your
half-way simplification would be. It requires _all_ commits, and it
requires them because we cannot guarantee that we output all children
before the parents when there are multiple threads without a central clock
(ie any distributed environment).
So for --topo-order, we generate the whole history, and then we sort it.
As a result, it has horrible interactivity behavior. Try it. Here's some
random command lines, and the times:
time git log --topo-order drivers/scsi/scsi_lib.c | head
real 0m0.688s
user 0m0.652s
sys 0m0.036s
and without:
time git log drivers/scsi/scsi_lib.c | head
real 0m0.033s
user 0m0.024s
sys 0m0.008s
do you see the difference? They happen to output _exactly_ the same ten
lines, but one of them takes the better part of a second (and that's on
pretty much the fastest machine you can find right now - on a laptop with
a slow disk and without things in cache, it would take many many seconds).
The other one is instantaneous.
Now, I realize that 0.033s vs 0.688s doesn't sound like a big deal, even
though that's a 20x difference, but that 20x difference is a _really_ big
deal when the machine is slower, or when "old history" isn't in the disk
cache any more.
For example, try doing the timings after flushing the disk caches to
simulate cold-cache behavior. Do it with a slow disk. Or do it over NFS.
Yes, even the "fast" case will actually be painfully slow (well, it is for
me, people who are used to CVS probably think it's just "normal").
And yes, it will depend a lot on the file in question too. Obviously, if
the first change is far back in history, it will be slow _regardless_, but
I've at least personally found that in practice, you tend to look at logs
of _recent_ things much much much more than you look at things that
haven't changed lately.
It will also depend a lot on whether you are packed or not. For example,
if you are well packed, the pack-file IO locality is really really good,
and the 20x slowdown is much less. I just tested with a laptop with a slow
disk, and the --topo-order case was "only" 2.5x slower, almost certainly
because the IO required to bring in the first part of the history ended up
being a large portion of the total IO, and so the "whole history" case was
not 20x slower, because there was not 20x more IO due to the good locality
and the kernel doing readahead etc.
But 2.5x slower is really bad, wouldn't you agree? We're not talking about
a few percent here, we're talking about more than twice as long. It's very
noticeable, especially when the end result was --topo-order: 29.8s, no
topo-order 12.1s
(Yeah, that wasn't a very realistic example, but on that same machine,
once it's in the cache, it's 0.13s vs 1.6s: one is "instant", the other is
very much a "wait for it" kind of thing.)
THAT is the kind of performance difference you see.
And trust me, it's a performance difference that you can really notice in
real life. I'm not kidding you. Just try it:
git log kernel/sched.c
vs
git log --topo-order kernel/sched.c
and one is instant, the other one pauses before it starts showing
something. One feels fast, the other feels slow.
At the same time, if you actually time the _whole_ log, it's all exactly
the same speed:
[torvalds@nehalem linux]$ time git log --topo-order kernel/sched.c > /dev/null
real 0m0.708s
user 0m0.684s
sys 0m0.020s
[torvalds@nehalem linux]$ time git log kernel/sched.c > /dev/null
real 0m0.703s
user 0m0.672s
sys 0m0.032s
Notice? The cost of the topological sort itself is basically zero. But
from an interactivity standpoint, it's _deadly_.
And please note that here "--topo-sort" is just an example of a random
"global history post-processing" thing. It's not that I want you to use
the topological sort per se, it's just an example of the whole issue with
_any_ post-factum operation. The topological sort is not expensive as a
sort. What is expensive is that it needs to get the whole history to work.
And also please notice that this is a huge scalability issue. "git log"
should not become slower as a project gets more history. Sure, the full
log will take longer to generate (because there's _more_ of it), but the
top commits should always show up immediately.
Again, if you have a filter (where "topological sort" is just an example
of such a filter) that requires the full history to work, it simply
_fundamentally_ cannot scale well. If very fundamentally will slow down
with bigger history.
> The problem is that your picture doesn't include my specific problem, I'm
> very interested in the big picture, but I'd like to be in it.
Roman, I've been trying to explain this "interactive" thing for _days_
now. That's the big picture. The whole "you have to be able to generate
history incrementally" thing.
First generating the whole global history, and then simplifying it, is
simply not acceptable. It's too slow, and it doesn't scale.
Linus
next prev parent reply other threads:[~2008-07-30 1:54 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-21 20:26 Bizarre missing changes (git bug?) Tim Harper
2008-07-21 20:37 ` Linus Torvalds
2008-07-21 22:53 ` Tim Harper
2008-07-21 22:55 ` Tim Harper
[not found] ` <8C23FB54-A28E-4294-ABEA-A5766200768B@gmail.com>
2008-07-21 22:57 ` Linus Torvalds
2008-07-26 3:12 ` Roman Zippel
2008-07-26 19:58 ` Linus Torvalds
2008-07-27 17:50 ` Roman Zippel
2008-07-27 18:47 ` Linus Torvalds
2008-07-27 23:14 ` Roman Zippel
2008-07-27 23:18 ` Linus Torvalds
2008-07-28 0:00 ` Roman Zippel
2008-07-28 5:00 ` Linus Torvalds
2008-07-28 5:30 ` Linus Torvalds
2008-07-29 2:59 ` Roman Zippel
2008-07-29 3:15 ` Martin Langhoff
2008-07-30 0:16 ` Roman Zippel
2008-07-30 0:25 ` Martin Langhoff
2008-07-30 0:32 ` Linus Torvalds
2008-07-30 0:48 ` Linus Torvalds
2008-07-30 23:56 ` Junio C Hamano
2008-07-31 0:15 ` Junio C Hamano
2008-07-31 0:30 ` Linus Torvalds
2008-07-31 8:17 ` [PATCH v2] revision traversal: show full history with merge simplification Junio C Hamano
2008-07-31 8:18 ` Junio C Hamano
2008-07-31 22:30 ` Linus Torvalds
2008-07-31 22:09 ` [PATCH v3-wip] " Junio C Hamano
2008-07-31 22:26 ` Linus Torvalds
2008-07-31 22:36 ` Junio C Hamano
2008-08-01 3:00 ` Junio C Hamano
2008-08-01 3:48 ` Linus Torvalds
2008-08-01 7:50 ` Junio C Hamano
2008-07-30 8:36 ` Bizarre missing changes (git bug?) Jakub Narebski
2008-07-29 3:29 ` Linus Torvalds
2008-07-29 3:33 ` Linus Torvalds
2008-07-29 11:39 ` Roman Zippel
2008-07-29 12:00 ` David Kastrup
2008-07-29 15:50 ` Linus Torvalds
2008-07-30 1:14 ` Roman Zippel
2008-07-30 1:32 ` Kevin Ballard
2008-07-30 1:49 ` Linus Torvalds [this message]
2008-07-29 5:31 ` Jeff King
2008-07-29 12:32 ` Roman Zippel
2008-07-29 12:48 ` Olivier Galibert
2008-07-29 12:52 ` Jeff King
2008-07-29 17:25 ` Linus Torvalds
2008-07-30 1:50 ` Roman Zippel
2008-07-30 2:05 ` Linus Torvalds
2008-07-30 4:26 ` Jeff King
2008-07-30 4:52 ` Linus Torvalds
2008-07-30 2:48 ` Roman Zippel
2008-07-30 3:20 ` Kevin Ballard
2008-07-30 3:21 ` Linus Torvalds
2008-07-30 3:35 ` Linus Torvalds
2008-07-30 4:23 ` Jeff King
2008-07-27 23:25 ` Martin Langhoff
2008-07-28 1:29 ` Roman Zippel
2008-07-21 20:42 ` Alex Riesen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.1.10.0807291822590.3334@nehalem.linux-foundation.org \
--to=torvalds@linux-foundation.org \
--cc=git@vger.kernel.org \
--cc=timcharper@gmail.com \
--cc=zippel@linux-m68k.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).