git annotate runs out of memory

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git annotate runs out of memory
@ 2007-12-11 17:33 Daniel Berlin
  2007-12-11 17:47 ` Nicolas Pitre
                   ` (3 more replies)
  0 siblings, 4 replies; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 17:33 UTC (permalink / raw)
  To: git

On the gcc repository (which is now a 234 meg pack for me), git
annotate ChangeLog takes > 800 meg of memory (I stopped it at about
1.6 gig, since it started swapping my machine).
I assume it will run out of memory.  I stopped it after 2 minutes.

Mercurial, on the same file, takes 50 meg and 30 seconds.

git annotate fold-const.c takes 300 meg of memory and takes > 30 seconds.
Mercurial, on the same file takes 50 meg of memory and 10 seconds.
svn takes 15 seconds and 20 meg of memory.

I have excluded the mmap memory from mmap'ing the pack/file (in
git/mercurial respectively).

Annotate is treasured by gcc developers (this was a key sticking point
in svn conversion).
Having an annotate that is 2x slower and takes 15x memory would not
fly (regardless of how good the results are).

This seems to be a common problem with git. It seems to use a lot of
memory to perform common operations on the gcc repository (even though
it is faster in some cases than hg).

--Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
@ 2007-12-11 17:47 ` Nicolas Pitre
  2007-12-11 17:53   ` Daniel Berlin
  2007-12-11 18:32 ` Marco Costalba
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2007-12-11 17:47 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: git

On Tue, 11 Dec 2007, Daniel Berlin wrote:

> On the gcc repository (which is now a 234 meg pack for me), git
> annotate ChangeLog takes > 800 meg of memory (I stopped it at about
> 1.6 gig, since it started swapping my machine).
> I assume it will run out of memory.  I stopped it after 2 minutes.

And I bet this is the exact same issue as the repack one.

Do you still have the 2.1GB pack around?  I bet annotate would eat much 
less memory in that case.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 17:47 ` Nicolas Pitre
@ 2007-12-11 17:53   ` Daniel Berlin
  2007-12-11 18:01     ` Nicolas Pitre
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 17:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

On 12/11/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 11 Dec 2007, Daniel Berlin wrote:
>
> > On the gcc repository (which is now a 234 meg pack for me), git
> > annotate ChangeLog takes > 800 meg of memory (I stopped it at about
> > 1.6 gig, since it started swapping my machine).
> > I assume it will run out of memory.  I stopped it after 2 minutes.
>
> And I bet this is the exact same issue as the repack one.
>
> Do you still have the 2.1GB pack around?  I bet annotate would eat much
> less memory in that case.

I do not, but i could remake it in a few days if it would help

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 17:53   ` Daniel Berlin
@ 2007-12-11 18:01     ` Nicolas Pitre
  0 siblings, 0 replies; 51+ messages in thread
From: Nicolas Pitre @ 2007-12-11 18:01 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: git

On Tue, 11 Dec 2007, Daniel Berlin wrote:

> On 12/11/07, Nicolas Pitre <nico@cam.org> wrote:
> > On Tue, 11 Dec 2007, Daniel Berlin wrote:
> >
> > > On the gcc repository (which is now a 234 meg pack for me), git
> > > annotate ChangeLog takes > 800 meg of memory (I stopped it at about
> > > 1.6 gig, since it started swapping my machine).
> > > I assume it will run out of memory.  I stopped it after 2 minutes.
> >
> > And I bet this is the exact same issue as the repack one.
> >
> > Do you still have the 2.1GB pack around?  I bet annotate would eat much
> > less memory in that case.
> 
> I do not, but i could remake it in a few days if it would help

Well, depending on the amount of RAM in your machine, you might even not 
be able to remake it at the moment.  I currently can't reproduce it 
myself due to the same out-of-memory issue.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
  2007-12-11 17:47 ` Nicolas Pitre
@ 2007-12-11 18:32 ` Marco Costalba
  2007-12-11 19:03   ` Daniel Berlin
  2007-12-11 18:40 ` Linus Torvalds
  2007-12-12 10:36 ` Florian Weimer
  3 siblings, 1 reply; 51+ messages in thread
From: Marco Costalba @ 2007-12-11 18:32 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: git

On Dec 11, 2007 6:33 PM, Daniel Berlin <dberlin@dberlin.org> wrote:
>
> Annotate is treasured by gcc developers (this was a key sticking point
> in svn conversion).
> Having an annotate that is 2x slower and takes 15x memory would not
> fly (regardless of how good the results are).
>

Speed of annotation is mainly due to getting the file history more
then calculating the actual annotation.

I don't know *how* file history is stored in the others scm, perhaps
is easier to retrieve, i.e. without a full walk across the
revisions...

In case you have qgit (especially the 2.0 version that is much faster
in this feature) I would be very interested to have annotation times
on this file. Indeed annotation times are shown splitted between file
history retrieval, based on something along the lines of "git log -p
-- <path>", and actual annotation calculation (fully internal at
qgit).

I would be interested in cold start and warm cache start (close the
annotation tab and start annotation again).


Thanks (a lot)
Marco

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
  2007-12-11 17:47 ` Nicolas Pitre
  2007-12-11 18:32 ` Marco Costalba
@ 2007-12-11 18:40 ` Linus Torvalds
  2007-12-11 19:01   ` Matthieu Moy
                     ` (3 more replies)
  2007-12-12 10:36 ` Florian Weimer
  3 siblings, 4 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 18:40 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: git

On Tue, 11 Dec 2007, Daniel Berlin wrote:
>
> This seems to be a common problem with git. It seems to use a lot of
> memory to perform common operations on the gcc repository (even though
> it is faster in some cases than hg).

The thing is, git has a very different notion of "common operations" than 
you do.

To git, "git annotate" is just about the *last* thing you ever want to do. 
It's not a common operation, it's a "last resort" operation. In git, the 
whole workflow is designed for "git log -p <pathnamepattern>" rather than 
annotate/blame.

In fact, we didn't support annotate at all for the first year or so of 
git.

The reason for git being relatively slow is exactly that git doesn't have 
"file history" at all, and only tracks full snapshots. So "git blame" is 
really a very complex operation that basically looks at the global history 
(because nothing else exists) and will basically generate a totally 
different "view" of local history from that one.

The disadvantage is that it's much slower and much more costly than just 
having a local history view to begin with.

However, the absolutely *huge* advantage is that it isn't then limited to 
local history.

So where git shines is when you actually use the global history, and do 
merges or when you track more than one file (which others find hard, but 
git finds much more natural).

An examples of this is content that actually comes from multiple files. 
File-based systems simply cannot do this at all. They aren't just slower, 
they are totally unable to do it sanely. For git, it's all the same: it 
never really cares about file boundaries in the first place.

The other example is doing things like "git log -p drivers/char", where 
you don't ask for the log of a single file, but a general file pattern, 
and get (still atomic!) commits as the result.

And perhaps the best example is just tracking code when you have two files 
that merge into one (possibly because the "same" file was created 
independently in two different branches). git gets things like that right 
without even thinking about it. Others tend to just flounder about and 
can't do anything at all about it.

That said, I'll see if I can speed up "git blame" on the gcc repository. 
It _is_ a fundamentally much more expensive operation than it is for 
systems that do single-file things.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 18:40 ` Linus Torvalds
@ 2007-12-11 19:01   ` Matthieu Moy
  2007-12-11 19:22     ` Linus Torvalds
  2007-12-11 19:06   ` Nicolas Pitre
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Matthieu Moy @ 2007-12-11 19:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> The other example is doing things like "git log -p drivers/char", where 
> you don't ask for the log of a single file, but a general file pattern, 
> and get (still atomic!) commits as the result.

I've seen you pointing this kind of examples many times, but is that
really different from what even SVN does? "svn log drivers/char" will
also list atomic commits, and give me a filtered view of the global
log.

So, yes, that's cool, but I don't see a real difference between git
and almost anything else (except CVS which really got this wrong, no
big surprise).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 18:32 ` Marco Costalba
@ 2007-12-11 19:03   ` Daniel Berlin
  2007-12-11 19:14     ` Marco Costalba
                       ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 19:03 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

On 12/11/07, Marco Costalba <mcostalba@gmail.com> wrote:
> On Dec 11, 2007 6:33 PM, Daniel Berlin <dberlin@dberlin.org> wrote:
> >
> > Annotate is treasured by gcc developers (this was a key sticking point
> > in svn conversion).
> > Having an annotate that is 2x slower and takes 15x memory would not
> > fly (regardless of how good the results are).
> >
>
> Speed of annotation is mainly due to getting the file history more
> then calculating the actual annotation.
>

Yes, i figured as much.

> I don't know *how* file history is stored in the others scm, perhaps
> is easier to retrieve, i.e. without a full walk across the
> revisions...

It is stored in an easier format. However, can you not simply provide
side-indexes to do the annotation?

I guess that own't work in git because you can change history (in
other scm's, history is readonly so you could know the results for
committed revisions will never change).

> I would be interested in cold start and warm cache start (close the
> annotation tab and start annotation again).

I will try to do this.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 18:40 ` Linus Torvalds
  2007-12-11 19:01   ` Matthieu Moy
@ 2007-12-11 19:06   ` Nicolas Pitre
  2007-12-11 20:31     ` Jon Smirl
  2007-12-11 19:09   ` Daniel Berlin
  2007-12-11 19:29   ` Steven Grimm
  3 siblings, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2007-12-11 19:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, git

On Tue, 11 Dec 2007, Linus Torvalds wrote:

> That said, I'll see if I can speed up "git blame" on the gcc repository. 
> It _is_ a fundamentally much more expensive operation than it is for 
> systems that do single-file things.

It has no excuse for eating up to 1.6GB or RAM though.  That's plainly 
wrong.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 18:40 ` Linus Torvalds
  2007-12-11 19:01   ` Matthieu Moy
  2007-12-11 19:06   ` Nicolas Pitre
@ 2007-12-11 19:09   ` Daniel Berlin
  2007-12-11 19:26     ` Daniel Barkalow
                       ` (3 more replies)
  2007-12-11 19:29   ` Steven Grimm
  3 siblings, 4 replies; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 19:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 11 Dec 2007, Daniel Berlin wrote:
> >
> > This seems to be a common problem with git. It seems to use a lot of
> > memory to perform common operations on the gcc repository (even though
> > it is faster in some cases than hg).
>
> The thing is, git has a very different notion of "common operations" than
> you do.
>
> To git, "git annotate" is just about the *last* thing you ever want to do.
> It's not a common operation, it's a "last resort" operation. In git, the
> whole workflow is designed for "git log -p <pathnamepattern>" rather than
> annotate/blame.
>
I understand this, and completely agree with you.
However, I cannot force GCC people to adopt completely new workflow in
this regard.
The changelog's are not useful enough (and we've had huge fights over
this) to do git log -p and figure out the info we want.
Looking through thousands of diffs to find the one that happened to
your line is also pretty annoying.
Annotate is a major use for gcc developers as a result
I wish I could fix this silliness, but i can't :)

> That said, I'll see if I can speed up "git blame" on the gcc repository.
> It _is_ a fundamentally much more expensive operation than it is for
> systems that do single-file things.

SVN had the same problem (the file retrieval was the most expensive op
on FSFS). One of the things i did to speed it up tremendously was to
do the annotate from newest to oldest (IE in reverse), and stop
annotating when we had come up with annotate info for all the lines.
If you can't speed up file retrieval itself, you can make it need less
files :)
In GCC history, it is likely you will be able to cut off at least 30%
of the time if you do this, because files often have changed entirely
multiple times.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:03   ` Daniel Berlin
@ 2007-12-11 19:14     ` Marco Costalba
  2007-12-11 19:27     ` Jason Sewall
  2007-12-11 19:46     ` Daniel Barkalow
  2 siblings, 0 replies; 51+ messages in thread
From: Marco Costalba @ 2007-12-11 19:14 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: git

On Dec 11, 2007 8:03 PM, Daniel Berlin <dberlin@dberlin.org> wrote:
>
> > I don't know *how* file history is stored in the others scm, perhaps
> > is easier to retrieve, i.e. without a full walk across the
> > revisions...
>
> It is stored in an easier format. However, can you not simply provide
> side-indexes to do the annotation?
>
> I guess that own't work in git because you can change history (in
> other scm's, history is readonly so you could know the results for
> committed revisions will never change).
>

As Linus pointed out annotation in git is "much slower and much more
costly than just
having a local history view to begin with".

Indeed to annotate say kernel/sched.c

the time is spent by git while executing

git log -p -- kernel/sched.c

could be also 10X higher the the following annotation processing time
starting from the git log output.

Unfortunately my knowledge of git internals falls far far shorter then
guessing what could be done to increase the *one file* history case
that _seems_ to be the common one.


> > I would be interested in cold start and warm cache start (close the
> > annotation tab and start annotation again).
>
> I will try to do this.
>

Thanks. Very appreciated.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:01   ` Matthieu Moy
@ 2007-12-11 19:22     ` Linus Torvalds
  2007-12-11 19:24       ` Daniel Berlin
  2007-12-11 23:37       ` Matthieu Moy
  0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 19:22 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Daniel Berlin, git

On Tue, 11 Dec 2007, Matthieu Moy wrote:
> 
> I've seen you pointing this kind of examples many times, but is that
> really different from what even SVN does? "svn log drivers/char" will
> also list atomic commits, and give me a filtered view of the global
> log.

Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe 
this is one of the things SVN gets right.

I seriously doubt it, though. Do you get *history* right, or do you just 
get a random list of commits?

Of course, to see the difference, you need to do "gitk drivers/char" or 
use another of the log viewers that actually show you history too. A plain 
"git log" won't make it obvious (unless you actually ask for parent 
information and then just track the history in your head, in which case 
you don't really need an SCM in the first place ;)

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:22     ` Linus Torvalds
@ 2007-12-11 19:24       ` Daniel Berlin
  2007-12-11 19:42         ` Pierre Habouzit
  2007-12-11 23:37       ` Matthieu Moy
  1 sibling, 1 reply; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 19:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Matthieu Moy, git

On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 11 Dec 2007, Matthieu Moy wrote:
> >
> > I've seen you pointing this kind of examples many times, but is that
> > really different from what even SVN does? "svn log drivers/char" will
> > also list atomic commits, and give me a filtered view of the global
> > log.
>
> Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe
> this is one of the things SVN gets right.
>
> I seriously doubt it, though. Do you get *history* right, or do you just
> get a random list of commits?

No, it will get actual history (IE not just things that happen to have
that path in the repository)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:09   ` Daniel Berlin
@ 2007-12-11 19:26     ` Daniel Barkalow
  2007-12-11 19:34     ` Pierre Habouzit
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 51+ messages in thread
From: Daniel Barkalow @ 2007-12-11 19:26 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Linus Torvalds, git

On Tue, 11 Dec 2007, Daniel Berlin wrote:

> On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > On Tue, 11 Dec 2007, Daniel Berlin wrote:
> > >
> > > This seems to be a common problem with git. It seems to use a lot of
> > > memory to perform common operations on the gcc repository (even though
> > > it is faster in some cases than hg).
> >
> > The thing is, git has a very different notion of "common operations" than
> > you do.
> >
> > To git, "git annotate" is just about the *last* thing you ever want to do.
> > It's not a common operation, it's a "last resort" operation. In git, the
> > whole workflow is designed for "git log -p <pathnamepattern>" rather than
> > annotate/blame.
> >
> I understand this, and completely agree with you.
> However, I cannot force GCC people to adopt completely new workflow in
> this regard.
> The changelog's are not useful enough (and we've had huge fights over
> this) to do git log -p and figure out the info we want.
> Looking through thousands of diffs to find the one that happened to
> your line is also pretty annoying.
> Annotate is a major use for gcc developers as a result
> I wish I could fix this silliness, but i can't :)
> 
> > That said, I'll see if I can speed up "git blame" on the gcc repository.
> > It _is_ a fundamentally much more expensive operation than it is for
> > systems that do single-file things.
> 
> SVN had the same problem (the file retrieval was the most expensive op
> on FSFS). One of the things i did to speed it up tremendously was to
> do the annotate from newest to oldest (IE in reverse), and stop
> annotating when we had come up with annotate info for all the lines.
> If you can't speed up file retrieval itself, you can make it need less
> files :)
> In GCC history, it is likely you will be able to cut off at least 30%
> of the time if you do this, because files often have changed entirely
> multiple times.

Unfortunately, we're doing that already. One improvement that is already 
available is that we can do progressive annotate: we can output lines we 
find in the order we find them, such that lines that changed recently 
(which are usually the more interesting ones) get annotated quicker. 
Obviously, you need a GUI-ish thing to do this, because pagers don't like 
having stuff written out of order, but there's a good chance that a user 
annotating fold-const.c will have the info for the interesting lines in a 
few seconds, and go on while git is still trying to find where the boring 
old lines came from.

There's also the possibility of generating caches of commit:file pairs 
you've annotated, which would make generating the annotation for something 
you'd annotated for a recent commit blindingly fast.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:03   ` Daniel Berlin
  2007-12-11 19:14     ` Marco Costalba
@ 2007-12-11 19:27     ` Jason Sewall
  2007-12-11 19:46     ` Daniel Barkalow
  2 siblings, 0 replies; 51+ messages in thread
From: Jason Sewall @ 2007-12-11 19:27 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Marco Costalba, git

On Dec 11, 2007 2:03 PM, Daniel Berlin <dberlin@dberlin.org> wrote:
> It is stored in an easier format. However, can you not simply provide
> side-indexes to do the annotation?
>
> I guess that own't work in git because you can change history (in
> other scm's, history is readonly so you could know the results for
> committed revisions will never change).
>

I don't know how other scms work, but history is definitely readonly
in git - whatever sha1 you have that describes a commit was calculated
based on its ancestor commits.

If you have a commit's id, it will *always* refer to the same thing -
a tree state and its complete ancestry.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 18:40 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2007-12-11 19:09   ` Daniel Berlin
@ 2007-12-11 19:29   ` Steven Grimm
  2007-12-11 20:14     ` Jakub Narebski
  3 siblings, 1 reply; 51+ messages in thread
From: Steven Grimm @ 2007-12-11 19:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, git

On Dec 11, 2007, at 10:40 AM, Linus Torvalds wrote:
> To git, "git annotate" is just about the *last* thing you ever want  
> to do.
> It's not a common operation, it's a "last resort" operation. In git,  
> the
> whole workflow is designed for "git log -p <pathnamepattern>" rather  
> than
> annotate/blame.

My use of "git blame" is perhaps not typical, but I use it fairly  
often when I'm looking at a part of my company's code base that I'm  
not terribly familiar with. I've found it's the fastest way to figure  
out who to go ask about a particular block of code that I think is  
responsible for a bug, or more commonly, who to ask to review a change  
I'm making.

"git log" is too coarse-grained to be useful for that purpose; it  
usually doesn't tell me which of the 500 revisions to the file I'm  
looking at introduced the actual line of code I want to change.

To me that really has nothing whatsoever to do with git workflow or  
svn workflow; it happens well before I'm ready to do any kind of  
integration or commit or even, sometimes, before I've made any changes  
to any code at all.

Given infinite spare time, one of the things I'd be strongly tempted  
to try to build would be some kind of blame cache. You could  
theoretically make blame pretty much instantaneous by doing something  
as simple as caching the per-line revision ID for each file in each  
revision in a shadow repository (or a shadow branch in the main repo)  
and keeping a map between shadow-repo revisions and real-repo ones. If  
the cache was of the form "one SHA1 hash per line in the original  
file" it would delta-compress pretty well. It'd be easy to update  
incrementally since you only need to walk back in history until you  
get to the most recently cached revision for each file, at which point  
you use the cached value for all the lines that haven't changed.

Yeah, I know, code talks louder than words...

-Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:09   ` Daniel Berlin
  2007-12-11 19:26     ` Daniel Barkalow
@ 2007-12-11 19:34     ` Pierre Habouzit
  2007-12-11 19:59       ` Junio C Hamano
  2007-12-11 19:42     ` Linus Torvalds
  2007-12-11 20:29     ` Marco Costalba
  3 siblings, 1 reply; 51+ messages in thread
From: Pierre Habouzit @ 2007-12-11 19:34 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Linus Torvalds, git

[-- Attachment #1: Type: text/plain, Size: 2787 bytes --]

On Tue, Dec 11, 2007 at 07:09:03PM +0000, Daniel Berlin wrote:
> On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > On Tue, 11 Dec 2007, Daniel Berlin wrote:
> > >
> > > This seems to be a common problem with git. It seems to use a lot of
> > > memory to perform common operations on the gcc repository (even though
> > > it is faster in some cases than hg).
> >
> > The thing is, git has a very different notion of "common operations" than
> > you do.
> >
> > To git, "git annotate" is just about the *last* thing you ever want to do.
> > It's not a common operation, it's a "last resort" operation. In git, the
> > whole workflow is designed for "git log -p <pathnamepattern>" rather than
> > annotate/blame.
> >
> I understand this, and completely agree with you.
> However, I cannot force GCC people to adopt completely new workflow in
> this regard.
> The changelog's are not useful enough (and we've had huge fights over
> this) to do git log -p and figure out the info we want.

> Looking through thousands of diffs to find the one that happened to
> your line is also pretty annoying.

  If the question you want to answer is "what happened to that line"
then using git annotate is using a big hammer for no good reason.

git log -S'<put the content of the line here>' -- path/to/file.c

will give you the very same answer, pointing you to the changes that
added or removed that line directly. It's not a fast command either, but
it should be less resource hungry than annotate that has to do roughly
the same for all lines whereas you're interested in one only.

The direct plus here, is that git log output is incremental, so you have
answers about the first diffs quite quick, which let you examine the
first answers while the rest is still being computed.

Unlike git annotate, this also allow you to restrict the revisions
where it searches to a range where you know this happened, which makes
it almost instantaneous in most cases.

Of course, if the line is '    free(p);\n' then you will probably have
quite a few false positives, but with the path restriction, I assume
this will still be quite accurate.

What is important here is to know what is the real question the GCC
programmers want to answer to. It seems to me that `blame` is an
overkill for the underlying issue.

Note that it does not justifies the current memory consumption that just
looks bad and wrong to me, but this aims at finding a way to answer your
question doing just what you need to answer it and not gazillions of
other things :)
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:09   ` Daniel Berlin
  2007-12-11 19:26     ` Daniel Barkalow
  2007-12-11 19:34     ` Pierre Habouzit
@ 2007-12-11 19:42     ` Linus Torvalds
  2007-12-11 19:50       ` Linus Torvalds
                         ` (3 more replies)
  2007-12-11 20:29     ` Marco Costalba
  3 siblings, 4 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 19:42 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Git Mailing List

On Tue, 11 Dec 2007, Daniel Berlin wrote:
>
> I understand this, and completely agree with you.
> However, I cannot force GCC people to adopt completely new workflow in
> this regard.

Oh, I agree. It's why we do have "git blame" these days, and it's why I've 
tried to make people use the nicer incremental mode, which is not at all 
faster, but it's a hell of a lot more pleasant to use because you get some 
output immediately.

In other words,

	git blame gcc/ChangeLog

is virtually useless because it's too expensive, but try doing

	git gui blame gcc ChangeLog

instead, and doesn't that just seem nicer? (*)

The difference is that the GUI one does it incrementally, and doesn't have 
to get _all_ the results before it can start reporting blame.

Not that I claim that the gui blame is perfect either (I dunno why it 
delays the nice coloring so long, for example), but it was something I 
pushed - and others made the gui for - exactly to help people with the 
fact that git interally really does it that incremental way.

> SVN had the same problem (the file retrieval was the most expensive op
> on FSFS). One of the things i did to speed it up tremendously was to
> do the annotate from newest to oldest (IE in reverse), and stop
> annotating when we had come up with annotate info for all the lines.

We do that. The expense for git is that we don't do the revisions as a 
single file at all. We'll look through each commit, check whether the 
"gcc" directory changed, if it did, we'll go into it, and check whether 
the "ChangeLog" file changed - and if it did, we'll actually diff it 
against the previous version.

> In GCC history, it is likely you will be able to cut off at least 30%
> of the time if you do this, because files often have changed entirely
> multiple times.

Not gcc/ChangeLog, though (apart from the renames that happen 
occasionally).

Btw, an example of something git *should* do right, but is just too damn 
expensive, is doing

	git gui blame gcc/ChangeLog-2000

and have it actually be able to track the original source of each of those 
annotations across that "ChangeLog split from hell". 

I bet it would eventually get it right, but that's a large file, way back 
in history, and it will try to do a non-whitespace blame with copy 
detection.

That's *expensive*, although it is an amusing thing to try to do ;)

			Linus

PS. I also do agree that we seem to use an excessive amount of memory 
there. As to whether it's the same issue or not, I'd not go as far as Nico 
and say "yes" yet. But it's interesting.

It's not entirely surprising that we see multiple issues with the gcc 
repo, simply because it's not the kind of repo that people have ever 
really worked on. So I don't think it's necessarily related at all, except 
in the sense of it being a different load and showing issues.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:24       ` Daniel Berlin
@ 2007-12-11 19:42         ` Pierre Habouzit
  2007-12-11 21:09           ` Daniel Berlin
  0 siblings, 1 reply; 51+ messages in thread
From: Pierre Habouzit @ 2007-12-11 19:42 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Linus Torvalds, Matthieu Moy, git

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

On Tue, Dec 11, 2007 at 07:24:54PM +0000, Daniel Berlin wrote:
> On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > On Tue, 11 Dec 2007, Matthieu Moy wrote:
> > >
> > > I've seen you pointing this kind of examples many times, but is that
> > > really different from what even SVN does? "svn log drivers/char" will
> > > also list atomic commits, and give me a filtered view of the global
> > > log.
> >
> > Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe
> > this is one of the things SVN gets right.
> >
> > I seriously doubt it, though. Do you get *history* right, or do you just
> > get a random list of commits?
> 
> No, it will get actual history (IE not just things that happen to have
> that path in the repository)

OTOH svn has the result right, but the way it does that is horrible.
When you svn log some/path, I think it just (basically) ask svn log for
each file in that directory, and merge the logs together. This is "easy"
for svn since it remembers "where this specific file" came from.

So for svn it's just a matter of merging the individual files histories
together. It may have a more clever implementation, but basically I
believe it would be similar to that in the end.

Of course, if you do something as stupid as:
  svn cp Makefile some/path/foo.c
  # completely rewrite foo.c
  svn commit
then you'll have the history of `Makefile` melded into the
some/path/foo.c svn log, which is completely horribly wrong.

or if you do (which unlike the previous example isn't silly for so
many good reasons):
  cp bar.c foo.c
  svn add foo.c
  svn commit
then foo.c won't have bar.c history in its svn log.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:03   ` Daniel Berlin
  2007-12-11 19:14     ` Marco Costalba
  2007-12-11 19:27     ` Jason Sewall
@ 2007-12-11 19:46     ` Daniel Barkalow
  2007-12-11 20:14       ` Marco Costalba
  2 siblings, 1 reply; 51+ messages in thread
From: Daniel Barkalow @ 2007-12-11 19:46 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Marco Costalba, git

On Tue, 11 Dec 2007, Daniel Berlin wrote:

> It is stored in an easier format. However, can you not simply provide
> side-indexes to do the annotation?
> 
> I guess that own't work in git because you can change history (in
> other scm's, history is readonly so you could know the results for
> committed revisions will never change).

History in git is read-only. It's just that git lets you fork and move 
forward with something different. Each commit can never change (and, in 
fact, you'd have to badly break SHA1 to change it), but which commits are 
relevant to the history can change.

Keeping extra information is fine; at worst, it'll go irrelevant.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:42     ` Linus Torvalds
@ 2007-12-11 19:50       ` Linus Torvalds
  2007-12-11 21:14         ` Daniel Berlin
  2007-12-12  7:57         ` Jeff King
  2007-12-11 21:14       ` Linus Torvalds
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 19:50 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Git Mailing List

On Tue, 11 Dec 2007, Linus Torvalds wrote:
> 
> We do that. The expense for git is that we don't do the revisions as a 
> single file at all. We'll look through each commit, check whether the 
> "gcc" directory changed, if it did, we'll go into it, and check whether 
> the "ChangeLog" file changed - and if it did, we'll actually diff it 
> against the previous version.

And, btw: the diff is totally different from the xdelta we have, so even 
if we have an already prepared nice xdelta between the two versions, we'll 
end up re-generating the files in full, and then do a diff on the end 
result.

Of course, part of that is that git logically *never* works with deltas, 
except in the actual code-paths that generate objects (or generate packs, 
of course). So even if we had used a delta algorithm that would be 
amenable to be turned into a diff directly, it would have been a layering 
violation to actually do that.

Other systems can sometimes just re-use their deltas to generate the 
diffs and/or blame information. I dunno whether SVN does that. CVS does, 
afaik.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:34     ` Pierre Habouzit
@ 2007-12-11 19:59       ` Junio C Hamano
  0 siblings, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2007-12-11 19:59 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Daniel Berlin, Linus Torvalds, git

Pierre Habouzit <madcoder@debian.org> writes:

>> Looking through thousands of diffs to find the one that happened to
>> your line is also pretty annoying.
>
>   If the question you want to answer is "what happened to that line"
> then using git annotate is using a big hammer for no good reason.
>
> git log -S'<put the content of the line here>' -- path/to/file.c
>
> will give you the very same answer, pointing you to the changes that
> added or removed that line directly. It's not a fast command either, but
> it should be less resource hungry than annotate that has to do roughly
> the same for all lines whereas you're interested in one only.
>
> The direct plus here, is that git log output is incremental, so you have
> answers about the first diffs quite quick, which let you examine the
> first answers while the rest is still being computed.

Yes.

> Unlike git annotate, this also allow you to restrict the revisions
> where it searches to a range where you know this happened, which makes
> it almost instantaneous in most cases.

Yes, but blame also takes revision bottoms (obviously you have to start
digging from a single revision so "blame master..next pu" would not
work, but "blame ^foo ^bar baz" would).

> Of course, if the line is '    free(p);\n' then you will probably have
> quite a few false positives,...

You can feed more than a line from -S, and the assumed and recommended
typical use case is to do so.

> Note that it does not justifies the current memory consumption that just
> looks bad and wrong to me,...

Right.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:46     ` Daniel Barkalow
@ 2007-12-11 20:14       ` Marco Costalba
  0 siblings, 0 replies; 51+ messages in thread
From: Marco Costalba @ 2007-12-11 20:14 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Daniel Berlin, git

On Dec 11, 2007 8:46 PM, Daniel Barkalow <barkalow@iabervon.org> wrote:
> On Tue, 11 Dec 2007, Daniel Berlin wrote:
>
> > It is stored in an easier format. However, can you not simply provide
> > side-indexes to do the annotation?
> >
> > I guess that own't work in git because you can change history (in
> > other scm's, history is readonly so you could know the results for
> > committed revisions will never change).
>
> History in git is read-only. It's just that git lets you fork and move
> forward with something different. Each commit can never change (and, in
> fact, you'd have to badly break SHA1 to change it), but which commits are
> relevant to the history can change.
>

Well, revisions never change, but history intended as revision's
parent information could and do changes when you use a path delimiter.
So does the graph that is a direct visualization of parent
information.

For a single revision (that modifies say 3 files) you can have at leat
3 different histories and acutally more if you want to visualize also
the history of the directories trees that owns the modified files.

You end up with a quite big number of different histories all showing
your revisions in different ways, according to the path delimiter you
use.

Perhaps the intended meaning of "changing histories" is this, and in
any case is this the reason you cannot (or has no sense to do) "save"
a single file history in git.

Marco

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:29   ` Steven Grimm
@ 2007-12-11 20:14     ` Jakub Narebski
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Narebski @ 2007-12-11 20:14 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Linus Torvalds, Daniel Berlin, git

Steven Grimm <koreth@midwinter.com> writes:
> On Dec 11, 2007, at 10:40 AM, Linus Torvalds wrote:

> > To git, "git annotate" is just about the *last* thing you ever want
> > to do.
> > It's not a common operation, it's a "last resort" operation. In git,
> > the
> > whole workflow is designed for "git log -p <pathnamepattern>" rather
> > than
> > annotate/blame.
> 
> My use of "git blame" is perhaps not typical, but I use it fairly
> often when I'm looking at a part of my company's code base that I'm
> not terribly familiar with. I've found it's the fastest way to figure
> out who to go ask about a particular block of code that I think is
> responsible for a bug, or more commonly, who to ask to review a change
> I'm making.
> 
> "git log" is too coarse-grained to be useful for that purpose; it
> usually doesn't tell me which of the 500 revisions to the file I'm
> looking at introduced the actual line of code I want to change.

There is always "pickaxe" search, i.e. 
  $ git log -p -S'<string>' -- <file or pathspec>
which can be used instead of blame (perhaps with --follow).

And you can limit blame to the interesting region of file, and to
interesting (important) range of revisions.

[about blame cache]

"git gui blame" uses incremental blame; if only it accepted range
(file fragment) limiting, and if "reblame" (blame --reference=<rev>,
blaming incrementally only lines which changed wrt. given revision)
was implemented.

BTW. qgit actually does blame using it's own "multiple files bottom-up
blame" code (it would be nice to have it in core-git if possible,
hint, hint), and does some caching, although I'm not sure if blame
info also. You should try it, I think.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:09   ` Daniel Berlin
                       ` (2 preceding siblings ...)
  2007-12-11 19:42     ` Linus Torvalds
@ 2007-12-11 20:29     ` Marco Costalba
  3 siblings, 0 replies; 51+ messages in thread
From: Marco Costalba @ 2007-12-11 20:29 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Linus Torvalds, git

On Dec 11, 2007 8:09 PM, Daniel Berlin <dberlin@dberlin.org> wrote:
>
> In GCC history, it is likely you will be able to cut off at least 30%
> of the time if you do this, because files often have changed entirely
> multiple times.
>

This could be useful for a command line tool but for a GUI the top
down approach is a myth IMHO.

In the GUI case what you actually end up doing (because a GUI allows
it) is to start from the latest file version, check the code region
you are interested then when you find the changed lines you _may_ want
to double click and go to see how it was the file before that change
and then perhaps start a new digging.

I found this is my typical workflow with annotation info because I'm
more interested not in what lines have changed but _why_ have changed
and to do this you naturally end up digging in the past (and checking
also the corresponding revisions patch as example in another tab)

In this case the advantage of oldest to newest annotation algorithm is
that you have _already_ annotated all the history so you can walk and
dig back and forth among the different file versions without *any*
additional delay.

Marco

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:06   ` Nicolas Pitre
@ 2007-12-11 20:31     ` Jon Smirl
  0 siblings, 0 replies; 51+ messages in thread
From: Jon Smirl @ 2007-12-11 20:31 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Daniel Berlin, git

On 12/11/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 11 Dec 2007, Linus Torvalds wrote:
>
> > That said, I'll see if I can speed up "git blame" on the gcc repository.
> > It _is_ a fundamentally much more expensive operation than it is for
> > systems that do single-file things.
>
> It has no excuse for eating up to 1.6GB or RAM though.  That's plainly
> wrong.

 git blame gcc/ChangeLog
It needs 2.25GB of RAM to run without swapping

That is pretty close to the same number the repack needs.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:42         ` Pierre Habouzit
@ 2007-12-11 21:09           ` Daniel Berlin
  0 siblings, 0 replies; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 21:09 UTC (permalink / raw)
  To: Pierre Habouzit, Daniel Berlin, Linus Torvalds, Matthieu Moy, git

On 12/11/07, Pierre Habouzit <madcoder@debian.org> wrote:
> On Tue, Dec 11, 2007 at 07:24:54PM +0000, Daniel Berlin wrote:
> > On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > >
> > > On Tue, 11 Dec 2007, Matthieu Moy wrote:
> > > >
> > > > I've seen you pointing this kind of examples many times, but is that
> > > > really different from what even SVN does? "svn log drivers/char" will
> > > > also list atomic commits, and give me a filtered view of the global
> > > > log.
> > >
> > > Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe
> > > this is one of the things SVN gets right.
> > >
> > > I seriously doubt it, though. Do you get *history* right, or do you just
> > > get a random list of commits?
> >
> > No, it will get actual history (IE not just things that happen to have
> > that path in the repository)
>
> OTOH svn has the result right, but the way it does that is horrible.
> When you svn log some/path, I think it just (basically) ask svn log for
> each file in that directory, and merge the logs together. This is "easy"
> for svn since it remembers "where this specific file" came from.

What?
We version directories too.
We don't do svn log for each file in the directory when you request a path.
We look at the history of the path, follow renames, etc.

When you change foo/bar/fred.c, we consider it a change to foo/bar and
foo/, and thus, they have new versions.

I'm not sure where you get this crazy notion that we do anything with
files when you ask about directories.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:42     ` Linus Torvalds
  2007-12-11 19:50       ` Linus Torvalds
@ 2007-12-11 21:14       ` Linus Torvalds
  2007-12-11 21:54         ` Junio C Hamano
  2007-12-11 21:24       ` Daniel Berlin
  2007-12-12  3:57       ` Shawn O. Pearce
  3 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 21:14 UTC (permalink / raw)
  To: Daniel Berlin, Junio C Hamano; +Cc: Git Mailing List



On Tue, 11 Dec 2007, Linus Torvalds wrote:
> 
> PS. I also do agree that we seem to use an excessive amount of memory 
> there. As to whether it's the same issue or not, I'd not go as far as Nico 
> and say "yes" yet. But it's interesting.

I think the answer here is that git-annotate is a totally different issue.

The blame machinery keeps around all the blobs it has ever needed to do a 
diff, which explains why something like gcc/ChangeLog blows up badly.

Try this trivial patch.

It will cause us to potentially re-generate some blobs much more, but 
that's a reasonably cheap operation, and our delta base cache will get the 
expensive cases.

It's still not a free operation, but I get

	[torvalds@woody gcc]$ /usr/bin/time ~/git/git-blame gcc/ChangeLog > /dev/null
	20.68user 1.25system 0:21.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
	0inputs+0outputs (0major+599833minor)pagefaults 0swaps

so it took 22s and I never saw it grow very large either (it grew to 72M 
resident, but I don't know how much of that was the mmap of the 
pack-file, so that number is pretty meaningless). Valgrind reports that 
it used a maximum heap of about 24M, and almost all of that seems to have 
been in the delta cache (which is all good).

		Linus

----
 builtin-blame.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..18f9924 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -87,6 +87,14 @@ struct origin {
 	char path[FLEX_ARRAY];
 };
 
+static void drop_origin_blob(struct origin *o)
+{
+	if (o->file.ptr) {
+		free(o->file.ptr);
+		o->file.ptr = NULL;
+	}
+}
+
 /*
  * Given an origin, prepare mmfile_t structure to be used by the
  * diff machinery
@@ -558,6 +566,8 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
 	if (!file_p.ptr || !file_o.ptr)
 		return NULL;
 	patch = compare_buffer(&file_p, &file_o, 0);
+	drop_origin_blob(parent);
+	drop_origin_blob(origin);
 	num_get_patch++;
 	return patch;
 }

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:50       ` Linus Torvalds
@ 2007-12-11 21:14         ` Daniel Berlin
  2007-12-11 21:34           ` Linus Torvalds
  2007-12-12  7:57         ` Jeff King
  1 sibling, 1 reply; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 11 Dec 2007, Linus Torvalds wrote:
> >
> > We do that. The expense for git is that we don't do the revisions as a
> > single file at all. We'll look through each commit, check whether the
> > "gcc" directory changed, if it did, we'll go into it, and check whether
> > the "ChangeLog" file changed - and if it did, we'll actually diff it
> > against the previous version.
>
> And, btw: the diff is totally different from the xdelta we have, so even
> if we have an already prepared nice xdelta between the two versions, we'll
> end up re-generating the files in full, and then do a diff on the end
> result.

This is what SVN does as well.

>
> Of course, part of that is that git logically *never* works with deltas,
> except in the actual code-paths that generate objects (or generate packs,
> of course). So even if we had used a delta algorithm that would be
> amenable to be turned into a diff directly, it would have been a layering
> violation to actually do that.

Right. SVN has the same problem.

>
> Other systems can sometimes just re-use their deltas to generate the
> diffs and/or blame information. I dunno whether SVN does that. CVS does,
> afaik.

CVS does because it's delta is line based, so it's easy.

You theroetically can generate blame info from SVN/GIT's block deltas,
but you of course, have the problem GIT does, which is that the delta
is not meant to represent the actual changes that occurred, but
instead, the smallest way to reconstruct data x from data y.
This only sometimes has any relation to how the file actually changed

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:42     ` Linus Torvalds
  2007-12-11 19:50       ` Linus Torvalds
  2007-12-11 21:14       ` Linus Torvalds
@ 2007-12-11 21:24       ` Daniel Berlin
  2007-12-12  3:57       ` Shawn O. Pearce
  3 siblings, 0 replies; 51+ messages in thread
From: Daniel Berlin @ 2007-12-11 21:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> It's not entirely surprising that we see multiple issues with the gcc
> repo, simply because it's not the kind of repo that people have ever
> really worked on. So I don't think it's necessarily related at all, except
> in the sense of it being a different load and showing issues.
>

I'm not surprised at all.
We had a number of issues with SVN that needed to be resolved.
I'm basically trying to get issues worked (both on git and mercurial)
out to the point where it is fair for our users to try their branch
and trunk workflows with git and mercurial,  and see which they like
more.
:)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 21:14         ` Daniel Berlin
@ 2007-12-11 21:34           ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 21:34 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Git Mailing List

On Tue, 11 Dec 2007, Daniel Berlin wrote:
> 
> You theroetically can generate blame info from SVN/GIT's block deltas,
> but you of course, have the problem GIT does, which is that the delta
> is not meant to represent the actual changes that occurred, but
> instead, the smallest way to reconstruct data x from data y.
> This only sometimes has any relation to how the file actually changed

Exactly. Git objects in themselves have no history or relationships, and 
being a delta against another object means nothing at all except for the 
fact that the data seems to resemble that other object (which has a 
_correlation_ with being related, but nothign more).

Anyway, I think the git annotate memory usage was simpyl just a real bug 
that nobody had noticed before because the memory leak wasn't all that 
noticeable with smaller files and/or less deep histories. Can'you verify 
that it works for you with the patch I sent out?

With that fix, I could even run 

	git blame -C gcc/ChangeLog-2000

to see the blame machinery work past the strange "combine many different 
changelogs into year-based ones" commit. Now, I cannot honestly claim that 
it was really *usable* (it did take three minutes to run!), but sometimes 
those three minutes of CPU time may be worth it, if it shows the real 
historical context it came from. 

In the case of the ChangeLog-2000 file, all the original lines obviously 
came from older versions of a file called "gcc/ChangeLog", so the end 
result doesn't really show what an involved situation it was to track the 
sources back through not just renames, but actually file splits and 
merges. Sad, but once you know what it did it's still a bit cool to see 
that it worked ;)

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 21:14       ` Linus Torvalds
@ 2007-12-11 21:54         ` Junio C Hamano
  2007-12-11 23:36           ` Linus Torvalds
  2007-12-12  4:48           ` Junio C Hamano
  0 siblings, 2 replies; 51+ messages in thread
From: Junio C Hamano @ 2007-12-11 21:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

>  builtin-blame.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/builtin-blame.c b/builtin-blame.c
> index c158d31..18f9924 100644
> --- a/builtin-blame.c
> +++ b/builtin-blame.c
> @@ -87,6 +87,14 @@ struct origin {
>  	char path[FLEX_ARRAY];
>  };
>  
>  /*
>   * Given an origin, prepare mmfile_t structure to be used by the
>   * diff machinery
> @@ -558,6 +566,8 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
>  	if (!file_p.ptr || !file_o.ptr)
>  		return NULL;
>  	patch = compare_buffer(&file_p, &file_o, 0);
> +	drop_origin_blob(parent);
> +	drop_origin_blob(origin);
>  	num_get_patch++;
>  	return patch;
>  }

While this should be safe (because the user of blob lazily re-fetches),
it feels a bit too aggressive, especially when -C or other "retry and
try harder to assign blame elsewhere" option is used.

Instead, how about discarding after we are done with each origin, like
this?

---
 builtin-blame.c |   17 +++++++++++++++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..eda79d0 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -130,6 +130,14 @@ static void origin_decref(struct origin *o)
 	}
 }
 
+static void drop_origin_blob(struct origin *o)
+{
+	if (o->file.ptr) {
+		free(o->file.ptr);
+		o->file.ptr = NULL;
+	}
+}
+
 /*
  * Each group of lines is described by a blame_entry; it can be split
  * as we pass blame to the parents.  They form a linked list in the
@@ -1274,8 +1282,13 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 		}
 
  finish:
-	for (i = 0; i < MAXPARENT; i++)
-		origin_decref(parent_origin[i]);
+	for (i = 0; i < MAXPARENT; i++) {
+		if (parent_origin[i]) {
+			drop_origin_blob(parent_origin[i]);
+			origin_decref(parent_origin[i]);
+		}
+	}
+	drop_origin_blob(origin);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 21:54         ` Junio C Hamano
@ 2007-12-11 23:36           ` Linus Torvalds
  2007-12-12  0:02             ` Linus Torvalds
  2007-12-12  4:48           ` Junio C Hamano
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 23:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Daniel Berlin, Git Mailing List



On Tue, 11 Dec 2007, Junio C Hamano wrote:
> 
> Instead, how about discarding after we are done with each origin, like
> this?

Sure, looks fine to me. With either of these patches, all of the cost is 
in the diffing routines:

	samples  %        image name               app name                 symbol name
	191317   31.4074  git                      git                      xdl_hash_record
	120060   19.7096  git                      git                      xdl_recmatch
	99286    16.2992  git                      git                      xdl_prepare_ctx
	56370     9.2539  libc-2.7.so              libc-2.7.so              memcpy
	23315     3.8275  git                      git                      xdl_prepare_env
	..

and while I suspect xdiff could be optimized a bit more for the cases 
where we have no changes at the end, that's beyond my skills.

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:22     ` Linus Torvalds
  2007-12-11 19:24       ` Daniel Berlin
@ 2007-12-11 23:37       ` Matthieu Moy
  2007-12-11 23:48         ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Matthieu Moy @ 2007-12-11 23:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 11 Dec 2007, Matthieu Moy wrote:
>> 
>> I've seen you pointing this kind of examples many times, but is that
>> really different from what even SVN does? "svn log drivers/char" will
>> also list atomic commits, and give me a filtered view of the global
>> log.
>
> Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe 
> this is one of the things SVN gets right.
>
> I seriously doubt it, though. Do you get *history* right, or do you just 
> get a random list of commits?

Well, you don't get merge commit right with SVN, but that's a
different issue (svn 1.5 is supposed to have something about merge
history, I don't know how it's done ...). So, if by "history", you
mean how branches interferred together, obviously, SVN is bad at this.
But it's equally bad at "svn log dir/" and plain "svn log".

But to simplify, if you take a linear history (no merge commits),
"svn log dir/" give you the list of commits which changed something
inside "dir/". As pointed out in other messages, the way it's done is
really different from what git does. SVN does know a lot about
directories, and records a lot about them at commit time, while git
just considers them as file containers.

Year, CVS got this terribly wrong. IIRC, it just took the log for
individual messages, and mix them together, so a commit touching
multiple files would appear several times.

I've taken SVN as an extreme example, but at least bzr and mercurial
have an approach very similar to git.

So, to me, this particular point is something git obviously got right,
but not a point where git is so different from the others.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 23:37       ` Matthieu Moy
@ 2007-12-11 23:48         ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-11 23:48 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Daniel Berlin, git



On Wed, 12 Dec 2007, Matthieu Moy wrote:
>
> > I seriously doubt it, though. Do you get *history* right, or do you just 
> > get a random list of commits?
> 
> Well, you don't get merge commit right with SVN, but that's a
> different issue (svn 1.5 is supposed to have something about merge
> history, I don't know how it's done ...). So, if by "history", you
> mean how branches interferred together, obviously, SVN is bad at this.
> But it's equally bad at "svn log dir/" and plain "svn log".

Yeah, git just has higher goals.

The time history really matters (or rather, what I call the "shape" of 
history) is when you are trying to merge, and you get a merge conflict. 
That's when you want to do

	gitk master merge ^merge-base -- files-that-are-unmerged

and in fact this is such an important thing for me that there is a 
shorthand argument to do exactly that, ie:

	gitk --merge

which shows the commits that touched the unmerged files graphically *with* 
the history being correct (ie you don't just get a random log of "these 
changes happened", you get the real history of the two branches as it 
pertains to the files you care about!)

> But to simplify, if you take a linear history (no merge commits),
> "svn log dir/" give you the list of commits which changed something
> inside "dir/"

Sure, linear history is trivial. But it's also almost totally 
uninteresting.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 23:36           ` Linus Torvalds
@ 2007-12-12  0:02             ` Linus Torvalds
  2007-12-12  0:22               ` Davide Libenzi
                                 ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-12  0:02 UTC (permalink / raw)
  To: Junio C Hamano, Davide Libenzi; +Cc: Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Linus Torvalds wrote:
> 
> and while I suspect xdiff could be optimized a bit more for the cases 
> where we have no changes at the end, that's beyond my skills.

Ok, I lied.

Nothing is beyond my skills. My mad k0der skillz are unbeatable.

This speeds up git-blame on ChangeLog-style files by a big amount, by just 
ignoring the common end that we don't care about, since we don't want any 
context anyway at that point. So I now get:

	[torvalds@woody gcc]$ time git blame gcc/ChangeLog > /dev/null

	real    0m7.031s
	user    0m6.852s
	sys     0m0.180s

which seems quite reasonable, and is about three times faster than trying 
to diff those big files.

Davide: this really _does_ make a huge difference. Maybe xdiff itself 
should do this optimization on its own, rather than have the caller hack 
around the fact that xdiff doesn't handle this common case all that well?

The same thing obviously works for the beginning-of-file too, but then you 
have to play games with line numbers being affected etc, so the end is the 
rather much easier case and is the case that a ChangeLog-style file cares 
about.

Daniel, this is obviously on top of the patches that fix the memory leak.

			Linus

---
diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..677188c 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -543,6 +551,20 @@ static struct patch *compare_buffer(mmfile_t *file_p, mmfile_t *file_o,
 	return state.ret;
 }

+#define BLOCK 1024
+
+static void truncate_common_data(mmfile_t *a, mmfile_t *b)
+{
+	long l1 = a->size, l2 = b->size;
+
+	while ((l1 -= BLOCK) > 0 && (l2 -= BLOCK) > 0) {
+		if (memcmp(a->ptr + l1, b->ptr + l2, BLOCK))
+			break;
+		a->size = l1;
+		b->size = l2;
+	}
+}
+
 /*
  * Run diff between two origins and grab the patch output, so that
  * we can pass blame for lines origin is currently suspected for
@@ -557,6 +579,7 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
 	fill_origin_blob(origin, &file_o);
 	if (!file_p.ptr || !file_o.ptr)
 		return NULL;
+	truncate_common_data(&file_p, &file_o);
 	patch = compare_buffer(&file_p, &file_o, 0);
 	num_get_patch++;
 	return patch;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  0:02             ` Linus Torvalds
@ 2007-12-12  0:22               ` Davide Libenzi
  2007-12-12  0:50                 ` Linus Torvalds
  2007-12-12  0:56               ` Junio C Hamano
  2007-12-12 19:43               ` Daniel Berlin
  2 siblings, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2007-12-12  0:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Linus Torvalds wrote:

> On Tue, 11 Dec 2007, Linus Torvalds wrote:
> > 
> > and while I suspect xdiff could be optimized a bit more for the cases 
> > where we have no changes at the end, that's beyond my skills.
> 
> Ok, I lied.
> 
> Nothing is beyond my skills. My mad k0der skillz are unbeatable.
> 
> This speeds up git-blame on ChangeLog-style files by a big amount, by just 
> ignoring the common end that we don't care about, since we don't want any 
> context anyway at that point. So I now get:
> 
> 	[torvalds@woody gcc]$ time git blame gcc/ChangeLog > /dev/null
> 
> 	real    0m7.031s
> 	user    0m6.852s
> 	sys     0m0.180s
> 
> which seems quite reasonable, and is about three times faster than trying 
> to diff those big files.
> 
> Davide: this really _does_ make a huge difference. Maybe xdiff itself 
> should do this optimization on its own, rather than have the caller hack 
> around the fact that xdiff doesn't handle this common case all that well?

I didn't follow the thread, but I can guess from the subject that this is 
about memory, isn't it?
Libxdiff already has a xdl_trim_ends() that strips all the common 
beginning and ending records, but at that point files are already loaded.
Since libxdiff works with memory files in order to keep any sort of 
system dependency out of the window, so the optimization would be 
useless on libxdiff side. This because the user would have to have 
already the file loaded in memory, to pass it to libxdiff.
If this is really about memory, this better be kept on the libxdiff caller 
side, so that it can avoid loading the terminal file sections altogether.
About your code, you may want to have an extend-till-next-eol code after 
the trimming part, since the last line may be used for context in the 
diffs.

- Davide

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  0:22               ` Davide Libenzi
@ 2007-12-12  0:50                 ` Linus Torvalds
  2007-12-12  1:12                   ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2007-12-12  0:50 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Junio C Hamano, Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Davide Libenzi wrote:
> 
> I didn't follow the thread, but I can guess from the subject that this is 
> about memory, isn't it?

No, it started out that way, but now it's about performance.

> Libxdiff already has a xdl_trim_ends() that strips all the common 
> beginning and ending records, but at that point files are already loaded.

That's not the problem. The problem with xdl_trim_ends() is that it 
happens *after* you have done all the hashing, so as an optimization it's 
fairly useless, because it still leaves the real cost (the per-line 
hashing) on the table.

So doing the trimming of the ends before you do even that, allows you to 
just do the trivial "let's see if the ends are identical" with a plain 
memcmp, which is much faster.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  0:02             ` Linus Torvalds
  2007-12-12  0:22               ` Davide Libenzi
@ 2007-12-12  0:56               ` Junio C Hamano
  2007-12-12  2:20                 ` Linus Torvalds
  2007-12-12 19:43               ` Daniel Berlin
  2 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2007-12-12  0:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Davide Libenzi, Daniel Berlin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 11 Dec 2007, Linus Torvalds wrote:
>> 
>> and while I suspect xdiff could be optimized a bit more for the cases 
>> where we have no changes at the end, that's beyond my skills.
>
> Ok, I lied.
>
> Nothing is beyond my skills. My mad k0der skillz are unbeatable.
>
> This speeds up git-blame on ChangeLog-style files by a big amount, by just 
> ignoring the common end that we don't care about, since we don't want any 
> context anyway at that point. So I now get:
>
> 	[torvalds@woody gcc]$ time git blame gcc/ChangeLog > /dev/null
>
> 	real    0m7.031s
> 	user    0m6.852s
> 	sys     0m0.180s
>
> which seems quite reasonable, and is about three times faster than trying 
> to diff those big files.

Funny.  I did not understand what you were talking about "no changes at
the end" when I read it ('cause I am at work and do not have the data
you are looking at handy), but now I see what you meant.  It is a cute
hack that optimizes for a very special case of "prepend only" files (aka
"ChangeLog").

I suspect that this optimization has an interesting corner case, though.
What happens if you chomp at the middle of the last line that is
different between the two files?  xdiff will report the line number but
wouldn't its (now artificial) "No newline at the end of the file" affect
the blame logic?

Besides, "prepend only" (or "append only") files would be good
candidates for the original -S"pickaxe" search, I would imagine, and
unless you are looking at that ChangeLog-2000 consolidated log, isn't
blame way overkill?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  0:50                 ` Linus Torvalds
@ 2007-12-12  1:12                   ` Davide Libenzi
  2007-12-12  2:10                     ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2007-12-12  1:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Linus Torvalds wrote:

> > Libxdiff already has a xdl_trim_ends() that strips all the common 
> > beginning and ending records, but at that point files are already loaded.
> 
> That's not the problem. The problem with xdl_trim_ends() is that it 
> happens *after* you have done all the hashing, so as an optimization it's 
> fairly useless, because it still leaves the real cost (the per-line 
> hashing) on the table.

Careful. The real cost of diffing, is not the O(1) pass of the prepare 
phase. It's the potentially O(N*M) worst case of the cross-record compare. 
So that optimization is far from useless. That optimization is indeed 
mainly targeted to avoid such worst case.

> So doing the trimming of the ends before you do even that, allows you to 
> just do the trivial "let's see if the ends are identical" with a plain 
> memcmp, which is much faster.

Yes, tail trimming done on a block-basis is faster and does not consume 
memory. The code for libxdiff would have to be a bit more complex though, 
since memory files can be composed by many sections, of different sizes 
(so you cannot just assume it's a single block you're trimming the end). 
Also, you'd need some code at the end that hands you back at least the N 
lines you want for context.

- Davide

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  1:12                   ` Davide Libenzi
@ 2007-12-12  2:10                     ` Linus Torvalds
  2007-12-12  3:35                       ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2007-12-12  2:10 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Junio C Hamano, Daniel Berlin, Git Mailing List



On Tue, 11 Dec 2007, Davide Libenzi wrote:
>
> > That's not the problem. The problem with xdl_trim_ends() is that it 
> > happens *after* you have done all the hashing, so as an optimization it's 
> > fairly useless, because it still leaves the real cost (the per-line 
> > hashing) on the table.
> 
> Careful. The real cost of diffing, is not the O(1) pass of the prepare 
> phase. It's the potentially O(N*M) worst case of the cross-record compare. 
> So that optimization is far from useless. That optimization is indeed 
> mainly targeted to avoid such worst case.

I'm not saying it's useless. I'm saying it's ineffective.

My simple patch that you saw, speeded up a real-life case by A FACTOR OF 
THREE. We're not talking small potatoes here.

> Also, you'd need some code at the end that hands you back at least the N 
> lines you want for context.

Sure. The special case I added it to specifically wanted a context of zero 
in the caller, so I could just ignore that.

But doing this in general and handing back the context is a simple matter 
of

	while (size < orig && context_lines) {
		if (src->buffer[size++] == '\n')
			context_lines--;
	}

which will usually hit in a really short time (ie three lines by default, 
just a few tens of bytes).
			
		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  0:56               ` Junio C Hamano
@ 2007-12-12  2:20                 ` Linus Torvalds
  2007-12-12  2:39                   ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2007-12-12  2:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Davide Libenzi, Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Junio C Hamano wrote:
> 
> I suspect that this optimization has an interesting corner case, though.
> What happens if you chomp at the middle of the last line that is
> different between the two files?  xdiff will report the line number but
> wouldn't its (now artificial) "No newline at the end of the file" affect
> the blame logic?

It shouldn't. I thought about it, but there doesn't seem to be any reason 
why blame could possibly care - the message can come at the end of a 
_real_ file, of course, so if the extra message confuses the blame logic, 
there's already a bug there. 

But no, I didn't create a test-case.

> Besides, "prepend only" (or "append only") files would be good
> candidates for the original -S"pickaxe" search, I would imagine, and
> unless you are looking at that ChangeLog-2000 consolidated log, isn't
> blame way overkill?

Actually, I suspect that this makes a difference for totally normal files 
too. I bet it cuts the size of the files to be tested for the common case 
(ie just a few small changes) down by 30-50% even on average. The fact 
that it cuts it down by 99.9% on ChangeLog files is just an added bonus.

As Davide mentioned, xdiff actually does something like that hack for the 
beginning and end of files internally _anyway_, the problem with that is 
that it does it so late that it's already done a fairly expensive hash for 
the file (and allocated space for it based on guesses that are in turn 
based on the original size) that it doesn't actually get the full effect 
of the optimization.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  2:20                 ` Linus Torvalds
@ 2007-12-12  2:39                   ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-12  2:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Davide Libenzi, Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Linus Torvalds wrote:
> 
> But no, I didn't create a test-case.

This *should* trigger the special case:

	mkdir test-dir
	cd test-dir
	git init
	(echo -n a ; yes '' | dd count=2) > file
	git add file
	git commit -m "'a' + 1k newlines"
	(echo -n b ; yes '' | dd count=2) > file
	git add file
	git commit -m "'b' + 1k newlines"

and it all seems to work fine.

But I didn't actually check that it really triggered, this is just 
creating a 1025-byte file that has a single character and then 1024 
newlines. So when the logic removes the shared tail (all the newlines), it 
leaves a single-character newlineless buffer for diff, and no, git-blame 
didn't care, and got the right answer.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  2:10                     ` Linus Torvalds
@ 2007-12-12  3:35                       ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-12  3:35 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Junio C Hamano, Daniel Berlin, Git Mailing List

On Tue, 11 Dec 2007, Linus Torvalds wrote:
> 
> I'm not saying it's useless. I'm saying it's ineffective.

Sorry, I _did_ call it "fairly useless". 

The rest of the comment stands. I'm sure the trimming that xdiff does is 
good at avoiding some common O(n*m) cases, it's just not as good as it 
could be, and leaves a big constant factor of the O(n) case on the table.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:42     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2007-12-11 21:24       ` Daniel Berlin
@ 2007-12-12  3:57       ` Shawn O. Pearce
  3 siblings, 0 replies; 51+ messages in thread
From: Shawn O. Pearce @ 2007-12-12  3:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:
...
> is virtually useless because it's too expensive, but try doing
> 
> 	git gui blame gcc ChangeLog
> 
> instead, and doesn't that just seem nicer? (*)
> 
> The difference is that the GUI one does it incrementally, and doesn't have 
> to get _all_ the results before it can start reporting blame.
> 
> Not that I claim that the gui blame is perfect either (I dunno why it 
> delays the nice coloring so long ...

git-gui waits to color until after it gets the move/copy annotations
back from the -C -C -w second pass it does.  This way the coloring
is based on the original source location, not on the move/copy that
caused it to be placed where it is now.

I played around with this for a while and finally made it work the
way it does as I assumed most users would want to see where something
originally came from more than how it got moved to where it is now.

IOW the (very expensive) -C -C -w pass is usually much more
interesting than the default (fast) pass, so that is the line
annotation data we color with.  But it takes longer to get and
is run second, so yea, coloring takes a while.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 21:54         ` Junio C Hamano
  2007-12-11 23:36           ` Linus Torvalds
@ 2007-12-12  4:48           ` Junio C Hamano
  1 sibling, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2007-12-12  4:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> While this should be safe (because the user of blob lazily re-fetches),
> it feels a bit too aggressive, especially when -C or other "retry and
> try harder to assign blame elsewhere" option is used.
>
> Instead, how about discarding after we are done with each origin, like
> this?

It's been a while for me to look at the blame engine, and it hit me that
it would be interesting to run assign_blame() loop on multi-core machine
in parallel threads.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 19:50       ` Linus Torvalds
  2007-12-11 21:14         ` Daniel Berlin
@ 2007-12-12  7:57         ` Jeff King
  2007-12-17 23:24           ` Jan Hudec
  1 sibling, 1 reply; 51+ messages in thread
From: Jeff King @ 2007-12-12  7:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Berlin, Git Mailing List

On Tue, Dec 11, 2007 at 11:50:08AM -0800, Linus Torvalds wrote:

> And, btw: the diff is totally different from the xdelta we have, so even 
> if we have an already prepared nice xdelta between the two versions, we'll 
> end up re-generating the files in full, and then do a diff on the end 
> result.
> 
> Of course, part of that is that git logically *never* works with deltas, 
> except in the actual code-paths that generate objects (or generate packs, 
> of course). So even if we had used a delta algorithm that would be 
> amenable to be turned into a diff directly, it would have been a layering 
> violation to actually do that.

That doesn't mean we can't opportunistically jump layers when available,
and fall back on the regular behavior otherwise. The nice thing about
clean and simple layers is that you can always add optimizations later
by poking sane holes.

Let's assume for the sake of argument that we can convert an xdelta into
a diff fairly cheaply.  Using the patch below, we can count the places
where we are diffing two blobs, and one blob is a delta base of the
other (assuming our magical conversion function can also reverse diffs.
;) ).

For a "git log -p" on git.git, I get:

   9951 diffs could be optimized
  10958 diffs could not be optimized

or about 48%. It would be nice if we could drop the cost by almost 50%
(if our magical function is free to call, too!).

Of course, I haven't even looked at whether converting xdeltas to
unified diffs is possible. I suspect in some cases it is (e.g., pure
addition of text) and in some cases it isn't (I assume xdelta doesn't
have any context lines, which might hurt). And it's possible that a
specialized diff user like git-blame can just learn to use the xdeltas
by itself (I didn't get a "could optimize" count for git-blame since
it seems to follow a different codepath for its diffs).

---
diff --git a/cache.h b/cache.h
index 27d90fe..0d672be 100644
--- a/cache.h
+++ b/cache.h
@@ -569,6 +569,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
 extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int have_xdelta(unsigned char from[20], unsigned char to[20]);
 extern int matches_pack_name(struct packed_git *p, const char *name);
 
 /* Dumb servers support */
diff --git a/diff.c b/diff.c
index f780e3e..5402900 100644
--- a/diff.c
+++ b/diff.c
@@ -1299,6 +1299,10 @@ static void builtin_diff(const char *name_a,
 		}
 	}
 
+	fprintf(stderr, "could optimize: %s\n",
+			(have_xdelta(one->sha1, two->sha1) ||
+			have_xdelta(two->sha1, one->sha1)) ? "yes" : "no");
+
 	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
 		die("unable to read files to diff");
 
diff --git a/sha1_file.c b/sha1_file.c
index b0c2435..f811ddc 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2422,3 +2422,20 @@ int read_pack_header(int fd, struct pack_header *header)
 		return PH_ERROR_PROTOCOL;
 	return 0;
 }
+
+int have_xdelta(unsigned char from[20], unsigned char to[20])
+{
+	struct pack_entry e;
+	unsigned char base_sha1[20];
+	const char *type;
+	unsigned long size;
+	unsigned long store_size;
+	unsigned int delta_chain_length;
+
+	if (!find_pack_entry(to, &e, NULL))
+		return 0;
+
+	type = packed_object_info_detail(e.p, e.offset, &size, &store_size,
+					 &delta_chain_length, base_sha1);
+	return !hashcmp(base_sha1, from);
+}

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
                   ` (2 preceding siblings ...)
  2007-12-11 18:40 ` Linus Torvalds
@ 2007-12-12 10:36 ` Florian Weimer
  3 siblings, 0 replies; 51+ messages in thread
From: Florian Weimer @ 2007-12-12 10:36 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: git

* Daniel Berlin:

> On the gcc repository (which is now a 234 meg pack for me), git
> annotate ChangeLog takes > 800 meg of memory (I stopped it at about
> 1.6 gig, since it started swapping my machine).
> I assume it will run out of memory.  I stopped it after 2 minutes.

A less unwieldy repository that shows the same problem is:

  svn://svn.debian.org/secure-testing/

It's annotating the data/CVE/list file that uses tons of memory.  I
guess you don't need to clone the full history to exhibit the problem.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  0:02             ` Linus Torvalds
  2007-12-12  0:22               ` Davide Libenzi
  2007-12-12  0:56               ` Junio C Hamano
@ 2007-12-12 19:43               ` Daniel Berlin
  2 siblings, 0 replies; 51+ messages in thread
From: Daniel Berlin @ 2007-12-12 19:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Davide Libenzi, Git Mailing List

On 12/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> Daniel, this is obviously on top of the patches that fix the memory leak.

Thanks, these patches work *great*.

I'm starting to have a few users who have no experience with git or hg
try their daily workflow with it, to see what UI issues they come up
with :)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-12  7:57         ` Jeff King
@ 2007-12-17 23:24           ` Jan Hudec
  2007-12-18  0:05             ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Hudec @ 2007-12-17 23:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Daniel Berlin, Git Mailing List

On Wed, Dec 12, 2007 at 02:57:25 -0500, Jeff King wrote:
> On Tue, Dec 11, 2007 at 11:50:08AM -0800, Linus Torvalds wrote:
> > And, btw: the diff is totally different from the xdelta we have, so even 
> > if we have an already prepared nice xdelta between the two versions, we'll 
> > end up re-generating the files in full, and then do a diff on the end 
> > result.

The problem is whether git does not end-up re-generating the same file
multiple times. When it needs to construct the diff between two versions of
a file and one is delta-base (even indirect) of the other, does it know to
create the first, remember it, continue to the other and calculate the diff?

> > Of course, part of that is that git logically *never* works with deltas, 
> > except in the actual code-paths that generate objects (or generate packs, 
> > of course). So even if we had used a delta algorithm that would be 
> > amenable to be turned into a diff directly, it would have been a layering 
> > violation to actually do that.
> 
> That doesn't mean we can't opportunistically jump layers when available,
> and fall back on the regular behavior otherwise. The nice thing about
> clean and simple layers is that you can always add optimizations later
> by poking sane holes.
> 
> Let's assume for the sake of argument that we can convert an xdelta into
> a diff fairly cheaply.  Using the patch below, we can count the places
> where we are diffing two blobs, and one blob is a delta base of the
> other (assuming our magical conversion function can also reverse diffs.
> ;) ).
> 
> For a "git log -p" on git.git, I get:
> 
>    9951 diffs could be optimized
>   10958 diffs could not be optimized
> 
> or about 48%. It would be nice if we could drop the cost by almost 50%
> (if our magical function is free to call, too!).

This is actually a gross underestimation. The idea would be to know all the
diffs we need to calculate and than remember all useful results. Ie. if we
know we'll want objects A and C, A's delta base is B and B's delta base is C,
start calculating A and when it turns out to need C at some point, just
remember it for purpose of doing the final diff. On the other hand B can be
thrown away early (because we don't need it) to save memory.

Now git can know the list of deltas it will need in advance. First generate
the list of revisions -- nothing helps there, but their delta bases are
likely to be randomish anyway -- and than with the knowledge of full list of
trees, start doing the diffs to see which touched the subtree in question.
Repeat for each level.

Since the list of deltas that will be needed is known, the objects from
which all deltas were already generated can be expired from cache (but not
thrown away immediately, as they may help building other objects).

> Of course, I haven't even looked at whether converting xdeltas to
> unified diffs is possible. I suspect in some cases it is (e.g., pure
> addition of text) and in some cases it isn't (I assume xdelta doesn't
> have any context lines, which might hurt). And it's possible that a
> specialized diff user like git-blame can just learn to use the xdeltas
> by itself (I didn't get a "could optimize" count for git-blame since
> it seems to follow a different codepath for its diffs).

Well, it's about as hard as applying them, because you can remember the
necessary stuff when applying. The imporant bit would be to avoid applying
the same delta more than once during the whole annotate.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: git annotate runs out of memory
  2007-12-17 23:24           ` Jan Hudec
@ 2007-12-18  0:05             ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2007-12-18  0:05 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Jeff King, Daniel Berlin, Git Mailing List

On Tue, 18 Dec 2007, Jan Hudec wrote:
> On Tue, Dec 11, 2007 at 11:50:08AM -0800, Linus Torvalds wrote:
> > And, btw: the diff is totally different from the xdelta we have, so even 
> > if we have an already prepared nice xdelta between the two versions, we'll 
> > end up re-generating the files in full, and then do a diff on the end 
> > result.
> 
> The problem is whether git does not end-up re-generating the same file
> multiple times. When it needs to construct the diff between two versions of
> a file and one is delta-base (even indirect) of the other, does it know to
> create the first, remember it, continue to the other and calculate the diff?

Yes.

Actually, it doesn't "know" anything at all - what happens is that git 
internally has a simple "delta-cache", which just caches the latest 
objects we've generated from deltas, and which automatically handles this 
common case (and others).

So when we tend to work with multiple versions of the same file (which is 
obviously very common with diff, and even more so with something like 
"annotate"), those multiple versions will obviously also tend to be deltas 
against each other and/or against some shared base object, and when we see 
a delta, we'll look the base object up in the delta cache, and if it has 
been generated earlier we'll be able to short-circuit the whole delta 
chain and just use the whole object we already cached.

So if you compare two objects that each have a very deep delta chain, you 
will obviously have to walk the whole delta chain _once_ (to generate 
whichever version of the file you happen to look up first), but you won't 
need to do it twice, because the second time you'll end up hitting in the 
delta cache.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2007-12-18  0:06 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
2007-12-11 17:47 ` Nicolas Pitre
2007-12-11 17:53   ` Daniel Berlin
2007-12-11 18:01     ` Nicolas Pitre
2007-12-11 18:32 ` Marco Costalba
2007-12-11 19:03   ` Daniel Berlin
2007-12-11 19:14     ` Marco Costalba
2007-12-11 19:27     ` Jason Sewall
2007-12-11 19:46     ` Daniel Barkalow
2007-12-11 20:14       ` Marco Costalba
2007-12-11 18:40 ` Linus Torvalds
2007-12-11 19:01   ` Matthieu Moy
2007-12-11 19:22     ` Linus Torvalds
2007-12-11 19:24       ` Daniel Berlin
2007-12-11 19:42         ` Pierre Habouzit
2007-12-11 21:09           ` Daniel Berlin
2007-12-11 23:37       ` Matthieu Moy
2007-12-11 23:48         ` Linus Torvalds
2007-12-11 19:06   ` Nicolas Pitre
2007-12-11 20:31     ` Jon Smirl
2007-12-11 19:09   ` Daniel Berlin
2007-12-11 19:26     ` Daniel Barkalow
2007-12-11 19:34     ` Pierre Habouzit
2007-12-11 19:59       ` Junio C Hamano
2007-12-11 19:42     ` Linus Torvalds
2007-12-11 19:50       ` Linus Torvalds
2007-12-11 21:14         ` Daniel Berlin
2007-12-11 21:34           ` Linus Torvalds
2007-12-12  7:57         ` Jeff King
2007-12-17 23:24           ` Jan Hudec
2007-12-18  0:05             ` Linus Torvalds
2007-12-11 21:14       ` Linus Torvalds
2007-12-11 21:54         ` Junio C Hamano
2007-12-11 23:36           ` Linus Torvalds
2007-12-12  0:02             ` Linus Torvalds
2007-12-12  0:22               ` Davide Libenzi
2007-12-12  0:50                 ` Linus Torvalds
2007-12-12  1:12                   ` Davide Libenzi
2007-12-12  2:10                     ` Linus Torvalds
2007-12-12  3:35                       ` Linus Torvalds
2007-12-12  0:56               ` Junio C Hamano
2007-12-12  2:20                 ` Linus Torvalds
2007-12-12  2:39                   ` Linus Torvalds
2007-12-12 19:43               ` Daniel Berlin
2007-12-12  4:48           ` Junio C Hamano
2007-12-11 21:24       ` Daniel Berlin
2007-12-12  3:57       ` Shawn O. Pearce
2007-12-11 20:29     ` Marco Costalba
2007-12-11 19:29   ` Steven Grimm
2007-12-11 20:14     ` Jakub Narebski
2007-12-12 10:36 ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).