git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Shawn O. Pearce" <spearce@spearce.org>
To: Nicolas Pitre <nico@cam.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: cleaner/better zlib sources?
Date: Sat, 17 Mar 2007 01:19:21 -0400	[thread overview]
Message-ID: <20070317051921.GA5731@spearce.org> (raw)
In-Reply-To: <alpine.LFD.0.83.0703162257560.18328@xanadu.home>

Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 16 Mar 2007, Linus Torvalds wrote:
> > I also didn't worry about it, because I felt that if it became a problem, 
> > it would be easy to just add a cache of base objects (we probably do *not* 
> > want to keep the whole unpacked object info in memory all the time just 
> > because of memory pressure issues, so "cache of base objects" is better). 
> > However, the "pack file + offset" thing makes it harder to do, since we 
> > now don't even have the SHA1 of the base object before we unpack it.
> > 
> > But I guess we could just index this by a <packfile, offset> tuple.
...
> Then it would only be a matter of coming up with a clever cache 
> eviction algorithm.

Yes.  Linus above seems to imply (at least to me) that we wouldn't
want to cache the original object requested by read_sha1_file(), as
its not the delta base.  But given our packing rules, we should be
(in general anyway) first asking for the most recent revision of
a file, which is stored whole, then for an older revision, which
will be a delta of the more recent revision we just saw.

Hence we probably would want to cache an object.  Well, at least
anything that had been packed as a delta.  Caching a deflated
OBJ_BLOB may not be worth it.
 
> > Anyway, I bet that this is a much bigger issue than the pack format 
> > itself (and is largely independent).
> 
> Well, I think the pack format issue is significant too.  But because 
> those are independent issues the gain in performance will be additive.

I'm torn there.

There's two places that we do lots of unpacks of objects where we
run into this difficult case of unpacking the same base object many
times: git-blame and a rev-list with a path limiter.

Now the git-blame case is obvious: we are constantly unpacking
various revisions of the same file, and these are probably delta'd
against each other, so the unpacking gets really brutal after a
while.  A blob cache here would probably *really* help out git-blame.

What's slightly less obvious about git-blame is we are probably also
traversing the different versions of the same trees over and over, as
we resolve the path to the correct blob in each commit we traverse.
So again here we are hitting lots of the same trees multiple times.

That last part about git-blame also obviously applies to the rev-list
with a path limiter.

But most other operations don't seem like they would benefit from a
base object cache; actually they might slow down from having such
a cache present!

Commits tend not to delta well; if they delta it is a very rare
occurrance.  So we aren't getting huge unpacking benefits there
by caching them.  Scratch any benefit of the cache for any sort of
rev-list operation that doesn't require tree access.

As for the other common operations (diff, read-tree, checkout-index,
merge-recursive): I don't think these will benefit from a cache
either.  Their data access patterns are pretty spread out over
the tree.  With the exception of rename detection we hit everything
only once.  After touching a path, we tend to not go back to it.
So unless we are really lucky and one blob acts as a base object
for many others at different paths (possible, but I suspect not
very likely) its not worth caching the base.

If we do hit something twice, its probably because we are doing two
distinct passes over the data.  In this case the passes are probably
because we either don't want to hold all of the data in memory (too
big of a set for some projects) or because we tried one algorithm,
failed, and are now trying a different one (internal read-tree
in merge-recursive).

Caching in merge-recursive may help, but just making the dirty
cache (index) that resulted from the internal read-tree available
for the remainder of the merge-recursive process might be faster;
especially if we only have one base and don't need to recursively
merge multiple bases.


So where does that leave us?  The only places I see a base object
cache really helping is in git-blame for blob access, repeated
tree access (git-blame and path limiting), and maybe we could do
better with the common cases in merge-recursive by being smarter
with the cache.

But with pack v4 I don't think I need a tree object cache.
With a 6 byte fixed record format, a strict ordering requirement,
a finite delta depth within a packfile, a stricter tree-specific
delta encoder, and a minor API change to tree-walk.h, I think we
can unpack the delta at the same time that we are walking the tree.
No upfront unpack required.  Hence no reason to cache.


So yea, a base object cache may help us today.  It will most
definately help in git-blame.  But I doubt it will help with trees
in pack v4, and I think it will just hurt in most cases.  So maybe
it should be local to git-blame only.

-- 
Shawn.

  reply	other threads:[~2007-03-17  5:19 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-16  1:04 cleaner/better zlib sources? Linus Torvalds
2007-03-16  1:10 ` Shawn O. Pearce
2007-03-16  1:11 ` Jeff Garzik
2007-03-16  1:14   ` Matt Mackall
2007-03-16  1:46   ` Linus Torvalds
2007-03-16  1:54     ` Linus Torvalds
2007-03-16  2:43       ` Davide Libenzi
2007-03-16  2:56         ` Linus Torvalds
2007-03-16  3:16           ` Davide Libenzi
2007-03-16 16:21             ` Linus Torvalds
2007-03-16 16:24               ` Davide Libenzi
2007-03-16 16:35                 ` Linus Torvalds
2007-03-16 19:21                   ` Davide Libenzi
2007-03-17  0:01                     ` Linus Torvalds
2007-03-17  1:11                       ` Linus Torvalds
2007-03-17  3:28                         ` Nicolas Pitre
2007-03-17  5:19                           ` Shawn O. Pearce [this message]
2007-03-17 17:55                           ` Linus Torvalds
2007-03-17 19:40                             ` Linus Torvalds
2007-03-17 19:42                               ` [PATCH 1/2] Make trivial wrapper functions around delta base generation and freeing Linus Torvalds
2007-03-17 19:44                               ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-17 21:45                                 ` Linus Torvalds
2007-03-17 22:37                                   ` Junio C Hamano
2007-03-17 23:09                                     ` Linus Torvalds
2007-03-17 23:54                                       ` Linus Torvalds
2007-03-18  1:13                                     ` Nicolas Pitre
2007-03-18  7:47                                       ` Junio C Hamano
2007-03-17 23:12                                   ` Junio C Hamano
2007-03-17 23:24                                     ` Linus Torvalds
2007-03-17 23:52                                       ` Jon Smirl
2007-03-18  1:14                                   ` Morten Welinder
2007-03-18  1:29                                     ` Linus Torvalds
2007-03-18  1:38                                       ` Nicolas Pitre
2007-03-18  1:55                                         ` Linus Torvalds
2007-03-18  2:03                                           ` Nicolas Pitre
2007-03-18  2:20                                             ` Linus Torvalds
2007-03-18  3:00                                               ` Nicolas Pitre
2007-03-18  3:31                                                 ` Linus Torvalds
2007-03-18  5:30                                                   ` Julian Phillips
2007-03-18 17:23                                                     ` Linus Torvalds
2007-03-18 10:53                                                   ` Robin Rosenberg
2007-03-18 17:34                                                     ` Linus Torvalds
2007-03-18 18:29                                                       ` Robin Rosenberg
2007-03-18 21:25                                                         ` Shawn O. Pearce
2007-03-19 13:16                                                         ` David Brodsky
2007-03-20  6:35                                                           ` Robin Rosenberg
2007-03-20  9:13                                                             ` David Brodsky
2007-03-21  2:37                                                               ` Linus Torvalds
2007-03-21  2:54                                                                 ` Nicolas Pitre
2007-03-18  3:06                                               ` [PATCH 3/2] Avoid unnecessary strlen() calls Linus Torvalds
2007-03-18  9:45                                                 ` Junio C Hamano
2007-03-18 15:54                                                   ` Linus Torvalds
2007-03-18 15:57                                                     ` Linus Torvalds
2007-03-18 21:38                                                       ` Shawn O. Pearce
2007-03-18 21:48                                                         ` Linus Torvalds
2007-03-20  3:05                                                     ` Johannes Schindelin
2007-03-20  3:29                                                       ` Shawn O. Pearce
2007-03-20  3:40                                                         ` Shawn O. Pearce
2007-03-20  4:11                                                           ` Linus Torvalds
2007-03-20  4:18                                                             ` Shawn O. Pearce
2007-03-20  4:45                                                               ` Linus Torvalds
2007-03-20  5:44                                                             ` Junio C Hamano
2007-03-20  3:16                                                     ` Junio C Hamano
2007-03-20  4:31                                                       ` Linus Torvalds
2007-03-20  4:39                                                         ` Shawn O. Pearce
2007-03-20  4:57                                                           ` Linus Torvalds
2007-03-18  1:44                                       ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-18  6:28                                     ` Avi Kivity
2007-03-17 22:44                                 ` Linus Torvalds
2007-03-16 16:35               ` cleaner/better zlib sources? Jeff Garzik
2007-03-16 16:42                 ` Matt Mackall
2007-03-16 16:51                 ` Linus Torvalds
2007-03-16 17:12                 ` Nicolas Pitre
2007-03-16 23:22                 ` Shawn O. Pearce
2007-03-16 17:06               ` Nicolas Pitre
2007-03-16 17:51                 ` Linus Torvalds
2007-03-16 18:09                   ` Nicolas Pitre
2007-03-16  1:33 ` Davide Libenzi
2007-03-16  2:06   ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2007-03-16  6:08 linux
2007-03-16 11:34 ` Florian Weimer
2007-03-16 15:51 ` Josef Weidendorfer
2007-03-16 16:11 ` Linus Torvalds
2007-03-16 17:39   ` linux
2007-03-16 22:45   ` Josef Weidendorfer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070317051921.GA5731@spearce.org \
    --to=spearce@spearce.org \
    --cc=git@vger.kernel.org \
    --cc=nico@cam.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).