git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 6/7] diff: cache textconv output
Date: Fri, 2 Apr 2010 03:38:32 -0400	[thread overview]
Message-ID: <20100402073832.GB2111@sigill.intra.peff.net> (raw)
In-Reply-To: <7vpr2il3mt.fsf@alter.siamese.dyndns.org>

On Fri, Apr 02, 2010 at 12:23:06AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Running a textconv filter can take a long time. It's
> > particularly bad for a large file which needs to be spooled
> > to disk, but even for small files, the fork+exec overhead
> > can add up for something like "git log -p".
> 
> Another reason that "log -p" gets benefit from caching is that you would
> typically end up running textconv twice on the same blob, once when you
> compare $commit:$path with $commit~1:$path, and again when you compare
> $commit~$n-1:$path with $commit~$n:$path (assuming that the $path didn't
> change between $commit~$n-1 and $commit~1).

Yep. I pointed out in one of my timing tests a slight slowdown in "git
show" when generating the cache. But for revision walking, it should
actually be faster, since you see each blob twice but cache after the
first time.

> It _might_ give you even better performance characteristics if you noice
> that you are walking history running many textconv, and cache the textconv
> result from the "older" (i.e. "one" side) tree in-core, until it is used
> in a "newer" (i.e. "two" side) tree, at which time you would evict it.

I doubt it is worth the effort. We are already caching the sha1 in-core
due to the notes mechanism. So we could really only save one object
retrieval. Which is already what a non-textconv diff will need to do, so
we should have performance on par with regular diffs at this point.

In fact, your optimization could be applied to all diff revision
walking, not just textconv, and you can halve the number of object
retrievals. The problem is that you may have blobs sitting in the
in-core cache as you walk many revisions, waiting for them to be changed
again. Depending on the locality of changes and the size of your
project, you won't be able to fit it all comfortably in memory, and will
end up discarding entries.

And all of that to save a few object retrievals, which are something
that git does very quickly already. Not to mention the ugly code that
would be involved.

-Peff

  reply	other threads:[~2010-04-02  7:38 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
2010-04-02  0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
2010-04-02  0:04 ` [PATCH 2/7] fix textconv leak in emit_rewrite_diff Jeff King
2010-04-02  0:05 ` [PATCH 3/7] make commit_tree a library function Jeff King
2010-04-02  0:07 ` [PATCH 4/7] introduce notes-cache interface Jeff King
2010-04-02  0:09 ` [PATCH 5/7] textconv: refactor calls to run_textconv Jeff King
2010-04-02  0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
2010-04-02  7:23   ` Junio C Hamano
2010-04-02  7:38     ` Jeff King [this message]
2010-04-02  0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
2010-04-02  7:12   ` Junio C Hamano
2010-04-02  7:24     ` Jeff King
2010-04-02  6:14 ` [PATCH 0/7] textconv caching Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100402073832.GB2111@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).