git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: larsxschneider@gmail.com, git@vger.kernel.org
Subject: Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion
Date: Fri, 22 Jul 2016 14:10:24 -0400	[thread overview]
Message-ID: <20160722181024.GA16595@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqq60rxbmaf.fsf@gitster.mtv.corp.google.com>

On Fri, Jul 22, 2016 at 10:44:08AM -0700, Junio C Hamano wrote:

> > diff --git a/diff.c b/diff.c
> > index 7d03419..b43d3dd 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
> >  	if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1))
> >  		return 0;
> >  
> > +	/*
> > +	 * Similarly, if we'd have to convert the file contents anyway, that
> > +	 * makes the optimization not worthwhile.
> > +	 */
> > +	if (!want_file && would_convert_to_git(name))
> > +		return 0;
> 
> The would_convert_to_git() function is not a free operation.  It
> needs to prime the attribute stack, so it needs to open/read/parse a
> few files ($GIT_DIR/info/attributes and .gitattributes at least, and
> more if your directory hierarchy is deep) on the filesystem.  The
> cost is amortized across paths, but we do not even enable the
> optimization if we have to pay the cost of reading the index
> ourselves.

Yeah, I almost commented on that, and its position in the function, but
forgot to.

The only code path which will trigger this is diff_populate_filespec.
After reuse_worktree_file() says "yes, we can reuse", we drop into a
conditional that will end in us calling convert_to_git() anyway, which
will do the same lookup. We don't need to cache, because the expensive
parts of the attribute-lookup are already cached for us by the attribute
code.

So my initial thought was to put it at the end of reuse_worktree_file(),
where it would have the least impact. If we find we cannot reuse the
file, then we would skip both this new attr lookup and the one in
diff_populate_filespec().

But in practice, I think we'll already have cached those attrs before we
even hit this function, because we'll hit the userdiff_find_by_path()
code earlier in the diff process (e.g., to see if it's binary, if we
need to textconv, etc). Those look for different attributes, but I think
the expensive bits (finding, opening, reading attribute files) are
cached across all lookups.

So I think we actually _can_ think of would_convert_to_git() as
basically free. Or as free as other efficient-lookup functions we call
like cache_name_pos(). And so I moved it further up in the function,
where it lets us avoid doing more out-of-process work (like calling
lstat() so we can ce_match_stat() on the result).

Possibly it should go after the cache_name_pos() call, though. That's
likely to be less expensive than the actual walk of the attr tree.

> I suspect that we may be better off disabling this optimization if
> we need to always call the helper.

The thought "does this tree reuse even speed things up enough to justify
its complexity" definitely crossed my mind. It's basically swapping
open/mmap for zlib inflating the content.

But I do think it helps in the "want_file" case (i.e., when we are
writing out a tempfile for an external command via prepare_temp_file()).
There it helps us omit writing a tempfile to disk entirely, including
any possible conversion.

-Peff

  reply	other threads:[~2016-07-22 18:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-21 20:59 [PATCH v1] convert: add test to demonstrate clean invocations larsxschneider
2016-07-21 21:37 ` Jeff King
2016-07-22 15:27   ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King
2016-07-22 17:44     ` Junio C Hamano
2016-07-22 18:10       ` Jeff King [this message]
2016-07-22 19:30         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160722181024.GA16595@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).