From: David Turner <dturner@twopensource.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH v2] unpack-trees: fix accidentally quadratic behavior
Date: Thu, 21 Jan 2016 18:24:20 -0500 [thread overview]
Message-ID: <1453418660.16226.53.camel@twopensource.com> (raw)
In-Reply-To: <20160121213056.GA6664@sigill.intra.peff.net>
On Thu, 2016-01-21 at 16:30 -0500, Jeff King wrote:
> On Thu, Jan 21, 2016 at 04:11:48PM -0500, David Turner wrote:
>
> > While unpacking trees (e.g. during git checkout), when we hit a
> > cache
> > entry that's past and outside our path, we cut off iteration.
> >
> > This provides about a 45% speedup on git checkout between master
> > and
> > master^20000 on Twitter's monorepo. Speedup in general will depend
> > on
> > repostitory structure, number of changes, and packfile packing
> > decisions.
>
> I feel like I'm missing the explanation of the quadratic part. From
> looking at the patch, my guess is:
>
> 1. We're doing a linear walk in a data structure (a "struct
> index_state").
>
> 2. For each element, we look it up in another structure
> ("struct traverse_info") with a linear search.
>
> That leaves us at O(m*n), but if we assume both are on the same
> order of magnitude, that's quadratic.
No, I think, it's the opposite order: we're doing a linear walk over
the incoming tree and for each entry, we're calling find_cache_pos.
find_cache_pos was doing a linear walk over struct index_state. But
the same algorithmic complexity holds.
> 3. The fix works by knowing that once a lookup in (2) fails once,
> it's
> likely to fail for all the remainder, and we short-cut that case
> and skip out of (1) completely.
>
> But that makes me wonder. Aren't we still quadratic in the case that
> ce_in_traverse_path() returns true?
I think that doesn't happen very often, because it requires that the
paths match up.
> If so, would we benefit from either:
>
> a. Improving the complexity of ce_in_traverse_path, to say O(log
> n),
> which would give us O(n log n) for the whole operation in all
> cases?
>
> b. If both lists are already sorted, maybe doing a list-merge to
> compare them in O(2n) time?
(b) appears to be now (roughly) what we're now doing.
> I'm fairly ignorant of this part of the code, so there's probably a
> good
> reason why my suggestion is unworkable.
I am also quite ignorant of this part of the code; I just looked at
perf and did some simple counting.
prev parent reply other threads:[~2016-01-21 23:24 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-21 21:11 [PATCH v2] unpack-trees: fix accidentally quadratic behavior David Turner
2016-01-21 21:30 ` Jeff King
2016-01-21 23:24 ` David Turner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1453418660.16226.53.camel@twopensource.com \
--to=dturner@twopensource.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).