git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] unpack-trees: fix accidentally quadratic behavior
@ 2016-01-21  4:05 David Turner
  2016-01-21  4:58 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: David Turner @ 2016-01-21  4:05 UTC (permalink / raw)
  To: git; +Cc: David Turner

While unpacking trees (e.g. during git checkout), when we hit a cache
entry that's past and outside our path, we cut off iteration.

This provides about a 45% speedup on git checkout between master and
master^20000 on Twitter's monorepo.  Speedup in general will depend on
repostitory structure, number of changes, and packfile packing
decisions.

Signed-off-by: David Turner <dturner@twopensource.com>
---
 unpack-trees.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 5f541c2..b18a611 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -695,8 +695,25 @@ static int find_cache_pos(struct traverse_info *info,
 				++o->cache_bottom;
 			continue;
 		}
-		if (!ce_in_traverse_path(ce, info))
+		if (!ce_in_traverse_path(ce, info)) {
+			/*
+			 * Check if we can skip future cache checks
+			 * (because we're already past all possible
+			 * entries in the traverse path).
+			 */
+			if (info->prev && info->traverse_path) {
+				int prefix_cmp = strncmp(ce->name, info->traverse_path, info->pathlen);
+				if (prefix_cmp > 0)
+					break;
+				else if (prefix_cmp == 0 &&
+					 ce_namelen(ce) >= info->pathlen &&
+					 strcmp(ce->name + info->pathlen,
+						 info->name.path) > 0) {
+					break;
+				}
+			}
 			continue;
+		}
 		ce_name = ce->name + pfxlen;
 		ce_slash = strchr(ce_name, '/');
 		if (ce_slash)
-- 
2.4.2.749.g730654d-twtrsrc

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] unpack-trees: fix accidentally quadratic behavior
  2016-01-21  4:05 [PATCH] unpack-trees: fix accidentally quadratic behavior David Turner
@ 2016-01-21  4:58 ` Junio C Hamano
  2016-01-21 19:09   ` David Turner
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2016-01-21  4:58 UTC (permalink / raw)
  To: David Turner; +Cc: git

David Turner <dturner@twopensource.com> writes:

> While unpacking trees (e.g. during git checkout), when we hit a cache
> entry that's past and outside our path, we cut off iteration.
>
> This provides about a 45% speedup on git checkout between master and
> master^20000 on Twitter's monorepo.  Speedup in general will depend on
> repostitory structure, number of changes, and packfile packing
> decisions.
>
> Signed-off-by: David Turner <dturner@twopensource.com>
> ---

I haven't thought things through, but does this get fooled by the
somewhat strange ordering rules of tree entries (i.e. a subtree
sorts as if its name is suffixed with a '/' in a tree object)?

Other than that, I like this.  "We know the list is sorted, and
after seeing this entry we know there is nothing that will match" is
an obvious optimization that we already use elsewhere.

Thanks.

>  unpack-trees.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 5f541c2..b18a611 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -695,8 +695,25 @@ static int find_cache_pos(struct traverse_info *info,
>  				++o->cache_bottom;
>  			continue;
>  		}
> -		if (!ce_in_traverse_path(ce, info))
> +		if (!ce_in_traverse_path(ce, info)) {
> +			/*
> +			 * Check if we can skip future cache checks
> +			 * (because we're already past all possible
> +			 * entries in the traverse path).
> +			 */
> +			if (info->prev && info->traverse_path) {
> +				int prefix_cmp = strncmp(ce->name, info->traverse_path, info->pathlen);
> +				if (prefix_cmp > 0)
> +					break;
> +				else if (prefix_cmp == 0 &&
> +					 ce_namelen(ce) >= info->pathlen &&
> +					 strcmp(ce->name + info->pathlen,
> +						 info->name.path) > 0) {
> +					break;
> +				}
> +			}
>  			continue;
> +		}
>  		ce_name = ce->name + pfxlen;
>  		ce_slash = strchr(ce_name, '/');
>  		if (ce_slash)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] unpack-trees: fix accidentally quadratic behavior
  2016-01-21  4:58 ` Junio C Hamano
@ 2016-01-21 19:09   ` David Turner
  2016-01-21 19:51     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: David Turner @ 2016-01-21 19:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git mailing list

On Wed, 2016-01-20 at 20:58 -0800, Junio C Hamano wrote:
> David Turner <dturner@twopensource.com> writes:
> 
> > While unpacking trees (e.g. during git checkout), when we hit a
> > cache
> > entry that's past and outside our path, we cut off iteration.
> > 
> > This provides about a 45% speedup on git checkout between master
> > and
> > master^20000 on Twitter's monorepo.  Speedup in general will depend
> > on
> > repostitory structure, number of changes, and packfile packing
> > decisions.
> > 
> > Signed-off-by: David Turner <dturner@twopensource.com>
> > ---
> 
> I haven't thought things through, but does this get fooled by the
> somewhat strange ordering rules of tree entries (i.e. a subtree
> sorts as if its name is suffixed with a '/' in a tree object)?
> 
> Other than that, I like this.  "We know the list is sorted, and
> after seeing this entry we know there is nothing that will match" is
> an obvious optimization that we already use elsewhere.
> 
> Thanks.

I think this is correct, because we first do the more complicated check
(ce_in_traverse_path), and only check the ordering once that has
failed.  The tests all pass, so this should be good.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] unpack-trees: fix accidentally quadratic behavior
  2016-01-21 19:09   ` David Turner
@ 2016-01-21 19:51     ` Junio C Hamano
  2016-01-21 20:59       ` David Turner
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2016-01-21 19:51 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

David Turner <dturner@twopensource.com> writes:

> On Wed, 2016-01-20 at 20:58 -0800, Junio C Hamano wrote:
>> David Turner <dturner@twopensource.com> writes:
>> 
>> > While unpacking trees (e.g. during git checkout), when we hit a
>> > cache
>> > entry that's past and outside our path, we cut off iteration.
>> > 
>> > This provides about a 45% speedup on git checkout between master
>> > and
>> > master^20000 on Twitter's monorepo.  Speedup in general will depend
>> > on
>> > repostitory structure, number of changes, and packfile packing
>> > decisions.
>> > 
>> > Signed-off-by: David Turner <dturner@twopensource.com>
>> > ---
>> 
>> I haven't thought things through, but does this get fooled by the
>> somewhat strange ordering rules of tree entries (i.e. a subtree
>> sorts as if its name is suffixed with a '/' in a tree object)?
>> 
>> Other than that, I like this.  "We know the list is sorted, and
>> after seeing this entry we know there is nothing that will match" is
>> an obvious optimization that we already use elsewhere.
>> 
>> Thanks.
>
> I think this is correct, because we first do the more complicated check
> (ce_in_traverse_path), and only check the ordering once that has
> failed.  

But the patch does this:

> +			if (info->prev && info->traverse_path) {
> +				int prefix_cmp = strncmp(ce->name, info->traverse_path, info->pathlen);
> +				if (prefix_cmp > 0)
> +					break;
> +				else if (prefix_cmp == 0 &&
> +					 ce_namelen(ce) >= info->pathlen &&
> +					 strcmp(ce->name + info->pathlen,
> +						 info->name.path) > 0) {
> +					break;
> +				}
> +			}
>  			continue;

The first break is correct, but I am not sure about the "else if"
part.  Shouldn't it be doing something similar to the logic to "keep
looking" that talks about "t-i", "t" and "t/a" at the end of the
loop?

> The tests all pass, so this should be good.

Please don't ever say that again.  The existing tests do not cover
all corner cases, and they fundamentally cannot cover the corner
cases future changes may introduce.

Saying "Even test X breaks, so it is clearly broken.  Why did you
send it out without testing?" is OK, though.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] unpack-trees: fix accidentally quadratic behavior
  2016-01-21 19:51     ` Junio C Hamano
@ 2016-01-21 20:59       ` David Turner
  2016-01-21 21:06         ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: David Turner @ 2016-01-21 20:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git mailing list

On Thu, 2016-01-21 at 11:51 -0800, Junio C Hamano wrote:
> David Turner <dturner@twopensource.com> writes:
> 
> > On Wed, 2016-01-20 at 20:58 -0800, Junio C Hamano wrote:
> > > David Turner <dturner@twopensource.com> writes:
> > > 
> > > > While unpacking trees (e.g. during git checkout), when we hit a
> > > > cache
> > > > entry that's past and outside our path, we cut off iteration.
> > > > 
> > > > This provides about a 45% speedup on git checkout between
> > > > master
> > > > and
> > > > master^20000 on Twitter's monorepo.  Speedup in general will
> > > > depend
> > > > on
> > > > repostitory structure, number of changes, and packfile packing
> > > > decisions.
> > > > 
> > > > Signed-off-by: David Turner <dturner@twopensource.com>
> > > > ---
> > > 
> > > I haven't thought things through, but does this get fooled by the
> > > somewhat strange ordering rules of tree entries (i.e. a subtree
> > > sorts as if its name is suffixed with a '/' in a tree object)?
> > > 
> > > Other than that, I like this.  "We know the list is sorted, and
> > > after seeing this entry we know there is nothing that will match"
> > > is
> > > an obvious optimization that we already use elsewhere.
> > > 
> > > Thanks.
> > 
> > I think this is correct, because we first do the more complicated
> > check
> > (ce_in_traverse_path), and only check the ordering once that has
> > failed.  
> 
> But the patch does this:
> 
> > +			if (info->prev && info->traverse_path) {
> > +				int prefix_cmp = strncmp(ce->name,
> > info->traverse_path, info->pathlen);
> > +				if (prefix_cmp > 0)
> > +					break;
> > +				else if (prefix_cmp == 0 &&
> > +					 ce_namelen(ce) >= info
> > ->pathlen &&
> > +					 strcmp(ce->name + info
> > ->pathlen,
> > +						 info->name.path)
> > > 0) {
> > +					break;
> > +				}
> > +			}
> >  			continue;
> 
> The first break is correct, but I am not sure about the "else if"
> part.  Shouldn't it be doing something similar to the logic to "keep
> looking" that talks about "t-i", "t" and "t/a" at the end of the
> loop?

Rather than doing more complicated logic, let's just do the first
check; it seems about as fast for our repo, and I think will usually be
so.  does that seem reasonable to you?

> > The tests all pass, so this should be good.
> 
> Please don't ever say that again.  

OK.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] unpack-trees: fix accidentally quadratic behavior
  2016-01-21 20:59       ` David Turner
@ 2016-01-21 21:06         ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2016-01-21 21:06 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

David Turner <dturner@twopensource.com> writes:

>> The first break is correct, but I am not sure about the "else if"
>> part.  Shouldn't it be doing something similar to the logic to "keep
>> looking" that talks about "t-i", "t" and "t/a" at the end of the
>> loop?
>
> Rather than doing more complicated logic, let's just do the first
> check; it seems about as fast for our repo, and I think will usually be
> so.  does that seem reasonable to you?

Very reasonable.  I was mostly afraid of leaving the loop
prematurely and giving a wrong answer to the caller.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-01-21 21:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-21  4:05 [PATCH] unpack-trees: fix accidentally quadratic behavior David Turner
2016-01-21  4:58 ` Junio C Hamano
2016-01-21 19:09   ` David Turner
2016-01-21 19:51     ` Junio C Hamano
2016-01-21 20:59       ` David Turner
2016-01-21 21:06         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).