[PATCH] parse_object: try internal cache before reading object db

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] parse_object: try internal cache before reading object db
@ 2012-01-05 21:00 Jeff King
  2012-01-05 21:35 ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2012-01-05 21:00 UTC (permalink / raw)
  To: git; +Cc: git-dev

When parse_object is called, we do the following:

  1. read the object data into a buffer via read_sha1_file

  2. call parse_object_buffer, which then:

     a. calls the appropriate lookup_{commit,tree,blob,tag}
	to either create a new "struct object", or to find
	an existing one. We know the appropriate type from
	the lookup in step 1.

     b. calls the appropriate parse_{commit,tree,blob,tag}
        to parse the buffer for the new (or existing) object

In step 2b, all of the called functions are no-ops for
object "X" if "X->object.parsed" is set. I.e., when we have
already parsed an object, we end up going to a lot of work
just to find out at a low level that there is nothing left
for us to do (and we throw away the data from read_sha1_file
unread).

We can optimize this by moving the check for "do we have an
in-memory object" from 2a before the expensive call to
read_sha1_file in step 1.

This might seem circular, since step 2a uses the type
information determined in step 1 to call the appropriate
lookup function. However, we can notice that all of the
lookup_* functions are backed by lookup_object. In other
words, all of the objects are kept in a master hash table,
and we don't actually need the type to do the "do we have
it" part of the lookup, only to do the "and create it if it
doesn't exist" part.

This can save time whenever we call parse_object on the same
sha1 twice in a single program. Some code paths already
perform this optimization manually, with either:

  if (!obj->parsed)
	  obj = parse_object(obj->sha1);

if you already have a "struct object", or:

  struct object *obj = lookup_unknown_object(sha1);
  if (!obj || !obj->parsed)
	  obj = parse_object(sha1);

if you don't.  This patch moves the optimization into
parse_object itself.

Most git operations won't notice any impact. Either they
don't parse a lot of duplicate sha1s, or the calling code
takes special care not to re-parse objects. I timed two
code paths that do benefit (there may be more, but these two
were immediately obvious and easy to time).

The first is fast-export, which calls parse_object on each
object it outputs, like this:

  object = parse_object(sha1);
  if (!object)
	  die(...);
  if (object->flags & SHOWN)
	  return;

which means that just to realize we have already shown an
object, we will read the whole object from disk!

With this patch, my best-of-five time for "fast-export --all" on
git.git dropped from 26.3s to 21.3s.

The second case is upload-pack, which will call parse_object
for each advertised ref (because it needs to peel tags to
show "^{}" entries). This doesn't matter for most
repositories, because they don't have a lot of refs pointing
to the same objects. However, if you have a big alternates
repository with a shared object db for a number of child
repositories, then the alternates repository will have
duplicated refs representing each of its children.

For example, GitHub's alternates repository for git.git has
~120,000 refs, of which only ~3200 are unique. The time for
upload-pack to print its list of advertised refs dropped
from 3.4s to 0.76s.

Signed-off-by: Jeff King <peff@peff.net>
---
 object.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/object.c b/object.c
index d8d09f9..6b06297 100644
--- a/object.c
+++ b/object.c
@@ -191,10 +191,15 @@ struct object *parse_object(const unsigned char *sha1)
 	enum object_type type;
 	int eaten;
 	const unsigned char *repl = lookup_replace_object(sha1);
-	void *buffer = read_sha1_file(sha1, &type, &size);
+	void *buffer;
+	struct object *obj;
+
+	obj = lookup_object(sha1);
+	if (obj && obj->parsed)
+		return obj;

+	buffer = read_sha1_file(sha1, &type, &size);
 	if (buffer) {
-		struct object *obj;
 		if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
 			free(buffer);
 			error("sha1 mismatch %s\n", sha1_to_hex(repl));
-- 
1.7.6.5.6.ge6248

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-05 21:00 [PATCH] parse_object: try internal cache before reading object db Jeff King
@ 2012-01-05 21:35 ` Junio C Hamano
  2012-01-05 21:49   ` Jeff King
                     ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Junio C Hamano @ 2012-01-05 21:35 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Jeff King <peff@peff.net> writes:

> This might seem circular, since step 2a uses the type
> information determined in step 1 to call the appropriate
> lookup function. However, we can notice that all of the
> lookup_* functions are backed by lookup_object. In other
> words, all of the objects are kept in a master hash table,
> and we don't actually need the type to do the "do we have
> it" part of the lookup,...

The only case that might matter is where you read one object, you have
written another object of a different type but that happens to hash to the
same SHA-1 value. The other existing optimizations do not take that into
account, so I do not think there is any new issue here.

> For example, GitHub's alternates repository for git.git has
> ~120,000 refs, of which only ~3200 are unique. The time for
> upload-pack to print its list of advertised refs dropped
> from 3.4s to 0.76s.

Nice. I am more impressed by 120k/3.4 than 3.2k/0.76, though ;-)

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-05 21:35 ` Junio C Hamano
@ 2012-01-05 21:49   ` Jeff King
  2012-01-05 21:55     ` Junio C Hamano
  2012-01-06 19:16   ` Jeff King
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2012-01-05 21:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

On Thu, Jan 05, 2012 at 01:35:50PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > This might seem circular, since step 2a uses the type
> > information determined in step 1 to call the appropriate
> > lookup function. However, we can notice that all of the
> > lookup_* functions are backed by lookup_object. In other
> > words, all of the objects are kept in a master hash table,
> > and we don't actually need the type to do the "do we have
> > it" part of the lookup,...
> 
> The only case that might matter is where you read one object, you have
> written another object of a different type but that happens to hash to the
> same SHA-1 value. The other existing optimizations do not take that into
> account, so I do not think there is any new issue here.

Yeah, I tried to think of issues like that. Even if you protected
against that, you'd still have the issue of reading one object, then
writing another of the _same_ type but with different content. We
wouldn't notice with the current code path (we'd just recreationally
read the data from disk and then throw it away).

The worst potential problem I could come up with is if you somehow had
an object whose "parsed" flag was set, but somehow didn't have its other
fields set (like type). But I think you'd have to be abusing the lookup
functions pretty hard to get into such a state (how would you be parsing
if you didn't know the type?). The parsed flag only gets set by the
type-specific lookup functions.

So I think it is safe short of somebody doing some horrible manual
munging of a "struct object".

> > For example, GitHub's alternates repository for git.git has
> > ~120,000 refs, of which only ~3200 are unique. The time for
> > upload-pack to print its list of advertised refs dropped
> > from 3.4s to 0.76s.
> 
> Nice. I am more impressed by 120k/3.4 than 3.2k/0.76, though ;-)

You can thank optimized zlib for that. We spent 60% of our time there.
:)

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-05 21:49   ` Jeff King
@ 2012-01-05 21:55     ` Junio C Hamano
  2012-01-05 22:18       ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2012-01-05 21:55 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Jeff King <peff@peff.net> writes:

> The worst potential problem I could come up with is if you somehow had
> an object whose "parsed" flag was set, but somehow didn't have its other
> fields set (like type).
> ...
> So I think it is safe short of somebody doing some horrible manual
> munging of a "struct object".

Yeah, I was worried about codepaths like commit-pretty-printing might be
mucking with the contents of commit->buffer, perhaps reencoding the text
and then calling parse_object() to get the unmodified original back, or
something silly like that. But the lookup_object() call at the beginning
of the parse_object() already prevents us from doing such a thing, so we
should be OK, I would think.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-05 21:55     ` Junio C Hamano
@ 2012-01-05 22:18       ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2012-01-05 22:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

On Thu, Jan 05, 2012 at 01:55:22PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The worst potential problem I could come up with is if you somehow had
> > an object whose "parsed" flag was set, but somehow didn't have its other
> > fields set (like type).
> > ...
> > So I think it is safe short of somebody doing some horrible manual
> > munging of a "struct object".
> 
> Yeah, I was worried about codepaths like commit-pretty-printing might be
> mucking with the contents of commit->buffer, perhaps reencoding the text
> and then calling parse_object() to get the unmodified original back, or
> something silly like that. But the lookup_object() call at the beginning
> of the parse_object() already prevents us from doing such a thing, so we
> should be OK, I would think.

Er, without my patch there is no such lookup_object, is there?

What saves you is that the parse_*_buffer functions all do nothing when
the object.parsed flag is set, and the code I added makes sure that
object.parsed is set in the object that lookup_object returns.

So yeah, anytime you tweak the contents of commit->buffer but don't
unset the "parsed" flag, you are asking for trouble.

Here's another possible code path where the behavior is changed:

  1. You set the global save_commit_buffer to 0.

  2. You call parse_commit(commit) on an unparsed commit object, which
     does not save the commit buffer, but does set
     commit->object.parsed.

  3. You call parse_object(commit->object.sha1).

     a. Without my patch, we read the file contents again, do _not_
        re-parse them (because we look up the existing object and notice
        that its "parsed" flag is set), but we _do_ assign the buffer to
        commit->buffer.

     b. With my patch, we see that there is an existing object that is
        already parsed, and return early. commit->buffer remains NULL.

I would argue that this doesn't matter, since "parse_commit" uses the
exact same optimization (it returns early without setting commit->buffer
if the parsed flag is set). So any program turning off
save_commit_buffer has to be ready to deal with a NULL commit->buffer in
the first place. The only exception would be a program that then tries
to fill in the commit->buffer field by manually running parse_object on
an already-parsed, buffer-less commit object. I don't think we do that.

You can verify that commit->buffer is the only place where these issues
can happen by following the logic in parse_object_buffer.

Sorry to belabor the discussion, but this is such a core piece of code,
I want to make sure the optimization isn't hurting anybody (I don't
think it is, and certainly the tests are all happy, but I think talking
through the cases is a good thing).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-05 21:35 ` Junio C Hamano
  2012-01-05 21:49   ` Jeff King
@ 2012-01-06 19:16   ` Jeff King
  2012-01-06 21:27     ` Junio C Hamano
  2012-01-06 19:17   ` [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement Jeff King
  2012-01-06 19:18   ` [PATCH 2/2] upload-pack: avoid parsing tag destinations Jeff King
  3 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2012-01-06 19:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

On Thu, Jan 05, 2012 at 01:35:50PM -0800, Junio C Hamano wrote:

> > For example, GitHub's alternates repository for git.git has
> > ~120,000 refs, of which only ~3200 are unique. The time for
> > upload-pack to print its list of advertised refs dropped
> > from 3.4s to 0.76s.
> 
> Nice. I am more impressed by 120k/3.4 than 3.2k/0.76, though ;-)

Actually, we can do much better than that. Here are a few patches that
avoid parsing objects when possible. They drop the 3.4s to 2.0s. If you
combine them with the parse_object optimization, my 120K case drops to
around 0.68s.

I don't know if it is really that worth it on top of the parse_object
optimization. It's almost negligible for the normal case (though I get a
tiny speedup on my ~900-ref git.git repo), and a minor speedup on the
crazy alternates case. OTOH, if you had some totally insane ref
structure, like 120K _unique_ refs (which would probably imply that
you're making one ref per commit or something silly like that. But hey,
people have suggested it in the past), then it could be a big
improvement.

  [1/2]: upload-pack: avoid parsing objects during ref advertisement
  [2/2]: upload-pack: avoid parsing tag destinations

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-06 19:16   ` Jeff King
@ 2012-01-06 21:27     ` Junio C Hamano
  2012-01-06 22:33       ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2012-01-06 21:27 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Jeff King <peff@peff.net> writes:

> Actually, we can do much better than that. Here are a few patches that
> avoid parsing objects when possible. They drop the 3.4s to 2.0s. If you
> combine them with the parse_object optimization, my 120K case drops to
> around 0.68s.
>
> I don't know if it is really that worth it on top of the parse_object
> optimization. It's almost negligible for the normal case...
> ... OTOH, if you had some totally insane ref
> structure, like 120K _unique_ refs (which would probably imply that
> you're making one ref per commit or something silly like that. But hey,
> people have suggested it in the past), then it could be a big
> improvement.

Even though it is a bit scary kind of loosening of sanity checks that I
hesitate to take at this late in the cycle, I think it makes sense. Let's
queue them on 'pu' and aim for the next cycle.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-06 21:27     ` Junio C Hamano
@ 2012-01-06 22:33       ` Jeff King
  2012-01-06 22:45         ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2012-01-06 22:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

On Fri, Jan 06, 2012 at 01:27:40PM -0800, Junio C Hamano wrote:

> > I don't know if it is really that worth it on top of the parse_object
> > optimization. It's almost negligible for the normal case...
> > ... OTOH, if you had some totally insane ref
> > structure, like 120K _unique_ refs (which would probably imply that
> > you're making one ref per commit or something silly like that. But hey,
> > people have suggested it in the past), then it could be a big
> > improvement.
> 
> Even though it is a bit scary kind of loosening of sanity checks that I
> hesitate to take at this late in the cycle, I think it makes sense. Let's
> queue them on 'pu' and aim for the next cycle.

Did you want to leave the parse_object optimization until next cycle,
too? It's not loosening checks, but it's such a core piece of code that
it makes me nervous somebody somewhere is abusing "struct object" in a
way that will break it.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-06 22:33       ` Jeff King
@ 2012-01-06 22:45         ` Junio C Hamano
  2012-01-06 22:46           ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2012-01-06 22:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Jeff King <peff@peff.net> writes:

>> Even though it is a bit scary kind of loosening of sanity checks that I
>> hesitate to take at this late in the cycle, I think it makes sense. Let's
>> queue them on 'pu' and aim for the next cycle.
>
> Did you want to leave the parse_object optimization until next cycle,
> too? It's not loosening checks, but it's such a core piece of code that
> it makes me nervous somebody somewhere is abusing "struct object" in a
> way that will break it.

I was just updating the "What's cooking" report and my current thinking is
that we should keep all three in "next" to give it a bit of exposure for
now, and merge them to "master" early in the 1.7.10 cycle.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] parse_object: try internal cache before reading object db
  2012-01-06 22:45         ` Junio C Hamano
@ 2012-01-06 22:46           ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2012-01-06 22:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

On Fri, Jan 06, 2012 at 02:45:15PM -0800, Junio C Hamano wrote:

> > Did you want to leave the parse_object optimization until next cycle,
> > too? It's not loosening checks, but it's such a core piece of code that
> > it makes me nervous somebody somewhere is abusing "struct object" in a
> > way that will break it.
> 
> I was just updating the "What's cooking" report and my current thinking is
> that we should keep all three in "next" to give it a bit of exposure for
> now, and merge them to "master" early in the 1.7.10 cycle.

That sounds perfect. Thanks.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement
  2012-01-05 21:35 ` Junio C Hamano
  2012-01-05 21:49   ` Jeff King
  2012-01-06 19:16   ` Jeff King
@ 2012-01-06 19:17   ` Jeff King
  2013-01-18 23:12     ` Junio C Hamano
  2013-01-29  8:10     ` Shawn Pearce
  2012-01-06 19:18   ` [PATCH 2/2] upload-pack: avoid parsing tag destinations Jeff King
  3 siblings, 2 replies; 17+ messages in thread
From: Jeff King @ 2012-01-06 19:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

When we advertise a ref, the first thing we do is parse the
pointed-to object. This gives us two things:

  1. a "struct object" we can use to store flags

  2. the type of the object, so we know whether we need to
     dereference it as a tag

Instead, we can just use lookup_unknown_object to get an
object struct, and then fill in just the type field using
sha1_object_info (which, in the case of packed files, can
find the information without actually inflating the object
data).

This can save time if you have a large number of refs, and
the client isn't actually going to request those refs (e.g.,
because most of them are already up-to-date).

The downside is that we are no longer verifying objects that
we advertise by fully parsing them (however, we do still
know we actually have them, because sha1_object_info must
find them to get the type). While we might fail to detect a
corrupt object here, if the client actually fetches the
object, we will parse (and verify) it then.

On a repository with 120K refs, the advertisement portion of
upload-pack goes from ~3.4s to 3.2s (the failure to speed up
more is largely due to the fact that most of these refs are
tags, which need dereferenced to find the tag destination
anyway).

Signed-off-by: Jeff King <peff@peff.net>
---
 upload-pack.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 6f36f62..65cb0ff 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -720,11 +720,14 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow no-progress"
 		" include-tag multi_ack_detailed";
-	struct object *o = parse_object(sha1);
+	struct object *o = lookup_unknown_object(sha1);
 	const char *refname_nons = strip_namespace(refname);
 
-	if (!o)
-		die("git upload-pack: cannot find object %s:", sha1_to_hex(sha1));
+	if (o->type == OBJ_NONE) {
+		o->type = sha1_object_info(sha1, NULL);
+		if (o->type < 0)
+		    die("git upload-pack: cannot find object %s:", sha1_to_hex(sha1));
+	}
 
 	if (capabilities)
 		packet_write(1, "%s %s%c%s%s\n", sha1_to_hex(sha1), refname_nons,
@@ -738,6 +741,7 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
 		nr_our_refs++;
 	}
 	if (o->type == OBJ_TAG) {
+		o = parse_object(o->sha1);
 		o = deref_tag(o, refname, 0);
 		if (o)
 			packet_write(1, "%s %s^{}\n", sha1_to_hex(o->sha1), refname_nons);
-- 
1.7.6.5.14.g7b06f

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement
  2012-01-06 19:17   ` [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement Jeff King
@ 2013-01-18 23:12     ` Junio C Hamano
  2013-01-24  7:50       ` Jeff King
  2013-01-29  8:10     ` Shawn Pearce
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2013-01-18 23:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Jeff King <peff@peff.net> writes:

> When we advertise a ref, the first thing we do is parse the
> pointed-to object. This gives us two things:
>
>   1. a "struct object" we can use to store flags
>
>   2. the type of the object, so we know whether we need to
>      dereference it as a tag
>
> Instead, we can just use lookup_unknown_object to get an
> object struct, and then fill in just the type field using
> sha1_object_info (which, in the case of packed files, can
> find the information without actually inflating the object
> data).
>
> This can save time if you have a large number of refs, and
> the client isn't actually going to request those refs (e.g.,
> because most of them are already up-to-date).
>
> The downside is that we are no longer verifying objects that
> we advertise by fully parsing them (however, we do still
> know we actually have them, because sha1_object_info must
> find them to get the type). While we might fail to detect a
> corrupt object here, if the client actually fetches the
> object, we will parse (and verify) it then.
>...

This is an old news, but do you recall why this patch did not update
the code in mark_our_ref() that gets "struct object *o" from parse_object()
the same way and mark them with OUR_REF flag?

I was wondering about code consolidation like this.

 upload-pack.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 95d8313..609cd6c 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -722,15 +722,18 @@ static void receive_needs(void)
 	free(shallows.objects);
 }
 
+static int mark_our_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data);
+
 static int send_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
 {
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow no-progress"
 		" include-tag multi_ack_detailed";
-	struct object *o = lookup_unknown_object(sha1);
 	const char *refname_nons = strip_namespace(refname);
 	unsigned char peeled[20];
 
+	mark_our_ref(refname, sha1, flag, cb_data);
+
 	if (capabilities)
 		packet_write(1, "%s %s%c%s%s agent=%s\n",
 			     sha1_to_hex(sha1), refname_nons,
@@ -740,10 +743,6 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
 	else
 		packet_write(1, "%s %s\n", sha1_to_hex(sha1), refname_nons);
 	capabilities = NULL;
-	if (!(o->flags & OUR_REF)) {
-		o->flags |= OUR_REF;
-		nr_our_refs++;
-	}
 	if (!peel_ref(refname, peeled))
 		packet_write(1, "%s %s^{}\n", sha1_to_hex(peeled), refname_nons);
 	return 0;
@@ -751,7 +750,7 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
 
 static int mark_our_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
 {
-	struct object *o = parse_object(sha1);
+	struct object *o = parse_object(sha1); /* lookup-unknown??? */
 	if (!o)
 		die("git upload-pack: cannot find object %s:", sha1_to_hex(sha1));
 	if (!(o->flags & OUR_REF)) {

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement
  2013-01-18 23:12     ` Junio C Hamano
@ 2013-01-24  7:50       ` Jeff King
  2013-01-24 17:25         ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2013-01-24  7:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

On Fri, Jan 18, 2013 at 03:12:52PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > When we advertise a ref, the first thing we do is parse the
> > pointed-to object. This gives us two things:
> >
> >   1. a "struct object" we can use to store flags
> >
> >   2. the type of the object, so we know whether we need to
> >      dereference it as a tag
> >
> > Instead, we can just use lookup_unknown_object to get an
> > object struct, and then fill in just the type field using
> > sha1_object_info (which, in the case of packed files, can
> > find the information without actually inflating the object
> > data).
> >
> > This can save time if you have a large number of refs, and
> > the client isn't actually going to request those refs (e.g.,
> > because most of them are already up-to-date).
> 
> This is an old news, but do you recall why this patch did not update
> the code in mark_our_ref() that gets "struct object *o" from parse_object()
> the same way and mark them with OUR_REF flag?
> 
> I was wondering about code consolidation like this.

It was just because I did my measuring on raw upload-pack, so I didn't
notice that mark_our_ref was doing the same potentially slow thing. We
only call mark_our_ref during the second half of the stateless-rpc
conversation, and I did not measure that (and it would be a pain to do
so in isolation).

But it should be able to get the exact same speedups that we get from
send_ref. It probably matters less in the long run, because the
advertising phase is going to be called more frequently (e.g., for every
no-op fetch), and once we are calling mark_our_ref, we are presumably
about to do do actual packing work. However, there's no reason not to
get what speed we can there, too.

> diff --git a/upload-pack.c b/upload-pack.c
> index 95d8313..609cd6c 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -722,15 +722,18 @@ static void receive_needs(void)
>  	free(shallows.objects);
>  }
>  
> +static int mark_our_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data);
> +
>  static int send_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
>  {
>  	static const char *capabilities = "multi_ack thin-pack side-band"
>  		" side-band-64k ofs-delta shallow no-progress"
>  		" include-tag multi_ack_detailed";
> -	struct object *o = lookup_unknown_object(sha1);
>  	const char *refname_nons = strip_namespace(refname);
>  	unsigned char peeled[20];
>  
> +	mark_our_ref(refname, sha1, flag, cb_data);
> +
>  	if (capabilities)
>  		packet_write(1, "%s %s%c%s%s agent=%s\n",
>  			     sha1_to_hex(sha1), refname_nons,
> @@ -740,10 +743,6 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
>  	else
>  		packet_write(1, "%s %s\n", sha1_to_hex(sha1), refname_nons);
>  	capabilities = NULL;
> -	if (!(o->flags & OUR_REF)) {
> -		o->flags |= OUR_REF;
> -		nr_our_refs++;
> -	}
>  	if (!peel_ref(refname, peeled))
>  		packet_write(1, "%s %s^{}\n", sha1_to_hex(peeled), refname_nons);
>  	return 0;

Right, I think this is a nice cleanup.

> @@ -751,7 +750,7 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
>  
>  static int mark_our_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
>  {
> -	struct object *o = parse_object(sha1);
> +	struct object *o = parse_object(sha1); /* lookup-unknown??? */
>  	if (!o)
>  		die("git upload-pack: cannot find object %s:", sha1_to_hex(sha1));
>  	if (!(o->flags & OUR_REF)) {

And yeah, this should use lookup_unknown_object to extend the
optimization to mark_our_ref (and avoid removing it for the
ref-advertisement case, of course).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement
  2013-01-24  7:50       ` Jeff King
@ 2013-01-24 17:25         ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2013-01-24 17:25 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Jeff King <peff@peff.net> writes:

> And yeah, this should use lookup_unknown_object to extend the
> optimization to mark_our_ref (and avoid removing it for the
> ref-advertisement case, of course).

Thanks for sanity checking.  Here is what is queued at the bottom of
the hide-refs topic.

-- >8 --
Date: Fri, 18 Jan 2013 15:48:49 -0800
Subject: [PATCH] upload-pack: share more code

We mark the objects pointed at our refs with "OUR_REF" flag in two
functions (mark_our_ref() and send_ref()), but we can just use the
former as a helper for the latter.

Update the way mark_our_ref() prepares in-core object to use
lookup_unknown_object() to delay reading the actual object data,
just like we did in 435c833 (upload-pack: use peel_ref for ref
advertisements, 2012-10-04).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 upload-pack.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 95d8313..3dd220d 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -722,15 +722,28 @@ static void receive_needs(void)
 	free(shallows.objects);
 }
 
+static int mark_our_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
+{
+	struct object *o = lookup_unknown_object(sha1);
+	if (!o)
+		die("git upload-pack: cannot find object %s:", sha1_to_hex(sha1));
+	if (!(o->flags & OUR_REF)) {
+		o->flags |= OUR_REF;
+		nr_our_refs++;
+	}
+	return 0;
+}
+
 static int send_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
 {
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow no-progress"
 		" include-tag multi_ack_detailed";
-	struct object *o = lookup_unknown_object(sha1);
 	const char *refname_nons = strip_namespace(refname);
 	unsigned char peeled[20];
 
+	mark_our_ref(refname, sha1, flag, cb_data);
+
 	if (capabilities)
 		packet_write(1, "%s %s%c%s%s agent=%s\n",
 			     sha1_to_hex(sha1), refname_nons,
@@ -740,27 +753,11 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
 	else
 		packet_write(1, "%s %s\n", sha1_to_hex(sha1), refname_nons);
 	capabilities = NULL;
-	if (!(o->flags & OUR_REF)) {
-		o->flags |= OUR_REF;
-		nr_our_refs++;
-	}
 	if (!peel_ref(refname, peeled))
 		packet_write(1, "%s %s^{}\n", sha1_to_hex(peeled), refname_nons);
 	return 0;
 }
 
-static int mark_our_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
-{
-	struct object *o = parse_object(sha1);
-	if (!o)
-		die("git upload-pack: cannot find object %s:", sha1_to_hex(sha1));
-	if (!(o->flags & OUR_REF)) {
-		o->flags |= OUR_REF;
-		nr_our_refs++;
-	}
-	return 0;
-}
-
 static void upload_pack(void)
 {
 	if (advertise_refs || !stateless_rpc) {
-- 
1.8.1.1.525.gdace530

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement
  2012-01-06 19:17   ` [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement Jeff King
  2013-01-18 23:12     ` Junio C Hamano
@ 2013-01-29  8:10     ` Shawn Pearce
  2013-01-29  8:14       ` Jeff King
  1 sibling, 1 reply; 17+ messages in thread
From: Shawn Pearce @ 2013-01-29  8:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, git-dev

On Fri, Jan 6, 2012 at 11:17 AM, Jeff King <peff@peff.net> wrote:
> When we advertise a ref, the first thing we do is parse the
> pointed-to object. This gives us two things:
...
> The downside is that we are no longer verifying objects that
> we advertise by fully parsing them (however, we do still
> know we actually have them, because sha1_object_info must
> find them to get the type). While we might fail to detect a
> corrupt object here, if the client actually fetches the
> object, we will parse (and verify) it then.

As you explain, its not necessary to verify during the advertisement
phase. Its fine to delay verification to when a client actually
"wants" the object.

> On a repository with 120K refs, the advertisement portion of
> upload-pack goes from ~3.4s to 3.2s (the failure to speed up
> more is largely due to the fact that most of these refs are
> tags, which need dereferenced to find the tag destination
> anyway).

Why aren't we using the peeled information from the packed-refs file?
JGit does this and it saves a lot of time on advertisements from a
well packed repository.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement
  2013-01-29  8:10     ` Shawn Pearce
@ 2013-01-29  8:14       ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2013-01-29  8:14 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Junio C Hamano, git, git-dev

On Tue, Jan 29, 2013 at 12:10:59AM -0800, Shawn O. Pearce wrote:

> > On a repository with 120K refs, the advertisement portion of
> > upload-pack goes from ~3.4s to 3.2s (the failure to speed up
> > more is largely due to the fact that most of these refs are
> > tags, which need dereferenced to find the tag destination
> > anyway).
> 
> Why aren't we using the peeled information from the packed-refs file?
> JGit does this and it saves a lot of time on advertisements from a
> well packed repository.

The patch you are replying to is a year old. Since then, I did:

  435c833 (upload-pack: use peel_ref for ref advertisements, 2012-10-04)

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/2] upload-pack: avoid parsing tag destinations
  2012-01-05 21:35 ` Junio C Hamano
                     ` (2 preceding siblings ...)
  2012-01-06 19:17   ` [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement Jeff King
@ 2012-01-06 19:18   ` Jeff King
  3 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2012-01-06 19:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git-dev

When upload-pack advertises refs, it dereferences any tags
it sees, and shows the resulting sha1 to the client. It does
this by calling deref_tag. That function must load and parse
each tag object to find the sha1 of the tagged object.
However, it also ends up parsing the tagged object itself,
which is not strictly necessary for upload-pack's use.

Each tag produces two object loads (assuming it is not a
recursive tag), when it could get away with only a single
one. Dropping the second load halves the effort we spend.

The downside is that we are no longer verifying the
resulting object by loading it. In particular:

  1. We never cross-check the "type" field given in the tag
     object with the type of the pointed-to object.  If the
     tag says it points to a tag but doesn't, then we will
     keep peeling and realize the error.  If the tag says it
     points to a non-tag but actually points to a tag, we
     will stop peeling and just advertise the pointed-to
     tag.

  2. If we are missing the pointed-to object, we will not
     realize (because we never even look it up in the object
     db).

However, both of these are errors in the object database,
and both will be detected if a client actually requests the
broken objects in question. So we are simply pushing the
verification away from the advertising stage, and down to
the actual fetching stage.

On my test repo with 120K refs, this drops the time to
advertise the refs from ~3.2s to ~2.0s.

Signed-off-by: Jeff King <peff@peff.net>
---
 tag.c         |   12 ++++++++++++
 tag.h         |    1 +
 upload-pack.c |    3 +--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/tag.c b/tag.c
index 3aa186d..78d272b 100644
--- a/tag.c
+++ b/tag.c
@@ -24,6 +24,18 @@ struct object *deref_tag(struct object *o, const char *warn, int warnlen)
 	return o;
 }
 
+struct object *deref_tag_noverify(struct object *o)
+{
+	while (o && o->type == OBJ_TAG) {
+		o = parse_object(o->sha1);
+		if (o && o->type == OBJ_TAG && ((struct tag *)o)->tagged)
+			o = ((struct tag *)o)->tagged;
+		else
+			o = NULL;
+	}
+	return o;
+}
+
 struct tag *lookup_tag(const unsigned char *sha1)
 {
 	struct object *obj = lookup_object(sha1);
diff --git a/tag.h b/tag.h
index 5ee88e6..bc8a1e4 100644
--- a/tag.h
+++ b/tag.h
@@ -16,6 +16,7 @@ extern struct tag *lookup_tag(const unsigned char *sha1);
 extern int parse_tag_buffer(struct tag *item, const void *data, unsigned long size);
 extern int parse_tag(struct tag *item);
 extern struct object *deref_tag(struct object *, const char *, int);
+extern struct object *deref_tag_noverify(struct object *);
 extern size_t parse_signature(const char *buf, unsigned long size);
 
 #endif /* TAG_H */
diff --git a/upload-pack.c b/upload-pack.c
index 65cb0ff..c01e161 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -741,8 +741,7 @@ static int send_ref(const char *refname, const unsigned char *sha1, int flag, vo
 		nr_our_refs++;
 	}
 	if (o->type == OBJ_TAG) {
-		o = parse_object(o->sha1);
-		o = deref_tag(o, refname, 0);
+		o = deref_tag_noverify(o);
 		if (o)
 			packet_write(1, "%s %s^{}\n", sha1_to_hex(o->sha1), refname_nons);
 	}
-- 
1.7.6.5.14.g7b06f

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-01-29  8:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-05 21:00 [PATCH] parse_object: try internal cache before reading object db Jeff King
2012-01-05 21:35 ` Junio C Hamano
2012-01-05 21:49   ` Jeff King
2012-01-05 21:55     ` Junio C Hamano
2012-01-05 22:18       ` Jeff King
2012-01-06 19:16   ` Jeff King
2012-01-06 21:27     ` Junio C Hamano
2012-01-06 22:33       ` Jeff King
2012-01-06 22:45         ` Junio C Hamano
2012-01-06 22:46           ` Jeff King
2012-01-06 19:17   ` [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement Jeff King
2013-01-18 23:12     ` Junio C Hamano
2013-01-24  7:50       ` Jeff King
2013-01-24 17:25         ` Junio C Hamano
2013-01-29  8:10     ` Shawn Pearce
2013-01-29  8:14       ` Jeff King
2012-01-06 19:18   ` [PATCH 2/2] upload-pack: avoid parsing tag destinations Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).