git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: git-dev@github.com
Subject: [PATCH] parse_object: try internal cache before reading object db
Date: Thu, 5 Jan 2012 16:00:01 -0500	[thread overview]
Message-ID: <20120105210001.GA30549@sigill.intra.peff.net> (raw)

When parse_object is called, we do the following:

  1. read the object data into a buffer via read_sha1_file

  2. call parse_object_buffer, which then:

     a. calls the appropriate lookup_{commit,tree,blob,tag}
	to either create a new "struct object", or to find
	an existing one. We know the appropriate type from
	the lookup in step 1.

     b. calls the appropriate parse_{commit,tree,blob,tag}
        to parse the buffer for the new (or existing) object

In step 2b, all of the called functions are no-ops for
object "X" if "X->object.parsed" is set. I.e., when we have
already parsed an object, we end up going to a lot of work
just to find out at a low level that there is nothing left
for us to do (and we throw away the data from read_sha1_file
unread).

We can optimize this by moving the check for "do we have an
in-memory object" from 2a before the expensive call to
read_sha1_file in step 1.

This might seem circular, since step 2a uses the type
information determined in step 1 to call the appropriate
lookup function. However, we can notice that all of the
lookup_* functions are backed by lookup_object. In other
words, all of the objects are kept in a master hash table,
and we don't actually need the type to do the "do we have
it" part of the lookup, only to do the "and create it if it
doesn't exist" part.

This can save time whenever we call parse_object on the same
sha1 twice in a single program. Some code paths already
perform this optimization manually, with either:

  if (!obj->parsed)
	  obj = parse_object(obj->sha1);

if you already have a "struct object", or:

  struct object *obj = lookup_unknown_object(sha1);
  if (!obj || !obj->parsed)
	  obj = parse_object(sha1);

if you don't.  This patch moves the optimization into
parse_object itself.

Most git operations won't notice any impact. Either they
don't parse a lot of duplicate sha1s, or the calling code
takes special care not to re-parse objects. I timed two
code paths that do benefit (there may be more, but these two
were immediately obvious and easy to time).

The first is fast-export, which calls parse_object on each
object it outputs, like this:

  object = parse_object(sha1);
  if (!object)
	  die(...);
  if (object->flags & SHOWN)
	  return;

which means that just to realize we have already shown an
object, we will read the whole object from disk!

With this patch, my best-of-five time for "fast-export --all" on
git.git dropped from 26.3s to 21.3s.

The second case is upload-pack, which will call parse_object
for each advertised ref (because it needs to peel tags to
show "^{}" entries). This doesn't matter for most
repositories, because they don't have a lot of refs pointing
to the same objects. However, if you have a big alternates
repository with a shared object db for a number of child
repositories, then the alternates repository will have
duplicated refs representing each of its children.

For example, GitHub's alternates repository for git.git has
~120,000 refs, of which only ~3200 are unique. The time for
upload-pack to print its list of advertised refs dropped
from 3.4s to 0.76s.

Signed-off-by: Jeff King <peff@peff.net>
---
 object.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/object.c b/object.c
index d8d09f9..6b06297 100644
--- a/object.c
+++ b/object.c
@@ -191,10 +191,15 @@ struct object *parse_object(const unsigned char *sha1)
 	enum object_type type;
 	int eaten;
 	const unsigned char *repl = lookup_replace_object(sha1);
-	void *buffer = read_sha1_file(sha1, &type, &size);
+	void *buffer;
+	struct object *obj;
+
+	obj = lookup_object(sha1);
+	if (obj && obj->parsed)
+		return obj;
 
+	buffer = read_sha1_file(sha1, &type, &size);
 	if (buffer) {
-		struct object *obj;
 		if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
 			free(buffer);
 			error("sha1 mismatch %s\n", sha1_to_hex(repl));
-- 
1.7.6.5.6.ge6248

             reply	other threads:[~2012-01-05 21:00 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-05 21:00 Jeff King [this message]
2012-01-05 21:35 ` [PATCH] parse_object: try internal cache before reading object db Junio C Hamano
2012-01-05 21:49   ` Jeff King
2012-01-05 21:55     ` Junio C Hamano
2012-01-05 22:18       ` Jeff King
2012-01-06 19:16   ` Jeff King
2012-01-06 21:27     ` Junio C Hamano
2012-01-06 22:33       ` Jeff King
2012-01-06 22:45         ` Junio C Hamano
2012-01-06 22:46           ` Jeff King
2012-01-06 19:17   ` [PATCH 1/2] upload-pack: avoid parsing objects during ref advertisement Jeff King
2013-01-18 23:12     ` Junio C Hamano
2013-01-24  7:50       ` Jeff King
2013-01-24 17:25         ` Junio C Hamano
2013-01-29  8:10     ` Shawn Pearce
2013-01-29  8:14       ` Jeff King
2012-01-06 19:18   ` [PATCH 2/2] upload-pack: avoid parsing tag destinations Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120105210001.GA30549@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git-dev@github.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).