git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
Date: Fri, 7 Dec 2012 09:04:44 -0500	[thread overview]
Message-ID: <20121207140444.GB10964@sigill.intra.peff.net> (raw)
In-Reply-To: <20121207135351.GA10538@sigill.intra.peff.net>

When we look up a sha1 object for reading, we first check
packfiles, and then loose objects. If we still haven't found
it, we re-scan the list of packfiles in `objects/pack`. This
final step ensures that we can co-exist with a simultaneous
repack process which creates a new pack and then prunes the
old object.

This extra re-scan usually does not have a performance
impact for two reasons:

  1. If an object is missing, then typically the re-scan
     will find a new pack, then no more misses will occur.
     Or if it truly is missing, then our next step is
     usually to die().

  2. Re-scanning is cheap enough that we do not even notice.

However, these do not always hold. The assumption in (1) is
that the caller is expecting to find the object. This is
usually the case, but the call to `parse_object` in
`everything_local` does not follow this pattern. It is
looking to see whether we have objects that the remote side
is advertising, not something we expect to have. Therefore
if we are fetching from a remote which has many refs
pointing to objects we do not have, we may end up
re-scanning the pack directory many times.

Even with this extra re-scanning, the impact is often not
noticeable due to (2); we just readdir() the packs directory
and skip any packs that are already loaded. However, if
there are a large number of packs, then even enumerating the
directory directory can be expensive (especially if we do it
repeatedly). Having this many packs is a good sign the user
should run `git gc`, but it would still be nice to avoid
having to scan the directory at all.

This patch checks has_sha1_file (which does not have the
re-scan and re-check behavior) in the critical loop, and
avoids calling parse_object at all if we do not have the
object.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm lukewarm on this patch.  The re-scan _shouldn't_ be that expensive,
so maybe patch 1 is enough to be a reasonable fix. The fact that we
re-scan repeatedly seems ugly and hacky to me, but it really is just
opendir/readdir/closedir in the case that nothing has changed (and if
something has changed, then it's a good thing to be checking). And with
my patch, fetch-pack would not notice new packs from a simultaneous
repack process (although it's OK, as the result is not incorrect, but
merely that we may ask for the object from the server).

Another option would be to make the reprepare_packed_git re-scan less
expensive by checking the mtime of the directory before scanning it.

 fetch-pack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index 099ff4d..b4383c6 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -594,6 +594,9 @@ static int everything_local(struct fetch_pack_args *args,
 	for (ref = *refs; ref; ref = ref->next) {
 		struct object *o;
 
+		if (!has_sha1_file(ref->old_sha1))
+			continue;
+
 		o = parse_object(ref->old_sha1);
 		if (!o)
 			continue;
-- 
1.8.1.rc0.4.g5948dfd.dirty

  parent reply	other threads:[~2012-12-07 14:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-07 13:53 [PATCH 0/2] make repeated fetch faster on "read only" repos Jeff King
2012-12-07 13:58 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
2012-12-07 14:04 ` Jeff King [this message]
  -- strict thread matches above, loose matches on Subject: below --
2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
2013-01-27 10:27   ` Jonathan Nieder
2013-01-27 20:09     ` Junio C Hamano
2013-01-27 23:20       ` Jonathan Nieder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121207140444.GB10964@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).