[PATCH 0/2] make repeated fetch faster on "read only" repos

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] make repeated fetch faster on "read only" repos
@ 2012-12-07 13:53 Jeff King
  2012-12-07 13:58 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
  2012-12-07 14:04 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff King @ 2012-12-07 13:53 UTC (permalink / raw)
  To: git

Like many dev shops, we run a CI server that basically does:

  git fetch $some_branch &&
  git checkout $some_branch &&
  make test

all day long. Sometimes the fetches would get very slow, and the problem
turned out to be a combination of:

  1. Never running "git gc". This means you can end up with a ton of
     loose objects, or even a bunch of small packs[1].

  2. One of the loops in fetch caused us to re-scan the entire
     objects/pack directory repeatedly, proportional to the number of
     refs on the remote.

I think the fundamental fix is to gc more often, as it makes the re-scans
less expensive, along with making general object lookup faster. But the
repeated re-scans strike me as kind of hacky. This series tries to
address both:

  [1/2]: fetch: run gc --auto after fetching
  [2/2]: fetch-pack: avoid repeatedly re-scanning pack directory

-Peff

[1] It turns out we had our transfer.unpacklimit set unreasonably low,
    leading to a large number of tiny packs, but even with the defaults,
    you will end up with a ton of loose objects if you do repeated small
    fetches.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] fetch: run gc --auto after fetching
  2012-12-07 13:53 [PATCH 0/2] make repeated fetch faster on "read only" repos Jeff King
@ 2012-12-07 13:58 ` Jeff King
  2012-12-07 14:04 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff King @ 2012-12-07 13:58 UTC (permalink / raw)
  To: git

We generally try to run "gc --auto" after any commands that
might introduce a large number of new objects. An obvious
place to do so is after running "fetch", which may introduce
new loose objects or packs (depending on the size of the
fetch).

While an active developer repository will probably
eventually trigger a "gc --auto" on another action (e.g.,
git-rebase), there are two good reasons why it is nicer to
do it at fetch time:

  1. Read-only repositories which track an upstream (e.g., a
     continuous integration server which fetches and builds,
     but never makes new commits) will accrue loose objects
     and small packs, but never coalesce them into a more
     efficient larger pack.

  2. Fetching is often already perceived to be slow to the
     user, since they have to wait on the network. It's much
     more pleasant to include a potentially slow auto-gc as
     part of the already-long network fetch than in the
     middle of productive work with git-rebase or similar.

Signed-off-by: Jeff King <peff@peff.net>
---
The original "gc --auto" patch in 2007 suggested adding this to fetch,
but we never did (there was some brief mention that maybe tweaking
fetch.unpacklimit would be relevant, but I think you would eventually
want to gc, whether you are accruing packs or loose objects).

I didn't bother protecting this with an option (as we do for
receive.gc). I don't see much point, as anybody who doesn't want gc on
their client repos will already have set gc.auto to prevent it from
triggering via other commands).

 builtin/fetch.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 4b5a898..1ddbf0d 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -959,6 +959,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	struct string_list list = STRING_LIST_INIT_NODUP;
 	struct remote *remote;
 	int result = 0;
+	static const char *argv_gc_auto[] = {
+		"gc", "--auto", NULL,
+	};
 
 	packet_trace_identity("fetch");
 
@@ -1026,5 +1029,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	list.strdup_strings = 1;
 	string_list_clear(&list, 0);
 
+	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
+
 	return result;
 }
-- 
1.8.1.rc0.4.g5948dfd.dirty

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2012-12-07 13:53 [PATCH 0/2] make repeated fetch faster on "read only" repos Jeff King
  2012-12-07 13:58 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
@ 2012-12-07 14:04 ` Jeff King
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff King @ 2012-12-07 14:04 UTC (permalink / raw)
  To: git

When we look up a sha1 object for reading, we first check
packfiles, and then loose objects. If we still haven't found
it, we re-scan the list of packfiles in `objects/pack`. This
final step ensures that we can co-exist with a simultaneous
repack process which creates a new pack and then prunes the
old object.

This extra re-scan usually does not have a performance
impact for two reasons:

  1. If an object is missing, then typically the re-scan
     will find a new pack, then no more misses will occur.
     Or if it truly is missing, then our next step is
     usually to die().

  2. Re-scanning is cheap enough that we do not even notice.

However, these do not always hold. The assumption in (1) is
that the caller is expecting to find the object. This is
usually the case, but the call to `parse_object` in
`everything_local` does not follow this pattern. It is
looking to see whether we have objects that the remote side
is advertising, not something we expect to have. Therefore
if we are fetching from a remote which has many refs
pointing to objects we do not have, we may end up
re-scanning the pack directory many times.

Even with this extra re-scanning, the impact is often not
noticeable due to (2); we just readdir() the packs directory
and skip any packs that are already loaded. However, if
there are a large number of packs, then even enumerating the
directory directory can be expensive (especially if we do it
repeatedly). Having this many packs is a good sign the user
should run `git gc`, but it would still be nice to avoid
having to scan the directory at all.

This patch checks has_sha1_file (which does not have the
re-scan and re-check behavior) in the critical loop, and
avoids calling parse_object at all if we do not have the
object.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm lukewarm on this patch.  The re-scan _shouldn't_ be that expensive,
so maybe patch 1 is enough to be a reasonable fix. The fact that we
re-scan repeatedly seems ugly and hacky to me, but it really is just
opendir/readdir/closedir in the case that nothing has changed (and if
something has changed, then it's a good thing to be checking). And with
my patch, fetch-pack would not notice new packs from a simultaneous
repack process (although it's OK, as the result is not incorrect, but
merely that we may ask for the object from the server).

Another option would be to make the reprepare_packed_git re-scan less
expensive by checking the mtime of the directory before scanning it.

 fetch-pack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index 099ff4d..b4383c6 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -594,6 +594,9 @@ static int everything_local(struct fetch_pack_args *args,
 	for (ref = *refs; ref; ref = ref->next) {
 		struct object *o;

+		if (!has_sha1_file(ref->old_sha1))
+			continue;
+
 		o = parse_object(ref->old_sha1);
 		if (!o)
 			continue;
-- 
1.8.1.rc0.4.g5948dfd.dirty

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
@ 2013-01-26 22:40 ` Jeff King
  2013-01-27 10:27   ` Jonathan Nieder
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2013-01-26 22:40 UTC (permalink / raw)
  To: git

When we look up a sha1 object for reading, we first check
packfiles, and then loose objects. If we still haven't found
it, we re-scan the list of packfiles in `objects/pack`. This
final step ensures that we can co-exist with a simultaneous
repack process which creates a new pack and then prunes the
old object.

This extra re-scan usually does not have a performance
impact for two reasons:

  1. If an object is missing, then typically the re-scan
     will find a new pack, then no more misses will occur.
     Or if it truly is missing, then our next step is
     usually to die().

  2. Re-scanning is cheap enough that we do not even notice.

However, these do not always hold. The assumption in (1) is
that the caller is expecting to find the object. This is
usually the case, but the call to `parse_object` in
`everything_local` does not follow this pattern. It is
looking to see whether we have objects that the remote side
is advertising, not something we expect to have. Therefore
if we are fetching from a remote which has many refs
pointing to objects we do not have, we may end up
re-scanning the pack directory many times.

Even with this extra re-scanning, the impact is often not
noticeable due to (2); we just readdir() the packs directory
and skip any packs that are already loaded. However, if
there are a large number of packs, then even enumerating the
directory directory can be expensive (especially if we do it
repeatedly). Having this many packs is a good sign the user
should run `git gc`, but it would still be nice to avoid
having to scan the directory at all.

Signed-off-by: Jeff King <peff@peff.net>
---
 fetch-pack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index f0acdf7..6d8926a 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -594,6 +594,9 @@ static int everything_local(struct fetch_pack_args *args,
 	for (ref = *refs; ref; ref = ref->next) {
 		struct object *o;

+		if (!has_sha1_file(ref->old_sha1))
+			continue;
+
 		o = parse_object(ref->old_sha1);
 		if (!o)
 			continue;
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
@ 2013-01-27 10:27   ` Jonathan Nieder
  2013-01-27 20:09     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Nieder @ 2013-01-27 10:27 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi,

Jeff King wrote:

> When we look up a sha1 object for reading, we first check
> packfiles, and then loose objects. If we still haven't found
> it, we re-scan the list of packfiles in `objects/pack`. This
> final step ensures that we can co-exist with a simultaneous
> repack process which creates a new pack and then prunes the
> old object.

I like the context above and what follows it, but I think you forgot
to mention what the patch actually does. :)

I guess it is:

	However, in the first scan over refs in fetch-pack.c::everything_local,
	this double-check of packfiles is not necessary since we are only
	trying to get a rough estimate of the last time we fetched from this
	remote repository in order to find good candidate common commits ---
	a missed object would only result in a slightly slower fetch.

	Avoid that slow second scan in the common case by guarding the object
	lookup with has_sha1_file().

Sounds like it would not affect most fetches except by making them
a lot faster in the many-refs case, so for what it's worth I like it.

I had not read this codepath before.  I'm left with a few questions:

 * Why is 49bb805e ("Do not ask for objects known to be complete",
   2005-10-19) trying to do?  Are we hurting that in any way?

   For the sake of an example, suppose in my stalled project I
   maintain 20 topic branches against an unmoving mainline I do not
   advertise and you regularly fetch from me.  The cutoff is the
   *newest* commit date of any of my topic branches you already have.
   By declaring you have that topic branch you avoid a complicated
   negotiation to discover that we both have the mainline.  Is that
   the goal?

 * Is has_sha1_file() generally succeptible to the race against repack
   you mentioned?  How is that normally dealt with?

 * Can a slow operation get confused if an object is incorporated into
   a pack and then expelled again by two repacks in sequence?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-27 10:27   ` Jonathan Nieder
@ 2013-01-27 20:09     ` Junio C Hamano
  2013-01-27 23:20       ` Jonathan Nieder
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-01-27 20:09 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, git

Jonathan Nieder <jrnieder@gmail.com> writes:

> Jeff King wrote:
>
>> When we look up a sha1 object for reading, we first check
>> packfiles, and then loose objects. If we still haven't found
>> it, we re-scan the list of packfiles in `objects/pack`. This
>> final step ensures that we can co-exist with a simultaneous
>> repack process which creates a new pack and then prunes the
>> old object.
>
> I like the context above and what follows it, but I think you forgot
> to mention what the patch actually does. :)
>
> I guess it is:
>
> 	However, in the first scan over refs in fetch-pack.c::everything_local,
> 	this double-check of packfiles is not necessary since we are only
> 	trying to get a rough estimate of the last time we fetched from this
> 	remote repository in order to find good candidate common commits ---
> 	a missed object would only result in a slightly slower fetch.

It is not about a rough estimate nor common commits, though.  The
"everything local" check in question is interested in only one
thing: are we _clearly_ up to date without fetching anything from
them?

Loosening the check may miss the rare case where we race against a
simultaneous repack and will cause us to go to the network when we
do not have to, and it becomes a trade off between the common unracy
case going faster by allowing the "Are we clearly up to date" check
to cheat, at the expense of rare racy cases suffering unnecessary
object transfer overhead.

> 	Avoid that slow second scan in the common case by guarding the object
> 	lookup with has_sha1_file().

This conclusion is correct.

> I had not read this codepath before.  I'm left with a few questions:
>
>  * Why is 49bb805e ("Do not ask for objects known to be complete",
>    2005-10-19) trying to do?  Are we hurting that in any way?

An earlier fetch may have acquired all the necessary objects but may
not have updated our refs for some reason (e.g. fast-forward check
may have fired).  In such a case, we may already have a history that
is good (i.e. not missing paths down to the common history) in our
repository that is not connected to any of our refs, and we can
update our refs (or write to FETCH_HEAD) without asking the remote
end to do any common ancestor computation or object transfer.

That was the primary thing the patch wanted to do.

As a side-effect, we know more objects than just the objects at the
tips of our refs are complete and that may help the later common
history discovery step, but obviously we do not want to dig the
history down to root.  The cutoff value is merely a heuristics
chosen without any deep thought.

>  * Is has_sha1_file() generally succeptible to the race against repack
>    you mentioned?  How is that normally dealt with?

By failing to find, so that the user will restart.  When the caller
really wants to use the object, parse_objects() => read_sha1_file()
=> read_object() is used and we will see the retry.

>  * Can a slow operation get confused if an object is incorporated into
>    a pack and then expelled again by two repacks in sequence?

If it checks "the object should be there" first, wait for a long
time, and then tries to find that object's data, the later access
will go to the parse_objects() callpath and I think it should do the
right thing.  If that slow opearation stops inside read_object(), it
could find it unable to map the loose object file and then unable to
find it in the pack, either.  Is that what you are worried about?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-27 20:09     ` Junio C Hamano
@ 2013-01-27 23:20       ` Jonathan Nieder
  0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Nieder @ 2013-01-27 23:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

Junio C Hamano wrote:

> It is not about a rough estimate nor common commits, though.  The
> "everything local" check in question is interested in only one
> thing: are we _clearly_ up to date without fetching anything from
> them?
[...]
> Jonathan Nieder <jrnieder@gmail.com> writes:

>>  * Why is 49bb805e ("Do not ask for objects known to be complete",
>>    2005-10-19) trying to do?  Are we hurting that in any way?
>
> An earlier fetch may have acquired all the necessary objects but may
> not have updated our refs for some reason (e.g. fast-forward check
> may have fired).  In such a case, we may already have a history that
> is good (i.e. not missing paths down to the common history) in our
> repository that is not connected to any of our refs, and we can
> update our refs (or write to FETCH_HEAD) without asking the remote
> end to do any common ancestor computation or object transfer.
>
> That was the primary thing the patch wanted to do.

Interesting.

After I fetch objects for branches a, b, and c which all have different
commit times in a fetch that fails due to the fast-forward check, I
have the following objects:

	a	commit date = 25 January 2013
	b	commit date = 26 January 2013
	c	commit date = 27 January 2013

When I try to fetch again (forcibly this time), git notices that the
objects are available locally and sets "cutoff" to 27 January 2013 as
a hint about the last time I fetched.

mark_recent_complete_commits() is called, and since these objects are
not reachable from any local refs, none are visited in the walk, and
49bb805e does not affect the outcome of the fetch positively or
negatively.

On the other hand, if I fetched a, b, and c to local branches and then
built on top of them, 49bb805e ensures that a second fetch of the same
refs will know right away not to request objects for c.  So it brings
some of the benefits of 2759cbc7 (git-fetch-pack: avoid unnecessary
zero packing, 2005-10-18) when the receiving side has either (A)
renamed refs or (B) built new history on top of them.

Correct?

It is only in case (B) that the cutoff matters.  If we miscalculate
the cutoff, that means either (i) cutoff == 0, and we would lose the
benefit of the walk that finds complete commits, or (ii) cutoff is a
little earlier, and the walk to find complete commits would take a
little longer.  Neither effects the result of the fetch, and neither
is likely to be a significant enough performance difference to offset
the benefit of Jeff's patch.

Thanks for explaining.
Jonathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-01-27 23:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-07 13:53 [PATCH 0/2] make repeated fetch faster on "read only" repos Jeff King
2012-12-07 13:58 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
2012-12-07 14:04 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
  -- strict thread matches above, loose matches on Subject: below --
2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
2013-01-27 10:27   ` Jonathan Nieder
2013-01-27 20:09     ` Junio C Hamano
2013-01-27 23:20       ` Jonathan Nieder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).