From: Michael Haggerty <mhagger@alum.mit.edu>
To: Jeff King <peff@peff.net>, git@vger.kernel.org
Subject: Re: [PATCH 13/16] prune: keep objects reachable from recent objects
Date: Tue, 07 Oct 2014 18:29:00 +0200 [thread overview]
Message-ID: <543414CC.9020104@alum.mit.edu> (raw)
In-Reply-To: <20141003203931.GM16293@peff.net>
On 10/03/2014 10:39 PM, Jeff King wrote:
> [...]
> Instead, this patch pushes the extra work onto prune, which
> runs less frequently (and has to look at the whole object
> graph anyway). It creates a new category of objects: objects
> which are not recent, but which are reachable from a recent
> object. We do not prune these objects, just like the
> reachable and recent ones.
>
> This lets us avoid the recursive check above, because if we
> have an object, even if it is unreachable, we should have
> its referent:
>
> - if we are creating new objects, then we cannot create
> the parent object without having the child
>
> - and if we are pruning objects, will not prune the child
> if we are keeping the parent
>
> The big exception would be if one were to write the object
> in a way that avoided referential integrity (e.g., using
> hash-object). But if you are in the habit of doing that, you
> deserve what you get.
>
> Naively, the simplest way to implement this would be to add
> all recent objects as tips to the reachability traversal.
> However, this does not perform well. In a recently-packed
> repository, all reachable objects will also be recent, and
> therefore we have to consider each object twice (both as a
> tip, and when we reach it in the traversal). I tested this,
> and it added about 10s to a 30s prune on linux.git. This
> patch instead performs the normal reachability traversal
> first, then follows up with a second traversal for recent
> objects, skipping any that have already been marked.
I haven't read all of the old code, but if I understand correctly this
is your new algorithm:
1. Walk from all references etc., marking reachable objects.
2. Iterate over *all* objects, in no particular order, skipping the
objects that are already known to be reachable. Use any unreachable
object that has a recent mtime as a tip for a second traversal that
marks all of its references as "to-keep".
3. Iterate over any objects that are not marked "to-keep". (I assume
that this iteration is in no particular order.) For each object:
* [Presumably] verify that its mtime is still "old"
* If so, prune the object
I see some problems with this.
* The one that you mentioned in your cover letter, namely that prune's
final mtime check is not atomic with the object deletion. I agree
that this race is so much shorter than the others that we can accept
a solution that doesn't eliminate it, so let's forget about this one.
* If the final mtime check fails, then the object is recognized as new
and not pruned. But does that prevent its referents from being pruned?
* When this situation is encountered, you would have to start another
object traversal starting at the "renewed" object to mark its
referents "to-keep". I don't see that you do this. Another, much
less attractive alternative would be to abort the prune operation
if this situation arises. But even if you do one of these...
* ...assuming the iteration in step 3 is in no defined order, a
referent might *already* have been pruned before you notice the
"renewed" object.
So although your changes are a big improvement, it seems to me that they
still leave a race with a window approximately as long as the time it
takes to scan and prune the unreachable objects.
I think that the only way to get rid of that race is to delete objects
in parent-to-child order; in other words, *only prune an object after
all objects that refer to it have been pruned*. This could be done by
maintaining reference counts of the to-be-pruned objects and only
deleting an object once its reference count is zero.
The next point that I'm confused by is what happens when a new object or
reference is created while prune is running, and the new object or
reference refers to old objects. I think when we discussed this
privately I claimed that the following freshenings would not be
necessary, but now I think that they are (sorry about that!).
Let's take the simpler case first. Suppose I run the following command
between steps 1 and 3:
git update-ref refs/heads/newbranch $COMMIT
, where $COMMIT is a previously-unreachable object. This doesn't affect
the mtime of $COMMIT, does it? So how does prune know that it shouldn't
delete $COMMIT?
-> So ISTM that updating a reference (or any other traversal starting
point, like the index) must freshen the mtime of any object newly
referred to.
A more complicated case: suppose I create a new $COMMIT referring to an
old $TREE during step 2, *after* prune has scanned the directory that
now contains $COMMIT. (I.e., the scan in step 2 never notices $COMMIT.)
Then I create a new reference pointing at $COMMIT. (I.e., the scan in
step 1 never noticed that the reference exists.) None of this affects
the mtime of $TREE, does it? So how does prune know that it mustn't
prune $TREE?
-> It seems to me that the creation of $COMMIT has to freshen the mtime
of $TREE, so that the final mtime check in step 3 realizes that it
shouldn't prune $TREE. Or to generalize, whenever a new object is
created which refers to existing objects, the direct referents of the
new object have to have their mtimes freshened. However, when an attempt
to write a new object accidentally coincides with an object that already
exists, *that* object needs to be freshened but *its* referents do *not*.
I hope I understood that all correctly...
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
next prev parent reply other threads:[~2014-10-07 16:29 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-03 20:20 [PATCH 0/16] make prune mtime-checking more careful Jeff King
2014-10-03 20:21 ` [PATCH 01/16] foreach_alt_odb: propagate return value from callback Jeff King
2014-10-03 22:55 ` René Scharfe
2014-10-04 0:31 ` Jeff King
2014-10-03 20:22 ` [PATCH 02/16] isxdigit: cast input to unsigned char Jeff King
2014-10-03 20:22 ` [PATCH 03/16] object_array: factor out slopbuf-freeing logic Jeff King
2014-10-07 11:25 ` Michael Haggerty
2014-10-08 7:36 ` Jeff King
2014-10-08 8:40 ` Michael Haggerty
2014-10-08 8:55 ` Jeff King
2014-10-03 20:22 ` [PATCH 04/16] object_array: add a "clear" function Jeff King
2014-10-03 20:23 ` [PATCH 05/16] clean up name allocation in prepare_revision_walk Jeff King
2014-10-03 20:24 ` [PATCH 06/16] reachable: clear pending array after walking it Jeff King
2014-10-03 20:25 ` [PATCH 07/16] t5304: use test_path_is_* instead of "test -f" Jeff King
2014-10-03 22:12 ` Junio C Hamano
2014-10-03 20:27 ` [PATCH 08/16] t5304: use helper to report failure of "test foo = bar" Jeff King
2014-10-03 22:17 ` Junio C Hamano
2014-10-04 0:13 ` Jeff King
2014-10-07 13:21 ` Michael Haggerty
2014-10-07 17:29 ` Junio C Hamano
2014-10-07 20:18 ` Jeff King
2014-10-07 20:35 ` Junio C Hamano
2014-10-07 21:29 ` Jeff King
2014-10-07 21:53 ` Junio C Hamano
2014-10-07 22:17 ` Michael Haggerty
2014-10-08 1:13 ` Jeff King
2014-10-08 16:58 ` Junio C Hamano
2014-10-07 21:16 ` Junio C Hamano
2014-10-03 20:29 ` [PATCH 09/16] prune: factor out loose-object directory traversal Jeff King
2014-10-03 22:19 ` Junio C Hamano
2014-10-04 0:24 ` Jeff King
2014-10-07 14:07 ` Michael Haggerty
2014-10-08 7:33 ` Jeff King
2014-10-03 20:31 ` [PATCH 10/16] count-objects: do not use xsize_t when counting object size Jeff King
2014-10-03 20:31 ` [PATCH 11/16] count-objects: use for_each_loose_file_in_objdir Jeff King
2014-10-03 20:32 ` [PATCH 12/16] sha1_file: add for_each iterators for loose and packed objects Jeff King
2014-10-05 8:15 ` René Scharfe
2014-10-05 10:47 ` Ramsay Jones
2014-10-03 20:39 ` [PATCH 13/16] prune: keep objects reachable from recent objects Jeff King
2014-10-03 21:47 ` Junio C Hamano
2014-10-04 0:09 ` Jeff King
2014-10-04 0:30 ` Jeff King
2014-10-04 3:04 ` Junio C Hamano
2014-10-07 16:29 ` Michael Haggerty [this message]
2014-10-08 7:19 ` Jeff King
2014-10-08 10:37 ` Michael Haggerty
2014-10-03 20:39 ` [PATCH 14/16] pack-objects: refactor unpack-unreachable expiration check Jeff King
2014-10-03 20:40 ` [PATCH 15/16] pack-objects: match prune logic for discarding objects Jeff King
2014-10-03 20:41 ` [PATCH 16/16] write_sha1_file: freshen existing objects Jeff King
2014-10-03 21:29 ` Junio C Hamano
2014-10-04 0:01 ` Jeff King
2014-10-05 9:12 ` René Scharfe
2014-10-03 22:20 ` [PATCH 0/16] make prune mtime-checking more careful Junio C Hamano
2014-10-04 22:22 ` Junio C Hamano
2014-10-05 9:19 ` René Scharfe
2014-10-06 1:42 ` Jeff King
2014-10-08 8:31 ` Jeff King
2014-10-08 17:03 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=543414CC.9020104@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.