Re: another packed-refs race - Michael Haggerty

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Johan Herland <johan@herland.net>
Subject: Re: another packed-refs race
Date: Tue, 07 May 2013 06:32:12 +0200	[thread overview]
Message-ID: <518883CC.7050609@alum.mit.edu> (raw)
In-Reply-To: <20130506184122.GA23568@sigill.intra.peff.net>

On 05/06/2013 08:41 PM, Jeff King wrote:
> On Mon, May 06, 2013 at 02:03:40PM +0200, Michael Haggerty wrote:
> [...]
>> The loose refs cache is only used by the for_each_ref() functions, not
>> for looking up individual references.  Another approach would be to
>> change the top-level for_each_ref() functions to re-stat() all of the
>> loose references within the namespace that interests it, *then* verify
>> that the packed-ref cache is not stale, *then* start the iteration.
>> Then there would be no need to re-stat() files during the iteration.
>> (This would mean that we have to prohibit a second reference iteration
>> from being started while one is already in progress.)
> 
> Hmm. Thinking on this more, I'm not sure that we need to stat the loose
> references at all. We do not need to care if the loose refs are up to
> date or not. Well, we might care, but the point here is not to pretend
> that we have an up-to-date atomic view of the loose refs; it is only to
> make sure that the fallback-to-packed behavior does not lie to us about
> the existence or value of a ref.
> 
> IOW, it is OK to come up with a value for ref X that was true at the
> beginning of the program, even if it has been simultaneously updated.
> Our program can operate as if it happened in the instant it started,
> even though in real life it takes longer. But it is _not_ OK to miss the
> existence of a ref, or to come up with a value that it did not hold at
> some point during the program (e.g., it is not OK to return some cruft
> we wrote into the packed-refs file when we packed it three weeks ago).

This all sounds correct to me.

Another potential problem caused by the non-atomicity of loose reference
reading could be to confuse reachability tests if process 1 is reading
loose references while process 2 is renaming a reference:

1. Process 1 looks for refs/heads/aaaaa and finds it missing.

2. Process 2 renames zzzzz -> aaaaa

3. Process 1 looks for refs/heads/zzzzz and finds it missing.

Process 2 would think that any objects pointed to by aaaaa (formerly
zzzzz) are unreachable.  This would be unfortunate if it is doing an
upload-pack and very bad if it is doing a gc.  I wonder if this could be
a problem in practice?  (Gee, wouldn't it be nice to keep reflogs for
deleted refs? :-) )

> That is a weaker guarantee, and I think we can provide it with:
> 
>   1. Load all loose refs into cache for a particular enumeration.
> 
>   2. Make sure the packed-refs cache is up-to-date (by checking its
>      stat() information and reloading if necessary).
> 
>   3. Run the usual iteration over the loose/packed ref caches.

This sounds reasonable, too, though I'll need some more time to digest
your suggestion in detail.

> [...]
>> Of course, clearing (part of) the loose reference cache invalidates any
>> pointers that other callers might have retained to refnames in the old
>> version of the cache.  I've never really investigated what callers might
>> hold onto such pointers under the assumption that they will live to the
>> end of the process.
> 
> My proposal above gets rid of the need to invalidate the loose refs
> cache, so we can ignore that complexity.

The same concern applies to invalidating the packed-ref cache, which you
still want to do.

>> Given all of this trouble, there is an obvious question: why do we have
>> a loose reference cache in the first place?  I think there are a few
>> reasons:
>>
>> 1. In case one git process has to iterate through the same part of the
>> reference namespace more than once.  (Does this frequently happen?)
> 
> I'm not sure how often it happens. There are a few obvious candidates if
> you "git grep 'for_each[a-z_]*ref'", but I'm not sure if the actual
> performance impact is measurable (the cache should be warm after the
> first run-through).
> 
>> 2. Reading a bunch of loose references at the same time is more
>> efficient than reading them one by one interleaved with other file
>> reads.  (I think this is a significant win.)
> 
> Makes some sense, though I don't recall whether it was ever measured.

I think I measured it once and found it a significant benefit, though I
can't remember whether this was with a warm cache or only with a cold cache.

In fact, I experimented with some other strategies for loose reference
loading for performance reasons.  Currently loose references are read
into the cache lazily, one directory at a time.  I experimented with
changes in both directions:

* Preloading the whole tree of loose references before starting an
iteration.  As I recall, this was a performance *win*.  It was on my
to-do list of things to pursue when I have some free time (ha, ha).  I
mostly wanted to check first that there are not callers who abort the
iteration soon after starting it.  For example, imagine a caller who
tries to determine "are there any tags at all" by iterating over
"refs/tags" with a callback that just returns 1; such a caller would
suffer the cost of reading all of the loose references in "refs/tags".

* Lazy loading loose references one reference at a time.  The ideas was
that this would allow the reference cache to be used for
single-reference lookups.  This change alone caused a significant
performance loss, so it would have had to be combined with code for
preloading directories or subtrees before a for_each_ref() iteration.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

next prev parent reply	other threads:[~2013-05-07  4:32 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-03  8:38 another packed-refs race Jeff King
2013-05-03  9:26 ` Johan Herland
2013-05-03 17:28   ` Jeff King
2013-05-03 18:26     ` Jeff King
2013-05-03 21:02       ` Johan Herland
2013-05-06 12:12     ` Michael Haggerty
2013-05-06 18:44       ` Jeff King
2013-05-03 21:21 ` Jeff King
2013-05-06 12:03 ` Michael Haggerty
2013-05-06 18:41   ` Jeff King
2013-05-06 22:18     ` Jeff King
2013-05-07  4:32     ` Michael Haggerty [this message]
2013-05-07  4:44       ` Jeff King
2013-05-07  8:03         ` Michael Haggerty
2013-05-07  2:36 ` [PATCH 0/4] fix packed-refs races Jeff King
2013-05-07  2:38   ` [PATCH 1/4] resolve_ref: close race condition for packed refs Jeff King
2013-05-12 22:56     ` Michael Haggerty
2013-05-16  3:47       ` Jeff King
2013-05-16  5:50         ` Michael Haggerty
2013-05-12 23:26     ` Michael Haggerty
2013-06-11 14:26     ` [PATCH 0/4] Fix a race condition when reading loose refs Michael Haggerty
2013-06-11 14:26       ` [PATCH 1/4] resolve_ref_unsafe(): extract function handle_missing_loose_ref() Michael Haggerty
2013-06-11 14:26       ` [PATCH 2/4] resolve_ref_unsafe(): handle the case of an SHA-1 within loop Michael Haggerty
2013-06-11 14:26       ` [PATCH 3/4] resolve_ref_unsafe(): nest reference-reading code in an infinite loop Michael Haggerty
2013-06-11 14:26       ` [PATCH 4/4] resolve_ref_unsafe(): close race condition reading loose refs Michael Haggerty
2013-06-12  8:04         ` Jeff King
2013-06-13  8:22         ` Thomas Rast
2013-06-14  7:17           ` Michael Haggerty
2013-06-11 20:57       ` [PATCH 0/4] Fix a race condition when " Junio C Hamano
2013-05-07  2:39   ` [PATCH 2/4] add a stat_validity struct Jeff King
2013-05-13  2:29     ` Michael Haggerty
2013-05-13  3:00       ` [RFC 0/2] Separate stat_data from cache_entry Michael Haggerty
2013-05-13  3:00         ` [RFC 1/2] Extract a struct " Michael Haggerty
2013-05-13  3:00         ` [RFC 2/2] add a stat_validity struct Michael Haggerty
2013-05-13  5:10         ` [RFC 0/2] Separate stat_data from cache_entry Junio C Hamano
2013-05-16  3:51       ` [PATCH 2/4] add a stat_validity struct Jeff King
2013-05-07  2:43   ` [PATCH 3/4] get_packed_refs: reload packed-refs file when it changes Jeff King
2013-05-07  2:54     ` [PATCH 0/2] peel_ref cleanups changes Jeff King
2013-05-07  2:56       ` [PATCH 1/2] peel_ref: rename "sha1" argument to "peeled" Jeff King
2013-05-07  3:06       ` [PATCH 2/2] peel_ref: refactor for safety with simultaneous update Jeff King
2013-05-09 19:18     ` [PATCH 3/4] get_packed_refs: reload packed-refs file when it changes Eric Sunshine
2013-05-13  2:43     ` Michael Haggerty
2013-05-07  2:51   ` [PATCH 4/4] for_each_ref: load all loose refs before packed refs Jeff King
2013-05-07  6:40   ` [PATCH 0/4] fix packed-refs races Junio C Hamano
2013-05-07 14:19     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=518883CC.7050609@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=johan@herland.net \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.