[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sunil Mushran <Sunil.Mushran@oracle.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery
Date: Tue, 08 Jul 2008 10:08:38 -0700	[thread overview]
Message-ID: <48739F16.7000205@oracle.com> (raw)
In-Reply-To: <4872C7A4.1010507@oracle.com>

Instead of using generation, we could use ij_pad. Renamed to
something appropriate.

Sunil Mushran wrote:
> ok.. going back to the drawing board. I am putting in writing the
> scheme the two of us thrashed out here. Please review.
>
> Problem:
> How does a recovery thread know that the slot it is trying to recover
> has already been mounted by another node (and thus already recovered)?
>
> Our first solution was to trylock the journal lock with the assumption
> that if it failed, the slot was recovered by another node.
>
> As Joel pointed out, the problem here is that as the fs is informed
> of node death before the dlm, a trylock on journal failing could
> be a case of dlm not yet registering the node death. And skipping
> journal replay in these circumstances would be asking for trouble.
> So we have to make a blocking request... but if we do that blindly,
> we could hang as another node could have mounted that slot.
>
> Possible Solution:
> Use the i_generation in the journal inode as a recovery counter.
> Meaning, we increment it each time we recover the journal. That
> is, from mount thread in ocfs2_journal_load() and from recovery
> thread in ocfs2_replay_journal(). In both places, we have the
> superblock lock making the writes safe.
>
> The mount thread will read the generation#s and save it. In the
> recovery thread, after taking the superblock lock, the thread
> reads the journal inode. If it finds the generation is not the same,
> it can safely skip the recovery. If not, it will go ahead with the
> blocking lock.
>
> As all nodes get the down event, they all go thru with the recovery
> process if only to detect that no recovery is required. We will
> use that to keep the recovery generation for all journal slots uptodate.
>
> Sunil
>
> Joel Becker wrote:
>> On Mon, Jul 07, 2008 at 01:26:19PM -0700, Sunil Mushran wrote:
>>  
>>> This patch fixes a tiny race between recovery and mount that could 
>>> otherwise
>>> lead to a hang.
>>>     
>>
>>     First off, does this still exist in mainline?
>>
>> <snip>
>>  
>>  
>>> The last step of the mount process calls for it to drop the EX on 
>>> the superblock
>>> lock. The node holds onto the same lock level on its journal/slot lock.
>>>
>>> The recovery thread then picks up the EX on the superblock lock and 
>>> then takes
>>> the same lock for the journal/slot it is recovering.
>>>
>>> This exposes a tiny race as another node mounting the volume could 
>>> race the
>>> recovery thread for the EX on the superblock lock.
>>>
>>> If that happens, that node could be assigned a dirty slot which it 
>>> would recover.
>>> It too will then drop the EX on the superblock lock but hold onto 
>>> the same
>>> lock level on its journal/slot lock.
>>>
>>> Now when the recovery thread on the first node gets back the EX on 
>>> the superblock
>>> lock, it will promptly attempt to get the EX on the journal/slot 
>>> lock of the node
>>> it thinks is dirty but since has not only been recovered but also 
>>> locked by
>>> another node.
>>>
>>> Hang.
>>>
>>> The fix here is to make the journal/slot EX lock request during 
>>> recovery a trylock
>>> operation. If the trylock fails, it would indicate that that slot is 
>>> in use and
>>> thus recovered.
>>>     
>>
>>     How do we distinguish this from a node that hasn't been evicted
>> yet?  If you recall, we first tried to eliminate the recovery map
>> completely, having recovery just trylock every journal to find the dead
>> node.  But if the node hasn't been evicted, the trylock fails.  Isn't
>> this the same thing?
>>     Let me try putting it another way.  The recovery thread just
>> sees entries.  It doesn't know whether they came from mount or from a
>> dead node event.  So, we have an entry.  Sure, it could be an entry that
>> was locked during mount, then unlocked, and now we're trying to relock.
>> The only way the trylock fails is if someone else recovered it.  That's
>> safe.  If it came from another path, though, we expect to sleep on that
>> lock - that's how we know the dlm has evicted it, because the lock
>> blocks until then.  How do we determine this is the case and sleep on
>> it?  Otherwise, our trylock fails, we skip it, then the dlm evicts it,
>> and noone has recovered it.  We then continue assuming the journal was
>> replayed, which it was not.
>>     I'm missing something, right?
>>
>> Joel
>>
>>   
>

next prev parent reply	other threads:[~2008-07-08 17:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-07 20:26 [Ocfs2-devel] Four patches for ocfs2 1.4 Sunil Mushran
2008-07-07 20:26 ` [Ocfs2-devel] [PATCH 1/4] ocfs2: Silence an error message in ocfs2_file_aio_read() Sunil Mushran
2008-07-07 20:26 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: Change the recovery map to an array of node numbers Sunil Mushran
2008-07-07 20:26 ` [Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery Sunil Mushran
2008-07-07 20:50   ` Joel Becker
     [not found]     ` <4872C7A4.1010507@oracle.com>
2008-07-08 17:08       ` Sunil Mushran [this message]
2008-07-08 18:17         ` Joel Becker
2008-07-09 17:35           ` Mark Fasheh
2008-07-09 18:29             ` Joel Becker
2008-07-07 20:51   ` Joel Becker
2008-07-07 20:26 ` [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Fixes oops in dlm_new_lockres() Sunil Mushran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48739F16.7000205@oracle.com \
    --to=sunil.mushran@oracle.com \
    --cc=ocfs2-devel@oss.oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.