From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sunil Mushran Date: Tue, 08 Jul 2008 10:08:38 -0700 Subject: [Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery In-Reply-To: <4872C7A4.1010507@oracle.com> References: <1215462380-12975-1-git-send-email-sunil.mushran@oracle.com> <1215462380-12975-4-git-send-email-sunil.mushran@oracle.com> <20080707205037.GB8651@mail.oracle.com> <4872C7A4.1010507@oracle.com> Message-ID: <48739F16.7000205@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Instead of using generation, we could use ij_pad. Renamed to something appropriate. Sunil Mushran wrote: > ok.. going back to the drawing board. I am putting in writing the > scheme the two of us thrashed out here. Please review. > > Problem: > How does a recovery thread know that the slot it is trying to recover > has already been mounted by another node (and thus already recovered)? > > Our first solution was to trylock the journal lock with the assumption > that if it failed, the slot was recovered by another node. > > As Joel pointed out, the problem here is that as the fs is informed > of node death before the dlm, a trylock on journal failing could > be a case of dlm not yet registering the node death. And skipping > journal replay in these circumstances would be asking for trouble. > So we have to make a blocking request... but if we do that blindly, > we could hang as another node could have mounted that slot. > > Possible Solution: > Use the i_generation in the journal inode as a recovery counter. > Meaning, we increment it each time we recover the journal. That > is, from mount thread in ocfs2_journal_load() and from recovery > thread in ocfs2_replay_journal(). In both places, we have the > superblock lock making the writes safe. > > The mount thread will read the generation#s and save it. In the > recovery thread, after taking the superblock lock, the thread > reads the journal inode. If it finds the generation is not the same, > it can safely skip the recovery. If not, it will go ahead with the > blocking lock. > > As all nodes get the down event, they all go thru with the recovery > process if only to detect that no recovery is required. We will > use that to keep the recovery generation for all journal slots uptodate. > > Sunil > > Joel Becker wrote: >> On Mon, Jul 07, 2008 at 01:26:19PM -0700, Sunil Mushran wrote: >> >>> This patch fixes a tiny race between recovery and mount that could >>> otherwise >>> lead to a hang. >>> >> >> First off, does this still exist in mainline? >> >> >> >> >>> The last step of the mount process calls for it to drop the EX on >>> the superblock >>> lock. The node holds onto the same lock level on its journal/slot lock. >>> >>> The recovery thread then picks up the EX on the superblock lock and >>> then takes >>> the same lock for the journal/slot it is recovering. >>> >>> This exposes a tiny race as another node mounting the volume could >>> race the >>> recovery thread for the EX on the superblock lock. >>> >>> If that happens, that node could be assigned a dirty slot which it >>> would recover. >>> It too will then drop the EX on the superblock lock but hold onto >>> the same >>> lock level on its journal/slot lock. >>> >>> Now when the recovery thread on the first node gets back the EX on >>> the superblock >>> lock, it will promptly attempt to get the EX on the journal/slot >>> lock of the node >>> it thinks is dirty but since has not only been recovered but also >>> locked by >>> another node. >>> >>> Hang. >>> >>> The fix here is to make the journal/slot EX lock request during >>> recovery a trylock >>> operation. If the trylock fails, it would indicate that that slot is >>> in use and >>> thus recovered. >>> >> >> How do we distinguish this from a node that hasn't been evicted >> yet? If you recall, we first tried to eliminate the recovery map >> completely, having recovery just trylock every journal to find the dead >> node. But if the node hasn't been evicted, the trylock fails. Isn't >> this the same thing? >> Let me try putting it another way. The recovery thread just >> sees entries. It doesn't know whether they came from mount or from a >> dead node event. So, we have an entry. Sure, it could be an entry that >> was locked during mount, then unlocked, and now we're trying to relock. >> The only way the trylock fails is if someone else recovered it. That's >> safe. If it came from another path, though, we expect to sleep on that >> lock - that's how we know the dlm has evicted it, because the lock >> blocks until then. How do we determine this is the case and sleep on >> it? Otherwise, our trylock fails, we skip it, then the dlm evicts it, >> and noone has recovered it. We then continue assuming the journal was >> replayed, which it was not. >> I'm missing something, right? >> >> Joel >> >> >