From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Tue, 11 Aug 2009 09:42:39 +0100 Subject: [Cluster-devel] GFS2: Umount recovery race fix In-Reply-To: <20090810223159.GA28306@redhat.com> References: <1242306797.29604.330.camel@localhost.localdomain> <20090810223159.GA28306@redhat.com> Message-ID: <1249980159.3337.3.camel@localhost.localdomain> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On Mon, 2009-08-10 at 17:31 -0500, David Teigland wrote: > On Thu, May 14, 2009 at 02:13:17PM +0100, Steven Whitehouse wrote: > > > > This patch fixes a race condition where we can receive recovery > > requests part way through processing a umount. This was causing > > problems since the recovery thread had already gone away. > > Do you have some logs showing specifically what happened in both kernel and > userland? > Yes, the one you sent to me on Fri, 8 May 2009 11:34:54 -0500 (17:34 BST). Next time please file a bugzilla so that we have a proper record of the issues. > > Looking in more detail at the recovery code, it was really trying > > to implement a slight variation on a work queue, and that happens to > > align nicely with the recently introduced slow-work subsystem. As a > > result I've updated the code to use slow-work, rather than its own home > > grown variety of work queue. > > > > When using the wait_on_bit() function, I noticed that the wait function > > that was supplied as an argument was appearing in the WCHAN field, so > > I've updated the function names in order to produce more meaningful > > output. > > That description doesn't explain how the specific bug was fixed. > The bug was fixed by not allowing recovery on a filesystem after a umount has occurred. > I'm guessing that this is the patch that broke gfs2 recovery, although there > are others that muck around with the sysfs control files. > > This is what appears in /var/log/messages, > > gfs_controld[7901]: start_journal_recovery 3 error -1 > > And from the daemon debug log, > > 1249942342 foo start_journal_recovery jid 3 > 1249942342 foo set /sys/fs/gfs2/bull:foo/lock_module/recover to 3 > 1249942342 foo set open /sys/fs/gfs2/bull:foo/lock_module/recover error -1 13 > 1249942342 start_journal_recovery 3 error -1 > > Dave > I'll have a look - EPERM (error = 13) is not one of the errno values which the recover code returns though, Steve.