From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: bob.mastors@solidfire.com, snitzer@redhat.com, xfs@oss.sgi.com
Subject: Re: [PATCH] xfs: unmount does not wait for shutdown during unmount
Date: Mon, 14 Apr 2014 15:28:24 -0400 [thread overview]
Message-ID: <20140414192824.GC62307@bfoster.bfoster> (raw)
In-Reply-To: <1397104955-7247-1-git-send-email-david@fromorbit.com>
On Thu, Apr 10, 2014 at 02:42:35PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> And interesting situation can occur if a log IO error occurs during
> the unmount of a filesystem. The cases reported have the same
> signature - the update of the superblock counters fails due to a log
> write IO error:
>
> XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1
> XFS (dm-16): Log I/O Error Detected. Shutting down filesystem
> XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
> XFS (dm-16): xfs_log_force: error 5 returned.
> XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)
>
> It can be seen that the last line of output contains a corrupt
> device name - this is because the log and xfs_mount structures have
> already been freed by the time this message is printed. A kernel
> oops closely follows.
>
> The issue is that the shutdown is occurring in a separate IO
> completion thread to the unmount. Once the shutdown processing has
> started and all the iclogs are marked with XLOG_STATE_IOERROR, the
> log shutdown code wakes anyone waiting on a log force so they can
> process the shutdown error. This wakes up the unmount code that
> is doing a synchronous transaction to update the superblock
> counters.
>
> The unmount path now sees all the iclogs are marked with
> XLOG_STATE_IOERROR and so never waits on them again, knowing that if
> it does, there will not be a wakeup trigger for it and we will hang
> the unmount if we do. Hence the unmount runs through all the
> remaining code and frees all the filesystem structures while the
> xlog_iodone() is still processing the shutdown. When the log
> shutdown processing completes, xfs_do_force_shutdown() emits the
> "Please umount the filesystem and rectify the problem(s)" message,
> and xlog_iodone() then aborts all the objects attached to the iclog.
> An iclog that has already been freed....
>
> The real issue here is that there is no serialisation point between
> the log IO and the unmount. We have serialisations points for log
> writes, log forces, reservations, etc, but we don't actually have
> any code that wakes for log IO to fully complete. We do that for all
> other types of object, so why not iclogbufs?
>
> Well, it turns out that we can easily do this. We've got xfs_buf
> handles, and that's what everyone else uses for IO serialisation.
> i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
> release the lock in xlog_iodone() when we are finished with the
> buffer. That way before we tear down the iclog, we can lock and
> unlock the buffer to ensure IO completion has finished completely
> before we tear it down.
>
Thanks for the write up...
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_log.c | 53 ++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 44 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 8497a00..08624dc 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1181,11 +1181,14 @@ xlog_iodone(xfs_buf_t *bp)
> /* log I/O is always issued ASYNC */
> ASSERT(XFS_BUF_ISASYNC(bp));
> xlog_state_done_syncing(iclog, aborted);
> +
> /*
> - * do not reference the buffer (bp) here as we could race
> - * with it being freed after writing the unmount record to the
> - * log.
> + * drop the buffer lock now that we are done. Nothing references
> + * the buffer after this, so an unmount waiting on this lock can now
> + * tear it down safely. As such, it is unsafe to reference the buffer
> + * (bp) after the unlock as we could race with it being freed.
> */
> + xfs_buf_unlock(bp);
> }
>
> /*
> @@ -1368,8 +1371,16 @@ xlog_alloc_log(
> bp = xfs_buf_alloc(mp->m_logdev_targp, 0, BTOBB(log->l_iclog_size), 0);
> if (!bp)
> goto out_free_log;
> - bp->b_iodone = xlog_iodone;
> +
> + /*
> + * The iclogbuf buffer locks are held over IO but we are not going to do
> + * IO yet. Hence unlock the buffer so that the log IO path can grab it
> + * when appropriately.
> + */
> ASSERT(xfs_buf_islocked(bp));
> + xfs_buf_unlock(bp);
> +
> + bp->b_iodone = xlog_iodone;
> log->l_xbuf = bp;
>
> spin_lock_init(&log->l_icloglock);
> @@ -1398,6 +1409,9 @@ xlog_alloc_log(
> if (!bp)
> goto out_free_iclog;
>
> + ASSERT(xfs_buf_islocked(bp));
> + xfs_buf_unlock(bp);
> +
> bp->b_iodone = xlog_iodone;
> iclog->ic_bp = bp;
> iclog->ic_data = bp->b_addr;
> @@ -1422,7 +1436,6 @@ xlog_alloc_log(
> iclog->ic_callback_tail = &(iclog->ic_callback);
> iclog->ic_datap = (char *)iclog->ic_data + log->l_iclog_hsize;
>
> - ASSERT(xfs_buf_islocked(iclog->ic_bp));
> init_waitqueue_head(&iclog->ic_force_wait);
> init_waitqueue_head(&iclog->ic_write_wait);
>
> @@ -1631,6 +1644,12 @@ xlog_cksum(
> * we transition the iclogs to IOERROR state *after* flushing all existing
> * iclogs to disk. This is because we don't want anymore new transactions to be
> * started or completed afterwards.
> + *
> + * We lock the iclogbufs here so that we can serialise against IO completion
> + * during unmount. We might be processing a shutdown triggered during unmount,
> + * and that can occur asynchronously to the unmount thread, and hence we need to
> + * ensure that completes before tearing down the iclogbufs. Hence we need to
> + * hold the buffer lock across the log IO to acheive that.
> */
> STATIC int
> xlog_bdstrat(
> @@ -1638,6 +1657,7 @@ xlog_bdstrat(
> {
> struct xlog_in_core *iclog = bp->b_fspriv;
>
> + xfs_buf_lock(bp);
> if (iclog->ic_state & XLOG_STATE_IOERROR) {
> xfs_buf_ioerror(bp, EIO);
> xfs_buf_stale(bp);
> @@ -1645,7 +1665,8 @@ xlog_bdstrat(
> /*
> * It would seem logical to return EIO here, but we rely on
> * the log state machine to propagate I/O errors instead of
> - * doing it here.
> + * doing it here. Similarly, IO completion will unlock the
> + * buffer, so we don't do it here.
> */
> return 0;
> }
> @@ -1847,14 +1868,28 @@ xlog_dealloc_log(
> xlog_cil_destroy(log);
>
> /*
> - * always need to ensure that the extra buffer does not point to memory
> - * owned by another log buffer before we free it.
> + * Cycle all the iclogbuf locks to make sure all log IO completion
> + * is done before we tear down these buffers.
> */
> + iclog = log->l_iclog;
> + for (i = 0; i < log->l_iclog_bufs; i++) {
> + xfs_buf_lock(iclog->ic_bp);
> + xfs_buf_unlock(iclog->ic_bp);
> + iclog = iclog->ic_next;
> + }
> +
> + /*
> + * Always need to ensure that the extra buffer does not point to memory
> + * owned by another log buffer before we free it. Also, cycle the lock
> + * first to ensure we've completed IO on it.
> + */
> + xfs_buf_lock(log->l_xbuf);
> + xfs_buf_unlock(log->l_xbuf);
> xfs_buf_set_empty(log->l_xbuf, BTOBB(log->l_iclog_size));
> xfs_buf_free(log->l_xbuf);
>
> iclog = log->l_iclog;
> - for (i=0; i<log->l_iclog_bufs; i++) {
> + for (i = 0; i < log->l_iclog_bufs; i++) {
> xfs_buf_free(iclog->ic_bp);
> next_iclog = iclog->ic_next;
> kmem_free(iclog);
On reading the code, my initial thought was that the source of this is
the xlog_state_do_callback() call down in the shutdown path, when
invoked from the log I/O completion handler. I think you pointed out in
your previous reply that even if we were to make that call selective
(e.g., based on whether the shutdown is due to a log error and thus we
can expect xlog_state_do_callback() to be invoked), we still access
relevant data structures after the ic_force_wait wait_queue is woken.
Therefore, there would still be a race even if we bypassed the call from
within the shutdown path in this particular case.
The logic seems sane to me. I don't notice any issues. But my only
question is why the use of locking, as opposed to wiring up use of
b_iowait or something into the log I/O handler? I ask because it looks
just a _bit_ funny to see the lock/unlock cycles used purely as a
serialization mechanism. Do we use this kind of pattern in other places?
I guess on the other hand you could argue it protects the I/O in
progress, and yet another wait_queue in this codepath might be overkill
(so I like the use of an existing mechanism from that standpoint).
Brian
> --
> 1.9.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-04-14 19:28 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-10 4:42 [PATCH] xfs: unmount does not wait for shutdown during unmount Dave Chinner
2014-04-10 13:29 ` Mark Tinguely
2014-04-10 21:52 ` Dave Chinner
2014-04-10 16:25 ` Mike Snitzer
2014-04-11 19:21 ` [PATCH] " Bob Mastors
2014-04-14 8:21 ` Dave Chinner
2014-04-14 19:28 ` Brian Foster [this message]
2014-04-15 2:15 ` Dave Chinner
2014-04-15 14:59 ` Brian Foster
2014-04-16 14:27 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140414192824.GC62307@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=bob.mastors@solidfire.com \
--cc=david@fromorbit.com \
--cc=snitzer@redhat.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).