* [PATCH] xfs: log mount failures don't wait for buffers to be released
@ 2016-01-12 1:35 Dave Chinner
2016-01-12 14:15 ` Brian Foster
0 siblings, 1 reply; 2+ messages in thread
From: Dave Chinner @ 2016-01-12 1:35 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Recently I've been seeing xfs/051 fail on 1k block size filesystems.
Trying to trace the events during the test lead to the problem going
away, indicating that it was a race condition that lead to this
ASSERT failure:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
.....
[<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
[<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
[<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
[<ffffffff811d8800>] mount_bdev+0x180/0x1b0
[<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
[<ffffffff811d90a8>] mount_fs+0x38/0x170
[<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
[<ffffffff811f7018>] do_mount+0x218/0xd60
[<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
When I finally caught it with tracing enabled, I saw that AG 2 had
an elevated reference count and a buffer was responsible for it. I
tracked down the specific buffer, and found that it was missing the
final reference count release that would put it back on the LRU and
hence be found by xfs_wait_buftarg() calls in the log mount failure
handling.
The last four traces for the buffer before the assert were (trimmed
for relevance)
kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC
This is an async write that is completing, so there's nobody waiting
for it directly. Hence we call xfs_buf_relse() once all the
processing is complete. That does:
static inline void xfs_buf_relse(xfs_buf_t *bp)
{
xfs_buf_unlock(bp);
xfs_buf_rele(bp);
}
Now, it's clear that mount is waiting on the buffer lock, and that
it has been released by xfs_buf_relse() and gained by mount. This is
expected, because at this point the mount process is in
xfs_buf_delwri_submit() waiting for all the IO it submitted to
complete.
The mount process, however, is waiting on the lock for the buffer
because it is in xfs_buf_delwri_submit(). This waits for IO
completion, but it doesn't wait for the buffer reference owned by
the IO to go away. The mount process collects all the completions,
fails the log recovery, and the higher level code then calls
xfs_wait_buftarg() to free all the remaining buffers in the
filesystem.
The issue is that on unlocking the buffer, the scheduler has decided
that the mount process has higher priority than the the kworker
thread that is running the IO completion, and so immediately
switched contexts to the mount process from the semaphore unlock
code, hence preventing the kworker thread from finishing the IO
completion and releasing the IO reference to the buffer.
Hence by the time that xfs_wait_buftarg() is run, the buffer still
has an active reference and so isn't on the LRU list that the
function walks to free the remaining buffers. Hence we miss that
buffer and continue onwards to tear down the mount structures,
at which time we get find a stray reference count on the perag
structure. On a non-debug kernel, this will be ignored and the
structure torn down and freed. Hence when the kworker thread is then
rescheduled and the buffer released and freed, it will access a
freed perag structure.
The problem here is that when the log mount fails, we still need to
quiesce the log to ensure that the IO workqueues have returned to
idle before we run xfs_wait_buftarg(). By synchronising the
workqueues, we ensure that all IO completions are fully processed,
not just to the point where buffers have been unlocked. This ensures
we don't end up in the situation above.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_buf.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 04d2bdd..08ad701 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1529,6 +1529,16 @@ xfs_wait_buftarg(
LIST_HEAD(dispose);
int loop = 0;
+ /*
+ * We need to flush the buffer workqueue to ensure that all IO
+ * completion processing is 100% done. Just waiting on buffer locks is
+ * not sufficient for async IO as the reference count held over IO is
+ * not released until after the buffer lock is dropped. Hence we need to
+ * ensure here that all reference counts have been dropped before we
+ * start walking the LRU list.
+ */
+ drain_workqueue(btp->bt_mount->m_buf_workqueue);
+
/* loop until there is nothing left on the lru list. */
while (list_lru_count(&btp->bt_lru)) {
list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
--
2.5.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 2+ messages in thread* Re: [PATCH] xfs: log mount failures don't wait for buffers to be released
2016-01-12 1:35 [PATCH] xfs: log mount failures don't wait for buffers to be released Dave Chinner
@ 2016-01-12 14:15 ` Brian Foster
0 siblings, 0 replies; 2+ messages in thread
From: Brian Foster @ 2016-01-12 14:15 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Tue, Jan 12, 2016 at 12:35:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Recently I've been seeing xfs/051 fail on 1k block size filesystems.
> Trying to trace the events during the test lead to the problem going
> away, indicating that it was a race condition that lead to this
> ASSERT failure:
>
> XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
> .....
> [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
> [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
> [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
> [<ffffffff811d8800>] mount_bdev+0x180/0x1b0
> [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
> [<ffffffff811d90a8>] mount_fs+0x38/0x170
> [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
> [<ffffffff811f7018>] do_mount+0x218/0xd60
> [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
>
> When I finally caught it with tracing enabled, I saw that AG 2 had
> an elevated reference count and a buffer was responsible for it. I
> tracked down the specific buffer, and found that it was missing the
> final reference count release that would put it back on the LRU and
> hence be found by xfs_wait_buftarg() calls in the log mount failure
> handling.
>
> The last four traces for the buffer before the assert were (trimmed
> for relevance)
>
> kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
> kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
> mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
> mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC
>
> This is an async write that is completing, so there's nobody waiting
> for it directly. Hence we call xfs_buf_relse() once all the
> processing is complete. That does:
>
> static inline void xfs_buf_relse(xfs_buf_t *bp)
> {
> xfs_buf_unlock(bp);
> xfs_buf_rele(bp);
> }
>
> Now, it's clear that mount is waiting on the buffer lock, and that
> it has been released by xfs_buf_relse() and gained by mount. This is
> expected, because at this point the mount process is in
> xfs_buf_delwri_submit() waiting for all the IO it submitted to
> complete.
>
> The mount process, however, is waiting on the lock for the buffer
> because it is in xfs_buf_delwri_submit(). This waits for IO
> completion, but it doesn't wait for the buffer reference owned by
> the IO to go away. The mount process collects all the completions,
> fails the log recovery, and the higher level code then calls
> xfs_wait_buftarg() to free all the remaining buffers in the
> filesystem.
>
> The issue is that on unlocking the buffer, the scheduler has decided
> that the mount process has higher priority than the the kworker
> thread that is running the IO completion, and so immediately
> switched contexts to the mount process from the semaphore unlock
> code, hence preventing the kworker thread from finishing the IO
> completion and releasing the IO reference to the buffer.
>
> Hence by the time that xfs_wait_buftarg() is run, the buffer still
> has an active reference and so isn't on the LRU list that the
> function walks to free the remaining buffers. Hence we miss that
> buffer and continue onwards to tear down the mount structures,
> at which time we get find a stray reference count on the perag
> structure. On a non-debug kernel, this will be ignored and the
> structure torn down and freed. Hence when the kworker thread is then
> rescheduled and the buffer released and freed, it will access a
> freed perag structure.
>
> The problem here is that when the log mount fails, we still need to
> quiesce the log to ensure that the IO workqueues have returned to
> idle before we run xfs_wait_buftarg(). By synchronising the
> workqueues, we ensure that all IO completions are fully processed,
> not just to the point where buffers have been unlocked. This ensures
> we don't end up in the situation above.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_buf.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 04d2bdd..08ad701 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1529,6 +1529,16 @@ xfs_wait_buftarg(
> LIST_HEAD(dispose);
> int loop = 0;
>
> + /*
> + * We need to flush the buffer workqueue to ensure that all IO
> + * completion processing is 100% done. Just waiting on buffer locks is
> + * not sufficient for async IO as the reference count held over IO is
> + * not released until after the buffer lock is dropped. Hence we need to
> + * ensure here that all reference counts have been dropped before we
> + * start walking the LRU list.
> + */
> + drain_workqueue(btp->bt_mount->m_buf_workqueue);
> +
> /* loop until there is nothing left on the lru list. */
> while (list_lru_count(&btp->bt_lru)) {
> list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
> --
> 2.5.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-01-12 14:15 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-12 1:35 [PATCH] xfs: log mount failures don't wait for buffers to be released Dave Chinner
2016-01-12 14:15 ` Brian Foster
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox