[PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
@ 2013-01-29 20:42 Brian Foster
  2013-01-29 20:42 ` [PATCH RFC 1/2] xfs: conditionally force log on trylock failure of pinned/stale buf Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Brian Foster @ 2013-01-29 20:42 UTC (permalink / raw)
  To: xfs

Hi all,

This patchset fixes a spinlock recursion we've reproduced initially on RHEL
kernels[1]. The problem is that we issue an xfs_log_force() via
xfs_buf_trylock() with the xa_lock held and ultimately drive down into
xlog_assign_tail_lsn(), which attempts to reacquire xa_lock[2].

Note that this lockup was difficult to reproduce and I was not able to
reproduce on an upstream kernel without a hack to comment out the pinned buf
check in xfs_buf_item_push() (presumably because the log force itself only
happens when the buf is pinned, so the window here is tight).

This patchset is what I'm testing to avoid the lockup, but I'm posting this RFC
to get some early thoughts:

- Patch 1 - Creates a flag to conditionally force the log in xfs_buf_trylock().
  The alternative I considered is to pull out the check and log force and
  sprinkle that code amongst the trylock callers.
- Patch 2 - Utilizes the flag created in patch 1 and duplicates the log force
  in xfs_buf_item_push() after dropping xa_lock.

The change in patch 2 makes me wonder how important the immediate flush is in
the context of xfsaild_push(), where we already pend up a flush if the item is
pinned. IOWs, I wonder if replacing what I have now with something like the
following would be acceptable and cleaner:

	if (!__xfs_buf_trylock(bp, false)) {
		if (xfs_buf_ispinned(bp)
			return XFS_ITEM_PINNED;
		return XFS_ITEM_LOCKED;
	}

Thoughts appreciated.

Brian

[1] - http://bugzilla.redhat.com/show_bug.cgi?id=896224
[2] - stacktrace:

BUG: spinlock recursion on CPU#5, xfsaild/dm-3/2690
Pid: 2690, comm: xfsaild/dm-3 Not tainted 3.8.0-rc1+ #46
Call Trace: 
 [<ffffffff8163211c>] spin_dump+0x8a/0x8f
 [<ffffffff81632142>] spin_bug+0x21/0x26
 [<ffffffff812f66a1>] do_raw_spin_lock+0x101/0x150
 [<ffffffff816378ce>] _raw_spin_lock+0xe/0x10
 [<ffffffffa0522c85>] xlog_assign_tail_lsn+0x25/0x50 [xfs]
 [<ffffffffa0523286>] xlog_state_release_iclog+0x86/0xd0 [xfs]
 [<ffffffffa0523c89>] xlog_write+0x569/0x710 [xfs]
 [<ffffffffa052555c>] xlog_cil_push+0x29c/0x3c0 [xfs]
 [<ffffffffa04cbfe2>] ? xfs_buf_get_map+0xf2/0x1b0 [xfs]
 [<ffffffffa0525d17>] xlog_cil_force_lsn+0x157/0x160 [xfs]
 [<ffffffffa04cced1>] ? xfs_buf_read_map+0x31/0x130 [xfs]
 [<ffffffffa0529e99>] ? xfs_trans_read_buf_map+0x279/0x4b0 [xfs]
 [<ffffffff8117e45d>] ? __kmalloc+0x15d/0x1b0
 [<ffffffffa0523f7d>] _xfs_log_force+0x6d/0x290 [xfs]
 [<ffffffffa051450f>] ? xfs_iflush_cluster+0x25f/0x3d0 [xfs]
 [<ffffffffa05241d9>] xfs_log_force+0x39/0xc0 [xfs]
 [<ffffffffa04cbaa0>] xfs_buf_trylock+0xd0/0xe0 [xfs]
 [<ffffffffa0526369>] xfs_buf_item_push+0x39/0xd0 [xfs]
 [<ffffffffa0527bdf>] ? xfs_inode_item_push+0x8f/0x140 [xfs]
 [<ffffffffa0528c01>] xfsaild+0x2e1/0x6e0 [xfs]
 [<ffffffff8108aa08>] ? __wake_up_common+0x58/0x90
 [<ffffffffa0528920>] ? xfs_trans_ail_cursor_first+0xc0/0xc0 [xfs]
 [<ffffffff81081708>] kthread+0xd8/0xe0
 [<ffffffff81081630>] ? flush_kthread_work+0x150/0x150
 [<ffffffff816400ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81081630>] ? flush_kthread_work+0x150/0x150

Brian Foster (2):
  xfs: conditionally force log on trylock failure of pinned/stale buf
  xfs: drop xa_lock around log force in xfs_buf_item push

 fs/xfs/xfs_buf.c      |    8 +++++---
 fs/xfs/xfs_buf.h      |    3 ++-
 fs/xfs/xfs_buf_item.c |   10 +++++++++-
 3 files changed, 16 insertions(+), 5 deletions(-)

-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RFC 1/2] xfs: conditionally force log on trylock failure of pinned/stale buf
  2013-01-29 20:42 [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Brian Foster
@ 2013-01-29 20:42 ` Brian Foster
  2013-01-29 20:42 ` [PATCH RFC 2/2] xfs: drop xa_lock around log force in xfs_buf_item push Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2013-01-29 20:42 UTC (permalink / raw)
  To: xfs

xfs_force_log() is not safe from all contexts. Add a flag parameter
to xfs_buf_trylock() to specify when the force is appropriate and
create a macro to preserve current behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c |    8 +++++---
 fs/xfs/xfs_buf.h |    3 ++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index fbbb9eb..2e04a44 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -943,15 +943,17 @@ xfs_buf_rele(
  *	to push on stale inode buffers.
  */
 int
-xfs_buf_trylock(
-	struct xfs_buf		*bp)
+__xfs_buf_trylock(
+	struct xfs_buf		*bp,
+	bool			force_log)
 {
 	int			locked;
 
 	locked = down_trylock(&bp->b_sema) == 0;
 	if (locked)
 		XB_SET_OWNER(bp);
-	else if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
+	else if (force_log &&
+		 atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
 		xfs_log_force(bp->b_target->bt_mount, 0);
 
 	trace_xfs_buf_trylock(bp, _RET_IP_);
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 433a12e..667b723 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -258,7 +258,8 @@ extern void xfs_buf_free(xfs_buf_t *);
 extern void xfs_buf_rele(xfs_buf_t *);
 
 /* Locking and Unlocking Buffers */
-extern int xfs_buf_trylock(xfs_buf_t *);
+#define xfs_buf_trylock(bp) __xfs_buf_trylock(bp, true)
+extern int __xfs_buf_trylock(xfs_buf_t *, bool);
 extern void xfs_buf_lock(xfs_buf_t *);
 extern void xfs_buf_unlock(xfs_buf_t *);
 #define xfs_buf_islocked(bp) \
-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 2/2] xfs: drop xa_lock around log force in xfs_buf_item push
  2013-01-29 20:42 [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Brian Foster
  2013-01-29 20:42 ` [PATCH RFC 1/2] xfs: conditionally force log on trylock failure of pinned/stale buf Brian Foster
@ 2013-01-29 20:42 ` Brian Foster
  2013-01-29 22:41 ` [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Mark Tinguely
  2013-01-30  6:05 ` Dave Chinner
  3 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2013-01-29 20:42 UTC (permalink / raw)
  To: xfs

If the trylock fails and the buffer is pinned and stale, the
resulting xfs_log_force() can lead to lock recursion on the ailp
xa_lock. Call __xfs_buf_trylock() to attempt the buf lock with
xa_lock held, but to avoid the log force. We subsequently pull the
check and log force up into xfs_buf_item_push() where we can drop
and reacquire xa_lock appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 9c4c050..7b6e98a 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -469,8 +469,16 @@ xfs_buf_item_push(
 
 	if (xfs_buf_ispinned(bp))
 		return XFS_ITEM_PINNED;
-	if (!xfs_buf_trylock(bp))
+
+	if (!__xfs_buf_trylock(bp, false)) {
+		if (atomic_read(&bp->b_pin_count) &&
+		    (bp->b_flags & XBF_STALE)) {
+			spin_unlock(&lip->li_ailp->xa_lock);
+			xfs_log_force(bp->b_target->bt_mount, 0);
+			spin_lock(&lip->li_ailp->xa_lock);
+		}
 		return XFS_ITEM_LOCKED;
+	}
 
 	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
 
-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
  2013-01-29 20:42 [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Brian Foster
  2013-01-29 20:42 ` [PATCH RFC 1/2] xfs: conditionally force log on trylock failure of pinned/stale buf Brian Foster
  2013-01-29 20:42 ` [PATCH RFC 2/2] xfs: drop xa_lock around log force in xfs_buf_item push Brian Foster
@ 2013-01-29 22:41 ` Mark Tinguely
  2013-01-30  6:05 ` Dave Chinner
  3 siblings, 0 replies; 10+ messages in thread
From: Mark Tinguely @ 2013-01-29 22:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On 01/29/13 14:42, Brian Foster wrote:
> Hi all,
>
> This patchset fixes a spinlock recursion we've reproduced initially on RHEL
> kernels[1]. The problem is that we issue an xfs_log_force() via
> xfs_buf_trylock() with the xa_lock held and ultimately drive down into
> xlog_assign_tail_lsn(), which attempts to reacquire xa_lock[2].
>
> Note that this lockup was difficult to reproduce and I was not able to
> reproduce on an upstream kernel without a hack to comment out the pinned buf
> check in xfs_buf_item_push() (presumably because the log force itself only
> happens when the buf is pinned, so the window here is tight).

Interesting. So this a result of a change of the b_pin_count between 
xfs_buf_ispinned() and xfs_buf_trylock() tests in xfs_buf_item_push()?

>
> This patchset is what I'm testing to avoid the lockup, but I'm posting this RFC
> to get some early thoughts:
>
> - Patch 1 - Creates a flag to conditionally force the log in xfs_buf_trylock().
>    The alternative I considered is to pull out the check and log force and
>    sprinkle that code amongst the trylock callers.
> - Patch 2 - Utilizes the flag created in patch 1 and duplicates the log force
>    in xfs_buf_item_push() after dropping xa_lock.
>
> The change in patch 2 makes me wonder how important the immediate flush is in
> the context of xfsaild_push(), where we already pend up a flush if the item is
> pinned. IOWs, I wonder if replacing what I have now with something like the
> following would be acceptable and cleaner:
>
> 	if (!__xfs_buf_trylock(bp, false)) {
> 		if (xfs_buf_ispinned(bp)
> 			return XFS_ITEM_PINNED;
> 		return XFS_ITEM_LOCKED;

This looks better to me that patch 2.

> 	}
>
> Thoughts appreciated.
>
> Brian
>
> [1] - http://bugzilla.redhat.com/show_bug.cgi?id=896224
> [2] - stacktrace:
>
> BUG: spinlock recursion on CPU#5, xfsaild/dm-3/2690
> Pid: 2690, comm: xfsaild/dm-3 Not tainted 3.8.0-rc1+ #46
> Call Trace:
>   [<ffffffff8163211c>] spin_dump+0x8a/0x8f
>   [<ffffffff81632142>] spin_bug+0x21/0x26
>   [<ffffffff812f66a1>] do_raw_spin_lock+0x101/0x150
>   [<ffffffff816378ce>] _raw_spin_lock+0xe/0x10
>   [<ffffffffa0522c85>] xlog_assign_tail_lsn+0x25/0x50 [xfs]
>   [<ffffffffa0523286>] xlog_state_release_iclog+0x86/0xd0 [xfs]
>   [<ffffffffa0523c89>] xlog_write+0x569/0x710 [xfs]
>   [<ffffffffa052555c>] xlog_cil_push+0x29c/0x3c0 [xfs]
>   [<ffffffffa04cbfe2>] ? xfs_buf_get_map+0xf2/0x1b0 [xfs]
>   [<ffffffffa0525d17>] xlog_cil_force_lsn+0x157/0x160 [xfs]
>   [<ffffffffa04cced1>] ? xfs_buf_read_map+0x31/0x130 [xfs]
>   [<ffffffffa0529e99>] ? xfs_trans_read_buf_map+0x279/0x4b0 [xfs]
>   [<ffffffff8117e45d>] ? __kmalloc+0x15d/0x1b0
>   [<ffffffffa0523f7d>] _xfs_log_force+0x6d/0x290 [xfs]
>   [<ffffffffa051450f>] ? xfs_iflush_cluster+0x25f/0x3d0 [xfs]
>   [<ffffffffa05241d9>] xfs_log_force+0x39/0xc0 [xfs]
>   [<ffffffffa04cbaa0>] xfs_buf_trylock+0xd0/0xe0 [xfs]
>   [<ffffffffa0526369>] xfs_buf_item_push+0x39/0xd0 [xfs]
>   [<ffffffffa0527bdf>] ? xfs_inode_item_push+0x8f/0x140 [xfs]
>   [<ffffffffa0528c01>] xfsaild+0x2e1/0x6e0 [xfs]
>   [<ffffffff8108aa08>] ? __wake_up_common+0x58/0x90
>   [<ffffffffa0528920>] ? xfs_trans_ail_cursor_first+0xc0/0xc0 [xfs]
>   [<ffffffff81081708>] kthread+0xd8/0xe0
>   [<ffffffff81081630>] ? flush_kthread_work+0x150/0x150
>   [<ffffffff816400ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff81081630>] ? flush_kthread_work+0x150/0x150
>
> Brian Foster (2):
>    xfs: conditionally force log on trylock failure of pinned/stale buf
>    xfs: drop xa_lock around log force in xfs_buf_item push
>
>   fs/xfs/xfs_buf.c      |    8 +++++---
>   fs/xfs/xfs_buf.h      |    3 ++-
>   fs/xfs/xfs_buf_item.c |   10 +++++++++-
>   3 files changed, 16 insertions(+), 5 deletions(-)
>

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
  2013-01-29 20:42 [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Brian Foster
                   ` (2 preceding siblings ...)
  2013-01-29 22:41 ` [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Mark Tinguely
@ 2013-01-30  6:05 ` Dave Chinner
       [not found]   ` <5109291E.6090303@sgi.com>
  3 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2013-01-30  6:05 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Jan 29, 2013 at 03:42:35PM -0500, Brian Foster wrote:
> Hi all,
> 
> This patchset fixes a spinlock recursion we've reproduced initially on RHEL
> kernels[1]. The problem is that we issue an xfs_log_force() via
> xfs_buf_trylock() with the xa_lock held and ultimately drive down into
> xlog_assign_tail_lsn(), which attempts to reacquire xa_lock[2].
> 
> Note that this lockup was difficult to reproduce and I was not able to
> reproduce on an upstream kernel without a hack to comment out the pinned buf
> check in xfs_buf_item_push() (presumably because the log force itself only
> happens when the buf is pinned, so the window here is tight).
> 
> This patchset is what I'm testing to avoid the lockup, but I'm posting this RFC
> to get some early thoughts:
> 
> - Patch 1 - Creates a flag to conditionally force the log in xfs_buf_trylock().
>   The alternative I considered is to pull out the check and log force and
>   sprinkle that code amongst the trylock callers.
> - Patch 2 - Utilizes the flag created in patch 1 and duplicates the log force
>   in xfs_buf_item_push() after dropping xa_lock.
> 
> The change in patch 2 makes me wonder how important the immediate flush is in
> the context of xfsaild_push(), where we already pend up a flush if the item is
> pinned. IOWs, I wonder if replacing what I have now with something like the
> following would be acceptable and cleaner:
> 
> 	if (!__xfs_buf_trylock(bp, false)) {
> 		if (xfs_buf_ispinned(bp)
> 			return XFS_ITEM_PINNED;
> 		return XFS_ITEM_LOCKED;
> 	}
> 
> Thoughts appreciated.

I haven't had time to really think about this, but I'll say up front
that I don't really like the fix. The log force in xfs_buf_trylock()
was added here:

commit 90810b9e82a36c3c57c1aeb8b2918b242a130b26
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Nov 30 15:16:16 2010 +1100

    xfs: push stale, pinned buffers on trylock failures
    
    As reported by Nick Piggin, XFS is suffering from long pauses under
    highly concurrent workloads when hosted on ramdisks. The problem is
    that an inode buffer is stuck in the pinned state in memory and as a
    result either the inode buffer or one of the inodes within the
    buffer is stopping the tail of the log from being moved forward.
    
    The system remains in this state until a periodic log force issued
    by xfssyncd causes the buffer to be unpinned. The main problem is
    that these are stale buffers, and are hence held locked until the
    transaction/checkpoint that marked them state has been committed to
    disk. When the filesystem gets into this state, only the xfssyncd
    can cause the async transactions to be committed to disk and hence
    unpin the inode buffer.
    
    This problem was encountered when scaling the busy extent list, but
    only the blocking lock interface was fixed to solve the problem.
    Extend the same fix to the buffer trylock operations - if we fail to
    lock a pinned, stale buffer, then force the log immediately so that
    when the next attempt to lock it comes around, it will have been
    unpinned.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

And the initial commit that introduced this dependency was this:

commit ed3b4d6cdc81e8feefdbfa3c584614be301b6d39
Author: Dave Chinner <david@fromorbit.com>
Date:   Fri May 21 12:07:08 2010 +1000

    xfs: Improve scalability of busy extent tracking
....
    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.
....

So essentially what is happening here is that we are trying to lock
a stale buffer in the AIL to flush it. Well, we can't flush it from
the AIL, and indeed the next line of code is this:

	if (!xfs_buf_trylock(bp))
		return XFS_ITEM_LOCKED;

>>>>>	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));

The only reason this ASSERT is not firing is that we are failing to
lock stale buffers. Hence we are relying on the failed lock to force
the log, instead of detecting that we need to force the log after we
drop the AIL lock and letting the caller handle it.

So, wouldn't a better solution be to do something like simply like:

+	if (bp->b_flags & XBF_STALE)
+		return XFS_ITEM_PINNED;
+
	if (!xfs_buf_trylock(bp))
		return XFS_ITEM_LOCKED;

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
       [not found]   ` <5109291E.6090303@sgi.com>
@ 2013-01-30 16:02     ` Brian Foster
  2013-01-30 21:59       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2013-01-30 16:02 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

(added Dave and the list back on CC)

On 01/30/2013 09:07 AM, Mark Tinguely wrote:
> On 01/30/13 00:05, Dave Chinner wrote:
>> On Tue, Jan 29, 2013 at 03:42:35PM -0500, Brian Foster wrote:
...
> 
>> So essentially what is happening here is that we are trying to lock
>> a stale buffer in the AIL to flush it. Well, we can't flush it from
>> the AIL, and indeed the next line of code is this:
>>
>>     if (!xfs_buf_trylock(bp))
>>         return XFS_ITEM_LOCKED;
>>
>>>>>>>     ASSERT(!(bip->bli_flags&  XFS_BLI_STALE));
>>
>> The only reason this ASSERT is not firing is that we are failing to
>> lock stale buffers. Hence we are relying on the failed lock to force
>> the log, instead of detecting that we need to force the log after we
>> drop the AIL lock and letting the caller handle it.
>>
>> So, wouldn't a better solution be to do something like simply like:
>>
>> +    if (bp->b_flags&  XBF_STALE)
>> +        return XFS_ITEM_PINNED;
>> +
>>     if (!xfs_buf_trylock(bp))
>>         return XFS_ITEM_LOCKED;
...
> 

Thanks guys. This certainly looks nicer than messing with the lock
wrapper, but is it susceptible to the same problem? In other words, does
this fix the problem or just tighten the window?

I'm going to go back to my original reproduction case and enable some
select tracepoints to try and get a specific sequence of events, but
given the code as it is, the problem seems to be that the buffer goes
from !pinned to pinned between the time we actually check for pinned and
try the buf lock.

So if the buf lock covers the pinned state (e.g., buffer gets locked,
added to a transaction, the transaction gets committed and pins and
unlocks the buffer, IIUC) and the stale state (buf gets locked, added to
a new transaction and inval'd before the original transaction was
written ?), but we don't hold the buf lock in xfs_buf_item_push(), how
can we guarantee the state of either doesn't change between the time we
check the flags and the time the lock fails?

> Makes sense. It would prevent the lock recursion. The more that I think
> about, we do not want to release xa_lock during an AIL scan.
> 

FWIW, the other log item abstractions appear to use this model (e.g.,
xfs_inode_item_push()), where it appears safe to drop xa_lock once the
actual object lock/ref is acquired and reacquire xa_lock before
returning. It looks like this behavior was introduced in:

43ff2122 xfs: on-stack delayed write buffer lists

Brian

> We would still want to see if the buffer is re-pinned (and not STALE) to
> the AIL.
> 
> --Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
  2013-01-30 16:02     ` Brian Foster
@ 2013-01-30 21:59       ` Dave Chinner
  2013-01-31 17:45         ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2013-01-30 21:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, xfs

On Wed, Jan 30, 2013 at 11:02:43AM -0500, Brian Foster wrote:
> (added Dave and the list back on CC)
> 
> On 01/30/2013 09:07 AM, Mark Tinguely wrote:
> > On 01/30/13 00:05, Dave Chinner wrote:
> >> On Tue, Jan 29, 2013 at 03:42:35PM -0500, Brian Foster wrote:
> ...
> > 
> >> So essentially what is happening here is that we are trying to lock
> >> a stale buffer in the AIL to flush it. Well, we can't flush it from
> >> the AIL, and indeed the next line of code is this:
> >>
> >>     if (!xfs_buf_trylock(bp))
> >>         return XFS_ITEM_LOCKED;
> >>
> >>>>>>>     ASSERT(!(bip->bli_flags&  XFS_BLI_STALE));
> >>
> >> The only reason this ASSERT is not firing is that we are failing to
> >> lock stale buffers. Hence we are relying on the failed lock to force
> >> the log, instead of detecting that we need to force the log after we
> >> drop the AIL lock and letting the caller handle it.
> >>
> >> So, wouldn't a better solution be to do something like simply like:
> >>
> >> +    if (bp->b_flags&  XBF_STALE)
> >> +        return XFS_ITEM_PINNED;
> >> +
> >>     if (!xfs_buf_trylock(bp))
> >>         return XFS_ITEM_LOCKED;
> ...
> > 
> 
> Thanks guys. This certainly looks nicer than messing with the lock
> wrapper, but is it susceptible to the same problem? In other words, does
> this fix the problem or just tighten the window?

That's what I need to think about more - the only difference here is
that we are checking the flag before the down_trylock() instead of
after....

> So if the buf lock covers the pinned state (e.g., buffer gets locked,
> added to a transaction, the transaction gets committed and pins and
> unlocks the buffer, IIUC) and the stale state (buf gets locked, added to
> a new transaction and inval'd before the original transaction was
> written ?), but we don't hold the buf lock in xfs_buf_item_push(), how
> can we guarantee the state of either doesn't change between the time we
> check the flags and the time the lock fails?

... but the order of them being set and checked may be significant
and hence checking the stale flag first might be sufficient to avoid
the pin count race and hence the log force. Hence this might just
need a pair of memory barriers - one here and one in xfs_buf_stale()
- to ensure that we always see the XBF_STALE flag without needing to
lock the buffer first.

> > Makes sense. It would prevent the lock recursion. The more that I think
> > about, we do not want to release xa_lock during an AIL scan.
> > 
> 
> FWIW, the other log item abstractions appear to use this model (e.g.,
> xfs_inode_item_push()), where it appears safe to drop xa_lock once the
> actual object lock/ref is acquired and reacquire xa_lock before
> returning.

Right. We can't hold the AIL while doing blocking lock operations
because the AIL lock is an innermost lock.  That's fundamentally the
cause of this problem - the trylock is doing more work that we can
allow in the context that it is being called in...

We also do not hold the AIL lock for the entire
traversal as long holds of the AIL lock result in serialisation
across the entire transaction subsystem. A successful IOP_PUSH()
typically drops and reacquires the AIL lock, but a failure doesn't.
Hence we limit the number of failed pushes under a single traversal
so that we don't spend a long time holding the lock doing no work
and instead back off and try again later....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.come

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
  2013-01-30 21:59       ` Dave Chinner
@ 2013-01-31 17:45         ` Brian Foster
  2013-02-01  1:26           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2013-01-31 17:45 UTC (permalink / raw)
  To: Mark Tinguely, xfs, david

On 01/30/2013 04:59 PM, Dave Chinner wrote:
> On Wed, Jan 30, 2013 at 11:02:43AM -0500, Brian Foster wrote:
>> (added Dave and the list back on CC)
>>
>> On 01/30/2013 09:07 AM, Mark Tinguely wrote:
>>> On 01/30/13 00:05, Dave Chinner wrote:
>>>> On Tue, Jan 29, 2013 at 03:42:35PM -0500, Brian Foster wrote:
...
>>
>> Thanks guys. This certainly looks nicer than messing with the lock
>> wrapper, but is it susceptible to the same problem? In other words, does
>> this fix the problem or just tighten the window?
> 
> That's what I need to think about more - the only difference here is
> that we are checking the flag before the down_trylock() instead of
> after....
> 
>> So if the buf lock covers the pinned state (e.g., buffer gets locked,
>> added to a transaction, the transaction gets committed and pins and
>> unlocks the buffer, IIUC) and the stale state (buf gets locked, added to
>> a new transaction and inval'd before the original transaction was
>> written ?), but we don't hold the buf lock in xfs_buf_item_push(), how
>> can we guarantee the state of either doesn't change between the time we
>> check the flags and the time the lock fails?
> 
> ... but the order of them being set and checked may be significant
> and hence checking the stale flag first might be sufficient to avoid
> the pin count race and hence the log force. Hence this might just
> need a pair of memory barriers - one here and one in xfs_buf_stale()
> - to ensure that we always see the XBF_STALE flag without needing to
> lock the buffer first.
> 

I _think_ I follow your train of thought. If we're racing on the pin
check, presumably the lock holder is committing the transaction and we
should either already see the buffer being stale, being pinned or we
should get the lock (assuming the order is: stale, pinned, unlocked).

That aside for a moment, here's some specific tracepoint (some of which
I've hacked in) data for when the recursion occurs:

      xfsalloc/5-3220  [005]  4304.223158: xfs_buf_item_stale: dev 253:3
bno 0x847feb28 len 0x1000 hold 2 pincount 0 lock 0 flags
|MAPPED|ASYNC|DONE|STALE|PAGES recur 0 refcount 1 bliflags |DIRTY|STALE
lidesc 0xffff88067ada5610 liflags IN_AIL
 smallfile_cli.p-3702  [007]  4304.223209: xfs_buf_item_format_stale:
dev 253:3 bno 0x847feb28 len 0x1000 hold 2 pincount 0 lock 0 flags
|MAPPED|ASYNC|DONE|STALE|PAGES recur 0 refcount 1 bliflags |STALE lidesc
0xffff88067ada5610 liflags IN_AIL
...
xfsaild/dm-3-3695  [002]  4304.223217: xfs_buf_item_trylock: dev 253:3
bno 0x847feb28 len 0x1000 hold 2 pincount 1 lock 0 flags
|MAPPED|ASYNC|DONE|STALE|PAGES recur 0 refcount 1 bliflags |STALE lidesc
0xffff88067ada5610 liflags IN_AIL
 smallfile_cli.p-3702  [007]  4304.223217: xfs_buf_item_pin: dev 253:3
bno 0x847feb28 len 0x1000 hold 2 pincount 0 lock 0 flags
|MAPPED|ASYNC|DONE|STALE|PAGES recur 0 refcount 1 bliflags |STALE lidesc
0xffff88067ada5610 liflags IN_AIL
...
    xfsaild/dm-3-3695  [002]  4304.223219: xfs_buf_cond_lock_log_force:
dev 253:3 bno 0x847feb28 len 0x1000 hold 2 pincount 1 lock 0 flags
MAPPED|ASYNC|DONE|STALE|PAGES caller xfs_buf_item_trylock

[NOTES:
 - I moved xfs_buf_item_trylock up to after the pinned check but before
the trylock.
 - xfs_buf_item_stale() is in xfs_trans_binval(). This was an oversight,
as I was looking at bli_flags, but still illustrates the sequence.]

... so as expected, the buffer is marked stale, we attempt the trylock,
the buf is pinned, we run the log force and we're dead.

>From the looks of the trace, I'd expect an additional stale check to
eliminate the ability to reproduce this, but that doesn't necessarily
make it correct of course. Regardless, I'm putting that to the test now
and letting it run for a bit while we get this sorted out.

I also need to stare at the code some more. My pending questions are:

- Is it always reasonable to to assume/consider a stale buf as pinned in
the context of xfsaild?
- If we currently reproduce the following sequence:

A		xfsaild
stale
		(!pinned) ==> trylock()
pinned
		(!trylock && pinned && stale)
			==> xfs_log_force() (boom)

... what prevents the following sequence from occurring sometime in the
future or with some alternate high-level sequence of events?

A		xfsaild
locked
		(!pinned && !stale) ==> trylock()
pinned
stale
		(!trylock && pinned && stale)
			==> xfs_log_force()

Granted the window is seriously tight, but is there some crazy sequence
where we could race with the pin count and the stale state? E.g., I
notice some error or forced shutdown sequences mark buffers as stale, etc.

Brian

...
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
  2013-01-31 17:45         ` Brian Foster
@ 2013-02-01  1:26           ` Dave Chinner
  2013-02-01 16:01             ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2013-02-01  1:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, xfs

On Thu, Jan 31, 2013 at 12:45:41PM -0500, Brian Foster wrote:
> On 01/30/2013 04:59 PM, Dave Chinner wrote:
> > On Wed, Jan 30, 2013 at 11:02:43AM -0500, Brian Foster wrote:
> >> (added Dave and the list back on CC)
> >>
> >> On 01/30/2013 09:07 AM, Mark Tinguely wrote:
> >>> On 01/30/13 00:05, Dave Chinner wrote:
> >>>> On Tue, Jan 29, 2013 at 03:42:35PM -0500, Brian Foster wrote:
> ...
> >>
> >> Thanks guys. This certainly looks nicer than messing with the lock
> >> wrapper, but is it susceptible to the same problem? In other words, does
> >> this fix the problem or just tighten the window?
> > 
> > That's what I need to think about more - the only difference here is
> > that we are checking the flag before the down_trylock() instead of
> > after....
> > 
> >> So if the buf lock covers the pinned state (e.g., buffer gets locked,
> >> added to a transaction, the transaction gets committed and pins and
> >> unlocks the buffer, IIUC) and the stale state (buf gets locked, added to
> >> a new transaction and inval'd before the original transaction was
> >> written ?), but we don't hold the buf lock in xfs_buf_item_push(), how
> >> can we guarantee the state of either doesn't change between the time we
> >> check the flags and the time the lock fails?
> > 
> > ... but the order of them being set and checked may be significant
> > and hence checking the stale flag first might be sufficient to avoid
> > the pin count race and hence the log force. Hence this might just
> > need a pair of memory barriers - one here and one in xfs_buf_stale()
> > - to ensure that we always see the XBF_STALE flag without needing to
> > lock the buffer first.
> > 
> 
> I _think_ I follow your train of thought. If we're racing on the pin
> check, presumably the lock holder is committing the transaction and we
> should either already see the buffer being stale, being pinned or we
> should get the lock (assuming the order is: stale, pinned, unlocked).

Right, that is the order in which operations occur. Marking a buffer
stale happens in the transaction body, the pin and subsequent unlock
occur some time later during the transaction commit.

> That aside for a moment, here's some specific tracepoint (some of which
> I've hacked in) data for when the recursion occurs:
....
> ... so as expected, the buffer is marked stale, we attempt the trylock,
> the buf is pinned, we run the log force and we're dead.

*nod*

> From the looks of the trace, I'd expect an additional stale check to
> eliminate the ability to reproduce this, but that doesn't necessarily
> make it correct of course. Regardless, I'm putting that to the test now
> and letting it run for a bit while we get this sorted out.
> 
> I also need to stare at the code some more. My pending questions are:
> 
> - Is it always reasonable to to assume/consider a stale buf as pinned in
> the context of xfsaild?

A stale buffer in the AIL is either pinned (transaction committed)
or on it's way to being pinned (transaction that marked it stale is
currently in progress). When the transaction commit completes, the
IOP_UNPIN() call will remove the item from the AIL if it is the last
reference to the item.

Hence we generally do not get unpinned, stale buffers in the AIL
because the last reference to the stale buffer is usually the
transaction that marked it stale....

> - If we currently reproduce the following sequence:
> 
> A		xfsaild
> stale
> 		(!pinned) ==> trylock()
> pinned
> 		(!trylock && pinned && stale)
> 			==> xfs_log_force() (boom)
> 
> ... what prevents the following sequence from occurring sometime in the
> future or with some alternate high-level sequence of events?
> 
> A		xfsaild
> locked
> 		(!pinned && !stale) ==> trylock()
> pinned
> stale
> 		(!trylock && pinned && stale)
> 			==> xfs_log_force()

You can't get that order and trigger the race. If the item is pinned
before it is marked stale, that means we will always see pin count
because it was pinned by a previous transaction commit.

Pinning occurs on the commit of the first transaction commit that
inserts the item into the CIL. It doesn't get unpinned until the CIL
is checkpointed and inserted into the AIL. Hence the order of
operations that marks an uncommitted buffer stale should always be
lock->stale->pinned->unlock.

However, if the buffer has been previously modified and is in the
CIL, you'll see:

	lock->pinned->CIL insert->unlock.....->lock->stale->unlock

If the buffer was modified a while back and the CIL is committed, the
buffer will be in the AIL but not the CIL. If the buffer was
modified a while back, and then again recently, it will be in both
pinned in the CIL and the AIL. Neither of these cases can trigger
this problem because the pinned check will fire reliably.

Hence the specific case here is the buffer has been previously modified,
the CIL committed so it's in the AIL, and we are marking the buffer
stale as the first modification of the buffer after it was added to
the AIL. IOWs, the order that is of concern is this while the item
is in the AIL:

	lock->stale->pin->CIL insert->unlock

So in terms of the xfsaild racing and causing problems, the only case
it will occur in is this:

	Thread 1	xfsaild
	lock		xfs_buf_item_push
	....		not pinned
	stale
			trylock
			<delay>
	commit
	pin
			  if(pinned && stale)
				<splat>
	unlock

So what I was asking is whether we can do checks in the order of:

	smb_mb();
	if (stale)
		return
	if (pinned)
		return

What I've just realised is that we really don't care if we race in
xfs_buf_item_push(). The race that matters is in xfs_buf_trylock()
and at that point it is too late to avoid it.

So I think your original patch is on the right path but having the
xfsaild handle the log force gets rid of most of the nasty cruft it
had.... ie. we're going to have to tell xfs_buf_trylock() that we
should not do a log force if the lock fails, and return
XFS_ITEM_PINNED rather than XFS_ITEM_LOCKED in
xfs_buf_item_push()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push
  2013-02-01  1:26           ` Dave Chinner
@ 2013-02-01 16:01             ` Brian Foster
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2013-02-01 16:01 UTC (permalink / raw)
  To: Mark Tinguely, xfs, david

On 01/31/2013 08:26 PM, Dave Chinner wrote:
> On Thu, Jan 31, 2013 at 12:45:41PM -0500, Brian Foster wrote:
>> On 01/30/2013 04:59 PM, Dave Chinner wrote:
>>> On Wed, Jan 30, 2013 at 11:02:43AM -0500, Brian Foster wrote:
>>>> (added Dave and the list back on CC)
>>>>
>>>> On 01/30/2013 09:07 AM, Mark Tinguely wrote:
>>>>> On 01/30/13 00:05, Dave Chinner wrote:
>>>>>> On Tue, Jan 29, 2013 at 03:42:35PM -0500, Brian Foster wrote:
...
>> ... what prevents the following sequence from occurring sometime in the
>> future or with some alternate high-level sequence of events?
>>
>> A		xfsaild
>> locked
>> 		(!pinned && !stale) ==> trylock()
>> pinned
>> stale
>> 		(!trylock && pinned && stale)
>> 			==> xfs_log_force()
> 
> You can't get that order and trigger the race. If the item is pinned
> before it is marked stale, that means we will always see pin count
> because it was pinned by a previous transaction commit.
> 

Ok, so if we're seeing the pin count race, we have a buffer that is
obviously in the ail and we're racing with another transaction commit on
the buffer. This means the buf is either (!pinned && locked), (pinned &&
locked) or (pinned && !locked). The first two cases imply that if the
buf is stale, it is marked stale by this transaction, or as you point
out, by a previous transaction (because it would have required the lock).

If we suppose it is possible for the buf to go from (!pinned && locked)
to (pinned && !locked) after the trylock fails but before the log force
check, what I'm not quite following is what would prevent some third
thread (I guess my diagram was wrong to only include xfsaild and thread
A) to squeeze in there and set the buf stale? For example, I was looking
at xfs_buf_iodone_callbacks() and the forced shutdown check (is the bp
always locked here?).

> Pinning occurs on the commit of the first transaction commit that
> inserts the item into the CIL. It doesn't get unpinned until the CIL
> is checkpointed and inserted into the AIL. Hence the order of
> operations that marks an uncommitted buffer stale should always be
> lock->stale->pinned->unlock.
> 
> However, if the buffer has been previously modified and is in the
> CIL, you'll see:
> 
> 	lock->pinned->CIL insert->unlock.....->lock->stale->unlock
> 
> If the buffer was modified a while back and the CIL is committed, the
> buffer will be in the AIL but not the CIL. If the buffer was
> modified a while back, and then again recently, it will be in both
> pinned in the CIL and the AIL. Neither of these cases can trigger
> this problem because the pinned check will fire reliably.
> 
> Hence the specific case here is the buffer has been previously modified,
> the CIL committed so it's in the AIL, and we are marking the buffer
> stale as the first modification of the buffer after it was added to
> the AIL. IOWs, the order that is of concern is this while the item
> is in the AIL:
> 
> 	lock->stale->pin->CIL insert->unlock
> 
> So in terms of the xfsaild racing and causing problems, the only case
> it will occur in is this:
> 
> 	Thread 1	xfsaild
> 	lock		xfs_buf_item_push
> 	....		not pinned
> 	stale
> 			trylock
> 			<delay>
> 	commit
> 	pin
> 			  if(pinned && stale)
> 				<splat>
> 	unlock
> 

Ok, thanks for the context.

> So what I was asking is whether we can do checks in the order of:
> 
> 	smb_mb();
> 	if (stale)
> 		return
> 	if (pinned)
> 		return
> 

Right...

> What I've just realised is that we really don't care if we race in
> xfs_buf_item_push(). The race that matters is in xfs_buf_trylock()
> and at that point it is too late to avoid it.
> 
> So I think your original patch is on the right path but having the
> xfsaild handle the log force gets rid of most of the nasty cruft it
> had.... ie. we're going to have to tell xfs_buf_trylock() that we
> should not do a log force if the lock fails, and return
> XFS_ITEM_PINNED rather than XFS_ITEM_LOCKED in
> xfs_buf_item_push()....
> 

... and that's the bit I wasn't so sure about. If it's Ok to pend up the
log force as such, the xfs_buf_item_push() part can be made cleaner
(xfs_buf_trylock() is still uglified with the new parameter). I'll start
testing something like that and follow up with a new set if it addresses
the problem and survives some sanity testing. Thanks.

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-02-01 16:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-29 20:42 [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Brian Foster
2013-01-29 20:42 ` [PATCH RFC 1/2] xfs: conditionally force log on trylock failure of pinned/stale buf Brian Foster
2013-01-29 20:42 ` [PATCH RFC 2/2] xfs: drop xa_lock around log force in xfs_buf_item push Brian Foster
2013-01-29 22:41 ` [PATCH RFC 0/2] fix spinlock recursion on xa_lock in xfs_buf_item_push Mark Tinguely
2013-01-30  6:05 ` Dave Chinner
     [not found]   ` <5109291E.6090303@sgi.com>
2013-01-30 16:02     ` Brian Foster
2013-01-30 21:59       ` Dave Chinner
2013-01-31 17:45         ` Brian Foster
2013-02-01  1:26           ` Dave Chinner
2013-02-01 16:01             ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox