From: Dave Chinner <david@fromorbit.com>
To: Xiong Zhou <xzhou@redhat.com>
Cc: linux-next@vger.kernel.org, viro@zeniv.linux.org.uk,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: Linux-next parallel cp workload hang
Date: Thu, 19 May 2016 00:17:26 +1000 [thread overview]
Message-ID: <20160518141726.GY26977@dastard> (raw)
In-Reply-To: <20160518114617.GC6551@dhcp12-144.nay.redhat.com>
On Wed, May 18, 2016 at 07:46:17PM +0800, Xiong Zhou wrote:
>
> On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote:
> > On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote:
> > > Hi,
> > >
> > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> > > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > > > > Hi,
> > > > >
> > > > > Parallel cp workload (xfstests generic/273) hangs like blow.
> > > > > It's reproducible with a small chance, less the 1/100 i think.
> > > > >
> > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > > > > xfs with loop or block device. Ext4 survived several rounds
> > > > > of testing.
> > > > >
> > > > > Linux next 20160510 tree hangs within 500 rounds testing several
> > > > > times. The same tree with vfs parallel lookup patchset reverted
> > > > > survived 900 rounds testing. Reverted commits are attached. >
Ok, this is trivial to reproduce. Al - I've hit this 9 times out of
ten running it on a 4p VM with a pair of 4GB ram disks using all
the current upstream default mkfs and mount configurations. On the
tenth attempt I got the tracing to capture what I needed to see -
process 7340 was the last xfs_buf_lock_done trace without an unlock
trace, and that process had this trace:
schedule
rwsem_down_read_failed
call_rwsem_down_read_failed
down_read
xfs_ilock
xfs_ilock_data_map_shared
xfs_dir2_leaf_getdents
xfs_readdir
xfs_file_readdir
iterate_dir
SyS_getdents
entry_SYSCALL_64_fastpath
Which means it's holding a buffer lock while trying to get the
ilock(shared). That's never going to end well - I'm now wondering
why lockdep hasn't been all over this lock order inversion....
Essentially, it's a three-process deadlock involving
shared/exclusive barriers and inverted lock orders. It's a
pre-existing problem with buffer mapping lock orders, nothing to do
with with the VFS parallelisation code.
process 1 process 2 process 3
--------- --------- ---------
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock and read buffer
iunlock(shared)
process entries in buffer
.....
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock buffer
<blocks>
finish ->iterate_shared
file_accessed()
->update_time
start transaction
ilock(excl)
<blocks>
.....
finishes processing buffer
get next buffer
ilock(shared)
<blocks>
And that's the deadlock.
Now I know what the problem is I can say that process 2 - the
transactional timestamp update - is the reason the readdir
operations are blocking like this. And I know why CXFS never hit
this - it doesn't use the VFS paths, so the VFS calls to update
timestamps don't exist during concurrent readdir operations on the
CXFS metadata server. Hence process 2 doesn't exist and no exclusive
barriers are put in amongst the shared locking....
Patch below should fix the deadlock.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
xfs: concurrent readdir hangs on data buffer locks
From: Dave Chinner <dchinner@redhat.com>
There's a three-process deadlock involving shared/exclusive barriers
and inverted lock orders in the directory readdir implementation.
It's a pre-existing problem with lock ordering, exposed by the
VFS parallelisation code.
process 1 process 2 process 3
--------- --------- ---------
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock and read buffer
iunlock(shared)
process entries in buffer
.....
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock buffer
<blocks>
finish ->iterate_shared
file_accessed()
->update_time
start transaction
ilock(excl)
<blocks>
.....
finishes processing buffer
get next buffer
ilock(shared)
<blocks>
And that's the deadlock.
Fix this by dropping the current buffer lock in process 1 before
trying to map the next buffer. This means we keep the lock order of
ilock -> buffer lock intact and hence will allow process 3 to make
progress and drop it's ilock(shared) once it is done.
Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_dir2_readdir.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 93b3ab0..21501dc 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -273,10 +273,11 @@ xfs_dir2_leaf_readbuf(
size_t bufsize,
struct xfs_dir2_leaf_map_info *mip,
xfs_dir2_off_t *curoff,
- struct xfs_buf **bpp)
+ struct xfs_buf **bpp,
+ bool trim_map)
{
struct xfs_inode *dp = args->dp;
- struct xfs_buf *bp = *bpp;
+ struct xfs_buf *bp = NULL;
struct xfs_bmbt_irec *map = mip->map;
struct blk_plug plug;
int error = 0;
@@ -286,13 +287,10 @@ xfs_dir2_leaf_readbuf(
struct xfs_da_geometry *geo = args->geo;
/*
- * If we have a buffer, we need to release it and
- * take it out of the mapping.
+ * If the caller just finished processing a buffer, it will tell us
+ * we need to trim that block out of the mapping now it is done.
*/
-
- if (bp) {
- xfs_trans_brelse(NULL, bp);
- bp = NULL;
+ if (trim_map) {
mip->map_blocks -= geo->fsbcount;
/*
* Loop to get rid of the extents for the
@@ -533,10 +531,17 @@ xfs_dir2_leaf_getdents(
*/
if (!bp || ptr >= (char *)bp->b_addr + geo->blksize) {
int lock_mode;
+ bool trim_map = false;
+
+ if (bp) {
+ xfs_trans_brelse(NULL, bp);
+ bp = NULL;
+ trim_map = true;
+ }
lock_mode = xfs_ilock_data_map_shared(dp);
error = xfs_dir2_leaf_readbuf(args, bufsize, map_info,
- &curoff, &bp);
+ &curoff, &bp, trim_map);
xfs_iunlock(dp, lock_mode);
if (error || !map_info->map_valid)
break;
next prev parent reply other threads:[~2016-05-18 14:17 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-18 1:46 Linux-next parallel cp workload hang Xiong Zhou
2016-05-18 5:56 ` Dave Chinner
2016-05-18 8:31 ` Xiong Zhou
2016-05-18 9:54 ` Dave Chinner
2016-05-18 11:46 ` Xiong Zhou
2016-05-18 14:17 ` Dave Chinner [this message]
2016-05-18 23:02 ` Dave Chinner
2016-05-19 6:22 ` Xiong Zhou
[not found] <20160518013802.GA4679@ZX.nay.redhat.com>
2016-05-18 2:06 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160518141726.GY26977@dastard \
--to=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-next@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=xzhou@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.