Re: Linux-next parallel cp workload hang

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Xiong Zhou <xzhou@redhat.com>
Cc: linux-next@vger.kernel.org, viro@zeniv.linux.org.uk,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: Linux-next parallel cp workload hang
Date: Thu, 19 May 2016 00:17:26 +1000	[thread overview]
Message-ID: <20160518141726.GY26977@dastard> (raw)
In-Reply-To: <20160518114617.GC6551@dhcp12-144.nay.redhat.com>

On Wed, May 18, 2016 at 07:46:17PM +0800, Xiong Zhou wrote:
> 
> On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote:
> > On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote:
> > > Hi,
> > > 
> > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> > > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > > > > Hi,
> > > > > 
> > > > > Parallel cp workload (xfstests generic/273) hangs like blow.
> > > > > It's reproducible with a small chance, less the 1/100 i think.
> > > > > 
> > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > > > > xfs with loop or block device. Ext4 survived several rounds
> > > > > of testing.
> > > > > 
> > > > > Linux next 20160510 tree hangs within 500 rounds testing several
> > > > > times. The same tree with vfs parallel lookup patchset reverted
> > > > > survived 900 rounds testing. Reverted commits are attached.  > 

Ok, this is trivial to reproduce. Al - I've hit this 9 times out of
ten running it on a 4p VM with a pair of 4GB ram disks using all
the current upstream default mkfs and mount configurations. On the
tenth attempt I got the tracing to capture what I needed to see -
process 7340 was the last xfs_buf_lock_done trace without an unlock
trace, and that process had this trace:

schedule
rwsem_down_read_failed
call_rwsem_down_read_failed
down_read
xfs_ilock
xfs_ilock_data_map_shared
xfs_dir2_leaf_getdents
xfs_readdir
xfs_file_readdir
iterate_dir
SyS_getdents
entry_SYSCALL_64_fastpath

Which means it's holding a buffer lock while trying to get the
ilock(shared). That's never going to end well - I'm now wondering
why lockdep hasn't been all over this lock order inversion....

Essentially, it's a three-process deadlock involving
shared/exclusive barriers and inverted lock orders. It's a
pre-existing problem with buffer mapping lock orders, nothing to do
with with the VFS parallelisation code.

process 1		process 2		process 3
---------		---------		---------
readdir
iolock(shared)
  get_leaf_dents
    iterate entries
       ilock(shared)
       map, lock and read buffer
       iunlock(shared)
       process entries in buffer
       .....
       						readdir
						iolock(shared)
						  get_leaf_dents
						    iterate entries
						      ilock(shared)
						      map, lock buffer
						      <blocks>
       			finish ->iterate_shared
			file_accessed()
			  ->update_time
			    start transaction
			    ilock(excl)
			    <blocks>
	.....
	finishes processing buffer
	get next buffer
	  ilock(shared)
	  <blocks>

And that's the deadlock.

Now I know what the problem is I can say that process 2 - the
transactional timestamp update - is the reason the readdir
operations are blocking like this. And I know why CXFS never hit
this - it doesn't use the VFS paths, so the VFS calls to update
timestamps don't exist during concurrent readdir operations on the
CXFS metadata server. Hence process 2 doesn't exist and no exclusive
barriers are put in amongst the shared locking....

Patch below should fix the deadlock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: concurrent readdir hangs on data buffer locks

From: Dave Chinner <dchinner@redhat.com>

There's a three-process deadlock involving shared/exclusive barriers
and inverted lock orders in the directory readdir implementation.
It's a pre-existing problem with lock ordering, exposed by the
VFS parallelisation code.

process 1               process 2               process 3
---------               ---------               ---------
readdir
iolock(shared)
  get_leaf_dents
    iterate entries
       ilock(shared)
       map, lock and read buffer
       iunlock(shared)
       process entries in buffer
       .....
                                                readdir
                                                iolock(shared)
                                                  get_leaf_dents
                                                    iterate entries
                                                      ilock(shared)
                                                      map, lock buffer
                                                      <blocks>
                        finish ->iterate_shared
                        file_accessed()
                          ->update_time
                            start transaction
                            ilock(excl)
                            <blocks>
        .....
        finishes processing buffer
        get next buffer
          ilock(shared)
          <blocks>

And that's the deadlock.

Fix this by dropping the current buffer lock in process 1 before
trying to map the next buffer. This means we keep the lock order of
ilock -> buffer lock intact and hence will allow process 3 to make
progress and drop it's ilock(shared) once it is done.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_dir2_readdir.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 93b3ab0..21501dc 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -273,10 +273,11 @@ xfs_dir2_leaf_readbuf(
 	size_t			bufsize,
 	struct xfs_dir2_leaf_map_info *mip,
 	xfs_dir2_off_t		*curoff,
-	struct xfs_buf		**bpp)
+	struct xfs_buf		**bpp,
+	bool			trim_map)
 {
 	struct xfs_inode	*dp = args->dp;
-	struct xfs_buf		*bp = *bpp;
+	struct xfs_buf		*bp = NULL;
 	struct xfs_bmbt_irec	*map = mip->map;
 	struct blk_plug		plug;
 	int			error = 0;
@@ -286,13 +287,10 @@ xfs_dir2_leaf_readbuf(
 	struct xfs_da_geometry	*geo = args->geo;
 
 	/*
-	 * If we have a buffer, we need to release it and
-	 * take it out of the mapping.
+	 * If the caller just finished processing a buffer, it will tell us
+	 * we need to trim that block out of the mapping now it is done.
 	 */
-
-	if (bp) {
-		xfs_trans_brelse(NULL, bp);
-		bp = NULL;
+	if (trim_map) {
 		mip->map_blocks -= geo->fsbcount;
 		/*
 		 * Loop to get rid of the extents for the
@@ -533,10 +531,17 @@ xfs_dir2_leaf_getdents(
 		 */
 		if (!bp || ptr >= (char *)bp->b_addr + geo->blksize) {
 			int	lock_mode;
+			bool	trim_map = false;
+
+			if (bp) {
+				xfs_trans_brelse(NULL, bp);
+				bp = NULL;
+				trim_map = true;
+			}
 
 			lock_mode = xfs_ilock_data_map_shared(dp);
 			error = xfs_dir2_leaf_readbuf(args, bufsize, map_info,
-						      &curoff, &bp);
+						      &curoff, &bp, trim_map);
 			xfs_iunlock(dp, lock_mode);
 			if (error || !map_info->map_valid)
 				break;

next prev parent reply	other threads:[~2016-05-18 14:17 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-18  1:46 Linux-next parallel cp workload hang Xiong Zhou
2016-05-18  5:56 ` Dave Chinner
2016-05-18  8:31   ` Xiong Zhou
2016-05-18  9:54     ` Dave Chinner
2016-05-18 11:46       ` Xiong Zhou
2016-05-18 14:17         ` Dave Chinner [this message]
2016-05-18 23:02           ` Dave Chinner
2016-05-19  6:22             ` Xiong Zhou
     [not found] <20160518013802.GA4679@ZX.nay.redhat.com>
2016-05-18  2:06 ` Al Viro

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:93b3ab0 dfblob:21501dc )
 OR (
bs:"Re: Linux-next parallel cp workload hang" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160518141726.GY26977@dastard \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xzhou@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.