public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: lockdep and stack reduction fixes
@ 2014-02-19  4:16 Dave Chinner
  2014-02-19  4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-19  4:16 UTC (permalink / raw)
  To: xfs

Hi folks,

The following patches fix the lockdep regression caused by adding
the inode ilock to the readdir code and reduce the stack of XFS in a
couple of critical paths. The log force change is the critical one,
as it can happen a leaf functions in the code paths and hence
greatly increase the maximum stack depth of any path that needs to
unpin an object or lock a buffer.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-19  4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner
@ 2014-02-19  4:16 ` Dave Chinner
  2014-02-19 18:24   ` Brian Foster
  2014-02-19  4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner
  2014-02-19  4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner
  2 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2014-02-19  4:16 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b57a8e0..7e54553 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -499,13 +499,6 @@ xlog_cil_push(
 	cil->xc_ctx = new_ctx;
 
 	/*
-	 * mirror the new sequence into the cil structure so that we can do
-	 * unlocked checks against the current sequence in log forces without
-	 * risking deferencing a freed context pointer.
-	 */
-	cil->xc_current_sequence = new_ctx->sequence;
-
-	/*
 	 * The switch is now done, so we can drop the context lock and move out
 	 * of a shared context. We can't just go straight to the commit record,
 	 * though - we need to synchronise with previous and future commits so
@@ -523,8 +516,15 @@ xlog_cil_push(
 	 * Hence we need to add this context to the committing context list so
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
+	 *
+	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+	 * structure atomically with the addition of this sequence to the
+	 * committing list. This also ensures that we can do unlocked checks
+	 * against the current sequence in log forces without risking
+	 * deferencing a freed context pointer.
 	 */
 	spin_lock(&cil->xc_push_lock);
+	cil->xc_current_sequence = new_ctx->sequence;
 	list_add(&ctx->committing, &cil->xc_committing);
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
@@ -662,8 +662,14 @@ xlog_cil_push_background(
 
 }
 
+/*
+ * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
+ * number that is passed. When it returns, the work will be queued for
+ * @push_seq, but it won't be completed. The caller is expected to do any
+ * waiting for push_seq to complete if it is required.
+ */
 static void
-xlog_cil_push_foreground(
+xlog_cil_push_now(
 	struct xlog	*log,
 	xfs_lsn_t	push_seq)
 {
@@ -688,10 +694,8 @@ xlog_cil_push_foreground(
 	}
 
 	cil->xc_push_seq = push_seq;
+	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
-
-	/* do the push now */
-	xlog_cil_push(log);
 }
 
 bool
@@ -795,7 +799,8 @@ xlog_cil_force_lsn(
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
-	xlog_cil_push_foreground(log, sequence);
+restart:
+	xlog_cil_push_now(log, sequence);
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -803,7 +808,6 @@ xlog_cil_force_lsn(
 	 * before allowing the force of push_seq to go ahead. Hence block
 	 * on commits for those as well.
 	 */
-restart:
 	spin_lock(&cil->xc_push_lock);
 	list_for_each_entry(ctx, &cil->xc_committing, committing) {
 		if (ctx->sequence > sequence)
@@ -821,6 +825,28 @@ restart:
 		/* found it! */
 		commit_lsn = ctx->commit_lsn;
 	}
+
+	/*
+	 * The call to xlog_cil_push_now() executes the push in the background.
+	 * Hence by the time we have got here it our sequence may not have been
+	 * pushed yet. This is true if the current sequence still matches the
+	 * push sequence after the above wait loop and the CIL still contains
+	 * dirty objects.
+	 *
+	 * When the push occurs, it will empty the CIL and
+	 * atomically increment the currect sequence past the push sequence and
+	 * move it into the committing list. Of course, if the CIL is clean at
+	 * the time of the push, it won't have pushed the CIL at all, so in that
+	 * case we should try the push for this sequence again from the start
+	 * just in case.
+	 */
+
+	if (sequence == cil->xc_current_sequence &&
+	    !list_empty(&cil->xc_cil)) {
+		spin_unlock(&cil->xc_push_lock);
+		goto restart;
+	}
+
 	spin_unlock(&cil->xc_push_lock);
 	return commit_lsn;
 }
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive
  2014-02-19  4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner
  2014-02-19  4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner
@ 2014-02-19  4:16 ` Dave Chinner
  2014-02-19 18:25   ` Brian Foster
  2014-02-20 14:51   ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Christoph Hellwig
  2014-02-19  4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner
  2 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-19  4:16 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The change to add the IO lock to protect the directory extent map
during readdir operations has cause lockdep to have a heart attack
as it now sees a different locking order on inodes w.r.t. the
mmap_sem because readdir has a different ordering to write().

Add a new lockdep class for directory inodes to avoid this false
positive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9ddfb81..bb3bb65 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -48,6 +48,18 @@
 #include <linux/fiemap.h>
 #include <linux/slab.h>
 
+/*
+ * Directories have different lock order w.r.t. mmap_sem compared to regular
+ * files. This is due to readdir potentially triggering page faults on a user
+ * buffer inside filldir(), and this happens with the ilock on the directory
+ * held. For regular files, the lock order is the other way around - the
+ * mmap_sem is taken during the page fault, and then we lock the ilock to do
+ * block mapping. Hence we need a different class for the directory ilock so
+ * that lockdep can tell them apart.
+ */
+static struct lock_class_key xfs_nondir_ilock_class;
+static struct lock_class_key xfs_dir_ilock_class;
+
 static int
 xfs_initxattrs(
 	struct inode		*inode,
@@ -1191,6 +1203,7 @@ xfs_setup_inode(
 	xfs_diflags_to_iflags(inode, ip);
 
 	ip->d_ops = ip->i_mount->m_nondir_inode_ops;
+	lockdep_set_class(&ip->i_lock.mr_lock, &xfs_nondir_ilock_class);
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
@@ -1198,6 +1211,7 @@ xfs_setup_inode(
 		inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
+		lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class);
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
 			inode->i_op = &xfs_dir_ci_inode_operations;
 		else
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint
  2014-02-19  4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner
  2014-02-19  4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner
  2014-02-19  4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner
@ 2014-02-19  4:16 ` Dave Chinner
  2014-02-19 18:25   ` Brian Foster
  2014-02-20 14:56   ` Christoph Hellwig
  2 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-19  4:16 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The struct xfs_da_args used to pass directory/attribute operation
information to the lower layers is 128 bytes in size and is
allocated on the stack. Dynamically allocate them to reduce the
stack footprint of directory operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_dir2.c | 344 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 214 insertions(+), 130 deletions(-)

diff --git a/fs/xfs/xfs_dir2.c b/fs/xfs/xfs_dir2.c
index ce16ef0..fc9a41f 100644
--- a/fs/xfs/xfs_dir2.c
+++ b/fs/xfs/xfs_dir2.c
@@ -180,16 +180,23 @@ xfs_dir_init(
 	xfs_inode_t	*dp,
 	xfs_inode_t	*pdp)
 {
-	xfs_da_args_t	args;
+	struct xfs_da_args *args;
 	int		error;
 
-	memset((char *)&args, 0, sizeof(args));
-	args.dp = dp;
-	args.trans = tp;
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
-	if ((error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino)))
+	error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino);
+	if (error)
 		return error;
-	return xfs_dir2_sf_create(&args, pdp->i_ino);
+
+	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
+	if (!args)
+		return ENOMEM;
+
+	args->dp = dp;
+	args->trans = tp;
+	error = xfs_dir2_sf_create(args, pdp->i_ino);
+	kmem_free(args);
+	return error;
 }
 
 /*
@@ -205,41 +212,56 @@ xfs_dir_createname(
 	xfs_bmap_free_t		*flist,		/* bmap's freeblock list */
 	xfs_extlen_t		total)		/* bmap's total block count */
 {
-	xfs_da_args_t		args;
+	struct xfs_da_args	*args;
 	int			rval;
 	int			v;		/* type-checking value */
 
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
-	if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum)))
+	rval = xfs_dir_ino_validate(tp->t_mountp, inum);
+	if (rval)
 		return rval;
 	XFS_STATS_INC(xs_dir_create);
 
-	memset(&args, 0, sizeof(xfs_da_args_t));
-	args.name = name->name;
-	args.namelen = name->len;
-	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
-	args.inumber = inum;
-	args.dp = dp;
-	args.firstblock = first;
-	args.flist = flist;
-	args.total = total;
-	args.whichfork = XFS_DATA_FORK;
-	args.trans = tp;
-	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
-
-	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
-		rval = xfs_dir2_sf_addname(&args);
-	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_block_addname(&args);
-	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_leaf_addname(&args);
+	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
+	if (!args)
+		return ENOMEM;
+
+	args->name = name->name;
+	args->namelen = name->len;
+	args->filetype = name->type;
+	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args->inumber = inum;
+	args->dp = dp;
+	args->firstblock = first;
+	args->flist = flist;
+	args->total = total;
+	args->whichfork = XFS_DATA_FORK;
+	args->trans = tp;
+	args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+
+	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
+		rval = xfs_dir2_sf_addname(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isblock(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v) {
+		rval = xfs_dir2_block_addname(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isleaf(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v)
+		rval = xfs_dir2_leaf_addname(args);
 	else
-		rval = xfs_dir2_node_addname(&args);
+		rval = xfs_dir2_node_addname(args);
+
+out_free:
+	kmem_free(args);
 	return rval;
 }
 
@@ -282,46 +304,68 @@ xfs_dir_lookup(
 	xfs_ino_t	*inum,		/* out: inode number */
 	struct xfs_name *ci_name)	/* out: actual name if CI match */
 {
-	xfs_da_args_t	args;
+	struct xfs_da_args *args;
 	int		rval;
 	int		v;		/* type-checking value */
 
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
 	XFS_STATS_INC(xs_dir_lookup);
 
-	memset(&args, 0, sizeof(xfs_da_args_t));
-	args.name = name->name;
-	args.namelen = name->len;
-	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
-	args.dp = dp;
-	args.whichfork = XFS_DATA_FORK;
-	args.trans = tp;
-	args.op_flags = XFS_DA_OP_OKNOENT;
+	/*
+	 * If we don't use KM_NOFS here, lockdep will through false positive
+	 * deadlock warnings when we come through here of the non-transactional
+	 * lookup path because the allocation can recurse into inode reclaim.
+	 * Doing this avoids having to add a bunch of lockdep class
+	 * annotations into the reclaim patch for the ilock.
+	 */
+	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
+	if (!args)
+		return ENOMEM;
+
+	args->name = name->name;
+	args->namelen = name->len;
+	args->filetype = name->type;
+	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args->dp = dp;
+	args->whichfork = XFS_DATA_FORK;
+	args->trans = tp;
+	args->op_flags = XFS_DA_OP_OKNOENT;
 	if (ci_name)
-		args.op_flags |= XFS_DA_OP_CILOOKUP;
+		args->op_flags |= XFS_DA_OP_CILOOKUP;
 
-	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
-		rval = xfs_dir2_sf_lookup(&args);
-	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_block_lookup(&args);
-	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_leaf_lookup(&args);
+	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
+		rval = xfs_dir2_sf_lookup(args);
+		goto out_check_rval;
+	}
+
+	rval = xfs_dir2_isblock(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v) {
+		rval = xfs_dir2_block_lookup(args);
+		goto out_check_rval;
+	}
+
+	rval = xfs_dir2_isleaf(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v)
+		rval = xfs_dir2_leaf_lookup(args);
 	else
-		rval = xfs_dir2_node_lookup(&args);
+		rval = xfs_dir2_node_lookup(args);
+
+out_check_rval:
 	if (rval == EEXIST)
 		rval = 0;
 	if (!rval) {
-		*inum = args.inumber;
+		*inum = args->inumber;
 		if (ci_name) {
-			ci_name->name = args.value;
-			ci_name->len = args.valuelen;
+			ci_name->name = args->value;
+			ci_name->len = args->valuelen;
 		}
 	}
+out_free:
+	kmem_free(args);
 	return rval;
 }
 
@@ -338,38 +382,51 @@ xfs_dir_removename(
 	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
 	xfs_extlen_t	total)		/* bmap's total block count */
 {
-	xfs_da_args_t	args;
+	struct xfs_da_args *args;
 	int		rval;
 	int		v;		/* type-checking value */
 
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
 	XFS_STATS_INC(xs_dir_remove);
 
-	memset(&args, 0, sizeof(xfs_da_args_t));
-	args.name = name->name;
-	args.namelen = name->len;
-	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
-	args.inumber = ino;
-	args.dp = dp;
-	args.firstblock = first;
-	args.flist = flist;
-	args.total = total;
-	args.whichfork = XFS_DATA_FORK;
-	args.trans = tp;
-
-	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
-		rval = xfs_dir2_sf_removename(&args);
-	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_block_removename(&args);
-	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_leaf_removename(&args);
+	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
+	if (!args)
+		return ENOMEM;
+
+	args->name = name->name;
+	args->namelen = name->len;
+	args->filetype = name->type;
+	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args->inumber = ino;
+	args->dp = dp;
+	args->firstblock = first;
+	args->flist = flist;
+	args->total = total;
+	args->whichfork = XFS_DATA_FORK;
+	args->trans = tp;
+
+	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
+		rval = xfs_dir2_sf_removename(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isblock(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v) {
+		rval = xfs_dir2_block_removename(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isleaf(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v)
+		rval = xfs_dir2_leaf_removename(args);
 	else
-		rval = xfs_dir2_node_removename(&args);
+		rval = xfs_dir2_node_removename(args);
+out_free:
+	kmem_free(args);
 	return rval;
 }
 
@@ -386,40 +443,54 @@ xfs_dir_replace(
 	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
 	xfs_extlen_t	total)		/* bmap's total block count */
 {
-	xfs_da_args_t	args;
+	struct xfs_da_args *args;
 	int		rval;
 	int		v;		/* type-checking value */
 
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
 
-	if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum)))
+	rval = xfs_dir_ino_validate(tp->t_mountp, inum);
+	if (rval)
 		return rval;
 
-	memset(&args, 0, sizeof(xfs_da_args_t));
-	args.name = name->name;
-	args.namelen = name->len;
-	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
-	args.inumber = inum;
-	args.dp = dp;
-	args.firstblock = first;
-	args.flist = flist;
-	args.total = total;
-	args.whichfork = XFS_DATA_FORK;
-	args.trans = tp;
-
-	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
-		rval = xfs_dir2_sf_replace(&args);
-	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_block_replace(&args);
-	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_leaf_replace(&args);
+	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
+	if (!args)
+		return ENOMEM;
+
+	args->name = name->name;
+	args->namelen = name->len;
+	args->filetype = name->type;
+	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args->inumber = inum;
+	args->dp = dp;
+	args->firstblock = first;
+	args->flist = flist;
+	args->total = total;
+	args->whichfork = XFS_DATA_FORK;
+	args->trans = tp;
+
+	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
+		rval = xfs_dir2_sf_replace(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isblock(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v) {
+		rval = xfs_dir2_block_replace(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isleaf(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v)
+		rval = xfs_dir2_leaf_replace(args);
 	else
-		rval = xfs_dir2_node_replace(&args);
+		rval = xfs_dir2_node_replace(args);
+out_free:
+	kmem_free(args);
 	return rval;
 }
 
@@ -434,7 +505,7 @@ xfs_dir_canenter(
 	struct xfs_name	*name,		/* name of entry to add */
 	uint		resblks)
 {
-	xfs_da_args_t	args;
+	struct xfs_da_args *args;
 	int		rval;
 	int		v;		/* type-checking value */
 
@@ -443,29 +514,42 @@ xfs_dir_canenter(
 
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
 
-	memset(&args, 0, sizeof(xfs_da_args_t));
-	args.name = name->name;
-	args.namelen = name->len;
-	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
-	args.dp = dp;
-	args.whichfork = XFS_DATA_FORK;
-	args.trans = tp;
-	args.op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
+	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
+	if (!args)
+		return ENOMEM;
+
+	args->name = name->name;
+	args->namelen = name->len;
+	args->filetype = name->type;
+	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args->dp = dp;
+	args->whichfork = XFS_DATA_FORK;
+	args->trans = tp;
+	args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
 							XFS_DA_OP_OKNOENT;
 
-	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
-		rval = xfs_dir2_sf_addname(&args);
-	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_block_addname(&args);
-	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
-	else if (v)
-		rval = xfs_dir2_leaf_addname(&args);
+	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
+		rval = xfs_dir2_sf_addname(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isblock(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v) {
+		rval = xfs_dir2_block_addname(args);
+		goto out_free;
+	}
+
+	rval = xfs_dir2_isleaf(tp, dp, &v);
+	if (rval)
+		goto out_free;
+	if (v)
+		rval = xfs_dir2_leaf_addname(args);
 	else
-		rval = xfs_dir2_node_addname(&args);
+		rval = xfs_dir2_node_addname(args);
+out_free:
+	kmem_free(args);
 	return rval;
 }
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-19  4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner
@ 2014-02-19 18:24   ` Brian Foster
  2014-02-20  0:23     ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2014-02-19 18:24 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 02/18/2014 11:16 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Log forces can occur deep in the call chain when we have relatively
> little stack free. Log forces can also happen at close to the call
> chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
> places where we really don't want to add more stack overhead.
> 
> This stack overhead occurs because log forces do foreground CIL
> pushes (xlog_cil_push_foreground()) rather than waking the
> background push wq and waiting for the for the push to complete.
> This foreground push was done to avoid confusing the CFQ Io
> scheduler when fsync()s were issued, as it has trouble dealing with
> dependent IOs being issued from different process contexts.
> 
> Avoiding blowing the stack is much more critical than performance
> optimisations for CFQ, especially as we've been recommending against
> the use of CFQ for XFS since 3.2 kernels were release because of
> it's problems with multi-threaded IO workloads.
> 
> Hence convert xlog_cil_push_foreground() to move the push work
> to the CIL workqueue. We already do the waiting for the push to
> complete in xlog_cil_force_lsn(), so there's nothing else we need to
> modify to make this work.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 39 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b57a8e0..7e54553 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -499,13 +499,6 @@ xlog_cil_push(
>  	cil->xc_ctx = new_ctx;
>  
>  	/*
> -	 * mirror the new sequence into the cil structure so that we can do
> -	 * unlocked checks against the current sequence in log forces without
> -	 * risking deferencing a freed context pointer.
> -	 */
> -	cil->xc_current_sequence = new_ctx->sequence;
> -
> -	/*
>  	 * The switch is now done, so we can drop the context lock and move out
>  	 * of a shared context. We can't just go straight to the commit record,
>  	 * though - we need to synchronise with previous and future commits so
> @@ -523,8 +516,15 @@ xlog_cil_push(
>  	 * Hence we need to add this context to the committing context list so
>  	 * that higher sequences will wait for us to write out a commit record
>  	 * before they do.
> +	 *
> +	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
> +	 * structure atomically with the addition of this sequence to the
> +	 * committing list. This also ensures that we can do unlocked checks
> +	 * against the current sequence in log forces without risking
> +	 * deferencing a freed context pointer.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> +	cil->xc_current_sequence = new_ctx->sequence;
>  	list_add(&ctx->committing, &cil->xc_committing);
>  	spin_unlock(&cil->xc_push_lock);
>  	up_write(&cil->xc_ctx_lock);
> @@ -662,8 +662,14 @@ xlog_cil_push_background(
>  
>  }
>  
> +/*
> + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
> + * number that is passed. When it returns, the work will be queued for
> + * @push_seq, but it won't be completed. The caller is expected to do any
> + * waiting for push_seq to complete if it is required.
> + */
>  static void
> -xlog_cil_push_foreground(
> +xlog_cil_push_now(
>  	struct xlog	*log,
>  	xfs_lsn_t	push_seq)
>  {
> @@ -688,10 +694,8 @@ xlog_cil_push_foreground(
>  	}
>  
>  	cil->xc_push_seq = push_seq;
> +	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	spin_unlock(&cil->xc_push_lock);
> -
> -	/* do the push now */
> -	xlog_cil_push(log);
>  }
>  
>  bool
> @@ -795,7 +799,8 @@ xlog_cil_force_lsn(
>  	 * xlog_cil_push() handles racing pushes for the same sequence,
>  	 * so no need to deal with it here.
>  	 */
> -	xlog_cil_push_foreground(log, sequence);
> +restart:
> +	xlog_cil_push_now(log, sequence);
>  
>  	/*
>  	 * See if we can find a previous sequence still committing.
> @@ -803,7 +808,6 @@ xlog_cil_force_lsn(
>  	 * before allowing the force of push_seq to go ahead. Hence block
>  	 * on commits for those as well.
>  	 */
> -restart:
>  	spin_lock(&cil->xc_push_lock);
>  	list_for_each_entry(ctx, &cil->xc_committing, committing) {
>  		if (ctx->sequence > sequence)
> @@ -821,6 +825,28 @@ restart:
>  		/* found it! */
>  		commit_lsn = ctx->commit_lsn;
>  	}
> +
> +	/*
> +	 * The call to xlog_cil_push_now() executes the push in the background.
> +	 * Hence by the time we have got here it our sequence may not have been
> +	 * pushed yet. This is true if the current sequence still matches the
> +	 * push sequence after the above wait loop and the CIL still contains
> +	 * dirty objects.
> +	 *
> +	 * When the push occurs, it will empty the CIL and
> +	 * atomically increment the currect sequence past the push sequence and
> +	 * move it into the committing list. Of course, if the CIL is clean at
> +	 * the time of the push, it won't have pushed the CIL at all, so in that
> +	 * case we should try the push for this sequence again from the start
> +	 * just in case.
> +	 */
> +
> +	if (sequence == cil->xc_current_sequence &&
> +	    !list_empty(&cil->xc_cil)) {
> +		spin_unlock(&cil->xc_push_lock);
> +		goto restart;
> +	}
> +

IIUC, the objective here is to make sure we don't leave this code path
before the push even starts and the ctx makes it onto the committing
list, due to xlog_cil_push_now() moving things to a workqueue.

Given that, what's the purpose of re-executing the background push as
opposed to restarting the wait sequence (as done previously)? It looks
like push_now() won't queue the work again due to cil->xc_push_seq, but
it will flush the queue and I suppose make it more likely the push
starts. Is that the intent?

Brian

>  	spin_unlock(&cil->xc_push_lock);
>  	return commit_lsn;
>  }
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive
  2014-02-19  4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner
@ 2014-02-19 18:25   ` Brian Foster
  2014-02-20  0:13     ` mmap_sem -> isec->lock lockdep issues with shmem (was Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive) Dave Chinner
  2014-02-20 14:51   ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Foster @ 2014-02-19 18:25 UTC (permalink / raw)
  To: Dave Chinner, xfs

[-- Attachment #1: Type: text/plain, Size: 2347 bytes --]

On 02/18/2014 11:16 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The change to add the IO lock to protect the directory extent map
> during readdir operations has cause lockdep to have a heart attack
> as it now sees a different locking order on inodes w.r.t. the
> mmap_sem because readdir has a different ordering to write().
> 
> Add a new lockdep class for directory inodes to avoid this false
> positive.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Hey Dave,

I'm not terribly familiar with lockdep, but I hit the attached "possible
circular locking dependency detected" warning when running with this patch.

(Reproduces by running generic/001 after a reboot).

Brian

>  fs/xfs/xfs_iops.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 9ddfb81..bb3bb65 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -48,6 +48,18 @@
>  #include <linux/fiemap.h>
>  #include <linux/slab.h>
>  
> +/*
> + * Directories have different lock order w.r.t. mmap_sem compared to regular
> + * files. This is due to readdir potentially triggering page faults on a user
> + * buffer inside filldir(), and this happens with the ilock on the directory
> + * held. For regular files, the lock order is the other way around - the
> + * mmap_sem is taken during the page fault, and then we lock the ilock to do
> + * block mapping. Hence we need a different class for the directory ilock so
> + * that lockdep can tell them apart.
> + */
> +static struct lock_class_key xfs_nondir_ilock_class;
> +static struct lock_class_key xfs_dir_ilock_class;
> +
>  static int
>  xfs_initxattrs(
>  	struct inode		*inode,
> @@ -1191,6 +1203,7 @@ xfs_setup_inode(
>  	xfs_diflags_to_iflags(inode, ip);
>  
>  	ip->d_ops = ip->i_mount->m_nondir_inode_ops;
> +	lockdep_set_class(&ip->i_lock.mr_lock, &xfs_nondir_ilock_class);
>  	switch (inode->i_mode & S_IFMT) {
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
> @@ -1198,6 +1211,7 @@ xfs_setup_inode(
>  		inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
> +		lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class);
>  		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
>  			inode->i_op = &xfs_dir_ci_inode_operations;
>  		else
> 



[-- Attachment #2: messages.lockdep --]
[-- Type: text/plain, Size: 17959 bytes --]

Feb 19 12:22:03 localhost kernel: [  101.486725] 
Feb 19 12:22:03 localhost kernel: [  101.486903] ======================================================
Feb 19 12:22:03 localhost kernel: [  101.487018] [ INFO: possible circular locking dependency detected ]
Feb 19 12:22:03 localhost kernel: [  101.487018] 3.14.0-rc1+ #6 Tainted: GF       W  O
Feb 19 12:22:03 localhost kernel: [  101.487018] -------------------------------------------------------
Feb 19 12:22:03 localhost kernel: [  101.487018] rm/4171 is trying to acquire lock:
Feb 19 12:22:03 localhost kernel: [  101.487018]  (&mm->mmap_sem){++++++}, at: [<ffffffff811cc8cf>] might_fault+0x5f/0xb0
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] but task is already holding lock:
Feb 19 12:22:03 localhost kernel: [  101.487018]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] which lock already depends on the new lock.
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] the existing dependency chain (in reverse order) is:
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] -> #2 (&xfs_dir_ilock_class){++++..}:
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810ed147>] down_read_nested+0x57/0xa0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa05a01af>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa0565d50>] xfs_attr_get+0x90/0xe0 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa055b9d7>] xfs_xattr_get+0x37/0x50 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812483ef>] generic_getxattr+0x4f/0x70
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8133fd5e>] inode_doinit_with_dentry+0x1ae/0x650
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff813402d8>] sb_finish_set_opts+0xd8/0x270
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340702>] selinux_set_mnt_opts+0x292/0x5f0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340ac8>] superblock_doinit+0x68/0xd0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340b8d>] selinux_sb_kern_mount+0x3d/0xa0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81335536>] security_sb_kern_mount+0x16/0x20
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8122333a>] mount_fs+0x8a/0x1b0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8124285b>] vfs_kern_mount+0x6b/0x150
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8124561e>] do_mount+0x23e/0xb90
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812462a3>] SyS_mount+0x83/0xc0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] -> #1 (&isec->lock){+.+.+.}:
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81780d77>] mutex_lock_nested+0x77/0x3f0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8133fc42>] inode_doinit_with_dentry+0x92/0x650
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340dcc>] selinux_d_instantiate+0x1c/0x20
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8133517b>] security_d_instantiate+0x1b/0x30
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81237d70>] d_instantiate+0x50/0x70
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811bcb70>] __shmem_file_setup+0xe0/0x1d0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811bf988>] shmem_zero_setup+0x28/0x70
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811d8653>] mmap_region+0x543/0x5a0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811d89b1>] do_mmap_pgoff+0x301/0x3c0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811c18f0>] vm_mmap_pgoff+0x90/0xc0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811d6f26>] SyS_mmap_pgoff+0x116/0x270
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8101f9b2>] SyS_mmap+0x22/0x30
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] -> #0 (&mm->mmap_sem){++++++}:
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811cc8fc>] might_fault+0x8c/0xb0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812341c1>] filldir+0x91/0x120
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81234008>] iterate_dir+0xa8/0xe0
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812344b3>] SyS_getdents+0x93/0x120
Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] other info that might help us debug this:
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] Chain exists of:
Feb 19 12:22:03 localhost kernel: [  101.487018]   &mm->mmap_sem --> &isec->lock --> &xfs_dir_ilock_class
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018]  Possible unsafe locking scenario:
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018]        CPU0                    CPU1
Feb 19 12:22:03 localhost kernel: [  101.487018]        ----                    ----
Feb 19 12:22:03 localhost kernel: [  101.487018]   lock(&xfs_dir_ilock_class);
Feb 19 12:22:03 localhost kernel: [  101.487018]                                lock(&isec->lock);
Feb 19 12:22:03 localhost kernel: [  101.487018]                                lock(&xfs_dir_ilock_class);
Feb 19 12:22:03 localhost kernel: [  101.487018]   lock(&mm->mmap_sem);
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018]  *** DEADLOCK ***
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] 2 locks held by rm/4171:
Feb 19 12:22:03 localhost kernel: [  101.487018]  #0:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff81233fc2>] iterate_dir+0x62/0xe0
Feb 19 12:22:03 localhost kernel: [  101.487018]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018] 
Feb 19 12:22:03 localhost kernel: [  101.487018] stack backtrace:
Feb 19 12:22:03 localhost kernel: [  101.487018] CPU: 1 PID: 4171 Comm: rm Tainted: GF       W  O 3.14.0-rc1+ #6
Feb 19 12:22:03 localhost kernel: [  101.487018] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Feb 19 12:22:03 localhost kernel: [  101.487018]  ffffffff82597d80 ffff8800c43cdc60 ffffffff8177ba90 ffffffff825cd9c0
Feb 19 12:22:03 localhost kernel: [  101.487018]  ffff8800c43cdca0 ffffffff81777168 ffff8800c43cdcf0 ffff8800d44ba630
Feb 19 12:22:03 localhost kernel: [  101.487018]  ffff8800d44b9aa0 0000000000000002 0000000000000002 ffff8800d44ba630
Feb 19 12:22:03 localhost kernel: [  101.487018] Call Trace:
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff8177ba90>] dump_stack+0x4d/0x66
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff81777168>] print_circular_bug+0x201/0x20f
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff811cc8fc>] might_fault+0x8c/0xb0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff812341c1>] filldir+0x91/0x120
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffffa05a0022>] ? xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs]
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff81234008>] iterate_dir+0xa8/0xe0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff812344b3>] SyS_getdents+0x93/0x120
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff81234130>] ? fillonedir+0xf0/0xf0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff8114a2cc>] ? __audit_syscall_entry+0x9c/0xf0
Feb 19 12:22:03 localhost kernel: [  101.487018]  [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: 
Feb 19 12:22:03 localhost kernel: ======================================================
Feb 19 12:22:03 localhost kernel: [ INFO: possible circular locking dependency detected ]
Feb 19 12:22:03 localhost kernel: 3.14.0-rc1+ #6 Tainted: GF       W  O
Feb 19 12:22:03 localhost kernel: -------------------------------------------------------
Feb 19 12:22:03 localhost kernel: rm/4171 is trying to acquire lock:
Feb 19 12:22:03 localhost kernel: (&mm->mmap_sem){++++++}, at: [<ffffffff811cc8cf>] might_fault+0x5f/0xb0
Feb 19 12:22:03 localhost kernel: 
but task is already holding lock:
Feb 19 12:22:03 localhost kernel: (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: 
which lock already depends on the new lock.

Feb 19 12:22:03 localhost kernel: 
the existing dependency chain (in reverse order) is:
Feb 19 12:22:03 localhost kernel: 
-> #2 (&xfs_dir_ilock_class){++++..}:
Feb 19 12:22:03 localhost kernel:       [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel:       [<ffffffff810ed147>] down_read_nested+0x57/0xa0
Feb 19 12:22:03 localhost kernel:       [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffffa05a01af>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffffa0565d50>] xfs_attr_get+0x90/0xe0 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffffa055b9d7>] xfs_xattr_get+0x37/0x50 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffff812483ef>] generic_getxattr+0x4f/0x70
Feb 19 12:22:03 localhost kernel:       [<ffffffff8133fd5e>] inode_doinit_with_dentry+0x1ae/0x650
Feb 19 12:22:03 localhost kernel:       [<ffffffff813402d8>] sb_finish_set_opts+0xd8/0x270
Feb 19 12:22:03 localhost kernel:       [<ffffffff81340702>] selinux_set_mnt_opts+0x292/0x5f0
Feb 19 12:22:03 localhost kernel:       [<ffffffff81340ac8>] superblock_doinit+0x68/0xd0
Feb 19 12:22:03 localhost kernel:       [<ffffffff81340b8d>] selinux_sb_kern_mount+0x3d/0xa0
Feb 19 12:22:03 localhost kernel:       [<ffffffff81335536>] security_sb_kern_mount+0x16/0x20
Feb 19 12:22:03 localhost kernel:       [<ffffffff8122333a>] mount_fs+0x8a/0x1b0
Feb 19 12:22:03 localhost kernel:       [<ffffffff8124285b>] vfs_kern_mount+0x6b/0x150
Feb 19 12:22:03 localhost kernel:       [<ffffffff8124561e>] do_mount+0x23e/0xb90
Feb 19 12:22:03 localhost kernel:       [<ffffffff812462a3>] SyS_mount+0x83/0xc0
Feb 19 12:22:03 localhost kernel:       [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: 
-> #1 (&isec->lock){+.+.+.}:
Feb 19 12:22:03 localhost kernel:       [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel:       [<ffffffff81780d77>] mutex_lock_nested+0x77/0x3f0
Feb 19 12:22:03 localhost kernel:       [<ffffffff8133fc42>] inode_doinit_with_dentry+0x92/0x650
Feb 19 12:22:03 localhost kernel:       [<ffffffff81340dcc>] selinux_d_instantiate+0x1c/0x20
Feb 19 12:22:03 localhost kernel:       [<ffffffff8133517b>] security_d_instantiate+0x1b/0x30
Feb 19 12:22:03 localhost kernel:       [<ffffffff81237d70>] d_instantiate+0x50/0x70
Feb 19 12:22:03 localhost kernel:       [<ffffffff811bcb70>] __shmem_file_setup+0xe0/0x1d0
Feb 19 12:22:03 localhost kernel:       [<ffffffff811bf988>] shmem_zero_setup+0x28/0x70
Feb 19 12:22:03 localhost kernel:       [<ffffffff811d8653>] mmap_region+0x543/0x5a0
Feb 19 12:22:03 localhost kernel:       [<ffffffff811d89b1>] do_mmap_pgoff+0x301/0x3c0
Feb 19 12:22:03 localhost kernel:       [<ffffffff811c18f0>] vm_mmap_pgoff+0x90/0xc0
Feb 19 12:22:03 localhost kernel:       [<ffffffff811d6f26>] SyS_mmap_pgoff+0x116/0x270
Feb 19 12:22:03 localhost kernel:       [<ffffffff8101f9b2>] SyS_mmap+0x22/0x30
Feb 19 12:22:03 localhost kernel:       [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: 
-> #0 (&mm->mmap_sem){++++++}:
Feb 19 12:22:03 localhost kernel:       [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0
Feb 19 12:22:03 localhost kernel:       [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel:       [<ffffffff811cc8fc>] might_fault+0x8c/0xb0
Feb 19 12:22:03 localhost kernel:       [<ffffffff812341c1>] filldir+0x91/0x120
Feb 19 12:22:03 localhost kernel:       [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs]
Feb 19 12:22:03 localhost kernel:       [<ffffffff81234008>] iterate_dir+0xa8/0xe0
Feb 19 12:22:03 localhost kernel:       [<ffffffff812344b3>] SyS_getdents+0x93/0x120
Feb 19 12:22:03 localhost kernel:       [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b
Feb 19 12:22:03 localhost kernel: 
other info that might help us debug this:

Feb 19 12:22:03 localhost kernel: Chain exists of:
  &mm->mmap_sem --> &isec->lock --> &xfs_dir_ilock_class

Feb 19 12:22:03 localhost kernel: Possible unsafe locking scenario:

Feb 19 12:22:03 localhost kernel:       CPU0                    CPU1
Feb 19 12:22:03 localhost kernel:       ----                    ----
Feb 19 12:22:03 localhost kernel:  lock(&xfs_dir_ilock_class);
Feb 19 12:22:03 localhost kernel:                               lock(&isec->lock);
Feb 19 12:22:03 localhost kernel:                               lock(&xfs_dir_ilock_class);
Feb 19 12:22:03 localhost kernel:  lock(&mm->mmap_sem);
Feb 19 12:22:03 localhost kernel: 
 *** DEADLOCK ***

Feb 19 12:22:03 localhost kernel: 2 locks held by rm/4171:
Feb 19 12:22:03 localhost kernel: #0:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff81233fc2>] iterate_dir+0x62/0xe0
Feb 19 12:22:03 localhost kernel: #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: 
stack backtrace:
Feb 19 12:22:03 localhost kernel: CPU: 1 PID: 4171 Comm: rm Tainted: GF       W  O 3.14.0-rc1+ #6
Feb 19 12:22:03 localhost kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Feb 19 12:22:03 localhost kernel: ffffffff82597d80 ffff8800c43cdc60 ffffffff8177ba90 ffffffff825cd9c0
Feb 19 12:22:03 localhost kernel: ffff8800c43cdca0 ffffffff81777168 ffff8800c43cdcf0 ffff8800d44ba630
Feb 19 12:22:03 localhost kernel: ffff8800d44b9aa0 0000000000000002 0000000000000002 ffff8800d44ba630
Feb 19 12:22:03 localhost kernel: Call Trace:
Feb 19 12:22:03 localhost kernel: [<ffffffff8177ba90>] dump_stack+0x4d/0x66
Feb 19 12:22:03 localhost kernel: [<ffffffff81777168>] print_circular_bug+0x201/0x20f
Feb 19 12:22:03 localhost kernel: [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0
Feb 19 12:22:03 localhost kernel: [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0
Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8fc>] might_fault+0x8c/0xb0
Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0
Feb 19 12:22:03 localhost kernel: [<ffffffff812341c1>] filldir+0x91/0x120
Feb 19 12:22:03 localhost kernel: [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs]
Feb 19 12:22:03 localhost kernel: [<ffffffffa05a0022>] ? xfs_ilock+0x122/0x250 [xfs]
Feb 19 12:22:03 localhost kernel: [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs]
Feb 19 12:22:03 localhost kernel: [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs]
Feb 19 12:22:03 localhost kernel: [<ffffffff81234008>] iterate_dir+0xa8/0xe0
Feb 19 12:22:03 localhost kernel: [<ffffffff812344b3>] SyS_getdents+0x93/0x120
Feb 19 12:22:03 localhost kernel: [<ffffffff81234130>] ? fillonedir+0xf0/0xf0
Feb 19 12:22:03 localhost kernel: [<ffffffff8114a2cc>] ? __audit_syscall_entry+0x9c/0xf0
Feb 19 12:22:03 localhost kernel: [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b



[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint
  2014-02-19  4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner
@ 2014-02-19 18:25   ` Brian Foster
  2014-02-20 14:56   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Brian Foster @ 2014-02-19 18:25 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 02/18/2014 11:16 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The struct xfs_da_args used to pass directory/attribute operation
> information to the lower layers is 128 bytes in size and is
> allocated on the stack. Dynamically allocate them to reduce the
> stack footprint of directory operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_dir2.c | 344 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 214 insertions(+), 130 deletions(-)
> 
> diff --git a/fs/xfs/xfs_dir2.c b/fs/xfs/xfs_dir2.c
> index ce16ef0..fc9a41f 100644
> --- a/fs/xfs/xfs_dir2.c
> +++ b/fs/xfs/xfs_dir2.c
> @@ -180,16 +180,23 @@ xfs_dir_init(
>  	xfs_inode_t	*dp,
>  	xfs_inode_t	*pdp)
>  {
> -	xfs_da_args_t	args;
> +	struct xfs_da_args *args;
>  	int		error;
>  
> -	memset((char *)&args, 0, sizeof(args));
> -	args.dp = dp;
> -	args.trans = tp;
>  	ASSERT(S_ISDIR(dp->i_d.di_mode));
> -	if ((error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino)))
> +	error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino);
> +	if (error)
>  		return error;
> -	return xfs_dir2_sf_create(&args, pdp->i_ino);
> +
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;
> +
> +	args->dp = dp;
> +	args->trans = tp;
> +	error = xfs_dir2_sf_create(args, pdp->i_ino);
> +	kmem_free(args);
> +	return error;
>  }
>  
>  /*
> @@ -205,41 +212,56 @@ xfs_dir_createname(
>  	xfs_bmap_free_t		*flist,		/* bmap's freeblock list */
>  	xfs_extlen_t		total)		/* bmap's total block count */
>  {
> -	xfs_da_args_t		args;
> +	struct xfs_da_args	*args;
>  	int			rval;
>  	int			v;		/* type-checking value */
>  
>  	ASSERT(S_ISDIR(dp->i_d.di_mode));
> -	if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum)))
> +	rval = xfs_dir_ino_validate(tp->t_mountp, inum);
> +	if (rval)
>  		return rval;
>  	XFS_STATS_INC(xs_dir_create);
>  
> -	memset(&args, 0, sizeof(xfs_da_args_t));
> -	args.name = name->name;
> -	args.namelen = name->len;
> -	args.filetype = name->type;
> -	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
> -	args.inumber = inum;
> -	args.dp = dp;
> -	args.firstblock = first;
> -	args.flist = flist;
> -	args.total = total;
> -	args.whichfork = XFS_DATA_FORK;
> -	args.trans = tp;
> -	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
> -
> -	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
> -		rval = xfs_dir2_sf_addname(&args);
> -	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_block_addname(&args);
> -	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_leaf_addname(&args);
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;
> +
> +	args->name = name->name;
> +	args->namelen = name->len;
> +	args->filetype = name->type;
> +	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
> +	args->inumber = inum;
> +	args->dp = dp;
> +	args->firstblock = first;
> +	args->flist = flist;
> +	args->total = total;
> +	args->whichfork = XFS_DATA_FORK;
> +	args->trans = tp;
> +	args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
> +
> +	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
> +		rval = xfs_dir2_sf_addname(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isblock(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v) {
> +		rval = xfs_dir2_block_addname(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isleaf(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v)
> +		rval = xfs_dir2_leaf_addname(args);
>  	else
> -		rval = xfs_dir2_node_addname(&args);
> +		rval = xfs_dir2_node_addname(args);
> +
> +out_free:
> +	kmem_free(args);
>  	return rval;
>  }
>  
> @@ -282,46 +304,68 @@ xfs_dir_lookup(
>  	xfs_ino_t	*inum,		/* out: inode number */
>  	struct xfs_name *ci_name)	/* out: actual name if CI match */
>  {
> -	xfs_da_args_t	args;
> +	struct xfs_da_args *args;
>  	int		rval;
>  	int		v;		/* type-checking value */
>  
>  	ASSERT(S_ISDIR(dp->i_d.di_mode));
>  	XFS_STATS_INC(xs_dir_lookup);
>  
> -	memset(&args, 0, sizeof(xfs_da_args_t));
> -	args.name = name->name;
> -	args.namelen = name->len;
> -	args.filetype = name->type;
> -	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
> -	args.dp = dp;
> -	args.whichfork = XFS_DATA_FORK;
> -	args.trans = tp;
> -	args.op_flags = XFS_DA_OP_OKNOENT;
> +	/*
> +	 * If we don't use KM_NOFS here, lockdep will through false positive

							throw?

Reviewed-by: Brian Foster <bfoster@redhat.com>


> +	 * deadlock warnings when we come through here of the non-transactional
> +	 * lookup path because the allocation can recurse into inode reclaim.
> +	 * Doing this avoids having to add a bunch of lockdep class
> +	 * annotations into the reclaim patch for the ilock.
> +	 */
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;
> +
> +	args->name = name->name;
> +	args->namelen = name->len;
> +	args->filetype = name->type;
> +	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
> +	args->dp = dp;
> +	args->whichfork = XFS_DATA_FORK;
> +	args->trans = tp;
> +	args->op_flags = XFS_DA_OP_OKNOENT;
>  	if (ci_name)
> -		args.op_flags |= XFS_DA_OP_CILOOKUP;
> +		args->op_flags |= XFS_DA_OP_CILOOKUP;
>  
> -	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
> -		rval = xfs_dir2_sf_lookup(&args);
> -	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_block_lookup(&args);
> -	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_leaf_lookup(&args);
> +	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
> +		rval = xfs_dir2_sf_lookup(args);
> +		goto out_check_rval;
> +	}
> +
> +	rval = xfs_dir2_isblock(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v) {
> +		rval = xfs_dir2_block_lookup(args);
> +		goto out_check_rval;
> +	}
> +
> +	rval = xfs_dir2_isleaf(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v)
> +		rval = xfs_dir2_leaf_lookup(args);
>  	else
> -		rval = xfs_dir2_node_lookup(&args);
> +		rval = xfs_dir2_node_lookup(args);
> +
> +out_check_rval:
>  	if (rval == EEXIST)
>  		rval = 0;
>  	if (!rval) {
> -		*inum = args.inumber;
> +		*inum = args->inumber;
>  		if (ci_name) {
> -			ci_name->name = args.value;
> -			ci_name->len = args.valuelen;
> +			ci_name->name = args->value;
> +			ci_name->len = args->valuelen;
>  		}
>  	}
> +out_free:
> +	kmem_free(args);
>  	return rval;
>  }
>  
> @@ -338,38 +382,51 @@ xfs_dir_removename(
>  	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
>  	xfs_extlen_t	total)		/* bmap's total block count */
>  {
> -	xfs_da_args_t	args;
> +	struct xfs_da_args *args;
>  	int		rval;
>  	int		v;		/* type-checking value */
>  
>  	ASSERT(S_ISDIR(dp->i_d.di_mode));
>  	XFS_STATS_INC(xs_dir_remove);
>  
> -	memset(&args, 0, sizeof(xfs_da_args_t));
> -	args.name = name->name;
> -	args.namelen = name->len;
> -	args.filetype = name->type;
> -	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
> -	args.inumber = ino;
> -	args.dp = dp;
> -	args.firstblock = first;
> -	args.flist = flist;
> -	args.total = total;
> -	args.whichfork = XFS_DATA_FORK;
> -	args.trans = tp;
> -
> -	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
> -		rval = xfs_dir2_sf_removename(&args);
> -	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_block_removename(&args);
> -	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_leaf_removename(&args);
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;
> +
> +	args->name = name->name;
> +	args->namelen = name->len;
> +	args->filetype = name->type;
> +	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
> +	args->inumber = ino;
> +	args->dp = dp;
> +	args->firstblock = first;
> +	args->flist = flist;
> +	args->total = total;
> +	args->whichfork = XFS_DATA_FORK;
> +	args->trans = tp;
> +
> +	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
> +		rval = xfs_dir2_sf_removename(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isblock(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v) {
> +		rval = xfs_dir2_block_removename(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isleaf(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v)
> +		rval = xfs_dir2_leaf_removename(args);
>  	else
> -		rval = xfs_dir2_node_removename(&args);
> +		rval = xfs_dir2_node_removename(args);
> +out_free:
> +	kmem_free(args);
>  	return rval;
>  }
>  
> @@ -386,40 +443,54 @@ xfs_dir_replace(
>  	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
>  	xfs_extlen_t	total)		/* bmap's total block count */
>  {
> -	xfs_da_args_t	args;
> +	struct xfs_da_args *args;
>  	int		rval;
>  	int		v;		/* type-checking value */
>  
>  	ASSERT(S_ISDIR(dp->i_d.di_mode));
>  
> -	if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum)))
> +	rval = xfs_dir_ino_validate(tp->t_mountp, inum);
> +	if (rval)
>  		return rval;
>  
> -	memset(&args, 0, sizeof(xfs_da_args_t));
> -	args.name = name->name;
> -	args.namelen = name->len;
> -	args.filetype = name->type;
> -	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
> -	args.inumber = inum;
> -	args.dp = dp;
> -	args.firstblock = first;
> -	args.flist = flist;
> -	args.total = total;
> -	args.whichfork = XFS_DATA_FORK;
> -	args.trans = tp;
> -
> -	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
> -		rval = xfs_dir2_sf_replace(&args);
> -	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_block_replace(&args);
> -	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_leaf_replace(&args);
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;
> +
> +	args->name = name->name;
> +	args->namelen = name->len;
> +	args->filetype = name->type;
> +	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
> +	args->inumber = inum;
> +	args->dp = dp;
> +	args->firstblock = first;
> +	args->flist = flist;
> +	args->total = total;
> +	args->whichfork = XFS_DATA_FORK;
> +	args->trans = tp;
> +
> +	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
> +		rval = xfs_dir2_sf_replace(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isblock(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v) {
> +		rval = xfs_dir2_block_replace(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isleaf(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v)
> +		rval = xfs_dir2_leaf_replace(args);
>  	else
> -		rval = xfs_dir2_node_replace(&args);
> +		rval = xfs_dir2_node_replace(args);
> +out_free:
> +	kmem_free(args);
>  	return rval;
>  }
>  
> @@ -434,7 +505,7 @@ xfs_dir_canenter(
>  	struct xfs_name	*name,		/* name of entry to add */
>  	uint		resblks)
>  {
> -	xfs_da_args_t	args;
> +	struct xfs_da_args *args;
>  	int		rval;
>  	int		v;		/* type-checking value */
>  
> @@ -443,29 +514,42 @@ xfs_dir_canenter(
>  
>  	ASSERT(S_ISDIR(dp->i_d.di_mode));
>  
> -	memset(&args, 0, sizeof(xfs_da_args_t));
> -	args.name = name->name;
> -	args.namelen = name->len;
> -	args.filetype = name->type;
> -	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
> -	args.dp = dp;
> -	args.whichfork = XFS_DATA_FORK;
> -	args.trans = tp;
> -	args.op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;
> +
> +	args->name = name->name;
> +	args->namelen = name->len;
> +	args->filetype = name->type;
> +	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
> +	args->dp = dp;
> +	args->whichfork = XFS_DATA_FORK;
> +	args->trans = tp;
> +	args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
>  							XFS_DA_OP_OKNOENT;
>  
> -	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
> -		rval = xfs_dir2_sf_addname(&args);
> -	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_block_addname(&args);
> -	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
> -		return rval;
> -	else if (v)
> -		rval = xfs_dir2_leaf_addname(&args);
> +	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
> +		rval = xfs_dir2_sf_addname(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isblock(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v) {
> +		rval = xfs_dir2_block_addname(args);
> +		goto out_free;
> +	}
> +
> +	rval = xfs_dir2_isleaf(tp, dp, &v);
> +	if (rval)
> +		goto out_free;
> +	if (v)
> +		rval = xfs_dir2_leaf_addname(args);
>  	else
> -		rval = xfs_dir2_node_addname(&args);
> +		rval = xfs_dir2_node_addname(args);
> +out_free:
> +	kmem_free(args);
>  	return rval;
>  }
>  
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* mmap_sem -> isec->lock lockdep issues with shmem (was Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive)
  2014-02-19 18:25   ` Brian Foster
@ 2014-02-20  0:13     ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-20  0:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-mm, linux-security-module, xfs

[cc linux-mm because it shmem craziness that is causing the problem]
[cc linux-security-module because it is security contexts that need
 lockdep annotations.]

On Wed, Feb 19, 2014 at 01:25:16PM -0500, Brian Foster wrote:
> On 02/18/2014 11:16 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The change to add the IO lock to protect the directory extent map
> > during readdir operations has cause lockdep to have a heart attack
> > as it now sees a different locking order on inodes w.r.t. the
> > mmap_sem because readdir has a different ordering to write().
> > 
> > Add a new lockdep class for directory inodes to avoid this false
> > positive.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Hey Dave,
> 
> I'm not terribly familiar with lockdep, but I hit the attached "possible
> circular locking dependency detected" warning when running with this patch.
> 
> (Reproduces by running generic/001 after a reboot).

Ok, you're testing on an selinux enabled system, I didn't.

> Feb 19 12:22:03 localhost kernel: [  101.487018] 
> Feb 19 12:22:03 localhost kernel: [  101.487018] -> #2 (&xfs_dir_ilock_class){++++..}:
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810ed147>] down_read_nested+0x57/0xa0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa05a01af>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa0565d50>] xfs_attr_get+0x90/0xe0 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa055b9d7>] xfs_xattr_get+0x37/0x50 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812483ef>] generic_getxattr+0x4f/0x70
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8133fd5e>] inode_doinit_with_dentry+0x1ae/0x650
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff813402d8>] sb_finish_set_opts+0xd8/0x270
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340702>] selinux_set_mnt_opts+0x292/0x5f0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340ac8>] superblock_doinit+0x68/0xd0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340b8d>] selinux_sb_kern_mount+0x3d/0xa0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81335536>] security_sb_kern_mount+0x16/0x20
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8122333a>] mount_fs+0x8a/0x1b0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8124285b>] vfs_kern_mount+0x6b/0x150
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8124561e>] do_mount+0x23e/0xb90
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812462a3>] SyS_mount+0x83/0xc0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b

So, we take the ilock on the directory xattr read path during
security attribute initialisation so we have a inode->i_isec->lock -> ilock
path, which is normal.

> Feb 19 12:22:03 localhost kernel: [  101.487018] -> #1 (&isec->lock){+.+.+.}:
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81780d77>] mutex_lock_nested+0x77/0x3f0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8133fc42>] inode_doinit_with_dentry+0x92/0x650
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81340dcc>] selinux_d_instantiate+0x1c/0x20
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8133517b>] security_d_instantiate+0x1b/0x30
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81237d70>] d_instantiate+0x50/0x70
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811bcb70>] __shmem_file_setup+0xe0/0x1d0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811bf988>] shmem_zero_setup+0x28/0x70
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811d8653>] mmap_region+0x543/0x5a0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811d89b1>] do_mmap_pgoff+0x301/0x3c0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811c18f0>] vm_mmap_pgoff+0x90/0xc0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811d6f26>] SyS_mmap_pgoff+0x116/0x270
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8101f9b2>] SyS_mmap+0x22/0x30
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b

What the hell?  We instantiate an shmem filesystem *inode* under the
mmap_sem? 

And so we have a mmap_sem -> inode->i_isec->lock path on a *shmem* inode.


> Feb 19 12:22:03 localhost kernel: [  101.487018] 
> Feb 19 12:22:03 localhost kernel: [  101.487018] -> #0 (&mm->mmap_sem){++++++}:
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff811cc8fc>] might_fault+0x8c/0xb0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812341c1>] filldir+0x91/0x120
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs]
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff81234008>] iterate_dir+0xa8/0xe0
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff812344b3>] SyS_getdents+0x93/0x120
> Feb 19 12:22:03 localhost kernel: [  101.487018]        [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b

And then we have the mmap_sem in readdir, which inode->ilock ->
mmap_sem.


> Feb 19 12:22:03 localhost kernel: [  101.487018] 
> Feb 19 12:22:03 localhost kernel: [  101.487018] other info that might help us debug this:
> Feb 19 12:22:03 localhost kernel: [  101.487018] 
> Feb 19 12:22:03 localhost kernel: [  101.487018] Chain exists of:
> Feb 19 12:22:03 localhost kernel: [  101.487018]   &mm->mmap_sem --> &isec->lock --> &xfs_dir_ilock_class
> Feb 19 12:22:03 localhost kernel: [  101.487018] 
> Feb 19 12:22:03 localhost kernel: [  101.487018]  Possible unsafe locking scenario:
> Feb 19 12:22:03 localhost kernel: [  101.487018] 
> Feb 19 12:22:03 localhost kernel: [  101.487018]        CPU0                    CPU1
> Feb 19 12:22:03 localhost kernel: [  101.487018]        ----                    ----
> Feb 19 12:22:03 localhost kernel: [  101.487018]   lock(&xfs_dir_ilock_class);
> Feb 19 12:22:03 localhost kernel: [  101.487018]                                lock(&isec->lock);
> Feb 19 12:22:03 localhost kernel: [  101.487018]                                lock(&xfs_dir_ilock_class);
> Feb 19 12:22:03 localhost kernel: [  101.487018]   lock(&mm->mmap_sem);

So that's just another goddamn false positive.

The problem here is that it's many, many layers away from XFS, and
really doesn't involve XFS at all. It's caused by shmem
instantiating an inode under the mmap_sem...

Basically, the only way I can see that this is even remotely
preventable is that inode->isec->ilock needs a per-sb lockdep
context so that lockdep doesn't confuse the lock heirarchies of
completely unrelated filesystems when someone does something crazy
like the page fault path is currently doing.

Fmeh:

struct super_block {
....
#ifdef CONFIG_SECURITY
        void                    *s_security;
#endif

So I can't even isolate it to the security subsystem pointer in
the superblock because there isn't a generic structure to abstract
security specific stuff from the superblock without having to
implement the same lockdep annotations in every security module
uses xattrs to store security information.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-19 18:24   ` Brian Foster
@ 2014-02-20  0:23     ` Dave Chinner
  2014-02-20 14:51       ` Mark Tinguely
  2014-02-21 15:04       ` Brian Foster
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-20  0:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
> On 02/18/2014 11:16 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Log forces can occur deep in the call chain when we have relatively
> > little stack free. Log forces can also happen at close to the call
> > chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
> > places where we really don't want to add more stack overhead.
> > 
> > This stack overhead occurs because log forces do foreground CIL
> > pushes (xlog_cil_push_foreground()) rather than waking the
> > background push wq and waiting for the for the push to complete.
> > This foreground push was done to avoid confusing the CFQ Io
> > scheduler when fsync()s were issued, as it has trouble dealing with
> > dependent IOs being issued from different process contexts.
> > 
> > Avoiding blowing the stack is much more critical than performance
> > optimisations for CFQ, especially as we've been recommending against
> > the use of CFQ for XFS since 3.2 kernels were release because of
> > it's problems with multi-threaded IO workloads.
> > 
> > Hence convert xlog_cil_push_foreground() to move the push work
> > to the CIL workqueue. We already do the waiting for the push to
> > complete in xlog_cil_force_lsn(), so there's nothing else we need to
> > modify to make this work.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 39 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b57a8e0..7e54553 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -499,13 +499,6 @@ xlog_cil_push(
> >  	cil->xc_ctx = new_ctx;
> >  
> >  	/*
> > -	 * mirror the new sequence into the cil structure so that we can do
> > -	 * unlocked checks against the current sequence in log forces without
> > -	 * risking deferencing a freed context pointer.
> > -	 */
> > -	cil->xc_current_sequence = new_ctx->sequence;
> > -
> > -	/*
> >  	 * The switch is now done, so we can drop the context lock and move out
> >  	 * of a shared context. We can't just go straight to the commit record,
> >  	 * though - we need to synchronise with previous and future commits so
> > @@ -523,8 +516,15 @@ xlog_cil_push(
> >  	 * Hence we need to add this context to the committing context list so
> >  	 * that higher sequences will wait for us to write out a commit record
> >  	 * before they do.
> > +	 *
> > +	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
> > +	 * structure atomically with the addition of this sequence to the
> > +	 * committing list. This also ensures that we can do unlocked checks
> > +	 * against the current sequence in log forces without risking
> > +	 * deferencing a freed context pointer.
> >  	 */
> >  	spin_lock(&cil->xc_push_lock);
> > +	cil->xc_current_sequence = new_ctx->sequence;
> >  	list_add(&ctx->committing, &cil->xc_committing);
> >  	spin_unlock(&cil->xc_push_lock);
> >  	up_write(&cil->xc_ctx_lock);
> > @@ -662,8 +662,14 @@ xlog_cil_push_background(
> >  
> >  }
> >  
> > +/*
> > + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
> > + * number that is passed. When it returns, the work will be queued for
> > + * @push_seq, but it won't be completed. The caller is expected to do any
> > + * waiting for push_seq to complete if it is required.
> > + */
> >  static void
> > -xlog_cil_push_foreground(
> > +xlog_cil_push_now(
> >  	struct xlog	*log,
> >  	xfs_lsn_t	push_seq)
> >  {
> > @@ -688,10 +694,8 @@ xlog_cil_push_foreground(
> >  	}
> >  
> >  	cil->xc_push_seq = push_seq;
> > +	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> >  	spin_unlock(&cil->xc_push_lock);
> > -
> > -	/* do the push now */
> > -	xlog_cil_push(log);
> >  }
> >  
> >  bool
> > @@ -795,7 +799,8 @@ xlog_cil_force_lsn(
> >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> >  	 * so no need to deal with it here.
> >  	 */
> > -	xlog_cil_push_foreground(log, sequence);
> > +restart:
> > +	xlog_cil_push_now(log, sequence);
> >  
> >  	/*
> >  	 * See if we can find a previous sequence still committing.
> > @@ -803,7 +808,6 @@ xlog_cil_force_lsn(
> >  	 * before allowing the force of push_seq to go ahead. Hence block
> >  	 * on commits for those as well.
> >  	 */
> > -restart:
> >  	spin_lock(&cil->xc_push_lock);
> >  	list_for_each_entry(ctx, &cil->xc_committing, committing) {
> >  		if (ctx->sequence > sequence)
> > @@ -821,6 +825,28 @@ restart:
> >  		/* found it! */
> >  		commit_lsn = ctx->commit_lsn;
> >  	}
> > +
> > +	/*
> > +	 * The call to xlog_cil_push_now() executes the push in the background.
> > +	 * Hence by the time we have got here it our sequence may not have been
> > +	 * pushed yet. This is true if the current sequence still matches the
> > +	 * push sequence after the above wait loop and the CIL still contains
> > +	 * dirty objects.
> > +	 *
> > +	 * When the push occurs, it will empty the CIL and
> > +	 * atomically increment the currect sequence past the push sequence and
> > +	 * move it into the committing list. Of course, if the CIL is clean at
> > +	 * the time of the push, it won't have pushed the CIL at all, so in that
> > +	 * case we should try the push for this sequence again from the start
> > +	 * just in case.
> > +	 */
> > +
> > +	if (sequence == cil->xc_current_sequence &&
> > +	    !list_empty(&cil->xc_cil)) {
> > +		spin_unlock(&cil->xc_push_lock);
> > +		goto restart;
> > +	}
> > +
> 
> IIUC, the objective here is to make sure we don't leave this code path
> before the push even starts and the ctx makes it onto the committing
> list, due to xlog_cil_push_now() moving things to a workqueue.

Right.

> Given that, what's the purpose of re-executing the background push as
> opposed to restarting the wait sequence (as done previously)? It looks
> like push_now() won't queue the work again due to cil->xc_push_seq, but
> it will flush the queue and I suppose make it more likely the push
> starts. Is that the intent?

Effectively. But the other thing that it is protecting against is
that foreground push is done without holding the cil->xc_ctx_lock,
and so we can get the situation where we try a foreground push
of the current sequence, see that the CIL is empty and return
without pushing, wait for previous sequences to commit, then find
that the CIL has items on the CIL in the sequence we are supposed to
be committing.

In this case, we don't know if this occurred because the workqueue
has not started working on our push, or whether we raced on an empty
CIL, and hence we need to make sure that everything in the sequence
we are support to commit is pushed to the log.

Hence if the current sequence is dirty after we've ensure that all
prior sequences are fully checkpointed, need to go back and
push the CIL again to ensure that when we return to the caller the
CIL is checkpointed up to the point in time of the log force
occurring.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive
  2014-02-19  4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner
  2014-02-19 18:25   ` Brian Foster
@ 2014-02-20 14:51   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2014-02-20 14:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-20  0:23     ` Dave Chinner
@ 2014-02-20 14:51       ` Mark Tinguely
  2014-02-20 22:07         ` Dave Chinner
  2014-02-21 15:04       ` Brian Foster
  1 sibling, 1 reply; 19+ messages in thread
From: Mark Tinguely @ 2014-02-20 14:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, xfs

On 02/19/14 18:23, Dave Chinner wrote:
> On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
>> On 02/18/2014 11:16 PM, Dave Chinner wrote:
>>> From: Dave Chinner<dchinner@redhat.com>
>>>
>>> Log forces can occur deep in the call chain when we have relatively
>>> little stack free. Log forces can also happen at close to the call
>>> chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
>>> places where we really don't want to add more stack overhead.
>>>
>>> This stack overhead occurs because log forces do foreground CIL
>>> pushes (xlog_cil_push_foreground()) rather than waking the
>>> background push wq and waiting for the for the push to complete.
>>> This foreground push was done to avoid confusing the CFQ Io
>>> scheduler when fsync()s were issued, as it has trouble dealing with
>>> dependent IOs being issued from different process contexts.
>>>
>>> Avoiding blowing the stack is much more critical than performance
>>> optimisations for CFQ, especially as we've been recommending against
>>> the use of CFQ for XFS since 3.2 kernels were release because of
>>> it's problems with multi-threaded IO workloads.
>>>
>>> Hence convert xlog_cil_push_foreground() to move the push work
>>> to the CIL workqueue. We already do the waiting for the push to
>>> complete in xlog_cil_force_lsn(), so there's nothing else we need to
>>> modify to make this work.
>>>
>>> Signed-off-by: Dave Chinner<dchinner@redhat.com>
>>> ---
>>>   fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
>>>   1 file changed, 39 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
>>> index b57a8e0..7e54553 100644
>>> --- a/fs/xfs/xfs_log_cil.c
>>> +++ b/fs/xfs/xfs_log_cil.c
>>> @@ -499,13 +499,6 @@ xlog_cil_push(
>>>   	cil->xc_ctx = new_ctx;
>>>
>>>   	/*
>>> -	 * mirror the new sequence into the cil structure so that we can do
>>> -	 * unlocked checks against the current sequence in log forces without
>>> -	 * risking deferencing a freed context pointer.
>>> -	 */
>>> -	cil->xc_current_sequence = new_ctx->sequence;
>>> -
>>> -	/*
>>>   	 * The switch is now done, so we can drop the context lock and move out
>>>   	 * of a shared context. We can't just go straight to the commit record,
>>>   	 * though - we need to synchronise with previous and future commits so
>>> @@ -523,8 +516,15 @@ xlog_cil_push(
>>>   	 * Hence we need to add this context to the committing context list so
>>>   	 * that higher sequences will wait for us to write out a commit record
>>>   	 * before they do.
>>> +	 *
>>> +	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
>>> +	 * structure atomically with the addition of this sequence to the
>>> +	 * committing list. This also ensures that we can do unlocked checks
>>> +	 * against the current sequence in log forces without risking
>>> +	 * deferencing a freed context pointer.
>>>   	 */
>>>   	spin_lock(&cil->xc_push_lock);
>>> +	cil->xc_current_sequence = new_ctx->sequence;
>>>   	list_add(&ctx->committing,&cil->xc_committing);
>>>   	spin_unlock(&cil->xc_push_lock);
>>>   	up_write(&cil->xc_ctx_lock);
>>> @@ -662,8 +662,14 @@ xlog_cil_push_background(
>>>
>>>   }
>>>
>>> +/*
>>> + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
>>> + * number that is passed. When it returns, the work will be queued for
>>> + * @push_seq, but it won't be completed. The caller is expected to do any
>>> + * waiting for push_seq to complete if it is required.
>>> + */
>>>   static void
>>> -xlog_cil_push_foreground(
>>> +xlog_cil_push_now(
>>>   	struct xlog	*log,
>>>   	xfs_lsn_t	push_seq)
>>>   {
>>> @@ -688,10 +694,8 @@ xlog_cil_push_foreground(
>>>   	}
>>>
>>>   	cil->xc_push_seq = push_seq;
>>> +	queue_work(log->l_mp->m_cil_workqueue,&cil->xc_push_work);
>>>   	spin_unlock(&cil->xc_push_lock);
>>> -
>>> -	/* do the push now */
>>> -	xlog_cil_push(log);
>>>   }
>>>
>>>   bool
>>> @@ -795,7 +799,8 @@ xlog_cil_force_lsn(
>>>   	 * xlog_cil_push() handles racing pushes for the same sequence,
>>>   	 * so no need to deal with it here.
>>>   	 */
>>> -	xlog_cil_push_foreground(log, sequence);
>>> +restart:
>>> +	xlog_cil_push_now(log, sequence);
>>>
>>>   	/*
>>>   	 * See if we can find a previous sequence still committing.
>>> @@ -803,7 +808,6 @@ xlog_cil_force_lsn(
>>>   	 * before allowing the force of push_seq to go ahead. Hence block
>>>   	 * on commits for those as well.
>>>   	 */
>>> -restart:
>>>   	spin_lock(&cil->xc_push_lock);
>>>   	list_for_each_entry(ctx,&cil->xc_committing, committing) {
>>>   		if (ctx->sequence>  sequence)
>>> @@ -821,6 +825,28 @@ restart:
>>>   		/* found it! */
>>>   		commit_lsn = ctx->commit_lsn;
>>>   	}
>>> +
>>> +	/*
>>> +	 * The call to xlog_cil_push_now() executes the push in the background.
>>> +	 * Hence by the time we have got here it our sequence may not have been
>>> +	 * pushed yet. This is true if the current sequence still matches the
>>> +	 * push sequence after the above wait loop and the CIL still contains
>>> +	 * dirty objects.
>>> +	 *
>>> +	 * When the push occurs, it will empty the CIL and
>>> +	 * atomically increment the currect sequence past the push sequence and
>>> +	 * move it into the committing list. Of course, if the CIL is clean at
>>> +	 * the time of the push, it won't have pushed the CIL at all, so in that
>>> +	 * case we should try the push for this sequence again from the start
>>> +	 * just in case.
>>> +	 */
>>> +
>>> +	if (sequence == cil->xc_current_sequence&&
>>> +	    !list_empty(&cil->xc_cil)) {
>>> +		spin_unlock(&cil->xc_push_lock);
>>> +		goto restart;
>>> +	}
>>> +
>>
>> IIUC, the objective here is to make sure we don't leave this code path
>> before the push even starts and the ctx makes it onto the committing
>> list, due to xlog_cil_push_now() moving things to a workqueue.
>
> Right.
>
>> Given that, what's the purpose of re-executing the background push as
>> opposed to restarting the wait sequence (as done previously)? It looks
>> like push_now() won't queue the work again due to cil->xc_push_seq, but
>> it will flush the queue and I suppose make it more likely the push
>> starts. Is that the intent?
>
> Effectively. But the other thing that it is protecting against is
> that foreground push is done without holding the cil->xc_ctx_lock,
> and so we can get the situation where we try a foreground push
> of the current sequence, see that the CIL is empty and return
> without pushing, wait for previous sequences to commit, then find
> that the CIL has items on the CIL in the sequence we are supposed to
> be committing.
>
> In this case, we don't know if this occurred because the workqueue
> has not started working on our push, or whether we raced on an empty
> CIL, and hence we need to make sure that everything in the sequence
> we are support to commit is pushed to the log.
>
> Hence if the current sequence is dirty after we've ensure that all
> prior sequences are fully checkpointed, need to go back and
> push the CIL again to ensure that when we return to the caller the
> CIL is checkpointed up to the point in time of the log force
> occurring.

The desired push sequence was taken from an item on the CIL (either when 
added or from a pinned item). How could the CIL now be empty other than 
someone else pushed to at least the desire sequence?

A flush_work() should be enough in the case where the ctx of the desire 
sequence is not on the xc_committing list. The flush_work will wait for 
the worker to start and place the ctx of the desired sequence into the 
xc_committing list. This preventing a tight loop waiting for the cil 
push worker to start.

Starting the cil push worker for every wakeup of smaller sequence in the 
list_for_each_entry loop seems wasteful.

We know the later error paths in xfs_cil_push() will not do a wake, now 
is a good time to fix that.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint
  2014-02-19  4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner
  2014-02-19 18:25   ` Brian Foster
@ 2014-02-20 14:56   ` Christoph Hellwig
  2014-02-20 21:09     ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2014-02-20 14:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 19, 2014 at 03:16:42PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The struct xfs_da_args used to pass directory/attribute operation
> information to the lower layers is 128 bytes in size and is
> allocated on the stack. Dynamically allocate them to reduce the
> stack footprint of directory operations.

Are we having stack space problems in the directory code as well,
without all the VM code above it?  I'm defintively a bit scared about
adding another memory allocation to every single directory operation.

> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> +	if (!args)
> +		return ENOMEM;

KM_SLEEP is the default when KM_NOFS is set.  Also KM_SLEEP will never
return a NULL pointer, either remove the handling or make it an
KM_MAYFAIL allocation.

> +	/*
> +	 * If we don't use KM_NOFS here, lockdep will through false positive
> +	 * deadlock warnings when we come through here of the non-transactional
> +	 * lookup path because the allocation can recurse into inode reclaim.
> +	 * Doing this avoids having to add a bunch of lockdep class
> +	 * annotations into the reclaim patch for the ilock.
> +	 */
> +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);

I don't understand that comment.  We do use KM_NOFS here unlike what the
comment claims, and the comment seems to explain why we actually need
KM_NOFS as far as I can tell.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint
  2014-02-20 14:56   ` Christoph Hellwig
@ 2014-02-20 21:09     ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-20 21:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Feb 20, 2014 at 06:56:01AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 19, 2014 at 03:16:42PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The struct xfs_da_args used to pass directory/attribute operation
> > information to the lower layers is 128 bytes in size and is
> > allocated on the stack. Dynamically allocate them to reduce the
> > stack footprint of directory operations.
> 
> Are we having stack space problems in the directory code as well,
> without all the VM code above it?  I'm defintively a bit scared about
> adding another memory allocation to every single directory operation.

Oh, yeah. We've been having problems for some time now. When I
introduced the stack switch in the allocation path, i also did it
for metadata allocations because we'd seen a couple of reports
of stack overflows through the directory code. That got reverted
because of the performance issues it caused, so the directory code
is simply just another ticking timebomb.

See the thread here:

http://oss.sgi.com/pipermail/xfs/2014-February/034018.html

and, specifically:

http://oss.sgi.com/pipermail/xfs/2014-February/034156.html

Of further interest:

http://oss.sgi.com/pipermail/xfs/2014-February/034051.html
http://oss.sgi.com/pipermail/xfs/2014-February/034057.html
http://oss.sgi.com/pipermail/xfs/2014-February/034053.html



> 
> > +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> > +	if (!args)
> > +		return ENOMEM;
> 
> KM_SLEEP is the default when KM_NOFS is set.  Also KM_SLEEP will never
> return a NULL pointer, either remove the handling or make it an
> KM_MAYFAIL allocation.

It might be, but it's the same pattern we use everywhere else. It's
as much documentation of what the code requires, because we have
KM_MAYFAIL | KM_NOFS in other places. I'll kill the ENOMEM check -
that's just habit....

> 
> > +	/*
> > +	 * If we don't use KM_NOFS here, lockdep will through false positive
> > +	 * deadlock warnings when we come through here of the non-transactional
> > +	 * lookup path because the allocation can recurse into inode reclaim.
> > +	 * Doing this avoids having to add a bunch of lockdep class
> > +	 * annotations into the reclaim patch for the ilock.
> > +	 */
> > +	args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS);
> 
> I don't understand that comment.  We do use KM_NOFS here unlike what the
> comment claims, and the comment seems to explain why we actually need
> KM_NOFS as far as I can tell.

The comment says "if we *don't* use KM_NOFS", so AFAICT, it's
correct. I can rewrite it to be more obvious.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-20 14:51       ` Mark Tinguely
@ 2014-02-20 22:07         ` Dave Chinner
  2014-02-20 22:35           ` Mark Tinguely
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2014-02-20 22:07 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Brian Foster, xfs

On Thu, Feb 20, 2014 at 08:51:55AM -0600, Mark Tinguely wrote:
> On 02/19/14 18:23, Dave Chinner wrote:
> >On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
> >>On 02/18/2014 11:16 PM, Dave Chinner wrote:
> >>>From: Dave Chinner<dchinner@redhat.com>
> >>>
> >>>Log forces can occur deep in the call chain when we have relatively
> >>>little stack free. Log forces can also happen at close to the call
> >>>chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
> >>>places where we really don't want to add more stack overhead.
> >>>
> >>>This stack overhead occurs because log forces do foreground CIL
> >>>pushes (xlog_cil_push_foreground()) rather than waking the
> >>>background push wq and waiting for the for the push to complete.
> >>>This foreground push was done to avoid confusing the CFQ Io
> >>>scheduler when fsync()s were issued, as it has trouble dealing with
> >>>dependent IOs being issued from different process contexts.
> >>>
> >>>Avoiding blowing the stack is much more critical than performance
> >>>optimisations for CFQ, especially as we've been recommending against
> >>>the use of CFQ for XFS since 3.2 kernels were release because of
> >>>it's problems with multi-threaded IO workloads.
> >>>
> >>>Hence convert xlog_cil_push_foreground() to move the push work
> >>>to the CIL workqueue. We already do the waiting for the push to
> >>>complete in xlog_cil_force_lsn(), so there's nothing else we need to
> >>>modify to make this work.
> >>>
> >>>Signed-off-by: Dave Chinner<dchinner@redhat.com>
.....
> >>>@@ -803,7 +808,6 @@ xlog_cil_force_lsn(
> >>>  	 * before allowing the force of push_seq to go ahead. Hence block
> >>>  	 * on commits for those as well.
> >>>  	 */
> >>>-restart:
> >>>  	spin_lock(&cil->xc_push_lock);
> >>>  	list_for_each_entry(ctx,&cil->xc_committing, committing) {
> >>>  		if (ctx->sequence>  sequence)
> >>>@@ -821,6 +825,28 @@ restart:
> >>>  		/* found it! */
> >>>  		commit_lsn = ctx->commit_lsn;
> >>>  	}
> >>>+
> >>>+	/*
> >>>+	 * The call to xlog_cil_push_now() executes the push in the background.
> >>>+	 * Hence by the time we have got here it our sequence may not have been
> >>>+	 * pushed yet. This is true if the current sequence still matches the
> >>>+	 * push sequence after the above wait loop and the CIL still contains
> >>>+	 * dirty objects.
> >>>+	 *
> >>>+	 * When the push occurs, it will empty the CIL and
> >>>+	 * atomically increment the currect sequence past the push sequence and
> >>>+	 * move it into the committing list. Of course, if the CIL is clean at
> >>>+	 * the time of the push, it won't have pushed the CIL at all, so in that
> >>>+	 * case we should try the push for this sequence again from the start
> >>>+	 * just in case.
> >>>+	 */
> >>>+
> >>>+	if (sequence == cil->xc_current_sequence&&
                                             ^^^^^
FYI, your mailer is still mangling whitespace when quoting code....

> >>>+	    !list_empty(&cil->xc_cil)) {
> >>>+		spin_unlock(&cil->xc_push_lock);
> >>>+		goto restart;
> >>>+	}
> >>>+
> >>
> >>IIUC, the objective here is to make sure we don't leave this code path
> >>before the push even starts and the ctx makes it onto the committing
> >>list, due to xlog_cil_push_now() moving things to a workqueue.
> >
> >Right.
> >
> >>Given that, what's the purpose of re-executing the background push as
> >>opposed to restarting the wait sequence (as done previously)? It looks
> >>like push_now() won't queue the work again due to cil->xc_push_seq, but
> >>it will flush the queue and I suppose make it more likely the push
> >>starts. Is that the intent?
> >
> >Effectively. But the other thing that it is protecting against is
> >that foreground push is done without holding the cil->xc_ctx_lock,
> >and so we can get the situation where we try a foreground push
> >of the current sequence, see that the CIL is empty and return
> >without pushing, wait for previous sequences to commit, then find
> >that the CIL has items on the CIL in the sequence we are supposed to
> >be committing.
> >
> >In this case, we don't know if this occurred because the workqueue
> >has not started working on our push, or whether we raced on an empty
> >CIL, and hence we need to make sure that everything in the sequence
> >we are support to commit is pushed to the log.
> >
> >Hence if the current sequence is dirty after we've ensure that all
> >prior sequences are fully checkpointed, need to go back and
> >push the CIL again to ensure that when we return to the caller the
> >CIL is checkpointed up to the point in time of the log force
> >occurring.
> 
> The desired push sequence was taken from an item on the CIL (either
> when added or from a pinned item). How could the CIL now be empty
> other than someone else pushed to at least the desire sequence?

The push sequence is only taken from an object on the CIL through
xfs_log_force_lsn(). For xfs_log_force(), the sequence is taken
directly from the current CIL context:

static inline void
xlog_cil_force(struct xlog *log)
{
        xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
}

And that's how you get an empty CIL when entering
xlog_cil_force_lsn(), and hence how you can get the race condition
that the code is protecting against.

> A flush_work() should be enough in the case where the ctx of the
> desire sequence is not on the xc_committing list. The flush_work
> will wait for the worker to start and place the ctx of the desired
> sequence into the xc_committing list. This preventing a tight loop
> waiting for the cil push worker to start.

Yes, that's exactly what the code does.

> Starting the cil push worker for every wakeup of smaller sequence in
> the list_for_each_entry loop seems wasteful.

As Brian pointed out, it won't restart on every wakeup - the
cil->xc_push_seq checks prevent that from happening, so a specific
sequence will only ever be queued for a push once.

> We know the later error paths in xfs_cil_push() will not do a wake,
> now is a good time to fix that.

I'm not sure what you are talking about here. If there's a problem,
please send patches.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-20 22:07         ` Dave Chinner
@ 2014-02-20 22:35           ` Mark Tinguely
  2014-02-21  0:02             ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Tinguely @ 2014-02-20 22:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, xfs

On 02/20/14 16:07, Dave Chinner wrote:
> On Thu, Feb 20, 2014 at 08:51:55AM -0600, Mark Tinguely wrote:
>> On 02/19/14 18:23, Dave Chinner wrote:
>>> On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
>>>> On 02/18/2014 11:16 PM, Dave Chinner wrote:
>>>>> From: Dave Chinner<dchinner@redhat.com>
>>>>>
>>>>> Log forces can occur deep in the call chain when we have relatively
>>>>> little stack free. Log forces can also happen at close to the call
>>>>> chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
>>>>> places where we really don't want to add more stack overhead.
>>>>>
>>>>> This stack overhead occurs because log forces do foreground CIL
>>>>> pushes (xlog_cil_push_foreground()) rather than waking the
>>>>> background push wq and waiting for the for the push to complete.
>>>>> This foreground push was done to avoid confusing the CFQ Io
>>>>> scheduler when fsync()s were issued, as it has trouble dealing with
>>>>> dependent IOs being issued from different process contexts.
>>>>>
>>>>> Avoiding blowing the stack is much more critical than performance
>>>>> optimisations for CFQ, especially as we've been recommending against
>>>>> the use of CFQ for XFS since 3.2 kernels were release because of
>>>>> it's problems with multi-threaded IO workloads.
>>>>>
>>>>> Hence convert xlog_cil_push_foreground() to move the push work
>>>>> to the CIL workqueue. We already do the waiting for the push to
>>>>> complete in xlog_cil_force_lsn(), so there's nothing else we need to
>>>>> modify to make this work.
>>>>>
>>>>> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> .....
>>>>> @@ -803,7 +808,6 @@ xlog_cil_force_lsn(
>>>>>   	 * before allowing the force of push_seq to go ahead. Hence block
>>>>>   	 * on commits for those as well.
>>>>>   	 */
>>>>> -restart:
>>>>>   	spin_lock(&cil->xc_push_lock);
>>>>>   	list_for_each_entry(ctx,&cil->xc_committing, committing) {
>>>>>   		if (ctx->sequence>   sequence)
>>>>> @@ -821,6 +825,28 @@ restart:
>>>>>   		/* found it! */
>>>>>   		commit_lsn = ctx->commit_lsn;
>>>>>   	}
>>>>> +
>>>>> +	/*
>>>>> +	 * The call to xlog_cil_push_now() executes the push in the background.
>>>>> +	 * Hence by the time we have got here it our sequence may not have been
>>>>> +	 * pushed yet. This is true if the current sequence still matches the
>>>>> +	 * push sequence after the above wait loop and the CIL still contains
>>>>> +	 * dirty objects.
>>>>> +	 *
>>>>> +	 * When the push occurs, it will empty the CIL and
>>>>> +	 * atomically increment the currect sequence past the push sequence and
>>>>> +	 * move it into the committing list. Of course, if the CIL is clean at
>>>>> +	 * the time of the push, it won't have pushed the CIL at all, so in that
>>>>> +	 * case we should try the push for this sequence again from the start
>>>>> +	 * just in case.
>>>>> +	 */
>>>>> +
>>>>> +	if (sequence == cil->xc_current_sequence&&
>                                               ^^^^^
> FYI, your mailer is still mangling whitespace when quoting code....
>
>>>>> +	    !list_empty(&cil->xc_cil)) {
>>>>> +		spin_unlock(&cil->xc_push_lock);
>>>>> +		goto restart;
>>>>> +	}
>>>>> +
>>>>
>>>> IIUC, the objective here is to make sure we don't leave this code path
>>>> before the push even starts and the ctx makes it onto the committing
>>>> list, due to xlog_cil_push_now() moving things to a workqueue.
>>>
>>> Right.
>>>
>>>> Given that, what's the purpose of re-executing the background push as
>>>> opposed to restarting the wait sequence (as done previously)? It looks
>>>> like push_now() won't queue the work again due to cil->xc_push_seq, but
>>>> it will flush the queue and I suppose make it more likely the push
>>>> starts. Is that the intent?
>>>
>>> Effectively. But the other thing that it is protecting against is
>>> that foreground push is done without holding the cil->xc_ctx_lock,
>>> and so we can get the situation where we try a foreground push
>>> of the current sequence, see that the CIL is empty and return
>>> without pushing, wait for previous sequences to commit, then find
>>> that the CIL has items on the CIL in the sequence we are supposed to
>>> be committing.
>>>
>>> In this case, we don't know if this occurred because the workqueue
>>> has not started working on our push, or whether we raced on an empty
>>> CIL, and hence we need to make sure that everything in the sequence
>>> we are support to commit is pushed to the log.
>>>
>>> Hence if the current sequence is dirty after we've ensure that all
>>> prior sequences are fully checkpointed, need to go back and
>>> push the CIL again to ensure that when we return to the caller the
>>> CIL is checkpointed up to the point in time of the log force
>>> occurring.
>>
>> The desired push sequence was taken from an item on the CIL (either
>> when added or from a pinned item). How could the CIL now be empty
>> other than someone else pushed to at least the desire sequence?
>
> The push sequence is only taken from an object on the CIL through
> xfs_log_force_lsn(). For xfs_log_force(), the sequence is taken
> directly from the current CIL context:
>
> static inline void
> xlog_cil_force(struct xlog *log)
> {
>          xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
> }
>
> And that's how you get an empty CIL when entering
> xlog_cil_force_lsn(), and hence how you can get the race condition
> that the code is protecting against.
>
>> A flush_work() should be enough in the case where the ctx of the
>> desire sequence is not on the xc_committing list. The flush_work
>> will wait for the worker to start and place the ctx of the desired
>> sequence into the xc_committing list. This preventing a tight loop
>> waiting for the cil push worker to start.
>
> Yes, that's exactly what the code does.
>
>> Starting the cil push worker for every wakeup of smaller sequence in
>> the list_for_each_entry loop seems wasteful.
>
> As Brian pointed out, it won't restart on every wakeup - the
> cil->xc_push_seq checks prevent that from happening, so a specific
> sequence will only ever be queued for a push once.
>
>> We know the later error paths in xfs_cil_push() will not do a wake,
>> now is a good time to fix that.
>
> I'm not sure what you are talking about here. If there's a problem,
> please send patches.
>
> Cheers,
>
> Dave.

http://oss.sgi.com/archives/xfs/2013-12/msg00870.html

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-20 22:35           ` Mark Tinguely
@ 2014-02-21  0:02             ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-21  0:02 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Brian Foster, xfs

On Thu, Feb 20, 2014 at 04:35:41PM -0600, Mark Tinguely wrote:
> On 02/20/14 16:07, Dave Chinner wrote:
> >On Thu, Feb 20, 2014 at 08:51:55AM -0600, Mark Tinguely wrote:
> >>We know the later error paths in xfs_cil_push() will not do a wake,
> >>now is a good time to fix that.
> >
> >I'm not sure what you are talking about here. If there's a problem,
> >please send patches.
> 
> http://oss.sgi.com/archives/xfs/2013-12/msg00870.html

Which resulted in a discussion and discovery of more problems
that also needed fixing, so I'm waiting for patches to be posted
that fix them.

You know what the problems are, you know how they should be fixed
(it's in the discussion thread), and you are capable of writing
the fixes and testing them. So, please send patches.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-20  0:23     ` Dave Chinner
  2014-02-20 14:51       ` Mark Tinguely
@ 2014-02-21 15:04       ` Brian Foster
  2014-02-21 22:21         ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Foster @ 2014-02-21 15:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/19/2014 07:23 PM, Dave Chinner wrote:
> On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
>> On 02/18/2014 11:16 PM, Dave Chinner wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
...
>>> +
>>> +	/*
>>> +	 * The call to xlog_cil_push_now() executes the push in the background.
>>> +	 * Hence by the time we have got here it our sequence may not have been
>>> +	 * pushed yet. This is true if the current sequence still matches the
>>> +	 * push sequence after the above wait loop and the CIL still contains
>>> +	 * dirty objects.
>>> +	 *
>>> +	 * When the push occurs, it will empty the CIL and
>>> +	 * atomically increment the currect sequence past the push sequence and
>>> +	 * move it into the committing list. Of course, if the CIL is clean at
>>> +	 * the time of the push, it won't have pushed the CIL at all, so in that
>>> +	 * case we should try the push for this sequence again from the start
>>> +	 * just in case.
>>> +	 */
>>> +
>>> +	if (sequence == cil->xc_current_sequence &&
>>> +	    !list_empty(&cil->xc_cil)) {
>>> +		spin_unlock(&cil->xc_push_lock);
>>> +		goto restart;
>>> +	}
>>> +
>>
>> IIUC, the objective here is to make sure we don't leave this code path
>> before the push even starts and the ctx makes it onto the committing
>> list, due to xlog_cil_push_now() moving things to a workqueue.
> 
> Right.
> 
>> Given that, what's the purpose of re-executing the background push as
>> opposed to restarting the wait sequence (as done previously)? It looks
>> like push_now() won't queue the work again due to cil->xc_push_seq, but
>> it will flush the queue and I suppose make it more likely the push
>> starts. Is that the intent?
> 
> Effectively. But the other thing that it is protecting against is
> that foreground push is done without holding the cil->xc_ctx_lock,
> and so we can get the situation where we try a foreground push
> of the current sequence, see that the CIL is empty and return
> without pushing, wait for previous sequences to commit, then find
> that the CIL has items on the CIL in the sequence we are supposed to
> be committing.
> 
> In this case, we don't know if this occurred because the workqueue
> has not started working on our push, or whether we raced on an empty
> CIL, and hence we need to make sure that everything in the sequence
> we are support to commit is pushed to the log.
> 
> Hence if the current sequence is dirty after we've ensure that all
> prior sequences are fully checkpointed, need to go back and
> push the CIL again to ensure that when we return to the caller the
> CIL is checkpointed up to the point in time of the log force
> occurring.
> 

Ok, I see. The foreground variant checks the push sequence and xc_cil
list state without xc_ctx_lock. The background variant (down in
xlog_cil_push()) will repeat the list check while under xc_ctx_lock.

>From what I can see, this potential behavior of a foreground push seeing
an empty cil is possible even before this patch. xlog_cil_force_lsn()
will just wait on all of the previous sequences and return. Another push
can move things along for that sequence, as xc_push_seq has not been
updated.

The workqueue pattern opens up potential for the "wait before handler
runs" race, and the sequence and list check is necessary to detect that.
Doing the push_now() again seems technically unnecessary, but the flush
executes and thus should guarantee we iterate this sequence twice at
most. Seems reasonable.

General follow up question - what makes not taking xc_ctx_lock anywhere
in here safe in the first place? In the current implementation, if the
push has already been queued (note that we flush before we take the
spinlock and check the push sequence) and we get into the ctx wait
sequence, isn't it possible to see xc_committing before the ctx we're
pushing is even added?

With this patch, what prevents us from seeing the updated
xc_current_sequence and thus skipping the restart (xc_current_sequence
isn't updated under the spinlock) before the pushed ctx has been added
to xc_committing?

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-21 15:04       ` Brian Foster
@ 2014-02-21 22:21         ` Dave Chinner
  2014-02-24 13:35           ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2014-02-21 22:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 21, 2014 at 10:04:37AM -0500, Brian Foster wrote:
> On 02/19/2014 07:23 PM, Dave Chinner wrote:
> > On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
> >> On 02/18/2014 11:16 PM, Dave Chinner wrote:
> >>> From: Dave Chinner <dchinner@redhat.com>
> >>>
> ...
> >>> +
> >>> +	/*
> >>> +	 * The call to xlog_cil_push_now() executes the push in the background.
> >>> +	 * Hence by the time we have got here it our sequence may not have been
> >>> +	 * pushed yet. This is true if the current sequence still matches the
> >>> +	 * push sequence after the above wait loop and the CIL still contains
> >>> +	 * dirty objects.
> >>> +	 *
> >>> +	 * When the push occurs, it will empty the CIL and
> >>> +	 * atomically increment the currect sequence past the push sequence and
> >>> +	 * move it into the committing list. Of course, if the CIL is clean at
> >>> +	 * the time of the push, it won't have pushed the CIL at all, so in that
> >>> +	 * case we should try the push for this sequence again from the start
> >>> +	 * just in case.
> >>> +	 */
> >>> +
> >>> +	if (sequence == cil->xc_current_sequence &&
> >>> +	    !list_empty(&cil->xc_cil)) {
> >>> +		spin_unlock(&cil->xc_push_lock);
> >>> +		goto restart;
> >>> +	}
> >>> +
> >>
> >> IIUC, the objective here is to make sure we don't leave this code path
> >> before the push even starts and the ctx makes it onto the committing
> >> list, due to xlog_cil_push_now() moving things to a workqueue.
> > 
> > Right.
> > 
> >> Given that, what's the purpose of re-executing the background push as
> >> opposed to restarting the wait sequence (as done previously)? It looks
> >> like push_now() won't queue the work again due to cil->xc_push_seq, but
> >> it will flush the queue and I suppose make it more likely the push
> >> starts. Is that the intent?
> > 
> > Effectively. But the other thing that it is protecting against is
> > that foreground push is done without holding the cil->xc_ctx_lock,
> > and so we can get the situation where we try a foreground push
> > of the current sequence, see that the CIL is empty and return
> > without pushing, wait for previous sequences to commit, then find
> > that the CIL has items on the CIL in the sequence we are supposed to
> > be committing.
> > 
> > In this case, we don't know if this occurred because the workqueue
> > has not started working on our push, or whether we raced on an empty
> > CIL, and hence we need to make sure that everything in the sequence
> > we are support to commit is pushed to the log.
> > 
> > Hence if the current sequence is dirty after we've ensure that all
> > prior sequences are fully checkpointed, need to go back and
> > push the CIL again to ensure that when we return to the caller the
> > CIL is checkpointed up to the point in time of the log force
> > occurring.
> > 
> 
> Ok, I see. The foreground variant checks the push sequence and xc_cil
> list state without xc_ctx_lock. The background variant (down in
> xlog_cil_push()) will repeat the list check while under xc_ctx_lock.
> 
> From what I can see, this potential behavior of a foreground push seeing
> an empty cil is possible even before this patch. xlog_cil_force_lsn()
> will just wait on all of the previous sequences and return. Another push
> can move things along for that sequence, as xc_push_seq has not been
> updated.
> 
> The workqueue pattern opens up potential for the "wait before handler
> runs" race, and the sequence and list check is necessary to detect that.
> Doing the push_now() again seems technically unnecessary, but the flush
> executes and thus should guarantee we iterate this sequence twice at
> most. Seems reasonable.
> 
> General follow up question - what makes not taking xc_ctx_lock anywhere
> in here safe in the first place? In the current implementation, if the
> push has already been queued (note that we flush before we take the
> spinlock and check the push sequence) and we get into the ctx wait
> sequence, isn't it possible to see xc_committing before the ctx we're
> pushing is even added?

The waiting is serialised on the push lock, not the context lock.

The context lock is used to serialise addition to a CIL context with
the against the pushing of that sequence. Triggering a push of a CIL
context does not need to be serialised addition to the CIL, nor
directly against the push of the CIL. A blocking push needs to be
serialised against the checkpoint of a CIL context to the iclog,
which is a different thing altogether.

Hence we don't want to use the xc_ctx_lock for this - it is already
a contended lock and we don't want to hold off commits into a new
sequence while we wait for a previous sequence to finish pushing.

Yes, there are potential races in the exist code. They are fixed by
this patch.

> With this patch, what prevents us from seeing the updated
> xc_current_sequence and thus skipping the restart (xc_current_sequence
> isn't updated under the spinlock) before the pushed ctx has been added
> to xc_committing?

The fact that the patch moves the xc_current_sequence update under
the the push_lock avoids this. i.e. it is now only updated atomically
with adding the context to the committing list. Both are now
explicitly updated at the same time, so you can't see a sequence
number greater than what you might find on the list...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: always do log forces via the workqueue
  2014-02-21 22:21         ` Dave Chinner
@ 2014-02-24 13:35           ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2014-02-24 13:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Feb 22, 2014 at 09:21:06AM +1100, Dave Chinner wrote:
> On Fri, Feb 21, 2014 at 10:04:37AM -0500, Brian Foster wrote:
> > On 02/19/2014 07:23 PM, Dave Chinner wrote:
> > > On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote:
> > >> On 02/18/2014 11:16 PM, Dave Chinner wrote:
> > >>> From: Dave Chinner <dchinner@redhat.com>
> > >>>
> > ...
...
> > 
> > General follow up question - what makes not taking xc_ctx_lock anywhere
> > in here safe in the first place? In the current implementation, if the
> > push has already been queued (note that we flush before we take the
> > spinlock and check the push sequence) and we get into the ctx wait
> > sequence, isn't it possible to see xc_committing before the ctx we're
> > pushing is even added?
> 
> The waiting is serialised on the push lock, not the context lock.
> 
> The context lock is used to serialise addition to a CIL context with
> the against the pushing of that sequence. Triggering a push of a CIL
> context does not need to be serialised addition to the CIL, nor
> directly against the push of the CIL. A blocking push needs to be
> serialised against the checkpoint of a CIL context to the iclog,
> which is a different thing altogether.
> 
> Hence we don't want to use the xc_ctx_lock for this - it is already
> a contended lock and we don't want to hold off commits into a new
> sequence while we wait for a previous sequence to finish pushing.
> 
> Yes, there are potential races in the exist code. They are fixed by
> this patch.
> 

Ok, thanks.

> > With this patch, what prevents us from seeing the updated
> > xc_current_sequence and thus skipping the restart (xc_current_sequence
> > isn't updated under the spinlock) before the pushed ctx has been added
> > to xc_committing?
> 
> The fact that the patch moves the xc_current_sequence update under
> the the push_lock avoids this. i.e. it is now only updated atomically
> with adding the context to the committing list. Both are now
> explicitly updated at the same time, so you can't see a sequence
> number greater than what you might find on the list...
> 

Ah, right. I was reading through your patch and the original code to
understand it better and lost the fact that you moved
xc_current_sequence under spinlock (e.g., my assumption above about it
not updated under lock is incorrect). That clears that up. Thanks for the
explanations.

Reviewed-by: Brian Foster <bfoster@redhat.com>

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-02-24 13:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-19  4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner
2014-02-19  4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner
2014-02-19 18:24   ` Brian Foster
2014-02-20  0:23     ` Dave Chinner
2014-02-20 14:51       ` Mark Tinguely
2014-02-20 22:07         ` Dave Chinner
2014-02-20 22:35           ` Mark Tinguely
2014-02-21  0:02             ` Dave Chinner
2014-02-21 15:04       ` Brian Foster
2014-02-21 22:21         ` Dave Chinner
2014-02-24 13:35           ` Brian Foster
2014-02-19  4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner
2014-02-19 18:25   ` Brian Foster
2014-02-20  0:13     ` mmap_sem -> isec->lock lockdep issues with shmem (was Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive) Dave Chinner
2014-02-20 14:51   ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Christoph Hellwig
2014-02-19  4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner
2014-02-19 18:25   ` Brian Foster
2014-02-20 14:56   ` Christoph Hellwig
2014-02-20 21:09     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox