[PATCH 0/3, V2] xfs: regression fixes for the 3.8 cycle

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3, V2] xfs: regression fixes for the 3.8 cycle
@ 2012-11-23  3:24 Dave Chinner
  2012-11-23  3:24 ` [PATCH 1/3] xfs: inode allocation should use unmapped buffers Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dave Chinner @ 2012-11-23  3:24 UTC (permalink / raw)
  To: xfs

Version 2:
- ensure append dio is added to datad workqueue
- rework endio logic back to the if/else if style of logic.
- fixed bogus assert in sub-block zeroing in xfs_file_zero_range

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] xfs: inode allocation should use unmapped buffers.
  2012-11-23  3:24 [PATCH 0/3, V2] xfs: regression fixes for the 3.8 cycle Dave Chinner
@ 2012-11-23  3:24 ` Dave Chinner
  2012-11-26 19:07   ` Mark Tinguely
  2012-11-26 22:12   ` Ben Myers
  2012-11-23  3:24 ` [PATCH 2/3] xfs: fix direct IO nested transaction deadlock Dave Chinner
  2012-11-23  3:24 ` [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE Dave Chinner
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2012-11-23  3:24 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_ialloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 2d6495e..a815412 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -200,7 +200,8 @@ xfs_ialloc_inode_init(
 		 */
 		d = XFS_AGB_TO_DADDR(mp, agno, agbno + (j * blks_per_cluster));
 		fbuf = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
-					 mp->m_bsize * blks_per_cluster, 0);
+					 mp->m_bsize * blks_per_cluster,
+					 XBF_UNMAPPED);
 		if (!fbuf)
 			return ENOMEM;
 		/*
-- 
1.7.10

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] xfs: fix direct IO nested transaction deadlock.
  2012-11-23  3:24 [PATCH 0/3, V2] xfs: regression fixes for the 3.8 cycle Dave Chinner
  2012-11-23  3:24 ` [PATCH 1/3] xfs: inode allocation should use unmapped buffers Dave Chinner
@ 2012-11-23  3:24 ` Dave Chinner
  2012-11-23  8:31   ` Christoph Hellwig
  2012-11-26 15:54   ` Mark Tinguely
  2012-11-23  3:24 ` [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE Dave Chinner
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2012-11-23  3:24 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The direct IO path can do a nested transaction reservation when
writing past the EOF. The first transaction is the append
transaction for setting the filesize at IO completion, but we can
also need a transaction for allocation of blocks. If the log is low
on space due to reservations and small log, the append transaction
can be granted after wating for space as the only active transaction
in the system. This then attempts a reservation for an allocation,
which there isn't space in the log for, and the reservation sleeps.
The result is that there is nothing left in the system to wake up
all the processes waiting for log space to come free.

The stack trace that shows this deadlock is relatively innocuous:

 xlog_grant_head_wait
 xlog_grant_head_check
 xfs_log_reserve
 xfs_trans_reserve
 xfs_iomap_write_direct
 __xfs_get_blocks
 xfs_get_blocks_direct
 do_blockdev_direct_IO
 __blockdev_direct_IO
 xfs_vm_direct_IO
 generic_file_direct_write
 xfs_file_dio_aio_writ
 xfs_file_aio_write
 do_sync_write
 vfs_write

This was discovered on a filesystem with a log of only 10MB, and a
log stripe unit of 256k whih increased the base reservations by
512k. Hence a allocation transaction requires 1.2MB of log space to
be available instead of only 260k, and so greatly increased the
chance that there wouldn't be enough log space available for the
nested transaction to succeed. The key to reproducing it is this
mkfs command:

mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV

The test case was a 1000 fsstress processes running with random
freeze and unfreezes every few seconds. Thanks to Eryu Guan
(eguan@redhat.com) for writing the test that found this on a system
with a somewhat unique default configuration....

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_aops.c |   81 +++++++++++++++++++----------------------------------
 fs/xfs/xfs_log.c  |    3 +-
 2 files changed, 31 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 71361da..4111a40 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	ioend->io_append_trans = tp;
 
 	/*
-	 * We will pass freeze protection with a transaction.  So tell lockdep
+	 * We may pass freeze protection with a transaction.  So tell lockdep
 	 * we released it.
 	 */
 	rwsem_release(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
@@ -149,11 +149,13 @@ xfs_setfilesize(
 	xfs_fsize_t		isize;
 
 	/*
-	 * The transaction was allocated in the I/O submission thread,
-	 * thus we need to mark ourselves as beeing in a transaction
-	 * manually.
+	 * The transaction may have been allocated in the I/O submission thread,
+	 * thus we need to mark ourselves as beeing in a transaction manually.
+	 * Similarly for freeze protection.
 	 */
 	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	rwsem_acquire_read(&VFS_I(ip)->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
+			   0, 1, _THIS_IP_);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	isize = xfs_new_eof(ip, ioend->io_offset + ioend->io_size);
@@ -187,7 +189,8 @@ xfs_finish_ioend(
 
 		if (ioend->io_type == XFS_IO_UNWRITTEN)
 			queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
-		else if (ioend->io_append_trans)
+		else if (ioend->io_append_trans ||
+			 (ioend->io_isdirect && xfs_ioend_is_append(ioend)))
 			queue_work(mp->m_data_workqueue, &ioend->io_work);
 		else
 			xfs_destroy_ioend(ioend);
@@ -205,15 +208,6 @@ xfs_end_io(
 	struct xfs_inode *ip = XFS_I(ioend->io_inode);
 	int		error = 0;
 
-	if (ioend->io_append_trans) {
-		/*
-		 * We've got freeze protection passed with the transaction.
-		 * Tell lockdep about it.
-		 */
-		rwsem_acquire_read(
-			&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
-			0, 1, _THIS_IP_);
-	}
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		ioend->io_error = -EIO;
 		goto done;
@@ -226,35 +220,31 @@ xfs_end_io(
 	 * range to normal written extens after the data I/O has finished.
 	 */
 	if (ioend->io_type == XFS_IO_UNWRITTEN) {
+		error = xfs_iomap_write_unwritten(ip, ioend->io_offset,
+						  ioend->io_size);
+	} else if (ioend->io_isdirect && xfs_ioend_is_append(ioend)) {
 		/*
-		 * For buffered I/O we never preallocate a transaction when
-		 * doing the unwritten extent conversion, but for direct I/O
-		 * we do not know if we are converting an unwritten extent
-		 * or not at the point where we preallocate the transaction.
+		 * For direct I/O we do not know if we need to allocate blocks
+		 * or not so we can't preallocate an append transaction as that
+		 * results in nested reservations and log space deadlocks. Hence
+		 * allocate the transaction here. While this is sub-optimal and
+		 * can block IO completion for some time, we're stuck with doing
+		 * it this way until we can pass the ioend to the direct IO
+		 * allocation callbacks and avoid nesting that way.
 		 */
-		if (ioend->io_append_trans) {
-			ASSERT(ioend->io_isdirect);
-
-			current_set_flags_nested(
-				&ioend->io_append_trans->t_pflags, PF_FSTRANS);
-			xfs_trans_cancel(ioend->io_append_trans, 0);
-		}
-
-		error = xfs_iomap_write_unwritten(ip, ioend->io_offset,
-						 ioend->io_size);
-		if (error) {
-			ioend->io_error = -error;
+		error = xfs_setfilesize_trans_alloc(ioend);
+		if (error)
 			goto done;
-		}
+		error = xfs_setfilesize(ioend);
 	} else if (ioend->io_append_trans) {
 		error = xfs_setfilesize(ioend);
-		if (error)
-			ioend->io_error = -error;
 	} else {
 		ASSERT(!xfs_ioend_is_append(ioend));
 	}
 
 done:
+	if (error)
+		ioend->io_error = -error;
 	xfs_destroy_ioend(ioend);
 }
 
@@ -1432,25 +1422,21 @@ xfs_vm_direct_IO(
 		size_t size = iov_length(iov, nr_segs);
 
 		/*
-		 * We need to preallocate a transaction for a size update
-		 * here.  In the case that this write both updates the size
-		 * and converts at least on unwritten extent we will cancel
-		 * the still clean transaction after the I/O has finished.
+		 * We cannot preallocate a size update transaction here as we
+		 * don't know whether allocation is necessary or not. Hence we
+		 * can only tell IO completion that one is necessary if we are
+		 * not doing unwritten extent conversion.
 		 */
 		iocb->private = ioend = xfs_alloc_ioend(inode, XFS_IO_DIRECT);
-		if (offset + size > XFS_I(inode)->i_d.di_size) {
-			ret = xfs_setfilesize_trans_alloc(ioend);
-			if (ret)
-				goto out_destroy_ioend;
+		if (offset + size > XFS_I(inode)->i_d.di_size)
 			ioend->io_isdirect = 1;
-		}
 
 		ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov,
 					    offset, nr_segs,
 					    xfs_get_blocks_direct,
 					    xfs_end_io_direct_write, NULL, 0);
 		if (ret != -EIOCBQUEUED && iocb->private)
-			goto out_trans_cancel;
+			goto out_destroy_ioend;
 	} else {
 		ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov,
 					    offset, nr_segs,
@@ -1460,15 +1446,6 @@ xfs_vm_direct_IO(
 
 	return ret;
 
-out_trans_cancel:
-	if (ioend->io_append_trans) {
-		current_set_flags_nested(&ioend->io_append_trans->t_pflags,
-					 PF_FSTRANS);
-		rwsem_acquire_read(
-			&inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
-			0, 1, _THIS_IP_);
-		xfs_trans_cancel(ioend->io_append_trans, 0);
-	}
 out_destroy_ioend:
 	xfs_destroy_ioend(ioend);
 	return ret;
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c6d6e13..c49e2c1 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -460,7 +460,8 @@ xfs_log_reserve(
 	tic->t_trans_type = t_type;
 	*ticp = tic;
 
-	xlog_grant_push_ail(log, tic->t_unit_res * tic->t_cnt);
+	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
+					    : tic->t_unit_res);
 
 	trace_xfs_log_reserve(log, tic);
 
-- 
1.7.10

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE
  2012-11-23  3:24 [PATCH 0/3, V2] xfs: regression fixes for the 3.8 cycle Dave Chinner
  2012-11-23  3:24 ` [PATCH 1/3] xfs: inode allocation should use unmapped buffers Dave Chinner
  2012-11-23  3:24 ` [PATCH 2/3] xfs: fix direct IO nested transaction deadlock Dave Chinner
@ 2012-11-23  3:24 ` Dave Chinner
  2012-11-23  8:34   ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2012-11-23  3:24 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

XFS_IOC_ZERO_RANGE simply does not work properly for non page cache
aligned ranges. Neither test 242 or 290 exercise this correctly, so
the behaviour is completely busted even though the tests pass.

Fix it to support full byte range granularity as was originally
intended for this ioctl.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_file.c     |    2 +-
 fs/xfs/xfs_vnodeops.c |   84 ++++++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_vnodeops.h |    1 +
 3 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 400b187..67284ed 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -86,7 +86,7 @@ xfs_rw_ilock_demote(
  *	valid before the operation, it will be read from disk before
  *	being partially zeroed.
  */
-STATIC int
+int
 xfs_iozero(
 	struct xfs_inode	*ip,	/* inode			*/
 	loff_t			pos,	/* offset in file		*/
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 2688079..06ce0be 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -2095,6 +2095,61 @@ xfs_free_file_space(
 	return error;
 }
 
+
+STATIC int
+xfs_zero_file_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len,
+	int			attr_flags)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	uint			rounding;
+	xfs_off_t		start;
+	xfs_off_t		end;
+	int			error;
+
+	rounding = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
+
+	/* round the range iof extents we are going to convert inwards */
+	start = round_up(offset, rounding);
+	end = round_down(offset + len, rounding);
+
+	ASSERT(start >= offset);
+	ASSERT(end <= offset + len);
+
+	if (!(attr_flags & XFS_ATTR_NOLOCK))
+		xfs_ilock(ip, XFS_IOLOCK_EXCL);
+
+	if (start < end - 1) {
+		/* punch out the page cache over the conversion range */
+		truncate_pagecache_range(VFS_I(ip), start, end - 1);
+		/* convert the blocks */
+		error = xfs_alloc_file_space(ip, start, end - start - 1,
+				    XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT,
+				    attr_flags);
+		if (error)
+			goto out_unlock;
+	} else {
+		/* it's a sub-rounding range */
+		ASSERT(offset + len <= start);
+		error = xfs_iozero(ip, offset, len);
+		goto out_unlock;
+	}
+
+	/* now we've handled the interior of the range, handle the edges */
+	if (start != offset)
+		error = xfs_iozero(ip, offset, start - offset);
+	if (!error && end != offset + len)
+		error = xfs_iozero(ip, end, offset + len - end);
+
+out_unlock:
+	if (!(attr_flags & XFS_ATTR_NOLOCK))
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	return error;
+
+}
+
 /*
  * xfs_change_file_space()
  *      This routine allocates or frees disk space for the given file.
@@ -2120,10 +2175,8 @@ xfs_change_file_space(
 	xfs_fsize_t	fsize;
 	int		setprealloc;
 	xfs_off_t	startoffset;
-	xfs_off_t	end;
 	xfs_trans_t	*tp;
 	struct iattr	iattr;
-	int		prealloc_type;
 
 	if (!S_ISREG(ip->i_d.di_mode))
 		return XFS_ERROR(EINVAL);
@@ -2172,31 +2225,20 @@ xfs_change_file_space(
 	startoffset = bf->l_start;
 	fsize = XFS_ISIZE(ip);
 
-	/*
-	 * XFS_IOC_RESVSP and XFS_IOC_UNRESVSP will reserve or unreserve
-	 * file space.
-	 * These calls do NOT zero the data space allocated to the file,
-	 * nor do they change the file size.
-	 *
-	 * XFS_IOC_ALLOCSP and XFS_IOC_FREESP will allocate and free file
-	 * space.
-	 * These calls cause the new file data to be zeroed and the file
-	 * size to be changed.
-	 */
 	setprealloc = clrprealloc = 0;
-	prealloc_type = XFS_BMAPI_PREALLOC;
-
 	switch (cmd) {
 	case XFS_IOC_ZERO_RANGE:
-		prealloc_type |= XFS_BMAPI_CONVERT;
-		end = round_down(startoffset + bf->l_len, PAGE_SIZE) - 1;
-		if (startoffset <= end)
-			truncate_pagecache_range(VFS_I(ip), startoffset, end);
-		/* FALLTHRU */
+		error = xfs_zero_file_space(ip, startoffset, bf->l_len,
+						attr_flags);
+		if (error)
+			return error;
+		setprealloc = 1;
+		break;
+
 	case XFS_IOC_RESVSP:
 	case XFS_IOC_RESVSP64:
 		error = xfs_alloc_file_space(ip, startoffset, bf->l_len,
-						prealloc_type, attr_flags);
+						XFS_BMAPI_PREALLOC, attr_flags);
 		if (error)
 			return error;
 		setprealloc = 1;
diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
index 91a03fa..5163022 100644
--- a/fs/xfs/xfs_vnodeops.h
+++ b/fs/xfs/xfs_vnodeops.h
@@ -49,6 +49,7 @@ int xfs_attr_remove(struct xfs_inode *dp, const unsigned char *name, int flags);
 int xfs_attr_list(struct xfs_inode *dp, char *buffer, int bufsize,
 		int flags, struct attrlist_cursor_kern *cursor);
 
+int xfs_iozero(struct xfs_inode *, loff_t, size_t);
 int xfs_zero_eof(struct xfs_inode *, xfs_off_t, xfs_fsize_t);
 int xfs_free_eofblocks(struct xfs_mount *, struct xfs_inode *, bool);
 
-- 
1.7.10

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: fix direct IO nested transaction deadlock.
  2012-11-23  3:24 ` [PATCH 2/3] xfs: fix direct IO nested transaction deadlock Dave Chinner
@ 2012-11-23  8:31   ` Christoph Hellwig
  2012-11-26 15:54   ` Mark Tinguely
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-23  8:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE
  2012-11-23  3:24 ` [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE Dave Chinner
@ 2012-11-23  8:34   ` Christoph Hellwig
  2012-11-23  8:59     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-23  8:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Nov 23, 2012 at 02:24:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> XFS_IOC_ZERO_RANGE simply does not work properly for non page cache
> aligned ranges. Neither test 242 or 290 exercise this correctly, so
> the behaviour is completely busted even though the tests pass.
> 
> Fix it to support full byte range granularity as was originally
> intended for this ioctl.

Looks good, but a couple cosmetic comments below:

> +	rounding = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);

I'd call this granularity.

> +	/* round the range iof extents we are going to convert inwards */
> +	start = round_up(offset, rounding);
> +	end = round_down(offset + len, rounding);

start_boundary, end_boundary?

> +	if (start < end - 1) {
> +		/* punch out the page cache over the conversion range */
> +		truncate_pagecache_range(VFS_I(ip), start, end - 1);
> +		/* convert the blocks */
> +		error = xfs_alloc_file_space(ip, start, end - start - 1,
> +				    XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT,
> +				    attr_flags);
> +		if (error)
> +			goto out_unlock;
> +	} else {
> +		/* it's a sub-rounding range */
> +		ASSERT(offset + len <= start);
> +		error = xfs_iozero(ip, offset, len);
> +		goto out_unlock;
> +	}
> +
> +	/* now we've handled the interior of the range, handle the edges */
> +	if (start != offset)
> +		error = xfs_iozero(ip, offset, start - offset);
> +	if (!error && end != offset + len)
> +		error = xfs_iozero(ip, end, offset + len - end);

I'd move the edge iozero calls into the if branch, that gives a natural
code flow and avoids the goto unlock in the sub-page case.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE
  2012-11-23  8:34   ` Christoph Hellwig
@ 2012-11-23  8:59     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2012-11-23  8:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Nov 23, 2012 at 03:34:03AM -0500, Christoph Hellwig wrote:
> On Fri, Nov 23, 2012 at 02:24:25PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > XFS_IOC_ZERO_RANGE simply does not work properly for non page cache
> > aligned ranges. Neither test 242 or 290 exercise this correctly, so
> > the behaviour is completely busted even though the tests pass.
> > 
> > Fix it to support full byte range granularity as was originally
> > intended for this ioctl.
> 
> Looks good, but a couple cosmetic comments below:
> 
> > +	rounding = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
> 
> I'd call this granularity.

OK. I just copied that from the hole punch code. ;)

> > +	/* round the range iof extents we are going to convert inwards */
> > +	start = round_up(offset, rounding);
> > +	end = round_down(offset + len, rounding);
> 
> start_boundary, end_boundary?

OK.

> > +	if (start < end - 1) {
> > +		/* punch out the page cache over the conversion range */
> > +		truncate_pagecache_range(VFS_I(ip), start, end - 1);
> > +		/* convert the blocks */
> > +		error = xfs_alloc_file_space(ip, start, end - start - 1,
> > +				    XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT,
> > +				    attr_flags);
> > +		if (error)
> > +			goto out_unlock;
> > +	} else {
> > +		/* it's a sub-rounding range */
> > +		ASSERT(offset + len <= start);
> > +		error = xfs_iozero(ip, offset, len);
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* now we've handled the interior of the range, handle the edges */
> > +	if (start != offset)
> > +		error = xfs_iozero(ip, offset, start - offset);
> > +	if (!error && end != offset + len)
> > +		error = xfs_iozero(ip, end, offset + len - end);
> 
> I'd move the edge iozero calls into the if branch, that gives a natural
> code flow and avoids the goto unlock in the sub-page case.

Sure. I added the sub-page case after the edge cases when I realised
the edge cases didn't handle that. I'll rework it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: fix direct IO nested transaction deadlock.
  2012-11-23  3:24 ` [PATCH 2/3] xfs: fix direct IO nested transaction deadlock Dave Chinner
  2012-11-23  8:31   ` Christoph Hellwig
@ 2012-11-26 15:54   ` Mark Tinguely
  2012-11-26 21:45     ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Mark Tinguely @ 2012-11-26 15:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 11/22/12 21:24, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> The direct IO path can do a nested transaction reservation when
> writing past the EOF. The first transaction is the append
> transaction for setting the filesize at IO completion, but we can
> also need a transaction for allocation of blocks. If the log is low
> on space due to reservations and small log, the append transaction
> can be granted after wating for space as the only active transaction
> in the system. This then attempts a reservation for an allocation,
> which there isn't space in the log for, and the reservation sleeps.
> The result is that there is nothing left in the system to wake up
> all the processes waiting for log space to come free.
>
> The stack trace that shows this deadlock is relatively innocuous:
>
>   xlog_grant_head_wait
>   xlog_grant_head_check
>   xfs_log_reserve
>   xfs_trans_reserve
>   xfs_iomap_write_direct
>   __xfs_get_blocks
>   xfs_get_blocks_direct
>   do_blockdev_direct_IO
>   __blockdev_direct_IO
>   xfs_vm_direct_IO
>   generic_file_direct_write
>   xfs_file_dio_aio_writ
>   xfs_file_aio_write
>   do_sync_write
>   vfs_write
>
> This was discovered on a filesystem with a log of only 10MB, and a
> log stripe unit of 256k whih increased the base reservations by
> 512k. Hence a allocation transaction requires 1.2MB of log space to
> be available instead of only 260k, and so greatly increased the
> chance that there wouldn't be enough log space available for the
> nested transaction to succeed. The key to reproducing it is this
> mkfs command:
>
> mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV
>
> The test case was a 1000 fsstress processes running with random
> freeze and unfreezes every few seconds. Thanks to Eryu Guan
> (eguan@redhat.com) for writing the test that found this on a system
> with a somewhat unique default configuration....
>
> cc:<stable@vger.kernel.org>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> Signed-off-by: Dave Chinner<david@fromorbit.com>
> ---


This is a good work-around but with a bit more filesystem traffic (I
used your filesystem, fsstress and added the rm/copies from bug 922
because I knew it stressed the log), we can easily (couple hours) hang
the filesystem. There are many processes waiting for log space, here is
one in xfs_end_io():

PID: 20256  TASK: ffff88034c5b4140  CPU: 2   COMMAND: "fsstress"
  #0 [ffff88034b8dd688] __schedule at ffffffff8142fbbe
  #1 [ffff88034b8dd710] schedule at ffffffff8142ffe4
  #2 [ffff88034b8dd720] xlog_grant_head_wait at ffffffffa05231f8 [xfs]
  #3 [ffff88034b8dd790] xlog_grant_head_check at ffffffffa05233fc [xfs]
  #4 [ffff88034b8dd7d0] xfs_log_reserve at ffffffffa05249c2 [xfs]
  #5 [ffff88034b8dd830] xfs_trans_reserve at ffffffffa0520b43 [xfs]
  #6 [ffff88034b8dd890] xfs_iomap_write_unwritten at ffffffffa04c874e [xfs]
  #7 [ffff88034b8dd960] xfs_end_io at ffffffffa04b6e62 [xfs]
  #8 [ffff88034b8dd990] xfs_finish_ioend_sync at ffffffffa04b6f81 [xfs]
  #9 [ffff88034b8dd9a0] xfs_end_io_direct_write at ffffffffa04b6fd9 [xfs]
#10 [ffff88034b8dd9b0] dio_complete at ffffffff81184f8e
#11 [ffff88034b8dd9d0] do_blockdev_direct_IO at ffffffff81187578
#12 [ffff88034b8ddba0] __blockdev_direct_IO at ffffffff81187850
#13 [ffff88034b8ddbe0] xfs_vm_direct_IO at ffffffffa04b71a8 [xfs]
#14 [ffff88034b8ddc70] generic_file_direct_write at ffffffff810f8b9e
#15 [ffff88034b8ddce0] xfs_file_dio_aio_write at ffffffffa04bf9b4 [xfs]
#16 [ffff88034b8ddd80] xfs_file_aio_write at ffffffffa04bfeea [xfs]
#17 [ffff88034b8ddde0] do_sync_write at ffffffff8114cf83
#18 [ffff88034b8ddf00] vfs_write at ffffffff8114d5fe
#19 [ffff88034b8ddf30] sys_write at ffffffff8114d91d
#20 [ffff88034b8ddf80] system_call_fastpath at ffffffff814391e9


I still believe we should continue to push the AIL when the last push
was not enough to satisfy the first waiter, but that is just a minor
optimization so we do not have to wait until the next sync.

Now that Brian fixed the locking issue, the current hangs happen because
the sync worker is responsible to force the writing of the CIL but the
sync worker needs log space for the dummy record.

xfs_ail ffff88034c798640:
struct xfs_ail {
   xa_mount = 0xffff88034fdd3000,
   xa_task = 0xffff88034fe364c0,
   xa_ail = {
     next = 0xffff88034c798650, 	<- empty
     prev = 0xffff88034c798650
   },

The CIL is hold some entries,
xfs_cil_ctx ffff88035159b7c0
struct xfs_cil_ctx {
   cil = 0xffff88034c159ec0,
   sequence = 0x1de0,
   start_lsn = 0x0,
   commit_lsn = 0x0,
   ticket = 0xffff8802b80c0e30,
   nvecs = 0xa9,
   space_used = 0x3d30,
   busy_extents = {
     next = 0xffff880333ae2f98,
     prev = 0xffff880333ae2f98
   },
   lv_chain = 0x0,
   log_cb = {
     cb_next = 0x0,
     cb_func = 0,
     cb_arg = 0x0
   },
   committing = {
     next = 0xffff88035159b820,
     prev = 0xffff88035159b820
   }
}
PID: 21396  TASK: ffff8802b816e080  CPU: 0   COMMAND: "kworker/0:0"
  #0 [ffff8802b8361b78] __schedule at ffffffff8142fbbe
  #1 [ffff8802b8361c00] schedule at ffffffff8142ffe4
  #2 [ffff8802b8361c10] xlog_grant_head_wait at ffffffffa05231f8 [xfs]
  #3 [ffff8802b8361c80] xlog_grant_head_check at ffffffffa05233fc [xfs]
  #4 [ffff8802b8361cc0] xfs_log_reserve at ffffffffa05249c2 [xfs]
  #5 [ffff8802b8361d20] xfs_trans_reserve at ffffffffa0520b43 [xfs]
  #6 [ffff8802b8361d80] xfs_fs_log_dummy at ffffffffa04c28ae [xfs]
  #7 [ffff8802b8361db0] xfs_log_worker at ffffffffa0525ea8 [xfs]
  #8 [ffff8802b8361dd0] process_one_work at ffffffff8105afc6
  #9 [ffff8802b8361e50] worker_thread at ffffffff8105be00
#10 [ffff8802b8361ec0] kthread at ffffffff81061b1b
#11 [ffff8802b8361f50] ret_from_fork at ffffffff8143913c

This patch is a good band aid, but we need to do a log rework as
discussed in:

http://oss.sgi.com/archives/xfs/2012-09/msg00020.html

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: inode allocation should use unmapped buffers.
  2012-11-23  3:24 ` [PATCH 1/3] xfs: inode allocation should use unmapped buffers Dave Chinner
@ 2012-11-26 19:07   ` Mark Tinguely
  2012-11-26 22:12   ` Ben Myers
  1 sibling, 0 replies; 14+ messages in thread
From: Mark Tinguely @ 2012-11-26 19:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 11/22/12 21:24, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Inode buffers do not need to be mapped as inodes are read or written
> directly from/to the pages underlying the buffer. This fixes a
> regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
> default behaviour").
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig<hch@lst.de>
> Signed-off-by: Dave Chinner<david@fromorbit.com>
> ---

Good catch.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: fix direct IO nested transaction deadlock.
  2012-11-26 15:54   ` Mark Tinguely
@ 2012-11-26 21:45     ` Dave Chinner
  2012-11-27 22:46       ` Mark Tinguely
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2012-11-26 21:45 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Mon, Nov 26, 2012 at 09:54:50AM -0600, Mark Tinguely wrote:
> On 11/22/12 21:24, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@redhat.com>
> >
> >The direct IO path can do a nested transaction reservation when
> >writing past the EOF. The first transaction is the append
> >transaction for setting the filesize at IO completion, but we can
> >also need a transaction for allocation of blocks. If the log is low
> >on space due to reservations and small log, the append transaction
> >can be granted after wating for space as the only active transaction
> >in the system. This then attempts a reservation for an allocation,
> >which there isn't space in the log for, and the reservation sleeps.
> >The result is that there is nothing left in the system to wake up
> >all the processes waiting for log space to come free.
> >
> >The stack trace that shows this deadlock is relatively innocuous:
> >
> >  xlog_grant_head_wait
> >  xlog_grant_head_check
> >  xfs_log_reserve
> >  xfs_trans_reserve
> >  xfs_iomap_write_direct
> >  __xfs_get_blocks
> >  xfs_get_blocks_direct
> >  do_blockdev_direct_IO
> >  __blockdev_direct_IO
> >  xfs_vm_direct_IO
> >  generic_file_direct_write
> >  xfs_file_dio_aio_writ
> >  xfs_file_aio_write
> >  do_sync_write
> >  vfs_write
> >
> >This was discovered on a filesystem with a log of only 10MB, and a
> >log stripe unit of 256k whih increased the base reservations by
> >512k. Hence a allocation transaction requires 1.2MB of log space to
> >be available instead of only 260k, and so greatly increased the
> >chance that there wouldn't be enough log space available for the
> >nested transaction to succeed. The key to reproducing it is this
> >mkfs command:
> >
> >mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV
> >
> >The test case was a 1000 fsstress processes running with random
> >freeze and unfreezes every few seconds. Thanks to Eryu Guan
> >(eguan@redhat.com) for writing the test that found this on a system
> >with a somewhat unique default configuration....
> >
> >cc:<stable@vger.kernel.org>
> >Signed-off-by: Dave Chinner<dchinner@redhat.com>
> >Signed-off-by: Dave Chinner<david@fromorbit.com>
> >---
> 
> 
> This is a good work-around

It's not a work-around. Nested transaction reserves are a bug.
Always have been, always will be.


> but with a bit more filesystem traffic (I
> used your filesystem, fsstress and added the rm/copies from bug 922
> because I knew it stressed the log), we can easily (couple hours) hang

Script, please.

> the filesystem. There are many processes waiting for log space, here is
> one in xfs_end_io():
> 
> PID: 20256  TASK: ffff88034c5b4140  CPU: 2   COMMAND: "fsstress"
>  #0 [ffff88034b8dd688] __schedule at ffffffff8142fbbe
>  #1 [ffff88034b8dd710] schedule at ffffffff8142ffe4
>  #2 [ffff88034b8dd720] xlog_grant_head_wait at ffffffffa05231f8 [xfs]
>  #3 [ffff88034b8dd790] xlog_grant_head_check at ffffffffa05233fc [xfs]
>  #4 [ffff88034b8dd7d0] xfs_log_reserve at ffffffffa05249c2 [xfs]
>  #5 [ffff88034b8dd830] xfs_trans_reserve at ffffffffa0520b43 [xfs]
>  #6 [ffff88034b8dd890] xfs_iomap_write_unwritten at ffffffffa04c874e [xfs]
>  #7 [ffff88034b8dd960] xfs_end_io at ffffffffa04b6e62 [xfs]
>  #8 [ffff88034b8dd990] xfs_finish_ioend_sync at ffffffffa04b6f81 [xfs]
>  #9 [ffff88034b8dd9a0] xfs_end_io_direct_write at ffffffffa04b6fd9 [xfs]
> #10 [ffff88034b8dd9b0] dio_complete at ffffffff81184f8e
> #11 [ffff88034b8dd9d0] do_blockdev_direct_IO at ffffffff81187578
> #12 [ffff88034b8ddba0] __blockdev_direct_IO at ffffffff81187850
> #13 [ffff88034b8ddbe0] xfs_vm_direct_IO at ffffffffa04b71a8 [xfs]
> #14 [ffff88034b8ddc70] generic_file_direct_write at ffffffff810f8b9e
> #15 [ffff88034b8ddce0] xfs_file_dio_aio_write at ffffffffa04bf9b4 [xfs]
> #16 [ffff88034b8ddd80] xfs_file_aio_write at ffffffffa04bfeea [xfs]
> #17 [ffff88034b8ddde0] do_sync_write at ffffffff8114cf83
> #18 [ffff88034b8ddf00] vfs_write at ffffffff8114d5fe
> #19 [ffff88034b8ddf30] sys_write at ffffffff8114d91d
> #20 [ffff88034b8ddf80] system_call_fastpath at ffffffff814391e9

That's waiting on unwritten extent conversion, so is a different
problem to the one that this patch fixes because there are no nested
transactions with the bug fix in place.

> I still believe we should continue to push the AIL when the last push
> was not enough to satisfy the first waiter, but that is just a minor
> optimization so we do not have to wait until the next sync.

There's no indication that the push target is the problem because
the AIL is empty i.e. the push has worked correctly.

> Now that Brian fixed the locking issue,

Remind me again what that was?

> the current hangs happen because
> the sync worker is responsible to force the writing of the CIL but the
> sync worker needs log space for the dummy record.

The sync worker is responsible for idling the log, not much else.

> xfs_ail ffff88034c798640:
> struct xfs_ail {
>   xa_mount = 0xffff88034fdd3000,
>   xa_task = 0xffff88034fe364c0,
>   xa_ail = {
>     next = 0xffff88034c798650, 	<- empty
>     prev = 0xffff88034c798650
>   },

If the AIL is empty, then the log is not being pinned. What's the
rest of the AIL look like? The push target would be very
enlightening, especially if that's comapred against the grant heads
and the LSNs held in the log structure.

> The CIL is hold some entries,
> xfs_cil_ctx ffff88035159b7c0
> struct xfs_cil_ctx {
>   cil = 0xffff88034c159ec0,
>   sequence = 0x1de0,
>   start_lsn = 0x0,
>   commit_lsn = 0x0,
>   ticket = 0xffff8802b80c0e30,
>   nvecs = 0xa9,
>   space_used = 0x3d30,

space_used = 15,664 bytes. There's very little space consumed by the
CIL.

But without the struct log, struct xfs_cil or the full struct
xfs_ail, I have no way of even confirming that it is a similar
problem.

Further, I need the full output of the first ticket on the reserve
grant head wait queue to determine if there really is enough space
in the log for that ticket to be woken.....

> PID: 21396  TASK: ffff8802b816e080  CPU: 0   COMMAND: "kworker/0:0"
>  #0 [ffff8802b8361b78] __schedule at ffffffff8142fbbe
>  #1 [ffff8802b8361c00] schedule at ffffffff8142ffe4
>  #2 [ffff8802b8361c10] xlog_grant_head_wait at ffffffffa05231f8 [xfs]
>  #3 [ffff8802b8361c80] xlog_grant_head_check at ffffffffa05233fc [xfs]
>  #4 [ffff8802b8361cc0] xfs_log_reserve at ffffffffa05249c2 [xfs]
>  #5 [ffff8802b8361d20] xfs_trans_reserve at ffffffffa0520b43 [xfs]
>  #6 [ffff8802b8361d80] xfs_fs_log_dummy at ffffffffa04c28ae [xfs]
>  #7 [ffff8802b8361db0] xfs_log_worker at ffffffffa0525ea8 [xfs]
>  #8 [ffff8802b8361dd0] process_one_work at ffffffff8105afc6
>  #9 [ffff8802b8361e50] worker_thread at ffffffff8105be00
> #10 [ffff8802b8361ec0] kthread at ffffffff81061b1b
> #11 [ffff8802b8361f50] ret_from_fork at ffffffff8143913c
> 
> This patch is a good band aid, but we need to do a log rework as
> discussed in:
> 
> http://oss.sgi.com/archives/xfs/2012-09/msg00020.html

The AIL list empty, and the CIL is close to empty - I'm not sure
that a log force would do anything because it won't change the
AIL/CIL state very much at all. Besides, that thread is refering to
*removing* regular log forces while the log is active, not adding
them, so it's not likely to help here at all, either.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: inode allocation should use unmapped buffers.
  2012-11-23  3:24 ` [PATCH 1/3] xfs: inode allocation should use unmapped buffers Dave Chinner
  2012-11-26 19:07   ` Mark Tinguely
@ 2012-11-26 22:12   ` Ben Myers
  2012-11-26 23:00     ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Ben Myers @ 2012-11-26 22:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On Fri, Nov 23, 2012 at 02:24:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode buffers do not need to be mapped as inodes are read or written
> directly from/to the pages underlying the buffer. This fixes a
> regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
> default behaviour").
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dave Chinner <david@fromorbit.com>

committed to git://oss.sgi.com/xfs/xfs.git, master and for-next.

What's with the multiple Signed-off-by?  Is your cousin Dave getting involved
in XFS development too?  Welcome!  ;)

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: inode allocation should use unmapped buffers.
  2012-11-26 22:12   ` Ben Myers
@ 2012-11-26 23:00     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2012-11-26 23:00 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Mon, Nov 26, 2012 at 04:12:53PM -0600, Ben Myers wrote:
> Hi Dave,
> 
> On Fri, Nov 23, 2012 at 02:24:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode buffers do not need to be mapped as inodes are read or written
> > directly from/to the pages underlying the buffer. This fixes a
> > regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
> > default behaviour").
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> 
> committed to git://oss.sgi.com/xfs/xfs.git, master and for-next.
> 
> What's with the multiple Signed-off-by?  Is your cousin Dave getting involved
> in XFS development too?  Welcome!  ;)

Ah, I forgot the "-s" flag to guilt patchbomb which prevents it
git-send-email from adding signed-off-by lines from the local
repository committer. That's my mistake (it's the first time I've
ever done that), you can kill the ones from the david@fromorbit.com
address.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: fix direct IO nested transaction deadlock.
  2012-11-26 21:45     ` Dave Chinner
@ 2012-11-27 22:46       ` Mark Tinguely
  2012-11-28  0:14         ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Tinguely @ 2012-11-27 22:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 11/26/12 15:45, Dave Chinner wrote:
>> but with a bit more filesystem traffic (I
>> used your filesystem, fsstress and added the rm/copies from bug 922
>> because I knew it stressed the log), we can easily (couple hours) hang
>
> Script, please.

I started with your filesystem, fsstress and added the programs from Bug 
922:

http://oss.sgi.com/bugzilla/attachment.cgi?id=304
http://oss.sgi.com/bugzilla/attachment.cgi?id=305

#!/bin/sh
perl ./copy.pl &
sleep 180
while true
  do
#   ./check -g auto
   /root/xfstests/ltp/fsstress -d /test2 -n 1000 -p 1000
  done

I ran it on 2 different x86_64 boxes and both hung in few hours.
I did a command line sync on one of the boxes and the test continued
but the disk utilization was low.

Today, I re-ran the test on both machines with the following line
commented out of copy.pl:

  system("sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'");

So far both machines are still running although one machine, that has
been running over night started a pattern of performing at 100% disk
utilization for 2 minutes and then nearly a minute of near 0%
utilization. Now the utilization is bouncing between 1% to 60% with no
apparent pattern - this looks like the utilization. Unfortunately, I
cannot core that machine.

> That's waiting on unwritten extent conversion, so is a different
> problem to the one that this patch fixes because there are no nested
> transactions with the bug fix in place.
>
>> I still believe we should continue to push the AIL when the last push
>> was not enough to satisfy the first waiter, but that is just a minor
>> optimization so we do not have to wait until the next sync.
>
> There's no indication that the push target is the problem because
> the AIL is empty i.e. the push has worked correctly.

Yeah, I know this is not the problem. It was an observation.

>
>> Now that Brian fixed the locking issue,
>
> Remind me again what that was?
>

http://oss.sgi.com/archives/xfs/2012-06/msg00134.html
check for stale inode before acquiring iflock on push

The AIL is empty, so we know that is working fine.

>> the current hangs happen because
>> the sync worker is responsible to force the writing of the CIL but the
>> sync worker needs log space for the dummy record.
>
> The sync worker is responsible for idling the log, not much else.
>

My mistake. The sync worker will only push the AIL. I am thinking of the 
xfsaild_push() will force the CIL if there is a pinned item - not 
appropriate if the AIL is empty. It is xfs_fs_sync_fs() can force the 
CIL as can other commands.

Side question:
  What keeps the CIL from consuming more than half the log space?

Here are the data structures. This is strange, normally hangs with
an empty AIL, the space could be accounted for in the remain grant
space and CIL.
	===============
AIL at ffff88034c798640:
struct xfs_ail {
   xa_mount = 0xffff88034fdd3000,
   xa_task = 0xffff88034fe364c0,
   xa_ail = {
     next = 0xffff88034c798650,
     prev = 0xffff88034c798650
   },
   xa_target = 0xd000004000,
   xa_target_prev = 0xd000004000,
   xa_cursors = {
     next = 0xffff88034c798670,
     prev = 0xffff88034c798670
   },
   xa_lock = {
   },
   xa_last_pushed_lsn = 0x0,
   xa_log_flush = 0x0,
   xa_buf_list = {
     next = 0xffff88034c798698,
     prev = 0xffff88034c798698
   },
   xa_empty = {
     lock = {
     },
     task_list = {
       next = 0xffff88034c7986b0,
       prev = 0xffff88034c7986b0
     }
   }
}
	=====================
xlog at ffff88034ec0a400
struct xlog {
   l_mp = 0xffff88034fdd3000,
   l_ailp = 0xffff88034c798640,
   l_cilp = 0xffff88034c159ec0,
   l_xbuf = 0xffff880349e94c80,
   l_targ = 0xffff88034f78a8c0,
   l_work = {
     work = {
       data = {
         counter = 0x0
       },
       entry = {
         next = 0xffff88034ec0a430,
         prev = 0xffff88034ec0a430
       },
       func = 0xffffffffa0525e60 <_xfs_log_force+624>
     },
     timer = {
       entry = {
         next = 0x0,
         prev = 0xdead000000200200
       },
       expires = 0x10024c0d0,
       base = 0xffffffff81cb9e42,
       function = 0xffffffff8105a410 <delayed_work_timer_fn>,
       data = 0xffff88034ec0a428,
       slack = 0xffffffff,
       start_pid = 0xffffffff,
       start_site = 0x0,
       start_comm = 
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
     },
     cpu = 0x1000
   },
   l_flags = 0x10,
   l_quotaoffs_flag = 0x0,
   l_buf_cancel_table = 0x0,
   l_iclog_hsize = 0x1000,
   l_iclog_heads = 0x8,
   l_sectBBsize = 0x1,
   l_iclog_size = 0x40000,
   l_iclog_size_log = 0x12,
   l_iclog_bufs = 0x8,
   l_logBBstart = 0x4001200,
   l_logsize = 0xa00000,
   l_logBBsize = 0x5000,
   l_flush_wait = {
     lock = {
     },
     task_list = {
       next = 0xffff88034ec0a508,
       prev = 0xffff88034ec0a508
     }
   },
   l_covered_state = 0x1,
   l_iclog = 0xffff88034fd64b80,
   l_icloglock = {

   },
   l_curr_cycle = 0xd0,
   l_prev_cycle = 0xd0,
   l_curr_block = 0x4200,
   l_prev_block = 0x4000,
   l_last_sync_lsn = {
     counter = 0xd000004000
   },
   l_tail_lsn = {
     counter = 0xd000004000
   },
   l_reserve_head = {
     lock = {
     },
     waiters = {
       next = 0xffff8802ba98fe30,
       prev = 0xffff8802b7c312b0
     },
     grant = {
       counter = 0xd1006656f0
     }
   },

0x800000-0x6656F0 == 1681680 reserve space left
		=====================
Top waiter ticket at ffff8802ba98fe30
struct xlog_ticket {
   t_queue = {
     next = 0xffff8802b7f1d140,
     prev = 0xffff88034ec0a5c8
   },
   t_task = 0xffff8802bac3a3c0,
   t_tid = 0x5d1c97b2,
   t_ref = {
     counter = 0x1
   },
   t_curr_res = 0xa8b6c,
   t_unit_res = 0xa8b6c,
   t_ocnt = 0x3,
   t_cnt = 0x3,
   t_clientid = 0x69,
   t_flags = 0x3,

  wants 0xa8b6c * 3 == 2073156 bytes.

	=====================
CIL at ffff88034c159ec0

struct xfs_cil {
   xc_log = 0xffff88034ec0a400,
   xc_cil = {
     next = 0xffff8802b8001808,
     prev = 0xffff8802b82d79c8
   },
   xc_cil_lock = {
   },
   xc_ctx = 0xffff88035159b7c0,
   xc_ctx_lock = {
     count = 0x0,
     wait_lock = {
     },
     wait_list = {
       next = 0xffff88034c159ef8,
       prev = 0xffff88034c159ef8
     }
   },
   xc_committing = {
     next = 0xffff88034c159f08,
     prev = 0xffff88034c159f08
   },
   xc_commit_wait = {
     lock = {
     },
     task_list = {
       next = 0xffff88034c159f20,
       prev = 0xffff88034c159f20
     }
   },
   xc_current_sequence = 0x1de0,
   xc_push_work = {
     data = {
       counter = 0x200200
     },
     entry = {
       next = 0xffff88034c159f40,
       prev = 0xffff88034c159f40
     },
     func = 0xffffffffa0527bd0 <xlog_cil_push+688>
   },
   xc_push_seq = 0x1ddf
}

	=====================
struct xfs_cil_ctx {
   cil = 0xffff88034c159ec0,
   sequence = 0x1de0,
   start_lsn = 0x0,
   commit_lsn = 0x0,
   ticket = 0xffff8802b80c0e30,
   nvecs = 0xa9,
   space_used = 0x3d30,     <- only 15664 bytes
   busy_extents = {
     next = 0xffff880333ae2f98,
     prev = 0xffff880333ae2f98
   },
   lv_chain = 0x0,
   log_cb = {
     cb_next = 0x0,
     cb_func = 0,
     cb_arg = 0x0
   },
   committing = {
     next = 0xffff88035159b820,
     prev = 0xffff88035159b820
   }
}
The ticket held by the CIL CTX at ffff8802b80c0e30:
struct xlog_ticket {
   t_queue = {
     next = 0xffff8802b80c0e30,
     prev = 0xffff8802b80c0e30
   },
   t_task = 0xffff88034fe364c0,
   t_tid = 0x79652e47,
   t_ref = {
     counter = 0x1
   },
   t_curr_res = 0x82034,
   t_unit_res = 0x82034, <- 532532 bytes
   t_ocnt = 0x1,
   t_cnt = 0x1,
   t_clientid = 0x69,
   t_flags = 0x1,
   t_trans_type = 0x2a,
   t_res_num = 0x0,
   t_res_num_ophdrs = 0x0,
   t_res_arr_sum = 0x0,
   t_res_o_flow = 0x0,

The xfs_log_item in the CIL have li_seq numbers from 0x1de0 to 0x1dbc.
0x1de0 is the xc_current_sequence of the CIL.

All the  xlog_in_core entries' state are XLOG_STATE_ACTIVE.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: fix direct IO nested transaction deadlock.
  2012-11-27 22:46       ` Mark Tinguely
@ 2012-11-28  0:14         ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2012-11-28  0:14 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Tue, Nov 27, 2012 at 04:46:11PM -0600, Mark Tinguely wrote:
> On 11/26/12 15:45, Dave Chinner wrote:
> >>but with a bit more filesystem traffic (I
> >>used your filesystem, fsstress and added the rm/copies from bug 922
> >>because I knew it stressed the log), we can easily (couple hours) hang
> >
> >Script, please.
> 
> I started with your filesystem, fsstress and added the programs from
> Bug 922:
> 
> http://oss.sgi.com/bugzilla/attachment.cgi?id=304
> http://oss.sgi.com/bugzilla/attachment.cgi?id=305
> 
> #!/bin/sh
> perl ./copy.pl &
> sleep 180
> while true
>  do
> #   ./check -g auto
>   /root/xfstests/ltp/fsstress -d /test2 -n 1000 -p 1000
>  done

Ok, I'll try it.

> >>the current hangs happen because
> >>the sync worker is responsible to force the writing of the CIL but the
> >>sync worker needs log space for the dummy record.
> >
> >The sync worker is responsible for idling the log, not much else.
> >
> 
> My mistake. The sync worker will only push the AIL. I am thinking of
> the xfsaild_push() will force the CIL if there is a pinned item -
> not appropriate if the AIL is empty. It is xfs_fs_sync_fs() can
> force the CIL as can other commands.
> 
> Side question:
>  What keeps the CIL from consuming more than half the log space?

See the comments in xfs_log_priv.h above these two lines:

#define XLOG_CIL_SPACE_LIMIT(log)       (log->l_logsize >> 3)
#define XLOG_CIL_HARD_SPACE_LIMIT(log)  (3 * (log->l_logsize >> 4))

> Here are the data structures. This is strange, normally hangs with
> an empty AIL, the space could be accounted for in the remain grant
> space and CIL.
> 	===============
> AIL at ffff88034c798640:
> struct xfs_ail {
....
>   xa_target = 0xd000004000,
>   xa_target_prev = 0xd000004000,
....
>   xa_last_pushed_lsn = 0x0,
>   xa_log_flush = 0x0,
....
> 	=====================
> xlog at ffff88034ec0a400
> struct xlog {
....
>   l_logsize = 0xa00000,

Which means the log is 10485760 bytes in size
...
>   l_curr_cycle = 0xd0,
>   l_prev_cycle = 0xd0,
>   l_curr_block = 0x4200,
>   l_prev_block = 0x4000,
>   l_last_sync_lsn = {
>     counter = 0xd000004000
>   },
>   l_tail_lsn = {
>     counter = 0xd000004000
>   },
>   l_reserve_head = {
>     lock = {
>     },
>     waiters = {
>       next = 0xffff8802ba98fe30,
>       prev = 0xffff8802b7c312b0
>     },
>     grant = {
>       counter = 0xd1006656f0
>     }
>   },
> 
> 0x800000-0x6656F0 == 1681680 reserve space left

BTW, when I ask for the full structure, I'd like to see the full
structure so that as i walk through the code I can refer back to the
structure to see that the values are what I think they should be...

> 		=====================
> Top waiter ticket at ffff8802ba98fe30
> struct xlog_ticket {
....
>   t_curr_res = 0xa8b6c,
>   t_unit_res = 0xa8b6c,
>   t_ocnt = 0x3,
>   t_cnt = 0x3,
>   t_clientid = 0x69,
>   t_flags = 0x3,
> 
>  wants 0xa8b6c * 3 == 2073156 bytes.

And that is close to 20% of the log size but not an invalid size.

The question is "what is holding all the grant space"? That's why I
need the whole struct xlog output - I suspect that there are
permanent transactions waiting on the write head and so can't move
forward and hence can't release grant space.


> 	=====================
> CIL at ffff88034c159ec0
> 
> struct xfs_cil {
.....
>   xc_current_sequence = 0x1de0,
.....
>   xc_push_seq = 0x1ddf
> }

Which indicates that there is an active sequence that hasn't been
pushed.

> 	=====================
> struct xfs_cil_ctx {
....
>   space_used = 0x3d30,     <- only 15664 bytes

So very little.

> The ticket held by the CIL CTX at ffff8802b80c0e30:
> struct xlog_ticket {
....
>   t_unit_res = 0x82034, <- 532532 bytes

Which means it has already stolen 532532 bytes from the existing
committed transactions in the CIL. This is accounted as used space
in the log as it is stolen from committed transactions that reserved
the space in the first place. 

However, this is such a small log, I'm wondering whether this
overhead is the source of the issue. It can't be freed up by pushing
on the AIL, and at 5% of the log space it increases the CIL
background push threshold by close to 50% (i.e. from 12.5% of the
log space used to 17.5%). And it's larger than the difference
between the reserve grant head and the ticket reservation size.

I think I need to reproduce this locally to be able to debug it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-11-28  0:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-23  3:24 [PATCH 0/3, V2] xfs: regression fixes for the 3.8 cycle Dave Chinner
2012-11-23  3:24 ` [PATCH 1/3] xfs: inode allocation should use unmapped buffers Dave Chinner
2012-11-26 19:07   ` Mark Tinguely
2012-11-26 22:12   ` Ben Myers
2012-11-26 23:00     ` Dave Chinner
2012-11-23  3:24 ` [PATCH 2/3] xfs: fix direct IO nested transaction deadlock Dave Chinner
2012-11-23  8:31   ` Christoph Hellwig
2012-11-26 15:54   ` Mark Tinguely
2012-11-26 21:45     ` Dave Chinner
2012-11-27 22:46       ` Mark Tinguely
2012-11-28  0:14         ` Dave Chinner
2012-11-23  3:24 ` [PATCH 3/3] xfs: byte range granularity for XFS_IOC_ZERO_RANGE Dave Chinner
2012-11-23  8:34   ` Christoph Hellwig
2012-11-23  8:59     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox