public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Delayed write metadata writeback V3
@ 2010-01-25  6:22 Dave Chinner
  2010-01-25  6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

(a.k.a. kill async inode writeback V3)

While I started with killing async inode writeback, the series has
grown. It's not really limited to inode writeback - it touches dquot
flushing, changes the way the AIL pushes on buffers, adds xfsbufd
sortingi for delayed write buffers, adds a real non-blocking mode to
inode reclaim and avoids physical inode writeback from the VFS while
fixing bugs in handling delayed write inodes.  Hence this is more
about enabling efficient delayed write metadata than it is able
killing async inode writeback.

The idea behind this series is to make metadata buffers get
written from xfsbufd via the delayed write queue rather than being
issued asynchronously from all over the place. To do this, async
buffer writeback is almost entirely removed from XFS, replaced
instead by delayed writes and a method to expedite flushing of
delayed write buffers when required.

The result of funnelling all the buffer IO into a single place
is that we can more tightly control and therefore optimise the
submission of metadata IO. Aggregating the buffers before dispatch
allows much better sort efficiency of the buffers as the sort window
is not limited to the size of the elevator congestion hysteresis
limit. Hence we can approach 100% merge effeciency on large numbers
of buffers when dispatched for IO and greatly reduce the amount
of seeking metadata writeback causes.

The major change is to the inode flushing and reclaim code. Delayed
write inodes hold the flush lock for much longer than for async
writeback, and hence blocking on the flush lock can cause extremely
long latencies without other mechanisms to expedite the release of
the flush locks. To prevent needing to flush inodes immeidately,
all operations are done non-blocking unless synchronous. THis
required a significant rework of the inode reclaim code, but it
greatly simplified other pieces of code (e.g. log item pushing).

Version 3
- rework inode reclaim to:
	- separate it from xfs_iflush return values
	- provide a non-blocking mode for background operation
- apply delwri buffer promotion tricks to dquot flushing
- kill unneeded dquot flushing flags, similar to inode flushing flag
  removal
- fix sync inode flush bug when trying to flush delwri inodes

Version 2:
- use generic list sort function
- when unmounting, push the delwri buffers first, then do sync inode
  reclaim so that reclaim doesn't block for 15 seconds waiting for
  delwri inode buffers to be aged and written before the inodes can
  be reclaimed.

Performance numbers for this version are the same as V2, which were
as follows:

Perf results (average of 3 runs) on a debug XFS build (means allocation
patterns are randomly varied, so runtimes are also a bit variable):

Untar 2.6.32 kernel tarball, sync, then remove:

                  Untar+sync     rm -rf
xfs-dev:           25.2s          13.0s
xfs-dev-delwri-1:  22.5s           9.1s
xfs-dev-delwri-2:  21.9s           8.4s

4 processes each creating 100,000, five byte files in separate
directories concurrently, then 4 processes removing a directory each
concurrently.

                  create          rm -rf
xfs-dev:           8m32s           4m10s
xfs-dev-delwri-1:  4m55s           3m42s
xfs-dev-delwri-2:  4m56s           3m33s


The patch series (plus the couple of previous bug fixes) are
available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/dgc/xfs for-2.6.34

Dave Chinner (9):
      xfs: don't hold onto reserved blocks on remount,ro
      xfs: turn off sign warnings
      xfs: Make inode reclaim states explicit
      xfs: Use delayed write for inodes rather than async
      xfs: Don't issue buffer IO direct from AIL push
      xfs: Sort delayed write buffers before dispatch
      xfs: Use delay write promotion for dquot flushing
      xfs: kill the unused XFS_QMOPT_* flush flags
      xfs: xfs_fs_write_inode() can fail to write inodes synchronously

 fs/xfs/Makefile               |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c    |  117 +++++++++++++++++++++++++++++---------
 fs/xfs/linux-2.6/xfs_buf.h    |    2 +
 fs/xfs/linux-2.6/xfs_super.c  |   72 ++++++++++++++++++------
 fs/xfs/linux-2.6/xfs_sync.c   |  124 ++++++++++++++++++++++++++++++++--------
 fs/xfs/linux-2.6/xfs_trace.h  |    1 +
 fs/xfs/quota/xfs_dquot.c      |   38 +++++-------
 fs/xfs/quota/xfs_dquot_item.c |   87 ++++------------------------
 fs/xfs/quota/xfs_dquot_item.h |    4 -
 fs/xfs/quota/xfs_qm.c         |   14 ++---
 fs/xfs/xfs_buf_item.c         |   64 ++++++++++++----------
 fs/xfs/xfs_inode.c            |   86 ++--------------------------
 fs/xfs/xfs_inode.h            |   11 +---
 fs/xfs/xfs_inode_item.c       |  108 +++++++----------------------------
 fs/xfs/xfs_inode_item.h       |    6 --
 fs/xfs/xfs_mount.c            |   13 ++++-
 fs/xfs/xfs_mount.h            |    1 +
 fs/xfs/xfs_quota.h            |    8 +--
 fs/xfs/xfs_trans_ail.c        |    7 ++
 19 files changed, 367 insertions(+), 398 deletions(-)


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] xfs: Make inode reclaim states explicit
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 11:47   ` Christoph Hellwig
  2010-01-25  6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

A.K.A.: don't rely on xfs_iflush() return value in reclaim

We have gradually been moving checks out of the reclaim code because
they are duplicated in xfs_iflush(). We've had a history of problems
in this area, and many of them stem from the overloading of the
return values from xfs_iflush() and interaction with inode flush
locking to determine if the inode is safe to reclaim.

With the desire to move to delayed write flushing of inodes and
non-blocking inode tree reclaim walks, the overloading of the
return value of xfs_iflush makes it very difficult to determine
the correct thing to do next.

This patch explicitly re-adds the checks to the inode reclaim code,
removing the reliance on the return value of xfs_iflush() to
determine what to do next. It also means that we can clearly
document all the inode states that reclaim must handle and hence
we can easily see that we handled all the necessary cases.

This also removes the need for the xfs_inode_clean() check in
xfs_iflush() as all callers now check this first (safely).

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   80 ++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_inode.c          |   11 +-----
 fs/xfs/xfs_inode.h          |    1 +
 3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index c9b863e..98b8937 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -706,12 +706,43 @@ __xfs_inode_clear_reclaim_tag(
 			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
 }
 
+/*
+ * Inodes in different states need to be treated differently, and the return
+ * value of xfs_iflush is not sufficient to get this right. The following table
+ * lists the inode states and the reclaim actions necessary for non-blocking
+ * reclaim:
+ *
+ *
+ *	inode state	     iflush ret		required action
+ *      ---------------      ----------         ---------------
+ *	bad			-		reclaim
+ *	shutdown		EIO		unpin and reclaim
+ *	clean, unpinned		0		reclaim
+ *	stale, unpinned		0		reclaim
+ *	clean, pinned(*)	0		unpin and reclaim
+ *	stale, pinned		0		unpin and reclaim
+ *	dirty, async		0		block on flush lock, reclaim
+ *	dirty, sync flush	0		block on flush lock, reclaim
+ *
+ * (*) dgc: I don't think the clean, pinned state is possible but it gets
+ * handled anyway given the order of checks implemented.
+ *
+ * Hence the order of actions after gaining the locks should be:
+ *	bad		=> reclaim
+ *	shutdown	=> unpin and reclaim
+ *	pinned		=> unpin
+ *	stale		=> reclaim
+ *	clean		=> reclaim
+ *	dirty		=> flush, wait and reclaim
+ */
 STATIC int
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
+	int	error;
+
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
 	 * with us starting reclaim on the inode.  Once we have the
@@ -729,30 +760,41 @@ xfs_reclaim_inode(
 	spin_unlock(&ip->i_flags_lock);
 	write_unlock(&pag->pag_ici_lock);
 
-	/*
-	 * If the inode is still dirty, then flush it out.  If the inode
-	 * is not in the AIL, then it will be OK to flush it delwri as
-	 * long as xfs_iflush() does not keep any references to the inode.
-	 * We leave that decision up to xfs_iflush() since it has the
-	 * knowledge of whether it's OK to simply do a delwri flush of
-	 * the inode or whether we need to wait until the inode is
-	 * pulled from the AIL.
-	 * We get the flush lock regardless, though, just to make sure
-	 * we don't free it while it is being flushed.
-	 */
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_iflock(ip);
 
-	/*
-	 * In the case of a forced shutdown we rely on xfs_iflush() to
-	 * wait for the inode to be unpinned before returning an error.
-	 */
-	if (!is_bad_inode(VFS_I(ip)) && xfs_iflush(ip, sync_mode) == 0) {
-		/* synchronize with xfs_iflush_done */
-		xfs_iflock(ip);
-		xfs_ifunlock(ip);
+	if (is_bad_inode(VFS_I(ip)))
+		goto reclaim;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		xfs_iunpin_wait(ip);
+		goto reclaim;
+	}
+	if (xfs_ipincount(ip))
+		xfs_iunpin_wait(ip);
+	if (xfs_iflags_test(ip, XFS_ISTALE))
+		goto reclaim;
+	if (xfs_inode_clean(ip))
+		goto reclaim;
+
+	/* Now we have an inode that needs flushing */
+	error = xfs_iflush(ip, sync_mode);
+	if (!error) {
+		switch(sync_mode) {
+		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
+		case XFS_IFLUSH_DELWRI:
+		case XFS_IFLUSH_ASYNC:
+		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
+		case XFS_IFLUSH_SYNC:
+			/* IO issued, synchronise with IO completion */
+			xfs_iflock(ip);
+		default:
+			ASSERT(0);
+			break;
+		}
 	}
 
+reclaim:
+	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_ireclaim(ip);
 	return 0;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d0d1b5a..8d0666d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2493,7 +2493,7 @@ __xfs_iunpin_wait(
 		wait_event(ip->i_ipin_wait, (atomic_read(&ip->i_pincount) == 0));
 }
 
-static inline void
+void
 xfs_iunpin_wait(
 	xfs_inode_t	*ip)
 {
@@ -2849,15 +2849,6 @@ xfs_iflush(
 	mp = ip->i_mount;
 
 	/*
-	 * If the inode isn't dirty, then just release the inode flush lock and
-	 * do nothing.
-	 */
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
-	/*
 	 * We can't flush the inode until it is unpinned, so wait for it if we
 	 * are allowed to block.  We know noone new can pin it, because we are
 	 * holding the inode lock shared and you need to hold it exclusively to
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ec1f28c..8b618ea 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -483,6 +483,7 @@ int		xfs_iunlink(struct xfs_trans *, xfs_inode_t *);
 void		xfs_iext_realloc(xfs_inode_t *, int, int);
 void		xfs_ipin(xfs_inode_t *);
 void		xfs_iunpin(xfs_inode_t *);
+void		xfs_iunpin_wait(xfs_inode_t *);
 int		xfs_iflush(xfs_inode_t *, uint);
 void		xfs_ichgtime(xfs_inode_t *, int);
 void		xfs_lock_inodes(xfs_inode_t **, int, uint);
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
  2010-01-25  6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 11:53   ` Christoph Hellwig
  2010-01-25 15:29   ` Christoph Hellwig
  2010-01-25  6:22 ` [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push Dave Chinner
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

We currently do background inode flush asynchronously, resulting in
inodes being written in whatever order the background writeback
issues them. Not only that, there are also blocking and non-blocking
asynchronous inode flushes, depending on where the flush comes from.

This patch completely removes asynchronous inode writeback. It
removes all the strange writeback modes and replaces them with
either a synchronous flush or a non-blocking delayed write flush.
That is, inode flushes will only issue IO directly if they are
synchronous, and background flushing may do nothing if the operation
would block (e.g. on a pinned inode or buffer lock).

Delayed write flushes will now result in the inode buffer sitting in
the delwri queue of the buffer cache to be flushed by either an AIL
push or by the xfsbufd timing out the buffer. This will allow
accumulation of dirty inode buffers in memory and allow optimisation
of inode cluster writeback at the xfsbufd level where we have much
greater queue depths than the block layer elevators. We will also
get adjacent inode cluster buffer IO merging for free when a later
patch in the series allows sorting of the delayed write buffers
before dispatch.

This effectively means that any inode that is written back by
background writeback will be seen as flush locked during AIL
pushing, and will result in the buffers being pushed from there.
This writeback path is currently non-optimal, but the next patch
in the series will fix that problem.

A side effect of this delayed write mechanism is that background
inode reclaim will no longer directly flush inodes, nor can it wait
on the flush lock. The result is that inode reclaim must leave the
inode in the reclaimable state until it is clean. Hence attempts to
reclaim a dirty inode in the background will simply skip the inode
until it is clean and this allows other mechanisms (i.e. xfsbufd) to
do more optimal writeback of the dirty buffers. As a result, the
inode reclaim code has been rewritten so that it no longer relies on
the ambiguous return values of xfs_iflush() to determine whether it
is safe to reclaim an inode.

Portions of this patch are derived from a patch by Christoph
Hellwig.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_super.c |    4 +-
 fs/xfs/linux-2.6/xfs_sync.c  |   88 ++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_inode.c           |   75 ++---------------------------------
 fs/xfs/xfs_inode.h           |   10 -----
 fs/xfs/xfs_inode_item.c      |   10 +++-
 fs/xfs/xfs_mount.c           |   13 ++++++-
 6 files changed, 86 insertions(+), 114 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 67001ec..5ed0468 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1064,7 +1064,7 @@ xfs_fs_write_inode(
 		xfs_ilock(ip, XFS_ILOCK_SHARED);
 		xfs_iflock(ip);
 
-		error = xfs_iflush(ip, XFS_IFLUSH_SYNC);
+		error = xfs_iflush(ip, SYNC_WAIT);
 	} else {
 		error = EAGAIN;
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
@@ -1072,7 +1072,7 @@ xfs_fs_write_inode(
 		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
 			goto out_unlock;
 
-		error = xfs_iflush(ip, XFS_IFLUSH_ASYNC_NOBLOCK);
+		error = xfs_iflush(ip, 0);
 	}
 
  out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 98b8937..ca0cc59 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -270,8 +270,7 @@ xfs_sync_inode_attr(
 		goto out_unlock;
 	}
 
-	error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
-			   XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
+	error = xfs_iflush(ip, (flags & SYNC_WAIT));
 
  out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -460,16 +459,18 @@ xfs_quiesce_fs(
 {
 	int	count = 0, pincount;
 
+	xfs_reclaim_inodes(mp, 0);
 	xfs_flush_buftarg(mp->m_ddev_targp, 0);
-	xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
 
 	/*
 	 * This loop must run at least twice.  The first instance of the loop
 	 * will flush most meta data but that will generate more meta data
 	 * (typically directory updates).  Which then must be flushed and
-	 * logged before we can write the unmount record.
+	 * logged before we can write the unmount record. We also so sync
+	 * reclaim of inodes to catch any that the above delwri flush skipped.
 	 */
 	do {
+		xfs_reclaim_inodes(mp, SYNC_WAIT);
 		xfs_sync_attr(mp, SYNC_WAIT);
 		pincount = xfs_flush_buftarg(mp->m_ddev_targp, 1);
 		if (!pincount) {
@@ -585,7 +586,7 @@ xfs_sync_worker(
 
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		xfs_log_force(mp, 0);
-		xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
+		xfs_reclaim_inodes(mp, 0);
 		/* dgc: errors ignored here */
 		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
 		error = xfs_sync_fsdata(mp, SYNC_TRYLOCK);
@@ -719,21 +720,42 @@ __xfs_inode_clear_reclaim_tag(
  *	shutdown		EIO		unpin and reclaim
  *	clean, unpinned		0		reclaim
  *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		unpin and reclaim
- *	stale, pinned		0		unpin and reclaim
- *	dirty, async		0		block on flush lock, reclaim
- *	dirty, sync flush	0		block on flush lock, reclaim
+ *	clean, pinned(*)	0		requeue
+ *	stale, pinned		EAGAIN		requeue
+ *	dirty, delwri ok	0		requeue
+ *	dirty, delwri blocked	EAGAIN		requeue
+ *	dirty, sync flush	0		reclaim
  *
  * (*) dgc: I don't think the clean, pinned state is possible but it gets
  * handled anyway given the order of checks implemented.
  *
+ * As can be seen from the table, the return value of xfs_iflush() is not
+ * sufficient to correctly decide the reclaim action here. The checks in
+ * xfs_iflush() might look like duplicates, but they are not.
+ *
+ * Also, because we get the flush lock first, we know that any inode that has
+ * been flushed delwri has had the flush completed by the time we check that
+ * the inode is clean. The clean inode check needs to be done before flushing
+ * the inode delwri otherwise we would loop forever requeuing clean inodes as
+ * we cannot tell apart a successful delwri flush and a clean inode from the
+ * return value of xfs_iflush().
+ *
+ * Note that because the inode is flushed delayed write by background
+ * writeback, the flush lock may already be held here and waiting on it can
+ * result in very long latencies. Hence for sync reclaims, where we wait on the
+ * flush lock, the caller should push out delayed write inodes first before
+ * trying to reclaim them to minimise the amount of time spent waiting. For
+ * background relaim, we just requeue the inode for the next pass.
+ *
  * Hence the order of actions after gaining the locks should be:
  *	bad		=> reclaim
  *	shutdown	=> unpin and reclaim
- *	pinned		=> unpin
+ *	pinned, delwri	=> requeue
+ *	pinned, sync	=> unpin
  *	stale		=> reclaim
  *	clean		=> reclaim
- *	dirty		=> flush, wait and reclaim
+ *	dirty, delwri	=> flush and requeue
+ *	dirty, sync	=> flush, wait and reclaim
  */
 STATIC int
 xfs_reclaim_inode(
@@ -741,7 +763,7 @@ xfs_reclaim_inode(
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	int	error;
+	int	error = 0;
 
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
@@ -761,7 +783,11 @@ xfs_reclaim_inode(
 	write_unlock(&pag->pag_ici_lock);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_iflock(ip);
+	if (!xfs_iflock_nowait(ip)) {
+		if (!(sync_mode & SYNC_WAIT))
+			goto requeue_no_flock;
+		xfs_iflock(ip);
+	}
 
 	if (is_bad_inode(VFS_I(ip)))
 		goto reclaim;
@@ -769,8 +795,11 @@ xfs_reclaim_inode(
 		xfs_iunpin_wait(ip);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip))
+	if (xfs_ipincount(ip)) {
+		if (!(sync_mode & SYNC_WAIT))
+			goto requeue;
 		xfs_iunpin_wait(ip);
+	}
 	if (xfs_iflags_test(ip, XFS_ISTALE))
 		goto reclaim;
 	if (xfs_inode_clean(ip))
@@ -778,25 +807,28 @@ xfs_reclaim_inode(
 
 	/* Now we have an inode that needs flushing */
 	error = xfs_iflush(ip, sync_mode);
-	if (!error) {
-		switch(sync_mode) {
-		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
-		case XFS_IFLUSH_DELWRI:
-		case XFS_IFLUSH_ASYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
-		case XFS_IFLUSH_SYNC:
-			/* IO issued, synchronise with IO completion */
-			xfs_iflock(ip);
-		default:
-			ASSERT(0);
-			break;
-		}
-	}
+	if (!(sync_mode & SYNC_WAIT))
+		goto requeue_no_flock;
+	xfs_iflock(ip);
 
 reclaim:
 	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_ireclaim(ip);
+	return error;
+
+requeue:
+	xfs_ifunlock(ip);
+requeue_no_flock:
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * We could return EAGAIN here to make reclaim rescan the inode tree in
+	 * a short while. However, this just burns CPU time scanning the tree
+	 * waiting for IO to complete and xfssyncd never goes back to the idle
+	 * state. Instead, return 0 to let the next scheduled background reclaim
+	 * attempt to reclaim the inode again.
+	 */
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8d0666d..fa31360 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,8 +2835,6 @@ xfs_iflush(
 	xfs_dinode_t		*dip;
 	xfs_mount_t		*mp;
 	int			error;
-	int			noblock = (flags == XFS_IFLUSH_ASYNC_NOBLOCK);
-	enum { INT_DELWRI = (1 << 0), INT_ASYNC = (1 << 1) };
 
 	XFS_STATS_INC(xs_iflush_count);
 
@@ -2859,7 +2857,7 @@ xfs_iflush(
 	 * in the same cluster are dirty, they will probably write the inode
 	 * out for us if they occur after the log force completes.
 	 */
-	if (noblock && xfs_ipincount(ip)) {
+	if (!(flags & SYNC_WAIT) && xfs_ipincount(ip)) {
 		xfs_iunpin_nowait(ip);
 		xfs_ifunlock(ip);
 		return EAGAIN;
@@ -2893,60 +2891,10 @@ xfs_iflush(
 	}
 
 	/*
-	 * Decide how buffer will be flushed out.  This is done before
-	 * the call to xfs_iflush_int because this field is zeroed by it.
-	 */
-	if (iip != NULL && iip->ili_format.ilf_fields != 0) {
-		/*
-		 * Flush out the inode buffer according to the directions
-		 * of the caller.  In the cases where the caller has given
-		 * us a choice choose the non-delwri case.  This is because
-		 * the inode is in the AIL and we need to get it out soon.
-		 */
-		switch (flags) {
-		case XFS_IFLUSH_SYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
-			flags = 0;
-			break;
-		case XFS_IFLUSH_ASYNC_NOBLOCK:
-		case XFS_IFLUSH_ASYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
-			flags = INT_ASYNC;
-			break;
-		case XFS_IFLUSH_DELWRI:
-			flags = INT_DELWRI;
-			break;
-		default:
-			ASSERT(0);
-			flags = 0;
-			break;
-		}
-	} else {
-		switch (flags) {
-		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
-		case XFS_IFLUSH_DELWRI:
-			flags = INT_DELWRI;
-			break;
-		case XFS_IFLUSH_ASYNC_NOBLOCK:
-		case XFS_IFLUSH_ASYNC:
-			flags = INT_ASYNC;
-			break;
-		case XFS_IFLUSH_SYNC:
-			flags = 0;
-			break;
-		default:
-			ASSERT(0);
-			flags = 0;
-			break;
-		}
-	}
-
-	/*
 	 * Get the buffer containing the on-disk inode.
 	 */
 	error = xfs_itobp(mp, NULL, ip, &dip, &bp,
-				noblock ? XBF_TRYLOCK : XBF_LOCK);
+				(flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
 	if (error || !bp) {
 		xfs_ifunlock(ip);
 		return error;
@@ -2974,13 +2922,10 @@ xfs_iflush(
 	if (error)
 		goto cluster_corrupt_out;
 
-	if (flags & INT_DELWRI) {
-		xfs_bdwrite(mp, bp);
-	} else if (flags & INT_ASYNC) {
-		error = xfs_bawrite(mp, bp);
-	} else {
+	if (flags & SYNC_WAIT)
 		error = xfs_bwrite(mp, bp);
-	}
+	else
+		xfs_bdwrite(mp, bp);
 	return error;
 
 corrupt_out:
@@ -3015,16 +2960,6 @@ xfs_iflush_int(
 	iip = ip->i_itemp;
 	mp = ip->i_mount;
 
-
-	/*
-	 * If the inode isn't dirty, then just release the inode
-	 * flush lock and do nothing.
-	 */
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
 	/* set *dip = inode's place in the buffer */
 	dip = (xfs_dinode_t *)xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8b618ea..6c912b0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -420,16 +420,6 @@ static inline void xfs_ifunlock(xfs_inode_t *ip)
 #define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
 
 /*
- * Flags for xfs_iflush()
- */
-#define	XFS_IFLUSH_DELWRI_ELSE_SYNC	1
-#define	XFS_IFLUSH_DELWRI_ELSE_ASYNC	2
-#define	XFS_IFLUSH_SYNC			3
-#define	XFS_IFLUSH_ASYNC		4
-#define	XFS_IFLUSH_DELWRI		5
-#define	XFS_IFLUSH_ASYNC_NOBLOCK	6
-
-/*
  * Flags for xfs_itruncate_start().
  */
 #define	XFS_ITRUNC_DEFINITE	0x1
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 48ec1c0..9ec41e6 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -866,10 +866,14 @@ xfs_inode_item_push(
 	       iip->ili_format.ilf_fields != 0);
 
 	/*
-	 * Write out the inode.  The completion routine ('iflush_done') will
-	 * pull it from the AIL, mark it clean, unlock the flush lock.
+	 * Push the inode to it's backing buffer. This will not remove
+	 * the inode from the AIL - a further push will be required to trigger
+	 * a buffer push. However, this allows all the dirty inodes to be pushed to
+	 * the buffer before it is pushed to disk. THe buffer IO completion
+	 * will pull th einode from the AIL, mark it clean and unlock the flush
+	 * lock.
 	 */
-	(void) xfs_iflush(ip, XFS_IFLUSH_ASYNC);
+	(void) xfs_iflush(ip, 0);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	return;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 7f81ed7..9a036f4 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1456,7 +1456,18 @@ xfs_unmountfs(
 	 * need to force the log first.
 	 */
 	xfs_log_force(mp, XFS_LOG_SYNC);
-	xfs_reclaim_inodes(mp, XFS_IFLUSH_ASYNC);
+
+	/*
+	 * Do a delwri reclaim pass first so that as many dirty inodes are
+	 * queued up for IO as possible. Then flush the buffers before making
+	 * a synchronous path to catch all the remaining inodes are reclaimed.
+	 * This makes the reclaim process as quick as possible by avoiding
+	 * synchronous writeout and blocking on inodes already in the delwri
+	 * state as much as possible.
+	 */
+	xfs_reclaim_inodes(mp, 0);
+	XFS_bflush(mp->m_ddev_targp);
+	xfs_reclaim_inodes(mp, SYNC_WAIT);
 
 	xfs_qm_unmount(mp);
 
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
  2010-01-25  6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
  2010-01-25  6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 11:59   ` Christoph Hellwig
  2010-01-25  6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

All buffers logged into the AIL are marked as delayed write.
When the AIL needs to push the buffer out, it issues an async write of the
buffer. This means that IO patterns are dependent on the order of
buffers in the AIL.

Instead of flushing the buffer, promote the buffer in the delayed
write list so that the next time the xfsbufd is run the buffer will
be flushed by the xfsbufd. Return the state to the xfsaild that the
buffer was promoted so that the xfsaild knows that it needs to cause
the xfsbufd to run to flush the buffers that were promoted.

Using the xfsbufd for issuing the IO allows us to dispatch all
buffer IO from the one queue. This means that we can make much more
enlightened decisions on what order to flush buffers to disk as
we don't have multiple places issuing IO. Optimisations to xfsbufd
will be in a future patch.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_buf.c    |   29 ++++++++++++
 fs/xfs/linux-2.6/xfs_buf.h    |    2 +
 fs/xfs/linux-2.6/xfs_trace.h  |    1 +
 fs/xfs/quota/xfs_dquot_item.c |   85 +++++------------------------------
 fs/xfs/quota/xfs_dquot_item.h |    4 --
 fs/xfs/xfs_buf_item.c         |   64 +++++++++++++++------------
 fs/xfs/xfs_inode_item.c       |   98 ++++++----------------------------------
 fs/xfs/xfs_inode_item.h       |    6 ---
 fs/xfs/xfs_trans_ail.c        |    7 +++
 9 files changed, 101 insertions(+), 195 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 44e20e5..b306265 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1778,6 +1778,35 @@ xfs_buf_delwri_dequeue(
 	trace_xfs_buf_delwri_dequeue(bp, _RET_IP_);
 }
 
+/*
+ * If a delwri buffer needs to be pushed before it has aged out, then promote
+ * it to the head of the delwri queue so that it will be flushed on the next
+ * xfsbufd run. We do this by resetting the queuetime of the buffer to be older
+ * than the age currently needed to flush the buffer. Hence the next time the
+ * xfsbufd sees it is guaranteed to be considered old enough to flush.
+ */
+void
+xfs_buf_delwri_promote(
+	struct xfs_buf	*bp)
+{
+	struct xfs_buftarg *btp = bp->b_target;
+	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
+
+	ASSERT(bp->b_flags & XBF_DELWRI);
+	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+
+	/*
+	 * Check the buffer age before locking the delayed write queue as we
+	 * don't need to promote buffers that are already past the flush age.
+	 */
+	if (bp->b_queuetime < jiffies - age)
+		return;
+	bp->b_queuetime = jiffies - age;
+	spin_lock(&btp->bt_delwrite_lock);
+	list_move(&bp->b_list, &btp->bt_delwrite_queue);
+	spin_unlock(&btp->bt_delwrite_lock);
+}
+
 STATIC void
 xfs_buf_runall_queues(
 	struct workqueue_struct	*queue)
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index ea8c198..be45e8c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -266,6 +266,7 @@ extern int xfs_buf_ispin(xfs_buf_t *);
 
 /* Delayed Write Buffer Routines */
 extern void xfs_buf_delwri_dequeue(xfs_buf_t *);
+extern void xfs_buf_delwri_promote(xfs_buf_t *);
 
 /* Buffer Daemon Setup Routines */
 extern int xfs_buf_init(void);
@@ -395,6 +396,7 @@ extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
 extern void xfs_wait_buftarg(xfs_buftarg_t *);
 extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
 extern int xfs_flush_buftarg(xfs_buftarg_t *, int);
+
 #ifdef CONFIG_KDB_MODULES
 extern struct list_head *xfs_get_buftarg_list(void);
 #endif
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index 1bb09e7..a4574dc 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -483,6 +483,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock_stale);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_committed);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_push);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pushbuf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf_recur);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_getsb);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index 1b56437..dda0fb0 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -212,66 +212,31 @@ xfs_qm_dquot_logitem_pushbuf(
 	xfs_dquot_t	*dqp;
 	xfs_mount_t	*mp;
 	xfs_buf_t	*bp;
-	uint		dopush;
 
 	dqp = qip->qli_dquot;
 	ASSERT(XFS_DQ_IS_LOCKED(dqp));
 
 	/*
-	 * The qli_pushbuf_flag keeps others from
-	 * trying to duplicate our effort.
-	 */
-	ASSERT(qip->qli_pushbuf_flag != 0);
-	ASSERT(qip->qli_push_owner == current_pid());
-
-	/*
 	 * If flushlock isn't locked anymore, chances are that the
 	 * inode flush completed and the inode was taken off the AIL.
 	 * So, just get out.
 	 */
 	if (completion_done(&dqp->q_flush)  ||
 	    ((qip->qli_item.li_flags & XFS_LI_IN_AIL) == 0)) {
-		qip->qli_pushbuf_flag = 0;
 		xfs_dqunlock(dqp);
 		return;
 	}
 	mp = dqp->q_mount;
 	bp = xfs_incore(mp->m_ddev_targp, qip->qli_format.qlf_blkno,
 		    XFS_QI_DQCHUNKLEN(mp), XBF_TRYLOCK);
-	if (bp != NULL) {
-		if (XFS_BUF_ISDELAYWRITE(bp)) {
-			dopush = ((qip->qli_item.li_flags & XFS_LI_IN_AIL) &&
-				  !completion_done(&dqp->q_flush));
-			qip->qli_pushbuf_flag = 0;
-			xfs_dqunlock(dqp);
-
-			if (XFS_BUF_ISPINNED(bp))
-				xfs_log_force(mp, 0);
-
-			if (dopush) {
-				int	error;
-#ifdef XFSRACEDEBUG
-				delay_for_intr();
-				delay(300);
-#endif
-				error = xfs_bawrite(mp, bp);
-				if (error)
-					xfs_fs_cmn_err(CE_WARN, mp,
-	"xfs_qm_dquot_logitem_pushbuf: pushbuf error %d on qip %p, bp %p",
-							error, qip, bp);
-			} else {
-				xfs_buf_relse(bp);
-			}
-		} else {
-			qip->qli_pushbuf_flag = 0;
-			xfs_dqunlock(dqp);
-			xfs_buf_relse(bp);
-		}
+	xfs_dqunlock(dqp);
+	if (!bp)
 		return;
-	}
+	if (XFS_BUF_ISDELAYWRITE(bp))
+		xfs_buf_delwri_promote(bp);
+	xfs_buf_relse(bp);
+	return;
 
-	qip->qli_pushbuf_flag = 0;
-	xfs_dqunlock(dqp);
 }
 
 /*
@@ -289,50 +254,24 @@ xfs_qm_dquot_logitem_trylock(
 	xfs_dq_logitem_t	*qip)
 {
 	xfs_dquot_t		*dqp;
-	uint			retval;
 
 	dqp = qip->qli_dquot;
 	if (atomic_read(&dqp->q_pincount) > 0)
-		return (XFS_ITEM_PINNED);
+		return XFS_ITEM_PINNED;
 
 	if (! xfs_qm_dqlock_nowait(dqp))
-		return (XFS_ITEM_LOCKED);
+		return XFS_ITEM_LOCKED;
 
-	retval = XFS_ITEM_SUCCESS;
 	if (!xfs_dqflock_nowait(dqp)) {
 		/*
-		 * The dquot is already being flushed.	It may have been
-		 * flushed delayed write, however, and we don't want to
-		 * get stuck waiting for that to complete.  So, we want to check
-		 * to see if we can lock the dquot's buffer without sleeping.
-		 * If we can and it is marked for delayed write, then we
-		 * hold it and send it out from the push routine.  We don't
-		 * want to do that now since we might sleep in the device
-		 * strategy routine.  We also don't want to grab the buffer lock
-		 * here because we'd like not to call into the buffer cache
-		 * while holding the AIL lock.
-		 * Make sure to only return PUSHBUF if we set pushbuf_flag
-		 * ourselves.  If someone else is doing it then we don't
-		 * want to go to the push routine and duplicate their efforts.
+		 * dquot has already been flushed to the backing buffer,
+		 * leave it locked, pushbuf routine will unlock it.
 		 */
-		if (qip->qli_pushbuf_flag == 0) {
-			qip->qli_pushbuf_flag = 1;
-			ASSERT(qip->qli_format.qlf_blkno == dqp->q_blkno);
-#ifdef DEBUG
-			qip->qli_push_owner = current_pid();
-#endif
-			/*
-			 * The dquot is left locked.
-			 */
-			retval = XFS_ITEM_PUSHBUF;
-		} else {
-			retval = XFS_ITEM_FLUSHING;
-			xfs_dqunlock_nonotify(dqp);
-		}
+		return XFS_ITEM_PUSHBUF;
 	}
 
 	ASSERT(qip->qli_item.li_flags & XFS_LI_IN_AIL);
-	return (retval);
+	return XFS_ITEM_SUCCESS;
 }
 
 
diff --git a/fs/xfs/quota/xfs_dquot_item.h b/fs/xfs/quota/xfs_dquot_item.h
index 5a63253..5acae2a 100644
--- a/fs/xfs/quota/xfs_dquot_item.h
+++ b/fs/xfs/quota/xfs_dquot_item.h
@@ -27,10 +27,6 @@ typedef struct xfs_dq_logitem {
 	xfs_log_item_t		 qli_item;	   /* common portion */
 	struct xfs_dquot	*qli_dquot;	   /* dquot ptr */
 	xfs_lsn_t		 qli_flush_lsn;	   /* lsn at last flush */
-	unsigned short		 qli_pushbuf_flag; /* 1 bit used in push_ail */
-#ifdef DEBUG
-	uint64_t		 qli_push_owner;
-#endif
 	xfs_dq_logformat_t	 qli_format;	   /* logged structure */
 } xfs_dq_logitem_t;
 
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e0a1158..f3c49e6 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -467,8 +467,10 @@ xfs_buf_item_unpin_remove(
 /*
  * This is called to attempt to lock the buffer associated with this
  * buf log item.  Don't sleep on the buffer lock.  If we can't get
- * the lock right away, return 0.  If we can get the lock, pull the
- * buffer from the free list, mark it busy, and return 1.
+ * the lock right away, return 0.  If we can get the lock, take a
+ * reference to the buffer. If this is a delayed write buffer that
+ * needs AIL help to be written back, invoke the pushbuf routine
+ * rather than the normal success path.
  */
 STATIC uint
 xfs_buf_item_trylock(
@@ -477,24 +479,18 @@ xfs_buf_item_trylock(
 	xfs_buf_t	*bp;
 
 	bp = bip->bli_buf;
-
-	if (XFS_BUF_ISPINNED(bp)) {
+	if (XFS_BUF_ISPINNED(bp))
 		return XFS_ITEM_PINNED;
-	}
-
-	if (!XFS_BUF_CPSEMA(bp)) {
+	if (!XFS_BUF_CPSEMA(bp))
 		return XFS_ITEM_LOCKED;
-	}
 
-	/*
-	 * Remove the buffer from the free list.  Only do this
-	 * if it's on the free list.  Private buffers like the
-	 * superblock buffer are not.
-	 */
+	/* take a reference to the buffer.  */
 	XFS_BUF_HOLD(bp);
 
 	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
 	trace_xfs_buf_item_trylock(bip);
+	if (XFS_BUF_ISDELAYWRITE(bp))
+		return XFS_ITEM_PUSHBUF;
 	return XFS_ITEM_SUCCESS;
 }
 
@@ -626,11 +622,9 @@ xfs_buf_item_committed(
 }
 
 /*
- * This is called to asynchronously write the buffer associated with this
- * buf log item out to disk. The buffer will already have been locked by
- * a successful call to xfs_buf_item_trylock().  If the buffer still has
- * B_DELWRI set, then get it going out to disk with a call to bawrite().
- * If not, then just release the buffer.
+ * The buffer is locked, but is not a delayed write buffer. This happens
+ * if we race with IO completion and hence we don't want to try to write it
+ * again. Just release the buffer.
  */
 STATIC void
 xfs_buf_item_push(
@@ -642,17 +636,29 @@ xfs_buf_item_push(
 	trace_xfs_buf_item_push(bip);
 
 	bp = bip->bli_buf;
+	ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
+	xfs_buf_relse(bp);
+}
 
-	if (XFS_BUF_ISDELAYWRITE(bp)) {
-		int	error;
-		error = xfs_bawrite(bip->bli_item.li_mountp, bp);
-		if (error)
-			xfs_fs_cmn_err(CE_WARN, bip->bli_item.li_mountp,
-			"xfs_buf_item_push: pushbuf error %d on bip %p, bp %p",
-					error, bip, bp);
-	} else {
-		xfs_buf_relse(bp);
-	}
+/*
+ * The buffer is locked and is a delayed write buffer. Promote the buffer
+ * in the delayed write queue as the caller knows that they must invoke
+ * the xfsbufd to get this buffer written. We have to unlock the buffer
+ * to allow the xfsbufd to write it, too.
+ */
+STATIC void
+xfs_buf_item_pushbuf(
+	xfs_buf_log_item_t	*bip)
+{
+	xfs_buf_t	*bp;
+
+	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+	trace_xfs_buf_item_pushbuf(bip);
+
+	bp = bip->bli_buf;
+	ASSERT(XFS_BUF_ISDELAYWRITE(bp));
+	xfs_buf_delwri_promote(bp);
+	xfs_buf_relse(bp);
 }
 
 /* ARGSUSED */
@@ -677,7 +683,7 @@ static struct xfs_item_ops xfs_buf_item_ops = {
 	.iop_committed	= (xfs_lsn_t(*)(xfs_log_item_t*, xfs_lsn_t))
 					xfs_buf_item_committed,
 	.iop_push	= (void(*)(xfs_log_item_t*))xfs_buf_item_push,
-	.iop_pushbuf	= NULL,
+	.iop_pushbuf	= (void(*)(xfs_log_item_t*))xfs_buf_item_pushbuf,
 	.iop_committing = (void(*)(xfs_log_item_t*, xfs_lsn_t))
 					xfs_buf_item_committing
 };
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 9ec41e6..25387b0 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -602,33 +602,20 @@ xfs_inode_item_trylock(
 
 	if (!xfs_iflock_nowait(ip)) {
 		/*
-		 * If someone else isn't already trying to push the inode
-		 * buffer, we get to do it.
+		 * inode has already been flushed to the backing buffer,
+		 * leave it locked in shared mode, pushbuf routine will
+		 * unlock it.
 		 */
-		if (iip->ili_pushbuf_flag == 0) {
-			iip->ili_pushbuf_flag = 1;
-#ifdef DEBUG
-			iip->ili_push_owner = current_pid();
-#endif
-			/*
-			 * Inode is left locked in shared mode.
-			 * Pushbuf routine gets to unlock it.
-			 */
-			return XFS_ITEM_PUSHBUF;
-		} else {
-			/*
-			 * We hold the AIL lock, so we must specify the
-			 * NONOTIFY flag so that we won't double trip.
-			 */
-			xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
-			return XFS_ITEM_FLUSHING;
-		}
-		/* NOTREACHED */
+		return XFS_ITEM_PUSHBUF;
 	}
 
 	/* Stale items should force out the iclog */
 	if (ip->i_flags & XFS_ISTALE) {
 		xfs_ifunlock(ip);
+		/*
+		 * we hold the AIL lock - notify the unlock routine of this
+		 * so it doesn't try to get the lock again.
+		 */
 		xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
 		return XFS_ITEM_PINNED;
 	}
@@ -746,11 +733,8 @@ xfs_inode_item_committed(
  * This gets called by xfs_trans_push_ail(), when IOP_TRYLOCK
  * failed to get the inode flush lock but did get the inode locked SHARED.
  * Here we're trying to see if the inode buffer is incore, and if so whether it's
- * marked delayed write. If that's the case, we'll initiate a bawrite on that
- * buffer to expedite the process.
- *
- * We aren't holding the AIL lock (or the flush lock) when this gets called,
- * so it is inherently race-y.
+ * marked delayed write. If that's the case, we'll promote it and that will
+ * allow the caller to write the buffer by triggering the xfsbufd to run.
  */
 STATIC void
 xfs_inode_item_pushbuf(
@@ -759,26 +743,16 @@ xfs_inode_item_pushbuf(
 	xfs_inode_t	*ip;
 	xfs_mount_t	*mp;
 	xfs_buf_t	*bp;
-	uint		dopush;
 
 	ip = iip->ili_inode;
-
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
 
 	/*
-	 * The ili_pushbuf_flag keeps others from
-	 * trying to duplicate our effort.
-	 */
-	ASSERT(iip->ili_pushbuf_flag != 0);
-	ASSERT(iip->ili_push_owner == current_pid());
-
-	/*
 	 * If a flush is not in progress anymore, chances are that the
 	 * inode was taken off the AIL. So, just get out.
 	 */
 	if (completion_done(&ip->i_flush) ||
 	    ((iip->ili_item.li_flags & XFS_LI_IN_AIL) == 0)) {
-		iip->ili_pushbuf_flag = 0;
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		return;
 	}
@@ -787,53 +761,12 @@ xfs_inode_item_pushbuf(
 	bp = xfs_incore(mp->m_ddev_targp, iip->ili_format.ilf_blkno,
 		    iip->ili_format.ilf_len, XBF_TRYLOCK);
 
-	if (bp != NULL) {
-		if (XFS_BUF_ISDELAYWRITE(bp)) {
-			/*
-			 * We were racing with iflush because we don't hold
-			 * the AIL lock or the flush lock. However, at this point,
-			 * we have the buffer, and we know that it's dirty.
-			 * So, it's possible that iflush raced with us, and
-			 * this item is already taken off the AIL.
-			 * If not, we can flush it async.
-			 */
-			dopush = ((iip->ili_item.li_flags & XFS_LI_IN_AIL) &&
-				  !completion_done(&ip->i_flush));
-			iip->ili_pushbuf_flag = 0;
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
-			trace_xfs_inode_item_push(bp, _RET_IP_);
-
-			if (XFS_BUF_ISPINNED(bp))
-				xfs_log_force(mp, 0);
-
-			if (dopush) {
-				int	error;
-				error = xfs_bawrite(mp, bp);
-				if (error)
-					xfs_fs_cmn_err(CE_WARN, mp,
-		"xfs_inode_item_pushbuf: pushbuf error %d on iip %p, bp %p",
-							error, iip, bp);
-			} else {
-				xfs_buf_relse(bp);
-			}
-		} else {
-			iip->ili_pushbuf_flag = 0;
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			xfs_buf_relse(bp);
-		}
-		return;
-	}
-	/*
-	 * We have to be careful about resetting pushbuf flag too early (above).
-	 * Even though in theory we can do it as soon as we have the buflock,
-	 * we don't want others to be doing work needlessly. They'll come to
-	 * this function thinking that pushing the buffer is their
-	 * responsibility only to find that the buffer is still locked by
-	 * another doing the same thing
-	 */
-	iip->ili_pushbuf_flag = 0;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (!bp)
+		return;
+	if (XFS_BUF_ISDELAYWRITE(bp))
+		xfs_buf_delwri_promote(bp);
+	xfs_buf_relse(bp);
 	return;
 }
 
@@ -937,7 +870,6 @@ xfs_inode_item_init(
 	/*
 	   We have zeroed memory. No need ...
 	   iip->ili_extents_buf = NULL;
-	   iip->ili_pushbuf_flag = 0;
 	 */
 
 	iip->ili_format.ilf_type = XFS_LI_INODE;
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index cc8df1a..9a46795 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -144,12 +144,6 @@ typedef struct xfs_inode_log_item {
 						      data exts */
 	struct xfs_bmbt_rec	*ili_aextents_buf; /* array of logged
 						      attr exts */
-	unsigned int            ili_pushbuf_flag;  /* one bit used in push_ail */
-
-#ifdef DEBUG
-	uint64_t                ili_push_owner;    /* one who sets pushbuf_flag
-						      above gets to push the buf */
-#endif
 #ifdef XFS_TRANS_DEBUG
 	int			ili_root_size;
 	char			*ili_orig_root;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index d7b1af8..760ceeb 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -253,6 +253,7 @@ xfsaild_push(
 	int		flush_log, count, stuck;
 	xfs_mount_t	*mp = ailp->xa_mount;
 	struct xfs_ail_cursor	*cur = &ailp->xa_cursors;
+	int		push_xfsbufd = 0;
 
 	spin_lock(&ailp->xa_lock);
 	xfs_trans_ail_cursor_init(ailp, cur);
@@ -308,6 +309,7 @@ xfsaild_push(
 			XFS_STATS_INC(xs_push_ail_pushbuf);
 			IOP_PUSHBUF(lip);
 			last_pushed_lsn = lsn;
+			push_xfsbufd = 1;
 			break;
 
 		case XFS_ITEM_PINNED:
@@ -374,6 +376,11 @@ xfsaild_push(
 		xfs_log_force(mp, 0);
 	}
 
+	if (push_xfsbufd) {
+		/* we've got delayed write buffers to flush */
+		wake_up_process(mp->m_ddev_targp->bt_task);
+	}
+
 	if (!count) {
 		/* We're past our target or empty, so idle */
 		last_pushed_lsn = 0;
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] xfs: Sort delayed write buffers before dispatch
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-01-25  6:22 ` [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 12:00   ` Christoph Hellwig
  2010-01-25  6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

Currently when the xfsbufd writes delayed write buffers, it pushes
them to disk in the order they come off the delayed write list. If
there are lots of buffers ѕpread widely over the disk, this results
in overwhelming the elevator sort queues in the block layer and we
end up losing the posibility of merging adjacent buffers to minimise
the number of IOs.

Use the new generic list_sort function to sort the delwri dispatch
queue before issue to ensure that the buffers are pushed in the most
friendly order possible to the lower layers.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   88 ++++++++++++++++++++++++++++++-------------
 1 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b306265..a53afb3 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -33,6 +33,7 @@
 #include <linux/migrate.h>
 #include <linux/backing-dev.h>
 #include <linux/freezer.h>
+#include <linux/list_sort.h>
 
 #include "xfs_sb.h"
 #include "xfs_inum.h"
@@ -1877,14 +1878,42 @@ xfs_buf_delwri_split(
 
 }
 
+/*
+ * Compare function is more complex than it needs to be because
+ * the return value is only 32 bits and we are doing comparisons
+ * on 64 bit values
+ */
+static int
+xfs_buf_cmp(
+	void		*priv,
+	struct list_head *a,
+	struct list_head *b)
+{
+	struct xfs_buf	*ap = container_of(a, struct xfs_buf, b_list);
+	struct xfs_buf	*bp = container_of(b, struct xfs_buf, b_list);
+	xfs_daddr_t		diff;
+
+	diff = ap->b_bn - bp->b_bn;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+void
+xfs_buf_delwri_sort(
+	xfs_buftarg_t	*target,
+	struct list_head *list)
+{
+	list_sort(NULL, list, xfs_buf_cmp);
+}
+
 STATIC int
 xfsbufd(
 	void		*data)
 {
-	struct list_head tmp;
-	xfs_buftarg_t	*target = (xfs_buftarg_t *)data;
-	int		count;
-	xfs_buf_t	*bp;
+	xfs_buftarg_t   *target = (xfs_buftarg_t *)data;
 
 	current->flags |= PF_MEMALLOC;
 
@@ -1893,6 +1922,8 @@ xfsbufd(
 	do {
 		long	age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
 		long	tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
+		int	count = 0;
+		struct list_head tmp;
 
 		if (unlikely(freezing(current))) {
 			set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
@@ -1907,11 +1938,10 @@ xfsbufd(
 		schedule_timeout_interruptible(tout);
 
 		xfs_buf_delwri_split(target, &tmp, age);
-		count = 0;
+		list_sort(NULL, &tmp, xfs_buf_cmp);
 		while (!list_empty(&tmp)) {
-			bp = list_entry(tmp.next, xfs_buf_t, b_list);
-			ASSERT(target == bp->b_target);
-
+			struct xfs_buf *bp;
+			bp = list_first_entry(&tmp, struct xfs_buf, b_list);
 			list_del_init(&bp->b_list);
 			xfs_buf_iostrategy(bp);
 			count++;
@@ -1937,42 +1967,46 @@ xfs_flush_buftarg(
 	xfs_buftarg_t	*target,
 	int		wait)
 {
-	struct list_head tmp;
-	xfs_buf_t	*bp, *n;
+	xfs_buf_t	*bp;
 	int		pincount = 0;
+	LIST_HEAD(tmp_list);
+	LIST_HEAD(wait_list);
 
 	xfs_buf_runall_queues(xfsconvertd_workqueue);
 	xfs_buf_runall_queues(xfsdatad_workqueue);
 	xfs_buf_runall_queues(xfslogd_workqueue);
 
 	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
-	pincount = xfs_buf_delwri_split(target, &tmp, 0);
+	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
 
 	/*
-	 * Dropped the delayed write list lock, now walk the temporary list
+	 * Dropped the delayed write list lock, now walk the temporary list.
+	 * All I/O is issued async and then if we need to wait for completion
+	 * we do that after issuing all the IO.
 	 */
-	list_for_each_entry_safe(bp, n, &tmp, b_list) {
+	list_sort(NULL, &tmp_list, xfs_buf_cmp);
+	while (!list_empty(&tmp_list)) {
+		struct xfs_buf *bp;
+		bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
 		ASSERT(target == bp->b_target);
-		if (wait)
+		list_del_init(&bp->b_list);
+		if (wait) {
 			bp->b_flags &= ~XBF_ASYNC;
-		else
-			list_del_init(&bp->b_list);
-
+			list_add(&bp->b_list, &wait_list);
+		}
 		xfs_buf_iostrategy(bp);
 	}
 
-	if (wait)
+	if (wait) {
+		/* Expedite and wait for IO to complete. */
 		blk_run_address_space(target->bt_mapping);
+		while (!list_empty(&wait_list)) {
+			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
-	/*
-	 * Remaining list items must be flushed before returning
-	 */
-	while (!list_empty(&tmp)) {
-		bp = list_entry(tmp.next, xfs_buf_t, b_list);
-
-		list_del_init(&bp->b_list);
-		xfs_iowait(bp);
-		xfs_buf_relse(bp);
+			list_del_init(&bp->b_list);
+			xfs_iowait(bp);
+			xfs_buf_relse(bp);
+		}
 	}
 
 	return pincount;
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] xfs: Use delay write promotion for dquot flushing
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
                   ` (3 preceding siblings ...)
  2010-01-25  6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 12:03   ` Christoph Hellwig
  2010-01-25  6:22 ` [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags Dave Chinner
  2010-01-25  6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
  6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
pushing used to do to flush out delayed write dquot buffers. Change
it to use the new promotion method rather than an async flush.

Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
held, yet the callers make the assumption that after this call the
flush lock is held. Always return with the flush lock held.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/quota/xfs_dquot.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index f9baeed..1620a56 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1528,21 +1528,16 @@ xfs_qm_dqflock_pushbuf_wait(
 	 */
 	bp = xfs_incore(dqp->q_mount->m_ddev_targp, dqp->q_blkno,
 		    XFS_QI_DQCHUNKLEN(dqp->q_mount), XBF_TRYLOCK);
-	if (bp != NULL) {
-		if (XFS_BUF_ISDELAYWRITE(bp)) {
-			int	error;
-
-			if (XFS_BUF_ISPINNED(bp))
-				xfs_log_force(dqp->q_mount, 0);
-			error = xfs_bawrite(dqp->q_mount, bp);
-			if (error)
-				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
-					"xfs_qm_dqflock_pushbuf_wait: "
-					"pushbuf error %d on dqp %p, bp %p",
-					error, dqp, bp);
-		} else {
-			xfs_buf_relse(bp);
-		}
+	if (!bp)
+		goto out_lock;
+
+	if (XFS_BUF_ISDELAYWRITE(bp)) {
+		if (XFS_BUF_ISPINNED(bp))
+			xfs_log_force(dqp->q_mount, 0);
+		xfs_buf_delwri_promote(bp);
+		wake_up_process(bp->b_target->bt_task);
 	}
+	xfs_buf_relse(bp);
+out_lock:
 	xfs_dqflock(dqp);
 }
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
                   ` (4 preceding siblings ...)
  2010-01-25  6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 12:04   ` Christoph Hellwig
  2010-01-25  6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
  6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

dquots are never flushed asynchronously. Remove the flag and the
async write support from the flush function. Make the default flush
a delwri flush to make the inode flush code, which leaves the
XFS_QMOPT_SYNC the only flag remaining.  Convert that to use
SYNC_WAIT instead, just like the inode flush code.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/quota/xfs_dquot.c      |   13 ++++++-------
 fs/xfs/quota/xfs_dquot_item.c |    2 +-
 fs/xfs/quota/xfs_qm.c         |   14 ++++++--------
 fs/xfs/xfs_quota.h            |    8 +-------
 4 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index 1620a56..5f79dd7 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1187,7 +1187,7 @@ xfs_qm_dqflush(
 	 * block, nada.
 	 */
 	if (!XFS_DQ_IS_DIRTY(dqp) ||
-	    (!(flags & XFS_QMOPT_SYNC) && atomic_read(&dqp->q_pincount) > 0)) {
+	    (!(flags & SYNC_WAIT) && atomic_read(&dqp->q_pincount) > 0)) {
 		xfs_dqfunlock(dqp);
 		return 0;
 	}
@@ -1251,18 +1251,17 @@ xfs_qm_dqflush(
 		xfs_log_force(mp, 0);
 	}
 
-	if (flags & XFS_QMOPT_DELWRI) {
-		xfs_bdwrite(mp, bp);
-	} else {
+	if (flags & SYNC_WAIT)
 		error = xfs_bwrite(mp, bp);
-	}
+	else
+		xfs_bdwrite(mp, bp);
 
 	trace_xfs_dqflush_done(dqp);
 
 	/*
 	 * dqp is still locked, but caller is free to unlock it now.
 	 */
-	return (error);
+	return error;
 
 }
 
@@ -1443,7 +1442,7 @@ xfs_qm_dqpurge(
 		 * We don't care about getting disk errors here. We need
 		 * to purge this dquot anyway, so we go ahead regardless.
 		 */
-		error = xfs_qm_dqflush(dqp, XFS_QMOPT_SYNC);
+		error = xfs_qm_dqflush(dqp, SYNC_WAIT);
 		if (error)
 			xfs_fs_cmn_err(CE_WARN, mp,
 				"xfs_qm_dqpurge: dquot %p flush failed", dqp);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index dda0fb0..4e4ee9a 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -153,7 +153,7 @@ xfs_qm_dquot_logitem_push(
 	 * lock without sleeping, then there must not have been
 	 * anyone in the process of flushing the dquot.
 	 */
-	error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+	error = xfs_qm_dqflush(dqp, 0);
 	if (error)
 		xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
 			"xfs_qm_dquot_logitem_push: push error %d on dqp %p",
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 11cfd82..8112a26 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -450,7 +450,7 @@ xfs_qm_unmount_quotas(
 STATIC int
 xfs_qm_dqflush_all(
 	xfs_mount_t	*mp,
-	int		flags)
+	int		sync_mode)
 {
 	int		recl;
 	xfs_dquot_t	*dqp;
@@ -486,7 +486,7 @@ again:
 		 * across a disk write.
 		 */
 		xfs_qm_mplist_unlock(mp);
-		error = xfs_qm_dqflush(dqp, flags);
+		error = xfs_qm_dqflush(dqp, sync_mode);
 		xfs_dqunlock(dqp);
 		if (error)
 			return error;
@@ -926,13 +926,11 @@ xfs_qm_sync(
 {
 	int		recl, restarts;
 	xfs_dquot_t	*dqp;
-	uint		flush_flags;
 	int		error;
 
 	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
 		return 0;
 
-	flush_flags = (flags & SYNC_WAIT) ? XFS_QMOPT_SYNC : XFS_QMOPT_DELWRI;
 	restarts = 0;
 
   again:
@@ -992,7 +990,7 @@ xfs_qm_sync(
 		 * across a disk write
 		 */
 		xfs_qm_mplist_unlock(mp);
-		error = xfs_qm_dqflush(dqp, flush_flags);
+		error = xfs_qm_dqflush(dqp, (flags & SYNC_WAIT));
 		xfs_dqunlock(dqp);
 		if (error && XFS_FORCED_SHUTDOWN(mp))
 			return 0;	/* Need to prevent umount failure */
@@ -1796,7 +1794,7 @@ xfs_qm_quotacheck(
 	 * successfully.
 	 */
 	if (!error)
-		error = xfs_qm_dqflush_all(mp, XFS_QMOPT_DELWRI);
+		error = xfs_qm_dqflush_all(mp, 0);
 
 	/*
 	 * We can get this error if we couldn't do a dquot allocation inside
@@ -2018,7 +2016,7 @@ xfs_qm_shake_freelist(
 			 * We flush it delayed write, so don't bother
 			 * releasing the mplock.
 			 */
-			error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+			error = xfs_qm_dqflush(dqp, 0);
 			if (error) {
 				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
 			"xfs_qm_dqflush_all: dquot %p flush failed", dqp);
@@ -2201,7 +2199,7 @@ xfs_qm_dqreclaim_one(void)
 			 * We flush it delayed write, so don't bother
 			 * releasing the freelist lock.
 			 */
-			error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+			error = xfs_qm_dqflush(dqp, 0);
 			if (error) {
 				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
 			"xfs_qm_dqreclaim: dquot %p flush failed", dqp);
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 21d11d9..fdcab3f 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -223,15 +223,9 @@ typedef struct xfs_qoff_logformat {
 #define XFS_QMOPT_RES_INOS	0x0800000
 
 /*
- * flags for dqflush and dqflush_all.
- */
-#define XFS_QMOPT_SYNC		0x1000000
-#define XFS_QMOPT_DELWRI	0x4000000
-
-/*
  * flags for dqalloc.
  */
-#define XFS_QMOPT_INHERIT	0x8000000
+#define XFS_QMOPT_INHERIT	0x1000000
 
 /*
  * flags to xfs_trans_mod_dquot.
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
  2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
                   ` (5 preceding siblings ...)
  2010-01-25  6:22 ` [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags Dave Chinner
@ 2010-01-25  6:22 ` Dave Chinner
  2010-01-25 16:03   ` Christoph Hellwig
  6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  6:22 UTC (permalink / raw)
  To: xfs

When an inode has already be flushed delayed write,
xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
return on a synchronous inode write without having written the
inode. Currently these sycnhronous writes only come from the unmount
path or the nfsd on a synchronous export so should be fairly rare.

Realistically, a synchronous inode write is not necessary here; we
can treat this like fsync where we either force the log if there are
no unlogged changes, or do a sync transaction if there are unlogged
changes. The will result real synchronous semantics as the fsync
will issue barriers, but may slow down the above two configurations
as a result. However, if the inode is not pinned and has no unlogged
changes, then the fsync code is a no-op and hence it may be faster
than the existing code.

For the asynchronous write, move the clean check until after we have
locked up the inode. With the inode locked and the flush lock held,
we know that if the inodis clean there are no pending changes in the
log and there are no current outstanding delayed writes or IO in
progress, so if it reports clean now it really is clean. This matches
the order of locking and checks in xfs_sync_inode_attr().

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_super.c |   44 ++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 5ed0468..4cfc82a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1045,14 +1045,18 @@ xfs_fs_write_inode(
 		error = xfs_wait_on_pages(ip, 0, -1);
 		if (error)
 			goto out;
+		/*
+		 * The fsync operation makes inode changes stable and it
+		 * reduces the IOs we have to do here from two (log and inode)
+		 * to just the log. We still need to do a delwri write of the
+		 * inode after this to flush it to the bacing buffer so that
+		 * bulkstat works properly.
+		 */
+		error = xfs_fsync(ip);
+		if (error)
+			goto out;
 	}
 
-	/*
-	 * Bypass inodes which have already been cleaned by
-	 * the inode flush clustering code inside xfs_iflush
-	 */
-	if (xfs_inode_clean(ip))
-		goto out;
 
 	/*
 	 * We make this non-blocking if the inode is contended, return
@@ -1060,20 +1064,24 @@ xfs_fs_write_inode(
 	 * This prevents the flush path from blocking on inodes inside
 	 * another operation right now, they get caught later by xfs_sync.
 	 */
-	if (sync) {
-		xfs_ilock(ip, XFS_ILOCK_SHARED);
-		xfs_iflock(ip);
-
-		error = xfs_iflush(ip, SYNC_WAIT);
-	} else {
-		error = EAGAIN;
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
-			goto out;
-		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
-			goto out_unlock;
+	error = EAGAIN;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
+		goto out;
+	if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
+		goto out_unlock;
 
-		error = xfs_iflush(ip, 0);
+	/*
+	 * Now we have the flush lock and the inode is not pinned, we can check
+	 * if the inode is really clean as we know that there are no pending
+	 * transaction completions, it is not waiting on the delayed write
+	 * queue and there is no IO in progress.
+	 */
+	error = 0;
+	if (xfs_inode_clean(ip)) {
+		xfs_ifunlock(ip);
+		goto out_unlock;
 	}
+	error = xfs_iflush(ip, 0);
 
  out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] xfs: Make inode reclaim states explicit
  2010-01-25  6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
@ 2010-01-25 11:47   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 11:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-25  6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
@ 2010-01-25 11:53   ` Christoph Hellwig
  2010-01-25 22:31     ` Dave Chinner
  2010-01-25 15:29   ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 11:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index 98b8937..ca0cc59 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -270,8 +270,7 @@ xfs_sync_inode_attr(
>  		goto out_unlock;
>  	}
>  
> -	error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
> -			   XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
> +	error = xfs_iflush(ip, (flags & SYNC_WAIT));

No need for the masking here, as xfs_iflush simply ignores SYNC_TRYLOCK.

>  	/* Now we have an inode that needs flushing */
>  	error = xfs_iflush(ip, sync_mode);
> +	if (!(sync_mode & SYNC_WAIT))
> +		goto requeue_no_flock;

So for the !wait case we entirely ignore the return value?  We should
at least check for an I/O error here I think.  Also in this context
the requeue label name doesn't fit too well, even if it's the same
action as the requeue.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push
  2010-01-25  6:22 ` [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push Dave Chinner
@ 2010-01-25 11:59   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 11:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

Btw, this patch kill the last two places returning XFS_ITEM_FLUSHING, so
you might want to kill that, too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] xfs: Sort delayed write buffers before dispatch
  2010-01-25  6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-01-25 12:00   ` Christoph Hellwig
  2010-01-26  4:13     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 12:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> @@ -1937,42 +1967,46 @@ xfs_flush_buftarg(
>  	xfs_buftarg_t	*target,
>  	int		wait)
>  {
> -	struct list_head tmp;
> -	xfs_buf_t	*bp, *n;
> +	xfs_buf_t	*bp;
>  	int		pincount = 0;
> +	LIST_HEAD(tmp_list);
> +	LIST_HEAD(wait_list);
>  
>  	xfs_buf_runall_queues(xfsconvertd_workqueue);
>  	xfs_buf_runall_queues(xfsdatad_workqueue);
>  	xfs_buf_runall_queues(xfslogd_workqueue);
>  
>  	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> -	pincount = xfs_buf_delwri_split(target, &tmp, 0);
> +	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
>  
>  	/*
> -	 * Dropped the delayed write list lock, now walk the temporary list
> +	 * Dropped the delayed write list lock, now walk the temporary list.
> +	 * All I/O is issued async and then if we need to wait for completion
> +	 * we do that after issuing all the IO.
>  	 */
> -	list_for_each_entry_safe(bp, n, &tmp, b_list) {
> +	list_sort(NULL, &tmp_list, xfs_buf_cmp);
> +	while (!list_empty(&tmp_list)) {
> +		struct xfs_buf *bp;

This now has a bp variable both in functionp-wide scope and a local one
here.  Might be worth to decide for either style.

Otherwise looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] xfs: Use delay write promotion for dquot flushing
  2010-01-25  6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-01-25 12:03   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 12:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jan 25, 2010 at 05:22:42PM +1100, Dave Chinner wrote:
> xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
> pushing used to do to flush out delayed write dquot buffers. Change
> it to use the new promotion method rather than an async flush.
> 
> Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
> held, yet the callers make the assumption that after this call the
> flush lock is held. Always return with the flush lock held.
> 
> Signed-off-by: Dave Chinner <david@fromorbit.com>

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

This also removes the last user of xfs_bawrite, so you might want to
kill it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags
  2010-01-25  6:22 ` [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags Dave Chinner
@ 2010-01-25 12:04   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 12:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> +		error = xfs_qm_dqflush(dqp, (flags & SYNC_WAIT));

Just pass down the full flags, the other flag is ignored by
xfs_qm_dqflush.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-25  6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
  2010-01-25 11:53   ` Christoph Hellwig
@ 2010-01-25 15:29   ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 15:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>  	/*
> -	 * Write out the inode.  The completion routine ('iflush_done') will
> -	 * pull it from the AIL, mark it clean, unlock the flush lock.
> +	 * Push the inode to it's backing buffer. This will not remove
> +	 * the inode from the AIL - a further push will be required to trigger
> +	 * a buffer push. However, this allows all the dirty inodes to be pushed to

Btw, this creates a rather annoying and pointless too long line in the
source file.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
  2010-01-25  6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
@ 2010-01-25 16:03   ` Christoph Hellwig
  2010-01-26  4:51     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 16:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jan 25, 2010 at 05:22:44PM +1100, Dave Chinner wrote:
> When an inode has already be flushed delayed write,
> xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
> return on a synchronous inode write without having written the
> inode. Currently these sycnhronous writes only come from the unmount
> path or the nfsd on a synchronous export so should be fairly rare.

They also come from sync_filesystem, which is uses by the sync system
call, in the unmount code and from cachefiles.

> Realistically, a synchronous inode write is not necessary here; we
> can treat this like fsync where we either force the log if there are
> no unlogged changes, or do a sync transaction if there are unlogged
> changes. The will result real synchronous semantics as the fsync
> will issue barriers, but may slow down the above two configurations
> as a result. However, if the inode is not pinned and has no unlogged
> changes, then the fsync code is a no-op and hence it may be faster
> than the existing code.

If we get a lot of cases where we need to write out the inode
synchronously the barrier might hit us really hard, though.  If
we have a lot of delalloc I/O outstanding I fear this might actually
happen in practice as the inode gets modified between the first
->write_inode with wait == 0 by I/O completion.

> +	error = EAGAIN;
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> +		goto out;
> +	if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
> +		goto out_unlock;

So if we make this non-blocking even for the wait case, don't we
still have a race window there bulkstat could miss the updates, even
after a sync?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-25 11:53   ` Christoph Hellwig
@ 2010-01-25 22:31     ` Dave Chinner
  2010-01-26  7:28       ` Dave Chinner
  2010-01-26 16:10       ` Christoph Hellwig
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 22:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Jan 25, 2010 at 06:53:08AM -0500, Christoph Hellwig wrote:
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index 98b8937..ca0cc59 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -270,8 +270,7 @@ xfs_sync_inode_attr(
> >  		goto out_unlock;
> >  	}
> >  
> > -	error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
> > -			   XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
> > +	error = xfs_iflush(ip, (flags & SYNC_WAIT));
> 
> No need for the masking here, as xfs_iflush simply ignores SYNC_TRYLOCK.

OK.

> >  	/* Now we have an inode that needs flushing */
> >  	error = xfs_iflush(ip, sync_mode);
> > +	if (!(sync_mode & SYNC_WAIT))
> > +		goto requeue_no_flock;
> 
> So for the !wait case we entirely ignore the return value?  We should
> at least check for an I/O error here I think.

I'm not sure we can get an error on a delayed write that we
need to take action on. We don't actually issue the IO from this
call, so the only error case is on reading the buffer. If we get
an error there, what should be do?

My thinking was that if we can't write back the inode, we can't
reclaim it, and if the error is from a transient read error (e.g. FC
path failover) we should be retrying anyway as we do in
xfs_buf_iodone_callbacks()...

> Also in this context
> the requeue label name doesn't fit too well, even if it's the same
> action as the requeue.

I'm open to suggestions for a better name ;)

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] xfs: Sort delayed write buffers before dispatch
  2010-01-25 12:00   ` Christoph Hellwig
@ 2010-01-26  4:13     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-26  4:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Jan 25, 2010 at 07:00:54AM -0500, Christoph Hellwig wrote:
> > @@ -1937,42 +1967,46 @@ xfs_flush_buftarg(
> >  	xfs_buftarg_t	*target,
> >  	int		wait)
> >  {
> > -	struct list_head tmp;
> > -	xfs_buf_t	*bp, *n;
> > +	xfs_buf_t	*bp;
> >  	int		pincount = 0;
> > +	LIST_HEAD(tmp_list);
> > +	LIST_HEAD(wait_list);
> >  
> >  	xfs_buf_runall_queues(xfsconvertd_workqueue);
> >  	xfs_buf_runall_queues(xfsdatad_workqueue);
> >  	xfs_buf_runall_queues(xfslogd_workqueue);
> >  
> >  	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> > -	pincount = xfs_buf_delwri_split(target, &tmp, 0);
> > +	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
> >  
> >  	/*
> > -	 * Dropped the delayed write list lock, now walk the temporary list
> > +	 * Dropped the delayed write list lock, now walk the temporary list.
> > +	 * All I/O is issued async and then if we need to wait for completion
> > +	 * we do that after issuing all the IO.
> >  	 */
> > -	list_for_each_entry_safe(bp, n, &tmp, b_list) {
> > +	list_sort(NULL, &tmp_list, xfs_buf_cmp);
> > +	while (!list_empty(&tmp_list)) {
> > +		struct xfs_buf *bp;
> 
> This now has a bp variable both in functionp-wide scope and a local one
> here.  Might be worth to decide for either style.

I removed the second declaration.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
  2010-01-25 16:03   ` Christoph Hellwig
@ 2010-01-26  4:51     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-26  4:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Jan 25, 2010 at 11:03:54AM -0500, Christoph Hellwig wrote:
> On Mon, Jan 25, 2010 at 05:22:44PM +1100, Dave Chinner wrote:
> > When an inode has already be flushed delayed write,
> > xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
> > return on a synchronous inode write without having written the
> > inode. Currently these sycnhronous writes only come from the unmount
> > path or the nfsd on a synchronous export so should be fairly rare.
> 
> They also come from sync_filesystem, which is uses by the sync system
> call, in the unmount code and from cachefiles.

True - I'll update the comment - but I still think it'll be fairly
rare.

> > Realistically, a synchronous inode write is not necessary here; we
> > can treat this like fsync where we either force the log if there are
> > no unlogged changes, or do a sync transaction if there are unlogged
> > changes. The will result real synchronous semantics as the fsync
> > will issue barriers, but may slow down the above two configurations
> > as a result. However, if the inode is not pinned and has no unlogged
> > changes, then the fsync code is a no-op and hence it may be faster
> > than the existing code.
> 
> If we get a lot of cases where we need to write out the inode
> synchronously the barrier might hit us really hard, though.

No different to running wsync, though, where all transactions
are synchronous and will issue barriers all the time.

> If
> we have a lot of delalloc I/O outstanding I fear this might actually
> happen in practice as the inode gets modified between the first
> ->write_inode with wait == 0 by I/O completion.

So far in my testing I haven't seen a big hit - the performance
tests I've done are on filesystems with barriers enabled. I just
checked barrier vs nobarrier sync times after creating 400,000
single block files in parallel - nobarrier = 27s, barrier = 29s.

> > +	error = EAGAIN;
> > +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> > +		goto out;
> > +	if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
> > +		goto out_unlock;
> 
> So if we make this non-blocking even for the wait case, don't we
> still have a race window there bulkstat could miss the updates, even
> after a sync?

Yes, you're right. But even if we lock here properly, a delwri flush
is non-blocking and hence can still return EAGAIN. We really only need
this if a newly allocated inode has not been previously flushed for
bulkstat to work correctly. We would need to race with a concurrent
transaction between the fsync call and the below checks for this
flush to fail, which I think should be a relatively rare ocurrence.

What I will look at is whether I can get xfs_fsync() to take a
locked inode and return with it still locked. Then this race
condition will go away completely and hence the delwri flush
will only occur if the inode has not been flushed yet (based
on the flock).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-25 22:31     ` Dave Chinner
@ 2010-01-26  7:28       ` Dave Chinner
  2010-01-26 16:10       ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-26  7:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Jan 26, 2010 at 09:31:40AM +1100, Dave Chinner wrote:
> On Mon, Jan 25, 2010 at 06:53:08AM -0500, Christoph Hellwig wrote:
> > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > > index 98b8937..ca0cc59 100644
> > > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > > @@ -270,8 +270,7 @@ xfs_sync_inode_attr(
> > >  		goto out_unlock;
> > >  	}
> > >  
> > > -	error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
> > > -			   XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
> > > +	error = xfs_iflush(ip, (flags & SYNC_WAIT));
> > 
> > No need for the masking here, as xfs_iflush simply ignores SYNC_TRYLOCK.
> 
> OK.
> 
> > >  	/* Now we have an inode that needs flushing */
> > >  	error = xfs_iflush(ip, sync_mode);
> > > +	if (!(sync_mode & SYNC_WAIT))
> > > +		goto requeue_no_flock;
> > 
> > So for the !wait case we entirely ignore the return value?  We should
> > at least check for an I/O error here I think.
> 
> I'm not sure we can get an error on a delayed write that we
> need to take action on. We don't actually issue the IO from this
> call, so the only error case is on reading the buffer. If we get
> an error there, what should be do?
> 
> My thinking was that if we can't write back the inode, we can't
> reclaim it, and if the error is from a transient read error (e.g. FC
> path failover) we should be retrying anyway as we do in
> xfs_buf_iodone_callbacks()...

I forgot to mention that it will get caught by a sync
reclaim pass when there is a context to return the error to....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-25 22:31     ` Dave Chinner
  2010-01-26  7:28       ` Dave Chinner
@ 2010-01-26 16:10       ` Christoph Hellwig
  2010-02-01 22:59         ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-26 16:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

> > Also in this context
> > the requeue label name doesn't fit too well, even if it's the same
> > action as the requeue.
> 
> I'm open to suggestions for a better name ;)

What do you think about this version?

Index: xfs/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_sync.c	2010-01-26 17:01:49.610012226 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_sync.c	2010-01-26 17:06:50.533006103 +0100
@@ -785,7 +785,7 @@ xfs_reclaim_inode(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
-			goto requeue_no_flock;
+			goto out;
 		xfs_iflock(ip);
 	}
 
@@ -796,8 +796,10 @@ xfs_reclaim_inode(
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto requeue;
+		if (!(sync_mode & SYNC_WAIT)) {
+			xfs_ifunlock(ip);
+			goto out;
+		}
 		xfs_iunpin_wait(ip);
 	}
 	if (xfs_iflags_test(ip, XFS_ISTALE))
@@ -807,19 +809,18 @@ xfs_reclaim_inode(
 
 	/* Now we have an inode that needs flushing */
 	error = xfs_iflush(ip, sync_mode);
-	if (!(sync_mode & SYNC_WAIT))
-		goto requeue_no_flock;
-	xfs_iflock(ip);
-
-reclaim:
-	xfs_ifunlock(ip);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	xfs_ireclaim(ip);
-	return error;
+	if (sync_mode & SYNC_WAIT) {
+		xfs_iflock(ip);
+		goto reclaim;
+	}
 
-requeue:
-	xfs_ifunlock(ip);
-requeue_no_flock:
+	/*
+	 * When we have to flush an inode but don't have SYNC_WAIT set, we
+	 * flush the inode out with using a delwri buffer and wait for
+	 * the next call into the reclaim to find it in a clean state
+	 * instead of doing it now.
+	 */
+out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	/*
@@ -830,6 +831,12 @@ requeue_no_flock:
 	 * attempt to reclaim the inode again.
 	 */
 	return 0;
+
+reclaim:
+	xfs_ifunlock(ip);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_ireclaim(ip);
+	return error;
 }
 
 int

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
  2010-01-26 16:10       ` Christoph Hellwig
@ 2010-02-01 22:59         ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-02-01 22:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Jan 26, 2010 at 11:10:45AM -0500, Christoph Hellwig wrote:
> > > Also in this context
> > > the requeue label name doesn't fit too well, even if it's the same
> > > action as the requeue.
> > 
> > I'm open to suggestions for a better name ;)
> 
> What do you think about this version?

Yeah, that looks a lot better. I'll incorporate that into the
patch. Thanks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-02-03  2:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-25  6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
2010-01-25  6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
2010-01-25 11:47   ` Christoph Hellwig
2010-01-25  6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
2010-01-25 11:53   ` Christoph Hellwig
2010-01-25 22:31     ` Dave Chinner
2010-01-26  7:28       ` Dave Chinner
2010-01-26 16:10       ` Christoph Hellwig
2010-02-01 22:59         ` Dave Chinner
2010-01-25 15:29   ` Christoph Hellwig
2010-01-25  6:22 ` [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push Dave Chinner
2010-01-25 11:59   ` Christoph Hellwig
2010-01-25  6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
2010-01-25 12:00   ` Christoph Hellwig
2010-01-26  4:13     ` Dave Chinner
2010-01-25  6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
2010-01-25 12:03   ` Christoph Hellwig
2010-01-25  6:22 ` [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags Dave Chinner
2010-01-25 12:04   ` Christoph Hellwig
2010-01-25  6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
2010-01-25 16:03   ` Christoph Hellwig
2010-01-26  4:51     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox