* [PATCH 1/7] xfs: Make inode reclaim states explicit
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 11:47 ` Christoph Hellwig
2010-01-25 6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
` (5 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
A.K.A.: don't rely on xfs_iflush() return value in reclaim
We have gradually been moving checks out of the reclaim code because
they are duplicated in xfs_iflush(). We've had a history of problems
in this area, and many of them stem from the overloading of the
return values from xfs_iflush() and interaction with inode flush
locking to determine if the inode is safe to reclaim.
With the desire to move to delayed write flushing of inodes and
non-blocking inode tree reclaim walks, the overloading of the
return value of xfs_iflush makes it very difficult to determine
the correct thing to do next.
This patch explicitly re-adds the checks to the inode reclaim code,
removing the reliance on the return value of xfs_iflush() to
determine what to do next. It also means that we can clearly
document all the inode states that reclaim must handle and hence
we can easily see that we handled all the necessary cases.
This also removes the need for the xfs_inode_clean() check in
xfs_iflush() as all callers now check this first (safely).
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_sync.c | 80 ++++++++++++++++++++++++++++++++----------
fs/xfs/xfs_inode.c | 11 +-----
fs/xfs/xfs_inode.h | 1 +
3 files changed, 63 insertions(+), 29 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index c9b863e..98b8937 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -706,12 +706,43 @@ __xfs_inode_clear_reclaim_tag(
XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
}
+/*
+ * Inodes in different states need to be treated differently, and the return
+ * value of xfs_iflush is not sufficient to get this right. The following table
+ * lists the inode states and the reclaim actions necessary for non-blocking
+ * reclaim:
+ *
+ *
+ * inode state iflush ret required action
+ * --------------- ---------- ---------------
+ * bad - reclaim
+ * shutdown EIO unpin and reclaim
+ * clean, unpinned 0 reclaim
+ * stale, unpinned 0 reclaim
+ * clean, pinned(*) 0 unpin and reclaim
+ * stale, pinned 0 unpin and reclaim
+ * dirty, async 0 block on flush lock, reclaim
+ * dirty, sync flush 0 block on flush lock, reclaim
+ *
+ * (*) dgc: I don't think the clean, pinned state is possible but it gets
+ * handled anyway given the order of checks implemented.
+ *
+ * Hence the order of actions after gaining the locks should be:
+ * bad => reclaim
+ * shutdown => unpin and reclaim
+ * pinned => unpin
+ * stale => reclaim
+ * clean => reclaim
+ * dirty => flush, wait and reclaim
+ */
STATIC int
xfs_reclaim_inode(
struct xfs_inode *ip,
struct xfs_perag *pag,
int sync_mode)
{
+ int error;
+
/*
* The radix tree lock here protects a thread in xfs_iget from racing
* with us starting reclaim on the inode. Once we have the
@@ -729,30 +760,41 @@ xfs_reclaim_inode(
spin_unlock(&ip->i_flags_lock);
write_unlock(&pag->pag_ici_lock);
- /*
- * If the inode is still dirty, then flush it out. If the inode
- * is not in the AIL, then it will be OK to flush it delwri as
- * long as xfs_iflush() does not keep any references to the inode.
- * We leave that decision up to xfs_iflush() since it has the
- * knowledge of whether it's OK to simply do a delwri flush of
- * the inode or whether we need to wait until the inode is
- * pulled from the AIL.
- * We get the flush lock regardless, though, just to make sure
- * we don't free it while it is being flushed.
- */
xfs_ilock(ip, XFS_ILOCK_EXCL);
xfs_iflock(ip);
- /*
- * In the case of a forced shutdown we rely on xfs_iflush() to
- * wait for the inode to be unpinned before returning an error.
- */
- if (!is_bad_inode(VFS_I(ip)) && xfs_iflush(ip, sync_mode) == 0) {
- /* synchronize with xfs_iflush_done */
- xfs_iflock(ip);
- xfs_ifunlock(ip);
+ if (is_bad_inode(VFS_I(ip)))
+ goto reclaim;
+ if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+ xfs_iunpin_wait(ip);
+ goto reclaim;
+ }
+ if (xfs_ipincount(ip))
+ xfs_iunpin_wait(ip);
+ if (xfs_iflags_test(ip, XFS_ISTALE))
+ goto reclaim;
+ if (xfs_inode_clean(ip))
+ goto reclaim;
+
+ /* Now we have an inode that needs flushing */
+ error = xfs_iflush(ip, sync_mode);
+ if (!error) {
+ switch(sync_mode) {
+ case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
+ case XFS_IFLUSH_DELWRI:
+ case XFS_IFLUSH_ASYNC:
+ case XFS_IFLUSH_DELWRI_ELSE_SYNC:
+ case XFS_IFLUSH_SYNC:
+ /* IO issued, synchronise with IO completion */
+ xfs_iflock(ip);
+ default:
+ ASSERT(0);
+ break;
+ }
}
+reclaim:
+ xfs_ifunlock(ip);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_ireclaim(ip);
return 0;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d0d1b5a..8d0666d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2493,7 +2493,7 @@ __xfs_iunpin_wait(
wait_event(ip->i_ipin_wait, (atomic_read(&ip->i_pincount) == 0));
}
-static inline void
+void
xfs_iunpin_wait(
xfs_inode_t *ip)
{
@@ -2849,15 +2849,6 @@ xfs_iflush(
mp = ip->i_mount;
/*
- * If the inode isn't dirty, then just release the inode flush lock and
- * do nothing.
- */
- if (xfs_inode_clean(ip)) {
- xfs_ifunlock(ip);
- return 0;
- }
-
- /*
* We can't flush the inode until it is unpinned, so wait for it if we
* are allowed to block. We know noone new can pin it, because we are
* holding the inode lock shared and you need to hold it exclusively to
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ec1f28c..8b618ea 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -483,6 +483,7 @@ int xfs_iunlink(struct xfs_trans *, xfs_inode_t *);
void xfs_iext_realloc(xfs_inode_t *, int, int);
void xfs_ipin(xfs_inode_t *);
void xfs_iunpin(xfs_inode_t *);
+void xfs_iunpin_wait(xfs_inode_t *);
int xfs_iflush(xfs_inode_t *, uint);
void xfs_ichgtime(xfs_inode_t *, int);
void xfs_lock_inodes(xfs_inode_t **, int, uint);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
2010-01-25 6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 11:53 ` Christoph Hellwig
2010-01-25 15:29 ` Christoph Hellwig
2010-01-25 6:22 ` [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push Dave Chinner
` (4 subsequent siblings)
6 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
We currently do background inode flush asynchronously, resulting in
inodes being written in whatever order the background writeback
issues them. Not only that, there are also blocking and non-blocking
asynchronous inode flushes, depending on where the flush comes from.
This patch completely removes asynchronous inode writeback. It
removes all the strange writeback modes and replaces them with
either a synchronous flush or a non-blocking delayed write flush.
That is, inode flushes will only issue IO directly if they are
synchronous, and background flushing may do nothing if the operation
would block (e.g. on a pinned inode or buffer lock).
Delayed write flushes will now result in the inode buffer sitting in
the delwri queue of the buffer cache to be flushed by either an AIL
push or by the xfsbufd timing out the buffer. This will allow
accumulation of dirty inode buffers in memory and allow optimisation
of inode cluster writeback at the xfsbufd level where we have much
greater queue depths than the block layer elevators. We will also
get adjacent inode cluster buffer IO merging for free when a later
patch in the series allows sorting of the delayed write buffers
before dispatch.
This effectively means that any inode that is written back by
background writeback will be seen as flush locked during AIL
pushing, and will result in the buffers being pushed from there.
This writeback path is currently non-optimal, but the next patch
in the series will fix that problem.
A side effect of this delayed write mechanism is that background
inode reclaim will no longer directly flush inodes, nor can it wait
on the flush lock. The result is that inode reclaim must leave the
inode in the reclaimable state until it is clean. Hence attempts to
reclaim a dirty inode in the background will simply skip the inode
until it is clean and this allows other mechanisms (i.e. xfsbufd) to
do more optimal writeback of the dirty buffers. As a result, the
inode reclaim code has been rewritten so that it no longer relies on
the ambiguous return values of xfs_iflush() to determine whether it
is safe to reclaim an inode.
Portions of this patch are derived from a patch by Christoph
Hellwig.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_super.c | 4 +-
fs/xfs/linux-2.6/xfs_sync.c | 88 ++++++++++++++++++++++++++++-------------
fs/xfs/xfs_inode.c | 75 ++---------------------------------
fs/xfs/xfs_inode.h | 10 -----
fs/xfs/xfs_inode_item.c | 10 +++-
fs/xfs/xfs_mount.c | 13 ++++++-
6 files changed, 86 insertions(+), 114 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 67001ec..5ed0468 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1064,7 +1064,7 @@ xfs_fs_write_inode(
xfs_ilock(ip, XFS_ILOCK_SHARED);
xfs_iflock(ip);
- error = xfs_iflush(ip, XFS_IFLUSH_SYNC);
+ error = xfs_iflush(ip, SYNC_WAIT);
} else {
error = EAGAIN;
if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
@@ -1072,7 +1072,7 @@ xfs_fs_write_inode(
if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
goto out_unlock;
- error = xfs_iflush(ip, XFS_IFLUSH_ASYNC_NOBLOCK);
+ error = xfs_iflush(ip, 0);
}
out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 98b8937..ca0cc59 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -270,8 +270,7 @@ xfs_sync_inode_attr(
goto out_unlock;
}
- error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
- XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
+ error = xfs_iflush(ip, (flags & SYNC_WAIT));
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -460,16 +459,18 @@ xfs_quiesce_fs(
{
int count = 0, pincount;
+ xfs_reclaim_inodes(mp, 0);
xfs_flush_buftarg(mp->m_ddev_targp, 0);
- xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
/*
* This loop must run at least twice. The first instance of the loop
* will flush most meta data but that will generate more meta data
* (typically directory updates). Which then must be flushed and
- * logged before we can write the unmount record.
+ * logged before we can write the unmount record. We also so sync
+ * reclaim of inodes to catch any that the above delwri flush skipped.
*/
do {
+ xfs_reclaim_inodes(mp, SYNC_WAIT);
xfs_sync_attr(mp, SYNC_WAIT);
pincount = xfs_flush_buftarg(mp->m_ddev_targp, 1);
if (!pincount) {
@@ -585,7 +586,7 @@ xfs_sync_worker(
if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
xfs_log_force(mp, 0);
- xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
+ xfs_reclaim_inodes(mp, 0);
/* dgc: errors ignored here */
error = xfs_qm_sync(mp, SYNC_TRYLOCK);
error = xfs_sync_fsdata(mp, SYNC_TRYLOCK);
@@ -719,21 +720,42 @@ __xfs_inode_clear_reclaim_tag(
* shutdown EIO unpin and reclaim
* clean, unpinned 0 reclaim
* stale, unpinned 0 reclaim
- * clean, pinned(*) 0 unpin and reclaim
- * stale, pinned 0 unpin and reclaim
- * dirty, async 0 block on flush lock, reclaim
- * dirty, sync flush 0 block on flush lock, reclaim
+ * clean, pinned(*) 0 requeue
+ * stale, pinned EAGAIN requeue
+ * dirty, delwri ok 0 requeue
+ * dirty, delwri blocked EAGAIN requeue
+ * dirty, sync flush 0 reclaim
*
* (*) dgc: I don't think the clean, pinned state is possible but it gets
* handled anyway given the order of checks implemented.
*
+ * As can be seen from the table, the return value of xfs_iflush() is not
+ * sufficient to correctly decide the reclaim action here. The checks in
+ * xfs_iflush() might look like duplicates, but they are not.
+ *
+ * Also, because we get the flush lock first, we know that any inode that has
+ * been flushed delwri has had the flush completed by the time we check that
+ * the inode is clean. The clean inode check needs to be done before flushing
+ * the inode delwri otherwise we would loop forever requeuing clean inodes as
+ * we cannot tell apart a successful delwri flush and a clean inode from the
+ * return value of xfs_iflush().
+ *
+ * Note that because the inode is flushed delayed write by background
+ * writeback, the flush lock may already be held here and waiting on it can
+ * result in very long latencies. Hence for sync reclaims, where we wait on the
+ * flush lock, the caller should push out delayed write inodes first before
+ * trying to reclaim them to minimise the amount of time spent waiting. For
+ * background relaim, we just requeue the inode for the next pass.
+ *
* Hence the order of actions after gaining the locks should be:
* bad => reclaim
* shutdown => unpin and reclaim
- * pinned => unpin
+ * pinned, delwri => requeue
+ * pinned, sync => unpin
* stale => reclaim
* clean => reclaim
- * dirty => flush, wait and reclaim
+ * dirty, delwri => flush and requeue
+ * dirty, sync => flush, wait and reclaim
*/
STATIC int
xfs_reclaim_inode(
@@ -741,7 +763,7 @@ xfs_reclaim_inode(
struct xfs_perag *pag,
int sync_mode)
{
- int error;
+ int error = 0;
/*
* The radix tree lock here protects a thread in xfs_iget from racing
@@ -761,7 +783,11 @@ xfs_reclaim_inode(
write_unlock(&pag->pag_ici_lock);
xfs_ilock(ip, XFS_ILOCK_EXCL);
- xfs_iflock(ip);
+ if (!xfs_iflock_nowait(ip)) {
+ if (!(sync_mode & SYNC_WAIT))
+ goto requeue_no_flock;
+ xfs_iflock(ip);
+ }
if (is_bad_inode(VFS_I(ip)))
goto reclaim;
@@ -769,8 +795,11 @@ xfs_reclaim_inode(
xfs_iunpin_wait(ip);
goto reclaim;
}
- if (xfs_ipincount(ip))
+ if (xfs_ipincount(ip)) {
+ if (!(sync_mode & SYNC_WAIT))
+ goto requeue;
xfs_iunpin_wait(ip);
+ }
if (xfs_iflags_test(ip, XFS_ISTALE))
goto reclaim;
if (xfs_inode_clean(ip))
@@ -778,25 +807,28 @@ xfs_reclaim_inode(
/* Now we have an inode that needs flushing */
error = xfs_iflush(ip, sync_mode);
- if (!error) {
- switch(sync_mode) {
- case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
- case XFS_IFLUSH_DELWRI:
- case XFS_IFLUSH_ASYNC:
- case XFS_IFLUSH_DELWRI_ELSE_SYNC:
- case XFS_IFLUSH_SYNC:
- /* IO issued, synchronise with IO completion */
- xfs_iflock(ip);
- default:
- ASSERT(0);
- break;
- }
- }
+ if (!(sync_mode & SYNC_WAIT))
+ goto requeue_no_flock;
+ xfs_iflock(ip);
reclaim:
xfs_ifunlock(ip);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_ireclaim(ip);
+ return error;
+
+requeue:
+ xfs_ifunlock(ip);
+requeue_no_flock:
+ xfs_iflags_clear(ip, XFS_IRECLAIM);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ /*
+ * We could return EAGAIN here to make reclaim rescan the inode tree in
+ * a short while. However, this just burns CPU time scanning the tree
+ * waiting for IO to complete and xfssyncd never goes back to the idle
+ * state. Instead, return 0 to let the next scheduled background reclaim
+ * attempt to reclaim the inode again.
+ */
return 0;
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8d0666d..fa31360 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,8 +2835,6 @@ xfs_iflush(
xfs_dinode_t *dip;
xfs_mount_t *mp;
int error;
- int noblock = (flags == XFS_IFLUSH_ASYNC_NOBLOCK);
- enum { INT_DELWRI = (1 << 0), INT_ASYNC = (1 << 1) };
XFS_STATS_INC(xs_iflush_count);
@@ -2859,7 +2857,7 @@ xfs_iflush(
* in the same cluster are dirty, they will probably write the inode
* out for us if they occur after the log force completes.
*/
- if (noblock && xfs_ipincount(ip)) {
+ if (!(flags & SYNC_WAIT) && xfs_ipincount(ip)) {
xfs_iunpin_nowait(ip);
xfs_ifunlock(ip);
return EAGAIN;
@@ -2893,60 +2891,10 @@ xfs_iflush(
}
/*
- * Decide how buffer will be flushed out. This is done before
- * the call to xfs_iflush_int because this field is zeroed by it.
- */
- if (iip != NULL && iip->ili_format.ilf_fields != 0) {
- /*
- * Flush out the inode buffer according to the directions
- * of the caller. In the cases where the caller has given
- * us a choice choose the non-delwri case. This is because
- * the inode is in the AIL and we need to get it out soon.
- */
- switch (flags) {
- case XFS_IFLUSH_SYNC:
- case XFS_IFLUSH_DELWRI_ELSE_SYNC:
- flags = 0;
- break;
- case XFS_IFLUSH_ASYNC_NOBLOCK:
- case XFS_IFLUSH_ASYNC:
- case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
- flags = INT_ASYNC;
- break;
- case XFS_IFLUSH_DELWRI:
- flags = INT_DELWRI;
- break;
- default:
- ASSERT(0);
- flags = 0;
- break;
- }
- } else {
- switch (flags) {
- case XFS_IFLUSH_DELWRI_ELSE_SYNC:
- case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
- case XFS_IFLUSH_DELWRI:
- flags = INT_DELWRI;
- break;
- case XFS_IFLUSH_ASYNC_NOBLOCK:
- case XFS_IFLUSH_ASYNC:
- flags = INT_ASYNC;
- break;
- case XFS_IFLUSH_SYNC:
- flags = 0;
- break;
- default:
- ASSERT(0);
- flags = 0;
- break;
- }
- }
-
- /*
* Get the buffer containing the on-disk inode.
*/
error = xfs_itobp(mp, NULL, ip, &dip, &bp,
- noblock ? XBF_TRYLOCK : XBF_LOCK);
+ (flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
if (error || !bp) {
xfs_ifunlock(ip);
return error;
@@ -2974,13 +2922,10 @@ xfs_iflush(
if (error)
goto cluster_corrupt_out;
- if (flags & INT_DELWRI) {
- xfs_bdwrite(mp, bp);
- } else if (flags & INT_ASYNC) {
- error = xfs_bawrite(mp, bp);
- } else {
+ if (flags & SYNC_WAIT)
error = xfs_bwrite(mp, bp);
- }
+ else
+ xfs_bdwrite(mp, bp);
return error;
corrupt_out:
@@ -3015,16 +2960,6 @@ xfs_iflush_int(
iip = ip->i_itemp;
mp = ip->i_mount;
-
- /*
- * If the inode isn't dirty, then just release the inode
- * flush lock and do nothing.
- */
- if (xfs_inode_clean(ip)) {
- xfs_ifunlock(ip);
- return 0;
- }
-
/* set *dip = inode's place in the buffer */
dip = (xfs_dinode_t *)xfs_buf_offset(bp, ip->i_imap.im_boffset);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8b618ea..6c912b0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -420,16 +420,6 @@ static inline void xfs_ifunlock(xfs_inode_t *ip)
#define XFS_ILOCK_DEP(flags) (((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
/*
- * Flags for xfs_iflush()
- */
-#define XFS_IFLUSH_DELWRI_ELSE_SYNC 1
-#define XFS_IFLUSH_DELWRI_ELSE_ASYNC 2
-#define XFS_IFLUSH_SYNC 3
-#define XFS_IFLUSH_ASYNC 4
-#define XFS_IFLUSH_DELWRI 5
-#define XFS_IFLUSH_ASYNC_NOBLOCK 6
-
-/*
* Flags for xfs_itruncate_start().
*/
#define XFS_ITRUNC_DEFINITE 0x1
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 48ec1c0..9ec41e6 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -866,10 +866,14 @@ xfs_inode_item_push(
iip->ili_format.ilf_fields != 0);
/*
- * Write out the inode. The completion routine ('iflush_done') will
- * pull it from the AIL, mark it clean, unlock the flush lock.
+ * Push the inode to it's backing buffer. This will not remove
+ * the inode from the AIL - a further push will be required to trigger
+ * a buffer push. However, this allows all the dirty inodes to be pushed to
+ * the buffer before it is pushed to disk. THe buffer IO completion
+ * will pull th einode from the AIL, mark it clean and unlock the flush
+ * lock.
*/
- (void) xfs_iflush(ip, XFS_IFLUSH_ASYNC);
+ (void) xfs_iflush(ip, 0);
xfs_iunlock(ip, XFS_ILOCK_SHARED);
return;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 7f81ed7..9a036f4 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1456,7 +1456,18 @@ xfs_unmountfs(
* need to force the log first.
*/
xfs_log_force(mp, XFS_LOG_SYNC);
- xfs_reclaim_inodes(mp, XFS_IFLUSH_ASYNC);
+
+ /*
+ * Do a delwri reclaim pass first so that as many dirty inodes are
+ * queued up for IO as possible. Then flush the buffers before making
+ * a synchronous path to catch all the remaining inodes are reclaimed.
+ * This makes the reclaim process as quick as possible by avoiding
+ * synchronous writeout and blocking on inodes already in the delwri
+ * state as much as possible.
+ */
+ xfs_reclaim_inodes(mp, 0);
+ XFS_bflush(mp->m_ddev_targp);
+ xfs_reclaim_inodes(mp, SYNC_WAIT);
xfs_qm_unmount(mp);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-25 6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
@ 2010-01-25 11:53 ` Christoph Hellwig
2010-01-25 22:31 ` Dave Chinner
2010-01-25 15:29 ` Christoph Hellwig
1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 11:53 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index 98b8937..ca0cc59 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -270,8 +270,7 @@ xfs_sync_inode_attr(
> goto out_unlock;
> }
>
> - error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
> - XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
> + error = xfs_iflush(ip, (flags & SYNC_WAIT));
No need for the masking here, as xfs_iflush simply ignores SYNC_TRYLOCK.
> /* Now we have an inode that needs flushing */
> error = xfs_iflush(ip, sync_mode);
> + if (!(sync_mode & SYNC_WAIT))
> + goto requeue_no_flock;
So for the !wait case we entirely ignore the return value? We should
at least check for an I/O error here I think. Also in this context
the requeue label name doesn't fit too well, even if it's the same
action as the requeue.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-25 11:53 ` Christoph Hellwig
@ 2010-01-25 22:31 ` Dave Chinner
2010-01-26 7:28 ` Dave Chinner
2010-01-26 16:10 ` Christoph Hellwig
0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 22:31 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Jan 25, 2010 at 06:53:08AM -0500, Christoph Hellwig wrote:
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index 98b8937..ca0cc59 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -270,8 +270,7 @@ xfs_sync_inode_attr(
> > goto out_unlock;
> > }
> >
> > - error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
> > - XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
> > + error = xfs_iflush(ip, (flags & SYNC_WAIT));
>
> No need for the masking here, as xfs_iflush simply ignores SYNC_TRYLOCK.
OK.
> > /* Now we have an inode that needs flushing */
> > error = xfs_iflush(ip, sync_mode);
> > + if (!(sync_mode & SYNC_WAIT))
> > + goto requeue_no_flock;
>
> So for the !wait case we entirely ignore the return value? We should
> at least check for an I/O error here I think.
I'm not sure we can get an error on a delayed write that we
need to take action on. We don't actually issue the IO from this
call, so the only error case is on reading the buffer. If we get
an error there, what should be do?
My thinking was that if we can't write back the inode, we can't
reclaim it, and if the error is from a transient read error (e.g. FC
path failover) we should be retrying anyway as we do in
xfs_buf_iodone_callbacks()...
> Also in this context
> the requeue label name doesn't fit too well, even if it's the same
> action as the requeue.
I'm open to suggestions for a better name ;)
Cheers,
Dave
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-25 22:31 ` Dave Chinner
@ 2010-01-26 7:28 ` Dave Chinner
2010-01-26 16:10 ` Christoph Hellwig
1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-26 7:28 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Tue, Jan 26, 2010 at 09:31:40AM +1100, Dave Chinner wrote:
> On Mon, Jan 25, 2010 at 06:53:08AM -0500, Christoph Hellwig wrote:
> > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > > index 98b8937..ca0cc59 100644
> > > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > > @@ -270,8 +270,7 @@ xfs_sync_inode_attr(
> > > goto out_unlock;
> > > }
> > >
> > > - error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
> > > - XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
> > > + error = xfs_iflush(ip, (flags & SYNC_WAIT));
> >
> > No need for the masking here, as xfs_iflush simply ignores SYNC_TRYLOCK.
>
> OK.
>
> > > /* Now we have an inode that needs flushing */
> > > error = xfs_iflush(ip, sync_mode);
> > > + if (!(sync_mode & SYNC_WAIT))
> > > + goto requeue_no_flock;
> >
> > So for the !wait case we entirely ignore the return value? We should
> > at least check for an I/O error here I think.
>
> I'm not sure we can get an error on a delayed write that we
> need to take action on. We don't actually issue the IO from this
> call, so the only error case is on reading the buffer. If we get
> an error there, what should be do?
>
> My thinking was that if we can't write back the inode, we can't
> reclaim it, and if the error is from a transient read error (e.g. FC
> path failover) we should be retrying anyway as we do in
> xfs_buf_iodone_callbacks()...
I forgot to mention that it will get caught by a sync
reclaim pass when there is a context to return the error to....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-25 22:31 ` Dave Chinner
2010-01-26 7:28 ` Dave Chinner
@ 2010-01-26 16:10 ` Christoph Hellwig
2010-02-01 22:59 ` Dave Chinner
1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-26 16:10 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, xfs
> > Also in this context
> > the requeue label name doesn't fit too well, even if it's the same
> > action as the requeue.
>
> I'm open to suggestions for a better name ;)
What do you think about this version?
Index: xfs/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_sync.c 2010-01-26 17:01:49.610012226 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_sync.c 2010-01-26 17:06:50.533006103 +0100
@@ -785,7 +785,7 @@ xfs_reclaim_inode(
xfs_ilock(ip, XFS_ILOCK_EXCL);
if (!xfs_iflock_nowait(ip)) {
if (!(sync_mode & SYNC_WAIT))
- goto requeue_no_flock;
+ goto out;
xfs_iflock(ip);
}
@@ -796,8 +796,10 @@ xfs_reclaim_inode(
goto reclaim;
}
if (xfs_ipincount(ip)) {
- if (!(sync_mode & SYNC_WAIT))
- goto requeue;
+ if (!(sync_mode & SYNC_WAIT)) {
+ xfs_ifunlock(ip);
+ goto out;
+ }
xfs_iunpin_wait(ip);
}
if (xfs_iflags_test(ip, XFS_ISTALE))
@@ -807,19 +809,18 @@ xfs_reclaim_inode(
/* Now we have an inode that needs flushing */
error = xfs_iflush(ip, sync_mode);
- if (!(sync_mode & SYNC_WAIT))
- goto requeue_no_flock;
- xfs_iflock(ip);
-
-reclaim:
- xfs_ifunlock(ip);
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_ireclaim(ip);
- return error;
+ if (sync_mode & SYNC_WAIT) {
+ xfs_iflock(ip);
+ goto reclaim;
+ }
-requeue:
- xfs_ifunlock(ip);
-requeue_no_flock:
+ /*
+ * When we have to flush an inode but don't have SYNC_WAIT set, we
+ * flush the inode out with using a delwri buffer and wait for
+ * the next call into the reclaim to find it in a clean state
+ * instead of doing it now.
+ */
+out:
xfs_iflags_clear(ip, XFS_IRECLAIM);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
/*
@@ -830,6 +831,12 @@ requeue_no_flock:
* attempt to reclaim the inode again.
*/
return 0;
+
+reclaim:
+ xfs_ifunlock(ip);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ xfs_ireclaim(ip);
+ return error;
}
int
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-26 16:10 ` Christoph Hellwig
@ 2010-02-01 22:59 ` Dave Chinner
0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-02-01 22:59 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Tue, Jan 26, 2010 at 11:10:45AM -0500, Christoph Hellwig wrote:
> > > Also in this context
> > > the requeue label name doesn't fit too well, even if it's the same
> > > action as the requeue.
> >
> > I'm open to suggestions for a better name ;)
>
> What do you think about this version?
Yeah, that looks a lot better. I'll incorporate that into the
patch. Thanks.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] xfs: Use delayed write for inodes rather than async
2010-01-25 6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
2010-01-25 11:53 ` Christoph Hellwig
@ 2010-01-25 15:29 ` Christoph Hellwig
1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 15:29 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
> /*
> - * Write out the inode. The completion routine ('iflush_done') will
> - * pull it from the AIL, mark it clean, unlock the flush lock.
> + * Push the inode to it's backing buffer. This will not remove
> + * the inode from the AIL - a further push will be required to trigger
> + * a buffer push. However, this allows all the dirty inodes to be pushed to
Btw, this creates a rather annoying and pointless too long line in the
source file.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
2010-01-25 6:22 ` [PATCH 1/7] xfs: Make inode reclaim states explicit Dave Chinner
2010-01-25 6:22 ` [PATCH 2/7] xfs: Use delayed write for inodes rather than async Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 11:59 ` Christoph Hellwig
2010-01-25 6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
` (3 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
All buffers logged into the AIL are marked as delayed write.
When the AIL needs to push the buffer out, it issues an async write of the
buffer. This means that IO patterns are dependent on the order of
buffers in the AIL.
Instead of flushing the buffer, promote the buffer in the delayed
write list so that the next time the xfsbufd is run the buffer will
be flushed by the xfsbufd. Return the state to the xfsaild that the
buffer was promoted so that the xfsaild knows that it needs to cause
the xfsbufd to run to flush the buffers that were promoted.
Using the xfsbufd for issuing the IO allows us to dispatch all
buffer IO from the one queue. This means that we can make much more
enlightened decisions on what order to flush buffers to disk as
we don't have multiple places issuing IO. Optimisations to xfsbufd
will be in a future patch.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_buf.c | 29 ++++++++++++
fs/xfs/linux-2.6/xfs_buf.h | 2 +
fs/xfs/linux-2.6/xfs_trace.h | 1 +
fs/xfs/quota/xfs_dquot_item.c | 85 +++++------------------------------
fs/xfs/quota/xfs_dquot_item.h | 4 --
fs/xfs/xfs_buf_item.c | 64 +++++++++++++++------------
fs/xfs/xfs_inode_item.c | 98 ++++++----------------------------------
fs/xfs/xfs_inode_item.h | 6 ---
fs/xfs/xfs_trans_ail.c | 7 +++
9 files changed, 101 insertions(+), 195 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 44e20e5..b306265 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1778,6 +1778,35 @@ xfs_buf_delwri_dequeue(
trace_xfs_buf_delwri_dequeue(bp, _RET_IP_);
}
+/*
+ * If a delwri buffer needs to be pushed before it has aged out, then promote
+ * it to the head of the delwri queue so that it will be flushed on the next
+ * xfsbufd run. We do this by resetting the queuetime of the buffer to be older
+ * than the age currently needed to flush the buffer. Hence the next time the
+ * xfsbufd sees it is guaranteed to be considered old enough to flush.
+ */
+void
+xfs_buf_delwri_promote(
+ struct xfs_buf *bp)
+{
+ struct xfs_buftarg *btp = bp->b_target;
+ long age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
+
+ ASSERT(bp->b_flags & XBF_DELWRI);
+ ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+
+ /*
+ * Check the buffer age before locking the delayed write queue as we
+ * don't need to promote buffers that are already past the flush age.
+ */
+ if (bp->b_queuetime < jiffies - age)
+ return;
+ bp->b_queuetime = jiffies - age;
+ spin_lock(&btp->bt_delwrite_lock);
+ list_move(&bp->b_list, &btp->bt_delwrite_queue);
+ spin_unlock(&btp->bt_delwrite_lock);
+}
+
STATIC void
xfs_buf_runall_queues(
struct workqueue_struct *queue)
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index ea8c198..be45e8c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -266,6 +266,7 @@ extern int xfs_buf_ispin(xfs_buf_t *);
/* Delayed Write Buffer Routines */
extern void xfs_buf_delwri_dequeue(xfs_buf_t *);
+extern void xfs_buf_delwri_promote(xfs_buf_t *);
/* Buffer Daemon Setup Routines */
extern int xfs_buf_init(void);
@@ -395,6 +396,7 @@ extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
extern void xfs_wait_buftarg(xfs_buftarg_t *);
extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
extern int xfs_flush_buftarg(xfs_buftarg_t *, int);
+
#ifdef CONFIG_KDB_MODULES
extern struct list_head *xfs_get_buftarg_list(void);
#endif
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index 1bb09e7..a4574dc 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -483,6 +483,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock);
DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock_stale);
DEFINE_BUF_ITEM_EVENT(xfs_buf_item_committed);
DEFINE_BUF_ITEM_EVENT(xfs_buf_item_push);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pushbuf);
DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf);
DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf_recur);
DEFINE_BUF_ITEM_EVENT(xfs_trans_getsb);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index 1b56437..dda0fb0 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -212,66 +212,31 @@ xfs_qm_dquot_logitem_pushbuf(
xfs_dquot_t *dqp;
xfs_mount_t *mp;
xfs_buf_t *bp;
- uint dopush;
dqp = qip->qli_dquot;
ASSERT(XFS_DQ_IS_LOCKED(dqp));
/*
- * The qli_pushbuf_flag keeps others from
- * trying to duplicate our effort.
- */
- ASSERT(qip->qli_pushbuf_flag != 0);
- ASSERT(qip->qli_push_owner == current_pid());
-
- /*
* If flushlock isn't locked anymore, chances are that the
* inode flush completed and the inode was taken off the AIL.
* So, just get out.
*/
if (completion_done(&dqp->q_flush) ||
((qip->qli_item.li_flags & XFS_LI_IN_AIL) == 0)) {
- qip->qli_pushbuf_flag = 0;
xfs_dqunlock(dqp);
return;
}
mp = dqp->q_mount;
bp = xfs_incore(mp->m_ddev_targp, qip->qli_format.qlf_blkno,
XFS_QI_DQCHUNKLEN(mp), XBF_TRYLOCK);
- if (bp != NULL) {
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- dopush = ((qip->qli_item.li_flags & XFS_LI_IN_AIL) &&
- !completion_done(&dqp->q_flush));
- qip->qli_pushbuf_flag = 0;
- xfs_dqunlock(dqp);
-
- if (XFS_BUF_ISPINNED(bp))
- xfs_log_force(mp, 0);
-
- if (dopush) {
- int error;
-#ifdef XFSRACEDEBUG
- delay_for_intr();
- delay(300);
-#endif
- error = xfs_bawrite(mp, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, mp,
- "xfs_qm_dquot_logitem_pushbuf: pushbuf error %d on qip %p, bp %p",
- error, qip, bp);
- } else {
- xfs_buf_relse(bp);
- }
- } else {
- qip->qli_pushbuf_flag = 0;
- xfs_dqunlock(dqp);
- xfs_buf_relse(bp);
- }
+ xfs_dqunlock(dqp);
+ if (!bp)
return;
- }
+ if (XFS_BUF_ISDELAYWRITE(bp))
+ xfs_buf_delwri_promote(bp);
+ xfs_buf_relse(bp);
+ return;
- qip->qli_pushbuf_flag = 0;
- xfs_dqunlock(dqp);
}
/*
@@ -289,50 +254,24 @@ xfs_qm_dquot_logitem_trylock(
xfs_dq_logitem_t *qip)
{
xfs_dquot_t *dqp;
- uint retval;
dqp = qip->qli_dquot;
if (atomic_read(&dqp->q_pincount) > 0)
- return (XFS_ITEM_PINNED);
+ return XFS_ITEM_PINNED;
if (! xfs_qm_dqlock_nowait(dqp))
- return (XFS_ITEM_LOCKED);
+ return XFS_ITEM_LOCKED;
- retval = XFS_ITEM_SUCCESS;
if (!xfs_dqflock_nowait(dqp)) {
/*
- * The dquot is already being flushed. It may have been
- * flushed delayed write, however, and we don't want to
- * get stuck waiting for that to complete. So, we want to check
- * to see if we can lock the dquot's buffer without sleeping.
- * If we can and it is marked for delayed write, then we
- * hold it and send it out from the push routine. We don't
- * want to do that now since we might sleep in the device
- * strategy routine. We also don't want to grab the buffer lock
- * here because we'd like not to call into the buffer cache
- * while holding the AIL lock.
- * Make sure to only return PUSHBUF if we set pushbuf_flag
- * ourselves. If someone else is doing it then we don't
- * want to go to the push routine and duplicate their efforts.
+ * dquot has already been flushed to the backing buffer,
+ * leave it locked, pushbuf routine will unlock it.
*/
- if (qip->qli_pushbuf_flag == 0) {
- qip->qli_pushbuf_flag = 1;
- ASSERT(qip->qli_format.qlf_blkno == dqp->q_blkno);
-#ifdef DEBUG
- qip->qli_push_owner = current_pid();
-#endif
- /*
- * The dquot is left locked.
- */
- retval = XFS_ITEM_PUSHBUF;
- } else {
- retval = XFS_ITEM_FLUSHING;
- xfs_dqunlock_nonotify(dqp);
- }
+ return XFS_ITEM_PUSHBUF;
}
ASSERT(qip->qli_item.li_flags & XFS_LI_IN_AIL);
- return (retval);
+ return XFS_ITEM_SUCCESS;
}
diff --git a/fs/xfs/quota/xfs_dquot_item.h b/fs/xfs/quota/xfs_dquot_item.h
index 5a63253..5acae2a 100644
--- a/fs/xfs/quota/xfs_dquot_item.h
+++ b/fs/xfs/quota/xfs_dquot_item.h
@@ -27,10 +27,6 @@ typedef struct xfs_dq_logitem {
xfs_log_item_t qli_item; /* common portion */
struct xfs_dquot *qli_dquot; /* dquot ptr */
xfs_lsn_t qli_flush_lsn; /* lsn at last flush */
- unsigned short qli_pushbuf_flag; /* 1 bit used in push_ail */
-#ifdef DEBUG
- uint64_t qli_push_owner;
-#endif
xfs_dq_logformat_t qli_format; /* logged structure */
} xfs_dq_logitem_t;
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e0a1158..f3c49e6 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -467,8 +467,10 @@ xfs_buf_item_unpin_remove(
/*
* This is called to attempt to lock the buffer associated with this
* buf log item. Don't sleep on the buffer lock. If we can't get
- * the lock right away, return 0. If we can get the lock, pull the
- * buffer from the free list, mark it busy, and return 1.
+ * the lock right away, return 0. If we can get the lock, take a
+ * reference to the buffer. If this is a delayed write buffer that
+ * needs AIL help to be written back, invoke the pushbuf routine
+ * rather than the normal success path.
*/
STATIC uint
xfs_buf_item_trylock(
@@ -477,24 +479,18 @@ xfs_buf_item_trylock(
xfs_buf_t *bp;
bp = bip->bli_buf;
-
- if (XFS_BUF_ISPINNED(bp)) {
+ if (XFS_BUF_ISPINNED(bp))
return XFS_ITEM_PINNED;
- }
-
- if (!XFS_BUF_CPSEMA(bp)) {
+ if (!XFS_BUF_CPSEMA(bp))
return XFS_ITEM_LOCKED;
- }
- /*
- * Remove the buffer from the free list. Only do this
- * if it's on the free list. Private buffers like the
- * superblock buffer are not.
- */
+ /* take a reference to the buffer. */
XFS_BUF_HOLD(bp);
ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
trace_xfs_buf_item_trylock(bip);
+ if (XFS_BUF_ISDELAYWRITE(bp))
+ return XFS_ITEM_PUSHBUF;
return XFS_ITEM_SUCCESS;
}
@@ -626,11 +622,9 @@ xfs_buf_item_committed(
}
/*
- * This is called to asynchronously write the buffer associated with this
- * buf log item out to disk. The buffer will already have been locked by
- * a successful call to xfs_buf_item_trylock(). If the buffer still has
- * B_DELWRI set, then get it going out to disk with a call to bawrite().
- * If not, then just release the buffer.
+ * The buffer is locked, but is not a delayed write buffer. This happens
+ * if we race with IO completion and hence we don't want to try to write it
+ * again. Just release the buffer.
*/
STATIC void
xfs_buf_item_push(
@@ -642,17 +636,29 @@ xfs_buf_item_push(
trace_xfs_buf_item_push(bip);
bp = bip->bli_buf;
+ ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
+ xfs_buf_relse(bp);
+}
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- int error;
- error = xfs_bawrite(bip->bli_item.li_mountp, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, bip->bli_item.li_mountp,
- "xfs_buf_item_push: pushbuf error %d on bip %p, bp %p",
- error, bip, bp);
- } else {
- xfs_buf_relse(bp);
- }
+/*
+ * The buffer is locked and is a delayed write buffer. Promote the buffer
+ * in the delayed write queue as the caller knows that they must invoke
+ * the xfsbufd to get this buffer written. We have to unlock the buffer
+ * to allow the xfsbufd to write it, too.
+ */
+STATIC void
+xfs_buf_item_pushbuf(
+ xfs_buf_log_item_t *bip)
+{
+ xfs_buf_t *bp;
+
+ ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+ trace_xfs_buf_item_pushbuf(bip);
+
+ bp = bip->bli_buf;
+ ASSERT(XFS_BUF_ISDELAYWRITE(bp));
+ xfs_buf_delwri_promote(bp);
+ xfs_buf_relse(bp);
}
/* ARGSUSED */
@@ -677,7 +683,7 @@ static struct xfs_item_ops xfs_buf_item_ops = {
.iop_committed = (xfs_lsn_t(*)(xfs_log_item_t*, xfs_lsn_t))
xfs_buf_item_committed,
.iop_push = (void(*)(xfs_log_item_t*))xfs_buf_item_push,
- .iop_pushbuf = NULL,
+ .iop_pushbuf = (void(*)(xfs_log_item_t*))xfs_buf_item_pushbuf,
.iop_committing = (void(*)(xfs_log_item_t*, xfs_lsn_t))
xfs_buf_item_committing
};
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 9ec41e6..25387b0 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -602,33 +602,20 @@ xfs_inode_item_trylock(
if (!xfs_iflock_nowait(ip)) {
/*
- * If someone else isn't already trying to push the inode
- * buffer, we get to do it.
+ * inode has already been flushed to the backing buffer,
+ * leave it locked in shared mode, pushbuf routine will
+ * unlock it.
*/
- if (iip->ili_pushbuf_flag == 0) {
- iip->ili_pushbuf_flag = 1;
-#ifdef DEBUG
- iip->ili_push_owner = current_pid();
-#endif
- /*
- * Inode is left locked in shared mode.
- * Pushbuf routine gets to unlock it.
- */
- return XFS_ITEM_PUSHBUF;
- } else {
- /*
- * We hold the AIL lock, so we must specify the
- * NONOTIFY flag so that we won't double trip.
- */
- xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
- return XFS_ITEM_FLUSHING;
- }
- /* NOTREACHED */
+ return XFS_ITEM_PUSHBUF;
}
/* Stale items should force out the iclog */
if (ip->i_flags & XFS_ISTALE) {
xfs_ifunlock(ip);
+ /*
+ * we hold the AIL lock - notify the unlock routine of this
+ * so it doesn't try to get the lock again.
+ */
xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
return XFS_ITEM_PINNED;
}
@@ -746,11 +733,8 @@ xfs_inode_item_committed(
* This gets called by xfs_trans_push_ail(), when IOP_TRYLOCK
* failed to get the inode flush lock but did get the inode locked SHARED.
* Here we're trying to see if the inode buffer is incore, and if so whether it's
- * marked delayed write. If that's the case, we'll initiate a bawrite on that
- * buffer to expedite the process.
- *
- * We aren't holding the AIL lock (or the flush lock) when this gets called,
- * so it is inherently race-y.
+ * marked delayed write. If that's the case, we'll promote it and that will
+ * allow the caller to write the buffer by triggering the xfsbufd to run.
*/
STATIC void
xfs_inode_item_pushbuf(
@@ -759,26 +743,16 @@ xfs_inode_item_pushbuf(
xfs_inode_t *ip;
xfs_mount_t *mp;
xfs_buf_t *bp;
- uint dopush;
ip = iip->ili_inode;
-
ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
/*
- * The ili_pushbuf_flag keeps others from
- * trying to duplicate our effort.
- */
- ASSERT(iip->ili_pushbuf_flag != 0);
- ASSERT(iip->ili_push_owner == current_pid());
-
- /*
* If a flush is not in progress anymore, chances are that the
* inode was taken off the AIL. So, just get out.
*/
if (completion_done(&ip->i_flush) ||
((iip->ili_item.li_flags & XFS_LI_IN_AIL) == 0)) {
- iip->ili_pushbuf_flag = 0;
xfs_iunlock(ip, XFS_ILOCK_SHARED);
return;
}
@@ -787,53 +761,12 @@ xfs_inode_item_pushbuf(
bp = xfs_incore(mp->m_ddev_targp, iip->ili_format.ilf_blkno,
iip->ili_format.ilf_len, XBF_TRYLOCK);
- if (bp != NULL) {
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- /*
- * We were racing with iflush because we don't hold
- * the AIL lock or the flush lock. However, at this point,
- * we have the buffer, and we know that it's dirty.
- * So, it's possible that iflush raced with us, and
- * this item is already taken off the AIL.
- * If not, we can flush it async.
- */
- dopush = ((iip->ili_item.li_flags & XFS_LI_IN_AIL) &&
- !completion_done(&ip->i_flush));
- iip->ili_pushbuf_flag = 0;
- xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
- trace_xfs_inode_item_push(bp, _RET_IP_);
-
- if (XFS_BUF_ISPINNED(bp))
- xfs_log_force(mp, 0);
-
- if (dopush) {
- int error;
- error = xfs_bawrite(mp, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, mp,
- "xfs_inode_item_pushbuf: pushbuf error %d on iip %p, bp %p",
- error, iip, bp);
- } else {
- xfs_buf_relse(bp);
- }
- } else {
- iip->ili_pushbuf_flag = 0;
- xfs_iunlock(ip, XFS_ILOCK_SHARED);
- xfs_buf_relse(bp);
- }
- return;
- }
- /*
- * We have to be careful about resetting pushbuf flag too early (above).
- * Even though in theory we can do it as soon as we have the buflock,
- * we don't want others to be doing work needlessly. They'll come to
- * this function thinking that pushing the buffer is their
- * responsibility only to find that the buffer is still locked by
- * another doing the same thing
- */
- iip->ili_pushbuf_flag = 0;
xfs_iunlock(ip, XFS_ILOCK_SHARED);
+ if (!bp)
+ return;
+ if (XFS_BUF_ISDELAYWRITE(bp))
+ xfs_buf_delwri_promote(bp);
+ xfs_buf_relse(bp);
return;
}
@@ -937,7 +870,6 @@ xfs_inode_item_init(
/*
We have zeroed memory. No need ...
iip->ili_extents_buf = NULL;
- iip->ili_pushbuf_flag = 0;
*/
iip->ili_format.ilf_type = XFS_LI_INODE;
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index cc8df1a..9a46795 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -144,12 +144,6 @@ typedef struct xfs_inode_log_item {
data exts */
struct xfs_bmbt_rec *ili_aextents_buf; /* array of logged
attr exts */
- unsigned int ili_pushbuf_flag; /* one bit used in push_ail */
-
-#ifdef DEBUG
- uint64_t ili_push_owner; /* one who sets pushbuf_flag
- above gets to push the buf */
-#endif
#ifdef XFS_TRANS_DEBUG
int ili_root_size;
char *ili_orig_root;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index d7b1af8..760ceeb 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -253,6 +253,7 @@ xfsaild_push(
int flush_log, count, stuck;
xfs_mount_t *mp = ailp->xa_mount;
struct xfs_ail_cursor *cur = &ailp->xa_cursors;
+ int push_xfsbufd = 0;
spin_lock(&ailp->xa_lock);
xfs_trans_ail_cursor_init(ailp, cur);
@@ -308,6 +309,7 @@ xfsaild_push(
XFS_STATS_INC(xs_push_ail_pushbuf);
IOP_PUSHBUF(lip);
last_pushed_lsn = lsn;
+ push_xfsbufd = 1;
break;
case XFS_ITEM_PINNED:
@@ -374,6 +376,11 @@ xfsaild_push(
xfs_log_force(mp, 0);
}
+ if (push_xfsbufd) {
+ /* we've got delayed write buffers to flush */
+ wake_up_process(mp->m_ddev_targp->bt_task);
+ }
+
if (!count) {
/* We're past our target or empty, so idle */
last_pushed_lsn = 0;
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH 4/7] xfs: Sort delayed write buffers before dispatch
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
` (2 preceding siblings ...)
2010-01-25 6:22 ` [PATCH 3/7] xfs: Don't issue buffer IO direct from AIL push Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 12:00 ` Christoph Hellwig
2010-01-25 6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
` (2 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
Currently when the xfsbufd writes delayed write buffers, it pushes
them to disk in the order they come off the delayed write list. If
there are lots of buffers ѕpread widely over the disk, this results
in overwhelming the elevator sort queues in the block layer and we
end up losing the posibility of merging adjacent buffers to minimise
the number of IOs.
Use the new generic list_sort function to sort the delwri dispatch
queue before issue to ensure that the buffers are pushed in the most
friendly order possible to the lower layers.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_buf.c | 88 ++++++++++++++++++++++++++++++-------------
1 files changed, 61 insertions(+), 27 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b306265..a53afb3 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -33,6 +33,7 @@
#include <linux/migrate.h>
#include <linux/backing-dev.h>
#include <linux/freezer.h>
+#include <linux/list_sort.h>
#include "xfs_sb.h"
#include "xfs_inum.h"
@@ -1877,14 +1878,42 @@ xfs_buf_delwri_split(
}
+/*
+ * Compare function is more complex than it needs to be because
+ * the return value is only 32 bits and we are doing comparisons
+ * on 64 bit values
+ */
+static int
+xfs_buf_cmp(
+ void *priv,
+ struct list_head *a,
+ struct list_head *b)
+{
+ struct xfs_buf *ap = container_of(a, struct xfs_buf, b_list);
+ struct xfs_buf *bp = container_of(b, struct xfs_buf, b_list);
+ xfs_daddr_t diff;
+
+ diff = ap->b_bn - bp->b_bn;
+ if (diff < 0)
+ return -1;
+ if (diff > 0)
+ return 1;
+ return 0;
+}
+
+void
+xfs_buf_delwri_sort(
+ xfs_buftarg_t *target,
+ struct list_head *list)
+{
+ list_sort(NULL, list, xfs_buf_cmp);
+}
+
STATIC int
xfsbufd(
void *data)
{
- struct list_head tmp;
- xfs_buftarg_t *target = (xfs_buftarg_t *)data;
- int count;
- xfs_buf_t *bp;
+ xfs_buftarg_t *target = (xfs_buftarg_t *)data;
current->flags |= PF_MEMALLOC;
@@ -1893,6 +1922,8 @@ xfsbufd(
do {
long age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
long tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
+ int count = 0;
+ struct list_head tmp;
if (unlikely(freezing(current))) {
set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
@@ -1907,11 +1938,10 @@ xfsbufd(
schedule_timeout_interruptible(tout);
xfs_buf_delwri_split(target, &tmp, age);
- count = 0;
+ list_sort(NULL, &tmp, xfs_buf_cmp);
while (!list_empty(&tmp)) {
- bp = list_entry(tmp.next, xfs_buf_t, b_list);
- ASSERT(target == bp->b_target);
-
+ struct xfs_buf *bp;
+ bp = list_first_entry(&tmp, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
xfs_buf_iostrategy(bp);
count++;
@@ -1937,42 +1967,46 @@ xfs_flush_buftarg(
xfs_buftarg_t *target,
int wait)
{
- struct list_head tmp;
- xfs_buf_t *bp, *n;
+ xfs_buf_t *bp;
int pincount = 0;
+ LIST_HEAD(tmp_list);
+ LIST_HEAD(wait_list);
xfs_buf_runall_queues(xfsconvertd_workqueue);
xfs_buf_runall_queues(xfsdatad_workqueue);
xfs_buf_runall_queues(xfslogd_workqueue);
set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
- pincount = xfs_buf_delwri_split(target, &tmp, 0);
+ pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
/*
- * Dropped the delayed write list lock, now walk the temporary list
+ * Dropped the delayed write list lock, now walk the temporary list.
+ * All I/O is issued async and then if we need to wait for completion
+ * we do that after issuing all the IO.
*/
- list_for_each_entry_safe(bp, n, &tmp, b_list) {
+ list_sort(NULL, &tmp_list, xfs_buf_cmp);
+ while (!list_empty(&tmp_list)) {
+ struct xfs_buf *bp;
+ bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
ASSERT(target == bp->b_target);
- if (wait)
+ list_del_init(&bp->b_list);
+ if (wait) {
bp->b_flags &= ~XBF_ASYNC;
- else
- list_del_init(&bp->b_list);
-
+ list_add(&bp->b_list, &wait_list);
+ }
xfs_buf_iostrategy(bp);
}
- if (wait)
+ if (wait) {
+ /* Expedite and wait for IO to complete. */
blk_run_address_space(target->bt_mapping);
+ while (!list_empty(&wait_list)) {
+ bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
- /*
- * Remaining list items must be flushed before returning
- */
- while (!list_empty(&tmp)) {
- bp = list_entry(tmp.next, xfs_buf_t, b_list);
-
- list_del_init(&bp->b_list);
- xfs_iowait(bp);
- xfs_buf_relse(bp);
+ list_del_init(&bp->b_list);
+ xfs_iowait(bp);
+ xfs_buf_relse(bp);
+ }
}
return pincount;
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 4/7] xfs: Sort delayed write buffers before dispatch
2010-01-25 6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-01-25 12:00 ` Christoph Hellwig
2010-01-26 4:13 ` Dave Chinner
0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 12:00 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
> @@ -1937,42 +1967,46 @@ xfs_flush_buftarg(
> xfs_buftarg_t *target,
> int wait)
> {
> - struct list_head tmp;
> - xfs_buf_t *bp, *n;
> + xfs_buf_t *bp;
> int pincount = 0;
> + LIST_HEAD(tmp_list);
> + LIST_HEAD(wait_list);
>
> xfs_buf_runall_queues(xfsconvertd_workqueue);
> xfs_buf_runall_queues(xfsdatad_workqueue);
> xfs_buf_runall_queues(xfslogd_workqueue);
>
> set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> - pincount = xfs_buf_delwri_split(target, &tmp, 0);
> + pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
>
> /*
> - * Dropped the delayed write list lock, now walk the temporary list
> + * Dropped the delayed write list lock, now walk the temporary list.
> + * All I/O is issued async and then if we need to wait for completion
> + * we do that after issuing all the IO.
> */
> - list_for_each_entry_safe(bp, n, &tmp, b_list) {
> + list_sort(NULL, &tmp_list, xfs_buf_cmp);
> + while (!list_empty(&tmp_list)) {
> + struct xfs_buf *bp;
This now has a bp variable both in functionp-wide scope and a local one
here. Might be worth to decide for either style.
Otherwise looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 4/7] xfs: Sort delayed write buffers before dispatch
2010-01-25 12:00 ` Christoph Hellwig
@ 2010-01-26 4:13 ` Dave Chinner
0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-26 4:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Jan 25, 2010 at 07:00:54AM -0500, Christoph Hellwig wrote:
> > @@ -1937,42 +1967,46 @@ xfs_flush_buftarg(
> > xfs_buftarg_t *target,
> > int wait)
> > {
> > - struct list_head tmp;
> > - xfs_buf_t *bp, *n;
> > + xfs_buf_t *bp;
> > int pincount = 0;
> > + LIST_HEAD(tmp_list);
> > + LIST_HEAD(wait_list);
> >
> > xfs_buf_runall_queues(xfsconvertd_workqueue);
> > xfs_buf_runall_queues(xfsdatad_workqueue);
> > xfs_buf_runall_queues(xfslogd_workqueue);
> >
> > set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> > - pincount = xfs_buf_delwri_split(target, &tmp, 0);
> > + pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
> >
> > /*
> > - * Dropped the delayed write list lock, now walk the temporary list
> > + * Dropped the delayed write list lock, now walk the temporary list.
> > + * All I/O is issued async and then if we need to wait for completion
> > + * we do that after issuing all the IO.
> > */
> > - list_for_each_entry_safe(bp, n, &tmp, b_list) {
> > + list_sort(NULL, &tmp_list, xfs_buf_cmp);
> > + while (!list_empty(&tmp_list)) {
> > + struct xfs_buf *bp;
>
> This now has a bp variable both in functionp-wide scope and a local one
> here. Might be worth to decide for either style.
I removed the second declaration.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/7] xfs: Use delay write promotion for dquot flushing
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
` (3 preceding siblings ...)
2010-01-25 6:22 ` [PATCH 4/7] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 12:03 ` Christoph Hellwig
2010-01-25 6:22 ` [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags Dave Chinner
2010-01-25 6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
pushing used to do to flush out delayed write dquot buffers. Change
it to use the new promotion method rather than an async flush.
Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
held, yet the callers make the assumption that after this call the
flush lock is held. Always return with the flush lock held.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/quota/xfs_dquot.c | 25 ++++++++++---------------
1 files changed, 10 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index f9baeed..1620a56 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1528,21 +1528,16 @@ xfs_qm_dqflock_pushbuf_wait(
*/
bp = xfs_incore(dqp->q_mount->m_ddev_targp, dqp->q_blkno,
XFS_QI_DQCHUNKLEN(dqp->q_mount), XBF_TRYLOCK);
- if (bp != NULL) {
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- int error;
-
- if (XFS_BUF_ISPINNED(bp))
- xfs_log_force(dqp->q_mount, 0);
- error = xfs_bawrite(dqp->q_mount, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
- "xfs_qm_dqflock_pushbuf_wait: "
- "pushbuf error %d on dqp %p, bp %p",
- error, dqp, bp);
- } else {
- xfs_buf_relse(bp);
- }
+ if (!bp)
+ goto out_lock;
+
+ if (XFS_BUF_ISDELAYWRITE(bp)) {
+ if (XFS_BUF_ISPINNED(bp))
+ xfs_log_force(dqp->q_mount, 0);
+ xfs_buf_delwri_promote(bp);
+ wake_up_process(bp->b_target->bt_task);
}
+ xfs_buf_relse(bp);
+out_lock:
xfs_dqflock(dqp);
}
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 5/7] xfs: Use delay write promotion for dquot flushing
2010-01-25 6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-01-25 12:03 ` Christoph Hellwig
0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 12:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Jan 25, 2010 at 05:22:42PM +1100, Dave Chinner wrote:
> xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
> pushing used to do to flush out delayed write dquot buffers. Change
> it to use the new promotion method rather than an async flush.
>
> Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
> held, yet the callers make the assumption that after this call the
> flush lock is held. Always return with the flush lock held.
>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
This also removes the last user of xfs_bawrite, so you might want to
kill it.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
` (4 preceding siblings ...)
2010-01-25 6:22 ` [PATCH 5/7] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 12:04 ` Christoph Hellwig
2010-01-25 6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
dquots are never flushed asynchronously. Remove the flag and the
async write support from the flush function. Make the default flush
a delwri flush to make the inode flush code, which leaves the
XFS_QMOPT_SYNC the only flag remaining. Convert that to use
SYNC_WAIT instead, just like the inode flush code.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/quota/xfs_dquot.c | 13 ++++++-------
fs/xfs/quota/xfs_dquot_item.c | 2 +-
fs/xfs/quota/xfs_qm.c | 14 ++++++--------
fs/xfs/xfs_quota.h | 8 +-------
4 files changed, 14 insertions(+), 23 deletions(-)
diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index 1620a56..5f79dd7 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1187,7 +1187,7 @@ xfs_qm_dqflush(
* block, nada.
*/
if (!XFS_DQ_IS_DIRTY(dqp) ||
- (!(flags & XFS_QMOPT_SYNC) && atomic_read(&dqp->q_pincount) > 0)) {
+ (!(flags & SYNC_WAIT) && atomic_read(&dqp->q_pincount) > 0)) {
xfs_dqfunlock(dqp);
return 0;
}
@@ -1251,18 +1251,17 @@ xfs_qm_dqflush(
xfs_log_force(mp, 0);
}
- if (flags & XFS_QMOPT_DELWRI) {
- xfs_bdwrite(mp, bp);
- } else {
+ if (flags & SYNC_WAIT)
error = xfs_bwrite(mp, bp);
- }
+ else
+ xfs_bdwrite(mp, bp);
trace_xfs_dqflush_done(dqp);
/*
* dqp is still locked, but caller is free to unlock it now.
*/
- return (error);
+ return error;
}
@@ -1443,7 +1442,7 @@ xfs_qm_dqpurge(
* We don't care about getting disk errors here. We need
* to purge this dquot anyway, so we go ahead regardless.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_SYNC);
+ error = xfs_qm_dqflush(dqp, SYNC_WAIT);
if (error)
xfs_fs_cmn_err(CE_WARN, mp,
"xfs_qm_dqpurge: dquot %p flush failed", dqp);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index dda0fb0..4e4ee9a 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -153,7 +153,7 @@ xfs_qm_dquot_logitem_push(
* lock without sleeping, then there must not have been
* anyone in the process of flushing the dquot.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush(dqp, 0);
if (error)
xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
"xfs_qm_dquot_logitem_push: push error %d on dqp %p",
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 11cfd82..8112a26 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -450,7 +450,7 @@ xfs_qm_unmount_quotas(
STATIC int
xfs_qm_dqflush_all(
xfs_mount_t *mp,
- int flags)
+ int sync_mode)
{
int recl;
xfs_dquot_t *dqp;
@@ -486,7 +486,7 @@ again:
* across a disk write.
*/
xfs_qm_mplist_unlock(mp);
- error = xfs_qm_dqflush(dqp, flags);
+ error = xfs_qm_dqflush(dqp, sync_mode);
xfs_dqunlock(dqp);
if (error)
return error;
@@ -926,13 +926,11 @@ xfs_qm_sync(
{
int recl, restarts;
xfs_dquot_t *dqp;
- uint flush_flags;
int error;
if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
return 0;
- flush_flags = (flags & SYNC_WAIT) ? XFS_QMOPT_SYNC : XFS_QMOPT_DELWRI;
restarts = 0;
again:
@@ -992,7 +990,7 @@ xfs_qm_sync(
* across a disk write
*/
xfs_qm_mplist_unlock(mp);
- error = xfs_qm_dqflush(dqp, flush_flags);
+ error = xfs_qm_dqflush(dqp, (flags & SYNC_WAIT));
xfs_dqunlock(dqp);
if (error && XFS_FORCED_SHUTDOWN(mp))
return 0; /* Need to prevent umount failure */
@@ -1796,7 +1794,7 @@ xfs_qm_quotacheck(
* successfully.
*/
if (!error)
- error = xfs_qm_dqflush_all(mp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush_all(mp, 0);
/*
* We can get this error if we couldn't do a dquot allocation inside
@@ -2018,7 +2016,7 @@ xfs_qm_shake_freelist(
* We flush it delayed write, so don't bother
* releasing the mplock.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush(dqp, 0);
if (error) {
xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
"xfs_qm_dqflush_all: dquot %p flush failed", dqp);
@@ -2201,7 +2199,7 @@ xfs_qm_dqreclaim_one(void)
* We flush it delayed write, so don't bother
* releasing the freelist lock.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush(dqp, 0);
if (error) {
xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
"xfs_qm_dqreclaim: dquot %p flush failed", dqp);
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 21d11d9..fdcab3f 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -223,15 +223,9 @@ typedef struct xfs_qoff_logformat {
#define XFS_QMOPT_RES_INOS 0x0800000
/*
- * flags for dqflush and dqflush_all.
- */
-#define XFS_QMOPT_SYNC 0x1000000
-#define XFS_QMOPT_DELWRI 0x4000000
-
-/*
* flags for dqalloc.
*/
-#define XFS_QMOPT_INHERIT 0x8000000
+#define XFS_QMOPT_INHERIT 0x1000000
/*
* flags to xfs_trans_mod_dquot.
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
2010-01-25 6:22 [PATCH 0/7] Delayed write metadata writeback V3 Dave Chinner
` (5 preceding siblings ...)
2010-01-25 6:22 ` [PATCH 6/7] xfs: kill the unused XFS_QMOPT_* flush flags Dave Chinner
@ 2010-01-25 6:22 ` Dave Chinner
2010-01-25 16:03 ` Christoph Hellwig
6 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-25 6:22 UTC (permalink / raw)
To: xfs
When an inode has already be flushed delayed write,
xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
return on a synchronous inode write without having written the
inode. Currently these sycnhronous writes only come from the unmount
path or the nfsd on a synchronous export so should be fairly rare.
Realistically, a synchronous inode write is not necessary here; we
can treat this like fsync where we either force the log if there are
no unlogged changes, or do a sync transaction if there are unlogged
changes. The will result real synchronous semantics as the fsync
will issue barriers, but may slow down the above two configurations
as a result. However, if the inode is not pinned and has no unlogged
changes, then the fsync code is a no-op and hence it may be faster
than the existing code.
For the asynchronous write, move the clean check until after we have
locked up the inode. With the inode locked and the flush lock held,
we know that if the inodis clean there are no pending changes in the
log and there are no current outstanding delayed writes or IO in
progress, so if it reports clean now it really is clean. This matches
the order of locking and checks in xfs_sync_inode_attr().
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_super.c | 44 ++++++++++++++++++++++++-----------------
1 files changed, 26 insertions(+), 18 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 5ed0468..4cfc82a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1045,14 +1045,18 @@ xfs_fs_write_inode(
error = xfs_wait_on_pages(ip, 0, -1);
if (error)
goto out;
+ /*
+ * The fsync operation makes inode changes stable and it
+ * reduces the IOs we have to do here from two (log and inode)
+ * to just the log. We still need to do a delwri write of the
+ * inode after this to flush it to the bacing buffer so that
+ * bulkstat works properly.
+ */
+ error = xfs_fsync(ip);
+ if (error)
+ goto out;
}
- /*
- * Bypass inodes which have already been cleaned by
- * the inode flush clustering code inside xfs_iflush
- */
- if (xfs_inode_clean(ip))
- goto out;
/*
* We make this non-blocking if the inode is contended, return
@@ -1060,20 +1064,24 @@ xfs_fs_write_inode(
* This prevents the flush path from blocking on inodes inside
* another operation right now, they get caught later by xfs_sync.
*/
- if (sync) {
- xfs_ilock(ip, XFS_ILOCK_SHARED);
- xfs_iflock(ip);
-
- error = xfs_iflush(ip, SYNC_WAIT);
- } else {
- error = EAGAIN;
- if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
- goto out;
- if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
- goto out_unlock;
+ error = EAGAIN;
+ if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
+ goto out;
+ if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
+ goto out_unlock;
- error = xfs_iflush(ip, 0);
+ /*
+ * Now we have the flush lock and the inode is not pinned, we can check
+ * if the inode is really clean as we know that there are no pending
+ * transaction completions, it is not waiting on the delayed write
+ * queue and there is no IO in progress.
+ */
+ error = 0;
+ if (xfs_inode_clean(ip)) {
+ xfs_ifunlock(ip);
+ goto out_unlock;
}
+ error = xfs_iflush(ip, 0);
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_SHARED);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
2010-01-25 6:22 ` [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously Dave Chinner
@ 2010-01-25 16:03 ` Christoph Hellwig
2010-01-26 4:51 ` Dave Chinner
0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-01-25 16:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Jan 25, 2010 at 05:22:44PM +1100, Dave Chinner wrote:
> When an inode has already be flushed delayed write,
> xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
> return on a synchronous inode write without having written the
> inode. Currently these sycnhronous writes only come from the unmount
> path or the nfsd on a synchronous export so should be fairly rare.
They also come from sync_filesystem, which is uses by the sync system
call, in the unmount code and from cachefiles.
> Realistically, a synchronous inode write is not necessary here; we
> can treat this like fsync where we either force the log if there are
> no unlogged changes, or do a sync transaction if there are unlogged
> changes. The will result real synchronous semantics as the fsync
> will issue barriers, but may slow down the above two configurations
> as a result. However, if the inode is not pinned and has no unlogged
> changes, then the fsync code is a no-op and hence it may be faster
> than the existing code.
If we get a lot of cases where we need to write out the inode
synchronously the barrier might hit us really hard, though. If
we have a lot of delalloc I/O outstanding I fear this might actually
happen in practice as the inode gets modified between the first
->write_inode with wait == 0 by I/O completion.
> + error = EAGAIN;
> + if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> + goto out;
> + if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
> + goto out_unlock;
So if we make this non-blocking even for the wait case, don't we
still have a race window there bulkstat could miss the updates, even
after a sync?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
2010-01-25 16:03 ` Christoph Hellwig
@ 2010-01-26 4:51 ` Dave Chinner
0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-26 4:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Jan 25, 2010 at 11:03:54AM -0500, Christoph Hellwig wrote:
> On Mon, Jan 25, 2010 at 05:22:44PM +1100, Dave Chinner wrote:
> > When an inode has already be flushed delayed write,
> > xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
> > return on a synchronous inode write without having written the
> > inode. Currently these sycnhronous writes only come from the unmount
> > path or the nfsd on a synchronous export so should be fairly rare.
>
> They also come from sync_filesystem, which is uses by the sync system
> call, in the unmount code and from cachefiles.
True - I'll update the comment - but I still think it'll be fairly
rare.
> > Realistically, a synchronous inode write is not necessary here; we
> > can treat this like fsync where we either force the log if there are
> > no unlogged changes, or do a sync transaction if there are unlogged
> > changes. The will result real synchronous semantics as the fsync
> > will issue barriers, but may slow down the above two configurations
> > as a result. However, if the inode is not pinned and has no unlogged
> > changes, then the fsync code is a no-op and hence it may be faster
> > than the existing code.
>
> If we get a lot of cases where we need to write out the inode
> synchronously the barrier might hit us really hard, though.
No different to running wsync, though, where all transactions
are synchronous and will issue barriers all the time.
> If
> we have a lot of delalloc I/O outstanding I fear this might actually
> happen in practice as the inode gets modified between the first
> ->write_inode with wait == 0 by I/O completion.
So far in my testing I haven't seen a big hit - the performance
tests I've done are on filesystems with barriers enabled. I just
checked barrier vs nobarrier sync times after creating 400,000
single block files in parallel - nobarrier = 27s, barrier = 29s.
> > + error = EAGAIN;
> > + if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> > + goto out;
> > + if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
> > + goto out_unlock;
>
> So if we make this non-blocking even for the wait case, don't we
> still have a race window there bulkstat could miss the updates, even
> after a sync?
Yes, you're right. But even if we lock here properly, a delwri flush
is non-blocking and hence can still return EAGAIN. We really only need
this if a newly allocated inode has not been previously flushed for
bulkstat to work correctly. We would need to race with a concurrent
transaction between the fsync call and the below checks for this
flush to fail, which I think should be a relatively rare ocurrence.
What I will look at is whether I can get xfs_fsync() to take a
locked inode and return with it still locked. Then this race
condition will go away completely and hence the delwri flush
will only occur if the inode has not been flushed yet (based
on the flock).
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread