* [PATCH 0/9] Delayed write metadata writeback V5
@ 2010-02-09 3:56 Dave Chinner
2010-02-09 3:56 ` [PATCH 1/9] xfs: Make inode reclaim states explicit Dave Chinner
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
While I started with killing async inode writeback, the series has
grown. It's not really limited to inode writeback - it touches dquot
flushing, changes the way the AIL pushes on buffers, adds xfsbufd
sorting for delayed write buffers, adds a real non-blocking mode to
inode reclaim and avoids physical inode writeback from the VFS while
fixing bugs in handling delayed write inodes. Hence this is more
about enabling efficient delayed write metadata than it is able
killing async inode writeback.
The idea behind this series is to make metadata buffers get
written from xfsbufd via the delayed write queue rather than being
issued asynchronously from all over the place. To do this, async
buffer writeback is almost entirely removed from XFS, replaced
instead by delayed writes and a method to expedite flushing of
delayed write buffers when required.
The result of funnelling all the buffer IO into a single place
is that we can more tightly control and therefore optimise the
submission of metadata IO. Aggregating the buffers before dispatch
allows much better sort efficiency of the buffers as the sort window
is not limited to the size of the elevator congestion hysteresis
limit. Hence we can approach 100% merge effeciency on large numbers
of buffers when dispatched for IO and greatly reduce the amount
of seeking metadata writeback causes.
The major change is to the inode flushing and reclaim code. Delayed
write inodes hold the flush lock for much longer than for async
writeback, and hence blocking on the flush lock can cause extremely
long latencies without other mechanisms to expedite the release of
the flush locks. To prevent needing to flush inodes immediately,
all operations are done non-blocking unless synchronous. This
required a significant rework of the inode reclaim code, but it
greatly simplified other pieces of code (e.g. log item pushing).
Version 5
- drop the fsync changes to xfs_fs_write_inode() and the associated
locking changes, replace them with a targeted inode logging
function from Christoph Hellwig to fix a performance regression on
fs_mark -S4 workloads on an SSD.
Version 4
- rework inode reclaim checks for better legibility
- add warning to reclaim code when delwri flush errors occur
- kill XFS_ITEM_FLUSHING now it is not used
- clean up sync_mode flags being pushed into xfs_iflush()
- kill the now unused xfs_bawrite() function
- include Christoph's fsync cache flush fix
- rework the inode locking and call to xfs_fsync() when doing
synchronous inode writes to close races between the fsync and
the background delwri flush afterwards.
Version 3
- rework inode reclaim to:
- separate it from xfs_iflush return values
- provide a non-blocking mode for background operation
- apply delwri buffer promotion tricks to dquot flushing
- kill unneeded dquot flushing flags, similar to inode flushing flag
removal
- fix sync inode flush bug when trying to flush delwri inodes
Version 2:
- use generic list sort function
- when unmounting, push the delwri buffers first, then do sync inode
reclaim so that reclaim doesn't block for 15 seconds waiting for
delwri inode buffers to be aged and written before the inodes can
be reclaimed.
Alex, the patch series is available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/dgc/xfs for-2.6.34
Christoph Hellwig (2):
xfs: remove invalid barrier optimization from xfs_fsync
xfs: log changed inodes instead of writing them synchronously
Dave Chinner (7):
xfs: Make inode reclaim states explicit
xfs: Use delayed write for inodes rather than async V2
xfs: Don't issue buffer IO direct from AIL push V2
xfs: Sort delayed write buffers before dispatch
xfs: Use delay write promotion for dquot flushing
xfs: kill the unused XFS_QMOPT_* flush flags V2
xfs: kill xfs_bawrite
fs/xfs/linux-2.6/xfs_buf.c | 135 ++++++++++++++++++++++++++--------------
fs/xfs/linux-2.6/xfs_buf.h | 3 +-
fs/xfs/linux-2.6/xfs_super.c | 111 ++++++++++++++++++++++++---------
fs/xfs/linux-2.6/xfs_sync.c | 138 +++++++++++++++++++++++++++++++++-------
fs/xfs/linux-2.6/xfs_trace.h | 1 +
fs/xfs/quota/xfs_dquot.c | 38 +++++-------
fs/xfs/quota/xfs_dquot_item.c | 87 ++++----------------------
fs/xfs/quota/xfs_dquot_item.h | 4 -
fs/xfs/quota/xfs_qm.c | 14 ++---
fs/xfs/xfs_buf_item.c | 64 ++++++++++---------
fs/xfs/xfs_inode.c | 86 ++------------------------
fs/xfs/xfs_inode.h | 11 +---
fs/xfs/xfs_inode_item.c | 108 +++++++-------------------------
fs/xfs/xfs_inode_item.h | 6 --
fs/xfs/xfs_mount.c | 13 ++++-
fs/xfs/xfs_quota.h | 8 +--
fs/xfs/xfs_trans.h | 3 +-
fs/xfs/xfs_trans_ail.c | 13 ++--
fs/xfs/xfs_vnodeops.c | 12 +---
19 files changed, 410 insertions(+), 445 deletions(-)
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/9] xfs: Make inode reclaim states explicit
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 2/9] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
A.K.A.: don't rely on xfs_iflush() return value in reclaim
We have gradually been moving checks out of the reclaim code because
they are duplicated in xfs_iflush(). We've had a history of problems
in this area, and many of them stem from the overloading of the
return values from xfs_iflush() and interaction with inode flush
locking to determine if the inode is safe to reclaim.
With the desire to move to delayed write flushing of inodes and
non-blocking inode tree reclaim walks, the overloading of the
return value of xfs_iflush makes it very difficult to determine
the correct thing to do next.
This patch explicitly re-adds the checks to the inode reclaim code,
removing the reliance on the return value of xfs_iflush() to
determine what to do next. It also means that we can clearly
document all the inode states that reclaim must handle and hence
we can easily see that we handled all the necessary cases.
This also removes the need for the xfs_inode_clean() check in
xfs_iflush() as all callers now check this first (safely).
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/linux-2.6/xfs_sync.c | 81 +++++++++++++++++++++++++++++++++----------
fs/xfs/xfs_inode.c | 11 +-----
fs/xfs/xfs_inode.h | 1 +
3 files changed, 64 insertions(+), 29 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index c9b863e..525260c 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -706,12 +706,43 @@ __xfs_inode_clear_reclaim_tag(
XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
}
+/*
+ * Inodes in different states need to be treated differently, and the return
+ * value of xfs_iflush is not sufficient to get this right. The following table
+ * lists the inode states and the reclaim actions necessary for non-blocking
+ * reclaim:
+ *
+ *
+ * inode state iflush ret required action
+ * --------------- ---------- ---------------
+ * bad - reclaim
+ * shutdown EIO unpin and reclaim
+ * clean, unpinned 0 reclaim
+ * stale, unpinned 0 reclaim
+ * clean, pinned(*) 0 unpin and reclaim
+ * stale, pinned 0 unpin and reclaim
+ * dirty, async 0 block on flush lock, reclaim
+ * dirty, sync flush 0 block on flush lock, reclaim
+ *
+ * (*) dgc: I don't think the clean, pinned state is possible but it gets
+ * handled anyway given the order of checks implemented.
+ *
+ * Hence the order of actions after gaining the locks should be:
+ * bad => reclaim
+ * shutdown => unpin and reclaim
+ * pinned => unpin
+ * stale => reclaim
+ * clean => reclaim
+ * dirty => flush, wait and reclaim
+ */
STATIC int
xfs_reclaim_inode(
struct xfs_inode *ip,
struct xfs_perag *pag,
int sync_mode)
{
+ int error;
+
/*
* The radix tree lock here protects a thread in xfs_iget from racing
* with us starting reclaim on the inode. Once we have the
@@ -729,30 +760,42 @@ xfs_reclaim_inode(
spin_unlock(&ip->i_flags_lock);
write_unlock(&pag->pag_ici_lock);
- /*
- * If the inode is still dirty, then flush it out. If the inode
- * is not in the AIL, then it will be OK to flush it delwri as
- * long as xfs_iflush() does not keep any references to the inode.
- * We leave that decision up to xfs_iflush() since it has the
- * knowledge of whether it's OK to simply do a delwri flush of
- * the inode or whether we need to wait until the inode is
- * pulled from the AIL.
- * We get the flush lock regardless, though, just to make sure
- * we don't free it while it is being flushed.
- */
xfs_ilock(ip, XFS_ILOCK_EXCL);
xfs_iflock(ip);
- /*
- * In the case of a forced shutdown we rely on xfs_iflush() to
- * wait for the inode to be unpinned before returning an error.
- */
- if (!is_bad_inode(VFS_I(ip)) && xfs_iflush(ip, sync_mode) == 0) {
- /* synchronize with xfs_iflush_done */
- xfs_iflock(ip);
- xfs_ifunlock(ip);
+ if (is_bad_inode(VFS_I(ip)))
+ goto reclaim;
+ if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+ xfs_iunpin_wait(ip);
+ goto reclaim;
+ }
+ if (xfs_ipincount(ip))
+ xfs_iunpin_wait(ip);
+ if (xfs_iflags_test(ip, XFS_ISTALE))
+ goto reclaim;
+ if (xfs_inode_clean(ip))
+ goto reclaim;
+
+ /* Now we have an inode that needs flushing */
+ error = xfs_iflush(ip, sync_mode);
+ if (!error) {
+ switch(sync_mode) {
+ case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
+ case XFS_IFLUSH_DELWRI:
+ case XFS_IFLUSH_ASYNC:
+ case XFS_IFLUSH_DELWRI_ELSE_SYNC:
+ case XFS_IFLUSH_SYNC:
+ /* IO issued, synchronise with IO completion */
+ xfs_iflock(ip);
+ break;
+ default:
+ ASSERT(0);
+ break;
+ }
}
+reclaim:
+ xfs_ifunlock(ip);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_ireclaim(ip);
return 0;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d0d1b5a..8d0666d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2493,7 +2493,7 @@ __xfs_iunpin_wait(
wait_event(ip->i_ipin_wait, (atomic_read(&ip->i_pincount) == 0));
}
-static inline void
+void
xfs_iunpin_wait(
xfs_inode_t *ip)
{
@@ -2849,15 +2849,6 @@ xfs_iflush(
mp = ip->i_mount;
/*
- * If the inode isn't dirty, then just release the inode flush lock and
- * do nothing.
- */
- if (xfs_inode_clean(ip)) {
- xfs_ifunlock(ip);
- return 0;
- }
-
- /*
* We can't flush the inode until it is unpinned, so wait for it if we
* are allowed to block. We know noone new can pin it, because we are
* holding the inode lock shared and you need to hold it exclusively to
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ec1f28c..8b618ea 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -483,6 +483,7 @@ int xfs_iunlink(struct xfs_trans *, xfs_inode_t *);
void xfs_iext_realloc(xfs_inode_t *, int, int);
void xfs_ipin(xfs_inode_t *);
void xfs_iunpin(xfs_inode_t *);
+void xfs_iunpin_wait(xfs_inode_t *);
int xfs_iflush(xfs_inode_t *, uint);
void xfs_ichgtime(xfs_inode_t *, int);
void xfs_lock_inodes(xfs_inode_t **, int, uint);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/9] xfs: Use delayed write for inodes rather than async V2
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
2010-02-09 3:56 ` [PATCH 1/9] xfs: Make inode reclaim states explicit Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 3/9] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
We currently do background inode flush asynchronously, resulting in
inodes being written in whatever order the background writeback
issues them. Not only that, there are also blocking and non-blocking
asynchronous inode flushes, depending on where the flush comes from.
This patch completely removes asynchronous inode writeback. It
removes all the strange writeback modes and replaces them with
either a synchronous flush or a non-blocking delayed write flush.
That is, inode flushes will only issue IO directly if they are
synchronous, and background flushing may do nothing if the operation
would block (e.g. on a pinned inode or buffer lock).
Delayed write flushes will now result in the inode buffer sitting in
the delwri queue of the buffer cache to be flushed by either an AIL
push or by the xfsbufd timing out the buffer. This will allow
accumulation of dirty inode buffers in memory and allow optimisation
of inode cluster writeback at the xfsbufd level where we have much
greater queue depths than the block layer elevators. We will also
get adjacent inode cluster buffer IO merging for free when a later
patch in the series allows sorting of the delayed write buffers
before dispatch.
This effectively means that any inode that is written back by
background writeback will be seen as flush locked during AIL
pushing, and will result in the buffers being pushed from there.
This writeback path is currently non-optimal, but the next patch
in the series will fix that problem.
A side effect of this delayed write mechanism is that background
inode reclaim will no longer directly flush inodes, nor can it wait
on the flush lock. The result is that inode reclaim must leave the
inode in the reclaimable state until it is clean. Hence attempts to
reclaim a dirty inode in the background will simply skip the inode
until it is clean and this allows other mechanisms (i.e. xfsbufd) to
do more optimal writeback of the dirty buffers. As a result, the
inode reclaim code has been rewritten so that it no longer relies on
the ambiguous return values of xfs_iflush() to determine whether it
is safe to reclaim an inode.
Portions of this patch are derived from patches by Christoph
Hellwig.
Version 2:
- cleanup reclaim code as suggested by Christoph
- log background reclaim inode flush errors
- just pass sync flags to xfs_iflush
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/linux-2.6/xfs_super.c | 4 +-
fs/xfs/linux-2.6/xfs_sync.c | 105 ++++++++++++++++++++++++++++++------------
fs/xfs/xfs_inode.c | 75 ++----------------------------
fs/xfs/xfs_inode.h | 10 ----
fs/xfs/xfs_inode_item.c | 10 +++-
fs/xfs/xfs_mount.c | 13 +++++-
6 files changed, 102 insertions(+), 115 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 6ce828e..3b5b46b 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1064,7 +1064,7 @@ xfs_fs_write_inode(
xfs_ilock(ip, XFS_ILOCK_SHARED);
xfs_iflock(ip);
- error = xfs_iflush(ip, XFS_IFLUSH_SYNC);
+ error = xfs_iflush(ip, SYNC_WAIT);
} else {
error = EAGAIN;
if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
@@ -1072,7 +1072,7 @@ xfs_fs_write_inode(
if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
goto out_unlock;
- error = xfs_iflush(ip, XFS_IFLUSH_ASYNC_NOBLOCK);
+ error = xfs_iflush(ip, 0);
}
out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 525260c..a9f6d20 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -270,8 +270,7 @@ xfs_sync_inode_attr(
goto out_unlock;
}
- error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
- XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
+ error = xfs_iflush(ip, flags);
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -460,16 +459,18 @@ xfs_quiesce_fs(
{
int count = 0, pincount;
+ xfs_reclaim_inodes(mp, 0);
xfs_flush_buftarg(mp->m_ddev_targp, 0);
- xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
/*
* This loop must run at least twice. The first instance of the loop
* will flush most meta data but that will generate more meta data
* (typically directory updates). Which then must be flushed and
- * logged before we can write the unmount record.
+ * logged before we can write the unmount record. We also so sync
+ * reclaim of inodes to catch any that the above delwri flush skipped.
*/
do {
+ xfs_reclaim_inodes(mp, SYNC_WAIT);
xfs_sync_attr(mp, SYNC_WAIT);
pincount = xfs_flush_buftarg(mp->m_ddev_targp, 1);
if (!pincount) {
@@ -585,7 +586,7 @@ xfs_sync_worker(
if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
xfs_log_force(mp, 0);
- xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
+ xfs_reclaim_inodes(mp, 0);
/* dgc: errors ignored here */
error = xfs_qm_sync(mp, SYNC_TRYLOCK);
error = xfs_sync_fsdata(mp, SYNC_TRYLOCK);
@@ -719,21 +720,42 @@ __xfs_inode_clear_reclaim_tag(
* shutdown EIO unpin and reclaim
* clean, unpinned 0 reclaim
* stale, unpinned 0 reclaim
- * clean, pinned(*) 0 unpin and reclaim
- * stale, pinned 0 unpin and reclaim
- * dirty, async 0 block on flush lock, reclaim
- * dirty, sync flush 0 block on flush lock, reclaim
+ * clean, pinned(*) 0 requeue
+ * stale, pinned EAGAIN requeue
+ * dirty, delwri ok 0 requeue
+ * dirty, delwri blocked EAGAIN requeue
+ * dirty, sync flush 0 reclaim
*
* (*) dgc: I don't think the clean, pinned state is possible but it gets
* handled anyway given the order of checks implemented.
*
+ * As can be seen from the table, the return value of xfs_iflush() is not
+ * sufficient to correctly decide the reclaim action here. The checks in
+ * xfs_iflush() might look like duplicates, but they are not.
+ *
+ * Also, because we get the flush lock first, we know that any inode that has
+ * been flushed delwri has had the flush completed by the time we check that
+ * the inode is clean. The clean inode check needs to be done before flushing
+ * the inode delwri otherwise we would loop forever requeuing clean inodes as
+ * we cannot tell apart a successful delwri flush and a clean inode from the
+ * return value of xfs_iflush().
+ *
+ * Note that because the inode is flushed delayed write by background
+ * writeback, the flush lock may already be held here and waiting on it can
+ * result in very long latencies. Hence for sync reclaims, where we wait on the
+ * flush lock, the caller should push out delayed write inodes first before
+ * trying to reclaim them to minimise the amount of time spent waiting. For
+ * background relaim, we just requeue the inode for the next pass.
+ *
* Hence the order of actions after gaining the locks should be:
* bad => reclaim
* shutdown => unpin and reclaim
- * pinned => unpin
+ * pinned, delwri => requeue
+ * pinned, sync => unpin
* stale => reclaim
* clean => reclaim
- * dirty => flush, wait and reclaim
+ * dirty, delwri => flush and requeue
+ * dirty, sync => flush, wait and reclaim
*/
STATIC int
xfs_reclaim_inode(
@@ -741,7 +763,7 @@ xfs_reclaim_inode(
struct xfs_perag *pag,
int sync_mode)
{
- int error;
+ int error = 0;
/*
* The radix tree lock here protects a thread in xfs_iget from racing
@@ -761,7 +783,11 @@ xfs_reclaim_inode(
write_unlock(&pag->pag_ici_lock);
xfs_ilock(ip, XFS_ILOCK_EXCL);
- xfs_iflock(ip);
+ if (!xfs_iflock_nowait(ip)) {
+ if (!(sync_mode & SYNC_WAIT))
+ goto out;
+ xfs_iflock(ip);
+ }
if (is_bad_inode(VFS_I(ip)))
goto reclaim;
@@ -769,8 +795,13 @@ xfs_reclaim_inode(
xfs_iunpin_wait(ip);
goto reclaim;
}
- if (xfs_ipincount(ip))
+ if (xfs_ipincount(ip)) {
+ if (!(sync_mode & SYNC_WAIT)) {
+ xfs_ifunlock(ip);
+ goto out;
+ }
xfs_iunpin_wait(ip);
+ }
if (xfs_iflags_test(ip, XFS_ISTALE))
goto reclaim;
if (xfs_inode_clean(ip))
@@ -778,27 +809,43 @@ xfs_reclaim_inode(
/* Now we have an inode that needs flushing */
error = xfs_iflush(ip, sync_mode);
- if (!error) {
- switch(sync_mode) {
- case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
- case XFS_IFLUSH_DELWRI:
- case XFS_IFLUSH_ASYNC:
- case XFS_IFLUSH_DELWRI_ELSE_SYNC:
- case XFS_IFLUSH_SYNC:
- /* IO issued, synchronise with IO completion */
- xfs_iflock(ip);
- break;
- default:
- ASSERT(0);
- break;
- }
+ if (sync_mode & SYNC_WAIT) {
+ xfs_iflock(ip);
+ goto reclaim;
}
+ /*
+ * When we have to flush an inode but don't have SYNC_WAIT set, we
+ * flush the inode out using a delwri buffer and wait for the next
+ * call into reclaim to find it in a clean state instead of waiting for
+ * it now. We also don't return errors here - if the error is transient
+ * then the next reclaim pass will flush the inode, and if the error
+ * is permanent then the next sync reclaim will relcaim the inode and
+ * pass on the error.
+ */
+ if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+ xfs_fs_cmn_err(CE_WARN, ip->i_mount,
+ "inode 0x%llx background reclaim flush failed with %d",
+ (long long)ip->i_ino, error);
+ }
+out:
+ xfs_iflags_clear(ip, XFS_IRECLAIM);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ /*
+ * We could return EAGAIN here to make reclaim rescan the inode tree in
+ * a short while. However, this just burns CPU time scanning the tree
+ * waiting for IO to complete and xfssyncd never goes back to the idle
+ * state. Instead, return 0 to let the next scheduled background reclaim
+ * attempt to reclaim the inode again.
+ */
+ return 0;
+
reclaim:
xfs_ifunlock(ip);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_ireclaim(ip);
- return 0;
+ return error;
+
}
int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8d0666d..fa31360 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,8 +2835,6 @@ xfs_iflush(
xfs_dinode_t *dip;
xfs_mount_t *mp;
int error;
- int noblock = (flags == XFS_IFLUSH_ASYNC_NOBLOCK);
- enum { INT_DELWRI = (1 << 0), INT_ASYNC = (1 << 1) };
XFS_STATS_INC(xs_iflush_count);
@@ -2859,7 +2857,7 @@ xfs_iflush(
* in the same cluster are dirty, they will probably write the inode
* out for us if they occur after the log force completes.
*/
- if (noblock && xfs_ipincount(ip)) {
+ if (!(flags & SYNC_WAIT) && xfs_ipincount(ip)) {
xfs_iunpin_nowait(ip);
xfs_ifunlock(ip);
return EAGAIN;
@@ -2893,60 +2891,10 @@ xfs_iflush(
}
/*
- * Decide how buffer will be flushed out. This is done before
- * the call to xfs_iflush_int because this field is zeroed by it.
- */
- if (iip != NULL && iip->ili_format.ilf_fields != 0) {
- /*
- * Flush out the inode buffer according to the directions
- * of the caller. In the cases where the caller has given
- * us a choice choose the non-delwri case. This is because
- * the inode is in the AIL and we need to get it out soon.
- */
- switch (flags) {
- case XFS_IFLUSH_SYNC:
- case XFS_IFLUSH_DELWRI_ELSE_SYNC:
- flags = 0;
- break;
- case XFS_IFLUSH_ASYNC_NOBLOCK:
- case XFS_IFLUSH_ASYNC:
- case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
- flags = INT_ASYNC;
- break;
- case XFS_IFLUSH_DELWRI:
- flags = INT_DELWRI;
- break;
- default:
- ASSERT(0);
- flags = 0;
- break;
- }
- } else {
- switch (flags) {
- case XFS_IFLUSH_DELWRI_ELSE_SYNC:
- case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
- case XFS_IFLUSH_DELWRI:
- flags = INT_DELWRI;
- break;
- case XFS_IFLUSH_ASYNC_NOBLOCK:
- case XFS_IFLUSH_ASYNC:
- flags = INT_ASYNC;
- break;
- case XFS_IFLUSH_SYNC:
- flags = 0;
- break;
- default:
- ASSERT(0);
- flags = 0;
- break;
- }
- }
-
- /*
* Get the buffer containing the on-disk inode.
*/
error = xfs_itobp(mp, NULL, ip, &dip, &bp,
- noblock ? XBF_TRYLOCK : XBF_LOCK);
+ (flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
if (error || !bp) {
xfs_ifunlock(ip);
return error;
@@ -2974,13 +2922,10 @@ xfs_iflush(
if (error)
goto cluster_corrupt_out;
- if (flags & INT_DELWRI) {
- xfs_bdwrite(mp, bp);
- } else if (flags & INT_ASYNC) {
- error = xfs_bawrite(mp, bp);
- } else {
+ if (flags & SYNC_WAIT)
error = xfs_bwrite(mp, bp);
- }
+ else
+ xfs_bdwrite(mp, bp);
return error;
corrupt_out:
@@ -3015,16 +2960,6 @@ xfs_iflush_int(
iip = ip->i_itemp;
mp = ip->i_mount;
-
- /*
- * If the inode isn't dirty, then just release the inode
- * flush lock and do nothing.
- */
- if (xfs_inode_clean(ip)) {
- xfs_ifunlock(ip);
- return 0;
- }
-
/* set *dip = inode's place in the buffer */
dip = (xfs_dinode_t *)xfs_buf_offset(bp, ip->i_imap.im_boffset);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8b618ea..6c912b0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -420,16 +420,6 @@ static inline void xfs_ifunlock(xfs_inode_t *ip)
#define XFS_ILOCK_DEP(flags) (((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
/*
- * Flags for xfs_iflush()
- */
-#define XFS_IFLUSH_DELWRI_ELSE_SYNC 1
-#define XFS_IFLUSH_DELWRI_ELSE_ASYNC 2
-#define XFS_IFLUSH_SYNC 3
-#define XFS_IFLUSH_ASYNC 4
-#define XFS_IFLUSH_DELWRI 5
-#define XFS_IFLUSH_ASYNC_NOBLOCK 6
-
-/*
* Flags for xfs_itruncate_start().
*/
#define XFS_ITRUNC_DEFINITE 0x1
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 48ec1c0..207553e 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -866,10 +866,14 @@ xfs_inode_item_push(
iip->ili_format.ilf_fields != 0);
/*
- * Write out the inode. The completion routine ('iflush_done') will
- * pull it from the AIL, mark it clean, unlock the flush lock.
+ * Push the inode to it's backing buffer. This will not remove the
+ * inode from the AIL - a further push will be required to trigger a
+ * buffer push. However, this allows all the dirty inodes to be pushed
+ * to the buffer before it is pushed to disk. THe buffer IO completion
+ * will pull th einode from the AIL, mark it clean and unlock the flush
+ * lock.
*/
- (void) xfs_iflush(ip, XFS_IFLUSH_ASYNC);
+ (void) xfs_iflush(ip, 0);
xfs_iunlock(ip, XFS_ILOCK_SHARED);
return;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5061149..6afaaeb 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1468,7 +1468,18 @@ xfs_unmountfs(
* need to force the log first.
*/
xfs_log_force(mp, XFS_LOG_SYNC);
- xfs_reclaim_inodes(mp, XFS_IFLUSH_ASYNC);
+
+ /*
+ * Do a delwri reclaim pass first so that as many dirty inodes are
+ * queued up for IO as possible. Then flush the buffers before making
+ * a synchronous path to catch all the remaining inodes are reclaimed.
+ * This makes the reclaim process as quick as possible by avoiding
+ * synchronous writeout and blocking on inodes already in the delwri
+ * state as much as possible.
+ */
+ xfs_reclaim_inodes(mp, 0);
+ XFS_bflush(mp->m_ddev_targp);
+ xfs_reclaim_inodes(mp, SYNC_WAIT);
xfs_qm_unmount(mp);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/9] xfs: Don't issue buffer IO direct from AIL push V2
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
2010-02-09 3:56 ` [PATCH 1/9] xfs: Make inode reclaim states explicit Dave Chinner
2010-02-09 3:56 ` [PATCH 2/9] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 4/9] xfs: Sort delayed write buffers before dispatch Dave Chinner
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
All buffers logged into the AIL are marked as delayed write.
When the AIL needs to push the buffer out, it issues an async write of the
buffer. This means that IO patterns are dependent on the order of
buffers in the AIL.
Instead of flushing the buffer, promote the buffer in the delayed
write list so that the next time the xfsbufd is run the buffer will
be flushed by the xfsbufd. Return the state to the xfsaild that the
buffer was promoted so that the xfsaild knows that it needs to cause
the xfsbufd to run to flush the buffers that were promoted.
Using the xfsbufd for issuing the IO allows us to dispatch all
buffer IO from the one queue. This means that we can make much more
enlightened decisions on what order to flush buffers to disk as
we don't have multiple places issuing IO. Optimisations to xfsbufd
will be in a future patch.
Version 2
- kill XFS_ITEM_FLUSHING as it is now unused.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/linux-2.6/xfs_buf.c | 29 ++++++++++++
fs/xfs/linux-2.6/xfs_buf.h | 2 +
fs/xfs/linux-2.6/xfs_trace.h | 1 +
fs/xfs/quota/xfs_dquot_item.c | 85 +++++------------------------------
fs/xfs/quota/xfs_dquot_item.h | 4 --
fs/xfs/xfs_buf_item.c | 64 +++++++++++++++------------
fs/xfs/xfs_inode_item.c | 98 ++++++----------------------------------
fs/xfs/xfs_inode_item.h | 6 ---
fs/xfs/xfs_trans.h | 3 +-
fs/xfs/xfs_trans_ail.c | 13 +++---
10 files changed, 102 insertions(+), 203 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 44e20e5..b306265 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1778,6 +1778,35 @@ xfs_buf_delwri_dequeue(
trace_xfs_buf_delwri_dequeue(bp, _RET_IP_);
}
+/*
+ * If a delwri buffer needs to be pushed before it has aged out, then promote
+ * it to the head of the delwri queue so that it will be flushed on the next
+ * xfsbufd run. We do this by resetting the queuetime of the buffer to be older
+ * than the age currently needed to flush the buffer. Hence the next time the
+ * xfsbufd sees it is guaranteed to be considered old enough to flush.
+ */
+void
+xfs_buf_delwri_promote(
+ struct xfs_buf *bp)
+{
+ struct xfs_buftarg *btp = bp->b_target;
+ long age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
+
+ ASSERT(bp->b_flags & XBF_DELWRI);
+ ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+
+ /*
+ * Check the buffer age before locking the delayed write queue as we
+ * don't need to promote buffers that are already past the flush age.
+ */
+ if (bp->b_queuetime < jiffies - age)
+ return;
+ bp->b_queuetime = jiffies - age;
+ spin_lock(&btp->bt_delwrite_lock);
+ list_move(&bp->b_list, &btp->bt_delwrite_queue);
+ spin_unlock(&btp->bt_delwrite_lock);
+}
+
STATIC void
xfs_buf_runall_queues(
struct workqueue_struct *queue)
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index ea8c198..be45e8c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -266,6 +266,7 @@ extern int xfs_buf_ispin(xfs_buf_t *);
/* Delayed Write Buffer Routines */
extern void xfs_buf_delwri_dequeue(xfs_buf_t *);
+extern void xfs_buf_delwri_promote(xfs_buf_t *);
/* Buffer Daemon Setup Routines */
extern int xfs_buf_init(void);
@@ -395,6 +396,7 @@ extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
extern void xfs_wait_buftarg(xfs_buftarg_t *);
extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
extern int xfs_flush_buftarg(xfs_buftarg_t *, int);
+
#ifdef CONFIG_KDB_MODULES
extern struct list_head *xfs_get_buftarg_list(void);
#endif
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index 1bb09e7..a4574dc 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -483,6 +483,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock);
DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock_stale);
DEFINE_BUF_ITEM_EVENT(xfs_buf_item_committed);
DEFINE_BUF_ITEM_EVENT(xfs_buf_item_push);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pushbuf);
DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf);
DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf_recur);
DEFINE_BUF_ITEM_EVENT(xfs_trans_getsb);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index 1b56437..dda0fb0 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -212,66 +212,31 @@ xfs_qm_dquot_logitem_pushbuf(
xfs_dquot_t *dqp;
xfs_mount_t *mp;
xfs_buf_t *bp;
- uint dopush;
dqp = qip->qli_dquot;
ASSERT(XFS_DQ_IS_LOCKED(dqp));
/*
- * The qli_pushbuf_flag keeps others from
- * trying to duplicate our effort.
- */
- ASSERT(qip->qli_pushbuf_flag != 0);
- ASSERT(qip->qli_push_owner == current_pid());
-
- /*
* If flushlock isn't locked anymore, chances are that the
* inode flush completed and the inode was taken off the AIL.
* So, just get out.
*/
if (completion_done(&dqp->q_flush) ||
((qip->qli_item.li_flags & XFS_LI_IN_AIL) == 0)) {
- qip->qli_pushbuf_flag = 0;
xfs_dqunlock(dqp);
return;
}
mp = dqp->q_mount;
bp = xfs_incore(mp->m_ddev_targp, qip->qli_format.qlf_blkno,
XFS_QI_DQCHUNKLEN(mp), XBF_TRYLOCK);
- if (bp != NULL) {
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- dopush = ((qip->qli_item.li_flags & XFS_LI_IN_AIL) &&
- !completion_done(&dqp->q_flush));
- qip->qli_pushbuf_flag = 0;
- xfs_dqunlock(dqp);
-
- if (XFS_BUF_ISPINNED(bp))
- xfs_log_force(mp, 0);
-
- if (dopush) {
- int error;
-#ifdef XFSRACEDEBUG
- delay_for_intr();
- delay(300);
-#endif
- error = xfs_bawrite(mp, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, mp,
- "xfs_qm_dquot_logitem_pushbuf: pushbuf error %d on qip %p, bp %p",
- error, qip, bp);
- } else {
- xfs_buf_relse(bp);
- }
- } else {
- qip->qli_pushbuf_flag = 0;
- xfs_dqunlock(dqp);
- xfs_buf_relse(bp);
- }
+ xfs_dqunlock(dqp);
+ if (!bp)
return;
- }
+ if (XFS_BUF_ISDELAYWRITE(bp))
+ xfs_buf_delwri_promote(bp);
+ xfs_buf_relse(bp);
+ return;
- qip->qli_pushbuf_flag = 0;
- xfs_dqunlock(dqp);
}
/*
@@ -289,50 +254,24 @@ xfs_qm_dquot_logitem_trylock(
xfs_dq_logitem_t *qip)
{
xfs_dquot_t *dqp;
- uint retval;
dqp = qip->qli_dquot;
if (atomic_read(&dqp->q_pincount) > 0)
- return (XFS_ITEM_PINNED);
+ return XFS_ITEM_PINNED;
if (! xfs_qm_dqlock_nowait(dqp))
- return (XFS_ITEM_LOCKED);
+ return XFS_ITEM_LOCKED;
- retval = XFS_ITEM_SUCCESS;
if (!xfs_dqflock_nowait(dqp)) {
/*
- * The dquot is already being flushed. It may have been
- * flushed delayed write, however, and we don't want to
- * get stuck waiting for that to complete. So, we want to check
- * to see if we can lock the dquot's buffer without sleeping.
- * If we can and it is marked for delayed write, then we
- * hold it and send it out from the push routine. We don't
- * want to do that now since we might sleep in the device
- * strategy routine. We also don't want to grab the buffer lock
- * here because we'd like not to call into the buffer cache
- * while holding the AIL lock.
- * Make sure to only return PUSHBUF if we set pushbuf_flag
- * ourselves. If someone else is doing it then we don't
- * want to go to the push routine and duplicate their efforts.
+ * dquot has already been flushed to the backing buffer,
+ * leave it locked, pushbuf routine will unlock it.
*/
- if (qip->qli_pushbuf_flag == 0) {
- qip->qli_pushbuf_flag = 1;
- ASSERT(qip->qli_format.qlf_blkno == dqp->q_blkno);
-#ifdef DEBUG
- qip->qli_push_owner = current_pid();
-#endif
- /*
- * The dquot is left locked.
- */
- retval = XFS_ITEM_PUSHBUF;
- } else {
- retval = XFS_ITEM_FLUSHING;
- xfs_dqunlock_nonotify(dqp);
- }
+ return XFS_ITEM_PUSHBUF;
}
ASSERT(qip->qli_item.li_flags & XFS_LI_IN_AIL);
- return (retval);
+ return XFS_ITEM_SUCCESS;
}
diff --git a/fs/xfs/quota/xfs_dquot_item.h b/fs/xfs/quota/xfs_dquot_item.h
index 5a63253..5acae2a 100644
--- a/fs/xfs/quota/xfs_dquot_item.h
+++ b/fs/xfs/quota/xfs_dquot_item.h
@@ -27,10 +27,6 @@ typedef struct xfs_dq_logitem {
xfs_log_item_t qli_item; /* common portion */
struct xfs_dquot *qli_dquot; /* dquot ptr */
xfs_lsn_t qli_flush_lsn; /* lsn at last flush */
- unsigned short qli_pushbuf_flag; /* 1 bit used in push_ail */
-#ifdef DEBUG
- uint64_t qli_push_owner;
-#endif
xfs_dq_logformat_t qli_format; /* logged structure */
} xfs_dq_logitem_t;
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e0a1158..f3c49e6 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -467,8 +467,10 @@ xfs_buf_item_unpin_remove(
/*
* This is called to attempt to lock the buffer associated with this
* buf log item. Don't sleep on the buffer lock. If we can't get
- * the lock right away, return 0. If we can get the lock, pull the
- * buffer from the free list, mark it busy, and return 1.
+ * the lock right away, return 0. If we can get the lock, take a
+ * reference to the buffer. If this is a delayed write buffer that
+ * needs AIL help to be written back, invoke the pushbuf routine
+ * rather than the normal success path.
*/
STATIC uint
xfs_buf_item_trylock(
@@ -477,24 +479,18 @@ xfs_buf_item_trylock(
xfs_buf_t *bp;
bp = bip->bli_buf;
-
- if (XFS_BUF_ISPINNED(bp)) {
+ if (XFS_BUF_ISPINNED(bp))
return XFS_ITEM_PINNED;
- }
-
- if (!XFS_BUF_CPSEMA(bp)) {
+ if (!XFS_BUF_CPSEMA(bp))
return XFS_ITEM_LOCKED;
- }
- /*
- * Remove the buffer from the free list. Only do this
- * if it's on the free list. Private buffers like the
- * superblock buffer are not.
- */
+ /* take a reference to the buffer. */
XFS_BUF_HOLD(bp);
ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
trace_xfs_buf_item_trylock(bip);
+ if (XFS_BUF_ISDELAYWRITE(bp))
+ return XFS_ITEM_PUSHBUF;
return XFS_ITEM_SUCCESS;
}
@@ -626,11 +622,9 @@ xfs_buf_item_committed(
}
/*
- * This is called to asynchronously write the buffer associated with this
- * buf log item out to disk. The buffer will already have been locked by
- * a successful call to xfs_buf_item_trylock(). If the buffer still has
- * B_DELWRI set, then get it going out to disk with a call to bawrite().
- * If not, then just release the buffer.
+ * The buffer is locked, but is not a delayed write buffer. This happens
+ * if we race with IO completion and hence we don't want to try to write it
+ * again. Just release the buffer.
*/
STATIC void
xfs_buf_item_push(
@@ -642,17 +636,29 @@ xfs_buf_item_push(
trace_xfs_buf_item_push(bip);
bp = bip->bli_buf;
+ ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
+ xfs_buf_relse(bp);
+}
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- int error;
- error = xfs_bawrite(bip->bli_item.li_mountp, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, bip->bli_item.li_mountp,
- "xfs_buf_item_push: pushbuf error %d on bip %p, bp %p",
- error, bip, bp);
- } else {
- xfs_buf_relse(bp);
- }
+/*
+ * The buffer is locked and is a delayed write buffer. Promote the buffer
+ * in the delayed write queue as the caller knows that they must invoke
+ * the xfsbufd to get this buffer written. We have to unlock the buffer
+ * to allow the xfsbufd to write it, too.
+ */
+STATIC void
+xfs_buf_item_pushbuf(
+ xfs_buf_log_item_t *bip)
+{
+ xfs_buf_t *bp;
+
+ ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+ trace_xfs_buf_item_pushbuf(bip);
+
+ bp = bip->bli_buf;
+ ASSERT(XFS_BUF_ISDELAYWRITE(bp));
+ xfs_buf_delwri_promote(bp);
+ xfs_buf_relse(bp);
}
/* ARGSUSED */
@@ -677,7 +683,7 @@ static struct xfs_item_ops xfs_buf_item_ops = {
.iop_committed = (xfs_lsn_t(*)(xfs_log_item_t*, xfs_lsn_t))
xfs_buf_item_committed,
.iop_push = (void(*)(xfs_log_item_t*))xfs_buf_item_push,
- .iop_pushbuf = NULL,
+ .iop_pushbuf = (void(*)(xfs_log_item_t*))xfs_buf_item_pushbuf,
.iop_committing = (void(*)(xfs_log_item_t*, xfs_lsn_t))
xfs_buf_item_committing
};
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 207553e..d4dc063 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -602,33 +602,20 @@ xfs_inode_item_trylock(
if (!xfs_iflock_nowait(ip)) {
/*
- * If someone else isn't already trying to push the inode
- * buffer, we get to do it.
+ * inode has already been flushed to the backing buffer,
+ * leave it locked in shared mode, pushbuf routine will
+ * unlock it.
*/
- if (iip->ili_pushbuf_flag == 0) {
- iip->ili_pushbuf_flag = 1;
-#ifdef DEBUG
- iip->ili_push_owner = current_pid();
-#endif
- /*
- * Inode is left locked in shared mode.
- * Pushbuf routine gets to unlock it.
- */
- return XFS_ITEM_PUSHBUF;
- } else {
- /*
- * We hold the AIL lock, so we must specify the
- * NONOTIFY flag so that we won't double trip.
- */
- xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
- return XFS_ITEM_FLUSHING;
- }
- /* NOTREACHED */
+ return XFS_ITEM_PUSHBUF;
}
/* Stale items should force out the iclog */
if (ip->i_flags & XFS_ISTALE) {
xfs_ifunlock(ip);
+ /*
+ * we hold the AIL lock - notify the unlock routine of this
+ * so it doesn't try to get the lock again.
+ */
xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
return XFS_ITEM_PINNED;
}
@@ -746,11 +733,8 @@ xfs_inode_item_committed(
* This gets called by xfs_trans_push_ail(), when IOP_TRYLOCK
* failed to get the inode flush lock but did get the inode locked SHARED.
* Here we're trying to see if the inode buffer is incore, and if so whether it's
- * marked delayed write. If that's the case, we'll initiate a bawrite on that
- * buffer to expedite the process.
- *
- * We aren't holding the AIL lock (or the flush lock) when this gets called,
- * so it is inherently race-y.
+ * marked delayed write. If that's the case, we'll promote it and that will
+ * allow the caller to write the buffer by triggering the xfsbufd to run.
*/
STATIC void
xfs_inode_item_pushbuf(
@@ -759,26 +743,16 @@ xfs_inode_item_pushbuf(
xfs_inode_t *ip;
xfs_mount_t *mp;
xfs_buf_t *bp;
- uint dopush;
ip = iip->ili_inode;
-
ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
/*
- * The ili_pushbuf_flag keeps others from
- * trying to duplicate our effort.
- */
- ASSERT(iip->ili_pushbuf_flag != 0);
- ASSERT(iip->ili_push_owner == current_pid());
-
- /*
* If a flush is not in progress anymore, chances are that the
* inode was taken off the AIL. So, just get out.
*/
if (completion_done(&ip->i_flush) ||
((iip->ili_item.li_flags & XFS_LI_IN_AIL) == 0)) {
- iip->ili_pushbuf_flag = 0;
xfs_iunlock(ip, XFS_ILOCK_SHARED);
return;
}
@@ -787,53 +761,12 @@ xfs_inode_item_pushbuf(
bp = xfs_incore(mp->m_ddev_targp, iip->ili_format.ilf_blkno,
iip->ili_format.ilf_len, XBF_TRYLOCK);
- if (bp != NULL) {
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- /*
- * We were racing with iflush because we don't hold
- * the AIL lock or the flush lock. However, at this point,
- * we have the buffer, and we know that it's dirty.
- * So, it's possible that iflush raced with us, and
- * this item is already taken off the AIL.
- * If not, we can flush it async.
- */
- dopush = ((iip->ili_item.li_flags & XFS_LI_IN_AIL) &&
- !completion_done(&ip->i_flush));
- iip->ili_pushbuf_flag = 0;
- xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
- trace_xfs_inode_item_push(bp, _RET_IP_);
-
- if (XFS_BUF_ISPINNED(bp))
- xfs_log_force(mp, 0);
-
- if (dopush) {
- int error;
- error = xfs_bawrite(mp, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, mp,
- "xfs_inode_item_pushbuf: pushbuf error %d on iip %p, bp %p",
- error, iip, bp);
- } else {
- xfs_buf_relse(bp);
- }
- } else {
- iip->ili_pushbuf_flag = 0;
- xfs_iunlock(ip, XFS_ILOCK_SHARED);
- xfs_buf_relse(bp);
- }
- return;
- }
- /*
- * We have to be careful about resetting pushbuf flag too early (above).
- * Even though in theory we can do it as soon as we have the buflock,
- * we don't want others to be doing work needlessly. They'll come to
- * this function thinking that pushing the buffer is their
- * responsibility only to find that the buffer is still locked by
- * another doing the same thing
- */
- iip->ili_pushbuf_flag = 0;
xfs_iunlock(ip, XFS_ILOCK_SHARED);
+ if (!bp)
+ return;
+ if (XFS_BUF_ISDELAYWRITE(bp))
+ xfs_buf_delwri_promote(bp);
+ xfs_buf_relse(bp);
return;
}
@@ -937,7 +870,6 @@ xfs_inode_item_init(
/*
We have zeroed memory. No need ...
iip->ili_extents_buf = NULL;
- iip->ili_pushbuf_flag = 0;
*/
iip->ili_format.ilf_type = XFS_LI_INODE;
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index cc8df1a..9a46795 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -144,12 +144,6 @@ typedef struct xfs_inode_log_item {
data exts */
struct xfs_bmbt_rec *ili_aextents_buf; /* array of logged
attr exts */
- unsigned int ili_pushbuf_flag; /* one bit used in push_ail */
-
-#ifdef DEBUG
- uint64_t ili_push_owner; /* one who sets pushbuf_flag
- above gets to push the buf */
-#endif
#ifdef XFS_TRANS_DEBUG
int ili_root_size;
char *ili_orig_root;
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index ca64f33..c93e3a1 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -861,8 +861,7 @@ typedef struct xfs_item_ops {
#define XFS_ITEM_SUCCESS 0
#define XFS_ITEM_PINNED 1
#define XFS_ITEM_LOCKED 2
-#define XFS_ITEM_FLUSHING 3
-#define XFS_ITEM_PUSHBUF 4
+#define XFS_ITEM_PUSHBUF 3
/*
* This structure is used to maintain a list of block ranges that have been
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index d7b1af8..e799824 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -253,6 +253,7 @@ xfsaild_push(
int flush_log, count, stuck;
xfs_mount_t *mp = ailp->xa_mount;
struct xfs_ail_cursor *cur = &ailp->xa_cursors;
+ int push_xfsbufd = 0;
spin_lock(&ailp->xa_lock);
xfs_trans_ail_cursor_init(ailp, cur);
@@ -308,6 +309,7 @@ xfsaild_push(
XFS_STATS_INC(xs_push_ail_pushbuf);
IOP_PUSHBUF(lip);
last_pushed_lsn = lsn;
+ push_xfsbufd = 1;
break;
case XFS_ITEM_PINNED:
@@ -322,12 +324,6 @@ xfsaild_push(
stuck++;
break;
- case XFS_ITEM_FLUSHING:
- XFS_STATS_INC(xs_push_ail_flushing);
- last_pushed_lsn = lsn;
- stuck++;
- break;
-
default:
ASSERT(0);
break;
@@ -374,6 +370,11 @@ xfsaild_push(
xfs_log_force(mp, 0);
}
+ if (push_xfsbufd) {
+ /* we've got delayed write buffers to flush */
+ wake_up_process(mp->m_ddev_targp->bt_task);
+ }
+
if (!count) {
/* We're past our target or empty, so idle */
last_pushed_lsn = 0;
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/9] xfs: Sort delayed write buffers before dispatch
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (2 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 3/9] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 5/9] xfs: Use delay write promotion for dquot flushing Dave Chinner
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
Currently when the xfsbufd writes delayed write buffers, it pushes
them to disk in the order they come off the delayed write list. If
there are lots of buffers ѕpread widely over the disk, this results
in overwhelming the elevator sort queues in the block layer and we
end up losing the posibility of merging adjacent buffers to minimise
the number of IOs.
Use the new generic list_sort function to sort the delwri dispatch
queue before issue to ensure that the buffers are pushed in the most
friendly order possible to the lower layers.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/linux-2.6/xfs_buf.c | 87 ++++++++++++++++++++++++++++++--------------
1 files changed, 60 insertions(+), 27 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b306265..4556a4c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -33,6 +33,7 @@
#include <linux/migrate.h>
#include <linux/backing-dev.h>
#include <linux/freezer.h>
+#include <linux/list_sort.h>
#include "xfs_sb.h"
#include "xfs_inum.h"
@@ -1877,14 +1878,42 @@ xfs_buf_delwri_split(
}
+/*
+ * Compare function is more complex than it needs to be because
+ * the return value is only 32 bits and we are doing comparisons
+ * on 64 bit values
+ */
+static int
+xfs_buf_cmp(
+ void *priv,
+ struct list_head *a,
+ struct list_head *b)
+{
+ struct xfs_buf *ap = container_of(a, struct xfs_buf, b_list);
+ struct xfs_buf *bp = container_of(b, struct xfs_buf, b_list);
+ xfs_daddr_t diff;
+
+ diff = ap->b_bn - bp->b_bn;
+ if (diff < 0)
+ return -1;
+ if (diff > 0)
+ return 1;
+ return 0;
+}
+
+void
+xfs_buf_delwri_sort(
+ xfs_buftarg_t *target,
+ struct list_head *list)
+{
+ list_sort(NULL, list, xfs_buf_cmp);
+}
+
STATIC int
xfsbufd(
void *data)
{
- struct list_head tmp;
- xfs_buftarg_t *target = (xfs_buftarg_t *)data;
- int count;
- xfs_buf_t *bp;
+ xfs_buftarg_t *target = (xfs_buftarg_t *)data;
current->flags |= PF_MEMALLOC;
@@ -1893,6 +1922,8 @@ xfsbufd(
do {
long age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
long tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
+ int count = 0;
+ struct list_head tmp;
if (unlikely(freezing(current))) {
set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
@@ -1907,11 +1938,10 @@ xfsbufd(
schedule_timeout_interruptible(tout);
xfs_buf_delwri_split(target, &tmp, age);
- count = 0;
+ list_sort(NULL, &tmp, xfs_buf_cmp);
while (!list_empty(&tmp)) {
- bp = list_entry(tmp.next, xfs_buf_t, b_list);
- ASSERT(target == bp->b_target);
-
+ struct xfs_buf *bp;
+ bp = list_first_entry(&tmp, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
xfs_buf_iostrategy(bp);
count++;
@@ -1937,42 +1967,45 @@ xfs_flush_buftarg(
xfs_buftarg_t *target,
int wait)
{
- struct list_head tmp;
- xfs_buf_t *bp, *n;
+ xfs_buf_t *bp;
int pincount = 0;
+ LIST_HEAD(tmp_list);
+ LIST_HEAD(wait_list);
xfs_buf_runall_queues(xfsconvertd_workqueue);
xfs_buf_runall_queues(xfsdatad_workqueue);
xfs_buf_runall_queues(xfslogd_workqueue);
set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
- pincount = xfs_buf_delwri_split(target, &tmp, 0);
+ pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
/*
- * Dropped the delayed write list lock, now walk the temporary list
+ * Dropped the delayed write list lock, now walk the temporary list.
+ * All I/O is issued async and then if we need to wait for completion
+ * we do that after issuing all the IO.
*/
- list_for_each_entry_safe(bp, n, &tmp, b_list) {
+ list_sort(NULL, &tmp_list, xfs_buf_cmp);
+ while (!list_empty(&tmp_list)) {
+ bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
ASSERT(target == bp->b_target);
- if (wait)
+ list_del_init(&bp->b_list);
+ if (wait) {
bp->b_flags &= ~XBF_ASYNC;
- else
- list_del_init(&bp->b_list);
-
+ list_add(&bp->b_list, &wait_list);
+ }
xfs_buf_iostrategy(bp);
}
- if (wait)
+ if (wait) {
+ /* Expedite and wait for IO to complete. */
blk_run_address_space(target->bt_mapping);
+ while (!list_empty(&wait_list)) {
+ bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
- /*
- * Remaining list items must be flushed before returning
- */
- while (!list_empty(&tmp)) {
- bp = list_entry(tmp.next, xfs_buf_t, b_list);
-
- list_del_init(&bp->b_list);
- xfs_iowait(bp);
- xfs_buf_relse(bp);
+ list_del_init(&bp->b_list);
+ xfs_iowait(bp);
+ xfs_buf_relse(bp);
+ }
}
return pincount;
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/9] xfs: Use delay write promotion for dquot flushing
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (3 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 4/9] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 6/9] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
pushing used to do to flush out delayed write dquot buffers. Change
it to use the new promotion method rather than an async flush.
Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
held, yet the callers make the assumption that after this call the
flush lock is held. Always return with the flush lock held.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/quota/xfs_dquot.c | 25 ++++++++++---------------
1 files changed, 10 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index f9baeed..1620a56 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1528,21 +1528,16 @@ xfs_qm_dqflock_pushbuf_wait(
*/
bp = xfs_incore(dqp->q_mount->m_ddev_targp, dqp->q_blkno,
XFS_QI_DQCHUNKLEN(dqp->q_mount), XBF_TRYLOCK);
- if (bp != NULL) {
- if (XFS_BUF_ISDELAYWRITE(bp)) {
- int error;
-
- if (XFS_BUF_ISPINNED(bp))
- xfs_log_force(dqp->q_mount, 0);
- error = xfs_bawrite(dqp->q_mount, bp);
- if (error)
- xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
- "xfs_qm_dqflock_pushbuf_wait: "
- "pushbuf error %d on dqp %p, bp %p",
- error, dqp, bp);
- } else {
- xfs_buf_relse(bp);
- }
+ if (!bp)
+ goto out_lock;
+
+ if (XFS_BUF_ISDELAYWRITE(bp)) {
+ if (XFS_BUF_ISPINNED(bp))
+ xfs_log_force(dqp->q_mount, 0);
+ xfs_buf_delwri_promote(bp);
+ wake_up_process(bp->b_target->bt_task);
}
+ xfs_buf_relse(bp);
+out_lock:
xfs_dqflock(dqp);
}
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 6/9] xfs: kill the unused XFS_QMOPT_* flush flags V2
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (4 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 5/9] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 7/9] xfs: remove invalid barrier optimization from xfs_fsync Dave Chinner
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
dquots are never flushed asynchronously. Remove the flag and the
async write support from the flush function. Make the default flush
a delwri flush to make the inode flush code, which leaves the
XFS_QMOPT_SYNC the only flag remaining. Convert that to use
SYNC_WAIT instead, just like the inode flush code.
V2:
- just pass flush flags straight through
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/quota/xfs_dquot.c | 13 ++++++-------
fs/xfs/quota/xfs_dquot_item.c | 2 +-
fs/xfs/quota/xfs_qm.c | 14 ++++++--------
fs/xfs/xfs_quota.h | 8 +-------
4 files changed, 14 insertions(+), 23 deletions(-)
diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index 1620a56..5f79dd7 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1187,7 +1187,7 @@ xfs_qm_dqflush(
* block, nada.
*/
if (!XFS_DQ_IS_DIRTY(dqp) ||
- (!(flags & XFS_QMOPT_SYNC) && atomic_read(&dqp->q_pincount) > 0)) {
+ (!(flags & SYNC_WAIT) && atomic_read(&dqp->q_pincount) > 0)) {
xfs_dqfunlock(dqp);
return 0;
}
@@ -1251,18 +1251,17 @@ xfs_qm_dqflush(
xfs_log_force(mp, 0);
}
- if (flags & XFS_QMOPT_DELWRI) {
- xfs_bdwrite(mp, bp);
- } else {
+ if (flags & SYNC_WAIT)
error = xfs_bwrite(mp, bp);
- }
+ else
+ xfs_bdwrite(mp, bp);
trace_xfs_dqflush_done(dqp);
/*
* dqp is still locked, but caller is free to unlock it now.
*/
- return (error);
+ return error;
}
@@ -1443,7 +1442,7 @@ xfs_qm_dqpurge(
* We don't care about getting disk errors here. We need
* to purge this dquot anyway, so we go ahead regardless.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_SYNC);
+ error = xfs_qm_dqflush(dqp, SYNC_WAIT);
if (error)
xfs_fs_cmn_err(CE_WARN, mp,
"xfs_qm_dqpurge: dquot %p flush failed", dqp);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index dda0fb0..4e4ee9a 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -153,7 +153,7 @@ xfs_qm_dquot_logitem_push(
* lock without sleeping, then there must not have been
* anyone in the process of flushing the dquot.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush(dqp, 0);
if (error)
xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
"xfs_qm_dquot_logitem_push: push error %d on dqp %p",
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 11cfd82..8699e51 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -450,7 +450,7 @@ xfs_qm_unmount_quotas(
STATIC int
xfs_qm_dqflush_all(
xfs_mount_t *mp,
- int flags)
+ int sync_mode)
{
int recl;
xfs_dquot_t *dqp;
@@ -486,7 +486,7 @@ again:
* across a disk write.
*/
xfs_qm_mplist_unlock(mp);
- error = xfs_qm_dqflush(dqp, flags);
+ error = xfs_qm_dqflush(dqp, sync_mode);
xfs_dqunlock(dqp);
if (error)
return error;
@@ -926,13 +926,11 @@ xfs_qm_sync(
{
int recl, restarts;
xfs_dquot_t *dqp;
- uint flush_flags;
int error;
if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
return 0;
- flush_flags = (flags & SYNC_WAIT) ? XFS_QMOPT_SYNC : XFS_QMOPT_DELWRI;
restarts = 0;
again:
@@ -992,7 +990,7 @@ xfs_qm_sync(
* across a disk write
*/
xfs_qm_mplist_unlock(mp);
- error = xfs_qm_dqflush(dqp, flush_flags);
+ error = xfs_qm_dqflush(dqp, flags);
xfs_dqunlock(dqp);
if (error && XFS_FORCED_SHUTDOWN(mp))
return 0; /* Need to prevent umount failure */
@@ -1796,7 +1794,7 @@ xfs_qm_quotacheck(
* successfully.
*/
if (!error)
- error = xfs_qm_dqflush_all(mp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush_all(mp, 0);
/*
* We can get this error if we couldn't do a dquot allocation inside
@@ -2018,7 +2016,7 @@ xfs_qm_shake_freelist(
* We flush it delayed write, so don't bother
* releasing the mplock.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush(dqp, 0);
if (error) {
xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
"xfs_qm_dqflush_all: dquot %p flush failed", dqp);
@@ -2201,7 +2199,7 @@ xfs_qm_dqreclaim_one(void)
* We flush it delayed write, so don't bother
* releasing the freelist lock.
*/
- error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+ error = xfs_qm_dqflush(dqp, 0);
if (error) {
xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
"xfs_qm_dqreclaim: dquot %p flush failed", dqp);
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 21d11d9..fdcab3f 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -223,15 +223,9 @@ typedef struct xfs_qoff_logformat {
#define XFS_QMOPT_RES_INOS 0x0800000
/*
- * flags for dqflush and dqflush_all.
- */
-#define XFS_QMOPT_SYNC 0x1000000
-#define XFS_QMOPT_DELWRI 0x4000000
-
-/*
* flags for dqalloc.
*/
-#define XFS_QMOPT_INHERIT 0x8000000
+#define XFS_QMOPT_INHERIT 0x1000000
/*
* flags to xfs_trans_mod_dquot.
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 7/9] xfs: remove invalid barrier optimization from xfs_fsync
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (5 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 6/9] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 8/9] xfs: log changed inodes instead of writing them synchronously Dave Chinner
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs; +Cc: Christoph Hellwig
From: Christoph Hellwig <hch@lst.de>
We always need to flush the disk write cache and can't skip it just because
the no inode attributes have changed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/xfs_vnodeops.c | 12 ++----------
1 files changed, 2 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index fd108b7..43241e2 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -597,7 +597,7 @@ xfs_fsync(
{
xfs_trans_t *tp;
int error = 0;
- int log_flushed = 0, changed = 1;
+ int log_flushed = 0;
xfs_itrace_entry(ip);
@@ -627,18 +627,10 @@ xfs_fsync(
* disk yet, the inode will be still be pinned. If it is,
* force the log.
*/
-
xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
if (xfs_ipincount(ip)) {
error = _xfs_log_force(ip->i_mount, XFS_LOG_SYNC,
&log_flushed);
- } else {
- /*
- * If the inode is not pinned and nothing has changed
- * we don't need to flush the cache.
- */
- changed = 0;
}
} else {
/*
@@ -673,7 +665,7 @@ xfs_fsync(
xfs_iunlock(ip, XFS_ILOCK_EXCL);
}
- if ((ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
+ if (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) {
/*
* If the log write didn't issue an ordered tag we need
* to flush the disk cache for the data device now.
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 8/9] xfs: log changed inodes instead of writing them synchronously
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (6 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 7/9] xfs: remove invalid barrier optimization from xfs_fsync Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 3:56 ` [PATCH 9/9] xfs: kill xfs_bawrite Dave Chinner
2010-02-09 19:10 ` [PATCH 0/9] Delayed write metadata writeback V5 Alex Elder
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs; +Cc: Christoph Hellwig
From: Christoph Hellwig <hch@infradead.org>
When an inode has already be flushed delayed write,
xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
return on a synchronous inode write without having written the
inode. Currently these sycnhronous writes only come sync(1),
unmount, a sycnhronous NFS export and cachefiles so should be
relatively rare and out of common performance paths.
Realistically, a synchronous inode write is not necessary here; we
can avoid writing the inode by logging any non-transactional changes
that are pending. This needs to be done with synchronous
transactions, but it avoids seeking between the log and inode
clusters as we do now. We don't force the log if the inode is
pinned, though, so this differs from the fsync case. For normal
sys_sync and unmount behaviour this is fine because we do a
synchronous log force in xfs_sync_data which is called from the
->sync_fs code.
It does however break the NFS synchronous export guarantees for now,
but work is under way to fix this at a higher level or for the
higher level to provide an additional flag in the writeback control
to tell us that a log force is needed.
Portions of this patch are based on work from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_super.c | 111 +++++++++++++++++++++++++++++++-----------
1 files changed, 82 insertions(+), 29 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 3b5b46b..25ea240 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1021,12 +1021,45 @@ xfs_fs_dirty_inode(
XFS_I(inode)->i_update_core = 1;
}
-/*
- * Attempt to flush the inode, this will actually fail
- * if the inode is pinned, but we dirty the inode again
- * at the point when it is unpinned after a log write,
- * since this is when the inode itself becomes flushable.
- */
+STATIC int
+xfs_log_inode(
+ struct xfs_inode *ip)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_trans *tp;
+ int error;
+
+ xfs_iunlock(ip, XFS_ILOCK_SHARED);
+ tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
+ error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);
+
+ if (error) {
+ xfs_trans_cancel(tp, 0);
+ /* we need to return with the lock hold shared */
+ xfs_ilock(ip, XFS_ILOCK_SHARED);
+ return error;
+ }
+
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+ /*
+ * Note - it's possible that we might have pushed ourselves out of the
+ * way during trans_reserve which would flush the inode. But there's
+ * no guarantee that the inode buffer has actually gone out yet (it's
+ * delwri). Plus the buffer could be pinned anyway if it's part of
+ * an inode in another recent transaction. So we play it safe and
+ * fire off the transaction anyway.
+ */
+ xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+ xfs_trans_ihold(tp, ip);
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ xfs_trans_set_sync(tp);
+ error = xfs_trans_commit(tp, 0);
+ xfs_ilock_demote(ip, XFS_ILOCK_EXCL);
+
+ return error;
+}
+
STATIC int
xfs_fs_write_inode(
struct inode *inode,
@@ -1034,7 +1067,7 @@ xfs_fs_write_inode(
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
- int error = 0;
+ int error = EAGAIN;
xfs_itrace_entry(ip);
@@ -1045,35 +1078,55 @@ xfs_fs_write_inode(
error = xfs_wait_on_pages(ip, 0, -1);
if (error)
goto out;
- }
-
- /*
- * Bypass inodes which have already been cleaned by
- * the inode flush clustering code inside xfs_iflush
- */
- if (xfs_inode_clean(ip))
- goto out;
- /*
- * We make this non-blocking if the inode is contended, return
- * EAGAIN to indicate to the caller that they did not succeed.
- * This prevents the flush path from blocking on inodes inside
- * another operation right now, they get caught later by xfs_sync.
- */
- if (sync) {
+ /*
+ * Make sure the inode has hit stable storage. By using the
+ * log and the fsync transactions we reduce the IOs we have
+ * to do here from two (log and inode) to just the log.
+ *
+ * Note: We still need to do a delwri write of the inode after
+ * this to flush it to the backing buffer so that bulkstat
+ * works properly if this is the first time the inode has been
+ * written. Because we hold the ilock atomically over the
+ * transaction commit and the inode flush we are guaranteed
+ * that the inode is not pinned when it returns. If the flush
+ * lock is already held, then the inode has already been
+ * flushed once and we don't need to flush it again. Hence
+ * the code will only flush the inode if it isn't already
+ * being flushed.
+ */
xfs_ilock(ip, XFS_ILOCK_SHARED);
- xfs_iflock(ip);
-
- error = xfs_iflush(ip, SYNC_WAIT);
+ if (ip->i_update_core) {
+ error = xfs_log_inode(ip);
+ if (error)
+ goto out_unlock;
+ }
} else {
- error = EAGAIN;
+ /*
+ * We make this non-blocking if the inode is contended, return
+ * EAGAIN to indicate to the caller that they did not succeed.
+ * This prevents the flush path from blocking on inodes inside
+ * another operation right now, they get caught later by xfs_sync.
+ */
if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
goto out;
- if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
- goto out_unlock;
+ }
+
+ if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
+ goto out_unlock;
- error = xfs_iflush(ip, 0);
+ /*
+ * Now we have the flush lock and the inode is not pinned, we can check
+ * if the inode is really clean as we know that there are no pending
+ * transaction completions, it is not waiting on the delayed write
+ * queue and there is no IO in progress.
+ */
+ if (xfs_inode_clean(ip)) {
+ xfs_ifunlock(ip);
+ error = 0;
+ goto out_unlock;
}
+ error = xfs_iflush(ip, 0);
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_SHARED);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 9/9] xfs: kill xfs_bawrite
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (7 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 8/9] xfs: log changed inodes instead of writing them synchronously Dave Chinner
@ 2010-02-09 3:56 ` Dave Chinner
2010-02-09 19:10 ` [PATCH 0/9] Delayed write metadata writeback V5 Alex Elder
9 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-02-09 3:56 UTC (permalink / raw)
To: xfs
There are no more users of this function left in the XFS code
now that we've switched everything to delayed write flushing.
Remove it.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/linux-2.6/xfs_buf.c | 19 -------------------
fs/xfs/linux-2.6/xfs_buf.h | 1 -
2 files changed, 0 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 4556a4c..d50df3a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1078,25 +1078,6 @@ xfs_bwrite(
return error;
}
-int
-xfs_bawrite(
- void *mp,
- struct xfs_buf *bp)
-{
- trace_xfs_buf_bawrite(bp, _RET_IP_);
-
- ASSERT(bp->b_bn != XFS_BUF_DADDR_NULL);
-
- xfs_buf_delwri_dequeue(bp);
-
- bp->b_flags &= ~(XBF_READ | XBF_DELWRI | XBF_READ_AHEAD);
- bp->b_flags |= (XBF_WRITE | XBF_ASYNC | _XBF_RUN_QUEUES);
-
- bp->b_mount = mp;
- bp->b_strat = xfs_bdstrat_cb;
- return xfs_bdstrat_cb(bp);
-}
-
void
xfs_bdwrite(
void *mp,
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index be45e8c..386e736 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -233,7 +233,6 @@ extern void xfs_buf_unlock(xfs_buf_t *);
/* Buffer Read and Write Routines */
extern int xfs_bwrite(struct xfs_mount *mp, struct xfs_buf *bp);
-extern int xfs_bawrite(void *mp, xfs_buf_t *bp);
extern void xfs_bdwrite(void *mp, xfs_buf_t *bp);
extern void xfsbdstrat(struct xfs_mount *, struct xfs_buf *);
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 0/9] Delayed write metadata writeback V5
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
` (8 preceding siblings ...)
2010-02-09 3:56 ` [PATCH 9/9] xfs: kill xfs_bawrite Dave Chinner
@ 2010-02-09 19:10 ` Alex Elder
9 siblings, 0 replies; 11+ messages in thread
From: Alex Elder @ 2010-02-09 19:10 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Tue, 2010-02-09 at 14:56 +1100, Dave Chinner wrote:
> While I started with killing async inode writeback, the series has
> grown. It's not really limited to inode writeback - it touches dquot
> flushing, changes the way the AIL pushes on buffers, adds xfsbufd
> sorting for delayed write buffers, adds a real non-blocking mode to
> inode reclaim and avoids physical inode writeback from the VFS while
> fixing bugs in handling delayed write inodes. Hence this is more
> about enabling efficient delayed write metadata than it is able
> killing async inode writeback.
>
> The idea behind this series is to make metadata buffers get
> written from xfsbufd via the delayed write queue rather than being
> issued asynchronously from all over the place. To do this, async
> buffer writeback is almost entirely removed from XFS, replaced
> instead by delayed writes and a method to expedite flushing of
> delayed write buffers when required.
>
> The result of funnelling all the buffer IO into a single place
> is that we can more tightly control and therefore optimise the
> submission of metadata IO. Aggregating the buffers before dispatch
> allows much better sort efficiency of the buffers as the sort window
> is not limited to the size of the elevator congestion hysteresis
> limit. Hence we can approach 100% merge effeciency on large numbers
> of buffers when dispatched for IO and greatly reduce the amount
> of seeking metadata writeback causes.
>
> The major change is to the inode flushing and reclaim code. Delayed
> write inodes hold the flush lock for much longer than for async
> writeback, and hence blocking on the flush lock can cause extremely
> long latencies without other mechanisms to expedite the release of
> the flush locks. To prevent needing to flush inodes immediately,
> all operations are done non-blocking unless synchronous. This
> required a significant rework of the inode reclaim code, but it
> greatly simplified other pieces of code (e.g. log item pushing).
>
> Version 5
> - drop the fsync changes to xfs_fs_write_inode() and the associated
> locking changes, replace them with a targeted inode logging
> function from Christoph Hellwig to fix a performance regression on
> fs_mark -S4 workloads on an SSD.
>
> Version 4
> - rework inode reclaim checks for better legibility
> - add warning to reclaim code when delwri flush errors occur
> - kill XFS_ITEM_FLUSHING now it is not used
> - clean up sync_mode flags being pushed into xfs_iflush()
> - kill the now unused xfs_bawrite() function
> - include Christoph's fsync cache flush fix
> - rework the inode locking and call to xfs_fsync() when doing
> synchronous inode writes to close races between the fsync and
> the background delwri flush afterwards.
>
> Version 3
> - rework inode reclaim to:
> - separate it from xfs_iflush return values
> - provide a non-blocking mode for background operation
> - apply delwri buffer promotion tricks to dquot flushing
> - kill unneeded dquot flushing flags, similar to inode flushing flag
> removal
> - fix sync inode flush bug when trying to flush delwri inodes
>
> Version 2:
> - use generic list sort function
> - when unmounting, push the delwri buffers first, then do sync inode
> reclaim so that reclaim doesn't block for 15 seconds waiting for
> delwri inode buffers to be aged and written before the inodes can
> be reclaimed.
>
> Alex, the patch series is available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/dgc/xfs for-2.6.34
I looked over the whole series again and it all looks good
to me. I will pull from your for-2.6.34 branch and will
post it on OSS after I've tested it a bit.
Signed-off-by: Alex Elder <aelder@sgi.com>
-Alex
> Christoph Hellwig (2):
> xfs: remove invalid barrier optimization from xfs_fsync
> xfs: log changed inodes instead of writing them synchronously
>
> Dave Chinner (7):
> xfs: Make inode reclaim states explicit
> xfs: Use delayed write for inodes rather than async V2
> xfs: Don't issue buffer IO direct from AIL push V2
> xfs: Sort delayed write buffers before dispatch
> xfs: Use delay write promotion for dquot flushing
> xfs: kill the unused XFS_QMOPT_* flush flags V2
> xfs: kill xfs_bawrite
>
> fs/xfs/linux-2.6/xfs_buf.c | 135 ++++++++++++++++++++++++++--------------
> fs/xfs/linux-2.6/xfs_buf.h | 3 +-
> fs/xfs/linux-2.6/xfs_super.c | 111 ++++++++++++++++++++++++---------
> fs/xfs/linux-2.6/xfs_sync.c | 138 +++++++++++++++++++++++++++++++++-------
> fs/xfs/linux-2.6/xfs_trace.h | 1 +
> fs/xfs/quota/xfs_dquot.c | 38 +++++-------
> fs/xfs/quota/xfs_dquot_item.c | 87 ++++----------------------
> fs/xfs/quota/xfs_dquot_item.h | 4 -
> fs/xfs/quota/xfs_qm.c | 14 ++---
> fs/xfs/xfs_buf_item.c | 64 ++++++++++---------
> fs/xfs/xfs_inode.c | 86 ++------------------------
> fs/xfs/xfs_inode.h | 11 +---
> fs/xfs/xfs_inode_item.c | 108 +++++++-------------------------
> fs/xfs/xfs_inode_item.h | 6 --
> fs/xfs/xfs_mount.c | 13 ++++-
> fs/xfs/xfs_quota.h | 8 +--
> fs/xfs/xfs_trans.h | 3 +-
> fs/xfs/xfs_trans_ail.c | 13 ++--
> fs/xfs/xfs_vnodeops.c | 12 +---
> 19 files changed, 410 insertions(+), 445 deletions(-)
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-02-09 19:08 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-09 3:56 [PATCH 0/9] Delayed write metadata writeback V5 Dave Chinner
2010-02-09 3:56 ` [PATCH 1/9] xfs: Make inode reclaim states explicit Dave Chinner
2010-02-09 3:56 ` [PATCH 2/9] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
2010-02-09 3:56 ` [PATCH 3/9] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
2010-02-09 3:56 ` [PATCH 4/9] xfs: Sort delayed write buffers before dispatch Dave Chinner
2010-02-09 3:56 ` [PATCH 5/9] xfs: Use delay write promotion for dquot flushing Dave Chinner
2010-02-09 3:56 ` [PATCH 6/9] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
2010-02-09 3:56 ` [PATCH 7/9] xfs: remove invalid barrier optimization from xfs_fsync Dave Chinner
2010-02-09 3:56 ` [PATCH 8/9] xfs: log changed inodes instead of writing them synchronously Dave Chinner
2010-02-09 3:56 ` [PATCH 9/9] xfs: kill xfs_bawrite Dave Chinner
2010-02-09 19:10 ` [PATCH 0/9] Delayed write metadata writeback V5 Alex Elder
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox