* [PATCH 0/2] Fix inode reclaim problems (hopefully)
@ 2010-01-06 23:05 Dave Chinner
2010-01-06 23:05 ` [PATCH 1/2] xfs: reclaim inodes under a write lock Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Dave Chinner @ 2010-01-06 23:05 UTC (permalink / raw)
To: xfs
These two patches seem to fix the inode reclaim issues I've been
able to reproduce lately. The changes are still running xfsqa in a
loop to confirm this, but the directory/small file stress test I've
been running to trigger the problem has run for 10 hours with these
fixes instead of dying after 20-30 minutes.
The first patch is a rewrite of Christoph's reclaim under write lock
fixes without all the code duplication, and the second avoids direct
reclaim altogether because xfs_inode_clean() needs to be run when the
flush lock is held to ensure the inode not under IO and really is
reclaimable.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] xfs: reclaim inodes under a write lock
2010-01-06 23:05 [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
@ 2010-01-06 23:05 ` Dave Chinner
2010-01-08 10:20 ` Christoph Hellwig
2010-01-06 23:05 ` [PATCH 2/2] xfs: reclaim all inodes by background tree walks Dave Chinner
2010-01-07 10:49 ` [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
2 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2010-01-06 23:05 UTC (permalink / raw)
To: xfs
Make the inode tree reclaim walk exclusive to avoid races with
concurrent sync walkers and lookups. This is a version of
a patch posted by Christoph Hellwig that avoids all the code
duplication.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_sync.c | 90 +++++++++++++++++++--------------------
fs/xfs/linux-2.6/xfs_sync.h | 2 +-
fs/xfs/quota/xfs_qm_syscalls.c | 2 +-
3 files changed, 46 insertions(+), 48 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index f974d1a..6d1cd6e 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -55,7 +55,8 @@ xfs_inode_ag_lookup(
struct xfs_mount *mp,
struct xfs_perag *pag,
uint32_t *first_index,
- int tag)
+ int tag,
+ int write_lock)
{
int nr_found;
struct xfs_inode *ip;
@@ -65,7 +66,10 @@ xfs_inode_ag_lookup(
* as the tree is sparse and a gang lookup walks to find
* the number of objects requested.
*/
- read_lock(&pag->pag_ici_lock);
+ if (write_lock)
+ write_lock(&pag->pag_ici_lock);
+ else
+ read_lock(&pag->pag_ici_lock);
if (tag == XFS_ICI_NO_TAG) {
nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
(void **)&ip, *first_index, 1);
@@ -89,7 +93,10 @@ xfs_inode_ag_lookup(
return ip;
unlock:
- read_unlock(&pag->pag_ici_lock);
+ if (write_lock)
+ write_unlock(&pag->pag_ici_lock);
+ else
+ read_unlock(&pag->pag_ici_lock);
return NULL;
}
@@ -100,7 +107,8 @@ xfs_inode_ag_walk(
int (*execute)(struct xfs_inode *ip,
struct xfs_perag *pag, int flags),
int flags,
- int tag)
+ int tag,
+ int write_lock)
{
uint32_t first_index;
int last_error = 0;
@@ -113,7 +121,8 @@ restart:
int error = 0;
xfs_inode_t *ip;
- ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
+ ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag,
+ write_lock);
if (!ip)
break;
@@ -145,7 +154,8 @@ xfs_inode_ag_iterator(
int (*execute)(struct xfs_inode *ip,
struct xfs_perag *pag, int flags),
int flags,
- int tag)
+ int tag,
+ int write_lock)
{
int error = 0;
int last_error = 0;
@@ -159,7 +169,8 @@ xfs_inode_ag_iterator(
xfs_perag_put(pag);
continue;
}
- error = xfs_inode_ag_walk(mp, pag, execute, flags, tag);
+ error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
+ write_lock);
xfs_perag_put(pag);
if (error) {
last_error = error;
@@ -184,18 +195,20 @@ xfs_sync_inode_valid(
return EFSCORRUPTED;
}
- /*
- * If we can't get a reference on the inode, it must be in reclaim.
- * Leave it for the reclaim code to flush. Also avoid inodes that
- * haven't been fully initialised.
- */
+ /* avoid new or reclaimable inodes. Leave for reclaim code to flush */
+ if (xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) {
+ read_unlock(&pag->pag_ici_lock);
+ return ENOENT;
+ }
+
+ /* If we can't get a reference on the inode, it must be in reclaim. */
if (!igrab(inode)) {
read_unlock(&pag->pag_ici_lock);
return ENOENT;
}
read_unlock(&pag->pag_ici_lock);
- if (is_bad_inode(inode) || xfs_iflags_test(ip, XFS_INEW)) {
+ if (is_bad_inode(inode)) {
IRELE(ip);
return ENOENT;
}
@@ -285,7 +298,7 @@ xfs_sync_data(
ASSERT((flags & ~(SYNC_TRYLOCK|SYNC_WAIT)) == 0);
error = xfs_inode_ag_iterator(mp, xfs_sync_inode_data, flags,
- XFS_ICI_NO_TAG);
+ XFS_ICI_NO_TAG, 0);
if (error)
return XFS_ERROR(error);
@@ -307,7 +320,7 @@ xfs_sync_attr(
ASSERT((flags & ~SYNC_WAIT) == 0);
return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags,
- XFS_ICI_NO_TAG);
+ XFS_ICI_NO_TAG, 0);
}
STATIC int
@@ -669,33 +682,8 @@ xfs_reclaim_inode(
xfs_inode_t *ip,
int sync_mode)
{
- struct xfs_mount *mp = ip->i_mount;
- struct xfs_perag *pag;
-
-
/*
- * The radix tree lock here protects a thread in xfs_iget_core from
- * racing with us on linking the inode back with a vnode.
- * Once we have the XFS_IRECLAIM flag set it will not touch
- * us.
- */
- pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
- write_lock(&pag->pag_ici_lock);
- spin_lock(&ip->i_flags_lock);
- if (__xfs_iflags_test(ip, XFS_IRECLAIM) ||
- !__xfs_iflags_test(ip, XFS_IRECLAIMABLE)) {
- spin_unlock(&ip->i_flags_lock);
- write_unlock(&pag->pag_ici_lock);
- xfs_perag_put(pag);
- return EAGAIN;
- }
- __xfs_iflags_set(ip, XFS_IRECLAIM);
- spin_unlock(&ip->i_flags_lock);
- write_unlock(&pag->pag_ici_lock);
- xfs_perag_put(pag);
-
- /*
- * The inode is flushed delayed write. That means the flush lock
+ * Inodes are flushed delayed write. That means the flush lock
* may be held here and we will block for some time on it. Further,
* if we hold the inode lock, we prevent the AIL from locking and
* therefore being able to push the buffer. This means that we'll end
@@ -791,12 +779,22 @@ xfs_reclaim_inode_now(
struct xfs_perag *pag,
int flags)
{
- /* ignore if already under reclaim */
- if (xfs_iflags_test(ip, XFS_IRECLAIM)) {
- read_unlock(&pag->pag_ici_lock);
+ /*
+ * The radix tree lock here protects a thread in xfs_iget from racing
+ * with us starting reclaim on the inode. Once we have the
+ * XFS_IRECLAIM flag set it will not touch us.
+ */
+ spin_lock(&ip->i_flags_lock);
+ ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+ if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
+ /* ignore as it is already under reclaim */
+ spin_unlock(&ip->i_flags_lock);
+ write_unlock(&pag->pag_ici_lock);
return 0;
}
- read_unlock(&pag->pag_ici_lock);
+ __xfs_iflags_set(ip, XFS_IRECLAIM);
+ spin_unlock(&ip->i_flags_lock);
+ write_unlock(&pag->pag_ici_lock);
return xfs_reclaim_inode(ip, flags);
}
@@ -807,5 +805,5 @@ xfs_reclaim_inodes(
int mode)
{
return xfs_inode_ag_iterator(mp, xfs_reclaim_inode_now, mode,
- XFS_ICI_RECLAIM_TAG);
+ XFS_ICI_RECLAIM_TAG, 1);
}
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index a500b4d..ea932b4 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -54,6 +54,6 @@ void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp, struct xfs_perag *pag,
int xfs_sync_inode_valid(struct xfs_inode *ip, struct xfs_perag *pag);
int xfs_inode_ag_iterator(struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags),
- int flags, int tag);
+ int flags, int tag, int write_lock);
#endif
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index 71af76f..873e07e 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -891,7 +891,7 @@ xfs_qm_dqrele_all_inodes(
uint flags)
{
ASSERT(mp->m_quotainfo);
- xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, XFS_ICI_NO_TAG);
+ xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, XFS_ICI_NO_TAG, 0);
}
/*------------------------------------------------------------------------*/
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] xfs: reclaim all inodes by background tree walks
2010-01-06 23:05 [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
2010-01-06 23:05 ` [PATCH 1/2] xfs: reclaim inodes under a write lock Dave Chinner
@ 2010-01-06 23:05 ` Dave Chinner
2010-01-08 10:24 ` Christoph Hellwig
2010-01-07 10:49 ` [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
2 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2010-01-06 23:05 UTC (permalink / raw)
To: xfs
We cannot do direct inode reclaim without taking the flush lock to
ensure that we do not reclaim an inode under IO. We check the inode
is clean before doing direct reclaim, but this is not good enough
because the inode flush code marks the inode clean once it has
copied the in-core dirty state to the backing buffer.
It is the flush lock that determines whether the inode is still
under IO, even though it is marked clean, and the inode is still
required at IO completion so we can't reclaim it even though it is
clean in core. Hence the requirement that we need to take the
flush lock even on clean inodes because this guarantees that the
inode writeback IO has completed and it is safe to reclaim the
inode.
With delayed write inode flushing, we coul dend up waiting a long
time on the flush lock even for a clean inode. The background
reclaim already handles this efficiently, so avoid all the problems
by killing the direct reclaim path altogether.
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
fs/xfs/linux-2.6/xfs_super.c | 14 ++++++--------
fs/xfs/linux-2.6/xfs_sync.c | 11 ++++++++++-
2 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index f3dd67d..23768f4 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -953,16 +953,14 @@ xfs_fs_destroy_inode(
ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
/*
- * If we have nothing to flush with this inode then complete the
- * teardown now, otherwise delay the flush operation.
+ * we always use background reclaim here because even if the
+ * inode is clean, it still may be under IO and hence we have
+ * to take the flush lock. The background reclaim path handles
+ * this more efficiently than we can here, so simply let background
+ * reclaim tear down all inodes.
*/
- if (!xfs_inode_clean(ip)) {
- xfs_inode_set_reclaim_tag(ip);
- return;
- }
-
out_reclaim:
- xfs_ireclaim(ip);
+ xfs_inode_set_reclaim_tag(ip);
}
/*
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 6d1cd6e..a1d7876 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -700,6 +700,8 @@ xfs_reclaim_inode(
/*
* In the case of a forced shutdown we rely on xfs_iflush() to
* wait for the inode to be unpinned before returning an error.
+ * Because we hold the flush lock, we know that the inode cannot
+ * be under IO, so if it reports clean it can be reclaimed.
*/
if (!is_bad_inode(VFS_I(ip)) && !xfs_inode_clean(ip)) {
/*
@@ -726,9 +728,16 @@ xfs_reclaim_inode(
return 0;
unlock_and_requeue:
+ /*
+ * We could return EAGAIN here to make reclaim rescan the inode tree in
+ * a short while. However, this just burns CPU time scanning the tree
+ * waiting for IO to complete and xfssyncd never goes back to the idle
+ * state. Instead, return 0 to let the next scheduled background reclaim
+ * attempt to reclaim the inode again.
+ */
xfs_iflags_clear(ip, XFS_IRECLAIM);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
- return EAGAIN;
+ return 0;
}
void
--
1.6.5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] Fix inode reclaim problems (hopefully)
2010-01-06 23:05 [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
2010-01-06 23:05 ` [PATCH 1/2] xfs: reclaim inodes under a write lock Dave Chinner
2010-01-06 23:05 ` [PATCH 2/2] xfs: reclaim all inodes by background tree walks Dave Chinner
@ 2010-01-07 10:49 ` Dave Chinner
2 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2010-01-07 10:49 UTC (permalink / raw)
To: xfs
On Thu, Jan 07, 2010 at 10:05:23AM +1100, Dave Chinner wrote:
> These two patches seem to fix the inode reclaim issues I've been
> able to reproduce lately. The changes are still running xfsqa in a
> loop to confirm this, but the directory/small file stress test I've
> been running to trigger the problem has run for 10 hours with these
> fixes instead of dying after 20-30 minutes.
Still running xfsqa 11 hours later without any problem, so it looks
like this series fixes the problem I was seeing. Score another win
for the new tracing code :)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] xfs: reclaim inodes under a write lock
2010-01-06 23:05 ` [PATCH 1/2] xfs: reclaim inodes under a write lock Dave Chinner
@ 2010-01-08 10:20 ` Christoph Hellwig
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2010-01-08 10:20 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Thu, Jan 07, 2010 at 10:05:24AM +1100, Dave Chinner wrote:
> Make the inode tree reclaim walk exclusive to avoid races with
> concurrent sync walkers and lookups. This is a version of
> a patch posted by Christoph Hellwig that avoids all the code
> duplication.
I don't like the write_lock flag very much, but given that the other
option is duplication we might have to live it.
> - /*
> - * If we can't get a reference on the inode, it must be in reclaim.
> - * Leave it for the reclaim code to flush. Also avoid inodes that
> - * haven't been fully initialised.
> - */
> + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> + if (xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) {
> + read_unlock(&pag->pag_ici_lock);
> + return ENOENT;
> + }
> +
> + /* If we can't get a reference on the inode, it must be in reclaim. */
> if (!igrab(inode)) {
> read_unlock(&pag->pag_ici_lock);
> return ENOENT;
> }
> read_unlock(&pag->pag_ici_lock);
>
> - if (is_bad_inode(inode) || xfs_iflags_test(ip, XFS_INEW)) {
> + if (is_bad_inode(inode)) {
> IRELE(ip);
> return ENOENT;
That's an unrelated change and should be a separate patch.
> @@ -791,12 +779,22 @@ xfs_reclaim_inode_now(
> struct xfs_perag *pag,
> int flags)
> {
> + /*
> + * The radix tree lock here protects a thread in xfs_iget from racing
> + * with us starting reclaim on the inode. Once we have the
> + * XFS_IRECLAIM flag set it will not touch us.
> + */
> + spin_lock(&ip->i_flags_lock);
> + ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
> + if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
> + /* ignore as it is already under reclaim */
> + spin_unlock(&ip->i_flags_lock);
> + write_unlock(&pag->pag_ici_lock);
> return 0;
> }
> + __xfs_iflags_set(ip, XFS_IRECLAIM);
> + spin_unlock(&ip->i_flags_lock);
> + write_unlock(&pag->pag_ici_lock);
>
> return xfs_reclaim_inode(ip, flags);
Once you move things around please merge xfs_reclaim_inode_now and
xfs_reclaim_inode into a single function.
And yes, all this currently doesn't apply against the XFS tree or
mainline, but you know that already.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] xfs: reclaim all inodes by background tree walks
2010-01-06 23:05 ` [PATCH 2/2] xfs: reclaim all inodes by background tree walks Dave Chinner
@ 2010-01-08 10:24 ` Christoph Hellwig
2010-01-08 10:43 ` Dave Chinner
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2010-01-08 10:24 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Looks safe to me. I wonder whaimpact leaving the inodes around for
longer has to memory usage for inode heavy workloads, though.
> unlock_and_requeue:
> + /*
> + * We could return EAGAIN here to make reclaim rescan the inode tree in
> + * a short while. However, this just burns CPU time scanning the tree
> + * waiting for IO to complete and xfssyncd never goes back to the idle
> + * state. Instead, return 0 to let the next scheduled background reclaim
> + * attempt to reclaim the inode again.
> + */
> xfs_iflags_clear(ip, XFS_IRECLAIM);
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> - return EAGAIN;
> + return 0;
This is an unrelated change and should be a patch of it's own.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] xfs: reclaim all inodes by background tree walks
2010-01-08 10:24 ` Christoph Hellwig
@ 2010-01-08 10:43 ` Dave Chinner
0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2010-01-08 10:43 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Fri, Jan 08, 2010 at 05:24:08AM -0500, Christoph Hellwig wrote:
>
> Looks safe to me. I wonder whaimpact leaving the inodes around for
> longer has to memory usage for inode heavy workloads, though.
In terms of wall time, nothing I can measure, but it seems to reduce
system time slightly. My main concern is memory pressure - maybe it
needs a shrinker registered to reclaim inodes immediately rather
than waiting for the next xfsssyncd run...
> > unlock_and_requeue:
> > + /*
> > + * We could return EAGAIN here to make reclaim rescan the inode tree in
> > + * a short while. However, this just burns CPU time scanning the tree
> > + * waiting for IO to complete and xfssyncd never goes back to the idle
> > + * state. Instead, return 0 to let the next scheduled background reclaim
> > + * attempt to reclaim the inode again.
> > + */
> > xfs_iflags_clear(ip, XFS_IRECLAIM);
> > xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > - return EAGAIN;
> > + return 0;
>
> This is an unrelated change and should be a patch of it's own.
Yup.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-01-08 10:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-06 23:05 [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
2010-01-06 23:05 ` [PATCH 1/2] xfs: reclaim inodes under a write lock Dave Chinner
2010-01-08 10:20 ` Christoph Hellwig
2010-01-06 23:05 ` [PATCH 2/2] xfs: reclaim all inodes by background tree walks Dave Chinner
2010-01-08 10:24 ` Christoph Hellwig
2010-01-08 10:43 ` Dave Chinner
2010-01-07 10:49 ` [PATCH 0/2] Fix inode reclaim problems (hopefully) Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox