* [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster
2016-04-06 9:22 [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
@ 2016-04-06 9:22 ` Dave Chinner
2016-04-06 12:51 ` Brian Foster
2016-04-07 15:47 ` Christoph Hellwig
2016-04-06 9:22 ` [PATCH 2/5] xfs: skip stale inodes " Dave Chinner
` (4 subsequent siblings)
5 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2016-04-06 9:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Some careless idiot wrote crap code in commit 1a3e8f3 ("xfs: convert
inode cache lookups to use RCU locking") back in late 2010, and so
xfs_iflush_cluster checks the wrong inode for whether it is still
valid under RCU protection. Fix it to lock and check the correct
inode.
Part of the reason for the screwup was the unconventional naming of
the cluster inode variable - iq - so also rename all the cluster
inode variables to use a more conventional prefixes to reduce
potential future confusion (cilist, cilist_size, cip).
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode.c | 64 +++++++++++++++++++++++++++---------------------------
1 file changed, 32 insertions(+), 32 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f79ea59..2718d10 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3149,16 +3149,16 @@ out_release_wip:
STATIC int
xfs_iflush_cluster(
- xfs_inode_t *ip,
- xfs_buf_t *bp)
+ struct xfs_inode *ip,
+ struct xfs_buf *bp)
{
- xfs_mount_t *mp = ip->i_mount;
+ struct xfs_mount *mp = ip->i_mount;
struct xfs_perag *pag;
unsigned long first_index, mask;
unsigned long inodes_per_cluster;
- int ilist_size;
- xfs_inode_t **ilist;
- xfs_inode_t *iq;
+ int cilist_size;
+ struct xfs_inode **cilist;
+ struct xfs_inode *cip;
int nr_found;
int clcount = 0;
int bufwasdelwri;
@@ -3167,23 +3167,23 @@ xfs_iflush_cluster(
pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
- ilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
- ilist = kmem_alloc(ilist_size, KM_MAYFAIL|KM_NOFS);
- if (!ilist)
+ cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
+ cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
+ if (!cilist)
goto out_put;
mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
rcu_read_lock();
/* really need a gang lookup range call here */
- nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
+ nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
first_index, inodes_per_cluster);
if (nr_found == 0)
goto out_free;
for (i = 0; i < nr_found; i++) {
- iq = ilist[i];
- if (iq == ip)
+ cip = cilist[i];
+ if (cip == ip)
continue;
/*
@@ -3192,20 +3192,20 @@ xfs_iflush_cluster(
* We need to check under the i_flags_lock for a valid inode
* here. Skip it if it is not valid or the wrong inode.
*/
- spin_lock(&ip->i_flags_lock);
- if (!ip->i_ino ||
- (XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) {
- spin_unlock(&ip->i_flags_lock);
+ spin_lock(&cip->i_flags_lock);
+ if (!cip->i_ino ||
+ (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+ spin_unlock(&cip->i_flags_lock);
continue;
}
- spin_unlock(&ip->i_flags_lock);
+ spin_unlock(&cip->i_flags_lock);
/*
* Do an un-protected check to see if the inode is dirty and
* is a candidate for flushing. These checks will be repeated
* later after the appropriate locks are acquired.
*/
- if (xfs_inode_clean(iq) && xfs_ipincount(iq) == 0)
+ if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
continue;
/*
@@ -3213,15 +3213,15 @@ xfs_iflush_cluster(
* then this inode cannot be flushed and is skipped.
*/
- if (!xfs_ilock_nowait(iq, XFS_ILOCK_SHARED))
+ if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
continue;
- if (!xfs_iflock_nowait(iq)) {
- xfs_iunlock(iq, XFS_ILOCK_SHARED);
+ if (!xfs_iflock_nowait(cip)) {
+ xfs_iunlock(cip, XFS_ILOCK_SHARED);
continue;
}
- if (xfs_ipincount(iq)) {
- xfs_ifunlock(iq);
- xfs_iunlock(iq, XFS_ILOCK_SHARED);
+ if (xfs_ipincount(cip)) {
+ xfs_ifunlock(cip);
+ xfs_iunlock(cip, XFS_ILOCK_SHARED);
continue;
}
@@ -3229,18 +3229,18 @@ xfs_iflush_cluster(
* arriving here means that this inode can be flushed. First
* re-check that it's dirty before flushing.
*/
- if (!xfs_inode_clean(iq)) {
+ if (!xfs_inode_clean(cip)) {
int error;
- error = xfs_iflush_int(iq, bp);
+ error = xfs_iflush_int(cip, bp);
if (error) {
- xfs_iunlock(iq, XFS_ILOCK_SHARED);
+ xfs_iunlock(cip, XFS_ILOCK_SHARED);
goto cluster_corrupt_out;
}
clcount++;
} else {
- xfs_ifunlock(iq);
+ xfs_ifunlock(cip);
}
- xfs_iunlock(iq, XFS_ILOCK_SHARED);
+ xfs_iunlock(cip, XFS_ILOCK_SHARED);
}
if (clcount) {
@@ -3250,7 +3250,7 @@ xfs_iflush_cluster(
out_free:
rcu_read_unlock();
- kmem_free(ilist);
+ kmem_free(cilist);
out_put:
xfs_perag_put(pag);
return 0;
@@ -3293,8 +3293,8 @@ cluster_corrupt_out:
/*
* Unlocks the flush lock
*/
- xfs_iflush_abort(iq, false);
- kmem_free(ilist);
+ xfs_iflush_abort(cip, false);
+ kmem_free(cilist);
xfs_perag_put(pag);
return -EFSCORRUPTED;
}
--
2.7.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster
2016-04-06 9:22 ` [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
@ 2016-04-06 12:51 ` Brian Foster
2016-04-07 15:47 ` Christoph Hellwig
1 sibling, 0 replies; 20+ messages in thread
From: Brian Foster @ 2016-04-06 12:51 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Apr 06, 2016 at 07:22:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Some careless idiot wrote crap code in commit 1a3e8f3 ("xfs: convert
> inode cache lookups to use RCU locking") back in late 2010, and so
> xfs_iflush_cluster checks the wrong inode for whether it is still
> valid under RCU protection. Fix it to lock and check the correct
> inode.
>
> Part of the reason for the screwup was the unconventional naming of
> the cluster inode variable - iq - so also rename all the cluster
> inode variables to use a more conventional prefixes to reduce
> potential future confusion (cilist, cilist_size, cip).
>
> Discovered-by: Brain Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Heh, the variable name changes are probably a good idea. It's very easy
to misread between iq/ip. I'm not sure how many times I read through
xfs_iflush_cluster() before I realized it wasn't doing what I thought it
was doing, but it was at least once or twice. ;)
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_inode.c | 64 +++++++++++++++++++++++++++---------------------------
> 1 file changed, 32 insertions(+), 32 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index f79ea59..2718d10 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3149,16 +3149,16 @@ out_release_wip:
>
> STATIC int
> xfs_iflush_cluster(
> - xfs_inode_t *ip,
> - xfs_buf_t *bp)
> + struct xfs_inode *ip,
> + struct xfs_buf *bp)
> {
> - xfs_mount_t *mp = ip->i_mount;
> + struct xfs_mount *mp = ip->i_mount;
> struct xfs_perag *pag;
> unsigned long first_index, mask;
> unsigned long inodes_per_cluster;
> - int ilist_size;
> - xfs_inode_t **ilist;
> - xfs_inode_t *iq;
> + int cilist_size;
> + struct xfs_inode **cilist;
> + struct xfs_inode *cip;
> int nr_found;
> int clcount = 0;
> int bufwasdelwri;
> @@ -3167,23 +3167,23 @@ xfs_iflush_cluster(
> pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>
> inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> - ilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
> - ilist = kmem_alloc(ilist_size, KM_MAYFAIL|KM_NOFS);
> - if (!ilist)
> + cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
> + cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> + if (!cilist)
> goto out_put;
>
> mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
> first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> rcu_read_lock();
> /* really need a gang lookup range call here */
> - nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
> + nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
> first_index, inodes_per_cluster);
> if (nr_found == 0)
> goto out_free;
>
> for (i = 0; i < nr_found; i++) {
> - iq = ilist[i];
> - if (iq == ip)
> + cip = cilist[i];
> + if (cip == ip)
> continue;
>
> /*
> @@ -3192,20 +3192,20 @@ xfs_iflush_cluster(
> * We need to check under the i_flags_lock for a valid inode
> * here. Skip it if it is not valid or the wrong inode.
> */
> - spin_lock(&ip->i_flags_lock);
> - if (!ip->i_ino ||
> - (XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) {
> - spin_unlock(&ip->i_flags_lock);
> + spin_lock(&cip->i_flags_lock);
> + if (!cip->i_ino ||
> + (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> + spin_unlock(&cip->i_flags_lock);
> continue;
> }
> - spin_unlock(&ip->i_flags_lock);
> + spin_unlock(&cip->i_flags_lock);
>
> /*
> * Do an un-protected check to see if the inode is dirty and
> * is a candidate for flushing. These checks will be repeated
> * later after the appropriate locks are acquired.
> */
> - if (xfs_inode_clean(iq) && xfs_ipincount(iq) == 0)
> + if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
> continue;
>
> /*
> @@ -3213,15 +3213,15 @@ xfs_iflush_cluster(
> * then this inode cannot be flushed and is skipped.
> */
>
> - if (!xfs_ilock_nowait(iq, XFS_ILOCK_SHARED))
> + if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
> continue;
> - if (!xfs_iflock_nowait(iq)) {
> - xfs_iunlock(iq, XFS_ILOCK_SHARED);
> + if (!xfs_iflock_nowait(cip)) {
> + xfs_iunlock(cip, XFS_ILOCK_SHARED);
> continue;
> }
> - if (xfs_ipincount(iq)) {
> - xfs_ifunlock(iq);
> - xfs_iunlock(iq, XFS_ILOCK_SHARED);
> + if (xfs_ipincount(cip)) {
> + xfs_ifunlock(cip);
> + xfs_iunlock(cip, XFS_ILOCK_SHARED);
> continue;
> }
>
> @@ -3229,18 +3229,18 @@ xfs_iflush_cluster(
> * arriving here means that this inode can be flushed. First
> * re-check that it's dirty before flushing.
> */
> - if (!xfs_inode_clean(iq)) {
> + if (!xfs_inode_clean(cip)) {
> int error;
> - error = xfs_iflush_int(iq, bp);
> + error = xfs_iflush_int(cip, bp);
> if (error) {
> - xfs_iunlock(iq, XFS_ILOCK_SHARED);
> + xfs_iunlock(cip, XFS_ILOCK_SHARED);
> goto cluster_corrupt_out;
> }
> clcount++;
> } else {
> - xfs_ifunlock(iq);
> + xfs_ifunlock(cip);
> }
> - xfs_iunlock(iq, XFS_ILOCK_SHARED);
> + xfs_iunlock(cip, XFS_ILOCK_SHARED);
> }
>
> if (clcount) {
> @@ -3250,7 +3250,7 @@ xfs_iflush_cluster(
>
> out_free:
> rcu_read_unlock();
> - kmem_free(ilist);
> + kmem_free(cilist);
> out_put:
> xfs_perag_put(pag);
> return 0;
> @@ -3293,8 +3293,8 @@ cluster_corrupt_out:
> /*
> * Unlocks the flush lock
> */
> - xfs_iflush_abort(iq, false);
> - kmem_free(ilist);
> + xfs_iflush_abort(cip, false);
> + kmem_free(cilist);
> xfs_perag_put(pag);
> return -EFSCORRUPTED;
> }
> --
> 2.7.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster
2016-04-06 9:22 ` [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
2016-04-06 12:51 ` Brian Foster
@ 2016-04-07 15:47 ` Christoph Hellwig
1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-04-07 15:47 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Apr 06, 2016 at 07:22:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Some careless idiot wrote crap code in commit 1a3e8f3 ("xfs: convert
> inode cache lookups to use RCU locking") back in late 2010, and so
> xfs_iflush_cluster checks the wrong inode for whether it is still
> valid under RCU protection. Fix it to lock and check the correct
> inode.
>
> Part of the reason for the screwup was the unconventional naming of
> the cluster inode variable - iq - so also rename all the cluster
> inode variables to use a more conventional prefixes to reduce
> potential future confusion (cilist, cilist_size, cip).
Between all that renaming I didn't actually manage to find the fix :)
Can you split the rename and the fix into separate patches?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 2/5] xfs: skip stale inodes in xfs_iflush_cluster
2016-04-06 9:22 [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
2016-04-06 9:22 ` [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
@ 2016-04-06 9:22 ` Dave Chinner
2016-04-06 12:51 ` Brian Foster
2016-04-07 15:47 ` Christoph Hellwig
2016-04-06 9:22 ` [PATCH 3/5] xfs: xfs_iflush_cluster has range issues Dave Chinner
` (3 subsequent siblings)
5 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2016-04-06 9:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
We don't write back stale inodes, so we should skip them in
xfS_iflush_cluster, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2718d10..6598104 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3190,10 +3190,11 @@ xfs_iflush_cluster(
* because this is an RCU protected lookup, we could find a
* recently freed or even reallocated inode during the lookup.
* We need to check under the i_flags_lock for a valid inode
- * here. Skip it if it is not valid or the wrong inode.
+ * here. Skip it if it is not valid, stale or the wrong inode.
*/
spin_lock(&cip->i_flags_lock);
if (!cip->i_ino ||
+ __xfs_iflags_test(ip, XFS_ISTALE) ||
(XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
spin_unlock(&cip->i_flags_lock);
continue;
--
2.7.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 2/5] xfs: skip stale inodes in xfs_iflush_cluster
2016-04-06 9:22 ` [PATCH 2/5] xfs: skip stale inodes " Dave Chinner
@ 2016-04-06 12:51 ` Brian Foster
2016-04-07 15:47 ` Christoph Hellwig
1 sibling, 0 replies; 20+ messages in thread
From: Brian Foster @ 2016-04-06 12:51 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Apr 06, 2016 at 07:22:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> We don't write back stale inodes, so we should skip them in
> xfS_iflush_cluster, too.
xfs_iflush_cluster
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_inode.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2718d10..6598104 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3190,10 +3190,11 @@ xfs_iflush_cluster(
> * because this is an RCU protected lookup, we could find a
> * recently freed or even reallocated inode during the lookup.
> * We need to check under the i_flags_lock for a valid inode
> - * here. Skip it if it is not valid or the wrong inode.
> + * here. Skip it if it is not valid, stale or the wrong inode.
> */
> spin_lock(&cip->i_flags_lock);
> if (!cip->i_ino ||
> + __xfs_iflags_test(ip, XFS_ISTALE) ||
> (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> spin_unlock(&cip->i_flags_lock);
> continue;
> --
> 2.7.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/5] xfs: skip stale inodes in xfs_iflush_cluster
2016-04-06 9:22 ` [PATCH 2/5] xfs: skip stale inodes " Dave Chinner
2016-04-06 12:51 ` Brian Foster
@ 2016-04-07 15:47 ` Christoph Hellwig
1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-04-07 15:47 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/5] xfs: xfs_iflush_cluster has range issues
2016-04-06 9:22 [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
2016-04-06 9:22 ` [PATCH 1/5] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
2016-04-06 9:22 ` [PATCH 2/5] xfs: skip stale inodes " Dave Chinner
@ 2016-04-06 9:22 ` Dave Chinner
2016-04-06 12:52 ` Brian Foster
2016-04-07 15:48 ` Christoph Hellwig
2016-04-06 9:22 ` [PATCH 4/5] xfs: xfs_inode_free() isn't RCU safe Dave Chinner
` (2 subsequent siblings)
5 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2016-04-06 9:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
it can find inode beyond the current cluster if there is sparse
cache population. gang lookups return results in ascending index
order, so stop trying to cluster iodes once the first inode outside
the cluster mask is detected.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6598104..b984be4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3194,11 +3194,20 @@ xfs_iflush_cluster(
*/
spin_lock(&cip->i_flags_lock);
if (!cip->i_ino ||
- __xfs_iflags_test(ip, XFS_ISTALE) ||
- (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+ __xfs_iflags_test(ip, XFS_ISTALE)) {
spin_unlock(&cip->i_flags_lock);
continue;
}
+
+ /*
+ * Once we fall off the end of the cluster, no point checking
+ * any more inodes in the list because they will also all be
+ * outside the cluster.
+ */
+ if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+ spin_unlock(&cip->i_flags_lock);
+ break;
+ }
spin_unlock(&cip->i_flags_lock);
/*
--
2.7.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 3/5] xfs: xfs_iflush_cluster has range issues
2016-04-06 9:22 ` [PATCH 3/5] xfs: xfs_iflush_cluster has range issues Dave Chinner
@ 2016-04-06 12:52 ` Brian Foster
2016-04-07 15:48 ` Christoph Hellwig
1 sibling, 0 replies; 20+ messages in thread
From: Brian Foster @ 2016-04-06 12:52 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Apr 06, 2016 at 07:22:52PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
> it can find inode beyond the current cluster if there is sparse
inodes
> cache population. gang lookups return results in ascending index
> order, so stop trying to cluster iodes once the first inode outside
inodes
> the cluster mask is detected.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_inode.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 6598104..b984be4 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3194,11 +3194,20 @@ xfs_iflush_cluster(
> */
> spin_lock(&cip->i_flags_lock);
> if (!cip->i_ino ||
> - __xfs_iflags_test(ip, XFS_ISTALE) ||
> - (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> + __xfs_iflags_test(ip, XFS_ISTALE)) {
> spin_unlock(&cip->i_flags_lock);
> continue;
> }
> +
> + /*
> + * Once we fall off the end of the cluster, no point checking
> + * any more inodes in the list because they will also all be
> + * outside the cluster.
> + */
> + if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> + spin_unlock(&cip->i_flags_lock);
> + break;
> + }
> spin_unlock(&cip->i_flags_lock);
>
> /*
> --
> 2.7.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 3/5] xfs: xfs_iflush_cluster has range issues
2016-04-06 9:22 ` [PATCH 3/5] xfs: xfs_iflush_cluster has range issues Dave Chinner
2016-04-06 12:52 ` Brian Foster
@ 2016-04-07 15:48 ` Christoph Hellwig
1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-04-07 15:48 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 4/5] xfs: xfs_inode_free() isn't RCU safe
2016-04-06 9:22 [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
` (2 preceding siblings ...)
2016-04-06 9:22 ` [PATCH 3/5] xfs: xfs_iflush_cluster has range issues Dave Chinner
@ 2016-04-06 9:22 ` Dave Chinner
2016-04-07 15:50 ` Christoph Hellwig
2016-04-06 9:22 ` [PATCH 5/5] xfs: mark reclaimed inodes invalid earlier Dave Chinner
2016-04-06 12:52 ` [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Brian Foster
5 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2016-04-06 9:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The xfs_inode freed in xfs_inode_free() has multiple allocated
structures attached to it. We free these in xfs_inode_free() before
we mark the inode as invalid, and before we run call_rcu() to queue
the structure for freeing.
Unfortunately, this freeing can race with other accesses that are in
the RCU current grace period that have found the inode in the radix
tree with a valid state. This includes xfs_iflush_cluster(), which
calls xfs_inode_clean(), and that accesses the inode log item on the
xfs_inode.
The log item structure is freed in xfs_inode_free(), so there is the
possibility we can be accessing freed memory in xfs_iflush_cluster()
after validating the xfs_inode structure as being valid for this RCU
context. Hence we can get spuriously incorrect clean state returned
from such checks. This can lead to use thinking the inode is dirty
when it is, in fact, clean, and so incorrectly attaching it to the
buffer for IO and completion processing.
This then leads to use-after-free situations on the xfs_inode itself
if the IO completes after the current RCU grace period expires. The
buffer callbacks will access the xfs_inode and try to do all sorts
of things it shouldn't with freed memory.
IOWs, xfs_iflush_cluster() only works correctly when racing with
inode reclaim if the inode log item is present and correctly stating
the inode is clean. If the inode is being freed, then reclaim has
already made sure the inode is clean, and hence xfs_iflush_cluster
can skip it. However, we are accessing the inode inode under RCU
read lock protection and so also must ensure that all dynamically
allocated memory we reference in this context is not freed until the
RCU grace period expires.
To fix this, move all the potential memory freeing into
xfs_inode_free_callback() so that we are guarantee RCU protected
lookup code will always have the memory structures it needs
available during the RCU grace period that lookup races can occur
in.
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_icache.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index bf2d607..0c94cde 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -94,13 +94,6 @@ xfs_inode_free_callback(
struct inode *inode = container_of(head, struct inode, i_rcu);
struct xfs_inode *ip = XFS_I(inode);
- kmem_zone_free(xfs_inode_zone, ip);
-}
-
-void
-xfs_inode_free(
- struct xfs_inode *ip)
-{
switch (VFS_I(ip)->i_mode & S_IFMT) {
case S_IFREG:
case S_IFDIR:
@@ -118,6 +111,13 @@ xfs_inode_free(
ip->i_itemp = NULL;
}
+ kmem_zone_free(xfs_inode_zone, ip);
+}
+
+void
+xfs_inode_free(
+ struct xfs_inode *ip)
+{
/*
* Because we use RCU freeing we need to ensure the inode always
* appears to be reclaimed with an invalid inode number when in the
--
2.7.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* [PATCH 5/5] xfs: mark reclaimed inodes invalid earlier
2016-04-06 9:22 [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
` (3 preceding siblings ...)
2016-04-06 9:22 ` [PATCH 4/5] xfs: xfs_inode_free() isn't RCU safe Dave Chinner
@ 2016-04-06 9:22 ` Dave Chinner
2016-04-06 12:52 ` [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Brian Foster
5 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2016-04-06 9:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The last thing we do before using call_rcu() on an xfs_inode to be
freed is mark it as invalid. This means there is a window between
when we know for certain that the inode is going to be freed and
when we do actually mark it as "freed".
This is important in the context of RCU lookups - we can look up the
inode, find that it is valid, and then use it as such not realising
that it is in the final stages of being freed.
As such, mark the inode as being invalid the moment we know it is
going to be reclaimed. This can be done while we still hold the
XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
it occurs well before we remove it from the radix tree, and that
the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
synchronisation points for detecting that an inode is about to go
away.
For defensive purposes, this allows us to add a further check to
xfs_iflush_cluster to ensure we skip inodes that are being freed
after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
if the inode number if valid while we have these locks held we know
that it has not progressed through reclaim to the point where it is
clean and is about to be freed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_icache.c | 40 ++++++++++++++++++++++++++++++++--------
fs/xfs/xfs_inode.c | 14 +++++++++++++-
2 files changed, 45 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0c94cde..a60db43 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,6 +114,18 @@ xfs_inode_free_callback(
kmem_zone_free(xfs_inode_zone, ip);
}
+static void
+__xfs_inode_free(
+ struct xfs_inode *ip)
+{
+ /* asserts to verify all state is correct here */
+ ASSERT(atomic_read(&ip->i_pincount) == 0);
+ ASSERT(!xfs_isiflocked(ip));
+ XFS_STATS_DEC(ip->i_mount, vn_active);
+
+ call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
+}
+
void
xfs_inode_free(
struct xfs_inode *ip)
@@ -129,12 +141,7 @@ xfs_inode_free(
ip->i_ino = 0;
spin_unlock(&ip->i_flags_lock);
- /* asserts to verify all state is correct here */
- ASSERT(atomic_read(&ip->i_pincount) == 0);
- ASSERT(!xfs_isiflocked(ip));
- XFS_STATS_DEC(ip->i_mount, vn_active);
-
- call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
+ __xfs_inode_free(ip);
}
/*
@@ -929,6 +936,7 @@ xfs_reclaim_inode(
int sync_mode)
{
struct xfs_buf *bp = NULL;
+ xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */
int error;
restart:
@@ -993,6 +1001,22 @@ restart:
xfs_iflock(ip);
reclaim:
+ /*
+ * Because we use RCU freeing we need to ensure the inode always appears
+ * to be reclaimed with an invalid inode number when in the free state.
+ * We do this as early as possible under the ILOCK and flush lock so
+ * that xfs_iflush_cluster() can be guaranteed to detect races with us
+ * here. By doing this, we guarantee that once xfs_iflush_cluster has
+ * locked both the XFS_ILOCK and the flush lock that it will see either
+ * a valid, flushable inode that will serialise correctly against the
+ * locks below, or it will see a clean (and invalid) inode that it can
+ * skip.
+ */
+ spin_lock(&ip->i_flags_lock);
+ ip->i_flags = XFS_IRECLAIM;
+ ip->i_ino = 0;
+ spin_unlock(&ip->i_flags_lock);
+
xfs_ifunlock(ip);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1006,7 +1030,7 @@ reclaim:
*/
spin_lock(&pag->pag_ici_lock);
if (!radix_tree_delete(&pag->pag_ici_root,
- XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino)))
+ XFS_INO_TO_AGINO(ip->i_mount, ino)))
ASSERT(0);
__xfs_inode_clear_reclaim(pag, ip);
spin_unlock(&pag->pag_ici_lock);
@@ -1023,7 +1047,7 @@ reclaim:
xfs_qm_dqdetach(ip);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_inode_free(ip);
+ __xfs_inode_free(ip);
return error;
out_ifunlock:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b984be4..5b84bbc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3222,7 +3222,6 @@ xfs_iflush_cluster(
* Try to get locks. If any are unavailable or it is pinned,
* then this inode cannot be flushed and is skipped.
*/
-
if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
continue;
if (!xfs_iflock_nowait(cip)) {
@@ -3235,6 +3234,19 @@ xfs_iflush_cluster(
continue;
}
+
+ /*
+ * Check the inode number again, just to be certain we are not
+ * racing with freeing in xfs_reclaim_inode(). See the comments
+ * in that function for more information as to why the initial
+ * check is not sufficient.
+ */
+ if (!cip->i_ino) {
+ xfs_ifunlock(cip);
+ xfs_iunlock(cip, XFS_ILOCK_SHARED);
+ continue;
+ }
+
/*
* arriving here means that this inode can be flushed. First
* re-check that it's dirty before flushing.
--
2.7.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode
2016-04-06 9:22 [PATCH 0/5] xfs; xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
` (4 preceding siblings ...)
2016-04-06 9:22 ` [PATCH 5/5] xfs: mark reclaimed inodes invalid earlier Dave Chinner
@ 2016-04-06 12:52 ` Brian Foster
5 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2016-04-06 12:52 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Apr 06, 2016 at 07:22:49PM +1000, Dave Chinner wrote:
> Hi folks,
>
> There is a problem that RHEL QE tripped over on a long-running
> fsstress test on a RHEL 6.6. Brian did all the hard work of working
> out the initial cause of the GPF that was being tripped, but
> had not yet got past how to fix the issues around xfs_free_inode.
> The code is the same as the current upstream code, so the problem
> still exists....
>
> The first patch fixes an obvious (now!) bug in xfs_iflush_cluster
> where it checks the wrong inode after lookup for validity. It still
> kind-of works, because the correct inode number is used for the "are
> we still in the right cluster" check, so it's not quite a hole the
> size of a truck. Still something that should not have slipped
> through 6 years ago and not been discovered until now...
>
> The most important patch (#4) address the use-after-free issues that the
> xfs inode has w.r.t. RCU freeing and the lookup that
> xfs_iflush_cluster is doing under the rcu_read_lock. All the
> structures accessed under the RCU context need to be freed after the
> current RCU grace period expires, as RCU lookups may attempt to
> access them at any time during the grace period. hence we have to
> move them into the RCU callback so that we don't free them
> prematurely.
>
> The rest of the patches are defensive in that they make
> xfs_iflush_cluster only act on relevant inodes and be able to
> guarantee detection inodes that are in the process of being freed.
> While these aren't absolutely necessary, it seems silly to ignore
> these obvious issues while I'm fixing up other issues with the same
> code.
>
> There's some more detail on the fixes in the commit descriptions.
>
> Brian, I've only run this through xfstests, so I have no real idea
> if it fixes the problem fsstress has uncovered. AIUI it takes 3 or 4
> days to reproduce the issue so this is kind of a pre-emptive strike
> on what I think is the underlying issue based on your description
> and commentary. I figured having code to explain the problems would
> save some time while you sleep....
>
> Comments, thoughts, testing and flames all welcome...
>
Thanks.. it looks mostly clean and logical to me on a first pass
through. I've reviewed the first three, which look straightforward, and
I'll follow up on the last couple once I've run through testing and had
a closer look.
To migrate some of the discussion over to the list... what I'm testing
now is an XFS_IRECLAIM flag check (and inode skip) on the inode in the
cluster flush path. This has run clean for the better part of yesterday
through overnight. While the test requires a few days to verify, the
hacked up kernel I've been testing with usually detects this a bit
sooner (within 6-8 hours or so), so I'm fairly confident that test at
least corners the problem.
The concern raised with the XFS_IRECLAIM check is that of performance
(skipping an inode flush when we don't need to). I'm not totally
convinced that's much of a problem considering how rare this situation
is. The best/only reproducer we have takes a couple days on a debug
kernel, for a bug that's been present for several years. The concern I
had with this approach was more that both codepaths seemed to have
trylock or checks that cause the reclaim or flush to "back off," and I
didn't want to create a situation where neither path would actually do
its job (potentially repeatedly) for a particular inode.
So that approach seemed cleaner than the alternative I was considering,
but the thought of pushing the i_itemp deallocation to the RCU callback
also crossed my mind later on yesterday after writing up my last report
on the bug. This series appears to move all of the deallocation (incl.
fork destruction) to the RCU cb, but that seems reasonable to me as
well. I think relying on RCU is probably the cleanest/most correct
approach so far, so I'll kill the currently running test, pull these
back to the rhel kernel and see how that goes.
Note that I haven't even attempted to reproduce this on an upstream
kernel simply because it's so difficult to reproduce. It's not even
consistently reproducible with the original kernel on different
hardware. I agree that the problem likely exists upstream, but I'm going
to continue to use the original kernel for verification because that
hw/kernel has provided a consistent reproducer...
Brian
> -Dave.
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread