[patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
@ 2006-10-04  9:20 Takenori Nagano
  2006-10-06  3:26 ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Takenori Nagano @ 2006-10-04  9:20 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: Type: text/plain, Size: 4558 bytes --]

Hi,

The patch attached to this mail is a fix for a race of xfs_iunpin()
and generic_delete_inode().

generic_delete_inode() checks inode->i_state using BUG_ON() after
clear_inode(). At this point inode->i_state value must be I_CLEAR after
clear_inode().

But we detected inode->i_state was not I_CLEAR after clear_inode().
Kernel panic occurred by BUG_ON(inode->i_state != I_CLEAR). We
analyzed the memory dump, then we found I_DIRTY_SYNC and I_CLEAR ware
set. The function to set I_DIRTY_SYNC is only __mark_inode_dirty(). We
took a backtrace when i_state is I_CLEAR in __mark_inode_dirty().

This is a backtrace when inode->i_state=I_CLEAR in __mark_inode_dirty().

> > Call Trace:
> >  [<a000000100019980>] show_stack+0x80/0xa0
> >                                 sp=e00000012c077970 bsp=e00000012c0713e8
> >  [<a00000010003d1e0>] die+0x1c0/0x2e0
> >                                 sp=e00000012c077b40 bsp=e00000012c0713b0
> >  [<a00000010003e810>] ia64_bad_break+0x2f0/0x400
> >                                 sp=e00000012c077b40 bsp=e00000012c071388
> >  [<a000000100010180>] ia64_leave_kernel+0x0/0x260
> >                                 sp=e00000012c077bd0 bsp=e00000012c071388
> >  [<a0000001001c6170>] __mark_inode_dirty+0x390/0x3a0
> >                                 sp=e00000012c077da0 bsp=e00000012c071330
> >  [<a00000021c88a070>] xfs_iunpin+0x110/0x120 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c071310
> >  [<a00000021c892550>] xfs_inode_item_unpin+0x30/0x60 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c0712f0
> >  [<a00000021c8b1d00>] xfs_trans_chunk_committed+0x280/0x380 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c071298
> >  [<a00000021c8b1e80>] xfs_trans_committed+0x80/0x320 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c071248
> >  [<a00000021c898280>] xlog_state_do_callback+0x4a0/0xa20 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c0711b0
> >  [<a00000021c898990>] xlog_iodone+0x190/0x300 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c071168
> >  [<a00000021c8e6280>] pagebuf_iodone_work+0xc0/0x120 [xfs]
> >                                 sp=e00000012c077da0 bsp=e00000012c071148
> >  [<a0000001000d2f50>] worker_thread+0x3f0/0x5c0
> >                                 sp=e00000012c077da0 bsp=e00000012c0710b0
> >  [<a0000001000dd3e0>] kthread+0x220/0x280
> >                                 sp=e00000012c077e10 bsp=e00000012c071068
> >  [<a0000001000181a0>] kernel_thread_helper+0xe0/0x100
> >                                 sp=e00000012c077e30 bsp=e00000012c071040

We found __mark_inode_dirty() was called by xfs_iunpin().
xfs_iunpin() sets I_DIRTY_SYNC on inode->i_state if i_pincount is 0.

If __mark_inode_dirty() is running simultaneously between
clear_inode() and BUG_ON() in generic_delete_inode(), BUG_ON() is
called. We think this is a cause of this bug.

All dirty buffers are invalidated by clear_inode(), but in-core log is
not deleted and the state will be inconsistent. The in-core log is
written by xfs_logd even if inode was already deleted. A cause of this
bug is xfs does not care in-core log after deleting the inode.

xfs_fs_clear_inode() calls xfs_reclaim(). We think the recent fixes to
xfs_iunpin() were not correct. With those patches, xfs_iunpin() now
can determine whether xfs_inode is recycled or not, but it is not
essential way to fix this bug. xfs_iunpin() must never touch xfs_inode
which is already freed. If try_to_free_page() collects some slabs
including pinned inode, it is possible to result in memory corruption.

We come up with three possible solutions:
1. xfs_fs_clear_inode() waits for i_pincount to become 0.
2. xfs_fs_clear_inode() syncs in-core log if i_pincount is not 0.
3. xfs_fs_clear_inode() invalidates in-core log that relates to the
deleted inode.

We chose 2, because the frequency of sync is almost same to as that of
BUG(), and it is the same way to sync in-core log in xfs_fsync() when
inode is still pinned. It has very very little effect for xfs performance.

This patch fixes to sync in-core log if i_pincount is not 0 in
xfs_fs_clear_inode(). We think this is essential.
We already tested this patch for more than 100 hours in kernel-2.6.18.
If we did not use this patch, BUG() was called within only 5 minutes
on 32way Itanium server.
We used a test program that repeats open(), write() and unlink() in
parallel.

Best Regards,
--
Takenori Nagano, NEC
t-nagano@ah.jp.nec.com

[-- Attachment #2: xfs-fix-log-race.patch --]
[-- Type: text/plain, Size: 879 bytes --]

diff -Naru linux-2.6.18.orig/fs/xfs/linux-2.6/xfs_super.c linux-2.6.18/fs/xfs/linux-2.6/xfs_super.c
--- linux-2.6.18.orig/fs/xfs/linux-2.6/xfs_super.c	2006-09-20 12:42:06.000000000 +0900
+++ linux-2.6.18/fs/xfs/linux-2.6/xfs_super.c	2006-09-28 18:16:02.280000000 +0900
@@ -433,6 +433,7 @@
 	struct inode		*inode)
 {
 	bhv_vnode_t		*vp = vn_from_inode(inode);
+	xfs_inode_t		*ip;

 	vn_trace_entry(vp, __FUNCTION__, (inst_t *)__return_address);

@@ -452,10 +453,14 @@
 	vp->v_flag &= ~VMODIFIED;
 	VN_UNLOCK(vp, 0);

-	if (VNHEAD(vp))
+	if (VNHEAD(vp)) {
+		ip = XFS_BHVTOI(VNHEAD(vp));
+		if (xfs_ipincount(ip)) 
+			xfs_log_force(ip->i_mount, (xfs_lsn_t)0,
+				      XFS_LOG_FORCE | XFS_LOG_SYNC);
 		if (bhv_vop_reclaim(vp))
 			panic("%s: cannot reclaim 0x%p\n", __FUNCTION__, vp);
-
+	}
 	ASSERT(VNHEAD(vp) == NULL);

 #ifdef XFS_VNODE_TRACE

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-04  9:20 [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode() Takenori Nagano
@ 2006-10-06  3:26 ` David Chinner
  2006-10-11  6:43   ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-10-06  3:26 UTC (permalink / raw)
  To: Takenori Nagano; +Cc: xfs

On Wed, Oct 04, 2006 at 06:20:14PM +0900, Takenori Nagano wrote:
> Hi,
> 
> The patch attached to this mail is a fix for a race of xfs_iunpin()
> and generic_delete_inode().

Ahh, that problem, still. :/

> If __mark_inode_dirty() is running simultaneously between
> clear_inode() and BUG_ON() in generic_delete_inode(), BUG_ON() is
> called. We think this is a cause of this bug.

*nod*

> xfs_fs_clear_inode() calls xfs_reclaim(). We think the recent fixes to
> xfs_iunpin() were not correct. With those patches, xfs_iunpin() now
> can determine whether xfs_inode is recycled or not, but it is not
> essential way to fix this bug.

Agreed - it was always a pretty ugly way to fix the problem.

> xfs_iunpin() must never touch xfs_inode
> which is already freed.

It must never touch the linux inode, not the xfs_inode....

> If try_to_free_page() collects some slabs
> including pinned inode, it is possible to result in memory corruption.

It's the linux inode that gets used after it's been freed, not the
xfs_inode (which doesn't get freed until after it is unpinned
and the async reclaim is run).

> We come up with three possible solutions:
> 1. xfs_fs_clear_inode() waits for i_pincount to become 0.
> 2. xfs_fs_clear_inode() syncs in-core log if i_pincount is not 0.
> 3. xfs_fs_clear_inode() invalidates in-core log that relates to the
> deleted inode.
> 
> We chose 2, because the frequency of sync is almost same to as that of
> BUG(), and it is the same way to sync in-core log in xfs_fsync() when
> inode is still pinned. It has very very little effect for xfs performance.

Option 1 is the correct solution and we already have a function for
doing that - xfs_iunpin_wait(). This also forces the log in the
most optimal manner (i.e. only up to the LSN of the pinned log item)
before waiting for the pin count to fall to zero so it does
option 2 as well.

> This patch fixes to sync in-core log if i_pincount is not 0 in
> xfs_fs_clear_inode(). We think this is essential.

The fix should go into xfs_reclaim rather than xfs_fs_clear_inode
as this is where it is important to wait.

I think this is a much better way of fixing the problem, but it needs
a little tweaking. Also, it indicates that we can probably revert
some of the previous changes made in attempting to fix this bug.
I'll put together a new patch with this fix and as much of the
other fixes removed as possible and run some tests on it here.
It'l be a day or two before I have a tested patch ready....

Thanks for persisting with this, Takenori - this looks like
it will be a good, robust solution to the problem.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-06  3:26 ` David Chinner
@ 2006-10-11  6:43   ` David Chinner
  2006-10-12 12:20     ` Takenori Nagano
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-10-11  6:43 UTC (permalink / raw)
  To: Takenori Nagano; +Cc: xfs

On Fri, Oct 06, 2006 at 01:26:17PM +1000, David Chinner wrote:
> I think this is a much better way of fixing the problem, but it needs
> a little tweaking. Also, it indicates that we can probably revert
> some of the previous changes made in attempting to fix this bug.
> I'll put together a new patch with this fix and as much of the
> other fixes removed as possible and run some tests on it here.
> It'l be a day or two before I have a tested patch ready....

I've run the attached patch through xfsqa but have not stress tested
it yet.

Takenori - can you give this a run through your tests to see if
it passes. I expect any races to trigger the BUG_ON statements
in xfs_iunpin().

This patch sits on top of iflags locking cleanup I posted here:

http://oss.sgi.com/archives/xfs/2006-10/msg00014.html

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_inode.c    |   59 ++++++++++++++++++--------------------------------
 fs/xfs/xfs_inode.h    |    1 
 fs/xfs/xfs_vnodeops.c |   12 +++++++++-
 3 files changed, 34 insertions(+), 38 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2006-10-11 14:01:47.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2006-10-11 14:33:59.055638165 +1000
@@ -2728,9 +2728,16 @@ xfs_ipin(
 }
 
 /*
- * Decrement the pin count of the given inode, and wake up
- * anyone in xfs_iwait_unpin() if the count goes to 0.  The
- * inode must have been previously pinned with a call to xfs_ipin().
+ * Decrement the pin count of the given inode, and wake up anyone in
+ * xfs_iunpin_wait() if the count goes to 0.  The inode must have been
+ * previously pinned with a call to xfs_ipin().
+ *
+ * Note that xfs_reclaim() _must_ wait for all transactions to complete by
+ * calling xfs_iunpin_wait() before either reclaiming the linux inode or
+ * breaking the link between the xfs_inode and the xfs_vnode to prevent races
+ * and use-after-frees here in this code due to asynchronous log I/O
+ * completion. Hence we should never see the XFS_IRECLAIM* state,
+ * a NULL vnode or a linu xinode with I_CLEAR set here.
  */
 void
 xfs_iunpin(
@@ -2739,41 +2746,19 @@ xfs_iunpin(
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 
 	if (atomic_dec_and_test(&ip->i_pincount)) {
-		/*
-		 * If the inode is currently being reclaimed, the
-		 * linux inode _and_ the xfs vnode may have been
-		 * freed so we cannot reference either of them safely.
-		 * Hence we should not try to do anything to them
-		 * if the xfs inode is currently in the reclaim
-		 * path.
-		 *
-		 * However, we still need to issue the unpin wakeup
-		 * call as the inode reclaim may be blocked waiting for
-		 * the inode to become unpinned.
-		 */
-		struct inode *inode = NULL;
+		bhv_vnode_t	*vp = XFS_ITOV_NULL(ip);
+		struct inode	*inode;
+
+		BUG_ON(xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE));
+		BUG_ON(vp == NULL);
+
+		/* make sync come back and flush this inode */
+		inode = vn_to_inode(vp);
+		BUG_ON(inode->i_state & I_CLEAR);
+		if (!(inode->i_state & (I_NEW|I_FREEING)))
+			mark_inode_dirty_sync(inode);
 
-		spin_lock(&ip->i_flags_lock);
-		if (!__xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE)) {
-			bhv_vnode_t	*vp = XFS_ITOV_NULL(ip);
-
-			/* make sync come back and flush this inode */
-			if (vp) {
-				inode = vn_to_inode(vp);
-
-				if (!(inode->i_state &
-						(I_NEW|I_FREEING|I_CLEAR))) {
-					inode = igrab(inode);
-					if (inode)
-						mark_inode_dirty_sync(inode);
-				} else
-					inode = NULL;
-			}
-		}
-		spin_unlock(&ip->i_flags_lock);
 		wake_up(&ip->i_ipin_wait);
-		if (inode)
-			xfs_inode_iput(ip);
 	}
 }
 
@@ -2784,7 +2769,7 @@ xfs_iunpin(
  * be subsequently pinned once someone is waiting for it to be
  * unpinned.
  */
-STATIC void
+void
 xfs_iunpin_wait(
 	xfs_inode_t	*ip)
 {
Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c	2006-10-11 14:01:46.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c	2006-10-11 14:18:08.307190867 +1000
@@ -3817,7 +3817,17 @@ xfs_reclaim(
 		return 0;
 	}
 
+	/*
+	 * We can't reclaim the inode until all I/O has completed, and we don't
+	 * want to break the link between the vnode and xfs_inode until all log
+	 * transactions have been written to disk.  By waiting here we provide
+	 * the guarantee to xfs_iunpin that the linux inode will always be
+	 * referencable because it won't be freed until after this wait and no
+	 * new transactions can be issued on this inode now.
+	 */
 	vn_iowait(vp);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_iunpin_wait(ip);
 
 	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
 
@@ -3834,12 +3844,12 @@ xfs_reclaim(
 	 * itself.
 	 */
 	if (!ip->i_update_core && (ip->i_itemp == NULL)) {
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_iflock(ip);
 		return xfs_finish_reclaim(ip, 1, XFS_IFLUSH_DELWRI_ELSE_SYNC);
 	} else {
 		xfs_mount_t	*mp = ip->i_mount;
 
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		/* Protect sync from us */
 		XFS_MOUNT_ILOCK(mp);
 		vn_bhv_remove(VN_BHV_HEAD(vp), XFS_ITOBHV(ip));
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.h	2006-10-11 14:01:46.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.h	2006-10-11 14:34:57.376058950 +1000
@@ -482,6 +482,7 @@ void		xfs_iext_realloc(xfs_inode_t *, in
 void		xfs_iroot_realloc(xfs_inode_t *, int, int);
 void		xfs_ipin(xfs_inode_t *);
 void		xfs_iunpin(xfs_inode_t *);
+void		xfs_iunpin_wait(xfs_inode_t *);
 int		xfs_iextents_copy(xfs_inode_t *, xfs_bmbt_rec_t *, int);
 int		xfs_iflush(xfs_inode_t *, uint);
 void		xfs_iflush_all(struct xfs_mount *);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-11  6:43   ` David Chinner
@ 2006-10-12 12:20     ` Takenori Nagano
  2006-10-13  1:46       ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Takenori Nagano @ 2006-10-12 12:20 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

Hi David,

I tried those patches, but they caused degradation.
These are results of "vmstat 10" while my test program was running.

- Before applying those patches:
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 7  0      0 2768240  37632 210512    0    0     7 43367  268  676  1 49 50  0  0
 9  0      0 2716352  37632 210672    0    0     0 362864 2154 47915  1 51 48  0  0
 9  0      0 2663136  37664 210048    0    0     0 361745 2154 48258  1 50 49  0  0
10  0      0 2610688  37664 211184    0    0     0 360908 2152 48068  1 51 49  0  0
 9  0      0 2557904  37680 210512    0    0     0 360254 2154 49036  1 51 48  0  0
10  0      0 2504832  37696 210304    0    0     0 362525 2153 48460  1 50 49  0  0


- After applying those patches:
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 15584608  21776 153072    0    0    69   403  256  394  1  3 95  1  0
 0  0      0 15586032  21824 153024    0    0     1  2319 2161 2944  0  2 98  0  0
 1  0      0 15585920  21824 153104    0    0     0  2342 2161 2951  0  2 98  0  0
 0  0      0 15585696  21824 152976    0    0     0  2364 2160 2978  0  2 98  0  0
 1  0      0 15585360  21824 153168    0    0     0  2380 2161 3027  0  2 98  0  0
 0  0      0 15585248  21824 152976    0    0     0  2348 2161 2983  0  2 98  0  0


Block I/O performance degradation was very serious.
Now, I am trying to ease the degradation.
Do you have any idea for resolving the degradation?

By the way, I found some mistakes in your patch.
Please correct them.

> > Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
> > ===================================================================
> > --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c        2006-09-14 11:18:52.000000000 
> > +1000
> > +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c     2006-09-14 12:01:04.648209950 +1000
> > @@ -625,7 +617,7 @@ xfs_iput_new(xfs_inode_t    *ip,
> >         vn_trace_entry(vp, "xfs_iput_new", (inst_t *)__return_address);
> >  
> >         if ((ip->i_d.di_mode == 0)) {
> > -               ASSERT(!(ip->i_flags & XFS_IRECLAIMABLE));
> > +               ASSERT(!xfs_iflags_test(ip, XFS_IRECLAIMABLE)) {

+               ASSERT(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));


> > Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.h
> > ===================================================================
> > --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.h       2006-09-14 11:18:52.000000000 
> > +1000
> > +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.h    2006-09-14 12:32:16.395321563 +1000
> > @@ -305,6 +305,47 @@ typedef struct xfs_inode {
> >  #endif
> >  } xfs_inode_t;
> >  
> > +
> > +/*
> > + * i_flags helper functions
> > + */
> > +static inline void
> > +__xfs_iflags_set(xfs_inode_t *ip, unsigned short flags)
> > +{
> > +       ip->i_flags |= flags;
> > +}
> > +
> > +static inline void
> > +xfs_iflags_set(xfs_inode_t *ip, unsigned short flags)
> > +{
> > +       spin_lock(&ip->i_flags_lock);
> > +       __xfs_iflag_set(ip, flags);

+       __xfs_iflags_set(ip, flags);


David Chinner wrote:
> On Fri, Oct 06, 2006 at 01:26:17PM +1000, David Chinner wrote:
>> I think this is a much better way of fixing the problem, but it needs
>> a little tweaking. Also, it indicates that we can probably revert
>> some of the previous changes made in attempting to fix this bug.
>> I'll put together a new patch with this fix and as much of the
>> other fixes removed as possible and run some tests on it here.
>> It'l be a day or two before I have a tested patch ready....
> 
> I've run the attached patch through xfsqa but have not stress tested
> it yet.
> 
> Takenori - can you give this a run through your tests to see if
> it passes. I expect any races to trigger the BUG_ON statements
> in xfs_iunpin().
> 
> This patch sits on top of iflags locking cleanup I posted here:
> 
> http://oss.sgi.com/archives/xfs/2006-10/msg00014.html
> 
> Cheers,
> 
> Dave.

Best Regards,
--
Takenori Nagano, NEC
t-nagano@ah.jp.nec.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-12 12:20     ` Takenori Nagano
@ 2006-10-13  1:46       ` David Chinner
  2006-10-13  8:06         ` Timothy Shimmin
  2006-10-13 12:17         ` Takenori Nagano
  0 siblings, 2 replies; 14+ messages in thread
From: David Chinner @ 2006-10-13  1:46 UTC (permalink / raw)
  To: Takenori Nagano; +Cc: xfs

On Thu, Oct 12, 2006 at 09:20:15PM +0900, Takenori Nagano wrote:
> Hi David,
> 
> I tried those patches, but they caused degradation.
> These are results of "vmstat 10" while my test program was running.
> 
> - Before applying those patches:
> procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  7  0      0 2768240  37632 210512    0    0     7 43367  268  676  1 49 50  0  0
>  9  0      0 2716352  37632 210672    0    0     0 362864 2154 47915  1 51 48  0  0
>  9  0      0 2663136  37664 210048    0    0     0 361745 2154 48258  1 50 49  0  0
> 10  0      0 2610688  37664 211184    0    0     0 360908 2152 48068  1 51 49  0  0
>  9  0      0 2557904  37680 210512    0    0     0 360254 2154 49036  1 51 48  0  0
> 10  0      0 2504832  37696 210304    0    0     0 362525 2153 48460  1 50 49  0  0
> 
> 
> - After applying those patches:
> procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0  0      0 15584608  21776 153072    0    0    69   403  256  394  1  3 95  1  0
>  0  0      0 15586032  21824 153024    0    0     1  2319 2161 2944  0  2 98  0  0
>  1  0      0 15585920  21824 153104    0    0     0  2342 2161 2951  0  2 98  0  0
>  0  0      0 15585696  21824 152976    0    0     0  2364 2160 2978  0  2 98  0  0
>  1  0      0 15585360  21824 153168    0    0     0  2380 2161 3027  0  2 98  0  0
>  0  0      0 15585248  21824 152976    0    0     0  2348 2161 2983  0  2 98  0  0
> 
> 
> Block I/O performance degradation was very serious.

That was unexpected. :/

> Now, I am trying to ease the degradation.
> Do you have any idea for resolving the degradation?

Did you see a degradation with your original fix? I suspect
not.

I think that the lsn based flush is tripping over a sync
transaction "optimisation" - if the previous log buffer needs
syncing or is currently being synced, then we don't try to flush the
active log buffer straight away - we wait for the previous log
buffer to complete it's I/O in the hope that we get more
transactions into the current log buffer.

IOWs, we introduce a pipeline bubble where we don't force the
current log buffer until the I/O on the previous log buffer has
completed and this effectively serialises these log forces. I
suspect that this is not needed anymore, but I'll look
inot this separately.

When flushing the entire log, (using 0 as the lsn as your
original patch did), we simply close off the current buffer
and flush it out, waiting on it completion if we need a sync
flush. IOWs, no pipeline bubble is introduced and we continue
to issue concurrent log I/O.

Can you test this patch (on top of the last patch I sent)
and see if it fixes the degradation?

Regards,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_inode.c |   12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2006-10-11 17:09:12.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2006-10-13 11:43:38.236562541 +1000
@@ -2773,26 +2773,16 @@ void
 xfs_iunpin_wait(
 	xfs_inode_t	*ip)
 {
-	xfs_inode_log_item_t	*iip;
-	xfs_lsn_t	lsn;
-
 	ASSERT(ismrlocked(&ip->i_lock, MR_UPDATE | MR_ACCESS));
 
 	if (atomic_read(&ip->i_pincount) == 0) {
 		return;
 	}
 
-	iip = ip->i_itemp;
-	if (iip && iip->ili_last_lsn) {
-		lsn = iip->ili_last_lsn;
-	} else {
-		lsn = (xfs_lsn_t)0;
-	}
-
 	/*
 	 * Give the log a push so we don't wait here too long.
 	 */
-	xfs_log_force(ip->i_mount, lsn, XFS_LOG_FORCE);
+	xfs_log_force(ip->i_mount, 0, XFS_LOG_FORCE);
 
 	wait_event(ip->i_ipin_wait, (atomic_read(&ip->i_pincount) == 0));
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-13  1:46       ` David Chinner
@ 2006-10-13  8:06         ` Timothy Shimmin
  2006-10-13 12:17         ` Takenori Nagano
  1 sibling, 0 replies; 14+ messages in thread
From: Timothy Shimmin @ 2006-10-13  8:06 UTC (permalink / raw)
  To: David Chinner, Takenori Nagano; +Cc: xfs



--On 13 October 2006 11:46:51 AM +1000 David Chinner <dgc@sgi.com> wrote:

>> Now, I am trying to ease the degradation.
>> Do you have any idea for resolving the degradation?
>
> Did you see a degradation with your original fix? I suspect
> not.
>
> I think that the lsn based flush is tripping over a sync
> transaction "optimisation" - if the previous log buffer needs
> syncing or is currently being synced, then we don't try to flush the
> active log buffer straight away - we wait for the previous log
> buffer to complete it's I/O in the hope that we get more
> transactions into the current log buffer.

You're referring to the code in xlog_state_sync() for the
XLOG_STATE_ACTIVE case with the big header comment.
It looks to be the one place where we sleep (sv_wait) even
when we haven't got XFS_LOG_SYNC set (which we don't have here AFAICT).
In this case, waiting for the previous iclog to the given
lsn's iclog.

--Tim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-13  1:46       ` David Chinner
  2006-10-13  8:06         ` Timothy Shimmin
@ 2006-10-13 12:17         ` Takenori Nagano
  2006-10-17  2:02           ` David Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Takenori Nagano @ 2006-10-13 12:17 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

Hi David,

David Chinner wrote:
>> Block I/O performance degradation was very serious.
> 
> That was unexpected. :/
> 
>> Now, I am trying to ease the degradation.
>> Do you have any idea for resolving the degradation?
> 
> Did you see a degradation with your original fix? I suspect
> not.

No, I don't see any degradation with my patch.
But my patch is not perfect.
Because xfs_log_force() don't guarantee to write the log in run time.

> 
> Can you test this patch (on top of the last patch I sent)
> and see if it fixes the degradation?

I tried your patch, but it seems degradation was not resolved.

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 15585968  21856 151696    0    0    73   273  258  298  1  3 94  2  0
 0  0      0 15585968  21856 151840    0    0     0  2270 2156 2693  0  2 98  0  0
 1  0      0 15585744  21856 151920    0    0     0  2362 2161 2797  0  2 98  0  0
 0  0      0 15585408  21856 151824    0    0     0  2291 2156 2732  0  2 98  0  0
 0  0      0 15584848  21856 152288    0    0     0  2300 2156 2765  0  2 98  0  0
 2  0      0 15584848  21856 151952    0    0     0  2346 2161 2809  0  2 98  0  0

Ummmm, I think schedule() was called many times by wait_event().

Best Regards,
--
Takenori Nagano, NEC
t-nagano@ah.jp.nec.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-13 12:17         ` Takenori Nagano
@ 2006-10-17  2:02           ` David Chinner
  2006-10-18  2:33             ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-10-17  2:02 UTC (permalink / raw)
  To: Takenori Nagano; +Cc: xfs

On Fri, Oct 13, 2006 at 09:17:01PM +0900, Takenori Nagano wrote:
> Hi David,
> 
> David Chinner wrote:
> >> Block I/O performance degradation was very serious.
> > 
> > That was unexpected. :/
> > 
> >> Now, I am trying to ease the degradation.
> >> Do you have any idea for resolving the degradation?
> > 
> > Did you see a degradation with your original fix? I suspect
> > not.
> 
> No, I don't see any degradation with my patch.
> But my patch is not perfect.

It still needs the iput() in xfs_iunpin() to do pushed off to an
external thread because we can deadlock in xfslogd:

Stack traceback for pid 123
0xe00000b9edda0000      123       19  0    0   D  0xe00000b9edda02f0  xfslogd/0
0xa0000001008352c0 schedule+0xf00
0xa000000100838650 schedule_timeout+0x110
0xa0000001003d55a0 xlog_state_sync_all+0x380
0xa0000001003d5d90 _xfs_log_force+0x210
0xa0000001004200f0 xfs_fs_clear_inode+0x2b0
0xa0000001001c9a20 clear_inode+0x200
0xa0000001001c9f60 generic_delete_inode+0x300
0xa0000001001ca320 generic_drop_inode+0x300
0xa0000001001c8e80 iput+0x180
0xa0000001003c99b0 xfs_iunpin+0x190
0xa0000001003cdb40 xfs_inode_item_unpin+0x20
0xa0000001003ee4c0 xfs_trans_chunk_committed+0x280
0xa0000001003ee730 xfs_trans_committed+0xd0
0xa0000001003d3e80 xlog_state_do_callback+0x520
0xa0000001003d4420 xlog_iodone+0x160
0xa0000001004274c0 xfs_buf_iodone_work+0x60
0xa0000001000eb640 run_workqueue+0x180
0xa0000001000eda00 worker_thread+0x260
0xa0000001000f6340 kthread+0x260
0xa0000001000121d0 kernel_thread_helper+0xd0
0xa0000001000094c0 start_kernel_thread+0x20

The patch I sent does not deadlock because it removed the igrab/iput
in xfs_iunpin().

In my testing the performance penalty is identical for the patch you
wrote and the one I wrote.  In both cases performance is limited by
the maximum number of log forces that can be issued, This results in
about a 70% degradation in single threaded sequential deletes (from
about 7,500/s to 2,500/s)....

So, unconditional log forces are not the solution here - the code is
neat, but the performance tradeoff is unacceptible. IOWs, to
maintain performance we cannot do an unconditional log force in the
->clear_inode() path.

Hmmm.....

In doing the previous patch that removed the igrab/iput, I used the
log force to provide synchronisation that prevented the unpin from
ever seeing an invalid state and hence we couldn't ever get a
use-after-free situation. What I failed to see was that we already
have this mechanism - the i_flags_lock and the XFS_IRECLAIM* flags.

If we synchronise the setting of either of the XFS_IRECLAIM* flags
with the breakage of the bhv_vnode<->xfs_inode link, then we can
never get the state in xfs_iunpin() where the link has been broken
and the XFS_IRECLAIM* flags are not set. The current usage of
the i_flags_lock in xfs_iunpin is sufficient to provide this
guarantee, but we are breaking the link before setting the
XFS_IRECLAIMABLE flag in xfs_reclaim()....

So, here's another patch that doesn't have the performance problems,
but removes the iput/igrab while still (I think) fixing the use
after free problem. Can you try this one out, Takenori? I've
run it through some stress tests and haven't been able to trigger
problems.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/xfs_inode.c    |   44 +++++++++++++++++++++-----------------------
 fs/xfs/xfs_vnodeops.c |   21 ++++++++++++++-------
 2 files changed, 35 insertions(+), 30 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2006-10-16 15:55:18.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2006-10-17 10:03:19.586174311 +1000
@@ -2739,41 +2739,39 @@ xfs_iunpin(
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 
 	if (atomic_dec_and_test(&ip->i_pincount)) {
+
 		/*
-		 * If the inode is currently being reclaimed, the
-		 * linux inode _and_ the xfs vnode may have been
-		 * freed so we cannot reference either of them safely.
-		 * Hence we should not try to do anything to them
-		 * if the xfs inode is currently in the reclaim
-		 * path.
+		 * If the inode is currently being reclaimed, the link between
+		 * the bhv_vnode and the xfs_inode will be broken after the
+		 * XFS_IRECLAIM* flag is set. Hence, if these flags are not
+		 * set, then we can move forward and mark the linux inode dirty
+		 * knowing that it is still valid as it won't freed until after
+		 * the bhv_vnode<->xfs_inode link is broken in xfs_reclaim. The
+		 * i_flags_lock is used to synchronise the setting of the
+		 * XFS_IRECLAIM* flags and the breaking of the link, and so we
+		 * can execute atomically w.r.t to reclaim by holding this lock
+		 * here.
 		 *
-		 * However, we still need to issue the unpin wakeup
-		 * call as the inode reclaim may be blocked waiting for
-		 * the inode to become unpinned.
+		 * However, we still need to issue the unpin wakeup call as the
+		 * inode reclaim may be blocked waiting for the inode to become
+		 * unpinned.
 		 */
-		struct inode *inode = NULL;
 
 		spin_lock(&ip->i_flags_lock);
 		if (!__xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE)) {
 			bhv_vnode_t	*vp = XFS_ITOV_NULL(ip);
+			struct inode *inode = NULL;
 
-			/* make sync come back and flush this inode */
-			if (vp) {
-				inode = vn_to_inode(vp);
+			BUG_ON(vp == NULL);
+			inode = vn_to_inode(vp);
+			BUG_ON(inode->i_state & I_CLEAR);
 
-				if (!(inode->i_state &
-						(I_NEW|I_FREEING|I_CLEAR))) {
-					inode = igrab(inode);
-					if (inode)
-						mark_inode_dirty_sync(inode);
-				} else
-					inode = NULL;
-			}
+			/* make sync come back and flush this inode */
+			if (!(inode->i_state & (I_NEW|I_FREEING)))
+				mark_inode_dirty_sync(inode);
 		}
 		spin_unlock(&ip->i_flags_lock);
 		wake_up(&ip->i_ipin_wait);
-		if (inode)
-			iput(inode);
 	}
 }
 
Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c	2006-10-16 15:55:18.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c	2006-10-17 10:27:38.447315865 +1000
@@ -3827,11 +3827,16 @@ xfs_reclaim(
 	 */
 	xfs_synchronize_atime(ip);
 
-	/* If we have nothing to flush with this inode then complete the
-	 * teardown now, otherwise break the link between the xfs inode
-	 * and the linux inode and clean up the xfs inode later. This
-	 * avoids flushing the inode to disk during the delete operation
-	 * itself.
+	/*
+	 * If we have nothing to flush with this inode then complete the
+	 * teardown now, otherwise break the link between the xfs inode and the
+	 * linux inode and clean up the xfs inode later. This avoids flushing
+	 * the inode to disk during the delete operation itself.
+	 *
+	 * When breaking the link, we need to set the XFS_IRECLAIMABLE flag
+	 * first to ensure that xfs_iunpin() will never see an xfs inode
+	 * that has a linux inode being reclaimed. Synchronisation is provided
+	 * by the i_flags_lock.
 	 */
 	if (!ip->i_update_core && (ip->i_itemp == NULL)) {
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
@@ -3840,11 +3845,13 @@ xfs_reclaim(
 	} else {
 		xfs_mount_t	*mp = ip->i_mount;
 
-		/* Protect sync from us */
+		/* Protect sync and unpin from us */
 		XFS_MOUNT_ILOCK(mp);
+		spin_lock(&ip->i_flags_lock);
+		__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
 		vn_bhv_remove(VN_BHV_HEAD(vp), XFS_ITOBHV(ip));
+		spin_unlock(&ip->i_flags_lock);
 		list_add_tail(&ip->i_reclaim, &mp->m_del_inodes);
-		xfs_iflags_set(ip, XFS_IRECLAIMABLE);
 		XFS_MOUNT_IUNLOCK(mp);
 	}
 	return 0;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-17  2:02           ` David Chinner
@ 2006-10-18  2:33             ` David Chinner
  2006-10-18  9:07               ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-10-18  2:33 UTC (permalink / raw)
  To: xfs; +Cc: Takenori Nagano

On Tue, Oct 17, 2006 at 12:02:18PM +1000, David Chinner wrote:
> In doing the previous patch that removed the igrab/iput, I used the
> log force to provide synchronisation that prevented the unpin from
> ever seeing an invalid state and hence we couldn't ever get a
> use-after-free situation. What I failed to see was that we already
> have this mechanism - the i_flags_lock and the XFS_IRECLAIM* flags.
> 
> If we synchronise the setting of either of the XFS_IRECLAIM* flags
> with the breakage of the bhv_vnode<->xfs_inode link, then we can
> never get the state in xfs_iunpin() where the link has been broken
> and the XFS_IRECLAIM* flags are not set. The current usage of
> the i_flags_lock in xfs_iunpin is sufficient to provide this
> guarantee, but we are breaking the link before setting the
> XFS_IRECLAIMABLE flag in xfs_reclaim()....
> 
> So, here's another patch that doesn't have the performance problems,
> but removes the iput/igrab while still (I think) fixing the use
> after free problem. Can you try this one out, Takenori? I've
> run it through some stress tests and haven't been able to trigger
> problems.

I just hit the BUG_ON(vp == NULL) that I put in xfs_iunpin()
in this patch. The xfs inode had no link to the bhv_vnode, nor
did it have either XFS_IRECLAIM* flag set, and hence it triggered
the BUG.

The problem appears to be  a race with on lookup with an inode that
is getting deleted - xfs_iget_core() finds the xfs_inode in the
cache with the XFS_IRECLAIMABLE flag set, so it removes that flag.
It then removes the inode from the reclaim list. Then it checks to
see if the inode has been unlinked, and if the create flag is not
set we return ENOENT.

Hence if we have transactions still to be written to disk on this
inode, when xfs_iunpin finally gets called there is no reclaim flag
set so it assumes that there's a vnode assoicated with the xfs inode
and we got kaboom.

I think this is a pre-existing bug in xfs_iget_core() that can
result in a memory leak because xfs_iget_core() removes it from
the reclaim list and then forgets about at...

The following patch sits on top of the others - it may not apply
because the tree I just pulled it from has the radix tree
inode cache patches applied earlier in the series.

Comments?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/xfs_iget.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c	2006-10-18 11:27:04.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c	2006-10-18 12:20:20.748107559 +1000
@@ -164,6 +164,34 @@ again:
 
 				goto again;
 			}
+
+			/*
+			 * If IRECLAIMABLE is set on this inode and lookup is
+			 * racing with unlink, then we should return an error
+			 * immediately so we don't remove it from the reclaim
+			 * list and potentially leak the inode.
+			 *
+			 * Also, there may be transactions sitting in the
+			 * incore log buffers or being flushed to disk at this
+			 * time.  We can't clear the XFS_IRECLAIMABLE flag
+			 * until these transactions have hit the disk,
+			 * otherwise we will void the guarantee the flag
+			 * provides xfs_iunpin()
+			 */
+			if (xfs_iflags_test(ip, XFS_IRECLAIMABLE)) {
+				if (ip->i_d.di_mode == 0) &&
+				    !(flags & XFS_IGET_CREATE)) {
+					read_unlock(&ih->ih_lock);
+					return ENOENT;
+				}
+				if (xfs_ipincount(ip)) {
+					read_unlock(&ih->ih_lock);
+					xfs_log_force(mp, 0,
+						XFS_LOG_FORCE|XFS_LOG_SYNC);
+					XFS_STATS_INC(xs_ig_frecycle);
+					goto again;
+				}
+			}
 			vn_trace_exit(vp, "xfs_iget.alloc",
 				(inst_t *)__return_address);
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-18  2:33             ` David Chinner
@ 2006-10-18  9:07               ` David Chinner
  2006-10-19  2:23                 ` Takenori Nagano
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-10-18  9:07 UTC (permalink / raw)
  To: Takenori Nagano; +Cc: xfs

On Wed, Oct 18, 2006 at 12:33:25PM +1000, David Chinner wrote:
> On Tue, Oct 17, 2006 at 12:02:18PM +1000, David Chinner wrote:
> > So, here's another patch that doesn't have the performance problems,
> > but removes the iput/igrab while still (I think) fixing the use
> > after free problem. Can you try this one out, Takenori? I've
> > run it through some stress tests and haven't been able to trigger
> > problems.
> 
> I just hit the BUG_ON(vp == NULL) that I put in xfs_iunpin()
> in this patch. The xfs inode had no link to the bhv_vnode, nor
> did it have either XFS_IRECLAIM* flag set, and hence it triggered
> the BUG.

And again. The xfs_iget_core change is valid - there's still a
race in xfs_iunpin (how many of them can we find?):

   xfs_iunpin				xfs_iget_core
   if(atomic_dec_and_test(pincount))
					if (vp == NULL)
					   if(IRECLAIMABLE)
					       if(pincount)
					    	   force+restart
					   .....
					   clear IRECLAIMABLE

   	spin_lock(i_flags_lock)
	If (IRECLAIMABLE)
		BUG_ON(vp == NULL)


So the solution is this:

---
 fs/xfs/xfs_inode.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2006-10-18 11:27:04.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2006-10-18 16:45:12.658102093 +1000
@@ -2738,7 +2738,7 @@ xfs_iunpin(
 {
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 
-	if (atomic_dec_and_test(&ip->i_pincount)) {
+	if (atomic_dec_and_lock(&ip->i_pincount, &ip->i_flags_lock)) {
 
 		/*
 		 * If the inode is currently being reclaimed, the link between
@@ -2757,7 +2757,6 @@ xfs_iunpin(
 		 * unpinned.
 		 */
 
-		spin_lock(&ip->i_flags_lock);
 		if (!__xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE)) {
 			bhv_vnode_t	*vp = XFS_ITOV_NULL(ip);
 			struct inode *inode = NULL;

I'm running stress tests on this now - it it survives until morning
I'll send out a new set of patches for testing...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-18  9:07               ` David Chinner
@ 2006-10-19  2:23                 ` Takenori Nagano
  2006-10-19  4:58                   ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Takenori Nagano @ 2006-10-19  2:23 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 2645 bytes --]

Hi David,

I'm testing your three patches.
I am not seeing any degradation with your patches.

But I think the patch that I attach to this mail is necessary.
Isn't it?

David Chinner wrote:
> On Wed, Oct 18, 2006 at 12:33:25PM +1000, David Chinner wrote:
>> On Tue, Oct 17, 2006 at 12:02:18PM +1000, David Chinner wrote:
>>> So, here's another patch that doesn't have the performance problems,
>>> but removes the iput/igrab while still (I think) fixing the use
>>> after free problem. Can you try this one out, Takenori? I've
>>> run it through some stress tests and haven't been able to trigger
>>> problems.
>> I just hit the BUG_ON(vp == NULL) that I put in xfs_iunpin()
>> in this patch. The xfs inode had no link to the bhv_vnode, nor
>> did it have either XFS_IRECLAIM* flag set, and hence it triggered
>> the BUG.
> 
> And again. The xfs_iget_core change is valid - there's still a
> race in xfs_iunpin (how many of them can we find?):
> 
>    xfs_iunpin				xfs_iget_core
>    if(atomic_dec_and_test(pincount))
> 					if (vp == NULL)
> 					   if(IRECLAIMABLE)
> 					       if(pincount)
> 					    	   force+restart
> 					   .....
> 					   clear IRECLAIMABLE
> 
>    	spin_lock(i_flags_lock)
> 	If (IRECLAIMABLE)
> 		BUG_ON(vp == NULL)
> 
> 
> So the solution is this:
> 
> ---
>  fs/xfs/xfs_inode.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
> ===================================================================
> --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2006-10-18 11:27:04.000000000 +1000
> +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2006-10-18 16:45:12.658102093 +1000
> @@ -2738,7 +2738,7 @@ xfs_iunpin(
>  {
>  	ASSERT(atomic_read(&ip->i_pincount) > 0);
>  
> -	if (atomic_dec_and_test(&ip->i_pincount)) {
> +	if (atomic_dec_and_lock(&ip->i_pincount, &ip->i_flags_lock)) {
>  
>  		/*
>  		 * If the inode is currently being reclaimed, the link between
> @@ -2757,7 +2757,6 @@ xfs_iunpin(
>  		 * unpinned.
>  		 */
>  
> -		spin_lock(&ip->i_flags_lock);
>  		if (!__xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE)) {
>  			bhv_vnode_t	*vp = XFS_ITOV_NULL(ip);
>  			struct inode *inode = NULL;
> 
> I'm running stress tests on this now - it it survives until morning
> I'll send out a new set of patches for testing...
> 
> Cheers,
> 
> Dave.

-- 
+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+
ＮＥＣ コンピュータソフトウェア事業本部
            ＯＳＳ推進センター 技術開発Ｇ

        永野 武則 (Takenori Nagano)

TEL:8-23-57270(MyLine) 042-333-5383(外線)
e-mail:t-nagano@ah.jp.nec.com
+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+

[-- Attachment #2: xfs_idestroy.patch --]
[-- Type: text/plain, Size: 1143 bytes --]

diff -Naru linux-2.6.19-rc1/fs/xfs.orig/xfs_inode.c linux-2.6.19-rc1/fs/xfs/xfs_inode.c
--- linux-2.6.19-rc1/fs/xfs.orig/xfs_inode.c	2006-10-19 01:51:43.020000000 +0900
+++ linux-2.6.19-rc1/fs/xfs/xfs_inode.c	2006-10-19 01:53:47.248000000 +0900
@@ -2713,6 +2713,7 @@
 		       XFS_FORCED_SHUTDOWN(ip->i_mount));
 		xfs_inode_item_destroy(ip);
 	}
+	xfs_iunpin_wait(ip);
 	kmem_zone_free(xfs_inode_zone, ip);
 }
 
@@ -2784,7 +2785,7 @@
  * be subsequently pinned once someone is waiting for it to be
  * unpinned.
  */
-STATIC void
+void
 xfs_iunpin_wait(
 	xfs_inode_t	*ip)
 {
diff -Naru linux-2.6.19-rc1/fs/xfs.orig/xfs_inode.h linux-2.6.19-rc1/fs/xfs/xfs_inode.h
--- linux-2.6.19-rc1/fs/xfs.orig/xfs_inode.h	2006-10-19 01:51:42.980000000 +0900
+++ linux-2.6.19-rc1/fs/xfs/xfs_inode.h	2006-10-19 01:52:17.980000000 +0900
@@ -498,6 +498,7 @@
 void		xfs_iroot_realloc(xfs_inode_t *, int, int);
 void		xfs_ipin(xfs_inode_t *);
 void		xfs_iunpin(xfs_inode_t *);
+void		xfs_iunpin_wait(xfs_inode_t *);
 int		xfs_iextents_copy(xfs_inode_t *, xfs_bmbt_rec_t *, int);
 int		xfs_iflush(xfs_inode_t *, uint);
 void		xfs_iflush_all(struct xfs_mount *);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-19  2:23                 ` Takenori Nagano
@ 2006-10-19  4:58                   ` David Chinner
  2006-10-20  4:25                     ` Takenori Nagano
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2006-10-19  4:58 UTC (permalink / raw)
  To: Takenori Nagano; +Cc: xfs

On Thu, Oct 19, 2006 at 11:23:02AM +0900, Takenori Nagano wrote:
> Hi David,
> 
> I'm testing your three patches.
> I am not seeing any degradation with your patches.

That is good to hear ;)

> But I think the patch that I attach to this mail is necessary.
> Isn't it?

I don't think so - in the lookup code where we find an existing
inode, we don't destroy the inode if XFS_IRECLAIMABLE is set.
Instead we do a log force and repeat the lookup. We only destroy
the inode in xfs_iget_core() if we raced with another thread
reading the inode in off disk after the cache lookup has
failed. In this case, we free the inode we read off disk which,
by definition, cannot be dirty or pinned at this point so we
don't need to wait for anything to be unpinned.

In the case of reclaim, when we flush a dirty inode we already
do a xfs_iunpin_wait() (xfs_finish_reclaim()->xfs_iflush()->wait)
so we should never get to the point of xfs_idestroy with an inode
that is still pinned.

Hence I don't think this is patch is necessary. Did I miss something
that I shouldn't have, Takenori?

FYI, the three patches have survived my testing for almost a day now,
so if they pass your testing I think we have a viable fix. I'll
sned out a set of updated patches later this afternoon.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-19  4:58                   ` David Chinner
@ 2006-10-20  4:25                     ` Takenori Nagano
  2006-10-23  6:53                       ` Takenori Nagano
  0 siblings, 1 reply; 14+ messages in thread
From: Takenori Nagano @ 2006-10-20  4:25 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

Hi,

David Chinner wrote:
> I don't think so - in the lookup code where we find an existing
> inode, we don't destroy the inode if XFS_IRECLAIMABLE is set.
> Instead we do a log force and repeat the lookup. We only destroy
> the inode in xfs_iget_core() if we raced with another thread
> reading the inode in off disk after the cache lookup has
> failed. In this case, we free the inode we read off disk which,
> by definition, cannot be dirty or pinned at this point so we
> don't need to wait for anything to be unpinned.
> 
> In the case of reclaim, when we flush a dirty inode we already
> do a xfs_iunpin_wait() (xfs_finish_reclaim()->xfs_iflush()->wait)
> so we should never get to the point of xfs_idestroy with an inode
> that is still pinned.
> 
> Hence I don't think this is patch is necessary. Did I miss something
> that I shouldn't have, Takenori?

Sorry, you are right. I forgot xfs_iget_core() was modified that it don't reuse
xfs_inode while i_pincount > 0.

> 
> FYI, the three patches have survived my testing for almost a day now,
> so if they pass your testing I think we have a viable fix. I'll
> sned out a set of updated patches later this afternoon.

Your patches have been working well for 20 hours. I intend to continue
testing your patches until next Monday, and I'll report the result.

Best Regards,
-- 
Takenori Nagano
<t-nagano@ah.jp.nec.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode().
  2006-10-20  4:25                     ` Takenori Nagano
@ 2006-10-23  6:53                       ` Takenori Nagano
  0 siblings, 0 replies; 14+ messages in thread
From: Takenori Nagano @ 2006-10-23  6:53 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

Hi David,

Your patches are surviving my testing. I have checked results of vmstat and
/var/log/messages, but I can't see any errors and degradation. I think they are
a viable fix, too.

Takenori Nagano wrote:
> Hi,
> 
> David Chinner wrote:
>> I don't think so - in the lookup code where we find an existing
>> inode, we don't destroy the inode if XFS_IRECLAIMABLE is set.
>> Instead we do a log force and repeat the lookup. We only destroy
>> the inode in xfs_iget_core() if we raced with another thread
>> reading the inode in off disk after the cache lookup has
>> failed. In this case, we free the inode we read off disk which,
>> by definition, cannot be dirty or pinned at this point so we
>> don't need to wait for anything to be unpinned.
>>
>> In the case of reclaim, when we flush a dirty inode we already
>> do a xfs_iunpin_wait() (xfs_finish_reclaim()->xfs_iflush()->wait)
>> so we should never get to the point of xfs_idestroy with an inode
>> that is still pinned.
>>
>> Hence I don't think this is patch is necessary. Did I miss something
>> that I shouldn't have, Takenori?
> 
> Sorry, you are right. I forgot xfs_iget_core() was modified that it don't reuse
> xfs_inode while i_pincount > 0.
> 
>> FYI, the three patches have survived my testing for almost a day now,
>> so if they pass your testing I think we have a viable fix. I'll
>> sned out a set of updated patches later this afternoon.
> 
> Your patches have been working well for 20 hours. I intend to continue
> testing your patches until next Monday, and I'll report the result.
> 
> Best Regards,

-- 
Takenori Nagano <t-nagano@ah.jp.nec.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-10-23  6:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-04  9:20 [patch] Fix xfs_iunpin() sets I_DIRTY_SYNC after clear_inode() Takenori Nagano
2006-10-06  3:26 ` David Chinner
2006-10-11  6:43   ` David Chinner
2006-10-12 12:20     ` Takenori Nagano
2006-10-13  1:46       ` David Chinner
2006-10-13  8:06         ` Timothy Shimmin
2006-10-13 12:17         ` Takenori Nagano
2006-10-17  2:02           ` David Chinner
2006-10-18  2:33             ` David Chinner
2006-10-18  9:07               ` David Chinner
2006-10-19  2:23                 ` Takenori Nagano
2006-10-19  4:58                   ` David Chinner
2006-10-20  4:25                     ` Takenori Nagano
2006-10-23  6:53                       ` Takenori Nagano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox