public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
@ 2005-10-18  8:26 Andrea Arcangeli
  2005-10-19  0:13 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2005-10-18  8:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Chris Mason

Hello,

@@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
 			list_move(&inode->i_list, &inode_in_use);
 		} else {
 			list_move(&inode->i_list, &inode_unused);
+			inodes_stat.nr_unused++;
 		}
 	}
 	wake_up_inode(inode);

Are you sure the above diff is correct? It was added somewhere between
2.6.5 and 2.6.8. I think it's wrong.

The only way I can imagine the i_count to be zero in the above path, is
that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
increase nr_unused. So I believe the above change is buggy and it will
definitely overstate the number of unused inodes and it should be backed
out.

Note that __writeback_single_inode before calling __sync_single_inode, can
drop the spinlock and we can have both the dirty and locked bitflags
clear here:

		spin_unlock(&inode_lock);
		__wait_on_inode(inode);
		iput(inode);
XXXXXXX
		spin_lock(&inode_lock);
	}
	use inode again here

a construct like the above makes zero sense from a reference counting
standpoint.

Either we don't ever use the inode again after the iput, or the
inode_lock should be taken _before_ executing the iput (i.e. a __iput
would be required). Taking the inode_lock after iput means the iget was
useless if we keep using the inode after the iput.

So the only chance the 2.6 was safe to call __writeback_single_inode
with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
prevent the VM to free the inode in XXXXX).

Potentially calling the above iput with I_WILL_FREE was also wrong
because it would recurse in iput_final (the second mainline bug).

The below (untested) patch fixes the nr_unused accounting, avoids
recursing in iput when I_WILL_FREE is set and makes sure (with the
BUG_ON) that we don't corrupt memory and that all holders that don't set
I_WILL_FREE, keeps a reference on the inode!

Comments welcome, thanks.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

 fs-writeback.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
--- linux-2.6/fs/fs-writeback.c.~1~	2005-07-28 17:08:53.000000000 +0200
+++ linux-2.6/fs/fs-writeback.c	2005-10-17 15:43:53.000000000 +0200
@@ -230,7 +230,6 @@ __sync_single_inode(struct inode *inode,
 			 * The inode is clean, unused
 			 */
 			list_move(&inode->i_list, &inode_unused);
-			inodes_stat.nr_unused++;
 		}
 	}
 	wake_up_inode(inode);
@@ -246,6 +245,8 @@ __writeback_single_inode(struct inode *i
 {
 	wait_queue_head_t *wqh;
 
+	BUG_ON(!atomic_read(&inode->i_count) ^ !!(inode->i_state & I_WILL_FREE));
+
 	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
 		list_move(&inode->i_list, &inode->i_sb->s_dirty);
 		return 0;
@@ -259,11 +260,9 @@ __writeback_single_inode(struct inode *i
 
 		wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
 		do {
-			__iget(inode);
 			spin_unlock(&inode_lock);
 			__wait_on_bit(wqh, &wq, inode_wait,
 							TASK_UNINTERRUPTIBLE);
-			iput(inode);
 			spin_lock(&inode_lock);
 		} while (inode->i_state & I_LOCK);
 	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-18  8:26 [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set Andrea Arcangeli
@ 2005-10-19  0:13 ` Andrew Morton
  2005-10-19  0:40   ` Chris Mason
  2005-10-19  7:47   ` Andrea Arcangeli
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2005-10-19  0:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, mason

Andrea Arcangeli <andrea@suse.de> wrote:
>
> Hello,
> 
> @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
>  			list_move(&inode->i_list, &inode_in_use);
>  		} else {
>  			list_move(&inode->i_list, &inode_unused);
> +			inodes_stat.nr_unused++;
>  		}
>  	}
>  	wake_up_inode(inode);
> 
> Are you sure the above diff is correct? It was added somewhere between
> 2.6.5 and 2.6.8. I think it's wrong.
> 
> The only way I can imagine the i_count to be zero in the above path, is
> that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
> increase nr_unused. So I believe the above change is buggy and it will
> definitely overstate the number of unused inodes and it should be backed
> out.

Well according to my assertion (below), the inode in __sync_single_inode()
cannot have a zero refcount, so the whole if() statement is never executed.

The thinking behind that increment is that __sync_single_inode() has just
taken a dirty, zero-refcount inode and has cleaned it.  A dirty inode
cannot have previously been on inode_unused, hence we now are newly moving
it to inode_unused.

I'll stick a WARN_ON in there for now, wait and see if anyone can hit it.

> Note that __writeback_single_inode before calling __sync_single_inode, can
> drop the spinlock and we can have both the dirty and locked bitflags
> clear here:
> 
> 		spin_unlock(&inode_lock);
> 		__wait_on_inode(inode);
> 		iput(inode);
> XXXXXXX
> 		spin_lock(&inode_lock);
> 	}
> 	use inode again here
> 
> a construct like the above makes zero sense from a reference counting
> standpoint.
> 
> Either we don't ever use the inode again after the iput, or the
> inode_lock should be taken _before_ executing the iput (i.e. a __iput
> would be required). Taking the inode_lock after iput means the iget was
> useless if we keep using the inode after the iput.
> 
> So the only chance the 2.6 was safe to call __writeback_single_inode
> with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
> prevent the VM to free the inode in XXXXX).
> 
> Potentially calling the above iput with I_WILL_FREE was also wrong
> because it would recurse in iput_final (the second mainline bug).
>
> The below (untested) patch fixes the nr_unused accounting, avoids
> recursing in iput when I_WILL_FREE is set and makes sure (with the
> BUG_ON) that we don't corrupt memory and that all holders that don't set
> I_WILL_FREE, keeps a reference on the inode!
> 

That's something which Bill snuck in there during some waitqueue rework.

I agree that the iget/iput is unneeded: all callers to
__writeback_single_inode() already have a ref on the inode: either via
sync_sb_inodes()'s iget() or via syscall(fd, ...).

So the BUG_ON() in __writeback_single_inode() becomes
BUG_ON(!atomic_read(&inode->i_count)); - I'll make it WARN_ON for now coz
I'm not so sure any more ;)

So...  With updated comments:


diff -puN fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set fs/fs-writeback.c
--- 25/fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set	Tue Oct 18 17:07:49 2005
+++ 25-akpm/fs/fs-writeback.c	Tue Oct 18 17:12:53 2005
@@ -229,8 +229,8 @@ __sync_single_inode(struct inode *inode,
 			/*
 			 * The inode is clean, unused
 			 */
+			WARN_ON(1);
 			list_move(&inode->i_list, &inode_unused);
-			inodes_stat.nr_unused++;
 		}
 	}
 	wake_up_inode(inode);
@@ -238,14 +238,16 @@ __sync_single_inode(struct inode *inode,
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_lock.
+ * Write out an inode's dirty pages.  Called under inode_lock.  The caller has
+ * ref on the inode (either via __iget or via syscall against an fd).
  */
 static int
-__writeback_single_inode(struct inode *inode,
-			struct writeback_control *wbc)
+__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	wait_queue_head_t *wqh;
 
+	WARN_ON(!atomic_read(&inode->i_count));
+
 	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
 		list_move(&inode->i_list, &inode->i_sb->s_dirty);
 		return 0;
@@ -259,11 +261,9 @@ __writeback_single_inode(struct inode *i
 
 		wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
 		do {
-			__iget(inode);
 			spin_unlock(&inode_lock);
 			__wait_on_bit(wqh, &wq, inode_wait,
 							TASK_UNINTERRUPTIBLE);
-			iput(inode);
 			spin_lock(&inode_lock);
 		} while (inode->i_state & I_LOCK);
 	}
@@ -541,14 +541,15 @@ void sync_inodes(int wait)
 }
 
 /**
- *	write_inode_now	-	write an inode to disk
- *	@inode: inode to write to disk
- *	@sync: whether the write should be synchronous or not
+ * write_inode_now	-	write an inode to disk
+ * @inode: inode to write to disk
+ * @sync: whether the write should be synchronous or not
+ *
+ * This function commits an inode to disk immediately if it is dirty. This is
+ * primarily needed by knfsd.
  *
- *	This function commits an inode to disk immediately if it is
- *	dirty. This is primarily needed by knfsd.
+ * The caller must have a ref on the inode.
  */
- 
 int write_inode_now(struct inode *inode, int sync)
 {
 	int ret;
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  0:13 ` Andrew Morton
@ 2005-10-19  0:40   ` Chris Mason
  2005-10-19  1:15     ` Andrew Morton
  2005-10-19  7:47   ` Andrea Arcangeli
  1 sibling, 1 reply; 10+ messages in thread
From: Chris Mason @ 2005-10-19  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel

On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > Hello,
> > 
> > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
> >  			list_move(&inode->i_list, &inode_in_use);
> >  		} else {
> >  			list_move(&inode->i_list, &inode_unused);
> > +			inodes_stat.nr_unused++;
> >  		}
> >  	}
> >  	wake_up_inode(inode);
> > 
> > Are you sure the above diff is correct? It was added somewhere between
> > 2.6.5 and 2.6.8. I think it's wrong.
> > 
> > The only way I can imagine the i_count to be zero in the above path, is
> > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
> > increase nr_unused. So I believe the above change is buggy and it will
> > definitely overstate the number of unused inodes and it should be backed
> > out.
> 
> Well according to my assertion (below), the inode in __sync_single_inode()
> cannot have a zero refcount, so the whole if() statement is never executed.

generic_forget_inode->write_inode_now->__writeback_single_inode->
__sync_single_inode

We do have I_WILL_FREE, but i_count will be zero.

> 
> The thinking behind that increment is that __sync_single_inode() has just
> taken a dirty, zero-refcount inode and has cleaned it.  A dirty inode
> cannot have previously been on inode_unused, hence we now are newly moving
> it to inode_unused.

nr_unused doesn't seem to count the number of inodes on the unused list.
It is actually counting the number of inodes whose i_count is 0.  See
generic_forget_inode and invalidate_list to see what I mean.

generic_forget_inode took care of incrementing the unused count when
i_count went to zero. So, I don't think we need to worry about the
unused count in __writeback_single_inode.

-chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  0:40   ` Chris Mason
@ 2005-10-19  1:15     ` Andrew Morton
  2005-10-19  1:58       ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2005-10-19  1:15 UTC (permalink / raw)
  To: Chris Mason; +Cc: andrea, linux-kernel

Chris Mason <mason@suse.com> wrote:
>
> On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > Hello,
> > > 
> > > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
> > >  			list_move(&inode->i_list, &inode_in_use);
> > >  		} else {
> > >  			list_move(&inode->i_list, &inode_unused);
> > > +			inodes_stat.nr_unused++;
> > >  		}
> > >  	}
> > >  	wake_up_inode(inode);
> > > 
> > > Are you sure the above diff is correct? It was added somewhere between
> > > 2.6.5 and 2.6.8. I think it's wrong.
> > > 
> > > The only way I can imagine the i_count to be zero in the above path, is
> > > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
> > > increase nr_unused. So I believe the above change is buggy and it will
> > > definitely overstate the number of unused inodes and it should be backed
> > > out.
> > 
> > Well according to my assertion (below), the inode in __sync_single_inode()
> > cannot have a zero refcount, so the whole if() statement is never executed.
> 
> generic_forget_inode->write_inode_now->__writeback_single_inode->
> __sync_single_inode

oshit.

> We do have I_WILL_FREE, but i_count will be zero.

yup.

> > 
> > The thinking behind that increment is that __sync_single_inode() has just
> > taken a dirty, zero-refcount inode and has cleaned it.  A dirty inode
> > cannot have previously been on inode_unused, hence we now are newly moving
> > it to inode_unused.
> 
> nr_unused doesn't seem to count the number of inodes on the unused list.
> It is actually counting the number of inodes whose i_count is 0.  See
> generic_forget_inode and invalidate_list to see what I mean.

hm, OK.  It'd be nice to make that more explicit.  Something like this?

--- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup	2005-10-18 18:13:22.000000000 -0700
+++ devel-akpm/fs/inode.c	2005-10-18 18:13:57.000000000 -0700
@@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct 
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_LOCK)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
+			inodes_stat.nr_unused++;  /* One more 0-ref inode */
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct 
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
_

> generic_forget_inode took care of incrementing the unused count when
> i_count went to zero. So, I don't think we need to worry about the
> unused count in __writeback_single_inode.
> 

How about this for now?


diff -puN fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set fs/fs-writeback.c
--- devel/fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set	2005-10-18 18:02:51.000000000 -0700
+++ devel-akpm/fs/fs-writeback.c	2005-10-18 18:05:22.000000000 -0700
@@ -229,8 +229,8 @@ __sync_single_inode(struct inode *inode,
 			/*
 			 * The inode is clean, unused
 			 */
+			WARN_ON(1);
 			list_move(&inode->i_list, &inode_unused);
-			inodes_stat.nr_unused++;
 		}
 	}
 	wake_up_inode(inode);
@@ -238,14 +238,20 @@ __sync_single_inode(struct inode *inode,
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_lock.
+ * Write out an inode's dirty pages.  Called under inode_lock.  Either the
+ * caller has ref on the inode (either via __iget or via syscall against an fd)
+ * or the inode has I_WILL_FREE set (via generic_forget_inode)
  */
 static int
-__writeback_single_inode(struct inode *inode,
-			struct writeback_control *wbc)
+__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	wait_queue_head_t *wqh;
 
+	if (!atomic_read(&inode->i_count))
+		WARN_ON(!(inode->i_flags & I_WILL_FREE));
+	else
+		WARN_ON(inode->i_flags & I_WILL_FREE);
+
 	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
 		list_move(&inode->i_list, &inode->i_sb->s_dirty);
 		return 0;
@@ -259,11 +265,9 @@ __writeback_single_inode(struct inode *i
 
 		wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
 		do {
-			__iget(inode);
 			spin_unlock(&inode_lock);
 			__wait_on_bit(wqh, &wq, inode_wait,
 							TASK_UNINTERRUPTIBLE);
-			iput(inode);
 			spin_lock(&inode_lock);
 		} while (inode->i_state & I_LOCK);
 	}
@@ -541,14 +545,15 @@ void sync_inodes(int wait)
 }
 
 /**
- *	write_inode_now	-	write an inode to disk
- *	@inode: inode to write to disk
- *	@sync: whether the write should be synchronous or not
+ * write_inode_now	-	write an inode to disk
+ * @inode: inode to write to disk
+ * @sync: whether the write should be synchronous or not
+ *
+ * This function commits an inode to disk immediately if it is dirty. This is
+ * primarily needed by knfsd.
  *
- *	This function commits an inode to disk immediately if it is
- *	dirty. This is primarily needed by knfsd.
+ * The caller must either have a ref on the inode or must have set I_WILL_FREE.
  */
- 
 int write_inode_now(struct inode *inode, int sync)
 {
 	int ret;
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  1:15     ` Andrew Morton
@ 2005-10-19  1:58       ` Chris Mason
  2005-10-19  2:26         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2005-10-19  1:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, linux-kernel

On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote:
> > > Well according to my assertion (below), the inode in __sync_single_inode()
> > > cannot have a zero refcount, so the whole if() statement is never executed.
> > 
> > generic_forget_inode->write_inode_now->__writeback_single_inode->
> > __sync_single_inode
> 
> oshit.

When does this ever happen?  Just for private inodes released during
put_super right?

> 
> > We do have I_WILL_FREE, but i_count will be zero.
> 
> yup.
> 
> > > 
> > > The thinking behind that increment is that __sync_single_inode() has just
> > > taken a dirty, zero-refcount inode and has cleaned it.  A dirty inode
> > > cannot have previously been on inode_unused, hence we now are newly moving
> > > it to inode_unused.
> > 
> > nr_unused doesn't seem to count the number of inodes on the unused list.
> > It is actually counting the number of inodes whose i_count is 0.  See
> > generic_forget_inode and invalidate_list to see what I mean.
> 
> hm, OK.  It'd be nice to make that more explicit.  Something like this?

Well, I can't quite convince myself it is wrong, but when 
(!sb || (sb->s_flags & MS_ACTIVE), we're dropping the
inode_lock with an inode with i_count == 0 and nr_unused hasn't been
incremented.

So, if someone (sync_sb_inodes?) comes in and runs __iget,
the counts end up wrong.  Then again, whoever ran __iget would also run
iput and things would go horribly wrong anyway.

Did I mention the part where Andrea and I are hunting a bug where the
count of unused inodes goes negative and the everyone ends up spinning
in shrink_icache_memory?  Andrea's patch doesn't fix the spinning, but
it might have fixed the unused inode count going negative.  We're
waiting for another reproduce on the ppc64 race monster.

> 
> --- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup	2005-10-18 18:13:22.000000000 -0700
> +++ devel-akpm/fs/inode.c	2005-10-18 18:13:57.000000000 -0700
> @@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct 
>  	if (!hlist_unhashed(&inode->i_hash)) {
>  		if (!(inode->i_state & (I_DIRTY|I_LOCK)))
>  			list_move(&inode->i_list, &inode_unused);
> -		inodes_stat.nr_unused++;
>  		if (!sb || (sb->s_flags & MS_ACTIVE)) {
> +			inodes_stat.nr_unused++;  /* One more 0-ref inode */
>  			spin_unlock(&inode_lock);
>  			return;
>  		}
> @@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct 
>  		write_inode_now(inode, 1);
>  		spin_lock(&inode_lock);
>  		inode->i_state &= ~I_WILL_FREE;
> -		inodes_stat.nr_unused--;
>  		hlist_del_init(&inode->i_hash);
>  	}
>  	list_del_init(&inode->i_list);
> _
> 
> > generic_forget_inode took care of incrementing the unused count when
> > i_count went to zero. So, I don't think we need to worry about the
> > unused count in __writeback_single_inode.
> > 
> 
> How about this for now?

This part looks good.

-chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  1:58       ` Chris Mason
@ 2005-10-19  2:26         ` Andrew Morton
  2005-10-19  2:58           ` Chris Mason
  2005-10-25  2:21           ` Chris Mason
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2005-10-19  2:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: andrea, linux-kernel

Chris Mason <mason@suse.com> wrote:
>
> On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote:
> > > > Well according to my assertion (below), the inode in __sync_single_inode()
> > > > cannot have a zero refcount, so the whole if() statement is never executed.
> > > 
> > > generic_forget_inode->write_inode_now->__writeback_single_inode->
> > > __sync_single_inode
> > 
> > oshit.
> 
> When does this ever happen?  Just for private inodes released during
> put_super right?

I suppose so, yes.

> > 
> > > We do have I_WILL_FREE, but i_count will be zero.
> > 
> > yup.
> > 
> > > > 
> > > > The thinking behind that increment is that __sync_single_inode() has just
> > > > taken a dirty, zero-refcount inode and has cleaned it.  A dirty inode
> > > > cannot have previously been on inode_unused, hence we now are newly moving
> > > > it to inode_unused.
> > > 
> > > nr_unused doesn't seem to count the number of inodes on the unused list.
> > > It is actually counting the number of inodes whose i_count is 0.  See
> > > generic_forget_inode and invalidate_list to see what I mean.
> > 
> > hm, OK.  It'd be nice to make that more explicit.  Something like this?
> 
> Well, I can't quite convince myself it is wrong, but when 
> (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the
> inode_lock with an inode with i_count == 0 and nr_unused hasn't been
> incremented.
> 
> So, if someone (sync_sb_inodes?) comes in and runs __iget,
> the counts end up wrong.  Then again, whoever ran __iget would also run
> iput and things would go horribly wrong anyway.

Nope, it's equivalent:

--- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup	2005-10-18 18:13:22.000000000 -0700
+++ devel-akpm/fs/inode.c	2005-10-18 18:13:57.000000000 -0700
@@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct 
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_LOCK)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
+			inodes_stat.nr_unused++;  /* One more 0-ref inode */
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct 
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
_


> Did I mention the part where Andrea and I are hunting a bug where the
> count of unused inodes goes negative and the everyone ends up spinning
> in shrink_icache_memory?

No.

>  Andrea's patch doesn't fix the spinning, but
> it might have fixed the unused inode count going negative.  We're
> waiting for another reproduce on the ppc64 race monster.

I assume you have BUG_ON(inode_stat.nr_unused < 0)s in there everywhere?

In fact WARN_ON(inode_stat.nr_unused < 100) might be better - something's
obviously doing a bogus decrement a lot of times.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  2:26         ` Andrew Morton
@ 2005-10-19  2:58           ` Chris Mason
  2005-10-25  2:21           ` Chris Mason
  1 sibling, 0 replies; 10+ messages in thread
From: Chris Mason @ 2005-10-19  2:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, linux-kernel

On Tue, Oct 18, 2005 at 07:26:46PM -0700, Andrew Morton wrote:
> > > hm, OK.  It'd be nice to make that more explicit.  Something like this?
> > 
> > Well, I can't quite convince myself it is wrong, but when 
> > (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the
> > inode_lock with an inode with i_count == 0 and nr_unused hasn't been
> > incremented.
> > 
> > So, if someone (sync_sb_inodes?) comes in and runs __iget,
> > the counts end up wrong.  Then again, whoever ran __iget would also run
> > iput and things would go horribly wrong anyway.
> 
> Nope, it's equivalent:

The math ends up the same, but for your version there is a window where
the lock isn't held and the count doesn't reflect reality.  I don't know
if anyone can race in and mess with the inode though.  It is still on
various lists, but if we're only in that part of generic_forget_inode
during unmount, the super semaphore will keep out most potential racers.

I need to read harder.

> > Did I mention the part where Andrea and I are hunting a bug where the
> > count of unused inodes goes negative and the everyone ends up spinning
> > in shrink_icache_memory?
> 
> No.
> 
> >  Andrea's patch doesn't fix the spinning, but
> > it might have fixed the unused inode count going negative.  We're
> > waiting for another reproduce on the ppc64 race monster.
> 
> I assume you have BUG_ON(inode_stat.nr_unused < 0)s in there everywhere?
> 
> In fact WARN_ON(inode_stat.nr_unused < 100) might be better - something's
> obviously doing a bogus decrement a lot of times.
> 

It goes negative in the invalidate_inodes run during unmount.  I
think Andrea's patch will solve that part, hopefully we'll know more
tomorrow.

-chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  0:13 ` Andrew Morton
  2005-10-19  0:40   ` Chris Mason
@ 2005-10-19  7:47   ` Andrea Arcangeli
  1 sibling, 0 replies; 10+ messages in thread
From: Andrea Arcangeli @ 2005-10-19  7:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mason

On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote:
> Well according to my assertion (below), the inode in __sync_single_inode()
> cannot have a zero refcount, so the whole if() statement is never executed.

It can, but only if it comes from I_WILL_FREE path
(generic_forget_inode).  See the write_inode_now in
generic_forget_inode.

My BUG_ON already makes sure that when i_count is zero, I_WILL_FREE is
set (I_WILL_FREE will prevent the inode to be freed by the vm as well,
so it's ok).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-19  2:26         ` Andrew Morton
  2005-10-19  2:58           ` Chris Mason
@ 2005-10-25  2:21           ` Chris Mason
  2005-10-25 14:14             ` Andrea Arcangeli
  1 sibling, 1 reply; 10+ messages in thread
From: Chris Mason @ 2005-10-25  2:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, linux-kernel

On Tue, Oct 18, 2005 at 07:26:46PM -0700, Andrew Morton wrote:
> Chris Mason <mason@suse.com> wrote:
> >
> > On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote:
> > > > > Well according to my assertion (below), the inode in __sync_single_inode()
> > > > > cannot have a zero refcount, so the whole if() statement is never executed.
> > > > 
> > > > generic_forget_inode->write_inode_now->__writeback_single_inode->
> > > > __sync_single_inode
> > > 
> > > oshit.
> > 
> > When does this ever happen?  Just for private inodes released during
> > put_super right?
> 
> I suppose so, yes.

It's not related to the bug, but prune_icache can jump in at any
time during generic_shutdown_super, except during the invalidate_inodes
runs.  Something like this:

proc1				proc2
generic_shutdown_super
s->s_flags &= ~MS_ACTIVE
invalidate_inodes
put_super
				shrink_icache_memory
				prune_icache
				invalidate_inode_pages
				try_to_release_page
				
I doubt any FS triggers this.  They would need to generate inodes
with pages during the put_super call, and get them onto the unused list.
But, I think prune_icache should just skip any inodes where the super
doesn't have MS_ACTIVE set.

At any rate, this wasn't the race I was looking for.  Aside from the
bugs fixed by Andrea's patch, we were seeing inodes go negative thanks
to a bad interaction between a latency fix and a backport of something
else from mainline.   Our latency code has a goto again, and mainline
has a big fat comment explaining why goto again isn't needed.  

If the super->s_inodes list was long enough to reschedule in invalidate_list,
we would process the same inodes in multiple times without removing them.

The short version is that no additional patches should be needed for
mainline.

-chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
  2005-10-25  2:21           ` Chris Mason
@ 2005-10-25 14:14             ` Andrea Arcangeli
  0 siblings, 0 replies; 10+ messages in thread
From: Andrea Arcangeli @ 2005-10-25 14:14 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, linux-kernel

On Mon, Oct 24, 2005 at 10:21:02PM -0400, Chris Mason wrote:
> The short version is that no additional patches should be needed for
> mainline.

This one may be needed too. Perhaps it's unnecessary for the MS_ACTIVE
case (I would assume in that case by design nobody can find the inode
while MS_ACTIVE is not set), but the unhashed case sounds more
interesting. At the moment I'm unsure who is using the unhashed
last-iput feature to get rid of the inode but the below looks needed at
the light of that feature.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff -r 2c7561cc445e fs/inode.c
--- a/fs/inode.c	Mon Oct 24 00:24:54 2005 +0011
+++ b/fs/inode.c	Tue Oct 25 16:06:25 2005 +0200
@@ -1088,6 +1088,7 @@
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
+	wake_up_inode(inode);
 	destroy_inode(inode);
 }
 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-10-25 14:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-18  8:26 [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set Andrea Arcangeli
2005-10-19  0:13 ` Andrew Morton
2005-10-19  0:40   ` Chris Mason
2005-10-19  1:15     ` Andrew Morton
2005-10-19  1:58       ` Chris Mason
2005-10-19  2:26         ` Andrew Morton
2005-10-19  2:58           ` Chris Mason
2005-10-25  2:21           ` Chris Mason
2005-10-25 14:14             ` Andrea Arcangeli
2005-10-19  7:47   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox