* [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
@ 2005-10-18 8:26 Andrea Arcangeli
2005-10-19 0:13 ` Andrew Morton
0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2005-10-18 8:26 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Chris Mason
Hello,
@@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
list_move(&inode->i_list, &inode_in_use);
} else {
list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
Are you sure the above diff is correct? It was added somewhere between
2.6.5 and 2.6.8. I think it's wrong.
The only way I can imagine the i_count to be zero in the above path, is
that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
increase nr_unused. So I believe the above change is buggy and it will
definitely overstate the number of unused inodes and it should be backed
out.
Note that __writeback_single_inode before calling __sync_single_inode, can
drop the spinlock and we can have both the dirty and locked bitflags
clear here:
spin_unlock(&inode_lock);
__wait_on_inode(inode);
iput(inode);
XXXXXXX
spin_lock(&inode_lock);
}
use inode again here
a construct like the above makes zero sense from a reference counting
standpoint.
Either we don't ever use the inode again after the iput, or the
inode_lock should be taken _before_ executing the iput (i.e. a __iput
would be required). Taking the inode_lock after iput means the iget was
useless if we keep using the inode after the iput.
So the only chance the 2.6 was safe to call __writeback_single_inode
with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
prevent the VM to free the inode in XXXXX).
Potentially calling the above iput with I_WILL_FREE was also wrong
because it would recurse in iput_final (the second mainline bug).
The below (untested) patch fixes the nr_unused accounting, avoids
recursing in iput when I_WILL_FREE is set and makes sure (with the
BUG_ON) that we don't corrupt memory and that all holders that don't set
I_WILL_FREE, keeps a reference on the inode!
Comments welcome, thanks.
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
fs-writeback.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)
Index: linux-2.6/fs/fs-writeback.c
--- linux-2.6/fs/fs-writeback.c.~1~ 2005-07-28 17:08:53.000000000 +0200
+++ linux-2.6/fs/fs-writeback.c 2005-10-17 15:43:53.000000000 +0200
@@ -230,7 +230,6 @@ __sync_single_inode(struct inode *inode,
* The inode is clean, unused
*/
list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
@@ -246,6 +245,8 @@ __writeback_single_inode(struct inode *i
{
wait_queue_head_t *wqh;
+ BUG_ON(!atomic_read(&inode->i_count) ^ !!(inode->i_state & I_WILL_FREE));
+
if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
list_move(&inode->i_list, &inode->i_sb->s_dirty);
return 0;
@@ -259,11 +260,9 @@ __writeback_single_inode(struct inode *i
wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
do {
- __iget(inode);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait,
TASK_UNINTERRUPTIBLE);
- iput(inode);
spin_lock(&inode_lock);
} while (inode->i_state & I_LOCK);
}
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-18 8:26 [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set Andrea Arcangeli @ 2005-10-19 0:13 ` Andrew Morton 2005-10-19 0:40 ` Chris Mason 2005-10-19 7:47 ` Andrea Arcangeli 0 siblings, 2 replies; 10+ messages in thread From: Andrew Morton @ 2005-10-19 0:13 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, mason Andrea Arcangeli <andrea@suse.de> wrote: > > Hello, > > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode, > list_move(&inode->i_list, &inode_in_use); > } else { > list_move(&inode->i_list, &inode_unused); > + inodes_stat.nr_unused++; > } > } > wake_up_inode(inode); > > Are you sure the above diff is correct? It was added somewhere between > 2.6.5 and 2.6.8. I think it's wrong. > > The only way I can imagine the i_count to be zero in the above path, is > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not > increase nr_unused. So I believe the above change is buggy and it will > definitely overstate the number of unused inodes and it should be backed > out. Well according to my assertion (below), the inode in __sync_single_inode() cannot have a zero refcount, so the whole if() statement is never executed. The thinking behind that increment is that __sync_single_inode() has just taken a dirty, zero-refcount inode and has cleaned it. A dirty inode cannot have previously been on inode_unused, hence we now are newly moving it to inode_unused. I'll stick a WARN_ON in there for now, wait and see if anyone can hit it. > Note that __writeback_single_inode before calling __sync_single_inode, can > drop the spinlock and we can have both the dirty and locked bitflags > clear here: > > spin_unlock(&inode_lock); > __wait_on_inode(inode); > iput(inode); > XXXXXXX > spin_lock(&inode_lock); > } > use inode again here > > a construct like the above makes zero sense from a reference counting > standpoint. > > Either we don't ever use the inode again after the iput, or the > inode_lock should be taken _before_ executing the iput (i.e. a __iput > would be required). Taking the inode_lock after iput means the iget was > useless if we keep using the inode after the iput. > > So the only chance the 2.6 was safe to call __writeback_single_inode > with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will > prevent the VM to free the inode in XXXXX). > > Potentially calling the above iput with I_WILL_FREE was also wrong > because it would recurse in iput_final (the second mainline bug). > > The below (untested) patch fixes the nr_unused accounting, avoids > recursing in iput when I_WILL_FREE is set and makes sure (with the > BUG_ON) that we don't corrupt memory and that all holders that don't set > I_WILL_FREE, keeps a reference on the inode! > That's something which Bill snuck in there during some waitqueue rework. I agree that the iget/iput is unneeded: all callers to __writeback_single_inode() already have a ref on the inode: either via sync_sb_inodes()'s iget() or via syscall(fd, ...). So the BUG_ON() in __writeback_single_inode() becomes BUG_ON(!atomic_read(&inode->i_count)); - I'll make it WARN_ON for now coz I'm not so sure any more ;) So... With updated comments: diff -puN fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set fs/fs-writeback.c --- 25/fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set Tue Oct 18 17:07:49 2005 +++ 25-akpm/fs/fs-writeback.c Tue Oct 18 17:12:53 2005 @@ -229,8 +229,8 @@ __sync_single_inode(struct inode *inode, /* * The inode is clean, unused */ + WARN_ON(1); list_move(&inode->i_list, &inode_unused); - inodes_stat.nr_unused++; } } wake_up_inode(inode); @@ -238,14 +238,16 @@ __sync_single_inode(struct inode *inode, } /* - * Write out an inode's dirty pages. Called under inode_lock. + * Write out an inode's dirty pages. Called under inode_lock. The caller has + * ref on the inode (either via __iget or via syscall against an fd). */ static int -__writeback_single_inode(struct inode *inode, - struct writeback_control *wbc) +__writeback_single_inode(struct inode *inode, struct writeback_control *wbc) { wait_queue_head_t *wqh; + WARN_ON(!atomic_read(&inode->i_count)); + if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) { list_move(&inode->i_list, &inode->i_sb->s_dirty); return 0; @@ -259,11 +261,9 @@ __writeback_single_inode(struct inode *i wqh = bit_waitqueue(&inode->i_state, __I_LOCK); do { - __iget(inode); spin_unlock(&inode_lock); __wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE); - iput(inode); spin_lock(&inode_lock); } while (inode->i_state & I_LOCK); } @@ -541,14 +541,15 @@ void sync_inodes(int wait) } /** - * write_inode_now - write an inode to disk - * @inode: inode to write to disk - * @sync: whether the write should be synchronous or not + * write_inode_now - write an inode to disk + * @inode: inode to write to disk + * @sync: whether the write should be synchronous or not + * + * This function commits an inode to disk immediately if it is dirty. This is + * primarily needed by knfsd. * - * This function commits an inode to disk immediately if it is - * dirty. This is primarily needed by knfsd. + * The caller must have a ref on the inode. */ - int write_inode_now(struct inode *inode, int sync) { int ret; _ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 0:13 ` Andrew Morton @ 2005-10-19 0:40 ` Chris Mason 2005-10-19 1:15 ` Andrew Morton 2005-10-19 7:47 ` Andrea Arcangeli 1 sibling, 1 reply; 10+ messages in thread From: Chris Mason @ 2005-10-19 0:40 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > Hello, > > > > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode, > > list_move(&inode->i_list, &inode_in_use); > > } else { > > list_move(&inode->i_list, &inode_unused); > > + inodes_stat.nr_unused++; > > } > > } > > wake_up_inode(inode); > > > > Are you sure the above diff is correct? It was added somewhere between > > 2.6.5 and 2.6.8. I think it's wrong. > > > > The only way I can imagine the i_count to be zero in the above path, is > > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not > > increase nr_unused. So I believe the above change is buggy and it will > > definitely overstate the number of unused inodes and it should be backed > > out. > > Well according to my assertion (below), the inode in __sync_single_inode() > cannot have a zero refcount, so the whole if() statement is never executed. generic_forget_inode->write_inode_now->__writeback_single_inode-> __sync_single_inode We do have I_WILL_FREE, but i_count will be zero. > > The thinking behind that increment is that __sync_single_inode() has just > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode > cannot have previously been on inode_unused, hence we now are newly moving > it to inode_unused. nr_unused doesn't seem to count the number of inodes on the unused list. It is actually counting the number of inodes whose i_count is 0. See generic_forget_inode and invalidate_list to see what I mean. generic_forget_inode took care of incrementing the unused count when i_count went to zero. So, I don't think we need to worry about the unused count in __writeback_single_inode. -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 0:40 ` Chris Mason @ 2005-10-19 1:15 ` Andrew Morton 2005-10-19 1:58 ` Chris Mason 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2005-10-19 1:15 UTC (permalink / raw) To: Chris Mason; +Cc: andrea, linux-kernel Chris Mason <mason@suse.com> wrote: > > On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote: > > Andrea Arcangeli <andrea@suse.de> wrote: > > > > > > Hello, > > > > > > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode, > > > list_move(&inode->i_list, &inode_in_use); > > > } else { > > > list_move(&inode->i_list, &inode_unused); > > > + inodes_stat.nr_unused++; > > > } > > > } > > > wake_up_inode(inode); > > > > > > Are you sure the above diff is correct? It was added somewhere between > > > 2.6.5 and 2.6.8. I think it's wrong. > > > > > > The only way I can imagine the i_count to be zero in the above path, is > > > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not > > > increase nr_unused. So I believe the above change is buggy and it will > > > definitely overstate the number of unused inodes and it should be backed > > > out. > > > > Well according to my assertion (below), the inode in __sync_single_inode() > > cannot have a zero refcount, so the whole if() statement is never executed. > > generic_forget_inode->write_inode_now->__writeback_single_inode-> > __sync_single_inode oshit. > We do have I_WILL_FREE, but i_count will be zero. yup. > > > > The thinking behind that increment is that __sync_single_inode() has just > > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode > > cannot have previously been on inode_unused, hence we now are newly moving > > it to inode_unused. > > nr_unused doesn't seem to count the number of inodes on the unused list. > It is actually counting the number of inodes whose i_count is 0. See > generic_forget_inode and invalidate_list to see what I mean. hm, OK. It'd be nice to make that more explicit. Something like this? --- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup 2005-10-18 18:13:22.000000000 -0700 +++ devel-akpm/fs/inode.c 2005-10-18 18:13:57.000000000 -0700 @@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct if (!hlist_unhashed(&inode->i_hash)) { if (!(inode->i_state & (I_DIRTY|I_LOCK))) list_move(&inode->i_list, &inode_unused); - inodes_stat.nr_unused++; if (!sb || (sb->s_flags & MS_ACTIVE)) { + inodes_stat.nr_unused++; /* One more 0-ref inode */ spin_unlock(&inode_lock); return; } @@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct write_inode_now(inode, 1); spin_lock(&inode_lock); inode->i_state &= ~I_WILL_FREE; - inodes_stat.nr_unused--; hlist_del_init(&inode->i_hash); } list_del_init(&inode->i_list); _ > generic_forget_inode took care of incrementing the unused count when > i_count went to zero. So, I don't think we need to worry about the > unused count in __writeback_single_inode. > How about this for now? diff -puN fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set fs/fs-writeback.c --- devel/fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set 2005-10-18 18:02:51.000000000 -0700 +++ devel-akpm/fs/fs-writeback.c 2005-10-18 18:05:22.000000000 -0700 @@ -229,8 +229,8 @@ __sync_single_inode(struct inode *inode, /* * The inode is clean, unused */ + WARN_ON(1); list_move(&inode->i_list, &inode_unused); - inodes_stat.nr_unused++; } } wake_up_inode(inode); @@ -238,14 +238,20 @@ __sync_single_inode(struct inode *inode, } /* - * Write out an inode's dirty pages. Called under inode_lock. + * Write out an inode's dirty pages. Called under inode_lock. Either the + * caller has ref on the inode (either via __iget or via syscall against an fd) + * or the inode has I_WILL_FREE set (via generic_forget_inode) */ static int -__writeback_single_inode(struct inode *inode, - struct writeback_control *wbc) +__writeback_single_inode(struct inode *inode, struct writeback_control *wbc) { wait_queue_head_t *wqh; + if (!atomic_read(&inode->i_count)) + WARN_ON(!(inode->i_flags & I_WILL_FREE)); + else + WARN_ON(inode->i_flags & I_WILL_FREE); + if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) { list_move(&inode->i_list, &inode->i_sb->s_dirty); return 0; @@ -259,11 +265,9 @@ __writeback_single_inode(struct inode *i wqh = bit_waitqueue(&inode->i_state, __I_LOCK); do { - __iget(inode); spin_unlock(&inode_lock); __wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE); - iput(inode); spin_lock(&inode_lock); } while (inode->i_state & I_LOCK); } @@ -541,14 +545,15 @@ void sync_inodes(int wait) } /** - * write_inode_now - write an inode to disk - * @inode: inode to write to disk - * @sync: whether the write should be synchronous or not + * write_inode_now - write an inode to disk + * @inode: inode to write to disk + * @sync: whether the write should be synchronous or not + * + * This function commits an inode to disk immediately if it is dirty. This is + * primarily needed by knfsd. * - * This function commits an inode to disk immediately if it is - * dirty. This is primarily needed by knfsd. + * The caller must either have a ref on the inode or must have set I_WILL_FREE. */ - int write_inode_now(struct inode *inode, int sync) { int ret; _ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 1:15 ` Andrew Morton @ 2005-10-19 1:58 ` Chris Mason 2005-10-19 2:26 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Chris Mason @ 2005-10-19 1:58 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, linux-kernel On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote: > > > Well according to my assertion (below), the inode in __sync_single_inode() > > > cannot have a zero refcount, so the whole if() statement is never executed. > > > > generic_forget_inode->write_inode_now->__writeback_single_inode-> > > __sync_single_inode > > oshit. When does this ever happen? Just for private inodes released during put_super right? > > > We do have I_WILL_FREE, but i_count will be zero. > > yup. > > > > > > > The thinking behind that increment is that __sync_single_inode() has just > > > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode > > > cannot have previously been on inode_unused, hence we now are newly moving > > > it to inode_unused. > > > > nr_unused doesn't seem to count the number of inodes on the unused list. > > It is actually counting the number of inodes whose i_count is 0. See > > generic_forget_inode and invalidate_list to see what I mean. > > hm, OK. It'd be nice to make that more explicit. Something like this? Well, I can't quite convince myself it is wrong, but when (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the inode_lock with an inode with i_count == 0 and nr_unused hasn't been incremented. So, if someone (sync_sb_inodes?) comes in and runs __iget, the counts end up wrong. Then again, whoever ran __iget would also run iput and things would go horribly wrong anyway. Did I mention the part where Andrea and I are hunting a bug where the count of unused inodes goes negative and the everyone ends up spinning in shrink_icache_memory? Andrea's patch doesn't fix the spinning, but it might have fixed the unused inode count going negative. We're waiting for another reproduce on the ppc64 race monster. > > --- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup 2005-10-18 18:13:22.000000000 -0700 > +++ devel-akpm/fs/inode.c 2005-10-18 18:13:57.000000000 -0700 > @@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct > if (!hlist_unhashed(&inode->i_hash)) { > if (!(inode->i_state & (I_DIRTY|I_LOCK))) > list_move(&inode->i_list, &inode_unused); > - inodes_stat.nr_unused++; > if (!sb || (sb->s_flags & MS_ACTIVE)) { > + inodes_stat.nr_unused++; /* One more 0-ref inode */ > spin_unlock(&inode_lock); > return; > } > @@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct > write_inode_now(inode, 1); > spin_lock(&inode_lock); > inode->i_state &= ~I_WILL_FREE; > - inodes_stat.nr_unused--; > hlist_del_init(&inode->i_hash); > } > list_del_init(&inode->i_list); > _ > > > generic_forget_inode took care of incrementing the unused count when > > i_count went to zero. So, I don't think we need to worry about the > > unused count in __writeback_single_inode. > > > > How about this for now? This part looks good. -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 1:58 ` Chris Mason @ 2005-10-19 2:26 ` Andrew Morton 2005-10-19 2:58 ` Chris Mason 2005-10-25 2:21 ` Chris Mason 0 siblings, 2 replies; 10+ messages in thread From: Andrew Morton @ 2005-10-19 2:26 UTC (permalink / raw) To: Chris Mason; +Cc: andrea, linux-kernel Chris Mason <mason@suse.com> wrote: > > On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote: > > > > Well according to my assertion (below), the inode in __sync_single_inode() > > > > cannot have a zero refcount, so the whole if() statement is never executed. > > > > > > generic_forget_inode->write_inode_now->__writeback_single_inode-> > > > __sync_single_inode > > > > oshit. > > When does this ever happen? Just for private inodes released during > put_super right? I suppose so, yes. > > > > > We do have I_WILL_FREE, but i_count will be zero. > > > > yup. > > > > > > > > > > The thinking behind that increment is that __sync_single_inode() has just > > > > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode > > > > cannot have previously been on inode_unused, hence we now are newly moving > > > > it to inode_unused. > > > > > > nr_unused doesn't seem to count the number of inodes on the unused list. > > > It is actually counting the number of inodes whose i_count is 0. See > > > generic_forget_inode and invalidate_list to see what I mean. > > > > hm, OK. It'd be nice to make that more explicit. Something like this? > > Well, I can't quite convince myself it is wrong, but when > (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the > inode_lock with an inode with i_count == 0 and nr_unused hasn't been > incremented. > > So, if someone (sync_sb_inodes?) comes in and runs __iget, > the counts end up wrong. Then again, whoever ran __iget would also run > iput and things would go horribly wrong anyway. Nope, it's equivalent: --- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup 2005-10-18 18:13:22.000000000 -0700 +++ devel-akpm/fs/inode.c 2005-10-18 18:13:57.000000000 -0700 @@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct if (!hlist_unhashed(&inode->i_hash)) { if (!(inode->i_state & (I_DIRTY|I_LOCK))) list_move(&inode->i_list, &inode_unused); - inodes_stat.nr_unused++; if (!sb || (sb->s_flags & MS_ACTIVE)) { + inodes_stat.nr_unused++; /* One more 0-ref inode */ spin_unlock(&inode_lock); return; } @@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct write_inode_now(inode, 1); spin_lock(&inode_lock); inode->i_state &= ~I_WILL_FREE; - inodes_stat.nr_unused--; hlist_del_init(&inode->i_hash); } list_del_init(&inode->i_list); _ > Did I mention the part where Andrea and I are hunting a bug where the > count of unused inodes goes negative and the everyone ends up spinning > in shrink_icache_memory? No. > Andrea's patch doesn't fix the spinning, but > it might have fixed the unused inode count going negative. We're > waiting for another reproduce on the ppc64 race monster. I assume you have BUG_ON(inode_stat.nr_unused < 0)s in there everywhere? In fact WARN_ON(inode_stat.nr_unused < 100) might be better - something's obviously doing a bogus decrement a lot of times. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 2:26 ` Andrew Morton @ 2005-10-19 2:58 ` Chris Mason 2005-10-25 2:21 ` Chris Mason 1 sibling, 0 replies; 10+ messages in thread From: Chris Mason @ 2005-10-19 2:58 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, linux-kernel On Tue, Oct 18, 2005 at 07:26:46PM -0700, Andrew Morton wrote: > > > hm, OK. It'd be nice to make that more explicit. Something like this? > > > > Well, I can't quite convince myself it is wrong, but when > > (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the > > inode_lock with an inode with i_count == 0 and nr_unused hasn't been > > incremented. > > > > So, if someone (sync_sb_inodes?) comes in and runs __iget, > > the counts end up wrong. Then again, whoever ran __iget would also run > > iput and things would go horribly wrong anyway. > > Nope, it's equivalent: The math ends up the same, but for your version there is a window where the lock isn't held and the count doesn't reflect reality. I don't know if anyone can race in and mess with the inode though. It is still on various lists, but if we're only in that part of generic_forget_inode during unmount, the super semaphore will keep out most potential racers. I need to read harder. > > Did I mention the part where Andrea and I are hunting a bug where the > > count of unused inodes goes negative and the everyone ends up spinning > > in shrink_icache_memory? > > No. > > > Andrea's patch doesn't fix the spinning, but > > it might have fixed the unused inode count going negative. We're > > waiting for another reproduce on the ppc64 race monster. > > I assume you have BUG_ON(inode_stat.nr_unused < 0)s in there everywhere? > > In fact WARN_ON(inode_stat.nr_unused < 100) might be better - something's > obviously doing a bogus decrement a lot of times. > It goes negative in the invalidate_inodes run during unmount. I think Andrea's patch will solve that part, hopefully we'll know more tomorrow. -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 2:26 ` Andrew Morton 2005-10-19 2:58 ` Chris Mason @ 2005-10-25 2:21 ` Chris Mason 2005-10-25 14:14 ` Andrea Arcangeli 1 sibling, 1 reply; 10+ messages in thread From: Chris Mason @ 2005-10-25 2:21 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, linux-kernel On Tue, Oct 18, 2005 at 07:26:46PM -0700, Andrew Morton wrote: > Chris Mason <mason@suse.com> wrote: > > > > On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote: > > > > > Well according to my assertion (below), the inode in __sync_single_inode() > > > > > cannot have a zero refcount, so the whole if() statement is never executed. > > > > > > > > generic_forget_inode->write_inode_now->__writeback_single_inode-> > > > > __sync_single_inode > > > > > > oshit. > > > > When does this ever happen? Just for private inodes released during > > put_super right? > > I suppose so, yes. It's not related to the bug, but prune_icache can jump in at any time during generic_shutdown_super, except during the invalidate_inodes runs. Something like this: proc1 proc2 generic_shutdown_super s->s_flags &= ~MS_ACTIVE invalidate_inodes put_super shrink_icache_memory prune_icache invalidate_inode_pages try_to_release_page I doubt any FS triggers this. They would need to generate inodes with pages during the put_super call, and get them onto the unused list. But, I think prune_icache should just skip any inodes where the super doesn't have MS_ACTIVE set. At any rate, this wasn't the race I was looking for. Aside from the bugs fixed by Andrea's patch, we were seeing inodes go negative thanks to a bad interaction between a latency fix and a backport of something else from mainline. Our latency code has a goto again, and mainline has a big fat comment explaining why goto again isn't needed. If the super->s_inodes list was long enough to reschedule in invalidate_list, we would process the same inodes in multiple times without removing them. The short version is that no additional patches should be needed for mainline. -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-25 2:21 ` Chris Mason @ 2005-10-25 14:14 ` Andrea Arcangeli 0 siblings, 0 replies; 10+ messages in thread From: Andrea Arcangeli @ 2005-10-25 14:14 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, linux-kernel On Mon, Oct 24, 2005 at 10:21:02PM -0400, Chris Mason wrote: > The short version is that no additional patches should be needed for > mainline. This one may be needed too. Perhaps it's unnecessary for the MS_ACTIVE case (I would assume in that case by design nobody can find the inode while MS_ACTIVE is not set), but the unhashed case sounds more interesting. At the moment I'm unsure who is using the unhashed last-iput feature to get rid of the inode but the below looks needed at the light of that feature. Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff -r 2c7561cc445e fs/inode.c --- a/fs/inode.c Mon Oct 24 00:24:54 2005 +0011 +++ b/fs/inode.c Tue Oct 25 16:06:25 2005 +0200 @@ -1088,6 +1088,7 @@ if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); + wake_up_inode(inode); destroy_inode(inode); } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set 2005-10-19 0:13 ` Andrew Morton 2005-10-19 0:40 ` Chris Mason @ 2005-10-19 7:47 ` Andrea Arcangeli 1 sibling, 0 replies; 10+ messages in thread From: Andrea Arcangeli @ 2005-10-19 7:47 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, mason On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote: > Well according to my assertion (below), the inode in __sync_single_inode() > cannot have a zero refcount, so the whole if() statement is never executed. It can, but only if it comes from I_WILL_FREE path (generic_forget_inode). See the write_inode_now in generic_forget_inode. My BUG_ON already makes sure that when i_count is zero, I_WILL_FREE is set (I_WILL_FREE will prevent the inode to be freed by the vm as well, so it's ok). ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-10-25 14:15 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-10-18 8:26 [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set Andrea Arcangeli 2005-10-19 0:13 ` Andrew Morton 2005-10-19 0:40 ` Chris Mason 2005-10-19 1:15 ` Andrew Morton 2005-10-19 1:58 ` Chris Mason 2005-10-19 2:26 ` Andrew Morton 2005-10-19 2:58 ` Chris Mason 2005-10-25 2:21 ` Chris Mason 2005-10-25 14:14 ` Andrea Arcangeli 2005-10-19 7:47 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox