public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] skip I_CLEAR state inodes
       [not found]       ` <20090324120502.GC23439@duck.suse.cz>
@ 2009-03-24 12:40         ` Wu Fengguang
  2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
  2009-06-01 21:38           ` [PATCH] " Eric Sandeen
  0 siblings, 2 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-03-24 12:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

clear_inode() will switch inode state from I_FREEING to I_CLEAR,
and do so _outside_ of inode_lock. So any I_FREEING testing is
incomplete without the testing of I_CLEAR.

Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
Jan Kara reminds fixing the other two cases. Thanks!

Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/dquot.c        |    2 +-
 fs/drop_caches.c  |    2 +-
 fs/fs-writeback.c |    3 ++-
 3 files changed, 4 insertions(+), 3 deletions(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
 		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 			struct address_space *mapping;
 
-			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+			if (inode->i_state &
+					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 				continue;
 			mapping = inode->i_mapping;
 			if (mapping->nrpages == 0)
--- mm.orig/fs/dquot.c
+++ mm/fs/dquot.c
@@ -793,7 +793,7 @@ static void add_dquot_ref(struct super_b
 			continue;
 		if (!dqinit_needed(inode, type))
 			continue;
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
 			continue;
 
 		__iget(inode);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes
  2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
@ 2009-03-30  7:18           ` Wu Fengguang
  2009-03-31 23:43             ` Andrew Morton
  2009-06-01 21:38           ` [PATCH] " Eric Sandeen
  1 sibling, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-03-30  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

clear_inode() will switch inode state from I_FREEING to I_CLEAR,
and do so _outside_ of inode_lock. So any I_FREEING testing is
incomplete without a coupled testing of I_CLEAR.

So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.

Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/drop_caches.c  |    2 +-
 fs/fs-writeback.c |    3 ++-
 fs/quota/dquot.c  |    2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
 		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 			struct address_space *mapping;
 
-			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+			if (inode->i_state &
+					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 				continue;
 			mapping = inode->i_mapping;
 			if (mapping->nrpages == 0)
--- mm.orig/fs/quota/dquot.c
+++ mm/fs/quota/dquot.c
@@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 			continue;
 		if (!atomic_read(&inode->i_writecount))
 			continue;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes
  2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
@ 2009-03-31 23:43             ` Andrew Morton
  2009-04-01  0:53               ` Wu Fengguang
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-03-31 23:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: stable, linux-kernel, jack, m.mizuma, linux-fsdevel, viro,
	npiggin

On Mon, 30 Mar 2009 15:18:24 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> and do so _outside_ of inode_lock. So any I_FREEING testing is
> incomplete without a coupled testing of I_CLEAR.
> 
> So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> add_dquot_ref().
> 
> Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
> reminds fixing the other two cases.

ok...

But what is the user-visible consequence of this?  You cc'ed
stable@kernel.org so I assume it's serious.  People will want to know
what problem we're fixing!

> 
> --- mm.orig/fs/drop_caches.c
> +++ mm/fs/drop_caches.c
> @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
>  
>  	spin_lock(&inode_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
>  			continue;
>  		if (inode->i_mapping->nrpages == 0)
>  			continue;
> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
>  		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>  			struct address_space *mapping;
>  
> -			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +			if (inode->i_state &
> +					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
>  				continue;
>  			mapping = inode->i_mapping;
>  			if (mapping->nrpages == 0)
> --- mm.orig/fs/quota/dquot.c
> +++ mm/fs/quota/dquot.c
> @@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
>  
>  	spin_lock(&inode_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
>  			continue;
>  		if (!atomic_read(&inode->i_writecount))
>  			continue;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes
  2009-03-31 23:43             ` Andrew Morton
@ 2009-04-01  0:53               ` Wu Fengguang
  0 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-01  0:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable@kernel.org, linux-kernel@vger.kernel.org, jack@suse.cz,
	m.mizuma@jp.fujitsu.com, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, npiggin@suse.de

On Wed, Apr 01, 2009 at 07:43:32AM +0800, Andrew Morton wrote:
> On Mon, 30 Mar 2009 15:18:24 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > incomplete without a coupled testing of I_CLEAR.
> >
> > So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > add_dquot_ref().
> >
> > Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
> > reminds fixing the other two cases.
>
> ok...
>
> But what is the user-visible consequence of this?  You cc'ed
> stable@kernel.org so I assume it's serious.  People will want to know
> what problem we're fixing!

Sorry, the changelog could be expanded with the following paragraph:

Fix real kernel panics.  Masayoshi MIZUMA has a nice panic flow:
----------------------------------------------------------------------
            [process A]               |        [process B]
 |                                    |
 |    prune_icache()                  | drop_pagecache()
 |      spin_lock(&inode_lock)        |   drop_pagecache_sb()
 |      inode->i_state |= I_FREEING;  |       |
 |      spin_unlock(&inode_lock)      |       V
 |          |                         |     spin_lock(&inode_lock)
 |          V                         |         |
 |      dispose_list()                |         |
 |        list_del()                  |         |
 |        clear_inode()               |         |
 |          inode->i_state = I_CLEAR  |         |
 |            |                       |         V
 |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
 |            |                       |              continue;           <==== NOT MATCH
 |            |                       |
 |            |                       | (DANGER from here on! Accessing disposing inode!)
 |            |                       |
 |            |                       |      __iget()
 |            |                       |        list_move() <===== PANIC on poisoned list !!
 V            V                       |
(time)
----------------------------------------------------------------------

Thanks,
Fengguang

> >
> > --- mm.orig/fs/drop_caches.c
> > +++ mm/fs/drop_caches.c
> > @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
> >
> >  	spin_lock(&inode_lock);
> >  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> >  			continue;
> >  		if (inode->i_mapping->nrpages == 0)
> >  			continue;
> > --- mm.orig/fs/fs-writeback.c
> > +++ mm/fs/fs-writeback.c
> > @@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
> >  		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> >  			struct address_space *mapping;
> >
> > -			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > +			if (inode->i_state &
> > +					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> >  				continue;
> >  			mapping = inode->i_mapping;
> >  			if (mapping->nrpages == 0)
> > --- mm.orig/fs/quota/dquot.c
> > +++ mm/fs/quota/dquot.c
> > @@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
> >
> >  	spin_lock(&inode_lock);
> >  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> >  			continue;
> >  		if (!atomic_read(&inode->i_writecount))
> >  			continue;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
  2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
@ 2009-06-01 21:38           ` Eric Sandeen
  2009-06-02  8:55             ` Wu Fengguang
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Sandeen @ 2009-06-01 21:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

Wu Fengguang wrote:
> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> add_dquot_ref().
> 
> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> and do so _outside_ of inode_lock. So any I_FREEING testing is
> incomplete without the testing of I_CLEAR.
> 
> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> Jan Kara reminds fixing the other two cases. Thanks!

Is there a reason it's not done for __sync_single_inode as well?

Jeff Layton asked the question and I'm following it up :)

__sync_single_inode currently only tests I_FREEING, but I think we are
safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
I_SYNC to be cleared before it changes I_STATE.

On the other hand, testing I_CLEAR here probably would be safe anyway,
and it'd be bonus points for consistency?

Same basic question for generic_sync_sb_inodes, which has a
BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
as well?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-01 21:38           ` [PATCH] " Eric Sandeen
@ 2009-06-02  8:55             ` Wu Fengguang
  2009-06-02 10:27               ` Jeff Layton
  2009-06-02 11:37               ` Jan Kara
  0 siblings, 2 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-06-02  8:55 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> Wu Fengguang wrote:
> > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > add_dquot_ref().
> > 
> > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > incomplete without the testing of I_CLEAR.
> > 
> > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > Jan Kara reminds fixing the other two cases. Thanks!
> 
> Is there a reason it's not done for __sync_single_inode as well?

It missed the glance because it don't have an obvious '|' in the line ;)

> Jeff Layton asked the question and I'm following it up :)
> 
> __sync_single_inode currently only tests I_FREEING, but I think we are
> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> I_SYNC to be cleared before it changes I_STATE.

But I_SYNC is removed just before the I_FREEING test, so we still have
a small race window?

> On the other hand, testing I_CLEAR here probably would be safe anyway,
> and it'd be bonus points for consistency?

So let's add the I_CLEAR test?

> Same basic question for generic_sync_sb_inodes, which has a
> BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> as well?

Yes, we can add I_CLEAR here to catch more error condition.

Thanks,
Fengguang

---
skip I_CLEAR state inodes in writeback routines

The I_FREEING test in __sync_single_inode() is racy because
clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
and the test of I_FREEING.

Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.

Reported-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
 	spin_lock(&inode_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
+	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
 		if (!(inode->i_state & I_DIRTY) &&
 		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
@@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
 
-		BUG_ON(inode->i_state & I_FREEING);
+		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02  8:55             ` Wu Fengguang
@ 2009-06-02 10:27               ` Jeff Layton
  2009-06-02 11:37               ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2009-06-02 10:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Eric Sandeen, Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

On Tue, 2 Jun 2009 16:55:23 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > Wu Fengguang wrote:
> > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > add_dquot_ref().
> > > 
> > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > incomplete without the testing of I_CLEAR.
> > > 
> > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > Jan Kara reminds fixing the other two cases. Thanks!
> > 
> > Is there a reason it's not done for __sync_single_inode as well?
> 
> It missed the glance because it don't have an obvious '|' in the line ;)
> 
> > Jeff Layton asked the question and I'm following it up :)
> > 
> > __sync_single_inode currently only tests I_FREEING, but I think we are
> > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > I_SYNC to be cleared before it changes I_STATE.
> 
> But I_SYNC is removed just before the I_FREEING test, so we still have
> a small race window?
> 

Yes, I think so. __sync_single_inode clears I_SYNC but doesn't wake up
the wait queue until the end of the function. So I think it's possible
(though unlikely) that another thread can race in and change the state
to I_CLEAR before the I_FREEING check.

> > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > and it'd be bonus points for consistency?
> 
> So let's add the I_CLEAR test?
> 
> > Same basic question for generic_sync_sb_inodes, which has a
> > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > as well?
> 
> Yes, we can add I_CLEAR here to catch more error condition.
> 
> Thanks,
> Fengguang
> 
> ---
> skip I_CLEAR state inodes in writeback routines
> 
> The I_FREEING test in __sync_single_inode() is racy because
> clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> and the test of I_FREEING.
> 
> Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> 
> Reported-by: Jeff Layton <jlayton@redhat.com>
> Reported-by: Eric Sandeen <sandeen@sandeen.net>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>  	spin_lock(&inode_lock);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state &= ~I_SYNC;
> -	if (!(inode->i_state & I_FREEING)) {
> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>  		if (!(inode->i_state & I_DIRTY) &&
>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
>  		if (current_is_pdflush() && !writeback_acquire(bdi))
>  			break;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
>  		__iget(inode);
>  		pages_skipped = wbc->pages_skipped;
>  		__writeback_single_inode(inode, wbc);

Acked-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02  8:55             ` Wu Fengguang
  2009-06-02 10:27               ` Jeff Layton
@ 2009-06-02 11:37               ` Jan Kara
  2009-06-02 21:48                 ` Eric Sandeen
                                   ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Jan Kara @ 2009-06-02 11:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Eric Sandeen, Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > Wu Fengguang wrote:
> > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > add_dquot_ref().
> > > 
> > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > incomplete without the testing of I_CLEAR.
> > > 
> > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > Jan Kara reminds fixing the other two cases. Thanks!
> > 
> > Is there a reason it's not done for __sync_single_inode as well?
> 
> It missed the glance because it don't have an obvious '|' in the line ;)
> 
> > Jeff Layton asked the question and I'm following it up :)
> > 
> > __sync_single_inode currently only tests I_FREEING, but I think we are
> > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > I_SYNC to be cleared before it changes I_STATE.
> 
> But I_SYNC is removed just before the I_FREEING test, so we still have
> a small race window?
> 
> > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > and it'd be bonus points for consistency?
> 
> So let's add the I_CLEAR test?
> 
> > Same basic question for generic_sync_sb_inodes, which has a
> > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > as well?
> 
> Yes, we can add I_CLEAR here to catch more error condition.
> 
> Thanks,
> Fengguang
> 
> ---
> skip I_CLEAR state inodes in writeback routines
> 
> The I_FREEING test in __sync_single_inode() is racy because
> clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> and the test of I_FREEING.
> 
> Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> 
> Reported-by: Jeff Layton <jlayton@redhat.com>
> Reported-by: Eric Sandeen <sandeen@sandeen.net>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>  	spin_lock(&inode_lock);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state &= ~I_SYNC;
> -	if (!(inode->i_state & I_FREEING)) {
> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>  		if (!(inode->i_state & I_DIRTY) &&
>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
  Is the whole if needed? I had an impression that everyone calling
__sync_single_inode() should better take care it does not race with inode
freeing... So WARN_ON would be more appropriate IMHO.

>  			/*
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
>  		if (current_is_pdflush() && !writeback_acquire(bdi))
>  			break;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
>  		__iget(inode);
>  		pages_skipped = wbc->pages_skipped;
>  		__writeback_single_inode(inode, wbc);
  Looking at this code, it looks a bit suspicious. What prevents this s_io
list scan to race with inode freeing? In particular generic_forget_inode()
can drop inode_lock to write the inode and in the mean time
generic_sync_sb_inodes() can come, get a reference to the inode and start
it's writeback... Subsequent iput() would then call generic_forget_inode()
on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
inodes in this scan like we do for later in the function for another scan?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 11:37               ` Jan Kara
@ 2009-06-02 21:48                 ` Eric Sandeen
  2009-06-03 10:45                   ` Jeff Layton
  2009-06-03 13:32                 ` Wu Fengguang
  2009-06-03 14:10                 ` Wu Fengguang
  2 siblings, 1 reply; 22+ messages in thread
From: Eric Sandeen @ 2009-06-02 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
>> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
>>> Wu Fengguang wrote:
>>>> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
>>>> add_dquot_ref().
>>>>
>>>> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
>>>> and do so _outside_ of inode_lock. So any I_FREEING testing is
>>>> incomplete without the testing of I_CLEAR.
>>>>
>>>> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
>>>> Jan Kara reminds fixing the other two cases. Thanks!
>>> Is there a reason it's not done for __sync_single_inode as well?
>> It missed the glance because it don't have an obvious '|' in the line ;)
>>
>>> Jeff Layton asked the question and I'm following it up :)
>>>
>>> __sync_single_inode currently only tests I_FREEING, but I think we are
>>> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
>>> I_SYNC to be cleared before it changes I_STATE.
>> But I_SYNC is removed just before the I_FREEING test, so we still have
>> a small race window?

yep that's right.

        inode->i_state &= ~I_SYNC;
	>>> clear_inode->inode_sync_wait here and find it clear <<<
        if (!(inode->i_state & I_FREEING)) {

...

>> --- linux.orig/fs/fs-writeback.c
>> +++ linux/fs/fs-writeback.c
>> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>>  	spin_lock(&inode_lock);
>>  	WARN_ON(inode->i_state & I_NEW);
>>  	inode->i_state &= ~I_SYNC;
>> -	if (!(inode->i_state & I_FREEING)) {
>> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>>  		if (!(inode->i_state & I_DIRTY) &&
>>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {

>   Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.

Maybe both then (both a WARN on and then the test (defensive here, I
guess)) because if we continue we may wander into a poisoned list
pointer and explode, right?

-Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 21:48                 ` Eric Sandeen
@ 2009-06-03 10:45                   ` Jeff Layton
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2009-06-03 10:45 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Wu Fengguang, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

On Tue, 02 Jun 2009 16:48:57 -0500
Eric Sandeen <sandeen@sandeen.net> wrote:

> Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> >> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> >>> Wu Fengguang wrote:
> >>>> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> >>>> add_dquot_ref().
> >>>>
> >>>> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> >>>> and do so _outside_ of inode_lock. So any I_FREEING testing is
> >>>> incomplete without the testing of I_CLEAR.
> >>>>
> >>>> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> >>>> Jan Kara reminds fixing the other two cases. Thanks!
> >>> Is there a reason it's not done for __sync_single_inode as well?
> >> It missed the glance because it don't have an obvious '|' in the line ;)
> >>
> >>> Jeff Layton asked the question and I'm following it up :)
> >>>
> >>> __sync_single_inode currently only tests I_FREEING, but I think we are
> >>> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> >>> I_SYNC to be cleared before it changes I_STATE.
> >> But I_SYNC is removed just before the I_FREEING test, so we still have
> >> a small race window?
> 
> yep that's right.
> 
>         inode->i_state &= ~I_SYNC;
> 	>>> clear_inode->inode_sync_wait here and find it clear <<<
>         if (!(inode->i_state & I_FREEING)) {
> 
> ...
> 
> >> --- linux.orig/fs/fs-writeback.c
> >> +++ linux/fs/fs-writeback.c
> >> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >>  	spin_lock(&inode_lock);
> >>  	WARN_ON(inode->i_state & I_NEW);
> >>  	inode->i_state &= ~I_SYNC;
> >> -	if (!(inode->i_state & I_FREEING)) {
> >> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >>  		if (!(inode->i_state & I_DIRTY) &&
> >>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> 
> >   Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> 
> Maybe both then (both a WARN on and then the test (defensive here, I
> guess)) because if we continue we may wander into a poisoned list
> pointer and explode, right?
> 

Right.

I think removing this check would be roughly equivalent to adding a
BUG_ON. It just wouldn't reliably pop since it would depend on the
timing of the race.

Keeping the check and adding a WARN seems like a reasonable thing to
do. Maybe after a few releases if no one has seen the WARN fire we can
look at removing the check (or maybe turn it into a BUG)?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 11:37               ` Jan Kara
  2009-06-02 21:48                 ` Eric Sandeen
@ 2009-06-03 13:32                 ` Wu Fengguang
  2009-06-03 14:00                   ` Jan Kara
  2009-06-03 14:10                 ` Wu Fengguang
  2 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-06-03 13:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

[reply the easy part first]
On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > Wu Fengguang wrote:
> > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > add_dquot_ref().
> > > > 
> > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > incomplete without the testing of I_CLEAR.
> > > > 
> > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > 
> > > Is there a reason it's not done for __sync_single_inode as well?
> > 
> > It missed the glance because it don't have an obvious '|' in the line ;)
> > 
> > > Jeff Layton asked the question and I'm following it up :)
> > > 
> > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > I_SYNC to be cleared before it changes I_STATE.
> > 
> > But I_SYNC is removed just before the I_FREEING test, so we still have
> > a small race window?
> > 
> > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > and it'd be bonus points for consistency?
> > 
> > So let's add the I_CLEAR test?
> > 
> > > Same basic question for generic_sync_sb_inodes, which has a
> > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > as well?
> > 
> > Yes, we can add I_CLEAR here to catch more error condition.
> > 
> > Thanks,
> > Fengguang
> > 
> > ---
> > skip I_CLEAR state inodes in writeback routines
> > 
> > The I_FREEING test in __sync_single_inode() is racy because
> > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > and the test of I_FREEING.
> > 
> > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > 
> > Reported-by: Jeff Layton <jlayton@redhat.com>
> > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >  	spin_lock(&inode_lock);
> >  	WARN_ON(inode->i_state & I_NEW);
> >  	inode->i_state &= ~I_SYNC;
> > -	if (!(inode->i_state & I_FREEING)) {
> > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >  		if (!(inode->i_state & I_DIRTY) &&
> >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>   Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.

The caller doesn't matter here, because we temporarily dropped inode_lock
in __sync_single_inode() ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-03 13:32                 ` Wu Fengguang
@ 2009-06-03 14:00                   ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2009-06-03 14:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin

On Wed 03-06-09 21:32:02, Wu Fengguang wrote:
> [reply the easy part first]
> On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > Wu Fengguang wrote:
> > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > add_dquot_ref().
> > > > > 
> > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > incomplete without the testing of I_CLEAR.
> > > > > 
> > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > 
> > > > Is there a reason it's not done for __sync_single_inode as well?
> > > 
> > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > 
> > > > Jeff Layton asked the question and I'm following it up :)
> > > > 
> > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > I_SYNC to be cleared before it changes I_STATE.
> > > 
> > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > a small race window?
> > > 
> > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > and it'd be bonus points for consistency?
> > > 
> > > So let's add the I_CLEAR test?
> > > 
> > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > as well?
> > > 
> > > Yes, we can add I_CLEAR here to catch more error condition.
> > > 
> > > Thanks,
> > > Fengguang
> > > 
> > > ---
> > > skip I_CLEAR state inodes in writeback routines
> > > 
> > > The I_FREEING test in __sync_single_inode() is racy because
> > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > and the test of I_FREEING.
> > > 
> > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > 
> > > Reported-by: Jeff Layton <jlayton@redhat.com>
> > > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux.orig/fs/fs-writeback.c
> > > +++ linux/fs/fs-writeback.c
> > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > >  	spin_lock(&inode_lock);
> > >  	WARN_ON(inode->i_state & I_NEW);
> > >  	inode->i_state &= ~I_SYNC;
> > > -	if (!(inode->i_state & I_FREEING)) {
> > > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > >  		if (!(inode->i_state & I_DIRTY) &&
> > >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >   Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> 
> The caller doesn't matter here, because we temporarily dropped inode_lock
> in __sync_single_inode() ?
  But the problem is basically what I described in the part you cut from
the email - if the caller of __sync_single_inode() does not have a
reference to the inode or some other means of protecting inode from being
released, then we have a problem because e.g. __writeback_single_inode()
can sleep and before I_SYNC is set, all the clear_inode() can finish and
we end up calling do_writepages() on invalidated mapping.
  So in my opinion everybody calling __writeback_single_inode() and thus
__sync_single_inode() should do an equivalent of igrab() or be sure inode
cannot be freed e.g. because we have a file open.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 11:37               ` Jan Kara
  2009-06-02 21:48                 ` Eric Sandeen
  2009-06-03 13:32                 ` Wu Fengguang
@ 2009-06-03 14:10                 ` Wu Fengguang
  2009-06-03 14:16                   ` Jan Kara
  2 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-06-03 14:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > Wu Fengguang wrote:
> > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > add_dquot_ref().
> > > > 
> > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > incomplete without the testing of I_CLEAR.
> > > > 
> > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > 
> > > Is there a reason it's not done for __sync_single_inode as well?
> > 
> > It missed the glance because it don't have an obvious '|' in the line ;)
> > 
> > > Jeff Layton asked the question and I'm following it up :)
> > > 
> > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > I_SYNC to be cleared before it changes I_STATE.
> > 
> > But I_SYNC is removed just before the I_FREEING test, so we still have
> > a small race window?
> > 
> > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > and it'd be bonus points for consistency?
> > 
> > So let's add the I_CLEAR test?
> > 
> > > Same basic question for generic_sync_sb_inodes, which has a
> > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > as well?
> > 
> > Yes, we can add I_CLEAR here to catch more error condition.
> > 
> > Thanks,
> > Fengguang
> > 
> > ---
> > skip I_CLEAR state inodes in writeback routines
> > 
> > The I_FREEING test in __sync_single_inode() is racy because
> > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > and the test of I_FREEING.
> > 
> > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > 
> > Reported-by: Jeff Layton <jlayton@redhat.com>
> > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >  	spin_lock(&inode_lock);
> >  	WARN_ON(inode->i_state & I_NEW);
> >  	inode->i_state &= ~I_SYNC;
> > -	if (!(inode->i_state & I_FREEING)) {
> > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >  		if (!(inode->i_state & I_DIRTY) &&
> >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>   Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.
> 
> >  			/*
> > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> >  		if (current_is_pdflush() && !writeback_acquire(bdi))
> >  			break;
> >  
> > -		BUG_ON(inode->i_state & I_FREEING);
> > +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> >  		__iget(inode);
> >  		pages_skipped = wbc->pages_skipped;
> >  		__writeback_single_inode(inode, wbc);
>   Looking at this code, it looks a bit suspicious. What prevents this s_io
> list scan to race with inode freeing? In particular generic_forget_inode()

Good catch.

> can drop inode_lock to write the inode and in the mean time
> generic_sync_sb_inodes() can come, get a reference to the inode and start
> it's writeback... Subsequent iput() would then call generic_forget_inode()

Another possibility:

generic_forget_inode
  inode->i_state |= I_WILL_FREE;
  spin_unlock(&inode_lock);
                                                generic_sync_sb_inodes()
                                                  spin_lock(&inode_lock);
                                                  __iget(inode);
                                                  __writeback_single_inode
                                                    // see non zero i_count
                                                    WARN_ON(inode->i_state & I_WILL_FREE);

I'm wondering why didn't we saw reports on the last WARN_ON()?
Did we missed something?

> on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> inodes in this scan like we do for later in the function for another scan?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-03 14:10                 ` Wu Fengguang
@ 2009-06-03 14:16                   ` Jan Kara
  2009-06-03 14:47                     ` Wu Fengguang
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2009-06-03 14:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

On Wed 03-06-09 22:10:21, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > Wu Fengguang wrote:
> > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > add_dquot_ref().
> > > > > 
> > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > incomplete without the testing of I_CLEAR.
> > > > > 
> > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > 
> > > > Is there a reason it's not done for __sync_single_inode as well?
> > > 
> > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > 
> > > > Jeff Layton asked the question and I'm following it up :)
> > > > 
> > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > I_SYNC to be cleared before it changes I_STATE.
> > > 
> > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > a small race window?
> > > 
> > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > and it'd be bonus points for consistency?
> > > 
> > > So let's add the I_CLEAR test?
> > > 
> > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > as well?
> > > 
> > > Yes, we can add I_CLEAR here to catch more error condition.
> > > 
> > > Thanks,
> > > Fengguang
> > > 
> > > ---
> > > skip I_CLEAR state inodes in writeback routines
> > > 
> > > The I_FREEING test in __sync_single_inode() is racy because
> > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > and the test of I_FREEING.
> > > 
> > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > 
> > > Reported-by: Jeff Layton <jlayton@redhat.com>
> > > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux.orig/fs/fs-writeback.c
> > > +++ linux/fs/fs-writeback.c
> > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > >  	spin_lock(&inode_lock);
> > >  	WARN_ON(inode->i_state & I_NEW);
> > >  	inode->i_state &= ~I_SYNC;
> > > -	if (!(inode->i_state & I_FREEING)) {
> > > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > >  		if (!(inode->i_state & I_DIRTY) &&
> > >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >   Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> > 
> > >  			/*
> > > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > >  		if (current_is_pdflush() && !writeback_acquire(bdi))
> > >  			break;
> > >  
> > > -		BUG_ON(inode->i_state & I_FREEING);
> > > +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > >  		__iget(inode);
> > >  		pages_skipped = wbc->pages_skipped;
> > >  		__writeback_single_inode(inode, wbc);
> >   Looking at this code, it looks a bit suspicious. What prevents this s_io
> > list scan to race with inode freeing? In particular generic_forget_inode()
> 
> Good catch.
> 
> > can drop inode_lock to write the inode and in the mean time
> > generic_sync_sb_inodes() can come, get a reference to the inode and start
> > it's writeback... Subsequent iput() would then call generic_forget_inode()
> 
> Another possibility:
> 
> generic_forget_inode
>   inode->i_state |= I_WILL_FREE;
>   spin_unlock(&inode_lock);
>                                                 generic_sync_sb_inodes()
>                                                   spin_lock(&inode_lock);
>                                                   __iget(inode);
>                                                   __writeback_single_inode
>                                                     // see non zero i_count
>                                                     WARN_ON(inode->i_state & I_WILL_FREE);
> 
> I'm wondering why didn't we saw reports on the last WARN_ON()?
> Did we missed something?
  I meant the above race in my description ;-). Anyway, the race can happen
only if we are unmounting the filesystem (normally, we bail out on
sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
a while to understand why we weren't seeing tons of warnings...).

> > on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> > inodes in this scan like we do for later in the function for another scan?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-03 14:16                   ` Jan Kara
@ 2009-06-03 14:47                     ` Wu Fengguang
  2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
  0 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-06-03 14:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

On Wed, Jun 03, 2009 at 10:16:36PM +0800, Jan Kara wrote:
> On Wed 03-06-09 22:10:21, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > > Wu Fengguang wrote:
> > > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > > add_dquot_ref().
> > > > > > 
> > > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > > incomplete without the testing of I_CLEAR.
> > > > > > 
> > > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > > 
> > > > > Is there a reason it's not done for __sync_single_inode as well?
> > > > 
> > > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > > 
> > > > > Jeff Layton asked the question and I'm following it up :)
> > > > > 
> > > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > > I_SYNC to be cleared before it changes I_STATE.
> > > > 
> > > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > > a small race window?
> > > > 
> > > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > > and it'd be bonus points for consistency?
> > > > 
> > > > So let's add the I_CLEAR test?
> > > > 
> > > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > > as well?
> > > > 
> > > > Yes, we can add I_CLEAR here to catch more error condition.
> > > > 
> > > > Thanks,
> > > > Fengguang
> > > > 
> > > > ---
> > > > skip I_CLEAR state inodes in writeback routines
> > > > 
> > > > The I_FREEING test in __sync_single_inode() is racy because
> > > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > > and the test of I_FREEING.
> > > > 
> > > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > > 
> > > > Reported-by: Jeff Layton <jlayton@redhat.com>
> > > > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  fs/fs-writeback.c |    4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > --- linux.orig/fs/fs-writeback.c
> > > > +++ linux/fs/fs-writeback.c
> > > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > > >  	spin_lock(&inode_lock);
> > > >  	WARN_ON(inode->i_state & I_NEW);
> > > >  	inode->i_state &= ~I_SYNC;
> > > > -	if (!(inode->i_state & I_FREEING)) {
> > > > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > > >  		if (!(inode->i_state & I_DIRTY) &&
> > > >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > >   Is the whole if needed? I had an impression that everyone calling
> > > __sync_single_inode() should better take care it does not race with inode
> > > freeing... So WARN_ON would be more appropriate IMHO.
> > > 
> > > >  			/*
> > > > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > > >  		if (current_is_pdflush() && !writeback_acquire(bdi))
> > > >  			break;
> > > >  
> > > > -		BUG_ON(inode->i_state & I_FREEING);
> > > > +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > > >  		__iget(inode);
> > > >  		pages_skipped = wbc->pages_skipped;
> > > >  		__writeback_single_inode(inode, wbc);
> > >   Looking at this code, it looks a bit suspicious. What prevents this s_io
> > > list scan to race with inode freeing? In particular generic_forget_inode()
> > 
> > Good catch.
> > 
> > > can drop inode_lock to write the inode and in the mean time
> > > generic_sync_sb_inodes() can come, get a reference to the inode and start
> > > it's writeback... Subsequent iput() would then call generic_forget_inode()
> > 
> > Another possibility:
> > 
> > generic_forget_inode
> >   inode->i_state |= I_WILL_FREE;
> >   spin_unlock(&inode_lock);
> >                                                 generic_sync_sb_inodes()
> >                                                   spin_lock(&inode_lock);
> >                                                   __iget(inode);
> >                                                   __writeback_single_inode
> >                                                     // see non zero i_count
> >                                                     WARN_ON(inode->i_state & I_WILL_FREE);
> > 
> > I'm wondering why didn't we saw reports on the last WARN_ON()?
> > Did we missed something?
>   I meant the above race in my description ;-). Anyway, the race can happen
> only if we are unmounting the filesystem (normally, we bail out on
> sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
> a while to understand why we weren't seeing tons of warnings...).

Ah OK. Just checked that all three callers of generic_sync_sb_inodes():
- writeback_inodes(): umount prevented
- pohmelfs_kill_super(): just before umount
- ubifs calls: too complex to be obvious..
At least the first two cases are safe, so we didn't see the error report ;)

> > > on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> > > inodes in this scan like we do for later in the function for another scan?

Yes we should do this at least for safety.  I_WILL_FREE means
generic_forget_inode() is going to writeback the inode on its own, so
generic_sync_sb_inodes() would better not to wade in.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-03 14:47                     ` Wu Fengguang
@ 2009-06-06  3:07                       ` Wu Fengguang
  2009-06-08  7:03                         ` Artem Bityutskiy
  2009-06-08 17:07                         ` Jan Kara
  0 siblings, 2 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-06-06  3:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

1) I_FREEING tests should be coupled with I_CLEAR

The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
   races with generic_forget_inode()

generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:

  generic_forget_inode
    inode->i_state |= I_WILL_FREE;
    spin_unlock(&inode_lock);
                                       generic_sync_sb_inodes()
                                         spin_lock(&inode_lock);
                                         __iget(inode);
                                         __writeback_single_inode
                                           // see non zero i_count
 may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                         spin_unlock(&inode_lock);
 may call generic_forget_inode again ==> iput(inode);

The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early. But we are not sure the UBIFS calls and future callers will guarantee
that. So skip I_WILL_FREE inodes for the sake of safety.

CC: Jan Kara <jack@suse.cz>
CC: Eric Sandeen <sandeen@sandeen.net>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
 	spin_lock(&inode_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
+	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
 		if (!(inode->i_state & I_DIRTY) &&
 		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
@@ -487,7 +487,7 @@ void generic_sync_sb_inodes(struct super
 			break;
 		}
 
-		if (inode->i_state & I_NEW) {
+		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
 			requeue_io(inode);
 			continue;
 		}
@@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
 
-		BUG_ON(inode->i_state & I_FREEING);
+		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
@ 2009-06-08  7:03                         ` Artem Bityutskiy
  2009-06-08  9:29                           ` Wu Fengguang
  2009-06-08 17:07                         ` Jan Kara
  1 sibling, 1 reply; 22+ messages in thread
From: Artem Bityutskiy @ 2009-06-08  7:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

Wu Fengguang wrote:
> The above race and warning didn't turn up because writeback_inodes() holds
> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> early. But we are not sure the UBIFS calls and future callers will guarantee
> that. So skip I_WILL_FREE inodes for the sake of safety.

The inode states are a bit vague for me, but vs. UBIFS - feel
free to ask questions.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08  7:03                         ` Artem Bityutskiy
@ 2009-06-08  9:29                           ` Wu Fengguang
  2009-06-08 10:45                             ` Christoph Hellwig
  2009-06-09  7:03                             ` Artem Bityutskiy
  0 siblings, 2 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-06-08  9:29 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

Hi Artem,

On Mon, Jun 08, 2009 at 03:03:10PM +0800, Artem Bityutskiy wrote:
> Wu Fengguang wrote:
> > The above race and warning didn't turn up because writeback_inodes() holds
> > the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> > early. But we are not sure the UBIFS calls and future callers will guarantee
> > that. So skip I_WILL_FREE inodes for the sake of safety.
>
> The inode states are a bit vague for me, but vs. UBIFS - feel
> free to ask questions.

Thank you. Basically I'm not sure if UBIFS guarantees it won't be
unmounted (hence the MS_ACTIVE bit is on) when calling
generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

Thanks,
Fengguang

PS: our previous discussions

        > > Another possibility:
        > >
        > > generic_forget_inode
        > >   inode->i_state |= I_WILL_FREE;
        > >   spin_unlock(&inode_lock);
        > >                                                 generic_sync_sb_inodes()
        > >                                                   spin_lock(&inode_lock);
        > >                                                   __iget(inode);
        > >                                                   __writeback_single_inode
        > >                                                     // see non zero i_count
        > >                                                     WARN_ON(inode->i_state & I_WILL_FREE);
        > >
        > > I'm wondering why didn't we saw reports on the last WARN_ON()?
        > > Did we missed something?
        >   I meant the above race in my description ;-). Anyway, the race can happen
        > only if we are unmounting the filesystem (normally, we bail out on
        > sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
        > a while to understand why we weren't seeing tons of warnings...).

        Ah OK. Just checked that all three callers of generic_sync_sb_inodes():
        - writeback_inodes(): umount prevented
        - pohmelfs_kill_super(): just before umount
        - ubifs calls: too complex to be obvious..
        At least the first two cases are safe, so we didn't see the error report ;)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08  9:29                           ` Wu Fengguang
@ 2009-06-08 10:45                             ` Christoph Hellwig
  2009-06-09  7:24                               ` Artem Bityutskiy
  2009-06-09  7:03                             ` Artem Bityutskiy
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2009-06-08 10:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Artem Bityutskiy, Jan Kara, Eric Sandeen, Andrew Morton, LKML,
	Masayoshi MIZUMA, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, Nick Piggin, Jeff Layton

On Mon, Jun 08, 2009 at 05:29:30PM +0800, Wu Fengguang wrote:
> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
> unmounted (hence the MS_ACTIVE bit is on) when calling
> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

Btw, the call in ubifs_sync_fs should be superflous in the
vfs-2.6#for-next tree.  We now do make sure that all inodes are flushed
before calling ->sync_fs with the wait parameter.

shrink_liability is a more interesting case, I don't understand enough
of ubifs to comment on it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
  2009-06-08  7:03                         ` Artem Bityutskiy
@ 2009-06-08 17:07                         ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Kara @ 2009-06-08 17:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

On Sat 06-06-09 11:07:25, Wu Fengguang wrote:
> 1) I_FREEING tests should be coupled with I_CLEAR
> 
> The two I_FREEING tests are racy because clear_inode() can set i_state to
> I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
> 
> 2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
>    races with generic_forget_inode()
> 
> generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
> generic_sync_sb_inodes() shall not try to step in and create possible races:
> 
>   generic_forget_inode
>     inode->i_state |= I_WILL_FREE;
>     spin_unlock(&inode_lock);
>                                        generic_sync_sb_inodes()
>                                          spin_lock(&inode_lock);
>                                          __iget(inode);
>                                          __writeback_single_inode
>                                            // see non zero i_count
>  may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
>                                          spin_unlock(&inode_lock);
>  may call generic_forget_inode again ==> iput(inode);
> 
> The above race and warning didn't turn up because writeback_inodes() holds
> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> early. But we are not sure the UBIFS calls and future callers will guarantee
> that. So skip I_WILL_FREE inodes for the sake of safety.
  OK, the patch looks fine. Acked-by: Jan Kara <jack@suse.cz>

									Honza 
> CC: Jan Kara <jack@suse.cz>
> CC: Eric Sandeen <sandeen@sandeen.net>
> Acked-by: Jeff Layton <jlayton@redhat.com>
> CC: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>  	spin_lock(&inode_lock);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state &= ~I_SYNC;
> -	if (!(inode->i_state & I_FREEING)) {
> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>  		if (!(inode->i_state & I_DIRTY) &&
>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> @@ -487,7 +487,7 @@ void generic_sync_sb_inodes(struct super
>  			break;
>  		}
>  
> -		if (inode->i_state & I_NEW) {
> +		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
>  			requeue_io(inode);
>  			continue;
>  		}
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
>  		if (current_is_pdflush() && !writeback_acquire(bdi))
>  			break;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
>  		__iget(inode);
>  		pages_skipped = wbc->pages_skipped;
>  		__writeback_single_inode(inode, wbc);
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08  9:29                           ` Wu Fengguang
  2009-06-08 10:45                             ` Christoph Hellwig
@ 2009-06-09  7:03                             ` Artem Bityutskiy
  1 sibling, 0 replies; 22+ messages in thread
From: Artem Bityutskiy @ 2009-06-09  7:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	Nick Piggin, Jeff Layton

Wu Fengguang wrote:
> Hi Artem,
> 
> On Mon, Jun 08, 2009 at 03:03:10PM +0800, Artem Bityutskiy wrote:
>> Wu Fengguang wrote:
>>> The above race and warning didn't turn up because writeback_inodes() holds
>>> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
>>> early. But we are not sure the UBIFS calls and future callers will guarantee
>>> that. So skip I_WILL_FREE inodes for the sake of safety.
>> The inode states are a bit vague for me, but vs. UBIFS - feel
>> free to ask questions.
> 
> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
> unmounted (hence the MS_ACTIVE bit is on) when calling
> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

To be frank my VFS knowledge is not good enough to give you a
good answer. MS_ACTIVE seems to be set by file-systems when
they are mounted, and cleaned by VFS when unmounting.
I guess MS_ACTIVE is used by FS-es to check whether unmounting
is in progress or not. Anyway, UBIFS does not use it.

The generic_sync_sb_inodes() is called only from within
VFS operations, e.g., from ->create, ->rename, ->mknod,
->write_begin, ->setattr, etc. I mean, it is called from
UBIFS implementations of the above calls. UBIFS never calls
generic_sync_sb_inodes() function by itself, e.g., from
the UBIFS background thread.

Also, all calls to generic_sync_sb_inodes() are from within
VFS operations which need writing, e.g., VFS called
'mnt_want_write()' for all of them.

I think VFS must protect the FS from being unmunted while
it is in the middle of an operation, right? I'm just not
sure how this mechanism works. This would mean that yes,
we cannot be unmounted while we are in
generic_sync_sb_inodes() called by UBIFS.

Did this help :-) ?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08 10:45                             ` Christoph Hellwig
@ 2009-06-09  7:24                               ` Artem Bityutskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Artem Bityutskiy @ 2009-06-09  7:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Jan Kara, Eric Sandeen, Andrew Morton, LKML,
	Masayoshi MIZUMA, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, Nick Piggin, Jeff Layton,
	Hunter Adrian (Nokia-D/Helsinki)

[CCed Adiran Hunter]

Christoph Hellwig wrote:
> On Mon, Jun 08, 2009 at 05:29:30PM +0800, Wu Fengguang wrote:
>> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
>> unmounted (hence the MS_ACTIVE bit is on) when calling
>> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().
> 
> Btw, the call in ubifs_sync_fs should be superflous in the
> vfs-2.6#for-next tree.  We now do make sure that all inodes are flushed
> before calling ->sync_fs with the wait parameter.

OK, thanks for letting know. Once this is merged upstream,
I'll amend UBIFS.

> shrink_liability is a more interesting case, I don't understand enough
> of ubifs to comment on it.

Well... I'm not sure if I can tell why we need this in few
words. But I'll try.

UBIFS supports on-the-flight compression. This, and other factors
lead to a situation when UBIFS does not know how much the dirty
data in the page/inode caches will take on the flash. Indeed,
how do we know how well the data will be compressed?

UBIFS has so called "budgeting" subsystem. This subsystem is
responsible for accounting flash space. If user writes a new
data page, which goes to the page cache and sits there, the
budgeting sub-system decrements the free space counters by
4KiB. And so on.

At some point the free space counters in the budgeting subsystem
reache zero, which means we do not have more space. However,
in 99% of the cases this is not true, because the budgeting
subsystem's calculations are _very_ pessimistic, they always
assume the worst case scenario like the data are uncompressible.

So consider a situation when user writes a new data page. First
of all, the ->write_begin function will call a budgeting
sub-system function in order to reserve flash space for this
new data page.

The budgeting subsystem will see that space counters are zero.
And what it will do it will call the 'shrink_liability()'
function, which, among other things, may call the
'generic_sync_sb_inodes()' function, which will force write-back,
and this will give us some space. Indeed, when we actually
write the data back, we'll see how much flash space they
really take. And in 99% of cases they will take less than
we budgeted for, usually much less.

This is the rough idea. In practice things are more complex,
and there are factors like inability to know how much of dirty
space may be reclaimed, what will be the index size after
commit, etc. This all makes the budgeting subsystem complex
and difficult to understand. Moreover, we still consider it
as a work in progress, because we use too rough calculations,
and there are too many heuristics.

Here you may read some more information about UBIFS flash
accounting issues:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_spaceacc

You may also find a lot of info here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation

Especially in this document:
http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf
but it is not easy reading. You may search for "budget"
in the doc.

HTH.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-06-09  7:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090318170237.8F6C.61FB500B@jp.fujitsu.com>
     [not found] ` <20090323103846.GA16577@localhost>
     [not found]   ` <20090324155655.2684.61FB500B@jp.fujitsu.com>
     [not found]     ` <20090324074457.GA7745@localhost>
     [not found]       ` <20090324120502.GC23439@duck.suse.cz>
2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
2009-03-31 23:43             ` Andrew Morton
2009-04-01  0:53               ` Wu Fengguang
2009-06-01 21:38           ` [PATCH] " Eric Sandeen
2009-06-02  8:55             ` Wu Fengguang
2009-06-02 10:27               ` Jeff Layton
2009-06-02 11:37               ` Jan Kara
2009-06-02 21:48                 ` Eric Sandeen
2009-06-03 10:45                   ` Jeff Layton
2009-06-03 13:32                 ` Wu Fengguang
2009-06-03 14:00                   ` Jan Kara
2009-06-03 14:10                 ` Wu Fengguang
2009-06-03 14:16                   ` Jan Kara
2009-06-03 14:47                     ` Wu Fengguang
2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
2009-06-08  7:03                         ` Artem Bityutskiy
2009-06-08  9:29                           ` Wu Fengguang
2009-06-08 10:45                             ` Christoph Hellwig
2009-06-09  7:24                               ` Artem Bityutskiy
2009-06-09  7:03                             ` Artem Bityutskiy
2009-06-08 17:07                         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox