[PATCH] VFS: Fix race with new inode creation

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] VFS: Fix race with new inode creation
@ 2009-04-10 15:31 Curt Wohlgemuth
  2009-04-10 16:01 ` Al Viro
  0 siblings, 1 reply; 5+ messages in thread
From: Curt Wohlgemuth @ 2009-04-10 15:31 UTC (permalink / raw)
  To: linux-fsdevel

This patch fixes a race between a task creating a new inode, and one writing
that same new, dirty inode out to disk.

We found this using a particular workload (fsstress) along with other
ancillary processes running on the same machine.  The symptom is one or more
hung unkillable (uniterruptible sleep) tasks that try to operate on this new
inode.

The original comment block is wrong.  Since the inode gets marked dirty
after it's created, but before its I_LOCK bit is cleared, there _can_ be
somebody else doing something with this inode -- e.g., a writeback task
(in our case, __sync_single_inode()).

	Signed-off-by: Curt Wohlgemuth  <curtw@google.com>
---

--- linux-2.6.29.1/fs/inode.c.orig	2009-04-02 13:55:27.000000000 -0700
+++ linux-2.6.29.1/fs/inode.c	2009-04-09 11:16:03.000000000 -0700
@@ -651,15 +651,14 @@ void unlock_new_inode(struct inode *inod
 	}
 #endif
 	/*
-	 * This is special!  We do not need the spinlock
-	 * when clearing I_LOCK, because we're guaranteed
-	 * that nobody else tries to do anything about the
-	 * state of the inode when it is locked, as we
-	 * just created it (so there can be no old holders
-	 * that haven't tested I_LOCK).
+	 * We need to hold the inode lock while clearing these bits, because
+	 * there is a race between this code and the writeback path, since all
+	 * of the new inode allocation routines will mark the inode as dirty.
 	 */
 	WARN_ON((inode->i_state & (I_LOCK|I_NEW)) != (I_LOCK|I_NEW));
+	spin_lock(&inode_lock);
 	inode->i_state &= ~(I_LOCK|I_NEW);
+	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] VFS: Fix race with new inode creation
  2009-04-10 15:31 [PATCH] VFS: Fix race with new inode creation Curt Wohlgemuth
@ 2009-04-10 16:01 ` Al Viro
  2009-04-10 16:08   ` Curt Wohlgemuth
  2009-04-14 16:57   ` Andrew Morton
  0 siblings, 2 replies; 5+ messages in thread
From: Al Viro @ 2009-04-10 16:01 UTC (permalink / raw)
  To: Curt Wohlgemuth; +Cc: linux-fsdevel

On Fri, Apr 10, 2009 at 08:31:40AM -0700, Curt Wohlgemuth wrote:
> This patch fixes a race between a task creating a new inode, and one writing
> that same new, dirty inode out to disk.
> 
> We found this using a particular workload (fsstress) along with other
> ancillary processes running on the same machine.  The symptom is one or more
> hung unkillable (uniterruptible sleep) tasks that try to operate on this new
> inode.
> 
> The original comment block is wrong.  Since the inode gets marked dirty
> after it's created, but before its I_LOCK bit is cleared, there _can_ be
> somebody else doing something with this inode -- e.g., a writeback task
> (in our case, __sync_single_inode()).

Um...  I'd say that the real bug in there is that we shouldn't *get* to
__sync_single_inode() until I_NEW/I_LOCK are removed.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] VFS: Fix race with new inode creation
  2009-04-10 16:01 ` Al Viro
@ 2009-04-10 16:08   ` Curt Wohlgemuth
  2009-04-14 16:57   ` Andrew Morton
  1 sibling, 0 replies; 5+ messages in thread
From: Curt Wohlgemuth @ 2009-04-10 16:08 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

On Fri, Apr 10, 2009 at 9:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Apr 10, 2009 at 08:31:40AM -0700, Curt Wohlgemuth wrote:
>> This patch fixes a race between a task creating a new inode, and one writing
>> that same new, dirty inode out to disk.
>>
>> We found this using a particular workload (fsstress) along with other
>> ancillary processes running on the same machine.  The symptom is one or more
>> hung unkillable (uniterruptible sleep) tasks that try to operate on this new
>> inode.
>>
>> The original comment block is wrong.  Since the inode gets marked dirty
>> after it's created, but before its I_LOCK bit is cleared, there _can_ be
>> somebody else doing something with this inode -- e.g., a writeback task
>> (in our case, __sync_single_inode()).
>
> Um...  I'd say that the real bug in there is that we shouldn't *get* to
> __sync_single_inode() until I_NEW/I_LOCK are removed.

Well, I thought about that too.  But I haven't seen an issue with this
happening, and
the patch I mailed has the benefit of extreme simplicity.  Plus I
couldn't be sure that
there weren't other spots that assume taking the inode lock meant that
they could
manipulate the i_state field with confidence (though I didn't find any).

Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] VFS: Fix race with new inode creation
  2009-04-10 16:01 ` Al Viro
  2009-04-10 16:08   ` Curt Wohlgemuth
@ 2009-04-14 16:57   ` Andrew Morton
  2009-04-14 17:30     ` Nick Piggin
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2009-04-14 16:57 UTC (permalink / raw)
  To: Al Viro; +Cc: Curt Wohlgemuth, linux-fsdevel, Nick Piggin

On Fri, 10 Apr 2009 17:01:39 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Fri, Apr 10, 2009 at 08:31:40AM -0700, Curt Wohlgemuth wrote:
> > This patch fixes a race between a task creating a new inode, and one writing
> > that same new, dirty inode out to disk.
> > 
> > We found this using a particular workload (fsstress) along with other
> > ancillary processes running on the same machine.  The symptom is one or more
> > hung unkillable (uniterruptible sleep) tasks that try to operate on this new
> > inode.
> > 
> > The original comment block is wrong.  Since the inode gets marked dirty
> > after it's created, but before its I_LOCK bit is cleared, there _can_ be
> > somebody else doing something with this inode -- e.g., a writeback task
> > (in our case, __sync_single_inode()).
> 
> Um...  I'd say that the real bug in there is that we shouldn't *get* to
> __sync_single_inode() until I_NEW/I_LOCK are removed.

I suspect Nick recently fixed this?


commit aabb8fdb41128705fd1627f56fdd571e45fdbcdb
Author: Nick Piggin <npiggin@suse.de>
Date:   Wed Mar 11 13:17:36 2009 -0700

    fs: avoid I_NEW inodes
    
    To be on the safe side, it should be less fragile to exclude I_NEW inodes
    from inode list scans by default (unless there is an important reason to
    have them).
    
    Normally they will get excluded (eg.  by zero refcount or writecount etc),
    however it is a bit fragile for list walkers to know exactly what parts of
    the inode state is set up and valid to test when in I_NEW.  So along these
    lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
    checks with them too -- this shouldn't be a problem should it?)
    
    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Acked-by: Jan Kara <jack@suse.cz>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/dquot.c b/fs/dquot.c
index bca3cac..cb1c3bc 100644
--- a/fs/dquot.c
+++ b/fs/dquot.c
@@ -789,12 +789,12 @@ static void add_dquot_ref(struct super_block *sb, int type)
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+			continue;
 		if (!atomic_read(&inode->i_writecount))
 			continue;
 		if (!dqinit_needed(inode, type))
 			continue;
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
-			continue;
 
 		__iget(inode);
 		spin_unlock(&inode_lock);
@@ -870,6 +870,12 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		/*
+		 *  We have to scan also I_NEW inodes because they can already
+		 *  have quota pointer initialized. Luckily, we need to touch
+		 *  only quota pointers and these have separate locking
+		 *  (dqptr_sem).
+		 */
 		if (!IS_NOQUOTA(inode))
 			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 3e5637f..44d725f 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct super_block *sb)
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
diff --git a/fs/inode.c b/fs/inode.c
index 826fb0b..06aa5a1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,6 +356,8 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
+		if (inode->i_state & I_NEW)
+			continue;
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
diff --git a/fs/notify/inotify/inotify.c b/fs/notify/inotify/inotify.c
index 331f2e8..220c13f 100644
--- a/fs/notify/inotify/inotify.c
+++ b/fs/notify/inotify/inotify.c
@@ -380,6 +380,14 @@ void inotify_unmount_inodes(struct list_head *list)
 		struct list_head *watches;
 
 		/*
+		 * We cannot __iget() an inode in state I_CLEAR, I_FREEING,
+		 * I_WILL_FREE, or I_NEW which is fine because by that point
+		 * the inode cannot have any associated watches.
+		 */
+		if (inode->i_state & (I_CLEAR|I_FREEING|I_WILL_FREE|I_NEW))
+			continue;
+
+		/*
 		 * If i_count is zero, the inode cannot have any watches and
 		 * doing an __iget/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
@@ -388,14 +396,6 @@ void inotify_unmount_inodes(struct list_head *list)
 		if (!atomic_read(&inode->i_count))
 			continue;
 
-		/*
-		 * We cannot __iget() an inode in state I_CLEAR, I_FREEING, or
-		 * I_WILL_FREE which is fine because by that point the inode
-		 * cannot have any associated watches.
-		 */
-		if (inode->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))
-			continue;
-
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
 		/* In case inotify_remove_watch_locked() drops a reference. */


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] VFS: Fix race with new inode creation
  2009-04-14 16:57   ` Andrew Morton
@ 2009-04-14 17:30     ` Nick Piggin
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2009-04-14 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Al Viro, Curt Wohlgemuth, linux-fsdevel

On Wednesday 15 April 2009 02:57:29 Andrew Morton wrote:
> On Fri, 10 Apr 2009 17:01:39 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Fri, Apr 10, 2009 at 08:31:40AM -0700, Curt Wohlgemuth wrote:
> > > This patch fixes a race between a task creating a new inode, and one writing
> > > that same new, dirty inode out to disk.
> > > 
> > > We found this using a particular workload (fsstress) along with other
> > > ancillary processes running on the same machine.  The symptom is one or more
> > > hung unkillable (uniterruptible sleep) tasks that try to operate on this new
> > > inode.
> > > 
> > > The original comment block is wrong.  Since the inode gets marked dirty
> > > after it's created, but before its I_LOCK bit is cleared, there _can_ be
> > > somebody else doing something with this inode -- e.g., a writeback task
> > > (in our case, __sync_single_inode()).
> > 
> > Um...  I'd say that the real bug in there is that we shouldn't *get* to
> > __sync_single_inode() until I_NEW/I_LOCK are removed.
> 
> I suspect Nick recently fixed this?
> 
> 
> commit aabb8fdb41128705fd1627f56fdd571e45fdbcdb
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Wed Mar 11 13:17:36 2009 -0700
> 
>     fs: avoid I_NEW inodes

You probably meant this one: 7ef0d7377cb287e08f3ae94cebc919448e1f5dff ?
Yes, I think that should fix it.

The "fs: avoid I_NEW inodes" patch I don't think we actually found a bug
which it fixes, but Jan and I both thought it is less fragile to just
avoid I_NEW inodes unless explicitly required for some reason.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-04-14 17:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-10 15:31 [PATCH] VFS: Fix race with new inode creation Curt Wohlgemuth
2009-04-10 16:01 ` Al Viro
2009-04-10 16:08   ` Curt Wohlgemuth
2009-04-14 16:57   ` Andrew Morton
2009-04-14 17:30     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).