linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org, Jan Kara <jack@suse.cz>
Subject: [PATCH 7/7] writeback: Avoid iput() from flusher thread
Date: Tue, 20 Mar 2012 23:56:31 +0100	[thread overview]
Message-ID: <1332284191-21076-8-git-send-email-jack@suse.cz> (raw)
In-Reply-To: <1332284191-21076-1-git-send-email-jack@suse.cz>

Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
because iput() can do a lot of work - for example truncate the inode if it's
the last iput on unlinked file. Some filesystems depend on flusher thread
progressing (e.g. because they need to flush delay allocated blocks to reduce
allocation uncertainty) and so flusher thread doing truncate creates
interesting dependencies and possibilities for deadlocks.

We get rid of iput() in flusher thread by using the fact that I_SYNC inode
flag effectively pins the inode in memory. So if we take care to either hold
i_lock or have I_SYNC set, we can get away without taking inode reference
in writeback_sb_inodes().

To make things work we have to move waiting for I_SYNC from end_writeback() to
evict() just before calling of ->evict_inode. This is because several
filesystems call end_writeback() after they have deleted the inode (btrfs,
gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
waiting for I_SYNC because they call end_writeback() from within a transaction.
Both were not really a problem previously because flusher thread and
->evict_inode() could not run in parallel but now these two could race.
So moving of I_SYNC wait prevents possible races..

As a side effect of these changes, we also fix possible use-after-free in
wb_writeback() because inode_wait_for_writeback() call could try to reacquire
i_lock on the inode that was already free.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++----------
 fs/inode.c                |   13 ++++++++++-
 include/linux/fs.h        |    7 +++--
 include/linux/writeback.h |    7 +-----
 4 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6f5f930..e5ca4b3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -325,9 +325,10 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
 }
 
 /*
- * Wait for writeback on an inode to complete.
+ * Wait for writeback on an inode to complete. Called with i_lock held.
+ * Caller must make sure inode cannot go away when we drop i_lock.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+void inode_wait_for_writeback(struct inode *inode)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -340,6 +341,26 @@ static void inode_wait_for_writeback(struct inode *inode)
 	}
 }
 
+/*
+ * Sleep until I_SYNC is cleared. This function must be called with i_lock
+ * held and drops it. It is aimed for callers not holding any inode reference
+ * so once i_lock is dropped, inode can go away.
+ */
+static void inode_sleep_on_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	int sleep;
+
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	sleep = inode->i_state & I_SYNC;
+	spin_unlock(&inode->i_lock);
+	if (sleep)
+		schedule();
+	finish_wait(wqh, &wait);
+}
+
 static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
 			     struct writeback_control *wbc)
 {
@@ -470,7 +491,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		if (wbc->sync_mode != WB_SYNC_ALL)
 			goto out;
 		/*
-		 * It's a data-integrity sync.  We must wait.
+		 * It's a data-integrity sync. We must wait. Since callers hold
+		 * inode reference or inode has I_WILL_FREE set, it cannot go
+		 * away under us.
 		 */
 		inode_wait_for_writeback(inode);
 	}
@@ -611,15 +634,23 @@ static long writeback_sb_inodes(struct super_block *sb,
 		}
 		spin_unlock(&wb->list_lock);
 
-		__iget(inode);
-		if (inode->i_state & I_SYNC)
-			inode_wait_for_writeback(inode);
+		if (inode->i_state & I_SYNC) {
+			/* Wait for I_SYNC. This function drops i_lock... */
+			inode_sleep_on_writeback(inode);
+			/* Inode may be gone, start again */
+			continue;
+		}
 		inode->i_state |= I_SYNC;
 		spin_unlock(&inode->i_lock);
+
 		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
+		/*
+		 * We use I_SYNC to pin the inode in memory. While it is set
+		 * end_writeback() will wait so the inode cannot be freed.
+		 */
 		__writeback_single_inode(inode, wb, &wbc);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
@@ -631,10 +662,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		inode_wb_requeue(inode, wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&wb->list_lock);
-		iput(inode);
-		cond_resched();
-		spin_lock(&wb->list_lock);
+		cond_resched_lock(&wb->list_lock);
 		/*
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.
@@ -829,8 +857,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 			inode = wb_inode(wb->b_more_io.prev);
 			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->list_lock);
-			inode_wait_for_writeback(inode);
-			spin_unlock(&inode->i_lock);
+			/* This function drops i_lock... */
+			inode_sleep_on_writeback(inode);
 			spin_lock(&wb->list_lock);
 		}
 	}
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..3869714 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
-	inode_sync_wait(inode);
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
@@ -541,6 +540,18 @@ static void evict(struct inode *inode)
 
 	inode_sb_list_del(inode);
 
+	/*
+	 * Wait for flusher thread to be done with the inode so that filesystem
+	 * does not start destroying it while writeback is still running. Since
+	 * the inode has I_FREEING set, flusher thread won't start new work on
+	 * the inode.  We just have to wait for running writeback to finish. We
+	 * must use i_lock here because flusher thread might be working with
+	 * the inode without I_SYNC set but under i_lock.
+	 */
+	spin_lock(&inode->i_lock);
+	inode_wait_for_writeback(inode);
+	spin_unlock(&inode->i_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..ddb2f55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1742,9 +1742,10 @@ struct super_operations {
  *			anew.  Other functions will just ignore such inodes,
  *			if appropriate.  I_NEW is used for waiting.
  *
- * I_SYNC		Synchonized write of dirty inode data.  The bits is
- *			set during data writeback, and cleared with a wakeup
- *			on the bit address once it is done.
+ * I_SYNC		Writeback of inode is running. The bit is set during
+ *			data writeback, and cleared with a wakeup on the bit
+ *			address once it is done. The bit is also used to pin
+ *			the inode in memory for flusher thread.
  *
  * I_REFERENCED		Marks the inode as recently references on the LRU list.
  *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 995b8bf..b785b74 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -94,6 +94,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+void inode_wait_for_writeback(struct inode *inode);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
@@ -101,12 +102,6 @@ static inline void wait_on_inode(struct inode *inode)
 	might_sleep();
 	wait_on_bit(&inode->i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE);
 }
-static inline void inode_sync_wait(struct inode *inode)
-{
-	might_sleep();
-	wait_on_bit(&inode->i_state, __I_SYNC, inode_wait,
-							TASK_UNINTERRUPTIBLE);
-}
 
 
 /*
-- 
1.7.1


  parent reply	other threads:[~2012-03-20 22:56 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
2012-04-30 14:36   ` Christoph Hellwig
2012-04-30 21:30     ` Jan Kara
2012-03-20 22:56 ` [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() Jan Kara
2012-04-30 14:38   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
2012-03-22  2:41   ` Fengguang Wu
2012-03-22  8:35     ` Jan Kara
2012-03-28  3:11       ` Fengguang Wu
2012-03-28 15:12         ` Christoph Hellwig
2012-04-30 14:39           ` Christoph Hellwig
2012-04-30 14:41   ` Christoph Hellwig
2012-04-30 21:21     ` Jan Kara
2012-03-20 22:56 ` [PATCH 4/7] writeback: Separate inode requeueing after writeback Jan Kara
2012-04-30 14:43   ` Christoph Hellwig
2012-04-30 21:42     ` Jan Kara
2012-03-20 22:56 ` [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode() Jan Kara
2012-04-30 14:44   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
2012-03-20 23:35   ` Dave Chinner
2012-03-21 10:03     ` Jan Kara
2012-04-30 14:46   ` Christoph Hellwig
2012-03-20 22:56 ` Jan Kara [this message]
2012-03-20 23:50   ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Dave Chinner
2012-03-21 10:25     ` Jan Kara
2012-03-22  3:03       ` Fengguang Wu
2012-03-22  6:27       ` Dave Chinner
2012-03-22  9:50         ` Jan Kara
2012-03-22  3:01   ` Fengguang Wu
2012-04-30 15:30   ` Christoph Hellwig
2012-04-30 21:58     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1332284191-21076-8-git-send-email-jack@suse.cz \
    --to=jack@suse.cz \
    --cc=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).