linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Fengguang Wu <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
Date: Thu, 22 Mar 2012 11:03:48 +0800	[thread overview]
Message-ID: <20120322030348.GC20507@localhost> (raw)
In-Reply-To: <20120321102550.GB22938@quack.suse.cz>

On Wed, Mar 21, 2012 at 11:25:50AM +0100, Jan Kara wrote:
> On Wed 21-03-12 10:50:05, Dave Chinner wrote:
> > On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> > > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> > > because iput() can do a lot of work - for example truncate the inode if it's
> > > the last iput on unlinked file. Some filesystems depend on flusher thread
> > > progressing (e.g. because they need to flush delay allocated blocks to reduce
> > > allocation uncertainty) and so flusher thread doing truncate creates
> > > interesting dependencies and possibilities for deadlocks.
> > > 
> > > We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> > > flag effectively pins the inode in memory. So if we take care to either hold
> > > i_lock or have I_SYNC set, we can get away without taking inode reference
> > > in writeback_sb_inodes().
> > > 
> > > To make things work we have to move waiting for I_SYNC from end_writeback() to
> > > evict() just before calling of ->evict_inode. This is because several
> > > filesystems call end_writeback() after they have deleted the inode (btrfs,
> > > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> > > waiting for I_SYNC because they call end_writeback() from within a transaction.
> > > Both were not really a problem previously because flusher thread and
> > > ->evict_inode() could not run in parallel but now these two could race.
> > > So moving of I_SYNC wait prevents possible races..
> > > 
> > > As a side effect of these changes, we also fix possible use-after-free in
> > > wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> > > i_lock on the inode that was already free.
> > .....
> > 
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index d3ebdbe..3869714 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> > >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> > >  	BUG_ON(!(inode->i_state & I_FREEING));
> > >  	BUG_ON(inode->i_state & I_CLEAR);
> > > -	inode_sync_wait(inode);
> > >  	/* don't need i_lock here, no concurrent mods to i_state */
> > >  	inode->i_state = I_FREEING | I_CLEAR;
> > >  }
> > > @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
> > >  
> > >  	inode_sb_list_del(inode);
> > >  
> > > +	/*
> > > +	 * Wait for flusher thread to be done with the inode so that filesystem
> > > +	 * does not start destroying it while writeback is still running. Since
> > > +	 * the inode has I_FREEING set, flusher thread won't start new work on
> > > +	 * the inode.  We just have to wait for running writeback to finish. We
> > > +	 * must use i_lock here because flusher thread might be working with
> > > +	 * the inode without I_SYNC set but under i_lock.
> > > +	 */
> > > +	spin_lock(&inode->i_lock);
> > > +	inode_wait_for_writeback(inode);
> > > +	spin_unlock(&inode->i_lock);
> > > +
> > 
> > Why move this wait from end_writeback() to here?  The whole point
> > of end_writeback() is to provide a barrier that guarantees that
> > there is no async writeback running when it returns, so it seems
> > strange to move the barrier out of the function that is supposed to
> > provide the barrier....
>   I agree that end_writeback() will be misnamed after this change. The
> thing is (as I tried to explain in the changelog) that a lot of filesystems
> get it wrong and call end_writeback() from places where
>   a) it is too late and writeback could be scribbling over a freed inode
>   b) they cannot really handle waiting for writeback to finish
> And nobody really noticed because writeback couldn't be racing with
> ->evict_inode().
> 
> It is not easy to fix this e.g. in GFS2 because end_writeback() does
> actually two things: It checks that inode is properly teared down (has no
> pages etc.) and waits for writeback. And while waiting for writeback should
> happen outside of a running transaction, checking of inode has to happen
> between truncate_inode_pages() and deleting inode which, in case of GFS2,
> has to be inside a transaction.
> 
> So what I decided to do is to split off waiting for writeback from checking
> the inode and move waiting for writeback before ->evict_inode() is called.
> That makes life easier for the filesystems and AFAICS it is a safe thing to
> do.
> 
> Maybe we could rename end_writeback() to something like clear_inode() to
> reflect this change?

end_writeback() does not imply "sleeping", but you'll at least remove
that explicit might_sleep() line inside it ;)

Thanks,
Fengguang

  reply	other threads:[~2012-03-22  3:03 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
2012-04-30 14:36   ` Christoph Hellwig
2012-04-30 21:30     ` Jan Kara
2012-03-20 22:56 ` [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() Jan Kara
2012-04-30 14:38   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
2012-03-22  2:41   ` Fengguang Wu
2012-03-22  8:35     ` Jan Kara
2012-03-28  3:11       ` Fengguang Wu
2012-03-28 15:12         ` Christoph Hellwig
2012-04-30 14:39           ` Christoph Hellwig
2012-04-30 14:41   ` Christoph Hellwig
2012-04-30 21:21     ` Jan Kara
2012-03-20 22:56 ` [PATCH 4/7] writeback: Separate inode requeueing after writeback Jan Kara
2012-04-30 14:43   ` Christoph Hellwig
2012-04-30 21:42     ` Jan Kara
2012-03-20 22:56 ` [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode() Jan Kara
2012-04-30 14:44   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
2012-03-20 23:35   ` Dave Chinner
2012-03-21 10:03     ` Jan Kara
2012-04-30 14:46   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
2012-03-20 23:50   ` Dave Chinner
2012-03-21 10:25     ` Jan Kara
2012-03-22  3:03       ` Fengguang Wu [this message]
2012-03-22  6:27       ` Dave Chinner
2012-03-22  9:50         ` Jan Kara
2012-03-22  3:01   ` Fengguang Wu
2012-04-30 15:30   ` Christoph Hellwig
2012-04-30 21:58     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120322030348.GC20507@localhost \
    --to=fengguang.wu@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).