From: Jan Kara <jack@suse.cz>
To: NeilBrown <neilb@suse.com>
Cc: Jan Kara <jack@suse.cz>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>, Tejun Heo <tj@kernel.org>,
Dan Williams <dan.j.williams@intel.com>,
Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>,
NeilBrown <neilb@suse.de>
Subject: Re: [PATCH 07/10] writeback: Implement reliable switching to default writeback structure
Date: Fri, 10 Feb 2017 14:20:07 +0100 [thread overview]
Message-ID: <20170210132007.GA5491@quack2.suse.cz> (raw)
In-Reply-To: <87zihuu6rj.fsf@notabene.neil.brown.name>
On Fri 10-02-17 13:19:44, NeilBrown wrote:
> On Thu, Feb 09 2017, Jan Kara wrote:
>
> > Currently switching of inode between different writeback structures is
> > asynchronous and not guaranteed to succeed. Add a variant of switching
> > that is synchronous and reliable so that it can reliably move inode to
> > the default writeback structure (bdi->wb) when writeback on bdi is going
> > to be shutdown.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > fs/fs-writeback.c | 60 ++++++++++++++++++++++++++++++++++++++++-------
> > include/linux/fs.h | 3 ++-
> > include/linux/writeback.h | 6 +++++
> > 3 files changed, 60 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 23dc97cf2a50..52992a1036b1 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -332,14 +332,11 @@ struct inode_switch_wbs_context {
> > struct work_struct work;
> > };
> >
> > -static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +static void do_inode_switch_wbs(struct inode *inode,
> > + struct bdi_writeback *new_wb)
> > {
> > - struct inode_switch_wbs_context *isw =
> > - container_of(work, struct inode_switch_wbs_context, work);
> > - struct inode *inode = isw->inode;
> > struct address_space *mapping = inode->i_mapping;
> > struct bdi_writeback *old_wb = inode->i_wb;
> > - struct bdi_writeback *new_wb = isw->new_wb;
> > struct radix_tree_iter iter;
> > bool switched = false;
> > void **slot;
> > @@ -436,15 +433,29 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
> > spin_unlock(&new_wb->list_lock);
> > spin_unlock(&old_wb->list_lock);
> >
> > + /*
> > + * Make sure waitqueue_active() check in wake_up_bit() cannot happen
> > + * before I_WB_SWITCH is cleared. Pairs with the barrier in
> > + * set_task_state() after wait_on_bit() added waiter to the wait queue.
>
> I think you mean "set_current_state()" ??
Yes, I'll fix that.
> It's rather a trap for the unwary, this need for a smp_mb().
> Greping for wake_up_bit(), I find quite a few places with barriers -
> sometimes clear_bit_unlock() or spin_unlock() - but
>
> fs/block_dev.c- whole->bd_claiming = NULL;
> fs/block_dev.c: wake_up_bit(&whole->bd_claiming, 0);
>
> fs/cifs/connect.c- clear_bit(TCON_LINK_PENDING, &tlink->tl_flags);
> fs/cifs/connect.c: wake_up_bit(&tlink->tl_flags, TCON_LINK_PENDING);
>
> fs/cifs/misc.c- clear_bit(CIFS_INODE_PENDING_WRITERS, &cinode->flags);
> fs/cifs/misc.c: wake_up_bit(&cinode->flags, CIFS_INODE_PENDING_WRITERS);
>
> (several more in cifs)
>
> net/sunrpc/xprt.c- clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
> net/sunrpc/xprt.c- xprt->ops->close(xprt);
> net/sunrpc/xprt.c- xprt_release_write(xprt, NULL);
> net/sunrpc/xprt.c: wake_up_bit(&xprt->state, XPRT_LOCKED);
> (there might be a barrier in ->close or xprt_release_write() I guess)
>
> security/keys/gc.c- clear_bit(KEY_GC_REAPING_KEYTYPE, &key_gc_flags);
> security/keys/gc.c: wake_up_bit(&key_gc_flags, KEY_GC_REAPING_KEYTYPE);
Yup, the above look like bugs.
> I wonder if there is a good way to make this less error-prone.
> I would suggest that wake_up_bit() should always have a barrier, and
> __wake_up_bit() is needed to avoid it, but there is already a
> __wake_up_bit() with a slightly different interface.
Yeah, it is error-prone as all waitqueue_active() optimizations...
> In this case, you have a spin_unlock() just before the wake_up_bit().
> It is my understand that it would provide enough of a barrier (all
> writes before are globally visible after), so do you really need
> the barrier here?
I believe I do. spin_unlock() is a semi-permeable barrier - i.e., any read
or write from "outside" can be moved inside. So CPU is free to prefetch
values for waitqueue active checks before the spinlock is unlocked or even
before clearing I_WB_SWITCH bit.
> > + */
> > + smp_mb();
> > + wake_up_bit(&inode->i_state, __I_WB_SWITCH);
> > +
> > if (switched) {
> > wb_wakeup(new_wb);
> > wb_put(old_wb);
> > }
> > - wb_put(new_wb);
> > +}
> >
> > - iput(inode);
> > - kfree(isw);
> > +static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +{
> > + struct inode_switch_wbs_context *isw =
> > + container_of(work, struct inode_switch_wbs_context, work);
> >
> > + do_inode_switch_wbs(isw->inode, isw->new_wb);
> > + wb_put(isw->new_wb);
> > + iput(isw->inode);
> > + kfree(isw);
> > atomic_dec(&isw_nr_in_flight);
> > }
> >
> > @@ -521,6 +532,39 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > }
> >
> > /**
> > + * inode_switch_to_default_wb_sync - change the wb association of an inode to
> > + * the default writeback structure synchronously
> > + * @inode: target inode
> > + *
> > + * Switch @inode's wb association to the default writeback structure (bdi->wb).
> > + * Unlike inode_switch_wbs() the switching is performed synchronously and we
> > + * guarantee the inode is switched to the default writeback structure when this
> > + * function returns. Nothing prevents from someone else switching inode to
> > + * another writeback structure just when we are done though. Preventing that is
> > + * upto the caller if needed.
> > + */
> > +void inode_switch_to_default_wb_sync(struct inode *inode)
> > +{
> > + struct backing_dev_info *bdi = inode_to_bdi(inode);
> > +
> > + /* while holding I_WB_SWITCH, no one else can update the association */
> > + spin_lock(&inode->i_lock);
> > + if (WARN_ON_ONCE(inode->i_state & I_FREEING) ||
> > + !inode_to_wb_is_valid(inode) || inode_to_wb(inode) == &bdi->wb) {
> > + spin_unlock(&inode->i_lock);
> > + return;
> > + }
> > + __inode_wait_for_state_bit(inode, __I_WB_SWITCH);
>
> I note that __inode_wait_for_state_bit() can drop and reclaim ->i_lock.
> is it possible that:
> !inode_to_wb_is_valid(inode) || inode_to_wb(inode) == &bdi->wb)
>
> could change while ->i_lock is unlocked?
> It would be particular unfortunate if inode_to_wb(inode) became &bdi->wb
> due to some thing thread, as do_inode_switch_wbs() will deadlock if
> inode_to_wb(inode) == &bdi->wb
>
> i.e. do you need to repeat the test?
That's a very good question and I think you are right that I need to repeat
the checks for inode->i_wb. Will fix. Thanks for review!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2017-02-10 13:20 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-09 12:44 [PATCH 0/10] block: Fix block device shutdown related races Jan Kara
2017-02-09 12:44 ` [PATCH 01/10] block: Move bdev_unhash_inode() after invalidate_partition() Jan Kara
2017-02-12 3:58 ` Tejun Heo
2017-02-20 14:53 ` Jan Kara
2017-02-09 12:44 ` [PATCH 02/10] block: Unhash also block device inode for the whole device Jan Kara
2017-02-12 4:16 ` Tejun Heo
2017-02-09 12:44 ` [PATCH 03/10] block: Revalidate i_bdev reference in bd_aquire() Jan Kara
2017-02-09 15:54 ` Jan Kara
2017-02-12 4:22 ` Tejun Heo
2017-02-09 12:44 ` [PATCH 04/10] block: Move bdi_unregister() to del_gendisk() Jan Kara
2017-02-10 2:21 ` NeilBrown
2017-02-12 4:31 ` Tejun Heo
2017-02-09 12:44 ` [PATCH 05/10] writeback: Generalize and standardize I_SYNC waiting function Jan Kara
2017-02-12 4:32 ` Tejun Heo
2017-02-09 12:44 ` [PATCH 06/10] writeback: Move __inode_wait_for_state_bit Jan Kara
2017-02-09 12:44 ` [PATCH 07/10] writeback: Implement reliable switching to default writeback structure Jan Kara
2017-02-10 2:19 ` NeilBrown
2017-02-10 13:20 ` Jan Kara [this message]
2017-02-09 12:44 ` [PATCH 08/10] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
2017-02-12 4:40 ` Tejun Heo
2017-02-20 16:58 ` Jan Kara
2017-02-09 12:44 ` [PATCH 09/10] kobject: Export kobject_get_unless_zero() Jan Kara
2017-02-12 4:41 ` Tejun Heo
2017-02-09 12:44 ` [PATCH 10/10] block: Fix oops scsi_disk_get() Jan Kara
2017-02-12 4:43 ` Tejun Heo
2017-02-09 14:52 ` [PATCH 0/10] block: Fix block device shutdown related races Thiago Jung Bauermann
2017-02-09 15:48 ` Jan Kara
2017-02-13 14:27 ` Thiago Jung Bauermann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170210132007.GA5491@quack2.suse.cz \
--to=jack@suse.cz \
--cc=axboe@kernel.dk \
--cc=bauerman@linux.vnet.ibm.com \
--cc=dan.j.williams@intel.com \
--cc=hch@infradead.org \
--cc=linux-block@vger.kernel.org \
--cc=neilb@suse.com \
--cc=neilb@suse.de \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox