Re: [locks] 6d390e4b5d: will-it-scale.per_process_ops -96.6% regression

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: NeilBrown <neilb@suse.de>, yangerkun <yangerkun@huawei.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel test robot <rong.a.chen@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, Bruce Fields <bfields@fieldses.org>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [locks] 6d390e4b5d: will-it-scale.per_process_ops -96.6% regression
Date: Tue, 10 Mar 2020 17:14:11 -0400	[thread overview]
Message-ID: <c4ef31a663fbf7a3de349696e9f00f2f5c4ec89a.camel@kernel.org> (raw)
In-Reply-To: <878sk7vs8q.fsf@notabene.neil.brown.name>

On Wed, 2020-03-11 at 08:01 +1100, NeilBrown wrote:
> On Tue, Mar 10 2020, Jeff Layton wrote:
> 
> > On Tue, 2020-03-10 at 08:52 -0400, Jeff Layton wrote:
> > 
> > [snip]
> > 
> > > On Tue, 2020-03-10 at 11:24 +0800, yangerkun wrote:
> > > > Something others. I think there is no need to call locks_delete_block 
> > > > for all case in function like flock_lock_inode_wait. What we should do 
> > > > as the patch '16306a61d3b7 ("fs/locks: always delete_block after 
> > > > waiting.")' describes is that we need call locks_delete_block not only 
> > > > for error equal to -ERESTARTSYS(please point out if I am wrong). And 
> > > > this patch may fix the regression too since simple lock that success or 
> > > > unlock will not try to acquire blocked_lock_lock.
> > > > 
> > > > 
> > > 
> > > Nice! This looks like it would work too, and it's a simpler fix.
> > > 
> > > I'd be inclined to add a WARN_ON_ONCE(fl->fl_blocker) after the if
> > > statements to make sure we never exit with one still queued. Also, I
> > > think we can do a similar optimization in __break_lease.
> > > 
> > > There are some other callers of locks_delete_block:
> > > 
> > > cifs_posix_lock_set: already only calls it in these cases
> > > 
> > > nlmsvc_unlink_block: I think we need to call this in most cases, and
> > > they're not going to be high-performance codepaths in general
> > > 
> > > nfsd4 callback handling: Several calls here, most need to always be
> > > called. find_blocked_lock could be reworked to take the
> > > blocked_lock_lock only once (I'll do that in a separate patch).
> > > 
> > > How about something like this (
> > > 
> > > ----------------------8<---------------------
> > > 
> > > From: yangerkun <yangerkun@huawei.com>
> > > 
> > > [PATCH] filelock: fix regression in unlock performance
> > > 
> > > '6d390e4b5d48 ("locks: fix a potential use-after-free problem when
> > > wakeup a waiter")' introduces a regression since we will acquire
> > > blocked_lock_lock every time locks_delete_block is called.
> > > 
> > > In many cases we can just avoid calling locks_delete_block at all,
> > > when we know that the wait was awoken by the condition becoming true.
> > > Change several callers of locks_delete_block to only call it when
> > > waking up due to signal or other error condition.
> > > 
> > > [ jlayton: add similar optimization to __break_lease, reword changelog,
> > > 	   add WARN_ON_ONCE calls ]
> > > 
> > > Fixes: 16306a61d3b7 ("fs/locks: always delete_block after waiting.")
> > > Fixes: 6d390e4b5d48 ("locks: fix a potential use-after-free problem when wakeup a waiter")
> > > Signed-off-by: yangerkun <yangerkun@huawei.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/locks.c | 29 ++++++++++++++++++++++-------
> > >  1 file changed, 22 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/locks.c b/fs/locks.c
> > > index 426b55d333d5..b88a5b11c464 100644
> > > --- a/fs/locks.c
> > > +++ b/fs/locks.c
> > > @@ -1354,7 +1354,10 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
> > >  		if (error)
> > >  			break;
> > >  	}
> > > -	locks_delete_block(fl);
> > > +	if (error)
> > > +		locks_delete_block(fl);
> > > +	WARN_ON_ONCE(fl->fl_blocker);
> > > +
> > >  	return error;
> > >  }
> > >  
> > > @@ -1447,7 +1450,9 @@ int locks_mandatory_area(struct inode *inode, struct file *filp, loff_t start,
> > >  
> > >  		break;
> > >  	}
> > > -	locks_delete_block(&fl);
> > > +	if (error)
> > > +		locks_delete_block(&fl);
> > > +	WARN_ON_ONCE(fl.fl_blocker);
> > >  
> > >  	return error;
> > >  }
> > > @@ -1638,23 +1643,28 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
> > >  
> > >  	locks_dispose_list(&dispose);
> > >  	error = wait_event_interruptible_timeout(new_fl->fl_wait,
> > > -						!new_fl->fl_blocker, break_time);
> > > +						 !new_fl->fl_blocker,
> > > +						 break_time);
> > >  
> > >  	percpu_down_read(&file_rwsem);
> > >  	spin_lock(&ctx->flc_lock);
> > >  	trace_break_lease_unblock(inode, new_fl);
> > > -	locks_delete_block(new_fl);
> > >  	if (error >= 0) {
> > >  		/*
> > >  		 * Wait for the next conflicting lease that has not been
> > >  		 * broken yet
> > >  		 */
> > > -		if (error == 0)
> > > +		if (error == 0) {
> > > +			locks_delete_block(new_fl);
> > >  			time_out_leases(inode, &dispose);
> > > +		}
> > >  		if (any_leases_conflict(inode, new_fl))
> > >  			goto restart;
> > >  		error = 0;
> > > +	} else {
> > > +		locks_delete_block(new_fl);
> > >  	}
> > > +	WARN_ON_ONCE(fl->fl_blocker);
> > >  out:
> > >  	spin_unlock(&ctx->flc_lock);
> > >  	percpu_up_read(&file_rwsem);
> > > @@ -2126,7 +2136,10 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
> > >  		if (error)
> > >  			break;
> > >  	}
> > > -	locks_delete_block(fl);
> > > +	if (error)
> > > +		locks_delete_block(fl);
> > > +	WARN_ON_ONCE(fl->fl_blocker);
> > > +
> > >  	return error;
> > >  }
> > >  
> > > @@ -2403,7 +2416,9 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
> > >  		if (error)
> > >  			break;
> > >  	}
> > > -	locks_delete_block(fl);
> > > +	if (error)
> > > +		locks_delete_block(fl);
> > > +	WARN_ON_ONCE(fl->fl_blocker);
> > >  
> > >  	return error;
> > >  }
> > 
> > I've gone ahead and added the above patch to linux-next. Linus, Neil,
> > are you ok with this one? I think this is probably the simplest
> > approach.
> 
> I think this patch contains an assumption which is not justified.  It
> assumes that if a wait_event completes without error, then the wake_up()
> must have happened.  I don't think that is correct.
> 
> In the patch that caused the recent regression, the race described
> involved a signal arriving just as __locks_wake_up_blocks() was being
> called on another thread.
> So the waiting process was woken by a signal *after* ->fl_blocker was set
> to NULL, and *before* the wake_up().  If wait_event_interruptible()
> finds that the condition is true, it will report success whether there
> was a signal or not.
> 
> If you skip the locks_delete_block() after a wait, you get exactly the
> same race as the optimization - which only skipped most of
> locks_delete_block().
> 
> I have a better solution.  I did like your patch except that it changed
> too much code.  So I revised it to change less code.  See below.
> 
> NeilBrown
> 
> From: NeilBrown <neilb@suse.de>
> Date: Wed, 11 Mar 2020 07:39:04 +1100
> Subject: [PATCH] locks: restore locks_delete_lock optimization
> 
> A recent patch (see Fixes: below) removed an optimization which is
> important as it avoids taking a lock in a common case.
> 
> The comment justifying the optimisation was correct as far as it went,
> in that if the tests succeeded, then the values would remain stable and
> the test result will remain valid even without a lock.
> 
> However after the test succeeds the lock can be freed while some other
> thread might have only just set ->blocker to NULL (thus allowing the
> test to succeed) but has not yet called wake_up() on the wq in the lock.
> If the wake_up happens after the lock is freed, a use-after-free error
> occurs.
> 
> This patch restores the optimization and reorders code to avoid the
> use-after-free.  Specifically we move the list_del_init on
> fl_blocked_member to *after* the wake_up(), and add an extra test on
> fl_block_member() to locks_delete_lock() before deciding to avoid taking
> the spinlock.
> 
> As this involves breaking code out of __locks_delete_block(), we discard
> the function completely and open-code it in the two places it was
> called.
> 
> These lockless accesses do not require any memory barriers.  The failure
> mode from possible memory access reordering is that the test at the top
> of locks_delete_lock() will fail, and in that case we fall through into
> the locked region which provides sufficient memory barriers implicitly.
> 
> Fixes: 6d390e4b5d48 ("locks: fix a potential use-after-free problem when wakeup a waiter")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/locks.c | 42 ++++++++++++++++++++++++++++--------------
>  1 file changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 426b55d333d5..dc99ab2262ea 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -716,18 +716,6 @@ static void locks_delete_global_blocked(struct file_lock *waiter)
>  	hash_del(&waiter->fl_link);
>  }
>  
> -/* Remove waiter from blocker's block list.
> - * When blocker ends up pointing to itself then the list is empty.
> - *
> - * Must be called with blocked_lock_lock held.
> - */
> -static void __locks_delete_block(struct file_lock *waiter)
> -{
> -	locks_delete_global_blocked(waiter);
> -	list_del_init(&waiter->fl_blocked_member);
> -	waiter->fl_blocker = NULL;
> -}
> -
>  static void __locks_wake_up_blocks(struct file_lock *blocker)
>  {
>  	while (!list_empty(&blocker->fl_blocked_requests)) {
> @@ -735,11 +723,13 @@ static void __locks_wake_up_blocks(struct file_lock *blocker)
>  
>  		waiter = list_first_entry(&blocker->fl_blocked_requests,
>  					  struct file_lock, fl_blocked_member);
> -		__locks_delete_block(waiter);
> +		locks_delete_global_blocked(waiter);
> +		waiter->fl_blocker = NULL;
>  		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
>  			waiter->fl_lmops->lm_notify(waiter);
>  		else
>  			wake_up(&waiter->fl_wait);
> +		list_del_init(&waiter->fl_blocked_member);

Are you sure you don't need a memory barrier here? Could the
list_del_init be hoisted just above the if condition?

>  	}
>  }
>  
> @@ -753,11 +743,35 @@ int locks_delete_block(struct file_lock *waiter)
>  {
>  	int status = -ENOENT;
>  
> +	/*
> +	 * If fl_blocker is NULL, it won't be set again as this thread
> +	 * "owns" the lock and is the only one that might try to claim
> +	 * the lock.  So it is safe to test fl_blocker locklessly.
> +	 * Also if fl_blocker is NULL, this waiter is not listed on
> +	 * fl_blocked_requests for some lock, so no other request can
> +	 * be added to the list of fl_blocked_requests for this
> +	 * request.  So if fl_blocker is NULL, it is safe to
> +	 * locklessly check if fl_blocked_requests is empty.  If both
> +	 * of these checks succeed, there is no need to take the lock.
> +	 * We also check fl_blocked_member is empty.  This is logically
> +	 * redundant with the test of fl_blocker, but it ensure that
> +	 * __locks_wake_up_blocks() has finished the wakeup and will not
> +	 * access the lock again, so it is safe to return and free.
> +	 * There is no need for any memory barriers with these lockless
> +	 * tests as is the reads happen before the corresponding writes are
> +	 * seen, we fall through to the locked code.
> +	 */
> +	if (waiter->fl_blocker == NULL &&
> +	    list_empty(&waiter->fl_blocked_member) &&
> +	    list_empty(&waiter->fl_blocked_requests))
> +		return status;
>  	spin_lock(&blocked_lock_lock);
>  	if (waiter->fl_blocker)
>  		status = 0;
>  	__locks_wake_up_blocks(waiter);
> -	__locks_delete_block(waiter);
> +	locks_delete_global_blocked(waiter);
> +	list_del_init(&waiter->fl_blocked_member);
> +	waiter->fl_blocker = NULL;
>  	spin_unlock(&blocked_lock_lock);
>  	return status;
>  }

-- 
Jeff Layton <jlayton@kernel.org>

next prev parent reply	other threads:[~2020-03-10 21:14 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-08 14:03 [locks] 6d390e4b5d: will-it-scale.per_process_ops -96.6% regression kernel test robot
2020-03-09 14:36 ` Jeff Layton
2020-03-09 15:52   ` Linus Torvalds
2020-03-09 17:22     ` Jeff Layton
2020-03-09 19:09       ` Jeff Layton
2020-03-09 19:53         ` Jeff Layton
2020-03-09 21:42         ` NeilBrown
2020-03-09 21:58           ` Jeff Layton
2020-03-10  7:52             ` kernel test robot
2020-03-09 22:11           ` Jeff Layton
2020-03-10  3:24             ` yangerkun
2020-03-10  7:54               ` kernel test robot
2020-03-10 12:52               ` Jeff Layton
2020-03-10 14:18                 ` yangerkun
2020-03-10 15:06                   ` Jeff Layton
2020-03-10 17:27                 ` Jeff Layton
2020-03-10 21:01                   ` NeilBrown
2020-03-10 21:14                     ` Jeff Layton [this message]
2020-03-10 21:21                       ` NeilBrown
2020-03-10 21:47                         ` Linus Torvalds
2020-03-10 22:07                           ` Jeff Layton
2020-03-10 22:31                             ` Linus Torvalds
2020-03-11 22:22                               ` NeilBrown
2020-03-12  0:38                                 ` Linus Torvalds
2020-03-12  4:42                                   ` NeilBrown
2020-03-12 12:31                                     ` Jeff Layton
2020-03-12 22:19                                       ` NeilBrown
2020-03-14  1:11                                         ` Jeff Layton
2020-03-12 16:07                                     ` Linus Torvalds
2020-03-14  1:31                                       ` Jeff Layton
2020-03-14  2:31                                         ` NeilBrown
2020-03-14 15:58                                           ` Linus Torvalds
2020-03-15 13:54                                             ` Jeff Layton
2020-03-16  5:06                                               ` NeilBrown
2020-03-16 11:07                                                 ` Jeff Layton
2020-03-16 17:26                                                   ` Linus Torvalds
2020-03-17  1:41                                                     ` yangerkun
2020-03-17 14:05                                                       ` yangerkun
2020-03-17 16:07                                                         ` Jeff Layton
2020-03-18  1:09                                                           ` yangerkun
2020-03-19 17:51                                                     ` Jeff Layton
2020-03-19 19:23                                                       ` Linus Torvalds
2020-03-19 19:24                                                         ` Jeff Layton
2020-03-19 19:35                                                           ` Linus Torvalds
2020-03-19 20:10                                                             ` Jeff Layton
2020-03-16 22:45                                                   ` NeilBrown
2020-03-17 15:59                                                     ` Jeff Layton
2020-03-17 21:27                                                       ` NeilBrown
2020-03-18  5:12                                                   ` kernel test robot
2020-03-16  4:26                                             ` NeilBrown
2020-03-11  1:57                     ` yangerkun
2020-03-11 12:52                       ` Jeff Layton
2020-03-11 13:26                         ` yangerkun
2020-03-11 22:15                       ` NeilBrown
2020-03-10  7:50           ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c4ef31a663fbf7a3de349696e9f00f2f5c4ec89a.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=bfields@fieldses.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@lists.01.org \
    --cc=neilb@suse.de \
    --cc=rong.a.chen@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yangerkun@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox