Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Waiman Long <waiman.long@hpe.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Jonathan Corbet <corbet@lwn.net>, <linux-kernel@vger.kernel.org>,
	<linux-doc@vger.kernel.org>, Davidlohr Bueso <dave@stgolabs.net>,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes
Date: Thu, 22 Sep 2016 22:44:30 -0400	[thread overview]
Message-ID: <57E4970E.9020602@hpe.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1609221247570.5599@nanos>

On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> On Tue, 6 Sep 2016, Waiman Long wrote:
>> +enum futex_type {
>> +	TYPE_PI = 0,
>> +	TYPE_TO,
>> +};
> Please introduce the futex_type magic and the related changes to the pi
> code in a seperate patch so it can be verified independently.
>
> It's sad that one has to explain that to you over and over ....

I didn't break it out because the changes to the PI code was pretty 
small. I will break it out in the next version.

>> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>>   		return;
>>
>>   	/*
>> -	 * If state->owner is NULL, the owner is most probably dying
>> -	 * and has cleaned up the futex state already
>> +	 * If state->owner is NULL and the type is TYPE_PI, the owner
>> +	 * is most probably dying and has cleaned up the state already
>>   	 */
>> -	if (state->owner) {
>> +	if (state->owner&&  (state->type == TYPE_PI)) {
>>   		raw_spin_lock_irq(&state->owner->pi_lock);
>>   		list_del_init(&state->list);
>>   		raw_spin_unlock_irq(&state->owner->pi_lock);
>> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>>   		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>>   	}
>>
>> +	/*
>> +	 * Dequeue it from the HB futex state list.
>> +	 */
>> +	list_del_init(&state->hb_list);
> The comment above this list_del() is really pointless. I can see that from
> the code itself.
>
> Aside of that: Why do you need seperate list heads? You explain the
> seperate list somewhere in that big comment below, but it should be
> explained at the point where you add it to the state and the hash bucket.

Sure. Will fix the comment.

>>   	if (current->pi_state_cache)
>>   		kfree(state);
>>   	else {
>> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>>   			continue;
>>   		}
>>
>> -		WARN_ON(pi_state->owner != curr);
>>   		WARN_ON(list_empty(&pi_state->list));
>> +		if (pi_state->type == TYPE_PI) {
>> +			WARN_ON(pi_state->owner != curr);
>> +			pi_state->owner = NULL;
>> +		}
>>   		list_del_init(&pi_state->list);
>> -		pi_state->owner = NULL;
>>   		raw_spin_unlock_irq(&curr->pi_lock);
>>
>> -		rt_mutex_unlock(&pi_state->pi_mutex);
>> +		if (pi_state->type == TYPE_PI)
> lacks curly braces

Yes, you are right.

>> +			rt_mutex_unlock(&pi_state->pi_mutex);
>> +		else if (pi_state->type == TYPE_TO) {
>> +			/*
>> +			 * Need to wakeup the mutex owner.
>> +			 */
> Another completely useless comment. Because you tell what you do, but not
> WHY.

Will elaborate on why the wakeup here.

>> +			WARN_ON(!pi_state->owner);
>> +			if (pi_state->owner)
>> +				wake_up_process(pi_state->owner);
> And what handles or sanity checks the state->hb_list ???
>

The exit_pi_state_list() function doesn't need to deal with 
state->hb_list. The hb_list is used to locate the futex state, but the 
futex owner doesn't have a reference to the futex state. So it won't 
need to decrement it and potentially free it.

>> +/*
>> + * Try to lock the userspace futex word (0 =>  vpid).
>> + *
>> + * Return: 1 if lock acquired or an error happens, 0 if not.
>> + *	   The status code will be 0 if no error, or<  0 if an error happens.
>> + *	   *puval will contain the latest futex value when trylock fails.
>> + *
>> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
>> + * The HB spinlock should NOT be held while calling this function. A
>> + * successful lock acquisition will clear the waiter and died bits.
>> + */
>> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
>> +				   const bool waiter, int *status)
>> +{
>> +	u32 uval;
>> +
>> +	*status = 0;
>> +
>> +	if (unlikely(get_user(uval, uaddr)))
>> +		goto efault;
>> +
>> +	*puval = uval;
>> +
>> +	if (waiter ? (uval&  FUTEX_TID_MASK) : uval)
>> +		return 0;	/* Trylock fails */
> Please do not use tail comments. They are hard to parse.
>

OK, will move the comment up.

>> +
>> +	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
>> +		goto efault;
>> +
>> +	return *puval == uval;
>> +
>> +efault:
>> +	*status = -EFAULT;
>> +	return 1;
>> +}
> Do we really need another variant of cmpxchg and why do you need that extra
> status? What's wrong in doing the magic in the return value?

This is not another variant of cmpxchg. It is the cmpxchg used by 
cmpxchg_futex_value_locked().  The only difference is that page fault 
was disabled with the locked version. I call 
futex_atomic_cmpxchg_inatomic() directly because it is called without 
the HB spinlock. So I don't need to disable page fault. I will add a 
separate patch to introduce the helper function 
cmpxchg_futex_value_unlocked() to indicate that the cmpxchg is done 
without lock.

I will remove the extra status parameter.

>> +
>> +/*
>> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
>> + */
>> +#define FUTEX_SPIN_THRESHOLD	(1<<  10)
> That number is pulled out of thin air or what is the rationale for chosing
> 1024?

It is kind of arbitrary. I want a value large enough to encourage lock 
stealing, while not too large that it may take too long to get the lock. 
Will elaborate more about the choice in the comment.


>> +/*
>> + * Spin on the futex word while the futex owner is active. Otherwise, set
>> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
>> + * owner's task structure, we don't need to use RCU to ensure that the task
>> + * structure is valid. The function will directly grab the lock if the
>> + * owner is dying or the pid is invalid. That should take care of the problem
>> + * of dead lock owners unless the pid wraps around and the preceived owner is
>> + * not the real owner.
>> + *
>> + * Return: 0 if futex acquired,<  0 if an error happens.
> If you document functions then please follow the style of this file which
> uses kerneldoc comments.
>

Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and 
my focus was to make it work. I haven't spent much time in cleaning up 
the comment. I will do that in the next version.

>> + */
>> +static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
>> +			       struct futex_state *state)
>> +{
>> +	int ret, loop = FUTEX_SPIN_THRESHOLD;
>> +	u32 uval, curval;
>> +	u32 opid = 0;				/* Futex owner task ID */
>> +	struct task_struct *otask = NULL;	/* Futex owner task struct */
> Please use understandable variable names instead of this horrible tail
> comments. What's wrong with using owner and owner_pid?

Yes, I will use more descriptive variable name.

>> +	bool on_owner_list = false;
>> +
>> +	preempt_disable();
>> +	WRITE_ONCE(state->owner, current);
>> +	for (;; loop--) {
>> +		if (futex_trylock_to(uaddr, vpid,&uval, true,&ret))
>> +			break;
> And here you are. What the hell is wrong with:
>
>      	     	ret = futex_trylock(uaddr, vpid,&uval, true);
> 		if (ret<  0)
> 			break;
>
> Nothing is wrong. It's just way simpler to read and understand....

Will do that.

>> +		WARN_ON_ONCE(!(uval&  FUTEX_TID_MASK));
>> +
>> +		if ((uval&  FUTEX_TID_MASK) != opid) {
>> +			/*
>> +			 * Get the new task structure
>> +			 */
>> +			if (otask)
>> +				put_task_struct(otask);
>> +
>> +			WARN_ON_ONCE(on_owner_list);
>> +			opid  = uval&  FUTEX_TID_MASK;
>> +			otask = futex_find_get_task(opid);
>> +		}
> Can you pretty please use split out functions for this otask handling? This
> for loop does not fit on a single screen and can't be understood in one go.

Yes, this is the largest function in the new code, I will try to add 
helper functions to make it easier to understand.

>> +		if (unlikely(!otask || (otask->flags&  PF_EXITING) ||
>> +			    (uval&  FUTEX_OWNER_DIED))) {
>> +			/*
>> +			 * PID invalid or exiting/dead task, try to grab
>> +			 * the lock now.
>> +			 */
>> +			ret = futex_atomic_cmpxchg_inatomic(&curval,
>> +					uaddr, uval, vpid);
>> +			if (unlikely(ret))
>> +				goto efault;
>> +			if (curval != uval)
>> +				continue;	/* Futex value changed */
> This code flow is completely non parseable. Stop this and put proper
> comments above decisions which explain why and not what.

I will spend more time cleaning up the comment and streamline the code.

>
>> +			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
>> +				vpid, opid, otask ? "dying" : "invalid");
> Really useful information to confuse sysadmins. A proper comment explaining
> the issue and the implications and how we deal with it would have been the
> right thing to do...

Yes, the message may be too cryptic. I will make it more clear in the 
next version.

>> +			break;
>> +		}
>> +		while ((loop<= 0)&&  !(uval&  FUTEX_WAITERS)) {
> Groan. This is more than horrible to read.
>
>         	        if (loop<= 0) {
> 		        if (futex_set_waiters_bit(....))
> 			   	goto fault;
> 		}
>
> If at all. This loop<= 0 thing is utterly confusing.
>

I will improve this code.

>> +			/*
>> +			 * Need to set the FUTEX_WAITERS bit.
>> +			 */
>> +			if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
>> +							  uval | FUTEX_WAITERS))
>> +				goto efault;
>> +			if (curval == uval) {
>> +				uval |= FUTEX_WAITERS;
>> +				break;
> I had to look five times to figure out to which loop this break belongs. I
> really wonder how you manage to keep track of this mess.

Because I wrote it :-)

In the v2 patch, the FUTEX_WAITERS bit setting was put into a helper 
function which should clarify the code. I will add a few more to 
simplify the main loop.

>> +			}
>> +			uval = curval;
>> +		}
>> +
>> +		if (!(uval&  ~FUTEX_TID_MASK))
>> +			continue;	/* Do trylock again */
> Gah. I can see that you go over to start and do trylock, but WHY ? I know
> it, but for the casual reader it's fcking non obvious.
>
>> +
>> +		if (need_resched()) {
>> +			__set_current_state(TASK_RUNNING);
>> +			schedule_preempt_disabled();
>> +			continue;
>> +		}
>> +
>> +		if (signal_pending(current)) {
>> +			ret = -EINTR;
>> +			break;
>> +		}
>> +
>> +		/*
>> +		 * If the owner isn't active, we need to go to sleep after
>> +		 * making sure that the FUTEX_WAITERS bit is set. We also
>> +		 * need to put the futex state into the futex owner's
>> +		 * pi_state_list to prevent deadlock when the owner dies.
>> +		 */
>> +		if (!otask->on_cpu) {
>> +			if (!(uval&  FUTEX_WAITERS)) {
>> +				loop = 0;
>> +				continue;
> This is completely fucked, really.
>
>       		if (owner->on_cpu) {
> 		   	cpu_relax();
> 			continue;
> 		}	
>
> 		ret = futex_sleep();
> 		if (ret<  0)
> 		   	goto fault;
> 		if (ret == FUTEX_AQUIRED)
> 		   	break;
>
> Your attempt to resue code which is in the loop above is just making this
> completely unreadable. Hell no! Futexes are complex enough already. We
> really can do without obfuscation. This not the obfuscated C-code contest.

As said above, I will add more helper functions and streamline the code 
to make it more readable.

>> +	if (futex_trylock_to(uaddr, vpid,&uval, false,&ret))
>> +		/* Lock acquired or an error happens */
>> +		return ret;
> This lacks curly braces. See: https://marc.info/?l=linux-kernel&m=147351236615103

Got it.

>> +
>> +	/*
>> +	 * Detect deadlocks.
>> +	 */
>> +	if (unlikely(((uval&  FUTEX_TID_MASK) == vpid) ||
>> +			should_fail_futex(true)))
>> +		return -EDEADLK;
>> +
>> +	if (refill_futex_state_cache())
>> +		return -ENOMEM;
>> +
>> +	futex_set_timer(time,&to,&timeout, flags, current->timer_slack_ns);
>> +
>> +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
>> +	if (unlikely(ret))
>> +		goto out;
>> +
>> +	hb = hash_futex(&key);
>> +	spin_lock(&hb->lock);
> Why are you using the global hash for this? As we have shown the global
> hash has horrible performance due to cross node access and potential hash
> bucket lock contention of unrelated processes. If we add something new
> which is performance optimized then we pretty please make it use a seperate
> process private storage. I don't see any point in making this optimized for
> process shared futexes.

I used the hash lock for managing the futex state objects only. I don't 
need it for other purpose. It is true that if there is a hash collision 
that another wait-wake futex is using the same hash. It may cause 
performance problem. I think I will add another spinlock in the hash 
bucket just for TO futexes. In this way, a collision with wait-wake 
futex won't cause unexpected spinlock contention, though a bit of extra 
hash bucket cachline contention may still happen.

BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU 
time were in the spinlock contention:

   96.23%  futex_test  [kernel.kallsyms] [k] 
native_queued_spin_lock_slowpath\v
       - queued_spin_lock_slowpath
          - _raw_spin_lock\v
             + 51.81% futex_wake\v
             + 48.08% futex_wait_setup

For the TO futexes, the perf profile was:

   97.33%   0.97%  futex_test  [kernel.kallsyms]  [k] futex_lock_to
\v  92.45%  92.45%  futex_test  [kernel.kallsyms]  [k] osq_lock
\v   1.65%   1.65%  futex_test  [kernel.kallsyms]  [k] 
mutex_spin_on_owner.is\v
    0.83%   0.83%  futex_test  [kernel.kallsyms]  [k] __get_user_4

The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't 
an issue for the TO futexes.

>> +
>> +	/*
>> +	 * Locate the futex state by looking up the futex state list in the
>> +	 * hash bucket. If it isn't found, create a new one and put it into
>> +	 * the list.
>> +	 */
>> +	state = lookup_futex_state(hb,&key, true);
>> +
>> +	/*
>> +	 * We don't need to hold the HB lock after looking up the futex state
>> +	 * as we have incremented the reference count.
>> +	 */
>> +	spin_unlock(&hb->lock);
>> +	BUG_ON(!state);
>> +
>> +	/*
>> +	 * Acquiring the serialization mutex.
>> +	 */
>> +	if (state->type != TYPE_TO) {
>> +		ret = -EINVAL;
>> +	} else {
>> +		if (to)
>> +			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
>> +		ret = mutex_lock_interruptible(&state->mutex);
>> +	}
>> +
>> +	if (unlikely(ret))
>> +		/*
>> +		 * We got a signal or has some other error, need to abort
>> +		 * the lock operation and return.
>> +		 */
>> +		goto out_put_state_key;
>> +
>> +	/*
>> +	 * As the mutex owner, we can now spin on the futex word as well as
>> +	 * the active-ness of the futex owner.
>> +	 */
>> +	ret = futex_spin_on_owner(uaddr, vpid, state);
> So if futex_spin_on_owner() faults, then you return -EFAULT to user space
> w/o trying to page in the stuff? That's just wrong.

I am not so sure about the proper fault handling in futex. However, 
futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page 
fault. So does that mean if I get a fault, it is probably other problem 
and not because of a lock of page fault?

>
>> +static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
>> +{
>> +	u32 uval, pid, vpid = task_pid_vnr(current);
>> +	union futex_key key = FUTEX_KEY_INIT;
>> +	struct futex_hash_bucket *hb;
>> +	struct futex_state *state;
>> +	struct task_struct *owner;
>> +	int ret;
>> +
>> +	if (get_user(uval, uaddr))
>> +		return -EFAULT;
>> +	/*
>> +	 * If there is a new lock owner, we can exit now.
>> +	 * If uval is 0, another task may have acquired and release the
>> +	 * lock in the mean time, so we should also exit.
> If there is a new lock owner or the lock has been released? Voodoo magic or
> what? How can user space take over the lock when the current task owns it?
>
> That's just broken. The only reason to get here is that user space called
> in because it was not able to release the lock atomically, i.e. because the
> waiters bit was set. If anything fiddled with the uval then returning 0 is
> beyond stupid.

I had changed it in the v2 patch to not allow that and return error if 
that happens.

>> +	 */
>> +	pid = uval&  FUTEX_TID_MASK;
>> +	if ((pid&&  (pid != vpid)) || !uval)
>> +		return 0;
>> +	WARN_ON_ONCE(!(uval&  FUTEX_WAITERS));
> Crap. It's legit to call here even if the waiters bit is not set.

Again, it was changed in the v2 patch to return error in this case.

>> +	/*
>> +	 * If the TID isn't cleared in the userspace, clear it here to
>> +	 * encourage faster lock transfer to the mutex owner.
> What do you encourage here? How is the mutex transfer accelerated?

It was removed in the v2 patch because of the need to do lock handoff.

>> +	 */
>> +	if (pid == vpid) {
>> +		futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
>> +					       uval&  ~FUTEX_TID_MASK);
>> +		WARN_ON((uval&  FUTEX_TID_MASK) != vpid);
>> +	}
>> +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
>> +	if (ret)
>> +		return ret;
>> +
>> +	hb = hash_futex(&key);
>> +	spin_lock(&hb->lock);
>> +
>> +	/*
>> +	 * Check the hash bucket only for matching futex state.
>> +	 */
>> +	state = lookup_futex_state(hb,&key, false);
>> +
>> +	if (!state)
>> +		goto out_unlock;
>> +
>> +	if (state->type != TYPE_TO) {
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	/*
>> +	 * We don't need to do the wakeup if the futex isn't equal to
>> +	 * FUTEX_WAITERS as it implies that someone else has taken over
>> +	 * the futex.
>> +	 */
>> +	if (get_user(uval, uaddr)) {
>> +		ret = -EFAULT;
>> +		goto out_unlock;
>> +	}
>> +
>> +	owner = READ_ONCE(state->owner);
>> +	if ((uval == FUTEX_WAITERS)&&  owner)
>> +		ret = wake_up_process(owner);
> What guarantees that owner cannot exit between reading owner state and
> calling wake_up_process?

I used the wake_q function in the v2 patch which increment the reference 
count.

>
> Dammit. If you read the source in this file carefully, then you notice that
> it contains a lot of documentation about life time rules, protection and
> serialization.
>
> If you think, that it's the reviewers problem to figure that out from your
> w/o proper comments in place, then you are completely on the wrong track.
>
> I'm not even trying to think about the concept itself as long as this is
> presented as a pile of unpenetrable clusterfuck.

I am sorry that patch wasn't as well-documented as it should. I will 
spend more time cleaning up the comments and streamline the code before 
sending out the next v3 version.

Thanks for the review.

Cheers,
Longman

next prev parent reply	other threads:[~2016-09-23  2:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-06 18:53 [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes Waiman Long
2016-09-06 18:53 ` [RFC PATCH 1/4] futex: Add futex_set_timer() helper function Waiman Long
2016-09-06 18:53 ` [RFC PATCH 2/4] futex: Rename futex_pi_state to futex_state Waiman Long
2016-09-06 18:53 ` [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes Waiman Long
2016-09-22 13:32   ` Thomas Gleixner
2016-09-23  2:44     ` Waiman Long [this message]
2016-09-23 10:16       ` Thomas Gleixner
2016-09-06 18:53 ` [RFC PATCH 4/4] futex, doc: TO futexes document Waiman Long
2016-09-22 10:40   ` Thomas Gleixner
2016-09-22 13:06     ` Waiman Long
2016-09-22 13:51       ` Thomas Gleixner
2016-09-22 19:14         ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57E4970E.9020602@hpe.com \
    --to=waiman.long@hpe.com \
    --cc=corbet@lwn.net \
    --cc=dave@stgolabs.net \
    --cc=doug.hatch@hpe.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=scott.norton@hpe.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.