kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	peterz@infradead.org, torvalds@linux-foundation.org,
	konrad.wilk@oracle.com, pbonzini@redhat.com,
	paulmck@linux.vnet.ibm.com, waiman.long@hp.com, davej@redhat.com,
	oleg@redhat.com, x86@kernel.org, paul.gortmaker@windriver.com,
	ak@linux.intel.com, jasowang@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	xen-devel@lists.xenproject.org, riel@redhat.com,
	borntraeger@de.ibm.com, akpm@linux-foundation.org,
	a.ryabinin@samsung.com, sasha.levin@oracle.com
Subject: Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
Date: Mon, 09 Feb 2015 15:04:22 +0530	[thread overview]
Message-ID: <54D87F1E.9060307@linux.vnet.ibm.com> (raw)
In-Reply-To: <54D7D19B.1000103@goop.org>

On 02/09/2015 02:44 AM, Jeremy Fitzhardinge wrote:
> On 02/06/2015 06:49 AM, Raghavendra K T wrote:
[...]
>
>> Linus suggested that we should not do any writes to lock after unlock(),
>> and we can move slowpath clearing to fastpath lock.
>
> Yep, that seems like a sound approach.

Current approach seem to be working now. (though we could not avoid read).
Related question: Do you think we could avoid SLOWPATH_FLAG itself by
checking head and tail difference. or is it costly because it may
result in unnecessary unlock_kicks?

>> However it brings additional case to be handled, viz., slowpath still
>> could be set when somebody does arch_trylock. Handle that too by ignoring
>> slowpath flag during lock availability check.
>>
>> Reported-by: Sasha Levin <sasha.levin@oracle.com>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   arch/x86/include/asm/spinlock.h | 70 ++++++++++++++++++++---------------------
>>   1 file changed, 34 insertions(+), 36 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
>> index 625660f..0829f86 100644
>> --- a/arch/x86/include/asm/spinlock.h
>> +++ b/arch/x86/include/asm/spinlock.h
>> @@ -49,6 +49,23 @@ static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
>>   	set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
>>   }
>>
>> +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
>> +{
>> +	arch_spinlock_t old, new;
>> +	__ticket_t diff;
>> +
>> +	old.tickets = READ_ONCE(lock->tickets);
>
> Couldn't the caller pass in the lock state that it read rather than
> re-reading it?
>

Yes we could. do you mean we could pass additional read value apart from 
lock, (because lock will be anyway needed for cmpxchg).

>>
>> +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
>> +{
>> +}
>> +
>>   #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>>
>>   static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>> @@ -84,7 +105,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>>   	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
>>
>>   	inc = xadd(&lock->tickets, inc);
>> -	if (likely(inc.head == inc.tail))
>> +	if (likely(inc.head == (inc.tail & ~TICKET_SLOWPATH_FLAG)))
>

good point, we can get rid of this as well.

> The intent of this conditional was to be the quickest possible path when
> taking a fastpath lock, with the code below being used for all slowpath
> locks (free or taken). So I don't think masking out SLOWPATH_FLAG is
> necessary here.
>
>>   		goto out;
>>
>>   	inc.tail &= ~TICKET_SLOWPATH_FLAG;
>> @@ -98,7 +119,10 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>>   		} while (--count);
>>   		__ticket_lock_spinning(lock, inc.tail);
>>   	}
>> -out:	barrier();	/* make sure nothing creeps before the lock is taken */
>> +out:
>> +	__ticket_check_and_clear_slowpath(lock);
>> +
>> +	barrier();	/* make sure nothing creeps before the lock is taken */
>
> Which means that if "goto out" path is only ever used for fastpath
> locks, you can limit calling __ticket_check_and_clear_slowpath() to the
> slowpath case.
>

Yes, I ll move that call up.

>>   }
>>
>>   static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
>> @@ -115,47 +139,21 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
>>   	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
>>   }
>>
>> -static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
>> -					    arch_spinlock_t old)
>> -{
>> -	arch_spinlock_t new;
>> -
>> -	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
>> -
>> -	/* Perform the unlock on the "before" copy */
>> -	old.tickets.head += TICKET_LOCK_INC;
>
> NB (see below)

Thanks for pointing, this solved the hang issue. I
missed this exact addition.

>
>> -
>> -	/* Clear the slowpath flag */
>> -	new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
>> -
>> -	/*
>> -	 * If the lock is uncontended, clear the flag - use cmpxchg in
>> -	 * case it changes behind our back though.
>> -	 */
>> -	if (new.tickets.head != new.tickets.tail ||
>> -	    cmpxchg(&lock->head_tail, old.head_tail,
>> -					new.head_tail) != old.head_tail) {
>> -		/*
>> -		 * Lock still has someone queued for it, so wake up an
>> -		 * appropriate waiter.
>> -		 */
>> -		__ticket_unlock_kick(lock, old.tickets.head);
>> -	}
>> -}
>> -
>>   static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>>   {
>>   	if (TICKET_SLOWPATH_FLAG &&
>> -	    static_key_false(&paravirt_ticketlocks_enabled)) {
>> -		arch_spinlock_t prev;
>> +		static_key_false(&paravirt_ticketlocks_enabled)) {
>> +		__ticket_t prev_head;
>>
>> -		prev = *lock;
>> +		prev_head = lock->tickets.head;
>>   		add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>>
>>   		/* add_smp() is a full mb() */
>>
>> -		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
>> -			__ticket_unlock_slowpath(lock, prev);
>> +		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) {
>
> So we're OK with still having a ("speculative"?) read-after-unlock here?
> I guess the only way to avoid it is to make the add_smp an xadd, but
> that's pretty expensive even compared to a locked add (at least last
> time I checked, which was at least a couple of microarchitectures ago).
> An unlocked add followed by lfence should also do the trick, but that
> was also much worse in practice.

So we have 3 choices,
1. xadd
2. continue with current approach.
3. a read before unlock and also after that.

>
>> +			BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
>> +			__ticket_unlock_kick(lock, prev_head);
>
> Should be "prev_head + TICKET_LOCK_INC" to match the previous code,
> otherwise it won't find the CPU waiting for the new head.

Yes it is :)

  reply	other threads:[~2015-02-09  9:34 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-06 14:49 [PATCH] x86 spinlock: Fix memory corruption on completing completions Raghavendra K T
2015-02-06 15:20 ` Sasha Levin
2015-02-06 16:15   ` Linus Torvalds
2015-02-06 17:03     ` Andrey Ryabinin
2015-02-08 17:14   ` Oleg Nesterov
2015-02-06 16:25 ` Linus Torvalds
2015-02-06 19:42   ` Davidlohr Bueso
2015-02-06 21:15     ` Sasha Levin
2015-02-06 23:24       ` Davidlohr Bueso
2015-02-08 17:49   ` Raghavendra K T
2015-02-06 18:57 ` Sasha Levin
2015-02-08 17:57   ` Raghavendra K T
2015-02-08 21:14 ` Jeremy Fitzhardinge
2015-02-09  9:34   ` Raghavendra K T [this message]
2015-02-09 12:02     ` Peter Zijlstra
2015-02-09 12:52       ` Raghavendra K T
2015-02-10  0:53       ` Linus Torvalds
2015-02-10  9:30         ` Raghavendra K T
2015-02-10 13:18           ` Denys Vlasenko
2015-02-10 13:20             ` Denys Vlasenko
2015-02-10 14:24             ` Oleg Nesterov
2015-02-10 13:23           ` Sasha Levin
2015-02-10 13:26           ` Oleg Nesterov
2015-02-11  1:18             ` Jeremy Fitzhardinge
2015-02-11 17:24               ` Oleg Nesterov
2015-02-11 23:15                 ` Jeremy Fitzhardinge
2015-02-11 23:28                   ` Linus Torvalds
2015-02-12  7:08                     ` Jeremy Fitzhardinge
2015-02-12 14:18                   ` Oleg Nesterov
2015-02-11 11:08             ` Raghavendra K T
2015-02-11 17:38               ` Oleg Nesterov
2015-02-11 18:38                 ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54D87F1E.9060307@linux.vnet.ibm.com \
    --to=raghavendra.kt@linux.vnet.ibm.com \
    --cc=a.ryabinin@samsung.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=borntraeger@de.ibm.com \
    --cc=davej@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jasowang@redhat.com \
    --cc=jeremy@goop.org \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=paul.gortmaker@windriver.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sasha.levin@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=waiman.long@hp.com \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).