Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

From: Waiman Long <waiman.long@hp.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	virtualization@lists.linux-foundation.org,
	Andi Kleen <andi@firstfloor.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Michel Lespinasse <walken@google.com>,
	Alok Kataria <akataria@vmware.com>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	Ingo Molnar <mingo@redhat.com>,
	Scott J Norton <scott.norton@hp.com>,
	xen-devel@lists.xenproject.org,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Alexander Fyodorov <halcy@yandex.ru>,
	Rik van Riel <riel@redhat.com>, Arnd Bergmann <arnd@arndb.de>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Daniel J Blueman <daniel@numascale.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Chris Wright <chrisw@sous-sol.org>,
	George Spelvin <linux@horizon.com>,
	Thomas Gleixner <tglx@linutro>
Subject: Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Date: Fri, 28 Feb 2014 11:38:25 -0500	[thread overview]
Message-ID: <5310BB81.3090508@hp.com> (raw)
In-Reply-To: <20140228092945.GG27965@twins.programming.kicks-ass.net>

[-- Attachment #1: Type: text/plain, Size: 3245 bytes --]

On 02/28/2014 04:29 AM, Peter Zijlstra wrote:
> On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
>>>> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
>>>> +
>>>> +	if (old == 0) {
>>>> +		/*
>>>> +		 * Got the lock, can clear the waiting bit now
>>>> +		 */
>>>> +		smp_u8_store_release(&qlock->wait, 0);
>>> So we just did an atomic op, and now you're trying to optimize this
>>> write. Why do you need a whole byte for that?
>>>
>>> Surely a cmpxchg loop with the right atomic op can't be _that_ much
>>> slower? Its far more readable and likely avoids that steal fail below as
>>> well.
>> At low contention level, atomic operations that requires a lock prefix are
>> the major contributor to the total execution times. I saw estimate online
>> that the time to execute a lock prefix instruction can easily be 50X longer
>> than a regular instruction that can be pipelined. That is why I try to do it
>> with as few lock prefix instructions as possible. If I have to do an atomic
>> cmpxchg, it probably won't be faster than the regular qspinlock slowpath.
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.
>
>> Given that speed at low contention level which is the common case is
>> important to get this patch accepted, I have to do what I can to make it run
>> as far as possible for this 2 contending task case.
> What I'm saying is that you can do the whole thing with a single
> cmpxchg. No extra ops needed. And at that point you don't need a whole
> byte, you can use a single bit.
>
> that removes the whole NR_CPUS dependent logic.

After modifying it to do a deterministic cmpxchg, the test run time of 2 
contending tasks jumps up from 600ms (best case) to about 1700ms which 
was worse than the original qspinlock's 1300-1500ms. It is the 
opportunistic nature of the xchg() code that can potentially combine 
multiple steps in the deterministic atomic sequence which can saves 
time. Without that, I would rather prefer going back to the basic 
qspinlock queuing sequence for 2 contending tasks.

Please take a look at the performance data in my patch 3 to see if the 
slowdown at 2 and 3 contending tasks are acceptable or not.

The reason why I need a whole byte for the lock bit is because of the 
simple unlock code of assigning 0 to the lock byte by the lock holder. 
Utilizing other bits in the low byte for other purpose will complicate 
the unlock path and slow down the no-contention case.

>>>> +		/*
>>>> +		 * Someone has steal the lock, so wait again
>>>> +		 */
>>>> +		goto try_again;
>>> That's just a fail.. steals should not ever be allowed. It's a fair lock
>>> after all.
>> The code is unfair, but this unfairness help it to run faster than ticket
>> spinlock in this particular case. And the regular qspinlock slowpath is
>> fair. A little bit of unfairness in this particular case helps its speed.
> *groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> should preserve this.

We can preserve that by removing patch 3.

> BTW; can you share your benchmark thingy?

I have attached the test program that I used to generate the timing data 
for patch 3.

-Longman

[-- Attachment #2: locktest.tar.gz --]
[-- Type: application/x-gzip, Size: 5244 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization