Re: [PATCH v10 03/19] qspinlock: Add pending bit

From: Waiman Long <waiman.long@hp.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: x86@kernel.org, Gleb Natapov <gleb@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	linux-arch@vger.kernel.org, kvm@vger.kernel.org,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Ingo Molnar <mingo@redhat.com>,
	xen-devel@lists.xenproject.org,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Scott J Norton <scott.norton@hp.com>,
	Paolo Bonzini <paolo.bonzini@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	virtualization@lists.linux-foundation.org,
	Chegu Vinod <chegu_vinod@hp.com>, Oleg Nesterov <oleg@redhat.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v10 03/19] qspinlock: Add pending bit
Date: Mon, 19 May 2014 16:17:23 -0400	[thread overview]
Message-ID: <537A66D3.8070607@hp.com> (raw)
In-Reply-To: <20140514191339.GA22813@potion.brq.redhat.com>

On 05/14/2014 03:13 PM, Radim Krčmář wrote:
> 2014-05-14 19:00+0200, Peter Zijlstra:
>> On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote:
>>> Ok.
>>> I've seen merit in pvqspinlock even with slightly slower first-waiter,
>>> so I would have happily sacrificed those horrible branches.
>>> (I prefer elegant to optimized code, but I can see why we want to be
>>>   strictly better than ticketlock.)
>>> Peter mentioned that we are focusing on bare-metal patches, so I'll
>>> withold my other paravirt rants until they are polished.
> (It was an ambiguous sentence, I have comments for later patches.)
>
>> Well, paravirt must happen too, but comes later in this series, patch 3
>> which we're replying to is still very much in the bare metal part of the
>> series.
> (I think that bare metal spans the first 7 patches.)
>
>> I've not had time yet to decode all that Waiman has done to make
>> paravirt work.
>>
>> But as a general rule I like patches that start with something simple
>> and working and then optimize it, this series doesn't seem to quite
>> grasp that.
>>
>>> And to forcefully bring this thread a little bit on-topic:
>>>
>>> Pending-bit is effectively a lock in a lock, so I was wondering why
>>> don't we use more pending bits; advantages are the same, just diminished
>>> by the probability of having an ideally contended lock:
>>>   - waiter won't be blocked on RAM access if critical section (or more)
>>>     ends sooner
>>>   - some unlucky cacheline is not forgotten
>>>   - faster unlock (no need for tail operations)
>>> (- ?)
>>> disadvantages are magnified:
>>>   - increased complexity
>>>   - intense cacheline sharing
>>>     (I thought that this is the main disadvantage of ticketlock.)
>>> (- ?)
>>>
>>> One bit still improved performance, is it the best we got?
>> So, the advantage of one bit is that if we use a whole byte for 1 bit we
>> can avoid some atomic ops.
>>
>> The entire reason for this in-word spinner is to amortize the cost of
>> hitting the external node cacheline.
> Every pending CPU removes one length of the critical section from the
> delay caused by cacheline propagation and really cold cache is
> hundreds(?) of cycles, so we could burn some to ensure correctness and
> still be waiting when the first pending CPU unlocks.

Assuming that taking a spinlock is fairly frequent in the kernel, the 
node structure cacheline won't be so cold after all.

>> So traditional locks like test-and-test and the ticket lock only ever
>> access the spinlock word itsef, this MCS style queueing lock has a
>> second (and, see my other rants in this thread, when done wrong more
>> than 2) cacheline to touch.
>>
>> That said, all our benchmarking is pretty much for the cache-hot case,
>> so I'm not entirely convinced yet that the one pending bit makes up for
>> it, it does in the cache-hot case.
> Yeah, we probably use the faster pre-lock quite a lot.
> Cover letter states that queue depth 1-3 is a bit slower than ticket
> spinlock, so we might not be losing if we implemented a faster
> in-word-lock of this capacity.  (Not that I'm a fan of the hybrid lock.)

I had tried an experimental patch with 2 pending bits. However, the 
benchmark test that I used show the performance is even worse than 
without any pending bit. I probably need to revisit that later as to why 
this is the case. As for now, I will focus on just having one pending 
bit. If we could find a way to get better performance out of more than 1 
pending bit later on, we could always submit another patch to do that.

>> But... writing cache-cold benchmarks is _hard_ :/
> Wouldn't clflush of the second cacheline before trying for the lock give
> us a rough estimate?

clflush is a very expensive operation and I doubt that it will be 
indicative of real life performance at all. BTW, there is no way to 
write a cache-cold benchmark for that 2nd cacheline as any call to 
spin_lock will likely to access it if there is enough contention.

-Longman

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization