linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <waiman.long@hp.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: x86@kernel.org, Gleb Natapov <gleb@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	linux-arch@vger.kernel.org, kvm@vger.kernel.org,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Ingo Molnar <mingo@redhat.com>,
	xen-devel@lists.xenproject.org,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Scott J Norton <scott.norton@hp.com>,
	Paolo Bonzini <paolo.bonzini@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	virtualization@lists.linux-foundation.org,
	Chegu Vinod <chegu_vinod@hp.com>, Oleg Nesterov <oleg@redhat.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v10 03/19] qspinlock: Add pending bit
Date: Mon, 19 May 2014 16:17:23 -0400	[thread overview]
Message-ID: <537A66D3.8070607@hp.com> (raw)
In-Reply-To: <20140514191339.GA22813@potion.brq.redhat.com>

On 05/14/2014 03:13 PM, Radim Krčmář wrote:
> 2014-05-14 19:00+0200, Peter Zijlstra:
>> On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote:
>>> Ok.
>>> I've seen merit in pvqspinlock even with slightly slower first-waiter,
>>> so I would have happily sacrificed those horrible branches.
>>> (I prefer elegant to optimized code, but I can see why we want to be
>>>   strictly better than ticketlock.)
>>> Peter mentioned that we are focusing on bare-metal patches, so I'll
>>> withold my other paravirt rants until they are polished.
> (It was an ambiguous sentence, I have comments for later patches.)
>
>> Well, paravirt must happen too, but comes later in this series, patch 3
>> which we're replying to is still very much in the bare metal part of the
>> series.
> (I think that bare metal spans the first 7 patches.)
>
>> I've not had time yet to decode all that Waiman has done to make
>> paravirt work.
>>
>> But as a general rule I like patches that start with something simple
>> and working and then optimize it, this series doesn't seem to quite
>> grasp that.
>>
>>> And to forcefully bring this thread a little bit on-topic:
>>>
>>> Pending-bit is effectively a lock in a lock, so I was wondering why
>>> don't we use more pending bits; advantages are the same, just diminished
>>> by the probability of having an ideally contended lock:
>>>   - waiter won't be blocked on RAM access if critical section (or more)
>>>     ends sooner
>>>   - some unlucky cacheline is not forgotten
>>>   - faster unlock (no need for tail operations)
>>> (- ?)
>>> disadvantages are magnified:
>>>   - increased complexity
>>>   - intense cacheline sharing
>>>     (I thought that this is the main disadvantage of ticketlock.)
>>> (- ?)
>>>
>>> One bit still improved performance, is it the best we got?
>> So, the advantage of one bit is that if we use a whole byte for 1 bit we
>> can avoid some atomic ops.
>>
>> The entire reason for this in-word spinner is to amortize the cost of
>> hitting the external node cacheline.
> Every pending CPU removes one length of the critical section from the
> delay caused by cacheline propagation and really cold cache is
> hundreds(?) of cycles, so we could burn some to ensure correctness and
> still be waiting when the first pending CPU unlocks.

Assuming that taking a spinlock is fairly frequent in the kernel, the 
node structure cacheline won't be so cold after all.

>> So traditional locks like test-and-test and the ticket lock only ever
>> access the spinlock word itsef, this MCS style queueing lock has a
>> second (and, see my other rants in this thread, when done wrong more
>> than 2) cacheline to touch.
>>
>> That said, all our benchmarking is pretty much for the cache-hot case,
>> so I'm not entirely convinced yet that the one pending bit makes up for
>> it, it does in the cache-hot case.
> Yeah, we probably use the faster pre-lock quite a lot.
> Cover letter states that queue depth 1-3 is a bit slower than ticket
> spinlock, so we might not be losing if we implemented a faster
> in-word-lock of this capacity.  (Not that I'm a fan of the hybrid lock.)

I had tried an experimental patch with 2 pending bits. However, the 
benchmark test that I used show the performance is even worse than 
without any pending bit. I probably need to revisit that later as to why 
this is the case. As for now, I will focus on just having one pending 
bit. If we could find a way to get better performance out of more than 1 
pending bit later on, we could always submit another patch to do that.

>> But... writing cache-cold benchmarks is _hard_ :/
> Wouldn't clflush of the second cacheline before trying for the lock give
> us a rough estimate?

clflush is a very expensive operation and I doubt that it will be 
indicative of real life performance at all. BTW, there is no way to 
write a cache-cold benchmark for that 2nd cacheline as any call to 
spin_lock will likely to access it if there is enough contention.

-Longman

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <waiman.long@hp.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	xen-devel@lists.xenproject.org, kvm@vger.kernel.org,
	Paolo Bonzini <paolo.bonzini@gmail.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Oleg Nesterov <oleg@redhat.com>, Gleb Natapov <gleb@redhat.com>,
	Scott J Norton <scott.norton@hp.com>,
	Chegu Vinod <chegu_vinod@hp.com>
Subject: Re: [PATCH v10 03/19] qspinlock: Add pending bit
Date: Mon, 19 May 2014 16:17:23 -0400	[thread overview]
Message-ID: <537A66D3.8070607@hp.com> (raw)
Message-ID: <20140519201723.PfMagbhP7zGVxDuwdl_ZczzoDnjzEO7XczDP9yCojXw@z> (raw)
In-Reply-To: <20140514191339.GA22813@potion.brq.redhat.com>

On 05/14/2014 03:13 PM, Radim Krčmář wrote:
> 2014-05-14 19:00+0200, Peter Zijlstra:
>> On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote:
>>> Ok.
>>> I've seen merit in pvqspinlock even with slightly slower first-waiter,
>>> so I would have happily sacrificed those horrible branches.
>>> (I prefer elegant to optimized code, but I can see why we want to be
>>>   strictly better than ticketlock.)
>>> Peter mentioned that we are focusing on bare-metal patches, so I'll
>>> withold my other paravirt rants until they are polished.
> (It was an ambiguous sentence, I have comments for later patches.)
>
>> Well, paravirt must happen too, but comes later in this series, patch 3
>> which we're replying to is still very much in the bare metal part of the
>> series.
> (I think that bare metal spans the first 7 patches.)
>
>> I've not had time yet to decode all that Waiman has done to make
>> paravirt work.
>>
>> But as a general rule I like patches that start with something simple
>> and working and then optimize it, this series doesn't seem to quite
>> grasp that.
>>
>>> And to forcefully bring this thread a little bit on-topic:
>>>
>>> Pending-bit is effectively a lock in a lock, so I was wondering why
>>> don't we use more pending bits; advantages are the same, just diminished
>>> by the probability of having an ideally contended lock:
>>>   - waiter won't be blocked on RAM access if critical section (or more)
>>>     ends sooner
>>>   - some unlucky cacheline is not forgotten
>>>   - faster unlock (no need for tail operations)
>>> (- ?)
>>> disadvantages are magnified:
>>>   - increased complexity
>>>   - intense cacheline sharing
>>>     (I thought that this is the main disadvantage of ticketlock.)
>>> (- ?)
>>>
>>> One bit still improved performance, is it the best we got?
>> So, the advantage of one bit is that if we use a whole byte for 1 bit we
>> can avoid some atomic ops.
>>
>> The entire reason for this in-word spinner is to amortize the cost of
>> hitting the external node cacheline.
> Every pending CPU removes one length of the critical section from the
> delay caused by cacheline propagation and really cold cache is
> hundreds(?) of cycles, so we could burn some to ensure correctness and
> still be waiting when the first pending CPU unlocks.

Assuming that taking a spinlock is fairly frequent in the kernel, the 
node structure cacheline won't be so cold after all.

>> So traditional locks like test-and-test and the ticket lock only ever
>> access the spinlock word itsef, this MCS style queueing lock has a
>> second (and, see my other rants in this thread, when done wrong more
>> than 2) cacheline to touch.
>>
>> That said, all our benchmarking is pretty much for the cache-hot case,
>> so I'm not entirely convinced yet that the one pending bit makes up for
>> it, it does in the cache-hot case.
> Yeah, we probably use the faster pre-lock quite a lot.
> Cover letter states that queue depth 1-3 is a bit slower than ticket
> spinlock, so we might not be losing if we implemented a faster
> in-word-lock of this capacity.  (Not that I'm a fan of the hybrid lock.)

I had tried an experimental patch with 2 pending bits. However, the 
benchmark test that I used show the performance is even worse than 
without any pending bit. I probably need to revisit that later as to why 
this is the case. As for now, I will focus on just having one pending 
bit. If we could find a way to get better performance out of more than 1 
pending bit later on, we could always submit another patch to do that.

>> But... writing cache-cold benchmarks is _hard_ :/
> Wouldn't clflush of the second cacheline before trying for the lock give
> us a rough estimate?

clflush is a very expensive operation and I doubt that it will be 
indicative of real life performance at all. BTW, there is no way to 
write a cache-cold benchmark for that 2nd cacheline as any call to 
spin_lock will likely to access it if there is enough contention.

-Longman


  parent reply	other threads:[~2014-05-19 20:17 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-07 15:01 [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 01/19] qspinlock: A simple generic 4-byte queue spinlock Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 02/19] qspinlock, x86: Enable x86-64 to use " Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 03/19] qspinlock: Add pending bit Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 18:57   ` Peter Zijlstra
2014-05-10  0:49     ` Waiman Long
2014-05-10  0:49       ` Waiman Long
2014-05-12 15:22   ` Radim Krčmář
2014-05-12 17:29     ` Peter Zijlstra
2014-05-13 19:47     ` Waiman Long
2014-05-13 19:47       ` Waiman Long
2014-05-14 16:51       ` Radim Krčmář
2014-05-14 16:51         ` Radim Krčmář
2014-05-14 17:00         ` Peter Zijlstra
2014-05-14 17:00           ` Peter Zijlstra
2014-05-14 19:13           ` Radim Krčmář
2014-05-14 19:13             ` Radim Krčmář
2014-05-19 20:17             ` Waiman Long [this message]
2014-05-19 20:17               ` Waiman Long
     [not found]               ` <20140521164930.GA26199@potion.brq.redhat.com>
2014-05-21 17:02                 ` [RFC 08/07] qspinlock: integrate pending bit into queue Radim Krčmář
2014-05-21 17:02                   ` Radim Krčmář
2014-05-07 15:01 ` [PATCH v10 04/19] qspinlock: Extract out the exchange of tail code word Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 05/19] qspinlock: Optimize for smaller NR_CPUS Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 06/19] qspinlock: prolong the stay in the pending bit path Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 18:58   ` Peter Zijlstra
2014-05-08 18:58     ` Peter Zijlstra
2014-05-10  0:58     ` Waiman Long
2014-05-10  0:58       ` Waiman Long
2014-05-10 13:38       ` Peter Zijlstra
2014-05-10 13:38         ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 07/19] qspinlock: Use a simple write to grab the lock, if applicable Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 19:00   ` Peter Zijlstra
2014-05-08 19:00     ` Peter Zijlstra
2014-05-10  1:05     ` Waiman Long
2014-05-10  1:05       ` Waiman Long
2014-05-08 19:02   ` Peter Zijlstra
2014-05-08 19:02     ` Peter Zijlstra
2014-05-10  1:06     ` Waiman Long
2014-05-10  1:06       ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 08/19] qspinlock: Make a new qnode structure to support virtualization Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 19:04   ` Peter Zijlstra
2014-05-08 19:04     ` Peter Zijlstra
2014-05-10  1:08     ` Waiman Long
2014-05-10  1:08       ` Waiman Long
2014-05-10 14:14       ` Peter Zijlstra
2014-05-10 14:14         ` Peter Zijlstra
2014-05-10 18:21         ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 09/19] qspinlock: Prepare for unfair lock support Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 19:06   ` Peter Zijlstra
2014-05-08 19:06     ` Peter Zijlstra
2014-05-10  1:19     ` Waiman Long
2014-05-10 14:13       ` Peter Zijlstra
2014-05-10 14:13         ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 19:12   ` Peter Zijlstra
2014-05-08 19:12     ` Peter Zijlstra
2014-05-19 20:30     ` Waiman Long
2014-05-19 20:30       ` Waiman Long
2014-05-12 18:57   ` Radim Krčmář
2014-05-07 15:01 ` [PATCH v10 11/19] qspinlock: Split the MCS queuing code into a separate slowerpath Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 12/19] unfair qspinlock: Variable frequency lock stealing mechanism Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-08 19:19   ` Peter Zijlstra
2014-05-08 19:19     ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 13/19] unfair qspinlock: Enable lock stealing in lock waiters Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 14/19] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 15/19] pvqspinlock, x86: Add PV data structure & methods Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 16/19] pvqspinlock: Enable coexistence with the unfair lock Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 17/19] pvqspinlock: Add qspinlock para-virtualization support Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 19:07   ` Konrad Rzeszutek Wilk
2014-05-07 19:07     ` Konrad Rzeszutek Wilk
2014-05-08 17:54     ` Waiman Long
2014-05-08 17:54       ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 19/19] pvqspinlock, x86: Enable PV qspinlock for XEN Waiman Long
2014-05-07 15:01   ` Waiman Long
2014-05-07 19:07 ` [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support Konrad Rzeszutek Wilk
2014-05-07 19:07   ` Konrad Rzeszutek Wilk
2014-05-08 17:54   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=537A66D3.8070607@hp.com \
    --to=waiman.long@hp.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=chegu_vinod@hp.com \
    --cc=david.vrabel@citrix.com \
    --cc=gleb@redhat.com \
    --cc=hpa@zytor.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=paolo.bonzini@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).