From: Waiman Long <waiman.long@hp.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: x86@kernel.org, Gleb Natapov <gleb@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>,
linux-arch@vger.kernel.org, kvm@vger.kernel.org,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
Ingo Molnar <mingo@redhat.com>,
xen-devel@lists.xenproject.org,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Rik van Riel <riel@redhat.com>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Scott J Norton <scott.norton@hp.com>,
Paolo Bonzini <paolo.bonzini@gmail.com>,
Thomas Gleixner <tglx@linutronix.de>,
virtualization@lists.linux-foundation.org,
Chegu Vinod <chegu_vinod@hp.com>, Oleg Nesterov <oleg@redhat.com>,
David Vrabel <david.vrabel@citrix.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v10 03/19] qspinlock: Add pending bit
Date: Mon, 19 May 2014 16:17:23 -0400 [thread overview]
Message-ID: <537A66D3.8070607@hp.com> (raw)
In-Reply-To: <20140514191339.GA22813@potion.brq.redhat.com>
On 05/14/2014 03:13 PM, Radim Krčmář wrote:
> 2014-05-14 19:00+0200, Peter Zijlstra:
>> On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote:
>>> Ok.
>>> I've seen merit in pvqspinlock even with slightly slower first-waiter,
>>> so I would have happily sacrificed those horrible branches.
>>> (I prefer elegant to optimized code, but I can see why we want to be
>>> strictly better than ticketlock.)
>>> Peter mentioned that we are focusing on bare-metal patches, so I'll
>>> withold my other paravirt rants until they are polished.
> (It was an ambiguous sentence, I have comments for later patches.)
>
>> Well, paravirt must happen too, but comes later in this series, patch 3
>> which we're replying to is still very much in the bare metal part of the
>> series.
> (I think that bare metal spans the first 7 patches.)
>
>> I've not had time yet to decode all that Waiman has done to make
>> paravirt work.
>>
>> But as a general rule I like patches that start with something simple
>> and working and then optimize it, this series doesn't seem to quite
>> grasp that.
>>
>>> And to forcefully bring this thread a little bit on-topic:
>>>
>>> Pending-bit is effectively a lock in a lock, so I was wondering why
>>> don't we use more pending bits; advantages are the same, just diminished
>>> by the probability of having an ideally contended lock:
>>> - waiter won't be blocked on RAM access if critical section (or more)
>>> ends sooner
>>> - some unlucky cacheline is not forgotten
>>> - faster unlock (no need for tail operations)
>>> (- ?)
>>> disadvantages are magnified:
>>> - increased complexity
>>> - intense cacheline sharing
>>> (I thought that this is the main disadvantage of ticketlock.)
>>> (- ?)
>>>
>>> One bit still improved performance, is it the best we got?
>> So, the advantage of one bit is that if we use a whole byte for 1 bit we
>> can avoid some atomic ops.
>>
>> The entire reason for this in-word spinner is to amortize the cost of
>> hitting the external node cacheline.
> Every pending CPU removes one length of the critical section from the
> delay caused by cacheline propagation and really cold cache is
> hundreds(?) of cycles, so we could burn some to ensure correctness and
> still be waiting when the first pending CPU unlocks.
Assuming that taking a spinlock is fairly frequent in the kernel, the
node structure cacheline won't be so cold after all.
>> So traditional locks like test-and-test and the ticket lock only ever
>> access the spinlock word itsef, this MCS style queueing lock has a
>> second (and, see my other rants in this thread, when done wrong more
>> than 2) cacheline to touch.
>>
>> That said, all our benchmarking is pretty much for the cache-hot case,
>> so I'm not entirely convinced yet that the one pending bit makes up for
>> it, it does in the cache-hot case.
> Yeah, we probably use the faster pre-lock quite a lot.
> Cover letter states that queue depth 1-3 is a bit slower than ticket
> spinlock, so we might not be losing if we implemented a faster
> in-word-lock of this capacity. (Not that I'm a fan of the hybrid lock.)
I had tried an experimental patch with 2 pending bits. However, the
benchmark test that I used show the performance is even worse than
without any pending bit. I probably need to revisit that later as to why
this is the case. As for now, I will focus on just having one pending
bit. If we could find a way to get better performance out of more than 1
pending bit later on, we could always submit another patch to do that.
>> But... writing cache-cold benchmarks is _hard_ :/
> Wouldn't clflush of the second cacheline before trying for the lock give
> us a rough estimate?
clflush is a very expensive operation and I doubt that it will be
indicative of real life performance at all. BTW, there is no way to
write a cache-cold benchmark for that 2nd cacheline as any call to
spin_lock will likely to access it if there is enough contention.
-Longman
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <waiman.long@hp.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
linux-arch@vger.kernel.org, x86@kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
xen-devel@lists.xenproject.org, kvm@vger.kernel.org,
Paolo Bonzini <paolo.bonzini@gmail.com>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Rik van Riel <riel@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
David Vrabel <david.vrabel@citrix.com>,
Oleg Nesterov <oleg@redhat.com>, Gleb Natapov <gleb@redhat.com>,
Scott J Norton <scott.norton@hp.com>,
Chegu Vinod <chegu_vinod@hp.com>
Subject: Re: [PATCH v10 03/19] qspinlock: Add pending bit
Date: Mon, 19 May 2014 16:17:23 -0400 [thread overview]
Message-ID: <537A66D3.8070607@hp.com> (raw)
Message-ID: <20140519201723.PfMagbhP7zGVxDuwdl_ZczzoDnjzEO7XczDP9yCojXw@z> (raw)
In-Reply-To: <20140514191339.GA22813@potion.brq.redhat.com>
On 05/14/2014 03:13 PM, Radim Krčmář wrote:
> 2014-05-14 19:00+0200, Peter Zijlstra:
>> On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote:
>>> Ok.
>>> I've seen merit in pvqspinlock even with slightly slower first-waiter,
>>> so I would have happily sacrificed those horrible branches.
>>> (I prefer elegant to optimized code, but I can see why we want to be
>>> strictly better than ticketlock.)
>>> Peter mentioned that we are focusing on bare-metal patches, so I'll
>>> withold my other paravirt rants until they are polished.
> (It was an ambiguous sentence, I have comments for later patches.)
>
>> Well, paravirt must happen too, but comes later in this series, patch 3
>> which we're replying to is still very much in the bare metal part of the
>> series.
> (I think that bare metal spans the first 7 patches.)
>
>> I've not had time yet to decode all that Waiman has done to make
>> paravirt work.
>>
>> But as a general rule I like patches that start with something simple
>> and working and then optimize it, this series doesn't seem to quite
>> grasp that.
>>
>>> And to forcefully bring this thread a little bit on-topic:
>>>
>>> Pending-bit is effectively a lock in a lock, so I was wondering why
>>> don't we use more pending bits; advantages are the same, just diminished
>>> by the probability of having an ideally contended lock:
>>> - waiter won't be blocked on RAM access if critical section (or more)
>>> ends sooner
>>> - some unlucky cacheline is not forgotten
>>> - faster unlock (no need for tail operations)
>>> (- ?)
>>> disadvantages are magnified:
>>> - increased complexity
>>> - intense cacheline sharing
>>> (I thought that this is the main disadvantage of ticketlock.)
>>> (- ?)
>>>
>>> One bit still improved performance, is it the best we got?
>> So, the advantage of one bit is that if we use a whole byte for 1 bit we
>> can avoid some atomic ops.
>>
>> The entire reason for this in-word spinner is to amortize the cost of
>> hitting the external node cacheline.
> Every pending CPU removes one length of the critical section from the
> delay caused by cacheline propagation and really cold cache is
> hundreds(?) of cycles, so we could burn some to ensure correctness and
> still be waiting when the first pending CPU unlocks.
Assuming that taking a spinlock is fairly frequent in the kernel, the
node structure cacheline won't be so cold after all.
>> So traditional locks like test-and-test and the ticket lock only ever
>> access the spinlock word itsef, this MCS style queueing lock has a
>> second (and, see my other rants in this thread, when done wrong more
>> than 2) cacheline to touch.
>>
>> That said, all our benchmarking is pretty much for the cache-hot case,
>> so I'm not entirely convinced yet that the one pending bit makes up for
>> it, it does in the cache-hot case.
> Yeah, we probably use the faster pre-lock quite a lot.
> Cover letter states that queue depth 1-3 is a bit slower than ticket
> spinlock, so we might not be losing if we implemented a faster
> in-word-lock of this capacity. (Not that I'm a fan of the hybrid lock.)
I had tried an experimental patch with 2 pending bits. However, the
benchmark test that I used show the performance is even worse than
without any pending bit. I probably need to revisit that later as to why
this is the case. As for now, I will focus on just having one pending
bit. If we could find a way to get better performance out of more than 1
pending bit later on, we could always submit another patch to do that.
>> But... writing cache-cold benchmarks is _hard_ :/
> Wouldn't clflush of the second cacheline before trying for the lock give
> us a rough estimate?
clflush is a very expensive operation and I doubt that it will be
indicative of real life performance at all. BTW, there is no way to
write a cache-cold benchmark for that 2nd cacheline as any call to
spin_lock will likely to access it if there is enough contention.
-Longman
next prev parent reply other threads:[~2014-05-19 20:17 UTC|newest]
Thread overview: 163+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-07 15:01 [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 01/19] qspinlock: A simple generic 4-byte queue spinlock Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 02/19] qspinlock, x86: Enable x86-64 to use " Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 03/19] qspinlock: Add pending bit Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 18:57 ` Peter Zijlstra
2014-05-08 18:57 ` Peter Zijlstra
2014-05-10 0:49 ` Waiman Long
2014-05-10 0:49 ` Waiman Long
2014-05-10 0:49 ` Waiman Long
2014-05-08 18:57 ` Peter Zijlstra
2014-05-12 15:22 ` Radim Krčmář
2014-05-12 17:29 ` Peter Zijlstra
2014-05-12 17:29 ` Peter Zijlstra
2014-05-12 17:29 ` Peter Zijlstra
2014-05-13 19:47 ` Waiman Long
2014-05-13 19:47 ` Waiman Long
2014-05-14 16:51 ` Radim Krčmář
2014-05-14 17:00 ` Peter Zijlstra
2014-05-14 17:00 ` Peter Zijlstra
2014-05-14 19:13 ` Radim Krčmář
2014-05-14 19:13 ` Radim Krčmář
2014-05-14 19:13 ` Radim Krčmář
2014-05-19 20:17 ` Waiman Long
2014-05-19 20:17 ` Waiman Long [this message]
2014-05-19 20:17 ` Waiman Long
[not found] ` <20140521164930.GA26199@potion.brq.redhat.com>
2014-05-21 17:02 ` [RFC 08/07] qspinlock: integrate pending bit into queue Radim Krčmář
2014-05-21 17:02 ` Radim Krčmář
2014-05-21 17:02 ` Radim Krčmář
2014-05-14 17:00 ` [PATCH v10 03/19] qspinlock: Add pending bit Peter Zijlstra
2014-05-14 16:51 ` Radim Krčmář
2014-05-14 16:51 ` Radim Krčmář
2014-05-13 19:47 ` Waiman Long
2014-05-12 15:22 ` Radim Krčmář
2014-05-12 15:22 ` Radim Krčmář
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 04/19] qspinlock: Extract out the exchange of tail code word Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 05/19] qspinlock: Optimize for smaller NR_CPUS Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 06/19] qspinlock: prolong the stay in the pending bit path Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 18:58 ` Peter Zijlstra
2014-05-08 18:58 ` Peter Zijlstra
2014-05-10 0:58 ` Waiman Long
2014-05-10 0:58 ` Waiman Long
2014-05-10 0:58 ` Waiman Long
2014-05-10 13:38 ` Peter Zijlstra
2014-05-10 13:38 ` Peter Zijlstra
2014-05-10 13:38 ` Peter Zijlstra
2014-05-08 18:58 ` Peter Zijlstra
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 07/19] qspinlock: Use a simple write to grab the lock, if applicable Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 19:00 ` Peter Zijlstra
2014-05-08 19:00 ` Peter Zijlstra
2014-05-08 19:00 ` Peter Zijlstra
2014-05-10 1:05 ` Waiman Long
2014-05-10 1:05 ` Waiman Long
2014-05-10 1:05 ` Waiman Long
2014-05-08 19:02 ` Peter Zijlstra
2014-05-08 19:02 ` Peter Zijlstra
2014-05-08 19:02 ` Peter Zijlstra
2014-05-10 1:06 ` Waiman Long
2014-05-10 1:06 ` Waiman Long
2014-05-10 1:06 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 08/19] qspinlock: Make a new qnode structure to support virtualization Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 19:04 ` Peter Zijlstra
2014-05-08 19:04 ` Peter Zijlstra
2014-05-10 1:08 ` Waiman Long
2014-05-10 1:08 ` Waiman Long
2014-05-10 1:08 ` Waiman Long
2014-05-10 14:14 ` Peter Zijlstra
2014-05-10 14:14 ` Peter Zijlstra
2014-05-10 18:21 ` Peter Zijlstra
2014-05-10 18:21 ` Peter Zijlstra
2014-05-10 18:21 ` Peter Zijlstra
2014-05-10 14:14 ` Peter Zijlstra
2014-05-08 19:04 ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 09/19] qspinlock: Prepare for unfair lock support Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 19:06 ` Peter Zijlstra
2014-05-08 19:06 ` Peter Zijlstra
2014-05-10 1:19 ` Waiman Long
2014-05-10 14:13 ` Peter Zijlstra
2014-05-10 14:13 ` Peter Zijlstra
2014-05-10 14:13 ` Peter Zijlstra
2014-05-10 1:19 ` Waiman Long
2014-05-10 1:19 ` Waiman Long
2014-05-08 19:06 ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 19:12 ` Peter Zijlstra
2014-05-08 19:12 ` Peter Zijlstra
2014-05-19 20:30 ` Waiman Long
2014-05-19 20:30 ` Waiman Long
2014-05-19 20:30 ` Waiman Long
2014-05-08 19:12 ` Peter Zijlstra
2014-05-12 18:57 ` Radim Krčmář
2014-05-12 18:57 ` Radim Krčmář
2014-05-12 18:57 ` Radim Krčmář
2014-05-07 15:01 ` [PATCH v10 11/19] qspinlock: Split the MCS queuing code into a separate slowerpath Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 12/19] unfair qspinlock: Variable frequency lock stealing mechanism Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-08 19:19 ` Peter Zijlstra
2014-05-08 19:19 ` Peter Zijlstra
2014-05-08 19:19 ` Peter Zijlstra
2014-05-07 15:01 ` [PATCH v10 13/19] unfair qspinlock: Enable lock stealing in lock waiters Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 14/19] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 15/19] pvqspinlock, x86: Add PV data structure & methods Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 16/19] pvqspinlock: Enable coexistence with the unfair lock Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 17/19] pvqspinlock: Add qspinlock para-virtualization support Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 19:07 ` Konrad Rzeszutek Wilk
2014-05-07 19:07 ` Konrad Rzeszutek Wilk
2014-05-08 17:54 ` Waiman Long
2014-05-08 17:54 ` Waiman Long
2014-05-08 17:54 ` Waiman Long
2014-05-07 19:07 ` Konrad Rzeszutek Wilk
2014-05-07 15:01 ` [PATCH v10 19/19] pvqspinlock, x86: Enable PV qspinlock for XEN Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 15:01 ` Waiman Long
2014-05-07 19:07 ` [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support Konrad Rzeszutek Wilk
2014-05-07 19:07 ` Konrad Rzeszutek Wilk
2014-05-08 17:54 ` Waiman Long
2014-05-08 17:54 ` Waiman Long
2014-05-08 17:54 ` Waiman Long
2014-05-07 19:07 ` Konrad Rzeszutek Wilk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=537A66D3.8070607@hp.com \
--to=waiman.long@hp.com \
--cc=boris.ostrovsky@oracle.com \
--cc=chegu_vinod@hp.com \
--cc=david.vrabel@citrix.com \
--cc=gleb@redhat.com \
--cc=hpa@zytor.com \
--cc=konrad.wilk@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=oleg@redhat.com \
--cc=paolo.bonzini@gmail.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@linux.vnet.ibm.com \
--cc=riel@redhat.com \
--cc=rkrcmar@redhat.com \
--cc=scott.norton@hp.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=virtualization@lists.linux-foundation.org \
--cc=x86@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.