* Question about cacheline bounching with percpu-rwsem and rcu-sync
@ 2019-05-31 13:10 Joel Fernandes
2019-05-31 13:45 ` Oleg Nesterov
2019-05-31 13:50 ` Paul E. McKenney
0 siblings, 2 replies; 8+ messages in thread
From: Joel Fernandes @ 2019-05-31 13:10 UTC (permalink / raw)
To: Oleg Nesterov, Eric Dumazet, Paul E. McKenney; +Cc: rcu
Hi,
As per the documentation for rationale of percpu-rwsem, the Documentation says:
The problem with traditional read-write semaphores is that when multiple
cores take the lock for reading, the cache line containing the semaphore
is bouncing between L1 caches of the cores, causing performance
degradation.
However, it appears to me that the struct percpu_rwsem "rss" element
which is used by the RCU-sync is not a per-cpu element. So even in the
fastpath case (only readers and no writers), the cacheline containing
rss is shared and will bounce by multiple CPUs. For that matter, even
the cacheline containing the percpu_rw_semaphore itself will be bounce
among multiple reader CPUs.
So how does percpu-rwsem eliminate cache line bouncing in the common
case. Could you let me know what I am missing?
Thanks a lot.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-05-31 13:10 Question about cacheline bounching with percpu-rwsem and rcu-sync Joel Fernandes
@ 2019-05-31 13:45 ` Oleg Nesterov
2019-05-31 14:42 ` Joel Fernandes
2019-05-31 13:50 ` Paul E. McKenney
1 sibling, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2019-05-31 13:45 UTC (permalink / raw)
To: Joel Fernandes; +Cc: Eric Dumazet, Paul E. McKenney, rcu
On 05/31, Joel Fernandes wrote:
>
> The problem with traditional read-write semaphores is that when multiple
> cores take the lock for reading, the cache line containing the semaphore
> is bouncing between L1 caches of the cores, causing performance
> degradation.
>
> However, it appears to me that the struct percpu_rwsem "rss" element
> which is used by the RCU-sync is not a per-cpu element. So even in the
> fastpath case (only readers and no writers), the cacheline containing
> rss is shared and will bounce by multiple CPUs. For that matter, even
> the cacheline containing the percpu_rw_semaphore itself will be bounce
> among multiple reader CPUs.
The readers won't modify this memory? read_lock/unlock will only update
the per-cpu counter, ->read_count.
Oleg.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-05-31 13:10 Question about cacheline bounching with percpu-rwsem and rcu-sync Joel Fernandes
2019-05-31 13:45 ` Oleg Nesterov
@ 2019-05-31 13:50 ` Paul E. McKenney
2019-05-31 14:43 ` Joel Fernandes
1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2019-05-31 13:50 UTC (permalink / raw)
To: Joel Fernandes; +Cc: Oleg Nesterov, Eric Dumazet, rcu
On Fri, May 31, 2019 at 09:10:16AM -0400, Joel Fernandes wrote:
> Hi,
> As per the documentation for rationale of percpu-rwsem, the Documentation says:
>
> The problem with traditional read-write semaphores is that when multiple
> cores take the lock for reading, the cache line containing the semaphore
> is bouncing between L1 caches of the cores, causing performance
> degradation.
>
> However, it appears to me that the struct percpu_rwsem "rss" element
> which is used by the RCU-sync is not a per-cpu element. So even in the
> fastpath case (only readers and no writers), the cacheline containing
> rss is shared and will bounce by multiple CPUs. For that matter, even
> the cacheline containing the percpu_rw_semaphore itself will be bounce
> among multiple reader CPUs.
>
> So how does percpu-rwsem eliminate cache line bouncing in the common
> case. Could you let me know what I am missing?
>
> Thanks a lot.
The accesses are loads, except for the __this_cpu_inc(), which updates
a per-CPU variable. The locations loaded will replicate across the
CPUs' caches and the per-CPU variables are private to each CPU. Hence
no cacheline bouncing.
Or am I missing the point of your question?
Either way, it would be good for you to just try it. Create a kernel
module or similar than hammers on percpu_down_read() and percpu_up_read(),
and empirically check the scalability on a largish system. Then compare
this to down_read() and up_read()
Thanx, Paul
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-05-31 13:45 ` Oleg Nesterov
@ 2019-05-31 14:42 ` Joel Fernandes
0 siblings, 0 replies; 8+ messages in thread
From: Joel Fernandes @ 2019-05-31 14:42 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Eric Dumazet, Paul E. McKenney, rcu
On Fri, May 31, 2019 at 9:45 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 05/31, Joel Fernandes wrote:
> >
> > The problem with traditional read-write semaphores is that when multiple
> > cores take the lock for reading, the cache line containing the semaphore
> > is bouncing between L1 caches of the cores, causing performance
> > degradation.
> >
> > However, it appears to me that the struct percpu_rwsem "rss" element
> > which is used by the RCU-sync is not a per-cpu element. So even in the
> > fastpath case (only readers and no writers), the cacheline containing
> > rss is shared and will bounce by multiple CPUs. For that matter, even
> > the cacheline containing the percpu_rw_semaphore itself will be bounce
> > among multiple reader CPUs.
>
> The readers won't modify this memory? read_lock/unlock will only update
> the per-cpu counter, ->read_count.
Makes sense, I was confusing cache misses for cache bouncing. Thanks
for clarification!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-05-31 13:50 ` Paul E. McKenney
@ 2019-05-31 14:43 ` Joel Fernandes
2019-06-09 0:24 ` Joel Fernandes
0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2019-05-31 14:43 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Oleg Nesterov, Eric Dumazet, rcu
On Fri, May 31, 2019 at 9:52 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> On Fri, May 31, 2019 at 09:10:16AM -0400, Joel Fernandes wrote:
> > Hi,
> > As per the documentation for rationale of percpu-rwsem, the Documentation says:
> >
> > The problem with traditional read-write semaphores is that when multiple
> > cores take the lock for reading, the cache line containing the semaphore
> > is bouncing between L1 caches of the cores, causing performance
> > degradation.
> >
> > However, it appears to me that the struct percpu_rwsem "rss" element
> > which is used by the RCU-sync is not a per-cpu element. So even in the
> > fastpath case (only readers and no writers), the cacheline containing
> > rss is shared and will bounce by multiple CPUs. For that matter, even
> > the cacheline containing the percpu_rw_semaphore itself will be bounce
> > among multiple reader CPUs.
> >
> > So how does percpu-rwsem eliminate cache line bouncing in the common
> > case. Could you let me know what I am missing?
> >
> > Thanks a lot.
>
> The accesses are loads, except for the __this_cpu_inc(), which updates
> a per-CPU variable. The locations loaded will replicate across the
> CPUs' caches and the per-CPU variables are private to each CPU. Hence
> no cacheline bouncing.
Makes sense, thanks for the answer!
>
> Either way, it would be good for you to just try it. Create a kernel
> module or similar than hammers on percpu_down_read() and percpu_up_read(),
> and empirically check the scalability on a largish system. Then compare
> this to down_read() and up_read()
Will do! thanks.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-05-31 14:43 ` Joel Fernandes
@ 2019-06-09 0:24 ` Joel Fernandes
2019-06-09 12:22 ` Paul E. McKenney
0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2019-06-09 0:24 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Oleg Nesterov, Eric Dumazet, rcu, LKML
On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
[snip]
> >
> > Either way, it would be good for you to just try it. Create a kernel
> > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > and empirically check the scalability on a largish system. Then compare
> > this to down_read() and up_read()
>
> Will do! thanks.
I created a test for this and the results are quite amazing just
stressed read lock/unlock for rwsem vs percpu-rwsem.
The test is conducted on a dual socket Intel x86_64 machine with 14
cores each socket.
Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424
Graphs/Results here:
https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
The completion time of the test goes up somewhat exponentially with
the number of threads, for the rwsem case, where as for percpu-rwsem
it is the same. I could add this data to some of the documentation as
well.
Thanks!
- Joel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-06-09 0:24 ` Joel Fernandes
@ 2019-06-09 12:22 ` Paul E. McKenney
2019-06-09 21:25 ` Joel Fernandes
0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2019-06-09 12:22 UTC (permalink / raw)
To: Joel Fernandes; +Cc: Oleg Nesterov, Eric Dumazet, rcu, LKML
On Sat, Jun 08, 2019 at 08:24:36PM -0400, Joel Fernandes wrote:
> On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> [snip]
> > >
> > > Either way, it would be good for you to just try it. Create a kernel
> > > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > > and empirically check the scalability on a largish system. Then compare
> > > this to down_read() and up_read()
> >
> > Will do! thanks.
>
> I created a test for this and the results are quite amazing just
> stressed read lock/unlock for rwsem vs percpu-rwsem.
> The test is conducted on a dual socket Intel x86_64 machine with 14
> cores each socket.
>
> Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
> https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424
Interesting location, but looks functional. ;-)
> Graphs/Results here:
> https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
>
> The completion time of the test goes up somewhat exponentially with
> the number of threads, for the rwsem case, where as for percpu-rwsem
> it is the same. I could add this data to some of the documentation as
> well.
Actually, the completion time looks to be pretty close to linear in the
number of CPUs. Which is still really bad, don't get me wrong.
Thank you for doing this, and it might be good to have some documentation
on this. In perfbook, I use counters to make this point, and perhaps
I need to emphasize more that it also applies to other algorithms,
including locking. Me, I learned this lesson from a logic analyzer
back in the very early 1990s. This was back in the days before on-CPU
caches when a logic analyzer could actually tell you something about
the detailed execution. ;-)
The key point is that you can often closely approximate the performance
of synchronization algorithms by counting the number of cache misses and
the number of CPUs competing for each cache line.
If you want to get the microbenchmark test code itself upstream,
one approach might be to have a kernel/locking/lockperf.c similar to
kernel/rcu/rcuperf.c.
Thoughts?
Thanx, Paul
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
2019-06-09 12:22 ` Paul E. McKenney
@ 2019-06-09 21:25 ` Joel Fernandes
0 siblings, 0 replies; 8+ messages in thread
From: Joel Fernandes @ 2019-06-09 21:25 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Oleg Nesterov, Eric Dumazet, rcu, LKML
On Sun, Jun 09, 2019 at 05:22:26AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 08, 2019 at 08:24:36PM -0400, Joel Fernandes wrote:
> > On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > [snip]
> > > >
> > > > Either way, it would be good for you to just try it. Create a kernel
> > > > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > > > and empirically check the scalability on a largish system. Then compare
> > > > this to down_read() and up_read()
> > >
> > > Will do! thanks.
> >
> > I created a test for this and the results are quite amazing just
> > stressed read lock/unlock for rwsem vs percpu-rwsem.
> > The test is conducted on a dual socket Intel x86_64 machine with 14
> > cores each socket.
> >
> > Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
> > https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424
>
> Interesting location, but looks functional. ;-)
>
> > Graphs/Results here:
> > https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
> >
> > The completion time of the test goes up somewhat exponentially with
> > the number of threads, for the rwsem case, where as for percpu-rwsem
> > it is the same. I could add this data to some of the documentation as
> > well.
>
> Actually, the completion time looks to be pretty close to linear in the
> number of CPUs. Which is still really bad, don't get me wrong.
Sure, yes on second thought it is more linear than exponential :)
> Thank you for doing this, and it might be good to have some documentation
> on this. In perfbook, I use counters to make this point, and perhaps
> I need to emphasize more that it also applies to other algorithms,
> including locking. Me, I learned this lesson from a logic analyzer
> back in the very early 1990s. This was back in the days before on-CPU
> caches when a logic analyzer could actually tell you something about
> the detailed execution. ;-)
>
> The key point is that you can often closely approximate the performance
> of synchronization algorithms by counting the number of cache misses and
> the number of CPUs competing for each cache line.
Cool, thanks for that insight. It has been some years since I used a logic
analyzer for some bus protocol debugging, but those are fun!
> If you want to get the microbenchmark test code itself upstream,
> one approach might be to have a kernel/locking/lockperf.c similar to
> kernel/rcu/rcuperf.c.
> Thoughts?
That sounds great to me, there's no other locking performance tests in the
kernel. There's locking api selftests at boot (DEBUG_LOCKING_API_SELFTESTS)
which just tests whether lockdep catches locking issues, and there's
locktorture, but I believe none of these test for lock performance.
I think a lockperf.c could also test other things about locking mechanisms,
such as how they perform if the owner of the lock is currently running vs
sleeping, while another thread is trying to acquire etc. What do you think? I
can add this to my list to do. Right now I'm working on the list-RCU lockdep
checking I started to work on [1] and want to post another series soon.
Thanks a lot,
- Joel
[1] https://lkml.org/lkml/2019/6/1/495
https://lore.kernel.org/patchwork/patch/1082846/
>
> Thanx, Paul
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-06-09 21:25 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-31 13:10 Question about cacheline bounching with percpu-rwsem and rcu-sync Joel Fernandes
2019-05-31 13:45 ` Oleg Nesterov
2019-05-31 14:42 ` Joel Fernandes
2019-05-31 13:50 ` Paul E. McKenney
2019-05-31 14:43 ` Joel Fernandes
2019-06-09 0:24 ` Joel Fernandes
2019-06-09 12:22 ` Paul E. McKenney
2019-06-09 21:25 ` Joel Fernandes
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.