* [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
@ 2004-05-25 5:35 Manfred Spraul
2004-05-27 23:22 ` [Lse-tech] " Paul E. McKenney
0 siblings, 1 reply; 4+ messages in thread
From: Manfred Spraul @ 2004-05-25 5:35 UTC (permalink / raw)
To: linux-kernel, lse-tech
Hi,
Step two for reducing cacheline trashing within rcupdate.c:
rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and
calls rcu_start_batch, even if the batch is already running or already
scheduled to run.
This can be avoided with a sequence lock: A sequence lock allows
to read the current batch number and next_pending atomically. If
next_pending is already set, then there is no need to acquire the
global mutex.
This means that for each grace period, there will be
- one write access to the rcu_ctrlblk.batch cacheline
- lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online(),
perhaps even more).
Behavior similar to the jiffies cacheline, shouldn't be a problem.
- cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them
starting with spin_lock(&rcu_ctrlblk.state.mutex).
For large enough cpus_online() this will be a problem, but all
except two of the spin_lock calls only protect the rcu_cpu_mask
bitmap, thus a hierarchical bitmap would allow to split the write
accesses to multiple cachelines.
Tested on an 8-way with reaim. Unfortunately it probably won't help
with Jack Steiner's 'ls' test since in this test only one cpu
generates rcu entries.
What do you think?
--
Manfred
// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 6
// EXTRAVERSION = -mm4
--- 2.6/kernel/rcupdate.c 2004-05-23 11:53:46.000000000 +0200
+++ build-2.6/kernel/rcupdate.c 2004-05-23 11:53:52.000000000 +0200
@@ -47,7 +47,7 @@
/* Definition for rcupdate control block. */
struct rcu_ctrlblk rcu_ctrlblk =
- { .batch = { .cur = -300, .completed = -300 },
+ { .batch = { .cur = -300, .completed = -300 , .lock = SEQCNT_ZERO },
.state = {.mutex = SPIN_LOCK_UNLOCKED, .rcu_cpu_mask = CPU_MASK_NONE } };
DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
@@ -124,16 +124,18 @@
cpumask_t active;
if (next_pending)
- rcu_ctrlblk.state.next_pending = 1;
+ rcu_ctrlblk.batch.next_pending = 1;
- if (rcu_ctrlblk.state.next_pending &&
+ if (rcu_ctrlblk.batch.next_pending &&
rcu_ctrlblk.batch.completed == rcu_ctrlblk.batch.cur) {
- rcu_ctrlblk.state.next_pending = 0;
/* Can't change, since spin lock held. */
active = nohz_cpu_mask;
cpus_complement(active);
cpus_and(rcu_ctrlblk.state.rcu_cpu_mask, cpu_online_map, active);
+ write_seqcount_begin(&rcu_ctrlblk.batch.lock);
+ rcu_ctrlblk.batch.next_pending = 0;
rcu_ctrlblk.batch.cur++;
+ write_seqcount_end(&rcu_ctrlblk.batch.lock);
}
}
@@ -261,6 +263,8 @@
local_irq_disable();
if (!list_empty(&RCU_nxtlist(cpu)) && list_empty(&RCU_curlist(cpu))) {
+ int next_pending, seq;
+
__list_splice(&RCU_nxtlist(cpu), &RCU_curlist(cpu));
INIT_LIST_HEAD(&RCU_nxtlist(cpu));
local_irq_enable();
@@ -268,10 +272,19 @@
/*
* start the next batch of callbacks
*/
- spin_lock(&rcu_ctrlblk.state.mutex);
- RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
- rcu_start_batch(1);
- spin_unlock(&rcu_ctrlblk.state.mutex);
+ do {
+ seq = read_seqcount_begin(&rcu_ctrlblk.batch.lock);
+ /* determine batch number */
+ RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
+ next_pending = rcu_ctrlblk.batch.next_pending;
+ } while (read_seqcount_retry(&rcu_ctrlblk.batch.lock, seq));
+
+ if (!next_pending) {
+ /* and start it/schedule start if it's a new batch */
+ spin_lock(&rcu_ctrlblk.state.mutex);
+ rcu_start_batch(1);
+ spin_unlock(&rcu_ctrlblk.state.mutex);
+ }
} else {
local_irq_enable();
}
--- 2.6/include/linux/rcupdate.h 2004-05-23 11:53:46.000000000 +0200
+++ build-2.6/include/linux/rcupdate.h 2004-05-23 11:53:52.000000000 +0200
@@ -41,6 +41,7 @@
#include <linux/threads.h>
#include <linux/percpu.h>
#include <linux/cpumask.h>
+#include <linux/seqlock.h>
/**
* struct rcu_head - callback structure for use with RCU
@@ -69,11 +70,14 @@
struct {
long cur; /* Current batch number. */
long completed; /* Number of the last completed batch */
+ int next_pending; /* Is the next batch already waiting? */
+ seqcount_t lock; /* for atomically reading cur and */
+ /* next_pending. Spinlock not used, */
+ /* protected by state.mutex */
} batch ____cacheline_maxaligned_in_smp;
/* remaining members: bookkeeping of the progress of the grace period */
struct {
spinlock_t mutex; /* Guard this struct */
- int next_pending; /* Is the next batch already waiting? */
cpumask_t rcu_cpu_mask; /* CPUs that need to switch */
/* in order for current batch to proceed. */
} state ____cacheline_maxaligned_in_smp;
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Lse-tech] [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
2004-05-25 5:35 [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches Manfred Spraul
@ 2004-05-27 23:22 ` Paul E. McKenney
2004-05-28 13:57 ` Manfred Spraul
0 siblings, 1 reply; 4+ messages in thread
From: Paul E. McKenney @ 2004-05-27 23:22 UTC (permalink / raw)
To: Manfred Spraul; +Cc: linux-kernel, lse-tech
Hello, Manfred,
I am still digging through these, and things look quite good in general,
but I have a question on your second patch.
Given the following sequence of events:
1. CPU 0 executes the
rcu_ctrlblk.batch.next_pending = 1;
at the beginning of rcu_start_batch().
2. CPU 1 executes the read_seqcount code sequence in
rcu_process_callbacks(), setting RCU_batch(cpu) to
the next batch number, and setting next_pending to 1.
3. CPU 0 executes the remainder of rcu_start_batch(),
setting rcu_ctrlblk.batch.next_pending to 0 and
incrementing rcu_ctrlblk.batch.cur.
4. CPU 1's state is now as if the grace period had already
completed for the callbacks that were just moved to
RCU_curlist(), which would be very bad.
First, can this sequence of events really happen? I cannot see
anything that prevents it, but on the other hand, am writing this
while under the influence of heavy-duty decongestants.
Second, is CPU 1's state really "grace period completed"?
See previous disclaimer.
If this really is a problem, it seems like the fix would be
to extend the range of the write_seqcount to cover the entire
body of rcu_start_batch(). Which would increase the number
of write_seqcount calls, degrading performance. Might not be
enough to worry about, but 512 CPUs is 512 CPUs...
Another approach would be to do the write_seqcount_begin() in
both "if" statements, and to do the corresponding write_seqcount_end()
if the begin had happened. If this is a concern, something like the
following might be appropriate:
static void rcu_start_batch(int next_pending)
{
cpumask_t active;
int writeseq = 0;
if (next_pending) {
write_seqcount_begin(&rcu_ctrlblk.batch.lock);
writeseq = 1;
rcu_ctrlblk.batch.next_pending = 1;
}
if (rcu_ctrlblk.batch.next_pending &&
rcu_ctrlblk.batch.completed == rcu_ctrlblk.batch.cur) {
/* Can't change, since spin lock held. */
active = nohz_cpu_mask;
cpus_complement(active);
cpus_and(rcu_ctrlblk.state.rcu_cpu_mask, cpu_online_map, active);
if (!writeseq) {
write_seqcount_begin(&rcu_ctrlblk.batch.lock);
writeseq = 1;
}
rcu_ctrlblk.batch.next_pending = 0;
rcu_ctrlblk.batch.cur++;
}
if (writeseq)
write_seqcount_end(&rcu_ctrlblk.batch.lock);
}
Thoughts?
Will continue looking through the patches, more later.
Thanx, Paul
On Tue, May 25, 2004 at 07:35:21AM +0200, Manfred Spraul wrote:
> Hi,
>
> Step two for reducing cacheline trashing within rcupdate.c:
>
> rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and
> calls rcu_start_batch, even if the batch is already running or already
> scheduled to run.
> This can be avoided with a sequence lock: A sequence lock allows
> to read the current batch number and next_pending atomically. If
> next_pending is already set, then there is no need to acquire the
> global mutex.
>
> This means that for each grace period, there will be
>
> - one write access to the rcu_ctrlblk.batch cacheline
> - lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online(),
> perhaps even more).
>
> Behavior similar to the jiffies cacheline, shouldn't be a problem.
>
> - cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them
> starting with spin_lock(&rcu_ctrlblk.state.mutex).
>
> For large enough cpus_online() this will be a problem, but all
> except two of the spin_lock calls only protect the rcu_cpu_mask
> bitmap, thus a hierarchical bitmap would allow to split the write
> accesses to multiple cachelines.
>
> Tested on an 8-way with reaim. Unfortunately it probably won't help
> with Jack Steiner's 'ls' test since in this test only one cpu
> generates rcu entries.
>
> What do you think?
>
> --
> Manfred
>
> // $Header$
> // Kernel Version:
> // VERSION = 2
> // PATCHLEVEL = 6
> // SUBLEVEL = 6
> // EXTRAVERSION = -mm4
> --- 2.6/kernel/rcupdate.c 2004-05-23 11:53:46.000000000 +0200
> +++ build-2.6/kernel/rcupdate.c 2004-05-23 11:53:52.000000000 +0200
> @@ -47,7 +47,7 @@
>
> /* Definition for rcupdate control block. */
> struct rcu_ctrlblk rcu_ctrlblk =
> - { .batch = { .cur = -300, .completed = -300 },
> + { .batch = { .cur = -300, .completed = -300 , .lock = SEQCNT_ZERO },
> .state = {.mutex = SPIN_LOCK_UNLOCKED, .rcu_cpu_mask = CPU_MASK_NONE } };
> DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
>
> @@ -124,16 +124,18 @@
> cpumask_t active;
>
> if (next_pending)
> - rcu_ctrlblk.state.next_pending = 1;
> + rcu_ctrlblk.batch.next_pending = 1;
>
> - if (rcu_ctrlblk.state.next_pending &&
> + if (rcu_ctrlblk.batch.next_pending &&
> rcu_ctrlblk.batch.completed == rcu_ctrlblk.batch.cur) {
> - rcu_ctrlblk.state.next_pending = 0;
> /* Can't change, since spin lock held. */
> active = nohz_cpu_mask;
> cpus_complement(active);
> cpus_and(rcu_ctrlblk.state.rcu_cpu_mask, cpu_online_map, active);
> + write_seqcount_begin(&rcu_ctrlblk.batch.lock);
> + rcu_ctrlblk.batch.next_pending = 0;
> rcu_ctrlblk.batch.cur++;
> + write_seqcount_end(&rcu_ctrlblk.batch.lock);
> }
> }
>
> @@ -261,6 +263,8 @@
>
> local_irq_disable();
> if (!list_empty(&RCU_nxtlist(cpu)) && list_empty(&RCU_curlist(cpu))) {
> + int next_pending, seq;
> +
> __list_splice(&RCU_nxtlist(cpu), &RCU_curlist(cpu));
> INIT_LIST_HEAD(&RCU_nxtlist(cpu));
> local_irq_enable();
> @@ -268,10 +272,19 @@
> /*
> * start the next batch of callbacks
> */
> - spin_lock(&rcu_ctrlblk.state.mutex);
> - RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
> - rcu_start_batch(1);
> - spin_unlock(&rcu_ctrlblk.state.mutex);
> + do {
> + seq = read_seqcount_begin(&rcu_ctrlblk.batch.lock);
> + /* determine batch number */
> + RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
> + next_pending = rcu_ctrlblk.batch.next_pending;
> + } while (read_seqcount_retry(&rcu_ctrlblk.batch.lock, seq));
> +
> + if (!next_pending) {
> + /* and start it/schedule start if it's a new batch */
> + spin_lock(&rcu_ctrlblk.state.mutex);
> + rcu_start_batch(1);
> + spin_unlock(&rcu_ctrlblk.state.mutex);
> + }
> } else {
> local_irq_enable();
> }
> --- 2.6/include/linux/rcupdate.h 2004-05-23 11:53:46.000000000 +0200
> +++ build-2.6/include/linux/rcupdate.h 2004-05-23 11:53:52.000000000 +0200
> @@ -41,6 +41,7 @@
> #include <linux/threads.h>
> #include <linux/percpu.h>
> #include <linux/cpumask.h>
> +#include <linux/seqlock.h>
>
> /**
> * struct rcu_head - callback structure for use with RCU
> @@ -69,11 +70,14 @@
> struct {
> long cur; /* Current batch number. */
> long completed; /* Number of the last completed batch */
> + int next_pending; /* Is the next batch already waiting? */
> + seqcount_t lock; /* for atomically reading cur and */
> + /* next_pending. Spinlock not used, */
> + /* protected by state.mutex */
> } batch ____cacheline_maxaligned_in_smp;
> /* remaining members: bookkeeping of the progress of the grace period */
> struct {
> spinlock_t mutex; /* Guard this struct */
> - int next_pending; /* Is the next batch already waiting? */
> cpumask_t rcu_cpu_mask; /* CPUs that need to switch */
> /* in order for current batch to proceed. */
> } state ____cacheline_maxaligned_in_smp;
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g.
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Lse-tech] [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
2004-05-27 23:22 ` [Lse-tech] " Paul E. McKenney
@ 2004-05-28 13:57 ` Manfred Spraul
2004-05-28 16:28 ` Paul E. McKenney
0 siblings, 1 reply; 4+ messages in thread
From: Manfred Spraul @ 2004-05-28 13:57 UTC (permalink / raw)
To: paulmck; +Cc: Manfred Spraul, linux-kernel, lse-tech
Paul E. McKenney wrote:
>Hello, Manfred,
>
>I am still digging through these, and things look quite good in general,
>but I have a question on your second patch.
>
>
Let's assume that
batch.completed = 5;
batch.cur = 5;
batch.next_pending = 0;
>Given the following sequence of events:
>
>1. CPU 0 executes the
>
> rcu_ctrlblk.batch.next_pending = 1;
>
>
batch.next_pending = 1.
> at the beginning of rcu_start_batch().
>
>2. CPU 1 executes the read_seqcount code sequence in
> rcu_process_callbacks(), setting RCU_batch(cpu) to
> the next batch number, and setting next_pending to 1.
>
>
RCU_batch(1) is now 6.
next_pending is 1, rcu_process_callbacks continues without calling
rcu_start_batch().
>3. CPU 0 executes the remainder of rcu_start_batch(),
> setting rcu_ctrlblk.batch.next_pending to 0 and
> incrementing rcu_ctrlblk.batch.cur.
>
>
batch.cur = 6.
>4. CPU 1's state is now as if the grace period had already
> completed for the callbacks that were just moved to
> RCU_curlist(), which would be very bad.
>
>
AFAICS: No. RCU_batch(1) is 6 and rcu_ctrlblk.batch.completed is still
5. The test for grace period completed is
> if (!list_empty(&RCU_curlist(cpu)) &&
>
> !rcu_batch_before(rcu_ctrlblk.batch.completed,RCU_batch(cpu))) {
> __list_splice(&RCU_curlist(cpu), &list);
> INIT_LIST_HEAD(&RCU_curlist(cpu));
> }
5 is before 6, thus the callbacks won't be processed.
The only write operation to rcu_ctrlblk.batch.completed is in cpu_quiet,
after checking that the cpu bitmap is empty and under
spin_lock(rcu_ctrlblk.state.mutex).
Thanks for looking at my patches,
--
Manfred
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Lse-tech] [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
2004-05-28 13:57 ` Manfred Spraul
@ 2004-05-28 16:28 ` Paul E. McKenney
0 siblings, 0 replies; 4+ messages in thread
From: Paul E. McKenney @ 2004-05-28 16:28 UTC (permalink / raw)
To: Manfred Spraul; +Cc: Manfred Spraul, linux-kernel, lse-tech
On Fri, May 28, 2004 at 03:57:07PM +0200, Manfred Spraul wrote:
> Paul E. McKenney wrote:
>
> >Hello, Manfred,
> >
> >I am still digging through these, and things look quite good in general,
> >but I have a question on your second patch.
Sorry for the bother, will keep looking, hopefully with more accurate
comments in the future. :-/
Thanx, Paul
> Let's assume that
>
> batch.completed = 5;
> batch.cur = 5;
> batch.next_pending = 0;
>
> >Given the following sequence of events:
> >
> >1. CPU 0 executes the
> >
> > rcu_ctrlblk.batch.next_pending = 1;
> >
> >
> batch.next_pending = 1.
>
> > at the beginning of rcu_start_batch().
> >
> >2. CPU 1 executes the read_seqcount code sequence in
> > rcu_process_callbacks(), setting RCU_batch(cpu) to
> > the next batch number, and setting next_pending to 1.
> >
> >
> RCU_batch(1) is now 6.
> next_pending is 1, rcu_process_callbacks continues without calling
> rcu_start_batch().
>
> >3. CPU 0 executes the remainder of rcu_start_batch(),
> > setting rcu_ctrlblk.batch.next_pending to 0 and
> > incrementing rcu_ctrlblk.batch.cur.
> >
> >
> batch.cur = 6.
>
> >4. CPU 1's state is now as if the grace period had already
> > completed for the callbacks that were just moved to
> > RCU_curlist(), which would be very bad.
> >
> >
> AFAICS: No. RCU_batch(1) is 6 and rcu_ctrlblk.batch.completed is still
> 5. The test for grace period completed is
>
> > if (!list_empty(&RCU_curlist(cpu)) &&
> >
> >!rcu_batch_before(rcu_ctrlblk.batch.completed,RCU_batch(cpu))) {
> > __list_splice(&RCU_curlist(cpu), &list);
> > INIT_LIST_HEAD(&RCU_curlist(cpu));
> > }
>
> 5 is before 6, thus the callbacks won't be processed.
>
> The only write operation to rcu_ctrlblk.batch.completed is in cpu_quiet,
> after checking that the cpu bitmap is empty and under
> spin_lock(rcu_ctrlblk.state.mutex).
>
> Thanks for looking at my patches,
>
> --
> Manfred
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-05-28 16:30 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-25 5:35 [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches Manfred Spraul
2004-05-27 23:22 ` [Lse-tech] " Paul E. McKenney
2004-05-28 13:57 ` Manfred Spraul
2004-05-28 16:28 ` Paul E. McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox