* Linux-kernel examples for LKMM recipes
@ 2017-10-11 22:32 Paul E. McKenney
2017-10-12 1:23 ` Boqun Feng
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-11 22:32 UTC (permalink / raw)
To: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
dhowells, j.alglave, luc.maranget
Cc: linux-kernel
Hello!
At Linux Plumbers Conference, we got requests for a recipes document,
and a further request to point to actual code in the Linux kernel.
I have pulled together some examples for various litmus-test families,
as shown below. The decoder ring for the abbreviations (ISA2, LB, SB,
MP, ...) is here:
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
This document is also checked into the memory-models git archive:
https://github.com/aparri/memory-model.git
I would be especially interested in simpler examples in general, and
of course any example at all for the cases where I was unable to find
any. Thoughts?
Thanx, Paul
------------------------------------------------------------------------
This document lists the litmus-test patterns that we have been discussing,
along with examples from the Linux kernel. This is intended to feed into
the recipes document. All examples are from v4.13.
0. Single-variable SC.
a. Within a single CPU, the use of the ->dynticks_nmi_nesting
counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
(see kernel/rcu/tree.c). The counter is accessed by
interrupts and NMIs as well as by process-level code.
This counter can be accessed by other CPUs, but only
for debug output.
b. Between CPUs, I would put forward the ->dflags
updates, but this is anything but simple. But maybe
OK for an illustration?
1. MP (see test6.pdf for nickname translation)
a. smp_store_release() / smp_load_acquire()
init_stack_slab() in lib/stackdepot.c uses release-acquire
to handle initialization of a slab of the stack. Working
out the mutual-exclusion design is left as an exercise for
the reader.
b. rcu_assign_pointer() / rcu_dereference()
expand_to_next_prime() does the rcu_assign_pointer(),
and next_prime_number() does the rcu_dereference().
This mediates access to a bit vector that is expanded
as additional primes are needed. These two functions
are in lib/prime_numbers.c.
c. smp_wmb() / smp_rmb()
xlog_state_switch_iclogs() contains the following:
log->l_curr_block -= log->l_logBBsize;
ASSERT(log->l_curr_block >= 0);
smp_wmb();
log->l_curr_cycle++;
And xlog_valid_lsn() contains the following:
cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
smp_rmb();
cur_block = ACCESS_ONCE(log->l_curr_block);
d. Replacing either of the above with smp_mb()
Holding off on this one for the moment...
2. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
Lots of variety here, can in some cases substitute:
a. READ_ONCE() for smp_load_acquire()
b. WRITE_ONCE() for smp_store_release()
c. Dependencies for both smp_load_acquire() and
smp_store_release().
d. smp_wmb() for smp_store_release() in first thread
of ISA2 and Z6.2.
e. smp_rmb() for smp_load_acquire() in last thread of ISA2.
The canonical illustration of LB involves the various memory
allocators, where you don't want a load from about-to-be-freed
memory to see a store initializing a later incarnation of that
same memory area. But the per-CPU caches make this a very
long and complicated example.
I am not aware of any three-CPU release-acquire chains in the
Linux kernel. There are three-CPU lock-based chains in RCU,
but these are not at all simple, either.
Thoughts?
3. SB
a. smp_mb(), as in lockless wait-wakeup coordination.
And as in sys_membarrier()-scheduler coordination,
for that matter.
Examples seem to be lacking. Most cases use locking.
Here is one rather strange one from RCU:
void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
{
unsigned long flags;
bool needwake;
bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
rhp->next = NULL;
rhp->func = func;
raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
needwake = !rcu_tasks_cbs_head;
*rcu_tasks_cbs_tail = rhp;
rcu_tasks_cbs_tail = &rhp->next;
raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
/* We can't create the thread unless interrupts are enabled. */
if ((needwake && havetask) ||
(!havetask && !irqs_disabled_flags(flags))) {
rcu_spawn_tasks_kthread();
wake_up(&rcu_tasks_cbs_wq);
}
}
And for the wait side, using synchronize_sched() to supply
the barrier for both ends, with the preemption disabling
due to raw_spin_lock_irqsave() serving as the read-side
critical section:
if (!list) {
wait_event_interruptible(rcu_tasks_cbs_wq,
rcu_tasks_cbs_head);
if (!rcu_tasks_cbs_head) {
WARN_ON(signal_pending(current));
schedule_timeout_interruptible(HZ/10);
}
continue;
}
synchronize_sched();
-----------------
Here is another one that uses atomic_cmpxchg() as a
full memory barrier:
if (!wait_event_timeout(*wait, !atomic_read(stopping),
msecs_to_jiffies(1000))) {
atomic_set(stopping, 0);
smp_mb();
return -ETIMEDOUT;
}
int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
atomic_t *stopping)
{
if (atomic_cmpxchg(stopping, 1, 0)) {
wake_up(wait);
return 1;
}
return 0;
}
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Linux-kernel examples for LKMM recipes 2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney @ 2017-10-12 1:23 ` Boqun Feng 2017-10-12 11:27 ` Will Deacon 2017-10-17 20:56 ` Paul E. McKenney 2017-10-12 13:27 ` Andrea Parri 2017-10-13 19:44 ` Alan Stern 2 siblings, 2 replies; 19+ messages in thread From: Boqun Feng @ 2017-10-12 1:23 UTC (permalink / raw) To: Paul E. McKenney Cc: stern, parri.andrea, will.deacon, peterz, npiggin, dhowells, j.alglave, luc.maranget, linux-kernel [-- Attachment #1: Type: text/plain, Size: 7275 bytes --] On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote: > Hello! > > At Linux Plumbers Conference, we got requests for a recipes document, > and a further request to point to actual code in the Linux kernel. > I have pulled together some examples for various litmus-test families, > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > MP, ...) is here: > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > This document is also checked into the memory-models git archive: > > https://github.com/aparri/memory-model.git > > I would be especially interested in simpler examples in general, and > of course any example at all for the cases where I was unable to find > any. Thoughts? > > Thanx, Paul > > ------------------------------------------------------------------------ > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0. Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. > > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? > > 1. MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 2. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > The "Program-Order guarantees" case in scheduler? See the comments written by Peter above try_to_wake_up(): * The basic program-order guarantee on SMP systems is that when a task [t] * migrates, all its activity on its old CPU [c0] happens-before any subsequent * execution on its new CPU [c1]. ... * For blocking we (obviously) need to provide the same guarantee as for * migration. However the means are completely different as there is no lock * chain to provide order. Instead we do: * * 1) smp_store_release(X->on_cpu, 0) * 2) smp_cond_load_acquire(!X->on_cpu) * * Example: * * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) * * LOCK rq(0)->lock LOCK X->pi_lock * dequeue X * sched-out X * smp_store_release(X->on_cpu, 0); * * smp_cond_load_acquire(&X->on_cpu, !VAL); * X->state = WAKING * set_task_cpu(X,2) * * LOCK rq(2)->lock * enqueue X * X->state = RUNNING * UNLOCK rq(2)->lock * * LOCK rq(2)->lock // orders against CPU1 * sched-out Z * sched-in X * UNLOCK rq(2)->lock * * UNLOCK X->pi_lock * UNLOCK rq(0)->lock This is a chain mixed with lock and acquire-release(maybe even better?). And another example would be osq_{lock,unlock}() on multiple(more than three) CPUs. Regards, Boqun > Thoughts? > > 3. SB > > a. smp_mb(), as in lockless wait-wakeup coordination. > And as in sys_membarrier()-scheduler coordination, > for that matter. > > Examples seem to be lacking. Most cases use locking. > Here is one rather strange one from RCU: > > void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) > { > unsigned long flags; > bool needwake; > bool havetask = READ_ONCE(rcu_tasks_kthread_ptr); > > rhp->next = NULL; > rhp->func = func; > raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags); > needwake = !rcu_tasks_cbs_head; > *rcu_tasks_cbs_tail = rhp; > rcu_tasks_cbs_tail = &rhp->next; > raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); > /* We can't create the thread unless interrupts are enabled. */ > if ((needwake && havetask) || > (!havetask && !irqs_disabled_flags(flags))) { > rcu_spawn_tasks_kthread(); > wake_up(&rcu_tasks_cbs_wq); > } > } > > And for the wait side, using synchronize_sched() to supply > the barrier for both ends, with the preemption disabling > due to raw_spin_lock_irqsave() serving as the read-side > critical section: > > if (!list) { > wait_event_interruptible(rcu_tasks_cbs_wq, > rcu_tasks_cbs_head); > if (!rcu_tasks_cbs_head) { > WARN_ON(signal_pending(current)); > schedule_timeout_interruptible(HZ/10); > } > continue; > } > synchronize_sched(); > > ----------------- > > Here is another one that uses atomic_cmpxchg() as a > full memory barrier: > > if (!wait_event_timeout(*wait, !atomic_read(stopping), > msecs_to_jiffies(1000))) { > atomic_set(stopping, 0); > smp_mb(); > return -ETIMEDOUT; > } > > int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait, > atomic_t *stopping) > { > if (atomic_cmpxchg(stopping, 1, 0)) { > wake_up(wait); > return 1; > } > > return 0; > } > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-12 1:23 ` Boqun Feng @ 2017-10-12 11:27 ` Will Deacon 2017-10-17 20:37 ` Paul E. McKenney 2017-10-17 20:56 ` Paul E. McKenney 1 sibling, 1 reply; 19+ messages in thread From: Will Deacon @ 2017-10-12 11:27 UTC (permalink / raw) To: Boqun Feng Cc: Paul E. McKenney, stern, parri.andrea, peterz, npiggin, dhowells, j.alglave, luc.maranget, linux-kernel On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote: > > I am not aware of any three-CPU release-acquire chains in the > > Linux kernel. There are three-CPU lock-based chains in RCU, > > but these are not at all simple, either. > > > > The "Program-Order guarantees" case in scheduler? See the comments > written by Peter above try_to_wake_up(): > > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any subsequent > * execution on its new CPU [c1]. > ... > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > * > * LOCK rq(0)->lock LOCK X->pi_lock > * dequeue X > * sched-out X > * smp_store_release(X->on_cpu, 0); > * > * smp_cond_load_acquire(&X->on_cpu, !VAL); > * X->state = WAKING > * set_task_cpu(X,2) > * > * LOCK rq(2)->lock > * enqueue X > * X->state = RUNNING > * UNLOCK rq(2)->lock > * > * LOCK rq(2)->lock // orders against CPU1 > * sched-out Z > * sched-in X > * UNLOCK rq(2)->lock > * > * UNLOCK X->pi_lock > * UNLOCK rq(0)->lock > > This is a chain mixed with lock and acquire-release(maybe even better?). > > > And another example would be osq_{lock,unlock}() on multiple(more than > three) CPUs. I think the qrwlock also has something similar with the writer fairness issue fixed: CPU0: (writer doing an unlock) smp_store_release(&lock->wlocked, 0); // Bottom byte of lock->cnts CPU1: (waiting writer on slowpath) atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING); ... arch_spin_unlock(&lock->wait_lock); CPU2: (reader on slowpath) arch_spin_lock(&lock->wait_lock); and there's mixed-size accesses here too. Fun stuff! Will ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-12 11:27 ` Will Deacon @ 2017-10-17 20:37 ` Paul E. McKenney 0 siblings, 0 replies; 19+ messages in thread From: Paul E. McKenney @ 2017-10-17 20:37 UTC (permalink / raw) To: Will Deacon Cc: Boqun Feng, stern, parri.andrea, peterz, npiggin, dhowells, j.alglave, luc.maranget, linux-kernel On Thu, Oct 12, 2017 at 12:27:19PM +0100, Will Deacon wrote: > On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > > On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote: > > > I am not aware of any three-CPU release-acquire chains in the > > > Linux kernel. There are three-CPU lock-based chains in RCU, > > > but these are not at all simple, either. > > > > > > > The "Program-Order guarantees" case in scheduler? See the comments > > written by Peter above try_to_wake_up(): > > > > * The basic program-order guarantee on SMP systems is that when a task [t] > > * migrates, all its activity on its old CPU [c0] happens-before any subsequent > > * execution on its new CPU [c1]. > > ... > > * For blocking we (obviously) need to provide the same guarantee as for > > * migration. However the means are completely different as there is no lock > > * chain to provide order. Instead we do: > > * > > * 1) smp_store_release(X->on_cpu, 0) > > * 2) smp_cond_load_acquire(!X->on_cpu) > > * > > * Example: > > * > > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > > * > > * LOCK rq(0)->lock LOCK X->pi_lock > > * dequeue X > > * sched-out X > > * smp_store_release(X->on_cpu, 0); > > * > > * smp_cond_load_acquire(&X->on_cpu, !VAL); > > * X->state = WAKING > > * set_task_cpu(X,2) > > * > > * LOCK rq(2)->lock > > * enqueue X > > * X->state = RUNNING > > * UNLOCK rq(2)->lock > > * > > * LOCK rq(2)->lock // orders against CPU1 > > * sched-out Z > > * sched-in X > > * UNLOCK rq(2)->lock > > * > > * UNLOCK X->pi_lock > > * UNLOCK rq(0)->lock > > > > This is a chain mixed with lock and acquire-release(maybe even better?). > > > > > > And another example would be osq_{lock,unlock}() on multiple(more than > > three) CPUs. > > I think the qrwlock also has something similar with the writer fairness > issue fixed: > > CPU0: (writer doing an unlock) > smp_store_release(&lock->wlocked, 0); // Bottom byte of lock->cnts > > > CPU1: (waiting writer on slowpath) > atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING); > ... > arch_spin_unlock(&lock->wait_lock); > > > CPU2: (reader on slowpath) > arch_spin_lock(&lock->wait_lock); > > and there's mixed-size accesses here too. Fun stuff! You had me going there until you mentioned the mixed-size accesses. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-12 1:23 ` Boqun Feng 2017-10-12 11:27 ` Will Deacon @ 2017-10-17 20:56 ` Paul E. McKenney 1 sibling, 0 replies; 19+ messages in thread From: Paul E. McKenney @ 2017-10-17 20:56 UTC (permalink / raw) To: Boqun Feng Cc: stern, parri.andrea, will.deacon, peterz, npiggin, dhowells, j.alglave, luc.maranget, linux-kernel On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote: > > Hello! > > > > At Linux Plumbers Conference, we got requests for a recipes document, > > and a further request to point to actual code in the Linux kernel. > > I have pulled together some examples for various litmus-test families, > > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > > MP, ...) is here: > > > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > > > This document is also checked into the memory-models git archive: > > > > https://github.com/aparri/memory-model.git > > > > I would be especially interested in simpler examples in general, and > > of course any example at all for the cases where I was unable to find > > any. Thoughts? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > This document lists the litmus-test patterns that we have been discussing, > > along with examples from the Linux kernel. This is intended to feed into > > the recipes document. All examples are from v4.13. > > > > 0. Single-variable SC. > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > (see kernel/rcu/tree.c). The counter is accessed by > > interrupts and NMIs as well as by process-level code. > > This counter can be accessed by other CPUs, but only > > for debug output. > > > > b. Between CPUs, I would put forward the ->dflags > > updates, but this is anything but simple. But maybe > > OK for an illustration? > > > > 1. MP (see test6.pdf for nickname translation) > > > > a. smp_store_release() / smp_load_acquire() > > > > init_stack_slab() in lib/stackdepot.c uses release-acquire > > to handle initialization of a slab of the stack. Working > > out the mutual-exclusion design is left as an exercise for > > the reader. > > > > b. rcu_assign_pointer() / rcu_dereference() > > > > expand_to_next_prime() does the rcu_assign_pointer(), > > and next_prime_number() does the rcu_dereference(). > > This mediates access to a bit vector that is expanded > > as additional primes are needed. These two functions > > are in lib/prime_numbers.c. > > > > c. smp_wmb() / smp_rmb() > > > > xlog_state_switch_iclogs() contains the following: > > > > log->l_curr_block -= log->l_logBBsize; > > ASSERT(log->l_curr_block >= 0); > > smp_wmb(); > > log->l_curr_cycle++; > > > > And xlog_valid_lsn() contains the following: > > > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > > smp_rmb(); > > cur_block = ACCESS_ONCE(log->l_curr_block); > > > > d. Replacing either of the above with smp_mb() > > > > Holding off on this one for the moment... > > > > 2. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > > > Lots of variety here, can in some cases substitute: > > > > a. READ_ONCE() for smp_load_acquire() > > b. WRITE_ONCE() for smp_store_release() > > c. Dependencies for both smp_load_acquire() and > > smp_store_release(). > > d. smp_wmb() for smp_store_release() in first thread > > of ISA2 and Z6.2. > > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > > > The canonical illustration of LB involves the various memory > > allocators, where you don't want a load from about-to-be-freed > > memory to see a store initializing a later incarnation of that > > same memory area. But the per-CPU caches make this a very > > long and complicated example. > > > > I am not aware of any three-CPU release-acquire chains in the > > Linux kernel. There are three-CPU lock-based chains in RCU, > > but these are not at all simple, either. > > > > The "Program-Order guarantees" case in scheduler? See the comments > written by Peter above try_to_wake_up(): > > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any subsequent > * execution on its new CPU [c1]. > ... > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > * > * LOCK rq(0)->lock LOCK X->pi_lock > * dequeue X > * sched-out X > * smp_store_release(X->on_cpu, 0); > * > * smp_cond_load_acquire(&X->on_cpu, !VAL); > * X->state = WAKING > * set_task_cpu(X,2) > * > * LOCK rq(2)->lock > * enqueue X > * X->state = RUNNING > * UNLOCK rq(2)->lock > * > * LOCK rq(2)->lock // orders against CPU1 > * sched-out Z > * sched-in X > * UNLOCK rq(2)->lock > * > * UNLOCK X->pi_lock > * UNLOCK rq(0)->lock > > This is a chain mixed with lock and acquire-release(maybe even better?). I added this one, though it might be outside of the scope of recipes. Thanx, Paul > And another example would be osq_{lock,unlock}() on multiple(more than > three) CPUs. > > Regards, > Boqun > > > Thoughts? > > > > 3. SB > > > > a. smp_mb(), as in lockless wait-wakeup coordination. > > And as in sys_membarrier()-scheduler coordination, > > for that matter. > > > > Examples seem to be lacking. Most cases use locking. > > Here is one rather strange one from RCU: > > > > void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) > > { > > unsigned long flags; > > bool needwake; > > bool havetask = READ_ONCE(rcu_tasks_kthread_ptr); > > > > rhp->next = NULL; > > rhp->func = func; > > raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags); > > needwake = !rcu_tasks_cbs_head; > > *rcu_tasks_cbs_tail = rhp; > > rcu_tasks_cbs_tail = &rhp->next; > > raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); > > /* We can't create the thread unless interrupts are enabled. */ > > if ((needwake && havetask) || > > (!havetask && !irqs_disabled_flags(flags))) { > > rcu_spawn_tasks_kthread(); > > wake_up(&rcu_tasks_cbs_wq); > > } > > } > > > > And for the wait side, using synchronize_sched() to supply > > the barrier for both ends, with the preemption disabling > > due to raw_spin_lock_irqsave() serving as the read-side > > critical section: > > > > if (!list) { > > wait_event_interruptible(rcu_tasks_cbs_wq, > > rcu_tasks_cbs_head); > > if (!rcu_tasks_cbs_head) { > > WARN_ON(signal_pending(current)); > > schedule_timeout_interruptible(HZ/10); > > } > > continue; > > } > > synchronize_sched(); > > > > ----------------- > > > > Here is another one that uses atomic_cmpxchg() as a > > full memory barrier: > > > > if (!wait_event_timeout(*wait, !atomic_read(stopping), > > msecs_to_jiffies(1000))) { > > atomic_set(stopping, 0); > > smp_mb(); > > return -ETIMEDOUT; > > } > > > > int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait, > > atomic_t *stopping) > > { > > if (atomic_cmpxchg(stopping, 1, 0)) { > > wake_up(wait); > > return 1; > > } > > > > return 0; > > } > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney 2017-10-12 1:23 ` Boqun Feng @ 2017-10-12 13:27 ` Andrea Parri 2017-10-17 20:18 ` Paul E. McKenney 2017-10-13 19:44 ` Alan Stern 2 siblings, 1 reply; 19+ messages in thread From: Andrea Parri @ 2017-10-12 13:27 UTC (permalink / raw) To: Paul E. McKenney Cc: stern, will.deacon, peterz, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, linux-kernel Hi Paul, On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote: > Hello! > > At Linux Plumbers Conference, we got requests for a recipes document, > and a further request to point to actual code in the Linux kernel. > I have pulled together some examples for various litmus-test families, > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > MP, ...) is here: > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > This document is also checked into the memory-models git archive: > > https://github.com/aparri/memory-model.git > > I would be especially interested in simpler examples in general, and > of course any example at all for the cases where I was unable to find > any. Thoughts? Below are some examples we did discuss (at some point): The comment in kernel/events/ring_buffer.c:perf_output_put_handle() describes instances of MP+wmb+rmb and LB+ctrl+mb. The comments in kernel/sched/core.c:try_to_wake_up() describes more instances of MP ("plus locking") and LB (see finish_lock_switch()). The comment in kernel/sched/core.c:task_rq_lock() describes an ins- tance of MP+wmb+addr-acqpo. The comment in include/linux/wait.h:waitqueue_active() describes an instance of SB+mb+mb. 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug") describes an instance of W+RWC+porel+mb+mb. [...] I wish we could say "any barrier (explicit or implicit) in sources is accompanied by a comment mentioning the interested pattern...", but life is not always this simple. ;-) Andrea > > Thanx, Paul > > ------------------------------------------------------------------------ > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0. Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. > > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? > > 1. MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 2. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > > Thoughts? > > 3. SB > > a. smp_mb(), as in lockless wait-wakeup coordination. > And as in sys_membarrier()-scheduler coordination, > for that matter. > > Examples seem to be lacking. Most cases use locking. > Here is one rather strange one from RCU: > > void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) > { > unsigned long flags; > bool needwake; > bool havetask = READ_ONCE(rcu_tasks_kthread_ptr); > > rhp->next = NULL; > rhp->func = func; > raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags); > needwake = !rcu_tasks_cbs_head; > *rcu_tasks_cbs_tail = rhp; > rcu_tasks_cbs_tail = &rhp->next; > raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); > /* We can't create the thread unless interrupts are enabled. */ > if ((needwake && havetask) || > (!havetask && !irqs_disabled_flags(flags))) { > rcu_spawn_tasks_kthread(); > wake_up(&rcu_tasks_cbs_wq); > } > } > > And for the wait side, using synchronize_sched() to supply > the barrier for both ends, with the preemption disabling > due to raw_spin_lock_irqsave() serving as the read-side > critical section: > > if (!list) { > wait_event_interruptible(rcu_tasks_cbs_wq, > rcu_tasks_cbs_head); > if (!rcu_tasks_cbs_head) { > WARN_ON(signal_pending(current)); > schedule_timeout_interruptible(HZ/10); > } > continue; > } > synchronize_sched(); > > ----------------- > > Here is another one that uses atomic_cmpxchg() as a > full memory barrier: > > if (!wait_event_timeout(*wait, !atomic_read(stopping), > msecs_to_jiffies(1000))) { > atomic_set(stopping, 0); > smp_mb(); > return -ETIMEDOUT; > } > > int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait, > atomic_t *stopping) > { > if (atomic_cmpxchg(stopping, 1, 0)) { > wake_up(wait); > return 1; > } > > return 0; > } > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-12 13:27 ` Andrea Parri @ 2017-10-17 20:18 ` Paul E. McKenney 0 siblings, 0 replies; 19+ messages in thread From: Paul E. McKenney @ 2017-10-17 20:18 UTC (permalink / raw) To: Andrea Parri Cc: stern, will.deacon, peterz, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, linux-kernel On Thu, Oct 12, 2017 at 03:27:44PM +0200, Andrea Parri wrote: > Hi Paul, > > On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote: > > Hello! > > > > At Linux Plumbers Conference, we got requests for a recipes document, > > and a further request to point to actual code in the Linux kernel. > > I have pulled together some examples for various litmus-test families, > > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > > MP, ...) is here: > > > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > > > This document is also checked into the memory-models git archive: > > > > https://github.com/aparri/memory-model.git > > > > I would be especially interested in simpler examples in general, and > > of course any example at all for the cases where I was unable to find > > any. Thoughts? > > Below are some examples we did discuss (at some point): > > The comment in kernel/events/ring_buffer.c:perf_output_put_handle() > describes instances of MP+wmb+rmb and LB+ctrl+mb. I added this as an alternative for MP and as the example for LB. > The comments in kernel/sched/core.c:try_to_wake_up() describes more > instances of MP ("plus locking") and LB (see finish_lock_switch()). This one looks a bit more messy, so I will set it aside, for the moment, anyway. > The comment in kernel/sched/core.c:task_rq_lock() describes an ins- > tance of MP+wmb+addr-acqpo. This one also looks a bit messy, so I am setting it aside as well. > The comment in include/linux/wait.h:waitqueue_active() describes an > instance of SB+mb+mb. Very good, I took this as the generic pattern for the current pair of SB examples. > 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug") > describes an instance of W+RWC+porel+mb+mb. Well, this one certainly is of historical interest. After all, it might well be the first Linux-kernel commit containing a litmus test. ;-) I put it in recipes-LKcode-63cae12bce986.txt and reference it from recipes-LKcode.txt. > [...] > > I wish we could say "any barrier (explicit or implicit) in sources > is accompanied by a comment mentioning the interested pattern...", > but life is not always this simple. ;-) Well, at least scripts/checkpatch.pl now complains if you try to add a new comment-free barrier. Not that these complaints are always paid attention to... Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney 2017-10-12 1:23 ` Boqun Feng 2017-10-12 13:27 ` Andrea Parri @ 2017-10-13 19:44 ` Alan Stern 2017-10-13 20:00 ` Paul E. McKenney 2 siblings, 1 reply; 19+ messages in thread From: Alan Stern @ 2017-10-13 19:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Wed, 11 Oct 2017, Paul E. McKenney wrote: > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0. Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. I'm not sure that single-variable SC can really be represented by an example. It gets used literally all over the kernel -- it's such a large part of the way we think about computer programs that we rely on it unconsciously. For example, the very first function in the very first C source file in the kernel/ directory (namely, check_free_space() in kernel/acct.c) includes this code: if (acct->active) { u64 suspend = sbuf.f_blocks * SUSPEND; do_div(suspend, 100); How do we know that the value which gets divided by 100 is sbuf.f_blocks * SUSPEND and not the random garbage which was stored in suspend's memory location before it was initialized? Answer: per-variable SC. Okay, maybe that's not really applicable, since it doesn't involve accesses to shared memory. Here's an example that does. get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. How do we know that the value retrieved by the second call was not stored _earlier_ than the value retrieved by the first call? Per-variable SC. > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? Pretty much any code that accesses the same shared variable twice on the same CPU could be an example of per-variable SC. But I don't think people would learn much by studying such examples. Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-13 19:44 ` Alan Stern @ 2017-10-13 20:00 ` Paul E. McKenney 2017-10-13 20:09 ` Alan Stern 0 siblings, 1 reply; 19+ messages in thread From: Paul E. McKenney @ 2017-10-13 20:00 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > On Wed, 11 Oct 2017, Paul E. McKenney wrote: > > > This document lists the litmus-test patterns that we have been discussing, > > along with examples from the Linux kernel. This is intended to feed into > > the recipes document. All examples are from v4.13. > > > > 0. Single-variable SC. > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > (see kernel/rcu/tree.c). The counter is accessed by > > interrupts and NMIs as well as by process-level code. > > This counter can be accessed by other CPUs, but only > > for debug output. > > I'm not sure that single-variable SC can really be represented by an > example. It gets used literally all over the kernel -- it's such a > large part of the way we think about computer programs that we rely on > it unconsciously. > > For example, the very first function in the very first C source file > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) > includes this code: > > if (acct->active) { > u64 suspend = sbuf.f_blocks * SUSPEND; > do_div(suspend, 100); > > How do we know that the value which gets divided by 100 is > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in > suspend's memory location before it was initialized? Answer: > per-variable SC. > > Okay, maybe that's not really applicable, since it doesn't involve > accesses to shared memory. Here's an example that does. > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. > How do we know that the value retrieved by the second call was not > stored _earlier_ than the value retrieved by the first call? > Per-variable SC. > > > b. Between CPUs, I would put forward the ->dflags > > updates, but this is anything but simple. But maybe > > OK for an illustration? > > Pretty much any code that accesses the same shared variable twice on > the same CPU could be an example of per-variable SC. But I don't think > people would learn much by studying such examples. Perhaps the recipes document should just baldly state that any execution having only one thread and/or having only one variable will be fully ordered? Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-13 20:00 ` Paul E. McKenney @ 2017-10-13 20:09 ` Alan Stern 2017-10-17 18:56 ` Paul E. McKenney 0 siblings, 1 reply; 19+ messages in thread From: Alan Stern @ 2017-10-13 20:09 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Fri, 13 Oct 2017, Paul E. McKenney wrote: > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > > On Wed, 11 Oct 2017, Paul E. McKenney wrote: > > > > > This document lists the litmus-test patterns that we have been discussing, > > > along with examples from the Linux kernel. This is intended to feed into > > > the recipes document. All examples are from v4.13. > > > > > > 0. Single-variable SC. > > > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > > (see kernel/rcu/tree.c). The counter is accessed by > > > interrupts and NMIs as well as by process-level code. > > > This counter can be accessed by other CPUs, but only > > > for debug output. > > > > I'm not sure that single-variable SC can really be represented by an > > example. It gets used literally all over the kernel -- it's such a > > large part of the way we think about computer programs that we rely on > > it unconsciously. > > > > For example, the very first function in the very first C source file > > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) > > includes this code: > > > > if (acct->active) { > > u64 suspend = sbuf.f_blocks * SUSPEND; > > do_div(suspend, 100); > > > > How do we know that the value which gets divided by 100 is > > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in > > suspend's memory location before it was initialized? Answer: > > per-variable SC. > > > > Okay, maybe that's not really applicable, since it doesn't involve > > accesses to shared memory. Here's an example that does. > > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. > > How do we know that the value retrieved by the second call was not > > stored _earlier_ than the value retrieved by the first call? > > Per-variable SC. > > > > > b. Between CPUs, I would put forward the ->dflags > > > updates, but this is anything but simple. But maybe > > > OK for an illustration? > > > > Pretty much any code that accesses the same shared variable twice on > > the same CPU could be an example of per-variable SC. But I don't think > > people would learn much by studying such examples. > > Perhaps the recipes document should just baldly state that any execution > having only one thread and/or having only one variable will be fully > ordered? That wouldn't be a bad idea. (Although the part about only one thread should be pretty obvious.) Also, you have to be a little careful because the ordering of the execution may not always agree with the ordering of the source code. The compiler is allowed to evaluate the arguments to a function call, for example, in any order it likes. Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-13 20:09 ` Alan Stern @ 2017-10-17 18:56 ` Paul E. McKenney 2017-10-17 19:38 ` Alan Stern 0 siblings, 1 reply; 19+ messages in thread From: Paul E. McKenney @ 2017-10-17 18:56 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Fri, Oct 13, 2017 at 04:09:26PM -0400, Alan Stern wrote: > On Fri, 13 Oct 2017, Paul E. McKenney wrote: > > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > > > On Wed, 11 Oct 2017, Paul E. McKenney wrote: [ . . . ] > > Perhaps the recipes document should just baldly state that any execution > > having only one thread and/or having only one variable will be fully > > ordered? > > That wouldn't be a bad idea. (Although the part about only one thread > should be pretty obvious.) > > Also, you have to be a little careful because the ordering of the > execution may not always agree with the ordering of the source code. > The compiler is allowed to evaluate the arguments to a function call, > for example, in any order it likes. How about this? 0. Simple special cases If there is only one CPU on the one hand or only one variable on the other, the code will execute in order. There are (as usual) some things to be careful of: a. There are some aspects of the C language that are unordered. For example, the compiler can output code computing arguments of a multi-parameter function in any order it likes, or even interleaved if it so chooses. b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes, as long as the results appear just as if the compiler had followed all the relevant rules. To see this, compiler with a high level of optimization and run the debugger on the resulting binary. c. If there is only one variable but multiple CPUs, all accesses to that variable must be aligned and full sized. Variables that straddle cachelines or pages void your full-ordering warranty, as do undersized accesses that load from or store to only part of the variable. Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-17 18:56 ` Paul E. McKenney @ 2017-10-17 19:38 ` Alan Stern 2017-10-17 20:33 ` Paul E. McKenney 0 siblings, 1 reply; 19+ messages in thread From: Alan Stern @ 2017-10-17 19:38 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Tue, 17 Oct 2017, Paul E. McKenney wrote: > How about this? > > 0. Simple special cases > > If there is only one CPU on the one hand or only one variable > on the other, the code will execute in order. There are (as > usual) some things to be careful of: > > a. There are some aspects of the C language that are > unordered. For example, the compiler can output code > computing arguments of a multi-parameter function in > any order it likes, or even interleaved if it so chooses. That parses a little oddly. I wouldn't agree that the compiler outputs the code in any order it likes! In fact, I wouldn't even mention the compiler at all. Just say that (with a few exceptions) the language doesn't specify the order in which the arguments of a function or operation should be evaluated. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. > b. Compilers are permitted to use the "as-if" rule. > That is, a compiler can emit whatever code it likes, > as long as the results appear just as if the compiler > had followed all the relevant rules. To see this, > compiler with a high level of optimization and run > the debugger on the resulting binary. You might omit the last sentence. Furthermore, if the accesses don't use READ_ONCE/WRITE_ONCE then the code might not get the same result as if it had executed in order (even for a single variable!), and if you do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code it likes. > c. If there is only one variable but multiple CPUs, all > accesses to that variable must be aligned and full sized. I would say that the variable is what needs to be aligned, not the accesses. (Although, if the variable is aligned and all the accesses are full sized, then they must necessarily be aligned as well.) > Variables that straddle cachelines or pages void your > full-ordering warranty, as do undersized accesses that > load from or store to only part of the variable. How can a variable straddle pages without also straddling cache lines? Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-17 19:38 ` Alan Stern @ 2017-10-17 20:33 ` Paul E. McKenney 2017-10-17 21:03 ` Alan Stern 0 siblings, 1 reply; 19+ messages in thread From: Paul E. McKenney @ 2017-10-17 20:33 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > How about this? > > > > 0. Simple special cases > > > > If there is only one CPU on the one hand or only one variable > > on the other, the code will execute in order. There are (as > > usual) some things to be careful of: > > > > a. There are some aspects of the C language that are > > unordered. For example, the compiler can output code > > computing arguments of a multi-parameter function in > > any order it likes, or even interleaved if it so chooses. > > That parses a little oddly. I wouldn't agree that the compiler outputs > the code in any order it likes! When was the last time you talked to a compiler writer? ;-) > In fact, I wouldn't even mention the compiler at all. Just say that > (with a few exceptions) the language doesn't specify the order in which > the arguments of a function or operation should be evaluated. For > example, in the expression "f(x) + g(y)", the order in which f and g > are called is not defined; the object code is allowed to use either > order or even to interleave the computations. Nevertheless, I took your suggestion: a. There are some aspects of the C language that are unordered. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. > > b. Compilers are permitted to use the "as-if" rule. > > That is, a compiler can emit whatever code it likes, > > as long as the results appear just as if the compiler > > had followed all the relevant rules. To see this, > > compiler with a high level of optimization and run > > the debugger on the resulting binary. > > You might omit the last sentence. Furthermore, if the accesses don't > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > if it had executed in order (even for a single variable!), and if you > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > it likes. Ah, I omitted an important qualifier: b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. I have seen people (including kernel hackers) surprised by what optimizers do, so I would prefer that the last sentence remain. > > c. If there is only one variable but multiple CPUs, all > > accesses to that variable must be aligned and full sized. > > I would say that the variable is what needs to be aligned, not the > accesses. (Although, if the variable is aligned and all the accesses > are full sized, then they must necessarily be aligned as well.) I was thinking in terms of an unaligned 16-bit access to a 32-bit variable. But how about this? c. If there is only one variable but multiple CPUs, all that variable must be properly aligned and all accesses to that variable must be full sized. > > Variables that straddle cachelines or pages void your > > full-ordering warranty, as do undersized accesses that > > load from or store to only part of the variable. > > How can a variable straddle pages without also straddling cache lines? Well, a variable -can- straddle cachelines without straddling pages, which justifies the "or". Furthermore, given that cacheline sizes have been growing, but pages are still 4KB, it is probably only a matter of time. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-17 20:33 ` Paul E. McKenney @ 2017-10-17 21:03 ` Alan Stern 2017-10-17 21:55 ` Paul E. McKenney 0 siblings, 1 reply; 19+ messages in thread From: Alan Stern @ 2017-10-17 21:03 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Tue, 17 Oct 2017, Paul E. McKenney wrote: > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > How about this? > > > > > > 0. Simple special cases > > > > > > If there is only one CPU on the one hand or only one variable > > > on the other, the code will execute in order. There are (as > > > usual) some things to be careful of: > > > > > > a. There are some aspects of the C language that are > > > unordered. For example, the compiler can output code > > > computing arguments of a multi-parameter function in > > > any order it likes, or even interleaved if it so chooses. > > > > That parses a little oddly. I wouldn't agree that the compiler outputs > > the code in any order it likes! > > When was the last time you talked to a compiler writer? ;-) > > > In fact, I wouldn't even mention the compiler at all. Just say that > > (with a few exceptions) the language doesn't specify the order in which > > the arguments of a function or operation should be evaluated. For > > example, in the expression "f(x) + g(y)", the order in which f and g > > are called is not defined; the object code is allowed to use either > > order or even to interleave the computations. > > Nevertheless, I took your suggestion: > > a. There are some aspects of the C language that are > unordered. For example, in the expression "f(x) + g(y)", > the order in which f and g are called is not defined; > the object code is allowed to use either order or even > to interleave the computations. Good. > > > b. Compilers are permitted to use the "as-if" rule. > > > That is, a compiler can emit whatever code it likes, > > > as long as the results appear just as if the compiler > > > had followed all the relevant rules. To see this, > > > compiler with a high level of optimization and run > > > the debugger on the resulting binary. > > > > You might omit the last sentence. Furthermore, if the accesses don't > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > if it had executed in order (even for a single variable!), and if you > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > it likes. > > Ah, I omitted an important qualifier: > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes, as long as > the results of a single-threaded execution appear just > as if the compiler had followed all the relevant rules. > To see this, compile with a high level of optimization > and run the debugger on the resulting binary. That's okay for the single-CPU case. I don't think it covers the multiple-CPU single-variable case correctly, though. If you don't use READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads and stores? And won't that potentially cause the end result to be different from what you would get if the code had appeared to execute in order? > I have seen people (including kernel hackers) surprised by what optimizers > do, so I would prefer that the last sentence remain. > > > > c. If there is only one variable but multiple CPUs, all > > > accesses to that variable must be aligned and full sized. > > > > I would say that the variable is what needs to be aligned, not the > > accesses. (Although, if the variable is aligned and all the accesses > > are full sized, then they must necessarily be aligned as well.) > > I was thinking in terms of an unaligned 16-bit access to a 32-bit > variable. That wouldn't be full sized. > But how about this? > > c. If there is only one variable but multiple CPUs, all Extra "all". Otherwise okay. > that variable must be properly aligned and all accesses > to that variable must be full sized. > > > > Variables that straddle cachelines or pages void your > > > full-ordering warranty, as do undersized accesses that > > > load from or store to only part of the variable. > > > > How can a variable straddle pages without also straddling cache lines? > > Well, a variable -can- straddle cachelines without straddling pages, > which justifies the "or". Furthermore, given that cacheline sizes have > been growing, but pages are still 4KB, it is probably only a matter > of time. ;-) By that time, we'll probably be using 64-KB pages. Or even bigger! Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-17 21:03 ` Alan Stern @ 2017-10-17 21:55 ` Paul E. McKenney 2017-10-18 14:43 ` Alan Stern 0 siblings, 1 reply; 19+ messages in thread From: Paul E. McKenney @ 2017-10-17 21:55 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > > > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > > > How about this? > > > > > > > > 0. Simple special cases > > > > > > > > If there is only one CPU on the one hand or only one variable > > > > on the other, the code will execute in order. There are (as > > > > usual) some things to be careful of: > > > > > > > > a. There are some aspects of the C language that are > > > > unordered. For example, the compiler can output code > > > > computing arguments of a multi-parameter function in > > > > any order it likes, or even interleaved if it so chooses. > > > > > > That parses a little oddly. I wouldn't agree that the compiler outputs > > > the code in any order it likes! > > > > When was the last time you talked to a compiler writer? ;-) > > > > > In fact, I wouldn't even mention the compiler at all. Just say that > > > (with a few exceptions) the language doesn't specify the order in which > > > the arguments of a function or operation should be evaluated. For > > > example, in the expression "f(x) + g(y)", the order in which f and g > > > are called is not defined; the object code is allowed to use either > > > order or even to interleave the computations. > > > > Nevertheless, I took your suggestion: > > > > a. There are some aspects of the C language that are > > unordered. For example, in the expression "f(x) + g(y)", > > the order in which f and g are called is not defined; > > the object code is allowed to use either order or even > > to interleave the computations. > > Good. > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > That is, a compiler can emit whatever code it likes, > > > > as long as the results appear just as if the compiler > > > > had followed all the relevant rules. To see this, > > > > compiler with a high level of optimization and run > > > > the debugger on the resulting binary. > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > if it had executed in order (even for a single variable!), and if you > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > it likes. > > > > Ah, I omitted an important qualifier: > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > a compiler can emit whatever code it likes, as long as > > the results of a single-threaded execution appear just > > as if the compiler had followed all the relevant rules. > > To see this, compile with a high level of optimization > > and run the debugger on the resulting binary. > > That's okay for the single-CPU case. I don't think it covers the > multiple-CPU single-variable case correctly, though. If you don't use > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > and stores? And won't that potentially cause the end result to be > different from what you would get if the code had appeared to execute > in order? Ah, good point, I need yet another qualifier. How about the following? b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes for normal accesses, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), and atomics. This, in conjunction with the previously added "single-threaded execution" means that yes, the compiler is permitted to tear normal loads and stores. The reason is that a single-threaded run could not tell the difference. Interrupt handlers or multiple threads are required to detect load/store tearing. So, what am I still missing? ;-) > > I have seen people (including kernel hackers) surprised by what optimizers > > do, so I would prefer that the last sentence remain. > > > > > > c. If there is only one variable but multiple CPUs, all > > > > accesses to that variable must be aligned and full sized. > > > > > > I would say that the variable is what needs to be aligned, not the > > > accesses. (Although, if the variable is aligned and all the accesses > > > are full sized, then they must necessarily be aligned as well.) > > > > I was thinking in terms of an unaligned 16-bit access to a 32-bit > > variable. > > That wouldn't be full sized. > > > But how about this? > > > > c. If there is only one variable but multiple CPUs, all > > Extra "all". Otherwise okay. Good catch, I removed the extra "all". > > that variable must be properly aligned and all accesses > > to that variable must be full sized. > > > > > > Variables that straddle cachelines or pages void your > > > > full-ordering warranty, as do undersized accesses that > > > > load from or store to only part of the variable. > > > > > > How can a variable straddle pages without also straddling cache lines? > > > > Well, a variable -can- straddle cachelines without straddling pages, > > which justifies the "or". Furthermore, given that cacheline sizes have > > been growing, but pages are still 4KB, it is probably only a matter > > of time. ;-) > > By that time, we'll probably be using 64-KB pages. Or even bigger! PowerPC's server builds have had a minimum page size of 64KB for quite a few years. This helps in many cases, but of course hurts for those occasional application that insist on doing a pile of independent 4K mappings. ;-) So I would guess that the move from 4K pages to 64K (or whatever) pages could be quite painful for some CPU families. Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-17 21:55 ` Paul E. McKenney @ 2017-10-18 14:43 ` Alan Stern 2017-10-18 20:28 ` Paul E. McKenney 0 siblings, 1 reply; 19+ messages in thread From: Alan Stern @ 2017-10-18 14:43 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > > That is, a compiler can emit whatever code it likes, > > > > > as long as the results appear just as if the compiler > > > > > had followed all the relevant rules. To see this, > > > > > compiler with a high level of optimization and run > > > > > the debugger on the resulting binary. > > > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > > if it had executed in order (even for a single variable!), and if you > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > > it likes. > > > > > > Ah, I omitted an important qualifier: > > > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > > a compiler can emit whatever code it likes, as long as > > > the results of a single-threaded execution appear just > > > as if the compiler had followed all the relevant rules. > > > To see this, compile with a high level of optimization > > > and run the debugger on the resulting binary. > > > > That's okay for the single-CPU case. I don't think it covers the > > multiple-CPU single-variable case correctly, though. If you don't use > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > > and stores? And won't that potentially cause the end result to be > > different from what you would get if the code had appeared to execute > > in order? > > Ah, good point, I need yet another qualifier. How about the following? > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes for normal > accesses, as long as the results of a single-threaded > execution appear just as if the compiler had followed > all the relevant rules. To see this, compile with a > high level of optimization and run the debugger on the > resulting binary. > > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), > and atomics. This, in conjunction with the previously added > "single-threaded execution" means that yes, the compiler is permitted > to tear normal loads and stores. The reason is that a single-threaded > run could not tell the difference. Interrupt handlers or multiple > threads are required to detect load/store tearing. > > So, what am I still missing? ;-) Well, you could explicitly mention that in the multi-thread case, this means all accesses to the shared variable had better use READ_ONCE() or WRITE_ONCE(). Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-18 14:43 ` Alan Stern @ 2017-10-18 20:28 ` Paul E. McKenney 2017-10-18 21:18 ` Alan Stern 0 siblings, 1 reply; 19+ messages in thread From: Paul E. McKenney @ 2017-10-18 20:28 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Wed, Oct 18, 2017 at 10:43:42AM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > > > That is, a compiler can emit whatever code it likes, > > > > > > as long as the results appear just as if the compiler > > > > > > had followed all the relevant rules. To see this, > > > > > > compiler with a high level of optimization and run > > > > > > the debugger on the resulting binary. > > > > > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > > > if it had executed in order (even for a single variable!), and if you > > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > > > it likes. > > > > > > > > Ah, I omitted an important qualifier: > > > > > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > > > a compiler can emit whatever code it likes, as long as > > > > the results of a single-threaded execution appear just > > > > as if the compiler had followed all the relevant rules. > > > > To see this, compile with a high level of optimization > > > > and run the debugger on the resulting binary. > > > > > > That's okay for the single-CPU case. I don't think it covers the > > > multiple-CPU single-variable case correctly, though. If you don't use > > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > > > and stores? And won't that potentially cause the end result to be > > > different from what you would get if the code had appeared to execute > > > in order? > > > > Ah, good point, I need yet another qualifier. How about the following? > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > a compiler can emit whatever code it likes for normal > > accesses, as long as the results of a single-threaded > > execution appear just as if the compiler had followed > > all the relevant rules. To see this, compile with a > > high level of optimization and run the debugger on the > > resulting binary. > > > > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), > > and atomics. This, in conjunction with the previously added > > "single-threaded execution" means that yes, the compiler is permitted > > to tear normal loads and stores. The reason is that a single-threaded > > run could not tell the difference. Interrupt handlers or multiple > > threads are required to detect load/store tearing. > > > > So, what am I still missing? ;-) > > Well, you could explicitly mention that in the multi-thread case, this > means all accesses to the shared variable had better use READ_ONCE() or > WRITE_ONCE(). Like this? Thanx, Paul ------------------------------------------------------------------------ d. If there are multiple CPUs, accesses to shared variables should use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store tearing, load/store fusing, and invented loads and stores. There are exceptions to this rule, for example: i. When there is no possibility of a given shared variable being updated, for example, while holding the update-side lock, reads from that variable need not use READ_ONCE(). ii. When there is no possibility of a given shared variable being either read or updated, for example, when running during early boot, reads from that variable need not use READ_ONCE() and writes to that variable need not use WRITE_ONCE(). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-18 20:28 ` Paul E. McKenney @ 2017-10-18 21:18 ` Alan Stern 2017-10-18 22:57 ` Paul E. McKenney 0 siblings, 1 reply; 19+ messages in thread From: Alan Stern @ 2017-10-18 21:18 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Wed, 18 Oct 2017, Paul E. McKenney wrote: > > Well, you could explicitly mention that in the multi-thread case, this > > means all accesses to the shared variable had better use READ_ONCE() or > > WRITE_ONCE(). > > Like this? > > Thanx, Paul > > ------------------------------------------------------------------------ > > d. If there are multiple CPUs, accesses to shared variables > should use READ_ONCE() and WRITE_ONCE() or stronger > to prevent load/store tearing, load/store fusing, and > invented loads and stores. There are exceptions to > this rule, for example: > > i. When there is no possibility of a given > shared variable being updated, for example, > while holding the update-side lock, reads > from that variable need not use READ_ONCE(). > > ii. When there is no possibility of a given shared > variable being either read or updated, for > example, when running during early boot, reads > from that variable need not use READ_ONCE() and > writes to that variable need not use WRITE_ONCE(). Yeah, except that you mean being read or updated by another thread. Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Linux-kernel examples for LKMM recipes 2017-10-18 21:18 ` Alan Stern @ 2017-10-18 22:57 ` Paul E. McKenney 0 siblings, 0 replies; 19+ messages in thread From: Paul E. McKenney @ 2017-10-18 22:57 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells, Jade Alglave, Luc Maranget, Kernel development list On Wed, Oct 18, 2017 at 05:18:37PM -0400, Alan Stern wrote: > On Wed, 18 Oct 2017, Paul E. McKenney wrote: > > > > Well, you could explicitly mention that in the multi-thread case, this > > > means all accesses to the shared variable had better use READ_ONCE() or > > > WRITE_ONCE(). > > > > Like this? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > d. If there are multiple CPUs, accesses to shared variables > > should use READ_ONCE() and WRITE_ONCE() or stronger > > to prevent load/store tearing, load/store fusing, and > > invented loads and stores. There are exceptions to > > this rule, for example: > > > > i. When there is no possibility of a given > > shared variable being updated, for example, > > while holding the update-side lock, reads > > from that variable need not use READ_ONCE(). > > > > ii. When there is no possibility of a given shared > > variable being either read or updated, for > > example, when running during early boot, reads > > from that variable need not use READ_ONCE() and > > writes to that variable need not use WRITE_ONCE(). > > Yeah, except that you mean being read or updated by another thread. Good point, added that qualifier. Thanx, Paul ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-10-18 22:57 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney 2017-10-12 1:23 ` Boqun Feng 2017-10-12 11:27 ` Will Deacon 2017-10-17 20:37 ` Paul E. McKenney 2017-10-17 20:56 ` Paul E. McKenney 2017-10-12 13:27 ` Andrea Parri 2017-10-17 20:18 ` Paul E. McKenney 2017-10-13 19:44 ` Alan Stern 2017-10-13 20:00 ` Paul E. McKenney 2017-10-13 20:09 ` Alan Stern 2017-10-17 18:56 ` Paul E. McKenney 2017-10-17 19:38 ` Alan Stern 2017-10-17 20:33 ` Paul E. McKenney 2017-10-17 21:03 ` Alan Stern 2017-10-17 21:55 ` Paul E. McKenney 2017-10-18 14:43 ` Alan Stern 2017-10-18 20:28 ` Paul E. McKenney 2017-10-18 21:18 ` Alan Stern 2017-10-18 22:57 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox