Linux-kernel examples for LKMM recipes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Linux-kernel examples for LKMM recipes
@ 2017-10-11 22:32 Paul E. McKenney
  2017-10-12  1:23 ` Boqun Feng
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-11 22:32 UTC (permalink / raw)
  To: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget
  Cc: linux-kernel

Hello!

At Linux Plumbers Conference, we got requests for a recipes document,
and a further request to point to actual code in the Linux kernel.
I have pulled together some examples for various litmus-test families,
as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
MP, ...) is here:

	https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf

This document is also checked into the memory-models git archive:

	https://github.com/aparri/memory-model.git

I would be especially interested in simpler examples in general, and
of course any example at all for the cases where I was unable to find
any.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

This document lists the litmus-test patterns that we have been discussing,
along with examples from the Linux kernel.  This is intended to feed into
the recipes document.  All examples are from v4.13.

0.	Single-variable SC.

	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
		(see kernel/rcu/tree.c).  The counter is accessed by
		interrupts and NMIs as well as by process-level code.
		This counter can be accessed by other CPUs, but only
		for debug output.

	b.	Between CPUs, I would put forward the ->dflags
		updates, but this is anything but simple.  But maybe
		OK for an illustration?

1.	MP (see test6.pdf for nickname translation)

	a.	smp_store_release() / smp_load_acquire()

		init_stack_slab() in lib/stackdepot.c uses release-acquire
		to handle initialization of a slab of the stack.  Working
		out the mutual-exclusion design is left as an exercise for
		the reader.

	b.	rcu_assign_pointer() / rcu_dereference()

		expand_to_next_prime() does the rcu_assign_pointer(),
		and next_prime_number() does the rcu_dereference().
		This mediates access to a bit vector that is expanded
		as additional primes are needed.  These two functions
		are in lib/prime_numbers.c.

	c.	smp_wmb() / smp_rmb()

		xlog_state_switch_iclogs() contains the following:

			log->l_curr_block -= log->l_logBBsize;
			ASSERT(log->l_curr_block >= 0);
			smp_wmb();
			log->l_curr_cycle++;

		And xlog_valid_lsn() contains the following:

			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
			smp_rmb();
			cur_block = ACCESS_ONCE(log->l_curr_block);

	d.	Replacing either of the above with smp_mb()

		Holding off on this one for the moment...

2.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB

	Lots of variety here, can in some cases substitute:
	
	a.	READ_ONCE() for smp_load_acquire()
	b.	WRITE_ONCE() for smp_store_release()
	c.	Dependencies for both smp_load_acquire() and
		smp_store_release().
	d.	smp_wmb() for smp_store_release() in first thread
		of ISA2 and Z6.2.
	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.

	The canonical illustration of LB involves the various memory
	allocators, where you don't want a load from about-to-be-freed
	memory to see a store initializing a later incarnation of that
	same memory area.  But the per-CPU caches make this a very
	long and complicated example.

	I am not aware of any three-CPU release-acquire chains in the
	Linux kernel.  There are three-CPU lock-based chains in RCU,
	but these are not at all simple, either.

	Thoughts?

3.	SB

	a.	smp_mb(), as in lockless wait-wakeup coordination.
		And as in sys_membarrier()-scheduler coordination,
		for that matter.

		Examples seem to be lacking.  Most cases use locking.
		Here is one rather strange one from RCU:

		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
		{
			unsigned long flags;
			bool needwake;
			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);

			rhp->next = NULL;
			rhp->func = func;
			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
			needwake = !rcu_tasks_cbs_head;
			*rcu_tasks_cbs_tail = rhp;
			rcu_tasks_cbs_tail = &rhp->next;
			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
			/* We can't create the thread unless interrupts are enabled. */
			if ((needwake && havetask) ||
			    (!havetask && !irqs_disabled_flags(flags))) {
				rcu_spawn_tasks_kthread();
				wake_up(&rcu_tasks_cbs_wq);
			}
		}

		And for the wait side, using synchronize_sched() to supply
		the barrier for both ends, with the preemption disabling
		due to raw_spin_lock_irqsave() serving as the read-side
		critical section:

		if (!list) {
			wait_event_interruptible(rcu_tasks_cbs_wq,
						 rcu_tasks_cbs_head);
			if (!rcu_tasks_cbs_head) {
				WARN_ON(signal_pending(current));
				schedule_timeout_interruptible(HZ/10);
			}
			continue;
		}
		synchronize_sched();

		-----------------

		Here is another one that uses atomic_cmpxchg() as a
		full memory barrier:

		if (!wait_event_timeout(*wait, !atomic_read(stopping),
					msecs_to_jiffies(1000))) {
			atomic_set(stopping, 0);
			smp_mb();
			return -ETIMEDOUT;
		}

		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
						     atomic_t *stopping)
		{
			if (atomic_cmpxchg(stopping, 1, 0)) {
				wake_up(wait);
				return 1;
			}

			return 0;
		}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney
@ 2017-10-12  1:23 ` Boqun Feng
  2017-10-12 11:27   ` Will Deacon
  2017-10-17 20:56   ` Paul E. McKenney
  2017-10-12 13:27 ` Andrea Parri
  2017-10-13 19:44 ` Alan Stern
  2 siblings, 2 replies; 19+ messages in thread
From: Boqun Feng @ 2017-10-12  1:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: stern, parri.andrea, will.deacon, peterz, npiggin, dhowells,
	j.alglave, luc.maranget, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7275 bytes --]

On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote:
> Hello!
> 
> At Linux Plumbers Conference, we got requests for a recipes document,
> and a further request to point to actual code in the Linux kernel.
> I have pulled together some examples for various litmus-test families,
> as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> MP, ...) is here:
> 
> 	https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> This document is also checked into the memory-models git archive:
> 
> 	https://github.com/aparri/memory-model.git
> 
> I would be especially interested in simpler examples in general, and
> of course any example at all for the cases where I was unable to find
> any.  Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.	Single-variable SC.
> 
> 	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
> 		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> 		(see kernel/rcu/tree.c).  The counter is accessed by
> 		interrupts and NMIs as well as by process-level code.
> 		This counter can be accessed by other CPUs, but only
> 		for debug output.
> 
> 	b.	Between CPUs, I would put forward the ->dflags
> 		updates, but this is anything but simple.  But maybe
> 		OK for an illustration?
> 
> 1.	MP (see test6.pdf for nickname translation)
> 
> 	a.	smp_store_release() / smp_load_acquire()
> 
> 		init_stack_slab() in lib/stackdepot.c uses release-acquire
> 		to handle initialization of a slab of the stack.  Working
> 		out the mutual-exclusion design is left as an exercise for
> 		the reader.
> 
> 	b.	rcu_assign_pointer() / rcu_dereference()
> 
> 		expand_to_next_prime() does the rcu_assign_pointer(),
> 		and next_prime_number() does the rcu_dereference().
> 		This mediates access to a bit vector that is expanded
> 		as additional primes are needed.  These two functions
> 		are in lib/prime_numbers.c.
> 
> 	c.	smp_wmb() / smp_rmb()
> 
> 		xlog_state_switch_iclogs() contains the following:
> 
> 			log->l_curr_block -= log->l_logBBsize;
> 			ASSERT(log->l_curr_block >= 0);
> 			smp_wmb();
> 			log->l_curr_cycle++;
> 
> 		And xlog_valid_lsn() contains the following:
> 
> 			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> 			smp_rmb();
> 			cur_block = ACCESS_ONCE(log->l_curr_block);
> 
> 	d.	Replacing either of the above with smp_mb()
> 
> 		Holding off on this one for the moment...
> 
> 2.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
> 	Lots of variety here, can in some cases substitute:
> 	
> 	a.	READ_ONCE() for smp_load_acquire()
> 	b.	WRITE_ONCE() for smp_store_release()
> 	c.	Dependencies for both smp_load_acquire() and
> 		smp_store_release().
> 	d.	smp_wmb() for smp_store_release() in first thread
> 		of ISA2 and Z6.2.
> 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
> 	The canonical illustration of LB involves the various memory
> 	allocators, where you don't want a load from about-to-be-freed
> 	memory to see a store initializing a later incarnation of that
> 	same memory area.  But the per-CPU caches make this a very
> 	long and complicated example.
> 
> 	I am not aware of any three-CPU release-acquire chains in the
> 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> 	but these are not at all simple, either.
> 

The "Program-Order guarantees" case in scheduler? See the comments
written by Peter above try_to_wake_up():

 * The basic program-order guarantee on SMP systems is that when a task [t]
 * migrates, all its activity on its old CPU [c0] happens-before any subsequent
 * execution on its new CPU [c1].
...
 * For blocking we (obviously) need to provide the same guarantee as for
 * migration. However the means are completely different as there is no lock
 * chain to provide order. Instead we do:
 *
 *   1) smp_store_release(X->on_cpu, 0)
 *   2) smp_cond_load_acquire(!X->on_cpu)
 *
 * Example:
 *
 *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
 *
 *   LOCK rq(0)->lock LOCK X->pi_lock
 *   dequeue X
 *   sched-out X
 *   smp_store_release(X->on_cpu, 0);
 *
 *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
 *                    X->state = WAKING
 *                    set_task_cpu(X,2)
 *
 *                    LOCK rq(2)->lock
 *                    enqueue X
 *                    X->state = RUNNING
 *                    UNLOCK rq(2)->lock
 *
 *                                          LOCK rq(2)->lock // orders against CPU1
 *                                          sched-out Z
 *                                          sched-in X
 *                                          UNLOCK rq(2)->lock
 *
 *                    UNLOCK X->pi_lock
 *   UNLOCK rq(0)->lock

This is a chain mixed with lock and acquire-release(maybe even better?).


And another example would be osq_{lock,unlock}() on multiple(more than
three) CPUs. 

Regards,
Boqun

> 	Thoughts?
> 
> 3.	SB
> 
> 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> 		And as in sys_membarrier()-scheduler coordination,
> 		for that matter.
> 
> 		Examples seem to be lacking.  Most cases use locking.
> 		Here is one rather strange one from RCU:
> 
> 		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> 		{
> 			unsigned long flags;
> 			bool needwake;
> 			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> 
> 			rhp->next = NULL;
> 			rhp->func = func;
> 			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> 			needwake = !rcu_tasks_cbs_head;
> 			*rcu_tasks_cbs_tail = rhp;
> 			rcu_tasks_cbs_tail = &rhp->next;
> 			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> 			/* We can't create the thread unless interrupts are enabled. */
> 			if ((needwake && havetask) ||
> 			    (!havetask && !irqs_disabled_flags(flags))) {
> 				rcu_spawn_tasks_kthread();
> 				wake_up(&rcu_tasks_cbs_wq);
> 			}
> 		}
> 
> 		And for the wait side, using synchronize_sched() to supply
> 		the barrier for both ends, with the preemption disabling
> 		due to raw_spin_lock_irqsave() serving as the read-side
> 		critical section:
> 
> 		if (!list) {
> 			wait_event_interruptible(rcu_tasks_cbs_wq,
> 						 rcu_tasks_cbs_head);
> 			if (!rcu_tasks_cbs_head) {
> 				WARN_ON(signal_pending(current));
> 				schedule_timeout_interruptible(HZ/10);
> 			}
> 			continue;
> 		}
> 		synchronize_sched();
> 
> 		-----------------
> 
> 		Here is another one that uses atomic_cmpxchg() as a
> 		full memory barrier:
> 
> 		if (!wait_event_timeout(*wait, !atomic_read(stopping),
> 					msecs_to_jiffies(1000))) {
> 			atomic_set(stopping, 0);
> 			smp_mb();
> 			return -ETIMEDOUT;
> 		}
> 
> 		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> 						     atomic_t *stopping)
> 		{
> 			if (atomic_cmpxchg(stopping, 1, 0)) {
> 				wake_up(wait);
> 				return 1;
> 			}
> 
> 			return 0;
> 		}
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-12  1:23 ` Boqun Feng
@ 2017-10-12 11:27   ` Will Deacon
  2017-10-17 20:37     ` Paul E. McKenney
  2017-10-17 20:56   ` Paul E. McKenney
  1 sibling, 1 reply; 19+ messages in thread
From: Will Deacon @ 2017-10-12 11:27 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, stern, parri.andrea, peterz, npiggin, dhowells,
	j.alglave, luc.maranget, linux-kernel

On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote:
> > 	I am not aware of any three-CPU release-acquire chains in the
> > 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> > 	but these are not at all simple, either.
> > 
> 
> The "Program-Order guarantees" case in scheduler? See the comments
> written by Peter above try_to_wake_up():
> 
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
>  * execution on its new CPU [c1].
> ...
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
>  *
>  *   LOCK rq(0)->lock LOCK X->pi_lock
>  *   dequeue X
>  *   sched-out X
>  *   smp_store_release(X->on_cpu, 0);
>  *
>  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
>  *                    X->state = WAKING
>  *                    set_task_cpu(X,2)
>  *
>  *                    LOCK rq(2)->lock
>  *                    enqueue X
>  *                    X->state = RUNNING
>  *                    UNLOCK rq(2)->lock
>  *
>  *                                          LOCK rq(2)->lock // orders against CPU1
>  *                                          sched-out Z
>  *                                          sched-in X
>  *                                          UNLOCK rq(2)->lock
>  *
>  *                    UNLOCK X->pi_lock
>  *   UNLOCK rq(0)->lock
> 
> This is a chain mixed with lock and acquire-release(maybe even better?).
> 
> 
> And another example would be osq_{lock,unlock}() on multiple(more than
> three) CPUs. 

I think the qrwlock also has something similar with the writer fairness
issue fixed:

CPU0: (writer doing an unlock)
smp_store_release(&lock->wlocked, 0);	// Bottom byte of lock->cnts


CPU1: (waiting writer on slowpath)
atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
...
arch_spin_unlock(&lock->wait_lock);


CPU2: (reader on slowpath)
arch_spin_lock(&lock->wait_lock);

and there's mixed-size accesses here too. Fun stuff!

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney
  2017-10-12  1:23 ` Boqun Feng
@ 2017-10-12 13:27 ` Andrea Parri
  2017-10-17 20:18   ` Paul E. McKenney
  2017-10-13 19:44 ` Alan Stern
  2 siblings, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2017-10-12 13:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: stern, will.deacon, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, linux-kernel

Hi Paul,

On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote:
> Hello!
> 
> At Linux Plumbers Conference, we got requests for a recipes document,
> and a further request to point to actual code in the Linux kernel.
> I have pulled together some examples for various litmus-test families,
> as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> MP, ...) is here:
> 
> 	https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> This document is also checked into the memory-models git archive:
> 
> 	https://github.com/aparri/memory-model.git
> 
> I would be especially interested in simpler examples in general, and
> of course any example at all for the cases where I was unable to find
> any.  Thoughts?

Below are some examples we did discuss (at some point):

The comment in kernel/events/ring_buffer.c:perf_output_put_handle()
describes instances of MP+wmb+rmb and LB+ctrl+mb.

The comments in kernel/sched/core.c:try_to_wake_up() describes more
instances of MP ("plus locking") and LB (see finish_lock_switch()).

The comment in kernel/sched/core.c:task_rq_lock() describes an ins-
tance of MP+wmb+addr-acqpo.

The comment in include/linux/wait.h:waitqueue_active() describes an
instance of SB+mb+mb.

63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug")
describes an instance of W+RWC+porel+mb+mb.

[...]

I wish we could say "any barrier (explicit or implicit) in sources
is accompanied by a comment mentioning the interested pattern...",
but life is not always this simple. ;-)

  Andrea


> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.	Single-variable SC.
> 
> 	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
> 		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> 		(see kernel/rcu/tree.c).  The counter is accessed by
> 		interrupts and NMIs as well as by process-level code.
> 		This counter can be accessed by other CPUs, but only
> 		for debug output.
> 
> 	b.	Between CPUs, I would put forward the ->dflags
> 		updates, but this is anything but simple.  But maybe
> 		OK for an illustration?
> 
> 1.	MP (see test6.pdf for nickname translation)
> 
> 	a.	smp_store_release() / smp_load_acquire()
> 
> 		init_stack_slab() in lib/stackdepot.c uses release-acquire
> 		to handle initialization of a slab of the stack.  Working
> 		out the mutual-exclusion design is left as an exercise for
> 		the reader.
> 
> 	b.	rcu_assign_pointer() / rcu_dereference()
> 
> 		expand_to_next_prime() does the rcu_assign_pointer(),
> 		and next_prime_number() does the rcu_dereference().
> 		This mediates access to a bit vector that is expanded
> 		as additional primes are needed.  These two functions
> 		are in lib/prime_numbers.c.
> 
> 	c.	smp_wmb() / smp_rmb()
> 
> 		xlog_state_switch_iclogs() contains the following:
> 
> 			log->l_curr_block -= log->l_logBBsize;
> 			ASSERT(log->l_curr_block >= 0);
> 			smp_wmb();
> 			log->l_curr_cycle++;
> 
> 		And xlog_valid_lsn() contains the following:
> 
> 			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> 			smp_rmb();
> 			cur_block = ACCESS_ONCE(log->l_curr_block);
> 
> 	d.	Replacing either of the above with smp_mb()
> 
> 		Holding off on this one for the moment...
> 
> 2.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
> 	Lots of variety here, can in some cases substitute:
> 	
> 	a.	READ_ONCE() for smp_load_acquire()
> 	b.	WRITE_ONCE() for smp_store_release()
> 	c.	Dependencies for both smp_load_acquire() and
> 		smp_store_release().
> 	d.	smp_wmb() for smp_store_release() in first thread
> 		of ISA2 and Z6.2.
> 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
> 	The canonical illustration of LB involves the various memory
> 	allocators, where you don't want a load from about-to-be-freed
> 	memory to see a store initializing a later incarnation of that
> 	same memory area.  But the per-CPU caches make this a very
> 	long and complicated example.
> 
> 	I am not aware of any three-CPU release-acquire chains in the
> 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> 	but these are not at all simple, either.
> 
> 	Thoughts?
> 
> 3.	SB
> 
> 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> 		And as in sys_membarrier()-scheduler coordination,
> 		for that matter.
> 
> 		Examples seem to be lacking.  Most cases use locking.
> 		Here is one rather strange one from RCU:
> 
> 		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> 		{
> 			unsigned long flags;
> 			bool needwake;
> 			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> 
> 			rhp->next = NULL;
> 			rhp->func = func;
> 			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> 			needwake = !rcu_tasks_cbs_head;
> 			*rcu_tasks_cbs_tail = rhp;
> 			rcu_tasks_cbs_tail = &rhp->next;
> 			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> 			/* We can't create the thread unless interrupts are enabled. */
> 			if ((needwake && havetask) ||
> 			    (!havetask && !irqs_disabled_flags(flags))) {
> 				rcu_spawn_tasks_kthread();
> 				wake_up(&rcu_tasks_cbs_wq);
> 			}
> 		}
> 
> 		And for the wait side, using synchronize_sched() to supply
> 		the barrier for both ends, with the preemption disabling
> 		due to raw_spin_lock_irqsave() serving as the read-side
> 		critical section:
> 
> 		if (!list) {
> 			wait_event_interruptible(rcu_tasks_cbs_wq,
> 						 rcu_tasks_cbs_head);
> 			if (!rcu_tasks_cbs_head) {
> 				WARN_ON(signal_pending(current));
> 				schedule_timeout_interruptible(HZ/10);
> 			}
> 			continue;
> 		}
> 		synchronize_sched();
> 
> 		-----------------
> 
> 		Here is another one that uses atomic_cmpxchg() as a
> 		full memory barrier:
> 
> 		if (!wait_event_timeout(*wait, !atomic_read(stopping),
> 					msecs_to_jiffies(1000))) {
> 			atomic_set(stopping, 0);
> 			smp_mb();
> 			return -ETIMEDOUT;
> 		}
> 
> 		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> 						     atomic_t *stopping)
> 		{
> 			if (atomic_cmpxchg(stopping, 1, 0)) {
> 				wake_up(wait);
> 				return 1;
> 			}
> 
> 			return 0;
> 		}
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney
  2017-10-12  1:23 ` Boqun Feng
  2017-10-12 13:27 ` Andrea Parri
@ 2017-10-13 19:44 ` Alan Stern
  2017-10-13 20:00   ` Paul E. McKenney
  2 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2017-10-13 19:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Wed, 11 Oct 2017, Paul E. McKenney wrote:

> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.	Single-variable SC.
> 
> 	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
> 		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> 		(see kernel/rcu/tree.c).  The counter is accessed by
> 		interrupts and NMIs as well as by process-level code.
> 		This counter can be accessed by other CPUs, but only
> 		for debug output.

I'm not sure that single-variable SC can really be represented by an 
example.  It gets used literally all over the kernel -- it's such a 
large part of the way we think about computer programs that we rely on 
it unconsciously.

For example, the very first function in the very first C source file 
in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
includes this code:

        if (acct->active) {
                u64 suspend = sbuf.f_blocks * SUSPEND;
                do_div(suspend, 100);

How do we know that the value which gets divided by 100 is
sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
suspend's memory location before it was initialized?  Answer:
per-variable SC.

Okay, maybe that's not really applicable, since it doesn't involve
accesses to shared memory.  Here's an example that does.  
get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
How do we know that the value retrieved by the second call was not
stored _earlier_ than the value retrieved by the first call?  
Per-variable SC.

> 	b.	Between CPUs, I would put forward the ->dflags
> 		updates, but this is anything but simple.  But maybe
> 		OK for an illustration?

Pretty much any code that accesses the same shared variable twice on
the same CPU could be an example of per-variable SC.  But I don't think 
people would learn much by studying such examples.

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-13 19:44 ` Alan Stern
@ 2017-10-13 20:00   ` Paul E. McKenney
  2017-10-13 20:09     ` Alan Stern
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-13 20:00 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> On Wed, 11 Oct 2017, Paul E. McKenney wrote:
> 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.	Single-variable SC.
> > 
> > 	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
> > 		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > 		(see kernel/rcu/tree.c).  The counter is accessed by
> > 		interrupts and NMIs as well as by process-level code.
> > 		This counter can be accessed by other CPUs, but only
> > 		for debug output.
> 
> I'm not sure that single-variable SC can really be represented by an 
> example.  It gets used literally all over the kernel -- it's such a 
> large part of the way we think about computer programs that we rely on 
> it unconsciously.
> 
> For example, the very first function in the very first C source file 
> in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
> includes this code:
> 
>         if (acct->active) {
>                 u64 suspend = sbuf.f_blocks * SUSPEND;
>                 do_div(suspend, 100);
> 
> How do we know that the value which gets divided by 100 is
> sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
> suspend's memory location before it was initialized?  Answer:
> per-variable SC.
> 
> Okay, maybe that's not really applicable, since it doesn't involve
> accesses to shared memory.  Here's an example that does.  
> get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
> How do we know that the value retrieved by the second call was not
> stored _earlier_ than the value retrieved by the first call?  
> Per-variable SC.
> 
> > 	b.	Between CPUs, I would put forward the ->dflags
> > 		updates, but this is anything but simple.  But maybe
> > 		OK for an illustration?
> 
> Pretty much any code that accesses the same shared variable twice on
> the same CPU could be an example of per-variable SC.  But I don't think 
> people would learn much by studying such examples.

Perhaps the recipes document should just baldly state that any execution
having only one thread and/or having only one variable will be fully
ordered?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-13 20:00   ` Paul E. McKenney
@ 2017-10-13 20:09     ` Alan Stern
  2017-10-17 18:56       ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2017-10-13 20:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Fri, 13 Oct 2017, Paul E. McKenney wrote:

> On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> > On Wed, 11 Oct 2017, Paul E. McKenney wrote:
> > 
> > > This document lists the litmus-test patterns that we have been discussing,
> > > along with examples from the Linux kernel.  This is intended to feed into
> > > the recipes document.  All examples are from v4.13.
> > > 
> > > 0.	Single-variable SC.
> > > 
> > > 	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
> > > 		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > > 		(see kernel/rcu/tree.c).  The counter is accessed by
> > > 		interrupts and NMIs as well as by process-level code.
> > > 		This counter can be accessed by other CPUs, but only
> > > 		for debug output.
> > 
> > I'm not sure that single-variable SC can really be represented by an 
> > example.  It gets used literally all over the kernel -- it's such a 
> > large part of the way we think about computer programs that we rely on 
> > it unconsciously.
> > 
> > For example, the very first function in the very first C source file 
> > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
> > includes this code:
> > 
> >         if (acct->active) {
> >                 u64 suspend = sbuf.f_blocks * SUSPEND;
> >                 do_div(suspend, 100);
> > 
> > How do we know that the value which gets divided by 100 is
> > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
> > suspend's memory location before it was initialized?  Answer:
> > per-variable SC.
> > 
> > Okay, maybe that's not really applicable, since it doesn't involve
> > accesses to shared memory.  Here's an example that does.  
> > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
> > How do we know that the value retrieved by the second call was not
> > stored _earlier_ than the value retrieved by the first call?  
> > Per-variable SC.
> > 
> > > 	b.	Between CPUs, I would put forward the ->dflags
> > > 		updates, but this is anything but simple.  But maybe
> > > 		OK for an illustration?
> > 
> > Pretty much any code that accesses the same shared variable twice on
> > the same CPU could be an example of per-variable SC.  But I don't think 
> > people would learn much by studying such examples.
> 
> Perhaps the recipes document should just baldly state that any execution
> having only one thread and/or having only one variable will be fully
> ordered?

That wouldn't be a bad idea.  (Although the part about only one thread 
should be pretty obvious.)

Also, you have to be a little careful because the ordering of the
execution may not always agree with the ordering of the source code.  
The compiler is allowed to evaluate the arguments to a function call,
for example, in any order it likes.

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-13 20:09     ` Alan Stern
@ 2017-10-17 18:56       ` Paul E. McKenney
  2017-10-17 19:38         ` Alan Stern
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-17 18:56 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Fri, Oct 13, 2017 at 04:09:26PM -0400, Alan Stern wrote:
> On Fri, 13 Oct 2017, Paul E. McKenney wrote:
> > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> > > On Wed, 11 Oct 2017, Paul E. McKenney wrote:

[ . . . ]

> > Perhaps the recipes document should just baldly state that any execution
> > having only one thread and/or having only one variable will be fully
> > ordered?
> 
> That wouldn't be a bad idea.  (Although the part about only one thread 
> should be pretty obvious.)
> 
> Also, you have to be a little careful because the ordering of the
> execution may not always agree with the ordering of the source code.  
> The compiler is allowed to evaluate the arguments to a function call,
> for example, in any order it likes.

How about this?

0.	Simple special cases

	If there is only one CPU on the one hand or only one variable
	on the other, the code will execute in order.  There are (as
	usual) some things to be careful of:

	a.	There are some aspects of the C language that are
		unordered.  For example, the compiler can output code
		computing arguments of a multi-parameter function in
		any order it likes, or even interleaved if it so chooses.

	b.	Compilers are permitted to use the "as-if" rule.
		That is, a compiler can emit whatever code it likes,
		as long as the results appear just as if the compiler
		had followed all the relevant rules.  To see this,
		compiler with a high level of optimization and run
		the debugger on the resulting binary.

	c.	If there is only one variable but multiple CPUs, all
		accesses to that variable must be aligned and full sized.
		Variables that straddle cachelines or pages void your
		full-ordering warranty, as do undersized accesses that
		load from or store to only part of the variable.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-17 18:56       ` Paul E. McKenney
@ 2017-10-17 19:38         ` Alan Stern
  2017-10-17 20:33           ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2017-10-17 19:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> How about this?
> 
> 0.	Simple special cases
> 
> 	If there is only one CPU on the one hand or only one variable
> 	on the other, the code will execute in order.  There are (as
> 	usual) some things to be careful of:
> 
> 	a.	There are some aspects of the C language that are
> 		unordered.  For example, the compiler can output code
> 		computing arguments of a multi-parameter function in
> 		any order it likes, or even interleaved if it so chooses.

That parses a little oddly.  I wouldn't agree that the compiler outputs
the code in any order it likes!

In fact, I wouldn't even mention the compiler at all.  Just say that
(with a few exceptions) the language doesn't specify the order in which
the arguments of a function or operation should be evaluated.  For
example, in the expression "f(x) + g(y)", the order in which f and g
are called is not defined; the object code is allowed to use either
order or even to interleave the computations.

> 	b.	Compilers are permitted to use the "as-if" rule.
> 		That is, a compiler can emit whatever code it likes,
> 		as long as the results appear just as if the compiler
> 		had followed all the relevant rules.  To see this,
> 		compiler with a high level of optimization and run
> 		the debugger on the resulting binary.

You might omit the last sentence.  Furthermore, if the accesses don't
use READ_ONCE/WRITE_ONCE then the code might not get the same result as
if it had executed in order (even for a single variable!), and if you
do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
it likes.

> 	c.	If there is only one variable but multiple CPUs, all
> 		accesses to that variable must be aligned and full sized.

I would say that the variable is what needs to be aligned, not the
accesses.  (Although, if the variable is aligned and all the accesses
are full sized, then they must necessarily be aligned as well.)

> 		Variables that straddle cachelines or pages void your
> 		full-ordering warranty, as do undersized accesses that
> 		load from or store to only part of the variable.

How can a variable straddle pages without also straddling cache lines?

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-12 13:27 ` Andrea Parri
@ 2017-10-17 20:18   ` Paul E. McKenney
  0 siblings, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-17 20:18 UTC (permalink / raw)
  To: Andrea Parri
  Cc: stern, will.deacon, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, linux-kernel

On Thu, Oct 12, 2017 at 03:27:44PM +0200, Andrea Parri wrote:
> Hi Paul,
> 
> On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > At Linux Plumbers Conference, we got requests for a recipes document,
> > and a further request to point to actual code in the Linux kernel.
> > I have pulled together some examples for various litmus-test families,
> > as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> > MP, ...) is here:
> > 
> > 	https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> > 
> > This document is also checked into the memory-models git archive:
> > 
> > 	https://github.com/aparri/memory-model.git
> > 
> > I would be especially interested in simpler examples in general, and
> > of course any example at all for the cases where I was unable to find
> > any.  Thoughts?
> 
> Below are some examples we did discuss (at some point):
> 
> The comment in kernel/events/ring_buffer.c:perf_output_put_handle()
> describes instances of MP+wmb+rmb and LB+ctrl+mb.

I added this as an alternative for MP and as the example for LB.

> The comments in kernel/sched/core.c:try_to_wake_up() describes more
> instances of MP ("plus locking") and LB (see finish_lock_switch()).

This one looks a bit more messy, so I will set it aside, for the moment,
anyway.

> The comment in kernel/sched/core.c:task_rq_lock() describes an ins-
> tance of MP+wmb+addr-acqpo.

This one also looks a bit messy, so I am setting it aside as well.

> The comment in include/linux/wait.h:waitqueue_active() describes an
> instance of SB+mb+mb.

Very good, I took this as the generic pattern for the current pair
of SB examples.

> 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug")
> describes an instance of W+RWC+porel+mb+mb.

Well, this one certainly is of historical interest.  After all, it might
well be the first Linux-kernel commit containing a litmus test.  ;-)

I put it in recipes-LKcode-63cae12bce986.txt and reference it from
recipes-LKcode.txt.

> [...]
> 
> I wish we could say "any barrier (explicit or implicit) in sources
> is accompanied by a comment mentioning the interested pattern...",
> but life is not always this simple. ;-)

Well, at least scripts/checkpatch.pl now complains if you try to add
a new comment-free barrier.  Not that these complaints are always
paid attention to...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-17 19:38         ` Alan Stern
@ 2017-10-17 20:33           ` Paul E. McKenney
  2017-10-17 21:03             ` Alan Stern
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-17 20:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > How about this?
> > 
> > 0.	Simple special cases
> > 
> > 	If there is only one CPU on the one hand or only one variable
> > 	on the other, the code will execute in order.  There are (as
> > 	usual) some things to be careful of:
> > 
> > 	a.	There are some aspects of the C language that are
> > 		unordered.  For example, the compiler can output code
> > 		computing arguments of a multi-parameter function in
> > 		any order it likes, or even interleaved if it so chooses.
> 
> That parses a little oddly.  I wouldn't agree that the compiler outputs
> the code in any order it likes!

When was the last time you talked to a compiler writer?  ;-)

> In fact, I wouldn't even mention the compiler at all.  Just say that
> (with a few exceptions) the language doesn't specify the order in which
> the arguments of a function or operation should be evaluated.  For
> example, in the expression "f(x) + g(y)", the order in which f and g
> are called is not defined; the object code is allowed to use either
> order or even to interleave the computations.

Nevertheless, I took your suggestion:

	a.	There are some aspects of the C language that are
		unordered.  For example, in the expression "f(x) + g(y)",
		the order in which f and g are called is not defined;
		the object code is allowed to use either order or even
		to interleave the computations.

> > 	b.	Compilers are permitted to use the "as-if" rule.
> > 		That is, a compiler can emit whatever code it likes,
> > 		as long as the results appear just as if the compiler
> > 		had followed all the relevant rules.  To see this,
> > 		compiler with a high level of optimization and run
> > 		the debugger on the resulting binary.
> 
> You might omit the last sentence.  Furthermore, if the accesses don't
> use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> if it had executed in order (even for a single variable!), and if you
> do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> it likes.

Ah, I omitted an important qualifier:

	b.	Compilers are permitted to use the "as-if" rule.  That is,
		a compiler can emit whatever code it likes, as long as
		the results of a single-threaded execution appear just
		as if the compiler had followed all the relevant rules.
		To see this, compile with a high level of optimization
		and run the debugger on the resulting binary.

I have seen people (including kernel hackers) surprised by what optimizers
do, so I would prefer that the last sentence remain.

> > 	c.	If there is only one variable but multiple CPUs, all
> > 		accesses to that variable must be aligned and full sized.
> 
> I would say that the variable is what needs to be aligned, not the
> accesses.  (Although, if the variable is aligned and all the accesses
> are full sized, then they must necessarily be aligned as well.)

I was thinking in terms of an unaligned 16-bit access to a 32-bit
variable.  But how about this?

	c.	If there is only one variable but multiple CPUs, all
		that variable must be properly aligned and all accesses
		to that variable must be full sized.

> > 		Variables that straddle cachelines or pages void your
> > 		full-ordering warranty, as do undersized accesses that
> > 		load from or store to only part of the variable.
> 
> How can a variable straddle pages without also straddling cache lines?

Well, a variable -can- straddle cachelines without straddling pages,
which justifies the "or".  Furthermore, given that cacheline sizes have
been growing, but pages are still 4KB, it is probably only a matter
of time.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-12 11:27   ` Will Deacon
@ 2017-10-17 20:37     ` Paul E. McKenney
  0 siblings, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-17 20:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, stern, parri.andrea, peterz, npiggin, dhowells,
	j.alglave, luc.maranget, linux-kernel

On Thu, Oct 12, 2017 at 12:27:19PM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> > On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote:
> > > 	I am not aware of any three-CPU release-acquire chains in the
> > > 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> > > 	but these are not at all simple, either.
> > > 
> > 
> > The "Program-Order guarantees" case in scheduler? See the comments
> > written by Peter above try_to_wake_up():
> > 
> >  * The basic program-order guarantee on SMP systems is that when a task [t]
> >  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
> >  * execution on its new CPU [c1].
> > ...
> >  * For blocking we (obviously) need to provide the same guarantee as for
> >  * migration. However the means are completely different as there is no lock
> >  * chain to provide order. Instead we do:
> >  *
> >  *   1) smp_store_release(X->on_cpu, 0)
> >  *   2) smp_cond_load_acquire(!X->on_cpu)
> >  *
> >  * Example:
> >  *
> >  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> >  *
> >  *   LOCK rq(0)->lock LOCK X->pi_lock
> >  *   dequeue X
> >  *   sched-out X
> >  *   smp_store_release(X->on_cpu, 0);
> >  *
> >  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
> >  *                    X->state = WAKING
> >  *                    set_task_cpu(X,2)
> >  *
> >  *                    LOCK rq(2)->lock
> >  *                    enqueue X
> >  *                    X->state = RUNNING
> >  *                    UNLOCK rq(2)->lock
> >  *
> >  *                                          LOCK rq(2)->lock // orders against CPU1
> >  *                                          sched-out Z
> >  *                                          sched-in X
> >  *                                          UNLOCK rq(2)->lock
> >  *
> >  *                    UNLOCK X->pi_lock
> >  *   UNLOCK rq(0)->lock
> > 
> > This is a chain mixed with lock and acquire-release(maybe even better?).
> > 
> > 
> > And another example would be osq_{lock,unlock}() on multiple(more than
> > three) CPUs. 
> 
> I think the qrwlock also has something similar with the writer fairness
> issue fixed:
> 
> CPU0: (writer doing an unlock)
> smp_store_release(&lock->wlocked, 0);	// Bottom byte of lock->cnts
> 
> 
> CPU1: (waiting writer on slowpath)
> atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
> ...
> arch_spin_unlock(&lock->wait_lock);
> 
> 
> CPU2: (reader on slowpath)
> arch_spin_lock(&lock->wait_lock);
> 
> and there's mixed-size accesses here too. Fun stuff!

You had me going there until you mentioned the mixed-size accesses.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-12  1:23 ` Boqun Feng
  2017-10-12 11:27   ` Will Deacon
@ 2017-10-17 20:56   ` Paul E. McKenney
  1 sibling, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-17 20:56 UTC (permalink / raw)
  To: Boqun Feng
  Cc: stern, parri.andrea, will.deacon, peterz, npiggin, dhowells,
	j.alglave, luc.maranget, linux-kernel

On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> On Wed, Oct 11, 2017 at 10:32:30PM +0000, Paul E. McKenney wrote:
> > Hello!
> > 
> > At Linux Plumbers Conference, we got requests for a recipes document,
> > and a further request to point to actual code in the Linux kernel.
> > I have pulled together some examples for various litmus-test families,
> > as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> > MP, ...) is here:
> > 
> > 	https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> > 
> > This document is also checked into the memory-models git archive:
> > 
> > 	https://github.com/aparri/memory-model.git
> > 
> > I would be especially interested in simpler examples in general, and
> > of course any example at all for the cases where I was unable to find
> > any.  Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.	Single-variable SC.
> > 
> > 	a.	Within a single CPU, the use of the ->dynticks_nmi_nesting
> > 		counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > 		(see kernel/rcu/tree.c).  The counter is accessed by
> > 		interrupts and NMIs as well as by process-level code.
> > 		This counter can be accessed by other CPUs, but only
> > 		for debug output.
> > 
> > 	b.	Between CPUs, I would put forward the ->dflags
> > 		updates, but this is anything but simple.  But maybe
> > 		OK for an illustration?
> > 
> > 1.	MP (see test6.pdf for nickname translation)
> > 
> > 	a.	smp_store_release() / smp_load_acquire()
> > 
> > 		init_stack_slab() in lib/stackdepot.c uses release-acquire
> > 		to handle initialization of a slab of the stack.  Working
> > 		out the mutual-exclusion design is left as an exercise for
> > 		the reader.
> > 
> > 	b.	rcu_assign_pointer() / rcu_dereference()
> > 
> > 		expand_to_next_prime() does the rcu_assign_pointer(),
> > 		and next_prime_number() does the rcu_dereference().
> > 		This mediates access to a bit vector that is expanded
> > 		as additional primes are needed.  These two functions
> > 		are in lib/prime_numbers.c.
> > 
> > 	c.	smp_wmb() / smp_rmb()
> > 
> > 		xlog_state_switch_iclogs() contains the following:
> > 
> > 			log->l_curr_block -= log->l_logBBsize;
> > 			ASSERT(log->l_curr_block >= 0);
> > 			smp_wmb();
> > 			log->l_curr_cycle++;
> > 
> > 		And xlog_valid_lsn() contains the following:
> > 
> > 			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> > 			smp_rmb();
> > 			cur_block = ACCESS_ONCE(log->l_curr_block);
> > 
> > 	d.	Replacing either of the above with smp_mb()
> > 
> > 		Holding off on this one for the moment...
> > 
> > 2.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > 
> > 	Lots of variety here, can in some cases substitute:
> > 	
> > 	a.	READ_ONCE() for smp_load_acquire()
> > 	b.	WRITE_ONCE() for smp_store_release()
> > 	c.	Dependencies for both smp_load_acquire() and
> > 		smp_store_release().
> > 	d.	smp_wmb() for smp_store_release() in first thread
> > 		of ISA2 and Z6.2.
> > 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > 
> > 	The canonical illustration of LB involves the various memory
> > 	allocators, where you don't want a load from about-to-be-freed
> > 	memory to see a store initializing a later incarnation of that
> > 	same memory area.  But the per-CPU caches make this a very
> > 	long and complicated example.
> > 
> > 	I am not aware of any three-CPU release-acquire chains in the
> > 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> > 	but these are not at all simple, either.
> > 
> 
> The "Program-Order guarantees" case in scheduler? See the comments
> written by Peter above try_to_wake_up():
> 
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
>  * execution on its new CPU [c1].
> ...
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
>  *
>  *   LOCK rq(0)->lock LOCK X->pi_lock
>  *   dequeue X
>  *   sched-out X
>  *   smp_store_release(X->on_cpu, 0);
>  *
>  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
>  *                    X->state = WAKING
>  *                    set_task_cpu(X,2)
>  *
>  *                    LOCK rq(2)->lock
>  *                    enqueue X
>  *                    X->state = RUNNING
>  *                    UNLOCK rq(2)->lock
>  *
>  *                                          LOCK rq(2)->lock // orders against CPU1
>  *                                          sched-out Z
>  *                                          sched-in X
>  *                                          UNLOCK rq(2)->lock
>  *
>  *                    UNLOCK X->pi_lock
>  *   UNLOCK rq(0)->lock
> 
> This is a chain mixed with lock and acquire-release(maybe even better?).

I added this one, though it might be outside of the scope of recipes.

						Thanx, Paul

> And another example would be osq_{lock,unlock}() on multiple(more than
> three) CPUs. 
> 
> Regards,
> Boqun
> 
> > 	Thoughts?
> > 
> > 3.	SB
> > 
> > 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> > 		And as in sys_membarrier()-scheduler coordination,
> > 		for that matter.
> > 
> > 		Examples seem to be lacking.  Most cases use locking.
> > 		Here is one rather strange one from RCU:
> > 
> > 		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> > 		{
> > 			unsigned long flags;
> > 			bool needwake;
> > 			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> > 
> > 			rhp->next = NULL;
> > 			rhp->func = func;
> > 			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > 			needwake = !rcu_tasks_cbs_head;
> > 			*rcu_tasks_cbs_tail = rhp;
> > 			rcu_tasks_cbs_tail = &rhp->next;
> > 			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > 			/* We can't create the thread unless interrupts are enabled. */
> > 			if ((needwake && havetask) ||
> > 			    (!havetask && !irqs_disabled_flags(flags))) {
> > 				rcu_spawn_tasks_kthread();
> > 				wake_up(&rcu_tasks_cbs_wq);
> > 			}
> > 		}
> > 
> > 		And for the wait side, using synchronize_sched() to supply
> > 		the barrier for both ends, with the preemption disabling
> > 		due to raw_spin_lock_irqsave() serving as the read-side
> > 		critical section:
> > 
> > 		if (!list) {
> > 			wait_event_interruptible(rcu_tasks_cbs_wq,
> > 						 rcu_tasks_cbs_head);
> > 			if (!rcu_tasks_cbs_head) {
> > 				WARN_ON(signal_pending(current));
> > 				schedule_timeout_interruptible(HZ/10);
> > 			}
> > 			continue;
> > 		}
> > 		synchronize_sched();
> > 
> > 		-----------------
> > 
> > 		Here is another one that uses atomic_cmpxchg() as a
> > 		full memory barrier:
> > 
> > 		if (!wait_event_timeout(*wait, !atomic_read(stopping),
> > 					msecs_to_jiffies(1000))) {
> > 			atomic_set(stopping, 0);
> > 			smp_mb();
> > 			return -ETIMEDOUT;
> > 		}
> > 
> > 		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> > 						     atomic_t *stopping)
> > 		{
> > 			if (atomic_cmpxchg(stopping, 1, 0)) {
> > 				wake_up(wait);
> > 				return 1;
> > 			}
> > 
> > 			return 0;
> > 		}
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-17 20:33           ` Paul E. McKenney
@ 2017-10-17 21:03             ` Alan Stern
  2017-10-17 21:55               ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2017-10-17 21:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > 
> > > How about this?
> > > 
> > > 0.	Simple special cases
> > > 
> > > 	If there is only one CPU on the one hand or only one variable
> > > 	on the other, the code will execute in order.  There are (as
> > > 	usual) some things to be careful of:
> > > 
> > > 	a.	There are some aspects of the C language that are
> > > 		unordered.  For example, the compiler can output code
> > > 		computing arguments of a multi-parameter function in
> > > 		any order it likes, or even interleaved if it so chooses.
> > 
> > That parses a little oddly.  I wouldn't agree that the compiler outputs
> > the code in any order it likes!
> 
> When was the last time you talked to a compiler writer?  ;-)
> 
> > In fact, I wouldn't even mention the compiler at all.  Just say that
> > (with a few exceptions) the language doesn't specify the order in which
> > the arguments of a function or operation should be evaluated.  For
> > example, in the expression "f(x) + g(y)", the order in which f and g
> > are called is not defined; the object code is allowed to use either
> > order or even to interleave the computations.
> 
> Nevertheless, I took your suggestion:
> 
> 	a.	There are some aspects of the C language that are
> 		unordered.  For example, in the expression "f(x) + g(y)",
> 		the order in which f and g are called is not defined;
> 		the object code is allowed to use either order or even
> 		to interleave the computations.

Good.

> > > 	b.	Compilers are permitted to use the "as-if" rule.
> > > 		That is, a compiler can emit whatever code it likes,
> > > 		as long as the results appear just as if the compiler
> > > 		had followed all the relevant rules.  To see this,
> > > 		compiler with a high level of optimization and run
> > > 		the debugger on the resulting binary.
> > 
> > You might omit the last sentence.  Furthermore, if the accesses don't
> > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > if it had executed in order (even for a single variable!), and if you
> > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > it likes.
> 
> Ah, I omitted an important qualifier:
> 
> 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> 		a compiler can emit whatever code it likes, as long as
> 		the results of a single-threaded execution appear just
> 		as if the compiler had followed all the relevant rules.
> 		To see this, compile with a high level of optimization
> 		and run the debugger on the resulting binary.

That's okay for the single-CPU case.  I don't think it covers the
multiple-CPU single-variable case correctly, though.  If you don't use
READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
and stores?  And won't that potentially cause the end result to be
different from what you would get if the code had appeared to execute
in order?

> I have seen people (including kernel hackers) surprised by what optimizers
> do, so I would prefer that the last sentence remain.
> 
> > > 	c.	If there is only one variable but multiple CPUs, all
> > > 		accesses to that variable must be aligned and full sized.
> > 
> > I would say that the variable is what needs to be aligned, not the
> > accesses.  (Although, if the variable is aligned and all the accesses
> > are full sized, then they must necessarily be aligned as well.)
> 
> I was thinking in terms of an unaligned 16-bit access to a 32-bit
> variable.

That wouldn't be full sized.

>  But how about this?
> 
> 	c.	If there is only one variable but multiple CPUs, all

Extra "all".  Otherwise okay.

> 		that variable must be properly aligned and all accesses
> 		to that variable must be full sized.
> 
> > > 		Variables that straddle cachelines or pages void your
> > > 		full-ordering warranty, as do undersized accesses that
> > > 		load from or store to only part of the variable.
> > 
> > How can a variable straddle pages without also straddling cache lines?
> 
> Well, a variable -can- straddle cachelines without straddling pages,
> which justifies the "or".  Furthermore, given that cacheline sizes have
> been growing, but pages are still 4KB, it is probably only a matter
> of time.  ;-)

By that time, we'll probably be using 64-KB pages.  Or even bigger!

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-17 21:03             ` Alan Stern
@ 2017-10-17 21:55               ` Paul E. McKenney
  2017-10-18 14:43                 ` Alan Stern
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-17 21:55 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > > 
> > > > How about this?
> > > > 
> > > > 0.	Simple special cases
> > > > 
> > > > 	If there is only one CPU on the one hand or only one variable
> > > > 	on the other, the code will execute in order.  There are (as
> > > > 	usual) some things to be careful of:
> > > > 
> > > > 	a.	There are some aspects of the C language that are
> > > > 		unordered.  For example, the compiler can output code
> > > > 		computing arguments of a multi-parameter function in
> > > > 		any order it likes, or even interleaved if it so chooses.
> > > 
> > > That parses a little oddly.  I wouldn't agree that the compiler outputs
> > > the code in any order it likes!
> > 
> > When was the last time you talked to a compiler writer?  ;-)
> > 
> > > In fact, I wouldn't even mention the compiler at all.  Just say that
> > > (with a few exceptions) the language doesn't specify the order in which
> > > the arguments of a function or operation should be evaluated.  For
> > > example, in the expression "f(x) + g(y)", the order in which f and g
> > > are called is not defined; the object code is allowed to use either
> > > order or even to interleave the computations.
> > 
> > Nevertheless, I took your suggestion:
> > 
> > 	a.	There are some aspects of the C language that are
> > 		unordered.  For example, in the expression "f(x) + g(y)",
> > 		the order in which f and g are called is not defined;
> > 		the object code is allowed to use either order or even
> > 		to interleave the computations.
> 
> Good.
> 
> > > > 	b.	Compilers are permitted to use the "as-if" rule.
> > > > 		That is, a compiler can emit whatever code it likes,
> > > > 		as long as the results appear just as if the compiler
> > > > 		had followed all the relevant rules.  To see this,
> > > > 		compiler with a high level of optimization and run
> > > > 		the debugger on the resulting binary.
> > > 
> > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > if it had executed in order (even for a single variable!), and if you
> > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > it likes.
> > 
> > Ah, I omitted an important qualifier:
> > 
> > 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> > 		a compiler can emit whatever code it likes, as long as
> > 		the results of a single-threaded execution appear just
> > 		as if the compiler had followed all the relevant rules.
> > 		To see this, compile with a high level of optimization
> > 		and run the debugger on the resulting binary.
> 
> That's okay for the single-CPU case.  I don't think it covers the
> multiple-CPU single-variable case correctly, though.  If you don't use
> READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> and stores?  And won't that potentially cause the end result to be
> different from what you would get if the code had appeared to execute
> in order?

Ah, good point, I need yet another qualifier.  How about the following?

	b.	Compilers are permitted to use the "as-if" rule.  That is,
		a compiler can emit whatever code it likes for normal
		accesses, as long as the results of a single-threaded
		execution appear just as if the compiler had followed
		all the relevant rules.  To see this, compile with a
		high level of optimization and run the debugger on the
		resulting binary.

I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
and atomics.  This, in conjunction with the previously added
"single-threaded execution" means that yes, the compiler is permitted
to tear normal loads and stores.  The reason is that a single-threaded
run could not tell the difference.  Interrupt handlers or multiple
threads are required to detect load/store tearing.

So, what am I still missing?  ;-)

> > I have seen people (including kernel hackers) surprised by what optimizers
> > do, so I would prefer that the last sentence remain.
> > 
> > > > 	c.	If there is only one variable but multiple CPUs, all
> > > > 		accesses to that variable must be aligned and full sized.
> > > 
> > > I would say that the variable is what needs to be aligned, not the
> > > accesses.  (Although, if the variable is aligned and all the accesses
> > > are full sized, then they must necessarily be aligned as well.)
> > 
> > I was thinking in terms of an unaligned 16-bit access to a 32-bit
> > variable.
> 
> That wouldn't be full sized.
> 
> >  But how about this?
> > 
> > 	c.	If there is only one variable but multiple CPUs, all
> 
> Extra "all".  Otherwise okay.

Good catch, I removed the extra "all".

> > 		that variable must be properly aligned and all accesses
> > 		to that variable must be full sized.
> > 
> > > > 		Variables that straddle cachelines or pages void your
> > > > 		full-ordering warranty, as do undersized accesses that
> > > > 		load from or store to only part of the variable.
> > > 
> > > How can a variable straddle pages without also straddling cache lines?
> > 
> > Well, a variable -can- straddle cachelines without straddling pages,
> > which justifies the "or".  Furthermore, given that cacheline sizes have
> > been growing, but pages are still 4KB, it is probably only a matter
> > of time.  ;-)
> 
> By that time, we'll probably be using 64-KB pages.  Or even bigger!

PowerPC's server builds have had a minimum page size of 64KB for quite
a few years.  This helps in many cases, but of course hurts for those
occasional application that insist on doing a pile of independent 4K
mappings.  ;-)

So I would guess that the move from 4K pages to 64K (or whatever)
pages could be quite painful for some CPU families.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-17 21:55               ` Paul E. McKenney
@ 2017-10-18 14:43                 ` Alan Stern
  2017-10-18 20:28                   ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2017-10-18 14:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> > > > > 	b.	Compilers are permitted to use the "as-if" rule.
> > > > > 		That is, a compiler can emit whatever code it likes,
> > > > > 		as long as the results appear just as if the compiler
> > > > > 		had followed all the relevant rules.  To see this,
> > > > > 		compiler with a high level of optimization and run
> > > > > 		the debugger on the resulting binary.
> > > > 
> > > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > > if it had executed in order (even for a single variable!), and if you
> > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > > it likes.
> > > 
> > > Ah, I omitted an important qualifier:
> > > 
> > > 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> > > 		a compiler can emit whatever code it likes, as long as
> > > 		the results of a single-threaded execution appear just
> > > 		as if the compiler had followed all the relevant rules.
> > > 		To see this, compile with a high level of optimization
> > > 		and run the debugger on the resulting binary.
> > 
> > That's okay for the single-CPU case.  I don't think it covers the
> > multiple-CPU single-variable case correctly, though.  If you don't use
> > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> > and stores?  And won't that potentially cause the end result to be
> > different from what you would get if the code had appeared to execute
> > in order?
> 
> Ah, good point, I need yet another qualifier.  How about the following?
> 
> 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> 		a compiler can emit whatever code it likes for normal
> 		accesses, as long as the results of a single-threaded
> 		execution appear just as if the compiler had followed
> 		all the relevant rules.  To see this, compile with a
> 		high level of optimization and run the debugger on the
> 		resulting binary.
> 
> I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
> and atomics.  This, in conjunction with the previously added
> "single-threaded execution" means that yes, the compiler is permitted
> to tear normal loads and stores.  The reason is that a single-threaded
> run could not tell the difference.  Interrupt handlers or multiple
> threads are required to detect load/store tearing.
> 
> So, what am I still missing?  ;-)

Well, you could explicitly mention that in the multi-thread case, this
means all accesses to the shared variable had better use READ_ONCE() or
WRITE_ONCE().

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-18 14:43                 ` Alan Stern
@ 2017-10-18 20:28                   ` Paul E. McKenney
  2017-10-18 21:18                     ` Alan Stern
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-18 20:28 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Wed, Oct 18, 2017 at 10:43:42AM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > > > > > 	b.	Compilers are permitted to use the "as-if" rule.
> > > > > > 		That is, a compiler can emit whatever code it likes,
> > > > > > 		as long as the results appear just as if the compiler
> > > > > > 		had followed all the relevant rules.  To see this,
> > > > > > 		compiler with a high level of optimization and run
> > > > > > 		the debugger on the resulting binary.
> > > > > 
> > > > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > > > if it had executed in order (even for a single variable!), and if you
> > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > > > it likes.
> > > > 
> > > > Ah, I omitted an important qualifier:
> > > > 
> > > > 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> > > > 		a compiler can emit whatever code it likes, as long as
> > > > 		the results of a single-threaded execution appear just
> > > > 		as if the compiler had followed all the relevant rules.
> > > > 		To see this, compile with a high level of optimization
> > > > 		and run the debugger on the resulting binary.
> > > 
> > > That's okay for the single-CPU case.  I don't think it covers the
> > > multiple-CPU single-variable case correctly, though.  If you don't use
> > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> > > and stores?  And won't that potentially cause the end result to be
> > > different from what you would get if the code had appeared to execute
> > > in order?
> > 
> > Ah, good point, I need yet another qualifier.  How about the following?
> > 
> > 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> > 		a compiler can emit whatever code it likes for normal
> > 		accesses, as long as the results of a single-threaded
> > 		execution appear just as if the compiler had followed
> > 		all the relevant rules.  To see this, compile with a
> > 		high level of optimization and run the debugger on the
> > 		resulting binary.
> > 
> > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
> > and atomics.  This, in conjunction with the previously added
> > "single-threaded execution" means that yes, the compiler is permitted
> > to tear normal loads and stores.  The reason is that a single-threaded
> > run could not tell the difference.  Interrupt handlers or multiple
> > threads are required to detect load/store tearing.
> > 
> > So, what am I still missing?  ;-)
> 
> Well, you could explicitly mention that in the multi-thread case, this
> means all accesses to the shared variable had better use READ_ONCE() or
> WRITE_ONCE().

Like this?

							Thanx, Paul

------------------------------------------------------------------------

	d.	If there are multiple CPUs, accesses to shared variables
		should use READ_ONCE() and WRITE_ONCE() or stronger
		to prevent load/store tearing, load/store fusing, and
		invented loads and stores.  There are exceptions to
		this rule, for example:

		i.	When there is no possibility of a given
			shared variable being updated, for example,
			while holding the update-side lock, reads
			from that variable need not use READ_ONCE().

		ii.	When there is no possibility of a given shared
			variable being either read or updated, for
			example, when running during early boot, reads
			from that variable need not use READ_ONCE() and
			writes to that variable need not use WRITE_ONCE().

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-18 20:28                   ` Paul E. McKenney
@ 2017-10-18 21:18                     ` Alan Stern
  2017-10-18 22:57                       ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2017-10-18 21:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Wed, 18 Oct 2017, Paul E. McKenney wrote:

> > Well, you could explicitly mention that in the multi-thread case, this
> > means all accesses to the shared variable had better use READ_ONCE() or
> > WRITE_ONCE().
> 
> Like this?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> 	d.	If there are multiple CPUs, accesses to shared variables
> 		should use READ_ONCE() and WRITE_ONCE() or stronger
> 		to prevent load/store tearing, load/store fusing, and
> 		invented loads and stores.  There are exceptions to
> 		this rule, for example:
> 
> 		i.	When there is no possibility of a given
> 			shared variable being updated, for example,
> 			while holding the update-side lock, reads
> 			from that variable need not use READ_ONCE().
> 
> 		ii.	When there is no possibility of a given shared
> 			variable being either read or updated, for
> 			example, when running during early boot, reads
> 			from that variable need not use READ_ONCE() and
> 			writes to that variable need not use WRITE_ONCE().

Yeah, except that you mean being read or updated by another thread.

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux-kernel examples for LKMM recipes
  2017-10-18 21:18                     ` Alan Stern
@ 2017-10-18 22:57                       ` Paul E. McKenney
  0 siblings, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2017-10-18 22:57 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Will Deacon, peterz, boqun.feng, npiggin, dhowells,
	Jade Alglave, Luc Maranget, Kernel development list

On Wed, Oct 18, 2017 at 05:18:37PM -0400, Alan Stern wrote:
> On Wed, 18 Oct 2017, Paul E. McKenney wrote:
> 
> > > Well, you could explicitly mention that in the multi-thread case, this
> > > means all accesses to the shared variable had better use READ_ONCE() or
> > > WRITE_ONCE().
> > 
> > Like this?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > 	d.	If there are multiple CPUs, accesses to shared variables
> > 		should use READ_ONCE() and WRITE_ONCE() or stronger
> > 		to prevent load/store tearing, load/store fusing, and
> > 		invented loads and stores.  There are exceptions to
> > 		this rule, for example:
> > 
> > 		i.	When there is no possibility of a given
> > 			shared variable being updated, for example,
> > 			while holding the update-side lock, reads
> > 			from that variable need not use READ_ONCE().
> > 
> > 		ii.	When there is no possibility of a given shared
> > 			variable being either read or updated, for
> > 			example, when running during early boot, reads
> > 			from that variable need not use READ_ONCE() and
> > 			writes to that variable need not use WRITE_ONCE().
> 
> Yeah, except that you mean being read or updated by another thread.

Good point, added that qualifier.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-10-18 22:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-11 22:32 Linux-kernel examples for LKMM recipes Paul E. McKenney
2017-10-12  1:23 ` Boqun Feng
2017-10-12 11:27   ` Will Deacon
2017-10-17 20:37     ` Paul E. McKenney
2017-10-17 20:56   ` Paul E. McKenney
2017-10-12 13:27 ` Andrea Parri
2017-10-17 20:18   ` Paul E. McKenney
2017-10-13 19:44 ` Alan Stern
2017-10-13 20:00   ` Paul E. McKenney
2017-10-13 20:09     ` Alan Stern
2017-10-17 18:56       ` Paul E. McKenney
2017-10-17 19:38         ` Alan Stern
2017-10-17 20:33           ` Paul E. McKenney
2017-10-17 21:03             ` Alan Stern
2017-10-17 21:55               ` Paul E. McKenney
2017-10-18 14:43                 ` Alan Stern
2017-10-18 20:28                   ` Paul E. McKenney
2017-10-18 21:18                     ` Alan Stern
2017-10-18 22:57                       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox