All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] sched: membarrier updates
@ 2020-08-14 16:43 Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers

Hi,

Please find those three changes based on our recent discussions about
interaction between membarrier and exit_mm and kthread_use_mm. I added
documentation of memory ordering scenarios to membarrier.c as well.

Thanks,

Mathieu

Mathieu Desnoyers (3):
  sched: fix exit_mm vs membarrier (v2)
  sched: membarrier: cover kthread_use_mm (v2)
  sched: membarrier: document memory ordering scenarios

 kernel/exit.c             |   8 +++
 kernel/kthread.c          |  19 +++++++
 kernel/sched/idle.c       |   1 +
 kernel/sched/membarrier.c | 136 ++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 158 insertions(+), 6 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
  2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
  2020-08-16 15:23   ` Boqun Feng
  2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers,
	Thomas Gleixner, Linus Torvalds, linux-mm

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
Changes since v1:
- Use smp_mb__after_spinlock rather than smp_mb.
- Document race scenario in commit message.
---
 kernel/exit.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..fe64e6e28dd5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,6 +475,14 @@ static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb__after_spinlock();
 	current->mm = NULL;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
  2020-08-16 15:29   ` Boqun Feng
  2020-08-14 16:43 ` [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
       [not found] ` <20200816070932.10752-1-hdanton@sina.com>
  3 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers

Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Changes since v1:
- Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
- Use smp_mb__after_spinlock rather than smp_mb after task_lock.
---
 kernel/kthread.c          | 19 +++++++++++++++++++
 kernel/sched/idle.c       |  1 +
 kernel/sched/membarrier.c |  8 ++------
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3edaa380dc7b..77aaaa7bc8d9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
 	finish_arch_post_lock_switch();
 #endif
 
+	/*
+	 * When a kthread starts operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after storing to tsk->mm, before accessing
+	 * user-space memory. A full memory barrier for membarrier
+	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+	 * mmdrop(), or explicitly with smp_mb().
+	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
+	else
+		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
 	task_lock(tsk);
+	/*
+	 * When a kthread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb__after_spinlock();
 	sync_mm_rss(mm);
 	local_irq_disable();
 	tsk->mm = NULL;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6bf34986f45c..3443ee8335d0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
 	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
 	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
 	WARN_ON_ONCE(!duration_ns);
+	WARN_ON_ONCE(current->mm);
 
 	rcu_sleep_check();
 	preempt_disable();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..8a294483074d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
 		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
 			continue;
 
-		/*
-		 * Skip the CPU if it runs a kernel thread. The scheduler
-		 * leaves the prior task mm in place as an optimization when
-		 * scheduling a kthread.
-		 */
+		/* Skip the CPU if it runs the idle thread. */
 		p = rcu_dereference(cpu_rq(cpu)->curr);
-		if (p->flags & PF_KTHREAD)
+		if (is_idle_task(p))
 			continue;
 
 		__cpumask_set_cpu(cpu, tmpmask);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios
  2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
       [not found] ` <20200816070932.10752-1-hdanton@sina.com>
  3 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers

Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 kernel/sched/membarrier.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 8a294483074d..103f5edb8ba5 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,6 +7,134 @@
 #include "sched.h"
 
 /*
+ * For documentation purposes, here are some membarrier ordering
+ * scenarios to keep in mind:
+ *
+ * A) Userspace thread execution after IPI vs membarrier's memory
+ *    barrier before sending the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the start of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU0 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU1 after the IPI-induced memory barrier:
+ *
+ *         CPU0                              CPU1
+ *
+ *         x = 1
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r2 = y
+ *                                           y = 1
+ *                                           barrier()
+ *                                           r1 = x
+ *
+ *                     BUG_ON(r1 == 0 && r2 == 0)
+ *
+ * The write to y and load from x by CPU1 are unordered by the hardware,
+ * so it's possible to have "r1 = x" reordered before "y = 1" at any
+ * point after (b).  If the memory barrier at (a) is omitted, then "x = 1"
+ * can be reordered after (a) (although not after (c)), so we get r1 == 0
+ * and r2 == 0.  This violates the guarantee that membarrier() is
+ * supposed by provide.
+ *
+ * The timing of the memory barrier at (a) has to ensure that it executes
+ * before the IPI-induced memory barrier on CPU1.
+ *
+ * B) Userspace thread execution before IPI vs membarrier's memory
+ *    barrier after completing the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the end of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU1 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU0 after the membarrier():
+ *
+ *         CPU0                              CPU1
+ *
+ *                                           x = 1
+ *                                           barrier()
+ *                                           y = 1
+ *         r2 = y
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r1 = x
+ *         BUG_ON(r1 == 0 && r2 == 1)
+ *
+ * The writes to x and y are unordered by the hardware, so it's possible to
+ * have "r2 = 1" even though the write to x doesn't execute until (b).  If
+ * the memory barrier at (c) is omitted then "r1 = x" can be reordered
+ * before (b) (although not before (a)), so we get "r1 = 0".  This violates
+ * the guarantee that membarrier() is supposed to provide.
+ *
+ * The timing of the memory barrier at (c) has to ensure that it executes
+ * after the IPI-induced memory barrier on CPU1.
+ *
+ * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           d: switch to kthread (includes mb)
+ *           b: read rq->curr->mm == NULL
+ *                                           e: switch to user (includes mb)
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (e). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ *
+ * D) exit_mm vs membarrier
+ *
+ * Two thread groups are created, A and B.  Thread group B is created by
+ * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
+ * Let's assume we have a single thread within each thread group (Thread A
+ * and Thread B).  Thread A runs on CPU0, Thread B runs on CPU1.
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *             a: smp_mb()
+ *                                           exit_mm():
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *             b: read rq->curr->mm == NULL
+ *             c: smp_mb()
+ *
+ * Using scenario (B), we can show that (c) needs to be paired with (d).
+ *
+ * E) kthread_{use,unuse}_mm vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           kthread_unuse_mm()
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *           b: read rq->curr->mm == NULL
+ *                                           kthread_use_mm()
+ *                                             f: current->mm = mm
+ *                                             g: smp_mb()
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (g). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ */
+
+/*
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
@ 2020-08-16 15:23   ` Boqun Feng
  2020-09-24 15:01     ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-16 15:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, Paul E . McKenney,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin,
	Thomas Gleixner, Linus Torvalds, linux-mm

Hi Mathieu,

On Fri, Aug 14, 2020 at 12:43:56PM -0400, Mathieu Desnoyers wrote:
> exit_mm should issue memory barriers after user-space memory accesses,
> before clearing current->mm, to order user-space memory accesses
> performed prior to exit_mm before clearing tsk->mm, which has the
> effect of skipping the membarrier private expedited IPIs.
> 
> The membarrier system call can be issued concurrently with do_exit
> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
> 
> Here is the scenario I have in mind:
> 
> Two thread groups are created, A and B. Thread group B is created by
> issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
> Let's assume we have a single thread within each thread group (Thread A
> and Thread B).
> 
> The AFAIU we can have:
> 
> Userspace variables:
> 
> int x = 0, y = 0;
> 
> CPU 0                   CPU 1
> Thread A                Thread B
> (in thread group A)     (in thread group B)
> 
> x = 1
> barrier()
> y = 1
> exit()
> exit_mm()
> current->mm = NULL;
>                         r1 = load y
>                         membarrier()
>                           skips CPU 0 (no IPI) because its current mm is NULL
>                         r2 = load x
>                         BUG_ON(r1 == 1 && r2 == 0)
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: linux-mm@kvack.org
> ---
> Changes since v1:
> - Use smp_mb__after_spinlock rather than smp_mb.
> - Document race scenario in commit message.
> ---
>  kernel/exit.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 733e80f334e7..fe64e6e28dd5 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -475,6 +475,14 @@ static void exit_mm(void)
>  	BUG_ON(mm != current->active_mm);
>  	/* more a memory barrier than a real lock */
>  	task_lock(current);
> +	/*
> +	 * When a thread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe

Is it accurate to say that the correctness of
membarrier_global_expedited() relies on the observation of ->mm? Because
IIUC membarrier_global_expedited() loop doesn't check ->mm.

Regards,
Boqun

> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb__after_spinlock();
>  	current->mm = NULL;
>  	mmap_read_unlock(mm);
>  	enter_lazy_tlb(mm, current);
> -- 
> 2.11.0
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
@ 2020-08-16 15:29   ` Boqun Feng
  2020-08-24 15:27     ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-16 15:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, Paul E . McKenney,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> to allow the effect of membarrier(2) to apply to kthreads accessing
> user-space memory as well.
> 
> Given that no prior kthread use this guarantee and that it only affects
> kthreads, adding this guarantee does not affect user-space ABI.
> 
> Refine the check in membarrier_global_expedited to exclude runqueues
> running the idle thread rather than all kthreads from the IPI cpumask.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> Changes since v1:
> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
> ---
>  kernel/kthread.c          | 19 +++++++++++++++++++
>  kernel/sched/idle.c       |  1 +
>  kernel/sched/membarrier.c |  8 ++------
>  3 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3edaa380dc7b..77aaaa7bc8d9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>  	finish_arch_post_lock_switch();
>  #endif
>  
> +	/*
> +	 * When a kthread starts operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after storing to tsk->mm, before accessing
> +	 * user-space memory. A full memory barrier for membarrier
> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> +	 * mmdrop(), or explicitly with smp_mb().
> +	 */
>  	if (active_mm != mm)
>  		mmdrop(active_mm);
> +	else
> +		smp_mb();

Similar question here: could smp_mb() guarantee the correctness of
GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
kthread_unuse_mm(), too?

Am I miss something here?

Regards,
Boqun

>  
>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
>  }
> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>  	force_uaccess_end(to_kthread(tsk)->oldfs);
>  
>  	task_lock(tsk);
> +	/*
> +	 * When a kthread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb__after_spinlock();
>  	sync_mm_rss(mm);
>  	local_irq_disable();
>  	tsk->mm = NULL;
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 6bf34986f45c..3443ee8335d0 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
>  	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
>  	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
>  	WARN_ON_ONCE(!duration_ns);
> +	WARN_ON_ONCE(current->mm);
>  
>  	rcu_sleep_check();
>  	preempt_disable();
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 168479a7d61b..8a294483074d 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>  			continue;
>  
> -		/*
> -		 * Skip the CPU if it runs a kernel thread. The scheduler
> -		 * leaves the prior task mm in place as an optimization when
> -		 * scheduling a kthread.
> -		 */
> +		/* Skip the CPU if it runs the idle thread. */
>  		p = rcu_dereference(cpu_rq(cpu)->curr);
> -		if (p->flags & PF_KTHREAD)
> +		if (is_idle_task(p))
>  			continue;
>  
>  		__cpumask_set_cpu(cpu, tmpmask);
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
       [not found] ` <20200816070932.10752-1-hdanton@sina.com>
@ 2020-08-16 21:05   ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-16 21:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

----- On Aug 16, 2020, at 3:09 AM, Hillf Danton hdanton@sina.com wrote:

> On Fri, 14 Aug 2020 12:43:57 -0400 Mathieu Desnoyers wrote:
>> 
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
> 
> Can you expand a bit on why kthreads like ksoftirqd have to ack the
> IPIs from Dave who's not CAP_SYS_ADMIN.

Do ksoftirqd kthreads ever use kthread_use_mm() to access user
processes' memory ? If not, then they won't be disturbed by any IPI.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-16 15:29   ` Boqun Feng
@ 2020-08-24 15:27     ` Mathieu Desnoyers
  2020-08-25  2:06       ` Boqun Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-24 15:27 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:

> On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
>> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> to allow the effect of membarrier(2) to apply to kthreads accessing
>> user-space memory as well.
>> 
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
>> 
>> Refine the check in membarrier_global_expedited to exclude runqueues
>> running the idle thread rather than all kthreads from the IPI cpumask.
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> Changes since v1:
>> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
>> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
>> ---
>>  kernel/kthread.c          | 19 +++++++++++++++++++
>>  kernel/sched/idle.c       |  1 +
>>  kernel/sched/membarrier.c |  8 ++------
>>  3 files changed, 22 insertions(+), 6 deletions(-)
>> 
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 3edaa380dc7b..77aaaa7bc8d9 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>>  	finish_arch_post_lock_switch();
>>  #endif
>>  
>> +	/*
>> +	 * When a kthread starts operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after storing to tsk->mm, before accessing
>> +	 * user-space memory. A full memory barrier for membarrier
>> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> +	 * mmdrop(), or explicitly with smp_mb().
>> +	 */
>>  	if (active_mm != mm)
>>  		mmdrop(active_mm);
>> +	else
>> +		smp_mb();
> 
> Similar question here: could smp_mb() guarantee the correctness of
> GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
> kthread_unuse_mm(), too?
> 
> Am I miss something here?

I think you have a good point there. Which brings me to wonder why we
don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
This means an IPI can be sent to a kthread even if it does not use an
mm, just because the membarrier state in the runqueue is that of the
active_mm.

Thoughts ?

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>>  
>>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
>>  }
>> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>>  	force_uaccess_end(to_kthread(tsk)->oldfs);
>>  
>>  	task_lock(tsk);
>> +	/*
>> +	 * When a kthread stops operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after accessing user-space memory, before
>> +	 * clearing tsk->mm.
>> +	 */
>> +	smp_mb__after_spinlock();
>>  	sync_mm_rss(mm);
>>  	local_irq_disable();
>>  	tsk->mm = NULL;
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 6bf34986f45c..3443ee8335d0 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
>>  	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
>>  	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
>>  	WARN_ON_ONCE(!duration_ns);
>> +	WARN_ON_ONCE(current->mm);
>>  
>>  	rcu_sleep_check();
>>  	preempt_disable();
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index 168479a7d61b..8a294483074d 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>>  			continue;
>>  
>> -		/*
>> -		 * Skip the CPU if it runs a kernel thread. The scheduler
>> -		 * leaves the prior task mm in place as an optimization when
>> -		 * scheduling a kthread.
>> -		 */
>> +		/* Skip the CPU if it runs the idle thread. */
>>  		p = rcu_dereference(cpu_rq(cpu)->curr);
>> -		if (p->flags & PF_KTHREAD)
>> +		if (is_idle_task(p))
>>  			continue;
>>  
>>  		__cpumask_set_cpu(cpu, tmpmask);
>> --
>> 2.11.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-24 15:27     ` Mathieu Desnoyers
@ 2020-08-25  2:06       ` Boqun Feng
  2020-09-24 15:26         ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-25  2:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

On Mon, Aug 24, 2020 at 11:27:49AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
> >> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> >> to allow the effect of membarrier(2) to apply to kthreads accessing
> >> user-space memory as well.
> >> 
> >> Given that no prior kthread use this guarantee and that it only affects
> >> kthreads, adding this guarantee does not affect user-space ABI.
> >> 
> >> Refine the check in membarrier_global_expedited to exclude runqueues
> >> running the idle thread rather than all kthreads from the IPI cpumask.
> >> 
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Paul E. McKenney <paulmck@kernel.org>
> >> Cc: Nicholas Piggin <npiggin@gmail.com>
> >> Cc: Andy Lutomirski <luto@amacapital.net>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >> Changes since v1:
> >> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
> >> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
> >> ---
> >>  kernel/kthread.c          | 19 +++++++++++++++++++
> >>  kernel/sched/idle.c       |  1 +
> >>  kernel/sched/membarrier.c |  8 ++------
> >>  3 files changed, 22 insertions(+), 6 deletions(-)
> >> 
> >> diff --git a/kernel/kthread.c b/kernel/kthread.c
> >> index 3edaa380dc7b..77aaaa7bc8d9 100644
> >> --- a/kernel/kthread.c
> >> +++ b/kernel/kthread.c
> >> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
> >>  	finish_arch_post_lock_switch();
> >>  #endif
> >>  
> >> +	/*
> >> +	 * When a kthread starts operating on an address space, the loop
> >> +	 * in membarrier_{private,global}_expedited() may not observe
> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> +	 * memory barrier after storing to tsk->mm, before accessing
> >> +	 * user-space memory. A full memory barrier for membarrier
> >> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> >> +	 * mmdrop(), or explicitly with smp_mb().
> >> +	 */
> >>  	if (active_mm != mm)
> >>  		mmdrop(active_mm);
> >> +	else
> >> +		smp_mb();
> > 
> > Similar question here: could smp_mb() guarantee the correctness of
> > GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
> > kthread_unuse_mm(), too?
> > 
> > Am I miss something here?
> 
> I think you have a good point there. Which brings me to wonder why we
> don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
> This means an IPI can be sent to a kthread even if it does not use an
> mm, just because the membarrier state in the runqueue is that of the
> active_mm.
> 
> Thoughts ?
> 

Right, I think we should also handle the percpu membarrier_state. The
basic rule is whenever we change current->mm or current (i.e. rq->curr)
itself, we need to update the percpu membarrier_state accordingly.

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> > 
> > Regards,
> > Boqun
> > 
> >>  
> >>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
> >>  }
> >> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
> >>  	force_uaccess_end(to_kthread(tsk)->oldfs);
> >>  
> >>  	task_lock(tsk);
> >> +	/*
> >> +	 * When a kthread stops operating on an address space, the loop
> >> +	 * in membarrier_{private,global}_expedited() may not observe
> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> +	 * memory barrier after accessing user-space memory, before
> >> +	 * clearing tsk->mm.
> >> +	 */
> >> +	smp_mb__after_spinlock();
> >>  	sync_mm_rss(mm);
> >>  	local_irq_disable();
> >>  	tsk->mm = NULL;
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index 6bf34986f45c..3443ee8335d0 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
> >>  	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> >>  	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> >>  	WARN_ON_ONCE(!duration_ns);
> >> +	WARN_ON_ONCE(current->mm);
> >>  
> >>  	rcu_sleep_check();
> >>  	preempt_disable();
> >> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> >> index 168479a7d61b..8a294483074d 100644
> >> --- a/kernel/sched/membarrier.c
> >> +++ b/kernel/sched/membarrier.c
> >> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
> >>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
> >>  			continue;
> >>  
> >> -		/*
> >> -		 * Skip the CPU if it runs a kernel thread. The scheduler
> >> -		 * leaves the prior task mm in place as an optimization when
> >> -		 * scheduling a kthread.
> >> -		 */
> >> +		/* Skip the CPU if it runs the idle thread. */
> >>  		p = rcu_dereference(cpu_rq(cpu)->curr);
> >> -		if (p->flags & PF_KTHREAD)
> >> +		if (is_idle_task(p))
> >>  			continue;
> >>  
> >>  		__cpumask_set_cpu(cpu, tmpmask);
> >> --
> >> 2.11.0
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
  2020-08-16 15:23   ` Boqun Feng
@ 2020-09-24 15:01     ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-09-24 15:01 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin,
	Thomas Gleixner, Linus Torvalds, linux-mm

----- On Aug 16, 2020, at 11:23 AM, Boqun Feng boqun.feng@gmail.com wrote:

> Hi Mathieu,
> 
> On Fri, Aug 14, 2020 at 12:43:56PM -0400, Mathieu Desnoyers wrote:
>> exit_mm should issue memory barriers after user-space memory accesses,
>> before clearing current->mm, to order user-space memory accesses
>> performed prior to exit_mm before clearing tsk->mm, which has the
>> effect of skipping the membarrier private expedited IPIs.
>> 
>> The membarrier system call can be issued concurrently with do_exit
>> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
>> 
>> Here is the scenario I have in mind:
>> 
>> Two thread groups are created, A and B. Thread group B is created by
>> issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
>> Let's assume we have a single thread within each thread group (Thread A
>> and Thread B).
>> 
>> The AFAIU we can have:
>> 
>> Userspace variables:
>> 
>> int x = 0, y = 0;
>> 
>> CPU 0                   CPU 1
>> Thread A                Thread B
>> (in thread group A)     (in thread group B)
>> 
>> x = 1
>> barrier()
>> y = 1
>> exit()
>> exit_mm()
>> current->mm = NULL;
>>                         r1 = load y
>>                         membarrier()
>>                           skips CPU 0 (no IPI) because its current mm is NULL
>>                         r2 = load x
>>                         BUG_ON(r1 == 1 && r2 == 0)
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: linux-mm@kvack.org
>> ---
>> Changes since v1:
>> - Use smp_mb__after_spinlock rather than smp_mb.
>> - Document race scenario in commit message.
>> ---
>>  kernel/exit.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>> 
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 733e80f334e7..fe64e6e28dd5 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -475,6 +475,14 @@ static void exit_mm(void)
>>  	BUG_ON(mm != current->active_mm);
>>  	/* more a memory barrier than a real lock */
>>  	task_lock(current);
>> +	/*
>> +	 * When a thread stops operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
> 
> Is it accurate to say that the correctness of
> membarrier_global_expedited() relies on the observation of ->mm? Because
> IIUC membarrier_global_expedited() loop doesn't check ->mm.

Good point, I was wrong. Will instead reword as:

        /*
         * When a thread stops operating on an address space, the loop
         * in membarrier_private_expedited() may not observe that
         * tsk->mm, and the loop in membarrier_global_expedited() may
         * not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED
         * rq->membarrier_state, so those would not issue an IPI.
         * Membarrier requires a memory barrier after accessing
         * user-space memory, before clearing tsk->mm or the
         * rq->membarrier_state.
         */

And I'll make sure exit_mm clears this_rq()->membarrier_state as well.

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after accessing user-space memory, before
>> +	 * clearing tsk->mm.
>> +	 */
>> +	smp_mb__after_spinlock();
>>  	current->mm = NULL;
>>  	mmap_read_unlock(mm);
>>  	enter_lazy_tlb(mm, current);
>> --
>> 2.11.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-25  2:06       ` Boqun Feng
@ 2020-09-24 15:26         ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-09-24 15:26 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

----- On Aug 24, 2020, at 10:06 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Mon, Aug 24, 2020 at 11:27:49AM -0400, Mathieu Desnoyers wrote:
>> ----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
>> 
>> > On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
>> >> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> >> to allow the effect of membarrier(2) to apply to kthreads accessing
>> >> user-space memory as well.
>> >> 
>> >> Given that no prior kthread use this guarantee and that it only affects
>> >> kthreads, adding this guarantee does not affect user-space ABI.
>> >> 
>> >> Refine the check in membarrier_global_expedited to exclude runqueues
>> >> running the idle thread rather than all kthreads from the IPI cpumask.
>> >> 
>> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> >> Cc: Will Deacon <will@kernel.org>
>> >> Cc: Paul E. McKenney <paulmck@kernel.org>
>> >> Cc: Nicholas Piggin <npiggin@gmail.com>
>> >> Cc: Andy Lutomirski <luto@amacapital.net>
>> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> ---
>> >> Changes since v1:
>> >> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
>> >> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
>> >> ---
>> >>  kernel/kthread.c          | 19 +++++++++++++++++++
>> >>  kernel/sched/idle.c       |  1 +
>> >>  kernel/sched/membarrier.c |  8 ++------
>> >>  3 files changed, 22 insertions(+), 6 deletions(-)
>> >> 
>> >> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> >> index 3edaa380dc7b..77aaaa7bc8d9 100644
>> >> --- a/kernel/kthread.c
>> >> +++ b/kernel/kthread.c
>> >> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>> >>  	finish_arch_post_lock_switch();
>> >>  #endif
>> >>  
>> >> +	/*
>> >> +	 * When a kthread starts operating on an address space, the loop
>> >> +	 * in membarrier_{private,global}_expedited() may not observe
>> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> >> +	 * memory barrier after storing to tsk->mm, before accessing
>> >> +	 * user-space memory. A full memory barrier for membarrier
>> >> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> >> +	 * mmdrop(), or explicitly with smp_mb().
>> >> +	 */
>> >>  	if (active_mm != mm)
>> >>  		mmdrop(active_mm);
>> >> +	else
>> >> +		smp_mb();
>> > 
>> > Similar question here: could smp_mb() guarantee the correctness of
>> > GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
>> > kthread_unuse_mm(), too?
>> > 
>> > Am I miss something here?
>> 
>> I think you have a good point there. Which brings me to wonder why we
>> don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
>> This means an IPI can be sent to a kthread even if it does not use an
>> mm, just because the membarrier state in the runqueue is that of the
>> active_mm.
>> 
>> Thoughts ?
>> 
> 
> Right, I think we should also handle the percpu membarrier_state. The
> basic rule is whenever we change current->mm or current (i.e. rq->curr)
> itself, we need to update the percpu membarrier_state accordingly.

OK, so as we introduce IPIs to kthreads which are using kthread_use_mm, we need to
reconsider how the scheduler deals with runqueues entering lazy TLB state. Currently,
membarrier_switch_mm() is not issued when entering lazy TLB state. But as we
start considering kthreads as candidates for sending IPIs, we need to update
the rq->membarrier_state whenever we enter lazy TLB state as well.

So I plan to do this change to cover this:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84758f34cdb0..44521dc5602a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3736,6 +3736,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
         */
        arch_start_context_switch(prev);
 
+       membarrier_switch_mm(rq, prev->mm, next->mm);
+
        /*
         * kernel -> kernel   lazy + transfer active
         *   user -> kernel   lazy + mmgrab() active
@@ -3752,7 +3754,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
                else
                        prev->active_mm = NULL;
        } else {                                        // to user
-               membarrier_switch_mm(rq, prev->active_mm, next->mm);
                /*
                 * sys_membarrier() requires an smp_mb() between setting
                 * rq->curr / membarrier_switch_mm() and returning to userspace.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fd283892761..481149066086 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2592,12 +2592,13 @@ static inline void membarrier_switch_mm(struct rq *rq,
                                        struct mm_struct *prev_mm,
                                        struct mm_struct *next_mm)
 {
-       int membarrier_state;
+       int membarrier_state = 0;
 
        if (prev_mm == next_mm)
                return;
 
-       membarrier_state = atomic_read(&next_mm->membarrier_state);
+       if (next_mm)
+               membarrier_state = atomic_read(&next_mm->membarrier_state);
        if (READ_ONCE(rq->membarrier_state) == membarrier_state)
                return;

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-09-24 15:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
2020-08-16 15:23   ` Boqun Feng
2020-09-24 15:01     ` Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
2020-08-16 15:29   ` Boqun Feng
2020-08-24 15:27     ` Mathieu Desnoyers
2020-08-25  2:06       ` Boqun Feng
2020-09-24 15:26         ` Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
     [not found] ` <20200816070932.10752-1-hdanton@sina.com>
2020-08-16 21:05   ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.