Re: [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Anton Blanchard <anton@ozlabs.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@ozlabs.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	linux-arch <linux-arch@vger.kernel.org>, x86 <x86@kernel.org>,
	riel <riel@surriel.com>, Dave Hansen <dave.hansen@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Nadav Amit <nadav.amit@gmail.com>
Subject: Re: [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit
Date: Wed, 12 Jan 2022 10:52:54 -0500 (EST)	[thread overview]
Message-ID: <1992660992.24733.1642002774380.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <c1bc25b895213921cf36f4ee4ba07c1791fb631e.1641659630.git.luto@kernel.org>

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> membarrier() needs a barrier after any CPU changes mm.  There is currently
> a comment explaining why this barrier probably exists in all cases. The
> logic is based on ensuring that the barrier exists on every control flow
> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> full barriers.
> 
> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> could use a release on architectures that have these operations.  Larger
> optimizations are also in the works.  Doing any of these optimizations
> while preserving an unnecessary barrier will complicate the code and
> penalize non-membarrier-using tasks.
> 
> Simplify the logic by adding an explicit barrier, and allow architectures
> to override it as an optimization if they want to.
> 
> One of the deleted comments in this patch said "It is therefore
> possible to schedule between user->kernel->user threads without
> passing through switch_mm()".  It is possible to do this without, say,
> writing to CR3 on x86, but the core scheduler indeed calls
> switch_mm_irqs_off() to tell the arch code to go back from lazy mode
> to no-lazy mode.
> 
> The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as
> there is no way for a newly execed program to register for membarrier prior
> to running user code.  Subsequent patches will merge the exec_mmap() code
> with the kthread_use_mm() code, though, and keeping the paths consistent
> will make the result more comprehensible.

I don't agree with the approach here. Adding additional memory barrier overhead
for the sake of possible future optimization work is IMHO not an appropriate
justification for a performance regression.

However I think we can manage to allow forward progress on optimization of
mmgrab/mmdrop without hurting performance.

One possible approach would be to introduce smp_mb__{before,after}_{mmgrab,mmdrop},
which would initially be no-ops. Those could be used by membarrier without
introducing any overhead, and would allow to gradually move the implicit barriers
from mmgrab/mmdrop to smp_mb__{before,after}_{mmgrab,mmdrop} on a per-architecture
basis.

Thanks,

Mathieu


> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> fs/exec.c                |  1 +
> include/linux/sched/mm.h | 18 ++++++++++++++++++
> kernel/kthread.c         | 12 +-----------
> kernel/sched/core.c      | 34 +++++++++-------------------------
> 4 files changed, 29 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..3abbd0294e73 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
> 	activate_mm(active_mm, mm);
> 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> 		local_irq_enable();
> +	membarrier_finish_switch_mm(mm);
> 	tsk->mm->vmacache_seqnum = 0;
> 	vmacache_flush(tsk);
> 	task_unlock(tsk);
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 0df706c099e5..e8919995d8dd 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
> 
> extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> 
> +/*
> + * Called by the core scheduler after calling switch_mm_irqs_off().
> + * Architectures that have implicit barriers when switching mms can
> + * override this as an optimization.
> + */
> +#ifndef membarrier_finish_switch_mm
> +static inline void membarrier_finish_switch_mm(struct mm_struct *mm)
> +{
> +	if (atomic_read(&mm->membarrier_state) &
> +	    (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
> +		smp_mb();
> +}
> +#endif
> +
> #else
> static inline void membarrier_exec_mmap(struct mm_struct *mm)
> {
> @@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct
> *mm)
> static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> }
> +static inline void membarrier_finish_switch_mm(struct mm_struct *mm)
> +{
> +}
> +
> #endif
> 
> #endif /* _LINUX_SCHED_MM_H */
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 5b37a8567168..396ae78a1a34 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm)
> 	tsk->mm = mm;
> 	membarrier_update_current_mm(mm);
> 	switch_mm_irqs_off(active_mm, mm, tsk);
> +	membarrier_finish_switch_mm(mm);
> 	local_irq_enable();
> 	task_unlock(tsk);
> #ifdef finish_arch_post_lock_switch
> 	finish_arch_post_lock_switch();
> #endif
> 
> -	/*
> -	 * When a kthread starts operating on an address space, the loop
> -	 * in membarrier_{private,global}_expedited() may not observe
> -	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> -	 * memory barrier after storing to tsk->mm, before accessing
> -	 * user-space memory. A full memory barrier for membarrier
> -	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> -	 * mmdrop(), or explicitly with smp_mb().
> -	 */
> 	if (active_mm != mm)
> 		mmdrop(active_mm);
> -	else
> -		smp_mb();
> 
> 	to_kthread(tsk)->oldfs = force_uaccess_begin();
> }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6a1db8264c7b..917068b0a145 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	fire_sched_in_preempt_notifiers(current);
> 
> 	/*
> -	 * When switching through a kernel thread, the loop in
> -	 * membarrier_{private,global}_expedited() may have observed that
> -	 * kernel thread and not issued an IPI. It is therefore possible to
> -	 * schedule between user->kernel->user threads without passing though
> -	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, and mmdrop() provides
> -	 * this barrier.
> -	 *
> 	 * If an architecture needs to take a specific action for
> 	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
> 	 */
> @@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev,
> 			prev->active_mm = NULL;
> 	} else {                                        // to user
> 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> 		/*
> 		 * sys_membarrier() requires an smp_mb() between setting
> -		 * rq->curr / membarrier_switch_mm() and returning to userspace.
> -		 *
> -		 * The below provides this either through switch_mm(), or in
> -		 * case 'prev->active_mm == next->mm' through
> -		 * finish_task_switch()'s mmdrop().
> +		 * rq->curr->mm to a membarrier-enabled mm and returning
> +		 * to userspace.
> 		 */
> -		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +		membarrier_finish_switch_mm(next->mm);
> 
> 		if (!prev->mm) {                        // from kernel
> 			/* will mmdrop() in finish_task_switch(). */
> @@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int
> sched_mode)
> 		RCU_INIT_POINTER(rq->curr, next);
> 		/*
> 		 * The membarrier system call requires each architecture
> -		 * to have a full memory barrier after updating
> -		 * rq->curr, before returning to user-space.
> -		 *
> -		 * Here are the schemes providing that barrier on the
> -		 * various architectures:
> -		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
> -		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
> -		 * - finish_lock_switch() for weakly-ordered
> -		 *   architectures where spin_unlock is a full barrier,
> -		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
> -		 *   is a RELEASE barrier),
> +		 * to have a full memory barrier before and after updating
> +		 * rq->curr->mm, before returning to userspace.  This
> +		 * is provided by membarrier_finish_switch_mm().  Architectures
> +		 * that want to optimize this can override that function.
> 		 */
> 		++*switch_count;
> 
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

next prev parent reply	other threads:[~2022-01-12 15:52 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2022-01-12 15:40   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52   ` Mathieu Desnoyers [this message]
2022-01-08 16:43 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
2022-01-12 15:55   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-10  8:42   ` Christophe Leroy
2022-01-12 15:57   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-12 16:11   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30   ` Mathieu Desnoyers
2022-01-12 17:08     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06   ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22   ` Linus Torvalds
2022-01-08 22:04     ` Andy Lutomirski
2022-01-09  0:53       ` Linus Torvalds
2022-01-09  3:58         ` Andy Lutomirski
2022-01-09  4:38           ` Linus Torvalds
2022-01-09 20:19             ` Andy Lutomirski
2022-01-09 20:48               ` Linus Torvalds
2022-01-09 21:51                 ` Linus Torvalds
2022-01-10  0:52                   ` Andy Lutomirski
2022-01-10  2:36                     ` Rik van Riel
2022-01-10  3:51                       ` Linus Torvalds
2022-01-10  4:56                   ` Nicholas Piggin
2022-01-10  5:17                     ` Nicholas Piggin
2022-01-10 17:19                       ` Linus Torvalds
2022-01-11  2:24                         ` Nicholas Piggin
2022-01-10 20:52                     ` Andy Lutomirski
2022-01-11  3:10                       ` Nicholas Piggin
2022-01-11 15:39                         ` Andy Lutomirski
2022-01-11 22:48                           ` Nicholas Piggin
2022-01-12  0:42                             ` Nicholas Piggin
2022-01-11 10:39                 ` Will Deacon
2022-01-11 15:22                   ` Andy Lutomirski
2022-01-09  5:56   ` Nadav Amit
2022-01-09  6:48     ` Linus Torvalds
2022-01-09  8:49       ` Nadav Amit
2022-01-09 19:10         ` Linus Torvalds
2022-01-09 19:52           ` Andy Lutomirski
2022-01-09 20:00             ` Linus Torvalds
2022-01-09 20:34             ` Nadav Amit
2022-01-09 20:48               ` Andy Lutomirski
2022-01-09 19:22         ` Rik van Riel
2022-01-09 19:34           ` Nadav Amit
2022-01-09 19:37             ` Rik van Riel
2022-01-09 19:51               ` Nadav Amit
2022-01-09 19:54                 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13   ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1992660992.24733.1642002774380.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=benh@kernel.crashing.org \
    --cc=dave.hansen@intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulus@ozlabs.org \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=riel@surriel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).