From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935251AbbCPP6H (ORCPT ); Mon, 16 Mar 2015 11:58:07 -0400 Received: from mail.efficios.com ([78.47.125.74]:48548 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935230AbbCPP6B (ORCPT ); Mon, 16 Mar 2015 11:58:01 -0400 Date: Mon, 16 Mar 2015 15:57:54 +0000 (UTC) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, KOSAKI Motohiro , Steven Rostedt , "Paul E. McKenney" , Nicholas Miell , Linus Torvalds , Ingo Molnar , Alan Cox , Lai Jiangshan , Stephen Hemminger , Andrew Morton , Josh Triplett , Thomas Gleixner , David Howells , Nick Piggin Message-ID: <1850734415.10397.1426521474837.JavaMail.zimbra@efficios.com> In-Reply-To: <1203077851.9491.1426520636551.JavaMail.zimbra@efficios.com> References: <1426447459-28620-1-git-send-email-mathieu.desnoyers@efficios.com> <20150316141939.GE21418@twins.programming.kicks-ass.net> <1203077851.9491.1426520636551.JavaMail.zimbra@efficios.com> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [173.246.22.116] X-Mailer: Zimbra 8.0.7_GA_6021 (ZimbraWebClient - FF36 (Linux)/8.0.7_GA_6021) Thread-Topic: sys_membarrier(): system/process-wide memory barrier (x86) (v12) Thread-Index: Mcv8pHQv4t5HaUdpIBWnSJFVfcDlZwUhay/k Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- Original Message ----- > From: "Mathieu Desnoyers" > To: "Peter Zijlstra" > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" , "Steven Rostedt" > , "Paul E. McKenney" , "Nicholas Miell" , > "Linus Torvalds" , "Ingo Molnar" , "Alan Cox" > , "Lai Jiangshan" , "Stephen Hemminger" > , "Andrew Morton" , "Josh Triplett" , > "Thomas Gleixner" , "David Howells" , "Nick Piggin" > Sent: Monday, March 16, 2015 11:43:56 AM > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) > > ----- Original Message ----- > > From: "Peter Zijlstra" > > To: "Mathieu Desnoyers" > > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" > > , "Steven Rostedt" > > , "Paul E. McKenney" , > > "Nicholas Miell" , > > "Linus Torvalds" , "Ingo Molnar" > > , "Alan Cox" > > , "Lai Jiangshan" , > > "Stephen Hemminger" > > , "Andrew Morton" , > > "Josh Triplett" , > > "Thomas Gleixner" , "David Howells" > > , "Nick Piggin" > > Sent: Monday, March 16, 2015 10:19:39 AM > > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory > > barrier (x86) (v12) > > > > On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote: > > > > TL;DR > > > > > --- a/arch/x86/include/asm/mmu_context.h > > > +++ b/arch/x86/include/asm/mmu_context.h > > > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, > > > struct mm_struct *next, > > > #endif > > > cpumask_set_cpu(cpu, mm_cpumask(next)); > > > > > > + /* > > > + * smp_mb() between mm_cpumask set and following memory > > > + * accesses to user-space addresses is required by > > > + * sys_membarrier(). A smp_mb() is also needed between > > > + * prior memory accesses and mm_cpumask clear. This > > > + * ensures that all user-space address memory accesses > > > + * performed by the current thread are in program order > > > + * when the mm_cpumask is set. Implied by load_cr3. > > > + */ > > > + > > > /* Re-load page tables */ > > > load_cr3(next->pgd); > > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, > > > struct mm_struct *next, > > > * We were in lazy tlb mode and leave_mm disabled > > > * tlb flush IPI delivery. We must reload CR3 > > > * to make sure to use no freed page tables. > > > + * > > > + * smp_mb() between mm_cpumask set and memory accesses > > > + * to user-space addresses is required by > > > + * sys_membarrier(). This ensures that all user-space > > > + * address memory accesses performed by the current > > > + * thread are in program order when the mm_cpumask is > > > + * set. Implied by load_cr3. > > > */ > > > load_cr3(next->pgd); > > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > > > > > In both cases the cpumask_set_cpu() will also imply a MB. > > I'm probably missing what exactly in cpumask_set_cpu() > implies this guarantee. cpumask_set_cpu() uses set_bit(). > On x86, set_bit is indeed implemented with a lock-prefixed > orb or bts. However, the comment above set_bit() states: > > * Note: there are no guarantees that this function will not be reordered > * on non x86 architectures, so if you are writing portable code, > * make sure not to rely on its reordering guarantees. > > And it states nothing about memory barriers. Typically, > atomic ops that imply memory barriers always return > something (xchg, cmpxchg, add_return). Ops like atomic_add > do not imply barriers. > > > > > > +enum { > > > + /* > > > + * Private flag set: only synchronize across a single process. If this > > > + * flag is not set, it means "shared": synchronize across multiple > > > + * processes. The shared mode is useful for shared memory mappings > > > + * across processes. > > > + */ > > > + MEMBARRIER_PRIVATE_FLAG = (1 << 0), > > > + > > > + /* > > > + * Expedited flag set: adds some overhead, fast execution (few > > > + * microseconds). If this flag is not set, it means "delayed": low > > > + * overhead, but slow execution (few milliseconds). > > > + */ > > > + MEMBARRIER_EXPEDITED_FLAG = (1 << 1), > > > > > > I suppose this is an unprivileged syscall; so what do we do about: > > > > for (;;) > > sys_membar(EXPEDITED); > > > > Which would spray the entire system with IPIs at break neck speed. > > Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL. > Therefore, if someone cares about issuing barriers on the entire > system, the only option is to use non-EXPEDITED, which rely on > synchronize_rcu(). Sorry, I meant "synchronize_sched()". Thanks, Mathieu > > The only way to invoke expedited barriers in a loop is: > > for (;;) > sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE); > > Which will only send IPIs to the CPU running threads from the same > process. > > > > > > +static void membarrier_ipi(void *unused) > > > +{ > > > + /* Order memory accesses with respects to sys_membarrier caller. */ > > > + smp_mb(); > > > +} > > > + > > > +/* > > > + * Handle out-of-memory by sending per-cpu IPIs instead. > > > + */ > > > +static void membarrier_fallback(void) > > > +{ > > > + struct mm_struct *mm; > > > + int cpu; > > > + > > > + for_each_cpu(cpu, mm_cpumask(current->mm)) { > > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > > > + mm = cpu_curr(cpu)->mm; > > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > > > + if (current->mm == mm) > > > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1); > > > + } > > > +} > > > > > +static void membarrier_expedited(void) > > > +{ > > > + struct mm_struct *mm; > > > + cpumask_var_t tmpmask; > > > + int cpu; > > > + > > > + /* > > > + * Memory barrier on the caller thread between previous memory accesses > > > + * to user-space addresses and sending memory-barrier IPIs. Orders all > > > + * user-space address memory accesses prior to sys_membarrier() before > > > + * mm_cpumask read and membarrier_ipi executions. This barrier is > > > paired > > > + * with memory barriers in: > > > + * - membarrier_ipi() (for each running threads of the current process) > > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > > > + * accesses to user-space addresses) > > > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > > > + * A memory barrier is issued each time ->mm is changed while the rq > > > + * lock is held. > > > + */ > > > + smp_mb(); > > > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > > > + membarrier_fallback(); > > > + goto out; > > > + } > > > + cpumask_copy(tmpmask, mm_cpumask(current->mm)); > > > + preempt_disable(); > > > + cpumask_clear_cpu(smp_processor_id(), tmpmask); > > > + for_each_cpu(cpu, tmpmask) { > > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > > > + mm = cpu_curr(cpu)->mm; > > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > > > + if (current->mm != mm) > > > + cpumask_clear_cpu(cpu, tmpmask); > > > + } > > > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1); > > > + preempt_enable(); > > > + free_cpumask_var(tmpmask); > > > +out: > > > + /* > > > + * Memory barrier on the caller thread between sending & waiting for > > > + * memory-barrier IPIs and following memory accesses to user-space > > > + * addresses. Orders mm_cpumask read and membarrier_ipi executions > > > + * before all user-space address memory accesses following > > > + * sys_membarrier(). This barrier is paired with memory barriers in: > > > + * - membarrier_ipi() (for each running threads of the current process) > > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > > > + * accesses to user-space addresses) > > > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > > > + * A memory barrier is issued each time ->mm is changed while the rq > > > + * lock is held. > > > + */ > > > + smp_mb(); > > > +} > > > > Did you just write: > > > > bool membar_cpu_is_mm(int cpu, void *info) > > { > > struct mm_struct *mm = info; > > struct rq *rq = cpu_rq(cpu); > > bool ret; > > > > raw_spin_lock_irq(&rq->lock); > > ret = rq->curr->mm == mm; > > raw_spin_unlock_irq(&rq->lock); > > > > return ret; > > } > > > > on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT); > > > > It is very similar indeed! The main difference is that my implementation > was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs > for which TLB shootdown is simply pending (confirmed by taking the rq lock > and checking cpu_curr(cpu)->mm against current->mm). > > Now that you mention this, I think we don't really need to use > mm_cpumask(current->mm) at all. Just iterating on each cpu, taking > the rq lock, and comparing the mm should be enough. This would > remove the need to rely on having extra memory barriers around > set/clear of the mm cpumask. > > The main reason why I did not use on_each_cpu_cond() was that it did > not exist back in 2010. ;-) > > > On which; I absolutely hate that rq->lock thing in there. What is > > 'wrong' with doing a lockless compare there? Other than not actually > > being able to deref rq->curr of course, but we need to fix that anyhow. > > If we can make sure rq->curr deref could be done without holding the rq > lock, then I think all we would need is to ensure that updates to rq->curr > are surrounded by memory barriers. Therefore, we would have the following: > > * When a thread is scheduled out, a memory barrier would be issued before > rq->curr is updated to the next thread task_struct. > > * Before a thread is scheduled in, a memory barrier needs to be issued > after rq->curr is updated to the incoming thread. > > In order to be able to dereference rq->curr->mm without holding the > rq->lock, do you envision we should protect task reclaim with RCU-sched ? > > Thanks! > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com