From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11158C8300F for ; Tue, 1 Dec 2020 20:52:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD15921D7A for ; Tue, 1 Dec 2020 20:52:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="tOGrSiiY" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388006AbgLAUwZ (ORCPT ); Tue, 1 Dec 2020 15:52:25 -0500 Received: from mail.efficios.com ([167.114.26.124]:44904 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726826AbgLAUwY (ORCPT ); Tue, 1 Dec 2020 15:52:24 -0500 Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 3AE2A24D745; Tue, 1 Dec 2020 15:51:43 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id X-8xt0--RzPR; Tue, 1 Dec 2020 15:51:42 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id B557124D827; Tue, 1 Dec 2020 15:51:42 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com B557124D827 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1606855902; bh=z0Vthxi3itmU/YTFu/3BgzBQQ+djft5+q0IiDG3o8ds=; h=Date:From:To:Message-ID:MIME-Version; b=tOGrSiiYey//0Afs6r6l/eiFEcqFVFRug28gcLrsTMopMJobT/t07ZqQ2UToNDDuS V5jsiFaOe6e2keNHvFSgNc4+HiRMNTHoAcvN5ksmxi8xFSPzE1A2ZQbmQi0ZhGnyoM znKqj9JRMBxeUduCe4iUQS1ahNSYwlMyuYF5QkhF0a84FRLUW0ovLPuPhyQE56NR1t rWzk4aIlutDo28P+0lkX4A7mi3WaucSHf1BWarMKd0wGX9F/0jDODlEgoUwjXCBCSk 08yqnMhmp/iLJW+OI8jD5lF9ndzfZ8ZkU7HAPQlpaFufqFwSiwB+owjNy06aineamy wbfl3ZelCY4IA== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id mEF35TKAlci8; Tue, 1 Dec 2020 15:51:42 -0500 (EST) Received: from mail03.efficios.com (mail03.efficios.com [167.114.26.124]) by mail.efficios.com (Postfix) with ESMTP id 94A0624D7B4; Tue, 1 Dec 2020 15:51:42 -0500 (EST) Date: Tue, 1 Dec 2020 15:51:42 -0500 (EST) From: Mathieu Desnoyers To: Andy Lutomirski Cc: Peter Zijlstra , x86 , linux-kernel , Nicholas Piggin , Arnd Bergmann , Anton Blanchard Message-ID: <466812913.70175.1606855902511.JavaMail.zimbra@efficios.com> In-Reply-To: References: <5495e4c344dc09011ff57756c7e0a1330830eafc.1606758530.git.luto@kernel.org> <20201201101637.GU2414@hirez.programming.kicks-ass.net> <1044280457.69297.1606832917168.JavaMail.zimbra@efficios.com> <763536813.69859.1606847380892.JavaMail.zimbra@efficios.com> Subject: Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [167.114.26.124] X-Mailer: Zimbra 8.8.15_GA_3975 (ZimbraWebClient - FF83 (Linux)/8.8.15_GA_3975) Thread-Topic: membarrier: Propagate SYNC_CORE and RSEQ actions more carefully Thread-Index: jjs5eM8xrz4RIu7TlLBMd6QlibfXew== Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- On Dec 1, 2020, at 1:48 PM, Andy Lutomirski luto@kernel.org wrote: > On Tue, Dec 1, 2020 at 10:29 AM Mathieu Desnoyers > wrote: >> >> ----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@kernel.org wrote: >> >> > On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers >> > wrote: >> >> >> >> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org= wrote: >> >> >> >> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote: >> >> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to al= l >> >> >> other CPUs, but there are two issues. >> >> >> >> >> >> - membarrier() does not sync_core() or rseq_preempt() the calling >> >> >> CPU. Aside from the logic being mind-bending, this also means >> >> >> that it may not be safe to modify user code through an alias, >> >> >> call membarrier(), and then jump to a different executable alia= s >> >> >> of the same code. >> >> > >> >> > I always understood this to be on purpose. The calling CPU can fix = up >> >> > itself just fine. The pain point is fixing up the other CPUs, and t= hat's >> >> > where membarrier() helps. >> >> >> >> Indeed, as documented in the man page: >> >> >> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16) >> >> In addition to providing the memory ordering guara= ntees de=E2=80=90 >> >> scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon ret= urn from >> >> system call the calling thread has a guarantee that all= its run=E2=80=90 >> >> ning thread siblings have executed a core serializing = instruc=E2=80=90 >> >> tion. This guarantee is provided only for threads in= the same >> >> process as the calling thread. >> >> >> >> membarrier sync core guarantees a core serializing instruction on the= siblings, >> >> not on the caller thread. This has been done on purpose given that th= e caller >> >> thread can always issue its core serializing instruction from user-sp= ace on >> >> its own. >> >> >> >> > >> >> > That said, I don't mind including self, these aren't fast calls by = any >> >> > means. >> >> >> >> I don't mind including self either, but this would require documentat= ion >> >> updates, including man pages, to state that starting from kernel Y th= is >> >> is the guaranteed behavior. It's then tricky for user-space to query = what >> >> the behavior is unless we introduce a new membarrier command for it. = So this >> >> could introduce issues if software written for the newer kernels runs= on older >> >> kernels. >> > >> > For rseq at least, if we do this now we don't have this issue -- I >> > don't think any released kernel has the rseq mode. >> >> But for rseq, there is no core-sync. And considering that it is invalid >> to issue a system call within an rseq critical section (including membar= rier), >> I don't see what we gain by doing a rseq barrier on self ? >> >> The only case where it really changes the semantic is for core-sync I th= ink. >> And in this case, it would be adding an additional core-sync on self. I >> am OK with doing that considering that it will simplify use of the syste= m >> call. I'm just wondering how we should document this change in the man p= age. >> >> > >> >> >> >> > >> >> >> - membarrier() does not explicitly sync_core() remote CPUs either= ; >> >> >> instead, it relies on the assumption that an IPI will result in= a >> >> >> core sync. On x86, I think this may be true in practice, but >> >> >> it's not architecturally reliable. In particular, the SDM and >> >> >> APM do not appear to guarantee that interrupt delivery is >> >> >> serializing. >> >> > >> >> > Right, I don't think we rely on that, we do rely on interrupt deliv= ery >> >> > providing order though -- as per the previous email. >> >> > >> >> >> On a preemptible kernel, IPI return can schedule, >> >> >> thereby switching to another task in the same mm that was >> >> >> sleeping in a syscall. The new task could then SYSRET back to >> >> >> usermode without ever executing IRET. >> >> > >> >> > This; I think we all overlooked this scenario. >> >> >> >> Indeed, this is an issue which needs to be fixed. >> >> >> >> > >> >> >> This patch simplifies the code to treat the calling CPU just like >> >> >> all other CPUs, and explicitly sync_core() on all target CPUs. Th= is >> >> >> eliminates the need for the smp_mb() at the end of the function >> >> >> except in the special case of a targeted remote membarrier(). Thi= s >> >> >> patch updates that code and the comments accordingly. >> >> >> >> I am not confident that removing the smp_mb at the end of membarrier = is >> >> an appropriate change, nor that it simplifies the model. >> > >> > Ah, but I didn't remove it. I carefully made sure that every possible >> > path through the function does an smp_mb() or stronger after all the >> > cpu_rq reads. ipi_func(), on_each_cpu(), and the explicit smp_mb() >> > cover the three cases. >> > >> > That being said, if you prefer, I can make the change to skip the >> > calling CPU, in which case I'll leave the smp_mb() at the end alone. >> >> For the memory barrier commands, I prefer skipping self and leaving the >> smp_mb at the very beginning/end of the system call. Those are the key >> before/after points we are synchronizing against, and those are simple >> to document. >> >=20 > Is there a reason that doing the barrier at the very end could make an > observable difference? The two models are: >=20 > membarrier() { > smp_mb(); > read a bunch of cpu_rq memory and make decisions; > execute smp_mb() on relevant cpus including self; you forget the fact that you also add a smp_mb() after each individual IPI returns, which is why you can get away with removing the smp_mb at the end of the membarrier syscall without introducing issues. > } >=20 > versus >=20 > membarrier() { > smp_mb(); > read a bunch of cpu_rq memory and make decisions; > execute smp_mb() on relevant non-self cpus; > wait for that to finish (acquire-style on the local cpu); > smp_mb(); > } >=20 > Is the idea that, on a sufficiently weakly ordered architecture, some > remote CPU could do a store before the IPI and a local load after the > membarrier() syscall might not observe the load unless the local > smp_mb() is after the remote smp_mb()? If so, I'm not entirely > convinced that this is observably different from the store simply > occurring after the IPI, but maybe there are some gnarly situations in > which this could happen. >=20 > If your concern is something along these lines, I could try to write > up an appropriate comment, and I'll rework the patch. This is already documented in the scenarios I added as comments in the patch sitting in the tip tree: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=3D25= 595eb6aaa9fbb31330f1e0b400642694bc6574 See "Scenario B Userspace thread execution before IPI vs membarrier's memor= y barrier after completing the IPI" I think the change you proposed would be technically still OK: - The callback on self issuing the smp_mb would take care of ensuring that at least one memory barrier is issued after loading rq->curr state for each cpu. - The smp_mb after each ipi return would ensure we have barrier ordering between the smp_mb in the ipi handler / before the membarrier system call returns to userspace. But then rather than having one clear spot where the smp_mb needs to be placed and documented, we have a more complex maze of conditions we need to consider. Hence my preference for keeping the smp_mb at the beginning/en= d of the membarrier system call. Thanks, Mathieu --=20 Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com