From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED,DKIM_VALID,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E441C3A59F for ; Thu, 29 Aug 2019 14:10:54 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E30142339E for ; Thu, 29 Aug 2019 14:10:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="ixp9EnE+"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="A62KVnHG" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E30142339E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Message-Id:MIME-Version:In-Reply-To: References:To:Subject:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=QinJMjjT1r8O9+sLaZ40E/4gDwLO/FfI6lf86VwwwUI=; b=ixp9EnE+/lwXID TljARFiBJw+/cYryFv89ezoF3XyqVhxOZXwi2+o34J2teKqIe4zxzSdr0E0NUeniX2T6eCM2c++6v f5Slo/ZdnLYrbi8BSIE6Poy+IX1s+pWIKM9zP8uH0J5y7l1pwixHSwZlkhMs/BuuZgzdHLdOJknd+ MdMzLGXY2xUCtEftk292u1WvzC3ZTaJIG6BSeoXlVZcEaOKk60Sp0nM7IWM8H/hZAOnvbLwRdCXC6 b3B8UT6uQR0QiDbozKLY3ZwxDnLRT5L5im65AHUDL789VZGRec92ZVW0Wir9KO6WrItuYIYxnTa+w E8j3xX63+ehk782dc20w==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.92 #3 (Red Hat Linux)) id 1i3L8N-00084k-MK; Thu, 29 Aug 2019 14:10:43 +0000 Received: from mail-pl1-x643.google.com ([2607:f8b0:4864:20::643]) by bombadil.infradead.org with esmtps (Exim 4.92 #3 (Red Hat Linux)) id 1i3L8K-00084K-8Y for linux-arm-kernel@lists.infradead.org; Thu, 29 Aug 2019 14:10:42 +0000 Received: by mail-pl1-x643.google.com with SMTP id y8so1624951plr.12 for ; Thu, 29 Aug 2019 07:10:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:subject:to:cc:references:in-reply-to:mime-version :user-agent:message-id:content-transfer-encoding; bh=oHiuqDc5byV42nC75JKsnZDI9kV2zYKc4gOzFLT9ZM8=; b=A62KVnHG27axRu5IWrUH/RBEliE9GokRyGw2eM8uznDtXJh4NU3Ifk5OixlYQGQsaN Ggw5Sm6LMKmHec8W/CdpQ0pRvagf4fnhjfuWj1gcmOxqkvNE+C8DpGKOmAgh5n4P2ApI 6FmdxiGvvCa6CYmSkUCI3nU98XlXFlxIpwa8ab8a6P6GbeZy2OT5UL3Yk7UD2cfY7pA8 xp2fBnbJTEeExznOjo0N4BrnlomV6J3zkjp8a1/hTXBrcDVuROQLVydYCZSU1w5CXxp9 BGiduTobvPB4X9nayNUjv2fNVF/Fgrp1Qx2a77URaT84DX2oMv+zfekBPdAgAgFWt1XV kNLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:user-agent:message-id:content-transfer-encoding; bh=oHiuqDc5byV42nC75JKsnZDI9kV2zYKc4gOzFLT9ZM8=; b=s7eyE2Ol6ojT0qjeegx8/Upw52/oqyjdESqN1NMD9WVsJia1LvfbPexHFw9yvz8vo4 a3dp0P7zQTwliZ/wK9iLp/sQUcVnWXnSikEkphrwlnpH+3CDHbf559thjp6F97+h3jpA NctcJmELrsN1/4RKadU9amxIOXSS25LrpCrYTwoLa0qaAY/CX0k+bRTvXtlZwF5eFVQw M1QBd2IptabA57izXj4WGKgW4V1JHpUNBR24bJyOVSDS4DQGaau94+M81PsAHhHPqpHN oLkTlLEhb6PdP6Bj9LcEoHUkw9G3BCaaKU0JS6xPH1+40cdi52pvzgSadMdPjJo50B1F dCHQ== X-Gm-Message-State: APjAAAU8TgqISrshfwJnx/bkcLxG6yX0s7mdsVXk41CgQgtiknMcVpSn CWMlJVVwhNnwDdvd1dvJQk4= X-Google-Smtp-Source: APXvYqxOrP90DAJEm/aa6FQmcaWYwJnTobNSUHKB2eGATGeArTlhM/j50LslZ3c5piDJOK6SQfH1Xg== X-Received: by 2002:a17:902:fe0f:: with SMTP id g15mr9809271plj.2.1567087838150; Thu, 29 Aug 2019 07:10:38 -0700 (PDT) Received: from localhost ([61.68.162.209]) by smtp.gmail.com with ESMTPSA id k64sm4696047pge.65.2019.08.29.07.10.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Aug 2019 07:10:37 -0700 (PDT) Date: Fri, 30 Aug 2019 00:08:52 +1000 From: Nicholas Piggin Subject: Re: [PATCH 0/6] Fix TLB invalidation on arm64 To: Will Deacon References: <20190827131818.14724-1-will@kernel.org> <1566947104.2uma6s0pl1.astroid@bobo.none> <20190828161256.uevoohval4sko24m@willie-the-truck> In-Reply-To: <20190828161256.uevoohval4sko24m@willie-the-truck> MIME-Version: 1.0 User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid) Message-Id: <1567085427.12jzc6eq6j.astroid@bobo.none> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20190829_071040_332947_1B8C7F57 X-CRM114-Status: GOOD ( 31.93 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-arch@vger.kernel.org, Mark Rutland , Peter Zijlstra , Catalin Marinas , Marc Zyngier , linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org Will Deacon's on August 29, 2019 2:12 am: > Hi Nick, > > On Wed, Aug 28, 2019 at 10:35:24AM +1000, Nicholas Piggin wrote: >> Will Deacon's on August 27, 2019 11:18 pm: >> > So far, so good, but the final piece of the puzzle isn't quite so rosy. >> > >> > *** Other architecture maintainers -- start here! *** >> > >> > In the case that one CPU maps a page and then sets a flag to tell another >> > CPU: >> > >> > CPU 0 >> > ----- >> > >> > MOV X0, >> > STR X0, [Xptep] // Store new PTE to page table >> > DSB ISHST >> > ISB >> > MOV X1, #1 >> > STR X1, [Xflag] // Set the flag >> > >> > CPU 1 >> > ----- >> > >> > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics >> > CBZ X0, loop >> > LDR X1, [X2] // Translates using the new PTE >> > >> > then the final load on CPU 1 can raise a translation fault for the same >> > reasons as mentioned at the start of this description. >> >> powerpc's ptesync instruction is defined to order MMU memory accesses on >> all other CPUs. ptesync does not go out to the fabric though. How does >> it work then? >> >> Because the MMU coherency problem (at least we have) is not that the >> load will begin to "partially" execute ahead of the store, enough to >> kick off a table walk that goes ahead of the store, but not so much >> that it violates the regular CPU barriers. It's just that the loads >> from the MMU don't participate in the LSU pipeline, they don't snoop >> the store queues aren't inserted into load queues so the regular >> memory barrier instructions won't see stores from other threads cuasing >> ordering violations. >> >> In your first example, if powerpc just has a normal memory barrier >> there instead of a ptesync, it could all execute completely >> non-speculatively and in-order but still cause a fault, because the >> table walker's loads didn't see the store in the store queue. > > Ah, so I think this is where our DSB comes in, as opposed to our usual > DMB. DMB provides ordering guarantees and is generally the only barrier > instruction you need for shared memory communication. DSB, on the other > hand, has additional properties such as making page-table updates visible > to the table walker and completing pending TLB invalidation. > > So in practice, DSB is likely to drain the store buffer to ensure that > pending page-table writes are visible at L1, which is coherent with all > CPUs (and their page-table walkers). > >> From the other side of the fabric you have no such problem. The table >> walker is cache coherent apart from the local stores, so we don't need a >> special barrier on the other side. That's why ptesync doesn't broadcast. > > Curious: but do you need to do anything extra to take into account > instruction fetch on remote CPUs if you're mapping an executable page? > We added an IPI to flush_icache_range() in 3b8c9f1cdfc5 to handle this, > because our broadcast I-cache maintenance doesn't force a pipeline flush > for remote CPUs (and may even execute as a NOP on recent cores). Ah, I think the tlbie does not force re-fetch indeed. We may need something like that as well. What do you do on the user side? Require threads to ISB themselves? >> I would be surprised if ARM's issue is different, but interested to >> hear if it is. > > If you take the four scenarios of Map/Unmap for the UP/SMP cases: > > Map+UP // Some sort of fence instruction (DSB;ISB/ptesync) > Map+SMP // Same as above > Unmap+UP // Local TLB invalidation > Unmap+SMP // Broadcast TLB invalidation > > then the most interesting case is Map+SMP, where one CPU transitions a PTE > from invalid to valid and executes its DSB;ISB/PTESYNC sequence without > affecting the remote CPU. That's what I'm trying to get at in my example > below: > >> > CPU 0 CPU 1 >> > ----- ----- >> > spin_lock(&lock); spin_lock(&lock); >> > set_fixmap(0, paddr, prot); if (mapped) >> > mapped = true; foo = *fix_to_virt(0); >> > spin_unlock(&lock); spin_unlock(&lock); >> > >> > could fault. >> >> This is kind of a different issue, or part of a wider one at least. >> Consider speculative execution more generally, any branch mispredict can >> send us off to crazy town executing instructions using nonsense register >> values. CPU0 does not have to be in the picture, or any kernel page >> table modification at all, CPU1 alone will be doing speculative loads >> wildly all over the kernel address space and trying to access pages with >> no pte. >> >> Yet we don't have to flush TLB when creating a new kernel mapping, and >> we don't get spurious kernel faults. The page table walker won't >> install negative entries, at least not "architectural" i.e., that cause >> faults and require flushing. My guess is ARM is similar, or you would >> have seen bigger problems by now? > > Right, we don't allow negative (invalid) entries to be cached in the TLB, > but CPUs can effectively cache the result of a translation for a load/store > instruction whilst that instruction flows down the pipe after its virtual > address is known. /That/ caching is not restricted to valid translations. Ah, I misunderstood you sorry. Yeah that is interesting, I don't think that is explicitly prohibited in the power ISA either. I believe CPU1 would have to do a ptesync to avoid the fault there. > For example, if we just take a simple message passing example where we have > two global variables: a 'mapped' flag (initialised to zero) and a pointer > (initialised to point at a page that is not yet mapped): > > [please excuse the mess I've made of your assembly language] > > CPU 0 > > // Set PTE which maps the page pointed to by pointer > stw r5, 0(r4) > ptesync > lwsync > > // Set 'mapped' flag to 1 > li r9, 1 > stb r9, 0(r3) > > > CPU 1 > > // Poll for 'mapped' flag to be set > loop: lbz r9, 0(r3) > lwsync > cmpdi cr7, r0, 0 > beq cr7, loop > > // Dereference pointer > lwz r4, 0(r5) > > > So in this example, I think it's surprising that CPU 1's dereference of > 'pointer' can fault. The way this happens on arm64 is that CPU 1 can > translate the 'pointer' dereference before the load of the 'mapped' flag has > returned its data. The walker may come back with a fault, but then if the > flag data later comes back as 1, the fault will be taken when the lwz > commits. In other words, translation table walks can occur out-of-order > with respect to the accesses they are translating for, even in the presence > of memory barriers. > > In practice, I think this kind of code sequence is unusual and triggering > the problem relies on CPU 1 knowing the virtual address it's going to > dereference before it's actually mapped. However, this could conceivably > happen with the fixmap and possibly also if the page-table itself was > being written concurrently using cmpxchg(), in which case you might use > the actual pte value in the same way as the 'mapped' flag above. > > But yes, adding the spurious fault handler is belt and braces, which is > why I've kept a WARN_RATELIMIT() in there if it ever triggers. This could be a theoretical problem for powerpc too, I think. Maybe. I'll ask around, I might not be understanding the architecture or Linux code properly. Thanks, Nick _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel