From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 754A5CD1284 for ; Thu, 11 Apr 2024 09:45:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E13396B0085; Thu, 11 Apr 2024 05:45:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D752D6B0087; Thu, 11 Apr 2024 05:45:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BEF066B0088; Thu, 11 Apr 2024 05:45:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 9CD736B0085 for ; Thu, 11 Apr 2024 05:45:45 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E970C1C0FD0 for ; Thu, 11 Apr 2024 09:45:44 +0000 (UTC) X-FDA: 81996769008.30.681EF04 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id D378A18000E for ; Thu, 11 Apr 2024 09:45:42 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712828743; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Oogbda0Ddkkzy12Ze/4J4kddlwyjEPK7tdX2DJoHYR0=; b=EO/2/thNsPUQQ/yBLvvk6xrPm7oujP6biVdh6Gln50mzz+ETCMkWzuF5BXGbGZ3hHMgz3K +w/VMQvzOoFI+4WD1WI203la/UHb9yFAcMzkHpz8ZEOxckDp5xdXyEYw9xizcTkYXsY4fV djgaxozoicwg9JJBZtUx8ttP24LBbvU= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712828743; a=rsa-sha256; cv=none; b=Yf0Gy+BxQy0EoUiyoS/IoOVNqsgq96Ef4t17RuPDdGXyQnloHv3BEzEdE2UMAxF3Wub3uA bCVs6cHPUl5UGTgKh09U0CS7rYYaVCygAHPBuYQxuUFMLfHZEkNl897E+hqYlD4N4SY5c6 eBxXKojI0MzNuF5ZO1QJsNKvqQIsh2Y= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5AB9E113E; Thu, 11 Apr 2024 02:46:11 -0700 (PDT) Received: from [10.1.38.151] (XHFQ2J9959.cambridge.arm.com [10.1.38.151]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2AF703F6C4; Thu, 11 Apr 2024 02:45:40 -0700 (PDT) Message-ID: <81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com> Date: Thu, 11 Apr 2024 10:45:38 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Content-Language: en-GB To: David Hildenbrand , Mark Rutland , Catalin Marinas , Will Deacon , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Andrew Morton , Muchun Song Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240215121756.2734131-1-ryan.roberts@arm.com> <0ae22147-e1a1-4bcb-8a4c-f900f3f8c39e@redhat.com> <374d8500-4625-4bff-a934-77b5f34cf2ec@arm.com> <8bd9e136-8575-4c40-bae2-9b015d823916@redhat.com> <86680856-2532-495b-951a-ea7b2b93872f@arm.com> <35236bbf-3d9a-40e9-84b5-e10e10295c0c@redhat.com> <4fba71aa-8a63-4a27-8eaf-92a69b2cff0d@arm.com> <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com> From: Ryan Roberts In-Reply-To: <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: bwts4dw83i94huffwi86cs9grr7dxuxa X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: D378A18000E X-HE-Tag: 1712828742-700437 X-HE-Meta: U2FsdGVkX190MDrde8Tzr7Ux6sYzQVBOBq+E8XsTrAOu1AdRvhAf7x9st3Ts2gUXMP/H9jWY9qdjTvSyFVgMFSBkw/AfY0JS+PkspXc1QvK+TJS+XWCx/xgYvumMQsA2pYNRDE8I+wHO+BFuvw8h2nNFijL1YAvYqNK7uXrmOHCzl+yCD8c2y6W/E3qF1XH7UfYscSHVfRSeeZyXYf7w+u32+cS1RC/Ilt+W+Oe0liTbhsiks9TYBiVHC3Vh8/3mW/P6tmaEg2pvq7wqrteXcF9AU75Mj40dkAypiB2WX1PkuW7vW7UKkyzKjWBQIzJeCgudfvzlpBkqU0s39fKdG26Iog3+69DqErdBtwDZbMsxOZYeUZwvGAIhiH4bs2TTiM7eeKkJM5PhR21O3dqZaLD+SilAr+VHGH/kzjZEsZP08ehCUNA2Um/0cKprTw4CLSrVj0Ir8udjyQseW5K68yYn0bnpOYAbLyU0mTQdnFFJNKdhq/d0zHypDAo1ZiMW+rAYOPFSPH22muRvXSWncUHpGfsPTTc7QboWYOz0aW5u2Y+1Im8pCR+FI8dY0K7ML3ndeVOUh3Jt6DehKHpIhnqNXxphWSzUCo4OkRGzS/oP+O88xddbd2mZkvrcSjoGs1l2UoVft2xMi7rRD+ewVImzzwWRiq+fE/mi9vuDYzmZ27u1pBNbY+sw3PPqqYtg1bSkVEsfqpynGiBtpn+hU16cj219i+dq8ZmZxPywdvL8w3ESZEdTR76Ivf/2C4rhcj9yT61S2GiS1MtrdYGoXYvfUTJ79wu1Ek0YIlQ5/LMX/ICWvno8hjMiKpEdae+Dl4FaW8oZrSrgxORgkiTIsk0D9h1vDBz/ut+7Qn/W6xh1qWYuLiWpnwn9KNSzaloh048Cb26HIYOS6MO1r712RVQLuIOFZiIMIm5xonw88UOYLFgAZ7RaV8hwXIHmJwH1dim6kgkBcEF72WmBDx6 H698gXS+ SXG/8vM7s4Il3H1DSW1GW7zzI5uHRDjIG/N+4VJCZnskzD2EYOzuLPim/4xl0dzIHkXhO1ZgRhxJJO0ihoZj/8Gq2/ZYL3SQvdUsz6scCzfHWKy1Pt+e9TUschrG/SdbTlyqkmZPJpUAdbOnrKSQ5XalBDths/0ies34zxmmCKZSvInB0fhbjkEA9dFmvipqciIP5xk8/YaUGTxSBFSVang+B0xOChzGE7QVsYYQE+Ez8b7YQ4MJ2WMSo9olHOoPWtGuR64YFEwEPVerXHPtKjb48NNNO4+b+DH9XCbH6aFiqQ7A2FjFe2PuIu3FPj3HWl9saMk6/QdH/B54fvicL8G2ubw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 10/04/2024 21:09, David Hildenbrand wrote: > [...] > > Skipping the ptdesc stuff we discussed offline, to not get distracted. :) > > This stuff is killing me, sorry for the lengthy reply ... No problem - thanks for taking the time to think it through and reply with such clarity. :) > >> >> So I've been looking at all this again, and getting myself even more confused. >> >> I believe there are 2 classes of ptep_get_lockless() caller: >> >> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() >> 2) everyone else > > Likely only completely lockless page table walkers where we *cannot* recheck > under PTL is special. Essentially where we disable interrupts and rely on TLB > flushes to sync against concurrent changes. Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and "lockless walkers that never take the PTL". Detail: the part about disabling interrupts and TLB flush syncing is arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But you make that clear further down. > > Let's take a look where ptep_get_lockless() comes from: > > commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 > Author: Peter Zijlstra > Date:   Fri Nov 13 11:41:40 2020 +0100 > >     mm/gup: Provide gup_get_pte() more generic > >     In order to write another lockless page-table walker, we need >     gup_get_pte() exposed. While doing that, rename it to >     ptep_get_lockless() to match the existing ptep_get() naming. > > > GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. > > "With get_user_pages_fast(), we walk down the pagetables without taking any > locks.  For this we would like to load the pointers atomically, but sometimes > that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What we > do have is the guarantee that a PTE will only either go from not present to > present, or present to not present or both -- it will not switch to a completely > different present page without a TLB flush in between; something hat we are > blocking by holding interrupts off." > > Later, we added support for GUP-fast that introduced the !TLB variant: > > commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 > Author: Steve Capper > Date:   Thu Oct 9 15:29:14 2014 -0700 > >     mm: introduce a general RCU get_user_pages_fast() > > With the pattern > > /* >  * In the line below we are assuming that the pte can be read >  * atomically. If this is not the case for your architecture, >  * please wrap this in a helper function! >  * >  * for an example see gup_get_pte in arch/x86/mm/gup.c >  */ > pte_t pte = ACCESS_ONCE(*ptep); > ... > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > ... > > > Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = > gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was > required to be sane, this the lengthy comment above ptep_get_lockless() that > talks about TLB flushes. > > The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still > full of details about TLB flushes syncing against GUP-fast. But as you note, we > use it even in contexts where we don't disable interrupts and the TLB flush > can't help. > > We don't disable interrupts during page faults ... so most of the things > described in ptep_get_lockless() don't really apply. > > That's also the reason why ... >> >>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> -               vmf->orig_pte = *vmf->pte; >> +               vmf->orig_pte = ptep_get_lockless(vmf->pte); >>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >> >> -               /* >> -                * some architectures can have larger ptes than wordsize, >> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and >> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic >> -                * accesses.  The code below just needs a consistent view >> -                * for the ifs and we later double check anyway with the >> -                * ptl lock held. So here a barrier will do. >> -                */ >> -               barrier(); >>                  if (pte_none(vmf->orig_pte)) { > > ... that code was in place. We would possibly read garbage PTEs, but would > recheck *under PTL* (where the PTE can no longer change) that what we read > wasn't garbage and didn't change. Agreed. > >> >> -- >> >> (2) All the other users require that a subset of the pte fields are >> self-consistent; specifically they don't care about access, dirty, uffd-wp or >> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent >> just by calling ptep_get(). > > Let's focus on access+dirty for now ;) > >> >> -- >> >> So, I'm making the bold claim that it was never neccessary to specialize >> pte_get_lockless() on arm64, and I *think* we could just delete it so that >> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original >> aim without needing to introduce "norecency" variants. >> >> Additionally I propose documenting ptep_get_lockless() to describe the set of >> fields that are guarranteed to be self-consistent and the remaining fields which >> are self-consistent only with best-effort. >> >> Could it be this easy? My head is hurting... > > I think what has to happen is: > > (1) pte_get_lockless() must return the same value as ptep_get() as long as there > are no races. No removal/addition of access/dirty bits etc. Today's arm64 ptep_get() guarantees this. > > (2) Lockless page table walkers that later verify under the PTL can handle > serious "garbage PTEs". This is our page fault handler. This isn't really a property of a ptep_get_lockless(); its a statement about a class of users. I agree with the statement. > > (3) Lockless page table walkers that cannot verify under PTL cannot handle > arbitrary garbage PTEs. This is GUP-fast. Two options: > > (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the > atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes > required. This is the common case. HW might concurrently set access/dirty bits, > so we can race with that. But we don't read garbage. Today's arm64 ptep_get() cannot garantee that the access/dirty bits are consistent for contpte ptes. That's the bit that complicates the current ptep_get_lockless() implementation. But the point I was trying to make is that GUP-fast does not actually care about *all* the fields being consistent (e.g. access/dirty). So we could spec pte_get_lockless() to say that "all fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured". This could mean that the access/dirty state *does* change for a given page while GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* that failing to detect this is benign. Aside: GUP-fast currently rechecks the pte originally obtained with ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform to (1), so either it returns the same pte or it returns a different pte or garbage. But that garbage could just happen to be the same as the originally obtained pte. So in that case, it would have a false match. I think this needs to be changed to ptep_get_lockless()? > > (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to > read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious > PTE changes (e.g., present -> present). This is weird x86-PAE. > > > If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. > > My head is hurting ... >