From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 631D4C54E41 for ; Wed, 28 Feb 2024 10:48:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE9476B00A2; Wed, 28 Feb 2024 05:48:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C978A6B00A3; Wed, 28 Feb 2024 05:48:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B391F6B00A4; Wed, 28 Feb 2024 05:48:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A4BD76B00A2 for ; Wed, 28 Feb 2024 05:48:14 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 51ABC160F1C for ; Wed, 28 Feb 2024 10:48:14 +0000 (UTC) X-FDA: 81840888108.11.331E45D Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) by imf27.hostedemail.com (Postfix) with ESMTP id 8A6C140006 for ; Wed, 28 Feb 2024 10:48:12 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=xT2pP4iz; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of qperret@google.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=qperret@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709117292; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yd5yZnBCJWOehzR8Dz9CIlcYtWtgp41lO7eq7h9zZG4=; b=kKsaYmBNVWxEVpID1T5C6c1RqfRPqSC4bpHg+8Rr+ne9cy4/SshaCKaeqkzbMnpLYXzEIt NglmVBzwMMJWNueg3j0GHfkFwQHE9PLRnVM+5tcPg/KjE6Zk6THT8u13tAuQJgOW46mE7/ uHCfe/8EI8fplPO/6fnDYru3gErtfeU= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=xT2pP4iz; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of qperret@google.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=qperret@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709117292; a=rsa-sha256; cv=none; b=pOJj0RdNflsPEMYNUFH9isF0iEExrmLe7/XSyzx2weuu8iTwCU1q+Mr/sna9bL4+wm0ag8 /+Th03M7koFOJzQ4c+57n9oWGoD8A8jCQH+bOcR1M5anA1Ub4XklvV7MG1bQXD43NqwF97 nkAOhB2FsLE66kcewc8iRyAh0c2v1Mw= Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-513181719easo763534e87.3 for ; Wed, 28 Feb 2024 02:48:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709117291; x=1709722091; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=yd5yZnBCJWOehzR8Dz9CIlcYtWtgp41lO7eq7h9zZG4=; b=xT2pP4iz9q2IoMugL8gNT1TTRKd8ZH5dpHolXmjmSHJ2qJSzrD5pY1fwERlZxR0z4v Erdj7vw47uuNgk8xXxV6bzCHkgF4G+tx1vcCKK3cPCgoJTQlgf1pNee7eDzaWxcyoX/5 eCaVjFLQXIOypC9WPP+uRoDzPG9wI7IFKbXo3NrDq/hRTTNxJrNItm4NBcMLpMa0WI2y HmibK0xhguRMbTMAL2kJ5J9aeQi5OXRRUNlO6u3LS19SrdqNCQ1LqKtT3hIzduUQfxJa X1X9uxIVya4HNcg00P4Zqg9NGph4QeuNlqxvk8JpjnXqyLQMNHQeWozFba7WN82LAqBk Aejg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709117291; x=1709722091; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=yd5yZnBCJWOehzR8Dz9CIlcYtWtgp41lO7eq7h9zZG4=; b=aQitonuSYwIdq0X+EUAv+x2W3rzF+WtjTuHT+HkZxwhdxXeXtw0VT0TdFP0hkRC+sB R8ZcBfF17fskL9rUqppwFTT3C7RCRI4zP7lzyxP2HG245bEs7WDuvv+L+rRT/UYYRJ55 R6ajN5UwoL1GcTdKd6nv/jmI5sUf8rXBXrCZrcnuY/GJnlo9LGKiUdlQG5VC80x1QHXQ TqeMvZEPhCkdx1m0zKFYPkCc+oaLZP4Vu3fySRLDWNJ9OzdQDwgIRC1SD8OSsmr6WOkB aCNmL+Mcwup9EZTdWNYSLCojTAx39gOhW4cZej9l2dAIb0ueGFlOjv4WMjyxADX+17/M ww/A== X-Forwarded-Encrypted: i=1; AJvYcCXVAeusm7aeKBNrCZbZo6krazYl0dCQH0LxjcUoROnhI9aN6EjA4fWO7YaIMdIRRsrvAXi9Xd8Zf67ljTOgRQUVJ00= X-Gm-Message-State: AOJu0Yy/u5C78LLmFamVV7UBx4RZBF67XlRjn3+5OoSjJVrdtPg729Ra wH905N4db9uCqhcc5RXXN1BSgj5yFQne8e4fm7PMbfuAufFlP7zctLGsdHXnJw== X-Google-Smtp-Source: AGHT+IGoxuPozYt3D8BINxx33cesSkYKSOCaHEVHz4LSd6TJAMaSCsPyZuksu5xQE/r8xD99uCOcCQ== X-Received: by 2002:ac2:5fc1:0:b0:512:b075:4a25 with SMTP id q1-20020ac25fc1000000b00512b0754a25mr7438780lfg.41.1709117290522; Wed, 28 Feb 2024 02:48:10 -0800 (PST) Received: from google.com (64.227.90.34.bc.googleusercontent.com. [34.90.227.64]) by smtp.gmail.com with ESMTPSA id hs39-20020a1709073ea700b00a43815bf5edsm1688444ejc.133.2024.02.28.02.48.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 02:48:09 -0800 (PST) Date: Wed, 28 Feb 2024 10:48:03 +0000 From: Quentin Perret To: David Hildenbrand Cc: Matthew Wilcox , Fuad Tabba , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, keirf@google.com, linux-mm@kvack.org Subject: Re: folio_mmapped Message-ID: References: <20240222161047.402609-1-tabba@google.com> <20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com> <40a8fb34-868f-4e19-9f98-7516948fc740@redhat.com> <20240226105258596-0800.eberman@hu-eberman-lv.qualcomm.com> <925f8f5d-c356-4c20-a6a5-dd7efde5ee86@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <925f8f5d-c356-4c20-a6a5-dd7efde5ee86@redhat.com> X-Rspamd-Queue-Id: 8A6C140006 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ughty45nzbm7qxpn3cph6aja9tjqitze X-HE-Tag: 1709117292-601779 X-HE-Meta: U2FsdGVkX18/RMDUo4+LPPytx/IUh3mltV9inhdWTrernFTjo8SVlRWTmuZ+idlpmxbWcGJNPNYPq7qgaTw3pvSZ1ontHc0dAeJcJ5radDSgag6WhiVmcMrR/btY89lyCh+4ImNpsTZcDu78OnZKAbD5o1lYJthBM+LXNw9h4Qzp/V2rIwKB77txTW6fhv0iEJC+COipv2yvgGhbJEmPuR7N4JWidBvtjO7vUDMJmCP7Q6kWvoosC5puqsmNQR66OTrLZ1poLAAqd1OviqJ/r+iG+otg6CZQ6/yw8KuoX6LK6IFvoni/cyj1mGrZDpwUUPxLAsuwlfYfAQNLXFVXAptVVXJ8bdrgRSLx+fv5AWDuLAkUr4MbOr/NLrqiWMXkUGNvHwqAkxM/R7uaLH6XtnJskwYR89WPlThZUhQD9IzhyVmxdhWsbyYWI9Oy1Mmq0PDKr5TS247BNcPG/a06AVqRNFrlDiRQtRphBRRIgMnykOCaAeP6FYmfiyWiL0AOyyqyZXYNsAS2xuheDQ4+m8/jVg9d23p5NaCxE/OdLAtdW8IV2p1khu00rw/5HRHeCdjqcLp9q94lB3rrmIysyzG9A1WvymskYMlnHLt3tCDA4Bojrx/sCeTMYbLKSuPLiJlyXzibA+BWdQmdamyQxIj3gSCp+oKDnHkYUbgKfvUktaPysBZVg5Yz1nRPyQ6VETyC9CphcyGrucbnN6RNIc6MeUtqJVRokWaR/gUjlWSmtUzrOE+zV8odAR5FrV1a41x5VpoP7jrrLLt+JQ8PbPL7ZBfkR066hfQrtBezJJIkHuteTh5ThzUDi/07zL1pEKiovye1Ht0okX8+b2osU6guwz+1ujbwN0yRoSMNE5/7RYZsSeS+pu0dYUktDN9kcdzqZEfF9UUJeTBF21QaX0XKO09GWC4/CwYtZnnqb7DzVVtHbg6m+Ax1K1XZgRuppRHu9Nn9ApLMkhhwCpq DtLZdMvg jdMEXMcCHgh0vCUCT2wWWyNDLwZ2HBXT8urDxddfSag3QxKdDIMvPEAUynAFgjgWad/DF5oktaeaM+9RIASq4VMtPLnquURc9u/brZKuwbvVwkIZk54vCpICjdoy9MBDl4KyUuBrCS1x0cGSGSOa5PMYVaeYnFfXlUQDtFAzGh3m6Of75nS4rcXgs0fo1Bt+rf9P9hVc9+aCfbCQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote: > > > > Ah, this was something I hadn't thought about. I think both Fuad and I > > need to update our series to check the refcount rather than mapcount > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But > checking for any unexpected references might be better (there are still some > GUP users that don't use FOLL_PIN). As a non-mm person I'm not sure to understand to consequences of holding a GUP pin to a page that is not covered by any VMA. The absence of VMAs imply that userspace cannot access the page right? Presumably the kernel can't be coerced into accessing that page either? Is that correct? > At least concurrent migration/swapout (that temporarily unmaps a folio and > can give you folio_mapped() "false negatives", which both take a temporary > folio reference and hold the page lock) should not be a concern because > guest_memfd doesn't support that yet. > > > > > > > > > Now, regarding the original question (disallow mapping the page), I see the > > > following approaches: > > > > > > 1) SIGBUS during page fault. There are other cases that can trigger > > > SIGBUS during page faults: hugetlb when we are out of free hugetlb > > > pages, userfaultfd with UFFD_FEATURE_SIGBUS. > > > > > > -> Simple and should get the job done. > > > > > > 2) folio_mmapped() + preventing new mmaps covering that folio > > > > > > -> More complicated, requires an rmap walk on every conversion. > > > > > > 3) Disallow any mmaps of the file while any page is private > > > > > > -> Likely not what you want. > > > > > > > > > Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are > > > you trying to avoid page faults? What's the use case? > > > > > > > We were chatting whether we could do better than the SIGBUS approach. > > SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to > > return errors early. One difference between hugetlb and this usecase is > > that running out of free hugetlb pages isn't something we could detect > > With hugetlb reservation one can try detecting it at mmap() time. But as > reservations are not NUMA aware, it's not reliable. > > > at mmap time. In guest_memfd usecase, we should be able to detect when > > SIGBUS becomes possible due to memory being lent to guest. > > > > I can't think of a reason why userspace would want/be able to resume > > operation after trying to access a page that it shouldn't be allowed, so > > SIGBUS is functional. The advantage of trying to avoid SIGBUS was > > better/easier reporting to userspace. > > To me, it sounds conceptually easier and less error-prone to > > 1) Converting a page to private only if there are no unexpected > references (no mappings, GUP pins, ...) > 2) Disallowing mapping private pages and failing the page fault. > 3) Handling that small race window only (page lock?) > > Instead of > > 1) Converting a page to private only if there are no unexpected > references (no mappings, GUP pins, ...) and no VMAs covering it where > we could fault it in later > 2) Disallowing mmap when the range would contain any private page > 3) Handling races between mmap and page conversion The one thing that makes the second option cleaner from a userspace perspective (IMO) is that the conversion to private is happening lazily during guest faults. So whether or not an mmapped page can indeed be accessed from userspace will be entirely undeterministic as it depends on the guest faulting pattern which userspace is entirely unaware of. Elliot's suggestion would prevent spurious crashes caused by that somewhat odd behaviour, though arguably sane userspace software shouldn't be doing that to start with. To add a layer of paint to the shed, the usage of SIGBUS for something that is really a permission access problem doesn't feel appropriate. Allocating memory via guestmem and donating that to a protected guest is a way for userspace to voluntarily relinquish access permissions to the memory it allocated. So a userspace process violating that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the point that signal would be sent, the page would have been accounted against that userspace process, so not sure the paging examples that were discussed earlier are exactly comparable. To illustrate that differently, given that pKVM and Gunyah use MMU-based protection, there is nothing architecturally that prevents a guest from sharing a page back with Linux as RO. Note that we don't currently support this, so I don't want to conflate this use case, but that hopefully makes it a little more obvious that this is a "there is a page, but you don't currently have the permission to access it" problem rather than "sorry but we ran out of pages" problem. Thanks, Quentin