From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD498CFB44C for ; Mon, 7 Oct 2024 16:45:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 402F76B008C; Mon, 7 Oct 2024 12:45:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B2736B0092; Mon, 7 Oct 2024 12:45:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27AF26B0093; Mon, 7 Oct 2024 12:45:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 07F646B008C for ; Mon, 7 Oct 2024 12:45:50 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B12B71A0928 for ; Mon, 7 Oct 2024 16:45:49 +0000 (UTC) X-FDA: 82647382818.07.1E3C92F Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf26.hostedemail.com (Postfix) with ESMTP id C56AB140004 for ; Mon, 7 Oct 2024 16:45:47 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dwiHYuCO; spf=pass (imf26.hostedemail.com: domain of 3OhAEZwYKCKkbNJWSLPXXPUN.LXVURWdg-VVTeJLT.XaP@flex--seanjc.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3OhAEZwYKCKkbNJWSLPXXPUN.LXVURWdg-VVTeJLT.XaP@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728319505; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mnvskd1UMKG/zIpThM4lfsmOmZwPeMRMLKnKM1lSsPQ=; b=LJWjdGxR9PCYrxNeHpNPBYezbwUwi5/Uwn41NzQZCGbB5pdZHcpx9iXWVYUgM6ROL0jhE+ 7eDng05AVsoxcEzryDseocbsVbz0/MtAnUYlmLcylKkWiKR4X8hwGCrFzkUy9nQizdJfzn s+6gQc0uytrkJoAL2pqzpcB2cwQsbe8= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dwiHYuCO; spf=pass (imf26.hostedemail.com: domain of 3OhAEZwYKCKkbNJWSLPXXPUN.LXVURWdg-VVTeJLT.XaP@flex--seanjc.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3OhAEZwYKCKkbNJWSLPXXPUN.LXVURWdg-VVTeJLT.XaP@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728319505; a=rsa-sha256; cv=none; b=k3dS/v9YtJycbqvTKw4YktaS3hb2l7Y0AMhtDPZMkionaTAKztJrZY2tTwkDa15SE42lTa q0yjyfI1yauvjOGm62WiLQRk1Z4giT025bE/nbCchJkR2kkY23lh6jHNdXI30saQ5+ulfl UneiYBBPRh5BnW3Bm/uzbbEzRIFT6rE= Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2e0a47eb73fso6006310a91.0 for ; Mon, 07 Oct 2024 09:45:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728319546; x=1728924346; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=mnvskd1UMKG/zIpThM4lfsmOmZwPeMRMLKnKM1lSsPQ=; b=dwiHYuCO7LTo2uXbj5zTbIXlfxnSbX9T4dPBYUC5rmhP5HY5r8Do4puPVLA+2FEUdu /gV2b18T5PtOG4+sSXU7g87DuFLt4qJXoDUVFf1E2+c3sg1KIWs7VeZLVxoy8QdvJuup qYdCUQo5gpkmyXFYJee19s1I1V2HLbqKaQxBZqp82jBKRHwXwEmIdo8cm+0m03ZnX4lB WnS6CzGi3L/WytIHkx1T2k0HF4EfFCyfgK1nM15Cbbgat8+eMnstBpi0a74y+Nkhx52y R4aXwJEmtKzi/D14huwtHpytdS4D/36sA2GouaOMwkwgv3S/jkY4xrXxTHIRewu95ZLj cJ3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728319546; x=1728924346; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=mnvskd1UMKG/zIpThM4lfsmOmZwPeMRMLKnKM1lSsPQ=; b=dguT2NzsGONOWl2g65OueYRtq1IAnvBOAy7cy5ExYqdY7qooQqNGeMkY3i32hwDYY8 TEJcf9kVy+pOoQaiq7nbbRP+walzG0nmVVcvKd4exQ5TUVq4pwV55gwjIlZHsQXXGgyo iKe/OtRpXD0dKlHEhNKrwKec9USbsTwJEFOJvaACkCWJ8gMLXMgNC7Q7aVnPuh9+iz6Q NUKqtORVz+S4L2Iiac3w6ZzKPpJC3gdPKwk7xjCY233hws/bhbFmPwg54PVNBpfvU1RJ WWV5LHq5JyTI6YBKbgbHKE4XZDYmZEs9WZKg90Ne7fzfi/+i2d7oDAAM1AKpFPQYcwQW vH9w== X-Forwarded-Encrypted: i=1; AJvYcCVjsyX2bVsaN/QC5QB0b4w1ZRYXnOaRmTcZmMXFeNO7rUsIF/p1IrMhJ8AF5YNBx+ljwT2AG/PJ8Q==@kvack.org X-Gm-Message-State: AOJu0YxtkV5eXbga416UadWx6vhDrNiSVfOcgJ38o71VZx8V7esXWtC+ HFuHNhNhvHZ+bQUW5jWWUWmb9mcIZv1NQDB//R11h/VAHLcSZkkDny8hCT2sAxmAzsO4fsOPru0 mmA== X-Google-Smtp-Source: AGHT+IFiIQrhwfP2AODONDAi/MWD5bThgV6AEO8qEDkrjI6w861ThE+bO5hmJB1ie46Q71cAIh+U1THO8Rc= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:9d:3983:ac13:c240]) (user=seanjc job=sendgmr) by 2002:a17:90a:7807:b0:2e1:e8d7:1947 with SMTP id 98e67ed59e1d1-2e1e8d719d8mr42978a91.1.1728319546131; Mon, 07 Oct 2024 09:45:46 -0700 (PDT) Date: Mon, 7 Oct 2024 09:45:44 -0700 In-Reply-To: <7c13be04-1d18-45bd-8cfc-f5d37bd39a8e@redhat.com> Mime-Version: 1.0 References: <20240903232241.43995-1-anthony.yznaga@oracle.com> <9927f9a3-efba-4053-8384-cc69c7949ea6@intel.com> <8c7fbaf1-61a0-4f55-8466-1ab40464d9db@redhat.com> <0a1678d8-0974-4783-a6f6-da85adfa1a34@intel.com> <7c13be04-1d18-45bd-8cfc-f5d37bd39a8e@redhat.com> Message-ID: Subject: Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes From: Sean Christopherson To: David Hildenbrand Cc: Dave Hansen , Anthony Yznaga , akpm@linux-foundation.org, willy@infradead.org, markhemm@googlemail.com, viro@zeniv.linux.org.uk, khalid@kernel.org, andreyknvl@gmail.com, luto@kernel.org, brauner@kernel.org, arnd@arndb.de, ebiederm@xmission.com, catalin.marinas@arm.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhiramat@kernel.org, rostedt@goodmis.org, vasily.averin@linux.dev, xhao@linux.alibaba.com, pcc@google.com, neilb@suse.de, maz@kernel.org, David Rientjes Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 3gscbrzskjhib96qjmpzkgsbriir1twh X-Rspamd-Queue-Id: C56AB140004 X-Rspamd-Server: rspam11 X-HE-Tag: 1728319547-878662 X-HE-Meta: U2FsdGVkX1+e3k0Hvfts86mPj4cnb1Z70bRlurvM97zgpzgxQJjQ5Z6N+yliOns5LK8hJs397DhHaHsNyAhXeX8767KLwv0KOfyyS0v/VJCNuLD5mqQ+tRyIsTuJzM3ziLQ5Jzqjr6Wpbi+8YXXsy3NJj7Mpd3jfk3keEq9UxN2PW1eBLDpXZlk44eTj3MWN+a8FXxGUVNIJylJ0F8z7XbFctkoqzrKHI47eXgwCgj2I7bNDkhj7R4yadmVDCk/8jXm0inemY3HVl9ezyDnnF7UJlC9LeU7nisGlyUhhIbbv7JrVkUU8qTpYNgTpPWmGW/EF4JLq3CoTA/30vTKxht0ZAJVn+25pPo7UkCgn+Ek1GzkulCgZ3A2A6NyRHjtjunz+L1adRP8kio2pjgLwIIWZpFVFGYfctKHMwAfE7caEEMYqR4o+C+DLiUferlmQiSi1ekky4mPMAF+L5V0DkPLLr3u7sMPoj2W2OZYhAML3u3DmMusin7N/it1+iM37JXL1KtmDkzYrwYmtOBp+XBCZpdOE3wvLAV1sZ1IxSvjfZWeWLLrXpFodu+7tzg4nBUeYdF0WoCerW33JUQE61eafEoDTiLqQLGCeFIJw9H0Kw8TgOTpD48BVUSeXYElgaWUVyNRydHySaUe0Q4XogRQT9jXSOpgsPlmYRaJM7UrTKnYUUT6DXeao0pQ126qyVVUhh3WhzcwD0MQ4h3/nevYpSDYh0cnkaZfsIcOdn406ancjUTEOFcprNp0DRMcO3A4yRPqWsu7+iqwv8VkZwUranpxba6J656JpbZmsXCKkZbg+BHiOOZyNRFkAuRufWZlRY0AcbwgCyVyyZdVHyi3fn2jbuAzf4QjR5MzXqTlbddRFeiSI7pyByKuWr+lFNKR4nyXx3SCclNld7jZ1zGHE/MXpYbV2ifhJNrjXQLVl/N37ax/OVBDKcfl07DAbNIuXsSyPrSoxRGf2EhT il0HN3ae WWddrc31lEjPUrPXH1HvtOO+1KcNXABqKGkiQsblSEcx0Jy4Dz7bZh/9eDWgHstSf3J+uXDhm+02Vit7ALTAl78WNe/q/TH9tOOubt/d7SBS1ADmUNXT1pqIUv3C4E/yWxANZt6tykyqZkCiB1XCWUVR2mCbmQenUstOrkg0TpHxlpE2iTms/Qt74rp+ivzaAO1fOiYWmXL92cAxZ/qQL2LeTac45lmBfmerd9CXL6M9jlbJIXouRKC61r21Tmau/ruJyXFRKDPC5hkWOs5bs9Lqzytes0w1DWKiNNm0DbZV9Ui9fkPGy7YgiHQiMMd3RXg+c58SKlStaq96iaI+btsOz6dKOPJAOx5QgIHHJaOnwxrbxEiVBLNEuYIb7j1ozimPzkYuW5OTCBpMkz9byKdiha5/WodRf5BraW3u1Qf8oNrH1pciG5iip1Ekz/3C2YbiiQegs8IvXfaX8nGlUvJSaHNEyYt8Qt3u6XnlKnBAJNa8y2WNhgKdbwyvS1j+D2rgH2TIpl6j6By2nD0tPkBvTgtVyvmGeNNqv2PHjkk3C9q2z5Hn9ZrjqkXJ2m0Zyt7CY/+vDGwfMd6dCSO3QxblwrvuHgSoqI4RLGJhs1ZJ0N+iqz1GAh8SJxd3f0CENmuVpilq5y62oiNpy5KTSyJPHO/yuiZXBEQtY/aplEckypMQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 07, 2024, David Hildenbrand wrote: > On 07.10.24 17:58, Dave Hansen wrote: > > On 10/7/24 01:44, David Hildenbrand wrote: > > > On 02.10.24 19:35, Dave Hansen wrote: > > > > We were just chatting about this on David Rientjes's MM alignment c= all. > > >=20 > > > Unfortunately I was not able to attend this time, my body decided it'= s a > > > good idea to stay in bed for a couple of days. > > >=20 > > > > I thought I'd try to give a little brain > > > >=20 > > > > Let's start by thinking about KVM and secondary MMUs.=C2=A0 KVM has= a primary > > > > mm: the QEMU (or whatever) process mm.=C2=A0 The virtualization (EP= T/NPT) > > > > tables get entries that effectively mirror the primary mm page tabl= es > > > > and constitute a secondary MMU.=C2=A0 If the primary page tables ch= ange, > > > > mmu_notifiers ensure that the changes get reflected into the > > > > virtualization tables and also that the virtualization paging struc= ture > > > > caches are flushed. > > > >=20 > > > > msharefs is doing something very similar.=C2=A0 But, in the msharef= s case, > > > > the secondary MMUs are actually normal CPU MMUs.=C2=A0 The page tab= les are > > > > normal old page tables and the caches are the normal old TLB.=C2=A0= That's > > > > what makes it so confusing: we have lots of infrastructure for deal= ing > > > > with that "stuff" (CPU page tables and TLB), but msharefs has > > > > short-circuited the infrastructure and it doesn't work any more. > > >=20 > > > It's quite different IMHO, to a degree that I believe they are differ= ent > > > beasts: > > >=20 > > > Secondary MMUs: > > > * "Belongs" to same MM context and the primary MMU (process page tabl= es) > >=20 > > I think you're speaking to the ratio here. For each secondary MMU, I > > think you're saying that there's one and only one mm_struct. Is that r= ight? >=20 > Yes, that is my understanding (at least with KVM). It's a secondary MMU > derived from exactly one primary MMU (MM context -> page table hierarchy)= . I don't think the ratio is what's important. I think the important takeawa= y is that the secondary MMU is explicitly tied to the primary MMU that it is tra= cking. This is enforced in code, as the list of mmu_notifiers is stored in mm_stru= ct. The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associate= d with exactly one mm_struct. But fundamentally, nothing would prevent a secondar= y MMU that manages a so called software TLB from tracking multiple primary MMUs. E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not= hard), because KVM's memslots disallow gfn aliases, i.e. each index into KVM's sec= ondary MMU would be associated with at most one VMA and thus mm_struct. Pulling Dave's earlier comment in: : But the short of it is that the msharefs host mm represents a "secondary : MMU". I don't think it is really that special of an MMU other than the : fact that it has an mm_struct. and David's (so. many. Davids): : I better not think about the complexity of seconary MMUs + mshare (e.g., : KVM with mshare in guest memory): MMU notifiers for all MMs must be : called ... mshare() is unique because it creates the possibly of chained "secondary" M= MUs. I.e. the fact that it has an mm_struct makes it *very* special, IMO. > > > * Maintains separate tables/PTEs, in completely separate page table > > > =C2=A0 hierarchy > >=20 > > This is the case for KVM and the VMX/SVM MMUs, but it's not generally > > true about hardware. IOMMUs can walk x86 page tables and populate the > > IOTLB from the _same_ page table hierarchy as the CPU. >=20 > Yes, of course. Yeah, the recent rework of invalidate_range() =3D> arch_invalidate_secondar= y_tlbs() sums things up nicely: commit 1af5a8109904b7f00828e7f9f63f5695b42f8215 Author: Alistair Popple AuthorDate: Tue Jul 25 23:42:07 2023 +1000 Commit: Andrew Morton CommitDate: Fri Aug 18 10:12:41 2023 -0700 mmu_notifiers: rename invalidate_range notifier =20 There are two main use cases for mmu notifiers. One is by KVM which us= es mmu_notifier_invalidate_range_start()/end() to manage a software TLB. =20 The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as the= se callbacks happen too soon/late during page unmap. =20 mmu notifier users should therefore either use the start()/end() callba= cks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention.