From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C48F9C27C79 for ; Thu, 20 Jun 2024 16:04:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0CB926B0119; Thu, 20 Jun 2024 12:04:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 001AB6B011A; Thu, 20 Jun 2024 12:04:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D1FF86B010E; Thu, 20 Jun 2024 12:04:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A1C7B6B00FE for ; Thu, 20 Jun 2024 12:04:34 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4DF701202EF for ; Thu, 20 Jun 2024 16:04:34 +0000 (UTC) X-FDA: 82251739668.22.DC52824 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf29.hostedemail.com (Postfix) with ESMTP id 43940120010 for ; Thu, 20 Jun 2024 16:04:32 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eIPPyzG0; spf=pass (imf29.hostedemail.com: domain of 3DlN0ZgYKCL8xjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3DlN0ZgYKCL8xjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718899462; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GM34BWwCNXiLOvG2mLkdqlEK4NIZT0q6djBzomdsWyE=; b=0ETwOLgQVcjKk9k6FTUZbM+K3tlQQRaEnxLwL2Siq75Y8LVywqPhB+9JCCHIYpnqJxXuVP SW+gpzhPF5263m9zkOTMC3V6nNx6oRmMSAm35rbKi0+e2OQbTBa5+ZVhi3FrLr8HWFspsD 9RzJ0D3ChncOTxsVRNHIJm+joAWPkIE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718899462; a=rsa-sha256; cv=none; b=kG30GFv4/NuaO0UAKjVSIBI5ex0pPBlGvUTaqCpSE4zGlY2fbHQ+Jry4V2KCjNSDtkJrGk Sf6dYKGHZLc02dMvHMcd4DOYUzrenLojWzND8/JIoCHvgQrr9iXber5lFkpcV7SGhZo4LC tsM2pu257IAs3iQQ9VARLzK0sPVdy3w= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eIPPyzG0; spf=pass (imf29.hostedemail.com: domain of 3DlN0ZgYKCL8xjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3DlN0ZgYKCL8xjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2c7a8dc68aeso1205789a91.0 for ; Thu, 20 Jun 2024 09:04:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718899471; x=1719504271; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=GM34BWwCNXiLOvG2mLkdqlEK4NIZT0q6djBzomdsWyE=; b=eIPPyzG0BcTVU2LMajFRiSTm7fxDsP69qc1zIE57+Dad0X+iawqtt14ciINJL4AW03 N2TayGZu3y51nO8Nu/U+ZF/ySWeRQZwg59XhBBc0ivincfdmI6E8WfA8+3lhhom8wjSW BGi+m9oiJEUV4cFf4yyqG1M7XTK30hYltLIzIFEn5BlCD0IojYdQOESpCN2B0gS7RSe0 S52umeMARAn2KJcs/hzgt2ABG2HmSJyi78bfZfTmJaF0CyWhDMzvYYcut6vexQbkBa5k Cp46qD68p0de/3dGKYJX8+3yp9nYZzjldkT/MgiktMxOIgfTl6lsZdej/kseLvpTHp9s VYhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718899471; x=1719504271; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GM34BWwCNXiLOvG2mLkdqlEK4NIZT0q6djBzomdsWyE=; b=DZL7fHyLQWI1nhSrhppUqimxJV89XMgi1RNnTX7nlaFbAeg5XJrvH/AZIvbnPzjifj hQro5x8mMvnaQDroOjdjXp3vZ+a0INDX3+lFl9ALRqj+yVYQSEun9NObWIiIhUsnFs8g qoitOWvM/564QHZRZaoO/AwhEw4Co2nLh68cL1L6nlf+dJ7Autu/dqxLzPA3NeGg8aJ6 30Fzq62oPHAMm6Vy0y/BsC55jDDCa+oay+axs0kRwPIVccnl3BxnZBEb3r63/hPrcMjJ aJ5BZ6Cte0m396zmymB8DZIru69klObvQJQhoypGg/PZ+cvAR+96gzrFWRlPXakwB8ML plMA== X-Forwarded-Encrypted: i=1; AJvYcCWdf/0UKF2xaZ/BlVlXPGNo2bAN/a+JWtlcEeke0hHRCpd7Ioa2T7ripmdoZocuKR1rSQewoqnQREGMloeS3Z1m9nc= X-Gm-Message-State: AOJu0YwBs6bzOroYn/kUuCJAfWliaOpt/P1/qBKMLeyUrCZsSqhurWTq 5yHHrJBcZO6wS9b1jXnSg6o8o2Dttdg9JfrDRLCkNqyek40tT58NpfqfOMZCbXbSbztdo+Un1+l 0Yg== X-Google-Smtp-Source: AGHT+IEQboNcvlNMXMXTQphakoihDkY0lsVfoV/sk/4m49kyvfsdiCALpaLxGs2jQVK54zSq5Uy3vdD10AI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:90a:fa8d:b0:2c2:ff46:312a with SMTP id 98e67ed59e1d1-2c7b5d647d3mr16281a91.4.1718899470904; Thu, 20 Jun 2024 09:04:30 -0700 (PDT) Date: Thu, 20 Jun 2024 09:04:29 -0700 In-Reply-To: <385a5692-ffc8-455e-b371-0449b828b637@redhat.com> Mime-Version: 1.0 References: <7fb8cc2c-916a-43e1-9edf-23ed35e42f51@nvidia.com> <14bd145a-039f-4fb9-8598-384d6a051737@redhat.com> <20240619115135.GE2494510@nvidia.com> <20240620135540.GG2494510@nvidia.com> <6d7b180a-9f80-43a4-a4cc-fd79a45d7571@redhat.com> <20240620142956.GI2494510@nvidia.com> <385a5692-ffc8-455e-b371-0449b828b637@redhat.com> Message-ID: Subject: Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning From: Sean Christopherson To: David Hildenbrand Cc: Jason Gunthorpe , Fuad Tabba , Christoph Hellwig , John Hubbard , Elliot Berman , Andrew Morton , Shuah Khan , Matthew Wilcox , maz@kernel.org, kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, pbonzini@redhat.com Content-Type: text/plain; charset="us-ascii" X-Stat-Signature: ps14dmwf9mj5snwse8ko7bfmcq8qg1mw X-Rspamd-Queue-Id: 43940120010 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1718899472-929789 X-HE-Meta: U2FsdGVkX1/MoZcQ8eWQxwNY7H2qEldEVWCKM9tmxaD+3llGYOcOdHIhJ+NLn5w15DV2LoXA98DhLImaapzv+hpE9T8ysO+/NR2U0eydHhHwi0598HsOJ5gMq86ORzviR0x1fVkPTsgCyWALFSPT3cMjPxn06vS4cZTm3W86awRiqG4jZYGadh0Sikje+lts6HSCaZUMTsoe8lthW1/9xY+/qYXx3PSCuxz4qtyHzU9ypPaFK9vawm4HLLhfUIDYWDB+Q2HBq8WRdzfhwmg0tAGnwyqKT86vxsY0Uscrmwx0cChJGj24Qx0ned3QlyGlTAKp5dxxCc5wROBbiksk9Xm4iiq6af2rIONgUEDfarP+snsdDTqGx8pp8ptusty2REVWh/CCV1FR4FbBRhbleqrNJQ7yFwqNIO81TNt8jMp2lFfkUkg2z8Zugl980KZHAO0DYnuKwmiVP6BEKMYOMhvgM4Qv4RxBksLLCcsnHAhcpMjGRmqdaftpCBA2t4V75fZGbOUbIcByQM8DFgv77JRngjaQND5asEdqLMpmq5u7fuAFe9VQnYHYk1SDSBlr+Rpi6zgPuKkLTX6D4lwy8u/UYdYw6xuTATWwulWYD2Q9jdYtLk+0PQq9m9F1A4tzfM8ha598cpBuGWIHrvEIIvwuBBb9GGqviZACGE63565b0Fxxx97Ffijm91extmetoR9ay9mUUQT43IWHWqzbs9KsyNsnRH1k6W6Zpkbl5zvLxl2tyot6CsstwitN1Y21kdFqutRsTWiRf92jrZ3F4bpKSxrUXl72iGpGyF5n7JLLHiIpZJT7H/ZU5Vz2ZdOPF+PW1cZnLYja/intQCooqbmFknrXplRbXkBzSmWZjoamksEG4sHSn4lVUALOWc4YEW3yweHSiIJKTXfKku9T7qZMnQM/AM+VHRLFoiT/HkXRZqly1X6uELgvdOO2yysBI/+lhgIbCgd6voHsNOQ nPxXsX19 6uyEuWuu6id2JbXSLZ+B7imCOO1Iy/MygpmLqQAsy3xzOrs+npe6Q86SX79iQsVW+iin/UZpF/XoY4gM18JWKDEMzBYlxrTCLHZIJt5/tEUHQW5a1gSrTxbuWbQaYyVPwymurdJ4cWaQAIn4T0RR2anuXXEDrFyt+M/T0ws3zDoD2ye6aEDvWe2s8E/RQrXEEEXyU82Ukg21/ORT3YoWkeTwRCCXEFZ64TIew6ru8oue4iEIJJCyTJTFDlX22C3TmZhmuz2HvCsQvmcfs5MoO8mXIb13AiE3eBzJlewPoCwjQdaQNcb4YeUFNiIortRHDaCpjhwJSXPU2Lkr4ZD6VKNpjlbl8KTCkha3yC5T5z9RqBNVpjRr+I1UMNIOro7H4ajI5T6mFqaQWikQONhSue7r/Wgkwt7kiMuBbf0cNE4e38T2QTfLo/AZVzrl6xiXpRi1QoVf6FsSZ+iBlLqdNgbYYvdu94dzMlikroxipZ6pYb5kvFw/DAEl4wKxX/DUE3bNHXHnpSpDV4GJ8qqbg7sRU7yc/Lp/vJNnSWPEDSqx8+EhhZW5/elgut80P0fFr4HwdRUFFc/o7MtuCE5ko5olXlxB+cm75+o3eWNu6PJLoNnA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 20, 2024, David Hildenbrand wrote: > On 20.06.24 16:29, Jason Gunthorpe wrote: > > On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote: > > > On 20.06.24 15:55, Jason Gunthorpe wrote: > > > > On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote: > > > Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, > > > now the VM requests to make one subpage private. > > > > I think the general CC model has the shared/private setup earlier on > > the VM lifecycle with large runs of contiguous pages. It would only > > become a problem if you intend to to high rate fine granual > > shared/private switching. Which is why I am asking what the actual > > "why" is here. > > I am not an expert on that, but I remember that the way memory > shared<->private conversion happens can heavily depend on the VM use case, Yeah, I forget the details, but there are scenarios where the guest will share (and unshare) memory at 4KiB (give or take) granularity, at runtime. There's an RFC[*] for making SWIOTLB operate at 2MiB is driven by the same underlying problems. But even if Linux-as-a-guest were better behaved, we (the host) can't prevent the guest from doing suboptimal conversions. In practice, killing the guest or refusing to convert memory isn't an option, i.e. we can't completely push the problem into the guest https://lore.kernel.org/all/20240112055251.36101-1-vannapurve@google.com > and that under pKVM we might see more frequent conversion, without even > going to user space. > > > > > > How to handle that without eventually running into a double > > > memory-allocation? (in the worst case, allocating a 1GiB huge page > > > for shared and for private memory). > > > > I expect you'd take the linear range of 1G of PFNs and fragment it > > into three ranges private/shared/private that span the same 1G. > > > > When you construct a page table (ie a S2) that holds these three > > ranges and has permission to access all the memory you want the page > > table to automatically join them back together into 1GB entry. > > > > When you construct a page table that has only access to the shared, > > then you'd only install the shared hole at its natural best size. > > > > So, I think there are two challenges - how to build an allocator and > > uAPI to manage this sort of stuff so you can keep track of any > > fractured pfns and ensure things remain in physical order. > > > > Then how to re-consolidate this for the KVM side of the world. > > Exactly! > > > > > guest_memfd, or something like it, is just really a good answer. You > > have it obtain the huge folio, and keep track on its own which sub > > pages can be mapped to a VMA because they are shared. KVM will obtain > > the PFNs directly from the fd and KVM will not see the shared > > holes. This means your S2's can be trivially constructed correctly. > > > > No need to double allocate.. > > Yes, that's why my thinking so far was: > > Let guest_memfd (or something like that) consume huge pages (somehow, let it > access the hugetlb reserves). Preallocate that memory once, as the VM starts > up: just like we do with hugetlb in VMs. > > Let KVM track which parts are shared/private, and if required, let it map > only the shared parts to user space. KVM has all information to make these > decisions. > > If we could disallow pinning any shared pages, that would make life a lot > easier, but I think there were reasons for why we might require it. To > convert shared->private, simply unmap that folio (only the shared parts > could possibly be mapped) from all user page tables. > > Of course, there might be alternatives, and I'll be happy to learn about > them. The allcoator part would be fairly easy, and the uAPI part would > similarly be comparably easy. So far the theory :) > > > > > I'm kind of surprised the CC folks don't want the same thing for > > exactly the same reason. It is much easier to recover the huge > > mappings for the S2 in the presence of shared holes if you track it > > this way. Even CC will have this problem, to some degree, too. > > Precisely! RH (and therefore, me) is primarily interested in existing > guest_memfd users at this point ("CC"), and I don't see an easy way to get > that running with huge pages in the existing model reasonably well ... This is the general direction guest_memfd is headed, but getting there is easier said than done. E.g. as alluded to above, "simply unmap that folio" is quite difficult, bordering on infeasible if the kernel is allowed to gup() shared guest_memfd memory.