From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA42CC7EE32 for ; Thu, 26 Jun 2025 23:19:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CEF5A6B00B4; Thu, 26 Jun 2025 19:19:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C9EA16B00B5; Thu, 26 Jun 2025 19:19:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B8D5A8D0001; Thu, 26 Jun 2025 19:19:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A43B26B00B4 for ; Thu, 26 Jun 2025 19:19:56 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 74DCC80440 for ; Thu, 26 Jun 2025 23:19:55 +0000 (UTC) X-FDA: 83599121550.28.BD0CA54 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf29.hostedemail.com (Postfix) with ESMTP id BCF7A120008 for ; Thu, 26 Jun 2025 23:19:53 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=TTsSTdjw; spf=pass (imf29.hostedemail.com: domain of 3mNVdaAsKCAUfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3mNVdaAsKCAUfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750979993; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VCztHCMZgaWSxSkof1MA9Ntkv2jeWGBUS2MVA37PA/U=; b=AvD7M2gXAUIyU+iOlaZ8NQ7tthXj4JnX7HQ55csmeW4w/niuyQAd8J0/1/TqsahjRufMTX zF6bVEvv+plg7u2MlgnXW4fNiea5B1IrIfuTo98fnYrO8DmQ5O4sGLnCz36QbSyfNR7KV0 I4wVvNXvtWZ8Fx6TC6U/EaVpPseq4y0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750979993; a=rsa-sha256; cv=none; b=J3an7HdrickiiiP2nFwKRCIwF3r8IXu77gHTPM98nh08BhrHHGsNbu2tzRopGl333zDV/A vfj6IcLF0JJ7zRLTr0GrdemkDvwg2ot/kAtIrx2bSI4VrqFotny5WxP0awXFwzOgaurucF QH1eoqeWJiiaa12moyJqkzHUk52toNg= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=TTsSTdjw; spf=pass (imf29.hostedemail.com: domain of 3mNVdaAsKCAUfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3mNVdaAsKCAUfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-3139c0001b5so1284757a91.2 for ; Thu, 26 Jun 2025 16:19:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1750979992; x=1751584792; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=VCztHCMZgaWSxSkof1MA9Ntkv2jeWGBUS2MVA37PA/U=; b=TTsSTdjwvxRmo/Ddw5plyD3VY9XJulHv08HI6zH+6C6tK6k9/PlrzouL2LtuMwdQvD TspfsVP+v3LK0Fw/D3lCJuHoRua6uDGutKSTmnddsBIkqrylqRin7aaSOJ3IR2ugjykD 4oR0JanwVyheoC6ZjbovaoHDteq6y9ZaivC6WJjIylwRgUdFUIjDHMNpk3uAVI1WItPV KqShfY2wjGYFqh24WFlRNMiIZ+nWQHFaM1Y6ND+CJyFlYtmmT5HHhvEe6ojPBnN8Pwq+ Gra3AaIxkklih/dMqrTvDlXRg0GM6GfiWDA6GGquLk6HEoDTzhPVXB3jwIcUQf9w6CWt SzSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750979992; x=1751584792; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VCztHCMZgaWSxSkof1MA9Ntkv2jeWGBUS2MVA37PA/U=; b=bINC1ftWRsWx3qHKVjaY3c/QbyhiQhwVV7gicLqwsyRG8UPDFWSCnBScsbVSt27dZE KbZbPOq8s1Oimv944iJKptM+2SfEhd0pfnH6ASpZ2pPPFZRZGHymhkxNSxhDgKiu/gxE SUrSTPDEIfo5jZjCXYWExf/60nPouOciRUD4XeVw5W7Q5Xcncn1wpERc4drXOJh+2PdT J8C+HOAvoj2BDkFz5fyUpM0bg/F267gIddxbJ+v0EWlzCAW6U816FLTvOWo/eQ3ijm72 yNBK16RDzOYzrBJqjlwtGSRFHEPp3xj6OCN3+xQwPh6l/D+zypmM/6wOTnahwGSys2j3 x5FA== X-Forwarded-Encrypted: i=1; AJvYcCWN8XSuZnPJW+8HkorpssBs8eVG7JaLZoaOXitVUqkBhYCnWPSGiSu4qdhlw5p1PQFCDOuon+AsUw==@kvack.org X-Gm-Message-State: AOJu0Yy3fSDxD5eaRaEnqb6XecT/l/2tVrGjOSHAeWCiATzfC9bKq0E+ NLXhRwpC35+nwcAqszbuCxSde7f59ZsyYPLnaoW4R17youboNYBPddT5ZmtETGSH1d8+eekzuJu 1BvIKPDS8Rx2tTSjjpG86c3SrEw== X-Google-Smtp-Source: AGHT+IFPZByzyXdjiwV8jKmY40Cj3sanVuekYnCn7YFFgb7gotLW0eqU5EvmJr9Dd5Hb/clw6m1dn9YOJk/nOy9NWw== X-Received: from pjbpv13.prod.google.com ([2002:a17:90b:3c8d:b0:312:1af5:98c9]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:5108:b0:313:283e:e881 with SMTP id 98e67ed59e1d1-318c922f2c7mr1105843a91.11.1750979992473; Thu, 26 Jun 2025 16:19:52 -0700 (PDT) Date: Thu, 26 Jun 2025 16:19:51 -0700 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd From: Ackerley Tng To: kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org Cc: aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, tabba@google.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: BCF7A120008 X-Stat-Signature: 3nk43hwinycurbfy835r3ytqdi3x5srf X-Rspam-User: X-HE-Tag: 1750979993-146107 X-HE-Meta: U2FsdGVkX1+y8tmPM8EnbTmGVW0N/6tQ7cCS3g7YfuQ8JZy0o2iLtlLkHxrqkQQSyyxkx71DPIjkCwEKkqNRXcKPT2FoAmTt+ncD4rCFdzYFIWrDBMpKp3VVCBueUlalCZMDgS16hiP1w6QlByBXdfYZsSjvJGcBeFsOs3XlBGX2zZsCMhglyUZAE9WvtsuhHdXJFyDlQ/KU6WplMFzF4MrHxBZabcJHB5zqIQ3MqpLZmjtImUN3YF+VudL+pCiBJn909HLd8fT9au/G5TVkiyQtCJzpsgBAHlXJ3XHivqBqPKHkN7RzEs6meK+/WTNKdgqrt2fxulK56r9KsthNtRtsw9XTD5D+13Ql1ynzYpPaucumyJbrdxvHVNEwU4XxiT+2lcMIRsYL9LfwF1h3zUF4eBFTjLnhN+MHrqRf1eUpLeKMpljiORfQcER2yR6h2pNR7eYzJHFMm02P6bAVj0Ks9Or04ZDv4vBjV701Gh3HzYXgMI6V3b8+0a9KkFIe+axGb5AkMPvG0TJhrY9/UXn8jpFDC62SAhWS/Y4vQJidR1x87OFYFWf3WU/pkZc/bBkXGGeJ6zEsM3H5Xwig4xusYJ0Uohf388//wnQ4xMd742fHzRzj6zJl/XD6Mczc4kcNE2VXtJXKrBGvyhgDGqUBtrPadrsj08DKNXitHOJxDmoejNR6QDa3w2mUmNmVIjli5VjlVja2yCu3a6wdytvVkSvfcNRUjEfDVL/eU+nKVrXn8tp5d0/LG8X8H2loDFxmP1fc8tMRrb9inUrxAomMSnAYf9pJTaWGHidtRq0x1gw1iXt26h5Ink7iOkrETJwLcpEeAJpwZOejkmmx6Mz5hE56n1puKtUkdeA/uRpywtKMsHwAVKj1pHqfD8aQ2vn4NZ75bBDCm7cbhHfnP/YnzjP2i28QtGdIRasFafJs83i21XaLfXYbOq0CqTSLRMTuiZrnBdp4EX+EuEy 3cN4nKRD syaDbzI+0AS/7cNPx8KpD8ybrEGYs1CAZGZTx0q4+7w+MDBQaFxCvBUPSB2KMQJS/sCUjT14rAZEiBKtDy7St/YFlPw5SgdZgCREq/ZMsomT68hAu4/SeTD6Zey50jBfXvzvtwe4YUumxz8BWKcPsn1ntCot/Kk65QCBs8HFty9JoAzaHRpXRNsBpk/wtRUu1VFMO6x3vHipezSa+1CIM9dTZwr3iQYnQrXNXAalqUuCHFXrhk55vSZZ0eaG60xuGUVjI8mz7M3scdf22RnEYSRpxFPFTJ1wMRyhYUIlQB1tFO2tyGglePsA46WfLNAd2qbshvmiBa0mmGFr1dusfu9g88Ugsyh7+QYjNi0Kgsn3VdhydGePZ7tagiK6YBS/h+Q4xLjbJ/9npg9BRAFd1Le1DdqnQwMzpndvQdZpnucMiJ6gGQOAv3IhjyFfLxIJT7sIhaLpe16ys1mUTsDIQLId4omibxgACwudG5wKKqzdP8QZrV/b/Va6n00SqmZQcaoiFMAYjw6vXIzenTKuYc7l9ZRMou+uqI8JY4LbKQIScrkjEp+tQmdWQO45X5WgbKIc74PR+/3aE+f1/ZaiizAvt6yOdKmrvcClJIe5STLyvx3IpNrcL2OMWEz5E/mJXn+jiTkobFbYMVgi9yeEhnh1oVn9eLUA7fh1sL0TAA/9E8cZ/rzD3f/H0EoQDrV4gS9Jf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ackerley Tng writes: > Hello, > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > upstream calls to provide 1G page support for guest_memfd by taking > pages from HugeTLB. > > [...] At the guest_memfd upstream call today (2025-06-26), we talked about when to merge folios with respect to conversions. Just want to call out that in this RFCv2, we managed to get conversions working with merges happening as soon as possible. "As soon as possible" means merges happen as long as shareability is all private (or all meaningless) within an aligned hugepage range. We try to merge after every conversion request and on truncation. On truncation, shareability becomes meaningless. On explicit truncation (e.g. fallocate(PUNCH_HOLE)), truncation can fail if there are unexpected refcounts (because we can't merge with unexpected refcounts). Explicit truncation will succeed only if refcounts are expected, and merge is performed before finally removing from filemap. On truncation caused by file close or inode release, guest_memfd may not hold the last refcount on the folio. Only in this case, we defer merging to the folio_put() callback, and because the callback can be called from atomic context, the merge is further deferred to be performed by a kernel worker. Deferment of merging is already minimized so that most of the restructuring is synchronous with some userspace-initiated action (conversion or explicit truncation). The only deferred merge is when the file is closed, and in that case there's no way to reject/fail this file close. (There are possible optimizations here - Yan suggested [1] checking if the folio_put() was called from interrupt context - I have not tried implementing that yet) I did propose an explicit guest_memfd merge ioctl, but since RFCv2 works, I was thinking to to have the merge ioctl be a separate optimization/project/patch series if it turns out that merging as-soon-as-possible is an inefficient strategy, or if some VM use cases prefer to have an explicit merge ioctl. During the call, Michael also brought up that SNP adds some constraints with respect to guest accepting pages/levels. Could you please expand on that? Suppose for an SNP guest, 1. Guest accepted a page at 2M level 2. Guest converts a 4K sub page to shared 3. guest_memfd requests unmapping of the guest-requested 4K range (the rest of the 2M remains mapped into stage 2 page tables) 4. guest_memfd splits the huge page to 4K pages (the 4K is set to SHAREABILITY_ALL, the rest of the 2M is still SHAREABILITY_GUEST) Can the SNP guest continue to use the rest of the 2M page or must it re-accept all the pages at 4K? And for the reverse: 1. Guest accepted a 2M range at 4K 2. guest_memfd merges the full 2M range to a single 2M page Must the SNP guest re-accept at 2M for the guest to continue functioning, or will the SNP guest continue to work (just with poorer performance than if the memory was accepted at 2M)? [1] https://lore.kernel.org/all/aDfT35EsYP%2FByf7Z@yzhao56-desk.sh.intel.com/