From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A741525DB1A for ; Thu, 2 Oct 2025 23:51:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759449079; cv=none; b=n+YQ6NNrNPgt3PBGP59GviGNctLLIDyA5jXAAQqSm2VuCcPrlDZmgorB6j4yQxgyFklehSQMgSPDlhOVg3sRHV/s1aB5bZHcFVSi2YA7yg28XkcN06rNp4sHh3gDJRjwRCw7m84cVnDBQ+us0Y6Ehc+hj4dRFKs0m6DRMPSUrSA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759449079; c=relaxed/simple; bh=4S0zhWFtG3eZj9J0/xePjP4M0Bwd2/giZoqCZTW1jUc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=PL6HumJuRiHA4hFdGOH2GPCvl1z+RSDAd4RWiNvtQjxvu5Ni8cBJ2z24LKGEEurHAHmqak9nBKk9dcC5rcLrzE0amsPHzopp4ZhdhB3teK/FbmU4CFNyAZ5nSgikvUv+45E0618ig2uuEHQc8vwE0BDY+RFFKA1I6tMDD5V3svo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=uXCy0j1t; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="uXCy0j1t" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-783604212ebso3113969b3a.3 for ; Thu, 02 Oct 2025 16:51:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1759449077; x=1760053877; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lV6kU18ljJ0dxk3XSSzUGg8U1vG+s7/1h2PQkgl2qAE=; b=uXCy0j1tIIKwhV3Eo3fNIox1EETN/lmIIaFzkBNeilMMfnqmRQ+CLJqLr2SsdM5C8A Qe0COYk5QH30QqpUoH1DuEEsttXRO0owEOidfGvlOSMRtepH/uqOuNRsyypvI3DPwmwn z113EcsUt7t02BWkvyw/LG9psu5mtTkwhGOx0xlTwwY3JyYETVHQtrQ8zd3UqRs/GYrM vjB4FLuvxPGBx3G+3caQA1dNYUvjuGzRAbb8YNLyQcDECTYZaqy4BsYYH8QOd7B7Xfvs WQIDJbQmw7RZ0wlT9VFLqpWsHWlnyOTmXQMKLbW+quXuk/1wXVUqtHcsJqFZ22oPc9Oq kpXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759449077; x=1760053877; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lV6kU18ljJ0dxk3XSSzUGg8U1vG+s7/1h2PQkgl2qAE=; b=dP9+d5AIn/5zJAwB+AYqEB1VFHBHRj834bu1YlbMfUNK7vL+QpmgcuFcrYlUIBQlHH i5M+emxqd5eKloncvBdoz6b6Wxa71hffkf3XLKHzZio0dQYal4DwF0zVhWuBCxZ6lHlT IojL8+mYYWxHFZLCBlWB8HI0K2yFfj8ytoZL5ek0siXZ6HOwJcg+FbE9dnXx5+865Ll/ /DBMC+emP5hJK5qhlRfbKBMUPSZF7OdVD7p3Ll5SjOL1gTCfGtnsqxKqMFT6v0S/nx85 AjShlGa4SxVt9of76oVYyzNjwzB7+ADA7Q2pX3NhBHfDcfmjoVfIOWLvj/of4vezsnlY t1Ng== X-Forwarded-Encrypted: i=1; AJvYcCU1gTY7Z/0B4QeFT2lSCNlohAmIxUanRggtpwuwfMnz/YXG42Ob9mgD5yV9D31unz8IhHz07eJty/lFzP4=@vger.kernel.org X-Gm-Message-State: AOJu0YzW9KN+ircD8FUBaZeqZZTkp3X0LVE+eL04P4gO3/Sij6tzMJ9N WsFv3Ug9Hc9njXroLWMVK5dg7q4qfgTtYYL88IIFFuJr0am8tzg00O1Yl6tcsLPIVXUJsfuaF67 y7e+tkQ== X-Google-Smtp-Source: AGHT+IHBUWpSfoqz6bNDbuRLm77zG7hkwgs6l+b7AyUGmi0F2ScwQi1v6ouRO4Ao1sSxMXa6nEVCtvgk8Kc= X-Received: from pfkm7.prod.google.com ([2002:a05:6a00:807:b0:776:1344:ca77]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:1796:b0:781:2177:1c9b with SMTP id d2e1a72fcca58-78c98cb9f4fmr2014775b3a.17.1759449076867; Thu, 02 Oct 2025 16:51:16 -0700 (PDT) Date: Thu, 2 Oct 2025 16:51:15 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: Message-ID: Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting From: Sean Christopherson To: Ackerley Tng Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yan Zhao , Fuad Tabba , Binbin Wu , Michael Roth , Ira Weiny , Rick P Edgecombe , Vishal Annapurve , David Hildenbrand , Paolo Bonzini Content-Type: text/plain; charset="us-ascii" On Thu, Oct 02, 2025, Ackerley Tng wrote: > Sean Christopherson writes: > > > On Wed, Oct 01, 2025, Ackerley Tng wrote: > >> Sean Christopherson writes: > >> >> I'd prefer not to have the module param choose between the use of > >> >> mem_attr_array and guest_memfd conversion in case we need both > >> >> mem_attr_array to support other stuff in future while supporting > >> >> conversions. > >> > > >> > Luckily, we don't actually need to make a decision on this, because PRIVATE is > >> > the only attribute that exists. Which is partly why I want to go with a module > >> > param. We can make the behavior very definitive without significant risk of > >> > causing ABI hell. > >> > > >> > >> Then maybe I'm misunderstanding the static_call() thing you were > >> describing. Is it like, at KVM module initialization time, > >> > >> if module_param == disable_tracking: > >> .__kvm_get_memory_attributes = read_attributes_from_guest_memfd > >> else > >> .__kvm_get_memory_attributes = read_attributes_from_mem_attr_array > >> > >> With that, I can't have both CoCo private/shared state tracked in > >> guest_memfd and RWX (as an example, could be any future attribute) > >> tracked in mem_attr_array on the same VM. > > > > More or less. > > > > Hm okay. So introducing the module param will only allow the use of one > of the following? > > + KVM_SET_MEMORY_ATTRIBUTES (vm ioctl) > + KVM_SET_MEMORY_ATTRIBUTES2 (guest_memfd ioctl) > > Then I guess using a module param which is a weaker userspace contract > allows us to later enable both vm and guest_memfd ioctl if the need > arises? In theory. More importantly from my perspective, making the knob global instead of per-VM simplifies the implementation. > >> + KVM_SET_MEMORY_ATTRIBUTES > >> + Is VM ioctl > >> + Is a write-only ioctl > >> + Is for setting memory attributes at a VM level > >> + Use struct kvm_memory_attributes for this > >> + KVM_GUEST_MEMFD_SET_MEMORY_ATTRIBUTES (name TBD) > >> + Is guest_memfd ioctl > >> + Is a read/write ioctl > >> + Is for setting memory attributes only for this guest_memfd > >> + Use struct guest_memfd_memory_attributes for this > >> + Also decode errors from this struct > > > > + Has extra padding for future expansion (because why not) > > > > If we really truly need a new ioctl, I'd probably prefer KVM_SET_MEMORY_ATTRIBUTES2. > > Yeah, it's silly, but I don't think baking GUEST_MEMFD into the names buys us > > anything. Then we can use KVM_SET_MEMORY_ATTRIBUTES2 on a VM if the need ever > > arises. > > > > I'm for having a new ioctl number and new struct, which are you leaning > towards? > > As for the naming, I think it's confusing to have something similar, and > Ira mentioned it being confusing in the other email too. It might sound confusing, but I don't think the _code_ will be confusing. If it's a module param, it's guaranteed to be global for any VMM. Then the conversion code can be a simple redirect, e.g. I would expect/hope the QEMU change to be something like: diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f89568bfa3..5253cf7275 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1421,6 +1421,10 @@ static int kvm_set_memory_attributes(hwaddr start, uint64_t size, uint64_t attr) int r; assert((attr & kvm_supported_memory_attributes) == attr); + + if (kvm_use_gmem_attributes) + return kvm_get_gmem_fd(start, size, attr); + attrs.attributes = attr; attrs.address = start; attrs.size = size; > At the same time, I accept that it's useful if the same struct were to be > used for a new iteration of the KVM_SET_MEMORY_ATTRIBUTES VM ioctl in future. > No strong preference either way on naming. > > Trying to understand the difference between unwind on failure vs > all-or-nothing: > > > Alternative #1 is to try and unwind on failure, but that gets complex, and it > > simply can't be done for some CoCo VMs. E.g. a private=>shared conversion for > > TDX is descrutive. > > > > Unwind on failure is: > > 1. Store current state > 2. Convert > 3. Restore current state on conversion failure Not quite, the above is missing key steps that cause problems: invalidation and memory allocation. 1. Zap stage-1 mappings 2. Check if range can be converted 3. Zap stage-2 mappings 4. Store state for target range For TDX, #3 is the point of no return. That means #4 must not fail, because at the very least invididual page conversions need to be atomic. > > Alternative #2 is to make the updates atomic and all-or-nothing, which is what > > we did for per-VM attributes. That's doable, but it'd either be much more > > complex than telling userspace to retry, or we'd have to lose the maple tree > > optimizations (which is effectively what we did for per-VM attributes). > > > > All-or-nothing: > > 1. Do everything to make sure conversion doesn't fail, bail early if it > fails > 2. Convert (always successful) > > Is that it? No, because #1 is non-trivial. The wrinkle is that, without doing an initial pass through mtree, it's impossible to know how many new entries will be needed (which you already know). It's a solvable problem, e.g. in this RFC, it's kinda sorta handled by allocating temporary structures to track the mtree metadata, but that's the type of complexity I want to avoid. > Zapping private pages from the stage 2 page tables for TDX can't be > recovered without help from the guest (I think that's what you're > talking about too), Ya. > although technically I think this zapping step could be delayed right till the end. Hmm. If we fully committed to blocking all "gets" via the filemap lock, then yes, I think it could be delayed until the end? Tempting. But as you note below, I'd prefer to give ourselves an out from an ABI perspective. > Maple tree allocations for conversion could fail, Ya, in my local version I handle that by pre-allocating. mas_for_each(&mas, entry, end - 1) { MA_STATE(m2, &gi->attributes, 0, 0); if (attrs->attributes == xa_to_value(entry)) continue; r = kvm_gmem_mas_preallocate(&m2, attrs->attributes, start, end); if (r) { *err_index = m2.index; goto out; } unmap_mapping_pages(mapping, start, nr_pages, false); if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages, err_index)) { mas_destroy(&m2); r = -EAGAIN; goto out; } kvm_gmem_invalidate_begin(inode, start, end); mas_store_prealloc(&m2, xa_mk_value(attrs->attributes)); kvm_gmem_invalidate_end(inode, start, end); } > and allocations are a bit more complicated since we try to compact ranges > with the same shared/private status into one same maple tree node. Still > technically possible, maybe by updating a copy of the maple tree first, then > swapping the current maple tree out atomically. Heh, more complexity I'd prefer to avoid. > With HugeTLB, undoing HVO needs pages to be allocated, will need more > digging into the details to determine if preallocation could work. It has to work, otherwise we're hosed. Though it should be noted that "preallocate" here just means before kvm_gmem_invalidate_begin(). Everything up to that point can fail. The other option in all of this is to add an API to _block_ stage-2 mappings, in TDX terminology. In TDX, BLOCK marks leaf S-EPT entries !PRESENT, but preserves all the metadata. So if we're too scared to do an after-the-fact invalidation, and we need to support failure after kvm_gmem_invalidate_begin(), we could rework the TDX backend to support e.g. kvm_gmem_invalidate_abort(). On begin(), KVM would BLOCK mappings; on abort(), restore; on end() fully remove. > I'd still prefer having the option to return an error so that we don't > paint ourselves into a corner. Yep, my thoughts exactly.