From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26B226138 for ; Tue, 9 Apr 2024 01:21:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712625684; cv=none; b=Vz4VlE3ppfUIqv7co6XsziKrTTGaosZm/xG43GpwP/l12XEbi/KP2xzqqmGicviqNzL/+lT8N6/HZZps65Kx2LRFqp6WVEg16x26wRKKYZaMc9iHsr/D/gRHFtyGBQmfjyYn3TUWY2VtpUaTNnDM27njqeIFcpbIGE+zWuNh7J0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712625684; c=relaxed/simple; bh=HLb3lzTl8sakQLxebX64cAFyuxzRJONTupHQAaLnEhc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=jxwHedvfVIpXFKuvVIir42U4sIkKcls5txP+5BJsLR6118nJFv/SuybQpEbvpUFb7rWRXCn8Pv6f+s5P6pjBgCIfDw8+3uCtF9z7kJUlmixryvXDl2fd8lhMxMCsvF0+gmnvI0Tr15xcG0wozUGf7b9XuluyhnF7KRBfzTYITNA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1AgBQzVh; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1AgBQzVh" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-1e2ae0153e2so41688075ad.3 for ; Mon, 08 Apr 2024 18:21:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1712625682; x=1713230482; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=DOboyqOmohbafhdvLzQ0D4l7iWv9fFGfP8r0CMbFChY=; b=1AgBQzVhdwOsycJqWH10ruQ8m+NPcKnwIhrIs+Mot4gZWtocWpoLfB8HG+xJafjjVR PnVBbC1hdmOS9O/u/zRH80j6//K+88t1kA7KqqGFKok71QjwPqf2kHUMEJpLycmt8jjV Qvi80XPG5aNyFE8l9tsNfYIdryOS8CB3zQTANJz99uM5zJVd9+hCDb+a7ku0yR58sytf Dmcs0nDTrq6b/Grt69a3Xp4VmMT3w5L2eHza6YAu2kf4gAQDemJyIAaqalF++eZdShn7 jLQS5YKCju5mPR5Y4qdx1WVpUkRbTp1oG7nPdaH3DICKP/UONBTIVSs2Wv64/tjTKZPX zlDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712625682; x=1713230482; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=DOboyqOmohbafhdvLzQ0D4l7iWv9fFGfP8r0CMbFChY=; b=egjAfHy9m+kpTUqgjroW9fngD2cUcsVYFfOkuBJ7uizsE0yJa/1+tNqMJocnING4cP dXldEXZc6jiOBnkX96sHyWBY4VBKcr63hbUWnUjUfI8GTWnY16NdvHDEEoN45SQnqyd/ o8V0fvByli7BEljtSFz5Zy6dnFc1pqSdhQ6BC9Qf6gEmEoUaAWDHgSipe8DNW6hGwR1P f+4gNsHvpWkl6SCIjJD18Rw5hXB6MmFd45cY7P8yvVNF2guGxkpmz0ymqHKhfOuOtEv1 t/tdzQIg9x8CNEa9lugegFWCJmg/r1q+lVEt1RDPh8Otl98cqzWD+YfMC5Z+WVVNxx4x 2mvQ== X-Forwarded-Encrypted: i=1; AJvYcCWzyIMY4BULZlQ0N/NlA3gPBawMqF6za1fdOBfmyefIZsx81xYCugu01KGPbbLf8pFmPZB2Y+1oqhUBt45dUSfp7Nchkrb07olJPxdq X-Gm-Message-State: AOJu0YxuyK8WOyA6CNHeMBd55sS7c4Bzc5NJ+6GVWsf3JTL8+sibuICf djDooDOQyjDthCzD0K+Nm6iy/RHVQt9e2Di6lU4Jwz4cftLAYvVtroPVFciHN2dak8yW4K3QqRd z+w== X-Google-Smtp-Source: AGHT+IHqFrHMa7zvZVYCVZteRx1klk5aQx2fqWK/tBoc1XrugycZfp9s9t1gg8Ky3mKSw7Fs1naZwSgjsK4= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:da83:b0:1e3:c934:61d1 with SMTP id j3-20020a170902da8300b001e3c93461d1mr59865plx.10.1712625682465; Mon, 08 Apr 2024 18:21:22 -0700 (PDT) Date: Mon, 8 Apr 2024 18:21:21 -0700 In-Reply-To: <43d1ade0461868016165e964e2bc97f280aee9d4.camel@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240404121327.3107131-1-pbonzini@redhat.com> <20240404121327.3107131-8-pbonzini@redhat.com> <43d1ade0461868016165e964e2bc97f280aee9d4.camel@intel.com> Message-ID: Subject: Re: [PATCH v5 07/17] KVM: x86: add fields to struct kvm_arch for CoCo features From: Sean Christopherson To: Rick P Edgecombe Cc: "kvm@vger.kernel.org" , "pbonzini@redhat.com" , "linux-kernel@vger.kernel.org" , "michael.roth@amd.com" , Isaku Yamahata Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Fri, Apr 05, 2024, Rick P Edgecombe wrote: > On Thu, 2024-04-04 at 08:13 -0400, Paolo Bonzini wrote: > > =C2=A0 > > =C2=A0struct kvm_arch { > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned long vm_type; > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned long n_used_mm= u_pages; > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned long n_request= ed_mmu_pages; > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned long n_max_mmu= _pages; > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned int indirect_s= hadow_pages; > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0u8 mmu_valid_gen; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0u8 vm_type; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0bool has_private_mem; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0bool has_protected_state; >=20 > I'm a little late to this conversation, so hopefully not just complicatin= g > things. But why not deduce has_private_mem and has_protected_state from t= he > vm_type during runtime? Like if kvm.arch.vm_type was instead a bit mask w= ith > the bit position of the KVM_X86_*_VM set, kvm_arch_has_private_mem() coul= d > bitwise-and with a compile time mask of vm_types that have primate memory= . > This also prevents it from ever transitioning through non-nonsensical sta= tes > like vm_type =3D=3D KVM_X86_TDX_VM, but !has_private_memory, so would be = a little > more robust. LOL, time is a circle, or something like that. Paolo actually did this in = v2[*], and I objected, vociferously. KVM advertises VM types to userspace via a 32-bit field, one bit per type. = So without more uAPI changes, the VM type needs to be <=3D31. KVM could embed= the "has private memory" information into the type, but then we cut down on the= number of possible VM types *and* bleed has_private_memory into KVM's ABI. While it's unlikely KVM will ever support TDX without has_private_memory, i= t's entirely possible that KVM could add support for an existing VM "base" type= that doesn't currently support private memory. E.g. with some massaging, KVM co= uld support private memory for SEV and SEV-ES. And then we need to add an enti= rely new VM type just so that KVM can let it use private memory. Obviously KVM could shove in bits after the fact, e.g. store vm_type as a u= 64 instead of u32 (or u8 as in this patch), but then what's the point? Burnin= g a byte instead of a bit for per-VM flag is a complete non-issue, and booleans= tend to yield code that's easier to read and easier to maintain. [*] https://lore.kernel.org/all/ZdjL783FazB6V6Cy@google.com > Partly why I ask is there is logic in the x86 MMU TDX changes that tries = to > be generic but still needs special handling for it. The current solution = is > to look at kvm_gfn_shared_mask() as TDX is the only vm type that sets it,= but > Isaku and I were discussing if we should check something else, that didn'= t > appear to be tying together to unrelated concepts: > https://lore.kernel.org/kvm/20240319235654.GC1994522@ls.amr.corp.intel.co= m/ >=20 > Since it's down the mail, the relevant snippet: > " > > > void kvm_arch_flush_shadow_memslot(struct kvm *kvm, > > > struct kvm_memory_slot *slot) > > > { > > > - kvm_mmu_zap_all_fast(kvm); > > > + if (kvm_gfn_shared_mask(kvm)) Whatever you do that is TDX specific and an internal KVM thing is likely th= e wrong thing :-) The main reason KVM doesn't do a targeted zap on memslot removal is because= of ABI baggage that we _think_ is limited to interaction with VFIO. Since KVM doe= sn't have any ABI for TDX *or* SNP, I want to at least entertain the option of d= oing a target zap for SNP as well as TDX, even though it's only truly "necessary= " for TDX, in quotes because it's not strictly necessary, e.g. KVM could BLOCK th= e S-EPT entries without fully removing the mappings. Whether or not targeted zapping is optimal for SNP (or any VM type) is very= much TBD, and likely highly dependent on use case, but at the same time it would= be nice to not rule it out completely. E.g. ChromeOS currently has a use case where they frequently delete and rec= reate a 2GiB (give or take) memslot. For that use case, zapping _just_ that mems= lot is likely far superious than blasting and rebuilding the entire VM. But if us= erspace deletes a 1TiB for some reason, e.g. for memory unplug?, then the fast zap = is probably better, even though it requires rebuilding all SPTEs. > > There seems to be an attempt to abstract away the existence of Secure- > > EPT in mmu.c, that is not fully successful. In this case the code > > checks kvm_gfn_shared_mask() to see if it needs to handle the zapping > > in a way specific needed by S-EPT. It ends up being a little confusing > > because the actual check is about whether there is a shared bit. It > > only works because only S-EPT is the only thing that has a > > kvm_gfn_shared_mask(). > >=20 > > Doing something like (kvm->arch.vm_type =3D=3D KVM_X86_TDX_VM) looks wr= ong, > > but is more honest about what we are getting up to here. I'm not sure > > though, what do you think? >=20 > Right, I attempted and failed in zapping case. This is due to the restri= ction > that the Secure-EPT pages must be removed from the leaves. the VMX case = (also > NPT, even SNP) heavily depends on zapping root entry as optimization. As above, it's more nuanced than that. KVM has come to depend on the fast = zap, but it got that way *because* KVM has historical zapped everything, and use= rspace has (unknowingly) relied on that behavior. > I can think of > - add TDX check. Looks wrong > - Use kvm_gfn_shared_mask(kvm). confusing Ya, even if we end up making it a hardcoded TDX thing, dress it up a bit. = E.g. even if KVM checks for a shared mask under the hood, add a helper to captur= e the logic, e.g. kvm_zap_all_sptes_on_memslot_deletion(kvm). > - Give other name for this check like zap_from_leafs (or better name?) > The implementation is same to kvm_gfn_shared_mask() with comment. > - Or we can add a boolean variable to struct kvm If we _don't_ hardcode the behavior, a per-memslot flag or a per-VM capabil= ity (and thus boolean) is likely the way to go. My off-the-cuff vote is probab= ly for a per-memslot flag.