From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A44C18E36E for ; Mon, 9 Sep 2024 23:58:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725926303; cv=none; b=J9ccemw0yrx5Gv7eWBPwiUUDwEjjtXOUWiq+v5yy0O12ipyynYGrM1vBCXRMc3W+4ufRVpvpGhekwjx4scUJesQWrNEth6R3C+jorRCe4n2/yZ4fR8aNaNlqt+IZ787nuasp49ooOYp0m9sxIHskAz9er6wEgIlSZRLMPbYkn2g= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725926303; c=relaxed/simple; bh=rAsj1s9qFPmjkLnE5bAbBBMOVdQ/Nm2tIAGLMo0JMxc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=aQihw1Nz6PtP78xsq1j+Il+98kbggLF6JBhc0ZRJsiyfzFLPQN0hMqixn4zJbjOC92rtWaUuClzxJOeuE6byTMnJvB4RWoqr6WMNW6Dk3Q+Vdieuf2UDmNGQhugQJ1IiGJnzrcxP2IOQ/Q8huQcI0zKHyS07EvSaH5rVuDIFdpI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=IelwXWkc; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="IelwXWkc" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1cf5a262a1so13037759276.0 for ; Mon, 09 Sep 2024 16:58:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725926301; x=1726531101; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=Vf3/RT96m5N9OrQpcP8mMpidT2hLWI0xZKMNo+VkJCE=; b=IelwXWkcLJeVgFbGCljKex6S8lLmbi3vFalhBrnRq+2/dN13MNMq9aVrnMNPCYFB2S TSsGhL98AsL+W9nbeeuUm1T6LGQuWLmvFNZVjF9u7SDRKmCUtGCl/tfG23opDgjA28/e U8Mk+7lnJ5ELZNgcPXRsJ+498ZmUtXr8BbISJNU0zXA9x1DVQUu1RgD9D2wPiZYwgsf1 oMkKjTT57yeR9OKI2PdYkV5GR2CyDRUOC86IUJwE/oL6vO9IWDXi8fdc0s+shXYx9eXx K7nzeuCuNSXbTFr//EfrqT1esevVrer16dwlaer3bh0dxiflUD/iKsQVegQfPu2NfQQZ HCPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725926301; x=1726531101; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=Vf3/RT96m5N9OrQpcP8mMpidT2hLWI0xZKMNo+VkJCE=; b=XY8D3NZExXNEWTnapVW2+DCBavDf41jp6vHOl00xVtRGJB+woAszTcIZyBnrQ7Ufzt TZZFuwDauiMyc7zJCT+Vaq+499kz6c3J++iySid88+ESjbLunxVvJrYo72vV7deLqKvp HmK/CPLX+v6wIUrv2RgfNEgx6ush0Cov3DEWX8ekySW2G7PkF13wXpDbtPDZxzQEXuZX 8gY8oSED2obNxGr6p/UAWW7miMUwf5pEGx3YjxblwUJYexGV4Fj1mLRZq0sMlU+/GOVX UT7hzHtNLAI0qZR3bd3b9j+dgydRg+wJXP5HrO1sQo9JTlxsY0oRD7lcBssiF9kFMlbS /eBw== X-Forwarded-Encrypted: i=1; AJvYcCX1cR1XsxaC4R2MUQ+P5Qsh5qZtBvYsZsIb30qkYcrHI0J+przwsZyBov6Ru9pIqWwRilI=@vger.kernel.org X-Gm-Message-State: AOJu0YxtWs3Hkx1AO0COCwdDrqLT/dsiumBh8s02rAPA7XgeSW1upEU3 HVVtYgLexAHvJSIEMrbPpUi/jiDeQst3L+0LQsjX+wjFYavYEBMsifubJDxUOxYgiA9hqqPuJOE MEA== X-Google-Smtp-Source: AGHT+IGQHKStZ3MEEEQ/Kd6Qzm+5KxNi5CSKkbEtHmszfppqGtBVtmmhHdX9JBQ5Iyi0YUu+PkLA8a0r10U= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:641:b0:e0b:af9b:fb94 with SMTP id 3f1490d57ef6-e1d7a225025mr11049276.6.1725926300985; Mon, 09 Sep 2024 16:58:20 -0700 (PDT) Date: Mon, 9 Sep 2024 16:58:19 -0700 In-Reply-To: <72ef77d580d2f16f0b04cbb03235109f5bde48dd.camel@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240904030751.117579-1-rick.p.edgecombe@intel.com> <20240904030751.117579-10-rick.p.edgecombe@intel.com> <6449047b-2783-46e1-b2a9-2043d192824c@redhat.com> <72ef77d580d2f16f0b04cbb03235109f5bde48dd.camel@intel.com> Message-ID: Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT From: Sean Christopherson To: Rick P Edgecombe Cc: "linux-kernel@vger.kernel.org" , Yuan Yao , Kai Huang , "isaku.yamahata@gmail.com" , Yan Y Zhao , "dmatlack@google.com" , "kvm@vger.kernel.org" , "nik.borisov@suse.com" , "pbonzini@redhat.com" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Mon, Sep 09, 2024, Rick P Edgecombe wrote: > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote: > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, > > > ever.=C2=A0 For its operations, I'm pretty sure the only sane approac= h is for > > > KVM to ensure there will be no contention.=C2=A0 And if the TDX modul= e's > > > single-step protection spuriously kicks in, KVM exits to userspace.= =C2=A0 If > > > the TDX module can't/doesn't/won't communicate that it's mitigating > > > single-step, e.g. so that KVM can forward the information to userspac= e, > > > then that's a TDX module problem to solve. > > >=20 > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs tha= t > > > > return TDX_OPERAND_BUSY. > > >=20 > > > IMO, that's terrible advice.=C2=A0 SGX has similar behavior, where th= e xucode > > > "module" signals #GP if there's a conflict.=C2=A0 #GP is obviously fa= r, far > > > worse as it lacks the precision that would help software understand > > > exactly what went wrong, but I think one of the better decisions we m= ade > > > with the SGX driver was to have a "zero tolerance" policy where the > > > driver would _never_ retry due to a potential resource conflict, i.e. > > > that any conflict in the module would be treated as a kernel bug. >=20 > Thanks for the analysis. The direction seems reasonable to me for this lo= ck in > particular. We need to do some analysis on how much the existing mmu_lock= can > protects us.=20 I would operate under the assumption that it provides SEPT no meaningful pr= otection. I think I would even go so far as to say that it is a _requirement_ that mm= u_lock does NOT provide the ordering required by SEPT, because I do not want to ta= ke on any risk (due to SEPT constraints) that would limit KVM's ability to do thi= ngs while holding mmu_lock for read. > Maybe sprinkle some asserts for documentation purposes. Not sure I understand, assert on what? > For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The= guest > side operations can take the locks too. From "Base Architecture Specifica= tion": > " > Host-Side (SEAMCALL) Operation > ------------------------------ > The host VMM is expected to retry host-side operations that fail with a > TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that= at > most after a limited time (the longest guest-side TDX module flow) there = will be > no contention with a guest TD attempting to acquire access to the same re= source. >=20 > Lock operations process the HOST_PRIORITY bit as follows: > - A SEAMCALL (host-side) function that fails to acquire a lock sets th= e lock=E2=80=99s > HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VM= M. It is > the host VMM=E2=80=99s responsibility to re-attempt the SEAMCALL funct= ion until is > succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the= guest TD > from acquiring the lock. > - A SEAMCALL (host-side) function that succeeds to acquire a lock clea= rs the > lock=E2=80=99s HOST_PRIORITY bit. *sigh* > Guest-Side (TDCALL) Operation > ----------------------------- > A TDCALL (guest-side) function that attempt to acquire a lock fails if > HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the g= uest. > The guest is expected to retry the operation. >=20 > Guest-side TDCALL flows that acquire a host priority lock have an upper b= ound on > the host-side latency for that lock; once a lock is acquired, the flow ei= ther > releases within a fixed upper time bound, or periodically monitor the > HOST_PRIORITY flag to see if the host is attempting to acquire the lock. > " >=20 > So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because = it is > involved in sorting out contention between the guest as well. We need to = double > check this, but I *think* this HOST_PRIORITY bit doesn't come into play f= or the > functionality we need to exercise for base support. >=20 > The thing that makes me nervous about retry based solution is the potenti= al for > some kind deadlock like pattern. Just to=C2=A0gather your opinion, if the= re was some > SEAMCALL contention that couldn't be locked around from KVM, but came wit= h some > strong well described guarantees, would a retry loop be hard NAK still? I don't know. It would depend on what operations can hit BUSY, and what th= e alternatives are. E.g. if we can narrow down the retry paths to a few sele= ct cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of deadlock, then maybe that's the least awful option. What I don't think KVM should do is blindly retry N number of times, becaus= e then there are effectively no rules whatsoever. E.g. if KVM is tearing dow= n a VM then KVM should assert on immediate success. And if KVM is handling a f= ault on behalf of a vCPU, then KVM can and should resume the guest and let it re= try. Ugh, but that would likely trigger the annoying "zero-step mitigation" crap= . What does this actually mean in practice? What's the threshold, is the VM-= Enter error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if= KVM runs afoul of the zero-step mitigation? After a pre-determined number of such EPT violations occur on the same in= struction, the TDX module starts tracking the GPAs that caused Secure EPT faults and= fails further host VMM attempts to enter the TD VCPU unless previously faulting= private GPAs are properly mapped in the Secure EPT. If HOST_PRIORITY is set, then one idea would be to resume the guest if ther= e's SEPT contention on a fault, and then _if_ the zero-step mitigation is trigg= ered, kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocke= d and can't be re-locked by the guest. That would allow KVM to guarantee forward progress without an arbitrary retry loop in the TDP MMU. Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensu= re the one and only retry is guaranteed to succeed.