From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A0AC818DF72 for ; Tue, 10 Sep 2024 15:57:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725983826; cv=none; b=W8ftwJLRQ+CPadPYLvJdwPoKdR/7MBvBqmYxJHnDq85HnFYfvk9vgKniSLLS/dhJ/Cbe5qakZs7/ZFpF5HoKXSkKj4z/IkA8OP/rsG9v7Papug/JaQMVWhHw8PA5QePTQ9GzbpC9ueRdC+PUdvpnIXCWNzwI/fusbrNmERVkSSE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725983826; c=relaxed/simple; bh=cW14aL2kByVBLDiSFE3bHdvErdW/UecAKHQcX/CiYRU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=c8Vogmz2Xmdef73Iy8z7OqlTprjP+ZqvFHgo+kQiDp7J+Txh1ne37I+q3ynDPgc7a+2d6vibpskHhDHEwUKerNq9jWlCUj50Oxo2ZZshRnVJeRGJy6t7VxdXf2Bk00sh4PHqi3Ayi99A6Yk57KQCIT4BUgbh3G1Tg5Yq7kgjRoI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=rF4sBVsr; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="rF4sBVsr" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1159159528so15406988276.1 for ; Tue, 10 Sep 2024 08:57:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725983823; x=1726588623; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=7tzvNQMp3vCXB8Yr+79tCmZju/CJm9McEg1D0ZzYNM4=; b=rF4sBVsrycR9OEWlOQhYoz81rkeWhADCG4GbD6Q8kdd5KNwU0CiK5oUBpKfuDnh7kf vBm/LFAWi+/Xnc761bApaDAv/izLA03kVvnvMT4ffviTzr8dILs3FCrkqHLoQbAb/6dD 2v12kZRjTaGDGceDSuLOyLRdSdNco5VQ6v7e1QRCvvb7y8CROwfvStzo8jkxPhzIn1E7 ytLRFtK8ornBzkYEivuS9wWhx2tyRjyUgVT8IOLxGig3BOlTrnGOx+PhMbEvUVY/FnsY uuUIyzSXlSSKQ+2tivQpKX75XJC6emxAKxU7hhMU6oZcKnvIIuc4ROSRCHHivaLFmio3 NbPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725983823; x=1726588623; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=7tzvNQMp3vCXB8Yr+79tCmZju/CJm9McEg1D0ZzYNM4=; b=tlY6AHr9R1dIdblUzaIGctThkX+Q53p+D0ZPfbxkWvfRqyIszefAgh3TM3zies33EU 8KT/U0Ewdu22+DL0hcCi144r/g4aA074C0igsch5/cFRVagYMdbpCN2aP9zJ+Fnexg7W LQPTFIH82jfNomtxxe9dUTwsek7ILJe9tBqVGOLPlYmT2WYfJvsNJqaixNRLuazHwGZM D+AueIkcnfD3jhxWiRLXCQHddWptRsDV4A1nnrNrYNFP9jLS8hGdhN9e+zV6v9l1ZCZo yQ7h7o5tB43a062fXpDF4kzFZItkhMOpcAY5ml0+gojXmPwcfQU2FOpyeufRyH8df7M8 6ivQ== X-Forwarded-Encrypted: i=1; AJvYcCWtkfbCybg+zj0UHpR9/tzlkyt8EFdoNId56jtfjqdGdhFF9CSA3IGbBKtKQFcv55Vz9NtNS9Zz9gVcuXI=@vger.kernel.org X-Gm-Message-State: AOJu0YyiUt8rAtTC7LrqCQ4uwGtLKQoRikTCQG9Rzz5pUu+q9eEXQ+D4 7l8GJefs0ljN6S2SlnsyEi+twkxYM+dFHryX0WEpmNW/oFVkHlHIUnMB2DSFg49SCFfK6+NPr+4 4YA== X-Google-Smtp-Source: AGHT+IH4LocodMSDLJCCpxDd2CcB7y06IMVHPbMZyWXohbePH6HffUlwoCeXp11+A2QcZOcjbMisCFflk+8= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:b16:b0:e0b:acc7:b1fd with SMTP id 3f1490d57ef6-e1d7a09b1a0mr17840276.4.1725983823594; Tue, 10 Sep 2024 08:57:03 -0700 (PDT) Date: Tue, 10 Sep 2024 08:57:02 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240904030751.117579-1-rick.p.edgecombe@intel.com> <20240904030751.117579-10-rick.p.edgecombe@intel.com> <6449047b-2783-46e1-b2a9-2043d192824c@redhat.com> <1bbe3a78-8746-4db9-a96c-9dc5f1190f16@redhat.com> Message-ID: Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT From: Sean Christopherson To: Paolo Bonzini Cc: Rick P Edgecombe , "kvm@vger.kernel.org" , Yan Y Zhao , Yuan Yao , "nik.borisov@suse.com" , "dmatlack@google.com" , Kai Huang , "isaku.yamahata@gmail.com" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Tue, Sep 10, 2024, Paolo Bonzini wrote: > On Tue, Sep 10, 2024 at 3:58=E2=80=AFPM Sean Christopherson wrote: > > On Tue, Sep 10, 2024, Paolo Bonzini wrote: > > No, because that defeates the purpose of having mmu_lock be a rwlock. >=20 > But if this part of the TDX module is wrapped in a single big > try_lock, there's no difference in spinning around busy seamcalls, or > doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention > in the same way. With respect to FROZEN_SPTE... > > > > This way we know that "busy" errors must come from the guest and have= set > > > HOST_PRIORITY. > > > > We should be able to achieve that without a VM-wide spinlock. My thoug= ht (from > > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock= , i.e. keep > > it set until the SEAMCALL completes. >=20 > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3, > which documents that the TDX module returns TDX_OPERAND_BUSY on a > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough > to prevent contention in the TDX module. Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per= -VM lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read = mode. So for the operations that KVM can do in parallel, the locking should effec= tively be per-entry. Because KVM will never throw away an entire S-EPT root, zapp= ing SPTEs will need to be done while holding mmu_lock for write, i.e. KVM shoul= dn't have problems with host tasks competing for the TDX module's VM-wide lock. > If we want to be a bit more optimistic, let's do something more > sophisticated, like only take the lock after the first busy reply. But > the spinlock is the easiest way to completely remove host-induced > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones. I am not convinced that's necessary or a good idea. I worry that doing so = would just kick the can down the road, and potentially make the problems harder t= o solve, e.g. because we'd have to worry about regressing existing setups. > > > It is still kinda bad that guests can force the VMM to loop, but the = VMM can > > > always say enough is enough. In other words, let's assume that a lim= it of > > > 16 is probably appropriate but we can also increase the limit and cra= sh the > > > VM if things become ridiculous. > > > > 2 :-) > > > > One try that guarantees no other host task is accessing the S-EPT entry= , and a > > second try after blasting IPI to kick vCPUs to ensure no guest-side tas= k has > > locked the S-EPT entry. >=20 > Fair enough. Though in principle it is possible to race and have the > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call. My limit of '2' is predicated on the lock being a "host priority" lock, i.e= . that kicking vCPUs would ensure the lock has been dropped and can't be re-acquir= ed by the guest. > So I would make it 5 or so just to be safe. >=20 > > My concern with an arbitrary retry loop is that we'll essentially propa= gate the > > TDX module issues to the broader kernel. Each of those SEAMCALLs is sl= ooow, so > > retrying even ~20 times could exceed the system's tolerances for schedu= ling, RCU, > > etc... >=20 > How slow are the failed ones? The number of retries is essentially the > cost of successful seamcall / cost of busy seamcall. I haven't measured, but would be surprised if it's less than 2000 cycles. > If HOST_PRIORITY works, even a not-small-but-not-huge number of > retries would be better than the IPIs. IPIs are not cheap either. Agreed, but we also need to account for the operations that are conflicting= . E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy= waiting for the to-be-zapped S-EPT entry to be available doesn't make much sense. > > > For zero step detection, my reading is that it's TDH.VP.ENTER that fa= ils; > > > not any of the MEM seamcalls. For that one to be resolved, it should= be > > > enough to do take and release the mmu_lock back to back, which ensure= s that > > > all pending critical sections have completed (that is, > > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And th= en > > > loop. Adding a vCPU stat for that one is a good idea, too. > > > > As above and in my discussion with Rick, I would prefer to kick vCPUs t= o force > > forward progress, especially for the zero-step case. If KVM gets to th= e point > > where it has retried TDH.VP.ENTER on the same fault so many times that = zero-step > > kicks in, then it's time to kick and wait, not keep retrying blindly. >=20 > Wait, zero-step detection should _not_ affect TDH.MEM latency. Only > TDH.VP.ENTER is delayed. Blocked, not delayed. Yes, it's TDH.VP.ENTER that "fails", but to get past TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to guar= antee forward progress for TDH.MEM (or whatever the operations are called). Though I wonder, are there any operations guest/host operations that can co= nflict if the vCPU is faulting? Maybe this particular scenario is a complete non-= issue. > If it is delayed to the point of failing, we can do write_lock/write_unlo= ck() > in the vCPU entry path. I was thinking that KVM could set a flag (another synthetic error code bit?= ) to tell the page fault handler that it needs to kick vCPUs. But as above, it = might be unnecessary. > My issue is that, even if we could make it a bit better by looking at > the TDX module source code, we don't have enough information to make a > good choice. For now we should start with something _easy_, even if > it may not be the greatest. I am not opposed to an easy/simple solution, but I am very much opposed to implementing a retry loop without understanding _exactly_ when and why it's needed.