From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ot1-f73.google.com (mail-ot1-f73.google.com [209.85.210.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 747184C3C6 for ; Mon, 5 Feb 2024 18:33:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707158034; cv=none; b=jU+7HbUzurlVNRhtf/MckpYvI3JyO8gA5XLLrYtlUnzG3w5rT8h/a+NkDPWBCB2AVatlILOH1pT5/lEwBsXGoAH9Z0iYqvSWSuZAaS8CVp+MO81NAETwYr4hAEs1Q5kovq3KupNBXwLuvcHUAJKtOOLoDUkmfa2LzCnqu0pQg2s= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707158034; c=relaxed/simple; bh=hTxsNRIaGRjBwJpqKwaNDMab2a4fYA69wIIdAR8tCCI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=d29BimMA2vVFcCIbfy9e2ztWj/ALI/fIgYYi4nBNlidjwQQVIRCoEeMp6Xj/TGGKv0ZbV7Ilrkc63pojPZHeylX4dvwmjJjsiIU2PAz34w1IXrN6BjEaNr7HtYQdF3zwfemuK9pOMdw6Wb04h2uzAiQg+h7ifZLo6c2K3S3dsOw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=m5B3hf6t; arc=none smtp.client-ip=209.85.210.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="m5B3hf6t" Received: by mail-ot1-f73.google.com with SMTP id 46e09a7af769-6e279e05c4fso3142376a34.1 for ; Mon, 05 Feb 2024 10:33:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707158031; x=1707762831; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=y6ZPCrlgHUcHY17Dx/9azP9InbsH3BOK0AE11r+Q4/4=; b=m5B3hf6t/GZwEAi8fIE4k9S6tkMhOT8O/lvQjez93Jq1OoIhEuaBF6C4uqoQMK7wgQ 0mak+U0TySkFILZpk3NKBiotQwXdTXo2EqljjwmPjlDOYU3wcb5zac37jYNkIcSf89hr OdNqqGj7511WWMt//aNvBUHX9CcRIj7H18P1FHOSsOkti21mvLZ6Q2190NMM0JpKfXni BWW0C3ooJqZhAMPdkZZrEvagBVFUaP2b36vfYRGpO2PbeCFmS90eNyqqi5qViI4LPVZ/ 4y7rL9KAbHTVQ7zdlPgWWeApjW+FTEkpPkNkK75AcXIRVXlLygrie5saJD5ezFaagSE0 dq1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707158031; x=1707762831; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=y6ZPCrlgHUcHY17Dx/9azP9InbsH3BOK0AE11r+Q4/4=; b=anFQODIoSpWRpqGe1wKhPx5fdzCC4e+mwxmTdWSeuwYOQzTdHX4CUOMZ6gB47+DYx3 zEWyKyCdieMhURZTQKJANsVdnxDH/MG83avnUlOXOlZeAj91JVghKB6690yHWiwCfuVK JWZ6PNQ3gX4LIKP5M6qYHI2E18tmXb7kSnYJP+A1Azq4bRnnWaQyIGOk61mJQriomZ9d QrR8CQ1t22RJSIDLDoqRrYOuQ+PD85pvn2CgbCfgswks6aANiQkD7kIpNJckv4E+fKY7 ErCCIgyrpOi35jn0Of1mBEVBFK6L3lbHm88AXnMlV1kSxHBCvEbVZYAinolAgnT2WaW4 Vy7g== X-Gm-Message-State: AOJu0YyL5LUVAip/nF0Jv3l74ahsdsaHPJqqL6Y/Zlt4KKSkmF4y4G8F tFyL2oKSGO0lSHAOnKM27CjZ35BnytcOWsLJleQ/nJEFJLqnDAGDuNiuXn1HYWRiNI67L4zI/w8 DlA== X-Google-Smtp-Source: AGHT+IFtnmfBtSmrd4IucuU5NkyRfvur5Auy6ZOBw6FDjoMVMNgtnq9HbpzBqE1HYDOpuwt7aoDovxA7+hk= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:1081:b0:dc6:b7c2:176e with SMTP id v1-20020a056902108100b00dc6b7c2176emr1704153ybu.4.1707157653888; Mon, 05 Feb 2024 10:27:33 -0800 (PST) Date: Mon, 5 Feb 2024 10:27:32 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240203003518.387220-1-seanjc@google.com> Message-ID: Subject: Re: [PATCH v3] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing From: Sean Christopherson To: Yan Zhao Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Kai Huang , Yuan Yao , Xu Yilun Content-Type: text/plain; charset="us-ascii" On Mon, Feb 05, 2024, Yan Zhao wrote: > On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote: > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 3c193b096b45..8ce9898914f1 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > if (unlikely(!fault->slot)) > > return kvm_handle_noslot_fault(vcpu, fault, access); > > > > + /* > > + * Pre-check for a relevant mmu_notifier invalidation event prior to > > + * acquiring mmu_lock. If there is an in-progress invalidation and the > > + * kernel allows preemption, the invalidation task may drop mmu_lock > > + * and yield in response to mmu_lock being contended, which is *very* > > + * counter-productive as this vCPU can't actually make forward progress > > + * until the invalidation completes. This "unsafe" check can get false > > + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. > > + * > > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM > > + * will never yield mmu_lock in response to contention, as this vCPU is > > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held > > + * to detect retry guarantees the worst case latency for the vCPU. > > + */ > > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) { > > + kvm_release_pfn_clean(fault->pfn); > > + return RET_PF_RETRY; > > + } > > + > Could we also add this pre-check before fault in the pfn? like > @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > smp_rmb(); > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) > + return RET_PF_CONTINUE; > > ret = __kvm_faultin_pfn(vcpu, fault); > if (ret != RET_PF_CONTINUE) > > Though the mmu_seq would be always equal in the check, it can avoid repeated faulting > and release pfn for vain during a long zap cycle. Just to be super claer, by "repeated faulting", you mean repeated faulting in the primary MMU, correct? Yeah, I agree, that makes sense. The only question is whether to check before and after, or only before. When abusing KSM, I see ~99.5% of all invalidations being detected before (21.5M out of just over 21.6M). I think it makes sense to also check after getting the pfn? The check is super cheap, especially when there isn't an invalidation event as the checks should all be quite predictable and cache hot. > > +/* > > + * This lockless version of the range-based retry check *must* be paired with a > > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to > > + * use only as a pre-check to avoid contending mmu_lock. This version *will* > > + * get false negatives and false positives. > > + */ > > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, > > + unsigned long mmu_seq, > > + gfn_t gfn) > > +{ > > + /* > > + * Use READ_ONCE() to ensure the in-progress flag and sequence counter > > + * are always read from memory, e.g. so that checking for retry in a > > + * loop won't result in an infinite retry loop. Don't force loads for > > + * start+end, as the key to avoiding infinite retry loops is observing > > + * the 1=>0 transition of in-progress, i.e. getting false negatives > > + * due to stale start+end values is acceptable. > > + */ > > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && > > + gfn >= kvm->mmu_invalidate_range_start && > > + gfn < kvm->mmu_invalidate_range_end) > > + return true; > > + > Is a smp_rmb() before below check better, given this retry is defined in a header > for all platforms? No, because the unsafe check very deliberately doesn't provide any ordering guarantees. The READ_ONCE() is purely to ensure forward progress. And trying to provide ordering for mmu_invalidate_in_progress is rather nonsensical. The smp_rmb() in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq or an elevated mmu_invalidate_in_progress. For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress are checked truly doesn't matter as the range-based checks are will get false negatives when performed outside of mmu_lock. And from a correctness perspective, there are zero constraints on when the checks are performed (beyond the existing constraints from the earlier smp_rmb() and acquiring mmu_lock). If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry() without a gfn, which has the advantage of being safe outside of mmu_lock (the obvious disadvantage is that it's very imprecise).