From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1089F4642D for ; Mon, 19 Aug 2024 18:31:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724092287; cv=none; b=G9pMLgccDBdOXCPkZzMvLawW/xhjkaJr76VV1l4ZZMsRP37OdDQ+3/HG6zvPGDo69T1FSyYqbACHvS5qW54eB45AEHCep407lsItrRV4XewiW8FY4x/7dQPes9FtLCXv5XlfMQtF5Fe4J0aRocPMy1PFidLo7ol9DYhFGZkUjhY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724092287; c=relaxed/simple; bh=ZrWC8ayVISMEQlXc1POjz6UYuqoV+ZvS6YR7eJJYJC0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=kirIshYcW8VPRADpPnAriwH/cj5rA1Ve2TBmdfncNwPF4R+lxngpbNNim91gVd0GPVC5ggvTpe3magqF20Vy8AvOjMsxC6Oy/Kuai2T4vqO40YHM65F50veF1xY1IqYEnDDlfHzVs6kzX247MxNSshA0689Nn8rktmykUZRmS8k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=GVlnf8jq; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="GVlnf8jq" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-713d24cf706so2362938b3a.1 for ; Mon, 19 Aug 2024 11:31:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1724092284; x=1724697084; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=sZh6vj+JVTE3khn4nIDCIHCY6ZVRQDK1Dqe9+u7egds=; b=GVlnf8jq0VrrH0+8J7j+bLiEkyzgaLRE6UwGa3mfSTZhZ78oFwB0uSk2qfLEmEkDwD HbwXyY9BRjCE4M15se7FCG0mCQA2GgUfE8AGoCsVfbchaNWHSf4fbPoolmgJ9awSTLHO 2DFE2nmNHm5jzsVEi+ubf33I9hsI8wl6TfwjI87MmYHMGXYqEfQo16Iv2Tb/eWdISQ2l G2qk1D0MPeWeoj85V8o1Qd6lgZUF33TflHahBCUXWe0JHukfcR6c9zV/jUFRHi5Tulo1 c6Ycc12Fa6hbCwvej2wSZEFM40WfogKNW9br/v5rYCRYvCGzmS/00A2dgMXMb8RExWcB /hOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724092284; x=1724697084; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=sZh6vj+JVTE3khn4nIDCIHCY6ZVRQDK1Dqe9+u7egds=; b=b7wcu5/EEabmb8djd00sMnQKApFQGZuHJZ6GY3A1FgbHdy+zfm8BPbuzYbEMCUycEA WhFOxIQQnIu/dciv6OYXjLvua4QoEanJ4xdy1ecA4R1Qm4+JCn9wrvC/ygc5PpMQWFcV h9yRGZZQIXD4XqX7BIn94FBMTa2dFaP6gQxFwQVTsJD/+xha5wu+gW9Nh5KBgbjz3cgr mJA/dD0ecEtu5pfBz0bCQiLOQp9AoZjE4CF9J88qpZ4+bXq5usbre0mCV0J06jEhUI5W QZ0yGYGThpam7eZ1jX14GsE1kCL/6J7fVphLBh6yQWi3blC4bMa6Ys08BoySpOdGokig tw3w== X-Forwarded-Encrypted: i=1; AJvYcCWvv7umm9Jy/EO769xrab7Owv8JW2ahofHe9TNM3ChF/RUhYU33f5Iaa2cQUqjDG14JduQcjKyBjsqbfCjS6+76Yx/k X-Gm-Message-State: AOJu0YyuSJaL0UqLsJU508cyUU9qBlKHxLl4zCv7X67XxW6Pk548oENQ d+IawFnn8zmGhO23AQPe2d7rByiU/GEyBGAs2SuYwkRVUxN5ovBXwAj2tFiAaa9FSBcfmI016Ym 4JQ== X-Google-Smtp-Source: AGHT+IGh8K54hv69rMrUmGxS9qgHX1Rbw16YYeuCjshOKbnSVF1r4Y1OnJ+4xCX8qHN+pt3Ztbq09K0kinA= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:aa7:8145:0:b0:710:4d4d:5ef1 with SMTP id d2e1a72fcca58-713c525b48bmr45627b3a.4.1724092284136; Mon, 19 Aug 2024 11:31:24 -0700 (PDT) Date: Mon, 19 Aug 2024 11:31:22 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240812171341.1763297-1-vipinsh@google.com> <20240812171341.1763297-2-vipinsh@google.com> <20240819172023.GA2210585.vipinsh@google.com> Message-ID: Subject: Re: [PATCH 1/2] KVM: x86/mmu: Split NX hugepage recovery flow into TDP and non-TDP flow From: Sean Christopherson To: David Matlack Cc: Vipin Sharma , pbonzini@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Mon, Aug 19, 2024, David Matlack wrote: > On Mon, Aug 19, 2024 at 10:20=E2=80=AFAM Vipin Sharma wrote: > > > > On 2024-08-16 16:29:11, Sean Christopherson wrote: > > > On Mon, Aug 12, 2024, Vipin Sharma wrote: > > > > + list_for_each_entry(sp, &kvm->arch.possible_nx_huge_pages, poss= ible_nx_huge_page_link) { > > > > + if (i++ >=3D max) > > > > + break; > > > > + if (is_tdp_mmu_page(sp) =3D=3D tdp_mmu) > > > > + return sp; > > > > + } > > > > > > This is silly and wasteful. E.g. in the (unlikely) case there's one = TDP MMU > > > page amongst hundreds/thousands of shadow MMU pages, this will walk t= he list > > > until @max, and then move on to the shadow MMU. > > > > > > Why not just use separate lists? > > > > Before this patch, NX huge page recovery calculates "to_zap" and then i= t > > zaps first "to_zap" pages from the common list. This series is trying t= o > > maintain that invarient. I wouldn't try to maintain any specific behavior in the existing code, AFAI= K it's 100% arbitrary and wasn't written with any meaningful sophistication. E.g.= FIFO is little more than blindly zapping pages and hoping for the best. > > If we use two separate lists then we have to decide how many pages > > should be zapped from TDP MMU and shadow MMU list. Few options I can > > think of: > > > > 1. Zap "to_zap" pages from both TDP MMU and shadow MMU list separately. > > Effectively, this might double the work for recovery thread. > > 2. Try zapping "to_zap" page from one list and if there are not enough > > pages to zap then zap from the other list. This can cause starvation= . > > 3. Do half of "to_zap" from one list and another half from the other > > list. This can lead to situations where only half work is being done > > by the recovery worker thread. > > > > Option (1) above seems more reasonable to me. >=20 > I vote each should zap 1/nx_huge_pages_recovery_ratio of their > respective list. i.e. Calculate to_zap separately for each list. Yeah, I don't have a better idea since this is effectively a quick and dirt= y solution to reduce guest jitter. We can at least add a counter so that the= zap is proportional to the number of pages on each list, e.g. this, and then do= the necessary math in the recovery paths. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 94e7b5a4fafe..3ff17ff4f78b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1484,6 +1484,8 @@ struct kvm_arch { * the code to do so. */ spinlock_t tdp_mmu_pages_lock; + + u64 tdp_mmu_nx_page_splits; #endif /* CONFIG_X86_64 */ =20 /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 928cf84778b0..b80fe5d4e741 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -870,6 +870,11 @@ void track_possible_nx_huge_page(struct kvm *kvm, stru= ct kvm_mmu_page *sp) if (!list_empty(&sp->possible_nx_huge_page_link)) return; =20 +#ifdef CONFIG_X86_64 + if (is_tdp_mmu_page(sp)) + ++kvm->arch.tdp_mmu_nx_page_splits; +#endif + ++kvm->stat.nx_lpage_splits; list_add_tail(&sp->possible_nx_huge_page_link, &kvm->arch.possible_nx_huge_pages); @@ -905,6 +910,10 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, st= ruct kvm_mmu_page *sp) if (list_empty(&sp->possible_nx_huge_page_link)) return; =20 +#ifdef CONFIG_X86_64 + if (is_tdp_mmu_page(sp)) + --kvm->arch.tdp_mmu_nx_page_splits; +#endif --kvm->stat.nx_lpage_splits; list_del_init(&sp->possible_nx_huge_page_link);