From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00BB7C27C65 for ; Tue, 11 Jun 2024 19:42:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 74E5B6B00CC; Tue, 11 Jun 2024 15:42:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7235D6B00CF; Tue, 11 Jun 2024 15:42:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C4A46B00D0; Tue, 11 Jun 2024 15:42:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 42BCD6B00CC for ; Tue, 11 Jun 2024 15:42:41 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B075B1C242B for ; Tue, 11 Jun 2024 19:42:40 +0000 (UTC) X-FDA: 82219630080.06.AB1A072 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf24.hostedemail.com (Postfix) with ESMTP id E498718000A for ; Tue, 11 Jun 2024 19:42:38 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=d6M50I8i; spf=pass (imf24.hostedemail.com: domain of 3rahoZgYKCNoOA6JF8CKKCHA.8KIHEJQT-IIGR68G.KNC@flex--seanjc.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3rahoZgYKCNoOA6JF8CKKCHA.8KIHEJQT-IIGR68G.KNC@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718134959; a=rsa-sha256; cv=none; b=slmqn69PRITqenyVp6cu2sm5Zv0momrBEkyiATUe1+y9mDh4ldxAmPdP7gIX/iSUuTWjxR VFbcmrrtK87AYJT15RFYCXHpaBlNsGbU9bv3L+W/Bbmx6Qd2dVR4DxPEhCyqdQWJurRiU9 h9u4/hQEr/Sk2fyNLui3BH4OUVBezUQ= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=d6M50I8i; spf=pass (imf24.hostedemail.com: domain of 3rahoZgYKCNoOA6JF8CKKCHA.8KIHEJQT-IIGR68G.KNC@flex--seanjc.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3rahoZgYKCNoOA6JF8CKKCHA.8KIHEJQT-IIGR68G.KNC@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718134959; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wQHC8OIi/Hsd3JLuWOR0lq7d0FaWprofQPExl9Z9si0=; b=5BvToCoRr8RqCHCiglZ9Z6I3XccDpei/a1wrIWEsaRHxVc2JijUq3lCSxtQd3gfbXn+a9B wpimP0MRlj9B/LVxOFj3vKf6J5hR5IXQWoiHKRBROLU9/GLxTjX1s0wVnvy7bKm55xhFzp /IrWsIShKmMbVRURrII67oOBIEuxbNQ= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-1f70ce71d55so20678385ad.2 for ; Tue, 11 Jun 2024 12:42:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718134958; x=1718739758; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=wQHC8OIi/Hsd3JLuWOR0lq7d0FaWprofQPExl9Z9si0=; b=d6M50I8iuozgjjXcrVWtrXYJap70NG+n+kx87G6n+y52bC8qn9FBz3xoM05c91mNqd wqlJorp+j9MoZf6FR0NxYLTf5UySMgZjnwvxU0NrvEbpSkdbmk4GiZZJ4MRNqZWHrI7H OpmNbopuMiXgyII78EtC45gfsokVOg1+UnSJtlty51cQTEf0TaAR6oeiP0L28F7CvUBU xUY6RxMCLHNJwOIlQJtz0dmikZGZTj0dnMaaKGFVA/yfFU9bmnOGZ7FEcZq5vwkaEmMs cKRUVPShPHOdpzl+qVZ7X6WvsmluIROA/bjQzOHH9l8Xxdd7vF1hqA0PZZ0KdXvubJzl i9dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718134958; x=1718739758; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=wQHC8OIi/Hsd3JLuWOR0lq7d0FaWprofQPExl9Z9si0=; b=Ric4M2H3RpO3RTkE3rS1LAalWhygZ9Mp4Nyutpk4fPxexw5Obj+pBJU/n/UZdlTiBB wc2vWqGRL24ZzUaWVRR4FtXLNl0WUpRZGC7S2Qwua5IX1akTK0mqYSOmSZ1/dFgy7r8x m7T+gX2b+MYM7n8VrX9oqAB9EtyEvqb8g8+AuM2bt0+2xeVVFjyBhvy4XPg5CirgHAfa GyK/Hxi4gMPeb2KhlWUnWm30PK9eWTGVltgrXB8av3zVGXdqc6t6eF8/OTWz6+5z/Pjx m5e2iij8jWrLSc6vi3yCLloucWvIpAgOLTO0XI10yisgkjX6yyPfqewmW/vqodZC6KRQ ojaw== X-Forwarded-Encrypted: i=1; AJvYcCXI5HKQMo4/jicVUP7HlgEAjR3BJHTKloYwmJI8bM5+j7tXUF+E1vym4vlswishK5p5sK8FAkJ23Y/zO7OO2n2UmIo= X-Gm-Message-State: AOJu0Yx81cUYGLk2w97NMhb/EeBCdtjEqCt9FA0Kcw7O9qgqahFkb9oK gxUKnGXqbusrHMs1coC5UTFhCTcRNKdMf5nUQBNYC5PqmQ3MPpB/DxRkWIDLolazXGjvzSFv9Gn iIg== X-Google-Smtp-Source: AGHT+IEPF8Tx1zPRD6hrqpXv6Mu4zzFUULaMqaPpbdSAl8Wrrff2x3ztnrpn6Pwr6Frteyuagy9bVl9rsn8= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:c401:b0:1f6:84b5:1e10 with SMTP id d9443c01a7336-1f6d02bfe8emr9509095ad.1.1718134957392; Tue, 11 Jun 2024 12:42:37 -0700 (PDT) Date: Tue, 11 Jun 2024 12:42:35 -0700 In-Reply-To: Mime-Version: 1.0 References: <20240611002145.2078921-1-jthoughton@google.com> <20240611002145.2078921-5-jthoughton@google.com> Message-ID: Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier From: Sean Christopherson To: James Houghton Cc: Yu Zhao , Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 3ia8we4aa3gt3s8e6wbi1tdy7gbizwcs X-Rspamd-Queue-Id: E498718000A X-Rspamd-Server: rspam04 X-Rspam-User: X-HE-Tag: 1718134958-572757 X-HE-Meta: U2FsdGVkX1/ubdguBinq1NqGwocv6s/ixLJhSX2BC6LJXpbeSMUocbypeWurLoO+5WiJr0Ude3UOe4vQJRdZUhx9NbhSVUQgbv/spCaEjtXZ9SG0Pyhldyo+yGtwm25S3oqyUNethbvpvUntcSnwMA6x4VIpMiTe2jq65ji/SdnxRKqkwq7KMQGi85EH0Ht+OSPjL3xT9KJn6UZwq53lZIHHi7GzvcrNd1Zztdq0lBIc9DLGjiQSG7npAKJqsWrZ1UtrfIEvrQDZta6MPEp7Mz3hIP25xy2GfE5bv3DYAjXTdOBo7fbox/b/wICPuIFVyRQMvs2MlhakOPV00SXw7A5LO5352dzUsfBQML9RYHMcaXg4FF1ewHZx5P1JS1bWX6gNqJw7jX5gwJMCHE7Gv5GE3rFALSejx+cY/D3H8trVoEFvlvLP8plCcjv/3VxUGO//7ltJio4jMy84HnCugQ2RlddswYLjmaz7SaTwyM6+HhA4PPa3MWwANNynCtHHxruhgzhz78KOls4gWmJ59HVgieT2APceuukp8CVt2wH3iuDntkT04Hm0+nrI8/3U3hsFAKmDthn4PycVu1xSLPv0AGOo+dWiSn6SZPBjN60Ik8m2ZS7HSZC+UM364DS+iFgKPZVeWUhwPYcpM5KBPCcOtubK8uIFcbLy96YMUf8W2lTw+1R/1cWEoyQBmMrd3tRmHLo/rXZikUNrQP2DuXlSnPHa7lG+qharW89qV8xW5zS/rj+4pP4pA4PL+DhS47xb5QTYTQrzJJlITbZ8sJ6F2yHSXn8wlAOxMvh7KWJojWIFXRRcHd5h4zERxnM6Z3afUuUQBW1Iz2SHSQs7JVJXrrjrhp2hBpz0sFQihaUBSyeqv2Nu9PT4eb0DkF6KaEUlyIn3L5gAVT+zp83NtaAnreXYw35B6dqigMg8Dfy1IJHygZ1JIWOmb5LURDo2CbMvcnPwwGbt6ZhJCop AzjPbuCV X/KGw42bW2+4JCgMdmbX61GscDKwPx+XQV3rcX2BUzdFeK15FEkcwgFGwCUHuEjb+2EVKT94mteRVcxBBqdZTkXQi/JvDLHTcdOtNwMIA9klTrWIndG5l+VXFln3kB3qQOvlE16F9t/POUKLTSlogWNpxnbiVhEVZy+IuJbHOpiK+MpRTtr8Mk6Zdc7l1QwhqRAeUDO2VxkDZyTkFzEvWwEGtvG3MdvEKrv0K+QyaiGUkom52P4GHIv2fNa7vINou8N1OyzBd9F5BKzOvWn8Amv84c3VsVgRwYcvkKVOQDtd/CjD82ECrdbecwA72l7yan2Bfl7/jY6JKY2IKdZOBe8OHws8lhiA1tgpIds0GUgupBoRK0JrxX1PBJJTC9s8LQghzyD/Xs2OYSsubYswzp4AIeWLhm1mSg1W+RbtGssjo/HpLt3KMmZkMjdcWUoelbFjVo5iQan36B7rorHbeKXL3Jba0LK1/S0GuLPkdYly6ZKNpZyeujAVAnJ0Jarq+Qb81LhvRO7brOPO8Ij8iHchsXV/Q3NwDy2E4XtWh7oXfXbVda0Is98vez4Kl2k1ZfdWM X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 11, 2024, James Houghton wrote: > On Mon, Jun 10, 2024 at 10:34=E2=80=AFPM Yu Zhao wrot= e: > > > > On Mon, Jun 10, 2024 at 6:22=E2=80=AFPM James Houghton wrote: > > > > > > This new notifier is for multi-gen LRU specifically > > > > Let me call it out before others do: we can't be this self-serving. > > > > > as it wants to be > > > able to get and clear age information from secondary MMUs only if it = can > > > be done "fast". > > > > > > By having this notifier specifically created for MGLRU, what "fast" > > > means comes down to what is "fast" enough to improve MGLRU's ability = to > > > reclaim most of the time. > > > > > > Signed-off-by: James Houghton > > > > If we'd like this to pass other MM reviewers, especially the MMU > > notifier maintainers, we'd need to design a generic API that can > > benefit all the *existing* users: idle page tracking [1], DAMON [2] > > and MGLRU. > > > > Also I personally prefer to extend the existing callbacks by adding > > new parameters, and on top of that, I'd try to consolidate the > > existing callbacks -- it'd be less of a hard sell if my changes result > > in less code, not more. > > > > (v2 did all these, btw.) >=20 > I think consolidating the callbacks is cleanest, like you had it in > v2. I really wasn't sure about this change honestly, but it was my > attempt to incorporate feedback like this[3] from v4. I'll consolidate > the callbacks like you had in v2. James, wait for others to chime in before committing yourself to a course o= f action, otherwise you're going to get ping-ponged to hell and back. > Instead of the bitmap like you had, I imagine we'll have some kind of > flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR, > MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does > that sound ok? Why do we need a bundle of flags? If we extend .clear_young() and .test_yo= ung() as Yu suggests, then we only need a single "bool fast_only". As for adding a fast_only versus dedicated APIs, I don't have a strong pref= erence. Extending will require a small amount of additional churn, e.g. to pass in = false, but that doesn't seem problematic on its own. On the plus side, there woul= d be less copy+paste in include/linux/mmu_notifier.h (though that could be solve= d with macros :-) ). E.g.=20 -- diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 7b77ad6cf833..07872ae00fa6 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *= mm, =20 int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool fast_only) { struct mmu_notifier *subscription; int young =3D 0, id; @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, hlist_for_each_entry_rcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { - if (subscription->ops->clear_young) - young |=3D subscription->ops->clear_young(subscript= ion, - mm, start, = end); + if (!subscription->ops->clear_young || + fast_only && !subscription->ops->has_fast_aging) + continue; + + young |=3D subscription->ops->clear_young(subscription, + mm, start, end); } srcu_read_unlock(&srcu, id); =20 @@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, } =20 int __mmu_notifier_test_young(struct mm_struct *mm, - unsigned long address) + unsigned long address, + bool fast_only) { struct mmu_notifier *subscription; int young =3D 0, id; @@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm, hlist_for_each_entry_rcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { - if (subscription->ops->test_young) { - young =3D subscription->ops->test_young(subscriptio= n, mm, - address); - if (young) - break; - } + if (!subscription->ops->test_young) + continue; + + if (fast_only && !subscription->ops->has_fast_aging) + continue; + + young =3D subscription->ops->test_young(subscription, mm, a= ddress); + if (young) + break; } srcu_read_unlock(&srcu, id); --=20 It might also require multiplexing the return value to differentiate betwee= n "young" and "failed". Ugh, but the code already does that, just in a bespo= ke way. Double ugh. Peeking ahead at the "failure" code, NAK to adding kvm_arch_young_notifier_likely_fast for all the same reasons I objected to kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything l= ike that, I will NAK each every attempt to have core mm/ code call directly int= o KVM. Anyways, back to this code, before we spin another version, we need to agre= e on exactly what behavior we want out of secondary MMUs. Because to me, the be= havior proposed in this version doesn't make any sense. Signalling failure because KVM _might_ have relevant aging information in S= PTEs that require taking kvm->mmu_lock is a terrible tradeoff. And for the test= _young case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU= , then KVM should return "young", not "failed". If KVM is using the TDP MMU, i.e. has_fast_aging=3Dtrue, then there will be= rmaps if and only if L1 ran a nested VM at some point. But as proposed, KVM does= n't actually check if there are any shadow TDP entries to process. That could = be fixed by looking at kvm->arch.indirect_shadow_pages, but even then it's not= clear that bailing if kvm->arch.indirect_shadow_pages > 0 makes sense. E.g. if L1 happens to be running an L2, but <10% of the VM's memory is expo= sed to L2, then "failure" is pretty much guaranteed to a false positive. And even= for the pages that are exposed to L2, "failure" will occur if and only if the p= ages are being accessed _only_ by L2. There most definitely are use cases where the majority of a VM's memory is = accessed only by L2. But if those use cases are performing poorly under MGLRU, then= IMO we should figure out a way to enhance KVM to do a fast harvest of nested TD= P Accessed information, not make MGRLU+KVM suck for a VMs that run nested VMs= . Oh, and calling into mmu_notifiers to do the "slow" version if the fast ver= sion fails is suboptimal. So rather than failing the fast aging, I think what we want is to know if a= n mmu_notifier found a young SPTE during a fast lookup. E.g. something like = this in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have= _rmaps() is an optional optimization to avoid taking mmu_lock for write in paths whe= re a (very rare) false negative is acceptable. static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) { return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); } static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, bool fast_only) { int young =3D 0; if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { write_lock(&kvm->mmu_lock); young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); write_unlock(&kvm->mmu_lock); } if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) young =3D 1 | MMU_NOTIFY_WAS_FAST; return (int)young; } and then in lru_gen_look_around(): if (spin_is_contended(pvmw->ptl)) return false; /* exclude special VMAs containing anon pages from COW */ if (vma->vm_flags & VM_SPECIAL) return false; young =3D ptep_clear_young_notify(vma, addr, pte); if (!young) return false; if (!(young & MMU_NOTIFY_WAS_FAST)) return true; young =3D 1; with the lookaround done using ptep_clear_young_notify_fast(). The MMU_NOTIFY_WAS_FAST flag is gross, but AFAICT it would Just Work withou= t needing to update all users of ptep_clear_young_notify() and friends.