Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [RFC PATCH v3 00/10] mm/damon: introduce DAMOS failed region quota charge ratio
From: SeongJae Park @ 2026-04-07  1:05 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, Brendan Higgins,
	David Gow, David Hildenbrand, Jonathan Corbet, Lorenzo Stoakes,
	Michal Hocko, Mike Rapoport, Shuah Khan, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, damon, kunit-dev, linux-doc,
	linux-kernel, linux-kselftest, linux-mm

TL; DR: Let users set different DAMOS quota charge ratios for DAMOS
action failed regions, for deterministic and consistent DAMOS action
progress.

Common Reports: Unexpectedly Slow DAMOS
=======================================

One common issue report that we get from DAMON users is that DAMOS
action applying progress speed is sometimes much slower than expected.
And one common root cause is that the DAMOS quota is exceeded by the
action applying failed memory regions.

For example, a group of users tried to run DAMOS-based proactive memory
reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota.  They
ran it on a system having no active workload which means all memory of
the system is cold.  The expectation was that the system will show 100
MiB per second reclamation until (nearly) all memory is reclaimed. But
what they found is that the speed is quite inconsistent and sometimes it
becomes very slower than the expectation, sometimes even no reclamation
at all for about tens of seconds.  The upper limit of the speed (100 MiB
per second) was being kept as expected, though.

By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS
stat, we found DAMOS quota is always exceeded when the speed is slow. By
monitoring sz_tried and sz_applied (the total amount of DAMOS action
tried memory and succeeded memory) DAMOS stats together, we found the
reclamation attempts nearly always failed when the speed is slow.

DAMOS quota charges DAMOS action tried regions regardless of the
successfulness of the try.  Hence in the example reported case, there
was unreclaimable memory spread around the system memory.  Sometimes
nearly 100 MiB of memory that DAMOS tried to reclaim in the given quota
interval was reclaimable, and therefore showed nearly 100 MiB per second
speed.  Sometimes nearly 99 MiB of memory that DAMOS was trying to
reclaim in the given quota interval was unreclaimable, and therefore
showing only about 1 MiB per second reclaim speed.

We explained it is an expected behavior of the feature rather than a
bug, as DAMOS quota is there for only the upper-limit of the speed.  The
users agreed and later reported a huge win from the adoption of
DAMON_RECLAIM on their products.

It is Not a Bug but a Feature; But...
=====================================

So nothing is broken.  DAMOS quota is working as intended, as the upper
limit of the speed.  It also provides its behavior observability via
DAMOS stat.  In the real world production environment that runs long
term active workloads and matters stability, the speed sometimes being
slow is not a real problem.

But, the non-deterministic behavior is sometimes annoying, especially in
lab environments.  Even in a realistic production environment, when
there is a huge amount of DAMOS action unapplicable memory, the speed
could be problematically slow.  Let's suppose a virtual machines
provider that setup 99% of the host memory as hugetlb pages that cannot
be reclaimed, to give it to virtual machines.  Also, when aim-oriented
DAMOS auto-tuning is applied, this could also make the internal feedback
loop confused.

The intention of the current behavior was that trying DAMOS action to
regions would anyway impose some overhead, and therefore somehow be
charged.  But in the real world, the overhead for failed action is much
lighter than successful action.  Charging those at the same ratio may be
unfair, or at least suboptimum in some environments.

DAMOS Action Failed Region Quota Charge Ratio
=============================================

Let users set the charge ratio for the action-failed memory, for more
optimal and deterministic use of DAMOS.  It allows users to specify the
numerator and the denominator of the ratio for flexible setup.  For
example, let's suppose the numerator and the denominator are set to 1
and 4,096, respectively.  The ratio is 1 / 4,096.  A DAMOS scheme action
is applied to 5 GiB memory.  For 1 GiB of the memory, the action is
succeeded.  For the rest (4 GiB), the action is failed.  Then, only 1
GiB and 1 MiB quota is charged.

The optimal charge ratio will depend on the use case and
system/workload.  I'd recommend starting from setting the nominator as 1
and the denominator as PAGE_SIZE and tune based on the results, because
many DAMOS actions are applied at page level.

Tests
=====

I tested this feature in the steps below.

1. Allocate 50% of system memory and mlock() it using a test program.
2. Fill up the page cache to exhaust nearly all free memory.
3. Start DAMON-based proactive reclamation with 100 MiB/second DAMOS
   hard-quota.  Auto-tune the DAMOS soft-quota under the hard-quota for
   achieving 40% free memory of the system with 'temporal' tuner.

For step 1, I run a simple C program that is written by Gemini.  It is
quite straightforward, so I'm not sharing the code here.

For step 2, I use dd command like below:

   dd if=/dev/zero of=foo bs=1M count=$50_percent_of_system_memory

For step 3, I use the latest version of DAMON user-space tool (damo)
like below.

    sudo damo start --damos_action pageout \
            ` # Do the pageout only up to 100 MiB per second ` \
            --damos_quota_space 100M --damos_quota_interval 1s \
            ` # Auto-tune the quota below the hard quota aiming` \
            ` # 40% free memory of the node 0 ` \
            ` # (entire node of the test system)` \
            --damos_quota_goal node_mem_free_bp 40% 0 \
            ` # use temporal tuner, which is easy to understnd ` \
            --damos_quota_goal_tuner temporal

As expected, the progress of the reclamation is not consistent, because
the quota is exceeded for the failed reclamation of the unreclaimable
memory.

I do this again, but with the failed region charge ratio feature.  For
this, the above 'damo' command is used, after appending command line
option for setup of the charge ratio like below.  Note that the option
was added to 'damo' after v3.1.9.

    sudo ./damo start --damos_action pageout \
            [...]
            ` # quota-charge only 1/4096 for pageout-failed regions ` \
            --damos_quota_fail_charge_ratio 1 4096

The progress of the reclamation was nearly 100 MiB per second until the
goal was achieved, meeting the expectation.

Patches Sequence
================

Patch 1 updates fully charged quota check to handle <min_region_sz
remaining quota, which will be able to exist after this series is
applied.  Patch 2 implements the feature and exposes it via DAMON core
API.  Patch 3 implements DAMON sysfs ABI for the feature.  Three
following patches (4-6) document the feature and ABI on design, usage,
and ABI documents, respectively.  Four patches for testing of the new
feature follow.  Patch 7 implements a kunit test for the feature.
Patches 8 and 9 extend DAMON selftest helpers for DAMON sysfs control
and internal state dumping for adding a new selftest for the feature.
Patch 10 extends existing DAMON sysfs interface selftest to test the new
feature using the extended helper scripts.

Changelog
=========

Changes from RFC v2
(https://lore.kernel.org/20260405151232.102690-1-sj@kernel.org)
- Handle <min_region_sz remaining quota.
- Document zero denum behavior.
- Fix typos: s/selftets/selftests/
Changes from RFC v1
(https://lore.kernel.org/20260404163943.89278-1-sj@kernel.org)
- Avoid overflows in charge amount calculation.
- Fix/wordsmith documentation for grammar, typo, and wrong examples.
- Improve unit test for more consistent comparison source use.

SeongJae Park (10):
  mm/damon/core: handle <min_region_sz remaining quota as empty
  mm/damon/core: introduce failed region quota charge ratio
  mm/damon/sysfs-schemes: implement fail_charge_{num,denom} files
  Docs/mm/damon/design: document fail_charge_{num,denom}
  Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom}
    files
  Docs/ABI/damon: document fail_charge_{num,denom}
  mm/damon/tests/core-kunit: test fail_charge_{num,denom} committing
  selftests/damon/_damon_sysfs: support failed region quota charge ratio
  selftests/damon/drgn_dump_damon_status: support failed region quota
    charge ratio
  selftests/damon/sysfs.py: test failed region quota charge ratio

 .../ABI/testing/sysfs-kernel-mm-damon         | 12 +++++
 Documentation/admin-guide/mm/damon/usage.rst  | 18 +++++--
 Documentation/mm/damon/design.rst             | 22 ++++++++
 include/linux/damon.h                         |  9 ++++
 mm/damon/core.c                               | 38 ++++++++++---
 mm/damon/sysfs-schemes.c                      | 54 +++++++++++++++++++
 mm/damon/tests/core-kunit.h                   |  6 +++
 tools/testing/selftests/damon/_damon_sysfs.py | 21 +++++++-
 .../selftests/damon/drgn_dump_damon_status.py |  2 +
 tools/testing/selftests/damon/sysfs.py        |  6 +++
 10 files changed, 175 insertions(+), 13 deletions(-)

base-commit: b1ca86c92674eaf92a32ce3a2d89a0349e406df1
-- 
2.47.3

^ permalink raw reply

* Re: (sashiko review) [PATCH v6 1/1] mm/damon: add node_eligible_mem_bp and node_ineligible_mem_bp goal metrics
From: SeongJae Park @ 2026-04-07  0:13 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <CALa+Y14oWqu5+DbkENy7GgBjc=dCbFTaoOCr1i4=9CN-ZNRgEA@mail.gmail.com>

On Mon, 6 Apr 2026 12:47:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, Apr 5, 2026 at 3:45 PM SeongJae Park <sj@kernel.org> wrote:
> >
> >
> > Ravi, thank you for reposting this patch after the rebase.  This time sashiko
> > was able to review this, and found good points including things that deserve
> > another revision of this patch.
> >
> > Forwarding full sashiko review in a reply format with my inline comments below,
> > for sharing details of my view and doing followup discussions via mails.  Ravi,
> > could you please reply?
> >
> 
> Thanks SJ, providing your comments on top of sashiko's review is very helpful.

I'm glad to hear that it is working for you :)

[...]
> > > +static unsigned long damos_calc_eligible_bytes(struct damon_ctx *c,
> > > > +           struct damos *s, int nid, unsigned long *total)
> > > > +{
[...]
> > > > +                           struct folio *folio;
> > > > +                           unsigned long folio_sz, counted;
> > > > +
> > > > +                           folio = damon_get_folio(PHYS_PFN(addr));
> > >
> > > What happens if this metric is assigned to a DAMON context configured for
> > > virtual address space monitoring? If the context uses DAMON_OPS_VADDR,
> > > passing a user-space virtual address to PHYS_PFN() might cause invalid
> > > memory accesses or out-of-bounds page struct reads. Should this code
> > > explicitly verify the operations type first?
> >
> > Good finding.  We intend to support only paddr ops.  But there is no guard for
> > using this on vaddr ops configuration.  Ravi, could we add underlying ops
> > check?  I think damon_commit_ctx() is a good place to add that.  The check
> > could be something like below?
> >
> 
> I plan to add the ops type check directly in the metric functions
> (damos_get_node_eligible_mem_bp and its counterpart) rather than in
> damon_commit_ctx(). The functions will return 0 early
> if c->ops.id != DAMON_OPS_PADDR.
> 
> That said, if you prefer the damon_commit_ctx() validation approach to
> reject the configuration outright, I can implement it that way instead.
> Please let me know your preference.

I'd prefer damon_commit_ctx() validation approach since it would give users
more clear message of the failure.

> 
> > '''
> > --- a/mm/damon/core.c
> > +++ b/mm/damon/core.c
> > @@ -1515,10 +1515,23 @@ static int damon_commit_sample_control(
> >  int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src)
> >  {
> >         int err;
> > +       struct damos *scheme;
> > +       struct damos_quota_goal *goal;
> >
> >         dst->maybe_corrupted = true;
> >         if (!is_power_of_2(src->min_region_sz))
> >                 return -EINVAL;
> > +       if (src->ops.id != DAMON_OPS_PADDR) {
> > +               damon_for_each_scheme(scheme, src) {
> > +                       damos_for_each_quota_goal(goal, &scheme->quota) {
> > +                               switch (goal->metric) {
> > +                               case DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP:
> > +                               case DAMOS_QUOTA_NODE_INELIGIBLE_MEMPBP:
> > +                                       return -EINVAL;
> > +                               }
> > +                       }
> > +               }
> > +       }
> >
> >         err = damon_commit_schemes(dst, src);
> >         if (err)
> > '''
[...]
> > > > +   /* Compute ineligible ratio directly: 10000 - eligible_bp */
> > > > +   return 10000 - mult_frac(node_eligible, 10000, total_eligible);
> > > > +}
> > >
> > > Does this return value match the documented metric? The formula computes the
> > > percentage of the system's eligible memory located on other NUMA nodes,
> > > rather than the amount of actual ineligible (filtered out) memory residing
> > > on the target node. Could this semantic mismatch cause confusion when
> > > configuring quota policies?
> >
> > Nice catch.  The name and the documentation are confusing.  We actually
> > confused a few times in previous revisions, and I'm again confused now.  IIUC,
> > the current implementation is the intended and right one for the given use
> > case, though.  If my understanding is correct, how about renaming
> > DAMOS_QUOTA_NODE_INELIGIBLE_MEM_BP to
> > DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_COMPLEMENT, and updating the documentation
> > together?  Ravi, what do you think?
> >
> 
> Agreed, the current name is confusing. How about
> DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_OFFNODE?
> 
> The rationale is that this metric measures "eligible memory that is off
> this node" (i.e., on other nodes).
> 
>  I think "offnode" conveys the physical meaning more directly than "complement".
> That said, I'm happy to go with "complement" if you prefer.
> both are clearer than "ineligible".

Thank you for the nice suggestion.  I like "offnode" term.  But I think having
"node" twice on the name is not really efficient for people who print code on
papers.  What about DAMOS_QUOTA_OFFNODE_ELIGIBLE_MEM_BP?

But...  Maybe more importantly...  Now I realize this means that
offnode_eligible_mem_bp with target nid 0 is just same to node_eligible_mem_bp
with target nid 1, on your test setup.  Maybe we don't really need
offnode_eligible_mem_bp?  That is, your test setup could be like below.

'''
For maintaining hot memory on DRAM (node 0) and CXL (node 1) in a 7:3
ratio:

    PUSH scheme: migrate_hot from node 0 -> node 1
      goal: node_eligible_mem_bp, nid=1, target=3000
      "Move hot pages from DRAM to CXL if less thatn 30% of hot data is
       in CXL"

    PULL scheme: migrate_hot from node 1 -> node 0
      goal: node_eligible_mem_bp, nid=0, target=7000
      "Move hot pages from CXL to DRAM if less than 70% of hot data is
       in DRAM"
'''

And the schemes are more easy to read and understand for me.  This seems even
straightforward to scale for >2 nodes.  For example, if we want hot memory
distribution of 5:3:2 to nodes 0:1:2,

	Two schemes for migrating hot pages out of node 0
	- migrate_hot from node 0 -> node 1
	  - goal: node_eligible_mem_bp, nid=1, target=3000
	- migrate_hot from node 0 -> node 2
	  - goal: node_eligible_mem_bp, nid=2, target=2000

	Two schemes for migrating hot pages out of node 1
	- migrate_hot from node 1 -> node 0
	  - goal: node_eligible_mem_bp, nid=0, target=5000
	- migrate_hot from node 1 -> node 2
	  - goal: node_eligible_mem_bp, nid=2, target=2000

	Two schemes for migrating hot pages out of node 2
	- migrate_hot from node 2 -> node 0
	  - goal: node_eligible_mem_bp, nid=0, target=5000
	- migrate_hot from node 2 -> node 1
	  - goal: node_eligible_mem_bp, nid=1, target=3000

Do you think this makes sense?  If it makes sense and works for your use case,
what about dropping the offnode goal type?


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH v7 9/9] KVM: selftests: nSVM: Add svm_nested_pat test
From: Sean Christopherson @ 2026-04-07  0:07 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <20260327234023.2659476-10-jmattson@google.com>

On Fri, Mar 27, 2026, Jim Mattson wrote:
> When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled, verify that KVM
> correctly virtualizes the host PAT MSR and the guest PAT register for
> nested SVM guests.
> 
> With nested NPT disabled:
>  * L1 and L2 share the same PAT
>  * The vmcb12.g_pat is ignored
> 
> With nested NPT enabled:
>  * An invalid g_pat in vmcb12 causes VMEXIT_INVALID
>  * RDMSR(IA32_PAT) from L2 returns the value of the guest PAT register
>  * WRMSR(IA32_PAT) from L2 is reflected in vmcb12's g_pat on VMEXIT
>  * RDMSR(IA32_PAT) from L1 returns the value of the host PAT MSR
>  * Save/restore with the vCPU in guest mode preserves both hPAT and gPAT
> 
> Signed-off-by: Jim Mattson <jmattson@google.com>
> ---
>  tools/arch/x86/include/uapi/asm/kvm.h         |   2 +

Don't update uAPI headers in tools/, they're not used by KVM selftests (perf
folks will sync them as needed).

> +#define PAT_DEFAULT		0x0007040600070406ULL
> +#define L1_PAT_VALUE		0x0007040600070404ULL  /* Change PA0 to WT */
> +#define L2_VMCB12_PAT		0x0606060606060606ULL  /* All WB */
> +#define L2_PAT_MODIFIED		0x0606060606060604ULL  /* Change PA0 to WT */
> +#define INVALID_PAT_VALUE	0x0808080808080808ULL  /* 8 is reserved */
> +
> +/*
> + * Shared state between L1 and L2 for verification.
> + */
> +struct pat_test_data {
> +	uint64_t l2_pat_read;
> +	uint64_t l2_pat_after_write;
> +	uint64_t l1_pat_after_vmexit;
> +	uint64_t vmcb12_gpat_after_exit;
> +	bool l2_done;
> +};
> +
> +static struct pat_test_data *pat_data;

This is ridiculous.  Whatever AI you're using is reinventing sync_global_to_guest()
in a very obfuscated way.  Drop the indirection along with the params and the
full page allocation, and just sync the damn struct.

Actually, this is even dumber than that.  The "data" is only ever accessed from
within the guest; it's used to pass info between L1 and L2.  Drop the struct
entirely and just write global variables.

In general, please clean this test up before submitting v8.  All of the L2 code
is basically copy+paste of itself.  This is the second vibe coded selftest (AFAIK)
you've posted, and it has many of the same flaws as the first one[*].   While I'm
not opposed to using fancy tools, and the bar is generally lower for selftests,
the code still needs to be readable and maintainable.  This ain't.

[*] https://lore.kernel.org/all/aXJal3srw2-3J5Dm@google.com

> +static void l2_guest_code(void)
> +{
> +	pat_data->l2_pat_read = rdmsr(MSR_IA32_CR_PAT);
> +	wrmsr(MSR_IA32_CR_PAT, L2_PAT_MODIFIED);
> +	pat_data->l2_pat_after_write = rdmsr(MSR_IA32_CR_PAT);
> +	pat_data->l2_done = true;
> +	vmmcall();
> +}

...

> +static void run_test(void *l1_code, const char *test_name, bool npt_enabled,
> +		     bool do_save_restore)
> +{
> +	struct pat_test_data *data_hva;
> +	vm_vaddr_t svm_gva, data_gva;
> +	struct kvm_x86_state *state;
> +	struct kvm_vcpu *vcpu;
> +	struct kvm_vm *vm;
> +	struct ucall uc;
> +
> +	pr_info("Testing: %s\n", test_name);
> +
> +	vm = vm_create_with_one_vcpu(&vcpu, l1_code);
> +	vm_enable_cap(vm, KVM_CAP_DISABLE_QUIRKS2,
> +		      KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);
> +	if (npt_enabled)
> +		vm_enable_npt(vm);
> +
> +	vcpu_alloc_svm(vm, &svm_gva);
> +
> +	data_gva = vm_vaddr_alloc_page(vm);
> +	data_hva = addr_gva2hva(vm, data_gva);
> +	memset(data_hva, 0, sizeof(*data_hva));

Ugh.

> +
> +	if (npt_enabled)
> +		tdp_identity_map_default_memslots(vm);
> +
> +	vcpu_args_set(vcpu, 2, svm_gva, data_gva);

^ permalink raw reply

* Re: [PATCH v7 8/9] KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE
From: Sean Christopherson @ 2026-04-06 23:47 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <20260327234023.2659476-9-jmattson@google.com>

On Fri, Mar 27, 2026, Jim Mattson wrote:
> @@ -1918,6 +1921,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	struct vmcb_save_area_cached save_cached;
>  	struct vmcb_ctrl_area_cached ctl_cached;
>  	unsigned long cr0;
> +	bool use_separate_l2_pat;

Land this above "cr0" to preserve the inverted fir tree.

>  	int ret;
>  
>  	BUILD_BUG_ON(sizeof(struct vmcb_control_area) + sizeof(struct vmcb_save_area) >
> @@ -1993,6 +1997,18 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	    !nested_vmcb_check_save(vcpu, &save_cached, false))
>  		goto out_free;
>  
> +	/*
> +	 * Validate gPAT when the shared PAT quirk is disabled (i.e. L2
> +	 * has its own gPAT). This is done separately from the
> +	 * vmcb_save_area_cached validation above, because gPAT is L2
> +	 * state, but the vmcb_save_area_cached is populated with L1 state.
> +	 */
> +	use_separate_l2_pat =
> +		(ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP) &&
> +		!kvm_check_has_quirk(vcpu->kvm,
> +				     KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);

I vote for either:

	use_separate_l2_pat = (ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP) &&
			      !kvm_check_has_quirk(vcpu->kvm,
						   KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);

or

	use_separate_l2_pat = (ctl_cached.misc_ctl & SVM_MISC_ENABLE_NP);
	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT))
		use_separate_l2_pat = false;


> +	if (use_separate_l2_pat && !kvm_pat_valid(kvm_state->hdr.svm.gpat))
> +		goto out_free;
>  
>  	/*
>  	 * All checks done, we can enter guest mode. Userspace provides
> @@ -2017,6 +2033,10 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	nested_copy_vmcb_control_to_cache(svm, ctl);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> +
> +	if (use_separate_l2_pat)
> +		vmcb_set_gpat(svm->vmcb, kvm_state->hdr.svm.gpat);
> +
>  	nested_vmcb02_prepare_control(svm);
>  
>  	/*
> -- 
> 2.53.0.1018.g2bb0e51243-goog
> 

^ permalink raw reply

* Re: [PATCH v7 5/9] KVM: x86: nSVM: Redirect IA32_PAT accesses to either hPAT or gPAT
From: Sean Christopherson @ 2026-04-06 23:45 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <20260327234023.2659476-6-jmattson@google.com>

On Fri, Mar 27, 2026, Jim Mattson wrote:
> When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
> guest mode with nested NPT enabled, guest accesses to IA32_PAT are
> redirected to the gPAT register, which is stored in VMCB02's g_pat field.
> 
> Non-guest accesses (e.g. from userspace) to IA32_PAT are always redirected
> to hPAT, which is stored in vcpu->arch.pat.
> 
> Directing host-initiated accesses to hPAT ensures that KVM_GET/SET_MSRS and
> KVM_GET/SET_NESTED_STATE are independent of each other and can be ordered
> arbitrarily during save and restore. gPAT is saved and restored separately
> via KVM_GET/SET_NESTED_STATE.
> 
> Use WARN_ON_ONCE to flag any host-initiated accesses originating from KVM
> itself rather than userspace.
> 
> Use pr_warn_once to flag any use of the common MSR-handling code (now
> shared by VMX and TDX) for IA32_PAT by a vCPU that is SVM-capable.
> 
> Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
> Signed-off-by: Jim Mattson <jmattson@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/nested.c |  9 -------
>  arch/x86/kvm/svm/svm.c    | 53 ++++++++++++++++++++++++++++++++++-----
>  arch/x86/kvm/svm/svm.h    |  1 -
>  arch/x86/kvm/x86.c        |  6 +++++
>  4 files changed, 53 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 8170042d5fb3..ccc556eb4d2f 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -703,15 +703,6 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>  	return 0;
>  }
>  
> -void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm)
> -{
> -	if (!svm->nested.vmcb02.ptr)
> -		return;
> -
> -	/* FIXME: merge g_pat from vmcb01 and vmcb12.  */
> -	vmcb_set_gpat(svm->nested.vmcb02.ptr, svm->vmcb01.ptr->save.g_pat);
> -}
> -
>  static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu)
>  {
>  	return guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) &&
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index af808e83173e..34fd07d6ad4d 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2776,6 +2776,47 @@ static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu,
>  	       !msr_write_intercepted(vcpu, msr_info->index);
>  }
>  
> +static bool svm_pat_accesses_gpat(struct kvm_vcpu *vcpu, bool from_host)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	/*
> +	 * When nested NPT is enabled, L2 has a separate PAT from L1.  Guest

Hmm, do we also want to add "and KVM's quirk is disabled"?

	/*
	 * When nested NPT is enabled and KVM's quirk is disabled, L2 has a
	 * separate PAT from L1.  Guest accesses to IA32_PAT while running L2
	 * target L2's gPAT; host-initiated accesses always target L1's hPAT so
	 * that KVM_GET/SET_MSRS and KVM_GET/SET_NESTED_STATE are independent
	 * of each other and can be ordered arbitrarily during save and restore.
	 */

> +	 * accesses to IA32_PAT while running L2 target L2's gPAT;
> +	 * host-initiated accesses always target L1's hPAT so that
> +	 * KVM_GET/SET_MSRS and KVM_GET/SET_NESTED_STATE are independent of
> +	 * each other and can be ordered arbitrarily during save and restore.
> +	 */
> +	WARN_ON_ONCE(from_host && vcpu->wants_to_run);
> +	return !from_host && is_guest_mode(vcpu) && l2_has_separate_pat(svm);
> +}
> +
> +static u64 svm_get_pat(struct kvm_vcpu *vcpu, bool from_host)
> +{
> +	if (svm_pat_accesses_gpat(vcpu, from_host))
> +		return to_svm(vcpu)->vmcb->save.g_pat;
> +	else
> +		return vcpu->arch.pat;
> +}
> +
> +static void svm_set_pat(struct kvm_vcpu *vcpu, bool from_host, u64 data)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	if (svm_pat_accesses_gpat(vcpu, from_host)) {
> +		vmcb_set_gpat(svm->vmcb, data);
> +		return;
> +	}
> +
> +	svm->vcpu.arch.pat = data;
> +
> +	if (npt_enabled) {
> +		vmcb_set_gpat(svm->vmcb01.ptr, data);
> +		if (is_guest_mode(&svm->vcpu) && !l2_has_separate_pat(svm))
> +			vmcb_set_gpat(svm->vmcb, data);
> +	}
> +}
> +
>  static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -2892,6 +2933,9 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_AMD64_DE_CFG:
>  		msr_info->data = svm->msr_decfg;
>  		break;
> +	case MSR_IA32_CR_PAT:
> +		msr_info->data = svm_get_pat(vcpu, msr_info->host_initiated);
> +		break;
>  	default:
>  		return kvm_get_msr_common(vcpu, msr_info);
>  	}
> @@ -2975,13 +3019,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
>  
>  		break;
>  	case MSR_IA32_CR_PAT:
> -		ret = kvm_set_msr_common(vcpu, msr);
> -		if (ret)
> -			break;
> +		if (!kvm_pat_valid(data))
> +			return 1;
>  
> -		vmcb_set_gpat(svm->vmcb01.ptr, data);
> -		if (is_guest_mode(vcpu))
> -			nested_vmcb02_compute_g_pat(svm);
> +		svm_set_pat(vcpu, msr->host_initiated, data);
>  		break;
>  	case MSR_IA32_SPEC_CTRL:
>  		if (!msr->host_initiated &&
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index b43e37b0448c..0b0279734486 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -868,7 +868,6 @@ void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm,
>  void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
>  				    struct vmcb_save_area *save);
>  void nested_sync_control_from_vmcb02(struct vcpu_svm *svm);
> -void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm);
>  void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb);
>  
>  extern struct kvm_x86_nested_ops svm_nested_ops;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0b5d48e75b65..cfb2517f692a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4025,6 +4025,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		}
>  		break;
>  	case MSR_IA32_CR_PAT:
> +		if (!(efer_reserved_bits & EFER_SVME))
> +			pr_warn_once("%s: MSR_IA32_CR_PAT should be handled by AMD vendor-specific code\n", __func__);

If we want to add a sanity check, simply do

		WARN_ON_ONCE(efer_reserved_bits & EFER_SVME)?

so that bugs fail loudly.

Hrm.  Though I'm leaning towards saying we should keep the "common" behavior in
x86.c.  If VMX and TDX didn't need to add get() APIs, I would say move the common
code to vmx/common.h, but that ends up being pretty ugly.

If we do that, the total diff for svm.c is quite nice.  Duplicating the
kvm_pat_valid() sucks, but it seems like the lesser of all evils :-/

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e7fdd7a9c280..6281599d5311 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2767,6 +2767,21 @@ static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu,
               !msr_write_intercepted(vcpu, msr_info->index);
 }
 
+static bool svm_pat_accesses_gpat(struct kvm_vcpu *vcpu, bool from_host)
+{
+       struct vcpu_svm *svm = to_svm(vcpu);
+
+       /*
+        * When nested NPT is enabled, L2 has a separate PAT from L1.  Guest
+        * accesses to IA32_PAT while running L2 target L2's gPAT;
+        * host-initiated accesses always target L1's hPAT so that
+        * KVM_GET/SET_MSRS and KVM_GET/SET_NESTED_STATE are independent of
+        * each other and can be ordered arbitrarily during save and restore.
+        */
+       WARN_ON_ONCE(from_host && vcpu->wants_to_run);
+       return !from_host && is_guest_mode(vcpu) && l2_has_separate_pat(svm);
+}
+
 static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
        struct vcpu_svm *svm = to_svm(vcpu);
@@ -2883,6 +2898,12 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
        case MSR_AMD64_DE_CFG:
                msr_info->data = svm->msr_decfg;
                break;
+       case MSR_IA32_CR_PAT:
+               if (svm_pat_accesses_gpat(vcpu, msr_info->host_initiated)) {
+                       msr_info->data = to_svm(vcpu)->vmcb->save.g_pat;
+                       break;
+               }
+               return kvm_get_msr_common(vcpu, msr_info);
        default:
                return kvm_get_msr_common(vcpu, msr_info);
        }
@@ -2966,14 +2987,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 
                break;
        case MSR_IA32_CR_PAT:
+               if (svm_pat_accesses_gpat(vcpu, msr->host_initiated)) {
+                       if (kvm_pat_valid(data))
+                               return 1;
+
+                       vmcb_set_gpat(svm->vmcb, data);
+                       break;
+               }
+
                ret = kvm_set_msr_common(vcpu, msr);
                if (ret)
                        break;
 
-               svm->vmcb01.ptr->save.g_pat = data;
-               if (is_guest_mode(vcpu))
-                       nested_vmcb02_compute_g_pat(svm);
-               vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
+               if (npt_enabled) {
+                       vmcb_set_gpat(svm->vmcb01.ptr, data);
+                       if (is_guest_mode(&svm->vcpu) && !l2_has_separate_pat(svm))
+                               vmcb_set_gpat(svm->vmcb, data);
+               }
                break;
        case MSR_IA32_SPEC_CTRL:
                if (!msr->host_initiated &&

> +
>  		if (!kvm_pat_valid(data))
>  			return 1;
>  
> @@ -4436,6 +4439,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		break;
>  	}
>  	case MSR_IA32_CR_PAT:
> +		if (!(efer_reserved_bits & EFER_SVME))
> +			pr_warn_once("%s: MSR_IA32_CR_PAT should be handled by AMD vendor-specific code\n", __func__);

And obviously the same here..

> +
>  		msr_info->data = vcpu->arch.pat;
>  		break;
>  	case MSR_MTRRcap:
> -- 
> 2.53.0.1018.g2bb0e51243-goog
> 

^ permalink raw reply related

* Re: [PATCH v7 1/9] KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT
From: Sean Christopherson @ 2026-04-06 23:27 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	kvm, linux-doc, linux-kernel, linux-kselftest, Yosry Ahmed
In-Reply-To: <20260327234023.2659476-2-jmattson@google.com>

On Fri, Mar 27, 2026, Jim Mattson wrote:
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index ff1e4b4dc998..74014110b550 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -616,6 +616,17 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
>  	return svm->nested.ctl.misc_ctl & SVM_MISC_ENABLE_NP;
>  }
>  
> +static inline bool l2_has_separate_pat(struct vcpu_svm *svm)

Take @vcpu instead of @svm.  All of the callers have a "vcpu", but not all have
a local "svm".  That will shorten the quirk check far enough to let it poke out.

> +{
> +	/*
> +	 * If KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled while a vCPU
> +	 * is running, the L2 IA32_PAT semantics for that vCPU are undefined.
> +	 */
> +	return nested_npt_enabled(svm) &&
> +		!kvm_check_has_quirk(svm->vcpu.kvm,
> +				     KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);

Align indentation.  With the @svm => @vcpu change, this becomes:

	return nested_npt_enabled(to_svm(vcpu)) &&
	       !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT);

> +}
> +
>  static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
>  {
>  	return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_VNMI) &&
> -- 
> 2.53.0.1018.g2bb0e51243-goog
> 

^ permalink raw reply

* [PATCH] riscv: Docs: fix unmatched quote warning
From: Randy Dunlap @ 2026-04-06 23:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Randy Dunlap, Deepak Gupta, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, linux-riscv, Jonathan Corbet,
	Shuah Khan, linux-doc

'make htmldocs' complains about ``prctrl` -- so add a second '`' to
avoid the warning.

Documentation/arch/riscv/zicfilp.rst:79: WARNING: Inline literal start-string without end-string. [docutils]

Fixes: 08ee1559052b ("prctl: cfi: change the branch landing pad prctl()s to be more descriptive")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: linux-riscv@lists.infradead.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org

 Documentation/arch/riscv/zicfilp.rst |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next-20260406.orig/Documentation/arch/riscv/zicfilp.rst
+++ linux-next-20260406/Documentation/arch/riscv/zicfilp.rst
@@ -78,7 +78,7 @@ the program.
 
 Per-task indirect branch tracking state can be monitored and
 controlled via the :c:macro:`PR_GET_CFI` and :c:macro:`PR_SET_CFI`
-``prctl()` arguments (respectively), by supplying
+``prctl()`` arguments (respectively), by supplying
 :c:macro:`PR_CFI_BRANCH_LANDING_PADS` as the second argument.  These
 are architecture-agnostic, and will return -EINVAL if the underlying
 functionality is not supported.

^ permalink raw reply

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-06 22:45 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tony.luck, Dave.Martin, james.morse,
	tglx, mingo, bp, dave.hansen
  Cc: skhan, x86, hpa, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kas,
	rick.p.edgecombe, akpm, pmladek, rdunlap, dapeng1.mi, kees, elver,
	paulmck, lirongqing, safinaskar, fvdl, seanjc, pawan.kumar.gupta,
	xin, tiala, Neeraj.Upadhyay, chang.seok.bae, thomas.lendacky,
	elena.reshetova, linux-doc, linux-kernel, linux-coco, kvm,
	eranian, peternewman
In-Reply-To: <83ae0c18-5c5e-4b52-901d-4126fe7c141b@intel.com>

Hi Reinette,

Sorry for the late response. I was trying to get confirmation about the 
use case.

On 3/31/26 17:24, Reinette Chatre wrote:
> Hi Babu,
> 
> On 3/30/26 11:46 AM, Babu Moger wrote:
>> On 3/27/26 17:11, Reinette Chatre wrote:
>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
> 
>>>>>>         Tony suggested using global variables to store the kernel mode
>>>>>>         CLOSID and RMID. However, the kernel mode CLOSID and RMID are
>>>>>>         coming from rdtgroup structure with the new interface. Accessing
>>>>>>         them requires holding the associated lock, which would make the
>>>>>>         context switch path unnecessarily expensive. So, dropped the idea.
>>>>>>         https://lore.kernel.org/lkml/aXuxVSbk1GR2ttzF@agluck-desk3/
>>>>>>         Let me know if there are other ways to optimize this.
>>>>> I do not see why the context switch path needs to be touched at all with this
>>>>> implementation. Since PLZA only supports global assignment does it not mean that resctrl
>>>>> only needs to update PQR_PLZA_ASSOC when user writes to info/kernel_mode and
>>>>> info/kernel_mode_assignment?
>>>> Each thread has an MSR to configure whether to associate privilege level zero execution with a separate COS and/or RMID, and the value of the COS and/or RMID.  PLZA may be enabled or disabled on a per-thread basis. However, the COS and RMID association and configuration must be the same for all threads in the QOS Domain.
>>> Based on previous comment in https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
>>> and this implementation all fields of PQR_PLZA_ASSOC except PQR_PLZA_ASSOC.plza_en must be the
>>> same for all CPUs on the system, not just per QoS domain. Could you please confirm?
>>
>> Sorry for the confusion. It is "per QoS domain".
>>
>> All the fields of PQR_PLZA_ASSOC except PQR_PLZA_ASSOC.plza_enmust be set to the same value for all HW threads in the QOS domain for consistent operation (Per-QosDomain).
> 
> Thank you for clarifying. To build on this, what would be best way for resctrl to interpret this?
> As I see it all values in PQR_PLZA_ASSOC apply to *all* resources yet (theoretically?) every resource

Yes.  That is correct. PLZA applies to all the resources.

> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?

Yes.

> 
> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?

Yes. that is correct. It should not be attached to one resource. We need 
to change it to global scope.

> 
> ...
> 
>> Yes, I agree with your concerns. The goal here is to make the interface less disruptive while still addressing the different use cases.
> 
> I consider changing resctrl behavior when values are written to existing resctrl files
> to be disruptive. This is something we explicitly discussed during v1 as something to
> be avoided so this implementation that overloads the tasks file again is unexpected.

Yes. Agree. If required we need to introduce new files (kmode_cpus, 
kmode_cpu_list or kmode_tasks) to handle these cases.

> 
>>       Background: Customers have identified an issue with the QoS
>>       Bandwidth Control feature: when a CLOS is aggressively throttled
>>       and execution transitions into kernel mode, kernel operations are
>>       also subject to the same aggressive throttling.
>>
>>> Privilege-Level Zero Association (PLZA) allows a user to specify a
>> COS and/or RMID to be used during execution at Privilege Level Zero.
>> When PLZA is enabled on a hardware thread, any execution that enters
>> Privilege Level Zero will have its transactions associated with the
>> PLZA COS and/or RMID. Otherwise, the thread continues to use the COS
>> and RMID specified by |PQR_ASSOC|. In other words, the hardware
>> provides a dedicated COS and/or RMID specifically for kernel-mode
>> execution.
> ack.
> 
>>
>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>
>>
>>       1. Global PLZA enablement
>>
>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
> 
> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
> well as all kernel work of all tasks are assigned.
> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
> interface can allow that, but there is no need for resctrl to enforce this.
> 
>> and all CPU threads use its allocations whenever they enter Privilege Level Zero. This CLOSID does not need to be associated with any resctrl group.

I misspoke here.

> 
> The CLOSID has to be associated with a resource group to be able to manage its
> resource allocations, no?

Yes. We need to have resource group schemata to enforce the limits.

> 
>> The user can explicitly enable or disable this feature.
> ack.
> 
>> There is no context switch overhead but there is no flexibility with this approach.
> 
> Flexibility is subjective. As I understand this supports the only use case we learned about so far:
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> 
>>       2. Group based PLZA allocation :  PLZA is managed via dedicated
>>       restctrl group. A separate resctrl group can be created
>>       specifically for PLZA, with a dedicated CLOSID used exclusively
>>       for kernel mode execution. This approach can be further divided
>>       into two association models:
> 
> So far this sounds like global allocation since both need a dedicated resource group.
> Whether this group is dedicated to kernel work or shared between kernel and user space work
> is up to the user. There is no motivation why CLOSID should ever be enforced to be
> exclusive for kernel mode execution.

Yes. That is fine.
> 
>>
>> i) CPU based association
>> CPUs are assigned to the PLZA group, and PLZA is enabled only on
>> those CPUs. This effectively creates a dedicated PLZA group. MSRs (|
>> PQR_PLZA_ASSOC)| are programmed only when the user changes CPU
>> assignments. This approach requires no changes to the context switch
>> code and introduces no additional context switch overhead.
>>
>> ii) Task based association
>> Tasks are explicitly assigned by the user to the PLZA group. Tasks
>> need to be updated when user adds a new task. Also, this requires
>> updates during task scheduling so that the MSRs (|PQR_PLZA_ASSOC)|
>> are programmed on each context switch, which introduces additional
>> context switch overhead.
> 
> As discussed during v1 any changes needed to support per task assignment would
> need to be done with new files dedicated to this purpose. Do not overload the
> existing resctrl tasks/cpus/cpus_list files.

Yes. Sure.

>   
>> I tried to fit these requirements into  the interface files in /sys/
>> fs/resctrl/info/.  I may have missed few things while trying to
>> achieve it.  As usual, I am open for the discussion and
>> recommendations.
> 
> Many of these items were already discussed as part of v1 so I think we may be
> talking past each other here. I tried to highlight the relevant points raised
> during v1 discussion that I thought there already was agreement on.
> 
> The one new aspect is that I assumed this implementation will only be for
> global configuration and assignment. It looks like you want to support both
> global configuration and per-task assignment. In the original I did not consider
> configuration and assignment to occur at different scope so we may need to come up
> with new modes to distinguish. Consider the addition of two modes as below:
> 
> 	# cat info/kernel_mode
> 	[inherit_ctrl_and_mon]
> 	global_assign_ctrl_inherit_mon_set_all
> 	global_assign_ctrl_assign_mon_set_all
> 	global_assign_ctrl_inherit_mon_set_individual
> 	global_assign_ctrl_assign_mon_set_individual
> 
> Above introduces a "set_all" and "set_individual" suffix to the original two
> modes.
> 
> global_assign_ctrl_inherit_mon_set_all
> global_assign_ctrl_assign_mon_set_all:
> 
> 	Above are the original two modes but makes it clear that when this mode is
> 	activated _all_ tasks run with the assignment.
> 
> global_assign_ctrl_inherit_mon_set_individual
> global_assign_ctrl_assign_mon_set_individual:
> 
> 	Above are two new modes. In this mode user space also assigns a resource
> 	group globally but then needs to follow that up by activating every task
> 	separately to run with this assignment.
> 	One way in which this can be accomplished could be to have "kernel_mode_tasks",
> 	"kernel_mode_cpus", and "kernel_mode_cpus_list"	files become visible (or be
> 	created) in the resource group found in	info/kernel_mode_assignment. User
> 	space interacts with the new files to set which tasks and/or CPUs run with
> 	PLZA enabled.
> 	
> Even so, as I understand global_assign_ctrl_inherit_mon_set_all and
> global_assign_ctrl_assign_mon_set_all addresses the only known use case. Do you know
> if there are use cases for global_assign_ctrl_inherit_mon_set_individual and
> global_assign_ctrl_assign_mon_set_individual? The latter two adds significant
> complexity to resctrl while I have not heard about any use case for it.
> 

Yes. I agree. The changes in context switch code is a concern.

You covered some of the cases I was thinking(xx_set_individual).

How about this idea?

I suggest splitting the PLZA into two distinct aspects:

1. How PLZA is applied within a resource group

2. How PLZA is monitored


Introduce a new file, "info/kmode_type", to describe how kmode applies 
in the system.

# cat info/kmode_type
[global] <- Kernel mode applies to the entire system (all CPUs/tasks)
   cpus   <- Kernel mode applies only to the CPUs in the group
   tasks  <- Kernel mode applies only to the tasks in the group

The "global" option is the default right now and it is current common 
use-case.

The "info/kmode_type -> cpus" option introduces new files "kmode_cpus" 
and "kmode_cpus_list" for users to apply kmode to specific set of CPUs. 
This lets users change the CPU set for PLZA. The PLZA MSR is updated 
when user changes the association to the file. No context switch code 
changes are needed. This will be dedicated group. The current resctrl 
group files, "cpus, cpus_list and tasks" will not be accessible in this 
mode. This option give some flexibility for the user without the context 
switch overhead.

The "info/kmode_type -> tasks" option introduces a new file, 
"kmode_tasks", for users to apply kmode to specific set of tasks. This 
requires context switch changes. This will be dedicated group. The 
current resctrl group files, "cpus, cpus_list and tasks" will not be 
accessible in this mode. We currently have no use case for this, so it 
will not be supported now.


Add a file, "info/kmode_monitor", to describe how kmode is monitored.

# cat info/kmode_monitor
[inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. 
Default option for the "global"
assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID 
inherited from user.
assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all 
kernel work. Default option for "cpu" type.


Rename “kernel_mode_assignment” to “kmode_group” to assign the specific 
group to kmode. This file usage is same as before.

#cat info/kmode_groups (Renamed "kernel_mode_assignment")
//


Thoughts?

thanks
Babu


^ permalink raw reply

* Re: [PATCH v10 07/21] gpu: nova-core: mm: Add TLB flush support
From: Joel Fernandes @ 2026-04-06 22:10 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Dave Airlie, Daniel Almeida, Koen Koning,
	dri-devel, rust-for-linux, Nikola Djukic, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Jonathan Corbet, Alex Deucher, Christian Koenig, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Matthew Auld, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
	Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
	Timur Tabi, Edwin Peer, Alexandre Courbot, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, Eliot Courtney, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <0f5605c1-32e8-4a62-b852-b1db01e42817@nvidia.com>



On 4/6/2026 5:24 PM, Joel Fernandes wrote:
> 
> 
> On 4/2/2026 1:59 AM, Matthew Brost wrote:
>> On Tue, Mar 31, 2026 at 05:20:34PM -0400, Joel Fernandes wrote:
>>> Add TLB (Translation Lookaside Buffer) flush support for GPU MMU.
>>>
>>> After modifying page table entries, the GPU's TLB must be invalidated
>>> to ensure the new mappings take effect. The Tlb struct provides flush
>>> functionality through BAR0 registers.
>>>
>>> The flush operation writes the page directory base address and triggers
>>> an invalidation, polling for completion with a 2 second timeout matching
>>> the Nouveau driver.
>>>
>>> Cc: Nikola Djukic <ndjukic@nvidia.com>
>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> ---
>>>  drivers/gpu/nova-core/mm.rs     |  1 +
>>>  drivers/gpu/nova-core/mm/tlb.rs | 95 +++++++++++++++++++++++++++++++++
>>>  drivers/gpu/nova-core/regs.rs   | 42 +++++++++++++++
>>>  3 files changed, 138 insertions(+)
>>>  create mode 100644 drivers/gpu/nova-core/mm/tlb.rs
>>>
>>> diff --git a/drivers/gpu/nova-core/mm.rs b/drivers/gpu/nova-core/mm.rs
>>> index 8f3089a5fa88..cfe9cbe11d57 100644
>>> --- a/drivers/gpu/nova-core/mm.rs
>>> +++ b/drivers/gpu/nova-core/mm.rs
>>> @@ -5,6 +5,7 @@
>>>  #![expect(dead_code)]
>>>  
>>>  pub(crate) mod pramin;
>>> +pub(crate) mod tlb;
>>>  
>>>  use kernel::sizes::SZ_4K;
>>>  
>>> diff --git a/drivers/gpu/nova-core/mm/tlb.rs b/drivers/gpu/nova-core/mm/tlb.rs
>>> new file mode 100644
>>> index 000000000000..cd3cbcf4c739
>>> --- /dev/null
>>> +++ b/drivers/gpu/nova-core/mm/tlb.rs
>>> @@ -0,0 +1,95 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +//! TLB (Translation Lookaside Buffer) flush support for GPU MMU.
>>> +//!
>>> +//! After modifying page table entries, the GPU's TLB must be flushed to
>>> +//! ensure the new mappings take effect. This module provides TLB flush
>>> +//! functionality for virtual memory managers.
>>> +//!
>>> +//! # Example
>>> +//!
>>> +//! ```ignore
>>> +//! use crate::mm::tlb::Tlb;
>>> +//!
>>> +//! fn page_table_update(tlb: &Tlb, pdb_addr: VramAddress) -> Result<()> {
>>> +//!     // ... modify page tables ...
>>> +//!
>>> +//!     // Flush TLB to make changes visible (polls for completion).
>>> +//!     tlb.flush(pdb_addr)?;
>>> +//!
>>> +//!     Ok(())
>>> +//! }
>>> +//! ```
>>> +
>>> +use kernel::{
>>> +    devres::Devres,
>>> +    io::poll::read_poll_timeout,
>>> +    io::Io,
>>> +    new_mutex,
>>> +    prelude::*,
>>> +    sync::{
>>> +        Arc,
>>> +        Mutex, //
>>> +    },
>>> +    time::Delta, //
>>> +};
>>> +
>>> +use crate::{
>>> +    driver::Bar0,
>>> +    mm::VramAddress,
>>> +    regs, //
>>> +};
>>> +
>>> +/// TLB manager for GPU translation buffer operations.
>>> +#[pin_data]
>>> +pub(crate) struct Tlb {
>>> +    bar: Arc<Devres<Bar0>>,
>>> +    /// TLB flush serialization lock: This lock is acquired during the
>>> +    /// DMA fence signalling critical path. It must NEVER be held across any
>>> +    /// reclaimable CPU memory allocations because the memory reclaim path can
>>> +    /// call `dma_fence_wait()`, which would deadlock with this lock held.
>>> +    #[pin]
>>> +    lock: Mutex<()>,
>>> +}
>>> +
>>> +impl Tlb {
>>> +    /// Create a new TLB manager.
>>> +    pub(super) fn new(bar: Arc<Devres<Bar0>>) -> impl PinInit<Self> {
>>> +        pin_init!(Self {
>>> +            bar,
>>> +            lock <- new_mutex!((), "tlb_flush"),
>>> +        })
>>> +    }
>>> +
>>> +    /// Flush the GPU TLB for a specific page directory base.
>>> +    ///
>>> +    /// This invalidates all TLB entries associated with the given PDB address.
>>> +    /// Must be called after modifying page table entries to ensure the GPU sees
>>> +    /// the updated mappings.
>>> +    pub(crate) fn flush(&self, pdb_addr: VramAddress) -> Result {
>>
>> This landed on my list randomly, so I took a look.
>>
>> Wouldn’t you want to virtualize the invalidation based on your device?
>> For example, what if you need to register interface changes on future hardware?
> 
> Good point, for future hardware it indeed makes sense. I will do that.
Actually, at least in the future as far as I can see, the register definitions
are the same for TLB invalidation are the same, so we are good and I will not be
making any change in this regard.

But, thanks for raising the point and forcing me to double check!

--
Joel Fernandes


^ permalink raw reply

* Re: [PATCH v10 12/21] gpu: nova-core: mm: Add unified page table entry wrapper enums
From: Joel Fernandes @ 2026-04-06 21:55 UTC (permalink / raw)
  To: Eliot Courtney, linux-kernel
  Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Dave Airlie, Daniel Almeida, Koen Koning, dri-devel,
	rust-for-linux, Nikola Djukic, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
	Alex Deucher, Christian Koenig, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Matthew Auld,
	Matthew Brost, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
	Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
	Timur Tabi, Edwin Peer, Alexandre Courbot, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx, intel-gfx,
	intel-xe, linux-fbdev
In-Reply-To: <DHIFF98P1YQ3.1IXUT02E3TF20@nvidia.com>



On 4/2/2026 1:40 AM, Eliot Courtney wrote:
> On Wed Apr 1, 2026 at 6:20 AM JST, Joel Fernandes wrote:
>> Add unified Pte, Pde, and DualPde wrapper enums that abstract over
>> MMU v2 and v3 page table entry formats. These enums allow the page
>> table walker and VMM to work with both MMU versions.
>>
>> Each unified type:
>> - Takes MmuVersion parameter in constructors
>> - Wraps both ver2 and ver3 variants
>> - Delegates method calls to the appropriate variant
>>
>> This enables version-agnostic page table operations while keeping
>> version-specific implementation details encapsulated in the ver2
>> and ver3 modules.
>>
>> Cc: Nikola Djukic <ndjukic@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  drivers/gpu/nova-core/mm/pagetable.rs | 330 ++++++++++++++++++++++++++
>>  1 file changed, 330 insertions(+)
>>
>> diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
>> index 6e01a1af5222..909df37c3ee8 100644
>> --- a/drivers/gpu/nova-core/mm/pagetable.rs
>> +++ b/drivers/gpu/nova-core/mm/pagetable.rs
>> @@ -12,6 +12,13 @@
>>  pub(crate) mod ver3;
>>  
>>  use crate::gpu::Architecture;
>> +use crate::mm::{
>> +    pramin,
>> +    Pfn,
>> +    VirtualAddress,
>> +    VramAddress, //
>> +};
>> +use kernel::prelude::*;
>>  
>>  /// Extracts the page table index at a given level from a virtual address.
>>  pub(crate) trait VaLevelIndex {
>> @@ -84,6 +91,96 @@ pub(crate) const fn as_index(&self) -> u64 {
>>      }
>>  }
>>  
>> +impl MmuVersion {
>> +    /// Get the `PDE` levels (excluding PTE level) for page table walking.
>> +    pub(crate) fn pde_levels(&self) -> &'static [PageTableLevel] {
>> +        match self {
>> +            Self::V2 => ver2::PDE_LEVELS,
>> +            Self::V3 => ver3::PDE_LEVELS,
>> +        }
>> +    }
>> +
>> +    /// Get the PTE level for this MMU version.
>> +    pub(crate) fn pte_level(&self) -> PageTableLevel {
>> +        match self {
>> +            Self::V2 => ver2::PTE_LEVEL,
>> +            Self::V3 => ver3::PTE_LEVEL,
>> +        }
>> +    }
>> +
>> +    /// Get the dual PDE level (128-bit entries) for this MMU version.
>> +    pub(crate) fn dual_pde_level(&self) -> PageTableLevel {
>> +        match self {
>> +            Self::V2 => ver2::DUAL_PDE_LEVEL,
>> +            Self::V3 => ver3::DUAL_PDE_LEVEL,
>> +        }
>> +    }
>> +
>> +    /// Get the number of PDE levels for this MMU version.
>> +    pub(crate) fn pde_level_count(&self) -> usize {
>> +        self.pde_levels().len()
>> +    }
>> +
>> +    /// Get the entry size in bytes for a given level.
>> +    pub(crate) fn entry_size(&self, level: PageTableLevel) -> usize {
>> +        if level == self.dual_pde_level() {
>> +            16 // 128-bit dual PDE
>> +        } else {
>> +            8 // 64-bit PDE/PTE
>> +        }
>> +    }
>> +
>> +    /// Get the number of entries per page table page for a given level.
>> +    pub(crate) fn entries_per_page(&self, level: PageTableLevel) -> usize {
>> +        match self {
>> +            Self::V2 => match level {
>> +                // TODO: Calculate these values from the bitfield dynamically
>> +                // instead of hardcoding them.
>> +                PageTableLevel::Pdb => 4, // PD3 root: bits [48:47] = 2 bits
>> +                PageTableLevel::L3 => 256, // PD0 dual: bits [28:21] = 8 bits
>> +                _ => 512,                 // PD2, PD1, PT: 9 bits each
>> +            },
>> +            Self::V3 => match level {
>> +                PageTableLevel::Pdb => 2,  // PDE4 root: bit [56] = 1 bit, 2 entries
>> +                PageTableLevel::L4 => 256, // PDE0 dual: bits [28:21] = 8 bits
>> +                _ => 512,                  // PDE3, PDE2, PDE1, PT: 9 bits each
>> +            },
>> +        }
>> +    }
>> +
>> +    /// Extract the page table index at `level` from `va` for this MMU version.
>> +    pub(crate) fn level_index(&self, va: VirtualAddress, level: u64) -> u64 {
>> +        match self {
>> +            Self::V2 => ver2::VirtualAddressV2::new(va).level_index(level),
>> +            Self::V3 => ver3::VirtualAddressV3::new(va).level_index(level),
>> +        }
>> +    }
>> +
>> +    /// Compute upper bound on page table pages needed for `num_virt_pages`.
>> +    ///
>> +    /// Walks from PTE level up through PDE levels, accumulating the tree.
>> +    pub(crate) fn pt_pages_upper_bound(&self, num_virt_pages: usize) -> usize {
>> +        let mut total = 0;
>> +
>> +        // PTE pages at the leaf level.
>> +        let pte_epp = self.entries_per_page(self.pte_level());
>> +        let mut pages_at_level = num_virt_pages.div_ceil(pte_epp);
>> +        total += pages_at_level;
>> +
>> +        // Walk PDE levels bottom-up (reverse of pde_levels()).
>> +        for &level in self.pde_levels().iter().rev() {
>> +            let epp = self.entries_per_page(level);
>> +
>> +            // How many pages at this level do we need to point to
>> +            // the previous pages_at_level?
>> +            pages_at_level = pages_at_level.div_ceil(epp);
>> +            total += pages_at_level;
>> +        }
>> +
>> +        total
>> +    }
>> +}
>> +
> 
> We have a lot of matches on the MMU version here (and below in Pte, Pde,
> DualPde). What about making MmuVersion into a trait (e.g. Mmu) with
> associated types for Pte, Pde, DualPde which can implement traits
> defining their common operations too?

I coded this up and it did not look pretty, there's not much LOC savings and the
code becomes harder to read because of parametrization of several functions. Also:

> Then you can parameterise Vmm/PtWalk on this type.

The match still to be done somewhere, so you end up matching on chipset to call
the correct parametrized functions versus just passing in the parameter or
chipset down, in some cases.

For now I am inclined to leave it as is. Also there's a Rust pitfall we all
learnt during the turing and other patch reviews, sometimes doing a bunch of
matches is good especially if the number of variants are expected to be fixed
(in the mm case, version 2 and version 3). Traits have some disadvantages too,
example dyn traits have to heap-allocated, parametrizing can increase code size
(due to monomorphization) etc.

thanks,

--
Joel Fernandes





^ permalink raw reply

* Re: [PATCH v10 07/21] gpu: nova-core: mm: Add TLB flush support
From: Joel Fernandes @ 2026-04-06 21:24 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Dave Airlie, Daniel Almeida, Koen Koning,
	dri-devel, rust-for-linux, Nikola Djukic, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Jonathan Corbet, Alex Deucher, Christian Koenig, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Matthew Auld, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
	Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
	Timur Tabi, Edwin Peer, Alexandre Courbot, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, Eliot Courtney, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <ac4FpcD29XnbbsdD@gsse-cloud1.jf.intel.com>



On 4/2/2026 1:59 AM, Matthew Brost wrote:
> On Tue, Mar 31, 2026 at 05:20:34PM -0400, Joel Fernandes wrote:
>> Add TLB (Translation Lookaside Buffer) flush support for GPU MMU.
>>
>> After modifying page table entries, the GPU's TLB must be invalidated
>> to ensure the new mappings take effect. The Tlb struct provides flush
>> functionality through BAR0 registers.
>>
>> The flush operation writes the page directory base address and triggers
>> an invalidation, polling for completion with a 2 second timeout matching
>> the Nouveau driver.
>>
>> Cc: Nikola Djukic <ndjukic@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  drivers/gpu/nova-core/mm.rs     |  1 +
>>  drivers/gpu/nova-core/mm/tlb.rs | 95 +++++++++++++++++++++++++++++++++
>>  drivers/gpu/nova-core/regs.rs   | 42 +++++++++++++++
>>  3 files changed, 138 insertions(+)
>>  create mode 100644 drivers/gpu/nova-core/mm/tlb.rs
>>
>> diff --git a/drivers/gpu/nova-core/mm.rs b/drivers/gpu/nova-core/mm.rs
>> index 8f3089a5fa88..cfe9cbe11d57 100644
>> --- a/drivers/gpu/nova-core/mm.rs
>> +++ b/drivers/gpu/nova-core/mm.rs
>> @@ -5,6 +5,7 @@
>>  #![expect(dead_code)]
>>  
>>  pub(crate) mod pramin;
>> +pub(crate) mod tlb;
>>  
>>  use kernel::sizes::SZ_4K;
>>  
>> diff --git a/drivers/gpu/nova-core/mm/tlb.rs b/drivers/gpu/nova-core/mm/tlb.rs
>> new file mode 100644
>> index 000000000000..cd3cbcf4c739
>> --- /dev/null
>> +++ b/drivers/gpu/nova-core/mm/tlb.rs
>> @@ -0,0 +1,95 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! TLB (Translation Lookaside Buffer) flush support for GPU MMU.
>> +//!
>> +//! After modifying page table entries, the GPU's TLB must be flushed to
>> +//! ensure the new mappings take effect. This module provides TLB flush
>> +//! functionality for virtual memory managers.
>> +//!
>> +//! # Example
>> +//!
>> +//! ```ignore
>> +//! use crate::mm::tlb::Tlb;
>> +//!
>> +//! fn page_table_update(tlb: &Tlb, pdb_addr: VramAddress) -> Result<()> {
>> +//!     // ... modify page tables ...
>> +//!
>> +//!     // Flush TLB to make changes visible (polls for completion).
>> +//!     tlb.flush(pdb_addr)?;
>> +//!
>> +//!     Ok(())
>> +//! }
>> +//! ```
>> +
>> +use kernel::{
>> +    devres::Devres,
>> +    io::poll::read_poll_timeout,
>> +    io::Io,
>> +    new_mutex,
>> +    prelude::*,
>> +    sync::{
>> +        Arc,
>> +        Mutex, //
>> +    },
>> +    time::Delta, //
>> +};
>> +
>> +use crate::{
>> +    driver::Bar0,
>> +    mm::VramAddress,
>> +    regs, //
>> +};
>> +
>> +/// TLB manager for GPU translation buffer operations.
>> +#[pin_data]
>> +pub(crate) struct Tlb {
>> +    bar: Arc<Devres<Bar0>>,
>> +    /// TLB flush serialization lock: This lock is acquired during the
>> +    /// DMA fence signalling critical path. It must NEVER be held across any
>> +    /// reclaimable CPU memory allocations because the memory reclaim path can
>> +    /// call `dma_fence_wait()`, which would deadlock with this lock held.
>> +    #[pin]
>> +    lock: Mutex<()>,
>> +}
>> +
>> +impl Tlb {
>> +    /// Create a new TLB manager.
>> +    pub(super) fn new(bar: Arc<Devres<Bar0>>) -> impl PinInit<Self> {
>> +        pin_init!(Self {
>> +            bar,
>> +            lock <- new_mutex!((), "tlb_flush"),
>> +        })
>> +    }
>> +
>> +    /// Flush the GPU TLB for a specific page directory base.
>> +    ///
>> +    /// This invalidates all TLB entries associated with the given PDB address.
>> +    /// Must be called after modifying page table entries to ensure the GPU sees
>> +    /// the updated mappings.
>> +    pub(crate) fn flush(&self, pdb_addr: VramAddress) -> Result {
> 
> This landed on my list randomly, so I took a look.
> 
> Wouldn’t you want to virtualize the invalidation based on your device?
> For example, what if you need to register interface changes on future hardware?

Good point, for future hardware it indeed makes sense. I will do that.
> Or, if you’re a VF, can you even do MMIO?

For VFs, TLB flush is typically done directly via VF's BAR0 MMIO. If there is
VF-specific handling for any reason, that should probably be a different
interface than a per-chip interface (more specifically, checking if Nova is
booted on a VF).

thanks,

--
Joel Fernandes


^ permalink raw reply

* Re: [PATCH v10 10/21] gpu: nova-core: mm: Add MMU v2 page table types
From: Joel Fernandes @ 2026-04-06 21:14 UTC (permalink / raw)
  To: Eliot Courtney, linux-kernel
  Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Dave Airlie, Daniel Almeida, Koen Koning, dri-devel,
	rust-for-linux, Nikola Djukic, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
	Alex Deucher, Christian Koenig, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Matthew Auld,
	Matthew Brost, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
	Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
	Timur Tabi, Edwin Peer, Alexandre Courbot, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx, intel-gfx,
	intel-xe, linux-fbdev
In-Reply-To: <DHIFGC4E879U.AXIIQKGRZQAF@nvidia.com>



On 4/2/2026 1:41 AM, Eliot Courtney wrote:
> On Wed Apr 1, 2026 at 6:20 AM JST, Joel Fernandes wrote:
>> Add page table entry and directory structures for MMU version 2
>> used by Turing/Ampere/Ada GPUs.
>>
>> Cc: Nikola Djukic <ndjukic@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  drivers/gpu/nova-core/mm/pagetable.rs      |   2 +
>>  drivers/gpu/nova-core/mm/pagetable/ver2.rs | 232 +++++++++++++++++++++
>>  2 files changed, 234 insertions(+)
>>  create mode 100644 drivers/gpu/nova-core/mm/pagetable/ver2.rs
>>
>> diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
>> index 50b76d5e5aaf..38d88f8f09a9 100644
>> --- a/drivers/gpu/nova-core/mm/pagetable.rs
>> +++ b/drivers/gpu/nova-core/mm/pagetable.rs
>> @@ -8,6 +8,8 @@
>>  
>>  #![expect(dead_code)]
>>  
>> +pub(crate) mod ver2;
>> +
> 
> This looks like it has more visibility than necessary. And it seems
> incorrect for anyone in the crate to care about MMU version details.
> This can probably be just 'mod ver2'. There are a lot of other types /
> functions in this series that could have tighter visibility. Could you
> go through and see if you can reduce a bunch to private or pub(super)?
> 

Yes, indeed. I am tightening it now, there are several. Thanks!

--
Joel Fernandes


^ permalink raw reply

* Re: [PATCH net-next] docs: netdev: document AI-assisted review tooling
From: Fernando Fernandez Mancera @ 2026-04-06 21:14 UTC (permalink / raw)
  To: Nicolai Buchwitz
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, netdev, workflows,
	linux-doc, linux-kernel, mbloch
In-Reply-To: <56c5bdfe2e37738e47b3b4d22e21697c@tipi-net.de>

On 4/6/26 10:24 PM, Nicolai Buchwitz wrote:
> On 6.4.2026 21:58, Fernando Fernandez Mancera wrote:
>> [...]
> 
>>
>> Hi Nicolai,
>> maybe I am missing something but [2] isn't from sashiko.dev but from 
>> netdev AI CI instead. See: https://netdev-ai.bots.linux.dev/ai- 
>> review.html?id=0b114a22-9aab-4265-8bfc-ea1b5bca5514
> 
> You're right, I mixed up the two systems - the example I linked was
> from the netdev AI bot, not Sashiko. My mistake on the link.
> 
> I stumbled over Sashiko when I noticed the name appearing more often
> in other reviews and then found Jonathan's LWN article about it [1].
> 
> Both tools are actively reviewing patches on the list today. I think
> it makes sense to document both rather than just one:
> 
> The netdev AI bot at netdev-ai.bots.linux.dev
> Sashiko at sashiko.dev, which posts reviews publicly on its website
> Both use the same review prompts by Chris Mason [2], so there is
> common ground - though results will vary between them due to the
> different AI models (Claude Opus for netdev-ai, Gemini for Sashiko)
> on top of the usual AI uncertainty.
> 
> I think it would be useful to document that AI reviews are happening
> but mixing AI bots might confuse people.
> 
> Agreed, I'll rework the patch to distinguish the two systems once
> the discussion has been settled.
> 
>>
>> The documentation mentioned for running the AI locally is correctly 
>> related to netdev AI bot.
>>
>> I think it would be useful to document that AI reviews are happening 
>> but mixing AI bots might confuse people.
>>
>>> Check for findings on your submissions and address
>>> +valid ones before a maintainer has to relay the same questions.
>>> +
>>
>> I wonder what would be the consequences for this. If less experienced 
>> submitters are expected to address issues pointed out by AI bots they 
>> might work on something that isn't valid. AFAIU, the AI output is only 
>> forwarded to the submitter after a maintainer reviewed it and believes 
>> it makes sense.
> 
> Fair point. The wording should make clear that the local tooling is
> an optional aid, not an obligation. I'll soften the language around
> addressing findings.
> 

Thank you! Regarding this topic it seems people have been already 
discussing this around other subsystems [1]. It might be useful to check 
out similar discussions and outcomes.

[1] https://lwn.net/Articles/1064830/

> Would appreciate input on how much detail is appropriate here -
> should the doc just acknowledge that AI review exists and point to
> the tooling, or go into more detail about the workflow?
> 

To be honest that is hard for me to tell, I am not a maintainer and not 
the one doing the forwarding currently. I think there isn't an official 
workflow regarding this. Maybe a good starter would be to just mention 
that they exist. Or maybe this is a good opportunity to define an 
official workflow!

Thanks,
Fernando.

^ permalink raw reply

* Re: [PATCH v10 03/21] gpu: nova-core: gsp: Expose total physical VRAM end from FB region info
From: Joel Fernandes @ 2026-04-06 21:08 UTC (permalink / raw)
  To: Eliot Courtney, linux-kernel
  Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Dave Airlie, Daniel Almeida, Koen Koning, dri-devel,
	rust-for-linux, Nikola Djukic, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
	Alex Deucher, Christian Koenig, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Matthew Auld,
	Matthew Brost, Lucas De Marchi, Thomas Hellstrom, Helge Deller,
	Alex Gaynor, Boqun Feng, John Hubbard, Alistair Popple,
	Timur Tabi, Edwin Peer, Alexandre Courbot, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx, intel-gfx,
	intel-xe, linux-fbdev
In-Reply-To: <DHIFD6N7QSU1.1RGEN0APPDHD8@nvidia.com>



On 4/2/2026 1:37 AM, Eliot Courtney wrote:
> On Wed Apr 1, 2026 at 6:20 AM JST, Joel Fernandes wrote:
>> Add `total_fb_end()` to `GspStaticConfigInfo` that computes the
>> exclusive end address of the highest valid FB region covering both
>> usable and GSP-reserved areas.
>>
>> This allows callers to know the full physical VRAM extent, not just
>> the allocatable portion.
>>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  drivers/gpu/nova-core/gsp/commands.rs    | 6 ++++++
>>  drivers/gpu/nova-core/gsp/fw/commands.rs | 7 +++++++
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/nova-core/gsp/commands.rs b/drivers/gpu/nova-core/gsp/commands.rs
>> index 41742c1633c8..5e0649024637 100644
>> --- a/drivers/gpu/nova-core/gsp/commands.rs
>> +++ b/drivers/gpu/nova-core/gsp/commands.rs
>> @@ -196,6 +196,9 @@ pub(crate) struct GetGspStaticInfoReply {
>>      /// Usable FB (VRAM) region for driver memory allocation.
>>      #[expect(dead_code)]
>>      pub(crate) usable_fb_region: Range<u64>,
>> +    /// End of VRAM.
>> +    #[expect(dead_code)]
>> +    pub(crate) total_fb_end: u64,
>>  }
>>  
>>  impl MessageFromGsp for GetGspStaticInfoReply {
>> @@ -209,9 +212,12 @@ fn read(
>>      ) -> Result<Self, Self::InitError> {
>>          let (base, size) = msg.first_usable_fb_region().ok_or(ENODEV)?;
>>  
>> +        let total_fb_end = msg.total_fb_end().ok_or(ENODEV)?;
>> +
>>          Ok(GetGspStaticInfoReply {
>>              gpu_name: msg.gpu_name_str(),
>>              usable_fb_region: base..base.saturating_add(size),
>> +            total_fb_end,
>>          })
>>      }
>>  }
>> diff --git a/drivers/gpu/nova-core/gsp/fw/commands.rs b/drivers/gpu/nova-core/gsp/fw/commands.rs
>> index 9fffa74d03f9..46932d5c8c1d 100644
>> --- a/drivers/gpu/nova-core/gsp/fw/commands.rs
>> +++ b/drivers/gpu/nova-core/gsp/fw/commands.rs
>> @@ -163,6 +163,13 @@ pub(crate) fn first_usable_fb_region(&self) -> Option<(u64, u64)> {
>>              }
>>          })
>>      }
>> +
>> +    /// Compute the end of physical VRAM from all FB regions.
>> +    pub(crate) fn total_fb_end(&self) -> Option<u64> {
>> +        self.fb_regions()
>> +            .map(|reg| reg.limit.saturating_add(1))
> 
> I think it would be better to used checked_add here.
> 

Done, thanks.

--
Joel Fernandes


^ permalink raw reply

* Re: [PATCH net-next] docs: netdev: document AI-assisted review tooling
From: Laurent Pinchart @ 2026-04-06 21:06 UTC (permalink / raw)
  To: Nicolai Buchwitz
  Cc: Fernando Fernandez Mancera, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Shuah Khan, netdev, workflows, linux-doc, linux-kernel, mbloch
In-Reply-To: <56c5bdfe2e37738e47b3b4d22e21697c@tipi-net.de>

On Mon, Apr 06, 2026 at 10:24:14PM +0200, Nicolai Buchwitz wrote:
> On 6.4.2026 21:58, Fernando Fernandez Mancera wrote:
> > [...]
> 
> > 
> > Hi Nicolai,
> > maybe I am missing something but [2] isn't from sashiko.dev but from 
> > netdev AI CI instead. See: 
> > https://netdev-ai.bots.linux.dev/ai-review.html?id=0b114a22-9aab-4265-8bfc-ea1b5bca5514
> 
> You're right, I mixed up the two systems - the example I linked was
> from the netdev AI bot, not Sashiko. My mistake on the link.
> 
> I stumbled over Sashiko when I noticed the name appearing more often
> in other reviews and then found Jonathan's LWN article about it [1].
> 
> Both tools are actively reviewing patches on the list today. I think
> it makes sense to document both rather than just one:
> 
> The netdev AI bot at netdev-ai.bots.linux.dev
> Sashiko at sashiko.dev, which posts reviews publicly on its website
> Both use the same review prompts by Chris Mason [2], so there is
> common ground - though results will vary between them due to the
> different AI models (Claude Opus for netdev-ai, Gemini for Sashiko)
> on top of the usual AI uncertainty.
> 
> I think it would be useful to document that AI reviews are happening
> but mixing AI bots might confuse people.
> 
> Agreed, I'll rework the patch to distinguish the two systems once
> the discussion has been settled.
> 
> > The documentation mentioned for running the AI locally is correctly 
> > related to netdev AI bot.
> > 
> > I think it would be useful to document that AI reviews are happening 
> > but mixing AI bots might confuse people.
> > 
> >> Check for findings on your submissions and address
> >> +valid ones before a maintainer has to relay the same questions.
> >> +
> > 
> > I wonder what would be the consequences for this. If less experienced 
> > submitters are expected to address issues pointed out by AI bots they 
> > might work on something that isn't valid. AFAIU, the AI output is only 
> > forwarded to the submitter after a maintainer reviewed it and believes 
> > it makes sense.
> 
> Fair point. The wording should make clear that the local tooling is
> an optional aid, not an obligation. I'll soften the language around
> addressing findings.
> 
> Would appreciate input on how much detail is appropriate here -
> should the doc just acknowledge that AI review exists and point to
> the tooling, or go into more detail about the workflow?

In general, if a workflow is expected by a subsystem, it should be
documented. I don't see much to be gained from not telling submitters
what they're expecting to do.

More precisely in this case, as a submitter, I would take it pretty
badly if I was told to act on the output of a tool that is prone to
hallucinations without a maintainer first triaging the comments.

> [1] https://lwn.net/Articles/1063292/
> [2] https://github.com/masoncl/review-prompts/blob/main/kernel/subsystem/networking.md
> 
> >> +You can also run AI reviews locally before submitting. Instructions
> >> +and tooling are available at:
> >> +
> >> +  https://netdev-ai.bots.linux.dev/ai-local.html
> >> +
> >>   Testimonials / feedback
> >>   -----------------------
> >> 
> 
> Thanks for your input
> 
> Nicolai

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [PATCH v10 07/21] gpu: nova-core: mm: Add TLB flush support
From: Joel Fernandes @ 2026-04-06 20:50 UTC (permalink / raw)
  To: Eliot Courtney
  Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Dave Airlie, Daniel Almeida, Koen Koning,
	dri-devel, rust-for-linux, Nikola Djukic, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Jonathan Corbet, Alex Deucher, Christian Koenig, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Matthew Auld, Matthew Brost, Lucas De Marchi, Thomas Hellstrom,
	Helge Deller, Alex Gaynor, Boqun Feng, John Hubbard,
	Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
	Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
	Philipp Stanner, Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <DHIFLRYSRR3Z.34IFDA1592HCW@nvidia.com>

On Thu, 02 Apr 2026 14:49:05 +0900, Eliot Courtney wrote:

> > +    /// TLB flush serialization lock: This lock is acquired during the
> > +    /// DMA fence signalling critical path. It must NEVER be held across any
> > +    /// reclaimable CPU memory allocations because the memory reclaim path can
> > +    /// call `dma_fence_wait()`, which would deadlock with this lock held.
> > +    #[pin]
>
> This comment says that the lock is acquired during the DMA fence
> signalling critical path, but IIUC we don't have anything like that
> right now. Is this based on future yet to be done work? Can we reword
> this in a way so it makes sense in the current state?

Good point. Will reword. I'll still keep the references this design is
specifically for that, but will refine to avoid making it look like a fence
signalling usecase already exists. Thanks!

> > +    /// This invalidates all TLB entries associated with the given PDB address.
> > +    /// Must be called after modifying page table entries to ensure the GPU sees
> > +    /// the updated mappings.
>
> If this must be called after every operation like that, I wonder if we
> can change the design to require a guard like pattern something to
> ensure flush is called. WDYT?

That's a good thought. I looked into this -- currently there are only 2
call sites (execute_map and unmap_pages in vmm.rs), and both follow the
same pattern of dropping the PRAMIN window lock before flushing. A RAII
guard would need careful lifetime management to maintain this lock
ordering, and error propagation from drop is tricky. With just 2 call
sites, the explicit flush calls are clearer and easier to audit. If more
call sites emerge in the future, we can revisit adding a guard pattern
then.

> > +    pub(crate) fn flush(&self, pdb_addr: VramAddress) -> Result {
>
> Hopefully we don't need to be calling flush() from anywhere in the
> entire crate. Can you tighten the visibility here and in other places?
> Many things seem to be pub(crate) that don't need to be.

Agreed, will tighten flush() to pub(super) and do a broader visibility
audit across the series. Thanks!

> > +        read_poll_timeout(
> > +            || Ok(bar.read(regs::NV_TLB_FLUSH_CTRL)),
> > +            |ctrl: &regs::NV_TLB_FLUSH_CTRL| !ctrl.enable(),
> > +            Delta::ZERO,
> > +            Delta::from_secs(2),
> > +        )?;
>
> This has zero delay on the read_poll_timeout - what about adding a small
> delay of a microsecond or so?

I think Delta::ZERO is fine here -- it still calls cpu_relax() on each
iteration (not a pure busy-spin). This matches what other DRM drivers do
such polls, e.g. lima_mmu.c and panfrost_gpu.c both use read_poll_timeout
with sleep_us=0 in some places. I also measured it on GA102 and its about
~1-1.5us which I think is well within busy-loop territory. From my experience
with the scheduler, if we sleep, such short sleeps will be much longer than 1us
due to scheduling latency. I also checked OpenRM and even there it is a
sleepless spin-loop.

thanks,

-- 
Joel Fernandes

^ permalink raw reply

* Re: [PATCH] checkpatch: add --json output mode
From: Joe Perches @ 2026-04-06 20:41 UTC (permalink / raw)
  To: Sasha Levin, dwaipayanray1, lukas.bulwahn
  Cc: corbet, skhan, apw, workflows, linux-doc, linux-kernel
In-Reply-To: <20260406170039.4034716-1-sashal@kernel.org>

On Mon, 2026-04-06 at 13:00 -0400, Sasha Levin wrote:
> Add a --json flag to checkpatch.pl that emits structured JSON output,
> making results machine-parseable for CI systems, IDE integrations, and
> AI-assisted code review tools.

Seems a reasonable idea but perhaps can be improved

> @@ -1372,7 +1376,7 @@ for my $filename (@ARGV) {
>  	$file = $oldfile if ($is_git_file);
>  }
>  
> -if (!$quiet) {
> +if (!$quiet && !$json) {
>  	hash_show_words(\%use_type, "Used");
>  	hash_show_words(\%ignore_type, "Ignored");

Maybe keep but update?

> @@ -7791,18 +7836,33 @@ sub process {
>  	# If we have no input at all, then there is nothing to report on
>  	# so just keep quiet.
>  	if ($#rawlines == -1) {
> +		if ($json) {
> +			print '{"filename":"' . json_escape($filename) .
> +			      '","total_errors":0,"total_warnings":0,' .

poor formatting for that trailing ".  please separate by content blocks.

> +			      '"total_checks":0,"total_lines":0,"issues":[]}' . "\n";
> +		}

I'd prefer to keep the print() style used elsewhere
and perhaps the JSON:PP module should be used here.

^ permalink raw reply

* Re: [PATCH] checkpatch: add --json output mode
From: Joe Perches @ 2026-04-06 20:34 UTC (permalink / raw)
  To: Konstantin Ryabitsev, Sasha Levin
  Cc: dwaipayanray1, lukas.bulwahn, corbet, skhan, apw, workflows,
	linux-doc, linux-kernel
In-Reply-To: <20260406-futuristic-lilac-gerbil-6ef4a5@lemur>

On Mon, 2026-04-06 at 15:22 -0400, Konstantin Ryabitsev wrote:
> On Mon, Apr 06, 2026 at 03:13:52PM -0400, Sasha Levin wrote:
> I see that it's writing json out manually, implementing its own escaping.
> > > While there are upsides to not requiring a perl json library, I think it's
> > > fair to expect that people who would want to get json output can probably make
> > > sure that JSON::XS is installed.
> > > 
> > > Not a strong object, but seems cleaner that way.

To me too.

JSON:PP is standard since 5.14, and that's 15 years old.

I'd rather just require 5.14 as a minimum and remove
a bunch of other checks too.

^ permalink raw reply

* Re: [PATCH net-next] docs: netdev: document AI-assisted review tooling
From: Nicolai Buchwitz @ 2026-04-06 20:24 UTC (permalink / raw)
  To: Fernando Fernandez Mancera
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, netdev, workflows,
	linux-doc, linux-kernel, mbloch
In-Reply-To: <345722f0-21b1-4970-8c45-ef85edf9d45b@suse.de>

On 6.4.2026 21:58, Fernando Fernandez Mancera wrote:
> [...]

> 
> Hi Nicolai,
> maybe I am missing something but [2] isn't from sashiko.dev but from 
> netdev AI CI instead. See: 
> https://netdev-ai.bots.linux.dev/ai-review.html?id=0b114a22-9aab-4265-8bfc-ea1b5bca5514

You're right, I mixed up the two systems - the example I linked was
from the netdev AI bot, not Sashiko. My mistake on the link.

I stumbled over Sashiko when I noticed the name appearing more often
in other reviews and then found Jonathan's LWN article about it [1].

Both tools are actively reviewing patches on the list today. I think
it makes sense to document both rather than just one:

The netdev AI bot at netdev-ai.bots.linux.dev
Sashiko at sashiko.dev, which posts reviews publicly on its website
Both use the same review prompts by Chris Mason [2], so there is
common ground - though results will vary between them due to the
different AI models (Claude Opus for netdev-ai, Gemini for Sashiko)
on top of the usual AI uncertainty.

I think it would be useful to document that AI reviews are happening
but mixing AI bots might confuse people.

Agreed, I'll rework the patch to distinguish the two systems once
the discussion has been settled.

> 
> The documentation mentioned for running the AI locally is correctly 
> related to netdev AI bot.
> 
> I think it would be useful to document that AI reviews are happening 
> but mixing AI bots might confuse people.
> 
>> Check for findings on your submissions and address
>> +valid ones before a maintainer has to relay the same questions.
>> +
> 
> I wonder what would be the consequences for this. If less experienced 
> submitters are expected to address issues pointed out by AI bots they 
> might work on something that isn't valid. AFAIU, the AI output is only 
> forwarded to the submitter after a maintainer reviewed it and believes 
> it makes sense.

Fair point. The wording should make clear that the local tooling is
an optional aid, not an obligation. I'll soften the language around
addressing findings.

Would appreciate input on how much detail is appropriate here -
should the doc just acknowledge that AI review exists and point to
the tooling, or go into more detail about the workflow?

[1] https://lwn.net/Articles/1063292/
[2] 
https://github.com/masoncl/review-prompts/blob/main/kernel/subsystem/networking.md

> 
> Thanks,
> Fernando.
> 
>> +You can also run AI reviews locally before submitting. Instructions
>> +and tooling are available at:
>> +
>> +  https://netdev-ai.bots.linux.dev/ai-local.html
>> +
>>   Testimonials / feedback
>>   -----------------------
>> 

Thanks for your input

Nicolai

^ permalink raw reply

* Re: [PATCH net-next] docs: netdev: document AI-assisted review tooling
From: Fernando Fernandez Mancera @ 2026-04-06 19:58 UTC (permalink / raw)
  To: Nicolai Buchwitz, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, workflows, linux-doc, linux-kernel
In-Reply-To: <20260406-nb-docs-ai-review-v1-1-b58943762ca9@tipi-net.de>

On 4/6/26 9:40 PM, Nicolai Buchwitz wrote:
> Add a section about Sashiko, the Linux Foundation's open-source
> AI review system for kernel patches. Contributors can check review
> feedback on the Sashiko website and address findings proactively,
> reducing the need for maintainers to relay the same questions.
> 
> Also point to the local review tooling at netdev-ai.bots.linux.dev
> for contributors who want to run AI reviews before submitting.
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
> ---
> Sashiko [1] reviews are already being used on the list (e.g. [2])
> but there's no mention of them in the netdev docs. Add a section
> so contributors know they can check and respond to AI review
> feedback directly.
> 
> Based on Jakub's reviewer guidance patch [3].
> 
> [1] https://sashiko.dev/
> [2] https://lore.kernel.org/all/20260324024235.929875-1-kuba@kernel.org/
> [3] https://lore.kernel.org/all/20260406175334.3153451-1-kuba@kernel.org/
> ---
>   Documentation/process/maintainer-netdev.rst | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst
> index bda93b459a0533fa1adfd11b756a4f47d1dbaa22..27296afb05d3828a350b4ed5c16907672db9785d 100644
> --- a/Documentation/process/maintainer-netdev.rst
> +++ b/Documentation/process/maintainer-netdev.rst
> @@ -559,6 +559,19 @@ Reviewers are highly encouraged to do more in-depth review of submissions
>   and not focus exclusively on process issues, trivial or subjective
>   matters like code formatting, tags etc.
>   
> +AI-assisted review
> +~~~~~~~~~~~~~~~~~~
> +
> +Patches posted to netdev are automatically reviewed by the Sashiko
> +AI review system (https://sashiko.dev/). Results are posted publicly
> +on the website. 

Hi Nicolai,

maybe I am missing something but [2] isn't from sashiko.dev but from 
netdev AI CI instead. See: 
https://netdev-ai.bots.linux.dev/ai-review.html?id=0b114a22-9aab-4265-8bfc-ea1b5bca5514

The documentation mentioned for running the AI locally is correctly 
related to netdev AI bot.

I think it would be useful to document that AI reviews are happening but 
mixing AI bots might confuse people.

> Check for findings on your submissions and address
> +valid ones before a maintainer has to relay the same questions.
> +

I wonder what would be the consequences for this. If less experienced 
submitters are expected to address issues pointed out by AI bots they 
might work on something that isn't valid. AFAIU, the AI output is only 
forwarded to the submitter after a maintainer reviewed it and believes 
it makes sense.

Thanks,
Fernando.

> +You can also run AI reviews locally before submitting. Instructions
> +and tooling are available at:
> +
> +  https://netdev-ai.bots.linux.dev/ai-local.html
> +
>   Testimonials / feedback
>   -----------------------
>   

^ permalink raw reply

* Re: (sashiko review) [PATCH v6 1/1] mm/damon: add node_eligible_mem_bp and node_ineligible_mem_bp goal metrics
From: Ravi Jonnalagadda @ 2026-04-06 19:47 UTC (permalink / raw)
  To: SeongJae Park
  Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
	ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <20260405224550.76218-1-sj@kernel.org>

On Sun, Apr 5, 2026 at 3:45 PM SeongJae Park <sj@kernel.org> wrote:
>
>
> Ravi, thank you for reposting this patch after the rebase.  This time sashiko
> was able to review this, and found good points including things that deserve
> another revision of this patch.
>
> Forwarding full sashiko review in a reply format with my inline comments below,
> for sharing details of my view and doing followup discussions via mails.  Ravi,
> could you please reply?
>

Thanks SJ, providing your comments on top of sashiko's review is very helpful.

> > # review url: https://sashiko.dev/#/patchset/20260405184247.2690-2-ravis.opensrc@gmail.com
> > # start of sashiko.dev inline review
> > commit 5f6ea8da06d136b84d97827777acf2b07047d42c
> > Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
> > mm/damon: add node_eligible_mem_bp and node_ineligible_mem_bp goal metrics
> >
> > This commit adds quota goal metrics for tracking scheme-eligible and
> > ineligible memory distribution across NUMA nodes. It iterates over
> > scheme-eligible regions and uses damon_get_folio() to determine placement.
> >
> > > diff --git a/mm/damon/core.c b/mm/damon/core.c
> > > --- a/mm/damon/core.c
> > > +++ b/mm/damon/core.c
> >
> > [ ... ]
> >
> > > +static unsigned long damos_calc_eligible_bytes(struct damon_ctx *c,
> > > +           struct damos *s, int nid, unsigned long *total)
> > > +{
> > > +   struct damon_target *t;
> > > +   struct damon_region *r;
> > > +   unsigned long total_eligible = 0;
> > > +   unsigned long node_eligible = 0;
> >
> > Can total_eligible and node_eligible overflow on 32-bit architectures?
> > Since these accumulators track raw physical bytes, regions spanning multiple
> > gigabytes might wrap around the 32-bit unsigned long limit. Would it be
> > safer to use u64 or track the counts in pages instead?
>
> Makes sense.  Ravi, could we update these to use phys_addr_t?

Agreed. I'll update total_eligible, node_eligible, and the function
return type to use phys_addr_t throughout damos_calc_eligible_bytes().

>
> >
> > > +
> > > +   damon_for_each_target(t, c) {
> > > +           damon_for_each_region(r, t) {
> > > +                   phys_addr_t addr, end_addr;
> > > +
> > > +                   if (!__damos_valid_target(r, s))
> > > +                           continue;
> >
> > Should this loop also check damos_core_filter_out()? By only checking
> > __damos_valid_target(), it seems memory explicitly excluded by user-defined
> > region filters (like target ID or memcg limits) is still counted as
> > eligible, which might inflate the total and affect the quota tuning loop.
>
> No, we intentionally defined "eligible" based on only the access pattern.  We
> may be able to extend this to apply filters in future, if we find a use case.
> But, for now, I'd suggest keeping this as is for now.
>

Got it. Will keep it as is.

> >
> > > +
> > > +                   /* Convert from core address units to physical bytes */
> > > +                   addr = r->ar.start * c->addr_unit;
> > > +                   end_addr = r->ar.end * c->addr_unit;
> > > +                   while (addr < end_addr) {
> >
> > Could this unbounded loop cause CPU soft lockups? Since DAMON regions can
> > span hundreds of gigabytes, executing millions of folio lookups sequentially
> > without a cond_resched() could monopolize the CPU.
>
> Good point.  Most DAMOS action implementations are also doing cond_resched()
> per region.  Ravi, could we add cond_resched() per region to this function,
> too?

Will add cond_resched() at the end of each region's processing.

>
> >
> > > +                           struct folio *folio;
> > > +                           unsigned long folio_sz, counted;
> > > +
> > > +                           folio = damon_get_folio(PHYS_PFN(addr));
> >
> > What happens if this metric is assigned to a DAMON context configured for
> > virtual address space monitoring? If the context uses DAMON_OPS_VADDR,
> > passing a user-space virtual address to PHYS_PFN() might cause invalid
> > memory accesses or out-of-bounds page struct reads. Should this code
> > explicitly verify the operations type first?
>
> Good finding.  We intend to support only paddr ops.  But there is no guard for
> using this on vaddr ops configuration.  Ravi, could we add underlying ops
> check?  I think damon_commit_ctx() is a good place to add that.  The check
> could be something like below?
>

I plan to add the ops type check directly in the metric functions
(damos_get_node_eligible_mem_bp and its counterpart) rather than in
damon_commit_ctx(). The functions will return 0 early
if c->ops.id != DAMON_OPS_PADDR.

That said, if you prefer the damon_commit_ctx() validation approach to
reject the configuration outright, I can implement it that way instead.
Please let me know your preference.

> '''
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -1515,10 +1515,23 @@ static int damon_commit_sample_control(
>  int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src)
>  {
>         int err;
> +       struct damos *scheme;
> +       struct damos_quota_goal *goal;
>
>         dst->maybe_corrupted = true;
>         if (!is_power_of_2(src->min_region_sz))
>                 return -EINVAL;
> +       if (src->ops.id != DAMON_OPS_PADDR) {
> +               damon_for_each_scheme(scheme, src) {
> +                       damos_for_each_quota_goal(goal, &scheme->quota) {
> +                               switch (goal->metric) {
> +                               case DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP:
> +                               case DAMOS_QUOTA_NODE_INELIGIBLE_MEMPBP:
> +                                       return -EINVAL;
> +                               }
> +                       }
> +               }
> +       }
>
>         err = damon_commit_schemes(dst, src);
>         if (err)
> '''
>
> >
> > > +                           if (!folio) {
> > > +                                   addr += PAGE_SIZE;
> > > +                                   continue;
> > > +                           }
> > > +
> > > +                           folio_sz = folio_size(folio);
> > > +                           /*
> > > +                            * Clip to region boundaries to avoid counting
> > > +                            * bytes outside the region when folio spans
> > > +                            * region boundaries.
> > > +                            */
> > > +                           counted = min(folio_sz, (unsigned long)(end_addr - addr));
> >
> > Does this accurately calculate the remaining bytes if the region starts at an
> > unaligned offset within a large folio? folio_size() returns the full size
> > of the folio, so this calculation might overestimate the eligible bytes by
> > not accounting for the initial offset within the folio.
> >
> > > +                           total_eligible += counted;
> > > +                           if (folio_nid(folio) == nid)
> > > +                                   node_eligible += counted;
> > > +
> > > +                           addr += folio_sz;
> >
> > Can this pointer advancement overshoot the end of the current folio? If
> > addr was unaligned, adding the full folio_sz will skip past the beginning
> > of the next contiguous physical memory folio.
>
> Above two points make sense to me.  Ravi, could we fix these?
>

 Will fix it. Will calculate the exact overlap between the
 region [addr, end_addr) and the folio's actual range.

> >
> > > +                           folio_put(folio);
> > > +                   }
> > > +           }
> > > +   }
> > > +
> > > +   *total = total_eligible;
> > > +   return node_eligible;
> > > +}
> >
> > [ ... ]
> >
> > > +static unsigned long damos_get_node_ineligible_mem_bp(struct damon_ctx *c,
> > > +           struct damos *s, int nid)
> > > +{
> > > +   unsigned long total_eligible = 0;
> > > +   unsigned long node_eligible;
> > > +
> > > +   if (nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> > > +           return 0;
> > > +
> > > +   node_eligible = damos_calc_eligible_bytes(c, s, nid, &total_eligible);
> > > +
> > > +   /* No eligible memory anywhere - ratio is undefined, return 0 */
> > > +   if (!total_eligible)
> > > +           return 0;
> > > +
> > > +   /* Compute ineligible ratio directly: 10000 - eligible_bp */
> > > +   return 10000 - mult_frac(node_eligible, 10000, total_eligible);
> > > +}
> >
> > Does this return value match the documented metric? The formula computes the
> > percentage of the system's eligible memory located on other NUMA nodes,
> > rather than the amount of actual ineligible (filtered out) memory residing
> > on the target node. Could this semantic mismatch cause confusion when
> > configuring quota policies?
>
> Nice catch.  The name and the documentation are confusing.  We actually
> confused a few times in previous revisions, and I'm again confused now.  IIUC,
> the current implementation is the intended and right one for the given use
> case, though.  If my understanding is correct, how about renaming
> DAMOS_QUOTA_NODE_INELIGIBLE_MEM_BP to
> DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_COMPLEMENT, and updating the documentation
> together?  Ravi, what do you think?
>

Agreed, the current name is confusing. How about
DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_OFFNODE?

The rationale is that this metric measures "eligible memory that is off
this node" (i.e., on other nodes).

 I think "offnode" conveys the physical meaning more directly than "complement".
That said, I'm happy to go with "complement" if you prefer.
both are clearer than "ineligible".

> >
> >
> > # end of sashiko.dev inline review
> > # review url: https://sashiko.dev/#/patchset/20260405184247.2690-2-ravis.opensrc@gmail.com
>
>
> Thanks,
> SJ
>

Best Regards,
Ravi.
> # hkml [1] generated a draft of this mail.  You can regenerate
> # this using below command:
> #
> #     hkml patch sashiko_dev --for_forwarding \
> #             20260405184247.2690-2-ravis.opensrc@gmail.com
> #

^ permalink raw reply

* Re: [PATCH v10 03/21] gpu: nova-core: gsp: Expose total physical VRAM end from FB region info
From: Joel Fernandes @ 2026-04-06 19:42 UTC (permalink / raw)
  To: Eliot Courtney
  Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Dave Airlie, Daniel Almeida, Koen Koning,
	dri-devel, rust-for-linux, Nikola Djukic, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Jonathan Corbet, Alex Deucher, Christian Koenig, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Matthew Auld, Matthew Brost, Lucas De Marchi, Thomas Hellstrom,
	Helge Deller, Alex Gaynor, Boqun Feng, John Hubbard,
	Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
	Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
	Philipp Stanner, Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <DHIFD6N7QSU1.1RGEN0APPDHD8@nvidia.com>

On Thu, 02 Apr 2026 14:37:52 +0900, Eliot Courtney wrote:

> +    /// Compute the end of physical VRAM from all FB regions.
> +    pub(crate) fn total_fb_end(&self) -> Option<u64> {
> +        self.fb_regions()
> +            .map(|reg| reg.limit.saturating_add(1))
>
> I think it would be better to used checked_add here.

Fixed, thanks.

-- 
Joel Fernandes

^ permalink raw reply

* [PATCH net-next] docs: netdev: document AI-assisted review tooling
From: Nicolai Buchwitz @ 2026-04-06 19:40 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, workflows, linux-doc, linux-kernel, Nicolai Buchwitz

Add a section about Sashiko, the Linux Foundation's open-source
AI review system for kernel patches. Contributors can check review
feedback on the Sashiko website and address findings proactively,
reducing the need for maintainers to relay the same questions.

Also point to the local review tooling at netdev-ai.bots.linux.dev
for contributors who want to run AI reviews before submitting.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
---
Sashiko [1] reviews are already being used on the list (e.g. [2])
but there's no mention of them in the netdev docs. Add a section
so contributors know they can check and respond to AI review
feedback directly.

Based on Jakub's reviewer guidance patch [3].

[1] https://sashiko.dev/
[2] https://lore.kernel.org/all/20260324024235.929875-1-kuba@kernel.org/
[3] https://lore.kernel.org/all/20260406175334.3153451-1-kuba@kernel.org/
---
 Documentation/process/maintainer-netdev.rst | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst
index bda93b459a0533fa1adfd11b756a4f47d1dbaa22..27296afb05d3828a350b4ed5c16907672db9785d 100644
--- a/Documentation/process/maintainer-netdev.rst
+++ b/Documentation/process/maintainer-netdev.rst
@@ -559,6 +559,19 @@ Reviewers are highly encouraged to do more in-depth review of submissions
 and not focus exclusively on process issues, trivial or subjective
 matters like code formatting, tags etc.
 
+AI-assisted review
+~~~~~~~~~~~~~~~~~~
+
+Patches posted to netdev are automatically reviewed by the Sashiko
+AI review system (https://sashiko.dev/). Results are posted publicly
+on the website. Check for findings on your submissions and address
+valid ones before a maintainer has to relay the same questions.
+
+You can also run AI reviews locally before submitting. Instructions
+and tooling are available at:
+
+  https://netdev-ai.bots.linux.dev/ai-local.html
+
 Testimonials / feedback
 -----------------------
 

---
base-commit: d00749db443cf420a882c020ce0e6bb5c43009de
change-id: 20260406-nb-docs-ai-review-28b4ff21cf5e

Best regards,
-- 
Nicolai Buchwitz <nb@tipi-net.de>


^ permalink raw reply related

* Re: [PATCH] checkpatch: add --json output mode
From: Konstantin Ryabitsev @ 2026-04-06 19:22 UTC (permalink / raw)
  To: Sasha Levin
  Cc: dwaipayanray1, lukas.bulwahn, joe, corbet, skhan, apw, workflows,
	linux-doc, linux-kernel
In-Reply-To: <adQF8LoUf4YH7F98@laps>

On Mon, Apr 06, 2026 at 03:13:52PM -0400, Sasha Levin wrote:
> > I see that it's writing json out manually, implementing its own escaping.
> > While there are upsides to not requiring a perl json library, I think it's
> > fair to expect that people who would want to get json output can probably make
> > sure that JSON::XS is installed.
> > 
> > Not a strong object, but seems cleaner that way.
> 
> No objection here, but from what I saw the checkpatch code only uses core perl
> packages so I wanted to keep it that way.

I saw that, too, but I think that stems from the expectation that we need to
make it easy to run checkpatch by any random person submitting patches, which
is why, by default, we'll output human-readable results.

JSON output, on the other hand, is mostly useful for specific setups that have
a lot more control over their environment and we don't have to stick to the
"pure perl only" guideline here.

Generating correct json is an exercise in corner cases, which is why I'd
rather this is done with a library that has addressed most of them already.

Regards,
-- 
KR

^ permalink raw reply

* Re: [PATCH] checkpatch: add --json output mode
From: Sasha Levin @ 2026-04-06 19:13 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: dwaipayanray1, lukas.bulwahn, joe, corbet, skhan, apw, workflows,
	linux-doc, linux-kernel
In-Reply-To: <20260406-true-whippet-of-luck-d3c2ba@lemur>

On Mon, Apr 06, 2026 at 03:00:25PM -0400, Konstantin Ryabitsev wrote:
>On Mon, Apr 06, 2026 at 01:00:39PM -0400, Sasha Levin wrote:
>> Add a --json flag to checkpatch.pl that emits structured JSON output,
>> making results machine-parseable for CI systems, IDE integrations, and
>> AI-assisted code review tools.
>>
>> The JSON output includes per-file totals (errors, warnings, checks,
>> lines) and an array of individual issues with structured fields for
>> level, type, message, file path, and line number.
>>
>> The --json flag is mutually exclusive with --terse and --emacs.
>> Normal text output behavior is completely unchanged when --json is
>> not specified.
>
>I see that it's writing json out manually, implementing its own escaping.
>While there are upsides to not requiring a perl json library, I think it's
>fair to expect that people who would want to get json output can probably make
>sure that JSON::XS is installed.
>
>Not a strong object, but seems cleaner that way.

No objection here, but from what I saw the checkpatch code only uses core perl
packages so I wanted to keep it that way.

-- 
Thanks,
Sasha

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox