* Re: [PATCH net-next v2 2/2] net: ti: icssg: Add HSR and LRE PA statistics
From: Jakub Kicinski @ 2026-05-20 22:33 UTC (permalink / raw)
To: MD Danish Anwar
Cc: Luka Gejak, Felix Maurer, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
Roger Quadros, Andrew Lunn, Meghana Malladi, Jacob Keller,
David Carlier, Vadim Fedorenko, Kevin Hao, netdev, linux-doc,
linux-kernel, linux-arm-kernel, Vladimir Oltean
In-Reply-To: <1d8ab51a-6943-4978-88cf-adda8cc57f7e@ti.com>
On Wed, 20 May 2026 15:30:24 +0530 MD Danish Anwar wrote:
> What should be the next steps here? Is there any existing defined set of
> stats where I could populate stats from ICSSG firmware for HSR (similar
> to ndo_get_stats64 callback). Or de we need to implement a new callback
> that will do this for HSR.
I'd try to plumb this thru ndo_get_offload_stats
Close enough for my taste, let's see if anyone objects.
> I agree with Luka on the categorization,
Felix responded with the MIB counters which are even better.
We should probably define a struct with all of those and then
just fill in the ones you have.
Please do the same thing ethtool Netlink does, break the counters up,
each member to its own Netlink attr, in the kernel init them to ~0
and only report values the driver actually set to something.
We don't want to print 0 for stats driver doesn't support.
^ permalink raw reply
* Re: [PATCH v3 04/12] x86,fs/resctrl: Program PLZA through kmode arch hooks
From: Luck, Tony @ 2026-05-20 22:16 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, bp,
dave.hansen, skhan, x86, mingo, hpa, akpm, rdunlap,
pawan.kumar.gupta, feng.tang, dapeng1.mi, kees, elver, lirongqing,
paulmck, bhelgaas, seanjc, alexandre.chartre, yazen.ghannam,
peterz, chang.seok.bae, kim.phillips, xin, naveen,
thomas.lendacky, linux-doc, linux-kernel, eranian, peternewman,
sos-linux-ext-patches
In-Reply-To: <1a410ca9-f4a2-4956-8477-033d61a733be@amd.com>
On Wed, May 20, 2026 at 12:49:25PM -0500, Babu Moger wrote:
> Hi Tony,
>
>
> On 5/19/26 15:59, Luck, Tony wrote:
> > On Thu, Apr 30, 2026 at 06:24:49PM -0500, Babu Moger wrote:
> > > +void resctrl_arch_configure_kmode(cpumask_var_t cpu_mask, u32 closid, u32 rmid, bool enable)
> > > +{
> > > + union msr_pqr_plza_assoc plza = { 0 };
> > > +
> > > + plza.split.rmid = rmid;
> > > + plza.split.rmid_en = 1;
> >
> > Shouldn't there be a parameter for the value of rmid_en?
>
>
> I realized that behavior is not required—it was actually due to a mistake in
> my v2 series implementation.
>
> Below are the relevant definitions:
>
>
> GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU:
> The CLOSID is applied to kernel work, while the RMID used for monitoring is
> inherited from the currently running user task.
> No separate monitoring group is assigned for kernel work, so kernel
> execution naturally inherits the user-space RMID.
>
>
> GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU:
> Both CLOSID and RMID are explicitly assigned to kernel work.
> This allows assigning a dedicated monitoring group for kernel execution and
> therefore requires a separate RMID.
>
> Example: For GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU:
>
> # mount -t resctrl resctrl /sys/fs/resctrl
>
> # cat /sys/fs/resctrl/info/kernel_mode
> [inherit_ctrl_and_mon:group=//]
> global_assign_ctrl_inherit_mon_per_cpu:group=none
> global_assign_ctrl_assign_mon_per_cpu:group=none
>
> # mkdir /sys/fs/resctrl/ctrl1 (PQR_ASSOC closid=1 rmid=1)
>
> This configures all the CPU threads to use closid=1 and rmid=1 for both
> allocation and monitoring across user and kernel modes.
>
>
> # echo "global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//" \
> > /sys/fs/resctrl/info/kernel_mode
>
> # cat /sys/fs/resctrl/info/kernel_mode
> inherit_ctrl_and_mon:group=none
> [global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//]
> global_assign_ctrl_assign_mon_per_cpu:group=none
>
> This overrides the previous configuration, and PQR_PLZA_ASSOC is written.
>
> Possible options:
>
> 1. (closid=1, rmid_en=0, rmid=1)
> Here, hardware uses closid=1 for kernel work, but RMID tracking is disabled
> for kernel mode.
>
> As a result, reading RMID 1 reports only user-mode activity
> This contradicts the definition of this mode, since kernel work is expected
> to inherit the user RMID for monitoring.
>
> 2. (closid=1, rmid_en=1, rmid=1)
> In this case, RMID tracking is enabled for both user and kernel modes.
>
> Reading RMID 1 reports combined user + kernel activity
> This aligns with the expected inherit_monitoring behavior
>
>
> The preferred approach is to separate kernel monitoring by assigning it a
> dedicated monitoring group and updating PQR_PLZA_ASSOC to use a different
> RMID (e.g., closid=1, rmid_en=1, rmid=2). This is exactly the behavior
> implemented by GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU.
So maybe I'm just confused by the name "global_assign_ctrl_inherit_mon_per_cpu"
That sounds like "Use the CLOSID from PLZA, but keep the RMID from
legacy PQR_ASSOC.
So:
# mkdir ctrl1 # maybe gets CLOSID=1, RMID=1
# echo global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//" > info/kernel_mode
# mkdir ctrl2 # maybe gets CLOSID=2, RMID=2
# echo $$ > ctrl2/tasks
My shell, and all children run with CLOSID=2 and RMID=2 from ctrl2. But
when they do system calls, take page faults or there is an interrupt I'd
expect the code in the kernel to run with the CLOSID=1, while inheriting
RMID=2.
To make that happen, I thing the PLZA MSR should have rmid_en = 0. But
the only code I see that sets this always sets rmid_en=1.
>
> Thanks
> Babu
-Tony
^ permalink raw reply
* Re: [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
From: Randy Dunlap @ 2026-05-20 22:06 UTC (permalink / raw)
To: Leonardo Bras, Jonathan Corbet, Shuah Khan, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Feng Tang, Dapeng Mi, Kees Cook, Marco Elver, Jakub Kicinski,
Li RongQing, Eric Biggers, Paul E. McKenney, Nathan Chancellor,
Nicolas Schier, Miguel Ojeda, Thomas Weißschuh,
Thomas Gleixner, Douglas Anderson, Gary Guo, Christian Brauner,
Pasha Tatashin, Coiby Xu, Masahiro Yamada, Frederic Weisbecker
Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
Marcelo Tosatti
In-Reply-To: <20260519012754.240804-2-leobras.c@gmail.com>
On 5/18/26 6:27 PM, Leonardo Bras wrote:
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c2c6d79275c6..7102031207c9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -21775,20 +21775,27 @@ QORIQ DPAA2 FSL-MC BUS DRIVER
> M: Ioana Ciornei <ioana.ciornei@nxp.com>
> L: linuxppc-dev@lists.ozlabs.org
> L: linux-kernel@vger.kernel.org
> S: Maintained
> F: Documentation/ABI/stable/sysfs-bus-fsl-mc
> F: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
> F: Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
> F: drivers/bus/fsl-mc/
> F: include/uapi/linux/fsl_mc.h
>
> +PW Locks
> +M: Leonardo Bras <leobras.c@gmail.com>
> +S: Supported
> +F: Documentation/locking/pwlocks.rst
> +F: include/linux/pwlocks.h
> +F: kernel/pwlocks.c
MAINTAINERS entries should be in alphabetical order: PW is not in the
middle of the Q entries.
> +
> QT1010 MEDIA DRIVER
> L: linux-media@vger.kernel.org
> S: Orphan
> W: https://linuxtv.org
> Q: http://patchwork.linuxtv.org/project/linux-media/list/
> F: drivers/media/tuners/qt1010*
>
> QUALCOMM ATH12K WIRELESS DRIVER
> M: Jeff Johnson <jjohnson@kernel.org>
> L: linux-wireless@vger.kernel.org
> diff --git a/Documentation/locking/pwlocks.rst b/Documentation/locking/pwlocks.rst
> new file mode 100644
> index 000000000000..09f4a5417bc1
> --- /dev/null
> +++ b/Documentation/locking/pwlocks.rst
> @@ -0,0 +1,76 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========
> +PW (Per-CPU Work) locks
> +=========
Overline and underline should be at least as long as the heading text.
> +
> +Some places in the kernel implement a parallel programming strategy
> +consisting on local_locks() for most of the work, and some rare remote
> +operations are scheduled on target cpu. This keeps cache bouncing low since
on a target CPU.
> +cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> +kernels, even though the very few remote operations will be expensive due
> +to scheduling overhead.
> +
> +On the other hand, for RT workloads this can represent a problem:
> +scheduling work on remote cpu that are executing low latency tasks
CPUs
> +is undesired and can introduce unexpected deadline misses.
undesirable
?
> +
> +PW locks help to convert sites that use local_locks (for cpu local operations)
> +and queue_work_on (for queueing work remotely, to be executed
> +locally on the owner cpu of the lock) to a spinlocks.
to spinlocks.
> +
> +The lock is declared pw_lock_t type.
> +The lock is initialized with pw_lock_init.
> +The lock is locked with pw_lock (takes a lock and cpu as a parameter).
> +The lock is unlocked with pw_unlock (takes a lock and cpu as a parameter).
> +
> +The pw_lock_irqsave function disables interrupts and saves current interrupt state,
> +cpu as a parameter.
> +
> +For trylock variant, there is the pw_trylock_t type, initialized with
> +pw_trylock_init. Then the corresponding pw_trylock and pw_trylock_irqsave.
> +
> +work_struct should be replaced by pw_struct, which contains a cpu parameter
> +(owner cpu of the lock), initialized by INIT_PW.
> +
> +The queue work related functions (analogous to queue_work_on and flush_work) are:
> +pw_queue_on and pw_flush.
> +
> +The behaviour of the PW lock functions is as follows:
> +
> +* !CONFIG_PWLOCKS (or CONFIG_PWLOCKS and pwlocks=off kernel boot parameter):
> + - pw_lock: local_lock
> + - pw_lock_irqsave: local_lock_irqsave
> + - pw_trylock: local_trylock
> + - pw_trylock_irqsave: local_trylock_irqsave
> + - pw_unlock: local_unlock
> + - pw_lock_local: local_lock
> + - pw_trylock_local: local_trylock
> + - pw_unlock_local: local_unlock
> + - pw_queue_on: queue_work_on
> + - pw_flush: flush_work
> +
> +* CONFIG_PWLOCKS (and CONFIG_PWLOCKS_DEFAULT=y or pwlocks=on kernel boot parameter),
> + - pw_lock: spin_lock
> + - pw_lock_irqsave: spin_lock_irqsave
> + - pw_trylock: spin_trylock
> + - pw_trylock_irqsave: spin_trylock_irqsave
> + - pw_unlock: spin_unlock
> + - pw_lock_local: preempt_disable OR migrate_disable + spin_lock
> + - pw_trylock_local: preempt_disable OR migrate_disable + spin_trylock
> + - pw_unlock_local: preempt_enable OR migrate_enable + spin_unlock
> + - pw_queue_on: executes work function on caller cpu
> + - pw_flush: empty
> +
> +pw_get_cpu(work_struct), to be called from within per-cpu work function,
> +returns the target cpu.
CPU.
> +
> +On the locking functions above, there are the local locking functions
> +(pw_lock_local, pw_trylock_local and pw_unlock_local) that must only
> +be used to access per-CPU data from the CPU that owns that data,
> +and never remotely. They disable preemption/migration and don't require
> +a cpu parameter, making them a replacement for local_lock functions that
> +does not introduce overhead.
> +
> +These should only be used when accessing per-CPU data of the local CPU.
> +
Running "make htmldocs" with this patch says:
Documentation/locking/pwlocks.rst: WARNING: document isn't included in any toctree [toc.not_included]
> diff --git a/init/Kconfig b/init/Kconfig
> index 2937c4d308ae..3fb751dc4530 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -764,20 +764,55 @@ config CPU_ISOLATION
> depends on SMP
> default y
> help
> Make sure that CPUs running critical tasks are not disturbed by
> any source of "noise" such as unbound workqueues, timers, kthreads...
> Unbound jobs get offloaded to housekeeping CPUs. This is driven by
> the "isolcpus=" boot parameter.
>
> Say Y if unsure.
>
> +config PWLOCKS
> + bool "Per-CPU Work locks"
> + depends on SMP || COMPILE_TEST
> + default n
> + help
> + Allow changing the behavior on per-CPU resource sharing with cache,
> + from the regular local_locks() + queue_work_on(remote_cpu) to using
> + per-CPU spinlocks on both local and remote operations.
> +
> + This is useful to give user the option on reducing IPIs to CPUs, and
to give the user
> + thus reduce interruptions and context switches. On the other hand, it
> + increases generated code and will use atomic operations if spinlocks
> + are selected.
> +
> + If set, will use the default behavior set in PWLOCKS_DEFAULT unless boot
unless the boot
> + parameter pwlocks is passed with a different behavior.
> +
> + If unset, will use the local_lock() + queue_work_on() strategy,
> + regardless of the boot parameter or PWLOCKS_DEFAULT.
> +
> + Say N if unsure.
> +
> +config PWLOCKS_DEFAULT
> + bool "Use per-CPU spinlocks by default on PWLOCKS"
> + depends on PWLOCKS
> + default n
> + help
> + If set, will use per-CPU spinlocks as default behavior for per-CPU
> + remote operations.
> +
> + If unset, will use local_lock() + queue_work_on(cpu) as default
> + behavior for remote operations.
> +
> + Say N if unsure
unsure.
> +
> source "kernel/rcu/Kconfig"
>
> config IKCONFIG
> tristate "Kernel .config support"
> help
> This option enables the complete Linux kernel ".config" file
> contents to be saved in the kernel. It provides documentation
> of which kernel options are used in a running kernel or in an
> on-disk kernel. This information can be extracted from the kernel
> image file with the script scripts/extract-ikconfig and used as
--
~Randy
^ permalink raw reply
* [PATCH bpf-next v2] bpf, docs: add LOAD_ACQUIRE and STORE_RELEASE instructions
From: Alexis Lothoré (eBPF Foundation) @ 2026-05-20 22:09 UTC (permalink / raw)
To: David Vernet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Jonathan Corbet, Shuah Khan
Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, bpf, bpf, linux-doc,
linux-kernel, Alexis Lothoré (eBPF Foundation)
Commit 880442305a39 ("bpf: Introduce load-acquire and store-release
instructions") instroduced the LOAD_ACQUIRE and STORE_RELEASE atomic
instructions modifiers. Those are currently not described in the
documentation, despite being used in the verifier and the various JIT
compilers supporting them.
Add the missing entries in the instruction set documentation.
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
Changes in v2:
- fix commit message typo
- clarify zero-extension and registers meaning for LOAD_ACQ
- clarify insn encoding for LOAD_ACQ and STORE_REL
- Link to v1: https://patch.msgid.link/20260520-bpf-insn-doc-v1-1-74d7dada9bfc@bootlin.com
To: David Vernet <void@manifault.com>
To: Alexei Starovoitov <ast@kernel.org>
To: Daniel Borkmann <daniel@iogearbox.net>
To: Andrii Nakryiko <andrii@kernel.org>
To: Martin KaFai Lau <martin.lau@linux.dev>
To: Eduard Zingerman <eddyz87@gmail.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: Song Liu <song@kernel.org>
To: Yonghong Song <yonghong.song@linux.dev>
To: Jiri Olsa <jolsa@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
Cc: ebpf@linuxfoundation.org
Cc: Bastien Curutchet <bastien.curutchet@bootlin.com>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: bpf@vger.kernel.org
Cc: bpf@ietf.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
.../bpf/standardization/instruction-set.rst | 27 +++++++++++++++-------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
index 39c74611752b..e8b33374bd09 100644
--- a/Documentation/bpf/standardization/instruction-set.rst
+++ b/Documentation/bpf/standardization/instruction-set.rst
@@ -668,7 +668,8 @@ that use the ``ATOMIC`` mode modifier as follows:
part of the "atomic32" conformance group.
* ``{ATOMIC, DW, STX}`` for 64-bit operations, which are
part of the "atomic64" conformance group.
-* 8-bit and 16-bit wide atomic operations are not supported.
+* ``{ATOMIC, H, STX}`` (only for LOAD_ACQ/STORE_REL)
+* ``{ATOMIC, B, STX}`` (only for LOAD_ACQ/STORE_REL)
The 'imm' field is used to encode the actual atomic operation.
Simple atomic operation use a subset of the values defined to encode
@@ -695,22 +696,24 @@ arithmetic operations in the 'imm' field to encode the atomic operation:
*(u64 *)(dst + offset) += src
In addition to the simple atomic operations, there also is a modifier and
-two complex atomic operations:
+four complex atomic operations:
.. table:: Complex atomic operations
=========== ================ ===========================
imm value description
=========== ================ ===========================
- FETCH 0x01 modifier: return old value
- XCHG 0xe0 | FETCH atomic exchange
- CMPXCHG 0xf0 | FETCH atomic compare and exchange
+ FETCH 0x0001 modifier: return old value
+ XCHG 0x00e0 | FETCH atomic exchange
+ CMPXCHG 0x00f0 | FETCH atomic compare and exchange
+ LOAD_ACQ 0x0100 atomic load with barrier
+ STORE_REL 0x0110 atomic store with barrier
=========== ================ ===========================
The ``FETCH`` modifier is optional for simple atomic operations, and
-always set for the complex atomic operations. If the ``FETCH`` flag
-is set, then the operation also overwrites ``src`` with the value that
-was in memory before it was modified.
+always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
+the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
+the value that was in memory before it was modified.
The ``XCHG`` operation atomically exchanges ``src`` with the value
addressed by ``dst + offset``.
@@ -721,6 +724,14 @@ The ``CMPXCHG`` operation atomically compares the value addressed by
value that was at ``dst + offset`` before the operation is zero-extended
and loaded back to ``R0``.
+The ``LOAD_ACQ`` and ``STORE_REL`` operations allow using lighter load and
+store memory barriers rather than full barriers. The corresponding accesses
+must be aligned, but are allowed for any access size (8-bit up to 64-bit
+operations), with 8-bit and 16-bit ``LOAD_ACQ`` loaded values being
+zero-extended. As atomics are encoded as stores, the meaning of dst and src
+are different for ``LOAD_ACQ``, effectively using src as memory based
+pointer and dst as destination register for the fetched value.
+
64-bit immediate instructions
-----------------------------
---
base-commit: ceeb3aa37bff895116944acf4347fcded0b7692d
change-id: 20260520-bpf-insn-doc-756b369ca328
Best regards,
--
Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
^ permalink raw reply related
* Re: [PATCH v6 20/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
From: Ackerley Tng @ 2026-05-20 22:04 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-20-91ab5a8b19a4@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
> From: Sean Christopherson <seanjc@google.com>
>
> Now that guest_memfd supports tracking private vs. shared within gmem
> itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
> for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
> are disabled, i.e. when it's actually possible for a guest_memfd instance
> to contain shared memory.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> arch/x86/kvm/x86.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1560de1e95be0..6609957ecfea3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14172,14 +14172,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
> }
>
> #ifdef CONFIG_KVM_GUEST_MEMFD
> -/*
> - * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
> - * with private memory (the private vs. shared tracking needs to be moved into
> - * guest_memfd).
> - */
> bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
> {
> - return !kvm_arch_has_private_mem(kvm);
> + /*
> + * INIT_SHARED isn't supported if the memory attributes are per-VM,
> + * in which case guest_memfd can _only_ be used for private memory.
> + */
> + return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);
Adding a note here from PUCK on 2026-05-20:
Michael pointed out that it's odd that when vm_memory_attributes is
available, guest_memfd still can only be used for private memory.
It is a little odd, but we don't want to investigate the complexities of
supporting it, and Sean says this is working as intended, in line with
deprecating vm_memory_attributes=true.
> }
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
>
> --
> 2.54.0.563.g4f69b47b94-goog
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-20 21:44 UTC (permalink / raw)
To: Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>
Fuad Tabba <tabba@google.com> writes:
>
> [...snip...]
>
>> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> +{
>> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> + struct inode *inode;
>> +
>> + /*
>> + * If this gfn has no associated memslot, there's no chance of the gfn
>> + * being backed by private memory, since guest_memfd must be used for
>> + * private memory, and guest_memfd must be associated with some memslot.
>> + */
>> + if (!slot)
>> + return 0;
>> +
>> + CLASS(gmem_get_file, file)(slot);
>> + if (!file)
>> + return 0;
>> +
>> + inode = file_inode(file);
>> +
>> + /*
>> + * Rely on the maple tree's internal RCU lock to ensure a
>> + * stable result. This result can become stale as soon as the
>> + * lock is dropped, so the caller _must_ still protect
>> + * consumption of private vs. shared by checking
>> + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
>> + * against ongoing attribute updates.
>> + */
>> + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
>> +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?
Let me know how I can improve the comment.
I think the "consumption" of private vs shared here actually means
something like "don't commit a page being faulted into page tables based
on the result of kvm_gmem_get_memory_attributes() without checking
kvm->mmu_invalidate_in_progress.", since a racing conversion may
complete before you commit.
kvm_mem_is_private() is used from these places:
1. Fault handling in KVM, like page_fault_can_be_fast(),
kvm_mmu_faultin_pfn(), kvm_mmu_page_fault(): this already handles the
entire mmu_lock and invalidation dance. No fault will be committed if
a racing conversion happened after kvm_mem_is_private() but before
the commit.
2. kvm_mmu_max_mapping_level() from recovering huge pages after
disabling dirty logging: Other than that it can't be used with
guest_memfd now since dirty logging can't be used with guest_memfd
and guest_memfd memslots are not updatable, this holds mmu_lock
throughout until the huge page recovery is done. invalidate_begin
also involves zapping the pages in the range, so if the order of
events is
| Thread A | Thread B |
|------------------------------|-------------------|
| invalidate_begin + zap | |
| update attributes maple_tree | recover huge page |
| invalidate_end | |
Then recovering will never see the zapped pages, nothing to
recover, no kvm_mem_is_private() lookup.
3. kvm_arch_vcpu_pre_fault_memory()
This eventually calls kvm_tdp_mmu_page_fault(), which checks
is_page_fault_stale(), so it does check before committing.
Were there any other calls I missed?
> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
>
Sean already replied on your actual question separately :)
> Cheers,
> /fuad
>
>
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Sean Christopherson @ 2026-05-20 20:39 UTC (permalink / raw)
To: Ackerley Tng
Cc: Fuad Tabba, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgGHHkvfJ-mn9rfDvS+_1ht08YatFWo-Swt+5wFSPnC9Nw@mail.gmail.com>
On Wed, May 20, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> > } else {
> >> > + /*
> >> > + * Memory attributes cannot be obtained from guest_memfd while
> >> > + * the MMU lock is held.
> >> > + */
> >> > + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> >> > + kvm_gmem_get_memory_attributes, kvm)) {
> >> > + return 0;
> >> > + }
> >> > +
> >>
> >> This directly takes the address of kvm_gmem_get_memory_attributes,
> >> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
> >> ARCH=i386.
> >
> > And this bleeds guest_memfd implementation details into places they don't belong.
> > The right way to deal with this is to use lockdep_assert_not_held() in whatever
> > code mustn't run with mmu_lock held. E.g.
> >
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index c9f155c2dc5c..3bea9c1137ef 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> > struct inode *inode;
> >
> > + /* Comment goes here. */
> > + lockdep_assert_not_held(&kvm->mmu_lock);
> > +
> > /*
> > * If this gfn has no associated memslot, there's no chance of the gfn
> > * being backed by private memory, since guest_memfd must be used for
> >
> > But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
> > filemap_invalidate_lock(), so what exactly is the problem?
> >
>
> Ahh I can drop this patch now. kvm_gmem_get_memory_attributes() used to
> take the filemap_invalidate_lock(), but after Liam pointed out that
> the attributes maple tree should be using MT_FLAGS_USE_RCU, I stopped
> taking filemap_invalidate_lock() and forgot to undo this.
>
> I'll wait a bit for more reviews and then put out another revision
> without this patch.
If this is the only issue with v6, don't send a new version, I'll just drop it
when applying.
^ permalink raw reply
* Re: [PATCH v6 12/43] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-05-20 20:35 UTC (permalink / raw)
To: Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTwOfJ=nCRoX3m5uDt=CcF_zrqGsZVBBmo4MscmPqrBxOA@mail.gmail.com>
Fuad Tabba <tabba@google.com> writes:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
>>
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> When memory in guest_memfd is converted from private to shared, the
>> platform-specific state associated with the guest-private pages must be
>> invalidated or cleaned up.
>>
>> Iterate over the folios in the affected range and call the
>> kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> architectures to perform necessary teardown, such as updating hardware
>> metadata or encryption states, before the pages are transitioned to the
>> shared state.
>>
>> Invoke this helper after indicating to KVM's mmu code that an invalidation
>> is in progress to stop in-flight page faults from succeeding.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Minor nit below, but lgtm.
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
>
>> ---
>> virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 41 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 9d82642a025e9..baf4b88dead1f 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -603,6 +603,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> return safe;
>> }
>>
>> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>> +{
>> + struct folio_batch fbatch;
>> + pgoff_t next = start;
>> + int i;
>> +
>> + folio_batch_init(&fbatch);
>> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> + struct folio *folio = fbatch.folios[i];
>> + pgoff_t start_index, end_index;
>> + kvm_pfn_t start_pfn, end_pfn;
>> +
>> + start_index = max(start, folio->index);
>> + end_index = min(end, folio_next_index(folio));
>> + /*
>> + * end_index is either in folio or points to
>> + * the first page of the next folio. Hence,
>> + * all pages in range [start_index, end_index)
>> + * are contiguous.
>> + */
>> + start_pfn = folio_file_pfn(folio, start_index);
>> + end_pfn = start_pfn + end_index - start_index;
>> +
>> + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> + }
>> +
>> + folio_batch_release(&fbatch);
>> + cond_resched();
>> + }
>> +}
>> +#else
>> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> +#endif
>> +
>> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> size_t nr_pages, uint64_t attrs,
>> pgoff_t *err_index)
>> @@ -643,7 +679,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> */
>>
>> kvm_gmem_invalidate_begin(inode, start, end);
>> +
>> + if (!to_private)
>> + kvm_gmem_invalidate(inode, start, end);
>> +
>> mas_store_prealloc(&mas, xa_mk_value(attrs));
>> +
>
> Why the unrelated extra space?
>
Hmm this space provides vertical space between invalidate_{begin,end}
and the STUFF it wraps, like
invalidate_begin
STUFF
invalidate_end
More STUFF is going to go here in future patch series, such as splitting
private pages in TDX.
>> kvm_gmem_invalidate_end(inode, start, end);
>> out:
>> filemap_invalidate_unlock(mapping);
>>
>> --
>> 2.54.0.563.g4f69b47b94-goog
>>
>>
^ permalink raw reply
* Re: [PATCH v8 0/8] KVM: x86: nSVM: Improve PAT virtualization
From: Yosry Ahmed @ 2026-05-20 20:30 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, Jim Mattson
In-Reply-To: <177915062620.2226127.1264745848157211491.b4-ty@google.com>
On Mon, May 18, 2026 at 05:41:06PM -0700, Sean Christopherson wrote:
> On Tue, 07 Apr 2026 12:03:23 -0700, Jim Mattson wrote:
> > Currently, KVM's implementation of nested SVM treats the PAT MSR the same
> > way whether or not nested NPT is enabled: L1 and L2 share a single
> > PAT. However, the AMD APM specifies that when nested NPT is enabled, the host
> > (L1) and the guest (L2) should have independent PATs: hPAT for L1 and gPAT
> > for L2.
> >
> > This patch series implements independent PATs for L1 and L2 when nested NPT
> > is enabled, but only when a new quirk, KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT,
> > is disabled. By default, the quirk is enabled, preserving KVM's legacy
> > behavior. When the quirk is disabled, KVM correctly virtualizes a separate
> > PAT register for L2, using the g_pat field in the VMCB.
> >
> > [...]
>
> Applied to kvm-x86 svm. Yosry and/or Jim, please double check the result, the
> goof with patch 5 was slightly more annoying than I was expecting.
The result looks good to me. I also ran the selftest from v7 and it
passes. I couldn't help myself from reworking it and cleaning it up, I
will send a patch your way soon.
>
> Thanks!
>
> [1/8] KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT
> https://github.com/kvm-x86/linux/commit/822790ab0149
> [2/8] KVM: x86: nSVM: Clear VMCB_NPT clean bit when updating hPAT from guest mode
> https://github.com/kvm-x86/linux/commit/0a8aeb15848e
> [3/8] KVM: x86: nSVM: Cache and validate vmcb12 g_pat
> https://github.com/kvm-x86/linux/commit/4b83e4ba836e
> [4/8] KVM: x86: nSVM: Set vmcb02.g_pat correctly for nested NPT
> https://github.com/kvm-x86/linux/commit/02233c73f8ae
> [6/8] KVM: x86: nSVM: Save gPAT to vmcb12.g_pat on VMEXIT
> https://github.com/kvm-x86/linux/commit/d65cf222b899
> [7/8] KVM: Documentation: document KVM_{GET,SET}_NESTED_STATE for SVM
> https://github.com/kvm-x86/linux/commit/32ebdbce3b23
> [8/8] KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE
> https://github.com/kvm-x86/linux/commit/4f256d5770fe
>
> --
> https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Ackerley Tng @ 2026-05-20 20:25 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag3DawWCcrpCkD0e@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
>> > @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> > max_level = fault->max_level;
>> > is_private = fault->is_private;
>> > } else {
>> > + /*
>> > + * Memory attributes cannot be obtained from guest_memfd while
>> > + * the MMU lock is held.
>> > + */
>> > + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
>> > + kvm_gmem_get_memory_attributes, kvm)) {
>> > + return 0;
>> > + }
>> > +
>>
>> This directly takes the address of kvm_gmem_get_memory_attributes,
>> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
>> ARCH=i386.
>
> And this bleeds guest_memfd implementation details into places they don't belong.
> The right way to deal with this is to use lockdep_assert_not_held() in whatever
> code mustn't run with mmu_lock held. E.g.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index c9f155c2dc5c..3bea9c1137ef 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> struct inode *inode;
>
> + /* Comment goes here. */
> + lockdep_assert_not_held(&kvm->mmu_lock);
> +
> /*
> * If this gfn has no associated memslot, there's no chance of the gfn
> * being backed by private memory, since guest_memfd must be used for
>
> But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
> filemap_invalidate_lock(), so what exactly is the problem?
>
Ahh I can drop this patch now. kvm_gmem_get_memory_attributes() used to
take the filemap_invalidate_lock(), but after Liam pointed out that
the attributes maple tree should be using MT_FLAGS_USE_RCU, I stopped
taking filemap_invalidate_lock() and forgot to undo this.
I'll wait a bit for more reviews and then put out another revision
without this patch.
>> > max_level = PG_LEVEL_NUM;
>> > is_private = kvm_mem_is_private(kvm, gfn);
>> > }
>> >
>> > --
>> > 2.54.0.563.g4f69b47b94-goog
>> >
>> >
^ permalink raw reply
* Re: [PATCH v16 01/38] tpm: Initial step to reorganize TPM public headers
From: Daniel P. Smith @ 2026-05-20 20:12 UTC (permalink / raw)
To: Dave Hansen, Jason Gunthorpe, Jarkko Sakkinen
Cc: Ross Philipson, linux-kernel, x86, linux-integrity, linux-doc,
linux-crypto, kexec, linux-efi, iommu, tglx, mingo, bp, hpa,
dave.hansen, ardb, mjg59, James.Bottomley, peterhuewe, luto,
nivedita, herbert, davem, corbet, ebiederm, dwmw2, baolu.lu,
kanth.ghatraju, daniel.kiper, andrew.cooper3, trenchboot-devel
In-Reply-To: <8f8328d3-a7db-4a64-82f8-1e0e3eff93cf@intel.com>
On 5/15/26 7:10 PM, Dave Hansen wrote:
> On 5/15/26 16:05, Jason Gunthorpe wrote:
>> Can we please split out and progress the TPM reorg mini-series at the
>> front?
>
> Yes, please.
>
Yes, we will split this out, address the Sashiko comments, and work on
getting it out next week.
> Any way to break this down and merge in more bite-size pieces would be
> better for everyone involved.
I went through the TPM only part of the series and only one commit was
of substantial length, note I am not considering those that were only
moving headers around. That's the new buffer allocation patch from
Jarkko. We can see if there is a way to split it up without resulting in
either unused code or breaking the TPM driver.
V/r,
Daniel P. Smith
^ permalink raw reply
* Re: [PATCH v10 00/25] Runtime TDX module update support
From: Dave Hansen @ 2026-05-20 19:42 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, x86, linux-kernel, linux-rt-devel,
linux-doc
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, H. Peter Anvin, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, Jonathan Corbet, Shuah Khan
In-Reply-To: <20260520133909.409394-1-chao.gao@intel.com>
So a little cat|sort|uniq says:
11 Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
17 Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
We're on v10 here. There are 6 patches in here with no review tags:
M: Kiryl Shutsemau <kas@kernel.org>
R: Dave Hansen <dave.hansen@linux.intel.com>
R: Rick Edgecombe <rick.p.edgecombe@intel.com>
Are all of those 6 new patches? Or do the reviewers still have some work
to do?
^ permalink raw reply
* Re: [PATCH v4 1/3] drm/fdinfo: Add "evicted" memory accounting
From: Nicolas Frattaroli @ 2026-05-20 19:38 UTC (permalink / raw)
To: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Boris Brezillon, Steven Price, Liviu Dudau,
Jonathan Corbet, Shuah Khan, Tvrtko Ursulin
Cc: dri-devel, linux-kernel, kernel, linux-doc
In-Reply-To: <7c7242b8-eb22-41b1-8f04-f7abda62bb28@ursulin.net>
Hello Tvrtko,
On Wednesday, 20 May 2026 16:19:12 Central European Summer Time Tvrtko Ursulin wrote:
>
> On 20/05/2026 14:04, Nicolas Frattaroli wrote:
> > Currently, there's no way to know for certain how much GPU memory was
> > swapped out. The difference between total and resident memory would
> > include newly allocated pages, which are not resident, but also aren't
> > swapped out.
> >
> > Add a new drm_gem_object_status so drivers can signal when an object has
> > been evicted to swap, and add a new "evicted" counter to
> > drm_memory_stats.
> >
> > Due to how the supported_flags bitmask is determined, the "evicted"
> > count won't be printed to fdinfo if there's no swapped out pages.
> >
> > Reviewed-by: Steven Price <steven.price@arm.com>
> > Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
> > ---
> > Documentation/gpu/drm-usage-stats.rst | 6 ++++++
> > drivers/gpu/drm/drm_file.c | 8 ++++++++
> > include/drm/drm_file.h | 2 ++
> > include/drm/drm_gem.h | 2 ++
> > 4 files changed, 18 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst
> > index 70b7cfcc194f..ac1dbf52d96d 100644
> > --- a/Documentation/gpu/drm-usage-stats.rst
> > +++ b/Documentation/gpu/drm-usage-stats.rst
> > @@ -202,6 +202,12 @@ One practical example of this could be the presence of unsignaled fences in a
> > GEM buffer reservation object. Therefore, the active category is a subset of the
> > resident category.
> >
> > +- drm-evicted-<region>: <uint> [KiB|MiB]
> > +
> > +The total size of buffers that have been evicted and are no longer pinned by the
> > +device. Only present if there are buffers that are currently evicted, and if the
> > +driver implements reporting of this type of memory.
>
> The semantics as tricky to make work in an obvious way.
>
> On one hand the text above is almost exactly the semantics of 'total' -
> 'resident'. Almost meaning it was resident at some point, but isn't any
> more. Whereas raw 'total' - 'resident' can also mean it never has been
> instantiated.
Yes, that is the difference. You cannot tell them apart otherwise.
> You could even have a "workaround" where you report a 'swap' memory
> region and then don't need to add anything new to the spec.
I get the idea that technically, swap is its own memory region, but
evicted is counting memory that panthor knows is currently evicted,
not necessarily memory that is in swap. Counting pages that would
*actually* be in swap would probably involve breaking several
abstractions that shouldn't be broken.
>
> Next problem - on paper evicted could be useful to replace driver legacy
> keys such as 'amd-evicted-ram'. But that "evicted" is defined as "not in
> a the preferred placement". While your evicted is more like "no current
> placement" (as in, no GPU accessible backing storage).
>
> Is it possible to find a definition of this new category which makes
> sense for different GPUs/drivers, be it integrated or discrete.
Sure, we can make this definition as loose as you need it to be to use
it in a different driver. I think the difference between "not in a
preferred placement" and "no current placement (but had a placement in
the past)" is not a big one for the users of this information; the goal
is to see how much of the GPU memory of a process has been made non-
resident by a shrinker.
> Or would simply going for 'drm-total-swap:' (or resident?) work for
> panthor? Advantage being it would also work unambiguously for discrete
> drivers.
Panthor itself doesn't really know whether something is in swap or
has just been made non-resident by the drm shrinker. It could be
somewhere between swap and resident, as Steven Price pointed out.
>
> Like the ones which support multiple TTM placements, for example VRAM +
> SYSTEM and then next step is swapping out so an extreme example on a
> 16GiB GPU + 16GiB RAM machine with a 32GiB gfx workload could be like:
>
> drm-total-vram: 32GiB
> drm-resident-vram: 16GiB
> drm-resident-system: 15GiB
> drm-total-swap: 1GiB
>
> Does this look clear enough? Whereas with the "evicted" category it
> would be:
>
> drm-total-vram: 32GiB
> drm-resident-vram: 16GiB
> drm-evicted-vram: 16GiB # portion which got demoted to system RAM
> drm-resident-system: 15GiB
> drm-evicted-system: 1GiB # portion which got demoted to swap
>
> Where drm-evicted-vram is redundant to "total - resident". And it is
> overloaded semantics as it where does evicted go depending on the
> GPU/driver/region.
"drm-evicted-vram" is only redundant to "total - resident" if objects
that have never been packed by any pages aren't counted in total. This
is not the case, so I'm trying to fix it by adding evicted to it for
pages that were backed at some stage, but now aren't backed anymore.
I think "evicted" solves this problem generally in your second example,
without me having to worry about whether a page is in swap or AMD's
memory model.
So, to summarise:
- Panthor does not know how much of the memory that was reclaimed by the
shrinker is actually in swap space, so "drm-total-swap" wouldn't work
here.
- "total - resident" measures the wrong thing. Objects that have never
been backed are not evicted.
- I am completely fine with AMD not using this fdinfo memory type due
to having more complex eviction handling, but I don't see why it
could not be used in this form.
>
> Thoughts, opinions?
>
> Regards,
>
> Tvrtko
>
> > +
> > Implementation Details
> > ======================
> >
> > diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> > index ec820686b302..5078172976c0 100644
> > --- a/drivers/gpu/drm/drm_file.c
> > +++ b/drivers/gpu/drm/drm_file.c
> > @@ -868,6 +868,7 @@ int drm_memory_stats_is_zero(const struct drm_memory_stats *stats)
> > stats->private == 0 &&
> > stats->resident == 0 &&
> > stats->purgeable == 0 &&
> > + stats->evicted == 0 &&
> > stats->active == 0);
> > }
> > EXPORT_SYMBOL(drm_memory_stats_is_zero);
> > @@ -901,6 +902,10 @@ void drm_print_memory_stats(struct drm_printer *p,
> > if (supported_status & DRM_GEM_OBJECT_PURGEABLE)
> > drm_fdinfo_print_size(p, prefix, "purgeable", region,
> > stats->purgeable);
> > +
> > + if (supported_status & DRM_GEM_OBJECT_EVICTED)
> > + drm_fdinfo_print_size(p, prefix, "evicted", region,
> > + stats->evicted);
> > }
> > EXPORT_SYMBOL(drm_print_memory_stats);
> >
> > @@ -954,6 +959,9 @@ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
> >
> > if (s & DRM_GEM_OBJECT_PURGEABLE)
> > status.purgeable += add_size;
> > +
> > + if (s & DRM_GEM_OBJECT_EVICTED)
> > + status.evicted += add_size;
> > }
> > spin_unlock(&file->table_lock);
> >
> > diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> > index 6ee70ad65e1f..7e4cb45a52c3 100644
> > --- a/include/drm/drm_file.h
> > +++ b/include/drm/drm_file.h
> > @@ -500,6 +500,7 @@ void drm_send_event_timestamp_locked(struct drm_device *dev,
> > * @resident: Total size of GEM objects backing pages
> > * @purgeable: Total size of GEM objects that can be purged (resident and not active)
> > * @active: Total size of GEM objects active on one or more engines
> > + * @evicted: Total size of GEM objects that have been evicted
> > *
> > * Used by drm_print_memory_stats()
> > */
> > @@ -509,6 +510,7 @@ struct drm_memory_stats {
> > u64 resident;
> > u64 purgeable;
> > u64 active;
> > + u64 evicted;
> > };
> >
> > enum drm_gem_object_status;
> > diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> > index 86f5846154f7..799588a2762a 100644
> > --- a/include/drm/drm_gem.h
> > +++ b/include/drm/drm_gem.h
> > @@ -53,6 +53,7 @@ struct drm_gem_object;
> > * @DRM_GEM_OBJECT_RESIDENT: object is resident in memory (ie. not unpinned)
> > * @DRM_GEM_OBJECT_PURGEABLE: object marked as purgeable by userspace
> > * @DRM_GEM_OBJECT_ACTIVE: object is currently used by an active submission
> > + * @DRM_GEM_OBJECT_EVICTED: object is evicted and no longer pinned by driver
> > *
> > * Bitmask of status used for fdinfo memory stats, see &drm_gem_object_funcs.status
> > * and drm_show_fdinfo(). Note that an object can report DRM_GEM_OBJECT_PURGEABLE
> > @@ -67,6 +68,7 @@ enum drm_gem_object_status {
> > DRM_GEM_OBJECT_RESIDENT = BIT(0),
> > DRM_GEM_OBJECT_PURGEABLE = BIT(1),
> > DRM_GEM_OBJECT_ACTIVE = BIT(2),
> > + DRM_GEM_OBJECT_EVICTED = BIT(3),
> > };
> >
> > /**
> >
>
>
^ permalink raw reply
* Re: [RFC PATCH 0/7] mm/damon: hardware-sampled access reports + AMD IBS Op example
From: Ravi Jonnalagadda @ 2026-05-20 19:01 UTC (permalink / raw)
To: SeongJae Park
Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
ajayjoshi, honggyu.kim, yunjeong.mun, bharata, Akinobu Mita
In-Reply-To: <20260519061905.89681-1-sj@kernel.org>
On Mon, May 18, 2026 at 11:19 PM SeongJae Park <sj@kernel.org> wrote:
>
> + Akinobu
>
> Hello Ravi,
>
> On Sat, 16 May 2026 15:34:25 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > Hi all,
> >
> > This is an RFC, not for merge. The series exercises and validates
> > damon_report_access() -- the consumer API SeongJae introduced in [1]
> > -- as a substrate for ingesting access reports from hardware-sampling
> > sources. The series includes one worked-example backend, an AMD IBS
> > Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the
> > existing perf event subsystem.
>
> Thank you for sharing this great RFC series!
>
> [...]
> > Why a hardware-source primitive complements existing primitives
> > ===============================================================
> [...]
> > Both primitives produce a view of hotness that converges to the
> > true distribution over the aggregation interval. For systems where
> > the address space is small relative to the aggregation rate, this is
> > the right tool. On large heterogeneous-memory systems with goal-
> > driven schemes asking the closed-loop tuner to converge on a target
> > distribution, a complementary lower-latency view of accesses can
> > tighten the loop -- reducing the time DAMON's nr_accesses takes to
> > reflect the workload's actual access distribution, which in turn
> > reduces ramp duration and oscillation amplitude during convergence
> > of goal-driven schemes.
> >
> > A hardware-sampling primitive provides this complementary view:
> > hardware retirement records each access at its natural event rate,
> > with a physical address per sample, independent of TLB state and
> > independent of the unmap/fault path.
>
> Yes, I fully agree. Different multiple access check primitives have different
> characteristics.
>
> [...]
>
> > Demonstration
> > =============
> [...]
> > In both regimes, convergence to target is quick, and the workload's
> > measured DRAM share then holds within 1.3 percentage points of
> > target with standard deviation under 1.3 percentage points, sustained
> > over runs of 15-30 minutes per target.
>
> I understand this demonstration shows your AMD IBS-based version of DAMON is
> functioning as expected. Thank you for sharing this!
>
> [...]
> > What's in this series
> > =====================
> >
> > Patch 1. mm/damon/core: refcount ops owner module to prevent
> > rmmod UAF
> > Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module
> > Patch 3. mm/damon/core: replace mutex-protected report buffer
> > with per-CPU lockless ring
> > Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring-
> > drain loop
> > Patch 5. mm/damon: add sysfs binding and dispatch hookup for
> > paddr_ibs operations
> > Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp
> > ops check
> > Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling
> > backend
> >
> > Patches 1, 3, and 4 are general infrastructure that benefits any
> > consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the
> > worked-example backend (paddr_ibs ops, sysfs binding, IBS module).
>
> I didn't read the detailed code of each patch. But my high level understanding
> is as below.
>
> Patches 1 and 2 are needed for supporting loadable module-based DAMON operation
> sets (access sampling backend).
>
> Patch 3 is needed for supporting access check primitives that can provide the
> access information in only nmi context. It can also speedup the access
> reporting in general, though.
>
> Patch 4 makes DAMON's internal reported access information retrieval faster, so
> will help any reporting-based DAMON operation set use case.
>
> Patches 5-7 are required for only the IBS-based DAMON operations set
> (paddr_ibs).
>
> So I agree patch 4 is a general infrastructure improvement that benefits
> multiple use cases.
>
> Patch 3 is also arguably general infrastructure improvement, as it will make
> the reporting faster in general.
>
> Patch 1 is not technically coupled with paddr_ibs, and will be needed for
> general loadable module based access check primitives. But, should we support
> lodable modules? If so, why?
>
> Patch 2 is also not technically coupled with paddr_ibs, to my understanding, so
> should be categorized together with patch 1? In other words, if we agree we
> should support lodable modules based DAMON operation sets, this should be
> useful for not only paddr_ibs but more general cases.
>
> Correct me if I'm wrong.
>
> >
> >
> > Patches worth folding into damon/next
> > =====================================
> >
> > Patches 1, 3, and 4 are not specific to IBS or to this RFC's
> > backend. Each is preparatory infrastructure that any consumer of
> > damon_report_access() will need:
> >
> > - Patch 1 (refcount ops owner) -- any modular ops set, including
> > out-of-tree backends, needs clean module unload to avoid UAF
> > on damon_unregister_ops.
> > - Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot
> > be called from NMI context with the current mutex-protected
> > buffer. Hardware samplers all need NMI-safe submission.
> > - Patch 4 (flat-array snapshot + bsearch drain) -- the linear-
> > scan drain is O(reports x regions) and exceeds the sample
> > interval at high-CPU x large-region products. Bsearch brings
> > it to O(reports x log regions).
> >
> > If these belong directly on damon/next as preparatory patches for
> > damon_report_access() rather than living inside an IBS-specific
> > track, we are happy to rebase and resend them that way.
>
> So I'm bit unsure about patch 1. If we don't have a plan to support lodable
> modules based DAMON operations set, we might not need it for now.
>
> For patches 3 and 4, I agree those will be useful in general. Nonetheless, I'd
> slightly prefer to do that optimizations at the later part of the long term
> project.
>
> >
> >
> > Relation to prior and ongoing work
> > ==================================
> >
> > The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op
> > default config, dc_phy_addr_valid filter, NMI-safe sample submission
> > -- is derived from concepts in Bharata B Rao's pghot RFC v5 [3].
> > The attribution header is in mm/damon/damon_ibs.c and the patch
> > carries a Suggested-by: trailer.
> >
> > Bharata's pghot v7 [4] introduces a different IBS driver targeting
> > the new IBS Memory Profiler (IBS-MProf) facility, which Bharata
> > describes as a facility "that will be present in future AMD
> > processors" -- a separate IBS instance from the one this RFC's
> > backend uses. This version of driver based out of v5 [3] is an
> > example of how DAMON can be benefited from AMD IBS Hardware
> > source and validates importance of IBS information indepedently.
> > It is not meant to be merged in the current form.
> > @Bharata if you see a path where IBS samples can be consumed
> > by DAMON at some point, will be happy to collaborate.
> >
> > Akinobu Mita's perf-event-based access-check RFC [5] explores a
> > configurable perf-event-driven access source for DAMON. IBS has
> > vendor-specific MSR setup beyond what perf_event_attr alone
> > expresses (e.g. dc_phy_addr_valid filtering on the produced sample,
> > not on the perf attr), so the IBS path here appears complementary
> > to [5] -- operators choose based on whether their hardware sampler
> > fits stock perf or needs additional kernel-side setup.
>
> So apparently there are multiple approaches to develop and use h/w-based access
> monitoring. Akinobu and you are trying to do that using DAMON as the frontend,
> and already made the working prototypes. There were more people who showed
> interest and will to contribute to this project other than you, too. I 100%
> agree h/w-based access monitoring can be useful, and I of course thinking using
> DAMON as the fronend is the right approach. I'm all for making this
> upstreamed.
>
> I was therefore spending time on thinking about in what long-term maintainable
> shape this capability can successfully be upstreamed. I suggested
> damon_report_access() as the internal interface between DAMON and the h/w-based
> access check primitives, and apparently we all (I, Ravi and Akinobu in this
> context) agreed. Akinobu thankfully revisioned his implementation based on
> damon_report_access() interface. Ravi also implemented this RFC based on the
> interface.
>
> After making the consensus with Akinobu, I was taking time on the user space
> interface. When I was discussing with Akinobu, my idea was extending the user
> interface for the page faults based monitoring v3 [1]. But, recently I decided
> to make this more general, so proposed data attributes monitoring extension [2]
> at LSFMMBPF. The patch series for the initial change [3] is merged into mm-new
> for more testing, today. The cover letter of the patch series is also sharing
> how it will be extended for h/w based access monitoring in long term.
>
> I of course want us to go in this direction. I believe you already had chances
> to take a look on the long term plan and didn't make some voice because you
> don't strongly disagree about the plan. If not, please make a voice.
>
Hi SJ,
One layering question I'd like to flag before the plan is written,
since it affects how this RFC's substrate slots in:
In [3], .apply_probes is a periodic per-region classifier driven
from kdamond_fn after .check_accesses, in process context, that
applies a (folio -> bool) predicate to each region's sampling_addr
and accounts the results in r->probe_hits[]. damon_report_access()
on the other hand is a per-event delivery callback into a per-CPU
buffer, called from the access source (NMI for IBS / PEBS / SPE,
process context for page-fault-based sources). These appear to
me to sit at different layers - delivery vs. classification.
The reason I want to confirm this: NMI context for HW samplers
precludes the operations .apply_probes can do today (no mutex, no
kmalloc, no sleep, no folio lookup that touches pte_lock). And
the data shape is inverted - .apply_probes asks "does region R's
sampling_addr have attribute A?", evaluated on the kdamond-chosen
address; an HW sample announces "PA Y was accessed at retirement
time T", arriving asynchronously and needing to find the region
it falls into. If access events end up routed through
.apply_probes in the long-term plan, the IBS / PEBS / SPE
backends would each need a deferral path under it (per-CPU ring
for NMI-safe submission, region mapping at drain time).
Happy to be wrong here if you see a unified shape that handles
both - just want to surface the constraint before the plan is
written.
On the loadable-module question for patches 1 and 2: agreed it's a
genuinely open architectural call, not just a paddr_ibs convenience.
- paddr_ibs (this RFC) targets the existing IBS Op facility on
Zen 3+ silicon via the perf event subsystem and uses a
vendor-specific
overflow-handler filter that perf_event_attr cannot express
(dc_phy_addr_valid in IBS_OP_DATA3). Bharata's pghot v7
[pghot-v7] introduces a separate IBS driver targeting the new
IBS-MProf
facility on future AMD silicon via direct MSR programming -
not perf at all. These are two AMD-specific HW samplers with
non-overlapping silicon coverage and non-overlapping kernel
paths. A distro shipping a single kernel image to a fleet
with mixed silicon needs runtime-selectable backends, which
obj=y can't do across exclusive `depends on` chains.
- Akinobu's perf-event RFC v3 [akinobu-v3] is a useful contrast:
it stays builtin because it's a generic configurable
perf_event_attr passthrough, no vendor-specific code in the
overflow handler. The tristate case is specifically for the
backends that need vendor logic outside perf_event_attr
(IBS dc_phy_addr_valid, future ARM SPE record-format
handling, future Intel PEBS DLA quirks if they need
kernel-side filtering beyond what perf delivers).
Bharata, would value your perspective on two related questions: in
your long-term plan for pghot, do you see the legacy IBS Op path
(this RFC) staying as a DAMON-side backend, while the new IBS-MProf
path lands under pghot? Or do you envision both IBS facilities
eventually feeding through a common HW-sampler primitive (pghot or
DAMON), with frontend selectable by user config? And on existing
Zen 3+ silicon: is the legacy IBS Op driver in this RFC the right
home for those processors going forward.
Thanks,
Ravi
> Assuming you don't have concern on the long term plan yet, I will take time to
> write down more formal and detailed plan. It will explain the overall roadmap,
> timeline and how we could collaborate. On top of that, we could further
> discuss.
>
> >
> >
> > Specific asks
> > =============
> >
> > To SeongJae:
> >
> > 1. Patches 1, 3, and 4 are infrastructure that benefits any consumer
> > of damon_report_access(), not just the IBS backend in this RFC.
> > Would these belong directly on damon/next as preparatory patches
> > for damon_report_access(), rather than living inside an
> > IBS-specific track? Happy to rebase and resend them that way if
> > you'd prefer that shape. Tested-by: tags can come along.
>
> I'm still thinking about how we can collaborate well. The answer for the above
> question would be a part of that. In other words, I have no good answer right
> now, sorry. Could you please give me more time to think more and share the
> plan? I will share the plan as another mail. On the thread, we could further
> discuss. Of course, we could have DAMON beer/coffee/tea chats [4] like
> additional discussions before/after/during the plan discussion.
>
> So, long story short, we agreed this project (h/w-based data access monitoring)
> should be upstreamed. But give me little more time on thinking about how we
> will do it and collaborate. It will take some time. Please bear in mind.
> Sorry for making you wait, but I pretty sure and promise that we will
> eventually make it.
>
> [1] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org
> [2] https://lwn.net/Articles/1071256/
> [3] https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org
> [4] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Sean Christopherson @ 2026-05-20 18:59 UTC (permalink / raw)
To: Fuad Tabba
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>
On Wed, May 20, 2026, Fuad Tabba wrote:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
> > core and architecture code to query per-GFN memory attributes.
> >
> > kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
> > queries the guest_memfd file's to determine if the page is marked as
> > private.
> >
> > If vm_memory_attributes is not enabled, there is no shared/private tracking
> > at the VM level. Install the guest_memfd implementation as long as
> > guest_memfd is enabled to give guest_memfd a chance to respond on
> > attributes.
> >
> > guest_memfd should look up attributes regardless of whether this memslot is
> > gmem-only since attributes are now tracked by gmem regardless of whether
> > mmap() is enabled.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> > include/linux/kvm_host.h | 2 ++
> > virt/kvm/guest_memfd.c | 31 +++++++++++++++++++++++++++++++
> > virt/kvm/kvm_main.c | 3 +++
> > 3 files changed, 36 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c5ba2cb34e45c..28a54298d27db 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2557,6 +2557,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > struct kvm_gfn_range *range);
> > #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> >
> > +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> > +
> > #ifdef CONFIG_KVM_GUEST_MEMFD
> > int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 5011d38820d0d..f055e058a3f28 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -509,6 +509,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> > return 0;
> > }
> >
> > +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > +{
> > + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> > + struct inode *inode;
> > +
> > + /*
> > + * If this gfn has no associated memslot, there's no chance of the gfn
> > + * being backed by private memory, since guest_memfd must be used for
> > + * private memory, and guest_memfd must be associated with some memslot.
> > + */
> > + if (!slot)
> > + return 0;
> > +
> > + CLASS(gmem_get_file, file)(slot);
> > + if (!file)
> > + return 0;
> > +
> > + inode = file_inode(file);
> > +
> > + /*
> > + * Rely on the maple tree's internal RCU lock to ensure a
> > + * stable result. This result can become stale as soon as the
> > + * lock is dropped, so the caller _must_ still protect
> > + * consumption of private vs. shared by checking
> > + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> > + * against ongoing attribute updates.
> > + */
> > + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
> > +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?
> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
Yes, but my understanding is that sev_handle_rmp_fault() can tolerate false
positives and false negatives. It's not optimal, but it's "fine", and already
KVM's existing behavior, e.g. KVM gets the PFN and then smashes the RMP, without
ensuring the PFN is fresh.
Mike, is that all correct?
^ permalink raw reply
* Re: [PATCH v5 12/13] Documentation: ABI: testing: add docs for ad9910 sysfs entries
From: Rodrigo Alencar @ 2026-05-20 18:47 UTC (permalink / raw)
To: rodrigo.alencar, linux-iio, devicetree, linux-kernel, linux-doc,
linux-hardening
Cc: Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
David Lechner, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Philipp Zabel, Jonathan Corbet, Shuah Khan,
Kees Cook, Gustavo A. R. Silva
In-Reply-To: <20260517-ad9910-iio-driver-v5-12-31599c88314a@analog.com>
On 26/05/17 07:37PM, Rodrigo Alencar via B4 Relay wrote:
> From: Rodrigo Alencar <rodrigo.alencar@analog.com>
>
> Add custom ABI documentation file for the DDS AD9910 with sysfs entries to
> control Parallel Port, Digital Ramp Generator and OSK parameters.
...
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency_offset
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows frequency control through buffers, this
> + represents the base frequency value in Hz. The actual output frequency
> + is derived from this offset combined with the processed buffer sample
> + value.
> +
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency_scale
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows frequency control through buffers, this
> + represents the frequency modulation gain. This value multiplies the
> + buffer input sample value before it is added to a frequency offset.
> +
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_phase_offset
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows phase control through buffers, this
> + represents the base phase value in radians. The actual output phase is
> + derived from this offset combined with the processed buffer sample
> + value.
> +
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale_offset
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows amplitude control through buffers, this
> + represents the value for a base amplitude scale. The actual output
> + amplitude scale is derived from this offset combined with the processed
> + buffer sample value.
> +
This will become just offset with altcurrent channels. I noticed we have a IIO_PHASE
iio_chan_type, could we have a IIO_FREQUENCY too? Parallel port needs actual raw
frequency values in that case to be written to the dma buffer.
Then we may have buffer capable channels for the parallel port:
out_altcurrent120
offset
out_phase120
offset
out_frequency120
scale
offset
Problem is that the math for the actual frequency output is:
f_OUT = f_FTW + (f_RAW * FM)
where f_FTW is a base frequency (already scaled), FM is a
modulation gain and f_RAW is the contribution from the parallel
port, which is the already scaled:
f_RAW = RAW * f_SYSCLK / 2^32
f_FTW = FTW * f_SYSCLK / 2^32
so the above becomes:
f_OUT = (FTW * f_SYSCLK / 2^32) + (RAW * f_SYSCLK / 2^32) * FM
f_OUT = (FTW/FM + RAW) * f_SYSCLK * FM / 2^32
if I make:
SCALE = f_SYSCLK * FM / 2^32
OFFSET = FTW/FM
f_OUT = (OFFSET + RAW) * SCALE
That would work for a IIO_FREQUENCY channel type, problem is that both
scale and offset would depend on the modulation gain (FM)... I suppose
scale should be setting that and offset assumes it is constant to act
only on FTW.
I suppose we can keep altcurrent for other modes as phase and frequency
can be attributes (knobs) for them. However, in parallel mode we are effectively
pushing frequency, phase or amplitude values into the buffer.
The polar destination is a corner case, but can be solved when both
phase and altcurrent channels are enabled. When that happens we can
change the scan_type with has_ext_scan_type = 1, so the 16-bit data
bus is split between the two.
With the above, all of those *_offset and *_scale custom ABI can be dropped.
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* Re: [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size
From: Ravi Jonnalagadda @ 2026-05-20 18:37 UTC (permalink / raw)
To: SeongJae Park
Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <20260517184705.4652-1-sj@kernel.org>
On Sun, May 17, 2026 at 11:47 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Sat, 16 May 2026 14:03:55 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > The CONSIST quota goal tuner initializes esz_bp to 0, producing an
> > effective quota size (esz) of 1 byte on the first tick.
> > damos_quota_is_full() rejects all regions when esz < min_region_sz
> > (default PAGE_SIZE = 4096), so no regions can be tried and no
> > feedback reaches the tuner — a bootstrapping deadlock.
>
> That depend on whether the goal is already [over]-achieved. If the goal is
> achieved, the tuner will think no change is needed, so keep the
> effectively-zero quota. If the goal is over-achived, the tuner will think the
> DAMOS scheme should be less aggressive, but it is already effectively-zero
> quota, so keep having effectively-zero quota.
>
> If the ogal is under-achived, the logic will iteratively increase the internal
> esz (esz_bp), until it exceeds the min_region_sz, and finally start making some
> effects.
>
> So, unless the goal is already [over]-achieved, there is no deadlock. If the
> goal is already [over]-achieved, why we would want to make DAMOS do something?
>
> Am I missing something?
>
Hello SJ,
You're not missing anything; you're right. Stock DAMON's
feed-loop tuner ramps esz_bp out of the seed quickly under an
under-achieved goal -- on the order of ten-some ticks at the
1000ms reset_interval the in-tree DAMON modules use, so the
floor isn't gating anything that wouldn't bootstrap on its own.
No deadlock.
I owe a clearer accounting of where this patch and patch 1 came
from, since the same origin story applies to both. Both came
from a parallel debug effort and should not have been carried
into this set.
The work that produced this series came out of an effort to
enable hardware-sampled hotness as a DAMON access source -- the
companion AMD IBS RFC
https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
-- and to characterise its closed-loop convergence with the
existing CONSIST tuner on a heterogeneous DRAM+CXL setup. Early
in that effort I was hitting random NMI-context hangs on the per-CPU
report path that prevented runs from completing, and while
debugging those hangs I wasn't sure which direction the
convergence anomalies were coming from -- the sampling backend,
the report-ring path, the tuner shape, or the quota controller.
I made two controller-side experiments as scaffolding while I
narrowed the problem down:
- A per-tick growth cap on the goal-feedback tuner
("max_delta_bp" at 5%/tick) to slow how fast esz could grow
on a transient. That cap stretches the bootstrap above
from ~13s to several minutes, so I added a floor at
min_region_sz to short-cut the bootstrap. That landed here
as patch 3.
- A separate access-rate seeding helper (clear-on-migration)
for the goal-feedback loop. Some early versions of that
helper left the access-rate fields in inconsistent states
and damon_moving_sum() landed in an underflow path I hadn't
seen before. I added a saturating-subtract guard to that
function. That landed here as patch 1.
Once cpuhp-related fixes on the per-CPU sampling path landed
and the NMI stability problem was actually resolved, the
convergence anomalies were tracked down to the sampling/ring
side, not the controller side. I dropped the max_delta_bp knob
and fixed the seeding helper to maintain its invariants.
Patches 1 and 3 were carried into this set even though their
justifications had gone away:
- Patch 1: with the seeding helper fixed, stock callers don't
reach the underflow path -- the invariant holds at every
aggregation boundary in stock DAMON, as you noted. Belongs
with the seeding helper if and when that work goes upstream.
- Patch 3 (this one): with max_delta_bp dropped, the slow
bootstrap doesn't happen -- the ~13s ramp is fast enough
that there's no problem to solve. Once stability was
sorted I also moved the closed-loop runs in the companion
RFC to the temporal tuner, where the bootstrap concern
this patch addresses doesn't even arise (esz_bp saturates
to ULONG_MAX immediately when score=0).
Apologies;both should have come out when
max_delta_bp and the seeding helper did.
Dropping patches 1 and 3.
Patches 2, 4, and 5 are independent of this scaffolding; I'll
reply on each thread separately with the relevant context.
Thanks again for the careful review.
Best,
Ravi
> I'd like to discuss this high level thing first, before digging deep into the
> details.
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-20 18:29 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <ag3zr7-11FO3k-Wv@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 4141 bytes --]
On Wed, 2026-05-20 at 10:47 -0700, Oliver Upton wrote:
> On Wed, May 20, 2026 at 12:33:52AM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 15:57 -0700, Oliver Upton wrote:
> > > What ifs and maybes do not meet the bar, in my opinion, for preserving
> > > bug emulation in KVM. Of course there could be a little flexibility with
> > > that but we need to have some way of discriminating between bug fixes
> > > and genuine guest expectations around the behavior of virtual hardware.
> >
> > I believe you have this completely backwards.
>
> No, I really don't.
>
> Leaving every bugfix that could _possibly_ have a guest-visible impact
> subject to drive-by scrutiny many years after the dust has settled is
> not an acceptable working dynamic. Especially since it would appear
> that the rest of the ecosystem has long since moved on from this
> particular issue.
That's reductio ad absurdum.
I can continue to work around this one internally, sure.
But I'm also concerned about the general case because not only did you
refuse it, but you *also* said that this change in guest-visible
behaviour "should've happened without a change to the revision number".
Which seems to indicate that not only are you being randomly
obstructive about a one-line fix, you *also* don't actually understand
the general concept of what is expected of KVM, which this
Documentation patch is intending to clarify.
It was *right* to bump the IIDR from 1 to 2 when this guest visible
behaviour was changed. The only problem was not letting userspace
select the old revision. I'm really concerned that we now appear to
have a regression of understanding of even the part we previously *did*
get right.
> If this matters to you so deeply then please, be part of the solution
> instead. You may find that reviewing patches leads to better outcomes
> than getting belligerent with the arm64 folks every time you guys
> decide to rebase your kernel. Hell, hypotheticals actually have a lot
> more weight in the context of a review. And if your testing is extensive
> enough to catch these sort of subtleties, don't you think it's better
> done against mainline?
Yes. Definitely. That's why my series with the fixes is more *test*
than actual fix, giving a nice simple framework for any such changes in
future. It checks that GICR_CTLR_IR|GICR_CTLR_CES are visible only with
IIDR.rev=3 for example.
And we're making progress on the amount of downstream crap, but it
doesn't help when we seem to have an impedance mismatch on the very
question of what it means to support customers on KVM at scale. This
thread is not exactly encouraging my engineers to poke their heads
above the parapet.
> Maybe it's just me but I am left feeling disappointed that we all
> haven't found a productive way of working together. I've tried to bridge
> the gap here; we obviously need to do something that at least fixes the
> UAPI breakage. Although apparently we don't even care to meet that low
> of bar.
>
> > A stable and mature platform doesn't get to play in its ivory tower and
> > randomly inflict breakage on guests because they "deserve it".
>
> Really? Aren't you asking for us to emulate something completely broken
> for you?
No. I'm asking for a path to be able to *fix* it.
As things stand, if I just drop these patches and launch guests on a
new kernel, those guests will see writable IGROUPR registers and may
try to use them. And then if I have to roll *back* a kernel deployment,
those guests may lose interrupts.
The *only* time a guest-visible feature (or bugfix, nobody cares about
the difference outside the ivory tower) can be enabled is when the
kernel deployment is finished and stable and *won't* be rolled back.
And *then* new launches (and reboots) can get it.
And one day, when the last guest which was launched *without* it is
finally rebooted and sees the new model, *then* maybe we no longer need
that one line if() statement to support IIDR version 1.
2018 was basically *yesterday*. And I'm kind of scared that I even have
to explain it.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 04/12] x86,fs/resctrl: Program PLZA through kmode arch hooks
From: Babu Moger @ 2026-05-20 17:49 UTC (permalink / raw)
To: Luck, Tony
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, bp,
dave.hansen, skhan, x86, mingo, hpa, akpm, rdunlap,
pawan.kumar.gupta, feng.tang, dapeng1.mi, kees, elver, lirongqing,
paulmck, bhelgaas, seanjc, alexandre.chartre, yazen.ghannam,
peterz, chang.seok.bae, kim.phillips, xin, naveen,
thomas.lendacky, linux-doc, linux-kernel, eranian, peternewman,
sos-linux-ext-patches
In-Reply-To: <agzPTMvJ_LdEmKXe@agluck-desk3>
Hi Tony,
On 5/19/26 15:59, Luck, Tony wrote:
> On Thu, Apr 30, 2026 at 06:24:49PM -0500, Babu Moger wrote:
>> +void resctrl_arch_configure_kmode(cpumask_var_t cpu_mask, u32 closid, u32 rmid, bool enable)
>> +{
>> + union msr_pqr_plza_assoc plza = { 0 };
>> +
>> + plza.split.rmid = rmid;
>> + plza.split.rmid_en = 1;
>
> Shouldn't there be a parameter for the value of rmid_en?
I realized that behavior is not required—it was actually due to a
mistake in my v2 series implementation.
Below are the relevant definitions:
GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU:
The CLOSID is applied to kernel work, while the RMID used for monitoring
is inherited from the currently running user task.
No separate monitoring group is assigned for kernel work, so kernel
execution naturally inherits the user-space RMID.
GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU:
Both CLOSID and RMID are explicitly assigned to kernel work.
This allows assigning a dedicated monitoring group for kernel execution
and therefore requires a separate RMID.
Example: For GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU:
# mount -t resctrl resctrl /sys/fs/resctrl
# cat /sys/fs/resctrl/info/kernel_mode
[inherit_ctrl_and_mon:group=//]
global_assign_ctrl_inherit_mon_per_cpu:group=none
global_assign_ctrl_assign_mon_per_cpu:group=none
# mkdir /sys/fs/resctrl/ctrl1 (PQR_ASSOC closid=1 rmid=1)
This configures all the CPU threads to use closid=1 and rmid=1 for both
allocation and monitoring across user and kernel modes.
# echo "global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//" \
> /sys/fs/resctrl/info/kernel_mode
# cat /sys/fs/resctrl/info/kernel_mode
inherit_ctrl_and_mon:group=none
[global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//]
global_assign_ctrl_assign_mon_per_cpu:group=none
This overrides the previous configuration, and PQR_PLZA_ASSOC is written.
Possible options:
1. (closid=1, rmid_en=0, rmid=1)
Here, hardware uses closid=1 for kernel work, but RMID tracking is
disabled for kernel mode.
As a result, reading RMID 1 reports only user-mode activity
This contradicts the definition of this mode, since kernel work is
expected to inherit the user RMID for monitoring.
2. (closid=1, rmid_en=1, rmid=1)
In this case, RMID tracking is enabled for both user and kernel modes.
Reading RMID 1 reports combined user + kernel activity
This aligns with the expected inherit_monitoring behavior
The preferred approach is to separate kernel monitoring by assigning it
a dedicated monitoring group and updating PQR_PLZA_ASSOC to use a
different RMID (e.g., closid=1, rmid_en=1, rmid=2). This is exactly the
behavior implemented by GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU.
Thanks
Babu
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-20 17:47 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <add71b6f61edc6357e1fddad83273b2cba697d10.camel@infradead.org>
On Wed, May 20, 2026 at 12:33:52AM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 15:57 -0700, Oliver Upton wrote:
> > What ifs and maybes do not meet the bar, in my opinion, for preserving
> > bug emulation in KVM. Of course there could be a little flexibility with
> > that but we need to have some way of discriminating between bug fixes
> > and genuine guest expectations around the behavior of virtual hardware.
>
> I believe you have this completely backwards.
No, I really don't.
Leaving every bugfix that could _possibly_ have a guest-visible impact
subject to drive-by scrutiny many years after the dust has settled is
not an acceptable working dynamic. Especially since it would appear
that the rest of the ecosystem has long since moved on from this
particular issue.
If this matters to you so deeply then please, be part of the solution
instead. You may find that reviewing patches leads to better outcomes
than getting belligerent with the arm64 folks every time you guys
decide to rebase your kernel. Hell, hypotheticals actually have a lot
more weight in the context of a review. And if your testing is extensive
enough to catch these sort of subtleties, don't you think it's better
done against mainline?
Maybe it's just me but I am left feeling disappointed that we all
haven't found a productive way of working together. I've tried to bridge
the gap here; we obviously need to do something that at least fixes the
UAPI breakage. Although apparently we don't even care to meet that low
of bar.
> A stable and mature platform doesn't get to play in its ivory tower and
> randomly inflict breakage on guests because they "deserve it".
Really? Aren't you asking for us to emulate something completely broken
for you?
Thanks,
Oliver
^ permalink raw reply
* Re: [PATCH RESEND bpf-next v10 2/8] bpf: clear list node owner and unlink before drop
From: Eduard Zingerman @ 2026-05-20 16:28 UTC (permalink / raw)
To: Kaitao Cheng
Cc: bpf, Alexei Starovoitov, linux-kernel, linux-doc, ast, memxor,
corbet, martin.lau, daniel, andrii, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, chengkaitao,
skhan, vmalik, linux-kselftest, martin.lau, clm, ihor.solodrai,
bot+bpf-ci
In-Reply-To: <47b928ac-25d9-481c-8764-8f840c2dcafa@linux.dev>
On Wed, 2026-05-20 at 17:55 +0800, Kaitao Cheng wrote:
> 在 2026/5/20 06:56, Eduard Zingerman 写道:
> > On Mon, 2026-05-18 at 11:02 +0800, Kaitao Cheng wrote:
> >
> > [...]
> >
> > > > > > The patch does have a bug, however. To fix the issues we are seeing now,
> > > > > > I propose the additional changes below and would appreciate feedback.
> > > > > >
> > > > > > --- a/kernel/bpf/helpers.c
> > > > > > +++ b/kernel/bpf/helpers.c
> > > > > > @@ -2263,8 +2263,10 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > > > if (!head->next || list_empty(head))
> > > > > > goto unlock;
> > > > > > list_for_each_safe(pos, n, head) {
> > > > > > - WRITE_ONCE(container_of(pos,
> > > > > > - struct bpf_list_node_kern, list_head)->owner, NULL);
> > > > > > + struct bpf_list_node_kern *node;
> > > > > > +
> > > > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > > > + WRITE_ONCE(node->owner, BPF_PTR_POISON);
> > > > > > list_move_tail(pos, &drain);
> > > > > > }
> > > > > > unlock:
> > > > > > @@ -2272,8 +2274,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > > > __bpf_spin_unlock_irqrestore(spin_lock);
> > > > > >
> > > > > > while (!list_empty(&drain)) {
> > > > > > + struct bpf_list_node_kern *node;
> > > > > > +
> > > > > > pos = drain.next;
> > > > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > > > list_del_init(pos);
> > > > > > + WRITE_ONCE(node->owner, NULL);
> >
> > Is CPU allowed to reorder the stores in list_del_init() and WRITE_ONCE()?
> > If it is, I think there is a race here.
>
> Thanks for taking a close look at this. You are right that there is an
> ordering issue here, but I don't think the specific sequence illustrated
> by the example below is problematic.
>
> > Thread #1:
> > enter bpf_list_head_free()
> > acquire H1 lock
> > list_move_tail(pos, &drain); // reordered
> > <-- ip here -->
> > WRITE_ONCE(node->owner, BPF_PTR_POISON); // reordered
> >
> > Thread #2:
> >
> > acquire H1 lock
> > n = bpf_refcount_acquire()
> > release H1 lock
> > acquire H2 lock
> > enter __bpf_list_add()
> > <-- ip here -->
> > cmpxchg(&node->owner, NULL, BPF_PTR_POISON)
>
> Even if the stores from list_move_tail(pos, &drain) become visible before
> WRITE_ONCE(node->owner, BPF_PTR_POISON), node->owner is not NULL in that
> window. Before the WRITE_ONCE(), it still points to H1. After the WRITE_ONCE(),
> it is BPF_PTR_POISON. In both cases, __bpf_list_add() will fail:
>
> cmpxchg(&node->owner, NULL, BPF_PTR_POISON)
>
> because the old value is neither NULL nor expected to become NULL from this
> part of bpf_list_head_free().
>
>
> However, I agree that your original concern about the ordering between
> list_del_init() and WRITE_ONCE(node->owner, NULL) is valid for the later
> drain loop:
>
> list_del_init(pos);
> WRITE_ONCE(node->owner, NULL);
>
> Here owner == NULL is the signal that the node can be inserted into another
> list. Since WRITE_ONCE() does not provide release ordering, another CPU may
> observe owner == NULL and successfully acquire the node in __bpf_list_add()
> before the list_del_init() stores are visible. In that case __bpf_list_add()
> can link the node into H2, and the delayed stores from list_del_init() may
> then overwrite the node's list pointers and corrupt the H2 list.
>
> So the fix should be to publish owner == NULL with release ordering after the
> node has been fully unlinked, for example:
>
> ```
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2279,7 +2279,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> pos = drain.next;
> node = container_of(pos, struct bpf_list_node_kern, list_head);
> list_del_init(pos);
> - WRITE_ONCE(node->owner, NULL);
> + /* Ensure __bpf_list_add() sees the node as unlinked. */
> + smp_store_release(&node->owner, NULL);
> /* The contained type can also have resources, including a
> * bpf_list_head which needs to be freed.
> */
> @@ -2607,7 +2608,8 @@ static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
> return NULL;
>
> list_del_init(n);
> - WRITE_ONCE(node->owner, NULL);
> + /* Ensure __bpf_list_add() sees the node as unlinked. */
> + smp_store_release(&node->owner, NULL);
> return (struct bpf_list_node *)n;
> }
> ```
>
> The existing cmpxchg() in __bpf_list_add() is a successful RMW with return
> value, so it is fully ordered and is sufficient on the acquire side.
Hi Kaitao,
Thank you for the analysis. I agree with the smp_store_release()
approach, could you please respin the series?
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, docs: add LOAD_AQCUIRE and STORE_RELEASE instructions
From: Alexei Starovoitov @ 2026-05-20 16:07 UTC (permalink / raw)
To: Alexis Lothoré
Cc: bot+bpf-ci, David Vernet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Jonathan Corbet, Shuah Khan, ebpf,
Bastien Curutchet (eBPF Foundation), Thomas Petazzoni, bpf, bpf,
open list:DOCUMENTATION, LKML, Martin KaFai Lau, Chris Mason,
Ihor Solodrai
In-Reply-To: <DINMD7Y3ZG8Q.3GZGX7SX9CN57@bootlin.com>
On Wed, May 20, 2026 at 5:46 PM Alexis Lothoré
<alexis.lothore@bootlin.com> wrote:
>
> On Wed May 20, 2026 at 5:18 PM CEST, bot+bpf-ci wrote:
> >> diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
> >> --- a/Documentation/bpf/standardization/instruction-set.rst
> >> +++ b/Documentation/bpf/standardization/instruction-set.rst
> >> @@ -695,22 +695,24 @@
> >> *(u64 *)(dst + offset) += src
> >>
> >> In addition to the simple atomic operations, there also is a modifier and
> >> -two complex atomic operations:
> >> +four complex atomic operations:
> >>
> >> .. table:: Complex atomic operations
> >>
> >> =========== ================ ===========================
> >> imm value description
> >> =========== ================ ===========================
> >> - FETCH 0x01 modifier: return old value
> >> - XCHG 0xe0 | FETCH atomic exchange
> >> - CMPXCHG 0xf0 | FETCH atomic compare and exchange
> >> + FETCH 0x0001 modifier: return old value
> >> + XCHG 0x00e0 | FETCH atomic exchange
> >> + CMPXCHG 0x00f0 | FETCH atomic compare and exchange
> >> + LOAD_ACQ 0x0100 atomic load with barrier
> >> + STORE_REL 0x0110 atomic store with barrier
> >> =========== ================ ===========================
> >>
> >> The ``FETCH`` modifier is optional for simple atomic operations, and
> >> -always set for the complex atomic operations. If the ``FETCH`` flag
> >> -is set, then the operation also overwrites ``src`` with the value that
> >> -was in memory before it was modified.
> >> +always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
> >> +the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
> >> +the value that was in memory before it was modified.
> >>
> >> The ``XCHG`` operation atomically exchanges ``src`` with the value
> >> addressed by ``dst + offset``.
> >
> > [ ... ]
> >
> >> @@ -721,6 +723,11 @@
> >> value that was at ``dst + offset`` before the operation is zero-extended
> >> and loaded back to ``R0``.
> >>
> >> +The ``LOAD_ACQ`` and ``STORE_REL`` operations implement lighter LOAD and
> >> +STORE memory barriers than full barriers. The corresponding accesses must
> >> +be aligned, but are allowed for any access size (8-bit up to 64-bit
> >> +operations).
> >
> > Does the LOAD_ACQ documentation need to clarify its non-standard register
> > semantics? BPF atomic operations are encoded as STX instructions, which
> > conventionally use dst as the memory base pointer and src as the value
> > register. However, LOAD_ACQ reverses this pattern by using src as the
> > memory base pointer and dst as the destination register for the fetched
> > value, acting like an LDX instruction.
>
> True, I can specify that both are anyway supported by a STX instruction.
>
> > Additionally, should the documentation clarify that the loaded value is
> > zero-extended into the 64-bit destination register for 8, 16, and 32-bit
> > accesses? Omitting these details might lead to incorrect implementations
> > based on standard STX rules.
>
> I guess I can add this detail, yes.
>
> > This contradiction creates specification ambiguity: the documentation
> > states that LOAD_ACQ and STORE_REL "are allowed for any access size
> > (8-bit up to 64-bit operations)", but earlier in the document (line 671)
> > explicitly states "8-bit and 16-bit wide atomic operations are not
> > supported."
> >
> > The instruction encoding section (lines 667-671) only defines two size
> > encodings for ATOMIC mode:
> > - {ATOMIC, W, STX} for 32-bit operations
> > - {ATOMIC, DW, STX} for 64-bit operations
> >
> > No encoding is shown for B (byte/8-bit) or H (half-word/16-bit) with
> > ATOMIC mode.
> >
> > If LOAD_ACQ/STORE_REL truly support 8-bit and 16-bit sizes, shouldn't the
> > encoding section document how B and H size modifiers work with ATOMIC
> > mode, and line 671 clarify the exception?
>
> This point, and the corresponding mentions to the "atomic32 conformance
> group" and "atomic64 conformance group", made me realize that the kernel
> doc seems to be in sync with the eBPF ISA RFC
> (https://www.rfc-editor.org/rfc/rfc9669.html). It makes me wonder if
> it's really ok to add those LOAD_ACQUIRE/STORE_RELEASE mentions in the
> kernel doc only ?
It's ok. It already diverged a bit. Eventually we will do an RFC update.
^ permalink raw reply
* [PATCH v2] docs: submitting-patches: Clarify that "reviewer" is a person
From: Krzysztof Kozlowski @ 2026-05-20 15:48 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, workflows, linux-doc, linux-kernel
Cc: Krzysztof Kozlowski, Greg Kroah-Hartman, Vlastimil Babka,
Andrew Morton, David Hildenbrand, Linus Torvalds, Randy Dunlap,
Mark Brown
Common understanding of word "Reviewer" is: a person performing a review
work [1]. Tools are not persons, thus cannot be reviewers in this term.
Also tools cannot make statements and cannot take responsibility for the
review.
Our docs already clearly mark that "Reviewed-by" must come from a
person:
- "By offering my Reviewed-by: tag, I state that:"
Usage of first person "I" and word "state"
- "A Reviewed-by tag is *a statement of opinion* that the patch is an
appropriate modification of the kernel without any remaining serious"
Only a person can make a statement of opinion.
- "Any interested reviewer (who has done the work) can offer a
Reviewed-by"
A person can offer a tag thus above does not grant the tool
permission to offer a tag.
However this might not be enough, so let's clarify that only a person
with a known identity can state the "Reviewer's statement of oversight".
Link: https://en.wiktionary.org/wiki/reviewer [1]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
---
Changes in v2:
1. Add tags
2. Rephrase/simplify a bit commit msg. Rephrase title - drop "in
English".
3. Add "with known identity", suggested by David Hildenbrand. I retained
previous tags, assuming this change is within spirit of previous
version and there were no objections on the list.
---
Documentation/process/submitting-patches.rst | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
index d7290e208e72..cc6a1f73d7f2 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -581,12 +581,12 @@ By offering my Reviewed-by: tag, I state that:
A Reviewed-by tag is a statement of opinion that the patch is an
appropriate modification of the kernel without any remaining serious
-technical issues. Any interested reviewer (who has done the work) can
-offer a Reviewed-by tag for a patch. This tag serves to give credit to
-reviewers and to inform maintainers of the degree of review which has been
-done on the patch. Reviewed-by: tags, when supplied by reviewers known to
-understand the subject area and to perform thorough reviews, will normally
-increase the likelihood of your patch getting into the kernel.
+technical issues. Any interested reviewer (who has done the work and is a
+person with known identity) can offer a Reviewed-by tag for a patch. This tag
+serves to give credit to reviewers and to inform maintainers of the degree of
+review which has been done on the patch. Reviewed-by: tags, when supplied by
+reviewers known to understand the subject area and to perform thorough reviews,
+will normally increase the likelihood of your patch getting into the kernel.
Both Tested-by and Reviewed-by tags, once received on mailing list from tester
or reviewer, should be added by author to the applicable patches when sending
--
2.53.0
^ permalink raw reply related
* Re: [PATCH 02/15] accel/qda: Add QDA driver documentation
From: Tomeu Vizoso @ 2026-05-20 15:47 UTC (permalink / raw)
To: Dmitry Baryshkov
Cc: ekansh.gupta, Oded Gabbay, Jonathan Corbet, Shuah Khan,
Joerg Roedel, Will Deacon, Robin Murphy, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Sumit Semwal, Christian König, Bharath Kumar,
Chenna Kesava Raju, srini, andersson, konradybcio, robin.clark,
linux-kernel, dri-devel, linux-doc, linux-arm-msm, iommu,
linux-media, linaro-mm-sig
In-Reply-To: <paiohsil5pmvm7cf6jxrhaj2225bgvlt3scrag4x6gbkyosow5@l4tbakbnxcvo>
On Wed, May 20, 2026 at 4:12 PM Dmitry Baryshkov
<dmitry.baryshkov@oss.qualcomm.com> wrote:
>
> On Tue, May 19, 2026 at 11:45:52AM +0530, Ekansh Gupta via B4 Relay wrote:
> > From: Ekansh Gupta <ekansh.gupta@oss.qualcomm.com>
> >
> > Add documentation for the Qualcomm DSP Accelerator (QDA) driver under
> > Documentation/accel/qda/. The documentation covers the driver
> > architecture, GEM-based buffer management, IOMMU context bank
> > isolation, and the RPMsg transport layer.
> >
> > The user-space API section describes the DRM IOCTLs for session
> > management, GEM buffer allocation, and remote procedure invocation via
> > the FastRPC protocol, along with a typical application lifecycle
> > example. Sections for dynamic debug and basic testing are also
> > included.
> >
> > Wire the new documentation into the Compute Accelerators index at
> > Documentation/accel/index.rst.
> >
> > Assisted-by: Claude:claude-4-6-sonnet
> > Signed-off-by: Ekansh Gupta <ekansh.gupta@oss.qualcomm.com>
> > ---
> > Documentation/accel/index.rst | 1 +
> > Documentation/accel/qda/index.rst | 13 ++++
> > Documentation/accel/qda/qda.rst | 146 ++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 160 insertions(+)
> >
> > diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst
> > index cbc7d4c3876a..5901ea7f784c 100644
> > --- a/Documentation/accel/index.rst
> > +++ b/Documentation/accel/index.rst
> > @@ -10,4 +10,5 @@ Compute Accelerators
> > introduction
> > amdxdna/index
> > qaic/index
> > + qda/index
> > rocket/index
> > diff --git a/Documentation/accel/qda/index.rst b/Documentation/accel/qda/index.rst
> > new file mode 100644
> > index 000000000000..013400cf9c25
> > --- /dev/null
> > +++ b/Documentation/accel/qda/index.rst
> > @@ -0,0 +1,13 @@
> > +.. SPDX-License-Identifier: GPL-2.0-only
> > +
> > +==================================
> > +accel/qda Qualcomm DSP Accelerator
> > +==================================
> > +
> > +The QDA driver provides a DRM accel based interface for Qualcomm DSP offload.
> > +It uses the FastRPC protocol and integrates with DRM and GEM infrastructure
> > +for device and buffer management.
> > +
> > +.. toctree::
> > +
> > + qda
> > diff --git a/Documentation/accel/qda/qda.rst b/Documentation/accel/qda/qda.rst
> > new file mode 100644
> > index 000000000000..9f49af6e6acc
> > --- /dev/null
> > +++ b/Documentation/accel/qda/qda.rst
> > @@ -0,0 +1,146 @@
> > +.. SPDX-License-Identifier: GPL-2.0-only
> > +
> > +=====================================
> > +Qualcomm DSP Accelerator (QDA) Driver
> > +=====================================
> > +
> > +Introduction
> > +============
> > +
> > +The QDA driver is a DRM accel driver for Qualcomm's DSPs. It provides a
> > +DRM accel based interface for Qualcomm DSP offload, supporting workloads
> > +such as AI inference, computer vision, audio processing, and sensor offload
> > +on Qualcomm SoCs. It uses the FastRPC protocol and integrates with DRM and
> > +GEM infrastructure for device and buffer management.
> > +
> > +Key Features
> > +============
> > +
> > +* **DRM accel Interface**: Exposes a standard character device node
> > + (e.g., ``/dev/accel/accel0``) via the DRM accel subsystem.
> > +* **FastRPC Protocol**: Implements the FastRPC protocol for communication
> > + between the application processor and the DSP.
> > +* **GEM Buffer Management**: Uses the DRM GEM interface for buffer
> > + allocation, lifecycle management, and DMA-BUF import/export.
> > +* **IOMMU Isolation**: Uses IOMMU context banks to enforce memory isolation
> > + between different DSP user sessions.
> > +* **Modular Design**: Clean separation between the core DRM logic, the
> > + memory manager, and the RPMsg-based transport layer.
> > +
> > +Architecture
> > +============
> > +
> > +The QDA driver consists of several functional blocks:
> > +
> > +1. **Core Driver (``qda_drv``)**: Manages device registration, file operations,
> > + and DRM accel integration.
> > +2. **Memory Manager (``qda_memory_manager``)**: A flexible memory management
> > + layer that handles IOMMU context banks. It supports pluggable backends
> > + (such as DMA-coherent) to adapt to different SoC memory architectures.
> > +3. **GEM Subsystem**: Implements the DRM GEM interface for buffer management:
> > +
> > + * **``qda_gem``**: Core GEM object management, including allocation, mmap
> > + operations, and buffer lifecycle management.
> > + * **``qda_prime``**: PRIME import functionality for DMA-BUF interoperability
> > + with other kernel subsystems.
> > +
> > +4. **Transport Layer (``qda_rpmsg``)**: Abstraction over the RPMsg framework
> > + to handle low-level message passing with the DSP firmware.
> > +5. **Compute Bus (``qda_compute_bus``)**: A custom virtual bus used to
> > + enumerate and manage the specific compute context banks defined in the
> > + device tree. The bus was introduced because IOMMU context banks (CBs) are
> > + synthetic constructs — not real platform devices — making a platform driver
> > + an incorrect abstraction for them. The earlier platform-driver approach also
> > + had a race condition: device nodes were created before the RPMsg channel
> > + resources were fully initialized, and because ``probe`` runs asynchronously,
> > + applications could open a CB device and attempt to start a session before
> > + the underlying transport was ready. The compute bus makes CB lifetime
> > + explicitly subordinate to the parent QDA device, closing that window.
> > +6. **FastRPC Core (``qda_fastrpc``)**: Implements the protocol logic for
> > + marshalling arguments and handling remote invocations.
> > +
> > +User-Space API
> > +==============
> > +
> > +The driver exposes a set of DRM-compliant IOCTLs:
> > +
> > +* ``DRM_IOCTL_QDA_QUERY``: Query DSP type (e.g., "cdsp", "adsp")
> > + and capabilities.
> > +* ``DRM_IOCTL_QDA_REMOTE_SESSION_CREATE``: Initialize a new process context
> > + on the DSP.
> > +* ``DRM_IOCTL_QDA_REMOTE_INVOKE``: Submit a remote method invocation (the
> > + primary execution unit).
> > +* ``DRM_IOCTL_QDA_GEM_CREATE``: Allocate a GEM buffer object for DSP usage.
> > +* ``DRM_IOCTL_QDA_GEM_MMAP_OFFSET``: Retrieve mmap offsets for memory mapping.
> > +* ``DRM_IOCTL_QDA_REMOTE_MAP`` / ``DRM_IOCTL_QDA_REMOTE_MUNMAP``: Map or unmap
> > + buffers into the DSP's virtual address space. Each accepts a ``request``
> > + field selecting between a legacy operation (``QDA_MAP_REQUEST_LEGACY`` /
> > + ``QDA_MUNMAP_REQUEST_LEGACY``) and an attribute-based operation
> > + (``QDA_MAP_REQUEST_ATTR`` / ``QDA_MUNMAP_REQUEST_ATTR``).
>
> Explain, what happens in the users don't map the buffers into the DSP
> space. Will DRM_IOCTL_QDA_REMOTE_INVOKE handle the mapping or not? What
> is the difference between those two modes?
>
> Would the driver benefit from using GPUVM?
>
> > +
> > +Usage Example
> > +=============
> > +
> > +A typical lifecycle for a user-space application:
> > +
> > +1. **Discovery**: Open ``/dev/accel/accel*`` and use
> > + ``DRM_IOCTL_QDA_QUERY`` to identify the DSP domain served by that
> > + device node.
> > +2. **Initialization**: Call ``DRM_IOCTL_QDA_REMOTE_SESSION_CREATE`` to
> > + establish a session and create a process context on the DSP.
> > +3. **Memory**: Allocate buffers via ``DRM_IOCTL_QDA_GEM_CREATE`` or import
> > + DMA-BUFs (PRIME fd) from other drivers using ``DRM_IOCTL_PRIME_FD_TO_HANDLE``.
> > +4. **Execution**: Use ``DRM_IOCTL_QDA_REMOTE_INVOKE`` to pass arguments and
> > + execute functions on the DSP.
> > +5. **Cleanup**: Close file descriptors to automatically release resources and
> > + detach the session.
>
> I'd have expected the description of the actual example. I.e. clone the
> app from https://the.addr, prepare clang >= NN.MM, QAIC (https://foo),
> run make, run the app, check the results. I'd remind that DRM Accel has
> a very specific requirement of having the working toolhain in the
> open-source.
We have been getting submissions lately that don't fulfill that
requirement so I will point to the precise part of the documentation
that explains it:
https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#open-source-userspace-requirements
For an example of a submissions that complies, see:
https://lore.kernel.org/dri-devel/20260114-thames-v2-0-e94a6636e050@tomeuvizoso.net/
Most importantly, notice how the proposed Thames Mesa driver generates
machine code for all the hardware units, and doesn't use any blob for
that.
Regards,
Tomeu
> > +
> > +Internal Implementation
> > +=======================
> > +
> > +Memory Management
> > +-----------------
> > +The driver's memory manager creates virtual "IOMMU devices" that map to
> > +hardware context banks. This allows the driver to manage multiple isolated
> > +address spaces. The implementation uses a DMA-coherent backend to ensure data consistency
> > +between the CPU and DSP without manual cache maintenance in most cases.
>
> GEM usage?
>
> > +
> > +Debugging
> > +=========
> > +The driver includes extensive dynamic debug support. Enable it via the
> > +kernel's dynamic debug control:
> > +
> > +.. code-block:: bash
> > +
> > + echo "file drivers/accel/qda/* +p" > /sys/kernel/debug/dynamic_debug/control
> > +
> > +Testing
> > +=======
> > +The QDA driver can be exercised using the ``fastrpc_test`` utility from the
> > +FastRPC userspace library. Run the test application:
>
> pointer
>
> > +
> > +.. code-block:: bash
> > +
> > + fastrpc_test -d 3 -U 1 -t linux -a v68
> > +
> > +**Options**
> > +
> > +``-d domain``
> > + Select the DSP domain to run on:
> > +
> > + * ``0`` — ADSP
> > + * ``1`` — MDSP
> > + * ``2`` — SDSP
> > + * ``3`` — CDSP *(default on targets with CDSP)*
> > +
> > +``-U unsigned_PD``
> > + Select signed or unsigned protection domain:
> > +
> > + * ``0`` — signed PD
> > + * ``1`` — unsigned PD *(default)*
> > +
> > +``-t target``
> > + Target platform: ``android`` or ``linux`` *(default: linux)*
> > +
> > +``-a arch_version``
> > + DSP architecture version, e.g. ``v68``, ``v75`` *(default: v68)*
> >
> > --
> > 2.34.1
> >
> >
>
> --
> With best wishes
> Dmitry
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, docs: add LOAD_AQCUIRE and STORE_RELEASE instructions
From: Alexis Lothoré @ 2026-05-20 15:46 UTC (permalink / raw)
To: bot+bpf-ci, alexis.lothore, void, ast, daniel, andrii, martin.lau,
eddyz87, memxor, song, yonghong.song, jolsa, corbet, skhan
Cc: ebpf, bastien.curutchet, thomas.petazzoni, bpf, bpf, linux-doc,
linux-kernel, martin.lau, clm, ihor.solodrai
In-Reply-To: <ffe4e22a8e5d0fceefa7f9ecb378e038b4686aa417addd57d2af243b67607c0a@mail.kernel.org>
On Wed May 20, 2026 at 5:18 PM CEST, bot+bpf-ci wrote:
>> diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
>> --- a/Documentation/bpf/standardization/instruction-set.rst
>> +++ b/Documentation/bpf/standardization/instruction-set.rst
>> @@ -695,22 +695,24 @@
>> *(u64 *)(dst + offset) += src
>>
>> In addition to the simple atomic operations, there also is a modifier and
>> -two complex atomic operations:
>> +four complex atomic operations:
>>
>> .. table:: Complex atomic operations
>>
>> =========== ================ ===========================
>> imm value description
>> =========== ================ ===========================
>> - FETCH 0x01 modifier: return old value
>> - XCHG 0xe0 | FETCH atomic exchange
>> - CMPXCHG 0xf0 | FETCH atomic compare and exchange
>> + FETCH 0x0001 modifier: return old value
>> + XCHG 0x00e0 | FETCH atomic exchange
>> + CMPXCHG 0x00f0 | FETCH atomic compare and exchange
>> + LOAD_ACQ 0x0100 atomic load with barrier
>> + STORE_REL 0x0110 atomic store with barrier
>> =========== ================ ===========================
>>
>> The ``FETCH`` modifier is optional for simple atomic operations, and
>> -always set for the complex atomic operations. If the ``FETCH`` flag
>> -is set, then the operation also overwrites ``src`` with the value that
>> -was in memory before it was modified.
>> +always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
>> +the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
>> +the value that was in memory before it was modified.
>>
>> The ``XCHG`` operation atomically exchanges ``src`` with the value
>> addressed by ``dst + offset``.
>
> [ ... ]
>
>> @@ -721,6 +723,11 @@
>> value that was at ``dst + offset`` before the operation is zero-extended
>> and loaded back to ``R0``.
>>
>> +The ``LOAD_ACQ`` and ``STORE_REL`` operations implement lighter LOAD and
>> +STORE memory barriers than full barriers. The corresponding accesses must
>> +be aligned, but are allowed for any access size (8-bit up to 64-bit
>> +operations).
>
> Does the LOAD_ACQ documentation need to clarify its non-standard register
> semantics? BPF atomic operations are encoded as STX instructions, which
> conventionally use dst as the memory base pointer and src as the value
> register. However, LOAD_ACQ reverses this pattern by using src as the
> memory base pointer and dst as the destination register for the fetched
> value, acting like an LDX instruction.
True, I can specify that both are anyway supported by a STX instruction.
> Additionally, should the documentation clarify that the loaded value is
> zero-extended into the 64-bit destination register for 8, 16, and 32-bit
> accesses? Omitting these details might lead to incorrect implementations
> based on standard STX rules.
I guess I can add this detail, yes.
> This contradiction creates specification ambiguity: the documentation
> states that LOAD_ACQ and STORE_REL "are allowed for any access size
> (8-bit up to 64-bit operations)", but earlier in the document (line 671)
> explicitly states "8-bit and 16-bit wide atomic operations are not
> supported."
>
> The instruction encoding section (lines 667-671) only defines two size
> encodings for ATOMIC mode:
> - {ATOMIC, W, STX} for 32-bit operations
> - {ATOMIC, DW, STX} for 64-bit operations
>
> No encoding is shown for B (byte/8-bit) or H (half-word/16-bit) with
> ATOMIC mode.
>
> If LOAD_ACQ/STORE_REL truly support 8-bit and 16-bit sizes, shouldn't the
> encoding section document how B and H size modifiers work with ATOMIC
> mode, and line 671 clarify the exception?
This point, and the corresponding mentions to the "atomic32 conformance
group" and "atomic64 conformance group", made me realize that the kernel
doc seems to be in sync with the eBPF ISA RFC
(https://www.rfc-editor.org/rfc/rfc9669.html). It makes me wonder if
it's really ok to add those LOAD_ACQUIRE/STORE_RELEASE mentions in the
kernel doc only ?
> Alternatively, if LOAD_ACQ/STORE_REL only support 32-bit and 64-bit sizes
> like other atomic operations, should line 728 be corrected to state
> "32-bit and 64-bit operations" instead of "8-bit up to 64-bit operations"?
>
> For a standardized instruction set specification used by multiple
> implementations (verifier, JIT compilers, potentially other BPF runtimes),
> internal contradictions can lead to implementation divergence and
> interoperability issues.
>
>> +
>> 64-bit immediate instructions
>> -----------------------------
>>
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26170628668
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox