Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Mateusz Guzik @ 2026-06-10 23:40 UTC (permalink / raw)
  To: Li Chen
  Cc: John Ericson, Andy Lutomirski, Christian Brauner, Kees Cook,
	Al Viro, linux-fsdevel, linux-api, LKML, linux-mm, linux-arch,
	linux-doc, linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <19eb181fdd4.6d028f442844776.3737831021032223216@linux.beauty>

On Wed, Jun 10, 2026 at 08:29:06PM +0800, Li Chen wrote:
>  ---- On Wed, 10 Jun 2026 01:27:47 +0800  John Ericson <mail@johnericson.me> wrote --- 
>  > Hope the above answers your question? I suppose my ideas lean more on the
>  > "future" than "empty" side --- there is indeed a thread in the thread group,
>  > with real VM/namespace/file descriptor etc. state. Moreover, state gets
>  > initialized before the process is started, so the actual start is a pretty
>  > lightweight step of just letting the scheduler know the now-ready process can
>  > be scheduled. The only thing that distinguishes the embryonic process from a
>  > real one is simply that it isn't running --- i.e. isn't (yet) available to be
>  > scheduled --- so the pidfds holders are free to poke at its state.
>  > 
> 
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.
> 
> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
> 
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);
> 
> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run. Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.
> 
> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.
> 

As I tried to explain in my previous e-mail this approach does not cut
it because of NUMA.

Suppose you have a machine with 2 nodes. The parent-to-be is running
on node 0 and the child is intended to exec something on node 1.

When the parent-to-be allocates and populates stuff, it takes place with
memory backed by node 0. If you allocate task_struct, the file table and
other frequently used (and modified!) objs in this way, you are
guaranteeing performance loss due to interconnect traffic to access it.

Trying to add plumbing so that all allocations respect numa placement is
probably too cumbersome.

The primary example for that is looking up the binary to exec in the
first place.

userspace likes to pass paths which don't exist, meaning checking for
the binary before any hard work is a useful optimizaiton. Suppose the
binary to be executed is in a container bound with a taskset using
node 1 and the content of the fs part of the container is currently
fully uncached.

When you perform the lookup on node 0, you are populating a bunch of
metadata (inode, dentry) using memory from that domain. But the intended
user will only execute on node 1, again resulting in a performance loss.

In order to not do it you would need to convince VFS to allocate memory
elsewhere.

So I stand by my previous claim that ultimately a pristine child has to
be created (like in this patch), but which also has to do the work on
its own.

Suppose there is no explicit placement requested anywhere. Even in that
case there are legitimate workloads which will eventually be forced to
exec stuff on another node. Even these have a better chance retaining
full locality if the child process does all the work.

Per my previous message I don't see a clean interface to do it.
something quasi-posix_spawn is probably the least bad way out, it will
also allow userspace to easily wrap the new thing with posix_spawn
itself.

Also note there is another issue with the fd-based approach: the fd will
get inherited on fork and will hang out in the child afterwards unless
explicitly closed. Suppose you have a multithreaded program which likes
to both fork(+no exec) and fork+exec. With the fd-based approach you
have no means of stopping another thread from grabbing your state thanks
to unix defaulting to copying everything. There was an attempt to fix
this aspect with O_CLOFORK, but this got rejected.

Whatever exactly happens, NUMA is a sad fact of computing and needs to
be accounted for. The approach as proposed not only does not do it, but
it actively hinders such deployments.

^ permalink raw reply

* Re: [PATCH v5 4/4] Documentation: PCI: Add documentation for DOE endpoint support
From: Randy Dunlap @ 2026-06-10 23:21 UTC (permalink / raw)
  To: Aksh Garg, linux-pci, linux-doc, mani, kwilczynski, bhelgaas,
	corbet, kishon, skhan, lukas, cassel, alistair
  Cc: linux-arm-kernel, linux-kernel, s-vadapalli, danishanwar, srk
In-Reply-To: <20260610100256.1889111-5-a-garg7@ti.com>



On 6/10/26 3:02 AM, Aksh Garg wrote:
> Document the architecture and implementation details for the Data Object
> Exchange (DOE) framework for PCIe Endpoint devices.
> 
> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Aksh Garg <a-garg7@ti.com>

Tested-by: Randy Dunlap <rdunlap@infradead.org>
Thanks.

> ---
> 
> Changes from v4 to v5:
> - Updated the DOE Abort handling setion.
> 
> Changes from v3 to v4:
> - Updated the maximum size of the DOE object from 256KB to 1MB,
>   as per PCIe spec.
> - Updated the DOE setup and cleanup sections.
> 
> Changes from v2 to v3:
> - Rebased on 7.1-rc1.
> 
> Changes since v1:
> - Squashed the patches [1] and [2], and moved the documentation file
>   to Documentation/PCI/endpoint/pci-endpoint-doe.rst to match the existing
>   naming scheme, as suggested by Niklas Cassel
> - Updated the documentation as per the design and implementaion changes
>   made to previous patches in this series:
>   * Updated for static protocol array instead of dynamic registration
>   * Documented asynchronous callback model
>   * Updated request/response flow with new callback signature
>   * Updated memory ownership: DOE core frees request, driver frees response
>   * Updated initialization and cleanup sections for new APIs
> 
> v4: https://lore.kernel.org/all/20260522052434.802034-5-a-garg7@ti.com/
> v3: https://lore.kernel.org/all/20260427051725.223704-5-a-garg7@ti.com/
> v2: https://lore.kernel.org/all/20260401073022.215805-5-a-garg7@ti.com/
> v1: [1] https://lore.kernel.org/all/20260213123603.420941-2-a-garg7@ti.com/
>     [2] https://lore.kernel.org/all/20260213123603.420941-5-a-garg7@ti.com/
> 
>  Documentation/PCI/endpoint/index.rst          |   1 +
>  .../PCI/endpoint/pci-endpoint-doe.rst         | 333 ++++++++++++++++++
>  2 files changed, 334 insertions(+)
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst


-- 
~Randy

^ permalink raw reply

* [PATCH] docs/zh_CN: fix CONFIG_CONPAT typo for CONFIG_COMPAT
From: Ethan Nelson-Moore @ 2026-06-10 23:18 UTC (permalink / raw)
  To: Dongliang Mu, Shuah Khan, Kees Cook, Ethan Nelson-Moore,
	linux-doc
  Cc: Alex Shi, Yanteng Si, Jonathan Corbet

The Simplified Chinese translation of security/self-protection.rst
contains a typo CONFIG_CONPAT for CONFIG_COMPAT. Fix it.

I don't speak Chinese, but I verified that CONFIG_COMPAT was what was
intended via Google Translate.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
 Documentation/translations/zh_CN/security/self-protection.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/translations/zh_CN/security/self-protection.rst b/Documentation/translations/zh_CN/security/self-protection.rst
index 93de9cee5c1a..ad96bb4a4995 100644
--- a/Documentation/translations/zh_CN/security/self-protection.rst
+++ b/Documentation/translations/zh_CN/security/self-protection.rst
@@ -97,7 +97,7 @@ ARCH_OPTIONAL_KERNEL_RWX时的默认设置。
 --------------------
 
 对于64位系统，一种消除许多系统调用最简单的方法是构建时不启用
-CONFIG_CONPAT。然而，这种情况通常不可行。
+CONFIG_COMPAT。然而，这种情况通常不可行。
 
 “seccomp”系统为用户空间提供了一种可选功能，提供了一种减少可供
 运行中进程使用内核入口点数量的方法。这限制了可以访问内核代码
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Mateusz Guzik @ 2026-06-10 22:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Li Chen, Kees Cook, Alexander Viro,
	linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
	linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <CAG48ez38OEE8ZPLyU6nr9=cYx-hMsdoh5WRrv-GMZGMDKyyOTA@mail.gmail.com>

On Mon, Jun 8, 2026 at 5:02 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > This problem is dear to my heart and I have been pondering it on and off
> > for some time now. The entire fork + exec idiom is terrible and needs to
> > be retired.
>
> It seems to me like vfork+exec is a decent UAPI building block, on
> which you can build nice-looking userspace APIs, though I agree that
> this is not an ideal direct interface for application code.
>
> > Additionally there is a known problem where transiently copied file
> > descriptors on fork + exec cause a headache in multithreaded programs
> > doing something like this in parallel. I only did cursory reading, it
> > seems your patchset keeps the same problem in place.
>
> I think we almost have UAPI that would let you avoid this issue?
> You can use clone() with CLONE_FILES, then unshare the FD table with
> close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently
> implemented to be atomic with stuff that happens on other threads, but
> if we changed that, and it doesn't provide a good way to carry some
> FDs across, but it feels to me like this could be fixed with a variant
> of close_range() that removes O_CLOEXEC FDs except ones listed in an
> array.

Suppose you want to exec a binary with the following fd set:
0 is /dev/null
1 is fd 1023 in your process
2 is fd 1023 in your process

You have tons of other fds and you don't want any of them anywhere near this.

Clean interface from my standpoint would avoid any unnecessary
overhead and would allow you to clearly specify what do you want.

In this case whatever the interface it should provide the ability to
map 1023 to 1 and 2 in the child. With the current syscall set you get
refs taken on these on clone, then you have to manually dup2 these
which is separate syscalls with extra atomics on top. A fast & elegant
solution would allow you to tell the kernel directly where to install
the 2 files.

Also note in practical terms userspace likes to closefrom/close_range
anyway to get rid of unwanted fds which happen to not have the cloexec
bit which is yet another syscall to invoke on the way to exec. A
better interface would instantly avoid the problem by not copying the
unwanted fds if not asked. For viability for use as foundation to
build posix_spawn over it such copying would have to be supported of
course.

>
> > There are numerous impactful ways to speed up execs both in terms of
> > single-threaded cost and their multicore scalability, most of which
> > would be immediately usable by all programs without an opt-in. imo these
> > needs to be exhausted before something like a "template" can be
> > considered.
>
> (I think probably a large part of this would be stuff that happens in
> userspace, like dynamic linking.)

I have not investigated userspace, even putting specific APIs aside
the kernel has *a lot* of avoidable overhead.

>
> > Per the above, the primary win would stem from *NOT* messing with mm.
>
> As you write below, I think we have that with CLONE_MM? The C function
> vfork() is kind of a terrible API because of its returns-twice
> behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> wrapped by libc in a way similar to clone() (with the child executing
> a separate handler function), or if it was used in the implementation
> of some higher-level process-spawning API, it would be a perfectly
> fine API?
>
> Or am I misunderstanding what you mean by "messing with mm"?
>

I was not aware of this functionality, let's assume it indeed works.
You still have the file issue described above.

^ permalink raw reply

* Re: [PATCH net-next v6 08/12] of: property: fw_devlink: Add support for "pcs-handle"
From: Rob Herring @ 2026-06-10 22:43 UTC (permalink / raw)
  To: Christian Marangi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Krzysztof Kozlowski, Conor Dooley, Simon Horman,
	Jonathan Corbet, Shuah Khan, Lorenzo Bianconi, Heiner Kallweit,
	Russell King, Saravana Kannan, Philipp Zabel, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, netdev, devicetree,
	linux-kernel, linux-doc, linux-arm-kernel, linux-mediatek, llvm
In-Reply-To: <20260609151212.29469-9-ansuelsmth@gmail.com>

On Tue, Jun 9, 2026 at 10:13 AM Christian Marangi <ansuelsmth@gmail.com> wrote:
>
> Add support for parsing PCS binding so that fw_devlink can
> enforce the dependency with Ethernet port.
>
> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
> ---
>  drivers/of/property.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/of/property.c b/drivers/of/property.c
> index 136946f8b746..e6584a2f705d 100644
> --- a/drivers/of/property.c
> +++ b/drivers/of/property.c
> @@ -1392,6 +1392,7 @@ DEFINE_SIMPLE_PROP(access_controllers, "access-controllers", "#access-controller
>  DEFINE_SIMPLE_PROP(pses, "pses", "#pse-cells")
>  DEFINE_SIMPLE_PROP(power_supplies, "power-supplies", NULL)
>  DEFINE_SIMPLE_PROP(mmc_pwrseq, "mmc-pwrseq", NULL)
> +DEFINE_SIMPLE_PROP(pcs_handle, "pcs-handle", "#pcs-cells")

There is no such common property "#pcs-cells".

>  DEFINE_SUFFIX_PROP(regulators, "-supply", NULL)
>  DEFINE_SUFFIX_PROP(gpio, "-gpio", "#gpio-cells")
>
> @@ -1548,6 +1549,7 @@ static const struct supplier_bindings of_supplier_bindings[] = {
>         { .parse_prop = parse_interrupts, },
>         { .parse_prop = parse_interrupt_map, },
>         { .parse_prop = parse_access_controllers, },
> +       { .parse_prop = parse_pcs_handle, },
>         { .parse_prop = parse_regulators, },
>         { .parse_prop = parse_gpio, },
>         { .parse_prop = parse_gpios, },
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH v7 06/42] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
From: Sean Christopherson @ 2026-06-10 22:23 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-6-2f0fae496530@google.com>

On Fri, May 22, 2026, Ackerley Tng wrote:
> Update the guest_memfd populate() flow to pull memory attributes from the
> gmem instance instead of the VM when KVM is not configured to track
> shared/private status in the VM.
> 
> Rename the per-VM API to make it clear that it retrieves per-VM
> attributes, i.e. is not suitable for use outside of flows that are
> specific to generic per-VM attributes.
> 
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

We should squash this in with the previous patch, i.e. wire up PRIVATE to gmem
in a single patch (sans the ioctl support).  I had a hell of time figure out how
the range-based lookup was supposed to work when revisiting the "wire up" patch,
until I realized populate() was handled in the next patch.

^ permalink raw reply

* Re: [PATCH v7 04/42] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Sean Christopherson @ 2026-06-10 22:19 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-4-2f0fae496530@google.com>

On Fri, May 22, 2026, Ackerley Tng wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
> 
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
> 
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---

...

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
>  static bool __ro_after_init allow_unsafe_mappings;
>  module_param(allow_unsafe_mappings, bool, 0444);
>  
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif

Fudge.  This morning's PUCK discussion about VBS made me realize that we really
don't want to kill off _all_ per-VM attributes like this, we really just want to
kill off PRIVATE.  And even if RWX protections never arrive, conceptually shoving
all attributes into guest_memfd doesn't make any sense, because it really is only
the private vs. shared state that is tied to the physical memory, things like RWX
protections aren't so tightly couple to the data.

It'll require a bit of minor surgery to these patches, but the silver lining is
that I think the end code will be slightly easier to follow.

I'll sync with you off-list to splice in the changes to your current series (I
have them sketched out).

^ permalink raw reply

* Re: [PATCH net-next 2/3] docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey
From: Sabrina Dubroca @ 2026-06-10 21:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, corbet,
	linux-doc, bpf, john.fastabend, skhan
In-Reply-To: <20260609201224.1191391-3-kuba@kernel.org>

2026-06-09, 13:12:23 -0700, Jakub Kicinski wrote:
> Fill in some gaps in the TLS offload doc:
> 
> - describe the tls_dev_del and tls_dev_resync callbacks
> - add a mention of rekeying being out of scope for now
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: john.fastabend@gmail.com
> CC: sd@queasysnail.net
> CC: corbet@lwn.net
> CC: skhan@linuxfoundation.org
> CC: linux-doc@vger.kernel.org
> ---
>  Documentation/networking/tls-offload.rst | 29 ++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
> index c173f537bf4d..a41f46885e8c 100644
> --- a/Documentation/networking/tls-offload.rst
> +++ b/Documentation/networking/tls-offload.rst
> @@ -104,6 +104,29 @@ at the end of kernel structures (see :c:member:`driver_state` members
>  in ``include/net/tls.h``) to avoid additional allocations and pointer
>  dereferences.
>  
> +When the offloaded connection is destroyed the core calls
> +the :c:member:`tls_dev_del` callback so the driver can release per-direction
> +state:
> +
> +.. code-block:: c
> +
> +	void (*tls_dev_del)(struct net_device *netdev,
> +			    struct tls_context *ctx,
> +			    enum tls_offload_ctx_dir direction);
> +
> +``tls_dev_del`` is mandatory whenever ``tls_dev_add`` is provided.
> +
> +The third TLS device callback is :c:member:`tls_dev_resync`, called by the core
> +to synchronize the TCP stream with the record boundaries:
> +
> +.. code-block:: c
> +
> +	int (*tls_dev_resync)(struct net_device *netdev,
> +			      struct sock *sk, u32 seq, u8 *rcd_sn,
> +			      enum tls_offload_ctx_dir direction);
> +
> +See the `Resync handling`_ section for details.

Hmm, this callback is not mentioned at all in the "Resync handling"
section. I think it'd be good to add at least a quick note there about
how/when it's invoked, and what the arguments mean (at least the two
types of sequence numbers, since the rest is identical to the other
driver CBs).

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH net-next v09 4/5] hinic3: Add ethtool rss ops
From: Dimitri Daskalakis @ 2026-06-10 20:41 UTC (permalink / raw)
  To: Fan Gong, Wu Di, Teng Peisen, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <7d1a4375fdf7c3e7a5a6162382cee4f48991d5da.1781062575.git.wudi234@huawei.com>



On 6/9/26 11:59 PM, Fan Gong wrote:
>   Implement following ethtool callback function:
> .get_rxnfc
> .set_rxnfc
> .get_channels
> .set_channels
> .get_rxfh_indir_size
> .get_rxfh_key_size
> .get_rxfh
> .set_rxfh
> 
>   These callbacks allow users to utilize ethtool for detailed
> RSS parameters configuration and monitoring.
> 
> Co-developed-by: Wu Di <wudi234@huawei.com>
> Signed-off-by: Wu Di <wudi234@huawei.com>
> Co-developed-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Fan Gong <gongfan1@huawei.com>
> ---
>  .../ethernet/huawei/hinic3/hinic3_ethtool.c   |   9 +
>  .../huawei/hinic3/hinic3_mgmt_interface.h     |   2 +
>  .../net/ethernet/huawei/hinic3/hinic3_rss.c   | 539 +++++++++++++++++-
>  .../net/ethernet/huawei/hinic3/hinic3_rss.h   |  19 +
>  4 files changed, 567 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> index 11c8eb0f5d2a..78818de9a946 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> @@ -16,6 +16,7 @@
>  #include "hinic3_hw_comm.h"
>  #include "hinic3_nic_dev.h"
>  #include "hinic3_nic_cfg.h"
> +#include "hinic3_rss.h"
>  
>  #define HINIC3_MGMT_VERSION_MAX_LEN     32
>  /* Coalesce time properties in microseconds */
> @@ -1238,6 +1239,14 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
>  	.get_pause_stats                = hinic3_get_pause_stats,
>  	.get_coalesce                   = hinic3_get_coalesce,
>  	.set_coalesce                   = hinic3_set_coalesce,
> +	.get_rxnfc                      = hinic3_get_rxnfc,
> +	.set_rxnfc                      = hinic3_set_rxnfc,
> +	.get_channels                   = hinic3_get_channels,
> +	.set_channels                   = hinic3_set_channels,
> +	.get_rxfh_indir_size            = hinic3_get_rxfh_indir_size,
> +	.get_rxfh_key_size              = hinic3_get_rxfh_key_size,
> +	.get_rxfh                       = hinic3_get_rxfh,
> +	.set_rxfh                       = hinic3_set_rxfh,
>  };
>  
>  void hinic3_set_ethtool_ops(struct net_device *netdev)
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
> index 76c691f82703..3c1263ff99ff 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
> @@ -282,6 +282,7 @@ enum l2nic_cmd {
>  	L2NIC_CMD_SET_VLAN_FILTER_EN  = 26,
>  	L2NIC_CMD_SET_RX_VLAN_OFFLOAD = 27,
>  	L2NIC_CMD_CFG_RSS             = 60,
> +	L2NIC_CMD_GET_RSS_CTX_TBL     = 62,
>  	L2NIC_CMD_CFG_RSS_HASH_KEY    = 63,
>  	L2NIC_CMD_CFG_RSS_HASH_ENGINE = 64,
>  	L2NIC_CMD_SET_RSS_CTX_TBL     = 65,
> @@ -301,6 +302,7 @@ enum l2nic_ucode_cmd {
>  	L2NIC_UCODE_CMD_MODIFY_QUEUE_CTX  = 0,
>  	L2NIC_UCODE_CMD_CLEAN_QUEUE_CTX   = 1,
>  	L2NIC_UCODE_CMD_SET_RSS_INDIR_TBL = 4,
> +	L2NIC_UCODE_CMD_GET_RSS_INDIR_TBL = 6,
>  };
>  
>  /* hilink mac group command */
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
> index 25db74d8c7dd..811a6b491e74 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
> @@ -155,7 +155,7 @@ static int hinic3_set_rss_type(struct hinic3_hwdev *hwdev,
>  				       L2NIC_CMD_SET_RSS_CTX_TBL, &msg_params);
>  
>  	if (ctx_tbl.msg_head.status == MGMT_STATUS_CMD_UNSUPPORTED) {
> -		return MGMT_STATUS_CMD_UNSUPPORTED;
> +		return -EOPNOTSUPP;
>  	} else if (err || ctx_tbl.msg_head.status) {
>  		dev_err(hwdev->dev, "mgmt Failed to set rss context offload, err: %d, status: 0x%x\n",
>  			err, ctx_tbl.msg_head.status);
> @@ -165,6 +165,41 @@ static int hinic3_set_rss_type(struct hinic3_hwdev *hwdev,
>  	return 0;
>  }
>  
> +static int hinic3_get_rss_type(struct hinic3_hwdev *hwdev,
> +			       struct hinic3_rss_type *rss_type)
> +{
> +	struct l2nic_cmd_rss_ctx_tbl ctx_tbl = {};
> +	struct mgmt_msg_params msg_params = {};
> +	int err;
> +
> +	ctx_tbl.func_id = hinic3_global_func_id(hwdev);
> +
> +	mgmt_msg_params_init_default(&msg_params, &ctx_tbl, sizeof(ctx_tbl));
> +
> +	err = hinic3_send_mbox_to_mgmt(hwdev, MGMT_MOD_L2NIC,
> +				       L2NIC_CMD_GET_RSS_CTX_TBL,
> +				       &msg_params);
> +	if (ctx_tbl.msg_head.status == MGMT_STATUS_CMD_UNSUPPORTED) {
> +		return -EOPNOTSUPP;
> +	} else if (err || ctx_tbl.msg_head.status) {
> +		dev_err(hwdev->dev, "Failed to get hash type, err: %d, status: 0x%x\n",
> +			err, ctx_tbl.msg_head.status);
> +		return -EINVAL;
> +	}
> +
> +	rss_type->ipv4         = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV4);
> +	rss_type->ipv6         = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV6);
> +	rss_type->ipv6_ext     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV6_EXT);
> +	rss_type->tcp_ipv4     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, TCP_IPV4);
> +	rss_type->tcp_ipv6     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, TCP_IPV6);
> +	rss_type->tcp_ipv6_ext = L2NIC_RSS_TYPE_GET(ctx_tbl.context,
> +						    TCP_IPV6_EXT);
> +	rss_type->udp_ipv4     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, UDP_IPV4);
> +	rss_type->udp_ipv6     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, UDP_IPV6);
> +
> +	return 0;
> +}
> +
>  static int hinic3_rss_cfg_hash_type(struct hinic3_hwdev *hwdev, u8 opcode,
>  				    enum hinic3_rss_hash_type *type)
>  {
> @@ -264,7 +299,8 @@ static int hinic3_set_hw_rss_parameters(struct net_device *netdev, u8 rss_en)
>  	if (err)
>  		return err;
>  
> -	hinic3_fillout_indir_tbl(netdev, nic_dev->rss_indir);
> +	if (!netif_is_rxfh_configured(netdev))
> +		hinic3_fillout_indir_tbl(netdev, nic_dev->rss_indir);
>  
>  	err = hinic3_config_rss_hw_resource(netdev, nic_dev->rss_indir);
>  	if (err)
> @@ -334,3 +370,502 @@ void hinic3_try_to_enable_rss(struct net_device *netdev)
>  	clear_bit(HINIC3_RSS_ENABLE, &nic_dev->flags);
>  	nic_dev->q_params.num_qps = nic_dev->max_qps;
>  }
> +
> +static int hinic3_set_l4_rss_hash_ops(const struct ethtool_rxnfc *cmd,
> +				      struct hinic3_rss_type *rss_type)
> +{
> +	u8 rss_l4_en;
> +
> +	switch (cmd->data & (RXH_L4_B_0_1 | RXH_L4_B_2_3)) {
> +	case 0:
> +		rss_l4_en = 0;
> +		break;
> +	case (RXH_L4_B_0_1 | RXH_L4_B_2_3):
> +		rss_l4_en = 1;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	switch (cmd->flow_type) {
> +	case TCP_V4_FLOW:
> +		rss_type->tcp_ipv4 = rss_l4_en;
> +		break;
> +	case TCP_V6_FLOW:
> +		rss_type->tcp_ipv6 = rss_l4_en;
> +		break;
> +	case UDP_V4_FLOW:
> +		rss_type->udp_ipv4 = rss_l4_en;
> +		break;
> +	case UDP_V6_FLOW:
> +		rss_type->udp_ipv6 = rss_l4_en;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_update_rss_hash_opts(struct net_device *netdev,
> +				       struct ethtool_rxnfc *cmd,
> +				       struct hinic3_rss_type *rss_type)
> +{
> +	int err;
> +
> +	switch (cmd->flow_type) {
> +	case TCP_V4_FLOW:
> +	case TCP_V6_FLOW:
> +	case UDP_V4_FLOW:
> +	case UDP_V6_FLOW:
> +		err = hinic3_set_l4_rss_hash_ops(cmd, rss_type);
> +		if (err)
> +			return err;
> +
> +		break;
> +	case IPV4_FLOW:
> +		rss_type->ipv4 = 1;
> +		break;
> +	case IPV6_FLOW:
> +		rss_type->ipv6 = 1;
> +		break;
> +	default:
> +		netdev_err(netdev, "Unsupported flow type\n");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_set_rss_hash_opts(struct net_device *netdev,
> +				    struct ethtool_rxnfc *cmd)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_rss_type rss_type;
> +	int err;
> +
> +	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags)) {
> +		cmd->data = 0;
> +		netdev_err(netdev, "RSS is disable, not support to set flow-hash\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* RSS only supports hashing of IP addresses and L4 ports */
> +	if (cmd->data & ~(RXH_IP_SRC | RXH_IP_DST |
> +			  RXH_L4_B_0_1 | RXH_L4_B_2_3))
> +		return -EINVAL;
> +
> +	/* Both IP addresses must be part of the hash tuple */
> +	if (!(cmd->data & RXH_IP_SRC) || !(cmd->data & RXH_IP_DST))
> +		return -EINVAL;
> +
> +	/* L4 hash bits are not valid for pure L3 flow types */
> +	if ((cmd->flow_type == IPV4_FLOW || cmd->flow_type == IPV6_FLOW) &&
> +	    (cmd->data & (RXH_L4_B_0_1 | RXH_L4_B_2_3)))
> +		return -EINVAL;
> +
> +	err = hinic3_get_rss_type(nic_dev->hwdev, &rss_type);
> +	if (err) {
> +		netdev_err(netdev, "Failed to get rss type\n");
> +		return err;
> +	}
> +
> +	err = hinic3_update_rss_hash_opts(netdev, cmd, &rss_type);
> +	if (err)
> +		return err;
> +
> +	err = hinic3_set_rss_type(nic_dev->hwdev, rss_type);
> +	if (err) {
> +		netdev_err(netdev, "Failed to set rss type\n");
> +		return err;
> +	}
> +
> +	nic_dev->rss_type = rss_type;
> +
> +	return 0;
> +}
> +
> +static void convert_rss_l3_type(u8 rss_opt, struct ethtool_rxnfc *cmd)
> +{
> +	if (!rss_opt)
> +		cmd->data &= ~(RXH_IP_SRC | RXH_IP_DST);
> +}
> +
> +static void convert_rss_l4_type(u8 rss_opt, struct ethtool_rxnfc *cmd)
> +{
> +	if (rss_opt)
> +		cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
> +}
> +
> +static int hinic3_convert_rss_type(struct net_device *netdev,
> +				   struct hinic3_rss_type *rss_type,
> +				   struct ethtool_rxnfc *cmd)
> +{
> +	cmd->data = RXH_IP_SRC | RXH_IP_DST;
> +	switch (cmd->flow_type) {
> +	case TCP_V4_FLOW:
> +		convert_rss_l4_type(rss_type->tcp_ipv4, cmd);
> +		break;
> +	case TCP_V6_FLOW:
> +		convert_rss_l4_type(rss_type->tcp_ipv6, cmd);
> +		break;
> +	case UDP_V4_FLOW:
> +		convert_rss_l4_type(rss_type->udp_ipv4, cmd);
> +		break;
> +	case UDP_V6_FLOW:
> +		convert_rss_l4_type(rss_type->udp_ipv6, cmd);
> +		break;
> +	case IPV4_FLOW:
> +		convert_rss_l3_type(rss_type->ipv4, cmd);
> +		break;
> +	case IPV6_FLOW:
> +		convert_rss_l3_type(rss_type->ipv6, cmd);
> +		break;
> +	default:
> +		netdev_err(netdev, "Unsupported flow type\n");
> +		cmd->data = 0;
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_get_rss_hash_opts(struct net_device *netdev,
> +				    struct ethtool_rxnfc *cmd)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_rss_type rss_type;
> +	int err;
> +
> +	cmd->data = 0;
> +
> +	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags))
> +		return 0;
> +
> +	err = hinic3_get_rss_type(nic_dev->hwdev, &rss_type);
> +	if (err) {
> +		netdev_err(netdev, "Failed to get rss type\n");
> +		return err;
> +	}
> +
> +	return hinic3_convert_rss_type(netdev, &rss_type, cmd);
> +}
> +
> +int hinic3_get_rxnfc(struct net_device *netdev,
> +		     struct ethtool_rxnfc *cmd, u32 *rule_locs)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	int err = 0;
> +
> +	switch (cmd->cmd) {
> +	case ETHTOOL_GRXRINGS:
> +		cmd->data = nic_dev->q_params.num_qps;
> +		break;

You should probably implement the get_rx_ring_count ethtool op instead.
See
https://lore.kernel.org/netdev/20260122-grxring_big_v4-v2-0-94dbe4dcaa10@debian.org/


^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-10 20:38 UTC (permalink / raw)
  To: Li Chen
  Cc: Andy Lutomirski, Christian Brauner, Kees Cook, Al Viro,
	linux-fsdevel, linux-api, LKML, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <19eb181fdd4.6d028f442844776.3737831021032223216@linux.beauty>

On Wed, Jun 10, 2026, at 8:29 AM, Li Chen wrote:
> Hi John,
>
> [...]
>
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.

Great! Glad to hear my suggestion (and the patch too I linked in the
other email, I hope?) was useful.

> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
>
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);

Glad to hear it is also one-fd now.

> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run.

So this is an interesting thing to think about. My hunch is that
`copy_process` is, at least in the longer term, still doing too much! In
particular, `struct kernel_clone_args` has many degrees of freedom, and
might also make assumptions about preserving more of the parent process
than is needed in this case.

This is a bit tangential, but one thing I have thought about is having
"null namespaces". I think the current (i.e. existing clone API) default
of "share with parent process" is a poor security practice (more
privileges, i.e. sharing, should always be opt-in). But the opposite
default of "unshare everything" is expensive since creating new
namespaces is non-free. The goal of the null namespaces would be a cheap
way of creating a more isolated and unprivileged process — and "cheap"
here is literal: a null pointer in `nsproxy`, no allocation, no
namespace object, no ID. This null state would be what
`pidfd_open(0, PIDFD_EMPTY)` (using your example above, or really
whatever the first step is) hands back.

Then, from that maximally cheap and unprivileged initial state, the
`pidfd_config(fd, ...);` calls (plural important, I think!) would opt
into either sharing or unsharing namespaces between the child and parent
as the parent sees fit.

The larger point here is that insofar as there are not good defaults for
things, there is pressure, whether in step 1 or step 2, to make larger
everything-at-once configuration. But when we think a bit outside the
box to create the good defaults where they didn't previously exist, we
can end up in a situation where a minimal initial blank unstarted
process, and the builder pattern to initialize it, are more "natural".

> Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.

Also glad to hear.

> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.

Sounds good, as I think you can guess, my preference is for "yes", but I
agree we can see what you end up with in the next patchset and make more
informed decisions based on that.

Cheers,

John

^ permalink raw reply

* Re: [PATCH 0/2] module: restrict module auto-loading to privileged users
From: Kees Cook @ 2026-06-10 20:23 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: Michal Gorlas, Jonathan Corbet, Shuah Khan, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Aaron Tomlin, linux-doc, linux-kernel,
	linux-modules
In-Reply-To: <20260605183646.GC2939956@google.com>

On Fri, Jun 05, 2026 at 06:36:46PM +0000, Sami Tolvanen wrote:
> On Fri, May 15, 2026 at 07:20:18PM +0200, Michal Gorlas wrote:
> > Add option to restrict the module auto-loading to CAP_SYS_ADMIN.
> > This is heavily inspired by CONFIG_GRKERNSEC_MODHARDEN of the latest
> > available Grsecurity patches [1]. Instead of checking whether the
> > callers' UID is 0, check whether the calling process has CAP_SYS_ADMIN.
> > The reasoning here is that many modules are autoloaded by systemd
> > services which are running as privileged users, but do not have UID 0.
> > While systemd-udevd runs as root, systemd-network (which often
> > auto-loads a module) for example runs as system user (UID range 6 to
> > 999).
> > 
> > When enabled, reduces attack surface where unprivileged users can trigger
> > vulnerable module to be auto-loaded, to then exploit it. Recent LPEs
> > (CopyFail [3], DirtyFrag [4]) for example, would have been mitigated
> > with this option enabled as long as the vulnerable modules are not built-in
> > (or already loaded at the point of running the exploit). 
> 
> This sounds potentially useful as an optional feature. Kees, you've
> looked at grsec features in the past, do you have any thoughts about
> this?

This doesn't really look like GRKERNSEC_MODHARDEN to me? In that
feature, the credentials of the usermode helper are passed down so that
udev or whatever can examine them and make choices (instead of seeing
the uid-0 usermode helper credentials).

This looks like it is just doing a request-time policy check, but that's
already covered by the security_kernel_module_request() call immediately
before the proposed module_autoload_restrict check.

Also note that module loading is _already_ controlled by CAP_SYS_MODULE,
not uid 0 nor CAP_SYS_ADMIN.

Sashiko has similar feedback, and some other notes too:
https://sashiko.dev/#/patchset/20260515-autoload_restrict-v1-0-40b7c03ddd04%409elements.com

I'm not clear what problem this patch is trying to solve?

-Kees

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH net-next v09 1/5] hinic3: Add ethtool queue ops
From: Dimitri Daskalakis @ 2026-06-10 20:21 UTC (permalink / raw)
  To: Fan Gong, Wu Di, Teng Peisen, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <02e87952a65aa268526ade2f03de6c76fbc1fe9d.1781062575.git.wudi234@huawei.com>



On 6/9/26 11:59 PM, Fan Gong wrote:
>   Implement following ethtool callback function:
> .get_ringparam
> .set_ringparam
> 
>   These callbacks allow users to utilize ethtool for detailed
> queue depth configuration and monitoring.
> 
> Co-developed-by: Wu Di <wudi234@huawei.com>
> Signed-off-by: Wu Di <wudi234@huawei.com>
> Co-developed-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Fan Gong <gongfan1@huawei.com>
> ---
>  .../ethernet/huawei/hinic3/hinic3_ethtool.c   |  93 ++++++++++++++++
>  .../net/ethernet/huawei/hinic3/hinic3_irq.c   |   5 +-
>  .../net/ethernet/huawei/hinic3/hinic3_main.c  |   6 +
>  .../huawei/hinic3/hinic3_netdev_ops.c         | 104 ++++++++++++++++--
>  .../ethernet/huawei/hinic3/hinic3_nic_dev.h   |   9 ++
>  .../ethernet/huawei/hinic3/hinic3_nic_io.c    |   4 +-
>  .../ethernet/huawei/hinic3/hinic3_nic_io.h    |   8 +-
>  .../net/ethernet/huawei/hinic3/hinic3_rx.c    |   2 +-
>  8 files changed, 217 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> index 90fc16288de9..be9992a235f7 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> @@ -9,6 +9,7 @@
>  #include <linux/errno.h>
>  #include <linux/etherdevice.h>
>  #include <linux/netdevice.h>
> +#include <linux/netlink.h>
>  #include <linux/ethtool.h>
>  
>  #include "hinic3_lld.h"
> @@ -409,6 +410,96 @@ hinic3_get_link_ksettings(struct net_device *netdev,
>  	return 0;
>  }
>  
> +static void hinic3_get_ringparam(struct net_device *netdev,
> +				 struct ethtool_ringparam *ring,
> +				 struct kernel_ethtool_ringparam *kernel_ring,
> +				 struct netlink_ext_ack *extack)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +
> +	ring->rx_max_pending = HINIC3_MAX_RX_QUEUE_DEPTH;
> +	ring->tx_max_pending = HINIC3_MAX_TX_QUEUE_DEPTH;
> +	ring->rx_pending = nic_dev->q_params.rq_depth;
> +	ring->rx_pending = nic_dev->q_params.sq_depth;
> +}
> +
> +static void hinic3_update_qp_depth(struct net_device *netdev,
> +				   u32 sq_depth, u32 rq_depth)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	u16 i;
> +
> +	nic_dev->q_params.sq_depth = sq_depth;
> +	nic_dev->q_params.rq_depth = rq_depth;
> +	for (i = 0; i < nic_dev->max_qps; i++) {
> +		nic_dev->txqs[i].q_depth = sq_depth;
> +		nic_dev->txqs[i].q_mask = sq_depth - 1;
> +		nic_dev->rxqs[i].q_depth = rq_depth;
> +		nic_dev->rxqs[i].q_mask = rq_depth - 1;
> +	}
> +}
> +
> +static int hinic3_check_ringparam_valid(struct net_device *netdev,
> +					const struct ethtool_ringparam *ring,
> +					struct netlink_ext_ack *extack)
> +{
> +	if (ring->tx_pending < HINIC3_MIN_QUEUE_DEPTH ||
> +	    ring->rx_pending < HINIC3_MIN_QUEUE_DEPTH) {
> +		NL_SET_ERR_MSG_FMT_MOD(extack,
> +				       "Queue depth out of range tx[%d-%d] rx[%d-%d]",
> +				       HINIC3_MIN_QUEUE_DEPTH,
> +				       HINIC3_MAX_TX_QUEUE_DEPTH,
> +				       HINIC3_MIN_QUEUE_DEPTH,
> +				       HINIC3_MAX_RX_QUEUE_DEPTH);
> +
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_set_ringparam(struct net_device *netdev,
> +				struct ethtool_ringparam *ring,
> +				struct kernel_ethtool_ringparam *kernel_ring,
> +				struct netlink_ext_ack *extack)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_dyna_txrxq_params q_params = {};
> +	u32 new_sq_depth, new_rq_depth;
> +	int err;
> +
> +	err = hinic3_check_ringparam_valid(netdev, ring, extack);
> +	if (err)
> +		return err;
> +
> +	new_sq_depth = 1U << ilog2(ring->tx_pending);
> +	new_rq_depth = 1U << ilog2(ring->rx_pending);
> +	if (new_sq_depth == nic_dev->q_params.sq_depth &&
> +	    new_rq_depth == nic_dev->q_params.rq_depth)
> +		return 0;
> +
> +	if (new_sq_depth != ring->tx_pending ||
> +	    new_rq_depth != ring->rx_pending)
> +		NL_SET_ERR_MSG_FMT_MOD(extack,
> +				       "Requested Tx/Rx ring depth %u/%u trimmed to %u/%u",
> +				       ring->tx_pending, ring->rx_pending,
> +				       new_sq_depth, new_rq_depth);
> +
> +	if (!netif_running(netdev)) {
> +		hinic3_update_qp_depth(netdev, new_sq_depth, new_rq_depth);
> +	} else {
> +		q_params = nic_dev->q_params;
> +		q_params.sq_depth = new_sq_depth;
> +		q_params.rq_depth = new_rq_depth;
> +
> +		err = hinic3_change_channel_settings(netdev, &q_params);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
>  static const struct ethtool_ops hinic3_ethtool_ops = {
>  	.supported_coalesce_params      = ETHTOOL_COALESCE_USECS |
>  					  ETHTOOL_COALESCE_PKT_RATE_RX_USECS,
> @@ -417,6 +508,8 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
>  	.get_msglevel                   = hinic3_get_msglevel,
>  	.set_msglevel                   = hinic3_set_msglevel,
>  	.get_link                       = ethtool_op_get_link,
> +	.get_ringparam                  = hinic3_get_ringparam,
> +	.set_ringparam                  = hinic3_set_ringparam,
>  };
>  
>  void hinic3_set_ethtool_ops(struct net_device *netdev)
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
> index e7d6c2033b45..bc4d879f9be4 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
> @@ -137,7 +137,8 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
>  	struct hinic3_interrupt_info info = {};
>  	int err;
>  
> -	if (q_id >= nic_dev->q_params.num_qps)
> +	if (q_id >= nic_dev->q_params.num_qps ||
> +	    !mutex_trylock(&nic_dev->change_res_mutex))
>  		return 0;
>  
>  	info.interrupt_coalesc_set = 1;
> @@ -156,6 +157,8 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
>  		nic_dev->rxqs[q_id].last_pending_limit = pending_limit;
>  	}
>  
> +	mutex_unlock(&nic_dev->change_res_mutex);
> +
>  	return err;
>  }
>  
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> index 0a888fe4c975..c87624a5e5dc 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> @@ -179,6 +179,7 @@ static int hinic3_sw_init(struct net_device *netdev)
>  	int err;
>  
>  	mutex_init(&nic_dev->port_state_mutex);
> +	mutex_init(&nic_dev->change_res_mutex);
>  
>  	nic_dev->q_params.sq_depth = HINIC3_SQ_DEPTH;
>  	nic_dev->q_params.rq_depth = HINIC3_RQ_DEPTH;
> @@ -315,6 +316,9 @@ static void hinic3_link_status_change(struct net_device *netdev,
>  {
>  	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
>  
> +	if (!mutex_trylock(&nic_dev->change_res_mutex))
> +		return;
> +
>  	if (link_status_up) {
>  		if (netif_carrier_ok(netdev))
>  			return;

There's a couple returns in this function that will cause the lock to
never be released. Probably need a goto unlock.

> @@ -330,6 +334,8 @@ static void hinic3_link_status_change(struct net_device *netdev,
>  		netif_carrier_off(netdev);
>  		netdev_dbg(netdev, "Link is down\n");
>  	}
> +
> +	mutex_unlock(&nic_dev->change_res_mutex);
>  }
>  
>  static void hinic3_port_module_event_handler(struct net_device *netdev,
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> index da73811641a9..047214cfc753 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> @@ -288,7 +288,8 @@ static void hinic3_free_channel_resources(struct net_device *netdev,
>  	hinic3_free_qps(nic_dev, qp_params);
>  }
>  
> -static int hinic3_open_channel(struct net_device *netdev)
> +static int hinic3_prepare_channel(struct net_device *netdev,
> +				  struct hinic3_dyna_txrxq_params *qp_params)
>  {
>  	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
>  	int err;
> @@ -299,16 +300,28 @@ static int hinic3_open_channel(struct net_device *netdev)
>  		return err;
>  	}
>  
> -	err = hinic3_configure_txrxqs(netdev, &nic_dev->q_params);
> +	err = hinic3_configure_txrxqs(netdev, qp_params);
>  	if (err) {
>  		netdev_err(netdev, "Failed to configure txrxqs\n");
>  		goto err_free_qp_ctxts;
>  	}
>  
> +	return 0;
> +
> +err_free_qp_ctxts:
> +	hinic3_free_qp_ctxts(nic_dev);
> +
> +	return err;
> +}
> +
> +static int hinic3_open_channel(struct net_device *netdev)
> +{
> +	int err;
> +
>  	err = hinic3_qps_irq_init(netdev);
>  	if (err) {
>  		netdev_err(netdev, "Failed to init txrxq irq\n");
> -		goto err_free_qp_ctxts;
> +		return err;
>  	}
>  
>  	err = hinic3_configure(netdev);
> @@ -321,8 +334,6 @@ static int hinic3_open_channel(struct net_device *netdev)
>  
>  err_uninit_qps_irq:
>  	hinic3_qps_irq_uninit(netdev);
> -err_free_qp_ctxts:
> -	hinic3_free_qp_ctxts(nic_dev);
>  
>  	return err;
>  }
> @@ -428,6 +439,74 @@ static void hinic3_vport_down(struct net_device *netdev)
>  	}
>  }
>  
> +int
> +hinic3_change_channel_settings(struct net_device *netdev,
> +			       struct hinic3_dyna_txrxq_params *trxq_params)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_dyna_txrxq_params cur_trxq_params = {};
> +	struct hinic3_dyna_qp_params new_qp_params = {};
> +	struct hinic3_dyna_qp_params cur_qp_params = {};
> +	int err;
> +
> +	cur_trxq_params = nic_dev->q_params;
> +
> +	hinic3_config_num_qps(netdev, trxq_params);
> +
> +	err = hinic3_alloc_channel_resources(netdev, &new_qp_params,
> +					     trxq_params);
> +	if (err) {
> +		netdev_err(netdev, "Failed to alloc channel resources\n");
> +		return err;
> +	}
> +
> +	mutex_lock(&nic_dev->change_res_mutex);
> +	hinic3_vport_down(netdev);
> +	hinic3_close_channel(netdev);
> +	hinic3_get_cur_qps(nic_dev, &cur_qp_params);
> +
> +	hinic3_init_qps(nic_dev, &new_qp_params);
> +
> +	err = hinic3_prepare_channel(netdev, trxq_params);
> +	if (err)
> +		goto err_uninit_qps;
> +
> +	if (nic_dev->num_qp_irq > trxq_params->num_qps)
> +		hinic3_qp_irq_change(netdev, trxq_params->num_qps);
> +
> +	nic_dev->q_params = *trxq_params;
> +
> +	err = hinic3_open_channel(netdev);
> +	if (err)
> +		goto err_qp_irq_reset;
> +
> +	err = hinic3_vport_up(netdev);
> +	if (err)
> +		goto err_close_channel;
> +
> +	hinic3_free_channel_resources(netdev, &cur_qp_params, &cur_trxq_params);
> +
> +	mutex_unlock(&nic_dev->change_res_mutex);
> +
> +	return 0;
> +
> +err_close_channel:
> +	hinic3_close_channel(netdev);
> +err_qp_irq_reset:
> +	nic_dev->q_params = cur_trxq_params;
> +
> +	if (trxq_params->num_qps > cur_trxq_params.num_qps)
> +		hinic3_qp_irq_change(netdev, cur_trxq_params.num_qps);
> +	hinic3_free_qp_ctxts(nic_dev);
> +err_uninit_qps:
> +	hinic3_get_cur_qps(nic_dev, &new_qp_params);
> +	hinic3_free_channel_resources(netdev, &new_qp_params, trxq_params);
> +	hinic3_free_channel_resources(netdev, &cur_qp_params, &cur_trxq_params);
> +	mutex_unlock(&nic_dev->change_res_mutex);
> +
> +	return err;
> +}
> +
>  static int hinic3_open(struct net_device *netdev)
>  {
>  	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> @@ -458,6 +537,10 @@ static int hinic3_open(struct net_device *netdev)
>  
>  	hinic3_init_qps(nic_dev, &qp_params);
>  
> +	err = hinic3_prepare_channel(netdev, &nic_dev->q_params);
> +	if (err)
> +		goto err_uninit_qps;
> +
>  	err = hinic3_open_channel(netdev);
>  	if (err)
>  		goto err_uninit_qps;
> @@ -473,7 +556,7 @@ static int hinic3_open(struct net_device *netdev)
>  err_close_channel:
>  	hinic3_close_channel(netdev);
>  err_uninit_qps:
> -	hinic3_uninit_qps(nic_dev, &qp_params);
> +	hinic3_get_cur_qps(nic_dev, &qp_params);
>  	hinic3_free_channel_resources(netdev, &qp_params, &nic_dev->q_params);
>  err_destroy_num_qps:
>  	hinic3_destroy_num_qps(netdev);
> @@ -493,10 +576,15 @@ static int hinic3_close(struct net_device *netdev)
>  		return 0;
>  	}
>  
> +	mutex_lock(&nic_dev->change_res_mutex);
>  	hinic3_vport_down(netdev);
>  	hinic3_close_channel(netdev);
> -	hinic3_uninit_qps(nic_dev, &qp_params);
> -	hinic3_free_channel_resources(netdev, &qp_params, &nic_dev->q_params);
> +	hinic3_get_cur_qps(nic_dev, &qp_params);
> +	hinic3_free_channel_resources(netdev, &qp_params,
> +				      &nic_dev->q_params);
> +	hinic3_free_nicio_res(nic_dev);
> +	hinic3_destroy_num_qps(netdev);
> +	mutex_unlock(&nic_dev->change_res_mutex);
>  
>  	return 0;
>  }
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> index 9502293ff710..005b2c01a988 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> @@ -10,6 +10,9 @@
>  #include "hinic3_hw_cfg.h"
>  #include "hinic3_hwdev.h"
>  #include "hinic3_mgmt_interface.h"
> +#include "hinic3_nic_io.h"
> +#include "hinic3_tx.h"
> +#include "hinic3_rx.h"
>  
>  #define HINIC3_VLAN_BITMAP_BYTE_SIZE(nic_dev)  (sizeof(*(nic_dev)->vlan_bitmap))
>  #define HINIC3_VLAN_BITMAP_SIZE(nic_dev)  \
> @@ -129,6 +132,8 @@ struct hinic3_nic_dev {
>  	struct work_struct              rx_mode_work;
>  	/* lock for enable/disable port */
>  	struct mutex                    port_state_mutex;
> +	/* mutex to serialize channel/resource changes */
> +	struct mutex                    change_res_mutex;
>  
>  	struct list_head                uc_filter_list;
>  	struct list_head                mc_filter_list;
> @@ -143,6 +148,10 @@ struct hinic3_nic_dev {
>  
>  void hinic3_set_netdev_ops(struct net_device *netdev);
>  int hinic3_set_hw_features(struct net_device *netdev);
> +int
> +hinic3_change_channel_settings(struct net_device *netdev,
> +			       struct hinic3_dyna_txrxq_params *trxq_params);
> +
>  int hinic3_qps_irq_init(struct net_device *netdev);
>  void hinic3_qps_irq_uninit(struct net_device *netdev);
>  
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c
> index 87e736adba02..0e7a0ccfba98 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c
> @@ -484,8 +484,8 @@ void hinic3_init_qps(struct hinic3_nic_dev *nic_dev,
>  	}
>  }
>  
> -void hinic3_uninit_qps(struct hinic3_nic_dev *nic_dev,
> -		       struct hinic3_dyna_qp_params *qp_params)
> +void hinic3_get_cur_qps(struct hinic3_nic_dev *nic_dev,
> +			struct hinic3_dyna_qp_params *qp_params)
>  {
>  	struct hinic3_nic_io *nic_io = nic_dev->nic_io;
>  
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> index 12eefabcf1db..571b34d63950 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> @@ -14,6 +14,10 @@ struct hinic3_nic_dev;
>  #define HINIC3_RQ_WQEBB_SHIFT      3
>  #define HINIC3_SQ_WQEBB_SIZE       BIT(HINIC3_SQ_WQEBB_SHIFT)
>  
> +#define HINIC3_MAX_TX_QUEUE_DEPTH  65536
> +#define HINIC3_MAX_RX_QUEUE_DEPTH  16384
> +#define HINIC3_MIN_QUEUE_DEPTH     128
> +
>  /* ******************** RQ_CTRL ******************** */
>  enum hinic3_rq_wqe_type {
>  	HINIC3_NORMAL_RQ_WQE = 1,
> @@ -136,8 +140,8 @@ void hinic3_free_qps(struct hinic3_nic_dev *nic_dev,
>  		     struct hinic3_dyna_qp_params *qp_params);
>  void hinic3_init_qps(struct hinic3_nic_dev *nic_dev,
>  		     struct hinic3_dyna_qp_params *qp_params);
> -void hinic3_uninit_qps(struct hinic3_nic_dev *nic_dev,
> -		       struct hinic3_dyna_qp_params *qp_params);
> +void hinic3_get_cur_qps(struct hinic3_nic_dev *nic_dev,
> +			struct hinic3_dyna_qp_params *qp_params);
>  
>  int hinic3_init_qp_ctxts(struct hinic3_nic_dev *nic_dev);
>  void hinic3_free_qp_ctxts(struct hinic3_nic_dev *nic_dev);
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
> index 309ab5901379..b5b601469517 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
> @@ -541,7 +541,7 @@ int hinic3_configure_rxqs(struct net_device *netdev, u16 num_rq,
>  		rq_associate_cqes(rxq);
>  
>  		pkts = hinic3_rx_fill_buffers(rxq);
> -		if (!pkts) {
> +		if (pkts < rxq->q_depth - 1) {

nit: just use rxq->q_mask?

>  			netdev_err(netdev, "Failed to fill Rx buffer\n");
>  			return -ENOMEM;
>  		}


^ permalink raw reply

* Re: [PATCH v5 00/19] perf cs-etm: Queue context packets for frontend
From: Arnaldo Carvalho de Melo @ 2026-06-10 20:14 UTC (permalink / raw)
  To: James Clark
  Cc: Suzuki K Poulose, Mike Leach, Leo Yan, Namhyung Kim, Jiri Olsa,
	Ian Rogers, Amir Ayupov, Jonathan Corbet, Shuah Khan,
	Paschalis Mpeis, coresight, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo, linux-doc
In-Reply-To: <20260609-james-cs-context-tracking-fix-v5-0-d53a7d096a19@linaro.org>

On Tue, Jun 09, 2026 at 03:40:05PM +0100, James Clark wrote:
> Fix thread tracking when decoding Coresight trace and add a new test for
> it.

The issues found by sashiko seem mild and you can address them in follow
up patches, I think.

So for the benefit of having perf-tools-next available for linux-next
testing and the window is closing soon, so I've merged this, ok?

- Arnaldo
 
> The new test is added as a Perf test workload instead of a custom binary
> with its own build system, but this requires a new feature in Perf test
> to pass in control pipes which can enable and disable events. This
> scopes the recording to just the workload and helps to reduce the amount
> of data recorded in tracing tests.
> 
> With this new feature we can re-write all of the Coresight tests to make
> use of it and remove the remaining binaries which fixes the following
> issues:
> 
>  * They didn't work in out of source builds
>  * A lot of the tests unnecessarily required root and didn't skip
>    without it
>  * They were mainly qualitative tests which didn't look for specific
>    behavior
> 
> Most importantly, the long build and runtime has been reduced. On a
> Radxa Orion O6, unroll_loop_thread.c took 37s to compile which is longer
> than the entire Perf build. Now the build time is negligible and the
> before and after test runtimes for all the Coresight tests are:
> 
>           |   N1SDP   |   Orion O6
>   -----------------------------------
>   Before  |   4m  0s  |    14m 49s
>   After   |      26s  |        56s
>   -----------------------------------
> 
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
> Changes in v5:
> - Forgot to include this change:
>   - Test for actual length of expected raw dump (Leo)
> - Link to v4: https://lore.kernel.org/r/20260609-james-cs-context-tracking-fix-v4-0-44f9fb9e5c42@linaro.org
> 
> Changes in v4:
> - Rename workload-ctl to record-ctl and improve docs (Leo)
> - Use new packet argument everywhere in
>   cs_etm__synth_instruction_sample() (Sashiko)
> - Test for actual length of expected raw dump (Leo)
> - Use -fno-inline instead of keyword (Leo)
> - Don't test any brace or call lines in deterministic test
> - Make sure context switch loop test does cleanup on failure (Sashiko)
> - Remove undef int overflows in workloads (Sashiko)
> - Link to v3: https://lore.kernel.org/r/20260603-james-cs-context-tracking-fix-v3-0-c392945d9ed5@linaro.org
> 
> Changes in v3:
> - Minor sashiko comments
>   - Close some more pipes
>   - Fix warning messages
>   - Error handling improvements
> - Pass packet into cs_etm__synth_instruction_sample()
> - Fixup stale comment (Leo)
> - Link to v2: https://lore.kernel.org/r/20260602-james-cs-context-tracking-fix-v2-0-85b5ce6f55c6@linaro.org
> 
> Changes in v2:
> - Add --workload-ctl option to Perf test
> - Re-write all the Coresight tests and speed them up
> - Pass packet to memory access function so frontend can use either the
>   previous or current packet's EL
> - Link to v1: https://lore.kernel.org/r/20260526-james-cs-context-tracking-fix-v1-0-ebd602e18287@linaro.org
> 
> ---
> James Clark (19):
>       perf cs-etm: Queue context packets for frontend
>       perf test: Add workload-ctl option
>       perf test: Add a workload that forces context switches
>       perf test cs-etm: Test process attribution
>       perf test: Add deterministic workload
>       perf test cs-etm: Replace unroll loop thread with deterministic decode test
>       perf test cs-etm: Remove asm_pure_loop test
>       perf test cs-etm: Replace memcpy test with raw dump stress test
>       perf test: Add named_threads workload
>       perf test cs-etm: Test decoding for concurrent threads test
>       perf test cs-etm: Remove duplicate branch tests
>       perf test cs-etm: Skip if not root
>       perf test cs-etm: Reduce snapshot size
>       perf test cs-etm: Speed up basic test
>       perf test cs-etm: Remove unused Coresight workloads
>       perf test cs-etm: Make disassembly test use kcore
>       perf test cs-etm: Add all branch instructions to test
>       perf test cs-etm: Speed up disassembly test
>       perf test cs-etm: Move existing tests to coresight folder
> 
>  Documentation/trace/coresight/coresight-perf.rst   |  78 +------
>  MAINTAINERS                                        |   2 -
>  tools/perf/Documentation/perf-test.txt             |  24 ++-
>  tools/perf/Makefile.perf                           |  14 +-
>  tools/perf/scripts/python/arm-cs-trace-disasm.py   |  20 +-
>  tools/perf/tests/builtin-test.c                    | 187 +++++++++++++++-
>  tools/perf/tests/shell/coresight/Makefile          |  29 ---
>  .../perf/tests/shell/coresight/Makefile.miniconfig |  14 --
>  tools/perf/tests/shell/coresight/asm_pure_loop.sh  |  22 --
>  .../tests/shell/coresight/asm_pure_loop/.gitignore |   1 -
>  .../tests/shell/coresight/asm_pure_loop/Makefile   |  34 ---
>  .../shell/coresight/asm_pure_loop/asm_pure_loop.S  |  30 ---
>  .../tests/shell/coresight/concurrent_threads.sh    |  45 ++++
>  .../tests/shell/coresight/context_switch_thread.sh |  69 ++++++
>  tools/perf/tests/shell/coresight/deterministic.sh  |  72 +++++++
>  .../tests/shell/coresight/memcpy_thread/.gitignore |   1 -
>  .../tests/shell/coresight/memcpy_thread/Makefile   |  33 ---
>  .../shell/coresight/memcpy_thread/memcpy_thread.c  |  80 -------
>  .../tests/shell/coresight/memcpy_thread_16k_10.sh  |  22 --
>  .../perf/tests/shell/coresight/raw_dump_stress.sh  |  65 ++++++
>  .../shell/{ => coresight}/test_arm_coresight.sh    |  43 ++--
>  .../{ => coresight}/test_arm_coresight_disasm.sh   |  23 +-
>  .../tests/shell/coresight/thread_loop/.gitignore   |   1 -
>  .../tests/shell/coresight/thread_loop/Makefile     |  33 ---
>  .../shell/coresight/thread_loop/thread_loop.c      |  85 --------
>  .../shell/coresight/thread_loop_check_tid_10.sh    |  23 --
>  .../shell/coresight/thread_loop_check_tid_2.sh     |  23 --
>  .../shell/coresight/unroll_loop_thread/.gitignore  |   1 -
>  .../shell/coresight/unroll_loop_thread/Makefile    |  33 ---
>  .../unroll_loop_thread/unroll_loop_thread.c        |  75 -------
>  .../tests/shell/coresight/unroll_loop_thread_10.sh |  22 --
>  tools/perf/tests/shell/lib/coresight.sh            | 134 ------------
>  tools/perf/tests/tests.h                           |   3 +
>  tools/perf/tests/workloads/Build                   |   4 +
>  tools/perf/tests/workloads/context_switch_loop.c   | 110 ++++++++++
>  tools/perf/tests/workloads/deterministic.c         |  39 ++++
>  tools/perf/tests/workloads/named_threads.c         | 109 ++++++++++
>  tools/perf/util/cs-etm-decoder/cs-etm-decoder.c    |  21 +-
>  tools/perf/util/cs-etm.c                           | 236 ++++++++++++---------
>  tools/perf/util/cs-etm.h                           |   8 +-
>  40 files changed, 926 insertions(+), 942 deletions(-)
> ---
> base-commit: 351a37f2fda4db668cff8ba12f2992d73dccdaea
> change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
> 
> Best regards,
> -- 
> James Clark <james.clark@linaro.org>

^ permalink raw reply

* htmldocs: Warning: drivers/tty/serial/serial_cortina-access.c references a file that doesn't exist: Documentation/serial/driver
From: kernel test robot @ 2026-06-10 19:50 UTC (permalink / raw)
  To: Jason Li; +Cc: oe-kbuild-all, 0day robot, linux-doc

tree:   https://github.com/intel-lab-lkp/linux/commits/Jason-Li/dt-bindings-serial-Add-binding-for-Cortina-Access-UART/20260610-193842
head:   e97c7dd14b20885c9b9f27daf2c6e0cd9e99d82a
commit: 2b08fdba152665eca1c8194820608a3f284143b6 tty: serial: Add UART driver for Cortina-Access platform
date:   8 hours ago
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project f43d6834093b19baf79beda8c0337ab020ac5f17)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260610/202606102102.JsRIO7Np-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606102102.JsRIO7Np-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Warning: Documentation/translations/zh_CN/scsi/scsi_mid_low_api.rst references a file that doesn't exist: Documentation/Configure.help
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/ABI/testing/sysfs-platform-ayaneo
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/display/bridge/megachips-stdpxxxx-ge-b850v3-fw.txt
   Warning: arch/powerpc/sysdev/mpic.c references a file that doesn't exist: Documentation/devicetree/bindings/powerpc/fsl/mpic.txt
   Warning: drivers/net/ethernet/smsc/Kconfig references a file that doesn't exist: file:Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
>> Warning: drivers/tty/serial/serial_cortina-access.c references a file that doesn't exist: Documentation/serial/driver
   Warning: rust/kernel/sync/atomic/ordering.rs references a file that doesn't exist: srctree/tools/memory-model/Documentation/explanation.txt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: Documentation/virtual/lguest/lguest.c
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,\b(\S*)(Documentation/[A-Za-z0-9
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: Documentation/devicetree/dt-object-internal.txt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,^Documentation/scheduler/sched-pelt

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v11 2/2] hwmon: temperature: add support for EMC1812
From: Guenter Roeck @ 2026-06-10 19:50 UTC (permalink / raw)
  To: Marius Cristea
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <20260610-hw_mon-emc1812-v11-2-cef809af5c19@microchip.com>

On Wed, Jun 10, 2026 at 06:19:47PM +0300, Marius Cristea wrote:
> This is the hwmon driver for Microchip EMC1812/13/14/15/33
> Multichannel Low-Voltage Remote Diode Sensor Family.
> 
> EMC1812 has one external remote temperature monitoring channel.
> EMC1813 has two external remote temperature monitoring channels.
> EMC1814 has three external remote temperature monitoring channels,
> channels 2 and 3 support anti parallel diode.
> EMC1815 has four external remote temperature monitoring channels and
> channels 1/2  and 3/4 support anti parallel diode.
> EMC1833 has two external remote temperature monitoring channels and
> channels 1 and 2 support anti parallel diode.
> Resistance Error Correction is supported on channels 1/2 and 3/4.
> 
> Signed-off-by: Marius Cristea <marius.cristea@microchip.com>

Applied.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH v11 1/2] dt-bindings: hwmon: temperature: add support for EMC1812
From: Guenter Roeck @ 2026-06-10 19:46 UTC (permalink / raw)
  To: Marius Cristea
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <20260610-hw_mon-emc1812-v11-1-cef809af5c19@microchip.com>

On Wed, Jun 10, 2026 at 06:19:46PM +0300, Marius Cristea wrote:
> This is the devicetree schema for Microchip EMC1812/13/14/15/33
> Multichannel Low-Voltage Remote Diode Sensor Family. It also
> updates the MAINTAINERS file to include the new driver.
> 
> EMC1812 has one external remote temperature monitoring channel.
> EMC1813 has two external remote temperature monitoring channels.
> EMC1814 has three external remote temperature monitoring channels and
> channels 2 and 3 support anti parallel diode.
> EMC1815 has four external remote temperature monitoring channels and
> channels 1/2  and 3/4 support anti parallel diode.
> EMC1833 has two external remote temperature monitoring channels and
> channels 1 and 2 support anti parallel diode.
> Resistance Error Correction is supported on channels 1/2 and 3/4.
> 
> Signed-off-by: Marius Cristea <marius.cristea@microchip.com>
> Reviewed-by: Rob Herring (Arm) <robh@kernel.org>

Applied.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH v4 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
From: kernel test robot @ 2026-06-10 19:42 UTC (permalink / raw)
  To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
	Kent Overstreet, Hao Ge
  Cc: oe-kbuild-all, Linux Memory Management List, Shuah Khan,
	Jonathan Corbet, linux-doc, linux-kernel, Sourav Panda,
	Abhishek Bapat
In-Reply-To: <d0a8308b4d0799876d24461a8ed9b5a71d3e1e89.1781042698.git.abhishekbapat@google.com>

Hi Suren,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20260609]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v7.1-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Abhishek-Bapat/alloc_tag-add-ioctl-to-proc-allocinfo/20260610-081508
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/d0a8308b4d0799876d24461a8ed9b5a71d3e1e89.1781042698.git.abhishekbapat%40google.com
patch subject: [PATCH v4 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
config: sparc64-randconfig-r061-20260610 (https://download.01.org/0day-ci/archive/20260611/202606110300.R4LPBVBO-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260611/202606110300.R4LPBVBO-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606110300.R4LPBVBO-lkp@intel.com/

All errors (new ones prefixed by >>):

   lib/alloc_tag.c: In function 'allocinfo_compat_ioctl':
>> lib/alloc_tag.c:346:58: error: implicit declaration of function 'compat_ptr' [-Wimplicit-function-declaration]
     346 |         return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
         |                                                          ^~~~~~~~~~


vim +/compat_ptr +346 lib/alloc_tag.c

   341	
   342	#ifdef CONFIG_COMPAT
   343	static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
   344					   unsigned long arg)
   345	{
 > 346		return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
   347	}
   348	#endif
   349	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net-next v3 0/3] Add standard stats for HSR/PRP
From: Simon Horman @ 2026-06-10 18:47 UTC (permalink / raw)
  To: MD Danish Anwar
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Shuah Khan, Roger Quadros, Andrew Lunn,
	Jacob Keller, Meghana Malladi, David Carlier, Vadim Fedorenko,
	Kevin Hao, Himanshu Mittal, Hangbin Liu, Markus Elfring,
	Fernando Fernandez Mancera, Jan Vaclav, netdev, linux-doc,
	linux-kernel, linux-arm-kernel, Felix Maurer, Luka Gejak
In-Reply-To: <20260608100930.210149-1-danishanwar@ti.com>

On Mon, Jun 08, 2026 at 03:39:27PM +0530, MD Danish Anwar wrote:
> Add standard stats for HSR / PRP. This series was initially adding HSR/PRP
> related stats for ICSSG driver. Based on maintainers' comments on v2 I am
> now adding support to dump standard stats for HSR/PRP.
> 
> The drivers which support offload can populate these standard stats.
> 
> This series only implements offloaded stats. For software-only interfaces
> Felix Maurer had said he will do it later [1]
> 
> v2 https://lore.kernel.org/all/20260514075605.850674-1-danishanwar@ti.com/
> [1] https://lore.kernel.org/all/ag87pBZfOyccPZTc@thinkpad/
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Felix Maurer <fmaurer@redhat.com>
> Cc: Luka Gejak <luka.gejak@linux.dev>

Hi MD,

There is AI-generated review of this patch-set available on both
https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/
I would appreciate it if you could look over that with a view
to addressing any issues that directly affect this patch-set.

^ permalink raw reply

* Re: [PATCH v5 10/21] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-06-10 18:38 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Jonathan Corbet,
	Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <efdade0b-38f2-4e5e-b6dc-567d9eea97a9@app.fastmail.com>

On Mon, 2026-06-08 at 16:52 -0400, Chuck Lever wrote:
> 
> On Fri, May 22, 2026, at 3:42 PM, Jeff Layton wrote:
> > Add the necessary parts to accept a fsnotify callback for directory
> > change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> > is created set a handle_event callback to handle the notification.
> 
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index e17488a911f7..31df04675713 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -4172,6 +4172,127 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, 
> > struct xdr_stream *xdr,
> >  	goto out;
> >  }
> > 
> > +static bool
> > +nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream 
> > *xdr,
> > +			  struct dentry *dentry, struct nfs4_delegation *dp,
> > +			  struct nfsd_file *nf, char *name, u32 namelen)
> > +{
> > +	uint32_t *attrmask;
> > +
> > +	/* Reserve space for attrmask */
> > +	attrmask = xdr_reserve_space(xdr, 3 * sizeof(uint32_t));
> > +	if (!attrmask)
> > +		return false;
> > +
> > +	ne->ne_file.data = name;
> > +	ne->ne_file.len = namelen;
> > +	ne->ne_attrs.attrmask.element = attrmask;
> > +
> > +	attrmask[0] = 0;
> > +	attrmask[1] = 0;
> > +	attrmask[2] = 0;
> > +	ne->ne_attrs.attr_vals.data = NULL;
> > +	ne->ne_attrs.attr_vals.len = 0;
> > +	ne->ne_attrs.attrmask.count = 1;
> > +	return true;
> > +}
> > +
> > +/**
> > + * nfsd4_encode_notify_event - encode a notify
> > + * @xdr: stream to which to encode the fattr4
> > + * @nne: nfsd_notify_event to encode
> > + * @dp: delegation where the event occurred
> > + * @nf: nfsd_file on which event occurred
> > + * @notify_mask: pointer to word where notification mask should be set
> > + *
> > + * Encode @nne into @xdr. Returns a pointer to the start of the event, 
> > or NULL if
> > + * the event couldn't be encoded. The appropriate bit in the 
> > notify_mask will also
> > + * be set on success.
> > + */
> > +u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct 
> > nfsd_notify_event *nne,
> > +			      struct nfs4_delegation *dp, struct nfsd_file *nf,
> > +			      u32 *notify_mask)
> > +{
> > +	u8 *p = NULL;
> > +
> > +	*notify_mask = 0;
> > +
> > +	if (nne->ne_mask & FS_DELETE) {
> > +		struct notify_remove4 nr = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrm_old_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_remove4(xdr, &nr))
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_REMOVE_ENTRY);
> > +	} else if (nne->ne_mask & FS_CREATE) {
> > +		struct notify_add4 na = { };
> > +		struct notify_remove4 old = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&na.nad_new_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (nne->ne_target) {
> > +			if (!nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +						       NULL, dp, nf,
> > +						       nne->ne_name, nne->ne_namelen))
> > +				goto out_err;
> > +			na.nad_old_entry.count = 1;
> > +			na.nad_old_entry.element = &old;
> > +		}
> > +
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_add4(xdr, &na))
> > +			goto out_err;
> > +
> > +		*notify_mask |= BIT(NOTIFY4_ADD_ENTRY);
> > +	} else if (nne->ne_mask & FS_RENAME) {
> > +		struct notify_rename4 nr = { };
> > +		struct notify_remove4 old = { };
> > +		struct name_snapshot n;
> > +		bool ret;
> > +
> > +		/* Don't send any attributes in the old_entry since they're the same 
> > in new */
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrn_old_entry.nrm_old_entry, xdr,
> > +					       NULL, dp, nf, nne->ne_name,
> > +					       nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		take_dentry_name_snapshot(&n, nne->ne_dentry);
> > +		ret = nfsd4_setup_notify_entry4(&nr.nrn_new_entry.nad_new_entry, xdr,
> > +					       nne->ne_dentry, dp, nf, (char *)n.name.name,
> > +					       n.name.len);
> 
> Now once I got all of the previous edits in place, all three LLM
> reviewers identified an issue here that might require a significant
> rewrite. This is why I stopped the minor editing here and decided
> it was time for you to consider restructuring (or not). I haven't
> looked at patches 11-21.
> 
>   I think the new name here has a time-of-use problem.
>   
>   nrn_old_entry uses nne->ne_name, which alloc_nfsd_notify_event() copied
>   when fsnotify delivered the rename.  nrn_new_entry instead reads the
>   live dentry via take_dentry_name_snapshot() at callback-prepare time,
>   which can run long after the event was queued.
> 
>   CB_NOTIFY is asynchronous: nfsd_handle_dir_event() queues the event on
>   ncn_evt[] and nothing holds ne_dentry stable until the work runs.
>   d_move() reuses the same dentry and rewrites d_name in place, so a
>   second rename of the entry before the queued callback encodes leaves
>   the dget'd ne_dentry carrying the later name.  An A->B event then
>   encodes as A->C, and a client holding the directory delegation applies
>   the wrong old->new mapping to its cache.  The old name is immune
>   because it was snapshotted up front; only the new name is read late.
> 
>   The new name is available at notification time -- fsnotify_move() passes
>   &moved->d_name as new_name, and ne_dentry is that moved dentry -- so
>   alloc_nfsd_notify_event() can snapshot it alongside the old name.
> 
> What I haven't assessed is whether the suggested restructuring is
> now vulnerable to misbehavior during memory exhaustion.
> 

That sounds legit. We probably need to snapshot the name sooner, when
we create the event. I'll spin something up. As far as memory
exhaustion goes: if that happens we'll just recall the delegation.
That's always the remedy when there are problems here.

> 
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (ret && nne->ne_target) {
> > +			ret = nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +							NULL, dp, nf,
> > +							(char *)n.name.name, n.name.len);
> > +			if (ret) {
> > +				nr.nrn_new_entry.nad_old_entry.count = 1;
> > +				nr.nrn_new_entry.nad_old_entry.element = &old;
> > +			}
> > +		}
> > +
> > +		if (ret) {
> > +			p = (u8 *)xdr->p;
> > +			ret = xdrgen_encode_notify_rename4(xdr, &nr);
> > +		}
> > +		release_dentry_name_snapshot(&n);
> > +		if (!ret)
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_RENAME_ENTRY);
> > +	}
> > +	return p;
> > +out_err:
> > +	pr_warn("nfsd: unable to marshal notify_rename4 to xdr stream\n");
> > +	return NULL;
> > +}
> > +
> >  static void svcxdr_init_encode_from_buffer(struct xdr_stream *xdr,
> >  				struct xdr_buf *buf, __be32 *p, int bytes)
> >  {
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v5 10/21] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-06-10 18:33 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Jonathan Corbet,
	Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <344ed039-86ce-4125-8476-2e5d22e40fdc@app.fastmail.com>

On Mon, 2026-06-08 at 16:40 -0400, Chuck Lever wrote:
> 
> On Fri, May 22, 2026, at 3:42 PM, Jeff Layton wrote:
> > Add the necessary parts to accept a fsnotify callback for directory
> > change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> > is created set a handle_event callback to handle the notification.
> > 
> > Use that to allocate a nfsd_notify_event object and then hand off a
> > reference to each delegation's CB_NOTIFY. If anything fails along the
> > way, recall any affected delegations.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> There are some significant-looking sashiko review findings which I did
> not follow up on.
> 

I plan to go over Sashiko's findings after I go through your responses.

> 
> > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > index ea3e7deb06fa..1964a213f80e 100644
> > --- a/fs/nfsd/nfs4callback.c
> > +++ b/fs/nfsd/nfs4callback.c
> > @@ -870,21 +870,30 @@ static void nfs4_xdr_enc_cb_notify(struct 
> > rpc_rqst *req,
> >  				   const void *data)
> >  {
> >  	const struct nfsd4_callback *cb = data;
> > +	struct nfsd4_cb_notify *ncn = container_of(cb, struct 
> > nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> >  	struct nfs4_cb_compound_hdr hdr = {
> >  		.ident = 0,
> >  		.minorversion = cb->cb_clp->cl_minorversion,
> >  	};
> > -	struct CB_NOTIFY4args args = { };
> > +	struct CB_NOTIFY4args args;
> > +	__be32 *p;
> > 
> >  	WARN_ON_ONCE(hdr.minorversion == 0);
> > 
> >  	encode_cb_compound4args(xdr, &hdr);
> >  	encode_cb_sequence4args(xdr, cb, &hdr);
> > 
> > -	/*
> > -	 * FIXME: get stateid and fh from delegation. Inline the cna_changes
> > -	 * buffer, and zero it.
> > -	 */
> > +	p = xdr_reserve_space(xdr, 4);
> > +	*p = cpu_to_be32(OP_CB_NOTIFY);
> > +
> > +	args.cna_stateid.seqid = dp->dl_stid.sc_stateid.si_generation;
> > +	memcpy(&args.cna_stateid.other, &dp->dl_stid.sc_stateid.si_opaque,
> > +	       ARRAY_SIZE(args.cna_stateid.other));
> > +	args.cna_fh.len = dp->dl_stid.sc_file->fi_fhandle.fh_size;
> > +	args.cna_fh.data = dp->dl_stid.sc_file->fi_fhandle.fh_raw;
> > +	args.cna_changes.count = ncn->ncn_nf_cnt;
> > +	args.cna_changes.element = ncn->ncn_nf;
> >  	WARN_ON_ONCE(!xdrgen_encode_CB_NOTIFY4args(xdr, &args));
> > 
> >  	hdr.nops++;
> 
> I want to avoid the need to use xdrgen to encode the CB_NOTIFY arguments.
> How about this:
> 
> +       struct nfsd4_cb_notify *ncn = container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> +       struct nfs4_delegation *dp = container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> 
>    ...
> 
> +       encode_stateid4(xdr, &dp->dl_stid.sc_stateid);
> +       encode_nfs_fh4(xdr, &dp->dl_stid.sc_file->fi_fhandle);
> +       xdr_stream_encode_u32(xdr, ncn->ncn_nf_cnt);
> +       for (u32 i = 0; i < ncn->ncn_nf_cnt; i++)
> +               (void)xdrgen_encode_notify4(xdr, &ncn->ncn_nf[i]);
> 
> And then add a "pragma public notify4;" in nfs4_1.x .
> 

For those following along, Chuck and I had a private discussion and I
think we're going to keep this calling xdrgen_encode_CB_NOTIFY4args()
for now. I am dropping the WARN_ON_ONCE though.

> 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index b0652c755b3b..20477144475b 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> 
> > @@ -3461,19 +3462,131 @@ nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
> >  	nfs4_put_stid(&dp->dl_stid);
> >  }
> > 
> > +static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
> > +{
> > +	bool queued;
> > +
> > +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags))
> > +		return;
> > +
> > +	/*
> > +	 * We're assuming the state code never drops its reference
> > +	 * without first removing the lease.  Since we're in this lease
> > +	 * callback (and since the lease code is serialized by the
> > +	 * flc_lock) we know the server hasn't removed the lease yet, and
> > +	 * we know it's safe to take a reference.
> > +	 */
> > +	refcount_inc(&dp->dl_stid.sc_count);
> > +	queued = nfsd4_run_cb(&dp->dl_recall);
> > +	WARN_ON_ONCE(!queued);
> > +	if (!queued)
> > +		refcount_dec(&dp->dl_stid.sc_count);
> > +}
> > +
> > +static bool
> > +nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
> > +{
> > +	struct nfsd4_cb_notify *ncn = container_of(cb, struct 
> > nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +	struct nfsd_notify_event *events[NOTIFY4_EVENT_QUEUE_SIZE];
> > +	struct xdr_buf xdr = { .buflen = PAGE_SIZE * NOTIFY4_PAGE_ARRAY_SIZE,
> > +			       .pages  = ncn->ncn_pages };
> > +	struct xdr_stream stream;
> > +	struct nfsd_file *nf;
> > +	int count, i;
> > +	bool error = false;
> > +
> > +	xdr_init_encode_pages(&stream, &xdr);
> > +
> > +	spin_lock(&ncn->ncn_lock);
> > +	count = ncn->ncn_evt_cnt;
> > +
> > +	/* spurious queueing? */
> > +	if (count == 0) {
> > +		spin_unlock(&ncn->ncn_lock);
> > +		return false;
> > +	}
> > +
> > +	/* we can't keep up! */
> > +	if (count > NOTIFY4_EVENT_QUEUE_SIZE) {
> > +		spin_unlock(&ncn->ncn_lock);
> > +		goto out_recall;
> > +	}
> > +
> > +	memcpy(events, ncn->ncn_evt, sizeof(*events) * count);
> > +	ncn->ncn_evt_cnt = 0;
> > +	spin_unlock(&ncn->ncn_lock);
> > +
> > +	rcu_read_lock();
> > +	nf = 
> > nfsd_file_get(rcu_dereference(dp->dl_stid.sc_file->fi_deleg_file));
> > +	rcu_read_unlock();
> > +	if (!nf) {
> > +		for (i = 0; i < count; ++i)
> > +			nfsd_notify_event_put(events[i]);
> > +		goto out_recall;
> > +	}
> > +
> > +	for (i = 0; i < count; ++i) {
> > +		struct nfsd_notify_event *nne = events[i];
> > +
> > +		if (!error) {
> > +			u32 *maskp = (u32 *)xdr_reserve_space(&stream, sizeof(*maskp));
> > +			u8 *p;
> > +
> > +			if (!maskp) {
> > +				error = true;
> > +				goto put_event;
> > +			}
> > +
> > +			p = nfsd4_encode_notify_event(&stream, nne, dp, nf, maskp);
> > +			if (!p) {
> > +				pr_notice("Could not generate CB_NOTIFY from fsnotify mask 0x%x\n",
> > +					  nne->ne_mask);
> > +				error = true;
> > +				goto put_event;
> > +			}
> > +
> > +			ncn->ncn_nf[i].notify_mask.count = 1;
> > +			ncn->ncn_nf[i].notify_mask.element = maskp;
> > +			ncn->ncn_nf[i].notify_vals.data = p;
> > +			ncn->ncn_nf[i].notify_vals.len = (u8 *)stream.p - p;
> > +		}
> > +put_event:
> > +		nfsd_notify_event_put(nne);
> > +	}
> > +	if (!error) {
> > +		ncn->ncn_nf_cnt = count;
> > +		nfsd_file_put(nf);
> > +		return true;
> > +	}
> > +	nfsd_file_put(nf);
> > +out_recall:
> > +	nfsd_break_one_deleg(dp);
> > +	return false;
> > +}
> > +
> >  static int
> >  nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> >  				struct rpc_task *task)
> >  {
> > +	struct nfsd4_cb_notify *ncn = container_of(cb, struct 
> > nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +
> >  	switch (task->tk_status) {
> >  	case -NFS4ERR_DELAY:
> >  		rpc_delay(task, 2 * HZ);
> >  		return 0;
> >  	default:
> > +		/* For any other hard error, recall the deleg */
> > +		nfsd_break_one_deleg(dp);
> > +		fallthrough;
> > +	case 0:
> >  		return 1;
> >  	}
> >  }
> > 
> > +static void nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn);
> > +
> >  static void
> >  nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> >  {
> > @@ -3482,6 +3595,9 @@ nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> >  	struct nfs4_delegation *dp =
> >  			container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> > 
> > +	/* Drain events that arrived while this callback was in flight */
> > +	if (ncn->ncn_evt_cnt > 0)
> > +		nfsd4_run_cb_notify(ncn);
> 
> The above check needs to be serialized with modification of
> ncn_evt_cnt:
>
> +       bool pending;
>  
> +       /* Drain events that arrived while this callback was in flight */
> +       spin_lock(&ncn->ncn_lock);
> +       pending = ncn->ncn_evt_cnt > 0;
> +       spin_unlock(&ncn->ncn_lock);
> +       if (pending)
> +               nfsd4_run_cb_notify(ncn);
> 

I need to ponder this. Does this matter?

NFSD4_CALLBACK_RUNNING is now clear, which should be observed by
another task queueing a new event. READ_ONCE() seems like it should be
sufficient here. I'll run it by Claude.


> 
> >  	nfs4_put_stid(&dp->dl_stid);
> >  }
> > 
> 
> > @@ -9858,3 +9954,133 @@ void nfsd_update_cmtime_attr(struct file *f, 
> > unsigned int flags)
> >  				      MINOR(inode->i_sb->s_dev),
> >  				      inode->i_ino, ret);
> >  }
> > +
> > +static void
> > +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> > +{
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +
> > +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> > +		return;
> > +
> > +	if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> > +		clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> > +	else
> > +		nfsd4_run_cb(&ncn->ncn_cb);
> > +}
> > +
> > +static struct nfsd_notify_event *
> > +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry 
> > *dentry,
> > +			struct inode *target)
> > +{
> > +	struct nfsd_notify_event *ne;
> > +
> > +	ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_NOFS);
> > +	if (!ne)
> > +		return NULL;
> > +
> > +	memcpy(&ne->ne_name, q->name, q->len);
> > +	refcount_set(&ne->ne_ref, 1);
> > +	ne->ne_mask = mask;
> > +	ne->ne_name[q->len] = '\0';
> > +	ne->ne_namelen = q->len;
> > +	ne->ne_dentry = dget(dentry);
> > +	ne->ne_target = target;
> > +	if (ne->ne_target)
> > +		ihold(ne->ne_target);
> > +	return ne;
> > +}
> > +
> > +static bool
> > +should_notify_deleg(u32 mask, struct file_lease *fl)
> > +{
> > +	/* Don't notify the client generating the event */
> > +	if (nfsd_breaker_owns_lease(fl))
> > +		return false;
> > +
> > +	/* Skip if this event wasn't ignored by the lease */
> > +	if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> > +		return false;
> > +	if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> > +		return false;
> > +	if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void
> > +nfsd_recall_all_dir_delegs(const struct inode *dir)
> > +{
> > +	struct file_lock_context *ctx = locks_inode_context(dir);
> > +	struct file_lock_core *flc;
> > +
> > +	spin_lock(&ctx->flc_lock);
> > +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +
> > +		if (fl->fl_lmops == &nfsd_lease_mng_ops)
> > +			nfsd_break_deleg_cb(fl);
> > +	}
> > +	spin_unlock(&ctx->flc_lock);
> > +}
> > +
> > +int
> > +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void 
> > *data,
> > +		      int data_type, const struct qstr *name)
> > +{
> > +	struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> > +	struct inode *target = fsnotify_data_rename_target(data, data_type);
> > +	struct file_lock_context *ctx;
> > +	struct file_lock_core *flc;
> > +	struct nfsd_notify_event *evt;
> > +
> > +	/* Normalize cross-dir rename events to create/delete */
> > +	if (mask & FS_MOVED_FROM) {
> > +		mask &= ~FS_MOVED_FROM;
> > +		mask |= FS_DELETE;
> > +	}
> > +	if (mask & FS_MOVED_TO) {
> > +		mask &= ~FS_MOVED_TO;
> > +		mask |= FS_CREATE;
> > +	}
> > +
> 
> I inserted an extra check here for rename notifications:
> 
> +       /*
> +        * FS_RENAME fires on the source directory even for a cross-dir
> +        * rename, where the moved entry now lives under a different
> +        * parent. NOTIFY4_RENAME_ENTRY describes an in-place rename, so
> +        * reporting it here would advertise a name absent from this
> +        * directory.
> +        */
> +       if ((mask & FS_RENAME) && dentry && d_inode(dentry->d_parent) != dir)
> +               mask &= ~FS_RENAME;
> 

Thanks. I'll add that in.

> 
> > +	/* Don't do anything if this is not an expected event */
> > +	if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> > +		return 0;
> > +
> > +	ctx = locks_inode_context(dir);
> > +	if (!ctx || list_empty(&ctx->flc_lease))
> > +		return 0;
> > +
> > +	evt = alloc_nfsd_notify_event(mask, name, dentry, target);
> > +	if (!evt) {
> > +		nfsd_recall_all_dir_delegs(dir);
> > +		return 0;
> > +	}
> > +
> > +	spin_lock(&ctx->flc_lock);
> > +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +		struct nfs4_delegation *dp = flc->flc_owner;
> > +		struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> > +
> 
> I added:
> 
> +               if (fl->fl_lmops != &nfsd_lease_mng_ops)
> +                       continue;
> 
> Otherwise the loop treats every lease on the inode as an nfsd delegation
> unconditionally.
> 

This is not necessary. should_notify_deleg() calls
nfsd_breaker_owns_lease(), which already checks this before doing
anything else.

> 
> > +		if (!should_notify_deleg(mask, fl))
> > +			continue;
> > +
> > +		spin_lock(&ncn->ncn_lock);
> > +		if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> > +			/* We're generating notifications too fast. Recall. */
> > +			spin_unlock(&ncn->ncn_lock);
> > +			nfsd_break_deleg_cb(fl);
> > +			continue;
> > +		}
> > +		ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> > +		spin_unlock(&ncn->ncn_lock);
> > +
> > +		nfsd4_run_cb_notify(ncn);
> > +	}
> > +	spin_unlock(&ctx->flc_lock);
> > +	nfsd_notify_event_put(evt);
> > +	return 0;
> > +}
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index e17488a911f7..31df04675713 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -4172,6 +4172,127 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, 
> > struct xdr_stream *xdr,
> >  	goto out;
> >  }
> > 
> > +static bool
> > +nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream 
> > *xdr,
> > +			  struct dentry *dentry, struct nfs4_delegation *dp,
> > +			  struct nfsd_file *nf, char *name, u32 namelen)
> > +{
> > +	uint32_t *attrmask;
> > +
> > +	/* Reserve space for attrmask */
> > +	attrmask = xdr_reserve_space(xdr, 3 * sizeof(uint32_t));
> > +	if (!attrmask)
> > +		return false;
> > +
> > +	ne->ne_file.data = name;
> > +	ne->ne_file.len = namelen;
> > +	ne->ne_attrs.attrmask.element = attrmask;
> > +
> > +	attrmask[0] = 0;
> > +	attrmask[1] = 0;
> > +	attrmask[2] = 0;
> > +	ne->ne_attrs.attr_vals.data = NULL;
> > +	ne->ne_attrs.attr_vals.len = 0;
> > +	ne->ne_attrs.attrmask.count = 1;
> > +	return true;
> > +}
> > +
> > +/**
> > + * nfsd4_encode_notify_event - encode a notify
> > + * @xdr: stream to which to encode the fattr4
> > + * @nne: nfsd_notify_event to encode
> > + * @dp: delegation where the event occurred
> > + * @nf: nfsd_file on which event occurred
> > + * @notify_mask: pointer to word where notification mask should be set
> > + *
> > + * Encode @nne into @xdr. Returns a pointer to the start of the event, 
> > or NULL if
> > + * the event couldn't be encoded. The appropriate bit in the 
> > notify_mask will also
> > + * be set on success.
> > + */
> 
> Nit: Let's use the usual kdoc style to describe the return value.
> 

Ok, will fix.

> + * Encode @nne into @xdr. The matching bit in @notify_mask is set on
> + * success.
> + *
> + * Return: pointer to the start of the encoded event, or NULL if the
> + * event could not be encoded.
> + */
> 
> 
> > +u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct 
> > nfsd_notify_event *nne,
> > +			      struct nfs4_delegation *dp, struct nfsd_file *nf,
> > +			      u32 *notify_mask)
> > +{
> > +	u8 *p = NULL;
> > +
> > +	*notify_mask = 0;
> > +
> > +	if (nne->ne_mask & FS_DELETE) {
> > +		struct notify_remove4 nr = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrm_old_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_remove4(xdr, &nr))
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_REMOVE_ENTRY);
> > +	} else if (nne->ne_mask & FS_CREATE) {
> > +		struct notify_add4 na = { };
> > +		struct notify_remove4 old = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&na.nad_new_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (nne->ne_target) {
> > +			if (!nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +						       NULL, dp, nf,
> > +						       nne->ne_name, nne->ne_namelen))
> > +				goto out_err;
> > +			na.nad_old_entry.count = 1;
> > +			na.nad_old_entry.element = &old;
> > +		}
> > +
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_add4(xdr, &na))
> > +			goto out_err;
> > +
> > +		*notify_mask |= BIT(NOTIFY4_ADD_ENTRY);
> > +	} else if (nne->ne_mask & FS_RENAME) {
> > +		struct notify_rename4 nr = { };
> > +		struct notify_remove4 old = { };
> > +		struct name_snapshot n;
> > +		bool ret;
> > +
> > +		/* Don't send any attributes in the old_entry since they're the same 
> > in new */
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrn_old_entry.nrm_old_entry, xdr,
> > +					       NULL, dp, nf, nne->ne_name,
> > +					       nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		take_dentry_name_snapshot(&n, nne->ne_dentry);
> > +		ret = nfsd4_setup_notify_entry4(&nr.nrn_new_entry.nad_new_entry, xdr,
> > +					       nne->ne_dentry, dp, nf, (char *)n.name.name,
> > +					       n.name.len);
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (ret && nne->ne_target) {
> > +			ret = nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +							NULL, dp, nf,
> > +							(char *)n.name.name, n.name.len);
> > +			if (ret) {
> > +				nr.nrn_new_entry.nad_old_entry.count = 1;
> > +				nr.nrn_new_entry.nad_old_entry.element = &old;
> > +			}
> > +		}
> > +
> > +		if (ret) {
> > +			p = (u8 *)xdr->p;
> > +			ret = xdrgen_encode_notify_rename4(xdr, &nr);
> > +		}
> > +		release_dentry_name_snapshot(&n);
> > +		if (!ret)
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_RENAME_ENTRY);
> > +	}
> > +	return p;
> > +out_err:
> > +	pr_warn("nfsd: unable to marshal notify_rename4 to xdr stream\n");
> 
> Nit: The warning needs to match the semantics of nfsd4_encode_notify_event().
> How about:
> 
> +       pr_warn("nfsd: unable to marshal notify event to xdr stream\n");
> 

Sounds good.

> 
> > +	return NULL;
> > +}
> > +
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-10 17:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aiMVLtblIKu1DQWJ@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Thu, Jun 04, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> >> + KVM: selftests: Test conversion with elevated page refcount
>> >>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>> >>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>> >>       take a dependency on CONFIG_GUP_TEST.
>> >
>> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
>> > it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
>> > but we're _also_ actively working to remove the need to pin.
>> >
>> > Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
>> > memory" syscall.
>> >
>>
>> Hmm that takes a dependency on io_uring, which isn't always compiled
>> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
>> CONFIG_GUP_TEST.
>
> Or try both?  If it's not a ridiculous amount of work.

CONFIG_GUP_TEST was tried in [1]

[1] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/

It looks like this

  static void pin_pages(void *vaddr, uint64_t size)
  {
  	const struct pin_longterm_test args = {
  		.addr = (uint64_t)vaddr,
  		.size = size,
  		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
  	};

  	gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
  	TEST_REQUIRE(gup_test_fd > 0);

  	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
  }

  static void unpin_pages(void)
  {
  	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
  }

So in the test I'll call pin_pages(), then try to convert, see that it
fails with EAGAIN and reports the expected error_offset, then I call
unpin_pages(), then I convert again and expect success.

Are you uncomfortable with the CONFIG_GUP_TEST interface? What would you
like me to try with CONFIG_IO_URING? I'm thinking that the main
difference between the two is just down to which non-default CONFIG
option we want to take for guest_memfd tests.

^ permalink raw reply

* [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
From: Shanker Donthineni @ 2026-06-10 16:48 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

  - A PE executes a Device-nGnR* store followed by a younger
    Device-nGnR* load.
  - The store is not a store-release.
  - The accesses target the same peripheral and do not overlap in bytes.
  - There is at most one intervening Device-nGnR* store in program
    order, and there are no intervening Device-nGnR* loads.
  - There is no DSB, and no DMB that orders loads, between the store and
    the load.
  - Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
the plain str* sequence.

Note: stlr* only supports base-register addressing, so affected CPUs use
a base-register stlr* path. Unaffected CPUs keep the original
offset-addressed str* sequence introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing").

The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
helpers are left unchanged. These helpers are intended for
write-combining mappings, which are Normal-NC on arm64. Replacing their
contiguous str* groups would defeat the write-combining behavior used to
improve store performance.

Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes since v2:
  - Reworked the raw MMIO write helpers so unaffected CPUs keep the
    existing offset-addressed STR sequence, while affected CPUs use the
    base-register STLR path.
  - Updated the commit message to match the code changes.
  - Rebased on top of the arm64 for-next/errata branch:
    https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata

Changes since v1:
  - Updated the commit message based on feedback from Vladimir Murzin.

 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 23 ++++++++++++++++
 arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
 arch/arm64/kernel/cpu_errata.c              |  8 ++++++
 arch/arm64/tools/cpucaps                    |  1 +
 5 files changed, 64 insertions(+)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad09bbb10da80..fc45125dc2f80 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
 +----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c65cef81be86a..d633eb70de1ac 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
 
 	  If unsure, say Y.
 
+config NVIDIA_OLYMPUS_1027_ERRATUM
+	bool "NVIDIA Olympus: device store/load ordering erratum"
+	default y
+	help
+	  This option adds an alternative code sequence to work around an
+	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
+	  observed by a peripheral after a younger Device-nGnR* load to the
+	  same peripheral. This breaks the program order that drivers rely
+	  on for MMIO and can leave a device in an incorrect state.
+
+	  The workaround promotes the raw MMIO store helpers
+	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+	  required ordering. Because writel() and writel_relaxed() are built
+	  on __raw_writel(), both are covered without changes to the higher
+	  layers.
+
+	  The fix is applied through the alternatives framework, so enabling
+	  this option does not by itself activate the workaround: it is
+	  patched in only when an affected CPU is detected, and is a no-op on
+	  unaffected CPUs.
+
+	  If unsure, say Y.
+
 config ARM64_ERRATUM_834220
 	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
 	depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50b..801223e754c90 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,22 @@
 /*
  * Generic IO read/write.  These perform native-endian accesses.
  */
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+	return alternative_has_cap_unlikely(
+				ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
 #define __raw_writeb __raw_writeb
 static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
 	volatile u8 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -33,6 +45,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
 	volatile u16 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -40,6 +58,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 {
 	volatile u32 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -47,6 +71,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
 	volatile u64 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index d597896b0f7f3..b096d9acca578 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -838,6 +838,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
 	},
 #endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+	{
+		/* NVIDIA Olympus core */
+		.desc = "NVIDIA Olympus device load/store ordering erratum",
+		.capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+	},
+#endif
 #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 	{
 		/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d6..d367257bf7703 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
 WORKAROUND_CAVIUM_TX2_219_TVM
 WORKAROUND_CLEAN_CACHE
 WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
 WORKAROUND_NVIDIA_CARMEL_CNP
 WORKAROUND_PMUV3_IMPDEF_TRAPS
 WORKAROUND_QCOM_FALKOR_E1003
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
From: Jason Gunthorpe @ 2026-06-10 16:11 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: Will Deacon, Catalin Marinas, linux-arm-kernel, Vladimir Murzin,
	Mark Rutland, linux-kernel, linux-doc, Vikram Sethi,
	Jason Sequeira
In-Reply-To: <223c49ee-528c-4750-9885-fd8e0247151e@nvidia.com>

On Wed, Jun 10, 2026 at 08:20:28AM -0500, Shanker Donthineni wrote:

> Based on the existing code comments and after reviewing this path again,
> __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
> appear to be intended for WC regions. Since the erratum is scoped to
> Device-nGnR* accesses, and WC mappings are Normal-NC on arm64, I don’t
> think the STLR workaround should apply to these helpers by default.

Hmm, unfortunately I think the APIs mix together IO and WC both as
__iomem things. However I recall when I was looking a this everyone
was using it for WC.

Jason

^ permalink raw reply

* Re: [PATCH v3 4/5] KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM on PowerNV
From: Amit Machhiwal @ 2026-06-10 15:53 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
	Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, lkp
In-Reply-To: <87jysgz292.fsf@vajain21.in.ibm.com>

On 2026/06/03 09:47 AM, Vaibhav Jain wrote:
> Hi Amit,
> 
> Thanks for the patch. My review comments inline:
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > Currently, when booting a compatibility-mode KVM guest (L1) on a PowerNV
> > hypervisor (L0), the guest runs with the expected processor
> > compatibility level. However, when booting a nested KVM guest (L2)
> > inside the L1, QEMU derives the CPU model from the raw host PVR and
> > attempts to run the nested guest at that level, instead of honoring the
> > compatibility mode of the L1.
> >
> > Extend host CPU compatibility capability reporting to support nested
> > virtualization on PowerNV systems (PAPR nested API v1).
> >
> > For nested API v2 (PowerVM), compatibility capabilities are obtained
> > from the hypervisor via the H_GUEST_GET_CAPABILITIES hcall. This
> > information is not available on PowerNV systems.
> >
> > For nested API v1, derive the compatibility capabilities from the L1
> > guest by reading the "cpu-version" property from the device tree, which
> > reflects the effective (logical) processor compatibility level. Map this
> > value to the corresponding compatibility capability bitmap.
> >
> > Introduce a helper to translate CPU version values into compatibility
> > capability bits and integrate it into kvmppc_get_compat_cpu_caps().
> >
> > This allows userspace to query host CPU compatibility modes on both
> > PowerVM and PowerNV platforms via the KVM_PPC_GET_COMPAT_CAPS ioctl.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 37 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 38de7040e2b7..18774c49af85 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -6522,15 +6522,50 @@ static bool kvmppc_hash_v3_possible(void)
> >  	return true;
> >  }
> >  
> > +static int kvmppc_map_compat_capabilities(const __be32 cpu_version,
> > +				      unsigned long *capabilities)
> > +{
> > +	switch (cpu_version) {
> > +	case PVR_ARCH_31_P11:
> > +		*capabilities |= H_GUEST_CAP_POWER11;
> > +		break;
> > +	case PVR_ARCH_31:
> > +		*capabilities |= H_GUEST_CAP_POWER10;
> > +		break;
> > +	case PVR_ARCH_300:
> > +		*capabilities |= H_GUEST_CAP_POWER9;
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	return 0;
> > +}
> >  
> >  static int kvmppc_get_compat_cpu_caps(struct kvm_ppc_compat_caps *host_caps)
> >  {
> > +	struct device_node *np;
> >  	unsigned long capabilities = 0;
> > +	const __be32 *prop = NULL;
> >  	long rc = -EINVAL;
> > +	u32 cpu_version;
> >  
> >  	if (kvmhv_on_pseries()) {
> > -		if (kvmhv_is_nestedv2())
> > +		if (kvmhv_is_nestedv2()) {
> >  			rc = plpar_guest_get_capabilities(0,
> >  	&capabilities);
> Need to mask capabilities as mentioned in the review comments for
> previous patch. I would suggest creating a helper that performs the
> hcall and applies the mask which can then be used at
> plpar_guest_get_capabilities() call sites.

Sure, will do.

Thanks,
Amit

> 
> > +		} else {
> > +			for_each_node_by_type(np, "cpu") {
> > +				prop = of_get_property(np, "cpu-version", NULL);
> > +				if (prop) {
> > +					cpu_version = be32_to_cpup(prop);
> > +					break;
> > +				}
> > +			}
> > +			if (!prop)
> > +				return -EINVAL;
> > +			rc = kvmppc_map_compat_capabilities(cpu_version,
> > +								&capabilities);
> > +		}
> >  		host_caps->compat_capabilities = capabilities;
> >  	}
> >  
> > -- 
> > 2.50.1 (Apple Git-155)
> >
> 
> -- 
> Cheers
> ~ Vaibhav

^ permalink raw reply

* Re: [PATCH v3 3/5] KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM on PowerVM
From: Amit Machhiwal @ 2026-06-10 15:51 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
	Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, lkp
In-Reply-To: <87mrxcz300.fsf@vajain21.in.ibm.com>

Hi Vaibhav,

Thanks for taking a look at this patch. My response is inline.

On 2026/06/03 09:31 AM, Vaibhav Jain wrote:
> Hi Amit,
> 
> Thanks for the patch. My review comments inline below:
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > On POWER systems, the host CPU may run in a compatibility mode (e.g., a
> > Power11 processor operating in Power10 compatibility mode). In such
> > cases, the effective CPU level exposed to guests differs from the
> > physical processor generation.
> >
> > When running nested KVM guests, QEMU derives the host CPU type using
> > mfpvr(), which reflects the physical processor version. This can result
> > in a mismatch between the CPU model selected by QEMU and the
> > compatibility mode enforced by the host, leading to guest boot failures.
> >
> > For example, booting a nested guest on a Power11 LPAR configured in
> > Power10 compatibility mode fails with:
> >
> >   KVM-NESTEDv2: couldn't set guest wide elements
> >   [..KVM reg dump..]
> >
> > This occurs because QEMU selects a CPU model corresponding to the
> > physical processor (via mfpvr()), while the host operates in a lower
> > compatibility mode. As a result, KVM rejects the requested compatibility
> > level during guest initialization.
> >
> > Add support for retrieving host CPU compatibility capabilities for
> > nested guests on PowerVM (PAPR nested API v2). The hypervisor provides
> > the effective compatibility levels via the H_GUEST_GET_CAPABILITIES
> > hcall, which reflects the processor modes negotiated between the Power
> > hypervisor (L0) and the host partition (L1).
> >
> > On pseries systems, obtain the capability bitmap using
> > plpar_guest_get_capabilities() and return it via struct
> > kvm_ppc_compat_caps. This information is then exposed to userspace
> > through the KVM_PPC_GET_COMPAT_CAPS ioctl.
> >
> > Hook the implementation into the Book3S HV kvmppc_ops so that it can be
> > invoked by the generic KVM ioctl handling code.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 249d1f2e4e2c..38de7040e2b7 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -6522,6 +6522,21 @@ static bool kvmppc_hash_v3_possible(void)
> >  	return true;
> >  }
> >  
> > +
> > +static int kvmppc_get_compat_cpu_caps(struct kvm_ppc_compat_caps *host_caps)
> > +{
> > +	unsigned long capabilities = 0;
> > +	long rc = -EINVAL;
> > +
> > +	if (kvmhv_on_pseries()) {
> > +		if (kvmhv_is_nestedv2())
> > +			rc = plpar_guest_get_capabilities(0,
> > &capabilities);
> 
> since this value will trikle back to userspace please apply a mask on
> the hcall return value so that any reserved and non-PVR related bits
> doesnt leak back to userspace.

Though currently we only supply the bits corresponding to supported
processor versions, it makes sense to mask out unrelated bits so that
they don't unnecesarily passed on to the userspace. I'll make the
changes in v4.

Thanks,
Amit

> 
> > +		host_caps->compat_capabilities = capabilities;
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> >  static struct kvmppc_ops kvm_ops_hv = {
> >  	.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
> >  	.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
> > @@ -6564,6 +6579,7 @@ static struct kvmppc_ops kvm_ops_hv = {
> >  	.hash_v3_possible = kvmppc_hash_v3_possible,
> >  	.create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
> >  	.create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
> > +	.get_compat_cpu_ver = kvmppc_get_compat_cpu_caps,
> >  };
> >  
> >  static int kvm_init_subcore_bitmap(void)
> > -- 
> > 2.50.1 (Apple Git-155)
> >
> >
> 
> -- 
> Cheers
> ~ Vaibhav

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox