Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH net-next v4 6/8] selftests: drv-net: refactor devmem command builders into lib module
From: Jakub Kicinski @ 2026-05-14  2:38 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi, Yanteng Si,
	Dongliang Mu, Michael Chan, Pavan Chebbi, Joshua Washington,
	Harshitha Ramamurthy, Saeed Mahameed, Tariq Toukan, Mark Bloch,
	Leon Romanovsky, Alexander Duyck, kernel-team, Daniel Borkmann,
	Nikolay Aleksandrov, Shuah Khan, dw, sdf.kernel, mohsin.bashr,
	willemb, jiang.kun2, xu.xin16, wang.yaxin, netdev, linux-doc,
	linux-kernel, linux-rdma, bpf, linux-kselftest,
	Stanislav Fomichev, Mina Almasry, Bobby Eshleman
In-Reply-To: <agUzR3O35Rx4RHnu@devvm29614.prn0.facebook.com>

On Wed, 13 May 2026 19:28:23 -0700 Bobby Eshleman wrote:
> > Also I think you missed adding the new file to Makefiles ?
> > It needs to be under TEST_FILES for building tarballs  
> 
> Ah okay, I wasn't sure if the already existing `TEST_INCLUDES :=
> $(wildcard lib/py/*.py ../lib/py/*.py)` was sufficient or not. Will use
> TEST_FILES with the devmem_lib approach above next rev.

Ah, you're right! Forgot we have the wildcard there.

We should probably update
https://github.com/linux-netdev/nipa/blob/main/tests/patch/check_selftest/test.py

^ permalink raw reply

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Wei Yang @ 2026-05-14  3:10 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260512074202.10253-1-lance.yang@linux.dev>

On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
>
>On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>>generalize the order of the __collapse_huge_page_* and collapse_max_*
>>functions to support future mTHP collapse.
>>
>>The current mechanism for determining collapse with the
>>khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>>raises a key design issue: if we support user defined max_pte_none values
>>(even those scaled by order), a collapse of a lower order can introduces
>>an feedback loop, or "creep", when max_ptes_none is set to a value greater
>>than HPAGE_PMD_NR / 2. [1]
>>
>>With this configuration, a successful collapse to order N will populate
>>enough pages to satisfy the collapse condition on order N+1 on the next
>>scan. This leads to unnecessary work and memory churn.
>>
>>To fix this issue introduce a helper function that will limit mTHP
>>collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>>This effectively supports two modes: [2]
>>
>>- max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>>  that maps the shared zeropage. Consequently, no memory bloat.
>>- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>  available mTHP order.
>>
>>This removes the possiblilty of "creep", while not modifying any uAPI
>>expectations. A warning will be emitted if any non-supported
>>max_ptes_none value is configured with mTHP enabled.
>>
>>mTHP collapse will not honor the khugepaged_max_ptes_shared or
>>khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>>shared or swapped entry.
>>
>>No functional changes in this patch; however it defines future behavior
>>for mTHP collapse.
>>
>>[1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>>[2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>>
>>Co-developed-by: Dev Jain <dev.jain@arm.com>
>>Signed-off-by: Dev Jain <dev.jain@arm.com>
>>Signed-off-by: Nico Pache <npache@redhat.com>
>>---
>> include/trace/events/huge_memory.h |   3 +-
>> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
>> 2 files changed, 85 insertions(+), 35 deletions(-)
>>
>>diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>index bcdc57eea270..443e0bd13fdb 100644
>>--- a/include/trace/events/huge_memory.h
>>+++ b/include/trace/events/huge_memory.h
>>@@ -39,7 +39,8 @@
>> 	EM( SCAN_STORE_FAILED,		"store_failed")			\
>> 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
>> 	EM( SCAN_PAGE_FILLED,		"page_filled")			\
>>-	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>>+	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
>>+	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
>> 
>> #undef EM
>> #undef EMe
>>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>index f68853b3caa7..27465161fa6d 100644
>>--- a/mm/khugepaged.c
>>+++ b/mm/khugepaged.c
>>@@ -61,6 +61,7 @@ enum scan_result {
>> 	SCAN_COPY_MC,
>> 	SCAN_PAGE_FILLED,
>> 	SCAN_PAGE_DIRTY_OR_WRITEBACK,
>>+	SCAN_INVALID_PTES_NONE,
>> };
>> 
>> #define CREATE_TRACE_POINTS
>>@@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>>  * PTEs for the given collapse operation.
>>  * @cc: The collapse control struct
>>  * @vma: The vma to check for userfaultfd
>>+ * @order: The folio order being collapsed to
>>  *
>>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
>>  * collapse operation.
>>  */
>>-static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>>-		struct vm_area_struct *vma)
>>+static int collapse_max_ptes_none(struct collapse_control *cc,
>>+		struct vm_area_struct *vma, unsigned int order)
>> {
>>+	unsigned int max_ptes_none = khugepaged_max_ptes_none;
>> 	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>
>One thing I still want to call out: kernel code usually uses C-style
>comments :)
>
>> 	if (vma && userfaultfd_armed(vma))
>> 		return 0;
>> 	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>> 	if (!cc->is_khugepaged)
>> 		return HPAGE_PMD_NR;
>>-	// For all other cases repect the user defined maximum.
>>-	return khugepaged_max_ptes_none;
>>+	// for PMD collapse, respect the user defined maximum.
>>+	if (is_pmd_order(order))
>>+		return max_ptes_none;
>>+	/* Zero/non-present collapse disabled. */
>>+	if (!max_ptes_none)
>>+		return 0;
>>+	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>>+	// scale the maximum number of PTEs to the order of the collapse.
>>+	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>>+		return (1 << order) - 1;
>>+
>>+	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>>+	// Emit a warning and return -EINVAL.
>>+	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>>+		      KHUGEPAGED_MAX_PTES_LIMIT);
>
>Maybe fallback to 0 instead, as David suggested earlier?
>

It looks reasonable to fallback to 0.

But as the updated Document says in patch 14:

  For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
  value will emit a warning and no mTHP collapse will be attempted.

This is why it does like this now.

    mthp_collapse()
        max_ptes_none = collapse_max_ptes_none();
        if (max_ptes_none < 0)
            return collapsed;

>max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
>intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
>disable it :(
>

So it depends on what we want to do here :-)

For me, I would vote for fallback to 0.

>Treating those values as 0 feels like the least surprising behavior,
>IMHO. It also gives mTHP a cleaner staring point, rather than carry over
>all the old PMD knob semantics :)
>
>Otherwise, LGTM!
>Reviewed-by: Lance Yang <lance.yang@linux.dev>
>
>>+	return -EINVAL;

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v10 8/9] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Tzung-Bi Shih @ 2026-05-14  3:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Arnd Bergmann, Greg Kroah-Hartman, Bartosz Golaszewski,
	Linus Walleij, Benson Leung, linux-kernel, chrome-platform,
	driver-core, linux-doc, linux-gpio, Rafael J. Wysocki,
	Danilo Krummrich, Jonathan Corbet, Shuah Khan, Laurent Pinchart,
	Wolfram Sang, Johan Hovold, Paul E . McKenney
In-Reply-To: <20260508115309.GA9254@nvidia.com>

On Fri, May 08, 2026 at 08:53:09AM -0300, Jason Gunthorpe wrote:
> On Fri, May 08, 2026 at 06:54:47PM +0800, Tzung-Bi Shih wrote:
> >  struct cros_ec_device *cros_ec_device_alloc(struct device *dev)
> > @@ -47,6 +49,15 @@ struct cros_ec_device *cros_ec_device_alloc(struct device *dev)
> >  	if (!ec_dev)
> >  		return NULL;
> >  
> > +	ec_dev->its_rev = revocable_alloc(ec_dev);
> > +	if (!ec_dev->its_rev)
> > +		return NULL;
> > +	/*
> > +	 * Drop the extra reference for the caller as the caller is the
> > +	 * resource provider.
> > +	 */
> > +	revocable_put(ec_dev->its_rev);
> > +
> >  	ec_dev->din_size = sizeof(struct ec_host_response) +
> >  			   sizeof(struct ec_response_get_protocol_info) +
> >  			   EC_MAX_RESPONSE_OVERHEAD;
> 
> FWIW I am still very much against seeing any revokable concept used
> *between two drivers*. That will turn the kernel's lifetime model into
> spaghetti code.
> 
> Your other series where you only have to change
> drivers/platform/chrome/cros_ec_chardev.c just confirms how wrong this
> approach is.
> 
> Given you say this is such a bug I think you really should be sending
> a series that is patches 5 through 7 from the other series and a
> simple rwsem instead of misc_deregister_sync() to deal with this bug
> ASAP. No need to complicate a simple bug fix in a driver with all
> these core changes.

Apologies for missing this suggestion.

For "patches 5 through 7 from the other series" I guess you're referring:
- https://lore.kernel.org/all/20260427134659.95181-6-tzungbi@kernel.org
- https://lore.kernel.org/all/20260427134659.95181-7-tzungbi@kernel.org
- https://lore.kernel.org/all/20260427134659.95181-8-tzungbi@kernel.org

Could you provide a bit more detail on the rwsem approach?  I'm not
entirely clear on what data or operations the rwsem would be protecting.

^ permalink raw reply

* Re: [PATCH v11 4/5] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Tzung-Bi Shih @ 2026-05-14  3:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Arnd Bergmann, Greg Kroah-Hartman, Bartosz Golaszewski,
	Linus Walleij, Benson Leung, linux-kernel, chrome-platform,
	driver-core, linux-doc, linux-gpio, Rafael J. Wysocki,
	Danilo Krummrich, Jonathan Corbet, Shuah Khan, Laurent Pinchart,
	Wolfram Sang, Johan Hovold, Paul E . McKenney
In-Reply-To: <20260513115102.GF7655@nvidia.com>

On Wed, May 13, 2026 at 08:51:02AM -0300, Jason Gunthorpe wrote:
> On Wed, May 13, 2026 at 05:10:42PM +0800, Tzung-Bi Shih wrote:
> > The cros_ec_device can be unregistered when the underlying device is
> > removed.  Other kernel drivers that interact with the EC may hold a
> > pointer to the cros_ec_device, creating a risk of a use-after-free
> > error if the EC device is removed while still being referenced.
> > 
> > To prevent this, leverage the revocable and convert the underlying
> > device drivers to resource providers of cros_ec_device.
> > 
> > ---
> > v11:
> > - No changes.
> 
> Two people are opposing this and yet no changes? Why haven't you
> followed my advice to fix the bug in this driver in the obvious way?

I understand there's opposition to this approach for this specific driver.
The main goal of the series is to introduce the revocable APIs and show
potential use cases.  I used this patch to illustrate how revocable could
solve this class of problem, not necessarily as the definitive fix in this
instance.

To help me understand, could you elaborate on why the revocable mechanism
isn't suitable here?  I'm wondering because if this piece of code were to
transition to Rust in the future, would the concerns you have also apply
to using Revocable[1] in the Rust context for this driver?

[1] https://rust.docs.kernel.org/kernel/revocable/struct.Revocable.html

^ permalink raw reply

* Re: [PATCH v2 0/5] KVM: PPC: Handle CPU compatibility mode for nested guests
From: Ritesh Harjani @ 2026-05-14  3:19 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Vaibhav Jain, Paolo Bonzini, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
	Shuah Khan, kvm, linux-kernel, linux-doc
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>


Hi Amit,

Amit Machhiwal <amachhiw@linux.ibm.com> writes:

> On POWER systems, newer processor generations can operate in compatibility
> modes corresponding to earlier generations (e.g., a Power11 system running
> in Power10 compatibility mode). In such cases, the effective CPU level
> exposed to guests differs from the physical processor generation.
>
> This creates a problem for nested virtualization. When booting a nested KVM
> guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
> userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
> and attempt to configure the nested guest accordingly. However, the L1
> partition is constrained by the compatibility level negotiated with the
> hypervisor (L0), and requests exceeding that level are rejected, leading to
> guest boot failures such as:
>
>   KVM-NESTEDv2: couldn't set guest wide elements
>
> This series addresses the issue in two steps:
>
> 1. Detect and reject invalid compatibility requests early in KVM to avoid
>    late failures.
>
> 2. Provide a mechanism for userspace to query the effective CPU
>    compatibility modes supported by the host, so it can select an
>    appropriate CPU model for nested guests.
>

Do we really need to add a uapi change for this? Tools like Qemu can
read the device tree info of the host, isn't it?

> To achieve this, the series introduces a new KVM capability and ioctl
> (KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
> compatibility modes supported by the host.
>
> The implementation supports both:
>
>   - PowerVM (nested API v2), where compatibility information is obtained
>     via the H_GUEST_GET_CAPABILITIES hypercall.
>   - PowerNV (nested API v1), where compatibility is derived from the device
>     tree ("cpu-version") representing the effective processor compatibility
>     level.

See there you go, for PowerNV if this info is provided in the device
tree, then Qemu could as well just read that info, no?

... yup, kvmppc_read_int_dt() can do that I guess.

So, my request is, can we look into this to see, if there is a possible
alternative to this? maybe we already have a mechanism which Qemu could
use to get this info already?

btw - I haven't given a full read of the patch series, but reading the
cover letter, I felt  we should atleast add this info to the cover
letter on, why a uapi change is really needed here, why can't the
existing alternatives work for us. 

-ritesh

>
> This allows userspace (e.g., QEMU) to select a CPU model consistent with
> the host compatibility mode, avoiding mismatches and enabling successful
> nested guest boot.
>

^ permalink raw reply

* Re: [PATCH] riscv: Docs: fix unmatched quote warning
From: Paul Walmsley @ 2026-05-14  4:16 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Deepak Gupta, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, linux-riscv, Jonathan Corbet,
	Shuah Khan, linux-doc
In-Reply-To: <4827939a-2e8e-4ac7-981c-deb3b7296a66@infradead.org>

Hi Randy,

On Wed, 13 May 2026, Randy Dunlap wrote:

> This docs build warning is now in mainline.
> Should I ask Jon to merge the patch, given no activity on it?

Sorry about the delay; I'll pick it up as a fix.


thanks,

- Paul

^ permalink raw reply

* Re: [PATCH v3 2/3] Documentation: security-bugs: explain what is and is not a security bug
From: Willy Tarreau @ 2026-05-14  4:32 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Greg KH, Leon Romanovsky, skhan, security, workflows, linux-doc,
	linux-kernel
In-Reply-To: <87fr3v6my2.fsf@trenco.lwn.net>

Hi Jon,

On Wed, May 13, 2026 at 03:04:21PM -0600, Jonathan Corbet wrote:
> Willy Tarreau <w@1wt.eu> writes:
> 
> > On Wed, May 13, 2026 at 06:52:00AM -0600, Jonathan Corbet wrote:
> 
> >> I definitely wouldn't argue for making it longer, and enumerating all of
> >> the make-me-root capabilities would be silly.  I would consider just
> >> replacing CAP_SYS_ADMIN with "elevated capabilities" or some such.  That
> >> might rule out legitimate reports where some capability provides an
> >> access it shouldn't, but I suspect you could live with that :)
> >
> > I think it could indeed work like this, without denaturating the rest
> > of the paragraph and having broader coverage. Do you think you could
> > amend/update it ? I'm not trying to add you any burden, it's just that
> > it will take me more time before I provide an update :-/
> 
> How's the following?

Looks good, thank you! In case this is needed:

  Acked-by: Willy Tarreau <w@1wt.eu>

> (While I was there, I noticed that threat-model.rst has no SPDX line;
> what's your preference there?)

I didn't notice any was needed, I tried to get inspiration from other
files for the format (I'm still not familiar with the rst format
though this time I could successfully install the tools). Same for
the label at the top BTW, I just did what I found somewhere else,
probably security-bugs.rst which is similar (no SPDX line and has a
label). So regarding SPDX, I do not have any preference. If one is
needed, let's pick what's used by default, I do not care, as long
as it allows the doc to be published.

Thanks,
Willy

> Thanks,
> 
> jon
> 
> >From 1e15a25142583e312dcc504b0279d47508cbfdab Mon Sep 17 00:00:00 2001
> From: Jonathan Corbet <corbet@lwn.net>
> Date: Wed, 13 May 2026 14:58:53 -0600
> Subject: [PATCH 2/2] docs: threat-model: don't limit root capabilities to
>  CAP_SYS_ADMIN
> 
> The threat-model document says that only users with CAP_SYS_ADMIN can carry
> out a number of admin-level tasks, but there are numerous capabilities that
> can confer that sort of power.  Generalize the text slightly to make it
> clear that CAP_SYS_ADMIN is not the only all-powerful capability.
> 
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/process/threat-model.rst | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/process/threat-model.rst b/Documentation/process/threat-model.rst
> index 91da52f7114fd..f177b8d3c1caf 100644
> --- a/Documentation/process/threat-model.rst
> +++ b/Documentation/process/threat-model.rst
> @@ -62,7 +62,8 @@ on common processors featuring privilege levels and memory management units:
>  
>  * **Capability-based protection**:
>  
> -  * users not having the ``CAP_SYS_ADMIN`` capability may not alter the
> +  * users not having elevated capabilities (including but not limited to
> +    CAP_SYS_ADMIN) may not alter the
>      kernel's configuration, memory nor state, change other users' view of the
>      file system layout, grant any user capabilities they do not have, nor
>      affect the system's availability (shutdown, reboot, panic, hang, or making
> -- 
> 2.53.0

^ permalink raw reply

* Re: [PATCH v3 3/3] Documentation: security-bugs: clarify requirements for AI-assisted reports
From: Willy Tarreau @ 2026-05-14  4:34 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Greg KH, Leon Romanovsky, skhan, security, workflows, linux-doc,
	linux-kernel
In-Reply-To: <87ik8r6n1r.fsf@trenco.lwn.net>

On Wed, May 13, 2026 at 03:02:08PM -0600, Jonathan Corbet wrote:
> Jonathan Corbet <corbet@lwn.net> writes:
> 
> > Willy Tarreau <w@1wt.eu> writes:
> >
> >> On Wed, May 13, 2026 at 12:30:10PM +0200, Greg KH wrote:
> >>> > One nit:
> >>> > 
> >>> > > +  * **Impact Evaluation**: Many AI-generated reports lack an understanding of
> >>> > > +    the kernel's threat model and go to great lengths inventing theoretical
> >>> > > +    consequences.
> >>> > 
> >>> > If only we had a shiny new document describing that threat model that we
> >>> > could reference here... :)
> >>> 
> >>> Ah yes, a link to that would make things better, but don't we have that
> >>> elsewhere in this series?
> >>
> >> It's in the same patch, I think Jon was sarcastic here. I thought I had
> >> addressed that one but apparently I was wrong :-/
> >
> > I'm just saying that this particular text should link to that document,
> > don't make readers go searching for it.  I can certainly add a patch
> > doing that if you like.
> 
> I was thinking something like this.
> jon

Indeed, looks good like this as it won't hide the file name from the
link. In case you'd want it:

  Acked-by: Willy Tarreau <w@1wt.eu>

Thank you! 
Willy

> >From 3f02a3c190bab6b54e2a250ead0c7408af1a3c51 Mon Sep 17 00:00:00 2001
> From: Jonathan Corbet <corbet@lwn.net>
> Date: Wed, 13 May 2026 14:51:29 -0600
> Subject: [PATCH 1/2] docs: security-bugs: add a link to the threat-model
>  documentation
> 
> Rather than make readers search for this document, just a link to it where
> it is referenced.
> 
> (While I was at it, I removed the unused and unneeded _threatmodel label
> from the top of threat-model.rst).
> 
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/process/security-bugs.rst | 13 +++++++------
>  Documentation/process/threat-model.rst  |  2 --
>  2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/process/security-bugs.rst b/Documentation/process/security-bugs.rst
> index f85c65f31f12f..3c51ddde31dd9 100644
> --- a/Documentation/process/security-bugs.rst
> +++ b/Documentation/process/security-bugs.rst
> @@ -191,12 +191,13 @@ handle:
>      Please **always convert your report to plain text** without any formatting
>      decorations before sending it.
>  
> -  * **Impact Evaluation**: Many AI-generated reports lack an understanding of
> -    the kernel's threat model and go to great lengths inventing theoretical
> -    consequences. This adds noise and complicates triage. Please stick to
> -    verifiable facts (e.g., "this bug permits any user to gain CAP_NET_ADMIN")
> -    without enumerating speculative implications. Have your tool read this
> -    documentation as part of the evaluation process.
> +  * **Impact Evaluation**: Many AI-generated reports lack an understanding
> +    of the kernel's threat model (see Documentation/process/threat-model.rst)
> +    and go to great lengths inventing theoretical consequences. This adds
> +    noise and complicates triage. Please stick to verifiable facts (e.g.,
> +    "this bug permits any user to gain CAP_NET_ADMIN") without enumerating
> +    speculative implications. Have your tool read this documentation as
> +    part of the evaluation process.
>  
>    * **Reproducer**: AI-based tools are often capable of generating reproducers.
>      Please always ensure your tool provides one and **test it thoroughly**. If
> diff --git a/Documentation/process/threat-model.rst b/Documentation/process/threat-model.rst
> index ecb432390e792..91da52f7114fd 100644
> --- a/Documentation/process/threat-model.rst
> +++ b/Documentation/process/threat-model.rst
> @@ -1,5 +1,3 @@
> -.. _threatmodel:
> -
>  The Linux Kernel threat model
>  =============================
>  
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Hao Ge @ 2026-05-14  4:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Abhishek Bapat, Kent Overstreet, Andrew Morton
In-Reply-To: <d064a1a8de127c0e321f675a9966e533a0917e7e.1777936301.git.abhishekbapat@google.com>

Hi Suren and Abhishek


Thanks for the patch! A couple of minor comments below.


On 2026/5/5 07:36, Abhishek Bapat wrote:
> From: Suren Baghdasaryan <surenb@google.com>
>
> Add the following ioctl commands for /proc/allocinfo file:
>
> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> to check whether the file content has changed specifically due to module
> load/unload. Every time a module is loaded / unloaded, the returned
> value will be different. By comparing the identifier value at the
> beginning and at the end of the content retrieval operation, users can
> validate retrieved information for consistency.
>
> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> is the position of a record in /proc/allocinfo.
>
> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> one. If no records were previously retrieved, returns the first
> record.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> ---
>   .../userspace-api/ioctl/ioctl-number.rst      |   2 +
>   include/linux/codetag.h                       |   1 +
>   include/uapi/linux/alloc_tag.h                |  54 ++++++
>   lib/alloc_tag.c                               | 178 +++++++++++++++++-
>   lib/codetag.c                                 |  11 ++
>   5 files changed, 244 insertions(+), 2 deletions(-)
>   create mode 100644 include/uapi/linux/alloc_tag.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761fff..84f6808a8578 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -349,6 +349,8 @@ Code  Seq#    Include File                                             Comments
>                                                                          <mailto:luzmaximilian@gmail.com>
>   0xA5  20-2F  linux/surface_aggregator/dtx.h                            Microsoft Surface DTX driver
>                                                                          <mailto:luzmaximilian@gmail.com>
> +0xA6  00-0F  uapi/linux/alloc_tag.h                                    Memory allocation profiling
> +                                                                       <mailto:surenb@google.com>
>   0xAA  00-3F  linux/uapi/linux/userfaultfd.h
>   0xAB  00-1F  linux/nbd.h
>   0xAC  00-1F  linux/raw.h
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index 8ea2a5f7c98a..2bcd4e7c809e 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -76,6 +76,7 @@ struct codetag_iterator {
>   
>   void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
>   bool codetag_trylock_module_list(struct codetag_type *cttype);
> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
>   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
>   struct codetag *codetag_next_ct(struct codetag_iterator *iter);
>   
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> new file mode 100644
> index 000000000000..e9a5b55fcc7a
> --- /dev/null
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + *  include/linux/alloc_tag.h
> + */
> +
> +#ifndef _UAPI_ALLOC_TAG_H
> +#define _UAPI_ALLOC_TAG_H
> +
> +#include <linux/types.h>
> +
> +#define ALLOCINFO_STR_SIZE	64
> +
> +struct allocinfo_content_id {
> +	__u64 id;
> +};
> +
> +struct allocinfo_tag {
> +	/* Longer names are trimmed */
> +	char modname[ALLOCINFO_STR_SIZE];
> +	char function[ALLOCINFO_STR_SIZE];
> +	char filename[ALLOCINFO_STR_SIZE];
> +	__u64 lineno;
> +};
> +
> +struct allocinfo_counter {
> +	__u64 bytes;
> +	__u64 calls;
> +	__u8 accurate;
> +	__u8 pad[7]; /* Add alignment to not break the 32-bit compatible interface */
> +};
> +
> +struct allocinfo_tag_data {
> +	struct allocinfo_tag tag;
> +	struct allocinfo_counter counter;
> +};
> +
> +struct allocinfo_get_at {
> +	__u64 pos;	/* input */
> +	struct allocinfo_tag_data data;
> +};
> +
> +#define _ALLOCINFO_IOC_CONTENT_ID	0
> +#define _ALLOCINFO_IOC_GET_AT		1
> +#define _ALLOCINFO_IOC_GET_NEXT		2
> +
> +#define ALLOCINFO_IOC_BASE		0xA6
> +#define ALLOCINFO_IOC_CONTENT_ID	_IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID,	\
> +					     struct allocinfo_content_id)
> +#define ALLOCINFO_IOC_GET_AT		_IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT,	\
> +					      struct allocinfo_get_at)
> +#define ALLOCINFO_IOC_GET_NEXT		_IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT,	\
> +					     struct allocinfo_tag_data)
> +
> +#endif /* _UAPI_ALLOC_TAG_H */
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index ed1bdcf1f8ab..5c24d2f954d4 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -14,6 +14,7 @@
>   #include <linux/string_choices.h>
>   #include <linux/vmalloc.h>
>   #include <linux/kmemleak.h>
> +#include <uapi/linux/alloc_tag.h>
>   
>   #define ALLOCINFO_FILE_NAME		"allocinfo"
>   #define MODULE_ALLOC_TAG_VMAP_SIZE	(100000UL * sizeof(struct alloc_tag))
> @@ -46,6 +47,9 @@ int alloc_tag_ref_offs;
>   struct allocinfo_private {
>   	struct codetag_iterator iter;
>   	bool print_header;
> +	/* ioctl uses a separate iterator not to interfere with reads */
> +	struct codetag_iterator ioctl_iter;
> +	bool positioned; /* seq_open_private() sets to 0 */
>   };
>   
>   static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> @@ -125,6 +129,177 @@ static const struct seq_operations allocinfo_seq_op = {
>   	.show	= allocinfo_show,
>   };
>   
> +static int allocinfo_open(struct inode *inode, struct file *file)
> +{
> +	return seq_open_private(file, &allocinfo_seq_op,
> +				sizeof(struct allocinfo_private));
> +}
> +
> +static int allocinfo_release(struct inode *inode, struct file *file)
> +{
> +	return seq_release_private(inode, file);
> +}
> +
> +static const char *allocinfo_str(const char *str)
> +{
> +	size_t len = strlen(str);
> +
> +	/* Keep an extra space for the trailing NULL. */
> +	if (len >= ALLOCINFO_STR_SIZE)
> +		str += (len - ALLOCINFO_STR_SIZE) + 1;
> +	return str;
> +}
> +
> +/* Copy a string and trim from the beginning if it's too long */
> +static void allocinfo_copy_str(char *dest, const char *src)
> +{
> +	strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> +}
> +
> +static void allocinfo_to_params(struct codetag *ct,
> +				struct allocinfo_tag_data *data)
> +{
> +	struct alloc_tag *tag = ct_to_alloc_tag(ct);
> +	struct alloc_tag_counters counter = alloc_tag_read(tag);
> +
> +	if (ct->modname)
> +		allocinfo_copy_str(data->tag.modname, ct->modname);
> +	else
> +		data->tag.modname[0] = '\0';

Minor nit about allocinfo_to_params():

When modname is NULL (built-in kernel code), the current code sets it

to an empty string:

     if (ct->modname)

         allocinfo_copy_str(data->tag.modname, ct->modname);

     else

         data->tag.modname[0] = '\0';

This is of course workable in userspace by checking for an empty

string, but I was wondering if it would be cleaner to use "vmlinux"

as a default:

else

           allocinfo_copy_str(data->tag.modname, "vmlinux");


For some context, in our memory analysis workflow we often group

allocations by module to get a quick overview of where memory goes,

for example:

vmlinux:    2.1 GB    (kernel core)

nvidia:     1.2 GB    (GPU driver)

iwlwifi:    800 MB    (WiFi driver)

ext4:       500 MB    (filesystem)

Having a consistent identifier for kernel built-in allocations would

avoid each userspace tool needing to handle the empty string as a

special case. Totally fine if this is intentional though.

> +	allocinfo_copy_str(data->tag.function, ct->function);
> +	allocinfo_copy_str(data->tag.filename, ct->filename);
> +	data->tag.lineno = ct->lineno;
> +	data->counter.bytes = counter.bytes;
> +	data->counter.calls = counter.calls;
> +	data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> +}
> +
> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> +{
> +	struct allocinfo_content_id params;
> +
> +	codetag_lock_module_list(alloc_tag_cttype, true);
> +	params.id = codetag_get_content_id(alloc_tag_cttype);
> +	codetag_lock_module_list(alloc_tag_cttype, false);
> +	if (copy_to_user(arg, &params, sizeof(params)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> +{
> +	struct allocinfo_private *priv;
> +	struct codetag *ct;
> +	__u64 pos;
> +	struct allocinfo_get_at params = {0};
> +
> +	if (copy_from_user(&params, arg, sizeof(params)))
> +		return -EFAULT;
> +
> +	priv = (struct allocinfo_private *)m->private;
> +	pos = params.pos;
> +
> +	codetag_lock_module_list(alloc_tag_cttype, true);
> +
> +	/* Find the codetag */
> +	priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> +	ct = codetag_next_ct(&priv->ioctl_iter);
> +	while (ct && pos--)
> +		ct = codetag_next_ct(&priv->ioctl_iter);

I noticed that codetag_next_ct(&priv->ioctl_iter) and

priv->positioned are accessed without serialization in the ioctl

path. Concurrent ioctl calls on the same fd could race on these

fields. Just something I spotted while reading the code.


Thanks

Best Regards

Hao

> +	if (ct) {
> +		allocinfo_to_params(ct, &params.data);
> +		priv->positioned = true;
> +	}
> +
> +	codetag_lock_module_list(alloc_tag_cttype, false);
> +
> +	if (!ct)
> +		return -ENOENT;
> +
> +	if (copy_to_user(arg, &params, sizeof(params)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> +{
> +	struct allocinfo_private *priv;
> +	struct codetag *ct;
> +	struct allocinfo_tag_data params = {0};
> +	int ret = 0;
> +
> +	priv = (struct allocinfo_private *)m->private;
> +
> +	codetag_lock_module_list(alloc_tag_cttype, true);
> +
> +	if (!priv->positioned) {
> +		priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> +		priv->positioned = true;
> +	}
> +
> +	ct = codetag_next_ct(&priv->ioctl_iter);
> +	if (ct)
> +		allocinfo_to_params(ct, &params);
> +
> +	if (!ct) {
> +		priv->positioned = false;
> +		ret = -ENOENT;
> +	}
> +	codetag_lock_module_list(alloc_tag_cttype, false);
> +
> +	if (ret == 0) {
> +		if (copy_to_user(arg, &params, sizeof(params)))
> +			return -EFAULT;
> +	}
> +	return ret;
> +}
> +
> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long __arg)
> +{
> +	void __user *arg = (void __user *)__arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case ALLOCINFO_IOC_CONTENT_ID:
> +		ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> +		break;
> +	case ALLOCINFO_IOC_GET_AT:
> +		ret = allocinfo_ioctl_get_at(file->private_data, arg);
> +		break;
> +	case ALLOCINFO_IOC_GET_NEXT:
> +		ret = allocinfo_ioctl_get_next(file->private_data, arg);
> +		break;
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> +				   unsigned long arg)
> +{
> +	return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +static const struct proc_ops allocinfo_proc_ops = {
> +	.proc_open		= allocinfo_open,
> +	.proc_read_iter		= seq_read_iter,
> +	.proc_lseek		= seq_lseek,
> +	.proc_release		= allocinfo_release,
> +	.proc_ioctl		= allocinfo_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.proc_compat_ioctl	= allocinfo_compat_ioctl,
> +#endif
> +
> +};
> +
>   size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
>   {
>   	struct codetag_iterator iter;
> @@ -946,8 +1121,7 @@ static int __init alloc_tag_init(void)
>   		return 0;
>   	}
>   
> -	if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> -				     sizeof(struct allocinfo_private), NULL)) {
> +	if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
>   		pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
>   		shutdown_mem_profiling(false);
>   		return -ENOMEM;
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 304667897ad4..93aa30991563 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -48,6 +48,17 @@ bool codetag_trylock_module_list(struct codetag_type *cttype)
>   	return down_read_trylock(&cttype->mod_lock) != 0;
>   }
>   
> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> +{
> +	lockdep_assert_held(&cttype->mod_lock);
> +
> +	/*
> +	 * next_mod_seq is updated on every load, so can be used to identify
> +	 * content changes.
> +	 */
> +	return cttype->next_mod_seq;
> +}
> +
>   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
>   {
>   	struct codetag_iterator iter = {

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: ti: icssg: Derive stats array lengths from ARRAY_SIZE
From: MD Danish Anwar @ 2026-05-14  4:56 UTC (permalink / raw)
  To: Jacob Keller, David CARLIER
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Roger Quadros,
	Andrew Lunn, Meghana Malladi, Kevin Hao, Vadim Fedorenko, netdev,
	linux-doc, linux-kernel, linux-arm-kernel, Vignesh Raghavendra
In-Reply-To: <9fbf6682-b521-4b7e-b5b6-af91694ed051@intel.com>

Hi Jacob,

On 14/05/26 1:30 am, Jacob Keller wrote:
> On 5/12/2026 2:40 AM, MD Danish Anwar wrote:
>> Hi David,
>>
>> On 12/05/26 1:28 pm, David CARLIER wrote:
>>> Hi MD,
>>>
>>> On Tue, 12 May 2026 at 07:06, MD Danish Anwar <danishanwar@ti.com> wrote:
>>>>
>>>> Replace the manually maintained ICSSG_NUM_MIIG_STATS and
>>>> ICSSG_NUM_PA_STATS constants with ARRAY_SIZE() expressions derived
>>>> directly from the corresponding stat descriptor arrays, so that adding
>>>> new entries to icssg_all_miig_stats[] or icssg_all_pa_stats[] no longer
>>>> requires a separate update to a numeric constant.
>>>>
>>>> To make this self-contained, break the circular include dependency
>>>> between icssg_stats.h and icssg_prueth.h:
>>>>
>>>>   - icssg_stats.h previously included icssg_prueth.h (transitively
>>>>     pulling in icssg_switch_map.h and ETH_GSTRING_LEN).  Replace that
>>>>     with direct includes of <linux/ethtool.h>, <linux/kernel.h> and
>>>>     "icssg_switch_map.h".
>>>>
>>>>   - icssg_prueth.h now includes icssg_stats.h, giving it access to
>>>>     the ARRAY_SIZE-based ICSSG_NUM_MIIG_STATS and ICSSG_NUM_PA_STATS
>>>>     before they are used in the prueth_emac struct and ICSSG_NUM_STATS.
>>>>
>>>> Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
>>>> ---
>>>>  drivers/net/ethernet/ti/icssg/icssg_prueth.h | 3 +--
>>>>  drivers/net/ethernet/ti/icssg/icssg_stats.h  | 7 ++++++-
>>>>  2 files changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>>>> index df93d15c5b78..e2ccecb0a0dd 100644
>>>> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>>>> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>>>> @@ -43,6 +43,7 @@
>>>>
>>>>  #include "icssg_config.h"
>>>>  #include "icss_iep.h"
>>>> +#include "icssg_stats.h"
>>>>  #include "icssg_switch_map.h"
>>>>
>>>>  #define PRUETH_MAX_MTU          (2000 - ETH_HLEN - ETH_FCS_LEN)
>>>> @@ -57,8 +58,6 @@
>>>>
>>>>  #define ICSSG_MAX_RFLOWS       8       /* per slice */
>>>>
>>>> -#define ICSSG_NUM_PA_STATS     32
>>>> -#define ICSSG_NUM_MIIG_STATS   60
>>>>  /* Number of ICSSG related stats */
>>>>  #define ICSSG_NUM_STATS (ICSSG_NUM_MIIG_STATS + ICSSG_NUM_PA_STATS)
>>>>  #define ICSSG_NUM_STANDARD_STATS 31
>>>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.h b/drivers/net/ethernet/ti/icssg/icssg_stats.h
>>>> index 5ec0b38e0c67..b854eb587c1e 100644
>>>> --- a/drivers/net/ethernet/ti/icssg/icssg_stats.h
>>>> +++ b/drivers/net/ethernet/ti/icssg/icssg_stats.h
>>>> @@ -8,10 +8,15 @@
>>>>  #ifndef __NET_TI_ICSSG_STATS_H
>>>>  #define __NET_TI_ICSSG_STATS_H
>>>>
>>>> -#include "icssg_prueth.h"
>>>> +#include <linux/ethtool.h>
>>>> +#include <linux/kernel.h>
>>>> +#include "icssg_switch_map.h"
>>>>
>>>>  #define STATS_TIME_LIMIT_1G_MS    25000    /* 25 seconds @ 1G */
>>>>
>>>> +#define ICSSG_NUM_MIIG_STATS   ARRAY_SIZE(icssg_all_miig_stats)
>>>> +#define ICSSG_NUM_PA_STATS     ARRAY_SIZE(icssg_all_pa_stats)
>>>> +
>>>>  struct miig_stats_regs {
>>>>         /* Rx */
>>>>         u32 rx_packets;
>>>> --
>>>> 2.34.1
>>>>
>>>
>>> One thing that caught my eye: icssg_all_miig_stats[] and
>>>   icssg_all_pa_stats[] are 'static const' arrays in icssg_stats.h with
>>>   ETH_GSTRING_LEN name buffers per entry. Right now only icssg_stats.c
>>>   and icssg_ethtool.c pull them in. After this patch icssg_prueth.h
>>>   includes icssg_stats.h, so every .c in the driver (classifier,
>>>   common, config, mii_cfg, queues, switchdev, ...) ends up with its own
>>>   static-const copy of both tables.
>>>
>>>   Would a static_assert() work for what you're after? Something like:
>>>
>>
>> While adding more stats manually, The ARRAY_SIZE() approach was
>> explicitly requested by maintainer [1]:
>>
>> This patch is a direct response to that feedback. static_assert() would
>> still require updating the numeric constant on every array change. The
>> goal here is to eliminate the need of manually incrementing stats count
>> whenever new stats are added
>>
>> Your concern about multiple copies of table is noted and valid. Could
>> you advise on the preferred way to reconcile these two requirements? I
>> am happy to restructure if there is an approach that satisfies both.
>>
> The way we solved this in the Intel drivers is to use a single array
> which contains both the stat name as well as the offset from the
> structure where the stat resides.
> 
> The stat string code just iterates over the stat list for the strings,
> while the stat value code iterates the array and computes the stat
> address from the offset and size and base structure pointer. Each object
> that has stats has its own stat array structure.
> 
> This is probably overkill, but the advantage is that the strings and
> their values are stored together and adding a new stat is as simple as
> adding a new entry to that list.
> 
> I.e.
> 
> struct ice_stats {
>         char stat_string[ETH_GSTRING_LEN];
>         int sizeof_stat;
>         int stat_offset;
> };
> 
> #define ICE_STAT(_type, _name, _stat) { \
>         .stat_string = _name, \
>         .sizeof_stat = sizeof_field(_type, _stat), \
>         .stat_offset = offsetof(_type, _stat) \
> }
> 
> #define ICE_VSI_STAT(_name, _stat) \
>                 ICE_STAT(struct ice_vsi, _name, _stat)
> #define ICE_PF_STAT(_name, _stat) \
>                 ICE_STAT(struct ice_pf, _name, _stat)
> 
> 
> Then the stats for the individial arrays are defined like this:
> 
> static const struct ice_stats ice_gstrings_vsi_stats[] = {
>         ICE_VSI_STAT(ICE_RX_UNICAST, eth_stats.rx_unicast),
>         ICE_VSI_STAT(ICE_TX_UNICAST, eth_stats.tx_unicast),
>         ICE_VSI_STAT(ICE_RX_MULTICAST, eth_stats.rx_multicast),
>         ICE_VSI_STAT(ICE_TX_MULTICAST, eth_stats.tx_multicast),
>         ICE_VSI_STAT(ICE_RX_BROADCAST, eth_stats.rx_broadcast),
>         ICE_VSI_STAT(ICE_TX_BROADCAST, eth_stats.tx_broadcast),
> 	...
> };
> 
> (Note, ICE_RX_UNICAST is a macro that defines the string value.. I don't
> recall who changed this to macros or why vs just having the strings be
> directly in the definition...)
> 

Thanks for sharing the ice driver pattern — that's a clean design.

> This is probably a lot bigger refactor to make work, and may not be
> exactly suitable for your driver. I've considered "upgrading" these data

Yes, I need to see if refactoring is applicable to ICSSG or not. I will
look into this and send a separate patch / series in future if
applicable. For this series I will stick with what David Carlier suggested.

> structures and logic as helpers to the core ethtool code (or perhaps
> now, to libeth) but never got around to it.


-- 
Thanks and Regards,
Danish


^ permalink raw reply

* htmldocs: Documentation/virt/kvm/api.rst:6589: WARNING: Literal block expected; none found. [docutils]
From: kernel test robot @ 2026-05-14  5:11 UTC (permalink / raw)
  To: Amit Machhiwal; +Cc: oe-kbuild-all, 0day robot, linux-doc

tree:   https://github.com/intel-lab-lkp/linux/commits/Amit-Machhiwal/KVM-PPC-Book3S-HV-Validate-arch_compat-against-host-compatibility-mode/20260514-003250
head:   14b4e064019c3de50b10ce42416ad214e65ab27d
commit: 14b4e064019c3de50b10ce42416ad214e65ab27d KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
date:   12 hours ago
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260514/202605140717.W1StD3Ke-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605140717.W1StD3Ke-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Documentation/userspace-api/landlock:550: ./include/uapi/linux/landlock.h:45: ERROR: Unknown target name: "network flags". [docutils]
   Documentation/userspace-api/landlock:550: ./include/uapi/linux/landlock.h:50: ERROR: Unknown target name: "scope flags". [docutils]
   Documentation/userspace-api/landlock:550: ./include/uapi/linux/landlock.h:24: ERROR: Unknown target name: "filesystem flags". [docutils]
   Documentation/userspace-api/landlock:559: ./include/uapi/linux/landlock.h:168: ERROR: Unknown target name: "filesystem flags". [docutils]
   Documentation/userspace-api/landlock:559: ./include/uapi/linux/landlock.h:191: ERROR: Unknown target name: "network flags". [docutils]
>> Documentation/virt/kvm/api.rst:6589: WARNING: Literal block expected; none found. [docutils]
   Documentation/networking/skbuff:36: ./include/linux/skbuff.h:181: WARNING: Failed to create a cross reference. A title or caption not found: 'crc' [ref.ref]


vim +6589 Documentation/virt/kvm/api.rst

  6588	
> 6589	H_GUEST_CAP_POWER9  (bit 1): KVM guests can run in Power9 processor mode
  6590	H_GUEST_CAP_POWER10 (bit 2): KVM guests can run in Power10 processor mode
  6591	H_GUEST_CAP_POWER11 (bit 3): KVM guests can run in Power11 processor mode
  6592	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH 2/6] alloc_tag: add ioctl filters to /proc/allocinfo
From: Hao Ge @ 2026-05-14  6:15 UTC (permalink / raw)
  To: Abhishek Bapat
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Suren Baghdasaryan, Andrew Morton, Kent Overstreet
In-Reply-To: <2d1cbd93b987198d9569ff54b7fee4ae6aad5ff6.1777936301.git.abhishekbapat@google.com>

Hi Abhishek

On 2026/5/5 07:36, Abhishek Bapat wrote:
> Extend the capability of the IOCTL mechanism to filter allocations based
> on tag's module name, function name, file name and line number.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> ---
>   include/uapi/linux/alloc_tag.h | 26 +++++++++++++++-
>   lib/alloc_tag.c                | 55 ++++++++++++++++++++++++++++++++--
>   2 files changed, 77 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> index e9a5b55fcc7a..0cc9db5298c6 100644
> --- a/include/uapi/linux/alloc_tag.h
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -34,8 +34,32 @@ struct allocinfo_tag_data {
>   	struct allocinfo_counter counter;
>   };
>   
> +enum {
> +	ALLOCINFO_FILTER_MODNAME,
> +	ALLOCINFO_FILTER_FUNCTION,
> +	ALLOCINFO_FILTER_FILENAME,
> +	ALLOCINFO_FILTER_LINENO,
> +	__ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> +};
> +
> +#define ALLOCINFO_FILTER_MASK_MODNAME		(1 << ALLOCINFO_FILTER_MODNAME)
> +#define ALLOCINFO_FILTER_MASK_FUNCTION		(1 << ALLOCINFO_FILTER_FUNCTION)
> +#define ALLOCINFO_FILTER_MASK_FILENAME		(1 << ALLOCINFO_FILTER_FILENAME)
> +#define ALLOCINFO_FILTER_MASK_LINENO		(1 << ALLOCINFO_FILTER_LINENO)
> +
> +#define ALLOCINFO_FILTER_MASKS \
> +	((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> +
> +struct allocinfo_filter {
> +	__u64 mask; /* bitmask of the filter fields used */
> +	struct allocinfo_tag fields;
> +};
> +
>   struct allocinfo_get_at {
> -	__u64 pos;	/* input */
> +	/* inputs */
> +	__u64 pos;
> +	struct allocinfo_filter filter;
> +	/* output */
>   	struct allocinfo_tag_data data;
>   };
>   
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 5c24d2f954d4..7ff936e15e97 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -47,6 +47,7 @@ int alloc_tag_ref_offs;
>   struct allocinfo_private {
>   	struct codetag_iterator iter;
>   	bool print_header;
> +	struct allocinfo_filter filter;
>   	/* ioctl uses a separate iterator not to interfere with reads */
>   	struct codetag_iterator ioctl_iter;
>   	bool positioned; /* seq_open_private() sets to 0 */
> @@ -156,6 +157,11 @@ static void allocinfo_copy_str(char *dest, const char *src)
>   	strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
>   }
>   
> +static int allocinfo_cmp_str(const char *str, const char *template)
> +{
> +	return strncmp(allocinfo_str(str), template, ALLOCINFO_STR_SIZE);
> +}
> +
>   static void allocinfo_to_params(struct codetag *ct,
>   				struct allocinfo_tag_data *data)
>   {
> @@ -187,26 +193,67 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
>   	return 0;
>   }
>   
> +static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> +{
> +	if (!ct || !filter || !filter->mask)
> +		return true;
> +

Minor: in matches_filter(), returning true when ct is NULL seems

semantically odd since both callers already check for ct != NULL

before calling this function. Not a real issue though.

> +	if ((filter->mask & ALLOCINFO_FILTER_MASK_MODNAME) &&
> +	    ct->modname && (allocinfo_cmp_str(ct->modname, filter->fields.modname)))
> +		return false;
> +

In matches_filter(), when ct->modname is NULL (built-in kernel code),

the modname filter is skipped due to

ct->modname && (allocinfo_cmp_str(...))

This means built-in allocations always pass the modname filter. Since

built-in code doesn't belong to any module, maybe it should not match

when a modname filter is set:

if (filter->mask & ALLOCINFO_FILTER_MASK_MODNAME) {

     if (!ct->modname)

         return false;

if (allocinfo_cmp_str(ct->modname, filter->fields.modname))

     return false;

}

Thanks

Best Regards

Hao

> +	if ((filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) &&
> +	    ct->function && (allocinfo_cmp_str(ct->function, filter->fields.function)))
> +		return false;
> +
> +	if ((filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) &&
> +	    ct->filename && (allocinfo_cmp_str(ct->filename, filter->fields.filename)))
> +		return false;
> +
> +	if ((filter->mask & ALLOCINFO_FILTER_MASK_LINENO) &&
> +	    ct->lineno != filter->fields.lineno)
> +		return false;
> +
> +	return true;
> +}
> +
>   static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
>   {
>   	struct allocinfo_private *priv;
>   	struct codetag *ct;
> -	__u64 pos;
>   	struct allocinfo_get_at params = {0};
> +	__u64 skip_count;
>   
>   	if (copy_from_user(&params, arg, sizeof(params)))
>   		return -EFAULT;
>   
> +	if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
> +		return -EINVAL;
> +
>   	priv = (struct allocinfo_private *)m->private;
> -	pos = params.pos;
> +
> +	skip_count = params.pos;
>   
>   	codetag_lock_module_list(alloc_tag_cttype, true);
>   
> +	if (params.filter.mask)
> +		priv->filter = params.filter;
> +	else
> +		priv->filter.mask = 0;
> +
>   	/* Find the codetag */
>   	priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
>   	ct = codetag_next_ct(&priv->ioctl_iter);
> -	while (ct && pos--)
> +
> +	while (ct) {
> +		if (matches_filter(ct, &priv->filter)) {
> +			if (skip_count == 0)
> +				break;
> +			skip_count--;
> +		}
>   		ct = codetag_next_ct(&priv->ioctl_iter);
> +	}
> +
>   	if (ct) {
>   		allocinfo_to_params(ct, &params.data);
>   		priv->positioned = true;
> @@ -240,6 +287,8 @@ static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
>   	}
>   
>   	ct = codetag_next_ct(&priv->ioctl_iter);
> +	while (ct && !matches_filter(ct, &priv->filter))
> +		ct = codetag_next_ct(&priv->ioctl_iter);
>   	if (ct)
>   		allocinfo_to_params(ct, &params);
>   

^ permalink raw reply

* Re: [PATCH 3/6] alloc_tag: add size-based filtering to ioctl
From: Hao Ge @ 2026-05-14  6:53 UTC (permalink / raw)
  To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
	Kent Overstreet
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda
In-Reply-To: <06b4fc2457fb4b75eb1ef18320a8722ddb5a850f.1777936301.git.abhishekbapat@google.com>

Hi Abhishek


On 2026/5/5 07:36, Abhishek Bapat wrote:
> Extend the allocinfo filtering mechanism to allow users to filter tags
> based on the total number of bytes allocated [min_size, max_size]. The
> size range is inclusive.
>
> Filtering by size involves retrieving allocinfo per-CPU counters, which
> is an expensive operation. Hence, the performance of size-based
> filtering will be worse than other filters.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> ---
>   include/uapi/linux/alloc_tag.h |  8 +++++++-
>   lib/alloc_tag.c                | 15 +++++++++++++++
>   2 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> index 0cc9db5298c6..229068efd24c 100644
> --- a/include/uapi/linux/alloc_tag.h
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -20,6 +20,8 @@ struct allocinfo_tag {
>   	char function[ALLOCINFO_STR_SIZE];
>   	char filename[ALLOCINFO_STR_SIZE];
>   	__u64 lineno;
> +	__u64 min_size;
> +	__u64 max_size;
>   };

allocinfo_tag is used both as a tag identifier in the output data

(allocinfo_tag_data.tag) and as filter criteria

(allocinfo_filter.fields). min_size and max_size are filter

parameters, not tag identity. Also, allocinfo_to_params() does not

fill these fields, so userspace gets zeros in the output, which is

a bit confusing. Might be cleaner to separate filter parameters

from tag identity.

>   struct allocinfo_counter {
> @@ -39,13 +41,17 @@ enum {
>   	ALLOCINFO_FILTER_FUNCTION,
>   	ALLOCINFO_FILTER_FILENAME,
>   	ALLOCINFO_FILTER_LINENO,
> -	__ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> +	ALLOCINFO_FILTER_MIN_SIZE,
> +	ALLOCINFO_FILTER_MAX_SIZE,
> +	__ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
>   };
>   
>   #define ALLOCINFO_FILTER_MASK_MODNAME		(1 << ALLOCINFO_FILTER_MODNAME)
>   #define ALLOCINFO_FILTER_MASK_FUNCTION		(1 << ALLOCINFO_FILTER_FUNCTION)
>   #define ALLOCINFO_FILTER_MASK_FILENAME		(1 << ALLOCINFO_FILTER_FILENAME)
>   #define ALLOCINFO_FILTER_MASK_LINENO		(1 << ALLOCINFO_FILTER_LINENO)
> +#define ALLOCINFO_FILTER_MASK_MIN_SIZE		(1 << ALLOCINFO_FILTER_MIN_SIZE)
> +#define ALLOCINFO_FILTER_MASK_MAX_SIZE		(1 << ALLOCINFO_FILTER_MAX_SIZE)
>   
>   #define ALLOCINFO_FILTER_MASKS \
>   	((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 7ff936e15e97..98a27c302928 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -195,6 +195,9 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
>   
>   static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
>   {
> +	struct alloc_tag *tag;
> +	struct alloc_tag_counters counters;
> +
>   	if (!ct || !filter || !filter->mask)
>   		return true;
>   
> @@ -214,6 +217,18 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
>   	    ct->lineno != filter->fields.lineno)
>   		return false;
>   
> +	if ((filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) ||
> +	    (filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
> +		tag = ct_to_alloc_tag(ct);
> +		counters = alloc_tag_read(tag);

alloc_tag_read() is called twice for matching tags

When size filtering is enabled, matches_filter() calls alloc_tag_read()

to check the size, and then allocinfo_to_params() calls it again to

fill the output data:

matches_filter():

     counters = alloc_tag_read(tag);        // 1st read

     if (counters.bytes < min_size)

         return false;

allocinfo_to_params():

     counter = alloc_tag_read(tag);         // 2nd read (same tag)

     data->counter.bytes = counter.bytes;

For matching tags, the same per-CPU counter aggregation is done twice.

On large machines this is not trivial. Would it make sense to cache

the counters from matches_filter() and reuse them in allocinfo_to_params()?


> +		if ((filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&
> +		    counters.bytes < filter->fields.min_size)
> +			return false;
> +		if ((filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&
> +		    counters.bytes > filter->fields.max_size)
> +			return false;
> +	}
> +

No validation for min_size > max_size.

If both MIN_SIZE and MAX_SIZE are set but min_size > max_size,

no records will match and the user gets no indication of the

invalid input. This could be checked alongside the existing

mask validation in allocinfo_ioctl_get_at():

     if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)

         return -EINVAL;

     +   if ((params.filter.mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&

     +       (params.filter.mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&

     +       params.filter.fields.min_size > params.filter.fields.max_size)

     +            return -EINVAL;

Thanks

Best Regards

Hao

>   	return true;
>   }
>   

^ permalink raw reply

* Re: [PATCH v3 3/3] Documentation: security-bugs: clarify requirements for AI-assisted reports
From: Greg KH @ 2026-05-14  7:23 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Willy Tarreau, Leon Romanovsky, skhan, security, workflows,
	linux-doc, linux-kernel
In-Reply-To: <87ik8r6n1r.fsf@trenco.lwn.net>

On Wed, May 13, 2026 at 03:02:08PM -0600, Jonathan Corbet wrote:
> Jonathan Corbet <corbet@lwn.net> writes:
> 
> > Willy Tarreau <w@1wt.eu> writes:
> >
> >> On Wed, May 13, 2026 at 12:30:10PM +0200, Greg KH wrote:
> >>> > One nit:
> >>> > 
> >>> > > +  * **Impact Evaluation**: Many AI-generated reports lack an understanding of
> >>> > > +    the kernel's threat model and go to great lengths inventing theoretical
> >>> > > +    consequences.
> >>> > 
> >>> > If only we had a shiny new document describing that threat model that we
> >>> > could reference here... :)
> >>> 
> >>> Ah yes, a link to that would make things better, but don't we have that
> >>> elsewhere in this series?
> >>
> >> It's in the same patch, I think Jon was sarcastic here. I thought I had
> >> addressed that one but apparently I was wrong :-/
> >
> > I'm just saying that this particular text should link to that document,
> > don't make readers go searching for it.  I can certainly add a patch
> > doing that if you like.
> 
> I was thinking something like this.
> 
> jon
> 
> >From 3f02a3c190bab6b54e2a250ead0c7408af1a3c51 Mon Sep 17 00:00:00 2001
> From: Jonathan Corbet <corbet@lwn.net>
> Date: Wed, 13 May 2026 14:51:29 -0600
> Subject: [PATCH 1/2] docs: security-bugs: add a link to the threat-model
>  documentation
> 
> Rather than make readers search for this document, just a link to it where
> it is referenced.
> 
> (While I was at it, I removed the unused and unneeded _threatmodel label
> from the top of threat-model.rst).
> 
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/process/security-bugs.rst | 13 +++++++------
>  Documentation/process/threat-model.rst  |  2 --
>  2 files changed, 7 insertions(+), 8 deletions(-)

Looks good, thanks!

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply

* htmldocs: Documentation/driver-api/media/v4l2-subdev:644: ./include/media/v4l2-subdev.h:1815: WARNING: Inline emphasis start-string without end-string. [docutils]
From: kernel test robot @ 2026-05-14  7:24 UTC (permalink / raw)
  To: Sakari Ailus; +Cc: oe-kbuild-all, 0day robot, linux-doc

tree:   https://github.com/intel-lab-lkp/linux/commits/Sakari-Ailus/media-v4l2-common-Add-mipi_csi2_dt_for_mbus/20260514-071037
head:   34918de37a97a5dc1db4f52558076912af6adcea
commit: a3067ab49d683ccd82af48d0fad7349b7724537d media: v4l2-subdev: Provide a cleanup-friendly get_frame_desc
date:   8 hours ago
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260514/202605140920.76HXOESS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605140920.76HXOESS-lkp@intel.com/

All warnings (new ones prefixed by >>):

   --------------------------------------------------------------------------------------------^
   Documentation/driver-api/basics:42: ./kernel/time/time.c:370: WARNING: Duplicate C declaration, also defined at driver-api/basics:436.
   Declaration is '.. c:function:: unsigned int jiffies_to_msecs (const unsigned long j)'. [duplicate_declaration.c]
   Documentation/driver-api/basics:42: ./kernel/time/time.c:393: WARNING: Duplicate C declaration, also defined at driver-api/basics:453.
   Declaration is '.. c:function:: unsigned int jiffies_to_usecs (const unsigned long j)'. [duplicate_declaration.c]
>> Documentation/driver-api/media/v4l2-subdev:644: ./include/media/v4l2-subdev.h:1815: WARNING: Inline emphasis start-string without end-string. [docutils]
   Documentation/driver-api/target:25: ./drivers/target/target_core_user.c:35: ERROR: Unexpected section title.

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH net-next v2 2/2] net: ti: icssg: Add HSR and LRE PA statistics
From: MD Danish Anwar @ 2026-05-14  7:56 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, MD Danish Anwar,
	Roger Quadros, Andrew Lunn, Meghana Malladi, Jacob Keller,
	David Carlier, Vadim Fedorenko, Kevin Hao
  Cc: netdev, linux-doc, linux-kernel, linux-arm-kernel,
	Vladimir Oltean
In-Reply-To: <20260514075605.850674-1-danishanwar@ti.com>

Add new firmware PA statistics counters for HSR and LRE to the ethtool
statistics exposed by the ICSSG driver.

New statistics added:
 - FW_HSR_FWD_CHECK_FAIL_DROP: Packets dropped on the HSR forwarding path
 - FW_HSR_HE_CHECK_FAIL_DROP: Packets dropped on the HSR host egress path
 - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES: Frames with duplicate discard
   skipped
 - FW_LRE_CNT_UNIQUE/DUPLICATE/MULTIPLE_RX: LRE duplicate detection
   counters
 - FW_LRE_CNT_RX/TX: LRE per-port frame counters
 - FW_LRE_CNT_OWN_RX: Own HSR tagged frames received
 - FW_LRE_CNT_ERRWRONGLAN: Frames with wrong LAN identifier (PRP)

Document the new HSR/LRE statistics in icssg_prueth.rst.

Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
---
 .../device_drivers/ethernet/ti/icssg_prueth.rst        | 10 ++++++++++
 drivers/net/ethernet/ti/icssg/icssg_common.c           |  7 +++++--
 drivers/net/ethernet/ti/icssg/icssg_prueth.h           |  2 +-
 drivers/net/ethernet/ti/icssg/icssg_stats.h            | 10 ++++++++++
 drivers/net/ethernet/ti/icssg/icssg_switch_map.h       | 10 ++++++++++
 5 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
index da21ddf431bb..b0bda7327b2a 100644
--- a/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
@@ -54,3 +54,13 @@ These statistics are as follows,
  - ``FW_HOST_TX_PKT_CNT``: Number of valid packets copied by RTU0 to Tx queues
  - ``FW_HOST_EGRESS_Q_PRE_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter
  - ``FW_HOST_EGRESS_Q_EXP_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter
+ - ``FW_HSR_FWD_CHECK_FAIL_DROP``: Packets dropped on the HSR forwarding path due to failed checks
+ - ``FW_HSR_HE_CHECK_FAIL_DROP``: Packets dropped on the host egress path due to failed checks
+ - ``FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES``: Frames for which the host duplicate discard check was skipped
+ - ``FW_LRE_CNT_UNIQUE_RX``: Number of frames received with no duplicate detected
+ - ``FW_LRE_CNT_DUPLICATE_RX``: Number of frames received for which exactly one duplicate was detected
+ - ``FW_LRE_CNT_MULTIPLE_RX``: Number of frames received for which more than one duplicate was detected
+ - ``FW_LRE_CNT_RX``: Number of HSR/PRP tagged frames received
+ - ``FW_LRE_CNT_TX``: Number of HSR/PRP tagged frames sent
+ - ``FW_LRE_CNT_OWN_RX``: Number of HSR/PRP tagged frames received whose source MAC matches the node's own address
+ - ``FW_LRE_CNT_ERRWRONGLAN``: Number of frames received with a wrong LAN identifier, PRP only
diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
index a28a608f9bf4..e7a51a9eee24 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_common.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
@@ -1633,7 +1633,8 @@ void icssg_ndo_get_stats64(struct net_device *ndev,
 			    emac_get_stat_by_name(emac, "FW_RX_EOF_SHORT_FRMERR") +
 			    emac_get_stat_by_name(emac, "FW_RX_B0_DROP_EARLY_EOF") +
 			    emac_get_stat_by_name(emac, "FW_RX_EXP_FRAG_Q_DROP") +
-			    emac_get_stat_by_name(emac, "FW_RX_FIFO_OVERRUN");
+			    emac_get_stat_by_name(emac, "FW_RX_FIFO_OVERRUN") +
+			    emac_get_stat_by_name(emac, "FW_LRE_CNT_ERRWRONGLAN");
 	stats->rx_dropped = ndev->stats.rx_dropped +
 			    emac_get_stat_by_name(emac, "FW_DROPPED_PKT") +
 			    emac_get_stat_by_name(emac, "FW_INF_PORT_DISABLED") +
@@ -1643,7 +1644,9 @@ void icssg_ndo_get_stats64(struct net_device *ndev,
 			    emac_get_stat_by_name(emac, "FW_INF_DROP_TAGGED") +
 			    emac_get_stat_by_name(emac, "FW_INF_DROP_PRIOTAGGED") +
 			    emac_get_stat_by_name(emac, "FW_INF_DROP_NOTAG") +
-			    emac_get_stat_by_name(emac, "FW_INF_DROP_NOTMEMBER");
+			    emac_get_stat_by_name(emac, "FW_INF_DROP_NOTMEMBER") +
+			    emac_get_stat_by_name(emac, "FW_HSR_FWD_CHECK_FAIL_DROP") +
+			    emac_get_stat_by_name(emac, "FW_HSR_HE_CHECK_FAIL_DROP");
 	stats->tx_errors  = ndev->stats.tx_errors;
 	stats->tx_dropped = ndev->stats.tx_dropped +
 			    emac_get_stat_by_name(emac, "FW_RTU_PKT_DROP") +
diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
index df93d15c5b78..60a8aedd334b 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
@@ -57,7 +57,7 @@
 
 #define ICSSG_MAX_RFLOWS	8	/* per slice */
 
-#define ICSSG_NUM_PA_STATS	32
+#define ICSSG_NUM_PA_STATS	42
 #define ICSSG_NUM_MIIG_STATS	60
 /* Number of ICSSG related stats */
 #define ICSSG_NUM_STATS (ICSSG_NUM_MIIG_STATS + ICSSG_NUM_PA_STATS)
diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.h b/drivers/net/ethernet/ti/icssg/icssg_stats.h
index 6f4400d8a0f6..08b5ab6f93da 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_stats.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_stats.h
@@ -201,6 +201,16 @@ static const struct icssg_pa_stats icssg_all_pa_stats[] = {
 	ICSSG_PA_STATS(FW_HOST_TX_PKT_CNT),
 	ICSSG_PA_STATS(FW_HOST_EGRESS_Q_PRE_OVERFLOW),
 	ICSSG_PA_STATS(FW_HOST_EGRESS_Q_EXP_OVERFLOW),
+	ICSSG_PA_STATS(FW_HSR_FWD_CHECK_FAIL_DROP),
+	ICSSG_PA_STATS(FW_HSR_HE_CHECK_FAIL_DROP),
+	ICSSG_PA_STATS(FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES),
+	ICSSG_PA_STATS(FW_LRE_CNT_UNIQUE_RX),
+	ICSSG_PA_STATS(FW_LRE_CNT_DUPLICATE_RX),
+	ICSSG_PA_STATS(FW_LRE_CNT_MULTIPLE_RX),
+	ICSSG_PA_STATS(FW_LRE_CNT_RX),
+	ICSSG_PA_STATS(FW_LRE_CNT_TX),
+	ICSSG_PA_STATS(FW_LRE_CNT_OWN_RX),
+	ICSSG_PA_STATS(FW_LRE_CNT_ERRWRONGLAN),
 };
 
 static_assert(ARRAY_SIZE(icssg_all_pa_stats) == ICSSG_NUM_PA_STATS);
diff --git a/drivers/net/ethernet/ti/icssg/icssg_switch_map.h b/drivers/net/ethernet/ti/icssg/icssg_switch_map.h
index 7e053b8af3ec..bd2d54dd7f45 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_switch_map.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_switch_map.h
@@ -266,5 +266,15 @@
 #define FW_HOST_TX_PKT_CNT		0x0250
 #define FW_HOST_EGRESS_Q_PRE_OVERFLOW	0x0258
 #define FW_HOST_EGRESS_Q_EXP_OVERFLOW	0x0260
+#define FW_HSR_FWD_CHECK_FAIL_DROP		0x0500
+#define FW_HSR_HE_CHECK_FAIL_DROP		0x0508
+#define FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES	0x0510
+#define FW_LRE_CNT_UNIQUE_RX			0x0518
+#define FW_LRE_CNT_DUPLICATE_RX			0x0520
+#define FW_LRE_CNT_MULTIPLE_RX			0x0528
+#define FW_LRE_CNT_RX				0x0530
+#define FW_LRE_CNT_TX				0x0538
+#define FW_LRE_CNT_OWN_RX			0x0540
+#define FW_LRE_CNT_ERRWRONGLAN			0x0548
 
 #endif /* __NET_TI_ICSSG_SWITCH_MAP_H  */
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v2 0/2] Add ICSSG firmware stats related to HSR
From: MD Danish Anwar @ 2026-05-14  7:56 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, MD Danish Anwar,
	Roger Quadros, Andrew Lunn, Meghana Malladi, Jacob Keller,
	David Carlier, Vadim Fedorenko, Kevin Hao
  Cc: netdev, linux-doc, linux-kernel, linux-arm-kernel,
	Vladimir Oltean

This series adds HSR and LRE firmware PA statistics to the TI ICSSG
ethtool stats interface, and places static_assert() guards next to the
stat descriptor arrays to catch count mismatches at build time.

Patch 1 adds static_assert() immediately after each of
icssg_all_miig_stats[] and icssg_all_pa_stats[] in icssg_stats.h,
verifying that ICSSG_NUM_MIIG_STATS and ICSSG_NUM_PA_STATS stay in
sync with the actual array sizes.

Patch 2 adds ten new firmware counters for HSR forwarding-path drops,
host-egress-path drops, and LRE duplicate-detection, updates
icssg_ndo_get_stats64() to fold the relevant counters into rx_errors
and rx_dropped, bumps ICSSG_NUM_PA_STATS to 42 (caught immediately by
the static_assert from patch 1 if the constant is ever left behind),
and documents all new entries in icssg_prueth.rst.

Changes in v2:
 - Drop the ARRAY_SIZE()-based macro approach from v1 (which caused
   binary bloat by pulling the static const arrays into every TU via
   icssg_prueth.h) as suggested by David Carlier <devnexen@gmail.com>
 - Add static_assert() next to each array in icssg_stats.h instead,
   keeping the numeric #defines and the original include graph. As
   suggested by David Carlier <devnexen@gmail.com>

v1 https://lore.kernel.org/all/20260512060627.3781329-1-danishanwar@ti.com/

MD Danish Anwar (2):
  net: ti: icssg: Add static_assert to guard stat array counts
  net: ti: icssg: Add HSR and LRE PA statistics

 .../device_drivers/ethernet/ti/icssg_prueth.rst    | 10 ++++++++++
 drivers/net/ethernet/ti/icssg/icssg_common.c       |  7 +++++--
 drivers/net/ethernet/ti/icssg/icssg_prueth.h       |  2 +-
 drivers/net/ethernet/ti/icssg/icssg_stats.h        | 14 ++++++++++++++
 drivers/net/ethernet/ti/icssg/icssg_switch_map.h   | 10 ++++++++++
 5 files changed, 40 insertions(+), 3 deletions(-)


base-commit: 18dc8e6d15d7a30888beec46a1e01ca0f98508fa
-- 
2.34.1


^ permalink raw reply

* [PATCH net-next v2 1/2] net: ti: icssg: Add static_assert to guard stat array counts
From: MD Danish Anwar @ 2026-05-14  7:56 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, MD Danish Anwar,
	Roger Quadros, Andrew Lunn, Meghana Malladi, Jacob Keller,
	David Carlier, Vadim Fedorenko, Kevin Hao
  Cc: netdev, linux-doc, linux-kernel, linux-arm-kernel,
	Vladimir Oltean
In-Reply-To: <20260514075605.850674-1-danishanwar@ti.com>

Place static_assert() immediately after each of icssg_all_miig_stats[]
and icssg_all_pa_stats[] in icssg_stats.h to verify at build time that
ICSSG_NUM_MIIG_STATS and ICSSG_NUM_PA_STATS stay in sync with the
actual array sizes. This turns a silent miscount into a build error
should either the constant or the array be updated independently.

Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
---
 drivers/net/ethernet/ti/icssg/icssg_stats.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.h b/drivers/net/ethernet/ti/icssg/icssg_stats.h
index 5ec0b38e0c67..6f4400d8a0f6 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_stats.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_stats.h
@@ -155,6 +155,8 @@ static const struct icssg_miig_stats icssg_all_miig_stats[] = {
 	ICSSG_MIIG_STATS(tx_bytes, true),
 };
 
+static_assert(ARRAY_SIZE(icssg_all_miig_stats) == ICSSG_NUM_MIIG_STATS);
+
 #define ICSSG_PA_STATS(field)	\
 {				\
 	#field,			\
@@ -201,4 +203,6 @@ static const struct icssg_pa_stats icssg_all_pa_stats[] = {
 	ICSSG_PA_STATS(FW_HOST_EGRESS_Q_EXP_OVERFLOW),
 };
 
+static_assert(ARRAY_SIZE(icssg_all_pa_stats) == ICSSG_NUM_PA_STATS);
+
 #endif /* __NET_TI_ICSSG_STATS_H */
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v3 2/4] PCI: endpoint: Add DOE mailbox support for endpoint functions
From: Manivannan Sadhasivam @ 2026-05-14  8:03 UTC (permalink / raw)
  To: Aksh Garg
  Cc: linux-pci, linux-doc, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <20260427051725.223704-3-a-garg7@ti.com>

On Mon, Apr 27, 2026 at 10:47:23AM +0530, Aksh Garg wrote:
> DOE (Data Object Exchange) is a standard PCIe extended capability
> feature introduced in the Data Object Exchange (DOE) ECN for
> PCIe r5.0. It provides a communication mechanism primarily used for
> implementing PCIe security features such as device authentication, and
> secure link establishment. Think of DOE as a sophisticated mailbox
> system built into PCIe. The root complex can send structured requests
> to the endpoint device through DOE mailboxes, and the endpoint device
> responds with appropriate data.
> 
> Add the DOE support for PCIe endpoint devices, enabling endpoint
> functions to process the DOE requests from the host. The implementation
> provides framework APIs for EPC core driver and controller drivers to
> register mailboxes, and request processing with workqueues ensuring
> sequential handling per mailbox, and parallel handling across mailboxes.
> The Discovery protocol is handled internally by the DOE core.
> 
> This implementation complements the existing DOE implementation for
> root complex in drivers/pci/doe.c.
> 
> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Aksh Garg <a-garg7@ti.com>
> ---
> 
> Changes from v2 to v3:
> - Rebased on 7.1-rc1.
> 
> Changes since v1:
> - Moved the DOE-EP core file to drivers/pci/endpoint/pci-ep-doe.c, and
>   corresponding Kconfig and Makefile to match the existing naming scheme,
>   as suggested by Niklas Cassel.
> - Renamed the config from PCI_DOE_EP to PCI_ENDPOINT_DOE
> - Moved the function declarations that need not be visible outside the
>   PCI core to drivers/pci/pci.h instead to include/linux/pci-doe.h as
>   suggested by Lukas Wunner
> - Converted from synchronous to asynchronous request processing:
>   * Removed wait_for_completion() from pci_ep_doe_process_request()
>   * Function returns immediately after queuing to workqueue, hence
>     removed private data for completion in the task structure
>   * Added completion callback as an additional argument to
>     pci_ep_doe_process_request(), which takes the response and status
>     parameters as arguments (along with other required arguments), hence
>     removed task_status in the task structure
>   * Created a typedef pci_ep_doe_complete_t for completion callback
>   * Removed the pci_ep_doe_task_complete() function, as it would not be
>     required anymore with these changes
>   * Moved from INIT_WORK_ONSTACK() to INIT_WORK(), to initialize the work
>     on heap instead of stack
>   * signal_task_complete() now invokes the completion callback, once the
>     protocol handler completes its task
> - Changed from dynamic xarray-based protocol registration to static array:
>   * Removed the register/unregister protocol APIs
>   * Replaced the dynamic xarray with static array of struct pci_doe_protocol
>   * Added discovery protocol to static array, instead of treating it specially,
>     hence removed the special handling for Discovery protocol in
>     doe_ep_task_work()
>   * Updated pci_ep_doe_handle_discovery() and pci_ep_doe_find_protocol()
>     accordingly.
> - Memory Management:
>   * DOE core frees request buffer in signal_task_complete()
>     or during error handling
>   * pci_ep_doe_process_request() defines response_pl and response_pl_sz
>     as NULL and 0 respectively, whose pointer is passed to the protocol
>     handler, hence removed the arguments void **response, size_t *response_sz
>     to this function.
> - Task structure refactoring:
>   * Response buffer: void **response_pl to void *response_pl
>   * Response size: size_t *response_pl_sz to size_t response_pl_sz
>   * Changed the completion callback to type pci_ep_doe_complete_t
>   * Removed void *private and int task_status
> - Updated documentation comments of the functions according to the changes 
> 
> v2: https://lore.kernel.org/all/20260401073022.215805-3-a-garg7@ti.com/
> v1: https://lore.kernel.org/all/20260213123603.420941-4-a-garg7@ti.com/
> 
>  drivers/pci/endpoint/Kconfig      |  14 +
>  drivers/pci/endpoint/Makefile     |   1 +
>  drivers/pci/endpoint/pci-ep-doe.c | 552 ++++++++++++++++++++++++++++++
>  drivers/pci/pci.h                 |  38 ++
>  include/linux/pci-doe.h           |   5 +
>  include/linux/pci-epc.h           |   3 +
>  6 files changed, 613 insertions(+)
>  create mode 100644 drivers/pci/endpoint/pci-ep-doe.c
> 
> diff --git a/drivers/pci/endpoint/Kconfig b/drivers/pci/endpoint/Kconfig
> index 8dad291be8b8..15ae16aaa58f 100644
> --- a/drivers/pci/endpoint/Kconfig
> +++ b/drivers/pci/endpoint/Kconfig
> @@ -36,6 +36,20 @@ config PCI_ENDPOINT_MSI_DOORBELL
>  	  doorbell. The RC can trigger doorbell in EP by writing data to a
>  	  dedicated BAR, which the EP maps to the controller's message address.
>  
> +config PCI_ENDPOINT_DOE
> +	bool "PCI Endpoint Data Object Exchange (DOE) support"
> +	depends on PCI_ENDPOINT
> +	help
> +	  This enables support for Data Object Exchange (DOE) protocol
> +	  on PCI Endpoint controllers. It provides a communication
> +	  mechanism through mailboxes, primarily used for PCIe security
> +	  features.
> +
> +	  Say Y here if you want be able to communicate using PCIe DOE
> +	  mailboxes.
> +
> +	  If unsure, say N.
> +
>  source "drivers/pci/endpoint/functions/Kconfig"
>  
>  endmenu
> diff --git a/drivers/pci/endpoint/Makefile b/drivers/pci/endpoint/Makefile
> index b4869d52053a..1fa176b6792b 100644
> --- a/drivers/pci/endpoint/Makefile
> +++ b/drivers/pci/endpoint/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_PCI_ENDPOINT_CONFIGFS)	+= pci-ep-cfs.o
>  obj-$(CONFIG_PCI_ENDPOINT)		+= pci-epc-core.o pci-epf-core.o\
>  					   pci-epc-mem.o functions/
>  obj-$(CONFIG_PCI_ENDPOINT_MSI_DOORBELL)	+= pci-ep-msi.o
> +obj-$(CONFIG_PCI_ENDPOINT_DOE)		+= pci-ep-doe.o
> diff --git a/drivers/pci/endpoint/pci-ep-doe.c b/drivers/pci/endpoint/pci-ep-doe.c
> new file mode 100644
> index 000000000000..ded0290b15ed
> --- /dev/null
> +++ b/drivers/pci/endpoint/pci-ep-doe.c
> @@ -0,0 +1,552 @@
> +// SPDX-License-Identifier: GPL-2.0-only or MIT
> +/*
> + * Data Object Exchange for PCIe Endpoint
> + *	PCIe r7.0, sec 6.30 DOE
> + *
> + * Copyright (C) 2026 Texas Instruments Incorporated - https://www.ti.com
> + *	Aksh Garg <a-garg7@ti.com>
> + *	Siddharth Vadapalli <s-vadapalli@ti.com>
> + */
> +
> +#define dev_fmt(fmt) "DOE EP: " fmt
> +
> +#include <linux/bitfield.h>
> +#include <linux/device.h>
> +#include <linux/pci.h>
> +#include <linux/pci-epc.h>
> +#include <linux/pci-doe.h>
> +#include <linux/slab.h>
> +#include <linux/workqueue.h>
> +#include <linux/xarray.h>
> +
> +#include "../pci.h"
> +
> +/* Forward declaration of discovery protocol handler */
> +static int pci_ep_doe_handle_discovery(const void *request, size_t request_sz,
> +				       void **response, size_t *response_sz);
> +
> +/**
> + * struct pci_doe_protocol - DOE protocol handler entry
> + * @vid: Vendor ID
> + * @type: Protocol type
> + * @handler: Handler function pointer
> + */
> +struct pci_doe_protocol {
> +	u16 vid;
> +	u8 type;
> +	pci_doe_protocol_handler_t handler;
> +};
> +
> +/**
> + * struct pci_ep_doe_mb - State for a single DOE mailbox on EP
> + *
> + * This state is used to manage a single DOE mailbox capability on the
> + * endpoint side.
> + *
> + * @epc: PCI endpoint controller this mailbox belongs to
> + * @func_no: Physical function number of the function this mailbox belongs to
> + * @cap_offset: Capability offset
> + * @work_queue: Queue of work items
> + * @flags: Bit array of PCI_DOE_FLAG_* flags
> + */
> +struct pci_ep_doe_mb {
> +	struct pci_epc *epc;
> +	u8 func_no;
> +	u16 cap_offset;
> +	struct workqueue_struct *work_queue;
> +	unsigned long flags;
> +};
> +
> +/**
> + * struct pci_ep_doe_task - Represents a single DOE request/response task
> + *
> + * @feat: DOE feature (vendor ID and type)
> + * @request_pl: Request payload
> + * @request_pl_sz: Size of request payload in bytes
> + * @response_pl: Response buffer
> + * @response_pl_sz: Size of response buffer in bytes
> + * @complete: Completion callback
> + * @work: Work structure for workqueue
> + * @doe_mb: DOE mailbox handling this task
> + */
> +struct pci_ep_doe_task {
> +	struct pci_doe_feature feat;
> +	const void *request_pl;
> +	size_t request_pl_sz;
> +	void *response_pl;
> +	size_t response_pl_sz;
> +	pci_ep_doe_complete_t complete;
> +
> +	/* Initialized by pci_ep_doe_submit_task() */
> +	struct work_struct work;
> +	struct pci_ep_doe_mb *doe_mb;
> +};
> +
> +/*
> + * Global registry of protocol handlers.
> + * When a new DOE protocol, library is added, add an entry to this array.
> + */
> +static const struct pci_doe_protocol pci_doe_protocols[] = {
> +	{
> +		.vid = PCI_VENDOR_ID_PCI_SIG,
> +		.type = PCI_DOE_FEATURE_DISCOVERY,
> +		.handler = pci_ep_doe_handle_discovery,
> +	},
> +};
> +
> +/*
> + * Combines function number and capability offset into a unique lookup key
> + * for storing/retrieving DOE mailboxes in an xarray.
> + */
> +#define PCI_DOE_MB_KEY(func, offset) \
> +	(((unsigned long)(func) << 16) | (offset))
> +#define PCI_DOE_PROTOCOL_COUNT        ARRAY_SIZE(pci_doe_protocols)
> +
> +/**
> + * pci_ep_doe_init() - Initialize the DOE framework for a controller in EP mode
> + * @epc: PCI endpoint controller
> + *
> + * Initialize the DOE framework data structures. This only initializes
> + * the xarray that will hold the mailboxes.
> + *
> + * RETURNS: 0 on success, -errno on failure

kernel-doc format to describe return value is 'Return:' or 'Returns:".

> + */
> +int pci_ep_doe_init(struct pci_epc *epc)
> +{
> +	if (!epc)
> +		return -EINVAL;
> +
> +	xa_init(&epc->doe_mbs);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_ep_doe_init);
> +
> +/**
> + * pci_ep_doe_add_mailbox() - Add a DOE mailbox for a physical function
> + * @epc: PCI endpoint controller
> + * @func_no: Physical function number
> + * @cap_offset: Offset of the DOE capability
> + *
> + * Create and register a DOE mailbox for the specified physical function
> + * and capability offset.
> + *
> + * EPC core driver calls this for each DOE capability discovered in the config
> + * space of each endpoint function through an API. The API is invoked by the
> + * controller driver during initialization if DOE support is available.
> + *
> + * RETURNS: 0 on success, -errno on failure
> + */
> +int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no, u16 cap_offset)
> +{
> +	struct pci_ep_doe_mb *doe_mb;
> +	unsigned long key;
> +	int ret;
> +
> +	if (!epc)
> +		return -EINVAL;
> +
> +	doe_mb = kzalloc_obj(*doe_mb, GFP_KERNEL);
> +	if (!doe_mb)
> +		return -ENOMEM;
> +
> +	doe_mb->epc = epc;
> +	doe_mb->func_no = func_no;
> +	doe_mb->cap_offset = cap_offset;
> +
> +	doe_mb->work_queue = alloc_ordered_workqueue("pci_ep_doe[%s:pf%d:offset%x]", 0,
> +						     dev_name(&epc->dev),
> +						     func_no, cap_offset);
> +	if (!doe_mb->work_queue) {
> +		dev_err(epc->dev.parent,
> +			"[pf%d:offset%x] failed to allocate work queue\n",
> +			func_no, cap_offset);
> +		ret = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	/* Add to xarray with composite key */
> +	key = PCI_DOE_MB_KEY(func_no, cap_offset);
> +	ret = xa_insert(&epc->doe_mbs, key, doe_mb, GFP_KERNEL);
> +	if (ret) {
> +		dev_err(epc->dev.parent,
> +			"[pf%d:offset%x] failed to insert mailbox: %d\n",
> +			func_no, cap_offset, ret);
> +		goto err_destroy;
> +	}
> +
> +	dev_dbg(epc->dev.parent,
> +		"DOE mailbox added: pf%d offset 0x%x\n",
> +		func_no, cap_offset);
> +
> +	return 0;
> +
> +err_destroy:
> +	destroy_workqueue(doe_mb->work_queue);
> +err_free:
> +	kfree(doe_mb);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_ep_doe_add_mailbox);
> +
> +/**
> + * pci_ep_doe_cancel_tasks() - Cancel all pending tasks
> + * @doe_mb: DOE mailbox
> + *
> + * Cancel all pending tasks in the mailbox. Mark the mailbox as dead
> + * so no new tasks can be submitted.
> + */
> +static void pci_ep_doe_cancel_tasks(struct pci_ep_doe_mb *doe_mb)
> +{
> +	if (!doe_mb)
> +		return;
> +
> +	/* Mark the mailbox as dead */
> +	set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> +
> +	/* Stop all pending work items from starting */
> +	set_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
> +}
> +
> +/**
> + * pci_ep_doe_get_mailbox() - Get DOE mailbox by function and offset
> + * @epc: PCI endpoint controller
> + * @func_no: Physical function number
> + * @cap_offset: Offset of the DOE capability
> + *
> + * Internal helper to look up a DOE mailbox by its function number and
> + * capability offset.
> + *
> + * RETURNS: Pointer to the mailbox or NULL if not found
> + */
> +static struct pci_ep_doe_mb *pci_ep_doe_get_mailbox(struct pci_epc *epc,
> +						    u8 func_no, u16 cap_offset)
> +{
> +	unsigned long key;
> +
> +	if (!epc)
> +		return NULL;
> +
> +	key = PCI_DOE_MB_KEY(func_no, cap_offset);
> +	return xa_load(&epc->doe_mbs, key);
> +}
> +
> +/**
> + * pci_ep_doe_find_protocol() - Find protocol handler in static array
> + * @vendor: Vendor ID
> + * @type: Protocol type
> + *
> + * Look up a protocol handler in the static protocol array by matching vendor ID
> + * and protocol type.
> + *
> + * RETURNS: Handler function pointer or NULL if not found
> + */
> +static pci_doe_protocol_handler_t pci_ep_doe_find_protocol(u16 vendor, u8 type)
> +{
> +	int i;
> +
> +	/* Search static protocol array */
> +	for (i = 0; i < PCI_DOE_PROTOCOL_COUNT; i++) {
> +		if (pci_doe_protocols[i].vid == vendor &&
> +		    pci_doe_protocols[i].type == type)
> +			return pci_doe_protocols[i].handler;
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * pci_ep_doe_handle_discovery() - Handle Discovery protocol request
> + * @request: Request payload
> + * @request_sz: Request size
> + * @response: Output pointer for response buffer
> + * @response_sz: Output pointer for response size
> + *
> + * Handle the DOE Discovery protocol. The request contains an index specifying
> + * which protocol to query. This function creates a response containing the
> + * vendor ID and protocol type for the requested index, along with the next
> + * index value for further discovery:
> + *
> + * - next_index = 0: Signals this is the last protocol supported
> + * - next_index = n (non-zero): Signals more protocols available,
> + *   query index n next
> + *
> + * RETURNS: 0 on success, -errno on failure
> + */
> +static int pci_ep_doe_handle_discovery(const void *request, size_t request_sz,
> +				       void **response, size_t *response_sz)
> +{
> +	struct pci_doe_protocol protocol;
> +	u8 requested_index, next_index;
> +	u32 *response_pl;
> +	u32 request_pl;
> +	u16 vendor;
> +	u8 type;
> +
> +	if (request_sz != sizeof(u32))
> +		return -EINVAL;
> +
> +	request_pl = *(u32 *)request;
> +	requested_index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, request_pl);
> +
> +	if (requested_index >= PCI_DOE_PROTOCOL_COUNT)
> +		return -EINVAL;
> +
> +	/* Get protocol from array at requested_index */
> +	protocol = pci_doe_protocols[requested_index];
> +	vendor = protocol.vid;
> +	type = protocol.type;
> +
> +	/* Calculate next index */
> +	next_index = (requested_index + 1 < PCI_DOE_PROTOCOL_COUNT) ? requested_index + 1 : 0;
> +
> +	response_pl = kzalloc_obj(*response_pl, GFP_KERNEL);
> +	if (!response_pl)
> +		return -ENOMEM;
> +
> +	/* Build response */
> +	*response_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, vendor) |
> +		       FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE, type) |
> +		       FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, next_index);
> +
> +	*response = response_pl;
> +	*response_sz = sizeof(*response_pl);
> +
> +	return 0;
> +}
> +
> +static void signal_task_complete(struct pci_ep_doe_task *task, int status)
> +{
> +	kfree(task->request_pl);
> +	task->complete(task->doe_mb->func_no, task->doe_mb->cap_offset, status,
> +		       task->feat.vid, task->feat.type,
> +		       task->response_pl, task->response_pl_sz);
> +	kfree(task);
> +}
> +
> +/**
> + * doe_ep_task_work() - Work function for processing DOE EP tasks
> + * @work: Work structure
> + *
> + * Process a DOE request by calling the appropriate protocol handler.
> + */
> +static void doe_ep_task_work(struct work_struct *work)
> +{
> +	struct pci_ep_doe_task *task = container_of(work, struct pci_ep_doe_task,
> +						    work);
> +	struct pci_ep_doe_mb *doe_mb = task->doe_mb;
> +	pci_doe_protocol_handler_t handler;
> +	int rc;
> +
> +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> +		signal_task_complete(task, -EIO);
> +		return;
> +	}
> +
> +	/* Check if request was aborted */
> +	if (test_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags)) {
> +		signal_task_complete(task, -ECANCELED);
> +		return;
> +	}
> +
> +	/* Find protocol handler in the array */
> +	handler = pci_ep_doe_find_protocol(task->feat.vid, task->feat.type);
> +	if (!handler) {
> +		dev_warn(doe_mb->epc->dev.parent,
> +			 "[%d:%x] Unsupported protocol VID=%04x TYPE=%02x\n",
> +			 doe_mb->func_no, doe_mb->cap_offset,
> +			 task->feat.vid, task->feat.type);
> +		signal_task_complete(task, -EOPNOTSUPP);
> +		return;
> +	}
> +
> +	/* Call protocol handler */
> +	rc = handler(task->request_pl, task->request_pl_sz,
> +		     &task->response_pl, &task->response_pl_sz);
> +
> +	signal_task_complete(task, rc);
> +}
> +
> +/**
> + * pci_ep_doe_submit_task() - Submit a task to be processed
> + * @doe_mb: DOE mailbox
> + * @task: Task to submit
> + *
> + * Submit a DOE task to the workqueue for asynchronous processing.
> + *
> + * RETURNS: 0 on success, -errno on failure
> + */
> +static int pci_ep_doe_submit_task(struct pci_ep_doe_mb *doe_mb,
> +				  struct pci_ep_doe_task *task)
> +{
> +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
> +		return -EIO;
> +
> +	task->doe_mb = doe_mb;
> +	INIT_WORK(&task->work, doe_ep_task_work);
> +	queue_work(doe_mb->work_queue, &task->work);
> +	return 0;
> +}
> +
> +/**
> + * pci_ep_doe_process_request() - Process DOE request on endpoint
> + * @epc: PCI endpoint controller
> + * @func_no: Physical function number
> + * @cap_offset: DOE capability offset
> + * @vendor: Vendor ID from request header
> + * @type: Protocol type from request header
> + * @request: Request payload in CPU-native format
> + * @request_sz: Size of request payload (bytes)
> + * @complete: Callback to invoke upon completion
> + *
> + * Asynchronously process a DOE request received on the endpoint. The request
> + * payload should not include the DOE header (vendor/type/length). The protocol
> + * handler will allocate the response buffer, which the caller (controller driver)
> + * must free after use.
> + *
> + * This function returns immediately after queuing the request. The completion
> + * callback will be invoked asynchronously from workqueue context once the
> + * request is processed. The callback receives the function number and capability
> + * offset to identify the mailbox, along with a status code (0 on success, -errno
> + * on failure), and other required arguments.
> + *
> + * As per DOE specification, a mailbox processes one request at a time.
> + * Therefore, this function will never be called concurrently for the same
> + * mailbox by different callers.
> + *
> + * The caller is responsible for the conversion of the received DOE request
> + * with le32_to_cpu() before calling this function.
> + * Similarly, it is responsible for converting the response payload with
> + * cpu_to_le32() before sending it back over the DOE mailbox.
> + *
> + * The caller is also responsible for ensuring that the request size
> + * is within the limits defined by PCI_DOE_MAX_LENGTH.
> + *
> + * RETURNS: 0 if the request was successfully queued, -errno on failure
> + */
> +int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no, u16 cap_offset,
> +			       u16 vendor, u8 type, const void *request, size_t request_sz,
> +			       pci_ep_doe_complete_t complete)
> +{
> +	struct pci_ep_doe_mb *doe_mb;
> +	struct pci_ep_doe_task *task;
> +	int rc;
> +
> +	doe_mb = pci_ep_doe_get_mailbox(epc, func_no, cap_offset);
> +	if (!doe_mb) {
> +		kfree(request);
> +		return -ENODEV;
> +	}
> +
> +	task = kzalloc_obj(*task, GFP_KERNEL);
> +	if (!task) {
> +		kfree(request);
> +		return -ENOMEM;
> +	}
> +
> +	task->feat.vid = vendor;
> +	task->feat.type = type;
> +	task->request_pl = request;
> +	task->request_pl_sz = request_sz;
> +	task->response_pl = NULL;
> +	task->response_pl_sz = 0;
> +	task->complete = complete;
> +
> +	rc = pci_ep_doe_submit_task(doe_mb, task);
> +	if (rc) {
> +		kfree(request);
> +		kfree(task);
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_ep_doe_process_request);

So who is supposed to call this API? EPC driver that receives the DOE interrupt?
But I don't see the any callers of this and below exported APIs in this series.
Either you should add the callers or limit this series just to adding the DOE
skeleton implementation with a clear follow-up.

But since you've limited the scope of this series to support only DOE Discovery
Data Object Protocol, it'd be good to add the EPC implementation to get the full
picture.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [PATCH v3 3/4] PCI: endpoint: Add API for DOE initialization and setup in EPC core
From: Manivannan Sadhasivam @ 2026-05-14  8:08 UTC (permalink / raw)
  To: Aksh Garg
  Cc: linux-pci, linux-doc, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <20260427051725.223704-4-a-garg7@ti.com>

On Mon, Apr 27, 2026 at 10:47:24AM +0530, Aksh Garg wrote:
> Add pci_epc_setup_doe() API in EPC core driver to initialize and setup
> the DOE framework for an endpoint controller. The API discovers the DOE
> capabilities (extended capability ID 0x2E), and registers each discovered
> DOE mailbox for all the functions in the endpoint controller. This API
> should be invoked by the controller driver during probe based on the
> doe_capable feature.
> 
> Add pci_epc_destroy_doe() API in EPC core driver for cleanup of DOE
> resources, which should be invoked by the controller driver during
> controller cleanup based on the doe_capable feature.
> 
> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Aksh Garg <a-garg7@ti.com>
> ---
> 
> Changes from v2 to v3:
> - Rebased on 7.1-rc1.
> 
> Changes since v1:
> - New patch added to v2 (not present in v1)
> 
> v2: https://lore.kernel.org/all/20260401073022.215805-4-a-garg7@ti.com/
> 
> This patch is introduced based on the feedback provided by Manivannan
> Sadhasivam at [1].
> 

Sweet! But I was expecting you to add atleast one EPC driver implementation to
make use of these APIs.

Also, why can't you call these APIs from the EPC core directly? Maybe during
pci_epc_init_notify() once the register accesses become valid.

- Mani

> [1]: https://lore.kernel.org/all/p57x6jleaim5w7t2k3v7tioujnaxuovfpj5euop5ogefvw23se@y5fw3che5p5d/
> 
>  drivers/pci/endpoint/pci-epc-core.c | 71 +++++++++++++++++++++++++++++
>  include/linux/pci-epc.h             | 21 +++++++++
>  2 files changed, 92 insertions(+)
> 
> diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
> index 6c3c58185fc5..5a95a07b7d3a 100644
> --- a/drivers/pci/endpoint/pci-epc-core.c
> +++ b/drivers/pci/endpoint/pci-epc-core.c
> @@ -14,6 +14,8 @@
>  #include <linux/pci-epf.h>
>  #include <linux/pci-ep-cfs.h>
>  
> +#include "../pci.h"
> +
>  static const struct class pci_epc_class = {
>  	.name = "pci_epc",
>  };
> @@ -548,6 +550,75 @@ void pci_epc_mem_unmap(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  }
>  EXPORT_SYMBOL_GPL(pci_epc_mem_unmap);
>  
> +/**
> + * pci_epc_doe_setup() - Setup and discover DOE mailboxes for all functions
> + * @epc: the EPC device on which DOE mailboxes has to be setup
> + *
> + * Discover DOE (Data Object Exchange) capabilities for all physical functions
> + * in the endpoint controller and register DOE mailboxes.
> + *
> + * This API should be called by the controller driver during initialization
> + * if DOE support is available (indicated by doe_capable in pci_epc_features).
> + *
> + * RETURNS: 0 on success, -errno on failure
> + */
> +int pci_epc_doe_setup(struct pci_epc *epc)
> +{
> +	u16 cap_offset = 0;
> +	u8 func_no;
> +	int ret;
> +
> +	if (!epc || !epc->ops || !epc->ops->find_ext_capability)
> +		return -EINVAL;
> +
> +	/* Initialize DOE framework for this controller */
> +	ret = pci_ep_doe_init(epc);
> +	if (ret)
> +		return ret;
> +
> +	/* Discover DOE capabilities for all functions */
> +	for (func_no = 0; func_no < epc->max_functions; func_no++) {
> +		while ((cap_offset = epc->ops->find_ext_capability(epc, func_no, 0,
> +								   cap_offset,
> +								   PCI_EXT_CAP_ID_DOE))) {
> +			/* Register this DOE mailbox */
> +			ret = pci_ep_doe_add_mailbox(epc, func_no, cap_offset);
> +			if (ret) {
> +				dev_err(&epc->dev,
> +					"[pf%d:offset %x] failed to add DOE mailbox\n",
> +					func_no, cap_offset);
> +			}
> +		}
> +	}
> +
> +	dev_dbg(&epc->dev, "DOE mailboxes setup complete\n");
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_epc_doe_setup);
> +
> +/**
> + * pci_epc_doe_destroy() - Destroy and cleanup DOE mailboxes
> + * @epc: the EPC device on which DOE mailboxes has to be destroyed
> + *
> + * Destroy all DOE mailboxes registered on this endpoint controller and
> + * free associated resources.
> + *
> + * This API should be called by the controller driver during controller cleanup
> + * if DOE support is available (indicated by doe_capable in pci_epc_features).
> + *
> + * RETURNS: 0 on success, -errno on failure
> + */
> +int pci_epc_doe_destroy(struct pci_epc *epc)
> +{
> +	if (!epc)
> +		return -EINVAL;
> +
> +	pci_ep_doe_destroy(epc);
> +	dev_dbg(&epc->dev, "DOE mailboxes destroyed\n");
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_epc_doe_destroy);
> +
>  /**
>   * pci_epc_clear_bar() - reset the BAR
>   * @epc: the EPC device for which the BAR has to be cleared
> diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
> index dd26294c8175..7b0f258ef330 100644
> --- a/include/linux/pci-epc.h
> +++ b/include/linux/pci-epc.h
> @@ -84,6 +84,8 @@ struct pci_epc_map {
>   * @start: ops to start the PCI link
>   * @stop: ops to stop the PCI link
>   * @get_features: ops to get the features supported by the EPC
> + * @find_ext_capability: ops to find extended capability offset for a function
> + *			 in endpoint controller
>   * @owner: the module owner containing the ops
>   */
>  struct pci_epc_ops {
> @@ -115,6 +117,8 @@ struct pci_epc_ops {
>  	void	(*stop)(struct pci_epc *epc);
>  	const struct pci_epc_features* (*get_features)(struct pci_epc *epc,
>  						       u8 func_no, u8 vfunc_no);
> +	u16	(*find_ext_capability)(struct pci_epc *epc, u8 func_no,
> +				       u8 vfunc_no, u16 start, u8 cap);
>  	struct module *owner;
>  };
>  
> @@ -270,6 +274,7 @@ struct pci_epc_bar_desc {
>   * @msi_capable: indicate if the endpoint function has MSI capability
>   * @msix_capable: indicate if the endpoint function has MSI-X capability
>   * @intx_capable: indicate if the endpoint can raise INTx interrupts
> + * @doe_capable: indicate if the endpoint function has DOE capability
>   * @bar: array specifying the hardware description for each BAR
>   * @align: alignment size required for BAR buffer allocation
>   */
> @@ -280,6 +285,7 @@ struct pci_epc_features {
>  	unsigned int	msi_capable : 1;
>  	unsigned int	msix_capable : 1;
>  	unsigned int	intx_capable : 1;
> +	unsigned int	doe_capable : 1;
>  	struct	pci_epc_bar_desc bar[PCI_STD_NUM_BARS];
>  	size_t	align;
>  };
> @@ -368,6 +374,21 @@ int pci_epc_mem_map(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  void pci_epc_mem_unmap(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  		       struct pci_epc_map *map);
>  
> +#ifdef CONFIG_PCI_ENDPOINT_DOE
> +int pci_epc_doe_setup(struct pci_epc *epc);
> +int pci_epc_doe_destroy(struct pci_epc *epc);
> +#else
> +static inline int pci_epc_doe_setup(struct pci_epc *epc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int pci_epc_doe_destroy(struct pci_epc *epc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif
> +
>  #else
>  static inline void pci_epc_init_notify(struct pci_epc *epc)
>  {
> -- 
> 2.34.1
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [PATCH v3 4/4] Documentation: PCI: Add documentation for DOE endpoint support
From: Manivannan Sadhasivam @ 2026-05-14  8:11 UTC (permalink / raw)
  To: Aksh Garg
  Cc: linux-pci, linux-doc, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <20260427051725.223704-5-a-garg7@ti.com>

On Mon, Apr 27, 2026 at 10:47:25AM +0530, Aksh Garg wrote:
> Document the architecture and implementation details for the Data Object
> Exchange (DOE) framework for PCIe Endpoint devices.
> 
> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Aksh Garg <a-garg7@ti.com>
> ---
> 
> Changes from v2 to v3:
> - Rebased on 7.1-rc1.
> 
> Changes since v1:
> - Squashed the patches [1] and [2], and moved the documentation file
>   to Documentation/PCI/endpoint/pci-endpoint-doe.rst to match the existing
>   naming scheme, as suggested by Niklas Cassel
> - Updated the documentation as per the design and implementaion changes
>   made to previous patches in this series:
>   * Updated for static protocol array instead of dynamic registration
>   * Documented asynchronous callback model
>   * Updated request/response flow with new callback signature
>   * Updated memory ownership: DOE core frees request, driver frees response
>   * Updated initialization and cleanup sections for new APIs
> 
> v2: https://lore.kernel.org/all/20260401073022.215805-5-a-garg7@ti.com/
> v1: [1] https://lore.kernel.org/all/20260213123603.420941-2-a-garg7@ti.com/
>     [2] https://lore.kernel.org/all/20260213123603.420941-5-a-garg7@ti.com/
> 
>  Documentation/PCI/endpoint/index.rst          |   1 +
>  .../PCI/endpoint/pci-endpoint-doe.rst         | 318 ++++++++++++++++++
>  2 files changed, 319 insertions(+)
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst
> 
> diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst
> index dd1f62e731c9..7c03d5abd2ef 100644
> --- a/Documentation/PCI/endpoint/index.rst
> +++ b/Documentation/PCI/endpoint/index.rst
> @@ -9,6 +9,7 @@ PCI Endpoint Framework
>  
>     pci-endpoint
>     pci-endpoint-cfs
> +   pci-endpoint-doe
>     pci-test-function
>     pci-test-howto
>     pci-ntb-function
> diff --git a/Documentation/PCI/endpoint/pci-endpoint-doe.rst b/Documentation/PCI/endpoint/pci-endpoint-doe.rst
> new file mode 100644
> index 000000000000..03b7a69516f3
> --- /dev/null
> +++ b/Documentation/PCI/endpoint/pci-endpoint-doe.rst
> @@ -0,0 +1,318 @@
> +.. SPDX-License-Identifier: GPL-2.0-only or MIT
> +
> +.. include:: <isonum.txt>
> +
> +=============================================
> +Data Object Exchange (DOE) for PCIe Endpoint
> +=============================================
> +
> +:Copyright: |copy| 2026 Texas Instruments Incorporated
> +:Author: Aksh Garg <a-garg7@ti.com>
> +:Co-Author: Siddharth Vadapalli <s-vadapalli@ti.com>
> +
> +Overview
> +========
> +
> +DOE (Data Object Exchange) is a standard PCIe extended capability feature
> +introduced in the Data Object Exchange (DOE) ECN for PCIe r5.0. It is an optional
> +mechanism for system firmware/software running on root complex (host) to perform
> +:ref:`data object <data-object-term>` exchanges with an endpoint function. Each
> +data object is uniquely identified by the Vendor ID of the vendor publishing the
> +data object definition and a Data Object Type value assigned by that vendor.
> +
> +Think of DOE as a sophisticated mailbox system built into PCIe. The root complex
> +can send structured requests to the endpoint device through DOE mailboxes, and
> +the endpoint device responds with appropriate data. DOE mailboxes are implemented
> +as PCIe Extended Capabilities in endpoint devices, allowing multiple mailboxes
> +per function, each potentially supporting different data object protocols.
> +
> +The DOE support for root complex devices has already been implemented in
> +``drivers/pci/doe.c``.
> +
> +How DOE Works
> +=============
> +
> +The DOE mailbox operates through a simple request-response model:
> +
> +1. **Host sends request**: The root complex writes a data object (vendor ID, type,
> +   and payload) to the DOE write mailbox register (one DWORD at a time) of the
> +   endpoint function's config space and sets the GO bit in the DOE Status register
> +   to indicate that a request is ready for processing.
> +2. **Endpoint processes**: The endpoint function reads the request from DOE write
> +   mailbox register, sets the BUSY bit in the DOE Status register, identifies the
> +   protocol of the data object, and executes the appropriate handler.
> +3. **Endpoint responds**: The endpoint function writes the response data object to the
> +   DOE read mailbox register (one DWORD at a time), and sets the READY bit in the DOE
> +   Status register to indicate that the response is ready. If an error occurs during
> +   request processing (such as unsupported protocol or handler failure), the endpoint
> +   sets the ERROR bit in the DOE Status register instead of the READY bit.
> +4. **Host reads response**: The root complex retrieves the response data from the DOE read
> +   mailbox register once the READY bit is set in the DOE Status register, and then writes
> +   any value to this register to indicate a successful read. If the ERROR bit was set,
> +   the root complex discards the response and performs error handling as needed.
> +
> +Each mailbox operates independently and can handle one transaction at a time. The
> +DOE specification supports data objects of size up to 256KB (2\ :sup:`18` dwords).
> +
> +For complete DOE capability details, refer to `PCI Express Base Specification Revision 7.0,
> +Section 6.30 - Data Object Exchange (DOE)`.
> +
> +Key Terminologies
> +=================
> +
> +.. _data-object-term:
> +
> +**Data Object**
> +  A structured, vendor-defined, or standard-defined message exchanged between
> +  root complex and endpoint function via DOE capability registers in configuration
> +  space of the function.
> +
> +**Mailbox**
> +  A DOE capability on the endpoint device, where each physical function can have
> +  multiple mailboxes.
> +
> +**Protocol**
> +  A specific type of DOE communication data object identified by a Vendor ID and Type.
> +
> +**Handler**
> +  A function that processes DOE requests of a specific protocol and generates responses.
> +
> +Architecture of DOE Implementation for Endpoint
> +===============================================
> +
> +.. code-block:: text
> +
> +       +------------------+
> +       |                  |
> +       |   Root Complex   |
> +       |                  |
> +       +--------^---------+
> +                |
> +                | Config space access
> +                |   over PCIe link
> +                |
> +     +----------v-----------+
> +     |                      |
> +     |    PCIe Controller   |
> +     |      as Endpoint     |
> +     |                      |
> +     |  +-----------------+ |
> +     |  |   DOE Mailbox   | |
> +     |  +-------^---------+ |
> +     +----------|-----------+
> +    +-----------|---------------------------------------------------------------+
> +    |           |                                       +--------------------+  |
> +    | +---------v--------+           Allocate           |  +--------------+  |  |
> +    | |                  |-------------------------------->|   Request    |  |  |
> +    | |   EP Controller  |                            +--->|    Buffer    |  |  |
> +    | |      Driver      |             Free           | |  +--------------+  |  |
> +    | |                  |--------------------------+ | |                    |  |
> +    | +--------^---------+                          | | |                    |  |
> +    |          |                                    | | |                    |  |
> +    |          |                                    | | |                    |  |
> +    |          | pci_ep_doe_process_request()       | | |                    |  |
> +    |          |                                    | | |                    |  |
> +    | +--------v---------+             Free         | | |                    |  |
> +    | |                  |----------------------------+ |         DDR        |  |
> +    | |    DOE EP Core   |<----+                    |   |                    |  |
> +    | |    (doe-ep.c)    |     |     Discovery      |   |                    |  |
> +    | |                  |-----+  Protocol Handler  |   |                    |  |
> +    | +--------^---------+                          |   |                    |  |
> +    |          |                                    |   |                    |  |
> +    |          | protocol_handler()                 |   |                    |  |
> +    |          |                                    |   |                    |  |
> +    | +--------v---------+                          |   |                    |  |
> +    | |                  |                          |   |  +--------------+  |  |
> +    | | Protocol Handler |                          +----->|   Response   |  |  |
> +    | |      Module      |-------------------------------->|    Buffer    |  |  |
> +    | | (CMA/SPDM/Other) |           Allocate           |  +--------------+  |  |
> +    | |                  |                              |                    |  |
> +    | +------------------+                              |                    |  |
> +    |                                                   +--------------------+  |
> +    +---------------------------------------------------------------------------+
> +
> +Initialization and Cleanup
> +--------------------------
> +
> +**Framework Initialization and DOE Setup**
> +
> +The EPC core provides the ``pci_epc_doe_setup(epc)`` API for centralized DOE
> +mailbox discovery and registration. The controller driver calls this API during
> +its probe sequence if DOE is supported.
> +
> +This API performs the following steps:
> +
> +1. Calls ``pci_ep_doe_init(epc)``, which initializes the xarray data structure
> +   (a resizable array data structure defined in linux) named ``doe_mbs`` that
> +   stores metadata of DOE mailboxes for the controller in ``struct pci_epc``.
> +2. Discovers all DOE capabilities in the endpoint function's configuration space
> +   for each function. For each discovered DOE capability, calls
> +   ``pci_ep_doe_add_mailbox(epc, func_no, cap_offset)`` to register the mailbox.
> +
> +Each DOE mailbox structure created by ``pci_ep_doe_add_mailbox()`` gets an
> +ordered workqueue allocated for processing DOE requests sequentially for that
> +mailbox, enabling concurrent request handling across different mailboxes. Each
> +mailbox is uniquely identified by the combination of physical function number
> +and capability offset for that controller.
> +
> +**Cleanup**
> +
> +The EPC core provides the ``pci_epc_doe_destroy(epc)`` API for centralized DOE
> +cleanup. The controller driver calls this API during its remove sequence
> +if DOE is supported.
> +
> +This API calls ``pci_ep_doe_destroy(epc)``, which destroys all registered
> +mailboxes, cancels any pending tasks, flushes and destroys the workqueues,
> +and frees all memory allocated to the mailboxes.
> +

As I mentioned in patch 3, we should call these APIs within the EPC core and not
sprinkle throughout the EPC drivers.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [PATCH v3 0/4] PCI: Add DOE support for endpoint
From: Manivannan Sadhasivam @ 2026-05-14  8:12 UTC (permalink / raw)
  To: Aksh Garg
  Cc: linux-pci, linux-doc, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <20260427051725.223704-1-a-garg7@ti.com>

On Mon, Apr 27, 2026 at 10:47:21AM +0530, Aksh Garg wrote:
> This patch series introduces the framework for supporting the Data
> Object Exchange (DOE) feature for PCIe endpoint devices. Please refer
> to the documentation added in patch 4 for details on the feature and
> implementation architecture.
> 
> The implementation provides a common framework for all PCIe endpoint
> controllers, not specific to any particular SoC vendor.
> 
> This patch series is the non-RFC version of the RFC series at 
> https://lore.kernel.org/all/20260213123603.420941-1-a-garg7@ti.com/
> 
> The changes since v1 are documented in the respective patch description.
> 

Thanks for the work! I left some comments, but the series look good from the
initial look. Once you add the callers as I suggested, I'll do a more thorough
review.

- Mani

> Changes from v2 to v3:
> - Rebased on 7.1-rc1.
> 
> v2: https://lore.kernel.org/all/20260401073022.215805-1-a-garg7@ti.com/
> 
> Aksh Garg (4):
>   PCI/DOE: Move common definitions to the header file
>   PCI: endpoint: Add DOE mailbox support for endpoint functions
>   PCI: endpoint: Add API for DOE initialization and setup in EPC core
>   Documentation: PCI: Add documentation for DOE endpoint support
> 
>  Documentation/PCI/endpoint/index.rst          |   1 +
>  .../PCI/endpoint/pci-endpoint-doe.rst         | 318 ++++++++++
>  drivers/pci/doe.c                             |  11 -
>  drivers/pci/endpoint/Kconfig                  |  14 +
>  drivers/pci/endpoint/Makefile                 |   1 +
>  drivers/pci/endpoint/pci-ep-doe.c             | 552 ++++++++++++++++++
>  drivers/pci/endpoint/pci-epc-core.c           |  71 +++
>  drivers/pci/pci.h                             |  47 ++
>  include/linux/pci-doe.h                       |   8 +
>  include/linux/pci-epc.h                       |  24 +
>  10 files changed, 1036 insertions(+), 11 deletions(-)
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst
>  create mode 100644 drivers/pci/endpoint/pci-ep-doe.c
> 
> -- 
> 2.34.1
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Hao Jia @ 2026-05-14  8:13 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, hannes, mhocko, tj
  Cc: akpm, shakeel.butt, mkoutny, chengming.zhou, muchun.song,
	roman.gushchin, cgroups, linux-mm, linux-kernel, linux-doc,
	Hao Jia, Alexandre Ghiti
In-Reply-To: <CAKEwX=OY_nws-vf3VgnD54G205TK2YjkoAwRCyB9jvW=Oz3PpQ@mail.gmail.com>



On 2026/5/14 04:53, Nhat Pham wrote:
> On Wed, May 13, 2026 at 11:55 AM Yosry Ahmed <yosry@kernel.org> wrote:
>>
>>>> Zswap objects are organized into LRU and exposed to the shrinker
>>>> interface. Echo-ing to memory.reclaim should also offload some zswap
>>>> entries, correct? Are there still cold zswap entries that escape this,
>>>> somehow?
>>>>
>>>
>>> Yes, the memory.reclaim path does drive some zswap writeback, but
>>> it is not enough for our case.
>>>
>>> 1. For a memcg that has reached steady state (a common case being
>>> when memory.current is below the policy target), the userspace
>>> reclaimer may not invoke memory.reclaim on it for a long time,
>>> and so no second-level offloading happens through
>>> memory.reclaim. In this state we want
>>> memory.zswap.proactive_writeback to write back entries that
>>> have sat in zswap past an age threshold, to further reclaim
>>> the DRAM still held by the compressed data.
>>>
>>> 2. Even when memory.reclaim is running, the fraction of zswap
>>> residency that ends up reaching the backing swap device is
>>> still very small for many of our workloads, and the userspace
>>> reclaimer has no way to participate in or control the
>>> granularity of zswap writeback. So in our deployment we prefer
>>> to leave the zswap shrinker disabled, decouple LRU -> zswap
>>> from zswap -> swap, and use a dedicated proactive-writeback
>>> interface that lifts the writeback policy into userspace where
>>> it can evolve independently of the kernel.
>>
>> To be honest I see the point of proactively reclaiming compressed
>> memory in zswap. If you use memory.reclaim, you are also reclaiming
>> hotter memory in the process, and you are not necessarily getting as
>> much writeback as you want. The memory in zswap is a more conservative
>> choice for proactive reclaim because it's memory that's guaranteed to
>> be cold(ish) and not being accessed.
>>
>> That being said, the interface is not great any way you cut it :/
>>
>> I don't like the 'memory.zswap.proactive_writeback' name, maybe we can
>> stay consistent by doing 'memory.zswap.reclaim', but that just as
>> easily reads as "reclaim using zswap". Maybe
>> 'memory.zswap.do_writeback' or something, idk.
>>
>> I also don't like having two proactive reclaim interfaces, so a voice
>> in my head wants to tie this into 'memory.reclaim' somehow, but that
>> includes adding a pretty specific argument (e.g. 'memory.reclaim
>> zswap_writeback_only=1'.
>>
>> I don't like any of these options, and we also need to consider what
>> the memcg maintainers think. I see the use case of proactive writeback
>> but I am struggling to come up with a clean interface.
>>
>> I also think we should take the 'age' aspect out of the conversation
>> for now, it can be a separate discussion. Well, unless we decide to
>> tie it to memory.reclaim. If memory.reclaim broadly supports age-based
>> reclaim then zswap writeback can be a natural part of that without
>> requiring a specific interface.
> 
> Yeah perhaps extending memory.reclaim is best... Sort of analogous to
> the way we have swappiness to balance file v.s anon....


Thanks for the suggestions, Yosry and Nhat.

My only concern is that if we eventually need to add more parameters to 
zswap_writeback (such as age or others) in the future, would it make the 
parameter parsing and the functionality of memory.reclaim overly complex?

As you mentioned, if the memcg maintainers have no objections, I will 
attempt to implement it in v2.

How about something like this?
echo "100M zswap_writeback_only" > memory.reclaim

Thanks,
Hao

^ permalink raw reply

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Hao Jia @ 2026-05-14  8:15 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Yosry Ahmed, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia, Alexandre Ghiti
In-Reply-To: <CAKEwX=M=6AQVYA7ROM0YOP7irpxbdMrEOAHKGKYo0Qgr+-uhSw@mail.gmail.com>



On 2026/5/14 05:09, Nhat Pham wrote:
> On Wed, May 13, 2026 at 1:04 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>
>>
>>
>> On 2026/5/12 23:47, Nhat Pham wrote:
>>> On Tue, May 12, 2026 at 2:32 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2026/5/12 03:57, Yosry Ahmed wrote:
>>>>> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>>>>>
>>>>>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>>>>>
>>>>>>> From: Hao Jia <jiahao1@lixiang.com>
>>>>>>>
>>>>>>> Zswap currently writes back pages to backing swap devices reactively,
>>>>>>> triggered either by memory pressure via the shrinker or by the pool
>>>>>>> reaching its size limit. This reactive approach offers no precise
>>>>>>> control over when writeback happens, which can disturb latency-sensitive
>>>>>>> workloads, and it cannot direct writeback at a specific memory cgroup.
>>>>>>> However, there are scenarios where users might want to proactively
>>>>>>> write back cold pages from zswap to the backing swap device, for
>>>>>>> example, to free up memory for other applications or to prepare for
>>>>>>> upcoming memory-intensive workloads.
>>>>>>>
>>>>>>> Therefore, implement a proactive writeback mechanism for zswap by
>>>>>>> adding a new cgroup interface file memory.zswap.proactive_writeback
>>>>>>> within the memory controller.
>>>>>>
>>>>
>>>> Thanks Nhat, Yosry — let me address both comments together.
>>>>
>>>>>>
>>>>>> We already have memory.reclaim, no? Would that not work to create
>>>>>> headroom generally for your use case? Is there a reason why we are
>>>>>> treating zswap memory as special here?
>>>>>
>>>>
>>>> Apologies for the lack of detailed explanation in the patch description,
>>>> which led to the confusion.
>>>>
>>>> While we are already utilizing memory.reclaim, it does not fully address
>>>> our requirements.
>>>>
>>>> Our deployment runs a userspace proactive reclaimer that drives
>>>> memory.reclaim based on the system's runtime state (memory/CPU/IO
>>>> pressure, refault rate, ...) and workload-specific
>>>> policy. That first stage compresses cold anon pages into zswap. Entries
>>>> that then remain in zswap past a policy-defined age threshold are
>>>> considered "twice cold", and the reclaimer wants
>>>> to write them back to the backing swap device at a moment of its own
>>>> choosing, to further reclaim the DRAM still held by the compressed data.
>>>>
>>>> This is the "second-level offloading" pattern described in Meta's TMO
>>>> paper [1]. zswap proactive writeback is what this series introduces to
>>>> address that second-level offloading stage.
>>>>
>>>> [1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf
>>>
>>> Yeah that's what we've been trying to work on as well :) We are
>>> working on a couple of improvements to the mechanism side of this path
>>> (cc Alex) - hopefully it will help your use case too!
>>>
>>> Anyway, back to my original inquiry: I understand your use case. It's
>>> pretty similar to our goal. What I'm not getting is why is
>>> memory.reclaim (which you already use) not sufficient for zswap ->
>>> disk swap offloading too?
>>>
>>> Zswap objects are organized into LRU and exposed to the shrinker
>>> interface. Echo-ing to memory.reclaim should also offload some zswap
>>> entries, correct? Are there still cold zswap entries that escape this,
>>> somehow?
>>>
>>
>> Yes, the memory.reclaim path does drive some zswap writeback, but
>> it is not enough for our case.
>>
>> 1. For a memcg that has reached steady state (a common case being
>> when memory.current is below the policy target), the userspace
>> reclaimer may not invoke memory.reclaim on it for a long time,
>> and so no second-level offloading happens through
>> memory.reclaim. In this state we want
>> memory.zswap.proactive_writeback to write back entries that
>> have sat in zswap past an age threshold, to further reclaim
>> the DRAM still held by the compressed data.
>>
>> 2. Even when memory.reclaim is running, the fraction of zswap
>> residency that ends up reaching the backing swap device is
>> still very small for many of our workloads, and the userspace
>> reclaimer has no way to participate in or control the
>> granularity of zswap writeback. So in our deployment we prefer
>> to leave the zswap shrinker disabled, decouple LRU -> zswap
>> from zswap -> swap, and use a dedicated proactive-writeback
>> interface that lifts the writeback policy into userspace where
>> it can evolve independently of the kernel.
> 
> I see. It's interesting - we've been dealing with the opposite
> problems (reclaiming too much from zswap) that it's refreshing to see
> the other end of the spectrum :) We should invest more into this to
> see why we are not reclaiming enough, but I see the value of adding a
> knob to hit zswap exclusively.
> 
> Regarding age-based reclaim, I agree with Yosry here. Let us try to
> land an interface to do targeted reclaim on compressed memory first. I
> do see the value of age information: with it, you can track zswap
> entries ages and the distribution of refault ages, and only reclaim
> the tail. However, I wonder if you can just build a system that adapt
> the reclaim request size based on PSI, refault rate etc. similar to
> how you're adjusting memory.reclaim on uncompressed memories with a
> senpai-like system. Something along the line of - if we are swapping
> in too much from disk (or if IO pressure is high), back off, and if
> not, stealing a bit more from zswap pool (perhaps with a bigger step
> size), etc. Is there a reason why zswap cannot adopt a similar
> strategy?

I'm not sure, as we haven't tested the case of tuning proactive zswap 
writeback without using age. As you pointed out, age provides a 
deterministic target that allows the userspace reclaimer to converge 
faster in a closed-loop, which helps avoid performance jitters.

That said, using age as a zswap writeback parameter indeed warrants 
further independent discussion. So I'll remove the age-related parts in v2.

Thanks,
Hao

^ permalink raw reply

* Re: [PATCH 3/3] mm/zswap: Add per-memcg stat for proactive writeback
From: Hao Jia @ 2026-05-14  8:21 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <CAKEwX=OigngmcNo1OU-apCFG2hebt5yZwXQxZQHqgC7SwH_HAQ@mail.gmail.com>



On 2026/5/14 05:21, Nhat Pham wrote:
> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>
>> From: Hao Jia <jiahao1@lixiang.com>
>>
>> Currently, zswap writeback can be triggered by either the pool limit
>> being hit or by the proactive writeback mechanism. However, the
>> existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all
>> written back pages, making it difficult to distinguish between pages
>> written back due to the pool limit and those written back proactively.
>>
>> Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat.
>> This counter tracks the number of pages written back due to proactive
>> writeback. This allows users to better monitor and tune the proactive
>> writeback mechanism.
>>
>> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
>> ---
>>   Documentation/admin-guide/cgroup-v2.rst |  4 ++++
>>   include/linux/vm_event_item.h           |  1 +
>>   mm/memcontrol.c                         |  1 +
>>   mm/vmstat.c                             |  1 +
>>   mm/zswap.c                              | 11 +++++++++--
>>   5 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 05b664b3b3e8..29a189b18efc 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -1734,6 +1734,10 @@ The following nested keys are defined.
>>            zswpwb
>>                  Number of pages written from zswap to swap.
>>
>> +         zswpwb_proactive
>> +               Number of pages written from zswap to swap by proactive
>> +               writeback. This is a subset of zswpwb.
>> +
>>            zswap_incomp
>>                  Number of incompressible pages currently stored in zswap
>>                  without compression. These pages could not be compressed to
> 
> nit: once we have reached consensus on an interface, can you add
> documentation for the new knob in cgroup v2 doc and zswap doc too, and
> how it interacts with the other interface (memory.zswap.writeback,
> shrinker_enabled sysfs knob).
> 
> A kselftest would be very much appreciated too :)

Thanks, will do in v2

Thanks,
Hao

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox