Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH 3/3] mm/zswap: Add per-memcg stat for proactive writeback
From: Nhat Pham @ 2026-05-13 21:21 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260511105149.75584-4-jiahao.kernel@gmail.com>

On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Currently, zswap writeback can be triggered by either the pool limit
> being hit or by the proactive writeback mechanism. However, the
> existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all
> written back pages, making it difficult to distinguish between pages
> written back due to the pool limit and those written back proactively.
>
> Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat.
> This counter tracks the number of pages written back due to proactive
> writeback. This allows users to better monitor and tune the proactive
> writeback mechanism.
>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
>  include/linux/vm_event_item.h           |  1 +
>  mm/memcontrol.c                         |  1 +
>  mm/vmstat.c                             |  1 +
>  mm/zswap.c                              | 11 +++++++++--
>  5 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 05b664b3b3e8..29a189b18efc 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1734,6 +1734,10 @@ The following nested keys are defined.
>           zswpwb
>                 Number of pages written from zswap to swap.
>
> +         zswpwb_proactive
> +               Number of pages written from zswap to swap by proactive
> +               writeback. This is a subset of zswpwb.
> +
>           zswap_incomp
>                 Number of incompressible pages currently stored in zswap
>                 without compression. These pages could not be compressed to

nit: once we have reached consensus on an interface, can you add
documentation for the new knob in cgroup v2 doc and zswap doc too, and
how it interacts with the other interface (memory.zswap.writeback,
shrinker_enabled sysfs knob).

A kselftest would be very much appreciated too :)

^ permalink raw reply

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-05-13 21:09 UTC (permalink / raw)
  To: Hao Jia
  Cc: Yosry Ahmed, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia, Alexandre Ghiti
In-Reply-To: <6fc7fdf0-368c-5129-038e-623f9db2aa88@gmail.com>

On Wed, May 13, 2026 at 1:04 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
>
>
> On 2026/5/12 23:47, Nhat Pham wrote:
> > On Tue, May 12, 2026 at 2:32 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
> >>
> >>
> >>
> >> On 2026/5/12 03:57, Yosry Ahmed wrote:
> >>> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
> >>>>>
> >>>>> From: Hao Jia <jiahao1@lixiang.com>
> >>>>>
> >>>>> Zswap currently writes back pages to backing swap devices reactively,
> >>>>> triggered either by memory pressure via the shrinker or by the pool
> >>>>> reaching its size limit. This reactive approach offers no precise
> >>>>> control over when writeback happens, which can disturb latency-sensitive
> >>>>> workloads, and it cannot direct writeback at a specific memory cgroup.
> >>>>> However, there are scenarios where users might want to proactively
> >>>>> write back cold pages from zswap to the backing swap device, for
> >>>>> example, to free up memory for other applications or to prepare for
> >>>>> upcoming memory-intensive workloads.
> >>>>>
> >>>>> Therefore, implement a proactive writeback mechanism for zswap by
> >>>>> adding a new cgroup interface file memory.zswap.proactive_writeback
> >>>>> within the memory controller.
> >>>>
> >>
> >> Thanks Nhat, Yosry — let me address both comments together.
> >>
> >>>>
> >>>> We already have memory.reclaim, no? Would that not work to create
> >>>> headroom generally for your use case? Is there a reason why we are
> >>>> treating zswap memory as special here?
> >>>
> >>
> >> Apologies for the lack of detailed explanation in the patch description,
> >> which led to the confusion.
> >>
> >> While we are already utilizing memory.reclaim, it does not fully address
> >> our requirements.
> >>
> >> Our deployment runs a userspace proactive reclaimer that drives
> >> memory.reclaim based on the system's runtime state (memory/CPU/IO
> >> pressure, refault rate, ...) and workload-specific
> >> policy. That first stage compresses cold anon pages into zswap. Entries
> >> that then remain in zswap past a policy-defined age threshold are
> >> considered "twice cold", and the reclaimer wants
> >> to write them back to the backing swap device at a moment of its own
> >> choosing, to further reclaim the DRAM still held by the compressed data.
> >>
> >> This is the "second-level offloading" pattern described in Meta's TMO
> >> paper [1]. zswap proactive writeback is what this series introduces to
> >> address that second-level offloading stage.
> >>
> >> [1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf
> >
> > Yeah that's what we've been trying to work on as well :) We are
> > working on a couple of improvements to the mechanism side of this path
> > (cc Alex) - hopefully it will help your use case too!
> >
> > Anyway, back to my original inquiry: I understand your use case. It's
> > pretty similar to our goal. What I'm not getting is why is
> > memory.reclaim (which you already use) not sufficient for zswap ->
> > disk swap offloading too?
> >
> > Zswap objects are organized into LRU and exposed to the shrinker
> > interface. Echo-ing to memory.reclaim should also offload some zswap
> > entries, correct? Are there still cold zswap entries that escape this,
> > somehow?
> >
>
> Yes, the memory.reclaim path does drive some zswap writeback, but
> it is not enough for our case.
>
> 1. For a memcg that has reached steady state (a common case being
> when memory.current is below the policy target), the userspace
> reclaimer may not invoke memory.reclaim on it for a long time,
> and so no second-level offloading happens through
> memory.reclaim. In this state we want
> memory.zswap.proactive_writeback to write back entries that
> have sat in zswap past an age threshold, to further reclaim
> the DRAM still held by the compressed data.
>
> 2. Even when memory.reclaim is running, the fraction of zswap
> residency that ends up reaching the backing swap device is
> still very small for many of our workloads, and the userspace
> reclaimer has no way to participate in or control the
> granularity of zswap writeback. So in our deployment we prefer
> to leave the zswap shrinker disabled, decouple LRU -> zswap
> from zswap -> swap, and use a dedicated proactive-writeback
> interface that lifts the writeback policy into userspace where
> it can evolve independently of the kernel.

I see. It's interesting - we've been dealing with the opposite
problems (reclaiming too much from zswap) that it's refreshing to see
the other end of the spectrum :) We should invest more into this to
see why we are not reclaiming enough, but I see the value of adding a
knob to hit zswap exclusively.

Regarding age-based reclaim, I agree with Yosry here. Let us try to
land an interface to do targeted reclaim on compressed memory first. I
do see the value of age information: with it, you can track zswap
entries ages and the distribution of refault ages, and only reclaim
the tail. However, I wonder if you can just build a system that adapt
the reclaim request size based on PSI, refault rate etc. similar to
how you're adjusting memory.reclaim on uncompressed memories with a
senpai-like system. Something along the line of - if we are swapping
in too much from disk (or if IO pressure is high), back off, and if
not, stealing a bit more from zswap pool (perhaps with a bigger step
size), etc. Is there a reason why zswap cannot adopt a similar
strategy?

^ permalink raw reply

* Re: [PATCH v3 2/3] Documentation: security-bugs: explain what is and is not a security bug
From: Jonathan Corbet @ 2026-05-13 21:04 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Greg KH, Leon Romanovsky, skhan, security, workflows, linux-doc,
	linux-kernel
In-Reply-To: <agR1-2Sj1KO9oM2k@1wt.eu>

Willy Tarreau <w@1wt.eu> writes:

> On Wed, May 13, 2026 at 06:52:00AM -0600, Jonathan Corbet wrote:

>> I definitely wouldn't argue for making it longer, and enumerating all of
>> the make-me-root capabilities would be silly.  I would consider just
>> replacing CAP_SYS_ADMIN with "elevated capabilities" or some such.  That
>> might rule out legitimate reports where some capability provides an
>> access it shouldn't, but I suspect you could live with that :)
>
> I think it could indeed work like this, without denaturating the rest
> of the paragraph and having broader coverage. Do you think you could
> amend/update it ? I'm not trying to add you any burden, it's just that
> it will take me more time before I provide an update :-/

How's the following?

(While I was there, I noticed that threat-model.rst has no SPDX line;
what's your preference there?)

Thanks,

jon

From 1e15a25142583e312dcc504b0279d47508cbfdab Mon Sep 17 00:00:00 2001
From: Jonathan Corbet <corbet@lwn.net>
Date: Wed, 13 May 2026 14:58:53 -0600
Subject: [PATCH 2/2] docs: threat-model: don't limit root capabilities to
 CAP_SYS_ADMIN

The threat-model document says that only users with CAP_SYS_ADMIN can carry
out a number of admin-level tasks, but there are numerous capabilities that
can confer that sort of power.  Generalize the text slightly to make it
clear that CAP_SYS_ADMIN is not the only all-powerful capability.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/process/threat-model.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/process/threat-model.rst b/Documentation/process/threat-model.rst
index 91da52f7114fd..f177b8d3c1caf 100644
--- a/Documentation/process/threat-model.rst
+++ b/Documentation/process/threat-model.rst
@@ -62,7 +62,8 @@ on common processors featuring privilege levels and memory management units:
 
 * **Capability-based protection**:
 
-  * users not having the ``CAP_SYS_ADMIN`` capability may not alter the
+  * users not having elevated capabilities (including but not limited to
+    CAP_SYS_ADMIN) may not alter the
     kernel's configuration, memory nor state, change other users' view of the
     file system layout, grant any user capabilities they do not have, nor
     affect the system's availability (shutdown, reboot, panic, hang, or making
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3 3/3] Documentation: security-bugs: clarify requirements for AI-assisted reports
From: Jonathan Corbet @ 2026-05-13 21:02 UTC (permalink / raw)
  To: Willy Tarreau, Greg KH
  Cc: Leon Romanovsky, skhan, security, workflows, linux-doc,
	linux-kernel
In-Reply-To: <87a4u3mpxk.fsf@trenco.lwn.net>

Jonathan Corbet <corbet@lwn.net> writes:

> Willy Tarreau <w@1wt.eu> writes:
>
>> On Wed, May 13, 2026 at 12:30:10PM +0200, Greg KH wrote:
>>> > One nit:
>>> > 
>>> > > +  * **Impact Evaluation**: Many AI-generated reports lack an understanding of
>>> > > +    the kernel's threat model and go to great lengths inventing theoretical
>>> > > +    consequences.
>>> > 
>>> > If only we had a shiny new document describing that threat model that we
>>> > could reference here... :)
>>> 
>>> Ah yes, a link to that would make things better, but don't we have that
>>> elsewhere in this series?
>>
>> It's in the same patch, I think Jon was sarcastic here. I thought I had
>> addressed that one but apparently I was wrong :-/
>
> I'm just saying that this particular text should link to that document,
> don't make readers go searching for it.  I can certainly add a patch
> doing that if you like.

I was thinking something like this.

jon

From 3f02a3c190bab6b54e2a250ead0c7408af1a3c51 Mon Sep 17 00:00:00 2001
From: Jonathan Corbet <corbet@lwn.net>
Date: Wed, 13 May 2026 14:51:29 -0600
Subject: [PATCH 1/2] docs: security-bugs: add a link to the threat-model
 documentation

Rather than make readers search for this document, just a link to it where
it is referenced.

(While I was at it, I removed the unused and unneeded _threatmodel label
from the top of threat-model.rst).

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/process/security-bugs.rst | 13 +++++++------
 Documentation/process/threat-model.rst  |  2 --
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/Documentation/process/security-bugs.rst b/Documentation/process/security-bugs.rst
index f85c65f31f12f..3c51ddde31dd9 100644
--- a/Documentation/process/security-bugs.rst
+++ b/Documentation/process/security-bugs.rst
@@ -191,12 +191,13 @@ handle:
     Please **always convert your report to plain text** without any formatting
     decorations before sending it.
 
-  * **Impact Evaluation**: Many AI-generated reports lack an understanding of
-    the kernel's threat model and go to great lengths inventing theoretical
-    consequences. This adds noise and complicates triage. Please stick to
-    verifiable facts (e.g., "this bug permits any user to gain CAP_NET_ADMIN")
-    without enumerating speculative implications. Have your tool read this
-    documentation as part of the evaluation process.
+  * **Impact Evaluation**: Many AI-generated reports lack an understanding
+    of the kernel's threat model (see Documentation/process/threat-model.rst)
+    and go to great lengths inventing theoretical consequences. This adds
+    noise and complicates triage. Please stick to verifiable facts (e.g.,
+    "this bug permits any user to gain CAP_NET_ADMIN") without enumerating
+    speculative implications. Have your tool read this documentation as
+    part of the evaluation process.
 
   * **Reproducer**: AI-based tools are often capable of generating reproducers.
     Please always ensure your tool provides one and **test it thoroughly**. If
diff --git a/Documentation/process/threat-model.rst b/Documentation/process/threat-model.rst
index ecb432390e792..91da52f7114fd 100644
--- a/Documentation/process/threat-model.rst
+++ b/Documentation/process/threat-model.rst
@@ -1,5 +1,3 @@
-.. _threatmodel:
-
 The Linux Kernel threat model
 =============================
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-05-13 20:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Hao Jia, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia, Alexandre Ghiti
In-Reply-To: <CAO9r8zPvgB-MG2ufmdn4HoS+QEPBAehU9u7fQmYs+47NF-C9aw@mail.gmail.com>

On Wed, May 13, 2026 at 11:55 AM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > > Zswap objects are organized into LRU and exposed to the shrinker
> > > interface. Echo-ing to memory.reclaim should also offload some zswap
> > > entries, correct? Are there still cold zswap entries that escape this,
> > > somehow?
> > >
> >
> > Yes, the memory.reclaim path does drive some zswap writeback, but
> > it is not enough for our case.
> >
> > 1. For a memcg that has reached steady state (a common case being
> > when memory.current is below the policy target), the userspace
> > reclaimer may not invoke memory.reclaim on it for a long time,
> > and so no second-level offloading happens through
> > memory.reclaim. In this state we want
> > memory.zswap.proactive_writeback to write back entries that
> > have sat in zswap past an age threshold, to further reclaim
> > the DRAM still held by the compressed data.
> >
> > 2. Even when memory.reclaim is running, the fraction of zswap
> > residency that ends up reaching the backing swap device is
> > still very small for many of our workloads, and the userspace
> > reclaimer has no way to participate in or control the
> > granularity of zswap writeback. So in our deployment we prefer
> > to leave the zswap shrinker disabled, decouple LRU -> zswap
> > from zswap -> swap, and use a dedicated proactive-writeback
> > interface that lifts the writeback policy into userspace where
> > it can evolve independently of the kernel.
>
> To be honest I see the point of proactively reclaiming compressed
> memory in zswap. If you use memory.reclaim, you are also reclaiming
> hotter memory in the process, and you are not necessarily getting as
> much writeback as you want. The memory in zswap is a more conservative
> choice for proactive reclaim because it's memory that's guaranteed to
> be cold(ish) and not being accessed.
>
> That being said, the interface is not great any way you cut it :/
>
> I don't like the 'memory.zswap.proactive_writeback' name, maybe we can
> stay consistent by doing 'memory.zswap.reclaim', but that just as
> easily reads as "reclaim using zswap". Maybe
> 'memory.zswap.do_writeback' or something, idk.
>
> I also don't like having two proactive reclaim interfaces, so a voice
> in my head wants to tie this into 'memory.reclaim' somehow, but that
> includes adding a pretty specific argument (e.g. 'memory.reclaim
> zswap_writeback_only=1'.
>
> I don't like any of these options, and we also need to consider what
> the memcg maintainers think. I see the use case of proactive writeback
> but I am struggling to come up with a clean interface.
>
> I also think we should take the 'age' aspect out of the conversation
> for now, it can be a separate discussion. Well, unless we decide to
> tie it to memory.reclaim. If memory.reclaim broadly supports age-based
> reclaim then zswap writeback can be a natural part of that without
> requiring a specific interface.

Yeah perhaps extending memory.reclaim is best... Sort of analogous to
the way we have swappiness to balance file v.s anon....

^ permalink raw reply

* Re: [PATCH v2] cgroup/dmem: introduce a peak file
From: Tejun Heo @ 2026-05-13 20:44 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Johannes Weiner, Michal Koutný, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Maarten Lankhorst, Maxime Ripard, Natalie Vock,
	Tvrtko Ursulin, cgroups, linux-kernel, linux-mm, linux-doc,
	dri-devel, kernel-dev
In-Reply-To: <20260513-dmem_peak-v2-1-dac06999db9e@igalia.com>

Hello,

The patch looks fine to me, but please flesh out the motivation in the
commit description - what's the use case, why do we want this?

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH v4 1/1] Documentation: real-time: Add kernel configuration guide
From: Ahmed S. Darwish @ 2026-05-13 20:42 UTC (permalink / raw)
  To: Jonathan Corbet, Clark Williams, Steven Rostedt, linux-rt-devel
  Cc: Matthew Wilcox, Sebastian Andrzej Siewior, John Ogness,
	Derek Barbosa, linux-doc, linux-kernel
In-Reply-To: <ad6DcliisiRxw5RN@lx-t490>

On Tue, 14 Apr 2026, Ahmed S. Darwish wrote:
>
> Add a configuration guide for real-time kernels.
>

Kind reminder.

^ permalink raw reply

* Re: improve the swap_activate interface
From: Steve French @ 2026-05-13 20:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Chris Li, Kairui Song, Christian Brauner,
	Darrick J . Wong, Jens Axboe, David Sterba, Theodore Ts'o,
	Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
	Namjae Jeon, Hyunchul Lee, Steve French, Paulo Alcantara,
	Carlos Maiolino, Damien Le Moal, Naohiro Aota, linux-xfs,
	linux-fsdevel, linux-doc, linux-mm, linux-block, linux-btrfs,
	linux-ext4, linux-f2fs-devel, linux-nfs, linux-cifs
In-Reply-To: <20260512053625.2950900-1-hch@lst.de>

I just tried this on 7.1-rc3 with the swap patches (full kernel build,
on Ubuntu 25,10) and boot failed with out of memory which I had never
seen before.  Any idea how to workaround this with the swap patch
series, or is there a fix for this in the swap series already?

On Tue, May 12, 2026 at 12:41 AM Christoph Hellwig <hch@lst.de> wrote:
>
> Hi all,
>
> Darrick recently posted iomap support for fuse-iomap, which was trivial
> but a bit ugly, which triggered me into looking how this could be done
> in a cleaner way.  The result of that is this fairly big series that
> reworks how the MM code calls into the file system to activate swap
> files to make it much cleaner and easier to use.
>
> I've tested this with swap devices manually, and using the swap tests
> in xfstests on btrfs, ext3, ext4, f2fs and xfs to exercise the different
> implementation.  Out of those all passed, but f2fs actually notruns all
> tests even in the baseline as it requires special preparation for
> swapfiles which never got wired up in xfstests.
>
> Diffstat:
>  Documentation/filesystems/iomap/operations.rst |    3
>  Documentation/filesystems/locking.rst          |   35 +--
>  Documentation/filesystems/vfs.rst              |   40 ++--
>  block/fops.c                                   |   15 +
>  fs/btrfs/btrfs_inode.h                         |    3
>  fs/btrfs/file.c                                |    4
>  fs/btrfs/inode.c                               |   72 -------
>  fs/ext4/file.c                                 |    6
>  fs/ext4/inode.c                                |   11 -
>  fs/f2fs/data.c                                 |   50 -----
>  fs/f2fs/f2fs.h                                 |    2
>  fs/f2fs/file.c                                 |    4
>  fs/iomap/swapfile.c                            |  165 +++---------------
>  fs/nfs/direct.c                                |    1
>  fs/nfs/file.c                                  |   21 --
>  fs/nfs/nfs4file.c                              |    3
>  fs/ntfs/aops.c                                 |    8
>  fs/ntfs/file.c                                 |    6
>  fs/smb/client/cifsfs.c                         |   18 +
>  fs/smb/client/cifsfs.h                         |    3
>  fs/smb/client/file.c                           |   16 -
>  fs/xfs/xfs_aops.c                              |   48 -----
>  fs/xfs/xfs_file.c                              |   39 ++++
>  fs/zonefs/file.c                               |   30 +--
>  include/linux/fs.h                             |   11 -
>  include/linux/iomap.h                          |    5
>  include/linux/nfs_fs.h                         |    3
>  include/linux/swap.h                           |  129 +-------------
>  mm/page_io.c                                   |   45 ----
>  mm/swap.h                                      |   92 ++++++++++
>  mm/swapfile.c                                  |  227 ++++++++++++++-----------
>  31 files changed, 471 insertions(+), 644 deletions(-)
>


-- 
Thanks,

Steve

^ permalink raw reply

* Re: [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13 20:10 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Andrew Morton, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-1-be2e578e61da@debian.org>

On 5/13/26 17:39, Breno Leitao wrote:
> The first entry of error_states[],
> 
> 	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
> 
> is unreachable.  identify_page_state() has two callers, and neither
> one can dispatch a PG_reserved page to me_kernel():
> 
>   * memory_failure() reaches identify_page_state() only after
>     get_hwpoison_page() returned 1.  get_any_page() reaches that
>     return only via __get_hwpoison_page(), which gates the refcount
>     on HWPoisonHandlable().  HWPoisonHandlable() rejects PG_reserved
>     pages, so they fail with -EBUSY/-EIO long before
>     identify_page_state() runs.

You should clarify why they are rejected. There is no explicit check for
PG_reserved in there!

> 
>   * try_memory_failure_hugetlb() reaches identify_page_state() on
>     the MF_HUGETLB_IN_USED branch, but the page is necessarily a
>     hugetlb folio there.  The first table entry that matches a
>     hugetlb folio is { head, head, MF_MSG_HUGE, me_huge_page }, so
>     they dispatch to me_huge_page() before the (now-removed)
>     reserved entry would have matched, regardless of whether
>     PG_reserved happens to be set on the head page.

See hugetlb_folio_init_vmemmap(): we always clear PG_reserved for hugetlb folios
allocated from memblock.

> 
> me_kernel() never executes and the entry exists only to be matched
> against by code that cannot see it.
> 
> Drop the entry, the me_kernel() helper, and the now-unused
> "reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
> remains part of the tracepoint and pr_err() string tables, and
> follow-on work to classify unrecoverable kernel pages can reuse it
> without churning the user-visible enum.
> 
> No functional change.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 14 --------------
>  1 file changed, 14 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..49bcfbd04d213 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -992,17 +992,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
>  	return false;
>  }
>  
> -/*
> - * Error hit kernel page.
> - * Do nothing, try to be lucky and not touch this instead. For a few cases we
> - * could be more sophisticated.
> - */
> -static int me_kernel(struct page_state *ps, struct page *p)
> -{
> -	unlock_page(p);
> -	return MF_IGNORED;
> -}
> -
>  /*
>   * Page in unknown state. Do nothing.
>   * This is a catch-all in case we fail to make sense of the page state.
> @@ -1211,10 +1200,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  #define mlock		(1UL << PG_mlocked)
>  #define lru		(1UL << PG_lru)
>  #define head		(1UL << PG_head)
> -#define reserved	(1UL << PG_reserved)
>  
>  static struct page_state error_states[] = {
> -	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
>  	/*
>  	 * free pages are specially detected outside this table:
>  	 * PG_buddy pages only make a small fraction of all free pages.
> @@ -1246,7 +1233,6 @@ static struct page_state error_states[] = {
>  #undef mlock
>  #undef lru
>  #undef head
> -#undef reserved
>  
>  static void update_per_node_mf_stats(unsigned long pfn,
>  				     enum mf_result result)
> 

Yes, I think this should work.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* [PATCH] docs: sphinx-static: fix typo "wich" -> "which"
From: Clinton Phillips @ 2026-05-13 19:59 UTC (permalink / raw)
  To: corbet; +Cc: Clinton Phillips, linux-doc, linux-kernel

Trivial typo fix in a CSS comment for the documentation theme.

Signed-off-by: Clinton Phillips <clintdotphillips@gmail.com>
---
 Documentation/sphinx-static/custom.css | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/sphinx-static/custom.css b/Documentation/sphinx-static/custom.css
index f91393426..5aa0a1ed9 100644
--- a/Documentation/sphinx-static/custom.css
+++ b/Documentation/sphinx-static/custom.css
@@ -30,7 +30,7 @@ img.logo {
     margin-bottom: 20px;
 }
 
-/* The default is to use -1em, wich makes it override text */
+/* The default is to use -1em, which makes it override text */
 li { text-indent: 0em; }
 
 /*
-- 
2.49.0


^ permalink raw reply related

* Re: [PATCH net-next 1/2] net: ti: icssg: Derive stats array lengths from ARRAY_SIZE
From: Jacob Keller @ 2026-05-13 20:00 UTC (permalink / raw)
  To: MD Danish Anwar, David CARLIER
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Roger Quadros,
	Andrew Lunn, Meghana Malladi, Kevin Hao, Vadim Fedorenko, netdev,
	linux-doc, linux-kernel, linux-arm-kernel, Vignesh Raghavendra
In-Reply-To: <6a1f411c-d7ed-463b-abf1-277d8cc0c184@ti.com>

On 5/12/2026 2:40 AM, MD Danish Anwar wrote:
> Hi David,
> 
> On 12/05/26 1:28 pm, David CARLIER wrote:
>> Hi MD,
>>
>> On Tue, 12 May 2026 at 07:06, MD Danish Anwar <danishanwar@ti.com> wrote:
>>>
>>> Replace the manually maintained ICSSG_NUM_MIIG_STATS and
>>> ICSSG_NUM_PA_STATS constants with ARRAY_SIZE() expressions derived
>>> directly from the corresponding stat descriptor arrays, so that adding
>>> new entries to icssg_all_miig_stats[] or icssg_all_pa_stats[] no longer
>>> requires a separate update to a numeric constant.
>>>
>>> To make this self-contained, break the circular include dependency
>>> between icssg_stats.h and icssg_prueth.h:
>>>
>>>   - icssg_stats.h previously included icssg_prueth.h (transitively
>>>     pulling in icssg_switch_map.h and ETH_GSTRING_LEN).  Replace that
>>>     with direct includes of <linux/ethtool.h>, <linux/kernel.h> and
>>>     "icssg_switch_map.h".
>>>
>>>   - icssg_prueth.h now includes icssg_stats.h, giving it access to
>>>     the ARRAY_SIZE-based ICSSG_NUM_MIIG_STATS and ICSSG_NUM_PA_STATS
>>>     before they are used in the prueth_emac struct and ICSSG_NUM_STATS.
>>>
>>> Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
>>> ---
>>>  drivers/net/ethernet/ti/icssg/icssg_prueth.h | 3 +--
>>>  drivers/net/ethernet/ti/icssg/icssg_stats.h  | 7 ++++++-
>>>  2 files changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>>> index df93d15c5b78..e2ccecb0a0dd 100644
>>> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>>> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>>> @@ -43,6 +43,7 @@
>>>
>>>  #include "icssg_config.h"
>>>  #include "icss_iep.h"
>>> +#include "icssg_stats.h"
>>>  #include "icssg_switch_map.h"
>>>
>>>  #define PRUETH_MAX_MTU          (2000 - ETH_HLEN - ETH_FCS_LEN)
>>> @@ -57,8 +58,6 @@
>>>
>>>  #define ICSSG_MAX_RFLOWS       8       /* per slice */
>>>
>>> -#define ICSSG_NUM_PA_STATS     32
>>> -#define ICSSG_NUM_MIIG_STATS   60
>>>  /* Number of ICSSG related stats */
>>>  #define ICSSG_NUM_STATS (ICSSG_NUM_MIIG_STATS + ICSSG_NUM_PA_STATS)
>>>  #define ICSSG_NUM_STANDARD_STATS 31
>>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.h b/drivers/net/ethernet/ti/icssg/icssg_stats.h
>>> index 5ec0b38e0c67..b854eb587c1e 100644
>>> --- a/drivers/net/ethernet/ti/icssg/icssg_stats.h
>>> +++ b/drivers/net/ethernet/ti/icssg/icssg_stats.h
>>> @@ -8,10 +8,15 @@
>>>  #ifndef __NET_TI_ICSSG_STATS_H
>>>  #define __NET_TI_ICSSG_STATS_H
>>>
>>> -#include "icssg_prueth.h"
>>> +#include <linux/ethtool.h>
>>> +#include <linux/kernel.h>
>>> +#include "icssg_switch_map.h"
>>>
>>>  #define STATS_TIME_LIMIT_1G_MS    25000    /* 25 seconds @ 1G */
>>>
>>> +#define ICSSG_NUM_MIIG_STATS   ARRAY_SIZE(icssg_all_miig_stats)
>>> +#define ICSSG_NUM_PA_STATS     ARRAY_SIZE(icssg_all_pa_stats)
>>> +
>>>  struct miig_stats_regs {
>>>         /* Rx */
>>>         u32 rx_packets;
>>> --
>>> 2.34.1
>>>
>>
>> One thing that caught my eye: icssg_all_miig_stats[] and
>>   icssg_all_pa_stats[] are 'static const' arrays in icssg_stats.h with
>>   ETH_GSTRING_LEN name buffers per entry. Right now only icssg_stats.c
>>   and icssg_ethtool.c pull them in. After this patch icssg_prueth.h
>>   includes icssg_stats.h, so every .c in the driver (classifier,
>>   common, config, mii_cfg, queues, switchdev, ...) ends up with its own
>>   static-const copy of both tables.
>>
>>   Would a static_assert() work for what you're after? Something like:
>>
> 
> While adding more stats manually, The ARRAY_SIZE() approach was
> explicitly requested by maintainer [1]:
> 
> This patch is a direct response to that feedback. static_assert() would
> still require updating the numeric constant on every array change. The
> goal here is to eliminate the need of manually incrementing stats count
> whenever new stats are added
> 
> Your concern about multiple copies of table is noted and valid. Could
> you advise on the preferred way to reconcile these two requirements? I
> am happy to restructure if there is an approach that satisfies both.
> 
The way we solved this in the Intel drivers is to use a single array
which contains both the stat name as well as the offset from the
structure where the stat resides.

The stat string code just iterates over the stat list for the strings,
while the stat value code iterates the array and computes the stat
address from the offset and size and base structure pointer. Each object
that has stats has its own stat array structure.

This is probably overkill, but the advantage is that the strings and
their values are stored together and adding a new stat is as simple as
adding a new entry to that list.

I.e.

struct ice_stats {
        char stat_string[ETH_GSTRING_LEN];
        int sizeof_stat;
        int stat_offset;
};

#define ICE_STAT(_type, _name, _stat) { \
        .stat_string = _name, \
        .sizeof_stat = sizeof_field(_type, _stat), \
        .stat_offset = offsetof(_type, _stat) \
}

#define ICE_VSI_STAT(_name, _stat) \
                ICE_STAT(struct ice_vsi, _name, _stat)
#define ICE_PF_STAT(_name, _stat) \
                ICE_STAT(struct ice_pf, _name, _stat)


Then the stats for the individial arrays are defined like this:

static const struct ice_stats ice_gstrings_vsi_stats[] = {
        ICE_VSI_STAT(ICE_RX_UNICAST, eth_stats.rx_unicast),
        ICE_VSI_STAT(ICE_TX_UNICAST, eth_stats.tx_unicast),
        ICE_VSI_STAT(ICE_RX_MULTICAST, eth_stats.rx_multicast),
        ICE_VSI_STAT(ICE_TX_MULTICAST, eth_stats.tx_multicast),
        ICE_VSI_STAT(ICE_RX_BROADCAST, eth_stats.rx_broadcast),
        ICE_VSI_STAT(ICE_TX_BROADCAST, eth_stats.tx_broadcast),
	...
};

(Note, ICE_RX_UNICAST is a macro that defines the string value.. I don't
recall who changed this to macros or why vs just having the strings be
directly in the definition...)

This is probably a lot bigger refactor to make work, and may not be
exactly suitable for your driver. I've considered "upgrading" these data
structures and logic as helpers to the core ethtool code (or perhaps
now, to libeth) but never got around to it.

^ permalink raw reply

* Re: [PATCH v7 4/6] mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
From: David Hildenbrand (Arm) @ 2026-05-13 19:49 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Andrew Morton, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-4-be2e578e61da@debian.org>

On 5/13/26 17:39, Breno Leitao wrote:
> The previous patch already classifies PG_reserved pages as
> MF_MSG_KERNEL through the long path: get_hwpoison_page() calls
> __get_hwpoison_page() which fails HWPoisonHandlable(), get_any_page()
> exhausts its shake_page() retry budget, and the resulting
> -ENOTRECOVERABLE is mapped to MF_MSG_KERNEL by the switch.  The
> outcome is correct but the work in between is wasted: shake_page()
> cannot turn a reserved page into a handlable one.

If really required, can we just move the check right there, into get_any_page() etc?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v3] docs: reporting-issues: replace "these advices" with "all of this advice"
From: Jonathan Corbet @ 2026-05-13 19:43 UTC (permalink / raw)
  To: Chen-Shi-Hong, linux; +Cc: skhan, linux-doc, linux-kernel, Chen-Shi-Hong
In-Reply-To: <20260513174009.1260-1-eric039eric@gmail.com>

Chen-Shi-Hong <eric039eric@gmail.com> writes:

You are getting closer, a couple of other details...

> "Advice" is an uncountable noun, so "these advices" is grammatically
> incorrect.
>
> Replace it with "all of this advice" instead, which keeps the sentence
> grammatical while also making it clear that it refers to the full set of
> recommendations in the paragraph.
>
> Signed-off-by: Chen-Shi-Hong <eric039eric@gmail.com>
>
> v3:
> - resend against the original base as requested
> - replace "these advices" directly with "all of this advice"
>
> v2:
> - use "all of this advice" based on review feedback

It is good to include the changes with each version, but it should go
below the "---" line so that the maintainer doesn't have to edit it out
at apply time.

> ---
>  Documentation/admin-guide/reporting-issues.rst | 4 ++--
>
>  1 file changed, 2 insertions(+), 2 deletions(-)

Also, please send new versions as a separate thread rather than as a
response to a previous posting.

Thanks,

jon

^ permalink raw reply

* Re: [PATCH] Documentation: intel_pstate: Fix description of asymmetric packing with SMT
From: Rafael J. Wysocki @ 2026-05-13 19:41 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, Viresh Kumar, Jonathan Corbet, Shuah Khan,
	Rafael J. Wysocki, Ricardo Neri, linux-pm, linux-doc,
	linux-kernel
In-Reply-To: <20260424-rneri-fix-intel-pstate-doc-smt-asym-packing-v1-1-317bf7d5c362@linux.intel.com>

On Fri, Apr 24, 2026 at 11:42 PM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> The patchset [1], of which commits 046a5a95c3b0 ("x86/sched/itmt: Give all
> SMT siblings of a core the same priority") and 995998ebdebd ("x86/sched:
> Remove SD_ASYM_PACKING from the SMT domain flags") are part, overhauled how
> the scheduler handles asym_packing on x86 hybrid processors with SMT. It
> removed SD_ASYM_PACKING from the x86 SMT scheduling domain and made all SMT
> siblings of a core share the same priority. As a result, asym_packing
> operates only across physical cores, spreading tasks among them and only
> using idle SMT siblings once all physical cores are busy.
>
> Fix the documentation to reflect this behavior.
>
> Fixes: f20af84c29b2 ("cpufreq: intel_pstate: Document hybrid processor support")
> Link: https://lore.kernel.org/r/20230406203148.19182-1-ricardo.neri-calderon@linux.intel.com [1]
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
> index fde967b0c2e0..25fe5d88fea6 100644
> --- a/Documentation/admin-guide/pm/intel_pstate.rst
> +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -355,11 +355,12 @@ HyperThreading (HT) in the context of Intel processors, is enabled on at least
>  one core, ``intel_pstate`` assigns performance-based priorities to CPUs.  Namely,
>  the priority of a given CPU reflects its highest HWP performance level which
>  causes the CPU scheduler to generally prefer more performant CPUs, so the less
> -performant CPUs are used when the other ones are fully loaded.  However, SMT
> -siblings (that is, logical CPUs sharing one physical core) are treated in a
> -special way such that if one of them is in use, the effective priority of the
> -other ones is lowered below the priorities of the CPUs located in the other
> -physical cores.
> +performant CPUs are used when the other ones are fully loaded.  SMT siblings
> +(that is, logical CPUs sharing one physical core) are given the same priority.
> +The scheduler can pull tasks from lower-priority cores and place them on any
> +sibling.  Since the scheduler spreads tasks among physical cores, tasks will be
> +placed on the SMT siblings of physical cores only after all physical cores are
> +busy.
>
>  This approach maximizes performance in the majority of cases, but unfortunately
>  it also leads to excessive energy usage in some important scenarios, like video
>
> ---

Applied (with some edits in the changelog) as 7.1-rc material, thanks!

^ permalink raw reply

* Re: [PATCH 4/8] drm/panthor: Add support for protected memory allocation in panthor
From: Chia-I Wu @ 2026-05-13 19:31 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Boris Brezillon, Marcin Ślusarz, Ketil Johnsen, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian König, Steven Price, Daniel Almeida, Alice Ryhl,
	Matthias Brugger, AngeloGioacchino Del Regno, dri-devel,
	linux-doc, linux-kernel, linux-media, linaro-mm-sig,
	linux-arm-kernel, linux-mediatek, Florent Tomasin, nd
In-Reply-To: <agNJasayW8VCHTiU@e142607>

On Tue, May 12, 2026 at 8:39 AM Liviu Dudau <liviu.dudau@arm.com> wrote:
>
> On Tue, May 12, 2026 at 04:11:11PM +0200, Boris Brezillon wrote:
> > On Tue, 12 May 2026 14:47:27 +0100
> > Liviu Dudau <liviu.dudau@arm.com> wrote:
> >
> > > On Thu, May 07, 2026 at 01:53:56PM +0200, Boris Brezillon wrote:
> > > > On Thu, 7 May 2026 11:02:26 +0200
> > > > Marcin Ślusarz <marcin.slusarz@arm.com> wrote:
> > > >
> > > > > On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:
> > > > > > > @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> > > > > > >                     return ret;
> > > > > > >     }
> > > > > > >
> > > > > > > +   /* If a protected heap name is specified but not found, defer the probe until created */
> > > > > > > +   if (protected_heap_name && strlen(protected_heap_name)) {
> > > > > >
> > > > > > Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> > > > > > name is "" already?
> > > > >
> > > > > If dma_heap_find() will fail, then the whole probe with fail too.
> > > > > This check prevents that.
> > > >
> > > > Yeah, that's also a questionable design choice. I mean, we can
> > > > currently probe and boot the FW even though we never setup the
> > > > protected FW sections, so why should we defer the probe here? Can't we
> > > > just retry the next time a group with the protected bit is created and
> > > > fail if we can find a protected heap?
> > >
> > > The problem we have with the current firmware is that it does a number of setup steps at "boot"
> > > time only. One of the steps is preparing its internal structures for when it enters protected
> > > mode and it stores them in the buffer passed in at firmware loading. We cannot later run the
> > > process when we have a group with protected mode set.
> >
> > No, but we can force a full/slow reset and have that thing
> > re-initialized, can't we? I mean, that's basically what we do when a
> > fast reset fails: we re-initialize all the sections and reset again, at
> > which point the FW should start from a fresh state, and be able to
> > properly initialize the protected-related stuff if protected sections
> > are populated. Am I missing something?
>
> Right, we can do that. For some reason I keep associating the reset with the
> error handling and not with "normal" operations.
I kind of hope we end up with either

 - panthor knows the exact heap to use and fails with EPROBE_DEFER if
the heap is missing, or
 - panthor gets a dma-buf from userspace and does the full reset
   - userspace also needs to provide a dma-buf for each protected
group for the suspend buffer

than something in-between. The latter is more ad-hoc and basically
kicks the issue to the userspace.

For the former, expressing the relation in DT seems to be the best,
but only if possible :-). Otherwise, a kconfig option (instead of
module param) should be easier to work with.

Looking at the userspace implementation, can we also have an panthor
ioctl to return the heap to userspace? A dma-heap ioctl to query the
heap size is also lacking.


>
> Best regards,
> Liviu
>
>
> --
> ====================
> | I would like to |
> | fix the world,  |
> | but they're not |
> | giving me the   |
>  \ source code!  /
>   ---------------
>     ¯\_(ツ)_/¯
>

^ permalink raw reply

* Re: [PATCH v2 1/2] dt-bindings: trivial-devices: Add Murata D1U74T PSU
From: Conor Dooley @ 2026-05-13 19:17 UTC (permalink / raw)
  To: Abdurrahman Hussain
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, linux-hwmon, devicetree,
	linux-kernel, linux-doc
In-Reply-To: <20260512-d1u74t-v2-1-431d00fbb1c4@nexthop.ai>

[-- Attachment #1: Type: text/plain, Size: 75 bytes --]

Acked-by: Conor Dooley <conor.dooley@microchip.com>
pw-bot: not-applicable

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v13 3/4] gpio: rpmsg: add generic rpmsg GPIO driver
From: Shah, Tanmay @ 2026-05-13 19:05 UTC (permalink / raw)
  To: Mathieu Poirier, tanmay.shah
  Cc: Arnaud POULIQUEN, Beleswar Prasad Padhi, Shenwei Wang,
	Andrew Lunn, Linus Walleij, Bartosz Golaszewski, Jonathan Corbet,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Frank Li, Sascha Hauer, Shuah Khan, linux-gpio@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Pengutronix Kernel Team, Fabio Estevam, Peng Fan,
	devicetree@vger.kernel.org, linux-remoteproc@vger.kernel.org,
	imx@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	dl-linux-imx, Bartosz Golaszewski
In-Reply-To: <CANLsYkz9-+1o4ek0f5jS=-G1nPpp7BkCmNE5cin1zRY_e6Me-A@mail.gmail.com>



On 5/13/2026 11:34 AM, Mathieu Poirier wrote:
> On Tue, 12 May 2026 at 11:20, Shah, Tanmay <tanmays@amd.com> wrote:
>>
>>
>>
>> On 5/12/2026 10:41 AM, Mathieu Poirier wrote:
>>> On Mon, May 11, 2026 at 04:35:46PM -0500, Shah, Tanmay wrote:
>>>>
>>>>
>>>> On 5/11/2026 12:58 PM, Mathieu Poirier wrote:
>>>>> On Mon, 11 May 2026 at 10:47, Shah, Tanmay <tanmays@amd.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 5/5/2026 10:52 AM, Shah, Tanmay wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 5/5/2026 4:28 AM, Arnaud POULIQUEN wrote:
>>>>>>>> Hi Tanmay,
>>>>>>>>
>>>>>>>> On 5/4/26 21:19, Shah, Tanmay wrote:
>>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> I have started reviewing this work as well.
>>>>>>>>> Thanks Shenwei for this work.
>>>>>>>>>
>>>>>>>>> I have gone through only the current revision, and would like to provide
>>>>>>>>> idea on how to achieve GPIO number multiplexing with the RPMsg protocol.
>>>>>>>>> Also, have some bindings related question.
>>>>>>>>>
>>>>>>>>> Please see below:
>>>>>>>>>
>>>>>>>>> On 4/30/2026 11:40 AM, Arnaud POULIQUEN wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/30/26 14:56, Beleswar Prasad Padhi wrote:
>>>>>>>>>>> Hello Arnaud,
>>>>>>>>>>>
>>>>>>>>>>> On 30/04/26 13:05, Arnaud POULIQUEN wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/29/26 21:20, Mathieu Poirier wrote:
>>>>>>>>>>>>> On Wed, 29 Apr 2026 at 12:07, Padhi, Beleswar <b-padhi@ti.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 4/29/2026 11:03 PM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>> On Wed, 29 Apr 2026 at 10:53, Shenwei Wang <shenwei.wang@nxp.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Mathieu Poirier <mathieu.poirier@linaro.org>
>>>>>>>>>>>>>>>>> Sent: Wednesday, April 29, 2026 10:42 AM
>>>>>>>>>>>>>>>>> To: Shenwei Wang <shenwei.wang@nxp.com>
>>>>>>>>>>>>>>>>> Cc: Andrew Lunn <andrew@lunn.ch>; Padhi, Beleswar <b-
>>>>>>>>>>>>>>>>> padhi@ti.com>; Linus
>>>>>>>>>>>>>>>>> Walleij <linusw@kernel.org>; Bartosz Golaszewski
>>>>>>>>>>>>>>>>> <brgl@kernel.org>; Jonathan
>>>>>>>>>>>>>>>>> Corbet <corbet@lwn.net>; Rob Herring <robh@kernel.org>;
>>>>>>>>>>>>>>>>> Krzysztof Kozlowski
>>>>>>>>>>>>>>>>> <krzk+dt@kernel.org>; Conor Dooley <conor+dt@kernel.org>; Bjorn
>>>>>>>>>>>>>>>>> Andersson
>>>>>>>>>>>>>>>>> <andersson@kernel.org>; Frank Li <frank.li@nxp.com>; Sascha Hauer
>>>>>>>>>>>>>>>>> <s.hauer@pengutronix.de>; Shuah Khan
>>>>>>>>>>>>>>>>> <skhan@linuxfoundation.org>; linux-
>>>>>>>>>>>>>>>>> gpio@vger.kernel.org; linux-doc@vger.kernel.org; linux-
>>>>>>>>>>>>>>>>> kernel@vger.kernel.org;
>>>>>>>>>>>>>>>>> Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
>>>>>>>>>>>>>>>>> <festevam@gmail.com>; Peng Fan <peng.fan@nxp.com>;
>>>>>>>>>>>>>>>>> devicetree@vger.kernel.org; linux-remoteproc@vger.kernel.org;
>>>>>>>>>>>>>>>>> imx@lists.linux.dev; linux-arm-kernel@lists.infradead.org; dl-
>>>>>>>>>>>>>>>>> linux-imx <linux-
>>>>>>>>>>>>>>>>> imx@nxp.com>; Bartosz Golaszewski <brgl@bgdev.pl>
>>>>>>>>>>>>>>>>> Subject: [EXT] Re: [PATCH v13 3/4] gpio: rpmsg: add generic
>>>>>>>>>>>>>>>>> rpmsg GPIO driver
>>>>>>>>>>>>>>>>> On Tue, Apr 28, 2026 at 03:24:59PM +0000, Shenwei Wang wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>> From: Andrew Lunn <andrew@lunn.ch>
>>>>>>>>>>>>>>>>>>> Sent: Monday, April 27, 2026 3:49 PM
>>>>>>>>>>>>>>>>>>> To: Shenwei Wang <shenwei.wang@nxp.com>
>>>>>>>>>>>>>>>>>>> Cc: Padhi, Beleswar <b-padhi@ti.com>; Linus Walleij
>>>>>>>>>>>>>>>>>>> <linusw@kernel.org>; Bartosz Golaszewski <brgl@kernel.org>;
>>>>>>>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>>>>>>>> Corbet <corbet@lwn.net>; Rob Herring <robh@kernel.org>;
>>>>>>>>>>>>>>>>>>> Krzysztof
>>>>>>>>>>>>>>>>>>> Kozlowski <krzk+dt@kernel.org>; Conor Dooley
>>>>>>>>>>>>>>>>>>> <conor+dt@kernel.org>;
>>>>>>>>>>>>>>>>>>> Bjorn Andersson <andersson@kernel.org>; Mathieu Poirier
>>>>>>>>>>>>>>>>>>> <mathieu.poirier@linaro.org>; Frank Li <frank.li@nxp.com>;
>>>>>>>>>>>>>>>>>>> Sascha
>>>>>>>>>>>>>>>>>>> Hauer <s.hauer@pengutronix.de>; Shuah Khan
>>>>>>>>>>>>>>>>>>> <skhan@linuxfoundation.org>; linux-gpio@vger.kernel.org; linux-
>>>>>>>>>>>>>>>>>>> doc@vger.kernel.org; linux-kernel@vger.kernel.org; Pengutronix
>>>>>>>>>>>>>>>>>>> Kernel Team <kernel@pengutronix.de>; Fabio Estevam
>>>>>>>>>>>>>>>>>>> <festevam@gmail.com>; Peng Fan <peng.fan@nxp.com>;
>>>>>>>>>>>>>>>>>>> devicetree@vger.kernel.org; linux- remoteproc@vger.kernel.org;
>>>>>>>>>>>>>>>>>>> imx@lists.linux.dev; linux-arm- kernel@lists.infradead.org;
>>>>>>>>>>>>>>>>>>> dl-linux-imx <linux-imx@nxp.com>; Bartosz Golaszewski
>>>>>>>>>>>>>>>>>>> <brgl@bgdev.pl>
>>>>>>>>>>>>>>>>>>> Subject: [EXT] Re: [PATCH v13 3/4] gpio: rpmsg: add generic
>>>>>>>>>>>>>>>>>>> rpmsg
>>>>>>>>>>>>>>>>>>> GPIO driver
>>>>>>>>>>>>>>>>>>>>> struct virtio_gpio_response {
>>>>>>>>>>>>>>>>>>>>>             __u8 status;
>>>>>>>>>>>>>>>>>>>>>             __u8 value;
>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>> It is the same message format. Please see the message
>>>>>>>>>>>>>>>>>>>> definition
>>>>>>>>>>>>>>>>>>> (GET_DIRECTION) below:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +   +-----+-----+-----+-----+-----+----+
>>>>>>>>>>>>>>>>>>>> +   |0x00 |0x01 |0x02 |0x03 |0x04 |0x05|
>>>>>>>>>>>>>>>>>>>> +   | 1   | 2   |port |line | err | dir|
>>>>>>>>>>>>>>>>>>>> +   +-----+-----+-----+-----+-----+----+
>>>>>>>>>>>>>>>>>>> Sorry, but i don't see how two u8 vs six u8 are the same
>>>>>>>>>>>>>>>>>>> message format.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Some changes to the message format are necessary.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Virtio uses two communication channels (virtqueues): one for
>>>>>>>>>>>>>>>>>> requests and
>>>>>>>>>>>>>>>>> replies, and a second one for events.
>>>>>>>>>>>>>>>>>> In contrast, rpmsg provides only a single communication
>>>>>>>>>>>>>>>>>> channel, so a
>>>>>>>>>>>>>>>>>> type field is required to distinguish between different kinds
>>>>>>>>>>>>>>>>>> of messages.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Since rpmsg replies and events share the same message format,
>>>>>>>>>>>>>>>>>> an additional
>>>>>>>>>>>>>>>>> line is introduced to handle both cases.
>>>>>>>>>>>>>>>>>> Finally, rpmsg supports multiple GPIO controllers, so a port
>>>>>>>>>>>>>>>>>> field is added to
>>>>>>>>>>>>>>>>> uniquely identify the target controller.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have commented on this before - RPMSG is already providing
>>>>>>>>>>>>>>>>> multiplexing
>>>>>>>>>>>>>>>>> capability by way of endpoints.  There is no need for a port
>>>>>>>>>>>>>>>>> field.  One endpoint,
>>>>>>>>>>>>>>>>> one GPIO controller.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You still need a way to let the remote side know which port the
>>>>>>>>>>>>>>>> endpoint maps to, either
>>>>>>>>>>>>>>>> by embedding the port information in the message (the current
>>>>>>>>>>>>>>>> way), or by sending it
>>>>>>>>>>>>>>>> separately.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> An endpoint is created with every namespace request.  There
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>> one namespace request for every GPIO controller, which yields a
>>>>>>>>>>>>>>> unique
>>>>>>>>>>>>>>> endpoint for each controller and eliminates the need for an extra
>>>>>>>>>>>>>>> field to identify them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right, but this can still be done by just having one namespace
>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>> We can create new endpoints bound to an existing namespace/
>>>>>>>>>>>>>> channel by
>>>>>>>>>>>>>> invoking rpmsg_create_ept(). This is what I suggested here too:
>>>>>>>>>>>>>> https://lore.kernel.org/all/29485742-6e49-482e-
>>>>>>>>>>>>>> b73d-228295daaeec@ti.com/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will look at your suggestion (i.e link above) later this week or
>>>>>>>>>>>>> next week.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> My mental model looks like this for the complete picture:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. namespace/channel#1 = rpmsg-io
>>>>>>>>>>>>>>        a. ept1 -> gpio-controller@1
>>>>>>>>>>>>>>        b. ept2 -> gpio-controller@2
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If my understanding of what gpio-controller is right, than this won't
>>>>>>>>> work. We need one rpmsg channel per gpio-controller, and in most cases
>>>>>>>>> there will be only one GPIO-controller on the remote side. If there are
>>>>>>>>> multiple or multiple instances of same controller, than we need separate
>>>>>>>>> channel name for that controller just like we would have separate device
>>>>>>>>> on the Linux.
>>>>>>>>
>>>>>>>> As done in ehe rpmsg_tty driver it could be instantiated several times with
>>>>>>>> the same channel/service name. This would imply a specific rpmsg to
>>>>>>>> retreive
>>>>>>>> the gpio controller index from the remote side.
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've asked for one endpoint per GPIO controller since the very
>>>>>>>>>>>>> beginning.  I don't yet have a strong opinion on whether to use one
>>>>>>>>>>>>> namespace request per GPIO controller or a single request that spins
>>>>>>>>>>>>> off multiple endpoints.  I'll have to look at your link and
>>>>>>>>>>>>> reflect on
>>>>>>>>>>>>> that.  Regardless of how we proceed on that front, multiplexing needs
>>>>>>>>>>>>> to happen at the endpoint level rather than the packet level.
>>>>>>>>>>>>> This is
>>>>>>>>>>>>> the only way this work can move forward.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I would be more in favor of Mathieu’s proposal: “An endpoint is
>>>>>>>>>>>> created with every namespace request.”
>>>>>>>>>>>>
>>>>>>>>>>>> If the endpoint is created only on the Linux side, how do we match
>>>>>>>>>>>> the Linux endpoint address with the local port field on the remote
>>>>>>>>>>>> side?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Simply by sending a message to the remote containing the newly created
>>>>>>>>>>> endpoint and the port idx. Note that is this done just one time, after
>>>>>>>>>>> this
>>>>>>>>>>> Linux need not have the port field in the message everytime its sending
>>>>>>>>>>> a message.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> With a multi-namespace approach, the namespace could be rpmsg-io-
>>>>>>>>>>>> [addr], where [addr] corresponds to the GPIO controller address in
>>>>>>>>>>>> the DT. This would:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You will face the same problem in this case also that you asked above:
>>>>>>>>>>> "how do we match the Linux endpoint address with the local port field
>>>>>>>>>>> on the remote side?"
>>>>>>>>>>
>>>>>>>>>> Sorry I probably introduced confusion here
>>>>>>>>>> my sentence should be;
>>>>>>>>>>   With a multi-namespace approach, the namespace could be rpmsg-io-
>>>>>>>>>> [port],
>>>>>>>>>>   where [port] corresponds to the GPIO controller port in the DT.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For instance:
>>>>>>>>>>
>>>>>>>>>>        rpmsg {
>>>>>>>>>>          rpmsg-io {
>>>>>>>>>>            #address-cells = <1>;
>>>>>>>>>>            #size-cells = <0>;
>>>>>>>>>>
>>>>>>>>>>            gpio@25 {
>>>>>>>>>>              compatible = "rpmsg-gpio";
>>>>>>>>>>              reg = <25>;
>>>>>>>>>>              gpio-controller;
>>>>>>>>>>              #gpio-cells = <2>;
>>>>>>>>>>              #interrupt-cells = <2>;
>>>>>>>>>>              interrupt-controller;
>>>>>>>>>>            };
>>>>>>>>>>
>>>>>>>>>>            gpio@32 {
>>>>>>>>>>              compatible = "rpmsg-gpio";
>>>>>>>>>>              reg = <32>;
>>>>>>>>>>              gpio-controller;
>>>>>>>>>>              #gpio-cells = <2>;
>>>>>>>>>>              #interrupt-cells = <2>;
>>>>>>>>>>              interrupt-controller;
>>>>>>>>>>            };
>>>>>>>>>>          };
>>>>>>>>>>        };
>>>>>>>>>>
>>>>>>>>>>   rpmsg-io-25  would match with gpio@25
>>>>>>>>>>   rpmsg-io-32  would match with gpio@32
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem with this approach is, we will endup creating way too many
>>>>>>>>> RPMsg devices/channels. i.e. one channel per one GPIO. That limits how
>>>>>>>>> many GPIOs can be handled by remote from memory perspective. At
>>>>>>>>> somepoint we might just run-out of number ept & channels created by the
>>>>>>>>> remote. As of now, open-amp library supports 128 epts I think.
>>>>>>>>
>>>>>>>> Right, I proposed a solution in my previous answer to Beleswar who has
>>>>>>>> the same concern.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Because the endpoint that is created on a namespace request is also
>>>>>>>>>>> dynamic in nature. How will the remote know which endpoint addr
>>>>>>>>>>> Linux allocated for a namespace that it announced?
>>>>>>>>>>>
>>>>>>>>>>> As an example/PoC, I created a firmware example which announces
>>>>>>>>>>> 2 name services to Linux, one is the standard "rpmsg_chrdev" and
>>>>>>>>>>> the other is a TI specific name service "ti.ipc4.ping-pong". You can
>>>>>>>>>>> see it created 2 different addresses (0x400 and 0x401) for each of
>>>>>>>>>>> the name service request from the same firmware:
>>>>>>>>>>>
>>>>>>>>>>> root@j784s4-evm:~# dmesg | grep virtio0 | grep -i channel
>>>>>>>>>>> [    9.290275] virtio_rpmsg_bus virtio0: creating channel
>>>>>>>>>>> ti.ipc4.ping-pong addr 0xd
>>>>>>>>>>> [    9.311230] virtio_rpmsg_bus virtio0: creating channel rpmsg_chrdev
>>>>>>>>>>> addr 0xe
>>>>>>>>>>> [    9.496645] rpmsg_chrdev virtio0.rpmsg_chrdev.-1.14: DEBUG: Channel
>>>>>>>>>>> formed from src = 0x400 to dst = 0xe
>>>>>>>>>>> [    9.707255] rpmsg_client_sample virtio0.ti.ipc4.ping-pong.-1.13:
>>>>>>>>>>> new channel: 0x401 -> 0xd!
>>>>>>>>>>>
>>>>>>>>>>> So in this case, rpmsg-io-1 can have different ept addr than rpmsg-io-2
>>>>>>>>>>> Back to same problem. Simple solution is to reply to remote with the
>>>>>>>>>>> created ept addr and the index.
>>>>>>>>>>
>>>>>>>>>> That why I would like to suggest to use the name service field to
>>>>>>>>>> identify the port/controller, instead of the endpoint address.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> - match the RPMsg probe with the DT,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We can probe from all controllers with a single name service
>>>>>>>>>>> announcement too.
>>>>>>>>>>>
>>>>>>>>>>>> - provide a simple mapping between the port and the endpoint on both
>>>>>>>>>>>> sides,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We are trying to get rid of this mapping from Linux side to adapt
>>>>>>>>>>> the gpio-virtio design.
>>>>>>>>>>>
>>>>>>>>>>>> - allow multiple endpoints on the remote side,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We can support this as well with single nameservice model.
>>>>>>>>>>> There is no limitation. Remote has to send a message with
>>>>>>>>>>> its newly created ept that's all.
>>>>>>>>>>>
>>>>>>>>>>>> - provide a simple discovery mechanism for remote capabilities.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> A single announcement: "rpmsg-io" is also discovery mechanism.
>>>>>>>>>>>
>>>>>>>>>>> Feel free to let me know if you have concerns with any of the
>>>>>>>>>>> suggestions!
>>>>>>>>>>
>>>>>>>>>> My only concern, whatever the solution, is that we find a smart
>>>>>>>>>> solution to associate the correct endpoint with the correct GPIO
>>>>>>>>>> port/controller defined in the DT.
>>>>>>>>>>
>>>>>>>>>> I may have misunderstood your solution. Could you please help me
>>>>>>>>>> understand your proposal by explaining how you would handle three
>>>>>>>>>> GPIO ports defined in the DT, considering that the endpoint
>>>>>>>>>> addresses on the Linux side can be random?
>>>>>>>>>> If I assume there is a unique endpoint on the remote side,
>>>>>>>>>> I do not understand how you can match, on the firmware side,
>>>>>>>>>> the Linux endpoint address to the GPIO port.
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,Arnaud
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Beleswar
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Arnaud
>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. namespace/channel#2 = rpmsg-i2c
>>>>>>>>>>>>>>        a. ept1 -> i2c@1
>>>>>>>>>>>>>>        b. ept2 -> i2c@2
>>>>>>>>>>>>>>        c. ept3 -> i2c@3
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> etc...
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just want to clear-up few terms before I jump to the solution:
>>>>>>>>>
>>>>>>>>> **RPMsg channel/device**:
>>>>>>>>>    - These are devices announced by the remote processor, and created by
>>>>>>>>> linux. They are created at: /sys/bus/rpmsg/devices
>>>>>>>>>    - The channel format: <name>.<src ept>.<dst ept>
>>>>>>>>>
>>>>>>>>> **RPMsg endpoint**:
>>>>>>>>>    - Endpoint is differnt than channel. Single channel can have multiple
>>>>>>>>> endpoints, and represented in the linux with: /dev/rpmsg? devices.
>>>>>>>>>
>>>>>>>>> To create endpoint device, we have rpmsg_create_ept API, which takes
>>>>>>>>> channel information as input, which has src-ept, dst-ept.
>>>>>>>>>
>>>>>>>>> Following is proposed solution:
>>>>>>>>>
>>>>>>>>> 1) Assign RPMsg channel/device per rpmsg-gpio controller (Not per GPIO
>>>>>>>>> pin/port).
>>>>>>>>>    - In our case that would be, single rpmsg-io node. (That makes me
>>>>>>>>> question if bindings are correct or not).
>>>>>>>>>
>>>>>>>>> 2) Assign GPIO number as src ept.
>>>>>>>>>
>>>>>>>>> i.e. *rpmsg-io.<GPIO number>.<dst ept>*. Do not randomly assign src
>>>>>>>>> endpoint.
>>>>>>>>>
>>>>>>>>> Now, RPMSG channel by spec reserves first 1024 endpoints [1], so we can
>>>>>>>>> add 1024 offset to the GPIO number:
>>>>>>>>>
>>>>>>>>> so, when calling rpmsg_create_ept() API, we assing src_endpoint as:
>>>>>>>>> (GPIO_NUMBER + RPMSG_RESERVED_ADDRESSES)
>>>>>>>>>
>>>>>>>>> Now on the remote side, there is single channel and only single-endpoint
>>>>>>>>> is needed that is mapped to the rpmsg-io channel callback.
>>>>>>>>>
>>>>>>>>> That callback will receive all the payloads from the Linux, which will
>>>>>>>>> have src-ept i.e. (RPMSG_RESERVED_ADDRESSES + GPIO_NUMBER).
>>>>>>>>
>>>>>>>>
>>>>>>>> Interesting approach. I also tried to find a similar solution.
>>>>>>>>
>>>>>>>> The question here is: how can we guarantee continuous addresses? Given
>>>>>>>> the static and dynamic allocation of endpoint addresses that are
>>>>>>>> implemented, my conclusion was that it is not reliable enough.
>>>>>>>>
>>>>>>>> but perhaps I missed something...
>>>>>>>>
>>>>>>>>>
>>>>>>>>> It can retrieve GPIO_NUMBER easily, and convert to appropriate pin based
>>>>>>>>> on platform specific logic.
>>>>>>>>>
>>>>>>>>> This doesn't need PORT information at all. Also it makes sure that
>>>>>>>>> remote is using only single-endpoint so not much memory is used.
>>>>>>>>>
>>>>>>>>> *Example*:
>>>>>>>>> If only rpmsg-gpio channel is created by the remote side, than following
>>>>>>>>> is the representation of the devices when GPIO 25, 26, 27 is assigned to
>>>>>>>>> the rpmsg-io controller:
>>>>>>>>>
>>>>>>>>> Linux                                                      Remote
>>>>>>>>>
>>>>>>>>> rpmsg-channel: rpmsg-gpio.0x400.0x400
>>>>>>>>>
>>>>>>>>> /dev/rpmsg0 - GPIO25 ept (rpmsg-gpio.0x419.0x400)-|
>>>>>>>>>                                                    |
>>>>>>>>> /dev/rpmsg1 - GPIO26 ept (rpmsg-gpio.0x41a.0x400)-|-> rpmsg-gpio.*.0x400
>>>>>>>>>                                                    |
>>>>>>>>> /dev/rpmsg2 - GPIO27 ept (rpmsg-gpio.0x41b.0x400)-|  0x400 ept callback.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *On remote side*:
>>>>>>>>>
>>>>>>>>> ept_0x400_callback(..., int src_ept, ...,)
>>>>>>>>> {
>>>>>>>>>     int gpio_num = src_ept - RPMSG_RESERVED_ADDRESSES;
>>>>>>>>>     // platform specific logic to convert gpio num to proper pin,
>>>>>>>>>     // just like you would convert gpio num to pin on a linux gpio
>>>>>>>>> controller.
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> My question on the binding:
>>>>>>>>>
>>>>>>>>> Why each GPIO is represented with the separate node? I think rpmsg-gpio
>>>>>>>>> can be represented just any other GPIO controller? Please let me know if
>>>>>>>>> I am missing something. So rpmsg channel/rpmsg device is not created per
>>>>>>>>> GPIO, but per controller. GPIO number multiplexing should be done with
>>>>>>>>> rpmsg src ept, that removes the need of having each GPIO as a separate
>>>>>>>>> node.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> rpmsg_gpio: rpmsg-gpio@0 {
>>>>>>>>>         compatible = "rpmsg-gpio";
>>>>>>>>>         reg = <0>;
>>>>>>>>>         gpio-controller;
>>>>>>>>>         #gpio-cells = <2>;
>>>>>>>>>         #interrupt-cells = <2>;
>>>>>>>>>         interrupt-controller;
>>>>>>>>>     };
>>>>>>>>>
>>>>>>>>> Then in DT, use like regular GPIO, but with the rpmsg-gpio controller:
>>>>>>>>>
>>>>>>>>> rpmsg-gpios = <&rpmsg_gpio (GPIO NUM) (flags)>;
>>>>>>>>>
>>>>>>>>> If the intent to create separate gpio nodes was only for the channel
>>>>>>>>> creation, then it's not really needed.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/torvalds/linux/
>>>>>>>>> blob/6d35786de28116ecf78797a62b84e6bf3c45aa5a/drivers/rpmsg/
>>>>>>>>> virtio_rpmsg_bus.c#L136
>>>>>>>>>
>>>>>>>>
>>>>>>>> It is already the case. bindings declare GPIO controllers, not directly
>>>>>>>> GPIOs in:
>>>>>>>>
>>>>>>>> [PATCH v13 2/4] dt-bindings: remoteproc: imx_rproc: Add "rpmsg" subnode
>>>>>>>> support
>>>>>>>>
>>>>>>>> The discussion is around having an unique RPmsg endpoint for all
>>>>>>>> GPIO controller or one RPmsg endpoint per GPIO controller.
>>>>>>>>
>>>>>>>
>>>>>>> Endpoint where remote side or linux side?
>>>>>>>
>>>>>>> If unique endpoint on remote side per gpio controller then it makes sense.
>>>>>>>
>>>>>>> Unique endpoint on linux side doesn't make sense. Instead, unique
>>>>>>> channel per gpio controller makes sense, and each channel will have
>>>>>>> multiple endpoints on linux side. As I replied to Beleswar on the other
>>>>>>> email, I will copy past my answer here too:
>>>>>>>
>>>>>>>
>>>>>>> To be more specific:
>>>>>>>
>>>>>>> Linux:                               remote:
>>>>>>>
>>>>>>> ch1: rpmsg-gpio.-1.1024 ->     gpio-controller@1024
>>>>>>>     - gpio-line ept1
>>>>>>>     - gpio-line ept2    ->     They all map to same callback_ept_1024.
>>>>>>>     - gpio-line ept3
>>>>>>>
>>>>>>> ch2: rpmsg-gpio.-1.1025 ->     gpio-controller@1025
>>>>>>>     - gpio-line ept1
>>>>>>>     - gpio-line ept2    ->     They all map to same callback_ept_1025.
>>>>>>>     - gpio-line ept3
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Mathieu,
>>>>>>
>>>>>> So upon more brain storming in this approach I found limitation:
>>>>>>
>>>>>> This approach won't work if host OS is any other OS but Linux. For
>>>>>> example, if the remote OS is zephyr/baremetal using open-amp, then Only
>>>>>> Linux <-> zephyr combination will work, and we won't be able to re-use
>>>>>> this approach for zephyr <-> zephyr use case. The concept of rpmsg
>>>>>> channel/device exist only in the linux kernel implementation. This
>>>>>> brings another question: Should the protocol we decide work on other use
>>>>>> cases as well? Or Linux must be the Host OS for this protocol ?
>>>>>>
>>>>>
>>>>> Linux and Zephyr are very distinct OS, each with their own subsystems
>>>>> and characteristics.  The design we choose here involves RPMSG and,
>>>>> inherently, Linux.  We can't make decisions based on what may
>>>>> potentially happen in Zephyr.
>>>>>
>>>>>>
>>>>>> I think your & Arnaud's proposed approach of single endpoint per
>>>>>> gpio-controller on both side makes more sense, as it will work
>>>>>> regardless of any OS on host or remote side.
>>>>>>
>>>>>
>>>>> Arnaud, Beleswar, Andrew and I are all advocating for one endpoint per
>>>>> GPIO controller.  The remaining issue it about the best way to work
>>>>> out source and destination addresses between Linux and the remote
>>>>> processor.  I'm running out of time for today but I'll return to this
>>>>> thread with a final analysis by the end of the week.
>>>>>
>>>>
>>>> Okay. Then that means multiple endpoints on Linux side can be considered.
>>>
>>> If there are multiple GPIO controllers then yes, there will be more than one
>>> endpoint.  At this time I do now want to condiser other bus architectures (i2c,
>>> spi, ...) to avoid muddying an already difficult conversation.
>>>
>>>>
>>>> If we decide to go single-endpoint per device on both side, then for
>>>> that here is the proposal to represent src ept and dst ept:
>>>
>>> I do not understand what you mean by "per device" - please be more specific.
>>>
>>
>> "per device" I mean, per rpmsg device/channel. In our case that would be
>> per gpio-controller.
>>
>>>>
>>>> When we represent any device under rpmsg bus node, I think it should be
>>>> considered remote's view of the adddress space. So ideally we can
>>>> convert it to Linux view of the address space, via 'ranges' property.
>>>
>>> There is no address space to consider since there is no GPIO controller memory
>>> space to access.  All that is done by the driver (remote processor) and
>>> completely hidden from Linux by rpmsg-virtio-gpio.
>>>
>>
>> So IMHO the dt-binding is the representation of the device hardware and
>> is independent of how driver will access it. Any gpio-controller device
>> node, we are just representing how gpio-controller hardware on the
>> remote side looks like, and what is the corresponding view of the linux is.
>>
>> The rpmsg-gpio driver is different than the platform gpio controller
>> driver mainly in two ways:
>>
>> 1) How the driver is probed: rpmsg-gpio driver will be probed when
>> corresponding rpmsg channel/device name-service announcment will happen
>> from the remote side.
>>
> 
> I agree.
> 
>> 2) The GPIO Ops are not performed on the hardware directly, but it's
>> done via rpmsg commands on the remote side.
>>
> 
> I agree.
> 
>> However, the GPIO controller hardware remains the same. So bindings
>> shoudln't change.
>>
> 
> That is where I have a different point of view.  There is no need to
> have information in the bindings the kernel won't use.  We are
> advertizing virtio-gpio devices and as such should use virtio-gpio
> bindings.  The only thing that changes is the transport method, i.e,
> encapsulated in RPMSG rather than directly over virtqueues.
> 

I do not have deep knowledge of virtio-gpio devices, but in the bindings
example "virtio-gpio.yaml", there is 'reg' property available for
virtio-mmio transport:
https://github.com/torvalds/linux/blob/1f63dd8ca0dc05a8272bb8155f643c691d29bb11/Documentation/devicetree/bindings/gpio/gpio-virtio.yaml#L47


Also, I am actually asking to use 'reg' property to retrieve endpoint
information in the rpmsg-gpio driver. please see below:


>> IMHO That means, if I want to move any existing GPIO-controller to the
>> remote side, and want the rpmsg-gpio driver to handle it then, all I
>> need to change is the compatible string of the current gpio-controller
>> device node. The rest of the address space should remain the same, and
>> leave ranges property empty. If the remote core has different view of
>> the address space, then the device should contain remote's view and
>> parent bus (rpmsg-io bus) should provide linux view via 'ranges' property.
>>
>> That is just the device hw representation in the device-tree as rpmsg
>> device. Same for any other type of the controller: i2c, spi etc.
>>
>> Thanks,
>> Tanmay
>>
>>
>>>>
>>>> So bindings should include 'ranges' property in the parent node. Then
>>>> linux view of the start address becomes src ept, and remote view of the
>>>> start address becomes dest ept. The remote view of the start address is
>>>> expected to be the static src endpoint on the remote side.
>>>>
>>>> Following representation of the rpmsg devices (gpio, i2c, spi or any other):
>>>>
>>>> rpmsg {
>>>>   #address-cells = <1>;
>>>>   #size-cells = <1>;
>>>>
>>>>   rpmsg-io {
>>>>     compatible = "rpmsg-io-bus";
>>>>     ranges = <remote_view_addr(dst ept) linux_view_addr(src ept) size>;
>>>>     #address-cells = <1>;
>>>>     #size-cells = <1>;
>>>>
>>>>     gpio@remote_view_addr(or dst ept) {
>>>>       compatible = "rpmsg-io";
>>>>       reg = <remote_view_addr addr_space_size>;
>>>>       gpio-controller;
>>>>       #gpio-cells = <2>;
>>>>       interrupt-controller;
>>>>       #interrupt-cells = <2>;
>>>>     };
>>>>

If we have 'reg' property as explained in above example, then we can use
remote view of the start address of the device as dest-endpoint, and
linux view of the start address of device (gpio-controller in our case)
as src ept.

We use 'reg' property like this:

/* Get remote view of the start addr of gpio-controller */
of_property_read_reg(dev_node, 0, &dest_ept, &size);

/* Get linux view of the start addr of gpio-controller */
of_address_to_resource(dev_node, 0, &res);
src_ept = res.start;

When sending rpmsg command we can use above endpoint informations
rpmsg_sendto() API.

Note that remote has already done name service announcement using
dst-ept (i.e. remote's view of the start address of the gpio-controller)
by this time.

Thanks,
Tanmay

>>>>     ...
>>>>
>>>>   };
>>>>
>>>> };
>>>>
>>>> Example device-tree:
>>>>
>>>> rpmsg {
>>>>   #address-cells = <1>;
>>>>   #size-cells = <1>;
>>>>
>>>>   rpmsg-io {
>>>>     compatible = "rpmsg-io-bus";
>>>>     ranges = <0x10000 0x50000 0x1000>,
>>>>              <0x20000 0x60000 0x1000>;
>>>>     #address-cells = <1>;
>>>>     #size-cells = <1>;
>>>>
>>>>     gpio@10000 {
>>>>       compatible = "rpmsg-io";
>>>>       reg = <0x10000 0x1000>;
>>>>       gpio-controller;
>>>>       #gpio-cells = <2>;
>>>>       interrupt-controller;
>>>>       #interrupt-cells = <2>;
>>>>     };
>>>>
>>>>     gpio@20000 {
>>>>       compatible = "rpmsg-io";
>>>>       reg = <0x20000 0x1000>;
>>>>       gpio-controller;
>>>>       #gpio-cells = <2>;
>>>>       interrupt-controller;
>>>>       #interrupt-cells = <2>;
>>>>     };
>>>>
>>>>   };
>>>>
>>>> };
>>>>
>>>>
>>>> Thanks,
>>>> Tanmay
>>>>
>>>>
>>>>>> To be more specific this will look like following:
>>>>>>
>>>>>> Host (Linux)                       Remote (baremetal/RTOS)
>>>>>>
>>>>>> rpmsg ch/device 1:
>>>>>>     - rpmsg ept 1   <------>     rpmsg ept 1 gpio-controller 0
>>>>>>
>>>>>> rpmsg ch/device 2:
>>>>>>      - rpmsg ept 2   <------>     rpmsg ept 2 gpio-controller 1
>>>>>>
>>>>>>
>>>>>> The question is, how to decide src ept, and dest ept on both sides?
>>>>>> I still think it should be static endpoints.
>>>>>>
>>>>>> I will get back with more reasoning on that.
>>>>>>
>>>>>>> On the remote side, we have to hardcode Which rpmsg controller is mapped
>>>>>>> to which endpoint.
>>>>>>>
>>>>>>>> Or did I misunderstand your questions?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Arnaud
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I gave this patch more time yesterday, and I think the 'reg' property
>>>>>>> should represent remote endpoint, instead of the gpio-controller index.
>>>>>>>
>>>>>>> So in this approach remote implementation is expected to provide
>>>>>>> hard-coded (static) endpoints for each gpio-controller instance, and
>>>>>>> that same number should be represented with the 'reg' property.
>>>>>>>
>>>>>>> On remote side:
>>>>>>>
>>>>>>> #define RPMSG_GPIO_0_CONTROLLER_EPT (RPMSG_RESERVED_ADDRESSES + 1) // 1024
>>>>>>>
>>>>>>> ept_1024_callback() {
>>>>>>>
>>>>>>>       // handle appropriate gpio port ()
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> On linux side:
>>>>>>>
>>>>>>> So new representation of controller:
>>>>>>>
>>>>>>>  rpmsg_gpio_0:   gpio@1024 {
>>>>>>>              compatible = "rpmsg-gpio";
>>>>>>>              reg = <1024>;
>>>>>>>              gpio-controller;
>>>>>>>              #gpio-cells = <2>;
>>>>>>>              #interrupt-cells = <2>;
>>>>>>>              interrupt-controller;
>>>>>>>           };
>>>>>>>
>>>>>>>  rpmsg_gpio_1:   gpio@1025 {
>>>>>>>              compatible = "rpmsg-gpio";
>>>>>>>              reg = <1025>;
>>>>>>>              gpio-controller;
>>>>>>>              #gpio-cells = <2>;
>>>>>>>              #interrupt-cells = <2>;
>>>>>>>              interrupt-controller;
>>>>>>>           };
>>>>>>>
>>>>>>> gpios = <&rpmsg_gpio_0 (GPIO NUM or PIN) flags>,
>>>>>>>       <&rpmsg_gpio_1 (GPIO NUM or PIN) flags>;
>>>>>>>
>>>>>>> Now in the linux driver:
>>>>>>>
>>>>>>> You can easily retrieve destination endpoint when we want to send the
>>>>>>> command to the gpio controller via device's "reg" property.
>>>>>>>
>>>>>>> This approach also provides built-in security as well. Because now
>>>>>>> gpio-controller instance is hardcoded with the endpoint callback, it
>>>>>>> can't be modified/addressed without changing the 'reg' property.
>>>>>>>
>>>>>>> Just like you wouldn't change device address for the instance of the
>>>>>>> gpio-controller right?
>>>>>>>
>>>>>>> This approach can be easily adapted to all the other rpmsg controllers
>>>>>>> as well.
>>>>>>>
>>>>>>> So, dynamic endpoint allocation doesn't make sense in this case. Dynamic
>>>>>>> endpoint allocation makes more sense for user-space apps which don't
>>>>>>> really care about endpoints and only payloads.
>>>>>>>
>>>>>>> But, here we are multiplexing device-addresses with endpoints, and so it
>>>>>>> has to be fixed, and presented via 'reg' property. So, firmware can't
>>>>>>> change device-address without Linux knowing it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Tanmay
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>>> This way device groups are isolated with each channel/namespace, and
>>>>>>>>>>>>>> instances within each device groups are also respected with specific
>>>>>>>>>>>>>> endpoints.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Beleswar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply

* [PATCH v2] cgroup/dmem: introduce a peak file
From: Thadeu Lima de Souza Cascardo @ 2026-05-13 18:58 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tvrtko Ursulin
  Cc: cgroups, linux-kernel, linux-mm, linux-doc, dri-devel, kernel-dev,
	Thadeu Lima de Souza Cascardo

Just like we have memory.peak, introduce a dmem.peak, which uses the
page_counter support for that.

For now, make it read-only.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
Changes in v2:
- Make it read-only for now and adjust documentation accordingly.
- Link to v1: https://patch.msgid.link/20260506-dmem_peak-v1-0-8d803eb3449c@igalia.com
---
 Documentation/admin-guide/cgroup-v2.rst |  6 ++++++
 kernel/cgroup/dmem.c                    | 15 +++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..d103623b2be4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2808,6 +2808,12 @@ DMEM Interface Files
 	The semantics are the same as for the memory cgroup controller, and are
 	calculated in the same way.
 
+  dmem.peak
+	A read-only nested-keyed file that exists on non-root cgroups.
+
+	The max device memory usage recorded for the cgroup and its
+	descendants since the creation of the cgroup for each region.
+
   dmem.capacity
 	A read-only file that describes maximum region capacity.
 	It only exists on the root cgroup. Not all memory can be
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f..6430c7ce1e03 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -182,6 +182,11 @@ static u64 get_resource_current(struct dmem_cgroup_pool_state *pool)
 	return pool ? page_counter_read(&pool->cnt) : 0;
 }
 
+static u64 get_resource_peak(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? READ_ONCE(pool->cnt.watermark) : 0;
+}
+
 static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
@@ -808,6 +813,11 @@ static int dmemcg_limit_show(struct seq_file *sf, void *v,
 	return 0;
 }
 
+static int dmem_cgroup_region_peak_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_peak);
+}
+
 static int dmem_cgroup_region_current_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_current);
@@ -856,6 +866,11 @@ static struct cftype files[] = {
 		.name = "current",
 		.seq_show = dmem_cgroup_region_current_show,
 	},
+	{
+		.name = "peak",
+		.seq_show = dmem_cgroup_region_peak_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{
 		.name = "min",
 		.write = dmem_cgroup_region_min_write,

---
base-commit: d3b0a7f21119f5a66cb76aa28fb8cc13206aaf7d
change-id: 20260409-dmem_peak-3abc1be95072

Best regards,
--  
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>


^ permalink raw reply related

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-05-13 18:54 UTC (permalink / raw)
  To: Hao Jia
  Cc: Nhat Pham, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia, Alexandre Ghiti
In-Reply-To: <6fc7fdf0-368c-5129-038e-623f9db2aa88@gmail.com>

> > Zswap objects are organized into LRU and exposed to the shrinker
> > interface. Echo-ing to memory.reclaim should also offload some zswap
> > entries, correct? Are there still cold zswap entries that escape this,
> > somehow?
> >
>
> Yes, the memory.reclaim path does drive some zswap writeback, but
> it is not enough for our case.
>
> 1. For a memcg that has reached steady state (a common case being
> when memory.current is below the policy target), the userspace
> reclaimer may not invoke memory.reclaim on it for a long time,
> and so no second-level offloading happens through
> memory.reclaim. In this state we want
> memory.zswap.proactive_writeback to write back entries that
> have sat in zswap past an age threshold, to further reclaim
> the DRAM still held by the compressed data.
>
> 2. Even when memory.reclaim is running, the fraction of zswap
> residency that ends up reaching the backing swap device is
> still very small for many of our workloads, and the userspace
> reclaimer has no way to participate in or control the
> granularity of zswap writeback. So in our deployment we prefer
> to leave the zswap shrinker disabled, decouple LRU -> zswap
> from zswap -> swap, and use a dedicated proactive-writeback
> interface that lifts the writeback policy into userspace where
> it can evolve independently of the kernel.

To be honest I see the point of proactively reclaiming compressed
memory in zswap. If you use memory.reclaim, you are also reclaiming
hotter memory in the process, and you are not necessarily getting as
much writeback as you want. The memory in zswap is a more conservative
choice for proactive reclaim because it's memory that's guaranteed to
be cold(ish) and not being accessed.

That being said, the interface is not great any way you cut it :/

I don't like the 'memory.zswap.proactive_writeback' name, maybe we can
stay consistent by doing 'memory.zswap.reclaim', but that just as
easily reads as "reclaim using zswap". Maybe
'memory.zswap.do_writeback' or something, idk.

I also don't like having two proactive reclaim interfaces, so a voice
in my head wants to tie this into 'memory.reclaim' somehow, but that
includes adding a pretty specific argument (e.g. 'memory.reclaim
zswap_writeback_only=1'.

I don't like any of these options, and we also need to consider what
the memcg maintainers think. I see the use case of proactive writeback
but I am struggling to come up with a clean interface.

I also think we should take the 'age' aspect out of the conversation
for now, it can be a separate discussion. Well, unless we decide to
tie it to memory.reclaim. If memory.reclaim broadly supports age-based
reclaim then zswap writeback can be a natural part of that without
requiring a specific interface.

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-13 18:39 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Christian König, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CABdmKX3R5faNgFva-HHVhtTcxJ0_BK9Rei3iTQcA+SRwdKv1Aw@mail.gmail.com>

On Wed, May 13, 2026 at 6:39 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Wed, May 13, 2026 at 5:41 AM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On Tue, May 12, 2026 at 12:14 PM Christian König
> > <christian.koenig@amd.com> wrote:
> > >
> > > On 5/12/26 11:10, Albert Esteve wrote:
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > > Usage examples:
> > > >
> > > >   1. Central allocator charging to a client at allocation time.
> > > >      The allocator knows the client's PID (e.g., from binder's
> > > >      sender_pid) and uses pidfd to attribute the charge:
> > > >
> > > >        pid_t client_pid = txn->sender_pid;
> > > >        int pidfd = pidfd_open(client_pid, 0);
> > > >
> > > >        struct dma_heap_allocation_data alloc = {
> > > >            .len             = buffer_size,
> > > >            .fd_flags        = O_RDWR | O_CLOEXEC,
> > > >            .charge_pid_fd   = pidfd,
> > > >        };
> > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > >        close(pidfd);
> > > >        /* alloc.fd is now charged to client's cgroup */
> > > >
> > > >   2. Default allocation (no pidfd, mem_accounting=1).
> > > >      When charge_pid_fd is not set and the mem_accounting module
> > > >      parameter is enabled, the buffer is charged to the allocator's
> > > >      own cgroup:
> > > >
> > > >        struct dma_heap_allocation_data alloc = {
> > > >            .len      = buffer_size,
> > > >            .fd_flags = O_RDWR | O_CLOEXEC,
> > > >        };
> > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > >        /* charged to current process's cgroup */
> > > >
> > > > Current limitations:
> > > >
> > > >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> > > >    how many processes share it. Means only the first owner (and exporter)
> > > >    of the shared buffer bears the charge.
> > > >  - Only memcg accounting supported. While this makes sense for system
> > > >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > > >    charging also for the dmem controller.
> > >
> > > Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
> > >
> > > I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
> > >
> > > Essentially the problem boils down to two limitations:
> > > 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
> > > 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
> > >
> > > The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
> >
> > Honestly, adding a hook to fd-passing uAPI to manage charge transfers
> > sounds like a promising solution requiring no uAPI changes. However,
> > it still does not cover all paths, e.g., dup() or fork(). And shared
> > memory sounds like a hard one to tackle, where deciding the best
> > policy is more a per-usecase thing and would probably require
> > userspace configuration.
>
> I'm curious if anyone knows of a use case where FDs aren't involved at
> all? It's possible to fork() or clone() with only a dmabuf mapping and
> no FD. That sounds strange, and I'm not sure there's a real usecase
> for transferring ownership with that approach, but figured I'd at
> least pose the question.

Yeah, that's a good point. I do not really have a usecase myself for
fork(), just thought of it as a posible gap/uncovered path.

>
> > All in all, charge_pid_fd covers a
> > well-defined and immediately practical subset. The UAPI cost is small
> > and the mechanism is explicit about what it does and doesn't solve. A
> > general solution, if it ever converges, would likely supersede
> > charge_pid_fd for most cases, which is a fine outcome if it solves the
> > problem more completely.
> >
> > Either way, if you have a specific approach in mind for solving any of
> > the above limitations, I'd be happy to look into it further.
> >
> > BR,
> > Albert.
> >
> > >
> > > On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > > > ---
> > > >  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> > > >  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> > > >  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> > > >  drivers/dma-buf/heaps/system_heap.c     |  2 --
> > > >  include/uapi/linux/dma-heap.h           |  6 +++++
> > > >  5 files changed, 53 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 8bdbc2e866430..824d269531eb1 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> > > >               structures.
> > > >
> > > >         dmabuf (npn)
> > > > -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> > > > -             Stays with the allocating cgroup regardless of how the buffer is shared.
> > > > +             Amount of memory used for exported DMA buffers allocated by or on
> > > > +             behalf of the cgroup. Stays with the allocating cgroup regardless
> > > > +             of how the buffer is shared.
> > > >
> > > >         workingset_refault_anon
> > > >               Number of refaults of previously evicted anonymous pages.
> > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > > index ce02377f48908..23fb758b78297 100644
> > > > --- a/drivers/dma-buf/dma-buf.c
> > > > +++ b/drivers/dma-buf/dma-buf.c
> > > > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> > > >        */
> > > >       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> > > >
> > > > -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > > -     mem_cgroup_put(dmabuf->memcg);
> > > > +     if (dmabuf->memcg) {
> > > > +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > > > +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > > +             mem_cgroup_put(dmabuf->memcg);
> > > > +     }
> > > >
> > > >       dmabuf->ops->release(dmabuf);
> > > >
> > > > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > > >               dmabuf->resv = resv;
> > > >       }
> > > >
> > > > -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > > > -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > > > -                                   GFP_KERNEL)) {
> > > > -             ret = -ENOMEM;
> > > > -             goto err_memcg;
> > > > -     }
> > > > -
> > > >       file->private_data = dmabuf;
> > > >       file->f_path.dentry->d_fsdata = dmabuf;
> > > >       dmabuf->file = file;
> > > > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > > >
> > > >       return dmabuf;
> > > >
> > > > -err_memcg:
> > > > -     mem_cgroup_put(dmabuf->memcg);
> > > >  err_file:
> > > >       fput(file);
> > > >  err_module:
> > > > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > > > index ac5f8685a6494..ff6e259afcdc0 100644
> > > > --- a/drivers/dma-buf/dma-heap.c
> > > > +++ b/drivers/dma-buf/dma-heap.c
> > > > @@ -7,13 +7,17 @@
> > > >   */
> > > >
> > > >  #include <linux/cdev.h>
> > > > +#include <linux/cgroup.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/dma-buf.h>
> > > >  #include <linux/dma-heap.h>
> > > > +#include <linux/memcontrol.h>
> > > > +#include <linux/sched/mm.h>
> > > >  #include <linux/err.h>
> > > >  #include <linux/export.h>
> > > >  #include <linux/list.h>
> > > >  #include <linux/nospec.h>
> > > > +#include <linux/pidfd.h>
> > > >  #include <linux/syscalls.h>
> > > >  #include <linux/uaccess.h>
> > > >  #include <linux/xarray.h>
> > > > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> > > >                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> > > >
> > > >  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > > > -                              u32 fd_flags,
> > > > -                              u64 heap_flags)
> > > > +                              u32 fd_flags, u64 heap_flags,
> > > > +                              struct mem_cgroup *charge_to)
> > > >  {
> > > >       struct dma_buf *dmabuf;
> > > > +     unsigned int nr_pages;
> > > > +     struct mem_cgroup *memcg = charge_to;
> > > >       int fd;
> > > >
> > > >       /*
> > > > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > > >       if (IS_ERR(dmabuf))
> > > >               return PTR_ERR(dmabuf);
> > > >
> > > > +     nr_pages = len / PAGE_SIZE;
> > > > +
> > > > +     if (memcg)
> > > > +             css_get(&memcg->css);
> > > > +     else if (mem_accounting)
> > > > +             memcg = get_mem_cgroup_from_mm(current->mm);
> > > > +
> > > > +     if (memcg) {
> > > > +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > > > +                     mem_cgroup_put(memcg);
> > > > +                     dma_buf_put(dmabuf);
> > > > +                     return -ENOMEM;
> > > > +             }
> > > > +             dmabuf->memcg = memcg;
> > > > +     }
> > > > +
> > > >       fd = dma_buf_fd(dmabuf, fd_flags);
> > > >       if (fd < 0) {
> > > >               dma_buf_put(dmabuf);
> > > > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > > >  {
> > > >       struct dma_heap_allocation_data *heap_allocation = data;
> > > >       struct dma_heap *heap = file->private_data;
> > > > +     struct mem_cgroup *memcg = NULL;
> > > > +     struct task_struct *task;
> > > > +     unsigned int pidfd_flags;
> > > >       int fd;
> > > >
> > > >       if (heap_allocation->fd)
> > > > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > > >       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> > > >               return -EINVAL;
> > > >
> > > > +     if (heap_allocation->charge_pid_fd) {
> > > > +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > > > +             if (IS_ERR(task))
> > > > +                     return PTR_ERR(task);
> > > > +
> > > > +             memcg = get_mem_cgroup_from_mm(task->mm);
> > > > +             put_task_struct(task);
> > > > +     }
> > > > +
> > > >       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> > > >                                  heap_allocation->fd_flags,
> > > > -                                heap_allocation->heap_flags);
> > > > +                                heap_allocation->heap_flags,
> > > > +                                memcg);
> > > > +     mem_cgroup_put(memcg);
> > > >       if (fd < 0)
> > > >               return fd;
> > > >
> > > > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > > > index 03c2b87cb1112..95d7688167b93 100644
> > > > --- a/drivers/dma-buf/heaps/system_heap.c
> > > > +++ b/drivers/dma-buf/heaps/system_heap.c
> > > > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> > > >               if (max_order < orders[i])
> > > >                       continue;
> > > >               flags = order_flags[i];
> > > > -             if (mem_accounting)
> > > > -                     flags |= __GFP_ACCOUNT;
> > > >               page = alloc_pages(flags, orders[i]);
> > > >               if (!page)
> > > >                       continue;
> > > > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > > > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > > > --- a/include/uapi/linux/dma-heap.h
> > > > +++ b/include/uapi/linux/dma-heap.h
> > > > @@ -29,6 +29,10 @@
> > > >   *                   handle to the allocated dma-buf
> > > >   * @fd_flags:                file descriptor flags used when allocating
> > > >   * @heap_flags:              flags passed to heap
> > > > + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> > > > + *                   charged for this allocation; 0 means charge the calling
> > > > + *                   process's cgroup
> > > > + * @__padding:               reserved, must be zero
> > > >   *
> > > >   * Provided by userspace as an argument to the ioctl
> > > >   */
> > > > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> > > >       __u32 fd;
> > > >       __u32 fd_flags;
> > > >       __u64 heap_flags;
> > > > +     __u32 charge_pid_fd;
> > > > +     __u32 __padding;
> > > >  };
> > > >
> > > >  #define DMA_HEAP_IOC_MAGIC           'H'
> > > >
> > >
> >
>


^ permalink raw reply

* Re: [PATCH v7 10/10] ipe: Add BPF program load policy enforcement via  Hornet integration
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-11-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> Add support for the bpf_prog_load_post_integrity LSM hook, enabling IPE
> to make policy decisions about BPF program loading based on integrity
> verdicts provided by the Hornet LSM.
> 
> New policy operation:
>   op=BPF_PROG_LOAD - Matches BPF program load events
> 
> New policy properties:
>   bpf_signature=NONE      - No Verdict
>   bpf_signature=OK        - Program signature and map hashes verified
>   bpf_signature=UNSIGNED  - No signature provided
>   bpf_signature=PARTIALSIG - Signature OK but no map hash data
>   bpf_signature=UNKNOWNKEY - The keyring requested by the user is invalid
>   bpf_signature=UNEXPECTED - An unexpected hash value was encountered
>   bpf_signature=FAULT 	   - System error during verification
>   bpf_signature=BADSIG    - Signature or map hash verification failed
>   bpf_keyring=BUILTIN     - Program was signed using a builtin keyring
>   bpf_keyring=SECONDARY   - Program was signed using the secondary keyring
>   bpf_keyring=PLATFORM    - Program was signed using the platform keyring
>   bpf_kernel=TRUE         - Program originated from kernelspace
>   bpf_kernel=FALSE        - Program originated from userspace
> 
> These properties map directly to the lsm_integrity_verdict enum values
> provided by the Hornet LSM through security_bpf_prog_load_post_integrity.
> 
> The feature is gated on CONFIG_IPE_PROP_BPF_SIGNATURE which depends on
> CONFIG_SECURITY_HORNET.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> Acked-by: Fan Wu <wufan@kernel.org>
> ---
>  Documentation/admin-guide/LSM/ipe.rst | 162 +++++++++++++++++++++++++-
>  Documentation/security/ipe.rst        |  68 +++++++++++
>  security/ipe/Kconfig                  |  15 +++
>  security/ipe/audit.c                  |  15 +++
>  security/ipe/eval.c                   |  93 ++++++++++++++-
>  security/ipe/eval.h                   |  11 ++
>  security/ipe/hooks.c                  |  63 ++++++++++
>  security/ipe/hooks.h                  |  15 +++
>  security/ipe/ipe.c                    |  14 +++
>  security/ipe/ipe.h                    |   3 +
>  security/ipe/policy.h                 |  14 +++
>  security/ipe/policy_parser.c          |  27 +++++
>  12 files changed, 498 insertions(+), 2 deletions(-)

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 9/10] selftests/hornet: Add a selftest for the Hornet  LSM
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-10-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> This selftest contains a testcase that utilizes light skeleton eBPF
> loaders and exercises hornet's map validation.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> ---
>  tools/testing/selftests/Makefile             |  1 +
>  tools/testing/selftests/hornet/Makefile      | 63 ++++++++++++++++++++
>  tools/testing/selftests/hornet/loader.c      | 21 +++++++
>  tools/testing/selftests/hornet/trivial.bpf.c | 33 ++++++++++
>  4 files changed, 118 insertions(+)
>  create mode 100644 tools/testing/selftests/hornet/Makefile
>  create mode 100644 tools/testing/selftests/hornet/loader.c
>  create mode 100644 tools/testing/selftests/hornet/trivial.bpf.c

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 8/10] hornet: Add a light skeleton data extractor  scripts
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-9-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> These script eases light skeleton development against Hornet by
> generating a data payloads which can be used for signing a light
> skeleton binary using gen_sig.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> ---
>  scripts/hornet/extract-insn.sh | 27 +++++++++++++++++++++++++++
>  scripts/hornet/extract-map.sh  | 27 +++++++++++++++++++++++++++
>  scripts/hornet/extract-skel.sh | 27 +++++++++++++++++++++++++++
>  3 files changed, 81 insertions(+)
>  create mode 100755 scripts/hornet/extract-insn.sh
>  create mode 100755 scripts/hornet/extract-map.sh
>  create mode 100755 scripts/hornet/extract-skel.sh

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 7/10] hornet: Introduce gen_sig
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-8-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> This introduces the gen_sig tool. It creates a pkcs#7 signature of a
> data payload. Additionally it appends a signed attribute containing a
> set of hashes.
> 
> Typical usage is to provide a payload containing the light skeleton
> ebpf syscall program binary and it's associated maps, which can be
> extracted from the auto-generated skeleton header.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> ---
>  scripts/Makefile            |   1 +
>  scripts/hornet/Makefile     |   5 +
>  scripts/hornet/gen_sig.c    | 401 ++++++++++++++++++++++++++++++++++++
>  scripts/hornet/write-sig.sh |  27 +++
>  4 files changed, 434 insertions(+)
>  create mode 100644 scripts/hornet/Makefile
>  create mode 100644 scripts/hornet/gen_sig.c
>  create mode 100755 scripts/hornet/write-sig.sh

Merged into lsm/dev, but I did add a .gitignore for scripts/hornet/ and
I fixed up the SPDX tag (it wants C++ style comments).

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v7 6/10] security: Hornet LSM
From: Paul Moore @ 2026-05-13 18:36 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, Randy Dunlap, linux-security-module,
	linux-doc, linux-kernel, bpf, Song Liu
In-Reply-To: <20260507191416.2984054-7-bboscaccy@linux.microsoft.com>

On May  7, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> This adds the Hornet Linux Security Module which provides enhanced
> signature verification and data validation for eBPF programs. This
> allows users to continue to maintain an invariant that all code
> running inside of the kernel has actually been signed and verified, by
> the kernel.
> 
> This effort builds upon the currently excepted upstream solution. It
> further hardens it by providing deterministic, in-kernel checking of
> map hashes to solidify auditing along with preventing TOCTOU attacks
> against lskel map hashes.
> 
> Target map hashes are passed in via PKCS#7 signed attributes. Hornet
> determines the extent which the eBFP program is signed and defers to
> other LSMs for policy decisions.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> Nacked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> ---
>  Documentation/admin-guide/LSM/Hornet.rst | 323 +++++++++++++++++++++
>  Documentation/admin-guide/LSM/index.rst  |   1 +
>  MAINTAINERS                              |   9 +
>  include/linux/oid_registry.h             |   3 +
>  include/uapi/linux/lsm.h                 |   1 +
>  security/Kconfig                         |   3 +-
>  security/Makefile                        |   1 +
>  security/hornet/Kconfig                  |  13 +
>  security/hornet/Makefile                 |   7 +
>  security/hornet/hornet.asn1              |  12 +
>  security/hornet/hornet_lsm.c             | 352 +++++++++++++++++++++++
>  11 files changed, 724 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/LSM/Hornet.rst
>  create mode 100644 security/hornet/Kconfig
>  create mode 100644 security/hornet/Makefile
>  create mode 100644 security/hornet/hornet.asn1
>  create mode 100644 security/hornet/hornet_lsm.c

Merged into lsm/dev, thanks.

--
paul-moore.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox