Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Hao Jia @ 2026-05-12  9:37 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <CAKEwX=PW2+EN41ANutv4cv+iM+JpwV5V+NSp5ukAt0M6fbHFLg@mail.gmail.com>



On 2026/5/12 03:54, Nhat Pham wrote:
> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 19538d6f169a..1173ac6836fa 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -36,6 +36,7 @@
>>   #include <linux/workqueue.h>
>>   #include <linux/list_lru.h>
>>   #include <linux/zsmalloc.h>
>> +#include <linux/timekeeping.h>
>>
>>   #include "swap.h"
>>   #include "internal.h"
>> @@ -160,6 +161,12 @@ struct zswap_pool {
>>          char tfm_name[CRYPTO_MAX_ALG_NAME];
>>   };
>>
>> +struct zswap_shrink_walk_arg {
>> +       ktime_t cutoff_time;
>> +       bool proactive;
>> +       bool encountered_page_in_swapcache;
>> +};
>> +
>>   /* Global LRU lists shared by all zswap pools. */
>>   static struct list_lru zswap_list_lru;
>>
>> @@ -183,6 +190,7 @@ static struct shrinker *zswap_shrinker;
>>    * handle - zsmalloc allocation handle that stores the compressed page data
>>    * objcg - the obj_cgroup that the compressed memory is charged to
>>    * lru - handle to the pool's lru used to evict pages.
>> + * store_time - Time when the entry was stored, for proactive writeback.
>>    */
>>   struct zswap_entry {
>>          swp_entry_t swpentry;
>> @@ -192,6 +200,7 @@ struct zswap_entry {
>>          unsigned long handle;
>>          struct obj_cgroup *objcg;
>>          struct list_head lru;
>> +       ktime_t store_time;
> 
> On the implementation side - will this blow up struct zswap_entry
> memory footprint? If so, can you guard this behind a CONFIG option, if
> we are to go this route?

Thanks for the review. I'll address this in v2.

Thanks,
Hao

^ permalink raw reply

* Re: [PATCH RESEND bpf-next v10 3/8] bpf: Introduce the bpf_list_del kfunc.
From: Kaitao Cheng @ 2026-05-12  9:36 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	chengkaitao, skhan, memxor, linux-kernel, linux-doc, vmalik,
	linux-kselftest, martin.lau, clm, ihor.solodrai
In-Reply-To: <297a19ac8fb5fc8ab279a932137bf39604f2f3637521ff6d7d4b39c364a52e07@mail.kernel.org>

I have clarified this issue in the [PATCH bpf-next v10 3/8] patch,
with the link provided below:

https://lore.kernel.org/bpf/0637c1ce-0e73-4505-8a0c-e77339f4aefb@linux.dev/

在 2026/5/12 14:41, bot+bpf-ci@kernel.org 写道:
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 1e8754877dd1..51b6ea4bb8cb 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -2593,6 +2593,15 @@ __bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
>>  	return __bpf_list_del(head, h->prev);
>>  }
>>
>> +__bpf_kfunc struct bpf_list_node *bpf_list_del(struct bpf_list_head *head,
>> +					       struct bpf_list_node *node)
>> +{
>> +	struct bpf_list_node_kern *kn = (void *)node;
>> +
>> +	/* verifier guarantees node is a list node rather than list head */
>> +	return __bpf_list_del(head, &kn->list_head);
>> +}
>> +
> 
> Looking at bpf_list_pop_front() and bpf_list_pop_back() in the same
> file, both functions read from the list_head (h->next or h->prev)
> before passing it to __bpf_list_del(). If the list is uninitialized,
> they end up passing potentially invalid pointer values to __bpf_list_del(),
> which then has to handle this by initializing the list and returning NULL.
> 
> While __bpf_list_del() handles this correctly, would it be clearer to
> move the initialization check into the top-level kfuncs (like
> bpf_list_pop_front/back) so that __bpf_list_del() always receives a
> valid list_head pointer rather than potentially garbage values that
> get ignored later?
> 
> Reference: https://lore.kernel.org/bpf/af50d188f6bc2de81fe3c321b26a645a9425ba9f2644cb6f8525d5bb637385f9@mail.kernel.org/
> 
>> @@ -4725,6 +4734,7 @@ BTF_ID_FLAGS(func, bpf_list_push_back, KF_IMPLICIT_ARGS)
>>  BTF_ID_FLAGS(func, bpf_list_push_back_impl)
>>  BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
>>  BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
>> +BTF_ID_FLAGS(func, bpf_list_del, KF_ACQUIRE | KF_RET_NULL)
>>  BTF_ID_FLAGS(func, bpf_list_front, KF_RET_NULL)
>>  BTF_ID_FLAGS(func, bpf_list_back, KF_RET_NULL)
>>  BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> 
> [ ... ]
> 
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 11054ad89c14..a6bcbe68cd72 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -10744,6 +10744,7 @@ enum special_kfunc_type {
>>  	KF_bpf_list_push_back,
>>  	KF_bpf_list_pop_front,
>>  	KF_bpf_list_pop_back,
>> +	KF_bpf_list_del,
>>  	KF_bpf_list_front,
>>  	KF_bpf_list_back,
>>  	KF_bpf_cast_to_kern_ctx,
> 
> [ ... ]
> 
>> @@ -11456,7 +11459,8 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env,
>>
>>  	switch (node_field_type) {
>>  	case BPF_LIST_NODE:
>> -		ret = is_bpf_list_push_kfunc(kfunc_btf_id);
>> +		ret = is_bpf_list_push_kfunc(kfunc_btf_id) ||
>> +		      kfunc_btf_id == special_kfunc_list[KF_bpf_list_del];
>>  		break;
>>  	case BPF_RB_NODE:
>>  		ret = (is_bpf_rbtree_add_kfunc(kfunc_btf_id) ||
> 
> Can bpf_list_del actually be called from BPF programs? Looking at the
> verifier's handling of KF_ARG_PTR_TO_LIST_NODE arguments in
> check_kfunc_args():
> 
> 	case KF_ARG_PTR_TO_LIST_NODE:
> 		...
> 		if (!reg->ref_obj_id) {
> 			verbose(env, "allocated object must be referenced\n");
> 			return -EINVAL;
> 		}
> 
> This code path unconditionally requires an owning reference (ref_obj_id
> must be non-zero). However, once a node is added to the list via
> bpf_list_push_front/back, the owning reference is converted to a
> non-owning reference (ref_obj_id becomes 0). This means bpf_list_del
> cannot be called on nodes that are actually in the list.
> 
> The sister kfunc bpf_rbtree_remove handles this by having a special
> branch in check_kfunc_args() that accepts non-owning references for
> that specific kfunc. Does bpf_list_del need similar handling to allow
> removing nodes that are actually in the list?
> 
> Reference: https://lore.kernel.org/bpf/294242698d576e4fedf243ab055c6f9a984e55033933a9bd09808f62a7018e9d@mail.kernel.org/
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25716874656

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [pdx86-platform-drivers-x86:review-ilpo-next 21/30] htmldocs: Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/embedded-controller/qcom,hamoa-crd-ec.yaml
From: Ilpo Järvinen @ 2026-05-12  9:34 UTC (permalink / raw)
  To: kernel test robot, Anvesh Jain P
  Cc: Sibi Sankar, oe-kbuild-all, Andy Shevchenko, Hans de Goede,
	Maya Matuszczyk, Dmitry Baryshkov, linux-doc
In-Reply-To: <202605112207.TL7dR71j-lkp@intel.com>

On Mon, 11 May 2026, kernel test robot wrote:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git review-ilpo-next
> head:   165e81354eefd5551358112773f24027aac59d5a
> commit: 5c44f48e91deefdd42e567a2779d331937c97cd0 [21/30] platform: arm64: Add driver for EC found on Qualcomm reference devices
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
> reproduce: (https://download.01.org/0day-ci/archive/20260511/202605112207.TL7dR71j-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202605112207.TL7dR71j-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
>    Warning: Documentation/translations/zh_CN/how-to.rst references a file that doesn't exist: Documentation/xxx/xxx.rst
>    Warning: Documentation/translations/zh_CN/networking/xfrm_proc.rst references a file that doesn't exist: Documentation/networking/xfrm_proc.rst
>    Warning: Documentation/translations/zh_CN/scsi/scsi_mid_low_api.rst references a file that doesn't exist: Documentation/Configure.help
>    Warning: MAINTAINERS references a file that doesn't exist: Documentation/ABI/testing/sysfs-platform-ayaneo
>    Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/display/bridge/megachips-stdpxxxx-ge-b850v3-fw.txt
> >> Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/embedded-controller/qcom,hamoa-crd-ec.yaml

This is an artifact of platform tree only taking the platform drivers 
patch.

The rest of the patches (bindings + dts changes) should go in together 
through other tree more appropriate for them.

Once everything is put together in linux-next / Linus' tree, the warning 
should no longer appear.

Thus, it looks there's no need to act on this warning.

-- 
 i.


^ permalink raw reply

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
From: Hao Jia @ 2026-05-12  9:32 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, chengming.zhou,
	muchun.song, roman.gushchin, cgroups, linux-mm, linux-kernel,
	linux-doc, Hao Jia
In-Reply-To: <CAO9r8zNOPdpJuTmccvQ6ZAVS+tXxp-_ofA765DbnfaUZOPPO-g@mail.gmail.com>



On 2026/5/12 03:57, Yosry Ahmed wrote:
> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>
>>> From: Hao Jia <jiahao1@lixiang.com>
>>>
>>> Zswap currently writes back pages to backing swap devices reactively,
>>> triggered either by memory pressure via the shrinker or by the pool
>>> reaching its size limit. This reactive approach offers no precise
>>> control over when writeback happens, which can disturb latency-sensitive
>>> workloads, and it cannot direct writeback at a specific memory cgroup.
>>> However, there are scenarios where users might want to proactively
>>> write back cold pages from zswap to the backing swap device, for
>>> example, to free up memory for other applications or to prepare for
>>> upcoming memory-intensive workloads.
>>>
>>> Therefore, implement a proactive writeback mechanism for zswap by
>>> adding a new cgroup interface file memory.zswap.proactive_writeback
>>> within the memory controller.
>>

Thanks Nhat, Yosry — let me address both comments together.

>>
>> We already have memory.reclaim, no? Would that not work to create
>> headroom generally for your use case? Is there a reason why we are
>> treating zswap memory as special here?
> 

Apologies for the lack of detailed explanation in the patch description, 
which led to the confusion.

While we are already utilizing memory.reclaim, it does not fully address 
our requirements.

Our deployment runs a userspace proactive reclaimer that drives 
memory.reclaim based on the system's runtime state (memory/CPU/IO 
pressure, refault rate, ...) and workload-specific
policy. That first stage compresses cold anon pages into zswap. Entries 
that then remain in zswap past a policy-defined age threshold are 
considered "twice cold", and the reclaimer wants
to write them back to the backing swap device at a moment of its own 
choosing, to further reclaim the DRAM still held by the compressed data.

This is the "second-level offloading" pattern described in Meta's TMO 
paper [1]. zswap proactive writeback is what this series introduces to 
address that second-level offloading stage.

[1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf


> +1, why do we need to specifically proactively reclaim the compressed memory?
> 
> Also, if we do need to minimize the compressed memory and force higher
> writeback rates, we can do so with memory.zswap.max, right?

Here are a few reasons why memory.zswap.max is not enough:

1. Writing memory.zswap.max itself does not trigger any writeback 
immediately. For a memcg that has reached steady state (on which the 
userspace reclaimer is no longer invoking
memory.reclaim), after enough time has passed, the reclaimer has no good 
way to trigger proactive writeback for second-level offloading by 
lowering memory.zswap.max, because in steady
state nothing drives the zswap_store() -> shrink_memcg() path. The 
userspace reclaimer still has no control over when proactive writeback 
happens.

2. memory.zswap.max currently triggers zswap writeback via zswap_store() 
-> shrink_memcg(), and each over-limit event can write back at most 
NR_NODES entries. If zswap residency is far
above memory.zswap.max, converging to the target size requires at least 
O(over-limit pages / NR_NODES) zswap_store() events, with no batching — 
proactive writeback therefore has
significant latency.

3. memory.zswap.max is a stateful interface. If the userspace reclaimer 
crashes for any reason mid-operation, it may leave memory.zswap.max at 
some set value, putting the application in a
  persistently throttled bad state.

4. Once the userspace reclaimer has lowered memory.zswap.max, if the 
workload is rapidly expanding and triggers memory reclaim via 
memory.high / kswapd / etc., the actual amount written
back can exceed what was intended.

Thanks,
Hao

^ permalink raw reply

* htmldocs: Documentation/networking/xfrm/xfrm_migrate_state.rst:16: WARNING: Inline emphasis start-string without end-string. [docutils]
From: kernel test robot @ 2026-05-12  9:11 UTC (permalink / raw)
  To: Antony Antony; +Cc: oe-kbuild-all, 0day robot, linux-doc

tree:   https://github.com/intel-lab-lkp/linux/commits/Antony-Antony/xfrm-remove-redundant-assignments/20260512-083513
head:   f4157abfb15003887443e17542963d7b2c96cab6
commit: f4157abfb15003887443e17542963d7b2c96cab6 xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
date:   8 hours ago
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260512/202605121137.SVAB7gcL-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605121137.SVAB7gcL-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Checksumming on output with GSO
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [docutils]
>> Documentation/networking/xfrm/xfrm_migrate_state.rst:16: WARNING: Inline emphasis start-string without end-string. [docutils]
   MAINTAINERS:40: WARNING: Inline strong start-string without end-string. [docutils]
   Documentation/userspace-api/landlock:504: ./security/landlock/errata/abi-4.h:5: ERROR: Unexpected section title.


vim +16 Documentation/networking/xfrm/xfrm_migrate_state.rst

    15	
  > 16	Because IKE daemons such as *wan manage policies independently of
    17	the kernel, this interface allows precise per-SA migration without
    18	requiring policy involvement. Optional netlink attributes follow an
    19	omit-to-inherit model: omitting an attribute preserves the value from
    20	the old SA. The ``flags`` field controls two exceptions: hardware offload
    21	is inherited by default and can be suppressed with
    22	``XFRM_MIGRATE_STATE_NO_OFFLOAD`` or overridden with ``XFRMA_OFFLOAD_DEV``;
    23	the new selector is taken from ``new_sel`` by default and can instead be
    24	derived from the new addresses with ``XFRM_MIGRATE_STATE_UPDATE_SEL``.
    25	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH RFC 5/5] selftests/dmabuf-heaps: Add dma-buf memcg accounting tests
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

Add tests for the new charge_pid_fd field in struct
dma_heap_allocation_data.

When the charge_pid_fd feature is absent (unpatched kernel),
the probe in pidfd_alloc_supported() detects this and the
tests are skipped gracefully.

Add vmtest.sh similar to other subsystem suites, to orchestrate
building the selftests (optionally with a freshly compiled kernel)
inside a virtme-ng VM, so the tests can be run without modifying
the host system. Add a config fragment with required Kconfig symbols.

Also add test_memcg_dmabuf() to the existing test_memcontrol suite
to verify end-to-end cross-cgroup accounting: a parent process opens
a pidfd for a child in a separate cgroup, allocates a dma-buf via
DMA_HEAP_IOCTL_ALLOC with that pidfd, and asserts that memory.stat
dmabuf in the child's cgroup reflects the allocation. If the dmabuf
key is missing (unpatched kernel) or /dev/dma_heap/system is absent,
the test is skipped.

Assisted-by: Claude:claude-sonnet-4-6 Cursor
Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 tools/testing/selftests/cgroup/Makefile            |   2 +-
 tools/testing/selftests/cgroup/test_memcontrol.c   | 143 +++++++++++++-
 tools/testing/selftests/dmabuf-heaps/config        |   1 +
 tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++-
 tools/testing/selftests/dmabuf-heaps/vmtest.sh     | 205 +++++++++++++++++++++
 5 files changed, 473 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index e01584c2189ac..9edfc9f1de5c4 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-CFLAGS += -Wall -pthread
+CFLAGS += -Wall -pthread $(KHDR_INCLUDES)
 
 all: ${HELPER_PROGS}
 
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index b43da9bc20c49..b6a228407530f 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -19,9 +19,17 @@
 #include <errno.h>
 #include <sys/mman.h>
 
+#include <linux/dma-heap.h>
+#include <signal.h>
+#include <sys/ioctl.h>
+
+#include "../pidfd/pidfd.h"
 #include "kselftest.h"
 #include "cgroup_util.h"
 
+#define DMA_HEAP_SYSTEM		"/dev/dma_heap/system"
+#define ONE_MEG			(1024 * 1024)
+
 #define MEMCG_SOCKSTAT_WAIT_RETRIES        30
 
 static bool has_localevents;
@@ -1762,6 +1770,125 @@ static int test_memcg_inotify_delete_dir(const char *root)
 	return ret;
 }
 
+static int memcg_dmabuf_child(const char *cgroup, void *arg)
+{
+	pause();
+	return 0;
+}
+
+/*
+ * This test allocates a dma-buf via DMA_HEAP_IOCTL_ALLOC with a pidfd
+ * pointing to a child process in a separate cgroup, then checks that
+ * memory.stat[dmabuf] in the child's cgroup rises by the allocation size
+ * and returns to zero after the buffer fd is closed.
+ */
+static int test_memcg_dmabuf(const char *root)
+{
+	char *parent = NULL, *child_cg = NULL;
+	int ret = KSFT_FAIL;
+	int heap_fd = -1, dmabuf_fd = -1, pidfd = -1;
+	pid_t child_pid;
+	int child_status;
+	long dmabuf_stat;
+	struct dma_heap_allocation_data alloc = {
+		.len      = ONE_MEG,
+		.fd_flags = O_RDWR | O_CLOEXEC,
+	};
+
+	if (access(DMA_HEAP_SYSTEM, R_OK | W_OK)) {
+		ret = KSFT_SKIP;
+		goto cleanup;
+	}
+
+	parent = cg_name(root, "dmabuf_memcg_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup_parent;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup_parent;
+
+	child_cg = cg_name(parent, "child");
+	if (!child_cg)
+		goto cleanup_parent;
+
+	if (cg_create(child_cg))
+		goto cleanup_parent;
+
+	child_pid = cg_run_nowait(child_cg, memcg_dmabuf_child, NULL);
+	if (child_pid < 0)
+		goto cleanup_child;
+
+	if (cg_wait_for_proc_count(child_cg, 1))
+		goto cleanup_kill;
+
+	pidfd = sys_pidfd_open(child_pid, 0);
+	if (pidfd < 0) {
+		ret = KSFT_SKIP;
+		goto cleanup_kill;
+	}
+
+	heap_fd = open(DMA_HEAP_SYSTEM, O_RDWR);
+	if (heap_fd < 0) {
+		ret = KSFT_SKIP;
+		goto cleanup_pidfd;
+	}
+
+	alloc.charge_pid_fd = (__u32)pidfd;
+	if (ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc) < 0)
+		goto cleanup_heap;
+	dmabuf_fd = (int)alloc.fd;
+
+	dmabuf_stat = cg_read_key_long(child_cg, "memory.stat", "dmabuf ");
+	if (dmabuf_stat == -1) {
+		ret = KSFT_SKIP;
+		goto cleanup_dmabuf;
+	}
+	if (dmabuf_stat != ONE_MEG)
+		dmabuf_stat = cg_read_key_long_poll(child_cg, "memory.stat",
+						    "dmabuf ", ONE_MEG,
+						    15, 200000);
+	if (dmabuf_stat != ONE_MEG) {
+		fprintf(stderr, "Expected dmabuf stat %d, got %ld\n",
+			ONE_MEG, dmabuf_stat);
+		goto cleanup_dmabuf;
+	}
+
+	close(dmabuf_fd);
+	dmabuf_fd = -1;
+
+	dmabuf_stat = cg_read_key_long_poll(child_cg, "memory.stat",
+					    "dmabuf ", 0, 15, 200000);
+	if (dmabuf_stat != 0) {
+		fprintf(stderr, "Expected dmabuf stat 0 after close, got %ld\n",
+			dmabuf_stat);
+		goto cleanup_heap;
+	}
+
+	ret = KSFT_PASS;
+
+cleanup_dmabuf:
+	if (dmabuf_fd >= 0)
+		close(dmabuf_fd);
+cleanup_heap:
+	close(heap_fd);
+cleanup_pidfd:
+	close(pidfd);
+cleanup_kill:
+	kill(child_pid, SIGTERM);
+	waitpid(child_pid, &child_status, 0);
+cleanup_child:
+	cg_destroy(child_cg);
+	free(child_cg);
+cleanup_parent:
+	cg_destroy(parent);
+	free(parent);
+cleanup:
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct memcg_test {
 	int (*fn)(const char *root);
@@ -1783,16 +1910,26 @@ struct memcg_test {
 	T(test_memcg_oom_group_score_events),
 	T(test_memcg_inotify_delete_file),
 	T(test_memcg_inotify_delete_dir),
+	T(test_memcg_dmabuf),
 };
 #undef T
 
 int main(int argc, char **argv)
 {
 	char root[PATH_MAX];
-	int i, proc_status;
+	int i, proc_status, plan;
+	const char *filter = NULL;
+
+	if (argc > 1)
+		filter = argv[1];
+
+	plan = 0;
+	for (i = 0; i < ARRAY_SIZE(tests); i++)
+		if (!filter || !strcmp(tests[i].name, filter))
+			plan++;
 
 	ksft_print_header();
-	ksft_set_plan(ARRAY_SIZE(tests));
+	ksft_set_plan(plan);
 	if (cg_find_unified_root(root, sizeof(root), NULL))
 		ksft_exit_skip("cgroup v2 isn't mounted\n");
 
@@ -1818,6 +1955,8 @@ int main(int argc, char **argv)
 	has_localevents = proc_status;
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (filter && strcmp(tests[i].name, filter))
+			continue;
 		switch (tests[i].fn(root)) {
 		case KSFT_PASS:
 			ksft_test_result_pass("%s\n", tests[i].name);
diff --git a/tools/testing/selftests/dmabuf-heaps/config b/tools/testing/selftests/dmabuf-heaps/config
index be091f1cdfa04..94c8f33b71a28 100644
--- a/tools/testing/selftests/dmabuf-heaps/config
+++ b/tools/testing/selftests/dmabuf-heaps/config
@@ -1,3 +1,4 @@
+CONFIG_MEMCG=y
 CONFIG_DMABUF_HEAPS=y
 CONFIG_DMABUF_HEAPS_SYSTEM=y
 CONFIG_DRM_VGEM=y
diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
index fc9694fc4e89e..904332b17698a 100644
--- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
+++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
@@ -3,6 +3,7 @@
 #include <dirent.h>
 #include <errno.h>
 #include <fcntl.h>
+#include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
@@ -10,11 +11,14 @@
 #include <unistd.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
+#include <sys/syscall.h>
 #include <sys/types.h>
+#include <sys/wait.h>
 
 #include <linux/dma-buf.h>
 #include <linux/dma-heap.h>
 #include <drm/drm.h>
+#include "../pidfd/pidfd.h"
 #include "kselftest.h"
 
 #define DEVPATH "/dev/dma_heap"
@@ -320,6 +324,8 @@ static int dmabuf_heap_alloc_newer(int fd, size_t len, unsigned int flags,
 		__u32 fd;
 		__u32 fd_flags;
 		__u64 heap_flags;
+		__u32 charge_pid_fd;
+		__u32 __padding;
 		__u64 garbage1;
 		__u64 garbage2;
 		__u64 garbage3;
@@ -328,6 +334,8 @@ static int dmabuf_heap_alloc_newer(int fd, size_t len, unsigned int flags,
 		.fd = 0,
 		.fd_flags = O_RDWR | O_CLOEXEC,
 		.heap_flags = flags,
+		.charge_pid_fd = 0,
+		.__padding = 0,
 		.garbage1 = 0xffffffff,
 		.garbage2 = 0x88888888,
 		.garbage3 = 0x11111111,
@@ -390,6 +398,120 @@ static void test_alloc_errors(char *heap_name)
 	close(heap_fd);
 }
 
+static int dmabuf_heap_alloc_pidfd(int fd, size_t len, unsigned int heap_flags,
+				   unsigned int charge_pid_fd, int *dmabuf_fd)
+{
+	struct dma_heap_allocation_data data = {
+		.len = len,
+		.fd = 0,
+		.fd_flags = O_RDWR | O_CLOEXEC,
+		.heap_flags = heap_flags,
+		.charge_pid_fd = charge_pid_fd,
+	};
+	int ret;
+
+	if (!dmabuf_fd)
+		return -EINVAL;
+
+	ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
+	if (ret < 0)
+		return ret;
+	*dmabuf_fd = (int)data.fd;
+	return ret;
+}
+
+/*
+ * Probe whether the kernel honours charge_pid_fd in DMA_HEAP_IOCTL_ALLOC.
+ */
+static bool pidfd_alloc_supported(int heap_fd)
+{
+	int devnull_fd, dmabuf_fd = -1, ret;
+
+	devnull_fd = open("/dev/null", O_RDONLY);
+	if (devnull_fd < 0)
+		return false;
+
+	ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, devnull_fd, &dmabuf_fd);
+	if (dmabuf_fd >= 0) {
+		close(dmabuf_fd);
+		dmabuf_fd = -1;
+	}
+	close(devnull_fd);
+	return ret < 0;
+}
+
+/*
+ * Test: allocate charging the calling process's own cgroup via a self pidfd.
+ */
+static void test_alloc_pidfd_self(char *heap_name)
+{
+	int heap_fd = -1, pidfd = -1, dmabuf_fd = -1, ret;
+
+	heap_fd = dmabuf_heap_open(heap_name);
+
+	if (!pidfd_alloc_supported(heap_fd)) {
+		ksft_test_result_skip("charge_pid_fd not supported by this kernel\n");
+		goto out;
+	}
+
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0) {
+		ksft_test_result_skip("pidfd_open not available\n");
+		goto out;
+	}
+
+	ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, pidfd, &dmabuf_fd);
+	ksft_test_result(!ret, "Allocation with self pidfd %d\n", ret);
+	if (dmabuf_fd >= 0)
+		close(dmabuf_fd);
+	close(pidfd);
+out:
+	close(heap_fd);
+}
+
+/*
+ * Test: allocate charging a child process's cgroup via a child pidfd.
+ */
+static void test_alloc_pidfd_child(char *heap_name)
+{
+	int heap_fd = -1, pidfd = -1, dmabuf_fd = -1;
+	pid_t child_pid;
+	int status, ret;
+
+	heap_fd = dmabuf_heap_open(heap_name);
+
+	if (!pidfd_alloc_supported(heap_fd)) {
+		ksft_test_result_skip("charge_pid_fd not supported by this kernel\n");
+		goto out;
+	}
+
+	child_pid = fork();
+	if (child_pid == 0) {
+		pause();
+		_exit(0);
+	}
+	if (child_pid < 0)
+		ksft_exit_fail_msg("fork failed: %s\n", strerror(errno));
+
+	pidfd = sys_pidfd_open(child_pid, 0);
+	if (pidfd < 0) {
+		kill(child_pid, SIGTERM);
+		waitpid(child_pid, &status, 0);
+		ksft_test_result_skip("pidfd_open for child failed\n");
+		goto out;
+	}
+
+	ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, pidfd, &dmabuf_fd);
+	ksft_test_result(!ret, "Allocation with child pidfd %d\n", ret);
+	if (dmabuf_fd >= 0)
+		close(dmabuf_fd);
+	close(pidfd);
+	kill(child_pid, SIGTERM);
+	waitpid(child_pid, &status, 0);
+out:
+	close(heap_fd);
+}
+
 static int numer_of_heaps(void)
 {
 	DIR *d = opendir(DEVPATH);
@@ -420,7 +542,7 @@ int main(void)
 		return KSFT_SKIP;
 	}
 
-	ksft_set_plan(11 * numer_of_heaps());
+	ksft_set_plan(13 * numer_of_heaps());
 
 	while ((dir = readdir(d))) {
 		if (!strncmp(dir->d_name, ".", 2))
@@ -435,6 +557,8 @@ int main(void)
 		test_alloc_zeroed(dir->d_name, ONE_MEG);
 		test_alloc_compat(dir->d_name);
 		test_alloc_errors(dir->d_name);
+		test_alloc_pidfd_self(dir->d_name);
+		test_alloc_pidfd_child(dir->d_name);
 	}
 	closedir(d);
 
diff --git a/tools/testing/selftests/dmabuf-heaps/vmtest.sh b/tools/testing/selftests/dmabuf-heaps/vmtest.sh
new file mode 100755
index 0000000000000..6f1a878384127
--- /dev/null
+++ b/tools/testing/selftests/dmabuf-heaps/vmtest.sh
@@ -0,0 +1,205 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2026 Red Hat
+#
+# Dependencies:
+#		* virtme-ng
+#		* qemu	(used by virtme-ng)
+
+readonly SCRIPT_DIR="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
+readonly KERNEL_CHECKOUT=$(realpath "${SCRIPT_DIR}"/../../../../)
+readonly CGROUP_DIR="${KERNEL_CHECKOUT}/tools/testing/selftests/cgroup"
+
+source "${SCRIPT_DIR}"/../kselftest/ktap_helpers.sh
+
+readonly DMABUF_HEAP_TEST="${SCRIPT_DIR}"/dmabuf-heap
+readonly MEMCONTROL_TEST="${CGROUP_DIR}"/test_memcontrol
+readonly TMP_DIR=$(mktemp -d /tmp/dmabuf-vmtest.XXXXXXXX)
+
+VERBOSE=false
+BUILD=false
+BUILD_HOST=""
+BUILD_HOST_PODMAN_CONTAINER_NAME=""
+
+usage() {
+	echo
+	echo "$0 [OPTIONS]"
+	echo
+	echo "Options"
+	echo "  -b: build the kernel from the current source tree and use it for the VM"
+	echo "  -H: hostname for remote build host (used with -b)"
+	echo "  -p: podman container name for remote build host (used with -b)"
+	echo "      Example: -H beefyserver -p vng"
+
+	echo "  -v: enable verbose vng/qemu output"
+	echo
+
+	exit 1
+}
+
+die() {
+	echo "$*" >&2
+	exit "${KSFT_FAIL}"
+}
+
+cleanup() {
+	rm -rf "${TMP_DIR}"
+}
+
+check_deps() {
+	for dep in vng make; do
+		if [[ ! -x $(command -v "${dep}") ]]; then
+			echo -e "skip:    dependency ${dep} not found!\n"
+			exit "${KSFT_SKIP}"
+		fi
+	done
+
+	if [[ ! -x "${DMABUF_HEAP_TEST}" ]]; then
+		printf "skip:    %s not found!" "${DMABUF_HEAP_TEST}"
+		printf " Please build the kselftest dmabuf-heaps target (or use -b).\n"
+		exit "${KSFT_SKIP}"
+	fi
+
+	if [[ ! -x "${MEMCONTROL_TEST}" ]]; then
+		printf "skip:    %s not found!" "${MEMCONTROL_TEST}"
+		printf " Please build the kselftest cgroup target (or use -b).\n"
+		exit "${KSFT_SKIP}"
+	fi
+}
+
+check_vng() {
+	local tested_versions=("1.36" "1.37")
+	local version
+	local ok=0
+
+	version="$(vng --version)"
+	for tv in "${tested_versions[@]}"; do
+		if [[ "${version}" == *"${tv}"* ]]; then
+			ok=1
+			break
+		fi
+	done
+
+	if [[ "${ok}" -eq 0 ]]; then
+		printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
+		printf "not function properly.\n\tThe following versions have been tested: " >&2
+		echo "${tested_versions[@]}" >&2
+	fi
+}
+
+build_selftests() {
+	make -C "${KERNEL_CHECKOUT}" headers_install \
+		INSTALL_HDR_PATH="${TMP_DIR}/usr" -j"$(nproc)"
+
+	local khdr="-isystem ${TMP_DIR}/usr/include"
+
+	if ! make -C "${SCRIPT_DIR}" KHDR_INCLUDES="${khdr}" -j"$(nproc)"; then
+		die "failed to build dmabuf-heaps selftests"
+	fi
+
+	if ! make -C "${CGROUP_DIR}" KHDR_INCLUDES="${khdr}" \
+		"${MEMCONTROL_TEST}" -j"$(nproc)"; then
+		die "failed to build cgroup/test_memcontrol selftest"
+	fi
+}
+
+handle_build() {
+	if ! ${BUILD}; then
+		return
+	fi
+
+	if [[ ! -d "${KERNEL_CHECKOUT}" ]]; then
+		echo "-b requires vmtest.sh called from the kernel source tree" >&2
+		exit 1
+	fi
+
+	pushd "${KERNEL_CHECKOUT}" &>/dev/null
+
+	if ! vng --kconfig --config "${SCRIPT_DIR}/config"; then
+		die "failed to generate .config for kernel source tree (${KERNEL_CHECKOUT})"
+	fi
+
+	local vng_args=("-v" "--config" "${SCRIPT_DIR}/config" "--build")
+
+	if [[ -n "${BUILD_HOST}" ]]; then
+		vng_args+=("--build-host" "${BUILD_HOST}")
+	fi
+
+	if [[ -n "${BUILD_HOST_PODMAN_CONTAINER_NAME}" ]]; then
+		vng_args+=("--build-host-exec-prefix" \
+			   "podman exec -ti ${BUILD_HOST_PODMAN_CONTAINER_NAME}")
+	fi
+
+	if ! vng "${vng_args[@]}"; then
+		die "failed to build kernel from source tree (${KERNEL_CHECKOUT})"
+	fi
+
+	build_selftests
+
+	popd &>/dev/null
+}
+
+make_runner() {
+	# virtme-ng shares the host filesystem, so TMP_DIR is accessible
+	# inside the VM at the same absolute path.
+	cat > "${TMP_DIR}/run_tests.sh" <<-EOF
+	#!/bin/sh
+	set -u
+	PASS=0; FAIL=0; SKIP=0; N=0
+
+	run() {
+		name="\$1"; shift
+		N=\$((N+1))
+		"\$@"; rc=\$?
+		if   [ \$rc -eq 0 ]; then echo "ok \$N \$name";        PASS=\$((PASS+1))
+		elif [ \$rc -eq 4 ]; then echo "ok \$N \$name # SKIP"; SKIP=\$((SKIP+1))
+		else                      echo "not ok \$N \$name";    FAIL=\$((FAIL+1))
+		fi
+	}
+
+	run "dmabuf-heap charge_pid_fd ioctl"	${DMABUF_HEAP_TEST}
+	run "memcontrol dma-buf memcg"  ${MEMCONTROL_TEST} test_memcg_dmabuf
+	echo "# PASS=\$PASS SKIP=\$SKIP FAIL=\$FAIL"
+	[ \$FAIL -eq 0 ]
+	EOF
+	chmod +x "${TMP_DIR}/run_tests.sh"
+}
+
+run_vm() {
+	local verbose_opt=""
+	local kernel_opt=""
+
+	${VERBOSE} && verbose_opt="--verbose"
+
+	# If we are running from within the kernel source tree, use the kernel
+	# source tree as the kernel to boot, otherwise use the running kernel.
+	if [[ "$(realpath "$(pwd)")" == "${KERNEL_CHECKOUT}"* ]]; then
+		kernel_opt="${KERNEL_CHECKOUT}"
+	fi
+
+	vng --run ${kernel_opt} ${verbose_opt} --user root --memory 512M \
+		--exec "${TMP_DIR}/run_tests.sh"
+}
+
+while getopts :hvbH:p: o
+do
+	case $o in
+	v) VERBOSE=true;;
+	b) BUILD=true;;
+	H) BUILD_HOST=$OPTARG;;
+	p) BUILD_HOST_PODMAN_CONTAINER_NAME=$OPTARG;;
+	h|*) usage;;
+	esac
+done
+shift $((OPTIND-1))
+
+trap cleanup EXIT
+
+check_vng
+handle_build
+check_deps
+make_runner
+
+echo "Booting VM and running tests..."
+run_vm

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 4/5] selinux: Restrict cross-cgroup dma-heap charging
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

The security_dma_heap_alloc() hook allows security modules
to control which processes may charge dma-buf allocations
to another process's cgroup via the charge_pid_fd field of
DMA_HEAP_IOCTL_ALLOC. Without a policy implementation, the
hook is a no-op and the restriction is not enforced.

On SELinux-managed systems any domain with access to a
dma-heap device node can therefore exhaust another cgroup's
memory budget without restriction.

Implement selinux_dma_heap_alloc() using avc_has_perm() with
a new dma_heap object class and a charge_to permission. Policy
authors can then grant cross-cgroup charging selectively,
for example:

  allow allocator_app_t client_app_t:dma_heap charge_to;

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 security/selinux/hooks.c            | 7 +++++++
 security/selinux/include/classmap.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 0f704380a8c81..ea1f410b9f619 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2189,6 +2189,12 @@ static int selinux_capable(const struct cred *cred, struct user_namespace *ns,
 	return cred_has_capability(cred, cap, opts, ns == &init_user_ns);
 }
 
+static int selinux_dma_heap_alloc(const struct cred *from, const struct cred *to)
+{
+	return avc_has_perm(cred_sid(from), cred_sid(to),
+			    SECCLASS_DMA_HEAP, DMA_HEAP__CHARGE_TO, NULL);
+}
+
 static int selinux_quotactl(int cmds, int type, int id, const struct super_block *sb)
 {
 	const struct cred *cred = current_cred();
@@ -7541,6 +7547,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(capget, selinux_capget),
 	LSM_HOOK_INIT(capset, selinux_capset),
 	LSM_HOOK_INIT(capable, selinux_capable),
+	LSM_HOOK_INIT(dma_heap_alloc, selinux_dma_heap_alloc),
 	LSM_HOOK_INIT(quotactl, selinux_quotactl),
 	LSM_HOOK_INIT(quota_on, selinux_quota_on),
 	LSM_HOOK_INIT(syslog, selinux_syslog),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 90cb61b164256..d232f7808f6b8 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -181,6 +181,7 @@ const struct security_class_mapping secclass_map[] = {
 	{ "user_namespace", { "create", NULL } },
 	{ "memfd_file",
 	  { COMMON_FILE_PERMS, "execute_no_trans", "entrypoint", NULL } },
+	{ "dma_heap", { "charge_to", NULL } },
 	/* last one */ { NULL, {} }
 };
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 3/5] security: dma-heap: Add dma_heap_alloc LSM hook
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

DMA_HEAP_IOCTL_ALLOC accepts a charge_pid_fd field that,
when set, causes the allocation to be charged to an arbitrary
process's cgroup rather than the caller's.

Without an access-control point, any process that holds a handle
to a dma-heap device node can charge unlimited memory to any other
process's cgroup, potentially exhausting that cgroup's limit and
triggering OOM kills independent of the victim's own activity or
privileges.

Add security_dma_heap_alloc(), called in dma_heap_ioctl_allocate()
when charge_pid_fd refers to another process. The hook receives
the credentials of the allocating process (from) and the credentials
of the process whose cgroup will be charged (to), giving security
modules a controlled enforcement point for cross-cgroup dma-buf
attribution policy.

When CONFIG_SECURITY is not set the hook compiles to an inline
returning 0, adding no overhead to the fast path.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 drivers/dma-buf/dma-heap.c    | 12 +++++++++++-
 include/linux/lsm_hook_defs.h |  1 +
 include/linux/security.h      |  7 +++++++
 security/security.c           | 16 ++++++++++++++++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index ff6e259afcdc0..e8ffb1031955e 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -18,6 +18,7 @@
 #include <linux/list.h>
 #include <linux/nospec.h>
 #include <linux/pidfd.h>
+#include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/xarray.h>
@@ -122,12 +123,13 @@ static int dma_heap_open(struct inode *inode, struct file *file)
 
 static long dma_heap_ioctl_allocate(struct file *file, void *data)
 {
+	const struct cred *tcred;
 	struct dma_heap_allocation_data *heap_allocation = data;
 	struct dma_heap *heap = file->private_data;
 	struct mem_cgroup *memcg = NULL;
 	struct task_struct *task;
 	unsigned int pidfd_flags;
-	int fd;
+	int fd, ret;
 
 	if (heap_allocation->fd)
 		return -EINVAL;
@@ -143,6 +145,14 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 		if (IS_ERR(task))
 			return PTR_ERR(task);
 
+		tcred = get_task_cred(task);
+		ret = security_dma_heap_alloc(current_cred(), tcred);
+		put_cred(tcred);
+		if (ret) {
+			put_task_struct(task);
+			return ret;
+		}
+
 		memcg = get_mem_cgroup_from_mm(task->mm);
 		put_task_struct(task);
 	}
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 2b8dfb35caed3..6a91656f97e1e 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -43,6 +43,7 @@ LSM_HOOK(int, 0, capset, struct cred *new, const struct cred *old,
 	 const kernel_cap_t *permitted)
 LSM_HOOK(int, 0, capable, const struct cred *cred, struct user_namespace *ns,
 	 int cap, unsigned int opts)
+LSM_HOOK(int, 0, dma_heap_alloc, const struct cred *from, const struct cred *to)
 LSM_HOOK(int, 0, quotactl, int cmds, int type, int id, const struct super_block *sb)
 LSM_HOOK(int, 0, quota_on, struct dentry *dentry)
 LSM_HOOK(int, 0, syslog, int type)
diff --git a/include/linux/security.h b/include/linux/security.h
index 41d7367cf4036..f1dad1eabe754 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -350,6 +350,7 @@ int security_capable(const struct cred *cred,
 		       struct user_namespace *ns,
 		       int cap,
 		       unsigned int opts);
+int security_dma_heap_alloc(const struct cred *from, const struct cred *to);
 int security_quotactl(int cmds, int type, int id, const struct super_block *sb);
 int security_quota_on(struct dentry *dentry);
 int security_syslog(int type);
@@ -701,6 +702,12 @@ static inline int security_capable(const struct cred *cred,
 	return cap_capable(cred, ns, cap, opts);
 }
 
+static inline int security_dma_heap_alloc(const struct cred *from,
+					  const struct cred *to)
+{
+	return 0;
+}
+
 static inline int security_quotactl(int cmds, int type, int id,
 				     const struct super_block *sb)
 {
diff --git a/security/security.c b/security/security.c
index 4e999f0236516..4adacef73c507 100644
--- a/security/security.c
+++ b/security/security.c
@@ -660,6 +660,22 @@ int security_capable(const struct cred *cred,
 	return call_int_hook(capable, cred, ns, cap, opts);
 }
 
+/**
+ * security_dma_heap_alloc() - Check if cross-cgroup dma-heap charging is allowed
+ * @from: credentials of the allocating process
+ * @to: credentials of the process to charge
+ *
+ * Check whether the process with credentials @from is allowed to allocate
+ * dma-heap memory and charge it to the cgroup of the process with credentials
+ * @to.
+ *
+ * Return: Returns 0 if permission is granted.
+ */
+int security_dma_heap_alloc(const struct cred *from, const struct cred *to)
+{
+	return call_int_hook(dma_heap_alloc, from, to);
+}
+
 /**
  * security_quotactl() - Check if a quotactl() syscall is allowed for this fs
  * @cmds: commands

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

On embedded platforms a central process often allocates dma-buf
memory on behalf of client applications. Without a way to
attribute the charge to the requesting client's cgroup, the
cost lands on the allocator, making per-cgroup memory limits
ineffective for the actual consumers.

Add charge_pid_fd to struct dma_heap_allocation_data. When set to
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
the mem_accounting module parameter enabled, the buffer is charged
to the allocator's own cgroup.

Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
page allocations. Keeping __GFP_ACCOUNT would charge the same pages
twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
all accounting through a single MEMCG_DMABUF path.

Usage examples:

  1. Central allocator charging to a client at allocation time.
     The allocator knows the client's PID (e.g., from binder's
     sender_pid) and uses pidfd to attribute the charge:

       pid_t client_pid = txn->sender_pid;
       int pidfd = pidfd_open(client_pid, 0);

       struct dma_heap_allocation_data alloc = {
           .len             = buffer_size,
           .fd_flags        = O_RDWR | O_CLOEXEC,
           .charge_pid_fd   = pidfd,
       };
       ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
       close(pidfd);
       /* alloc.fd is now charged to client's cgroup */

  2. Default allocation (no pidfd, mem_accounting=1).
     When charge_pid_fd is not set and the mem_accounting module
     parameter is enabled, the buffer is charged to the allocator's
     own cgroup:

       struct dma_heap_allocation_data alloc = {
           .len      = buffer_size,
           .fd_flags = O_RDWR | O_CLOEXEC,
       };
       ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
       /* charged to current process's cgroup */

Current limitations:

 - Single-owner model: a dma-buf carries one memcg charge regardless of
   how many processes share it. Means only the first owner (and exporter)
   of the shared buffer bears the charge.
 - Only memcg accounting supported. While this makes sense for system
   heap buffers, other heaps (e.g., CMA heaps) will require selectively
   charging also for the dmem controller.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  5 ++--
 drivers/dma-buf/dma-buf.c               | 16 ++++---------
 drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
 drivers/dma-buf/heaps/system_heap.c     |  2 --
 include/uapi/linux/dma-heap.h           |  6 +++++
 5 files changed, 53 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8bdbc2e866430..824d269531eb1 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1636,8 +1636,9 @@ The following nested keys are defined.
 		structures.
 
 	  dmabuf (npn)
-		Amount of memory used for exported DMA buffers allocated by the cgroup.
-		Stays with the allocating cgroup regardless of how the buffer is shared.
+		Amount of memory used for exported DMA buffers allocated by or on
+		behalf of the cgroup. Stays with the allocating cgroup regardless
+		of how the buffer is shared.
 
 	  workingset_refault_anon
 		Number of refaults of previously evicted anonymous pages.
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index ce02377f48908..23fb758b78297 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
 	 */
 	BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
 
-	mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
-	mem_cgroup_put(dmabuf->memcg);
+	if (dmabuf->memcg) {
+		mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
+					  PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
+		mem_cgroup_put(dmabuf->memcg);
+	}
 
 	dmabuf->ops->release(dmabuf);
 
@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 		dmabuf->resv = resv;
 	}
 
-	dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
-	if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
-				      GFP_KERNEL)) {
-		ret = -ENOMEM;
-		goto err_memcg;
-	}
-
 	file->private_data = dmabuf;
 	file->f_path.dentry->d_fsdata = dmabuf;
 	dmabuf->file = file;
@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 
 	return dmabuf;
 
-err_memcg:
-	mem_cgroup_put(dmabuf->memcg);
 err_file:
 	fput(file);
 err_module:
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index ac5f8685a6494..ff6e259afcdc0 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -7,13 +7,17 @@
  */
 
 #include <linux/cdev.h>
+#include <linux/cgroup.h>
 #include <linux/device.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-heap.h>
+#include <linux/memcontrol.h>
+#include <linux/sched/mm.h>
 #include <linux/err.h>
 #include <linux/export.h>
 #include <linux/list.h>
 #include <linux/nospec.h>
+#include <linux/pidfd.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/xarray.h>
@@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
 		 "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
 
 static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
-				 u32 fd_flags,
-				 u64 heap_flags)
+				 u32 fd_flags, u64 heap_flags,
+				 struct mem_cgroup *charge_to)
 {
 	struct dma_buf *dmabuf;
+	unsigned int nr_pages;
+	struct mem_cgroup *memcg = charge_to;
 	int fd;
 
 	/*
@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
 	if (IS_ERR(dmabuf))
 		return PTR_ERR(dmabuf);
 
+	nr_pages = len / PAGE_SIZE;
+
+	if (memcg)
+		css_get(&memcg->css);
+	else if (mem_accounting)
+		memcg = get_mem_cgroup_from_mm(current->mm);
+
+	if (memcg) {
+		if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			dma_buf_put(dmabuf);
+			return -ENOMEM;
+		}
+		dmabuf->memcg = memcg;
+	}
+
 	fd = dma_buf_fd(dmabuf, fd_flags);
 	if (fd < 0) {
 		dma_buf_put(dmabuf);
@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 {
 	struct dma_heap_allocation_data *heap_allocation = data;
 	struct dma_heap *heap = file->private_data;
+	struct mem_cgroup *memcg = NULL;
+	struct task_struct *task;
+	unsigned int pidfd_flags;
 	int fd;
 
 	if (heap_allocation->fd)
@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 	if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
 		return -EINVAL;
 
+	if (heap_allocation->charge_pid_fd) {
+		task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
+		if (IS_ERR(task))
+			return PTR_ERR(task);
+
+		memcg = get_mem_cgroup_from_mm(task->mm);
+		put_task_struct(task);
+	}
+
 	fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
 				   heap_allocation->fd_flags,
-				   heap_allocation->heap_flags);
+				   heap_allocation->heap_flags,
+				   memcg);
+	mem_cgroup_put(memcg);
 	if (fd < 0)
 		return fd;
 
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index 03c2b87cb1112..95d7688167b93 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
 		if (max_order < orders[i])
 			continue;
 		flags = order_flags[i];
-		if (mem_accounting)
-			flags |= __GFP_ACCOUNT;
 		page = alloc_pages(flags, orders[i]);
 		if (!page)
 			continue;
diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
index a4cf716a49fa6..e02b0f8cbc6a1 100644
--- a/include/uapi/linux/dma-heap.h
+++ b/include/uapi/linux/dma-heap.h
@@ -29,6 +29,10 @@
  *			handle to the allocated dma-buf
  * @fd_flags:		file descriptor flags used when allocating
  * @heap_flags:		flags passed to heap
+ * @charge_pid_fd:	optional pidfd of the process whose cgroup should be
+ *			charged for this allocation; 0 means charge the calling
+ *			process's cgroup
+ * @__padding:		reserved, must be zero
  *
  * Provided by userspace as an argument to the ioctl
  */
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
 	__u32 fd;
 	__u32 fd_flags;
 	__u64 heap_flags;
+	__u32 charge_pid_fd;
+	__u32 __padding;
 };
 
 #define DMA_HEAP_IOC_MAGIC		'H'

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 1/5] memcg: Track exported dma-buffers
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

From: "T.J. Mercier" <tjmercier@google.com>

When a buffer is exported to userspace, use memcg to attribute the
buffer to the allocating cgroup until all buffer references are
released.

Unlike the dmabuf sysfs stats implementation, this memcg accounting
avoids contention over the kernfs_rwsem incurred when creating or
removing nodes.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 drivers/dma-buf/dma-buf.c               | 13 ++++++++++++
 include/linux/dma-buf.h                 |  4 ++++
 include/linux/memcontrol.h              | 37 +++++++++++++++++++++++++++++++++
 mm/memcontrol.c                         | 19 +++++++++++++++++
 5 files changed, 77 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed995..8bdbc2e866430 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1635,6 +1635,10 @@ The following nested keys are defined.
 		Amount of memory used for storing in-kernel data
 		structures.
 
+	  dmabuf (npn)
+		Amount of memory used for exported DMA buffers allocated by the cgroup.
+		Stays with the allocating cgroup regardless of how the buffer is shared.
+
 	  workingset_refault_anon
 		Number of refaults of previously evicted anonymous pages.
 
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 71f37544a5c61..ce02377f48908 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -14,6 +14,7 @@
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/dma-buf.h>
+#include <linux/memcontrol.h>
 #include <linux/dma-fence.h>
 #include <linux/dma-fence-unwrap.h>
 #include <linux/anon_inodes.h>
@@ -180,6 +181,9 @@ static void dma_buf_release(struct dentry *dentry)
 	 */
 	BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
 
+	mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
+	mem_cgroup_put(dmabuf->memcg);
+
 	dmabuf->ops->release(dmabuf);
 
 	if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
@@ -760,6 +764,13 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 		dmabuf->resv = resv;
 	}
 
+	dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
+	if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
+				      GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err_memcg;
+	}
+
 	file->private_data = dmabuf;
 	file->f_path.dentry->d_fsdata = dmabuf;
 	dmabuf->file = file;
@@ -770,6 +781,8 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 
 	return dmabuf;
 
+err_memcg:
+	mem_cgroup_put(dmabuf->memcg);
 err_file:
 	fput(file);
 err_module:
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index d1203da56fc5f..d9f1ccb51c60e 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -27,6 +27,7 @@
 struct device;
 struct dma_buf;
 struct dma_buf_attachment;
+struct mem_cgroup;
 
 /**
  * struct dma_buf_ops - operations possible on struct dma_buf
@@ -429,6 +430,9 @@ struct dma_buf {
 
 		__poll_t active;
 	} cb_in, cb_out;
+
+	/** @memcg: the cgroup to which this buffer is currently attributed */
+	struct mem_cgroup *memcg;
 };
 
 /**
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b4..10068a833ad9e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum memcg_stat_item {
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
+	MEMCG_DMABUF,
 	MEMCG_NR_STAT,
 };
 
@@ -649,6 +650,24 @@ int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp);
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
 
+/**
+ * mem_cgroup_charge_dmabuf - Charge dma-buf memory to a cgroup and update stat counter
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ * @gfp_mask: reclaim mode
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if it doesn't.
+ */
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask);
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages,
+					    gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return true;
+	return __mem_cgroup_charge_dmabuf(memcg, nr_pages, gfp_mask);
+}
+
 void __mem_cgroup_uncharge(struct folio *folio);
 
 /**
@@ -664,6 +683,14 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
 	__mem_cgroup_uncharge(folio);
 }
 
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages);
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	if (mem_cgroup_disabled())
+		return;
+	__mem_cgroup_uncharge_dmabuf(memcg, nr_pages);
+}
+
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios);
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
@@ -1142,10 +1169,20 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 	return 0;
 }
 
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages,
+					    gfp_t gfp_mask)
+{
+	return true;
+}
+
 static inline void mem_cgroup_uncharge(struct folio *folio)
 {
 }
 
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+}
+
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d4668..15cee13d3ccd6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -433,6 +433,7 @@ static const unsigned int memcg_stat_items[] = {
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
+	MEMCG_DMABUF,
 };
 
 #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
@@ -1580,6 +1581,7 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	{ "hugetlb",			NR_HUGETLB			},
 #endif
+	{ "dmabuf",			MEMCG_DMABUF			},
 
 	/* The memory events */
 	{ "workingset_refault_anon",	WORKINGSET_REFAULT_ANON		},
@@ -5399,6 +5401,23 @@ void mem_cgroup_flush_workqueue(void)
 	flush_workqueue(memcg_wq);
 }
 
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask)
+{
+	if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
+		mod_memcg_state(memcg, MEMCG_DMABUF, nr_pages);
+		return true;
+	}
+
+	return false;
+}
+
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	mod_memcg_state(memcg, MEMCG_DMABUF, -nr_pages);
+	if (!mem_cgroup_is_root(memcg))
+		refill_stock(memcg, nr_pages);
+}
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 0/5] memcg: dma-buf per-cgroup accounting via pid_fd
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude

This RFC builds on T.J. Mercier's earlier series [1] which added
a memory.stat counter for exported dma-bufs and a binder-backed
mechanism to transfer charges between cgroups.

The first commit is taken almost verbatim from TJ's series:
it introduces MEMCG_DMABUF as a dedicated per-cgroup stat, so that
the total exported dma-buf footprint is visible both system-wide
(via the root cgroup) and per-application (via per-process cgroups).
This avoids the overhead of DMABUF_SYSFS_STATS and integrates
naturally into the existing cgroup memory hierarchy.

The rest of the series departs from TJ's approach. While the first
commit introduces the memcg stat infrastructure for dmabufs, the
export-time charging it introduces in dma_buf_export() is then
superseded: we charge at dma_heap_ioctl_allocate() time, using a
new charge_pid_fd field in struct dma_heap_allocation_data. The
allocator opens a pidfd for its client (e.g., from binder's
sender_pid), passes it to the ioctl, and the kernel charges the
buffer directly to the client's cgroup at allocation time, so no
transfer step is needed.

This decouples the accounting path from binder entirely:
any allocator that knows its client's PID can use the pid_fd
mechanism regardless of the IPC transport in use.

The cross-cgroup charging capability requires access control.
Patches #3 and #4 add a generic LSM hook (security_dma_heap_alloc)
and an SELinux implementation based on a new dma_heap object class
with a charge_to permission, so policy authors can express which
domains are allowed to charge memory to another domain's cgroup.

Last patch adds some tests to verify the new charge_pid_fd field.

We are sending it as an RFC to spark broader discussion. It may or
may not be the right path forward, and we welcome feedback on the
trade-offs.

Collision note: Eric Chanudet's series [2] adds __GFP_ACCOUNT to
system_heap page allocations as an opt-in module parameter. That
approach charges pages to the allocator's own kmem, which overlaps with
MEMCG_DMABUF. This series explicitly removes __GFP_ACCOUNT from system
heap allocations and routes all accounting through the MEMCG_DMABUF
path to avoid double-counting.

[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/
[2] https://lore.kernel.org/r/20260113-dmabuf-heap-system-memcg-v2-0-e85722cc2f24@redhat.com

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
Albert Esteve (4):
      dma-heap: charge dma-buf memory via explicit memcg
      security: dma-heap: Add dma_heap_alloc LSM hook
      selinux: Restrict cross-cgroup dma-heap charging
      selftests/dmabuf-heaps: Add dma-buf memcg accounting tests

T.J. Mercier (1):
      memcg: Track exported dma-buffers

 Documentation/admin-guide/cgroup-v2.rst            |   5 +
 drivers/dma-buf/dma-buf.c                          |   7 +
 drivers/dma-buf/dma-heap.c                         |  54 +++++-
 drivers/dma-buf/heaps/system_heap.c                |   2 -
 include/linux/dma-buf.h                            |   4 +
 include/linux/lsm_hook_defs.h                      |   1 +
 include/linux/memcontrol.h                         |  37 ++++
 include/linux/security.h                           |   7 +
 include/uapi/linux/dma-heap.h                      |   6 +
 mm/memcontrol.c                                    |  19 ++
 security/security.c                                |  16 ++
 security/selinux/hooks.c                           |   7 +
 security/selinux/include/classmap.h                |   1 +
 tools/testing/selftests/cgroup/Makefile            |   2 +-
 tools/testing/selftests/cgroup/test_memcontrol.c   | 143 +++++++++++++-
 tools/testing/selftests/dmabuf-heaps/config        |   1 +
 tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++-
 tools/testing/selftests/dmabuf-heaps/vmtest.sh     | 205 +++++++++++++++++++++
 18 files changed, 633 insertions(+), 10 deletions(-)
---
base-commit: 74fe02ce122a6103f207d29fafc8b3a53de6abaf
change-id: 20260508-v2_20230123_tjmercier_google_com-f44fcfb16530

Best regards,
-- 
Albert Esteve <aesteve@redhat.com>


^ permalink raw reply

* Re: [PATCH v19 0/2] ACPI: Add support for ACPI RAS2 feature table
From: Borislav Petkov @ 2026-05-12  8:58 UTC (permalink / raw)
  To: shiju.jose
  Cc: rafael, akpm, rppt, dferguson, linux-edac, linux-acpi, linux-mm,
	linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam, mchehab,
	jonathan.cameron, linuxarm, rientjes, jiaqiyan, Jon.Grimm,
	dave.hansen, naoya.horiguchi, james.morse, jthoughton,
	somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
	wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, shijujose2008
In-Reply-To: <20260408172850.183-1-shiju.jose@huawei.com>

On Wed, Apr 08, 2026 at 06:28:47PM +0100, shiju.jose@huawei.com wrote:
> Changes
> =======
> v18 -> v19:
> 1. Fixed gemini tool reported issues sent by Borislav. Thanks.
> https://sashiko.dev/#/patchset/20260325165714.294-1-shiju.jose%40huawei.com

I can still see outstanding issues there:

https://sashiko.dev/#/patchset/20260408172850.183-1-shiju.jose%40huawei.com

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH RESEND bpf-next v10 1/8] bpf: refactor __bpf_list_del to take list node pointer
From: Kaitao Cheng @ 2026-05-12  8:55 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	chengkaitao, skhan, memxor, linux-kernel, linux-doc, vmalik,
	linux-kselftest, martin.lau, clm, ihor.solodrai
In-Reply-To: <6eb07d559a41965a3a9b186b4c28c0942cf1bfb982d539b92c76c27b08db0b04@mail.kernel.org>

I have clarified this issue in the [PATCH bpf-next v10 1/8] patch,
with the link provided below:

https://lore.kernel.org/bpf/e908cd1a-a0ee-49b1-8528-c455bbd4c432@linux.dev/

在 2026/5/12 14:41, bot+bpf-ci@kernel.org 写道:
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index baa12b24bb64..9cd7b028592c 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -2550,22 +2550,24 @@ __bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
>>  	return bpf_list_push_back(head, node, meta__ign, off);
>>  }
>>
>> -static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tail)
>> +static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
>> +					    struct list_head *n)
>>  {
>> -	struct list_head *n, *h = (void *)head;
>> +	struct list_head *h = (void *)head;
>>  	struct bpf_list_node_kern *node;
>>
>>  	/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
>>  	 * called on its fields, so init here
>>  	 */
>> -	if (unlikely(!h->next))
>> +	if (unlikely(!h->next)) {
>>  		INIT_LIST_HEAD(h);
>> +		return NULL;
>> +	}
>>  	if (list_empty(h))
>>  		return NULL;
>>
>> -	n = tail ? h->prev : h->next;
>>  	node = container_of(n, struct bpf_list_node_kern, list_head);
>> -	if (WARN_ON_ONCE(READ_ONCE(node->owner) != head))
>> +	if (unlikely(READ_ONCE(node->owner) != head))
>>  		return NULL;
>>
>>  	list_del_init(n);
>> @@ -2575,12 +2577,16 @@ static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tai
>>
>>  __bpf_kfunc struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head)
>>  {
>> -	return __bpf_list_del(head, false);
>> +	struct list_head *h = (void *)head;
>> +
>> +	return __bpf_list_del(head, h->next);
>>  }
> 
> The new code reads h->next before __bpf_list_del checks if the list is
> initialized. If the list is uninitialized (h->next is NULL), a NULL pointer
> is passed to __bpf_list_del.
> 
> While __bpf_list_del eventually initializes the list and returns NULL without
> dereferencing the NULL pointer, this is counterintuitive - passing a
> potentially garbage value that is later ignored.
> 
> Can you move the initialization check into the top-level kfuncs
> (bpf_list_pop_front and bpf_list_pop_back) to ensure the list_head passed to
> __bpf_list_del is always valid? You could initialize the list and return NULL
> in that case - we know it's empty.
> 
> This concern was raised by Emil Tsalapatis in v8:
> https://lore.kernel.org/bpf/DH6W7K2SLPYB.PCO7FQ6CR1YZ@etsalapatis.com/
> 
>>
>>  __bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
>>  {
>> -	return __bpf_list_del(head, true);
>> +	struct list_head *h = (void *)head;
>> +
>> +	return __bpf_list_del(head, h->prev);
>>  }
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25716874656

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v4 1/4] kernel: param: initialize module_kset before do_initcalls()
From: Jon Hunter @ 2026-05-12  8:55 UTC (permalink / raw)
  To: Shashank Balaji, Thierry Reding, Sumit Gupta
  Cc: Gary Guo, Suzuki K Poulose, James Clark, Alexander Shishkin,
	Maxime Coquelin, Alexandre Torgue, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich, Miguel Ojeda, Boqun Feng,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Richard Cochran, Jonathan Corbet, Shuah Khan,
	Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Mike Leach, Leo Yan, Rahul Bukte, linux-kernel,
	coresight, linux-arm-kernel, driver-core, rust-for-linux,
	linux-doc, Daniel Palmer, Tim Bird, linux-modules, linux-tegra
In-Reply-To: <agKMcA7a_UqMua5V@JPC00244420>

Hi Shashank,

On 12/05/2026 03:12, Shashank Balaji wrote:

...

>> Hi Thierry and Jonathan,
>>
>> You can find the context for this email in this patch:
>> https://lore.kernel.org/all/20260427-acpi_mod_name-v4-1-22b42240c9bf@sony.com/
>>
>> TL;DR: tegra194_cbb_driver and tegra234_cbb_driver are the only drivers
>> registering themselves as early as in a pure_initcall. This is a problem
>> on two fronts:
>> 1. Philosophical: As Gary pointed out, pure_initcalls are intended to purely
>> initialize variables that couldn't be statically initialized. But these
>> are doing driver registrations.
>> 2. module_kset not initialized at pure_initcall stage: This is needed to
>> set the module sysfs symlink. Since module_kset is not alive yet during
>> pure_initcalls, registering these drivers panics the kernel.

Where exactly is this panic seen? Ie. why are we not seeing this?

>> We would like to do the tegra cbb driver registration in a core_initcall
>> (or some later initcall works too), and move module_kset initialization
>> to a pure_initcall. Like this:
>>
>> diff --git a/drivers/soc/tegra/cbb/tegra194-cbb.c b/drivers/soc/tegra/cbb/tegra194-cbb.c
>> index ab75d50cc85c..2f69e104c838 100644
>> --- a/drivers/soc/tegra/cbb/tegra194-cbb.c
>> +++ b/drivers/soc/tegra/cbb/tegra194-cbb.c
>> @@ -2342,7 +2342,7 @@ static int __init tegra194_cbb_init(void)
>>   {
>>          return platform_driver_register(&tegra194_cbb_driver);
>>   }
>> -pure_initcall(tegra194_cbb_init);
>> +core_initcall(tegra194_cbb_init);
>>
>>   static void __exit tegra194_cbb_exit(void)
>>   {
>> diff --git a/drivers/soc/tegra/cbb/tegra234-cbb.c b/drivers/soc/tegra/cbb/tegra234-cbb.c
>> index fb26f085f691..785072fa4e85 100644
>> --- a/drivers/soc/tegra/cbb/tegra234-cbb.c
>> +++ b/drivers/soc/tegra/cbb/tegra234-cbb.c
>> @@ -1774,7 +1774,7 @@ static int __init tegra234_cbb_init(void)
>>   {
>>          return platform_driver_register(&tegra234_cbb_driver);
>>   }
>> -pure_initcall(tegra234_cbb_init);
>> +core_initcall(tegra234_cbb_init);
>>
>>   static void __exit tegra234_cbb_exit(void)
>>   {
>>
>> Would this work?


I am adding Sumit who has been doing a lot of the Tegra CBB driver work.

Sumit, any concerns here? We could run this change through our internal 
testing to confirm.

Jon

-- 
nvpublic


^ permalink raw reply

* Re: [PATCH 1/2] Doc: deprecated.rst: add strlcat()
From: Jani Nikula @ 2026-05-12  8:52 UTC (permalink / raw)
  To: Manuel Ebner, manuelebner
  Cc: andy.shevchenko, apw, corbet, dwaipayanray1, joe, kees, linux-doc,
	linux-kernel, lukas.bulwahn, skhan, workflows
In-Reply-To: <20260510165451.57674-2-manuelebner@mailbox.org>

On Sun, 10 May 2026, Manuel Ebner <manuelebner@mailbox.org> wrote:
> add strlcat and alternatives

You'd think it's the strlcat() definition that needs a comment above it
saying it's deprecated. I don't think folks really look at
deprecated.rst.

BR,
Jani.

>
> Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
> ---
>  Documentation/process/deprecated.rst | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst
> index fed56864d036..b8a65c19796c 100644
> --- a/Documentation/process/deprecated.rst
> +++ b/Documentation/process/deprecated.rst
> @@ -162,6 +162,12 @@ if a source string is not NUL-terminated. The safe replacement is strscpy(),
>  though care must be given to any cases where the return value of strlcpy()
>  is used, since strscpy() will return negative errno values when it truncates.
>  
> +strlcat()
> +---------
> +strlcat() must re-scan the destination string from the beginning on each
> +call (O(n^2) behavior). Alternatives are seq_buf_puts(), seq_buf_printf(),
> +snprintf() and scnprintf()
> +
>  %p format specifier
>  -------------------
>  Traditionally, using "%p" in format strings would lead to regular address

-- 
Jani Nikula, Intel

^ permalink raw reply

* Re: [RFC net-next 0/4] devlink: Add boot-time defaults
From: Jiri Pirko @ 2026-05-12  8:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Mark Bloch, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Andrew Lunn, David S. Miller, Jonathan Corbet, Shuah Khan,
	Simon Horman, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Andrew Morton, Borislav Petkov (AMD), Randy Dunlap, Dave Hansen,
	Christian Brauner, Petr Mladek, Peter Zijlstra (Intel),
	Thomas Gleixner, Pawan Gupta, Dapeng Mi, Kees Cook, Marco Elver,
	Eric Biggers, NBU-Contact-Li Rongqing (EXTERNAL),
	Paul E. McKenney, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org
In-Reply-To: <SJ0PR12MB68068C50EE9776A3D9060635DC382@SJ0PR12MB6806.namprd12.prod.outlook.com>

Mon, May 11, 2026 at 08:21:37PM +0200, parav@nvidia.com wrote:
>
>> From: Mark Bloch <mbloch@nvidia.com>
>> Sent: 10 May 2026 06:02 PM
>> 
>
>[..]
>
>> > I look at it from the perspective that from some CX generation,
>> > switchdev mode should be default. So that is a device-based decision.
>> > I believe as such it can optionally be permanenty configured (nv config)
>> > on older device. Why not?
>>
>Because sometimes switchdev_inactive is needed and sometimes not.
>Such knob is not device decision.

That is what I would call corner case. In that, user can use userspace
configuration to change the mode in runtime.


>If it is placed in the device, orchestration needs to yet use additional vendor tool to configure in the device.
>And that theoretical tool cannot even run yet because driver is not yet loaded.
>
>That sort of defeats the purpose.
> 
>> This is a deployment policy decision, not a permanent property of the card.
>+1
>
>> The same adapter can be used in a regular host/RDMA setup or in a
>> switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
>> follows the device across hosts, kernels and use cases, and can surprise the
>> next deployment that just expects a normal NIC.
>> 
>> I'll send another RFC v2 with support limited to:
>> devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
>> and let's see where we land with that.
>> 
>This looks elegant to me as well covering all eswitch modes and still sw is in control.

^ permalink raw reply

* Re: [RFC net-next 0/4] devlink: Add boot-time defaults
From: Jiri Pirko @ 2026-05-12  8:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Mark Bloch, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Jonathan Corbet, Shuah Khan, Simon Horman,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Morton,
	Borislav Petkov (AMD), Randy Dunlap, Dave Hansen,
	Christian Brauner, Petr Mladek, Peter Zijlstra (Intel),
	Thomas Gleixner, Pawan Gupta, Dapeng Mi, Kees Cook, Marco Elver,
	Eric Biggers, Li RongQing, Paul E. McKenney, linux-doc,
	linux-kernel, netdev, linux-rdma
In-Reply-To: <20260511164132.2df9c5a1@kernel.org>

Tue, May 12, 2026 at 01:41:32AM +0200, kuba@kernel.org wrote:
>On Mon, 11 May 2026 10:42:56 +0200 Jiri Pirko wrote:
>> Sun, May 10, 2026 at 06:37:32PM +0200, kuba@kernel.org wrote:
>> >On Sat, 9 May 2026 09:01:23 +0200 Jiri Pirko wrote:  
>> >> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:  
>> >> As "a non-SR-IOV user", what extra representors you talk about? When you
>> >> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> >> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> >> Everyhing is the same:  
>> >
>> >Some devices have separate uplink ports and PF representors.
>> >As I said, what you're proposing isn't going to work for all drivers.  
>> 
>> Well, the point is, mlx5 appears to the the one needing this, not other
>> drivers. What I'm trying to point at, mlx5 should not need this.
>> It makes things compicated, adding a ugly knob for no good reason.
>> Legacy/switchdev mode, in both, the non-sriov/eswitch user should not
>> see different behaviour. The mode is an eswitch attribute.
>> 
>>    devlink dev eswitch set - sets devlink device eswitch attributes
>>        mode { legacy | switchdev }
>>               Set eswitch mode
>> 
>>               legacy - Legacy SRIOV
>> 
>>               switchdev - SRIOV switchdev offloads
>> 
>> 
>> Briefly looking over other drivers, looks like ice, bnxt, octeon, sfc,
>> there is no new entity created in case of switching to switchdev mode.
>> The only driver that creates separate pf entities seems to be nfp,
>> but the mode seems to be determined by the app being run (loaded
>> firmware).
>> 
>> Am I missing something?
>
>Hm. Okay, I wasn't aware that mlx5 was the only driver that did
>heavy-duty reinit for switching modes.
>
>> >> I look at it from the perspective that from some CX generation,
>> >> switchdev mode should be default. So that is a device-based decision.
>> >> I believe as such it can optionally be permanenty configured (nv config)
>> >> on older device. Why not?  
>> >
>> >Feels a bit arbitrary and won't cover all cases. The question should be  
>> 
>> What cases it does not cover? I don't follow.
>
>Other FW and HW versions. People are still using EOL devices (CX4/CX5),
>IIUC the nvmem config path would require FW upgrade.

If user wants to have a new feature (a bit odd to call this feature,
but ok), he is obliged to upgrade FW. What's wrong about it?

But, even without nvconfig knob, what's stopping us from fixing the
behaviour (/bugs) and just make "switchdev" mode default in net-next for
all in mlx5 driver? Again, perhaps I'm missing something.



>
>> >why you are nacking a more reasonable solution. Keeping Linux config in
>> >Linux params.  
>> 
>> What's reasonable about adding basically a module option (kernel cmdline
>> is pretty much the same) for no reason?
>
>The initial patch as posted added this to a mlx5-specific module param.
>If we need a module param IMO generic one is much better.
>Doesn't matter if other drivers take no time to reinit into switchdev
>mode, having to switch mlx5 with a module param and all the rest in
>runtime is not the best user experience?

I still believe we don't need either, not module param, not cmdline
devlink option. We just need to fix bugs and have proper defaults. The
rest is shortcut.



^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: David Hildenbrand (Arm) @ 2026-05-12  8:42 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, Liam.Howlett, akpm, decui, haiyangz,
	jgg, corbet, leon, longli, ljs, mhocko, rppt, shuah, skhan,
	surenb, vbabka, wei.liu
  Cc: linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <177759840859.221039.13065406062747296947.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>


> +	for (; addr < end; addr += PAGE_SIZE) {
> +		vm_fault_t ret;
> +
> +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> +
> +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> +			/*
> +			 * The mmap lock has been dropped by the fault handler.
> +			 * Record the failing address and signal lock-drop to
> +			 * the caller.
> +			 */
> +			*hmm_vma_walk->locked = 0;
> +			hmm_vma_walk->last = addr;
> +			return -EAGAIN;


Okay, so we'll return straight from hmm_vma_fault() to
hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.

Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
the hmm_vma_fault() could be called by the caller of walk_page_range(), but
that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
requires the vma in there.


Note: am I wrong, or is hmm_vma_fault() really always called with
required_fault=true?

> +		}
> +
> +		if (ret & VM_FAULT_ERROR)
>  			return -EFAULT;
> +	}
>  	return -EBUSY;
>  }
>  
> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>  	if (required_fault) {
>  		int ret;
>  
> +		/*
> +		 * Faulting hugetlb pages on the unlockable path is not
> +		 * supported. The walk framework holds hugetlb_vma_lock_read
> +		 * which must be dropped before handle_mm_fault, but if the
> +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> +		 * be freed and the walk framework's unconditional unlock
> +		 * becomes a use-after-free.
> +		 */
> +		if (hmm_vma_walk->locked)
> +			return -EFAULT;

Just because it's unlockable doesn't mean that you must unlock. Can't this be
kept working as is, just simulating here as if it would not be unlockable?


-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 8/9] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Laurent Pinchart @ 2026-05-12  8:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tzung-Bi Shih, Arnd Bergmann, Greg Kroah-Hartman,
	Bartosz Golaszewski, Linus Walleij, Benson Leung, linux-kernel,
	chrome-platform, driver-core, linux-doc, linux-gpio,
	Rafael J. Wysocki, Danilo Krummrich, Jonathan Corbet, Shuah Khan,
	Wolfram Sang, Johan Hovold, Paul E . McKenney
In-Reply-To: <20260508115309.GA9254@nvidia.com>

On Fri, May 08, 2026 at 08:53:09AM -0300, Jason Gunthorpe wrote:
> On Fri, May 08, 2026 at 06:54:47PM +0800, Tzung-Bi Shih wrote:
> >  struct cros_ec_device *cros_ec_device_alloc(struct device *dev)
> > @@ -47,6 +49,15 @@ struct cros_ec_device *cros_ec_device_alloc(struct device *dev)
> >  	if (!ec_dev)
> >  		return NULL;
> >  
> > +	ec_dev->its_rev = revocable_alloc(ec_dev);
> > +	if (!ec_dev->its_rev)
> > +		return NULL;
> > +	/*
> > +	 * Drop the extra reference for the caller as the caller is the
> > +	 * resource provider.
> > +	 */
> > +	revocable_put(ec_dev->its_rev);
> > +
> >  	ec_dev->din_size = sizeof(struct ec_host_response) +
> >  			   sizeof(struct ec_response_get_protocol_info) +
> >  			   EC_MAX_RESPONSE_OVERHEAD;
> 
> FWIW I am still very much against seeing any revokable concept used
> *between two drivers*. That will turn the kernel's lifetime model into
> spaghetti code.

I agree, I really think it will become a huge mess that we will
massively regret.

/me feels like Cassandra

> Your other series where you only have to change
> drivers/platform/chrome/cros_ec_chardev.c just confirms how wrong this
> approach is.
> 
> Given you say this is such a bug I think you really should be sending
> a series that is patches 5 through 7 from the other series and a
> simple rwsem instead of misc_deregister_sync() to deal with this bug
> ASAP. No need to complicate a simple bug fix in a driver with all
> these core changes.
> 
> Once the bug is fixed you can continue to try to propose more general
> solutions.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [PATCH v6 3/4] mm/memory-failure: add panic option for unrecoverable pages
From: David Hildenbrand (Arm) @ 2026-05-12  8:22 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260511-ecc_panic-v6-3-183012ba7d4b@debian.org>


> @@ -1281,6 +1292,18 @@ static void update_per_node_mf_stats(unsigned long pfn,
>  	++mf_stats->total;
>  }
>  
> +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> +				      enum mf_result result)
> +{
> +	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> +		return false;
> +
> +	if (type == MF_MSG_KERNEL)
> +		return true;
> +
> +	return false;

return type == MF_MSG_KERNEL;

might be simpler.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: David Hildenbrand (Arm) @ 2026-05-12  8:21 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260511-ecc_panic-v6-2-183012ba7d4b@debian.org>


>  		}
>  		goto unlock_mutex;
>  	} else if (res < 0) {
> -		if (is_reserved)
> +		/*
> +		 * Promote a stable unhandlable kernel page diagnosed by
> +		 * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
> +		 * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
> +		 */
> +		if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
>  			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);


It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
is unhandable if it is PG_reserved?

Why can't we just return a special error code from  get_hwpoison_page()? We ahve
plenty of errno values to chose from.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v5 00/13] ima: Introduce staging mechanism
From: Roberto Sassu @ 2026-05-12  8:17 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian, steven chen, corbet, skhan, zohar,
	dmitry.kasatkin, eric.snowberg, paul, jmorris, serge
  Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
	gregorylumen, Roberto Sassu
In-Reply-To: <aaed52cf-26e1-4c40-812d-3788024ce5b5@linux.microsoft.com>

On Mon, 2026-05-11 at 10:29 -0700, Lakshmi Ramasubramanian wrote:
> On 5/7/2026 9:47 AM, steven chen wrote:
> > > 
> > > Usage
> > > =====
> > > 
> > > The IMA staging mechanism can be enabled from the kernel configuration
> > > with the CONFIG_IMA_STAGING option.
> > > 
> > > If it is enabled, IMA duplicates the current measurements interfaces
> > > (both binary and ASCII), by adding the _staged file suffix. Both the
> > > original and the staging interfaces gain the write permission for the
> > > root user and group, but require the process to have CAP_SYS_ADMIN set.
> > > 
> > > The staging mechanism supports two flavors.
> > > 
> > > Staging with prompt
> > > ~~~~~~~~~~~~~~~~~~~
> > > 
> > > The current measurements list is moved to a temporary staging area, and
> > > staged measurements are deleted upon confirmation.
> > > 
> > > This staging process is achieved with the following steps.
> > > 
> > >    1.  echo A > <original interface>: the user requests IMA to stage the
> > >        entire measurements list;
> > >    2.  cat <_staged interface>: the user reads the staged measurements;
> > >    3.  echo D > <_staged interface>: the user requests IMA to delete
> > >        staged measurements.
> > > 
> > > Staging and deleting
> > > ~~~~~~~~~~~~~~~~~~~~
> > > 
> > > N measurements are staged to a temporary staging area, and immediately
> > > deleted without further confirmation.
> > > 
> > > This staging process is achieved with the following steps.
> > > 
> > >    1.  cat <original interface>: the user reads the current measurements
> > >        list and determines what the value N for staging should be;
> > >    2.  echo N > <original interface>: the user requests IMA to delete N
> > >        measurements from the current measurements list.
> > 
> > This submission proposes two ways for log trimming:
> > 
> > *Flavor 1:* Staging with prompt
> > *Flavor 2:* stage and delete N
> > 
> > Functionally, both approaches address the same problem, but *Favour 2 
> > *is the
> > stronger design and should be preferred. There is no good reason to keep 
> > *Flavor 1.*
> > 
> >  From a kernel implementation perspective, *Flavor 2 *is more efficient 
> > because it
> > minimizes the time spent holding the list lock (can’t be shorter). It 
> > also substantially
> > reduces the amount of kernel-side logic, removing nearly half of the 
> > code required
> > by the alternative approach.
> > 
> >  From a user-space perspective, *Flavor 2 *results in a much cleaner 
> > model. It avoids
> > the need to track and reconcile both old and staged lists in user space 
> > as well as
> > two lists (cur and staged) in the kernel space, which simplifies log 
> > trimming logic
> > and reduces maintenance overhead. In addition, it preserves the existing 
> > external
> > behavior by not exposing any staged list to user space.
> > 
> > Overall, *Flavor 2 *provides the same functional result with lower 
> > kernel complexity,
> > shorter kernel list lock hold time, and a simpler user-space interface. 
> > For those
> > reasons, it is the preferable approach and *Favour 1* does not appear to 
> > offer sufficient
> > justification to keep both implementations.
> > 
> > Steven
> 
> Roberto, Mimi:
> 
> I want to add on to the point Steven has brought up.
> 
> With "Stage and Delete N" approach, we have the following sequence of 
> tasks for trimming the IMA log:
> 
> 	1. User mode locks the IMA measurement list through the "write interface".
> 		a. While this prevents any other user mode process from updating the 
> IMA log, kernel can still add new IMA events to the measurement log
> 	2. User mode reads the TPM Quote and the IMA measurement events and 
> sends it to the remote attestation service
> 	3. Once the remote service has successfully processed the IMA events, 
> the user mode determines the number of IMA events "N" to be removed from 
> the measurement list maintained in the kernel
> 	4. User mode provides the value "N" to the kernel
> 	5. Kernel now determines the point at which to snap the IMA measurement 
> list using "N" - without holding a lock
> 	6. Then, the kernel lock is held and the list is snapped at the point 
> determined in the previous step thus keeping the kernel lock time to the 
> minimum.
> 	7. Now, user mode removes the "write" lock on the IMA measurement list
> 
> With the above, we believe "Stage and Delete N" alone is sufficient to 
> trim IMA log.

Hi Lakshmi

I'm happy to support your trimming method. Just does not fit with my
use case. I would like to keep both.

Thanks

Roberto

>   -lakshmi
> 
> > >   .../admin-guide/kernel-parameters.txt         |   4 +
> > >   Documentation/security/IMA-staging.rst        | 163 +++++++++
> > >   Documentation/security/index.rst              |   1 +
> > >   MAINTAINERS                                   |   2 +
> > >   security/integrity/ima/Kconfig                |  16 +
> > >   security/integrity/ima/ima.h                  |  32 +-
> > >   security/integrity/ima/ima_api.c              |   2 +-
> > >   security/integrity/ima/ima_fs.c               | 315 ++++++++++++++++--
> > >   security/integrity/ima/ima_init.c             |   5 +
> > >   security/integrity/ima/ima_kexec.c            |  53 ++-
> > >   security/integrity/ima/ima_queue.c            | 283 ++++++++++++++--
> > >   11 files changed, 803 insertions(+), 73 deletions(-)
> > >   create mode 100644 Documentation/security/IMA-staging.rst
> > > 


^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-12  8:17 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260511-ecc_panic-v6-1-183012ba7d4b@debian.org>

On 5/11/26 17:38, Breno Leitao wrote:
> When get_hwpoison_page() returns a negative value, distinguish
> reserved pages from other failure cases by reporting MF_MSG_KERNEL
> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
> and should be classified accordingly for proper handling.
> 
> Sample PG_reserved before the get_hwpoison_page() call. In the
> MF_COUNT_INCREASED path get_any_page() can drop the caller's
> reference before returning -EIO, after which the underlying page may
> have been freed and reallocated with page->flags reset; reading
> PageReserved(p) at that point would observe stale or unrelated state.
> The pre-call snapshot reflects what the page actually was at the
> time of the failure event.
> 
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..f112fb27a8ff6 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	unsigned long page_flags;
>  	bool retry = true;
>  	int hugetlb = 0;
> +	bool is_reserved;
>  
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>  	 * In fact it's dangerous to directly bump up page count from 0,
>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>  	 */
> +	/*
> +	 * Pages with PG_reserved set are not currently managed by the
> +	 * page allocator (memblock-reserved memory, driver reservations,
> +	 * etc.), so classify them as kernel-owned for reporting.
> +	 *
> +	 * Sample the flag before get_hwpoison_page(): in the
> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
> +	 * reference before returning -EIO, after which page->flags may
> +	 * have been reset by the allocator.
> +	 */
> +	is_reserved = PageReserved(p);
> +
>  	res = get_hwpoison_page(p, flags);
>  	if (!res) {
>  		if (is_free_buddy_page(p)) {
> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>  		}
>  		goto unlock_mutex;
>  	} else if (res < 0) {
> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> +		if (is_reserved)
> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> +		else
> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
> +					    MF_IGNORED);
>  		goto unlock_mutex;
>  	}
>  
> 

It's a bit odd that we need this handling when we already have handling for
reserved pages in error_states[].

HWPoisonHandlable() would always essentially reject PG_reserved pages. So
__get_hwpoison_page() ... would always fail? Making
get_hwpoison_page()->get_any_page() always fail?

But then, we never call identify_page_state()? And never call me_kernel()?

This all looks very odd.

Why would you even want to call get_hwpoison_page() in the first place if you
find PageReserved?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 7/9] gpio: Remove unused `chip` and `srcu` in struct gpio_device
From: Tzung-Bi Shih @ 2026-05-12  8:13 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Benson Leung, linux-kernel, chrome-platform, driver-core,
	linux-doc, linux-gpio, Rafael J. Wysocki, Danilo Krummrich,
	Jonathan Corbet, Shuah Khan, Laurent Pinchart, Wolfram Sang,
	Jason Gunthorpe, Johan Hovold, Paul E . McKenney, Arnd Bergmann,
	Greg Kroah-Hartman, Linus Walleij
In-Reply-To: <CAMRc=MfZL2ZEmNEdVd6NeZJDjTzh_MbDy2kU+AYi-CmgRnWZmw@mail.gmail.com>

On Mon, May 11, 2026 at 06:18:21AM -0700, Bartosz Golaszewski wrote:
> On Fri, 8 May 2026 12:54:46 +0200, Tzung-Bi Shih <tzungbi@kernel.org> said:
> > `chip` and `srcu` in struct gpio_device are unused as their usages are
> > replaced to use revocable.  Remove them.
> >
> > Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>
> > ---
> 
> I'm thinking that all the GPIO patches could actually be squashed into one. Is
> there any technical reason for the split or is it just for easier review?

Correct, they are separated only for easier review.  Would you prefer I
squash the 5 patches into a single patch?

^ permalink raw reply

* Re: [PATCH v10 2/9] revocable: Add KUnit test cases
From: Tzung-Bi Shih @ 2026-05-12  8:12 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Benson Leung, linux-kernel, chrome-platform, driver-core,
	linux-doc, linux-gpio, Rafael J. Wysocki, Danilo Krummrich,
	Jonathan Corbet, Shuah Khan, Laurent Pinchart, Wolfram Sang,
	Jason Gunthorpe, Johan Hovold, Paul E . McKenney, Arnd Bergmann,
	Greg Kroah-Hartman, Linus Walleij
In-Reply-To: <CAMRc=McPez6Ver5NgrDPnM9YDb7cPonWE7BBsS_5AnY9tGf3xQ@mail.gmail.com>

On Mon, May 11, 2026 at 06:10:32AM -0700, Bartosz Golaszewski wrote:
> On Fri, 8 May 2026 12:54:41 +0200, Tzung-Bi Shih <tzungbi@kernel.org> said:
> > diff --git a/drivers/base/revocable_test.c b/drivers/base/revocable_test.c
> 
> Please move this under drivers/base/tests/ where the rest of kunit modules
> live and name it revocable-test.c for consistency with the existing ones.

Ack, I overlooked the folder.  Will move the test to drivers/base/test/ and
rename it in the next version.

> > +#include <kunit/test.h>
> 
> Add a newline here as do other kunit modules.

Ack, will fix it in the next version.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox