Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* [PATCH RFC 0/5] memcg: dma-buf per-cgroup accounting via pid_fd
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude

This RFC builds on T.J. Mercier's earlier series [1] which added
a memory.stat counter for exported dma-bufs and a binder-backed
mechanism to transfer charges between cgroups.

The first commit is taken almost verbatim from TJ's series:
it introduces MEMCG_DMABUF as a dedicated per-cgroup stat, so that
the total exported dma-buf footprint is visible both system-wide
(via the root cgroup) and per-application (via per-process cgroups).
This avoids the overhead of DMABUF_SYSFS_STATS and integrates
naturally into the existing cgroup memory hierarchy.

The rest of the series departs from TJ's approach. While the first
commit introduces the memcg stat infrastructure for dmabufs, the
export-time charging it introduces in dma_buf_export() is then
superseded: we charge at dma_heap_ioctl_allocate() time, using a
new charge_pid_fd field in struct dma_heap_allocation_data. The
allocator opens a pidfd for its client (e.g., from binder's
sender_pid), passes it to the ioctl, and the kernel charges the
buffer directly to the client's cgroup at allocation time, so no
transfer step is needed.

This decouples the accounting path from binder entirely:
any allocator that knows its client's PID can use the pid_fd
mechanism regardless of the IPC transport in use.

The cross-cgroup charging capability requires access control.
Patches #3 and #4 add a generic LSM hook (security_dma_heap_alloc)
and an SELinux implementation based on a new dma_heap object class
with a charge_to permission, so policy authors can express which
domains are allowed to charge memory to another domain's cgroup.

Last patch adds some tests to verify the new charge_pid_fd field.

We are sending it as an RFC to spark broader discussion. It may or
may not be the right path forward, and we welcome feedback on the
trade-offs.

Collision note: Eric Chanudet's series [2] adds __GFP_ACCOUNT to
system_heap page allocations as an opt-in module parameter. That
approach charges pages to the allocator's own kmem, which overlaps with
MEMCG_DMABUF. This series explicitly removes __GFP_ACCOUNT from system
heap allocations and routes all accounting through the MEMCG_DMABUF
path to avoid double-counting.

[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/
[2] https://lore.kernel.org/r/20260113-dmabuf-heap-system-memcg-v2-0-e85722cc2f24@redhat.com

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
Albert Esteve (4):
      dma-heap: charge dma-buf memory via explicit memcg
      security: dma-heap: Add dma_heap_alloc LSM hook
      selinux: Restrict cross-cgroup dma-heap charging
      selftests/dmabuf-heaps: Add dma-buf memcg accounting tests

T.J. Mercier (1):
      memcg: Track exported dma-buffers

 Documentation/admin-guide/cgroup-v2.rst            |   5 +
 drivers/dma-buf/dma-buf.c                          |   7 +
 drivers/dma-buf/dma-heap.c                         |  54 +++++-
 drivers/dma-buf/heaps/system_heap.c                |   2 -
 include/linux/dma-buf.h                            |   4 +
 include/linux/lsm_hook_defs.h                      |   1 +
 include/linux/memcontrol.h                         |  37 ++++
 include/linux/security.h                           |   7 +
 include/uapi/linux/dma-heap.h                      |   6 +
 mm/memcontrol.c                                    |  19 ++
 security/security.c                                |  16 ++
 security/selinux/hooks.c                           |   7 +
 security/selinux/include/classmap.h                |   1 +
 tools/testing/selftests/cgroup/Makefile            |   2 +-
 tools/testing/selftests/cgroup/test_memcontrol.c   | 143 +++++++++++++-
 tools/testing/selftests/dmabuf-heaps/config        |   1 +
 tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++-
 tools/testing/selftests/dmabuf-heaps/vmtest.sh     | 205 +++++++++++++++++++++
 18 files changed, 633 insertions(+), 10 deletions(-)
---
base-commit: 74fe02ce122a6103f207d29fafc8b3a53de6abaf
change-id: 20260508-v2_20230123_tjmercier_google_com-f44fcfb16530

Best regards,
-- 
Albert Esteve <aesteve@redhat.com>

^ permalink raw reply

* [PATCH RFC 1/5] memcg: Track exported dma-buffers
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

From: "T.J. Mercier" <tjmercier@google.com>

When a buffer is exported to userspace, use memcg to attribute the
buffer to the allocating cgroup until all buffer references are
released.

Unlike the dmabuf sysfs stats implementation, this memcg accounting
avoids contention over the kernfs_rwsem incurred when creating or
removing nodes.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 drivers/dma-buf/dma-buf.c               | 13 ++++++++++++
 include/linux/dma-buf.h                 |  4 ++++
 include/linux/memcontrol.h              | 37 +++++++++++++++++++++++++++++++++
 mm/memcontrol.c                         | 19 +++++++++++++++++
 5 files changed, 77 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed995..8bdbc2e866430 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1635,6 +1635,10 @@ The following nested keys are defined.
 		Amount of memory used for storing in-kernel data
 		structures.
 
+	  dmabuf (npn)
+		Amount of memory used for exported DMA buffers allocated by the cgroup.
+		Stays with the allocating cgroup regardless of how the buffer is shared.
+
 	  workingset_refault_anon
 		Number of refaults of previously evicted anonymous pages.
 
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 71f37544a5c61..ce02377f48908 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -14,6 +14,7 @@
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/dma-buf.h>
+#include <linux/memcontrol.h>
 #include <linux/dma-fence.h>
 #include <linux/dma-fence-unwrap.h>
 #include <linux/anon_inodes.h>
@@ -180,6 +181,9 @@ static void dma_buf_release(struct dentry *dentry)
 	 */
 	BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
 
+	mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
+	mem_cgroup_put(dmabuf->memcg);
+
 	dmabuf->ops->release(dmabuf);
 
 	if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
@@ -760,6 +764,13 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 		dmabuf->resv = resv;
 	}
 
+	dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
+	if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
+				      GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err_memcg;
+	}
+
 	file->private_data = dmabuf;
 	file->f_path.dentry->d_fsdata = dmabuf;
 	dmabuf->file = file;
@@ -770,6 +781,8 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 
 	return dmabuf;
 
+err_memcg:
+	mem_cgroup_put(dmabuf->memcg);
 err_file:
 	fput(file);
 err_module:
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index d1203da56fc5f..d9f1ccb51c60e 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -27,6 +27,7 @@
 struct device;
 struct dma_buf;
 struct dma_buf_attachment;
+struct mem_cgroup;
 
 /**
  * struct dma_buf_ops - operations possible on struct dma_buf
@@ -429,6 +430,9 @@ struct dma_buf {
 
 		__poll_t active;
 	} cb_in, cb_out;
+
+	/** @memcg: the cgroup to which this buffer is currently attributed */
+	struct mem_cgroup *memcg;
 };
 
 /**
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b4..10068a833ad9e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum memcg_stat_item {
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
+	MEMCG_DMABUF,
 	MEMCG_NR_STAT,
 };
 
@@ -649,6 +650,24 @@ int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp);
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
 
+/**
+ * mem_cgroup_charge_dmabuf - Charge dma-buf memory to a cgroup and update stat counter
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ * @gfp_mask: reclaim mode
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if it doesn't.
+ */
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask);
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages,
+					    gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return true;
+	return __mem_cgroup_charge_dmabuf(memcg, nr_pages, gfp_mask);
+}
+
 void __mem_cgroup_uncharge(struct folio *folio);
 
 /**
@@ -664,6 +683,14 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
 	__mem_cgroup_uncharge(folio);
 }
 
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages);
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	if (mem_cgroup_disabled())
+		return;
+	__mem_cgroup_uncharge_dmabuf(memcg, nr_pages);
+}
+
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios);
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
@@ -1142,10 +1169,20 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 	return 0;
 }
 
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages,
+					    gfp_t gfp_mask)
+{
+	return true;
+}
+
 static inline void mem_cgroup_uncharge(struct folio *folio)
 {
 }
 
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+}
+
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d4668..15cee13d3ccd6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -433,6 +433,7 @@ static const unsigned int memcg_stat_items[] = {
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
+	MEMCG_DMABUF,
 };
 
 #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
@@ -1580,6 +1581,7 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	{ "hugetlb",			NR_HUGETLB			},
 #endif
+	{ "dmabuf",			MEMCG_DMABUF			},
 
 	/* The memory events */
 	{ "workingset_refault_anon",	WORKINGSET_REFAULT_ANON		},
@@ -5399,6 +5401,23 @@ void mem_cgroup_flush_workqueue(void)
 	flush_workqueue(memcg_wq);
 }
 
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask)
+{
+	if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
+		mod_memcg_state(memcg, MEMCG_DMABUF, nr_pages);
+		return true;
+	}
+
+	return false;
+}
+
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	mod_memcg_state(memcg, MEMCG_DMABUF, -nr_pages);
+	if (!mem_cgroup_is_root(memcg))
+		refill_stock(memcg, nr_pages);
+}
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

On embedded platforms a central process often allocates dma-buf
memory on behalf of client applications. Without a way to
attribute the charge to the requesting client's cgroup, the
cost lands on the allocator, making per-cgroup memory limits
ineffective for the actual consumers.

Add charge_pid_fd to struct dma_heap_allocation_data. When set to
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
the mem_accounting module parameter enabled, the buffer is charged
to the allocator's own cgroup.

Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
page allocations. Keeping __GFP_ACCOUNT would charge the same pages
twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
all accounting through a single MEMCG_DMABUF path.

Usage examples:

  1. Central allocator charging to a client at allocation time.
     The allocator knows the client's PID (e.g., from binder's
     sender_pid) and uses pidfd to attribute the charge:

       pid_t client_pid = txn->sender_pid;
       int pidfd = pidfd_open(client_pid, 0);

       struct dma_heap_allocation_data alloc = {
           .len             = buffer_size,
           .fd_flags        = O_RDWR | O_CLOEXEC,
           .charge_pid_fd   = pidfd,
       };
       ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
       close(pidfd);
       /* alloc.fd is now charged to client's cgroup */

  2. Default allocation (no pidfd, mem_accounting=1).
     When charge_pid_fd is not set and the mem_accounting module
     parameter is enabled, the buffer is charged to the allocator's
     own cgroup:

       struct dma_heap_allocation_data alloc = {
           .len      = buffer_size,
           .fd_flags = O_RDWR | O_CLOEXEC,
       };
       ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
       /* charged to current process's cgroup */

Current limitations:

 - Single-owner model: a dma-buf carries one memcg charge regardless of
   how many processes share it. Means only the first owner (and exporter)
   of the shared buffer bears the charge.
 - Only memcg accounting supported. While this makes sense for system
   heap buffers, other heaps (e.g., CMA heaps) will require selectively
   charging also for the dmem controller.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  5 ++--
 drivers/dma-buf/dma-buf.c               | 16 ++++---------
 drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
 drivers/dma-buf/heaps/system_heap.c     |  2 --
 include/uapi/linux/dma-heap.h           |  6 +++++
 5 files changed, 53 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8bdbc2e866430..824d269531eb1 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1636,8 +1636,9 @@ The following nested keys are defined.
 		structures.
 
 	  dmabuf (npn)
-		Amount of memory used for exported DMA buffers allocated by the cgroup.
-		Stays with the allocating cgroup regardless of how the buffer is shared.
+		Amount of memory used for exported DMA buffers allocated by or on
+		behalf of the cgroup. Stays with the allocating cgroup regardless
+		of how the buffer is shared.
 
 	  workingset_refault_anon
 		Number of refaults of previously evicted anonymous pages.
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index ce02377f48908..23fb758b78297 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
 	 */
 	BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
 
-	mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
-	mem_cgroup_put(dmabuf->memcg);
+	if (dmabuf->memcg) {
+		mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
+					  PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
+		mem_cgroup_put(dmabuf->memcg);
+	}
 
 	dmabuf->ops->release(dmabuf);
 
@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 		dmabuf->resv = resv;
 	}
 
-	dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
-	if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
-				      GFP_KERNEL)) {
-		ret = -ENOMEM;
-		goto err_memcg;
-	}
-
 	file->private_data = dmabuf;
 	file->f_path.dentry->d_fsdata = dmabuf;
 	dmabuf->file = file;
@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 
 	return dmabuf;
 
-err_memcg:
-	mem_cgroup_put(dmabuf->memcg);
 err_file:
 	fput(file);
 err_module:
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index ac5f8685a6494..ff6e259afcdc0 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -7,13 +7,17 @@
  */
 
 #include <linux/cdev.h>
+#include <linux/cgroup.h>
 #include <linux/device.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-heap.h>
+#include <linux/memcontrol.h>
+#include <linux/sched/mm.h>
 #include <linux/err.h>
 #include <linux/export.h>
 #include <linux/list.h>
 #include <linux/nospec.h>
+#include <linux/pidfd.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/xarray.h>
@@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
 		 "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
 
 static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
-				 u32 fd_flags,
-				 u64 heap_flags)
+				 u32 fd_flags, u64 heap_flags,
+				 struct mem_cgroup *charge_to)
 {
 	struct dma_buf *dmabuf;
+	unsigned int nr_pages;
+	struct mem_cgroup *memcg = charge_to;
 	int fd;
 
 	/*
@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
 	if (IS_ERR(dmabuf))
 		return PTR_ERR(dmabuf);
 
+	nr_pages = len / PAGE_SIZE;
+
+	if (memcg)
+		css_get(&memcg->css);
+	else if (mem_accounting)
+		memcg = get_mem_cgroup_from_mm(current->mm);
+
+	if (memcg) {
+		if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			dma_buf_put(dmabuf);
+			return -ENOMEM;
+		}
+		dmabuf->memcg = memcg;
+	}
+
 	fd = dma_buf_fd(dmabuf, fd_flags);
 	if (fd < 0) {
 		dma_buf_put(dmabuf);
@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 {
 	struct dma_heap_allocation_data *heap_allocation = data;
 	struct dma_heap *heap = file->private_data;
+	struct mem_cgroup *memcg = NULL;
+	struct task_struct *task;
+	unsigned int pidfd_flags;
 	int fd;
 
 	if (heap_allocation->fd)
@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 	if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
 		return -EINVAL;
 
+	if (heap_allocation->charge_pid_fd) {
+		task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
+		if (IS_ERR(task))
+			return PTR_ERR(task);
+
+		memcg = get_mem_cgroup_from_mm(task->mm);
+		put_task_struct(task);
+	}
+
 	fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
 				   heap_allocation->fd_flags,
-				   heap_allocation->heap_flags);
+				   heap_allocation->heap_flags,
+				   memcg);
+	mem_cgroup_put(memcg);
 	if (fd < 0)
 		return fd;
 
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index 03c2b87cb1112..95d7688167b93 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
 		if (max_order < orders[i])
 			continue;
 		flags = order_flags[i];
-		if (mem_accounting)
-			flags |= __GFP_ACCOUNT;
 		page = alloc_pages(flags, orders[i]);
 		if (!page)
 			continue;
diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
index a4cf716a49fa6..e02b0f8cbc6a1 100644
--- a/include/uapi/linux/dma-heap.h
+++ b/include/uapi/linux/dma-heap.h
@@ -29,6 +29,10 @@
  *			handle to the allocated dma-buf
  * @fd_flags:		file descriptor flags used when allocating
  * @heap_flags:		flags passed to heap
+ * @charge_pid_fd:	optional pidfd of the process whose cgroup should be
+ *			charged for this allocation; 0 means charge the calling
+ *			process's cgroup
+ * @__padding:		reserved, must be zero
  *
  * Provided by userspace as an argument to the ioctl
  */
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
 	__u32 fd;
 	__u32 fd_flags;
 	__u64 heap_flags;
+	__u32 charge_pid_fd;
+	__u32 __padding;
 };
 
 #define DMA_HEAP_IOC_MAGIC		'H'

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 3/5] security: dma-heap: Add dma_heap_alloc LSM hook
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

DMA_HEAP_IOCTL_ALLOC accepts a charge_pid_fd field that,
when set, causes the allocation to be charged to an arbitrary
process's cgroup rather than the caller's.

Without an access-control point, any process that holds a handle
to a dma-heap device node can charge unlimited memory to any other
process's cgroup, potentially exhausting that cgroup's limit and
triggering OOM kills independent of the victim's own activity or
privileges.

Add security_dma_heap_alloc(), called in dma_heap_ioctl_allocate()
when charge_pid_fd refers to another process. The hook receives
the credentials of the allocating process (from) and the credentials
of the process whose cgroup will be charged (to), giving security
modules a controlled enforcement point for cross-cgroup dma-buf
attribution policy.

When CONFIG_SECURITY is not set the hook compiles to an inline
returning 0, adding no overhead to the fast path.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 drivers/dma-buf/dma-heap.c    | 12 +++++++++++-
 include/linux/lsm_hook_defs.h |  1 +
 include/linux/security.h      |  7 +++++++
 security/security.c           | 16 ++++++++++++++++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index ff6e259afcdc0..e8ffb1031955e 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -18,6 +18,7 @@
 #include <linux/list.h>
 #include <linux/nospec.h>
 #include <linux/pidfd.h>
+#include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/xarray.h>
@@ -122,12 +123,13 @@ static int dma_heap_open(struct inode *inode, struct file *file)
 
 static long dma_heap_ioctl_allocate(struct file *file, void *data)
 {
+	const struct cred *tcred;
 	struct dma_heap_allocation_data *heap_allocation = data;
 	struct dma_heap *heap = file->private_data;
 	struct mem_cgroup *memcg = NULL;
 	struct task_struct *task;
 	unsigned int pidfd_flags;
-	int fd;
+	int fd, ret;
 
 	if (heap_allocation->fd)
 		return -EINVAL;
@@ -143,6 +145,14 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 		if (IS_ERR(task))
 			return PTR_ERR(task);
 
+		tcred = get_task_cred(task);
+		ret = security_dma_heap_alloc(current_cred(), tcred);
+		put_cred(tcred);
+		if (ret) {
+			put_task_struct(task);
+			return ret;
+		}
+
 		memcg = get_mem_cgroup_from_mm(task->mm);
 		put_task_struct(task);
 	}
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 2b8dfb35caed3..6a91656f97e1e 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -43,6 +43,7 @@ LSM_HOOK(int, 0, capset, struct cred *new, const struct cred *old,
 	 const kernel_cap_t *permitted)
 LSM_HOOK(int, 0, capable, const struct cred *cred, struct user_namespace *ns,
 	 int cap, unsigned int opts)
+LSM_HOOK(int, 0, dma_heap_alloc, const struct cred *from, const struct cred *to)
 LSM_HOOK(int, 0, quotactl, int cmds, int type, int id, const struct super_block *sb)
 LSM_HOOK(int, 0, quota_on, struct dentry *dentry)
 LSM_HOOK(int, 0, syslog, int type)
diff --git a/include/linux/security.h b/include/linux/security.h
index 41d7367cf4036..f1dad1eabe754 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -350,6 +350,7 @@ int security_capable(const struct cred *cred,
 		       struct user_namespace *ns,
 		       int cap,
 		       unsigned int opts);
+int security_dma_heap_alloc(const struct cred *from, const struct cred *to);
 int security_quotactl(int cmds, int type, int id, const struct super_block *sb);
 int security_quota_on(struct dentry *dentry);
 int security_syslog(int type);
@@ -701,6 +702,12 @@ static inline int security_capable(const struct cred *cred,
 	return cap_capable(cred, ns, cap, opts);
 }
 
+static inline int security_dma_heap_alloc(const struct cred *from,
+					  const struct cred *to)
+{
+	return 0;
+}
+
 static inline int security_quotactl(int cmds, int type, int id,
 				     const struct super_block *sb)
 {
diff --git a/security/security.c b/security/security.c
index 4e999f0236516..4adacef73c507 100644
--- a/security/security.c
+++ b/security/security.c
@@ -660,6 +660,22 @@ int security_capable(const struct cred *cred,
 	return call_int_hook(capable, cred, ns, cap, opts);
 }
 
+/**
+ * security_dma_heap_alloc() - Check if cross-cgroup dma-heap charging is allowed
+ * @from: credentials of the allocating process
+ * @to: credentials of the process to charge
+ *
+ * Check whether the process with credentials @from is allowed to allocate
+ * dma-heap memory and charge it to the cgroup of the process with credentials
+ * @to.
+ *
+ * Return: Returns 0 if permission is granted.
+ */
+int security_dma_heap_alloc(const struct cred *from, const struct cred *to)
+{
+	return call_int_hook(dma_heap_alloc, from, to);
+}
+
 /**
  * security_quotactl() - Check if a quotactl() syscall is allowed for this fs
  * @cmds: commands

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 4/5] selinux: Restrict cross-cgroup dma-heap charging
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

The security_dma_heap_alloc() hook allows security modules
to control which processes may charge dma-buf allocations
to another process's cgroup via the charge_pid_fd field of
DMA_HEAP_IOCTL_ALLOC. Without a policy implementation, the
hook is a no-op and the restriction is not enforced.

On SELinux-managed systems any domain with access to a
dma-heap device node can therefore exhaust another cgroup's
memory budget without restriction.

Implement selinux_dma_heap_alloc() using avc_has_perm() with
a new dma_heap object class and a charge_to permission. Policy
authors can then grant cross-cgroup charging selectively,
for example:

  allow allocator_app_t client_app_t:dma_heap charge_to;

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 security/selinux/hooks.c            | 7 +++++++
 security/selinux/include/classmap.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 0f704380a8c81..ea1f410b9f619 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2189,6 +2189,12 @@ static int selinux_capable(const struct cred *cred, struct user_namespace *ns,
 	return cred_has_capability(cred, cap, opts, ns == &init_user_ns);
 }
 
+static int selinux_dma_heap_alloc(const struct cred *from, const struct cred *to)
+{
+	return avc_has_perm(cred_sid(from), cred_sid(to),
+			    SECCLASS_DMA_HEAP, DMA_HEAP__CHARGE_TO, NULL);
+}
+
 static int selinux_quotactl(int cmds, int type, int id, const struct super_block *sb)
 {
 	const struct cred *cred = current_cred();
@@ -7541,6 +7547,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(capget, selinux_capget),
 	LSM_HOOK_INIT(capset, selinux_capset),
 	LSM_HOOK_INIT(capable, selinux_capable),
+	LSM_HOOK_INIT(dma_heap_alloc, selinux_dma_heap_alloc),
 	LSM_HOOK_INIT(quotactl, selinux_quotactl),
 	LSM_HOOK_INIT(quota_on, selinux_quota_on),
 	LSM_HOOK_INIT(syslog, selinux_syslog),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 90cb61b164256..d232f7808f6b8 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -181,6 +181,7 @@ const struct security_class_mapping secclass_map[] = {
 	{ "user_namespace", { "create", NULL } },
 	{ "memfd_file",
 	  { COMMON_FILE_PERMS, "execute_no_trans", "entrypoint", NULL } },
+	{ "dma_heap", { "charge_to", NULL } },
 	/* last one */ { NULL, {} }
 };
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 5/5] selftests/dmabuf-heaps: Add dma-buf memcg accounting tests
From: Albert Esteve @ 2026-05-12  9:10 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-0-6326701c3691@redhat.com>

Add tests for the new charge_pid_fd field in struct
dma_heap_allocation_data.

When the charge_pid_fd feature is absent (unpatched kernel),
the probe in pidfd_alloc_supported() detects this and the
tests are skipped gracefully.

Add vmtest.sh similar to other subsystem suites, to orchestrate
building the selftests (optionally with a freshly compiled kernel)
inside a virtme-ng VM, so the tests can be run without modifying
the host system. Add a config fragment with required Kconfig symbols.

Also add test_memcg_dmabuf() to the existing test_memcontrol suite
to verify end-to-end cross-cgroup accounting: a parent process opens
a pidfd for a child in a separate cgroup, allocates a dma-buf via
DMA_HEAP_IOCTL_ALLOC with that pidfd, and asserts that memory.stat
dmabuf in the child's cgroup reflects the allocation. If the dmabuf
key is missing (unpatched kernel) or /dev/dma_heap/system is absent,
the test is skipped.

Assisted-by: Claude:claude-sonnet-4-6 Cursor
Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 tools/testing/selftests/cgroup/Makefile            |   2 +-
 tools/testing/selftests/cgroup/test_memcontrol.c   | 143 +++++++++++++-
 tools/testing/selftests/dmabuf-heaps/config        |   1 +
 tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++-
 tools/testing/selftests/dmabuf-heaps/vmtest.sh     | 205 +++++++++++++++++++++
 5 files changed, 473 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index e01584c2189ac..9edfc9f1de5c4 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-CFLAGS += -Wall -pthread
+CFLAGS += -Wall -pthread $(KHDR_INCLUDES)
 
 all: ${HELPER_PROGS}
 
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index b43da9bc20c49..b6a228407530f 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -19,9 +19,17 @@
 #include <errno.h>
 #include <sys/mman.h>
 
+#include <linux/dma-heap.h>
+#include <signal.h>
+#include <sys/ioctl.h>
+
+#include "../pidfd/pidfd.h"
 #include "kselftest.h"
 #include "cgroup_util.h"
 
+#define DMA_HEAP_SYSTEM		"/dev/dma_heap/system"
+#define ONE_MEG			(1024 * 1024)
+
 #define MEMCG_SOCKSTAT_WAIT_RETRIES        30
 
 static bool has_localevents;
@@ -1762,6 +1770,125 @@ static int test_memcg_inotify_delete_dir(const char *root)
 	return ret;
 }
 
+static int memcg_dmabuf_child(const char *cgroup, void *arg)
+{
+	pause();
+	return 0;
+}
+
+/*
+ * This test allocates a dma-buf via DMA_HEAP_IOCTL_ALLOC with a pidfd
+ * pointing to a child process in a separate cgroup, then checks that
+ * memory.stat[dmabuf] in the child's cgroup rises by the allocation size
+ * and returns to zero after the buffer fd is closed.
+ */
+static int test_memcg_dmabuf(const char *root)
+{
+	char *parent = NULL, *child_cg = NULL;
+	int ret = KSFT_FAIL;
+	int heap_fd = -1, dmabuf_fd = -1, pidfd = -1;
+	pid_t child_pid;
+	int child_status;
+	long dmabuf_stat;
+	struct dma_heap_allocation_data alloc = {
+		.len      = ONE_MEG,
+		.fd_flags = O_RDWR | O_CLOEXEC,
+	};
+
+	if (access(DMA_HEAP_SYSTEM, R_OK | W_OK)) {
+		ret = KSFT_SKIP;
+		goto cleanup;
+	}
+
+	parent = cg_name(root, "dmabuf_memcg_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup_parent;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup_parent;
+
+	child_cg = cg_name(parent, "child");
+	if (!child_cg)
+		goto cleanup_parent;
+
+	if (cg_create(child_cg))
+		goto cleanup_parent;
+
+	child_pid = cg_run_nowait(child_cg, memcg_dmabuf_child, NULL);
+	if (child_pid < 0)
+		goto cleanup_child;
+
+	if (cg_wait_for_proc_count(child_cg, 1))
+		goto cleanup_kill;
+
+	pidfd = sys_pidfd_open(child_pid, 0);
+	if (pidfd < 0) {
+		ret = KSFT_SKIP;
+		goto cleanup_kill;
+	}
+
+	heap_fd = open(DMA_HEAP_SYSTEM, O_RDWR);
+	if (heap_fd < 0) {
+		ret = KSFT_SKIP;
+		goto cleanup_pidfd;
+	}
+
+	alloc.charge_pid_fd = (__u32)pidfd;
+	if (ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc) < 0)
+		goto cleanup_heap;
+	dmabuf_fd = (int)alloc.fd;
+
+	dmabuf_stat = cg_read_key_long(child_cg, "memory.stat", "dmabuf ");
+	if (dmabuf_stat == -1) {
+		ret = KSFT_SKIP;
+		goto cleanup_dmabuf;
+	}
+	if (dmabuf_stat != ONE_MEG)
+		dmabuf_stat = cg_read_key_long_poll(child_cg, "memory.stat",
+						    "dmabuf ", ONE_MEG,
+						    15, 200000);
+	if (dmabuf_stat != ONE_MEG) {
+		fprintf(stderr, "Expected dmabuf stat %d, got %ld\n",
+			ONE_MEG, dmabuf_stat);
+		goto cleanup_dmabuf;
+	}
+
+	close(dmabuf_fd);
+	dmabuf_fd = -1;
+
+	dmabuf_stat = cg_read_key_long_poll(child_cg, "memory.stat",
+					    "dmabuf ", 0, 15, 200000);
+	if (dmabuf_stat != 0) {
+		fprintf(stderr, "Expected dmabuf stat 0 after close, got %ld\n",
+			dmabuf_stat);
+		goto cleanup_heap;
+	}
+
+	ret = KSFT_PASS;
+
+cleanup_dmabuf:
+	if (dmabuf_fd >= 0)
+		close(dmabuf_fd);
+cleanup_heap:
+	close(heap_fd);
+cleanup_pidfd:
+	close(pidfd);
+cleanup_kill:
+	kill(child_pid, SIGTERM);
+	waitpid(child_pid, &child_status, 0);
+cleanup_child:
+	cg_destroy(child_cg);
+	free(child_cg);
+cleanup_parent:
+	cg_destroy(parent);
+	free(parent);
+cleanup:
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct memcg_test {
 	int (*fn)(const char *root);
@@ -1783,16 +1910,26 @@ struct memcg_test {
 	T(test_memcg_oom_group_score_events),
 	T(test_memcg_inotify_delete_file),
 	T(test_memcg_inotify_delete_dir),
+	T(test_memcg_dmabuf),
 };
 #undef T
 
 int main(int argc, char **argv)
 {
 	char root[PATH_MAX];
-	int i, proc_status;
+	int i, proc_status, plan;
+	const char *filter = NULL;
+
+	if (argc > 1)
+		filter = argv[1];
+
+	plan = 0;
+	for (i = 0; i < ARRAY_SIZE(tests); i++)
+		if (!filter || !strcmp(tests[i].name, filter))
+			plan++;
 
 	ksft_print_header();
-	ksft_set_plan(ARRAY_SIZE(tests));
+	ksft_set_plan(plan);
 	if (cg_find_unified_root(root, sizeof(root), NULL))
 		ksft_exit_skip("cgroup v2 isn't mounted\n");
 
@@ -1818,6 +1955,8 @@ int main(int argc, char **argv)
 	has_localevents = proc_status;
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (filter && strcmp(tests[i].name, filter))
+			continue;
 		switch (tests[i].fn(root)) {
 		case KSFT_PASS:
 			ksft_test_result_pass("%s\n", tests[i].name);
diff --git a/tools/testing/selftests/dmabuf-heaps/config b/tools/testing/selftests/dmabuf-heaps/config
index be091f1cdfa04..94c8f33b71a28 100644
--- a/tools/testing/selftests/dmabuf-heaps/config
+++ b/tools/testing/selftests/dmabuf-heaps/config
@@ -1,3 +1,4 @@
+CONFIG_MEMCG=y
 CONFIG_DMABUF_HEAPS=y
 CONFIG_DMABUF_HEAPS_SYSTEM=y
 CONFIG_DRM_VGEM=y
diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
index fc9694fc4e89e..904332b17698a 100644
--- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
+++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
@@ -3,6 +3,7 @@
 #include <dirent.h>
 #include <errno.h>
 #include <fcntl.h>
+#include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
@@ -10,11 +11,14 @@
 #include <unistd.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
+#include <sys/syscall.h>
 #include <sys/types.h>
+#include <sys/wait.h>
 
 #include <linux/dma-buf.h>
 #include <linux/dma-heap.h>
 #include <drm/drm.h>
+#include "../pidfd/pidfd.h"
 #include "kselftest.h"
 
 #define DEVPATH "/dev/dma_heap"
@@ -320,6 +324,8 @@ static int dmabuf_heap_alloc_newer(int fd, size_t len, unsigned int flags,
 		__u32 fd;
 		__u32 fd_flags;
 		__u64 heap_flags;
+		__u32 charge_pid_fd;
+		__u32 __padding;
 		__u64 garbage1;
 		__u64 garbage2;
 		__u64 garbage3;
@@ -328,6 +334,8 @@ static int dmabuf_heap_alloc_newer(int fd, size_t len, unsigned int flags,
 		.fd = 0,
 		.fd_flags = O_RDWR | O_CLOEXEC,
 		.heap_flags = flags,
+		.charge_pid_fd = 0,
+		.__padding = 0,
 		.garbage1 = 0xffffffff,
 		.garbage2 = 0x88888888,
 		.garbage3 = 0x11111111,
@@ -390,6 +398,120 @@ static void test_alloc_errors(char *heap_name)
 	close(heap_fd);
 }
 
+static int dmabuf_heap_alloc_pidfd(int fd, size_t len, unsigned int heap_flags,
+				   unsigned int charge_pid_fd, int *dmabuf_fd)
+{
+	struct dma_heap_allocation_data data = {
+		.len = len,
+		.fd = 0,
+		.fd_flags = O_RDWR | O_CLOEXEC,
+		.heap_flags = heap_flags,
+		.charge_pid_fd = charge_pid_fd,
+	};
+	int ret;
+
+	if (!dmabuf_fd)
+		return -EINVAL;
+
+	ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
+	if (ret < 0)
+		return ret;
+	*dmabuf_fd = (int)data.fd;
+	return ret;
+}
+
+/*
+ * Probe whether the kernel honours charge_pid_fd in DMA_HEAP_IOCTL_ALLOC.
+ */
+static bool pidfd_alloc_supported(int heap_fd)
+{
+	int devnull_fd, dmabuf_fd = -1, ret;
+
+	devnull_fd = open("/dev/null", O_RDONLY);
+	if (devnull_fd < 0)
+		return false;
+
+	ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, devnull_fd, &dmabuf_fd);
+	if (dmabuf_fd >= 0) {
+		close(dmabuf_fd);
+		dmabuf_fd = -1;
+	}
+	close(devnull_fd);
+	return ret < 0;
+}
+
+/*
+ * Test: allocate charging the calling process's own cgroup via a self pidfd.
+ */
+static void test_alloc_pidfd_self(char *heap_name)
+{
+	int heap_fd = -1, pidfd = -1, dmabuf_fd = -1, ret;
+
+	heap_fd = dmabuf_heap_open(heap_name);
+
+	if (!pidfd_alloc_supported(heap_fd)) {
+		ksft_test_result_skip("charge_pid_fd not supported by this kernel\n");
+		goto out;
+	}
+
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0) {
+		ksft_test_result_skip("pidfd_open not available\n");
+		goto out;
+	}
+
+	ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, pidfd, &dmabuf_fd);
+	ksft_test_result(!ret, "Allocation with self pidfd %d\n", ret);
+	if (dmabuf_fd >= 0)
+		close(dmabuf_fd);
+	close(pidfd);
+out:
+	close(heap_fd);
+}
+
+/*
+ * Test: allocate charging a child process's cgroup via a child pidfd.
+ */
+static void test_alloc_pidfd_child(char *heap_name)
+{
+	int heap_fd = -1, pidfd = -1, dmabuf_fd = -1;
+	pid_t child_pid;
+	int status, ret;
+
+	heap_fd = dmabuf_heap_open(heap_name);
+
+	if (!pidfd_alloc_supported(heap_fd)) {
+		ksft_test_result_skip("charge_pid_fd not supported by this kernel\n");
+		goto out;
+	}
+
+	child_pid = fork();
+	if (child_pid == 0) {
+		pause();
+		_exit(0);
+	}
+	if (child_pid < 0)
+		ksft_exit_fail_msg("fork failed: %s\n", strerror(errno));
+
+	pidfd = sys_pidfd_open(child_pid, 0);
+	if (pidfd < 0) {
+		kill(child_pid, SIGTERM);
+		waitpid(child_pid, &status, 0);
+		ksft_test_result_skip("pidfd_open for child failed\n");
+		goto out;
+	}
+
+	ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, pidfd, &dmabuf_fd);
+	ksft_test_result(!ret, "Allocation with child pidfd %d\n", ret);
+	if (dmabuf_fd >= 0)
+		close(dmabuf_fd);
+	close(pidfd);
+	kill(child_pid, SIGTERM);
+	waitpid(child_pid, &status, 0);
+out:
+	close(heap_fd);
+}
+
 static int numer_of_heaps(void)
 {
 	DIR *d = opendir(DEVPATH);
@@ -420,7 +542,7 @@ int main(void)
 		return KSFT_SKIP;
 	}
 
-	ksft_set_plan(11 * numer_of_heaps());
+	ksft_set_plan(13 * numer_of_heaps());
 
 	while ((dir = readdir(d))) {
 		if (!strncmp(dir->d_name, ".", 2))
@@ -435,6 +557,8 @@ int main(void)
 		test_alloc_zeroed(dir->d_name, ONE_MEG);
 		test_alloc_compat(dir->d_name);
 		test_alloc_errors(dir->d_name);
+		test_alloc_pidfd_self(dir->d_name);
+		test_alloc_pidfd_child(dir->d_name);
 	}
 	closedir(d);
 
diff --git a/tools/testing/selftests/dmabuf-heaps/vmtest.sh b/tools/testing/selftests/dmabuf-heaps/vmtest.sh
new file mode 100755
index 0000000000000..6f1a878384127
--- /dev/null
+++ b/tools/testing/selftests/dmabuf-heaps/vmtest.sh
@@ -0,0 +1,205 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2026 Red Hat
+#
+# Dependencies:
+#		* virtme-ng
+#		* qemu	(used by virtme-ng)
+
+readonly SCRIPT_DIR="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
+readonly KERNEL_CHECKOUT=$(realpath "${SCRIPT_DIR}"/../../../../)
+readonly CGROUP_DIR="${KERNEL_CHECKOUT}/tools/testing/selftests/cgroup"
+
+source "${SCRIPT_DIR}"/../kselftest/ktap_helpers.sh
+
+readonly DMABUF_HEAP_TEST="${SCRIPT_DIR}"/dmabuf-heap
+readonly MEMCONTROL_TEST="${CGROUP_DIR}"/test_memcontrol
+readonly TMP_DIR=$(mktemp -d /tmp/dmabuf-vmtest.XXXXXXXX)
+
+VERBOSE=false
+BUILD=false
+BUILD_HOST=""
+BUILD_HOST_PODMAN_CONTAINER_NAME=""
+
+usage() {
+	echo
+	echo "$0 [OPTIONS]"
+	echo
+	echo "Options"
+	echo "  -b: build the kernel from the current source tree and use it for the VM"
+	echo "  -H: hostname for remote build host (used with -b)"
+	echo "  -p: podman container name for remote build host (used with -b)"
+	echo "      Example: -H beefyserver -p vng"
+
+	echo "  -v: enable verbose vng/qemu output"
+	echo
+
+	exit 1
+}
+
+die() {
+	echo "$*" >&2
+	exit "${KSFT_FAIL}"
+}
+
+cleanup() {
+	rm -rf "${TMP_DIR}"
+}
+
+check_deps() {
+	for dep in vng make; do
+		if [[ ! -x $(command -v "${dep}") ]]; then
+			echo -e "skip:    dependency ${dep} not found!\n"
+			exit "${KSFT_SKIP}"
+		fi
+	done
+
+	if [[ ! -x "${DMABUF_HEAP_TEST}" ]]; then
+		printf "skip:    %s not found!" "${DMABUF_HEAP_TEST}"
+		printf " Please build the kselftest dmabuf-heaps target (or use -b).\n"
+		exit "${KSFT_SKIP}"
+	fi
+
+	if [[ ! -x "${MEMCONTROL_TEST}" ]]; then
+		printf "skip:    %s not found!" "${MEMCONTROL_TEST}"
+		printf " Please build the kselftest cgroup target (or use -b).\n"
+		exit "${KSFT_SKIP}"
+	fi
+}
+
+check_vng() {
+	local tested_versions=("1.36" "1.37")
+	local version
+	local ok=0
+
+	version="$(vng --version)"
+	for tv in "${tested_versions[@]}"; do
+		if [[ "${version}" == *"${tv}"* ]]; then
+			ok=1
+			break
+		fi
+	done
+
+	if [[ "${ok}" -eq 0 ]]; then
+		printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
+		printf "not function properly.\n\tThe following versions have been tested: " >&2
+		echo "${tested_versions[@]}" >&2
+	fi
+}
+
+build_selftests() {
+	make -C "${KERNEL_CHECKOUT}" headers_install \
+		INSTALL_HDR_PATH="${TMP_DIR}/usr" -j"$(nproc)"
+
+	local khdr="-isystem ${TMP_DIR}/usr/include"
+
+	if ! make -C "${SCRIPT_DIR}" KHDR_INCLUDES="${khdr}" -j"$(nproc)"; then
+		die "failed to build dmabuf-heaps selftests"
+	fi
+
+	if ! make -C "${CGROUP_DIR}" KHDR_INCLUDES="${khdr}" \
+		"${MEMCONTROL_TEST}" -j"$(nproc)"; then
+		die "failed to build cgroup/test_memcontrol selftest"
+	fi
+}
+
+handle_build() {
+	if ! ${BUILD}; then
+		return
+	fi
+
+	if [[ ! -d "${KERNEL_CHECKOUT}" ]]; then
+		echo "-b requires vmtest.sh called from the kernel source tree" >&2
+		exit 1
+	fi
+
+	pushd "${KERNEL_CHECKOUT}" &>/dev/null
+
+	if ! vng --kconfig --config "${SCRIPT_DIR}/config"; then
+		die "failed to generate .config for kernel source tree (${KERNEL_CHECKOUT})"
+	fi
+
+	local vng_args=("-v" "--config" "${SCRIPT_DIR}/config" "--build")
+
+	if [[ -n "${BUILD_HOST}" ]]; then
+		vng_args+=("--build-host" "${BUILD_HOST}")
+	fi
+
+	if [[ -n "${BUILD_HOST_PODMAN_CONTAINER_NAME}" ]]; then
+		vng_args+=("--build-host-exec-prefix" \
+			   "podman exec -ti ${BUILD_HOST_PODMAN_CONTAINER_NAME}")
+	fi
+
+	if ! vng "${vng_args[@]}"; then
+		die "failed to build kernel from source tree (${KERNEL_CHECKOUT})"
+	fi
+
+	build_selftests
+
+	popd &>/dev/null
+}
+
+make_runner() {
+	# virtme-ng shares the host filesystem, so TMP_DIR is accessible
+	# inside the VM at the same absolute path.
+	cat > "${TMP_DIR}/run_tests.sh" <<-EOF
+	#!/bin/sh
+	set -u
+	PASS=0; FAIL=0; SKIP=0; N=0
+
+	run() {
+		name="\$1"; shift
+		N=\$((N+1))
+		"\$@"; rc=\$?
+		if   [ \$rc -eq 0 ]; then echo "ok \$N \$name";        PASS=\$((PASS+1))
+		elif [ \$rc -eq 4 ]; then echo "ok \$N \$name # SKIP"; SKIP=\$((SKIP+1))
+		else                      echo "not ok \$N \$name";    FAIL=\$((FAIL+1))
+		fi
+	}
+
+	run "dmabuf-heap charge_pid_fd ioctl"	${DMABUF_HEAP_TEST}
+	run "memcontrol dma-buf memcg"  ${MEMCONTROL_TEST} test_memcg_dmabuf
+	echo "# PASS=\$PASS SKIP=\$SKIP FAIL=\$FAIL"
+	[ \$FAIL -eq 0 ]
+	EOF
+	chmod +x "${TMP_DIR}/run_tests.sh"
+}
+
+run_vm() {
+	local verbose_opt=""
+	local kernel_opt=""
+
+	${VERBOSE} && verbose_opt="--verbose"
+
+	# If we are running from within the kernel source tree, use the kernel
+	# source tree as the kernel to boot, otherwise use the running kernel.
+	if [[ "$(realpath "$(pwd)")" == "${KERNEL_CHECKOUT}"* ]]; then
+		kernel_opt="${KERNEL_CHECKOUT}"
+	fi
+
+	vng --run ${kernel_opt} ${verbose_opt} --user root --memory 512M \
+		--exec "${TMP_DIR}/run_tests.sh"
+}
+
+while getopts :hvbH:p: o
+do
+	case $o in
+	v) VERBOSE=true;;
+	b) BUILD=true;;
+	H) BUILD_HOST=$OPTARG;;
+	p) BUILD_HOST_PODMAN_CONTAINER_NAME=$OPTARG;;
+	h|*) usage;;
+	esac
+done
+shift $((OPTIND-1))
+
+trap cleanup EXIT
+
+check_vng
+handle_build
+check_deps
+make_runner
+
+echo "Booting VM and running tests..."
+run_vm

-- 
2.53.0


^ permalink raw reply related

* Re: [linus:master] [selftests]  465b05bae5: kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail
From: Thomas Weißschuh @ 2026-05-12  9:27 UTC (permalink / raw)
  To: Mickaël Salaün, Günther Noack
  Cc: kernel test robot, linux-security-module, oe-lkp, lkp,
	linux-kernel, Shuah Khan, Kees Cook, linux-kselftest
In-Reply-To: <202605111649.a8b30a62-lkp@intel.com>

Hi Mickaël and Günther,

I received the following report about a failing landlock selftest from
the 0day bot.

On Mon, May 11, 2026 at 10:15:00PM +0800, kernel test robot wrote:
> kernel test robot noticed "kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail" on:
> 
> commit: 465b05bae5ac553c13315681c1490dc565337771 ("selftests: harness: Restore order of test functions")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 
> in testcase: kernel-selftests
> version: kernel-selftests-x86_64-9f2693489ef8-1_20260201
> with following parameters:
> 
> 	group: landlock
> 
> config: x86_64-rhel-9.4-kselftests
> compiler: gcc-14
> test machine: 16 threads Intel(R) Core(TM) i7-13620H (Raptor Lake) with 32G memory
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202605111649.a8b30a62-lkp@intel.com

I was unable to run the landlock selftests myself, on my machines they are
failing at runtime with all kinds of colorful errors. Are the requirements
explained somewhere?

> # #  RUN           audit.tsync_override_log_subdomains_off ...
> # # audit_test.c:591:tsync_override_log_subdomains_off:Expected 0 (0) == matches_log_signal(_metadata, self->audit_fd, child_data.parent_pid, NULL) (-11)

This error number means "EAGAIN 11 Resource temporarily unavailable",
so it could be a temporary error.

Can you reproduce this issue? Is it really dependent on my patch as
blamed above? If so, does the selftest rely on the previous, incorrect order?

> # # tsync_override_log_subdomains_off: Test failed
> # #          FAIL  audit.tsync_override_log_subdomains_off
> # not ok 5 audit.tsync_override_log_subdomains_off


Thomas

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-12 10:14 UTC (permalink / raw)
  To: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-2-6326701c3691@redhat.com>

On 5/12/26 11:10, Albert Esteve wrote:
> On embedded platforms a central process often allocates dma-buf
> memory on behalf of client applications. Without a way to
> attribute the charge to the requesting client's cgroup, the
> cost lands on the allocator, making per-cgroup memory limits
> ineffective for the actual consumers.
> 
> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> the mem_accounting module parameter enabled, the buffer is charged
> to the allocator's own cgroup.
> 
> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> all accounting through a single MEMCG_DMABUF path.
> 
> Usage examples:
> 
>   1. Central allocator charging to a client at allocation time.
>      The allocator knows the client's PID (e.g., from binder's
>      sender_pid) and uses pidfd to attribute the charge:
> 
>        pid_t client_pid = txn->sender_pid;
>        int pidfd = pidfd_open(client_pid, 0);
> 
>        struct dma_heap_allocation_data alloc = {
>            .len             = buffer_size,
>            .fd_flags        = O_RDWR | O_CLOEXEC,
>            .charge_pid_fd   = pidfd,
>        };
>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>        close(pidfd);
>        /* alloc.fd is now charged to client's cgroup */
> 
>   2. Default allocation (no pidfd, mem_accounting=1).
>      When charge_pid_fd is not set and the mem_accounting module
>      parameter is enabled, the buffer is charged to the allocator's
>      own cgroup:
> 
>        struct dma_heap_allocation_data alloc = {
>            .len      = buffer_size,
>            .fd_flags = O_RDWR | O_CLOEXEC,
>        };
>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>        /* charged to current process's cgroup */
> 
> Current limitations:
> 
>  - Single-owner model: a dma-buf carries one memcg charge regardless of
>    how many processes share it. Means only the first owner (and exporter)
>    of the shared buffer bears the charge.
>  - Only memcg accounting supported. While this makes sense for system
>    heap buffers, other heaps (e.g., CMA heaps) will require selectively
>    charging also for the dmem controller.

Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.

I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...

Essentially the problem boils down to two limitations:
1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup

The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...

On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.

Regards,
Christian.

> 
> Signed-off-by: Albert Esteve <aesteve@redhat.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
>  drivers/dma-buf/dma-buf.c               | 16 ++++---------
>  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
>  drivers/dma-buf/heaps/system_heap.c     |  2 --
>  include/uapi/linux/dma-heap.h           |  6 +++++
>  5 files changed, 53 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 8bdbc2e866430..824d269531eb1 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
>  		structures.
>  
>  	  dmabuf (npn)
> -		Amount of memory used for exported DMA buffers allocated by the cgroup.
> -		Stays with the allocating cgroup regardless of how the buffer is shared.
> +		Amount of memory used for exported DMA buffers allocated by or on
> +		behalf of the cgroup. Stays with the allocating cgroup regardless
> +		of how the buffer is shared.
>  
>  	  workingset_refault_anon
>  		Number of refaults of previously evicted anonymous pages.
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index ce02377f48908..23fb758b78297 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
>  	 */
>  	BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
>  
> -	mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> -	mem_cgroup_put(dmabuf->memcg);
> +	if (dmabuf->memcg) {
> +		mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> +					  PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> +		mem_cgroup_put(dmabuf->memcg);
> +	}
>  
>  	dmabuf->ops->release(dmabuf);
>  
> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>  		dmabuf->resv = resv;
>  	}
>  
> -	dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> -	if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> -				      GFP_KERNEL)) {
> -		ret = -ENOMEM;
> -		goto err_memcg;
> -	}
> -
>  	file->private_data = dmabuf;
>  	file->f_path.dentry->d_fsdata = dmabuf;
>  	dmabuf->file = file;
> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>  
>  	return dmabuf;
>  
> -err_memcg:
> -	mem_cgroup_put(dmabuf->memcg);
>  err_file:
>  	fput(file);
>  err_module:
> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> index ac5f8685a6494..ff6e259afcdc0 100644
> --- a/drivers/dma-buf/dma-heap.c
> +++ b/drivers/dma-buf/dma-heap.c
> @@ -7,13 +7,17 @@
>   */
>  
>  #include <linux/cdev.h>
> +#include <linux/cgroup.h>
>  #include <linux/device.h>
>  #include <linux/dma-buf.h>
>  #include <linux/dma-heap.h>
> +#include <linux/memcontrol.h>
> +#include <linux/sched/mm.h>
>  #include <linux/err.h>
>  #include <linux/export.h>
>  #include <linux/list.h>
>  #include <linux/nospec.h>
> +#include <linux/pidfd.h>
>  #include <linux/syscalls.h>
>  #include <linux/uaccess.h>
>  #include <linux/xarray.h>
> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
>  		 "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
>  
>  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> -				 u32 fd_flags,
> -				 u64 heap_flags)
> +				 u32 fd_flags, u64 heap_flags,
> +				 struct mem_cgroup *charge_to)
>  {
>  	struct dma_buf *dmabuf;
> +	unsigned int nr_pages;
> +	struct mem_cgroup *memcg = charge_to;
>  	int fd;
>  
>  	/*
> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>  	if (IS_ERR(dmabuf))
>  		return PTR_ERR(dmabuf);
>  
> +	nr_pages = len / PAGE_SIZE;
> +
> +	if (memcg)
> +		css_get(&memcg->css);
> +	else if (mem_accounting)
> +		memcg = get_mem_cgroup_from_mm(current->mm);
> +
> +	if (memcg) {
> +		if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> +			mem_cgroup_put(memcg);
> +			dma_buf_put(dmabuf);
> +			return -ENOMEM;
> +		}
> +		dmabuf->memcg = memcg;
> +	}
> +
>  	fd = dma_buf_fd(dmabuf, fd_flags);
>  	if (fd < 0) {
>  		dma_buf_put(dmabuf);
> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>  {
>  	struct dma_heap_allocation_data *heap_allocation = data;
>  	struct dma_heap *heap = file->private_data;
> +	struct mem_cgroup *memcg = NULL;
> +	struct task_struct *task;
> +	unsigned int pidfd_flags;
>  	int fd;
>  
>  	if (heap_allocation->fd)
> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>  	if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
>  		return -EINVAL;
>  
> +	if (heap_allocation->charge_pid_fd) {
> +		task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> +		if (IS_ERR(task))
> +			return PTR_ERR(task);
> +
> +		memcg = get_mem_cgroup_from_mm(task->mm);
> +		put_task_struct(task);
> +	}
> +
>  	fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
>  				   heap_allocation->fd_flags,
> -				   heap_allocation->heap_flags);
> +				   heap_allocation->heap_flags,
> +				   memcg);
> +	mem_cgroup_put(memcg);
>  	if (fd < 0)
>  		return fd;
>  
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index 03c2b87cb1112..95d7688167b93 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
>  		if (max_order < orders[i])
>  			continue;
>  		flags = order_flags[i];
> -		if (mem_accounting)
> -			flags |= __GFP_ACCOUNT;
>  		page = alloc_pages(flags, orders[i]);
>  		if (!page)
>  			continue;
> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> index a4cf716a49fa6..e02b0f8cbc6a1 100644
> --- a/include/uapi/linux/dma-heap.h
> +++ b/include/uapi/linux/dma-heap.h
> @@ -29,6 +29,10 @@
>   *			handle to the allocated dma-buf
>   * @fd_flags:		file descriptor flags used when allocating
>   * @heap_flags:		flags passed to heap
> + * @charge_pid_fd:	optional pidfd of the process whose cgroup should be
> + *			charged for this allocation; 0 means charge the calling
> + *			process's cgroup
> + * @__padding:		reserved, must be zero
>   *
>   * Provided by userspace as an argument to the ioctl
>   */
> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
>  	__u32 fd;
>  	__u32 fd_flags;
>  	__u64 heap_flags;
> +	__u32 charge_pid_fd;
> +	__u32 __padding;
>  };
>  
>  #define DMA_HEAP_IOC_MAGIC		'H'
> 


^ permalink raw reply

* Re: [PATCH v3 6/7] tomoyo: Convert from sb_mount to granular mount hooks
From: Tetsuo Handa @ 2026-05-12 11:01 UTC (permalink / raw)
  To: Song Liu, linux-security-module, linux-fsdevel
  Cc: paul, jmorris, serge, viro, brauner, jack, john.johansen,
	stephen.smalley.work, omosnace, mic, gnoack, takedakn, herton,
	kernel-team, selinux, apparmor
In-Reply-To: <20260509015208.3853132-7-song@kernel.org>

On 2026/05/09 10:52, Song Liu wrote:
> Replace tomoyo_sb_mount() with granular mount hooks. Each hook
> reconstructs the MS_* flags expected by tomoyo_mount_permission()
> using the original flags parameter where available.

Please fold below diff into this patch. Then,

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 security/tomoyo/tomoyo.c | 60 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index ac84e1f03d5e..c93d000acc95 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -400,6 +400,15 @@ static int tomoyo_path_chroot(const struct path *path)
 	return tomoyo_path_perm(TOMOYO_TYPE_CHROOT, path, NULL);
 }
 
+/**
+ * tomoyo_mount_bind - Target for security_mount_bind().
+ *
+ * @from:    Pointer to "struct path".
+ * @to:      Pointer to "struct path".
+ * @recurse: Whether recursive bind mount or not.
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
 static int tomoyo_mount_bind(const struct path *from, const struct path *to,
 			     bool recurse)
 {
@@ -408,6 +417,17 @@ static int tomoyo_mount_bind(const struct path *from, const struct path *to,
 	return tomoyo_mount_permission(NULL, to, NULL, flags, from);
 }
 
+/**
+ * tomoyo_mount_new - Target for security_mount_new().
+ *
+ * @fc:        Pointer to "struct fs_context".
+ * @mp:        Pointer to "struct path".
+ * @mnt_flags: Mount options.
+ * @flags:     Original mount options.
+ * @data:      Optional data. Maybe NULL.
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
 static int tomoyo_mount_new(struct fs_context *fc, const struct path *mp,
 			    int mnt_flags, unsigned long flags, void *data)
 {
@@ -416,6 +436,17 @@ static int tomoyo_mount_new(struct fs_context *fc, const struct path *mp,
 				       flags, NULL);
 }
 
+/**
+ * tomoyo_mount_remount - Target for security_mount_remount().
+ *
+ * @fc:        Pointer to "struct fs_context".
+ * @mp:        Pointer to "struct path".
+ * @mnt_flags: Mount options.
+ * @flags:     Original mount options.
+ * @data:      Optional data. Maybe NULL.
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
 static int tomoyo_mount_remount(struct fs_context *fc, const struct path *mp,
 				int mnt_flags, unsigned long flags, void *data)
 {
@@ -423,6 +454,15 @@ static int tomoyo_mount_remount(struct fs_context *fc, const struct path *mp,
 	return tomoyo_mount_permission(NULL, mp, NULL, flags, NULL);
 }
 
+/**
+ * tomoyo_mount_reconfigure - Target for security_mount_reconfigure().
+ *
+ * @mp:        Pointer to "struct path".
+ * @mnt_flags: Mount options.
+ * @flags:     Original mount options.
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
 static int tomoyo_mount_reconfigure(const struct path *mp,
 				    unsigned int mnt_flags,
 				    unsigned long flags)
@@ -431,12 +471,28 @@ static int tomoyo_mount_reconfigure(const struct path *mp,
 	return tomoyo_mount_permission(NULL, mp, NULL, flags, NULL);
 }
 
+/**
+ * tomoyo_mount_change_type - Target for security_mount_change_type().
+ *
+ * @mp:       Pointer to "struct path".
+ * @ms_flags: Mount options.
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
 static int tomoyo_mount_change_type(const struct path *mp, int ms_flags)
 {
 	return tomoyo_mount_permission(NULL, mp, NULL, ms_flags, NULL);
 }
 
-static int tomoyo_move_mount(const struct path *from_path,
+/**
+ * tomoyo_mount_move - Target for security_mount_move().
+ *
+ * @from_path: Pointer to "struct path".
+ * @to_path:   Pointer to "struct path".
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+static int tomoyo_mount_move(const struct path *from_path,
 			     const struct path *to_path)
 {
 	return tomoyo_mount_permission(NULL, to_path, NULL, MS_MOVE,
@@ -609,7 +665,7 @@ static struct security_hook_list tomoyo_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(mount_remount, tomoyo_mount_remount),
 	LSM_HOOK_INIT(mount_reconfigure, tomoyo_mount_reconfigure),
 	LSM_HOOK_INIT(mount_change_type, tomoyo_mount_change_type),
-	LSM_HOOK_INIT(mount_move, tomoyo_move_mount),
+	LSM_HOOK_INIT(mount_move, tomoyo_mount_move),
 	LSM_HOOK_INIT(sb_umount, tomoyo_sb_umount),
 	LSM_HOOK_INIT(sb_pivotroot, tomoyo_sb_pivotroot),
 	LSM_HOOK_INIT(socket_bind, tomoyo_socket_bind),
-- 
2.47.3



^ permalink raw reply related

* Re: [PATCH v3 6/7] tomoyo: Convert from sb_mount to granular mount hooks
From: Paul Moore @ 2026-05-12 13:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Song Liu, linux-security-module, linux-fsdevel, jmorris, serge,
	viro, brauner, jack, john.johansen, stephen.smalley.work,
	omosnace, mic, gnoack, takedakn, herton, kernel-team, selinux,
	apparmor
In-Reply-To: <42a9075e-a4b4-4eb7-b96e-48e5c0cd2f3a@I-love.SAKURA.ne.jp>

On Tue, May 12, 2026 at 7:03 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> On 2026/05/09 10:52, Song Liu wrote:
> > Replace tomoyo_sb_mount() with granular mount hooks. Each hook
> > reconstructs the MS_* flags expected by tomoyo_mount_permission()
> > using the original flags parameter where available.
>
> Please fold below diff into this patch. Then,
>
> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  security/tomoyo/tomoyo.c | 60 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 58 insertions(+), 2 deletions(-)

Thanks Tetsuo.

Song, assuming you have no objections to the comment blocks, please
fold in Tetsuo's additions in your next version and add his ACK to
this patch.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH v3 6/7] tomoyo: Convert from sb_mount to granular mount hooks
From: Song Liu @ 2026-05-12 18:07 UTC (permalink / raw)
  To: Paul Moore
  Cc: Tetsuo Handa, linux-security-module, linux-fsdevel, jmorris,
	serge, viro, brauner, jack, john.johansen, stephen.smalley.work,
	omosnace, mic, gnoack, takedakn, herton, kernel-team, selinux,
	apparmor
In-Reply-To: <CAHC9VhT9vvaoYpRX4fPZ_H13+PaqG72CpRbS+d=9xgMBaKHo8w@mail.gmail.com>

On Tue, May 12, 2026 at 6:32 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Tue, May 12, 2026 at 7:03 AM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > On 2026/05/09 10:52, Song Liu wrote:
> > > Replace tomoyo_sb_mount() with granular mount hooks. Each hook
> > > reconstructs the MS_* flags expected by tomoyo_mount_permission()
> > > using the original flags parameter where available.
> >
> > Please fold below diff into this patch. Then,
> >
> > Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  security/tomoyo/tomoyo.c | 60 ++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 58 insertions(+), 2 deletions(-)
>
> Thanks Tetsuo.
>
> Song, assuming you have no objections to the comment blocks, please
> fold in Tetsuo's additions in your next version and add his ACK to
> this patch.

Updated 6/7 with these changes. Thanks to both of you!

Song

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-12 18:53 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, Christian Brauner,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <8ef38815-6ae9-4359-86d4-042554357639@amd.com>

On Tue, May 12, 2026 at 3:14 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/12/26 11:10, Albert Esteve wrote:
> > On embedded platforms a central process often allocates dma-buf
> > memory on behalf of client applications. Without a way to
> > attribute the charge to the requesting client's cgroup, the
> > cost lands on the allocator, making per-cgroup memory limits
> > ineffective for the actual consumers.
> >
> > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > the mem_accounting module parameter enabled, the buffer is charged
> > to the allocator's own cgroup.
> >
> > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > all accounting through a single MEMCG_DMABUF path.
> >
> > Usage examples:
> >
> >   1. Central allocator charging to a client at allocation time.
> >      The allocator knows the client's PID (e.g., from binder's
> >      sender_pid) and uses pidfd to attribute the charge:
> >
> >        pid_t client_pid = txn->sender_pid;
> >        int pidfd = pidfd_open(client_pid, 0);
> >
> >        struct dma_heap_allocation_data alloc = {
> >            .len             = buffer_size,
> >            .fd_flags        = O_RDWR | O_CLOEXEC,
> >            .charge_pid_fd   = pidfd,
> >        };
> >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >        close(pidfd);
> >        /* alloc.fd is now charged to client's cgroup */
> >
> >   2. Default allocation (no pidfd, mem_accounting=1).
> >      When charge_pid_fd is not set and the mem_accounting module
> >      parameter is enabled, the buffer is charged to the allocator's
> >      own cgroup:
> >
> >        struct dma_heap_allocation_data alloc = {
> >            .len      = buffer_size,
> >            .fd_flags = O_RDWR | O_CLOEXEC,
> >        };
> >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >        /* charged to current process's cgroup */
> >
> > Current limitations:
> >
> >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> >    how many processes share it. Means only the first owner (and exporter)
> >    of the shared buffer bears the charge.
> >  - Only memcg accounting supported. While this makes sense for system
> >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> >    charging also for the dmem controller.
>
> Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.

Yeah I think this might work. I know of 3 cases, and it trivially
solves the first two. The third requires some work on our end to
extend our userspace interfaces to include the pidfd but it seems
doable. I'm checking with our graphics folks.

1) Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo)
No changes required to userspace. mem_accounting=1 charges the app.

2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc)
gralloc has the caller's pid as described in the commit message. Open
a pidfd and pass it in the dma_heap_allocation_data.

3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc)
In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
we need to add the app's pidfd to the SurfaceFlinger -> gralloc
interface, or transfer the memcg charge from SurfaceFlinger to the app
after the allocation.
It'd be nice to avoid the charge transfer option entirely, but if we
need it that doesn't seem so bad in this case because it's a bulk
charge for the entire dmabuf rather than per-page. So the exporter
doesn't need to get involved (we wouldn't need a new dma_buf_op) and
we wouldn't have to worry about looping and locking for each page.

> I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
>
> Essentially the problem boils down to two limitations:
> 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups

Yup, memcg already has this problem with pagecache and shmem.

> 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
>
> The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
>
> On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.

I have a question about this part. Albert I guess you are interested
only in accounting dmabuf-heap allocations, or do you expect to add
__GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
non-dmabuf-heap exporters?

> On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> >  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> >  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> >  drivers/dma-buf/heaps/system_heap.c     |  2 --
> >  include/uapi/linux/dma-heap.h           |  6 +++++
> >  5 files changed, 53 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 8bdbc2e866430..824d269531eb1 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> >               structures.
> >
> >         dmabuf (npn)
> > -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> > -             Stays with the allocating cgroup regardless of how the buffer is shared.
> > +             Amount of memory used for exported DMA buffers allocated by or on
> > +             behalf of the cgroup. Stays with the allocating cgroup regardless
> > +             of how the buffer is shared.
> >
> >         workingset_refault_anon
> >               Number of refaults of previously evicted anonymous pages.
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index ce02377f48908..23fb758b78297 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> >        */
> >       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> >
> > -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > -     mem_cgroup_put(dmabuf->memcg);
> > +     if (dmabuf->memcg) {
> > +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > +             mem_cgroup_put(dmabuf->memcg);
> > +     }
> >
> >       dmabuf->ops->release(dmabuf);
> >
> > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >               dmabuf->resv = resv;
> >       }
> >
> > -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > -                                   GFP_KERNEL)) {
> > -             ret = -ENOMEM;
> > -             goto err_memcg;
> > -     }
> > -
> >       file->private_data = dmabuf;
> >       file->f_path.dentry->d_fsdata = dmabuf;
> >       dmabuf->file = file;
> > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >
> >       return dmabuf;
> >
> > -err_memcg:
> > -     mem_cgroup_put(dmabuf->memcg);
> >  err_file:
> >       fput(file);
> >  err_module:
> > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > index ac5f8685a6494..ff6e259afcdc0 100644
> > --- a/drivers/dma-buf/dma-heap.c
> > +++ b/drivers/dma-buf/dma-heap.c
> > @@ -7,13 +7,17 @@
> >   */
> >
> >  #include <linux/cdev.h>
> > +#include <linux/cgroup.h>
> >  #include <linux/device.h>
> >  #include <linux/dma-buf.h>
> >  #include <linux/dma-heap.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/err.h>
> >  #include <linux/export.h>
> >  #include <linux/list.h>
> >  #include <linux/nospec.h>
> > +#include <linux/pidfd.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/xarray.h>
> > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> >                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> >
> >  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > -                              u32 fd_flags,
> > -                              u64 heap_flags)
> > +                              u32 fd_flags, u64 heap_flags,
> > +                              struct mem_cgroup *charge_to)
> >  {
> >       struct dma_buf *dmabuf;
> > +     unsigned int nr_pages;
> > +     struct mem_cgroup *memcg = charge_to;
> >       int fd;
> >
> >       /*
> > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >       if (IS_ERR(dmabuf))
> >               return PTR_ERR(dmabuf);
> >
> > +     nr_pages = len / PAGE_SIZE;
> > +
> > +     if (memcg)
> > +             css_get(&memcg->css);
> > +     else if (mem_accounting)
> > +             memcg = get_mem_cgroup_from_mm(current->mm);
> > +
> > +     if (memcg) {
> > +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > +                     mem_cgroup_put(memcg);
> > +                     dma_buf_put(dmabuf);
> > +                     return -ENOMEM;
> > +             }
> > +             dmabuf->memcg = memcg;
> > +     }
> > +
> >       fd = dma_buf_fd(dmabuf, fd_flags);
> >       if (fd < 0) {
> >               dma_buf_put(dmabuf);
> > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >  {
> >       struct dma_heap_allocation_data *heap_allocation = data;
> >       struct dma_heap *heap = file->private_data;
> > +     struct mem_cgroup *memcg = NULL;
> > +     struct task_struct *task;
> > +     unsigned int pidfd_flags;
> >       int fd;
> >
> >       if (heap_allocation->fd)
> > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> >               return -EINVAL;
> >
> > +     if (heap_allocation->charge_pid_fd) {
> > +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > +             if (IS_ERR(task))
> > +                     return PTR_ERR(task);
> > +
> > +             memcg = get_mem_cgroup_from_mm(task->mm);
> > +             put_task_struct(task);
> > +     }
> > +
> >       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> >                                  heap_allocation->fd_flags,
> > -                                heap_allocation->heap_flags);
> > +                                heap_allocation->heap_flags,
> > +                                memcg);
> > +     mem_cgroup_put(memcg);
> >       if (fd < 0)
> >               return fd;
> >
> > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > index 03c2b87cb1112..95d7688167b93 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> >               if (max_order < orders[i])
> >                       continue;
> >               flags = order_flags[i];
> > -             if (mem_accounting)
> > -                     flags |= __GFP_ACCOUNT;
> >               page = alloc_pages(flags, orders[i]);
> >               if (!page)
> >                       continue;
> > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > --- a/include/uapi/linux/dma-heap.h
> > +++ b/include/uapi/linux/dma-heap.h
> > @@ -29,6 +29,10 @@
> >   *                   handle to the allocated dma-buf
> >   * @fd_flags:                file descriptor flags used when allocating
> >   * @heap_flags:              flags passed to heap
> > + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> > + *                   charged for this allocation; 0 means charge the calling
> > + *                   process's cgroup
> > + * @__padding:               reserved, must be zero
> >   *
> >   * Provided by userspace as an argument to the ioctl
> >   */
> > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> >       __u32 fd;
> >       __u32 fd_flags;
> >       __u64 heap_flags;
> > +     __u32 charge_pid_fd;
> > +     __u32 __padding;
> >  };
> >
> >  #define DMA_HEAP_IOC_MAGIC           'H'
> >
>

^ permalink raw reply

* Re: [BUG] lsm= with bpf before selinux breaks fscreate with EINVAL
From: Paul Moore @ 2026-05-12 19:17 UTC (permalink / raw)
  To: Vitaly Chikunov
  Cc: linux-security-module, bpf, selinux, KP Singh, Matt Bobrowski,
	Stephen Smalley, Ondrej Mosnacek, linux-kernel
In-Reply-To: <agJajS11YK1XGB-y@altlinux.org>

On Mon, May 11, 2026 at 6:43 PM Vitaly Chikunov <vt@altlinux.org> wrote:
> On Tue, May 12, 2026 at 12:54:21AM +0300, Vitaly Chikunov wrote:
> > On Mon, May 11, 2026 at 05:49:39PM -0400, Paul Moore wrote:
> > > On Mon, May 11, 2026 at 5:03 PM Vitaly Chikunov <vt@altlinux.org> wrote:
> > > > On Mon, May 11, 2026 at 04:19:34PM -0400, Paul Moore wrote:
> > > > > On Sun, May 10, 2026 at 5:17 PM Vitaly Chikunov <vt@altlinux.org> wrote:

...

> > > The patch below is what I had in mind (although be warned that was
> > > just a cut-n-paste into this email so it is likely whitespace
> > > damaged).  If you are able to give that a test it would be great, if
> > > not, I can throw it on the todo pile.
> > >
> > > diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> > > index 2b8dfb35caed..12724e259900 100644
> > > --- a/include/linux/lsm_hook_defs.h
> > > +++ b/include/linux/lsm_hook_defs.h
> > > @@ -298,9 +298,9 @@ LSM_HOOK(int, -EOPNOTSUPP, getselfattr, unsigned int attr,
> > >         struct lsm_ctx __user *ctx, u32 *size, u32 flags)
> > > LSM_HOOK(int, -EOPNOTSUPP, setselfattr, unsigned int attr,
> > >         struct lsm_ctx *ctx, u32 size, u32 flags)
> > > -LSM_HOOK(int, -EINVAL, getprocattr, struct task_struct *p, const char *name,
> > > +LSM_HOOK(int, 0, getprocattr, struct task_struct *p, const char *name,
> > >         char **value)
> > > -LSM_HOOK(int, -EINVAL, setprocattr, const char *name, void *value, size_t size)
> > > +LSM_HOOK(int, 0, setprocattr, const char *name, void *value, size_t size)
> > > LSM_HOOK(int, 0, ismaclabel, const char *name)
> > > LSM_HOOK(int, -EOPNOTSUPP, secid_to_secctx, u32 secid, struct lsm_context *cp)
> > > LSM_HOOK(int, -EOPNOTSUPP, lsmprop_to_secctx, struct lsm_prop *prop,
> >
> > We will test it and report, but this may take some time.
>
> Before trying the full system boot test, I tried to reproducer I posted
> before. With this patch applied (just ensure it's correct) over v6.12.87:
>
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index 9eca013aa5e1..b38f6194699b 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -288,9 +288,9 @@ LSM_HOOK(int, -EOPNOTSUPP, getselfattr, unsigned int attr,
>           struct lsm_ctx __user *ctx, u32 *size, u32 flags)
>  LSM_HOOK(int, -EOPNOTSUPP, setselfattr, unsigned int attr,
>           struct lsm_ctx *ctx, u32 size, u32 flags)
> -LSM_HOOK(int, -EINVAL, getprocattr, struct task_struct *p, const char *name,
> +LSM_HOOK(int, 0, getprocattr, struct task_struct *p, const char *name,
>           char **value)
> -LSM_HOOK(int, -EINVAL, setprocattr, const char *name, void *value, size_t size)
> +LSM_HOOK(int, 0, setprocattr, const char *name, void *value, size_t size)
>  LSM_HOOK(int, 0, ismaclabel, const char *name)
>  LSM_HOOK(int, -EOPNOTSUPP, secid_to_secctx, u32 secid, char **secdata,
>           u32 *seclen)
>
> 1. `cat /proc/thread-self/attr/current` does not report `kernel` as before.
> 2. `echo > /proc/thread-self/attr/fscreate` process hangs in R state, with strace
> showing infinite loop of
>   write(1, "\n", 1)                       = 0
>   write(1, "\n", 1)                       = 0
>   write(1, "\n", 1)                       = 0

Bummer, I was worried userspace would be expecting something, but wasn't sure.

Thanks for giving that a test, it looks like we'll need some special
handling for these hooks (which is okay, you'll see they already have
special handling if you look at the code).  In the meantime the
workaround would be to place the BPF LSM after SELinux in your LSM
ordering.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH 1/3] cgroup/cpuset: Fix deadline bandwidth leak in cpuset_can_attach()
From: Aaron Tomlin @ 2026-05-12 19:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: tsbogend, paul, jmorris, serge, mingo, peterz, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, tj, hannes, mkoutny,
	chenridong, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, omosnace, kees, neelx, sean, chjohnst, steve,
	mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <354af9fc-1c70-4ee4-a0ff-8821bebec7b8@redhat.com>

On Mon, May 11, 2026 at 01:54:37PM -0400, Waiman Long wrote:
> Yes, it does look like the AI feedback is valid. I will take a further look
> into this.

Hi,

As promised [1], please find my suggested patch here [2].

[1]: https://lore.kernel.org/lkml/dif4xi73znyz3diguyxihzztgosvyj3bjeh3y3oidg4gnt2qpv@5nygeq3rk333/
[2]: https://lore.kernel.org/lkml/20260512010341.101419-1-atomlin@atomlin.com/

Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [PATCH v2 2/3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Aaron Tomlin @ 2026-05-12 19:48 UTC (permalink / raw)
  To: Paul Moore
  Cc: tsbogend, jmorris, serge, mingo, peterz, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <CAHC9VhQthq7y2akbQSdJwBEex1MQYWG49wcJK3b8gSQuQ_d1cQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2353 bytes --]

On Mon, May 11, 2026 at 04:28:09PM -0400, Paul Moore wrote:
[ ... ]
> > Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> > ---
> >  arch/mips/kernel/mips-mt-fpaff.c | 30 +++++++++++++++++-------------
> >  fs/proc/base.c                   |  2 +-
> >  include/linux/lsm_hook_defs.h    |  3 ++-
> >  include/linux/security.h         | 11 +++++++----
> >  kernel/cgroup/cpuset.c           |  4 ++--
> >  kernel/sched/syscalls.c          |  4 ++--
> >  security/commoncap.c             |  7 +++++--
> >  security/security.c              | 11 ++++++-----
> >  security/selinux/hooks.c         |  3 ++-
> >  security/smack/smack_lsm.c       | 11 +++++++++--
> >  10 files changed, 53 insertions(+), 33 deletions(-)
> 
> I haven't looked too closely at this patch yet, but based on a quick
> glance, can you help me understand why it is included with the other
> two patches in one patchset?  The other two patches look like stable
> level kernel bug fixes, while this patch introduces functionality to
> an existing LSM hook; one of these is not like the others :)
> 
> Unless there is something critical that I'm missing here, I would
> suggest splitting this patch out from the other two bugfixes for
> separate handling.  If there is a patch dependency issue you can
> always mention that in the cover letter.

Hi Paul,

Thank you for taking the time to have a look.

You raise a perfectly valid point.

Please note, the cgroup-related BUG fix will be dropped from the next
iteration of this series. As per Waiman Long (on Cc), a solution for the
BUG was already proposed here [1].

However, I suspect the MIPS-related patch will need to remain coupled with
this feature patch. Because the first patch fundamentally alters the
signature of the security_task_setscheduler() hook, the MIPS FPU affinity
code must be updated concurrently to accommodate the new parameter.
Separating them into entirely different series would likely invite
bisection breakages or awkward merge conflicts, depending on the order in
which they are applied, no?

If this approach sounds sensible to you, I shall prepare a v2 series
reflecting this restructured grouping.

Please let me know your thoughts.

[1]: https://lore.kernel.org/lkml/20260509102031.97608-2-zhangguopeng@kylinos.cn/

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v1 2/2] selftests/landlock: Increase default audit socket timeout
From: Mickaël Salaün @ 2026-05-13 10:51 UTC (permalink / raw)
  Cc: Mickaël Salaün, Günther Noack, Kees Cook,
	Shuah Khan, Thomas Weißschuh, kernel test robot,
	linux-kernel, linux-kselftest, linux-security-module, lkp, oe-lkp,
	stable, Günther Noack
In-Reply-To: <20260513105112.140137-1-mic@digikod.net>

matches_log_fs() and other audit_match_record() callers intermittently
return -EAGAIN under heavy debug configs (KASAN, lockdep).  The audit
record delivery pipeline is asynchronous: landlock_log_denial() queues
the record to audit_queue, and kauditd_thread dequeues and delivers via
netlink.  Under debug configs, kauditd scheduling between
audit_log_end() and netlink_unicast() can exceed a syscall round trip
(more than 1 usec), which was the value of the socket timeout used for
the recvfrom() calls.

The observed failure [1] is an EAGAIN error code (-11) which means that
the access record had not arrived within the 1 usec timeout of
recvfrom().  The expected record does arrive, but only after
matches_log_fs() has already returned.  It is then consumed by a later
audit_count_records() call, making records.access == 1 instead of 0.

Switch the default socket timeout to the slow value (1 second) so all
audit_match_record() callers wait long enough for kauditd delivery, and
lower it to the fast value (1 usec) only on the two paths that expect no
record: audit_count_records() and the expected_domain_id == 0 probe in
matches_log_domain_deallocated().  audit_init() drains stale records
with the fast timeout (terminating on -EAGAIN once the backlog is empty)
and switches to the patient default before returning.  1 second gives
~10x margin over the observed maximum (~100 ms, while the happy path is
~23 us).

Rename the timeval constants to reflect their new roles:
- audit_tv_dom_drop (1 second) -> audit_tv_default: default socket
  timeout, patient enough for asynchronous kauditd delivery.
- audit_tv_default (1 usec) -> audit_tv_fast: fast timeout for paths
  that expect no record (drain, audit_count_records(), probes).

Invert the conditional in matches_log_domain_deallocated().  Check
setsockopt returns on both the lower and restore paths; preserve the
first error via !err when the restore fails after a prior error so the
actionable return code is not masked by a bookkeeping failure.

Cc: Günther Noack <gnoack@google.com>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Cc: stable@vger.kernel.org
Depends-on: 07c2572a8757 ("selftests/landlock: Skip stale records in audit_match_record()")
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Reported-by: Günther Noack <gnoack3000@gmail.com>
Closes: https://lore.kernel.org/r/20260402.eb5c4e85f472@gnoack.org [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605111649.a8b30a62-lkp@intel.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 tools/testing/selftests/landlock/audit.h | 80 +++++++++++++++++++-----
 1 file changed, 63 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 699aed5ffab4..936fe20f020e 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -45,17 +45,25 @@ struct audit_message {
 	};
 };
 
-static const struct timeval audit_tv_dom_drop = {
+static const struct timeval audit_tv_default = {
 	/*
-	 * Because domain deallocation is tied to asynchronous credential
-	 * freeing, receiving such event may take some time.  In practice,
-	 * on a small VM, it should not exceed 100k usec, but let's wait up
-	 * to 1 second to be safe.
+	 * Default socket timeout for audit_match_record() callers that expect a
+	 * record to arrive.  Asynchronous kauditd delivery can exceed 1 usec
+	 * under heavy debug configs (KASAN, lockdep), where kauditd_thread
+	 * scheduling between audit_log_end() and netlink_unicast() takes longer
+	 * than the previous 1 usec timeout. 1 second is a generous ceiling: on
+	 * the happy path, kauditd delivers within dozens of usec.
 	 */
 	.tv_sec = 1,
 };
 
-static const struct timeval audit_tv_default = {
+static const struct timeval audit_tv_fast = {
+	/*
+	 * Fast timeout for paths that expect no record (audit_init() drain,
+	 * audit_count_records(), probes).  Causes audit_recv() to return
+	 * -EAGAIN once the socket buffer is empty, naturally terminating the
+	 * read loop.
+	 */
 	.tv_usec = 1,
 };
 
@@ -334,8 +342,13 @@ static int __maybe_unused matches_log_domain_allocated(int audit_fd, pid_t pid,
  * Matches a domain deallocation record.  When expected_domain_id is non-zero,
  * the pattern includes the specific domain ID so that stale deallocation
  * records from a previous test (with a different domain ID) are skipped by
- * audit_match_record(), and the socket timeout is temporarily increased to
- * audit_tv_dom_drop to wait for the asynchronous kworker deallocation.
+ * audit_match_record(), waiting for the asynchronous kworker deallocation with
+ * the default patient timeout.
+ *
+ * When expected_domain_id is zero, the caller is probing for any dealloc record
+ * that may or may not arrive.  Temporarily lowers the socket timeout to
+ * audit_tv_fast for this probe so it returns promptly when no record is
+ * pending; restores audit_tv_default after.
  */
 static int __maybe_unused
 matches_log_domain_deallocated(int audit_fd, unsigned int num_denials,
@@ -361,16 +374,21 @@ matches_log_domain_deallocated(int audit_fd, unsigned int num_denials,
 	if (log_match_len >= sizeof(log_match))
 		return -E2BIG;
 
-	if (expected_domain_id)
-		setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO,
-			   &audit_tv_dom_drop, sizeof(audit_tv_dom_drop));
+	if (!expected_domain_id) {
+		if (setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO,
+			       &audit_tv_fast, sizeof(audit_tv_fast)))
+			return -errno;
+	}
 
 	err = audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
 				 domain_id);
 
-	if (expected_domain_id)
-		setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
-			   sizeof(audit_tv_default));
+	if (!expected_domain_id) {
+		if (setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO,
+			       &audit_tv_default, sizeof(audit_tv_default)) &&
+		    !err)
+			err = -errno;
+	}
 
 	return err;
 }
@@ -387,6 +405,11 @@ struct audit_records {
  * audit_init() and after the preceding audit_match_record() call.  Allocation
  * records are emitted synchronously during landlock_log_denial() in the current
  * test's syscall context, so only those are counted in records->domain.
+ *
+ * Temporarily lowers SO_RCVTIMEO to audit_tv_fast for the read loop: this is a
+ * "no record expected" path that should terminate on the first -EAGAIN.  The
+ * default patient timeout is restored on exit for subsequent
+ * audit_match_record() callers.
  */
 static int audit_count_records(int audit_fd, struct audit_records *records)
 {
@@ -403,6 +426,12 @@ static int audit_count_records(int audit_fd, struct audit_records *records)
 	records->access = 0;
 	records->domain = 0;
 
+	if (setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_fast,
+		       sizeof(audit_tv_fast))) {
+		err = -errno;
+		goto out;
+	}
+
 	do {
 		memset(&msg, 0, sizeof(msg));
 		err = audit_recv(audit_fd, &msg);
@@ -429,6 +458,10 @@ static int audit_count_records(int audit_fd, struct audit_records *records)
 	} while (true);
 
 out:
+	if (setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
+		       sizeof(audit_tv_default)) &&
+	    !err)
+		err = -errno;
 	regfree(&dealloc_re);
 	return err;
 }
@@ -449,9 +482,9 @@ static int audit_init(void)
 	if (err)
 		goto err_close;
 
-	/* Sets a timeout for negative tests. */
-	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
-			 sizeof(audit_tv_default));
+	/* Uses the fast timeout to drain stale records below. */
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_fast,
+			 sizeof(audit_tv_fast));
 	if (err) {
 		err = -errno;
 		goto err_close;
@@ -467,6 +500,19 @@ static int audit_init(void)
 	while (audit_recv(fd, NULL) == 0)
 		;
 
+	/*
+	 * Restores the default timeout for audit_match_record() callers that
+	 * expect a record to arrive.  Paths that expect no record restore the
+	 * fast timeout locally (audit_count_records(), the expected_domain_id
+	 * == 0 probe in matches_log_domain_deallocated()).
+	 */
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
+			 sizeof(audit_tv_default));
+	if (err) {
+		err = -errno;
+		goto err_close;
+	}
+
 	return fd;
 
 err_close:
-- 
2.54.0


^ permalink raw reply related

* [PATCH v1 1/2] selftests/landlock: Filter dealloc records in audit_count_records()
From: Mickaël Salaün @ 2026-05-13 10:51 UTC (permalink / raw)
  Cc: Mickaël Salaün, Günther Noack, Kees Cook,
	Shuah Khan, Thomas Weißschuh, kernel test robot,
	linux-kernel, linux-kselftest, linux-security-module, lkp, oe-lkp,
	stable

audit_count_records() counts both AUDIT_LANDLOCK_DOMAIN allocation and
deallocation records in records.domain .  Domain deallocation is tied to
asynchronous credential freeing via kworker threads
(landlock_put_ruleset_deferred), so the dealloc record can arrive after
the drain in audit_init() and after the preceding audit_match_record()
call.  This causes flaky failures in tests that assert an exact
records.domain count: a stale dealloc record from a previous test's
domain inflates the count by one.

Observed on x86_64 under build configurations that delay the kworker
firing the dealloc callback (e.g. coverage instrumentation): the
audit_layout1 tests in fs_test.c intermittently saw records.domain == 2
where 1 was expected.  The fix is in the shared helper, so those
existing checks become robust without needing a fs_test.c edit.

Filter audit_count_records() with a regex to skip records containing
deallocation status.  The remaining domain records (allocation, emitted
synchronously during landlock_log_denial()) are deterministic.
Deallocation records are already tested explicitly via
matches_log_domain_deallocated() in audit_test.c, which uses its own
domain-ID-based filtering and longer timeout.

With this filter in place, re-add the records.domain == 0 checks that
were removed in commit 3647a4977fb7 ("selftests/landlock: Drain stale
audit records on init") as a workaround for this race.

Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Depends-on: 07c2572a8757 ("selftests/landlock: Skip stale records in audit_match_record()")
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 tools/testing/selftests/landlock/audit.h      | 39 ++++++++++++-------
 tools/testing/selftests/landlock/audit_test.c |  2 +
 .../testing/selftests/landlock/ptrace_test.c  |  1 +
 .../landlock/scoped_abstract_unix_test.c      |  1 +
 4 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 834005b2b0f0..699aed5ffab4 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -381,18 +381,24 @@ struct audit_records {
 };
 
 /*
- * WARNING: Do not assert records.domain == 0 without a preceding
- * audit_match_record() call.  Domain deallocation records are emitted
- * asynchronously from kworker threads and can arrive after the drain in
- * audit_init(), corrupting the domain count.  A preceding audit_match_record()
- * call consumes stale records while scanning, making the assertion safe in
- * practice because stale deallocation records arrive before the expected access
- * records.
+ * Counts remaining audit records by type, skipping domain deallocation records.
+ * Deallocation records are emitted asynchronously from kworker threads after a
+ * previous test's child has exited, so they can arrive after the drain in
+ * audit_init() and after the preceding audit_match_record() call.  Allocation
+ * records are emitted synchronously during landlock_log_denial() in the current
+ * test's syscall context, so only those are counted in records->domain.
  */
 static int audit_count_records(int audit_fd, struct audit_records *records)
 {
+	static const char dealloc_pattern[] = REGEX_LANDLOCK_PREFIX
+		" status=deallocated ";
 	struct audit_message msg;
-	int err;
+	regex_t dealloc_re;
+	int ret, err = 0;
+
+	ret = regcomp(&dealloc_re, dealloc_pattern, 0);
+	if (ret)
+		return -ENOMEM;
 
 	records->access = 0;
 	records->domain = 0;
@@ -402,9 +408,8 @@ static int audit_count_records(int audit_fd, struct audit_records *records)
 		err = audit_recv(audit_fd, &msg);
 		if (err) {
 			if (err == -EAGAIN)
-				return 0;
-			else
-				return err;
+				err = 0;
+			break;
 		}
 
 		switch (msg.header.nlmsg_type) {
@@ -412,12 +417,20 @@ static int audit_count_records(int audit_fd, struct audit_records *records)
 			records->access++;
 			break;
 		case AUDIT_LANDLOCK_DOMAIN:
-			records->domain++;
+			ret = regexec(&dealloc_re, msg.data, 0, NULL, 0);
+			if (ret == REG_NOMATCH) {
+				records->domain++;
+			} else if (ret != 0) {
+				err = -EIO;
+				goto out;
+			}
 			break;
 		}
 	} while (true);
 
-	return 0;
+out:
+	regfree(&dealloc_re);
+	return err;
 }
 
 static int audit_init(void)
diff --git a/tools/testing/selftests/landlock/audit_test.c b/tools/testing/selftests/landlock/audit_test.c
index 93ae5bd0dcce..758cf2368281 100644
--- a/tools/testing/selftests/landlock/audit_test.c
+++ b/tools/testing/selftests/landlock/audit_test.c
@@ -730,6 +730,7 @@ TEST_F(audit_flags, signal)
 		} else {
 			EXPECT_EQ(1, records.access);
 		}
+		EXPECT_EQ(0, records.domain);
 
 		/* Updates filter rules to match the drop record. */
 		set_cap(_metadata, CAP_AUDIT_CONTROL);
@@ -917,6 +918,7 @@ TEST_F(audit_exec, signal_and_open)
 	/* Tests that there was no denial until now. */
 	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
 	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(0, records.domain);
 
 	/*
 	 * Wait for the child to do a first denied action by layer1 and
diff --git a/tools/testing/selftests/landlock/ptrace_test.c b/tools/testing/selftests/landlock/ptrace_test.c
index 1b6c8b53bf33..4f64c90583cd 100644
--- a/tools/testing/selftests/landlock/ptrace_test.c
+++ b/tools/testing/selftests/landlock/ptrace_test.c
@@ -342,6 +342,7 @@ TEST_F(audit, trace)
 	/* Makes sure there is no superfluous logged records. */
 	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
 	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(0, records.domain);
 
 	yama_ptrace_scope = get_yama_ptrace_scope();
 	ASSERT_LE(0, yama_ptrace_scope);
diff --git a/tools/testing/selftests/landlock/scoped_abstract_unix_test.c b/tools/testing/selftests/landlock/scoped_abstract_unix_test.c
index c47491d2d1c1..72f97648d4a7 100644
--- a/tools/testing/selftests/landlock/scoped_abstract_unix_test.c
+++ b/tools/testing/selftests/landlock/scoped_abstract_unix_test.c
@@ -312,6 +312,7 @@ TEST_F(scoped_audit, connect_to_child)
 	/* Makes sure there is no superfluous logged records. */
 	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
 	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(0, records.domain);
 
 	ASSERT_EQ(0, pipe2(pipe_child, O_CLOEXEC));
 	ASSERT_EQ(0, pipe2(pipe_parent, O_CLOEXEC));
-- 
2.54.0


^ permalink raw reply related

* Re: [linus:master] [selftests]  465b05bae5: kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail
From: Mickaël Salaün @ 2026-05-13 10:52 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Günther Noack, kernel test robot, linux-security-module,
	oe-lkp, lkp, linux-kernel, Shuah Khan, Kees Cook, linux-kselftest
In-Reply-To: <20260512112054-0a5c7cad-c8c6-4745-a765-0780a3bd2dab@linutronix.de>

Hi Thomas,

Thanks for the report.

On Tue, May 12, 2026 at 11:27:42AM +0200, Thomas Weißschuh wrote:
> Hi Mickaël and Günther,
> 
> I received the following report about a failing landlock selftest from
> the 0day bot.
> 
> On Mon, May 11, 2026 at 10:15:00PM +0800, kernel test robot wrote:
> > kernel test robot noticed "kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail" on:
> > 
> > commit: 465b05bae5ac553c13315681c1490dc565337771 ("selftests: harness: Restore order of test functions")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > 
> > in testcase: kernel-selftests
> > version: kernel-selftests-x86_64-9f2693489ef8-1_20260201
> > with following parameters:
> > 
> > 	group: landlock

This group is correct but I'm wondering why the maintainer (me) and the
reviewer (Günther) weren't in Cc.

What do we need to do to subscribe to such report?  Why isn't
automatically inferred from the MAINTAINER file?

> > 
> > config: x86_64-rhel-9.4-kselftests
> > compiler: gcc-14
> > test machine: 16 threads Intel(R) Core(TM) i7-13620H (Raptor Lake) with 32G memory
> > 
> > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202605111649.a8b30a62-lkp@intel.com
> 
> I was unable to run the landlock selftests myself, on my machines they are
> failing at runtime with all kinds of colorful errors. Are the requirements
> explained somewhere?

I'm curious about the errors you get.  They are standard kselftests that
should work following this workflow:

  make TARGETS=landlock O=build kselftest-gen_tar

and then running ./build/kselftests/kselftest_install/run_kselftest.sh
as root in a VM.  The required kernel configuration is listed in
tools/testing/selftests/landlock/config

To make it easier, we wrote a wrapper to test everything with UML:
https://github.com/landlock-lsm/landlock-test-tools (see check-linux.sh)

> 
> > # #  RUN           audit.tsync_override_log_subdomains_off ...
> > # # audit_test.c:591:tsync_override_log_subdomains_off:Expected 0 (0) == matches_log_signal(_metadata, self->audit_fd, child_data.parent_pid, NULL) (-11)
> 
> This error number means "EAGAIN 11 Resource temporarily unavailable",
> so it could be a temporary error.

Yes, the test is flaky under pressure.

> 
> Can you reproduce this issue? Is it really dependent on my patch as
> blamed above? If so, does the selftest rely on the previous, incorrect order?

I don't think it directly depends on your patch but it might be a side
effect.  Anyway, I've been working on fixing this kind of issue and just
sent a fix:
https://lore.kernel.org/r/20260513105112.140137-2-mic@digikod.net

> 
> > # # tsync_override_log_subdomains_off: Test failed
> > # #          FAIL  audit.tsync_override_log_subdomains_off
> > # not ok 5 audit.tsync_override_log_subdomains_off
> 
> 
> Thomas

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-13 11:39 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Christian König, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CABdmKX2uwZ12kYJYPJGfWxuMBOJS=64b1GRj72tfB5D=NKM22w@mail.gmail.com>

On Tue, May 12, 2026 at 8:53 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Tue, May 12, 2026 at 3:14 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > On 5/12/26 11:10, Albert Esteve wrote:
> > > On embedded platforms a central process often allocates dma-buf
> > > memory on behalf of client applications. Without a way to
> > > attribute the charge to the requesting client's cgroup, the
> > > cost lands on the allocator, making per-cgroup memory limits
> > > ineffective for the actual consumers.
> > >
> > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > the mem_accounting module parameter enabled, the buffer is charged
> > > to the allocator's own cgroup.
> > >
> > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > all accounting through a single MEMCG_DMABUF path.
> > >
> > > Usage examples:
> > >
> > >   1. Central allocator charging to a client at allocation time.
> > >      The allocator knows the client's PID (e.g., from binder's
> > >      sender_pid) and uses pidfd to attribute the charge:
> > >
> > >        pid_t client_pid = txn->sender_pid;
> > >        int pidfd = pidfd_open(client_pid, 0);
> > >
> > >        struct dma_heap_allocation_data alloc = {
> > >            .len             = buffer_size,
> > >            .fd_flags        = O_RDWR | O_CLOEXEC,
> > >            .charge_pid_fd   = pidfd,
> > >        };
> > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > >        close(pidfd);
> > >        /* alloc.fd is now charged to client's cgroup */
> > >
> > >   2. Default allocation (no pidfd, mem_accounting=1).
> > >      When charge_pid_fd is not set and the mem_accounting module
> > >      parameter is enabled, the buffer is charged to the allocator's
> > >      own cgroup:
> > >
> > >        struct dma_heap_allocation_data alloc = {
> > >            .len      = buffer_size,
> > >            .fd_flags = O_RDWR | O_CLOEXEC,
> > >        };
> > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > >        /* charged to current process's cgroup */
> > >
> > > Current limitations:
> > >
> > >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> > >    how many processes share it. Means only the first owner (and exporter)
> > >    of the shared buffer bears the charge.
> > >  - Only memcg accounting supported. While this makes sense for system
> > >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > >    charging also for the dmem controller.
> >
> > Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
>
> Yeah I think this might work. I know of 3 cases, and it trivially
> solves the first two. The third requires some work on our end to
> extend our userspace interfaces to include the pidfd but it seems
> doable. I'm checking with our graphics folks.
>
> 1) Direct allocation from user (e.g. app -> allocation ioctl on
> /dev/dma_heap/foo)
> No changes required to userspace. mem_accounting=1 charges the app.
>
> 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> -> gralloc)
> gralloc has the caller's pid as described in the commit message. Open
> a pidfd and pass it in the dma_heap_allocation_data.
>
> 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> SurfaceFlinger -> gralloc)
> In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> interface, or transfer the memcg charge from SurfaceFlinger to the app
> after the allocation.
> It'd be nice to avoid the charge transfer option entirely, but if we
> need it that doesn't seem so bad in this case because it's a bulk
> charge for the entire dmabuf rather than per-page. So the exporter
> doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> we wouldn't have to worry about looping and locking for each page.
>
> > I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
> >
> > Essentially the problem boils down to two limitations:
> > 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
>
> Yup, memcg already has this problem with pagecache and shmem.
>
> > 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
> >
> > The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
> >
> > On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.
>
> I have a question about this part. Albert I guess you are interested
> only in accounting dmabuf-heap allocations, or do you expect to add
> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> non-dmabuf-heap exporters?

We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
controller are on the radar for follow-up/parallel work (there will be
dragons and will surely need discussion). For DRM and V4L2 the
long-term intent is migration to heaps, which would make direct
accounting on those paths unnecessary. udmabufs are already
memcg-charged, so adding a separate MEMCG_DMABUF would double count.
Are there any other exporters you had in mind that would benefit from
this approach?

BR,
Albert.

>
> > On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
> >
> > Regards,
> > Christian.
> >
> > >
> > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> > >  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> > >  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> > >  drivers/dma-buf/heaps/system_heap.c     |  2 --
> > >  include/uapi/linux/dma-heap.h           |  6 +++++
> > >  5 files changed, 53 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 8bdbc2e866430..824d269531eb1 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> > >               structures.
> > >
> > >         dmabuf (npn)
> > > -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> > > -             Stays with the allocating cgroup regardless of how the buffer is shared.
> > > +             Amount of memory used for exported DMA buffers allocated by or on
> > > +             behalf of the cgroup. Stays with the allocating cgroup regardless
> > > +             of how the buffer is shared.
> > >
> > >         workingset_refault_anon
> > >               Number of refaults of previously evicted anonymous pages.
> > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > index ce02377f48908..23fb758b78297 100644
> > > --- a/drivers/dma-buf/dma-buf.c
> > > +++ b/drivers/dma-buf/dma-buf.c
> > > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> > >        */
> > >       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> > >
> > > -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > -     mem_cgroup_put(dmabuf->memcg);
> > > +     if (dmabuf->memcg) {
> > > +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > > +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > > +             mem_cgroup_put(dmabuf->memcg);
> > > +     }
> > >
> > >       dmabuf->ops->release(dmabuf);
> > >
> > > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > >               dmabuf->resv = resv;
> > >       }
> > >
> > > -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > > -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > > -                                   GFP_KERNEL)) {
> > > -             ret = -ENOMEM;
> > > -             goto err_memcg;
> > > -     }
> > > -
> > >       file->private_data = dmabuf;
> > >       file->f_path.dentry->d_fsdata = dmabuf;
> > >       dmabuf->file = file;
> > > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > >
> > >       return dmabuf;
> > >
> > > -err_memcg:
> > > -     mem_cgroup_put(dmabuf->memcg);
> > >  err_file:
> > >       fput(file);
> > >  err_module:
> > > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > > index ac5f8685a6494..ff6e259afcdc0 100644
> > > --- a/drivers/dma-buf/dma-heap.c
> > > +++ b/drivers/dma-buf/dma-heap.c
> > > @@ -7,13 +7,17 @@
> > >   */
> > >
> > >  #include <linux/cdev.h>
> > > +#include <linux/cgroup.h>
> > >  #include <linux/device.h>
> > >  #include <linux/dma-buf.h>
> > >  #include <linux/dma-heap.h>
> > > +#include <linux/memcontrol.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <linux/err.h>
> > >  #include <linux/export.h>
> > >  #include <linux/list.h>
> > >  #include <linux/nospec.h>
> > > +#include <linux/pidfd.h>
> > >  #include <linux/syscalls.h>
> > >  #include <linux/uaccess.h>
> > >  #include <linux/xarray.h>
> > > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> > >                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> > >
> > >  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > > -                              u32 fd_flags,
> > > -                              u64 heap_flags)
> > > +                              u32 fd_flags, u64 heap_flags,
> > > +                              struct mem_cgroup *charge_to)
> > >  {
> > >       struct dma_buf *dmabuf;
> > > +     unsigned int nr_pages;
> > > +     struct mem_cgroup *memcg = charge_to;
> > >       int fd;
> > >
> > >       /*
> > > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > >       if (IS_ERR(dmabuf))
> > >               return PTR_ERR(dmabuf);
> > >
> > > +     nr_pages = len / PAGE_SIZE;
> > > +
> > > +     if (memcg)
> > > +             css_get(&memcg->css);
> > > +     else if (mem_accounting)
> > > +             memcg = get_mem_cgroup_from_mm(current->mm);
> > > +
> > > +     if (memcg) {
> > > +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > > +                     mem_cgroup_put(memcg);
> > > +                     dma_buf_put(dmabuf);
> > > +                     return -ENOMEM;
> > > +             }
> > > +             dmabuf->memcg = memcg;
> > > +     }
> > > +
> > >       fd = dma_buf_fd(dmabuf, fd_flags);
> > >       if (fd < 0) {
> > >               dma_buf_put(dmabuf);
> > > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > >  {
> > >       struct dma_heap_allocation_data *heap_allocation = data;
> > >       struct dma_heap *heap = file->private_data;
> > > +     struct mem_cgroup *memcg = NULL;
> > > +     struct task_struct *task;
> > > +     unsigned int pidfd_flags;
> > >       int fd;
> > >
> > >       if (heap_allocation->fd)
> > > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > >       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> > >               return -EINVAL;
> > >
> > > +     if (heap_allocation->charge_pid_fd) {
> > > +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > > +             if (IS_ERR(task))
> > > +                     return PTR_ERR(task);
> > > +
> > > +             memcg = get_mem_cgroup_from_mm(task->mm);
> > > +             put_task_struct(task);
> > > +     }
> > > +
> > >       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> > >                                  heap_allocation->fd_flags,
> > > -                                heap_allocation->heap_flags);
> > > +                                heap_allocation->heap_flags,
> > > +                                memcg);
> > > +     mem_cgroup_put(memcg);
> > >       if (fd < 0)
> > >               return fd;
> > >
> > > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > > index 03c2b87cb1112..95d7688167b93 100644
> > > --- a/drivers/dma-buf/heaps/system_heap.c
> > > +++ b/drivers/dma-buf/heaps/system_heap.c
> > > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> > >               if (max_order < orders[i])
> > >                       continue;
> > >               flags = order_flags[i];
> > > -             if (mem_accounting)
> > > -                     flags |= __GFP_ACCOUNT;
> > >               page = alloc_pages(flags, orders[i]);
> > >               if (!page)
> > >                       continue;
> > > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > > --- a/include/uapi/linux/dma-heap.h
> > > +++ b/include/uapi/linux/dma-heap.h
> > > @@ -29,6 +29,10 @@
> > >   *                   handle to the allocated dma-buf
> > >   * @fd_flags:                file descriptor flags used when allocating
> > >   * @heap_flags:              flags passed to heap
> > > + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> > > + *                   charged for this allocation; 0 means charge the calling
> > > + *                   process's cgroup
> > > + * @__padding:               reserved, must be zero
> > >   *
> > >   * Provided by userspace as an argument to the ioctl
> > >   */
> > > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> > >       __u32 fd;
> > >       __u32 fd_flags;
> > >       __u64 heap_flags;
> > > +     __u32 charge_pid_fd;
> > > +     __u32 __padding;
> > >  };
> > >
> > >  #define DMA_HEAP_IOC_MAGIC           'H'
> > >
> >
>


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-13 12:41 UTC (permalink / raw)
  To: Christian König
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Benjamin Gaignard,
	Brian Starkey, John Stultz, T.J. Mercier, Christian Brauner,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <8ef38815-6ae9-4359-86d4-042554357639@amd.com>

On Tue, May 12, 2026 at 12:14 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/12/26 11:10, Albert Esteve wrote:
> > On embedded platforms a central process often allocates dma-buf
> > memory on behalf of client applications. Without a way to
> > attribute the charge to the requesting client's cgroup, the
> > cost lands on the allocator, making per-cgroup memory limits
> > ineffective for the actual consumers.
> >
> > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > the mem_accounting module parameter enabled, the buffer is charged
> > to the allocator's own cgroup.
> >
> > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > all accounting through a single MEMCG_DMABUF path.
> >
> > Usage examples:
> >
> >   1. Central allocator charging to a client at allocation time.
> >      The allocator knows the client's PID (e.g., from binder's
> >      sender_pid) and uses pidfd to attribute the charge:
> >
> >        pid_t client_pid = txn->sender_pid;
> >        int pidfd = pidfd_open(client_pid, 0);
> >
> >        struct dma_heap_allocation_data alloc = {
> >            .len             = buffer_size,
> >            .fd_flags        = O_RDWR | O_CLOEXEC,
> >            .charge_pid_fd   = pidfd,
> >        };
> >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >        close(pidfd);
> >        /* alloc.fd is now charged to client's cgroup */
> >
> >   2. Default allocation (no pidfd, mem_accounting=1).
> >      When charge_pid_fd is not set and the mem_accounting module
> >      parameter is enabled, the buffer is charged to the allocator's
> >      own cgroup:
> >
> >        struct dma_heap_allocation_data alloc = {
> >            .len      = buffer_size,
> >            .fd_flags = O_RDWR | O_CLOEXEC,
> >        };
> >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >        /* charged to current process's cgroup */
> >
> > Current limitations:
> >
> >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> >    how many processes share it. Means only the first owner (and exporter)
> >    of the shared buffer bears the charge.
> >  - Only memcg accounting supported. While this makes sense for system
> >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> >    charging also for the dmem controller.
>
> Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
>
> I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
>
> Essentially the problem boils down to two limitations:
> 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
> 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
>
> The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...

Honestly, adding a hook to fd-passing uAPI to manage charge transfers
sounds like a promising solution requiring no uAPI changes. However,
it still does not cover all paths, e.g., dup() or fork(). And shared
memory sounds like a hard one to tackle, where deciding the best
policy is more a per-usecase thing and would probably require
userspace configuration. All in all, charge_pid_fd covers a
well-defined and immediately practical subset. The UAPI cost is small
and the mechanism is explicit about what it does and doesn't solve. A
general solution, if it ever converges, would likely supersede
charge_pid_fd for most cases, which is a fine outcome if it solves the
problem more completely.

Either way, if you have a specific approach in mind for solving any of
the above limitations, I'd be happy to look into it further.

BR,
Albert.

>
> On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> >  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> >  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> >  drivers/dma-buf/heaps/system_heap.c     |  2 --
> >  include/uapi/linux/dma-heap.h           |  6 +++++
> >  5 files changed, 53 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 8bdbc2e866430..824d269531eb1 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> >               structures.
> >
> >         dmabuf (npn)
> > -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> > -             Stays with the allocating cgroup regardless of how the buffer is shared.
> > +             Amount of memory used for exported DMA buffers allocated by or on
> > +             behalf of the cgroup. Stays with the allocating cgroup regardless
> > +             of how the buffer is shared.
> >
> >         workingset_refault_anon
> >               Number of refaults of previously evicted anonymous pages.
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index ce02377f48908..23fb758b78297 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> >        */
> >       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> >
> > -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > -     mem_cgroup_put(dmabuf->memcg);
> > +     if (dmabuf->memcg) {
> > +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > +             mem_cgroup_put(dmabuf->memcg);
> > +     }
> >
> >       dmabuf->ops->release(dmabuf);
> >
> > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >               dmabuf->resv = resv;
> >       }
> >
> > -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > -                                   GFP_KERNEL)) {
> > -             ret = -ENOMEM;
> > -             goto err_memcg;
> > -     }
> > -
> >       file->private_data = dmabuf;
> >       file->f_path.dentry->d_fsdata = dmabuf;
> >       dmabuf->file = file;
> > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >
> >       return dmabuf;
> >
> > -err_memcg:
> > -     mem_cgroup_put(dmabuf->memcg);
> >  err_file:
> >       fput(file);
> >  err_module:
> > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > index ac5f8685a6494..ff6e259afcdc0 100644
> > --- a/drivers/dma-buf/dma-heap.c
> > +++ b/drivers/dma-buf/dma-heap.c
> > @@ -7,13 +7,17 @@
> >   */
> >
> >  #include <linux/cdev.h>
> > +#include <linux/cgroup.h>
> >  #include <linux/device.h>
> >  #include <linux/dma-buf.h>
> >  #include <linux/dma-heap.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/err.h>
> >  #include <linux/export.h>
> >  #include <linux/list.h>
> >  #include <linux/nospec.h>
> > +#include <linux/pidfd.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/xarray.h>
> > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> >                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> >
> >  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > -                              u32 fd_flags,
> > -                              u64 heap_flags)
> > +                              u32 fd_flags, u64 heap_flags,
> > +                              struct mem_cgroup *charge_to)
> >  {
> >       struct dma_buf *dmabuf;
> > +     unsigned int nr_pages;
> > +     struct mem_cgroup *memcg = charge_to;
> >       int fd;
> >
> >       /*
> > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >       if (IS_ERR(dmabuf))
> >               return PTR_ERR(dmabuf);
> >
> > +     nr_pages = len / PAGE_SIZE;
> > +
> > +     if (memcg)
> > +             css_get(&memcg->css);
> > +     else if (mem_accounting)
> > +             memcg = get_mem_cgroup_from_mm(current->mm);
> > +
> > +     if (memcg) {
> > +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > +                     mem_cgroup_put(memcg);
> > +                     dma_buf_put(dmabuf);
> > +                     return -ENOMEM;
> > +             }
> > +             dmabuf->memcg = memcg;
> > +     }
> > +
> >       fd = dma_buf_fd(dmabuf, fd_flags);
> >       if (fd < 0) {
> >               dma_buf_put(dmabuf);
> > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >  {
> >       struct dma_heap_allocation_data *heap_allocation = data;
> >       struct dma_heap *heap = file->private_data;
> > +     struct mem_cgroup *memcg = NULL;
> > +     struct task_struct *task;
> > +     unsigned int pidfd_flags;
> >       int fd;
> >
> >       if (heap_allocation->fd)
> > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> >               return -EINVAL;
> >
> > +     if (heap_allocation->charge_pid_fd) {
> > +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > +             if (IS_ERR(task))
> > +                     return PTR_ERR(task);
> > +
> > +             memcg = get_mem_cgroup_from_mm(task->mm);
> > +             put_task_struct(task);
> > +     }
> > +
> >       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> >                                  heap_allocation->fd_flags,
> > -                                heap_allocation->heap_flags);
> > +                                heap_allocation->heap_flags,
> > +                                memcg);
> > +     mem_cgroup_put(memcg);
> >       if (fd < 0)
> >               return fd;
> >
> > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > index 03c2b87cb1112..95d7688167b93 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> >               if (max_order < orders[i])
> >                       continue;
> >               flags = order_flags[i];
> > -             if (mem_accounting)
> > -                     flags |= __GFP_ACCOUNT;
> >               page = alloc_pages(flags, orders[i]);
> >               if (!page)
> >                       continue;
> > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > --- a/include/uapi/linux/dma-heap.h
> > +++ b/include/uapi/linux/dma-heap.h
> > @@ -29,6 +29,10 @@
> >   *                   handle to the allocated dma-buf
> >   * @fd_flags:                file descriptor flags used when allocating
> >   * @heap_flags:              flags passed to heap
> > + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> > + *                   charged for this allocation; 0 means charge the calling
> > + *                   process's cgroup
> > + * @__padding:               reserved, must be zero
> >   *
> >   * Provided by userspace as an argument to the ioctl
> >   */
> > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> >       __u32 fd;
> >       __u32 fd_flags;
> >       __u64 heap_flags;
> > +     __u32 charge_pid_fd;
> > +     __u32 __padding;
> >  };
> >
> >  #define DMA_HEAP_IOC_MAGIC           'H'
> >
>


^ permalink raw reply

* Re: [PATCH] rust: cred: add safe abstractions for capable() and ns_capable()
From: kernel test robot @ 2026-05-13 12:53 UTC (permalink / raw)
  To: Arnav Sharma, ojeda, paul
  Cc: llvm, oe-kbuild-all, Arnav Sharma, Serge Hallyn, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, linux-security-module,
	rust-for-linux, linux-kernel
In-Reply-To: <20260506204913.26022-1-arnav4324@gmail.com>

Hi Arnav,

kernel test robot noticed the following build errors:

[auto build test ERROR on rust/rust-next]
[also build test ERROR on linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Arnav-Sharma/rust-cred-add-safe-abstractions-for-capable-and-ns_capable/20260513-154340
base:   https://github.com/Rust-for-Linux/linux rust-next
patch link:    https://lore.kernel.org/r/20260506204913.26022-1-arnav4324%40gmail.com
patch subject: [PATCH] rust: cred: add safe abstractions for capable() and ns_capable()
config: loongarch-randconfig-001 (https://download.01.org/0day-ci/archive/20260513/202605132018.249x1thF-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605132018.249x1thF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605132018.249x1thF-lkp@intel.com/

All errors (new ones prefixed by >>):

>> error[E0425]: cannot find function `capable` in crate `bindings`
   --> rust/kernel/cred.rs:133:24
   |
   133 |     unsafe { bindings::capable(cap as i32) }
   |                        ^^^^^^^ not found in `bindings`
--
>> error[E0425]: cannot find function `ns_capable` in crate `bindings`
   --> rust/kernel/cred.rs:169:24
   |
   169    |       unsafe { bindings::ns_capable(ns, cap as i32) }
   |                          ^^^^^^^^^^ help: a function with a similar name exists: `cap_capable`
   |
   ::: rust/bindings/bindings_generated.rs:107518:5
   |
   107518 | /     pub fn cap_capable(
   107519 | |         cred: *const cred,
   107520 | |         ns: *mut user_namespace,
   107521 | |         cap: ffi::c_int,
   107522 | |         opts: ffi::c_uint,
   107523 | |     ) -> ffi::c_int;
   | |____________________- similarly named function `cap_capable` defined here

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH v1] landlock: Demonstrate best-effort allowed_access filtering
From: Mickaël Salaün @ 2026-05-13 15:18 UTC (permalink / raw)
  To: Günther Noack
  Cc: Mickaël Salaün, linux-security-module, Justin Suess,
	Tingmao Wang

Landlock provides best-effort sandboxing across ABI versions:
applications request the rights they need, and on older kernels the
unsupported rights are silently dropped from handled_access_* by the
documented compatibility switch.  The recommended pattern for
landlock_add_rule(2) calls is to mirror this filtering at the rule
level, which wasn't explicitly described in the exemple.

Show the pattern explicitly in the filesystem and network rule examples
by masking each rule's allowed_access against the ruleset's
handled_access_* and adding the rule only when at least one bit remains
set.  This makes the recommended best-effort pattern self-documenting.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 Documentation/userspace-api/landlock.rst | 48 +++++++++++++-----------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
index fd8b78c31f2f..45861fa75685 100644
--- a/Documentation/userspace-api/landlock.rst
+++ b/Documentation/userspace-api/landlock.rst
@@ -8,7 +8,7 @@ Landlock: unprivileged access control
 =====================================
 
 :Author: Mickaël Salaün
-:Date: March 2026
+:Date: May 2026
 
 The goal of Landlock is to enable restriction of ambient rights (e.g. global
 filesystem or network access) for a set of processes.  Because Landlock
@@ -155,7 +155,7 @@ this file descriptor.
 
 .. code-block:: c
 
-    int err;
+    int err = 0;
     struct landlock_path_beneath_attr path_beneath = {
         .allowed_access =
             LANDLOCK_ACCESS_FS_EXECUTE |
@@ -163,25 +163,29 @@ this file descriptor.
             LANDLOCK_ACCESS_FS_READ_DIR,
     };
 
-    path_beneath.parent_fd = open("/usr", O_PATH | O_CLOEXEC);
-    if (path_beneath.parent_fd < 0) {
-        perror("Failed to open file");
-        close(ruleset_fd);
-        return 1;
-    }
-    err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
-                            &path_beneath, 0);
-    close(path_beneath.parent_fd);
-    if (err) {
-        perror("Failed to update ruleset");
-        close(ruleset_fd);
-        return 1;
+    path_beneath.allowed_access &= ruleset_attr.handled_access_fs;
+    if (path_beneath.allowed_access) {
+        path_beneath.parent_fd = open("/usr", O_PATH | O_CLOEXEC);
+        if (path_beneath.parent_fd < 0) {
+            perror("Failed to open file");
+            close(ruleset_fd);
+            return 1;
+        }
+        err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
+                                &path_beneath, 0);
+        close(path_beneath.parent_fd);
+        if (err) {
+            perror("Failed to update ruleset");
+            close(ruleset_fd);
+            return 1;
+        }
     }
 
-It may also be required to create rules following the same logic as explained
-for the ruleset creation, by filtering access rights according to the Landlock
-ABI version.  In this example, this is not required because all of the requested
-``allowed_access`` rights are already available in ABI 1.
+As shown above, masking the rule's ``allowed_access`` against the ruleset's
+``handled_access_*`` is the recommended best-effort pattern: rights the running
+kernel does not support are dropped (the compatibility switch above already
+cleared them in ``handled_access_*``), and the rule is skipped if no supported
+right remains.
 
 For network access-control, we can add a set of rules that allow to use a port
 number for a specific action: HTTPS connections.
@@ -193,8 +197,10 @@ number for a specific action: HTTPS connections.
         .port = 443,
     };
 
-    err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
-                            &net_port, 0);
+    net_port.allowed_access &= ruleset_attr.handled_access_net;
+    if (net_port.allowed_access)
+        err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
+                                &net_port, 0);
 
 When passing a non-zero ``flags`` argument to ``landlock_restrict_self()``, a
 similar backwards compatibility check is needed for the restrict flags
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] rust: cred: add safe abstractions for capable() and ns_capable()
From: kernel test robot @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Arnav Sharma, ojeda, paul
  Cc: oe-kbuild-all, Arnav Sharma, Serge Hallyn, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, linux-security-module,
	rust-for-linux, linux-kernel
In-Reply-To: <20260506204913.26022-1-arnav4324@gmail.com>

Hi Arnav,

kernel test robot noticed the following build errors:

[auto build test ERROR on rust/rust-next]
[also build test ERROR on linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Arnav-Sharma/rust-cred-add-safe-abstractions-for-capable-and-ns_capable/20260513-154340
base:   https://github.com/Rust-for-Linux/linux rust-next
patch link:    https://lore.kernel.org/r/20260506204913.26022-1-arnav4324%40gmail.com
patch subject: [PATCH] rust: cred: add safe abstractions for capable() and ns_capable()
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260513/202605131729.NXF18q0f-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605131729.NXF18q0f-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605131729.NXF18q0f-lkp@intel.com/

All errors (new ones prefixed by >>):

   PATH=/opt/cross/clang-20/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
   INFO PATH=/opt/cross/rustc-1.88.0-bindgen-0.72.1/cargo/bin:/opt/cross/clang-20/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
   /usr/bin/timeout -k 100 12h /usr/bin/make KCFLAGS=\ -fno-crash-diagnostics\ -Wno-error=return-type\ -Wreturn-type\ -funsigned-char\ -Wundef\ -falign-functions=64 W=1 --keep-going LLVM=1 -j384 -C source O=/kbuild/obj/consumer/x86_64-rhel-9.4-rust ARCH=x86_64 SHELL=/bin/bash rustfmtcheck 
   make: Entering directory '/kbuild/src'
   make[1]: Entering directory '/kbuild/obj/consumer/x86_64-rhel-9.4-rust'
>> Diff in rust/kernel/cred.rs:168:
        // the specified capability is granted.
        unsafe { bindings::ns_capable(ns, cap as i32) }
    }
   -

>> Diff in rust/kernel/cred.rs:168:
        // the specified capability is granted.
        unsafe { bindings::ns_capable(ns, cap as i32) }
    }
   -

   make[2]: *** [Makefile:1954: rustfmt] Error 123
   make[2]: Target 'rustfmtcheck' not remade because of errors.
   make[1]: Leaving directory '/kbuild/obj/consumer/x86_64-rhel-9.4-rust'
   make[1]: *** [Makefile:248: __sub-make] Error 2
   make[1]: Target 'rustfmtcheck' not remade because of errors.
   make: *** [Makefile:248: __sub-make] Error 2
   make: Target 'rustfmtcheck' not remade because of errors.
   make: Leaving directory '/kbuild/src'

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH v2 0/3] landlock: Restrict renameat2 with RENAME_WHITEOUT
From: Günther Noack @ 2026-05-13 16:05 UTC (permalink / raw)
  To: Mickaël Salaün, Christian Brauner
  Cc: linux-security-module, Paul Moore, Amir Goldstein, Miklos Szeredi,
	Serge Hallyn, Stephen Smalley, Günther Noack

Hello!

As discussed in [1], the renameat2() syscall's RENAME_WHITEOUT flag allows
the creation of chardev directory entries with major=minor=0 as "whiteout
objects" in the location of the rename source file [2].

This functionality is available even without having any OverlayFS mounted
and can be invoked with the regular renameat2(2) syscall [3].

In V1 [5], it was discussed that whiteout objects are not the same as
character devices, and should therefore be guarded with a separate access
right.  We are therefore guarding the operation with the new access right
LANDLOCK_ACCESS_FS_MAKE_WHITEOUT now.

By introducing a new access right, that change is also exposed by
incrementing the ABI level and does not require a Landlock erratum.

Motivation
==========

The RENAME_WHITEOUT flag side-steps all of the existing Landlock access
rights, which are designed to restrict the creation of directory entries.
It is desirable to restrict that.

This patch set fixes that by adding a check in Landlock's path_rename hook.

Tradeoffs considered in the implementation
==========================================

* Should the access right check be merged into the longer
  current_check_refer_path() function?

  I am leaning towards keeping it as a special case earlier.  This means
  that we traverse the source path twice, but as we have seen in Debian
  Code Search, there are apparently no legitimate callers of renameat2()
  with RENAME_WHITEOUT who are calling this from within a Landlock domain.
  (fuse-overlayfs is legitimate, but is not landlocked)

  It doesn't seem worth complicating our common rename code for a corner
  case that doesn't happen in practice.

[1] https://lore.kernel.org/all/adUBCQXrt7kmgqJT@google.com/
[2] https://docs.kernel.org/filesystems/overlayfs.html#whiteouts-and-opaque-directories
[3] https://man7.org/linux/man-pages/man2/renameat2.2.html#DESCRIPTION
[4] https://codesearch.debian.net/search?q=rename.*RENAME_WHITEOUT&literal=0
[5] https://lore.kernel.org/all/20260411090944.3131168-2-gnoack@google.com/

Changelog
=========

v2:
 - Introduce LANDLOCK_ACCESS_FS_MAKE_WHITEOUT access right
   and guard it with that.
 - Bump ABI version

v1:
 - initial version
   https://lore.kernel.org/all/20260411090944.3131168-2-gnoack@google.com/

Günther Noack (3):
  landlock: Require LANDLOCK_ACCESS_FS_MAKE_WHITEOUT for RENAME_WHITEOUT
  selftests/landlock: Add test for RENAME_WHITEOUT denial
  selftests/landlock: Test OverlayFS renames w/o
    LANDLOCK_ACCESS_FS_MAKE_WHITEOUT

 include/uapi/linux/landlock.h                |  3 ++
 security/landlock/audit.c                    |  1 +
 security/landlock/fs.c                       | 15 ++++++
 security/landlock/limits.h                   |  2 +-
 security/landlock/syscalls.c                 |  2 +-
 tools/testing/selftests/landlock/base_test.c |  4 +-
 tools/testing/selftests/landlock/fs_test.c   | 50 +++++++++++++++++++-
 7 files changed, 71 insertions(+), 6 deletions(-)

-- 
2.54.0.563.g4f69b47b94-goog

^ permalink raw reply

* [PATCH v2 1/3] landlock: Require LANDLOCK_ACCESS_FS_MAKE_WHITEOUT for RENAME_WHITEOUT
From: Günther Noack @ 2026-05-13 16:05 UTC (permalink / raw)
  To: Mickaël Salaün, Christian Brauner
  Cc: linux-security-module, Paul Moore, Amir Goldstein, Miklos Szeredi,
	Serge Hallyn, Stephen Smalley, Günther Noack
In-Reply-To: <20260513160552.4022649-1-gnoack@google.com>

renameat2(2) with the RENAME_WHITEOUT flag places a whiteout character
device file in the source file location in place of the moved file.
This creates a directory entry even in cases where all
LANDLOCK_ACCESS_FS_MAKE_* rights are denied.

Introduce the LANDLOCK_ACCESS_FS_MAKE_WHITEOUT right, which is checked
for the origin directory if RENAME_WHITEOUT is passed.

This does not affect normal renames within layered OverlayFS mounts:
When OverlayFS invokes rename with RENAME_WHITEOUT as part of a
"normal" rename operation, it does so in ovl_rename() using the
credentials that were set at the time of mounting the OverlayFS.

Bump the Landlock ABI version to 10.

Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Günther Noack <gnoack@google.com>
---
 include/uapi/linux/landlock.h                |  3 +++
 security/landlock/audit.c                    |  1 +
 security/landlock/fs.c                       | 15 +++++++++++++++
 security/landlock/limits.h                   |  2 +-
 security/landlock/syscalls.c                 |  2 +-
 tools/testing/selftests/landlock/base_test.c |  4 ++--
 tools/testing/selftests/landlock/fs_test.c   |  5 +++--
 7 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index 10a346e55e95..1f8a1d6d25f1 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -328,6 +328,8 @@ struct landlock_net_port_attr {
  *
  *   If multiple requirements are not met, the ``EACCES`` error code takes
  *   precedence over ``EXDEV``.
+ * - %LANDLOCK_ACCESS_FS_MAKE_WHITEOUT: Create a whiteout object through
+ *   :manpage:`rename(2)` with ``RENAME_WHITEOUT``.
  *
  * .. warning::
  *
@@ -356,6 +358,7 @@ struct landlock_net_port_attr {
 #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
 #define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
 #define LANDLOCK_ACCESS_FS_RESOLVE_UNIX			(1ULL << 16)
+#define LANDLOCK_ACCESS_FS_MAKE_WHITEOUT		(1ULL << 17)
 /* clang-format on */
 
 /**
diff --git a/security/landlock/audit.c b/security/landlock/audit.c
index 8d0edf94037d..09c97083f599 100644
--- a/security/landlock/audit.c
+++ b/security/landlock/audit.c
@@ -38,6 +38,7 @@ static const char *const fs_access_strings[] = {
 	[BIT_INDEX(LANDLOCK_ACCESS_FS_TRUNCATE)] = "fs.truncate",
 	[BIT_INDEX(LANDLOCK_ACCESS_FS_IOCTL_DEV)] = "fs.ioctl_dev",
 	[BIT_INDEX(LANDLOCK_ACCESS_FS_RESOLVE_UNIX)] = "fs.resolve_unix",
+	[BIT_INDEX(LANDLOCK_ACCESS_FS_MAKE_WHITEOUT)] = "fs.make_whiteout",
 };
 
 static_assert(ARRAY_SIZE(fs_access_strings) == LANDLOCK_NUM_ACCESS_FS);
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index c1ecfe239032..09de6ba5c3a3 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1519,6 +1519,21 @@ static int hook_path_rename(const struct path *const old_dir,
 			    const unsigned int flags)
 {
 	/* old_dir refers to old_dentry->d_parent and new_dir->mnt */
+	if (flags & RENAME_WHITEOUT) {
+		int err;
+
+		/*
+		 * Rename with RENAME_WHITEOUT creates a whiteout object in the
+		 * old location, so we check the access right for creating that.
+		 *
+		 * See Documentation/filesystems/overlayfs.rst and renameat2(2).
+		 */
+		err = current_check_access_path(
+			old_dir, LANDLOCK_ACCESS_FS_MAKE_WHITEOUT);
+		if (err)
+			return err;
+	}
+
 	return current_check_refer_path(old_dentry, new_dir, new_dentry, true,
 					!!(flags & RENAME_EXCHANGE));
 }
diff --git a/security/landlock/limits.h b/security/landlock/limits.h
index b454ad73b15e..e59378e8e897 100644
--- a/security/landlock/limits.h
+++ b/security/landlock/limits.h
@@ -19,7 +19,7 @@
 #define LANDLOCK_MAX_NUM_LAYERS		16
 #define LANDLOCK_MAX_NUM_RULES		U32_MAX
 
-#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_RESOLVE_UNIX
+#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_MAKE_WHITEOUT
 #define LANDLOCK_MASK_ACCESS_FS		((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
 #define LANDLOCK_NUM_ACCESS_FS		__const_hweight64(LANDLOCK_MASK_ACCESS_FS)
 
diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
index accfd2e5a0cd..d45469d5d464 100644
--- a/security/landlock/syscalls.c
+++ b/security/landlock/syscalls.c
@@ -166,7 +166,7 @@ static const struct file_operations ruleset_fops = {
  * If the change involves a fix that requires userspace awareness, also update
  * the errata documentation in Documentation/userspace-api/landlock.rst .
  */
-const int landlock_abi_version = 9;
+const int landlock_abi_version = 10;
 
 /**
  * sys_landlock_create_ruleset - Create a new ruleset
diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
index 30d37234086c..6c8113c2ded1 100644
--- a/tools/testing/selftests/landlock/base_test.c
+++ b/tools/testing/selftests/landlock/base_test.c
@@ -76,8 +76,8 @@ TEST(abi_version)
 	const struct landlock_ruleset_attr ruleset_attr = {
 		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
 	};
-	ASSERT_EQ(9, landlock_create_ruleset(NULL, 0,
-					     LANDLOCK_CREATE_RULESET_VERSION));
+	ASSERT_EQ(10, landlock_create_ruleset(NULL, 0,
+					      LANDLOCK_CREATE_RULESET_VERSION));
 
 	ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
 					      LANDLOCK_CREATE_RULESET_VERSION));
diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
index cdb47fc1fc0a..53d1b659849f 100644
--- a/tools/testing/selftests/landlock/fs_test.c
+++ b/tools/testing/selftests/landlock/fs_test.c
@@ -579,7 +579,7 @@ TEST_F_FORK(layout1, inval)
 	LANDLOCK_ACCESS_FS_IOCTL_DEV | \
 	LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
 
-#define ACCESS_LAST LANDLOCK_ACCESS_FS_RESOLVE_UNIX
+#define ACCESS_LAST LANDLOCK_ACCESS_FS_MAKE_WHITEOUT
 
 #define ACCESS_ALL ( \
 	ACCESS_FILE | \
@@ -593,7 +593,8 @@ TEST_F_FORK(layout1, inval)
 	LANDLOCK_ACCESS_FS_MAKE_FIFO | \
 	LANDLOCK_ACCESS_FS_MAKE_BLOCK | \
 	LANDLOCK_ACCESS_FS_MAKE_SYM | \
-	LANDLOCK_ACCESS_FS_REFER)
+	LANDLOCK_ACCESS_FS_REFER | \
+	LANDLOCK_ACCESS_FS_MAKE_WHITEOUT)
 
 /* clang-format on */
 
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox