[PATCH RFC 00/12] dma: Enable dmem cgroup tracking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
@ 2025-03-10 12:06 Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 01/12] cma: Register dmem region for each cma region Maxime Ripard
                   ` (14 more replies)
  0 siblings, 15 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Hi,

Here's preliminary work to enable dmem tracking for heavy users of DMA
allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.

It's not really meant for inclusion at the moment, because I really
don't like it that much, and would like to discuss solutions on how to
make it nicer.

In particular, the dma dmem region accessors don't feel that great to
me. It duplicates the logic to select the proper accessor in
dma_alloc_attrs(), and it looks fragile and potentially buggy to me.

One solution I tried is to do the accounting in dma_alloc_attrs()
directly, depending on a flag being set, similar to what __GFP_ACCOUNT
is doing.

It didn't work because dmem initialises a state pointer when charging an
allocation to a region, and expects that state pointer to be passed back
when uncharging. Since dma_alloc_attrs() returns a void pointer to the
allocated buffer, we need to put that state into a higher-level
structure, such as drm_gem_object, or dma_buf.

Since we can't share the region selection logic, we need to get the
region through some other mean. Another thing I consider was to return
the region as part of the allocated buffer (through struct page or
folio), but those are lost across the calls and dma_alloc_attrs() will
only get a void pointer. So that's not doable without some heavy
rework, if it's a good idea at all.

So yeah, I went for the dumbest possible solution with the accessors,
hoping you could suggest a much smarter idea :)

Thanks,
Maxime

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
Maxime Ripard (12):
      cma: Register dmem region for each cma region
      cma: Provide accessor to cma dmem region
      dma: coherent: Register dmem region for each coherent region
      dma: coherent: Provide accessor to dmem region
      dma: contiguous: Provide accessor to dmem region
      dma: direct: Provide accessor to dmem region
      dma: Create default dmem region for DMA allocations
      dma: Provide accessor to dmem region
      dma-buf: Clear cgroup accounting on release
      dma-buf: cma: Account for allocations in dmem cgroup
      drm/gem: Add cgroup memory accounting
      media: videobuf2: Track buffer allocations through the dmem cgroup

 drivers/dma-buf/dma-buf.c                          |  7 ++++
 drivers/dma-buf/heaps/cma_heap.c                   | 18 ++++++++--
 drivers/gpu/drm/drm_gem.c                          |  5 +++
 drivers/gpu/drm/drm_gem_dma_helper.c               |  6 ++++
 .../media/common/videobuf2/videobuf2-dma-contig.c  | 19 +++++++++++
 include/drm/drm_device.h                           |  1 +
 include/drm/drm_gem.h                              |  2 ++
 include/linux/cma.h                                |  9 +++++
 include/linux/dma-buf.h                            |  5 +++
 include/linux/dma-direct.h                         |  2 ++
 include/linux/dma-map-ops.h                        | 32 ++++++++++++++++++
 include/linux/dma-mapping.h                        | 11 ++++++
 kernel/dma/coherent.c                              | 26 +++++++++++++++
 kernel/dma/direct.c                                |  8 +++++
 kernel/dma/mapping.c                               | 39 ++++++++++++++++++++++
 mm/cma.c                                           | 21 +++++++++++-
 mm/cma.h                                           |  3 ++
 17 files changed, 211 insertions(+), 3 deletions(-)
---
base-commit: 55a2aa61ba59c138bd956afe0376ec412a7004cf
change-id: 20250307-dmem-cgroups-73febced0989

Best regards,
-- 
Maxime Ripard <mripard@kernel.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 01/12] cma: Register dmem region for each cma region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 02/12] cma: Provide accessor to cma dmem region Maxime Ripard
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Now that the dmem cgroup has been merged, we need to create memory
regions for each allocator devices might allocate DMA memory from.

Since CMA is one of these allocators, we need to create such a region.
CMA can deal with multiple regions though, so we'll need to create a
dmem region per CMA region.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 mm/cma.c | 14 +++++++++++++-
 mm/cma.h |  3 +++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/cma.c b/mm/cma.c
index de5bc0c81fc232bf82cd7ef22f6097059ab605e2..41a9ae907dcf69a73e963830d2c5f589dfc44f22 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -21,10 +21,11 @@
 #include <linux/mm.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/log2.h>
 #include <linux/cma.h>
+#include <linux/cgroup_dmem.h>
 #include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kmemleak.h>
 #include <trace/events/cma.h>
 
@@ -89,16 +90,25 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
 	spin_unlock_irqrestore(&cma->lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
 {
+	struct dmem_cgroup_region *region;
 	unsigned long base_pfn = cma->base_pfn, pfn;
 	struct zone *zone;
 
+	region = dmem_cgroup_register_region(cma_get_size(cma), "cma/%s", cma->name);
+	if (IS_ERR(region))
+		goto out_error;
+
+#ifdef CONFIG_CGROUP_DMEM
+	cma->dmem_cgrp_region = region;
+#endif
+
 	cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL);
 	if (!cma->bitmap)
-		goto out_error;
+		goto unreg_dmem;
 
 	/*
 	 * alloc_contig_range() requires the pfn range specified to be in the
 	 * same zone. Simplify by forcing the entire CMA resv range to be in the
 	 * same zone.
@@ -124,10 +134,12 @@ static void __init cma_activate_area(struct cma *cma)
 
 	return;
 
 not_in_zone:
 	bitmap_free(cma->bitmap);
+unreg_dmem:
+	dmem_cgroup_unregister_region(region);
 out_error:
 	/* Expose all pages to the buddy, they are useless for CMA. */
 	if (!cma->reserve_pages_on_error) {
 		for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++)
 			free_reserved_page(pfn_to_page(pfn));
diff --git a/mm/cma.h b/mm/cma.h
index 8485ef893e99d8da5ee41eb03194b5b00ff088ba..e05d3eb7c173f3fe75ad7808968925c77d190c80 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -29,10 +29,13 @@ struct cma {
 	atomic64_t nr_pages_failed;
 	/* the number of CMA page released */
 	atomic64_t nr_pages_released;
 	/* kobject requires dynamic object */
 	struct cma_kobject *cma_kobj;
+#endif
+#ifdef CONFIG_CGROUP_DMEM
+	struct dmem_cgroup_region *dmem_cgrp_region;
 #endif
 	bool reserve_pages_on_error;
 };
 
 extern struct cma cma_areas[MAX_CMA_AREAS];

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 02/12] cma: Provide accessor to cma dmem region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 01/12] cma: Register dmem region for each cma region Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 03/12] dma: coherent: Register dmem region for each coherent region Maxime Ripard
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Consumers of the CMA API will have to know which CMA region their device
allocate from in order for them to charge the memory allocation in the
right one.

Let's provide an accessor for that region.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 include/linux/cma.h | 9 +++++++++
 mm/cma.c            | 7 +++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index d15b64f51336df18d17a4097e27961fd1ac8d79f..d7b2f13918e536aeb8bccebc1934d36f2f0b4cf4 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -66,6 +66,15 @@ static inline bool cma_free_folio(struct cma *cma, const struct folio *folio)
 {
 	return false;
 }
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM)
+struct dmem_cgroup_region *cma_get_dmem_cgroup_region(struct cma *cma);
+#else /* CONFIG_CGROUP_DMEM */
+static inline struct dmem_cgroup_region *cma_get_dmem_cgroup_region(struct cma *cma)
+{
+	return NULL;
+}
+#endif /* CONFIG_CGROUP_DMEM */
+
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index 41a9ae907dcf69a73e963830d2c5f589dfc44f22..4973a8c6bacb9d4924f4969be07757cf631304b8 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -49,10 +49,17 @@ unsigned long cma_get_size(const struct cma *cma)
 const char *cma_get_name(const struct cma *cma)
 {
 	return cma->name;
 }
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM)
+struct dmem_cgroup_region *cma_get_dmem_cgroup_region(struct cma *cma)
+{
+	return cma->dmem_cgrp_region;
+}
+#endif /* CONFIG_CGROUP_DMEM */
+
 static unsigned long cma_bitmap_aligned_mask(const struct cma *cma,
 					     unsigned int align_order)
 {
 	if (align_order <= cma->order_per_bit)
 		return 0;

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 03/12] dma: coherent: Register dmem region for each coherent region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 01/12] cma: Register dmem region for each cma region Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 02/12] cma: Provide accessor to cma dmem region Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 04/12] dma: coherent: Provide accessor to dmem region Maxime Ripard
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

There can be several coherent memory region in the system, and all of
them might end up being used to allocate a DMA buffer.

Let's register a dmem region for each of them to make sure we can track
those allocations.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 kernel/dma/coherent.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index 3b2bdca9f1d4b0274bf4874892b94730cd05c5df..2a2d515e43acbdef19c14d8840ed90e48e7ebb43 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -5,10 +5,11 @@
  */
 #include <linux/io.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/cgroup_dmem.h>
 #include <linux/dma-direct.h>
 #include <linux/dma-map-ops.h>
 
 struct dma_coherent_mem {
 	void		*virt_base;
@@ -16,10 +17,12 @@ struct dma_coherent_mem {
 	unsigned long	pfn_base;
 	int		size;
 	unsigned long	*bitmap;
 	spinlock_t	spinlock;
 	bool		use_dev_dma_pfn_offset;
+
+	struct dmem_cgroup_region *dmem_cgroup_region;
 };
 
 static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device *dev)
 {
 	if (dev && dev->dma_mem)
@@ -335,16 +338,25 @@ static phys_addr_t dma_reserved_default_memory_size __initdata;
 #endif
 
 static int rmem_dma_device_init(struct reserved_mem *rmem, struct device *dev)
 {
 	if (!rmem->priv) {
+		struct dmem_cgroup_region *region;
 		struct dma_coherent_mem *mem;
 
 		mem = dma_init_coherent_memory(rmem->base, rmem->base,
 					       rmem->size, true);
 		if (IS_ERR(mem))
 			return PTR_ERR(mem);
+
+		region = dmem_cgroup_register_region(rmem->size,
+						     "dma/coherent/%s",
+						     rmem->name);
+		if (IS_ERR(region))
+			return PTR_ERR(region);
+
+		mem->dmem_cgroup_region = region;
 		rmem->priv = mem;
 	}
 	dma_assign_coherent_memory(dev, rmem->priv);
 	return 0;
 }

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 04/12] dma: coherent: Provide accessor to dmem region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (2 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 03/12] dma: coherent: Register dmem region for each coherent region Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 05/12] dma: contiguous: " Maxime Ripard
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Consumers of the coherent DMA API will have to know which coherent
region their device allocate from in order for them to charge the memory
allocation in the right one.

Let's provide an accessor for that region.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 include/linux/dma-map-ops.h | 11 +++++++++++
 kernel/dma/coherent.c       | 14 ++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index e172522cd93651594607e16461fac56e4d67cbce..a2c10ed186efb6e08f64df0954b4d389589b6e35 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -199,10 +199,21 @@ static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma,
 {
 	return 0;
 }
 #endif /* CONFIG_DMA_GLOBAL_POOL */
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) && IS_ENABLED(CONFIG_DMA_DECLARE_COHERENT)
+struct dmem_cgroup_region *
+dma_coherent_get_dmem_cgroup_region(struct device *dev);
+#else /* CONFIG_CGROUP_DMEM && CONFIG_DMA_DECLARE_COHERENT */
+static inline struct dmem_cgroup_region *
+dma_coherent_get_dmem_cgroup_region(struct device *dev)
+{
+	return NULL;
+}
+#endif /* CONFIG_CGROUP_DMEM && CONFIG_DMA_DECLARE_COHERENT */
+
 int dma_common_get_sgtable(struct device *dev, struct sg_table *sgt,
 		void *cpu_addr, dma_addr_t dma_addr, size_t size,
 		unsigned long attrs);
 int dma_common_mmap(struct device *dev, struct vm_area_struct *vma,
 		void *cpu_addr, dma_addr_t dma_addr, size_t size,
diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index 2a2d515e43acbdef19c14d8840ed90e48e7ebb43..74c5ff5105110487770c1b73812eefe8b3d7eb3c 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -28,10 +28,24 @@ static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device *de
 	if (dev && dev->dma_mem)
 		return dev->dma_mem;
 	return NULL;
 }
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM)
+struct dmem_cgroup_region *
+dma_coherent_get_dmem_cgroup_region(struct device *dev)
+{
+	struct dma_coherent_mem *mem;
+
+	mem = dev_get_coherent_memory(dev);
+	if (!mem)
+		return NULL;
+
+	return mem->dmem_cgroup_region;
+}
+#endif
+
 static inline dma_addr_t dma_get_device_base(struct device *dev,
 					     struct dma_coherent_mem * mem)
 {
 	if (mem->use_dev_dma_pfn_offset)
 		return phys_to_dma(dev, PFN_PHYS(mem->pfn_base));

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 05/12] dma: contiguous: Provide accessor to dmem region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (3 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 04/12] dma: coherent: Provide accessor to dmem region Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 06/12] dma: direct: " Maxime Ripard
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Consumers of the DMA contiguous API will have to know which region their
device allocates from in order for them to charge the memory allocation
in the right one.

Let's provide an accessor for that region.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 include/linux/dma-map-ops.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index a2c10ed186efb6e08f64df0954b4d389589b6e35..bfc928d3bac37f3eece93d152abd57da513a1cc8 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -4,10 +4,11 @@
  * It should not be included in drivers just using the DMA API.
  */
 #ifndef _LINUX_DMA_MAP_OPS_H
 #define _LINUX_DMA_MAP_OPS_H
 
+#include <linux/cma.h>
 #include <linux/dma-mapping.h>
 #include <linux/pgtable.h>
 #include <linux/slab.h>
 
 struct cma;
@@ -153,10 +154,30 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page,
 {
 	__free_pages(page, get_order(size));
 }
 #endif /* CONFIG_DMA_CMA*/
 
+#if IS_ENABLED(CONFIG_DMA_CMA) && IS_ENABLED(CONFIG_CGROUP_DMEM)
+
+static inline struct dmem_cgroup_region *
+dma_contiguous_get_dmem_cgroup_region(struct device *dev)
+{
+	struct cma *cma = dev_get_cma_area(dev);
+
+	return cma_get_dmem_cgroup_region(cma);
+}
+
+#else /* IS_ENABLED(CONFIG_DMA_CMA) && IS_ENABLED(CONFIG_CGROUP_DMEM) */
+
+static inline struct dmem_cgroup_region *
+dma_contiguous_get_dmem_cgroup_region(struct device *dev)
+{
+	return NULL;
+}
+
+#endif /* IS_ENABLED(CONFIG_DMA_CMA) && IS_ENABLED(CONFIG_CGROUP_DMEM) */
+
 #ifdef CONFIG_DMA_DECLARE_COHERENT
 int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
 		dma_addr_t device_addr, size_t size);
 void dma_release_coherent_memory(struct device *dev);
 int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size,

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 06/12] dma: direct: Provide accessor to dmem region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (4 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 05/12] dma: contiguous: " Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 14:56   ` Robin Murphy
  2025-03-10 12:06 ` [PATCH RFC 07/12] dma: Create default dmem region for DMA allocations Maxime Ripard
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Consumers of the direct DMA API will have to know which region their
device allocate from in order for them to charge the memory allocation
in the right one.

Let's provide an accessor for that region.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 include/linux/dma-direct.h | 2 ++
 kernel/dma/direct.c        | 8 ++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index d7e30d4f7503a898a456df8eedf6a2cd284c35ff..2dd7cbccfaeed81c18c67aae877417fe89f2f2f5 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -145,6 +145,8 @@ void dma_direct_free_pages(struct device *dev, size_t size,
 		enum dma_data_direction dir);
 int dma_direct_supported(struct device *dev, u64 mask);
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
 		size_t size, enum dma_data_direction dir, unsigned long attrs);
 
+struct dmem_cgroup_region *dma_direct_get_dmem_cgroup_region(struct device *dev);
+
 #endif /* _LINUX_DMA_DIRECT_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5b4e6d3bf7bcca8930877ba078aed4ce26828f06..ece1361077b6efeec5b202d838750afd967d473f 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -42,10 +42,18 @@ u64 dma_direct_get_required_mask(struct device *dev)
 	u64 max_dma = phys_to_dma_direct(dev, phys);
 
 	return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
 }
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM)
+struct dmem_cgroup_region *
+dma_direct_get_dmem_cgroup_region(struct device *dev)
+{
+	return dma_contiguous_get_dmem_cgroup_region(dev);
+}
+#endif
+
 static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
 {
 	u64 dma_limit = min_not_zero(
 		dev->coherent_dma_mask,
 		dev->bus_dma_limit);

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 07/12] dma: Create default dmem region for DMA allocations
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (5 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 06/12] dma: direct: " Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 08/12] dma: Provide accessor to dmem region Maxime Ripard
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Some DMA allocations are not going to be performed through dedicated
sub-allocators but using the default path. We need to create a default
region to track those as well.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 kernel/dma/mapping.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index cda127027e48a757f2d9fb04a49249d2b0238871..7bc3957512fd84e0bf3a89c210338be72457b5c9 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -5,10 +5,11 @@
  * Copyright (c) 2006  SUSE Linux Products GmbH
  * Copyright (c) 2006  Tejun Heo <teheo@suse.de>
  */
 #include <linux/memblock.h> /* for max_pfn */
 #include <linux/acpi.h>
+#include <linux/cgroup_dmem.h>
 #include <linux/dma-map-ops.h>
 #include <linux/export.h>
 #include <linux/gfp.h>
 #include <linux/iommu-dma.h>
 #include <linux/kmsan.h>
@@ -25,10 +26,14 @@
 	defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
 	defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 bool dma_default_coherent = IS_ENABLED(CONFIG_ARCH_DMA_DEFAULT_COHERENT);
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM)
+static struct dmem_cgroup_region *default_dmem_cgroup_region;
+#endif
+
 /*
  * Managed DMA API
  */
 struct dma_devres {
 	size_t		size;
@@ -587,10 +592,28 @@ u64 dma_get_required_mask(struct device *dev)
 	 */
 	return DMA_BIT_MASK(32);
 }
 EXPORT_SYMBOL_GPL(dma_get_required_mask);
 
+#if IS_ENABLED(CONFIG_CGROUP_DMEM)
+static int __init dma_init_dmem_cgroup(void)
+{
+	struct dmem_cgroup_region *region;
+
+	if (default_dmem_cgroup_region)
+		return -EBUSY;
+
+	region = dmem_cgroup_register_region(U64_MAX, "dma/global");
+	if (IS_ERR(region))
+		return PTR_ERR(region);
+
+	default_dmem_cgroup_region = region;
+	return 0;
+}
+core_initcall(dma_init_dmem_cgroup);
+#endif
+
 void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
 		gfp_t flag, unsigned long attrs)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 	void *cpu_addr;

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 08/12] dma: Provide accessor to dmem region
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (6 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 07/12] dma: Create default dmem region for DMA allocations Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 09/12] dma-buf: Clear cgroup accounting on release Maxime Ripard
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Consumers of the DMA API will have to know which DMA region their device
allocate from in order for them to charge the memory allocation in the
right one.

Let's provide an accessor for that region.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 include/linux/dma-mapping.h | 11 +++++++++++
 kernel/dma/mapping.c        | 16 ++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index b79925b1c4333ce25e66c57d8ac1dae5c7b7fe37..75f5ca1d11a6297720742cea1359c7f28c23d741 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -351,10 +351,21 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	return false;
 }
 #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
 
+#if IS_ENABLED(CONFIG_HAS_DMA) && IS_ENABLED(CONFIG_CGROUP_DMEM)
+struct dmem_cgroup_region *
+dma_get_dmem_cgroup_region(struct device *dev);
+#else
+static inline struct dmem_cgroup_region *
+dma_get_dmem_cgroup_region(struct device *dev)
+{
+	return NULL;
+}
+#endif
+
 struct page *dma_alloc_pages(struct device *dev, size_t size,
 		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp);
 void dma_free_pages(struct device *dev, size_t size, struct page *page,
 		dma_addr_t dma_handle, enum dma_data_direction dir);
 int dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 7bc3957512fd84e0bf3a89c210338be72457b5c9..e45d63700183acb03c779f969ae33803dcf5cf1b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -608,10 +608,26 @@ static int __init dma_init_dmem_cgroup(void)
 
 	default_dmem_cgroup_region = region;
 	return 0;
 }
 core_initcall(dma_init_dmem_cgroup);
+
+struct dmem_cgroup_region *
+dma_get_dmem_cgroup_region(struct device *dev)
+{
+	struct dmem_cgroup_region *region;
+
+	region = dma_coherent_get_dmem_cgroup_region(dev);
+	if (region)
+		return region;
+
+	if (dma_alloc_direct(dev, get_dma_ops(dev)))
+		return dma_direct_get_dmem_cgroup_region(dev);
+
+	return default_dmem_cgroup_region;
+}
+EXPORT_SYMBOL(dma_get_dmem_cgroup_region);
 #endif
 
 void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
 		gfp_t flag, unsigned long attrs)
 {

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 09/12] dma-buf: Clear cgroup accounting on release
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (7 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 08/12] dma: Provide accessor to dmem region Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 10/12] dma-buf: cma: Account for allocations in dmem cgroup Maxime Ripard
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

In order to clean thing up when dma-heaps will allocate and register
buffers in the dev cgroup, let's uncharge a released buffer for any
(optional) cgroup controller.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 drivers/dma-buf/dma-buf.c | 7 +++++++
 include/linux/dma-buf.h   | 5 +++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 5baa83b855156516a0a766bee0789b122473efb3..a95eef17f193454b018dc8177ddfd434d7b64473 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -11,10 +11,11 @@
  * refining of this idea.
  */
 
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/cgroup_dmem.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-fence.h>
 #include <linux/dma-fence-unwrap.h>
 #include <linux/anon_inodes.h>
 #include <linux/export.h>
@@ -97,10 +98,16 @@ static void dma_buf_release(struct dentry *dentry)
 	 * * dmabuf->cb_in/out.active are non-0 despite no pending fence callback
 	 */
 	BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
 
 	dma_buf_stats_teardown(dmabuf);
+
+#ifdef CONFIG_CGROUP_DMEM
+	if (dmabuf->cgroup_pool)
+		dmem_cgroup_uncharge(dmabuf->cgroup_pool, dmabuf->size);
+#endif
+
 	dmabuf->ops->release(dmabuf);
 
 	if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
 		dma_resv_fini(dmabuf->resv);
 
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bdc01a9c9c47e27c392413f7f6c5fb..111ca5a738ae0a816ba1551313dfb0a958720b6c 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -437,10 +437,15 @@ struct dma_buf {
 		struct dma_fence_cb cb;
 		wait_queue_head_t *poll;
 
 		__poll_t active;
 	} cb_in, cb_out;
+
+#ifdef CONFIG_CGROUP_DMEM
+	struct dmem_cgroup_pool_state *cgroup_pool;
+#endif
+
 #ifdef CONFIG_DMABUF_SYSFS_STATS
 	/**
 	 * @sysfs_entry:
 	 *
 	 * For exposing information about this buffer in sysfs. See also

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 10/12] dma-buf: cma: Account for allocations in dmem cgroup
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (8 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 09/12] dma-buf: Clear cgroup accounting on release Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:06 ` [PATCH RFC 11/12] drm/gem: Add cgroup memory accounting Maxime Ripard
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

Now that we have a DMEM region per CMA region, we can track the
allocations of the CMA heap through DMEM.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 drivers/dma-buf/heaps/cma_heap.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 9512d050563a9ad0a735230c4870c3d3b3b01b25..4951c41db3ba0cbd903b6d62787f51b83f4a1e7e 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -7,10 +7,11 @@
  *
  * Also utilizing parts of Andrew Davis' SRAM heap:
  * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com/
  *	Andrew F. Davis <afd@ti.com>
  */
+#include <linux/cgroup_dmem.h>
 #include <linux/cma.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-heap.h>
 #include <linux/dma-map-ops.h>
 #include <linux/err.h>
@@ -276,23 +277,31 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap,
 					 unsigned long len,
 					 u32 fd_flags,
 					 u64 heap_flags)
 {
 	struct cma_heap *cma_heap = dma_heap_get_drvdata(heap);
+	struct dmem_cgroup_pool_state *pool;
 	struct cma_heap_buffer *buffer;
 	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
 	size_t size = PAGE_ALIGN(len);
 	pgoff_t pagecount = size >> PAGE_SHIFT;
 	unsigned long align = get_order(size);
 	struct page *cma_pages;
 	struct dma_buf *dmabuf;
 	int ret = -ENOMEM;
 	pgoff_t pg;
 
+	ret = dmem_cgroup_try_charge(cma_get_dmem_cgroup_region(cma_heap->cma),
+				     size, &pool, NULL);
+	if (ret)
+		return ERR_PTR(ret);
+
 	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
-	if (!buffer)
-		return ERR_PTR(-ENOMEM);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto uncharge_cgroup;
+	}
 
 	INIT_LIST_HEAD(&buffer->attachments);
 	mutex_init(&buffer->lock);
 	buffer->len = size;
 
@@ -348,18 +357,23 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap,
 	dmabuf = dma_buf_export(&exp_info);
 	if (IS_ERR(dmabuf)) {
 		ret = PTR_ERR(dmabuf);
 		goto free_pages;
 	}
+
+	dmabuf->cgroup_pool = pool;
+
 	return dmabuf;
 
 free_pages:
 	kfree(buffer->pages);
 free_cma:
 	cma_release(cma_heap->cma, cma_pages, pagecount);
 free_buffer:
 	kfree(buffer);
+uncharge_cgroup:
+	dmem_cgroup_uncharge(pool, len);
 
 	return ERR_PTR(ret);
 }
 
 static const struct dma_heap_ops cma_heap_ops = {

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 11/12] drm/gem: Add cgroup memory accounting
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (9 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 10/12] dma-buf: cma: Account for allocations in dmem cgroup Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 15:06   ` Robin Murphy
  2025-03-10 12:06 ` [PATCH RFC 12/12] media: videobuf2: Track buffer allocations through the dmem cgroup Maxime Ripard
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

In order to support any device using the GEM support, let's charge any
GEM DMA allocation into the dmem cgroup.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 drivers/gpu/drm/drm_gem.c            | 5 +++++
 drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++++
 include/drm/drm_device.h             | 1 +
 include/drm/drm_gem.h                | 2 ++
 4 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index ee811764c3df4b4e9c377a66afd4967512ba2001..e04733cb49353cf3ff9672d883b106a083f80d86 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -108,10 +108,11 @@ drm_gem_init(struct drm_device *dev)
 	dev->vma_offset_manager = vma_offset_manager;
 	drm_vma_offset_manager_init(vma_offset_manager,
 				    DRM_FILE_PAGE_OFFSET_START,
 				    DRM_FILE_PAGE_OFFSET_SIZE);
 
+
 	return drmm_add_action(dev, drm_gem_init_release, NULL);
 }
 
 /**
  * drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM
@@ -973,10 +974,14 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
  * drm_gem_object_init().
  */
 void
 drm_gem_object_release(struct drm_gem_object *obj)
 {
+
+	if (obj->cgroup_pool_state)
+		dmem_cgroup_uncharge(obj->cgroup_pool_state, obj->size);
+
 	if (obj->filp)
 		fput(obj->filp);
 
 	drm_gem_private_object_fini(obj);
 
diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c
index 16988d316a6dc702310fa44c15c92dc67b82802b..6236feb67ddd6338f0f597a0606377e0352ca6ed 100644
--- a/drivers/gpu/drm/drm_gem_dma_helper.c
+++ b/drivers/gpu/drm/drm_gem_dma_helper.c
@@ -104,10 +104,16 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private)
 	if (ret) {
 		drm_gem_object_release(gem_obj);
 		goto error;
 	}
 
+	ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(drm->dev),
+				     size,
+				     &dma_obj->base.cgroup_pool_state, NULL);
+	if (ret)
+		goto error;
+
 	return dma_obj;
 
 error:
 	kfree(dma_obj);
 	return ERR_PTR(ret);
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d7a499917eb4aeb6ca8350f856eb3..58987f39ba8718eb768f6261fb0a1fbf16b38549 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -1,8 +1,9 @@
 #ifndef _DRM_DEVICE_H_
 #define _DRM_DEVICE_H_
 
+#include <linux/cgroup_dmem.h>
 #include <linux/list.h>
 #include <linux/kref.h>
 #include <linux/mutex.h>
 #include <linux/idr.h>
 
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index fdae947682cd0b7b06db5e35e120f049a0f30179..95fe8ed48a26204020bb47d6074689829c410465 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -430,10 +430,12 @@ struct drm_gem_object {
 	 * @lru:
 	 *
 	 * The current LRU list that the GEM object is on.
 	 */
 	struct drm_gem_lru *lru;
+
+	struct dmem_cgroup_pool_state *cgroup_pool_state;
 };
 
 /**
  * DRM_GEM_FOPS - Default drm GEM file operations
  *

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 12/12] media: videobuf2: Track buffer allocations through the dmem cgroup
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (10 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 11/12] drm/gem: Add cgroup memory accounting Maxime Ripard
@ 2025-03-10 12:06 ` Maxime Ripard
  2025-03-10 12:15 ` [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:06 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig, Maxime Ripard

The dmem cgroup allows to track any DMA memory allocation made by the
userspace. Let's charge our allocations in videobuf2 to enable proper
memory tracking.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 drivers/media/common/videobuf2/videobuf2-dma-contig.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
index a13ec569c82f6da2d977222b94af32e74c6c6c82..48384e18030812f4f89f1c225c38def2ac6aa3ca 100644
--- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c
+++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
@@ -8,10 +8,11 @@
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation.
  */
 
+#include <linux/cgroup_dmem.h>
 #include <linux/dma-buf.h>
 #include <linux/module.h>
 #include <linux/refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/sched.h>
@@ -40,10 +41,14 @@ struct vb2_dc_buf {
 	struct sg_table			*sgt_base;
 
 	/* DMABUF related */
 	struct dma_buf_attachment	*db_attach;
 
+#ifdef CONFIG_CGROUP_DMEM
+	struct dmem_cgroup_pool_state	*cgroup_pool_state;
+#endif
+
 	struct vb2_buffer		*vb;
 	bool				non_coherent_mem;
 };
 
 /*********************************************/
@@ -169,10 +174,14 @@ static void vb2_dc_put(void *buf_priv)
 	struct vb2_dc_buf *buf = buf_priv;
 
 	if (!refcount_dec_and_test(&buf->refcount))
 		return;
 
+#ifdef CONFIG_CGROUP_DMEM
+	dmem_cgroup_uncharge(buf->cgroup_pool_state, buf->size);
+#endif
+
 	if (buf->non_coherent_mem) {
 		if (buf->vaddr)
 			dma_vunmap_noncontiguous(buf->dev, buf->vaddr);
 		dma_free_noncontiguous(buf->dev, buf->size,
 				       buf->dma_sgt, buf->dma_dir);
@@ -230,10 +239,11 @@ static int vb2_dc_alloc_non_coherent(struct vb2_dc_buf *buf)
 
 static void *vb2_dc_alloc(struct vb2_buffer *vb,
 			  struct device *dev,
 			  unsigned long size)
 {
+	struct dmem_cgroup_pool_state *pool;
 	struct vb2_dc_buf *buf;
 	int ret;
 
 	if (WARN_ON(!dev))
 		return ERR_PTR(-EINVAL);
@@ -249,25 +259,34 @@ static void *vb2_dc_alloc(struct vb2_buffer *vb,
 
 	buf->size = size;
 	/* Prevent the device from being released while the buffer is used */
 	buf->dev = get_device(dev);
 
+	ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(dev), size, &pool, NULL);
+	if (ret)
+		return ret;
+
 	if (buf->non_coherent_mem)
 		ret = vb2_dc_alloc_non_coherent(buf);
 	else
 		ret = vb2_dc_alloc_coherent(buf);
 
 	if (ret) {
 		dev_err(dev, "dma alloc of size %lu failed\n", size);
+		dmem_cgroup_uncharge(pool, size);
 		kfree(buf);
 		return ERR_PTR(-ENOMEM);
 	}
 
 	buf->handler.refcount = &buf->refcount;
 	buf->handler.put = vb2_dc_put;
 	buf->handler.arg = buf;
 
+#ifdef CONFIG_CGROUP_DMEM
+	buf->cgroup_pool_state = pool;
+#endif
+
 	refcount_set(&buf->refcount, 1);
 
 	return buf;
 }
 

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (11 preceding siblings ...)
  2025-03-10 12:06 ` [PATCH RFC 12/12] media: videobuf2: Track buffer allocations through the dmem cgroup Maxime Ripard
@ 2025-03-10 12:15 ` Maxime Ripard
  2025-03-10 14:16 ` Christian König
  2025-04-03  8:27 ` Simona Vetter
  14 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 12:15 UTC (permalink / raw)
  To: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

[-- Attachment #1: Type: text/plain, Size: 2133 bytes --]

On Mon, Mar 10, 2025 at 01:06:06PM +0100, Maxime Ripard wrote:
> Here's preliminary work to enable dmem tracking for heavy users of DMA
> allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.
> 
> It's not really meant for inclusion at the moment, because I really
> don't like it that much, and would like to discuss solutions on how to
> make it nicer.
> 
> In particular, the dma dmem region accessors don't feel that great to
> me. It duplicates the logic to select the proper accessor in
> dma_alloc_attrs(), and it looks fragile and potentially buggy to me.
> 
> One solution I tried is to do the accounting in dma_alloc_attrs()
> directly, depending on a flag being set, similar to what __GFP_ACCOUNT
> is doing.
> 
> It didn't work because dmem initialises a state pointer when charging an
> allocation to a region, and expects that state pointer to be passed back
> when uncharging. Since dma_alloc_attrs() returns a void pointer to the
> allocated buffer, we need to put that state into a higher-level
> structure, such as drm_gem_object, or dma_buf.
> 
> Since we can't share the region selection logic, we need to get the
> region through some other mean. Another thing I consider was to return
> the region as part of the allocated buffer (through struct page or
> folio), but those are lost across the calls and dma_alloc_attrs() will
> only get a void pointer. So that's not doable without some heavy
> rework, if it's a good idea at all.

One thing I forgot to mention is that it makes it harder than it could
for subsystems that can allocate from multiple allocators (like... all
the ones included in this series at least).

I only added proper tracking in the backends using dma_alloc_attrs(),
but they also support vmalloc. In what region vmalloc allocations should
be tracked (if any) is an open-question to me. Similarly, some use
dma_alloc_noncontiguous().

Also, I've set the size of the "default" DMA allocation region to
U64_MAX, but that's obviously wrong and will break any relative metric.
I'm not sure what would be the correct size though.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (12 preceding siblings ...)
  2025-03-10 12:15 ` [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
@ 2025-03-10 14:16 ` Christian König
  2025-03-10 14:26   ` Maxime Ripard
  2025-04-03  8:27 ` Simona Vetter
  14 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2025-03-10 14:16 UTC (permalink / raw)
  To: Maxime Ripard, Andrew Morton, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann, David Airlie,
	Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]

Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?

Thanks,
Christian.

Am 10.03.25 um 13:06 schrieb Maxime Ripard:
> Hi,
>
> Here's preliminary work to enable dmem tracking for heavy users of DMA
> allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.
>
> It's not really meant for inclusion at the moment, because I really
> don't like it that much, and would like to discuss solutions on how to
> make it nicer.
>
> In particular, the dma dmem region accessors don't feel that great to
> me. It duplicates the logic to select the proper accessor in
> dma_alloc_attrs(), and it looks fragile and potentially buggy to me.
>
> One solution I tried is to do the accounting in dma_alloc_attrs()
> directly, depending on a flag being set, similar to what __GFP_ACCOUNT
> is doing.
>
> It didn't work because dmem initialises a state pointer when charging an
> allocation to a region, and expects that state pointer to be passed back
> when uncharging. Since dma_alloc_attrs() returns a void pointer to the
> allocated buffer, we need to put that state into a higher-level
> structure, such as drm_gem_object, or dma_buf.
>
> Since we can't share the region selection logic, we need to get the
> region through some other mean. Another thing I consider was to return
> the region as part of the allocated buffer (through struct page or
> folio), but those are lost across the calls and dma_alloc_attrs() will
> only get a void pointer. So that's not doable without some heavy
> rework, if it's a good idea at all.
>
> So yeah, I went for the dumbest possible solution with the accessors,
> hoping you could suggest a much smarter idea :)
>
> Thanks,
> Maxime
>
> Signed-off-by: Maxime Ripard <mripard@kernel.org>
> ---
> Maxime Ripard (12):
>       cma: Register dmem region for each cma region
>       cma: Provide accessor to cma dmem region
>       dma: coherent: Register dmem region for each coherent region
>       dma: coherent: Provide accessor to dmem region
>       dma: contiguous: Provide accessor to dmem region
>       dma: direct: Provide accessor to dmem region
>       dma: Create default dmem region for DMA allocations
>       dma: Provide accessor to dmem region
>       dma-buf: Clear cgroup accounting on release
>       dma-buf: cma: Account for allocations in dmem cgroup
>       drm/gem: Add cgroup memory accounting
>       media: videobuf2: Track buffer allocations through the dmem cgroup
>
>  drivers/dma-buf/dma-buf.c                          |  7 ++++
>  drivers/dma-buf/heaps/cma_heap.c                   | 18 ++++++++--
>  drivers/gpu/drm/drm_gem.c                          |  5 +++
>  drivers/gpu/drm/drm_gem_dma_helper.c               |  6 ++++
>  .../media/common/videobuf2/videobuf2-dma-contig.c  | 19 +++++++++++
>  include/drm/drm_device.h                           |  1 +
>  include/drm/drm_gem.h                              |  2 ++
>  include/linux/cma.h                                |  9 +++++
>  include/linux/dma-buf.h                            |  5 +++
>  include/linux/dma-direct.h                         |  2 ++
>  include/linux/dma-map-ops.h                        | 32 ++++++++++++++++++
>  include/linux/dma-mapping.h                        | 11 ++++++
>  kernel/dma/coherent.c                              | 26 +++++++++++++++
>  kernel/dma/direct.c                                |  8 +++++
>  kernel/dma/mapping.c                               | 39 ++++++++++++++++++++++
>  mm/cma.c                                           | 21 +++++++++++-
>  mm/cma.h                                           |  3 ++
>  17 files changed, 211 insertions(+), 3 deletions(-)
> ---
> base-commit: 55a2aa61ba59c138bd956afe0376ec412a7004cf
> change-id: 20250307-dmem-cgroups-73febced0989
>
> Best regards,



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-03-10 14:16 ` Christian König
@ 2025-03-10 14:26   ` Maxime Ripard
  2025-03-31 20:43     ` Dave Airlie
  0 siblings, 1 reply; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 14:26 UTC (permalink / raw)
  To: Christian König
  Cc: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Maarten Lankhorst, Thomas Zimmermann, David Airlie, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

Hi,

On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
> [Adding Ben since we are currently in the middle of a discussion
> regarding exactly that problem]
>
> Just for my understanding before I deep dive into the code: This uses
> a separate dmem cgroup and does not account against memcg, don't it?

Yes. The main rationale being that it doesn't always make sense to
register against memcg: a lot of devices are going to allocate from
dedicated chunks of memory that are either carved out from the main
memory allocator, or not under Linux supervision at all.

And if there's no way to make it consistent across drivers, it's not the
right tool.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 06/12] dma: direct: Provide accessor to dmem region
  2025-03-10 12:06 ` [PATCH RFC 06/12] dma: direct: " Maxime Ripard
@ 2025-03-10 14:56   ` Robin Murphy
  2025-03-10 16:28     ` Maxime Ripard
  0 siblings, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2025-03-10 14:56 UTC (permalink / raw)
  To: Maxime Ripard, Andrew Morton, Marek Szyprowski, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

On 2025-03-10 12:06 pm, Maxime Ripard wrote:
> Consumers of the direct DMA API will have to know which region their
> device allocate from in order for them to charge the memory allocation
> in the right one.

This doesn't seem to make much sense - dma-direct is not an allocator 
itself, it just provides the high-level 
dma_alloc_attrs/dma_alloc_pages/etc. interfaces wherein the underlying 
allocations _could_ come from CMA, but also a per-device 
coherent/restricted pool, or a global coherent/atomic pool, or the 
regular page allocator, or in one weird corner case the SWIOTLB buffer, 
or...

Thanks,
Robin.

> Let's provide an accessor for that region.
> 
> Signed-off-by: Maxime Ripard <mripard@kernel.org>
> ---
>   include/linux/dma-direct.h | 2 ++
>   kernel/dma/direct.c        | 8 ++++++++
>   2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index d7e30d4f7503a898a456df8eedf6a2cd284c35ff..2dd7cbccfaeed81c18c67aae877417fe89f2f2f5 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -145,6 +145,8 @@ void dma_direct_free_pages(struct device *dev, size_t size,
>   		enum dma_data_direction dir);
>   int dma_direct_supported(struct device *dev, u64 mask);
>   dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
>   		size_t size, enum dma_data_direction dir, unsigned long attrs);
>   
> +struct dmem_cgroup_region *dma_direct_get_dmem_cgroup_region(struct device *dev);
> +
>   #endif /* _LINUX_DMA_DIRECT_H */
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5b4e6d3bf7bcca8930877ba078aed4ce26828f06..ece1361077b6efeec5b202d838750afd967d473f 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -42,10 +42,18 @@ u64 dma_direct_get_required_mask(struct device *dev)
>   	u64 max_dma = phys_to_dma_direct(dev, phys);
>   
>   	return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
>   }
>   
> +#if IS_ENABLED(CONFIG_CGROUP_DMEM)
> +struct dmem_cgroup_region *
> +dma_direct_get_dmem_cgroup_region(struct device *dev)
> +{
> +	return dma_contiguous_get_dmem_cgroup_region(dev);
> +}
> +#endif
> +
>   static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
>   {
>   	u64 dma_limit = min_not_zero(
>   		dev->coherent_dma_mask,
>   		dev->bus_dma_limit);
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 11/12] drm/gem: Add cgroup memory accounting
  2025-03-10 12:06 ` [PATCH RFC 11/12] drm/gem: Add cgroup memory accounting Maxime Ripard
@ 2025-03-10 15:06   ` Robin Murphy
  0 siblings, 0 replies; 31+ messages in thread
From: Robin Murphy @ 2025-03-10 15:06 UTC (permalink / raw)
  To: Maxime Ripard, Andrew Morton, Marek Szyprowski, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab
  Cc: Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

On 2025-03-10 12:06 pm, Maxime Ripard wrote:
> In order to support any device using the GEM support, let's charge any
> GEM DMA allocation into the dmem cgroup.
> 
> Signed-off-by: Maxime Ripard <mripard@kernel.org>
> ---
>   drivers/gpu/drm/drm_gem.c            | 5 +++++
>   drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++++
>   include/drm/drm_device.h             | 1 +
>   include/drm/drm_gem.h                | 2 ++
>   4 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index ee811764c3df4b4e9c377a66afd4967512ba2001..e04733cb49353cf3ff9672d883b106a083f80d86 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -108,10 +108,11 @@ drm_gem_init(struct drm_device *dev)
>   	dev->vma_offset_manager = vma_offset_manager;
>   	drm_vma_offset_manager_init(vma_offset_manager,
>   				    DRM_FILE_PAGE_OFFSET_START,
>   				    DRM_FILE_PAGE_OFFSET_SIZE);
>   
> +
>   	return drmm_add_action(dev, drm_gem_init_release, NULL);
>   }
>   
>   /**
>    * drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM
> @@ -973,10 +974,14 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
>    * drm_gem_object_init().
>    */
>   void
>   drm_gem_object_release(struct drm_gem_object *obj)
>   {
> +
> +	if (obj->cgroup_pool_state)
> +		dmem_cgroup_uncharge(obj->cgroup_pool_state, obj->size);
> +
>   	if (obj->filp)
>   		fput(obj->filp);
>   
>   	drm_gem_private_object_fini(obj);
>   
> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c
> index 16988d316a6dc702310fa44c15c92dc67b82802b..6236feb67ddd6338f0f597a0606377e0352ca6ed 100644
> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
> @@ -104,10 +104,16 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private)
>   	if (ret) {
>   		drm_gem_object_release(gem_obj);
>   		goto error;
>   	}
>   
> +	ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(drm->dev),
> +				     size,
> +				     &dma_obj->base.cgroup_pool_state, NULL);
> +	if (ret)
> +		goto error;

Doesn't that miss cleaning up gem_obj? However, surely you want the 
accounting before the allocation anyway, like in the other cases. 
Otherwise userspace is still able to allocate massive amounts of memory 
and incur some of the associated side-effects of that, it just doesn't 
get to keep said memory for very long :)

Thanks,
Robin.

> +
>   	return dma_obj;
>   
>   error:
>   	kfree(dma_obj);
>   	return ERR_PTR(ret);
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d7a499917eb4aeb6ca8350f856eb3..58987f39ba8718eb768f6261fb0a1fbf16b38549 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -1,8 +1,9 @@
>   #ifndef _DRM_DEVICE_H_
>   #define _DRM_DEVICE_H_
>   
> +#include <linux/cgroup_dmem.h>
>   #include <linux/list.h>
>   #include <linux/kref.h>
>   #include <linux/mutex.h>
>   #include <linux/idr.h>
>   
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index fdae947682cd0b7b06db5e35e120f049a0f30179..95fe8ed48a26204020bb47d6074689829c410465 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -430,10 +430,12 @@ struct drm_gem_object {
>   	 * @lru:
>   	 *
>   	 * The current LRU list that the GEM object is on.
>   	 */
>   	struct drm_gem_lru *lru;
> +
> +	struct dmem_cgroup_pool_state *cgroup_pool_state;
>   };
>   
>   /**
>    * DRM_GEM_FOPS - Default drm GEM file operations
>    *
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 06/12] dma: direct: Provide accessor to dmem region
  2025-03-10 14:56   ` Robin Murphy
@ 2025-03-10 16:28     ` Maxime Ripard
  2025-03-10 18:44       ` Robin Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Maxime Ripard @ 2025-03-10 16:28 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Andrew Morton, Marek Szyprowski, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab,
	Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

On Mon, Mar 10, 2025 at 02:56:37PM +0000, Robin Murphy wrote:
> On 2025-03-10 12:06 pm, Maxime Ripard wrote:
> > Consumers of the direct DMA API will have to know which region their
> > device allocate from in order for them to charge the memory allocation
> > in the right one.
> 
> This doesn't seem to make much sense - dma-direct is not an allocator
> itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc.
> interfaces wherein the underlying allocations _could_ come from CMA, but
> also a per-device coherent/restricted pool, or a global coherent/atomic
> pool, or the regular page allocator, or in one weird corner case the SWIOTLB
> buffer, or...

I guess it wasn't super clear, but what I meant is that it's an
allocator to the consumer: it gets called, and returns a buffer. How it
does so is transparent to the device, and on the other side of the
abstraction.

I do agree that the logic is complicated to follow, and that's what I
was getting at in the cover letter.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 06/12] dma: direct: Provide accessor to dmem region
  2025-03-10 16:28     ` Maxime Ripard
@ 2025-03-10 18:44       ` Robin Murphy
  2025-03-13 18:16         ` Maxime Ripard
  0 siblings, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2025-03-10 18:44 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Andrew Morton, Marek Szyprowski, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab,
	Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

On 2025-03-10 4:28 pm, Maxime Ripard wrote:
> On Mon, Mar 10, 2025 at 02:56:37PM +0000, Robin Murphy wrote:
>> On 2025-03-10 12:06 pm, Maxime Ripard wrote:
>>> Consumers of the direct DMA API will have to know which region their
>>> device allocate from in order for them to charge the memory allocation
>>> in the right one.
>>
>> This doesn't seem to make much sense - dma-direct is not an allocator
>> itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc.
>> interfaces wherein the underlying allocations _could_ come from CMA, but
>> also a per-device coherent/restricted pool, or a global coherent/atomic
>> pool, or the regular page allocator, or in one weird corner case the SWIOTLB
>> buffer, or...
> 
> I guess it wasn't super clear, but what I meant is that it's an
> allocator to the consumer: it gets called, and returns a buffer. How it
> does so is transparent to the device, and on the other side of the
> abstraction.
> 
> I do agree that the logic is complicated to follow, and that's what I
> was getting at in the cover letter.

Right, but ultimately my point is that when we later end up with:

struct dmem_cgroup_region *
dma_get_dmem_cgroup_region(struct device *dev)
{
	if (dma_alloc_direct(dev, get_dma_ops(dev)))
		return dma_direct_get_dmem_cgroup_region(dev);

		= dma_contiguous_get_dmem_cgroup_region(dev);

it's objectively wrong given what dma_alloc_direct() means in context:

void *dma_alloc_attrs(...)
{
	if (dma_alloc_direct(dev, ops))
		cpu_addr = dma_direct_alloc(...);

where dma_direct_alloc() may then use at least 5 different allocation 
methods, only one of which is CMA. Accounting things which are not CMA 
to CMA seems to thoroughly defeat the purpose of having such 
fine-grained accounting at all.

This is why the very notion of "consumers of dma-direct" should 
fundamentally not be a thing IMO. Drivers consume the DMA API 
interfaces, and the DMA API ultimately consumes various memory 
allocators, but what happens in between is nobody else's business; 
dma-direct happens to represent *some* paths between the two, but there 
are plenty more paths to the same (and different) allocators through 
other DMA API implementations as well. Which route a particular call 
takes to end up at a particular allocator is not meaningful unless you 
are the DMA ops dispatch code.

Or to put it another way, to even go for the "dumbest possible correct 
solution", the plumbing of dma_get_dmem_cgroup_region() would need to be 
about as complex and widespread as the plumbing of dma_alloc_attrs() 
itself ;)

I think I see why a simple DMA attribute couldn't be made to work, as 
dmem_cgroup_uncharge() can't simply look up the pool the same way 
dmem_cgroup_try_charge() found it, since we still need a cg for that and 
get_current_dmemcs() can't be assumed to be stable over time, right?
At the point I'm probably starting to lean towards a whole new DMA op 
with a properly encapsulated return type (and maybe a long-term goal of 
consolidating the 3 or 4 different allocation type we already have), or 
just have a single dmem region for "DMA API memory" and don't care where 
it came from (although I do see the issues with that too - you probably 
wouldn't want to ration a device-private pool the same way as global 
system memory, for example)

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 06/12] dma: direct: Provide accessor to dmem region
  2025-03-10 18:44       ` Robin Murphy
@ 2025-03-13 18:16         ` Maxime Ripard
  0 siblings, 0 replies; 31+ messages in thread
From: Maxime Ripard @ 2025-03-13 18:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Andrew Morton, Marek Szyprowski, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab,
	Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

[-- Attachment #1: Type: text/plain, Size: 4248 bytes --]

On Mon, Mar 10, 2025 at 06:44:51PM +0000, Robin Murphy wrote:
> On 2025-03-10 4:28 pm, Maxime Ripard wrote:
> > On Mon, Mar 10, 2025 at 02:56:37PM +0000, Robin Murphy wrote:
> > > On 2025-03-10 12:06 pm, Maxime Ripard wrote:
> > > > Consumers of the direct DMA API will have to know which region their
> > > > device allocate from in order for them to charge the memory allocation
> > > > in the right one.
> > > 
> > > This doesn't seem to make much sense - dma-direct is not an allocator
> > > itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc.
> > > interfaces wherein the underlying allocations _could_ come from CMA, but
> > > also a per-device coherent/restricted pool, or a global coherent/atomic
> > > pool, or the regular page allocator, or in one weird corner case the SWIOTLB
> > > buffer, or...
> > 
> > I guess it wasn't super clear, but what I meant is that it's an
> > allocator to the consumer: it gets called, and returns a buffer. How it
> > does so is transparent to the device, and on the other side of the
> > abstraction.
> > 
> > I do agree that the logic is complicated to follow, and that's what I
> > was getting at in the cover letter.
> 
> Right, but ultimately my point is that when we later end up with:
> 
> struct dmem_cgroup_region *
> dma_get_dmem_cgroup_region(struct device *dev)
> {
> 	if (dma_alloc_direct(dev, get_dma_ops(dev)))
> 		return dma_direct_get_dmem_cgroup_region(dev);
> 
> 		= dma_contiguous_get_dmem_cgroup_region(dev);
> 
> it's objectively wrong given what dma_alloc_direct() means in context:
> 
> void *dma_alloc_attrs(...)
> {
> 	if (dma_alloc_direct(dev, ops))
> 		cpu_addr = dma_direct_alloc(...);
> 
> where dma_direct_alloc() may then use at least 5 different allocation
> methods, only one of which is CMA. Accounting things which are not CMA to
> CMA seems to thoroughly defeat the purpose of having such fine-grained
> accounting at all.
> 
> This is why the very notion of "consumers of dma-direct" should
> fundamentally not be a thing IMO. Drivers consume the DMA API interfaces,
> and the DMA API ultimately consumes various memory allocators, but what
> happens in between is nobody else's business; dma-direct happens to
> represent *some* paths between the two, but there are plenty more paths to
> the same (and different) allocators through other DMA API implementations as
> well. Which route a particular call takes to end up at a particular
> allocator is not meaningful unless you are the DMA ops dispatch code.
> 
> Or to put it another way, to even go for the "dumbest possible correct
> solution", the plumbing of dma_get_dmem_cgroup_region() would need to be
> about as complex and widespread as the plumbing of dma_alloc_attrs() itself
> ;)

I largely agree with the sentiment, and I think the very idea of
dma_get_dmem_cgroup_region() is a bad one for that reason. But since I
wasn't too sure what a good one might look like, I figured it would be a
good way to start the discussion still :)

> I think I see why a simple DMA attribute couldn't be made to work, as
> dmem_cgroup_uncharge() can't simply look up the pool the same way
> dmem_cgroup_try_charge() found it, since we still need a cg for that and
> get_current_dmemcs() can't be assumed to be stable over time, right?
> At the point I'm probably starting to lean towards a whole new DMA op with a
> properly encapsulated return type (and maybe a long-term goal of
> consolidating the 3 or 4 different allocation type we already have)

It felt like a good solution to me too, and what I alluded to with
struct page or folio. My feeling was that the best way to do it would be
to encapsulate it into the structure returned by the dma_alloc_* API.
That's a pretty large rework though, so I wanted to make sure I was on
the right path before doing so.

> or just have a single dmem region for "DMA API memory" and don't care
> where it came from (although I do see the issues with that too - you
> probably wouldn't want to ration a device-private pool the same way as
> global system memory, for example)

Yeah, the CMA pool is probably something you want to limit differently
as well.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-03-10 14:26   ` Maxime Ripard
@ 2025-03-31 20:43     ` Dave Airlie
  2025-04-01 11:03       ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Airlie @ 2025-03-31 20:43 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Christian König, Andrew Morton, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard,
	Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

On Tue, 11 Mar 2025 at 00:26, Maxime Ripard <mripard@kernel.org> wrote:
>
> Hi,
>
> On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
> > [Adding Ben since we are currently in the middle of a discussion
> > regarding exactly that problem]
> >
> > Just for my understanding before I deep dive into the code: This uses
> > a separate dmem cgroup and does not account against memcg, don't it?
>
> Yes. The main rationale being that it doesn't always make sense to
> register against memcg: a lot of devices are going to allocate from
> dedicated chunks of memory that are either carved out from the main
> memory allocator, or not under Linux supervision at all.
>
> And if there's no way to make it consistent across drivers, it's not the
> right tool.
>

While I agree on that, if a user can cause a device driver to allocate
memory that is also memory that memcg accounts, then we have to
interface with memcg to account that memory.

The pathological case would be a single application wanting to use 90%
of RAM for device allocations, freeing it all, then using 90% of RAM
for normal usage. How to create a policy that would allow that with
dmem and memcg is difficult, since if you say you can do 90% on both
then the user can easily OOM the system.

Dave.
> Maxime


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-03-31 20:43     ` Dave Airlie
@ 2025-04-01 11:03       ` Christian König
  2025-04-03  6:07         ` Dave Airlie
  0 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2025-04-01 11:03 UTC (permalink / raw)
  To: Dave Airlie, Maxime Ripard
  Cc: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Maarten Lankhorst, Thomas Zimmermann, Simona Vetter, Tomasz Figa,
	Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

Am 31.03.25 um 22:43 schrieb Dave Airlie:
> On Tue, 11 Mar 2025 at 00:26, Maxime Ripard <mripard@kernel.org> wrote:
>> Hi,
>>
>> On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
>>> [Adding Ben since we are currently in the middle of a discussion
>>> regarding exactly that problem]
>>>
>>> Just for my understanding before I deep dive into the code: This uses
>>> a separate dmem cgroup and does not account against memcg, don't it?
>> Yes. The main rationale being that it doesn't always make sense to
>> register against memcg: a lot of devices are going to allocate from
>> dedicated chunks of memory that are either carved out from the main
>> memory allocator, or not under Linux supervision at all.
>>
>> And if there's no way to make it consistent across drivers, it's not the
>> right tool.
>>
> While I agree on that, if a user can cause a device driver to allocate
> memory that is also memory that memcg accounts, then we have to
> interface with memcg to account that memory.

This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.

E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.

That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).

> The pathological case would be a single application wanting to use 90%
> of RAM for device allocations, freeing it all, then using 90% of RAM
> for normal usage. How to create a policy that would allow that with
> dmem and memcg is difficult, since if you say you can do 90% on both
> then the user can easily OOM the system.

Yeah, completely agree.

That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and  TTM_TT_FLAG_EXTERNAL.

Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?

If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.

But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.

There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.

Maybe there is not this one solution and we need to make it somehow configurable?

Regards,
Christian.

>
> Dave.
>> Maxime

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-01 11:03       ` Christian König
@ 2025-04-03  6:07         ` Dave Airlie
  2025-04-03  7:39           ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Airlie @ 2025-04-03  6:07 UTC (permalink / raw)
  To: Christian König
  Cc: Maxime Ripard, Andrew Morton, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

On Tue, 1 Apr 2025 at 21:03, Christian König <christian.koenig@amd.com> wrote:
>
> Am 31.03.25 um 22:43 schrieb Dave Airlie:
> > On Tue, 11 Mar 2025 at 00:26, Maxime Ripard <mripard@kernel.org> wrote:
> >> Hi,
> >>
> >> On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
> >>> [Adding Ben since we are currently in the middle of a discussion
> >>> regarding exactly that problem]
> >>>
> >>> Just for my understanding before I deep dive into the code: This uses
> >>> a separate dmem cgroup and does not account against memcg, don't it?
> >> Yes. The main rationale being that it doesn't always make sense to
> >> register against memcg: a lot of devices are going to allocate from
> >> dedicated chunks of memory that are either carved out from the main
> >> memory allocator, or not under Linux supervision at all.
> >>
> >> And if there's no way to make it consistent across drivers, it's not the
> >> right tool.
> >>
> > While I agree on that, if a user can cause a device driver to allocate
> > memory that is also memory that memcg accounts, then we have to
> > interface with memcg to account that memory.
>
> This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.
>
> E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.

Yes we definitely need the ability to transfer an allocation between
cgroups for this case.

>
> That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).
>
> > The pathological case would be a single application wanting to use 90%
> > of RAM for device allocations, freeing it all, then using 90% of RAM
> > for normal usage. How to create a policy that would allow that with
> > dmem and memcg is difficult, since if you say you can do 90% on both
> > then the user can easily OOM the system.
>
> Yeah, completely agree.
>
> That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and  TTM_TT_FLAG_EXTERNAL.
>
> Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?
>
> If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.
>
> But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.
>
> There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.
>
> Maybe there is not this one solution and we need to make it somehow configurable?

My feeling is that we can't solve the VRAM eviction problem super
effectively, but it's also probably not going to be a major common
case, I don't think we should double account memcg/dmem just in case
we have to evict all of a users dmem at some point, maybe if there was
some kind of soft memcg limit we could add as an accounting but not
enforced overhead it might be useful to track evictions, but yes we
can't have A allocating memory causing B to fall over because we evict
memory into it's memcg space and it fails to allocate the next time it
tries, or having A fail in that case.

For the UMA GPU case where there is no device memory or eviction
problem, perhaps a configurable option to just say account memory in
memcg for all allocations done by this process, and state yes you can
work around it with allocation servers or whatever but the behaviour
for well behaved things is at least somewhat defined.

Dave.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-03  6:07         ` Dave Airlie
@ 2025-04-03  7:39           ` Christian König
  2025-04-03 15:47             ` Maxime Ripard
  0 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2025-04-03  7:39 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Maxime Ripard, Andrew Morton, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

[-- Attachment #1: Type: text/plain, Size: 5886 bytes --]

Am 03.04.25 um 08:07 schrieb Dave Airlie:
> On Tue, 1 Apr 2025 at 21:03, Christian König <christian.koenig@amd.com> wrote:
>> Am 31.03.25 um 22:43 schrieb Dave Airlie:
>>> On Tue, 11 Mar 2025 at 00:26, Maxime Ripard <mripard@kernel.org> wrote:
>>>> Hi,
>>>>
>>>> On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
>>>>> [Adding Ben since we are currently in the middle of a discussion
>>>>> regarding exactly that problem]
>>>>>
>>>>> Just for my understanding before I deep dive into the code: This uses
>>>>> a separate dmem cgroup and does not account against memcg, don't it?
>>>> Yes. The main rationale being that it doesn't always make sense to
>>>> register against memcg: a lot of devices are going to allocate from
>>>> dedicated chunks of memory that are either carved out from the main
>>>> memory allocator, or not under Linux supervision at all.
>>>>
>>>> And if there's no way to make it consistent across drivers, it's not the
>>>> right tool.
>>>>
>>> While I agree on that, if a user can cause a device driver to allocate
>>> memory that is also memory that memcg accounts, then we have to
>>> interface with memcg to account that memory.
>> This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.
>>
>> E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.
> Yes we definitely need the ability to transfer an allocation between
> cgroups for this case.

The bigger issue is that you break UAPI for things which are "older" (X server, older Android approaches etc...), fixing all of this is certainly possible but most likely simply not worth it.

There are simpler approach to handle all this I think, but see below for further thoughts on that topic.

>> That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).
>>
>>> The pathological case would be a single application wanting to use 90%
>>> of RAM for device allocations, freeing it all, then using 90% of RAM
>>> for normal usage. How to create a policy that would allow that with
>>> dmem and memcg is difficult, since if you say you can do 90% on both
>>> then the user can easily OOM the system.
>> Yeah, completely agree.
>>
>> That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and  TTM_TT_FLAG_EXTERNAL.
>>
>> Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?
>>
>> If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.
>>
>> But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.
>>
>> There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.
>>
>> Maybe there is not this one solution and we need to make it somehow configurable?
> My feeling is that we can't solve the VRAM eviction problem super
> effectively, but it's also probably not going to be a major common
> case, I don't think we should double account memcg/dmem just in case
> we have to evict all of a users dmem at some point, maybe if there was
> some kind of soft memcg limit we could add as an accounting but not
> enforced overhead it might be useful to track evictions, but yes we
> can't have A allocating memory causing B to fall over because we evict
> memory into it's memcg space and it fails to allocate the next time it
> tries, or having A fail in that case.

+1 yeah, exactly my thinking as well.

> For the UMA GPU case where there is no device memory or eviction
> problem, perhaps a configurable option to just say account memory in
> memcg for all allocations done by this process, and state yes you can
> work around it with allocation servers or whatever but the behaviour
> for well behaved things is at least somewhat defined.

We can have that as a workaround, but I think we should approach that differently.

With upcoming CXL even coherent device memory is exposed to the core OS as NUMA memory with just a high latency.

So both in the CXL and UMA case it actually doesn't make sense to allocate the memory through the driver interfaces any more. With AMDGPU for example we are just replicating mbind()/madvise() within the driver.

Instead what the DRM subsystem should aim for is to allocate memory using the normal core OS functionality and then import it into the driver.

AMD, NVidia and Intel have HMM working for quite a while now but it has some limitations, especially on the performance side.

So for AMDGPU we are currently evaluating udmabuf as alternative. That seems to be working fine with different NUMA nodes, is perfectly memcg accounted and gives you a DMA-buf which can be imported everywhere.

The only show stopper might be the allocation performance, but even if that's the case I think the ongoing folio work will properly resolve that.

With that in mind I think for the CXL/UMA use case we should use dmem to limit the driver allocate memory to just a few megabytes for legacy things and let the wast amount of memory allocation go through the normal core OS channels instead.

Regards,
Christian.

>
> Dave.

[-- Attachment #2: Type: text/html, Size: 7948 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
                   ` (13 preceding siblings ...)
  2025-03-10 14:16 ` Christian König
@ 2025-04-03  8:27 ` Simona Vetter
  14 siblings, 0 replies; 31+ messages in thread
From: Simona Vetter @ 2025-04-03  8:27 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Andrew Morton, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Christian König, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann,
	David Airlie, Simona Vetter, Tomasz Figa, Mauro Carvalho Chehab,
	Hans Verkuil, Laurent Pinchart, linux-mm, linux-kernel, iommu,
	linux-media, dri-devel, linaro-mm-sig

On Mon, Mar 10, 2025 at 01:06:06PM +0100, Maxime Ripard wrote:
> Hi,
> 
> Here's preliminary work to enable dmem tracking for heavy users of DMA
> allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.
> 
> It's not really meant for inclusion at the moment, because I really
> don't like it that much, and would like to discuss solutions on how to
> make it nicer.
> 
> In particular, the dma dmem region accessors don't feel that great to
> me. It duplicates the logic to select the proper accessor in
> dma_alloc_attrs(), and it looks fragile and potentially buggy to me.
> 
> One solution I tried is to do the accounting in dma_alloc_attrs()
> directly, depending on a flag being set, similar to what __GFP_ACCOUNT
> is doing.
> 
> It didn't work because dmem initialises a state pointer when charging an
> allocation to a region, and expects that state pointer to be passed back
> when uncharging. Since dma_alloc_attrs() returns a void pointer to the
> allocated buffer, we need to put that state into a higher-level
> structure, such as drm_gem_object, or dma_buf.
> 
> Since we can't share the region selection logic, we need to get the
> region through some other mean. Another thing I consider was to return
> the region as part of the allocated buffer (through struct page or
> folio), but those are lost across the calls and dma_alloc_attrs() will
> only get a void pointer. So that's not doable without some heavy
> rework, if it's a good idea at all.
> 
> So yeah, I went for the dumbest possible solution with the accessors,
> hoping you could suggest a much smarter idea :)

I've had a private chat with Maxime to get him up to speed on hopefully a
lot of the past discussions, but probably best I put my notes here too.
Somewhat unstructured list of challenges with trying to account all the
memory for gpu/isp/camera/whatever:

- At LPC in Dublin I think we've pretty much reached the conclusion that
  normal struct page memory should be just accounted in memcg. Otherwise
  you just get really nasty double-accounting chaos or issues where you
  can exhaust reserves.

- We did not figure out what to do with mixed stuff like CMA, where we
  probably want to account it both into memcg (because it's struct page)
  but also separately into dmem (because the CMA region is a limited
  resource and only using memcg will not help us manage it).

- There's the entire chaos of carve-out vs CMA and how userspace can
  figure out how to set reasonable limits automatically. Maxime brought
  the issue that limits need to be adjusted if carve-out/CMA/shmem aren't
  accounted the same, which I think is a valid concern. But due to the
  above conclusion around memcg accounting I think that's unavoidable, so
  we need some means for userspace to autoconfigure reasonable limits.
  Then that autoconfig can be done on each boot, and kernel (or dt or
  whatever) changes between these three allocators don't matter anymore.

- Autoconfiguration challenges also exist for split display/render SoC. It
  gets even more fun if you also throw in camera and media codecs, and
  even more fun if you have multiple CMA regions.

- Discrete gpu also has a very fun autoconfiguration issue because you
  have dmem limits for vram, and memcg limits for system memory. Vram
  might be swapped out to system memory, so naively you might want to
  assume that you need higher memcg limits than dmem limits. But there's
  systems with more dmem and smem (because the cpu with its memory is
  essentially just the co-processor that orchestrates the real compute
  machine, which is all gpus).

- We need a charge transfer, least for Android since there all memory is
  allocated through binder. TJ Mercier did some patches:

  https://lore.kernel.org/dri-devel/20230123191728.2928839-3-tjmercier@google.com/

  Ofc with dmem this would need to work for both dmem and memcg charges,
  since with CMA and discrete gpu we'll have bo that are tracked in both.

- Hard limits for shmem/ttm drivers need a memcg-aware shrinker. TTM
  doesn't even have a shrinker yet, but with xe we now have a
  helper-library approach to enabling shrinking for TTM drivers.
  memcg-aware shrinking will be a large step up in complexity on top (and
  probably a good reason to switch over to the common shrinker lru instead
  of hand-rolling).

  See the various attempts at ttm shrinkers by Christian König and Thomas
  Hellstrom over the past years on dri-devel.

  This also means that most likely cgroup limit enforcement for ttm based
  drivers will be per-driver or at least very uneven.

- Hard limits for dmem vram means ttm eviction needs to be able to account
  the evicted bo against the right memcg. Because this can happen in
  random other threads (cs ioctl of another process, kernel threads)
  accounting this correctly is going to be "fun". Plus I haven't thought
  through interactions with memcg-aware shrinkers, which might cause some
  really fundamental issues.

- We also ideally need pin account, but I don't think we have any
  consensus on how to do that for memcg memory. Thus far it's all
  functionality-specific limits (e.g. mlock, rdma has its own for
  long-term pinned memory), not sure it makes sense to push for a unified
  tracking in memcg here?

  For dmem I think it's pretty easy, but there the question is how to
  differentiate between dmem that's always pinned (cma, I don't think
  anyone bothered with a shrinker for cma memory, vc4 maybe?) and dmem
  that generally has a shrinker and really wants a separate pin limit
  (vram/ttm drivers).

- Unfortunately on top of the sometimes very high individual complexity
  these issues also all interact. Which means that we won't be able to
  roll this out in one go, and we need to cope with very uneven
  enforcement. I think trying to allow userspace to cope with changing
  cgroup support through autoconfiguration is the most feasible way out of
  this challenge.

tldr; cgroup for device memory is a really complex mess

Cheers, Sima

> Thanks,
> Maxime
> 
> Signed-off-by: Maxime Ripard <mripard@kernel.org>
> ---
> Maxime Ripard (12):
>       cma: Register dmem region for each cma region
>       cma: Provide accessor to cma dmem region
>       dma: coherent: Register dmem region for each coherent region
>       dma: coherent: Provide accessor to dmem region
>       dma: contiguous: Provide accessor to dmem region
>       dma: direct: Provide accessor to dmem region
>       dma: Create default dmem region for DMA allocations
>       dma: Provide accessor to dmem region
>       dma-buf: Clear cgroup accounting on release
>       dma-buf: cma: Account for allocations in dmem cgroup
>       drm/gem: Add cgroup memory accounting
>       media: videobuf2: Track buffer allocations through the dmem cgroup
> 
>  drivers/dma-buf/dma-buf.c                          |  7 ++++
>  drivers/dma-buf/heaps/cma_heap.c                   | 18 ++++++++--
>  drivers/gpu/drm/drm_gem.c                          |  5 +++
>  drivers/gpu/drm/drm_gem_dma_helper.c               |  6 ++++
>  .../media/common/videobuf2/videobuf2-dma-contig.c  | 19 +++++++++++
>  include/drm/drm_device.h                           |  1 +
>  include/drm/drm_gem.h                              |  2 ++
>  include/linux/cma.h                                |  9 +++++
>  include/linux/dma-buf.h                            |  5 +++
>  include/linux/dma-direct.h                         |  2 ++
>  include/linux/dma-map-ops.h                        | 32 ++++++++++++++++++
>  include/linux/dma-mapping.h                        | 11 ++++++
>  kernel/dma/coherent.c                              | 26 +++++++++++++++
>  kernel/dma/direct.c                                |  8 +++++
>  kernel/dma/mapping.c                               | 39 ++++++++++++++++++++++
>  mm/cma.c                                           | 21 +++++++++++-
>  mm/cma.h                                           |  3 ++
>  17 files changed, 211 insertions(+), 3 deletions(-)
> ---
> base-commit: 55a2aa61ba59c138bd956afe0376ec412a7004cf
> change-id: 20250307-dmem-cgroups-73febced0989
> 
> Best regards,
> -- 
> Maxime Ripard <mripard@kernel.org>
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-03  7:39           ` Christian König
@ 2025-04-03 15:47             ` Maxime Ripard
  2025-04-04  8:47               ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: Maxime Ripard @ 2025-04-03 15:47 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, Andrew Morton, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

[-- Attachment #1: Type: text/plain, Size: 2421 bytes --]

On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
> > For the UMA GPU case where there is no device memory or eviction
> > problem, perhaps a configurable option to just say account memory in
> > memcg for all allocations done by this process, and state yes you can
> > work around it with allocation servers or whatever but the behaviour
> > for well behaved things is at least somewhat defined.
> 
> We can have that as a workaround, but I think we should approach that
> differently.
> 
> With upcoming CXL even coherent device memory is exposed to the core
> OS as NUMA memory with just a high latency.
> 
> So both in the CXL and UMA case it actually doesn't make sense to
> allocate the memory through the driver interfaces any more. With
> AMDGPU for example we are just replicating mbind()/madvise() within
> the driver.
> 
> Instead what the DRM subsystem should aim for is to allocate memory
> using the normal core OS functionality and then import it into the
> driver.
> 
> AMD, NVidia and Intel have HMM working for quite a while now but it
> has some limitations, especially on the performance side.
> 
> So for AMDGPU we are currently evaluating udmabuf as alternative. That
> seems to be working fine with different NUMA nodes, is perfectly memcg
> accounted and gives you a DMA-buf which can be imported everywhere.
> 
> The only show stopper might be the allocation performance, but even if
> that's the case I think the ongoing folio work will properly resolve
> that.

I mean, no, the showstopper to that is that using udmabuf has the
assumption that you have an IOMMU for every device doing DMA, which is
absolutely not true on !x86 platforms.

It might be true for all GPUs, but it certainly isn't for display
controllers, and it's not either for codecs, ISPs, and cameras.

And then there's the other assumption that all memory is under the
memory allocator control, which isn't the case on most recent platforms
either.

We *need* to take CMA into account there, all the carved-out, device
specific memory regions, and the memory regions that aren't even under
Linux supervision like protected memory that is typically handled by the
firmware and all you get is a dma-buf.

Saying that it's how you want to workaround it on AMD is absolutely
fine, but DRM as a whole should certainly not aim for that, because it
can't.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-03 15:47             ` Maxime Ripard
@ 2025-04-04  8:47               ` Christian König
  2025-04-05  1:57                 ` T.J. Mercier
  0 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2025-04-04  8:47 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Dave Airlie, Andrew Morton, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

Hi Maxime,

Am 03.04.25 um 17:47 schrieb Maxime Ripard:
> On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
>>> For the UMA GPU case where there is no device memory or eviction
>>> problem, perhaps a configurable option to just say account memory in
>>> memcg for all allocations done by this process, and state yes you can
>>> work around it with allocation servers or whatever but the behaviour
>>> for well behaved things is at least somewhat defined.
>> We can have that as a workaround, but I think we should approach that
>> differently.
>>
>> With upcoming CXL even coherent device memory is exposed to the core
>> OS as NUMA memory with just a high latency.
>>
>> So both in the CXL and UMA case it actually doesn't make sense to
>> allocate the memory through the driver interfaces any more. With
>> AMDGPU for example we are just replicating mbind()/madvise() within
>> the driver.
>>
>> Instead what the DRM subsystem should aim for is to allocate memory
>> using the normal core OS functionality and then import it into the
>> driver.
>>
>> AMD, NVidia and Intel have HMM working for quite a while now but it
>> has some limitations, especially on the performance side.
>>
>> So for AMDGPU we are currently evaluating udmabuf as alternative. That
>> seems to be working fine with different NUMA nodes, is perfectly memcg
>> accounted and gives you a DMA-buf which can be imported everywhere.
>>
>> The only show stopper might be the allocation performance, but even if
>> that's the case I think the ongoing folio work will properly resolve
>> that.
> I mean, no, the showstopper to that is that using udmabuf has the
> assumption that you have an IOMMU for every device doing DMA, which is
> absolutely not true on !x86 platforms.
>
> It might be true for all GPUs, but it certainly isn't for display
> controllers, and it's not either for codecs, ISPs, and cameras.
>
> And then there's the other assumption that all memory is under the
> memory allocator control, which isn't the case on most recent platforms
> either.
>
> We *need* to take CMA into account there, all the carved-out, device
> specific memory regions, and the memory regions that aren't even under
> Linux supervision like protected memory that is typically handled by the
> firmware and all you get is a dma-buf.
>
> Saying that it's how you want to workaround it on AMD is absolutely
> fine, but DRM as a whole should certainly not aim for that, because it
> can't.

A bunch of good points you bring up here but it sounds like you misunderstood me a bit.

I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.

For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.

When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.

When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.

The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.

Regards,
Christian.

>
> Maxime



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-04  8:47               ` Christian König
@ 2025-04-05  1:57                 ` T.J. Mercier
  2025-04-07 11:46                   ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: T.J. Mercier @ 2025-04-05  1:57 UTC (permalink / raw)
  To: Christian König
  Cc: Maxime Ripard, Dave Airlie, Andrew Morton, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

On Fri, Apr 4, 2025 at 1:47 AM Christian König <christian.koenig@amd.com> wrote:
>
> Hi Maxime,
>
> Am 03.04.25 um 17:47 schrieb Maxime Ripard:
> > On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
> >>> For the UMA GPU case where there is no device memory or eviction
> >>> problem, perhaps a configurable option to just say account memory in
> >>> memcg for all allocations done by this process, and state yes you can
> >>> work around it with allocation servers or whatever but the behaviour
> >>> for well behaved things is at least somewhat defined.
> >> We can have that as a workaround, but I think we should approach that
> >> differently.
> >>
> >> With upcoming CXL even coherent device memory is exposed to the core
> >> OS as NUMA memory with just a high latency.
> >>
> >> So both in the CXL and UMA case it actually doesn't make sense to
> >> allocate the memory through the driver interfaces any more. With
> >> AMDGPU for example we are just replicating mbind()/madvise() within
> >> the driver.
> >>
> >> Instead what the DRM subsystem should aim for is to allocate memory
> >> using the normal core OS functionality and then import it into the
> >> driver.
> >>
> >> AMD, NVidia and Intel have HMM working for quite a while now but it
> >> has some limitations, especially on the performance side.
> >>
> >> So for AMDGPU we are currently evaluating udmabuf as alternative. That
> >> seems to be working fine with different NUMA nodes, is perfectly memcg
> >> accounted and gives you a DMA-buf which can be imported everywhere.
> >>
> >> The only show stopper might be the allocation performance, but even if
> >> that's the case I think the ongoing folio work will properly resolve
> >> that.
> > I mean, no, the showstopper to that is that using udmabuf has the
> > assumption that you have an IOMMU for every device doing DMA, which is
> > absolutely not true on !x86 platforms.
> >
> > It might be true for all GPUs, but it certainly isn't for display
> > controllers, and it's not either for codecs, ISPs, and cameras.
> >
> > And then there's the other assumption that all memory is under the
> > memory allocator control, which isn't the case on most recent platforms
> > either.
> >
> > We *need* to take CMA into account there, all the carved-out, device
> > specific memory regions, and the memory regions that aren't even under
> > Linux supervision like protected memory that is typically handled by the
> > firmware and all you get is a dma-buf.
> >
> > Saying that it's how you want to workaround it on AMD is absolutely
> > fine, but DRM as a whole should certainly not aim for that, because it
> > can't.
>
> A bunch of good points you bring up here but it sounds like you misunderstood me a bit.
>
> I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.
>
> For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.
>
> When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.
>
> When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.
>
> The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.

This reminds me of memory.memsw in cgroup v1, where both resident and
swapped memory show up under the same memcg counter. In this dmem
scenario it's similar but across two different cgroup controllers
instead of two different types of system memory under the same
controller.

memsw doesn't exist in v2, and users are asking for it back. [1] I
tend to agree that a combined counter is useful as I don't see a great
way to apply meaningful limits to individual counters (or individual
controller limits in the dmem+memcg case) when multiple cgroups are
involved and eviction can cause memory to be transferred from one
place to another. Sorry I'm not really offering a solution to this,
but I feel like only transferring the charge between cgroups is a
partial solution since the enforcement by the kernel is independent
for each controller. So yeah as Dave and Sima said for accounting I
guess it works, and maybe that's good enough if you have userspace
enforcement that's smart enough to look in all the different places.
But then there are the folks asking for kernel enforcement. Maybe just
accounting as best we can is a good place to start?

[1] https://lore.kernel.org/all/20250319064148.774406-5-jingxiangzeng.cas@gmail.com/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-05  1:57                 ` T.J. Mercier
@ 2025-04-07 11:46                   ` Christian König
  2025-04-08  1:03                     ` T.J. Mercier
  0 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2025-04-07 11:46 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Maxime Ripard, Dave Airlie, Andrew Morton, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

Am 05.04.25 um 03:57 schrieb T.J. Mercier:
> On Fri, Apr 4, 2025 at 1:47 AM Christian König <christian.koenig@amd.com> wrote:
>> Hi Maxime,
>>
>> Am 03.04.25 um 17:47 schrieb Maxime Ripard:
>>> On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
>>>>> For the UMA GPU case where there is no device memory or eviction
>>>>> problem, perhaps a configurable option to just say account memory in
>>>>> memcg for all allocations done by this process, and state yes you can
>>>>> work around it with allocation servers or whatever but the behaviour
>>>>> for well behaved things is at least somewhat defined.
>>>> We can have that as a workaround, but I think we should approach that
>>>> differently.
>>>>
>>>> With upcoming CXL even coherent device memory is exposed to the core
>>>> OS as NUMA memory with just a high latency.
>>>>
>>>> So both in the CXL and UMA case it actually doesn't make sense to
>>>> allocate the memory through the driver interfaces any more. With
>>>> AMDGPU for example we are just replicating mbind()/madvise() within
>>>> the driver.
>>>>
>>>> Instead what the DRM subsystem should aim for is to allocate memory
>>>> using the normal core OS functionality and then import it into the
>>>> driver.
>>>>
>>>> AMD, NVidia and Intel have HMM working for quite a while now but it
>>>> has some limitations, especially on the performance side.
>>>>
>>>> So for AMDGPU we are currently evaluating udmabuf as alternative. That
>>>> seems to be working fine with different NUMA nodes, is perfectly memcg
>>>> accounted and gives you a DMA-buf which can be imported everywhere.
>>>>
>>>> The only show stopper might be the allocation performance, but even if
>>>> that's the case I think the ongoing folio work will properly resolve
>>>> that.
>>> I mean, no, the showstopper to that is that using udmabuf has the
>>> assumption that you have an IOMMU for every device doing DMA, which is
>>> absolutely not true on !x86 platforms.
>>>
>>> It might be true for all GPUs, but it certainly isn't for display
>>> controllers, and it's not either for codecs, ISPs, and cameras.
>>>
>>> And then there's the other assumption that all memory is under the
>>> memory allocator control, which isn't the case on most recent platforms
>>> either.
>>>
>>> We *need* to take CMA into account there, all the carved-out, device
>>> specific memory regions, and the memory regions that aren't even under
>>> Linux supervision like protected memory that is typically handled by the
>>> firmware and all you get is a dma-buf.
>>>
>>> Saying that it's how you want to workaround it on AMD is absolutely
>>> fine, but DRM as a whole should certainly not aim for that, because it
>>> can't.
>> A bunch of good points you bring up here but it sounds like you misunderstood me a bit.
>>
>> I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.
>>
>> For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.
>>
>> When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.
>>
>> When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.
>>
>> The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.
> This reminds me of memory.memsw in cgroup v1, where both resident and
> swapped memory show up under the same memcg counter. In this dmem
> scenario it's similar but across two different cgroup controllers
> instead of two different types of system memory under the same
> controller.

Yeah, nailed it. Exactly that was one of the potential solutions I had in mind as well.

It's just that I abandoned that idea when I realized that it actually wouldn't help.

For example imagine you have 8GiB system and 8GiB local memory. So you set your cgroup limit to 12GiB. But when an application tries to use full 12GiB as system instead of a combination of the two you still run into the OOM killer.

> memsw doesn't exist in v2, and users are asking for it back. [1] I
> tend to agree that a combined counter is useful as I don't see a great
> way to apply meaningful limits to individual counters (or individual
> controller limits in the dmem+memcg case) when multiple cgroups are
> involved and eviction can cause memory to be transferred from one
> place to another. Sorry I'm not really offering a solution to this,
> but I feel like only transferring the charge between cgroups is a
> partial solution since the enforcement by the kernel is independent
> for each controller. So yeah as Dave and Sima said for accounting I
> guess it works, and maybe that's good enough if you have userspace
> enforcement that's smart enough to look in all the different places.
> But then there are the folks asking for kernel enforcement. Maybe just
> accounting as best we can is a good place to start?

So we would account to memcg, but don't apply it's limits?

Mhm, that's a kind of interesting idea. It at least partially solves the problem.

Regards,
Christian.

>
> [1] https://lore.kernel.org/all/20250319064148.774406-5-jingxiangzeng.cas@gmail.com/



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking
  2025-04-07 11:46                   ` Christian König
@ 2025-04-08  1:03                     ` T.J. Mercier
  0 siblings, 0 replies; 31+ messages in thread
From: T.J. Mercier @ 2025-04-08  1:03 UTC (permalink / raw)
  To: Christian König
  Cc: Maxime Ripard, Dave Airlie, Andrew Morton, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, Maarten Lankhorst, Thomas Zimmermann, Simona Vetter,
	Tomasz Figa, Mauro Carvalho Chehab, Ben Woodard, Hans Verkuil,
	Laurent Pinchart, linux-mm, linux-kernel, iommu, linux-media,
	dri-devel, linaro-mm-sig

On Mon, Apr 7, 2025 at 4:46 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 05.04.25 um 03:57 schrieb T.J. Mercier:
> > On Fri, Apr 4, 2025 at 1:47 AM Christian König <christian.koenig@amd.com> wrote:
> >> Hi Maxime,
> >>
> >> Am 03.04.25 um 17:47 schrieb Maxime Ripard:
> >>> On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
> >>>>> For the UMA GPU case where there is no device memory or eviction
> >>>>> problem, perhaps a configurable option to just say account memory in
> >>>>> memcg for all allocations done by this process, and state yes you can
> >>>>> work around it with allocation servers or whatever but the behaviour
> >>>>> for well behaved things is at least somewhat defined.
> >>>> We can have that as a workaround, but I think we should approach that
> >>>> differently.
> >>>>
> >>>> With upcoming CXL even coherent device memory is exposed to the core
> >>>> OS as NUMA memory with just a high latency.
> >>>>
> >>>> So both in the CXL and UMA case it actually doesn't make sense to
> >>>> allocate the memory through the driver interfaces any more. With
> >>>> AMDGPU for example we are just replicating mbind()/madvise() within
> >>>> the driver.
> >>>>
> >>>> Instead what the DRM subsystem should aim for is to allocate memory
> >>>> using the normal core OS functionality and then import it into the
> >>>> driver.
> >>>>
> >>>> AMD, NVidia and Intel have HMM working for quite a while now but it
> >>>> has some limitations, especially on the performance side.
> >>>>
> >>>> So for AMDGPU we are currently evaluating udmabuf as alternative. That
> >>>> seems to be working fine with different NUMA nodes, is perfectly memcg
> >>>> accounted and gives you a DMA-buf which can be imported everywhere.
> >>>>
> >>>> The only show stopper might be the allocation performance, but even if
> >>>> that's the case I think the ongoing folio work will properly resolve
> >>>> that.
> >>> I mean, no, the showstopper to that is that using udmabuf has the
> >>> assumption that you have an IOMMU for every device doing DMA, which is
> >>> absolutely not true on !x86 platforms.
> >>>
> >>> It might be true for all GPUs, but it certainly isn't for display
> >>> controllers, and it's not either for codecs, ISPs, and cameras.
> >>>
> >>> And then there's the other assumption that all memory is under the
> >>> memory allocator control, which isn't the case on most recent platforms
> >>> either.
> >>>
> >>> We *need* to take CMA into account there, all the carved-out, device
> >>> specific memory regions, and the memory regions that aren't even under
> >>> Linux supervision like protected memory that is typically handled by the
> >>> firmware and all you get is a dma-buf.
> >>>
> >>> Saying that it's how you want to workaround it on AMD is absolutely
> >>> fine, but DRM as a whole should certainly not aim for that, because it
> >>> can't.
> >> A bunch of good points you bring up here but it sounds like you misunderstood me a bit.
> >>
> >> I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.
> >>
> >> For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.
> >>
> >> When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.
> >>
> >> When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.
> >>
> >> The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.
> > This reminds me of memory.memsw in cgroup v1, where both resident and
> > swapped memory show up under the same memcg counter. In this dmem
> > scenario it's similar but across two different cgroup controllers
> > instead of two different types of system memory under the same
> > controller.
>
> Yeah, nailed it. Exactly that was one of the potential solutions I had in mind as well.
>
> It's just that I abandoned that idea when I realized that it actually wouldn't help.
>
> For example imagine you have 8GiB system and 8GiB local memory. So you set your cgroup limit to 12GiB. But when an application tries to use full 12GiB as system instead of a combination of the two you still run into the OOM killer.

Yup to solve this with kernel enforcement, we would need a counter
that includes both types. Then what if that system memory can be
swapped and exceeds a swap-only counter. Yet another counter? (dmem,
dmem+system, dmem+system+swap) :\

> > memsw doesn't exist in v2, and users are asking for it back. [1] I
> > tend to agree that a combined counter is useful as I don't see a great
> > way to apply meaningful limits to individual counters (or individual
> > controller limits in the dmem+memcg case) when multiple cgroups are
> > involved and eviction can cause memory to be transferred from one
> > place to another. Sorry I'm not really offering a solution to this,
> > but I feel like only transferring the charge between cgroups is a
> > partial solution since the enforcement by the kernel is independent
> > for each controller. So yeah as Dave and Sima said for accounting I
> > guess it works, and maybe that's good enough if you have userspace
> > enforcement that's smart enough to look in all the different places.
> > But then there are the folks asking for kernel enforcement. Maybe just
> > accounting as best we can is a good place to start?
>
> So we would account to memcg, but don't apply it's limits?

I was thinking just do the accounting independently for each resource
and not rely on the kernel for enforcement of combinations of
resources across controllers. (The status quo.) That shouldn't affect
how memcg enforces limits.

If we could compose a "synthetic" counter file in cgroupfs at runtime
that combines multiple arbitrary existing counters I think that'd help
address the enforcement side. It'd also conveniently solve the memsw
problem in v2 since you could combine memory.current and
memory.swap.current into something like memsw.current and set a
memsw.max, and only users who care about that combination would pay
the overhead for it.

> Mhm, that's a kind of interesting idea. It at least partially solves the problem.
>
> Regards,
> Christian.
>
> >
> > [1] https://lore.kernel.org/all/20250319064148.774406-5-jingxiangzeng.cas@gmail.com/
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-08  1:03 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-10 12:06 [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 01/12] cma: Register dmem region for each cma region Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 02/12] cma: Provide accessor to cma dmem region Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 03/12] dma: coherent: Register dmem region for each coherent region Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 04/12] dma: coherent: Provide accessor to dmem region Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 05/12] dma: contiguous: " Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 06/12] dma: direct: " Maxime Ripard
2025-03-10 14:56   ` Robin Murphy
2025-03-10 16:28     ` Maxime Ripard
2025-03-10 18:44       ` Robin Murphy
2025-03-13 18:16         ` Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 07/12] dma: Create default dmem region for DMA allocations Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 08/12] dma: Provide accessor to dmem region Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 09/12] dma-buf: Clear cgroup accounting on release Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 10/12] dma-buf: cma: Account for allocations in dmem cgroup Maxime Ripard
2025-03-10 12:06 ` [PATCH RFC 11/12] drm/gem: Add cgroup memory accounting Maxime Ripard
2025-03-10 15:06   ` Robin Murphy
2025-03-10 12:06 ` [PATCH RFC 12/12] media: videobuf2: Track buffer allocations through the dmem cgroup Maxime Ripard
2025-03-10 12:15 ` [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Maxime Ripard
2025-03-10 14:16 ` Christian König
2025-03-10 14:26   ` Maxime Ripard
2025-03-31 20:43     ` Dave Airlie
2025-04-01 11:03       ` Christian König
2025-04-03  6:07         ` Dave Airlie
2025-04-03  7:39           ` Christian König
2025-04-03 15:47             ` Maxime Ripard
2025-04-04  8:47               ` Christian König
2025-04-05  1:57                 ` T.J. Mercier
2025-04-07 11:46                   ` Christian König
2025-04-08  1:03                     ` T.J. Mercier
2025-04-03  8:27 ` Simona Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).