[PATCH v4 0/7] Enable THP support in drm

Linux CXL
 help / color / mirror / Atom feed

* [PATCH v4 0/7] Enable THP support in drm_pagemap
@ 2026-01-11 20:55 Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
  0 siblings, 2 replies; 39+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Zi Yan, Madhavan Srinivasan,
	Alistair Popple, Lorenzo Stoakes, Liam R . Howlett,
	Suren Baghdasaryan, Michal Hocko, Mike Rapoport, Vlastimil Babka,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Balbir Singh, Dan Williams,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci,
	linux-mm, linux-cxl, nvdimm, linux-fsdevel

Use Balbir Singh's series for device-private THP support [1] and
previous preparation work in drm_pagemap [2] to add 2MB/THP support
in xe. This leads to significant performance improvements when using
SVM with 2MB pages.

[1] https://lore.kernel.org/linux-mm/20251001065707.920170-1-balbirs@nvidia.com/
[2] https://patchwork.freedesktop.org/series/151754/

v2:
- rebase on top of multi-device SVM
- add drm_pagemap_cpages() with temporary patch
- address other feedback from Matt Brost on v1

v3:
The major change is to remove the dependency to the mm/huge_memory
helper migrate_device_split_page() that was called explicitely when
a 2M buddy allocation backed by a large folio would be later reused
for a smaller allocation (4K or 64K). Instead, the first 3 patches
provided by Matthew Brost ensure large folios are split at the time
of freeing.

v4:
- add order argument to folio_free callback
- send complete series to linux-mm and MM folks as requested (Zi Yan
  and Andrew Morton) and cover letter to anyone receiving at least
  one of the patches (Liam R. Howlett)

Cc: Zi Yan <ziy@nvidia.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Cc: nvdimm@lists.linux.dev
Cc: linux-fsdevel@vger.kernel.org

Francois Dugast (3):
  drm/pagemap: Unlock and put folios when possible
  drm/pagemap: Add helper to access zone_device_data
  drm/pagemap: Enable THP support for GPU memory migration

Matthew Brost (4):
  mm/zone_device: Add order argument to folio_free callback
  mm/zone_device: Add free_zone_device_folio_prepare() helper
  fs/dax: Use free_zone_device_folio_prepare() helper
  drm/pagemap: Correct cpages calculation for migrate_vma_setup

 arch/powerpc/kvm/book3s_hv_uvmem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   2 +-
 drivers/gpu/drm/drm_gpusvm.c             |   7 +-
 drivers/gpu/drm/drm_pagemap.c            | 165 ++++++++++++++++++-----
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   4 +-
 drivers/pci/p2pdma.c                     |   2 +-
 fs/dax.c                                 |  24 +---
 include/drm/drm_pagemap.h                |  15 +++
 include/linux/memremap.h                 |   8 +-
 lib/test_hmm.c                           |   4 +-
 mm/memremap.c                            |  60 ++++++++-
 11 files changed, 227 insertions(+), 66 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-11 22:35   ` Matthew Wilcox
  2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
  1 sibling, 1 reply; 39+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Balbir Singh, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

The core MM splits the folio before calling folio_free, restoring the
zone pages associated with the folio to an initialized state (e.g.,
non-compound, pgmap valid, etc...). The order argument represents the
folio’s order prior to the split which can be used driver side to know
how many pages are being freed.

Fixes: 3a5a06554566 ("mm/zone_device: rename page_free callback to folio_free")
Cc: Zi Yan <ziy@nvidia.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       | 2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
 drivers/gpu/drm/drm_pagemap.c            | 3 ++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 4 ++--
 drivers/pci/p2pdma.c                     | 2 +-
 include/linux/memremap.h                 | 7 ++++++-
 lib/test_hmm.c                           | 4 +---
 mm/memremap.c                            | 5 +++--
 8 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e5000bef90f2..b58f34eec6e5 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1014,7 +1014,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
  * to a normal PFN during H_SVM_PAGE_OUT.
  * Gets called with kvm->arch.uvmem_lock held.
  */
-static void kvmppc_uvmem_folio_free(struct folio *folio)
+static void kvmppc_uvmem_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	unsigned long pfn = page_to_pfn(page) -
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index af53e796ea1b..a26e3c448e47 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -567,7 +567,7 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
 	return r < 0 ? r : 0;
 }
 
-static void svm_migrate_folio_free(struct folio *folio)
+static void svm_migrate_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct svm_range_bo *svm_bo = page->zone_device_data;
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 03ee39a761a4..df253b13cf85 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -1144,11 +1144,12 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 /**
  * drm_pagemap_folio_free() - Put GPU SVM zone device data associated with a folio
  * @folio: Pointer to the folio
+ * @order: Order of the folio prior to being split by core MM
  *
  * This function is a callback used to put the GPU SVM zone device data
  * associated with a page when it is being released.
  */
-static void drm_pagemap_folio_free(struct folio *folio)
+static void drm_pagemap_folio_free(struct folio *folio, unsigned int order)
 {
 	drm_pagemap_zdd_put(folio->page.zone_device_data);
 }
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 58071652679d..545f316fca14 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -115,14 +115,14 @@ unsigned long nouveau_dmem_page_addr(struct page *page)
 	return chunk->bo->offset + off;
 }
 
-static void nouveau_dmem_folio_free(struct folio *folio)
+static void nouveau_dmem_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
 
 	spin_lock(&dmem->lock);
-	if (folio_order(folio)) {
+	if (order) {
 		page->zone_device_data = dmem->free_folios;
 		dmem->free_folios = folio;
 	} else {
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4a2fc7ab42c3..a6fa7610f8a8 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -200,7 +200,7 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
-static void p2pdma_folio_free(struct folio *folio)
+static void p2pdma_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 713ec0435b48..97fcffeb1c1e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,8 +79,13 @@ struct dev_pagemap_ops {
 	 * Called once the folio refcount reaches 0.  The reference count will be
 	 * reset to one by the core code after the method is called to prepare
 	 * for handing out the folio again.
+	 *
+	 * The core MM splits the folio before calling folio_free, restoring the
+	 * zone pages associated with the folio to an initialized state (e.g.,
+	 * non-compound, pgmap valid, etc...). The order argument represents the
+	 * folio’s order prior to the split.
 	 */
-	void (*folio_free)(struct folio *folio);
+	void (*folio_free)(struct folio *folio, unsigned int order);
 
 	/*
 	 * Used for private (un-addressable) device memory only.  Must migrate
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 8af169d3873a..e17c71d02a3a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1580,13 +1580,11 @@ static const struct file_operations dmirror_fops = {
 	.owner		= THIS_MODULE,
 };
 
-static void dmirror_devmem_free(struct folio *folio)
+static void dmirror_devmem_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
-	struct folio *rfolio = page_folio(rpage);
-	unsigned int order = folio_order(rfolio);
 
 	if (rpage != page) {
 		if (order)
diff --git a/mm/memremap.c b/mm/memremap.c
index 63c6ab4fdf08..39dc4bd190d0 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -417,6 +417,7 @@ void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
 	unsigned long nr = folio_nr_pages(folio);
+	unsigned int order = folio_order(folio);
 	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
@@ -453,7 +454,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->folio_free(folio);
+		pgmap->ops->folio_free(folio, order);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
 
@@ -472,7 +473,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->folio_free(folio);
+		pgmap->ops->folio_free(folio, order);
 		break;
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
@ 2026-01-11 22:35   ` Matthew Wilcox
  2026-01-12  0:19     ` Balbir Singh
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2026-01-11 22:35 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Balbir Singh, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> The core MM splits the folio before calling folio_free, restoring the
> zone pages associated with the folio to an initialized state (e.g.,
> non-compound, pgmap valid, etc...). The order argument represents the
> folio’s order prior to the split which can be used driver side to know
> how many pages are being freed.

This really feels like the wrong way to fix this problem.

I think someone from the graphics side really needs to take the lead on
understanding what the MM is doing (both currently and in the future).
I'm happy to work with you, but it feels like there's a lot of churn right
now because there's a lot of people working on this without understanding
the MM side of things (and conversely, I don't think (m)any people on the
MM side really understand what graphics cards are trying to accomplish).

Who is that going to be?  I'm happy to get on the phone with someone.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-11 22:35   ` Matthew Wilcox
@ 2026-01-12  0:19     ` Balbir Singh
  2026-01-12  0:51       ` Zi Yan
  0 siblings, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2026-01-12  0:19 UTC (permalink / raw)
  To: Matthew Wilcox, Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 1/12/26 08:35, Matthew Wilcox wrote:
> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>> The core MM splits the folio before calling folio_free, restoring the
>> zone pages associated with the folio to an initialized state (e.g.,
>> non-compound, pgmap valid, etc...). The order argument represents the
>> folio’s order prior to the split which can be used driver side to know
>> how many pages are being freed.
> 
> This really feels like the wrong way to fix this problem.
> 

This stems from a special requirement, freeing is done in two phases

1. Free the folio -> inform the driver (which implies freeing the backing device memory)
2. Return the folio back, split it back to single order folios

The current code does not do 2. 1 followed by 2 does not work for
Francois since the backing memory can get reused before we reach step 2.
The proposed patch does 2 followed 1, but doing 2 means we've lost the 
folio order and thus the old order is passed in. Although, I wonder if the
backing folio's zone_device_data can be used to encode any order information
about the device side allocation.

@Francois, I hope I did not miss anything in the explanation above.

> I think someone from the graphics side really needs to take the lead on
> understanding what the MM is doing (both currently and in the future).
> I'm happy to work with you, but it feels like there's a lot of churn right
> now because there's a lot of people working on this without understanding
> the MM side of things (and conversely, I don't think (m)any people on the
> MM side really understand what graphics cards are trying to accomplish).
> 

I suspect you are referring to folio specialization and/or downsizing?

> Who is that going to be?  I'm happy to get on the phone with someone.

Happy to work with you, but I am not the authority on graphics, I can speak
to zone device folios. I suspect we'd need to speak to more than one person.

Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:19     ` Balbir Singh
@ 2026-01-12  0:51       ` Zi Yan
  2026-01-12  1:37         ` Matthew Brost
                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Zi Yan @ 2026-01-12  0:51 UTC (permalink / raw)
  To: Matthew Wilcox, Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost,
	Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher,
	Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx,
	nouveau, linux-pci, linux-mm, linux-cxl

On 11 Jan 2026, at 19:19, Balbir Singh wrote:

> On 1/12/26 08:35, Matthew Wilcox wrote:
>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>> The core MM splits the folio before calling folio_free, restoring the
>>> zone pages associated with the folio to an initialized state (e.g.,
>>> non-compound, pgmap valid, etc...). The order argument represents the
>>> folio’s order prior to the split which can be used driver side to know
>>> how many pages are being freed.
>>
>> This really feels like the wrong way to fix this problem.
>>

Hi Matthew,

I think the wording is confusing, since the actual issue is that:

1. zone_device_page_init() calls prep_compound_page() to form a large folio,
2. but free_zone_device_folio() never reverse the course,
3. the undo of prep_compound_page() in free_zone_device_folio() needs to
   be done before driver callback ->folio_free(), since once ->folio_free()
   is called, the folio can be reallocated immediately,
4. after the undo of prep_compound_page(), folio_order() can no longer provide
   the original order information, thus, folio_free() needs that for proper
   device side ref manipulation.

So this is not used for "split" but undo of prep_compound_page(). It might
look like a split to non core MM people, since it changes a large folio
to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
open codes it in free_pages_prepare() by resetting page flags, page->mapping,
and so on. So it might be why the undo prep_compound_page() is missed
by non core MM people.

>
> This stems from a special requirement, freeing is done in two phases
>
> 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
> 2. Return the folio back, split it back to single order folios

Hi Balbir,

Please refrain from using "split" here, since it confuses MM people. A folio
is split when it is still in use, but in this case, the folio has been freed
and needs to be restored to "free page" state.

>
> The current code does not do 2. 1 followed by 2 does not work for
> Francois since the backing memory can get reused before we reach step 2.
> The proposed patch does 2 followed 1, but doing 2 means we've lost the
> folio order and thus the old order is passed in. Although, I wonder if the
> backing folio's zone_device_data can be used to encode any order information
> about the device side allocation.
>
> @Francois, I hope I did not miss anything in the explanation above.
>
>> I think someone from the graphics side really needs to take the lead on
>> understanding what the MM is doing (both currently and in the future).
>> I'm happy to work with you, but it feels like there's a lot of churn right
>> now because there's a lot of people working on this without understanding
>> the MM side of things (and conversely, I don't think (m)any people on the
>> MM side really understand what graphics cards are trying to accomplish).
>>
>
> I suspect you are referring to folio specialization and/or downsizing?
>
>> Who is that going to be?  I'm happy to get on the phone with someone.
>
> Happy to work with you, but I am not the authority on graphics, I can speak
> to zone device folios. I suspect we'd need to speak to more than one person.
>

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:51       ` Zi Yan
@ 2026-01-12  1:37         ` Matthew Brost
  2026-01-12  4:50         ` Balbir Singh
  2026-01-12 13:45         ` Jason Gunthorpe
  2 siblings, 0 replies; 39+ messages in thread
From: Matthew Brost @ 2026-01-12  1:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher,
	Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx,
	nouveau, linux-pci, linux-mm, linux-cxl

On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
> > On 1/12/26 08:35, Matthew Wilcox wrote:
> >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>> The core MM splits the folio before calling folio_free, restoring the
> >>> zone pages associated with the folio to an initialized state (e.g.,
> >>> non-compound, pgmap valid, etc...). The order argument represents the
> >>> folio’s order prior to the split which can be used driver side to know
> >>> how many pages are being freed.
> >>
> >> This really feels like the wrong way to fix this problem.
> >>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.
> 
> So this is not used for "split" but undo of prep_compound_page(). It might
> look like a split to non core MM people, since it changes a large folio
> to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
> open codes it in free_pages_prepare() by resetting page flags, page->mapping,
> and so on. So it might be why the undo prep_compound_page() is missed
> by non core MM people.
> 

Let me try to reword this while avoiding the term “split” and properly
explaining the problem.

> >
> > This stems from a special requirement, freeing is done in two phases
> >
> > 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
> > 2. Return the folio back, split it back to single order folios
> 
> Hi Balbir,
> 
> Please refrain from using "split" here, since it confuses MM people. A folio
> is split when it is still in use, but in this case, the folio has been freed
> and needs to be restored to "free page" state.
> 

Yeah, “split” is a bad term. We are reinitializing all zone pages in a
folio upon free.

> >
> > The current code does not do 2. 1 followed by 2 does not work for
> > Francois since the backing memory can get reused before we reach step 2.
> > The proposed patch does 2 followed 1, but doing 2 means we've lost the
> > folio order and thus the old order is passed in. Although, I wonder if the
> > backing folio's zone_device_data can be used to encode any order information
> > about the device side allocation.
> >
> > @Francois, I hope I did not miss anything in the explanation above.

Yes, correct. The pages in the folio must be reinitialized before
calling into the driver to free them, because once that happens, the
pages can be immediately reallocated.

> >
> >> I think someone from the graphics side really needs to take the lead on
> >> understanding what the MM is doing (both currently and in the future).
> >> I'm happy to work with you, but it feels like there's a lot of churn right
> >> now because there's a lot of people working on this without understanding
> >> the MM side of things (and conversely, I don't think (m)any people on the
> >> MM side really understand what graphics cards are trying to accomplish).

I can’t disagree with anything you’re saying. The core MM is about as
complex as it gets, and my understanding of what’s going on isn’t
great—it’s basically just reverse engineering until I reach a point
where I can fix a problem, think it’s correct, and hope I don’t get
shredded.

Graphics/DRM is also quite complex, but that’s where I work...
> >>
> >
> > I suspect you are referring to folio specialization and/or downsizing?
> >
> >> Who is that going to be?  I'm happy to get on the phone with someone.
> >
> > Happy to work with you, but I am not the authority on graphics, I can speak
> > to zone device folios. I suspect we'd need to speak to more than one person.
> >

Also happy to work with you, but I agree with Zi—graphics isn’t
something one company can speak as an authority on, much less one
person.

Matt

> 
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:51       ` Zi Yan
  2026-01-12  1:37         ` Matthew Brost
@ 2026-01-12  4:50         ` Balbir Singh
  2026-01-12 13:45         ` Jason Gunthorpe
  2 siblings, 0 replies; 39+ messages in thread
From: Balbir Singh @ 2026-01-12  4:50 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost,
	Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher,
	Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx,
	nouveau, linux-pci, linux-mm, linux-cxl

On 1/12/26 10:51, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>> The core MM splits the folio before calling folio_free, restoring the
>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>> folio’s order prior to the split which can be used driver side to know
>>>> how many pages are being freed.
>>>
>>> This really feels like the wrong way to fix this problem.
>>>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.
> 
> So this is not used for "split" but undo of prep_compound_page(). It might
> look like a split to non core MM people, since it changes a large folio
> to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
> open codes it in free_pages_prepare() by resetting page flags, page->mapping,
> and so on. So it might be why the undo prep_compound_page() is missed
> by non core MM people.
> 
>>
>> This stems from a special requirement, freeing is done in two phases
>>
>> 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
>> 2. Return the folio back, split it back to single order folios
> 
> Hi Balbir,
> 
> Please refrain from using "split" here, since it confuses MM people. A folio
> is split when it is still in use, but in this case, the folio has been freed
> and needs to be restored to "free page" state.
> 

Yeah, the word split came from the initial version that called it folio_split_unref()
and I was also thinking of the split callback for zone device folios, but I agree
(re)initialization is a better term.

>>
>> The current code does not do 2. 1 followed by 2 does not work for
>> Francois since the backing memory can get reused before we reach step 2.
>> The proposed patch does 2 followed 1, but doing 2 means we've lost the
>> folio order and thus the old order is passed in. Although, I wonder if the
>> backing folio's zone_device_data can be used to encode any order information
>> about the device side allocation.
>>
>> @Francois, I hope I did not miss anything in the explanation above.
>>
>>> I think someone from the graphics side really needs to take the lead on
>>> understanding what the MM is doing (both currently and in the future).
>>> I'm happy to work with you, but it feels like there's a lot of churn right
>>> now because there's a lot of people working on this without understanding
>>> the MM side of things (and conversely, I don't think (m)any people on the
>>> MM side really understand what graphics cards are trying to accomplish).
>>>
>>
>> I suspect you are referring to folio specialization and/or downsizing?
>>
>>> Who is that going to be?  I'm happy to get on the phone with someone.
>>
>> Happy to work with you, but I am not the authority on graphics, I can speak
>> to zone device folios. I suspect we'd need to speak to more than one person.
>>
> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:51       ` Zi Yan
  2026-01-12  1:37         ` Matthew Brost
  2026-01-12  4:50         ` Balbir Singh
@ 2026-01-12 13:45         ` Jason Gunthorpe
  2026-01-12 16:31           ` Zi Yan
  2026-01-12 21:49           ` Matthew Brost
  2 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 13:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
> > On 1/12/26 08:35, Matthew Wilcox wrote:
> >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>> The core MM splits the folio before calling folio_free, restoring the
> >>> zone pages associated with the folio to an initialized state (e.g.,
> >>> non-compound, pgmap valid, etc...). The order argument represents the
> >>> folio’s order prior to the split which can be used driver side to know
> >>> how many pages are being freed.
> >>
> >> This really feels like the wrong way to fix this problem.
> >>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.

There is something wrong with the driver if the "folio can be
reallocated immediately".

The flow generally expects there to be a driver allocator linked to
folio_free()

1) Allocator finds free memory
2) zone_device_page_init() allocates the memory and makes refcount=1
3) __folio_put() knows the recount 0.
4) free_zone_device_folio() calls folio_free(), but it doesn't
   actually need to undo prep_compound_page() because *NOTHING* can
   use the page pointer at this point.
5) Driver puts the memory back into the allocator and now #1 can
   happen. It knows how much memory to put back because folio->order
   is valid from #2
6) #1 happens again, then #2 happens again and the folio is in the
   right state for use. The successor #2 fully undoes the work of the
   predecessor #2.

If you have races where #1 can happen immediately after #3 then the
driver design is fundamentally broken and passing around order isn't
going to help anything.

If the allocator is using the struct page memory then step #5 should
also clean up the struct page with the allocator data before returning
it to the allocator.

I vaugely remember talking about this before in the context of the Xe
driver.. You can't just take an existing VRAM allocator and layer it
on top of the folios and have it broadly ignore the folio_free
callback.

Jsaon

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 13:45         ` Jason Gunthorpe
@ 2026-01-12 16:31           ` Zi Yan
  2026-01-12 16:50             ` Jason Gunthorpe
  2026-01-12 21:49           ` Matthew Brost
  1 sibling, 1 reply; 39+ messages in thread
From: Zi Yan @ 2026-01-12 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 8:45, Jason Gunthorpe wrote:

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>
>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>> folio’s order prior to the split which can be used driver side to know
>>>>> how many pages are being freed.
>>>>
>>>> This really feels like the wrong way to fix this problem.
>>>>
>>
>> Hi Matthew,
>>
>> I think the wording is confusing, since the actual issue is that:
>>
>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>> 2. but free_zone_device_folio() never reverse the course,
>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>    is called, the folio can be reallocated immediately,
>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>>    the original order information, thus, folio_free() needs that for proper
>>    device side ref manipulation.
>
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
>
> The flow generally expects there to be a driver allocator linked to
> folio_free()
>
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.
> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.

But how can a successor #2 undo the work if the second #1 only allocates
half of the original folio? For example, an order-9 at PFN 0 is
allocated and freed, then an order-8 at PFN 0 is allocated and another
order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
without corrupting each other’s data?


>
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.

Do you mean ->folio_free() callback should undo prep_compound_page()
instead?

>
> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 16:31           ` Zi Yan
@ 2026-01-12 16:50             ` Jason Gunthorpe
  2026-01-12 17:46               ` Zi Yan
  2026-01-12 23:07               ` Matthew Brost
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 16:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> > folio_free()
> >
> > 1) Allocator finds free memory
> > 2) zone_device_page_init() allocates the memory and makes refcount=1
> > 3) __folio_put() knows the recount 0.
> > 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >    actually need to undo prep_compound_page() because *NOTHING* can
> >    use the page pointer at this point.
> > 5) Driver puts the memory back into the allocator and now #1 can
> >    happen. It knows how much memory to put back because folio->order
> >    is valid from #2
> > 6) #1 happens again, then #2 happens again and the folio is in the
> >    right state for use. The successor #2 fully undoes the work of the
> >    predecessor #2.
> 
> But how can a successor #2 undo the work if the second #1 only allocates
> half of the original folio? For example, an order-9 at PFN 0 is
> allocated and freed, then an order-8 at PFN 0 is allocated and another
> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> without corrupting each other’s data?

What do you mean? The fundamental rule is you can't read the folio or
the order outside folio_free once it's refcount reaches 0.

So the successor #2 will write updated heads and order to the order 8
pages at PFN 0 and the ones starting at PFN 256 will remain with
garbage.

This is OK because nothing is allowed to read them as their refcount
is 0.

If later PFN256 is allocated then it will get updated head and order
at the same time it's refcount becomes 1.

There is corruption and they don't corrupt each other's data.

> > If the allocator is using the struct page memory then step #5 should
> > also clean up the struct page with the allocator data before returning
> > it to the allocator.
> 
> Do you mean ->folio_free() callback should undo prep_compound_page()
> instead?

I wouldn't say undo, I was very careful to say it needs to get the
struct page memory into a state that the allocator algorithm expects,
whatever that means.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 16:50             ` Jason Gunthorpe
@ 2026-01-12 17:46               ` Zi Yan
  2026-01-12 18:25                 ` Jason Gunthorpe
  2026-01-12 23:07               ` Matthew Brost
  1 sibling, 1 reply; 39+ messages in thread
From: Zi Yan @ 2026-01-12 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
>>> folio_free()
>>>
>>> 1) Allocator finds free memory
>>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>>> 3) __folio_put() knows the recount 0.
>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>>    actually need to undo prep_compound_page() because *NOTHING* can
>>>    use the page pointer at this point.
>>> 5) Driver puts the memory back into the allocator and now #1 can
>>>    happen. It knows how much memory to put back because folio->order
>>>    is valid from #2
>>> 6) #1 happens again, then #2 happens again and the folio is in the
>>>    right state for use. The successor #2 fully undoes the work of the
>>>    predecessor #2.
>>
>> But how can a successor #2 undo the work if the second #1 only allocates
>> half of the original folio? For example, an order-9 at PFN 0 is
>> allocated and freed, then an order-8 at PFN 0 is allocated and another
>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
>> without corrupting each other’s data?
>
> What do you mean? The fundamental rule is you can't read the folio or
> the order outside folio_free once it's refcount reaches 0.

There is no such a rule. In core MM, folio_split(), which splits a high
order folio to low order ones, freezes the folio (turning refcount to 0)
and manipulates the folio order and all tail pages compound_head to
restructure the folio. Your fundamental rule breaks this.
Allowing compound information to stay after a folio is freed means
you cannot tell whether a folio is under split or freed.

>
> So the successor #2 will write updated heads and order to the order 8
> pages at PFN 0 and the ones starting at PFN 256 will remain with
> garbage.
>
> This is OK because nothing is allowed to read them as their refcount
> is 0.
>
> If later PFN256 is allocated then it will get updated head and order
> at the same time it's refcount becomes 1.
>
> There is corruption and they don't corrupt each other's data.
>
>>> If the allocator is using the struct page memory then step #5 should
>>> also clean up the struct page with the allocator data before returning
>>> it to the allocator.
>>
>> Do you mean ->folio_free() callback should undo prep_compound_page()
>> instead?
>
> I wouldn't say undo, I was very careful to say it needs to get the
> struct page memory into a state that the allocator algorithm expects,
> whatever that means.
>
> Jason


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 17:46               ` Zi Yan
@ 2026-01-12 18:25                 ` Jason Gunthorpe
  2026-01-12 18:55                   ` Zi Yan
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 18:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote:
> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:
> 
> > On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> >>> folio_free()
> >>>
> >>> 1) Allocator finds free memory
> >>> 2) zone_device_page_init() allocates the memory and makes refcount=1
> >>> 3) __folio_put() knows the recount 0.
> >>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >>>    actually need to undo prep_compound_page() because *NOTHING* can
> >>>    use the page pointer at this point.
> >>> 5) Driver puts the memory back into the allocator and now #1 can
> >>>    happen. It knows how much memory to put back because folio->order
> >>>    is valid from #2
> >>> 6) #1 happens again, then #2 happens again and the folio is in the
> >>>    right state for use. The successor #2 fully undoes the work of the
> >>>    predecessor #2.
> >>
> >> But how can a successor #2 undo the work if the second #1 only allocates
> >> half of the original folio? For example, an order-9 at PFN 0 is
> >> allocated and freed, then an order-8 at PFN 0 is allocated and another
> >> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> >> without corrupting each other’s data?
> >
> > What do you mean? The fundamental rule is you can't read the folio or
> > the order outside folio_free once it's refcount reaches 0.
> 
> There is no such a rule. In core MM, folio_split(), which splits a high
> order folio to low order ones, freezes the folio (turning refcount to 0)
> and manipulates the folio order and all tail pages compound_head to
> restructure the folio.

That's different, I am talking about reaching 0 because it has been
freed, meaning there are no external pointers to it.

Further, when a page is frozen page_ref_freeze() takes in the number
of references the caller has ownership over and it doesn't succeed if
there are stray references elsewhere.

This is very important because the entire operating model of split
only works if it has exclusive locks over all the valid pointers into
that page.

Spurious refcount failures concurrent with split cannot be allowed.

I don't see how pointing at __folio_freeze_and_split_unmapped() can
justify this series.

> Your fundamental rule breaks this.  Allowing compound information
> to stay after a folio is freed means you cannot tell whether a folio
> is under split or freed.

You can't refcount a folio out of nothing. It has to come from a
memory location that already is holding a refcount, and then you can
incr it.

For example lockless GUP fast will read the PTE, adjust to the head
page, attempt to incr it, then recheck the PTE.

If there are races then sure maybe the PTE will point to a stray tail
page that refers to an already allocated head page, but the re-check
of the PTE wille exclude this. The refcount system already has to
tolerate spurious refcount incrs because of GUP fast.

Nothing should be looking at order and refcount to try to guess if
concurrent split is happening!!

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 18:25                 ` Jason Gunthorpe
@ 2026-01-12 18:55                   ` Zi Yan
  2026-01-12 19:28                     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Zi Yan @ 2026-01-12 18:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 13:25, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote:
>> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:
>>
>>> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
>>>>> folio_free()
>>>>>
>>>>> 1) Allocator finds free memory
>>>>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>>>>> 3) __folio_put() knows the recount 0.
>>>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>>>>    actually need to undo prep_compound_page() because *NOTHING* can
>>>>>    use the page pointer at this point.
>>>>> 5) Driver puts the memory back into the allocator and now #1 can
>>>>>    happen. It knows how much memory to put back because folio->order
>>>>>    is valid from #2
>>>>> 6) #1 happens again, then #2 happens again and the folio is in the
>>>>>    right state for use. The successor #2 fully undoes the work of the
>>>>>    predecessor #2.
>>>>
>>>> But how can a successor #2 undo the work if the second #1 only allocates
>>>> half of the original folio? For example, an order-9 at PFN 0 is
>>>> allocated and freed, then an order-8 at PFN 0 is allocated and another
>>>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
>>>> without corrupting each other’s data?
>>>
>>> What do you mean? The fundamental rule is you can't read the folio or
>>> the order outside folio_free once it's refcount reaches 0.
>>
>> There is no such a rule. In core MM, folio_split(), which splits a high
>> order folio to low order ones, freezes the folio (turning refcount to 0)
>> and manipulates the folio order and all tail pages compound_head to
>> restructure the folio.
>
> That's different, I am talking about reaching 0 because it has been
> freed, meaning there are no external pointers to it.
>
> Further, when a page is frozen page_ref_freeze() takes in the number
> of references the caller has ownership over and it doesn't succeed if
> there are stray references elsewhere.
>
> This is very important because the entire operating model of split
> only works if it has exclusive locks over all the valid pointers into
> that page.
>
> Spurious refcount failures concurrent with split cannot be allowed.
>
> I don't see how pointing at __folio_freeze_and_split_unmapped() can
> justify this series.
>

But from anyone looking at the folio state, refcount == 0, compound_head
is set, they cannot tell the difference.

If what you said is true, why is free_pages_prepare() needed? No one
should touch these free pages. Why bother resetting these states.

>> Your fundamental rule breaks this.  Allowing compound information
>> to stay after a folio is freed means you cannot tell whether a folio
>> is under split or freed.
>
> You can't refcount a folio out of nothing. It has to come from a
> memory location that already is holding a refcount, and then you can
> incr it.

Right. There is also no guarantee that all code is correct and follows
this.

My point here is that calling prep_compound_page() on a compound page
does not follow core MM’s conventions.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 18:55                   ` Zi Yan
@ 2026-01-12 19:28                     ` Jason Gunthorpe
  2026-01-12 23:34                       ` Zi Yan
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 19:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote:
> > That's different, I am talking about reaching 0 because it has been
> > freed, meaning there are no external pointers to it.
> >
> > Further, when a page is frozen page_ref_freeze() takes in the number
> > of references the caller has ownership over and it doesn't succeed if
> > there are stray references elsewhere.
> >
> > This is very important because the entire operating model of split
> > only works if it has exclusive locks over all the valid pointers into
> > that page.
> >
> > Spurious refcount failures concurrent with split cannot be allowed.
> >
> > I don't see how pointing at __folio_freeze_and_split_unmapped() can
> > justify this series.
> >
> 
> But from anyone looking at the folio state, refcount == 0, compound_head
> is set, they cannot tell the difference.

This isn't reliable, nothing correct can be doing it :\

> If what you said is true, why is free_pages_prepare() needed? No one
> should touch these free pages. Why bother resetting these states.

? that function does alot of stuff, thinks like uncharging the cgroup
should obviously happen at free time.

What part of it are you looking at?

> > You can't refcount a folio out of nothing. It has to come from a
> > memory location that already is holding a refcount, and then you can
> > incr it.
> 
> Right. There is also no guarantee that all code is correct and follows
> this.

Let's concretely point at things that have a problem please.

> My point here is that calling prep_compound_page() on a compound page
> does not follow core MM’s conventions.

Maybe, but that doesn't mean it isn't the right solution..

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 19:28                     ` Jason Gunthorpe
@ 2026-01-12 23:34                       ` Zi Yan
  2026-01-12 23:53                         ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Zi Yan @ 2026-01-12 23:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 14:28, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote:
>>> That's different, I am talking about reaching 0 because it has been
>>> freed, meaning there are no external pointers to it.
>>>
>>> Further, when a page is frozen page_ref_freeze() takes in the number
>>> of references the caller has ownership over and it doesn't succeed if
>>> there are stray references elsewhere.
>>>
>>> This is very important because the entire operating model of split
>>> only works if it has exclusive locks over all the valid pointers into
>>> that page.
>>>
>>> Spurious refcount failures concurrent with split cannot be allowed.
>>>
>>> I don't see how pointing at __folio_freeze_and_split_unmapped() can
>>> justify this series.
>>>
>>
>> But from anyone looking at the folio state, refcount == 0, compound_head
>> is set, they cannot tell the difference.
>
> This isn't reliable, nothing correct can be doing it :\
>
>> If what you said is true, why is free_pages_prepare() needed? No one
>> should touch these free pages. Why bother resetting these states.
>
> ? that function does alot of stuff, thinks like uncharging the cgroup
> should obviously happen at free time.
>
> What part of it are you looking at?

page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order.

free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and
compound_head at the end.

page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound
pages.

These three parts undo prep_compound_page().

>
>>> You can't refcount a folio out of nothing. It has to come from a
>>> memory location that already is holding a refcount, and then you can
>>> incr it.
>>
>> Right. There is also no guarantee that all code is correct and follows
>> this.
>
> Let's concretely point at things that have a problem please.
>
>> My point here is that calling prep_compound_page() on a compound page
>> does not follow core MM’s conventions.
>
> Maybe, but that doesn't mean it isn't the right solution..

In current nouveau code, ->free_folios is used holding the freed folio.
In nouveau_dmem_page_alloc_locked(), the freed folio is passed to
zone_device_folio_init(). If the allocated folio order is different
from the freed folio order, I do not know how you are going to keep
track of the rest of the freed folio. Of course you can implement a
buddy allocator there.

If this still does not convince you that overwriting an existing compound
page with a different order configuration is a bad idea, feel free to
do whatever you think it is right.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:34                       ` Zi Yan
@ 2026-01-12 23:53                         ` Jason Gunthorpe
  2026-01-13  0:35                           ` Zi Yan
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 23:53 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 06:34:06PM -0500, Zi Yan wrote:
> page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order.
> 
> free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and
> compound_head at the end.
> 
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound
> pages.
> 
> These three parts undo prep_compound_page().

Well, mm doesn't clear all things on alloc..

> In current nouveau code, ->free_folios is used holding the freed folio.
> In nouveau_dmem_page_alloc_locked(), the freed folio is passed to
> zone_device_folio_init(). If the allocated folio order is different
> from the freed folio order, I do not know how you are going to keep
> track of the rest of the freed folio. Of course you can implement a
> buddy allocator there.

nouveau doesn't support high order folios.

A simple linked list is not really a suitable data structure to ever
support high order folios with.. If it were to use such a thing, and
did want to take a high order folio off the list, and reduce its
order, then it would have to put the remainder back on the list with a
revised order value. That's all, nothing hard.

Again if the driver needs to store information in the struct page to
manage its free list mechanism (ie linked pointers, order, whatever)
then it should be doing that directly.

When it takes the memory range off the free list it should call
zone_device_page_init() to make it ready to be used again. I think it
is a poor argument to say that zone_device_page_init() should rely on
values already in the struct page to work properly :\

The usable space within the struct page, and what values must be fixed
for correct system function, should exactly mirror what frozen pages
require. After free it is effectively now a frozen page owned by the
device driver.

I haven't seen any documentation on that, but I suspect Matthew and
David have some ideas..

If there is a reason for order, flags and mapping to be something
particular then it should flow from the definition of frozen pages,
and be documented, IMHO.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:53                         ` Jason Gunthorpe
@ 2026-01-13  0:35                           ` Zi Yan
  0 siblings, 0 replies; 39+ messages in thread
From: Zi Yan @ 2026-01-13  0:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 18:53, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 06:34:06PM -0500, Zi Yan wrote:
>> page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order.
>>
>> free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and
>> compound_head at the end.
>>
>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound
>> pages.
>>
>> These three parts undo prep_compound_page().
>
> Well, mm doesn't clear all things on alloc..
>
>> In current nouveau code, ->free_folios is used holding the freed folio.
>> In nouveau_dmem_page_alloc_locked(), the freed folio is passed to
>> zone_device_folio_init(). If the allocated folio order is different
>> from the freed folio order, I do not know how you are going to keep
>> track of the rest of the freed folio. Of course you can implement a
>> buddy allocator there.
>
> nouveau doesn't support high order folios.
>
> A simple linked list is not really a suitable data structure to ever
> support high order folios with.. If it were to use such a thing, and
> did want to take a high order folio off the list, and reduce its
> order, then it would have to put the remainder back on the list with a
> revised order value. That's all, nothing hard.
>
> Again if the driver needs to store information in the struct page to
> manage its free list mechanism (ie linked pointers, order, whatever)
> then it should be doing that directly.
>
> When it takes the memory range off the free list it should call
> zone_device_page_init() to make it ready to be used again. I think it
> is a poor argument to say that zone_device_page_init() should rely on
> values already in the struct page to work properly :\
>
> The usable space within the struct page, and what values must be fixed
> for correct system function, should exactly mirror what frozen pages
> require. After free it is effectively now a frozen page owned by the
> device driver.
>
> I haven't seen any documentation on that, but I suspect Matthew and
> David have some ideas..
>
> If there is a reason for order, flags and mapping to be something
> particular then it should flow from the definition of frozen pages,
> and be documented, IMHO.

Thank you for the explanation.

It seems that I do not have enough knowledge to comment on device private
pages. I will refrain myself from doing so from now on

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 16:50             ` Jason Gunthorpe
  2026-01-12 17:46               ` Zi Yan
@ 2026-01-12 23:07               ` Matthew Brost
  1 sibling, 0 replies; 39+ messages in thread
From: Matthew Brost @ 2026-01-12 23:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher,
	Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 12:50:01PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> > > folio_free()
> > >
> > > 1) Allocator finds free memory
> > > 2) zone_device_page_init() allocates the memory and makes refcount=1
> > > 3) __folio_put() knows the recount 0.
> > > 4) free_zone_device_folio() calls folio_free(), but it doesn't
> > >    actually need to undo prep_compound_page() because *NOTHING* can
> > >    use the page pointer at this point.
> > > 5) Driver puts the memory back into the allocator and now #1 can
> > >    happen. It knows how much memory to put back because folio->order
> > >    is valid from #2
> > > 6) #1 happens again, then #2 happens again and the folio is in the
> > >    right state for use. The successor #2 fully undoes the work of the
> > >    predecessor #2.
> > 
> > But how can a successor #2 undo the work if the second #1 only allocates
> > half of the original folio? For example, an order-9 at PFN 0 is
> > allocated and freed, then an order-8 at PFN 0 is allocated and another
> > order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> > without corrupting each other’s data?
> 
> What do you mean? The fundamental rule is you can't read the folio or
> the order outside folio_free once it's refcount reaches 0.
> 
> So the successor #2 will write updated heads and order to the order 8
> pages at PFN 0 and the ones starting at PFN 256 will remain with
> garbage.
> 
> This is OK because nothing is allowed to read them as their refcount
> is 0.
> 
> If later PFN256 is allocated then it will get updated head and order
> at the same time it's refcount becomes 1.
> 
> There is corruption and they don't corrupt each other's data.
> 
> > > If the allocator is using the struct page memory then step #5 should
> > > also clean up the struct page with the allocator data before returning
> > > it to the allocator.
> > 
> > Do you mean ->folio_free() callback should undo prep_compound_page()
> > instead?
> 
> I wouldn't say undo, I was very careful to say it needs to get the
> struct page memory into a state that the allocator algorithm expects,
> whatever that means.
> 

Hi Jason,

A lot of back and forth with Zi — if I’m understanding correctly, your
suggestion is to just call free_zone_device_folio_prepare() [1] in
->folio_free() if required by the driver. This is the function that puts
struct page into a state my allocator expects. That works just fine for
me.

Matt

[1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4

> Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 13:45         ` Jason Gunthorpe
  2026-01-12 16:31           ` Zi Yan
@ 2026-01-12 21:49           ` Matthew Brost
  2026-01-12 23:15             ` Zi Yan
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-12 21:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher,
	Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:

Hi, catching up here.

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> > On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> > 
> > > On 1/12/26 08:35, Matthew Wilcox wrote:
> > >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> > >>> The core MM splits the folio before calling folio_free, restoring the
> > >>> zone pages associated with the folio to an initialized state (e.g.,
> > >>> non-compound, pgmap valid, etc...). The order argument represents the
> > >>> folio’s order prior to the split which can be used driver side to know
> > >>> how many pages are being freed.
> > >>
> > >> This really feels like the wrong way to fix this problem.
> > >>
> > 
> > Hi Matthew,
> > 
> > I think the wording is confusing, since the actual issue is that:
> > 
> > 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> > 2. but free_zone_device_folio() never reverse the course,
> > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> >    be done before driver callback ->folio_free(), since once ->folio_free()
> >    is called, the folio can be reallocated immediately,
> > 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> >    the original order information, thus, folio_free() needs that for proper
> >    device side ref manipulation.
> 
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
> 
> The flow generally expects there to be a driver allocator linked to
> folio_free()
> 
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.

Correct—nothing can use the folio prior to calling folio_free(). Once
folio_free() returns, the driver side is free to immediately reallocate
the folio (or a subset of its pages).

> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.
> 
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>

The above race does not exist; if it did, I agree we’d be solving
nothing here.
 
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.
> 

We could move the call to free_zone_device_folio_prepare() [1] into the
driver-side implementation of ->folio_free() and drop the order argument
here. Zi didn’t particularly like that; he preferred calling
free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
which is why this patch exists.

FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
actually need the order regardless of where
free_zone_device_folio_prepare() is called, but Nouveau does need the
order if free_zone_device_folio_prepare() is called before
->folio_free().

[1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
[2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405

> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.
> 

We are definitely not ignoring the ->folio_free callback—that is the
point at which we tell our VRAM allocator (DRM buddy) it is okay to
release the allocation and make it available for reuse.

Matt

> Jsaon

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 21:49           ` Matthew Brost
@ 2026-01-12 23:15             ` Zi Yan
  2026-01-12 23:22               ` Matthew Brost
  2026-01-12 23:31               ` Jason Gunthorpe
  0 siblings, 2 replies; 39+ messages in thread
From: Zi Yan @ 2026-01-12 23:15 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast,
	intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 16:49, Matthew Brost wrote:

> On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
>
> Hi, catching up here.
>
>> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>>
>>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>>> folio’s order prior to the split which can be used driver side to know
>>>>>> how many pages are being freed.
>>>>>
>>>>> This really feels like the wrong way to fix this problem.
>>>>>
>>>
>>> Hi Matthew,
>>>
>>> I think the wording is confusing, since the actual issue is that:
>>>
>>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>>> 2. but free_zone_device_folio() never reverse the course,
>>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>>    is called, the folio can be reallocated immediately,
>>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>>>    the original order information, thus, folio_free() needs that for proper
>>>    device side ref manipulation.
>>
>> There is something wrong with the driver if the "folio can be
>> reallocated immediately".
>>
>> The flow generally expects there to be a driver allocator linked to
>> folio_free()
>>
>> 1) Allocator finds free memory
>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>> 3) __folio_put() knows the recount 0.
>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>    actually need to undo prep_compound_page() because *NOTHING* can
>>    use the page pointer at this point.
>
> Correct—nothing can use the folio prior to calling folio_free(). Once
> folio_free() returns, the driver side is free to immediately reallocate
> the folio (or a subset of its pages).
>
>> 5) Driver puts the memory back into the allocator and now #1 can
>>    happen. It knows how much memory to put back because folio->order
>>    is valid from #2
>> 6) #1 happens again, then #2 happens again and the folio is in the
>>    right state for use. The successor #2 fully undoes the work of the
>>    predecessor #2.
>>
>> If you have races where #1 can happen immediately after #3 then the
>> driver design is fundamentally broken and passing around order isn't
>> going to help anything.
>>
>
> The above race does not exist; if it did, I agree we’d be solving
> nothing here.
>
>> If the allocator is using the struct page memory then step #5 should
>> also clean up the struct page with the allocator data before returning
>> it to the allocator.
>>
>
> We could move the call to free_zone_device_folio_prepare() [1] into the
> driver-side implementation of ->folio_free() and drop the order argument
> here. Zi didn’t particularly like that; he preferred calling
> free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> which is why this patch exists.

On a second thought, if calling free_zone_device_folio_prepare() in
->folio_free() works, feel free to do so.

>
> FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> actually need the order regardless of where
> free_zone_device_folio_prepare() is called, but Nouveau does need the
> order if free_zone_device_folio_prepare() is called before
> ->folio_free().
>
> [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
>
>> I vaugely remember talking about this before in the context of the Xe
>> driver.. You can't just take an existing VRAM allocator and layer it
>> on top of the folios and have it broadly ignore the folio_free
>> callback.
>>
>
> We are definitely not ignoring the ->folio_free callback—that is the
> point at which we tell our VRAM allocator (DRM buddy) it is okay to
> release the allocation and make it available for reuse.
>
> Matt
>
>> Jsaon


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:15             ` Zi Yan
@ 2026-01-12 23:22               ` Matthew Brost
  2026-01-12 23:44                 ` Alistair Popple
  2026-01-12 23:31               ` Jason Gunthorpe
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-12 23:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast,
	intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> On 12 Jan 2026, at 16:49, Matthew Brost wrote:
> 
> > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
> >
> > Hi, catching up here.
> >
> >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> >>>
> >>>> On 1/12/26 08:35, Matthew Wilcox wrote:
> >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>>>>> The core MM splits the folio before calling folio_free, restoring the
> >>>>>> zone pages associated with the folio to an initialized state (e.g.,
> >>>>>> non-compound, pgmap valid, etc...). The order argument represents the
> >>>>>> folio’s order prior to the split which can be used driver side to know
> >>>>>> how many pages are being freed.
> >>>>>
> >>>>> This really feels like the wrong way to fix this problem.
> >>>>>
> >>>
> >>> Hi Matthew,
> >>>
> >>> I think the wording is confusing, since the actual issue is that:
> >>>
> >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> >>> 2. but free_zone_device_folio() never reverse the course,
> >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> >>>    be done before driver callback ->folio_free(), since once ->folio_free()
> >>>    is called, the folio can be reallocated immediately,
> >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> >>>    the original order information, thus, folio_free() needs that for proper
> >>>    device side ref manipulation.
> >>
> >> There is something wrong with the driver if the "folio can be
> >> reallocated immediately".
> >>
> >> The flow generally expects there to be a driver allocator linked to
> >> folio_free()
> >>
> >> 1) Allocator finds free memory
> >> 2) zone_device_page_init() allocates the memory and makes refcount=1
> >> 3) __folio_put() knows the recount 0.
> >> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >>    actually need to undo prep_compound_page() because *NOTHING* can
> >>    use the page pointer at this point.
> >
> > Correct—nothing can use the folio prior to calling folio_free(). Once
> > folio_free() returns, the driver side is free to immediately reallocate
> > the folio (or a subset of its pages).
> >
> >> 5) Driver puts the memory back into the allocator and now #1 can
> >>    happen. It knows how much memory to put back because folio->order
> >>    is valid from #2
> >> 6) #1 happens again, then #2 happens again and the folio is in the
> >>    right state for use. The successor #2 fully undoes the work of the
> >>    predecessor #2.
> >>
> >> If you have races where #1 can happen immediately after #3 then the
> >> driver design is fundamentally broken and passing around order isn't
> >> going to help anything.
> >>
> >
> > The above race does not exist; if it did, I agree we’d be solving
> > nothing here.
> >
> >> If the allocator is using the struct page memory then step #5 should
> >> also clean up the struct page with the allocator data before returning
> >> it to the allocator.
> >>
> >
> > We could move the call to free_zone_device_folio_prepare() [1] into the
> > driver-side implementation of ->folio_free() and drop the order argument
> > here. Zi didn’t particularly like that; he preferred calling
> > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > which is why this patch exists.
> 
> On a second thought, if calling free_zone_device_folio_prepare() in
> ->folio_free() works, feel free to do so.
> 

+1, testing this change right now and it does indeed work.

Matt

> >
> > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> > actually need the order regardless of where
> > free_zone_device_folio_prepare() is called, but Nouveau does need the
> > order if free_zone_device_folio_prepare() is called before
> > ->folio_free().
> >
> > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
> >
> >> I vaugely remember talking about this before in the context of the Xe
> >> driver.. You can't just take an existing VRAM allocator and layer it
> >> on top of the folios and have it broadly ignore the folio_free
> >> callback.
> >>
> >
> > We are definitely not ignoring the ->folio_free callback—that is the
> > point at which we tell our VRAM allocator (DRM buddy) it is okay to
> > release the allocation and make it available for reuse.
> >
> > Matt
> >
> >> Jsaon
> 
> 
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:22               ` Matthew Brost
@ 2026-01-12 23:44                 ` Alistair Popple
  2026-01-12 23:54                   ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Alistair Popple @ 2026-01-12 23:44 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Zi Yan, Jason Gunthorpe, Matthew Wilcox, Balbir Singh,
	Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci,
	linux-mm, linux-cxl

On 2026-01-13 at 10:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> > On 12 Jan 2026, at 16:49, Matthew Brost wrote:
> > 
> > > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
> > >
> > > Hi, catching up here.
> > >
> > >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> > >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> > >>>
> > >>>> On 1/12/26 08:35, Matthew Wilcox wrote:
> > >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> > >>>>>> The core MM splits the folio before calling folio_free, restoring the
> > >>>>>> zone pages associated with the folio to an initialized state (e.g.,
> > >>>>>> non-compound, pgmap valid, etc...). The order argument represents the
> > >>>>>> folio’s order prior to the split which can be used driver side to know
> > >>>>>> how many pages are being freed.
> > >>>>>
> > >>>>> This really feels like the wrong way to fix this problem.
> > >>>>>
> > >>>
> > >>> Hi Matthew,
> > >>>
> > >>> I think the wording is confusing, since the actual issue is that:
> > >>>
> > >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> > >>> 2. but free_zone_device_folio() never reverse the course,
> > >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> > >>>    be done before driver callback ->folio_free(), since once ->folio_free()
> > >>>    is called, the folio can be reallocated immediately,
> > >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> > >>>    the original order information, thus, folio_free() needs that for proper
> > >>>    device side ref manipulation.
> > >>
> > >> There is something wrong with the driver if the "folio can be
> > >> reallocated immediately".
> > >>
> > >> The flow generally expects there to be a driver allocator linked to
> > >> folio_free()
> > >>
> > >> 1) Allocator finds free memory
> > >> 2) zone_device_page_init() allocates the memory and makes refcount=1
> > >> 3) __folio_put() knows the recount 0.
> > >> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> > >>    actually need to undo prep_compound_page() because *NOTHING* can
> > >>    use the page pointer at this point.
> > >
> > > Correct—nothing can use the folio prior to calling folio_free(). Once
> > > folio_free() returns, the driver side is free to immediately reallocate
> > > the folio (or a subset of its pages).
> > >
> > >> 5) Driver puts the memory back into the allocator and now #1 can
> > >>    happen. It knows how much memory to put back because folio->order
> > >>    is valid from #2
> > >> 6) #1 happens again, then #2 happens again and the folio is in the
> > >>    right state for use. The successor #2 fully undoes the work of the
> > >>    predecessor #2.
> > >>
> > >> If you have races where #1 can happen immediately after #3 then the
> > >> driver design is fundamentally broken and passing around order isn't
> > >> going to help anything.
> > >>
> > >
> > > The above race does not exist; if it did, I agree we’d be solving
> > > nothing here.
> > >
> > >> If the allocator is using the struct page memory then step #5 should
> > >> also clean up the struct page with the allocator data before returning
> > >> it to the allocator.
> > >>
> > >
> > > We could move the call to free_zone_device_folio_prepare() [1] into the
> > > driver-side implementation of ->folio_free() and drop the order argument
> > > here. Zi didn’t particularly like that; he preferred calling
> > > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > > which is why this patch exists.
> > 
> > On a second thought, if calling free_zone_device_folio_prepare() in
> > ->folio_free() works, feel free to do so.

I think making drivers do this is the correct approach and is consistent with
what P2PDMA and DAX does. All the interfaces for mapping a ZONE_DEVICE folio
currently rely on the driver correctly initialising the folio, so this special
case for ZONE_DEVICE_PRIVATE/COHERENT seemed weird to me - they shouldn't rely
on the core-mm to do some of the re-initialisation in the free paths.

Also drivers may have different strategies than just resetting everything back
to small pages. For example the may choose to only ever allocate large folios
making the whole clearing/resetting of folio fields superfluous.

 - Alistair

> +1, testing this change right now and it does indeed work.
> 
> Matt
> 
> > >
> > > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> > > actually need the order regardless of where
> > > free_zone_device_folio_prepare() is called, but Nouveau does need the
> > > order if free_zone_device_folio_prepare() is called before
> > > ->folio_free().
> > >
> > > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> > > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
> > >
> > >> I vaugely remember talking about this before in the context of the Xe
> > >> driver.. You can't just take an existing VRAM allocator and layer it
> > >> on top of the folios and have it broadly ignore the folio_free
> > >> callback.
> > >>
> > >
> > > We are definitely not ignoring the ->folio_free callback—that is the
> > > point at which we tell our VRAM allocator (DRM buddy) it is okay to
> > > release the allocation and make it available for reuse.
> > >
> > > Matt
> > >
> > >> Jsaon
> > 
> > 
> > Best Regards,
> > Yan, Zi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:44                 ` Alistair Popple
@ 2026-01-12 23:54                   ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 23:54 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Matthew Brost, Zi Yan, Matthew Wilcox, Balbir Singh,
	Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci,
	linux-mm, linux-cxl

On Tue, Jan 13, 2026 at 10:44:27AM +1100, Alistair Popple wrote:

> Also drivers may have different strategies than just resetting everything back
> to small pages. For example the may choose to only ever allocate large folios
> making the whole clearing/resetting of folio fields superfluous.

+1

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:15             ` Zi Yan
  2026-01-12 23:22               ` Matthew Brost
@ 2026-01-12 23:31               ` Jason Gunthorpe
  1 sibling, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 23:31 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Brost, Matthew Wilcox, Balbir Singh, Francois Dugast,
	intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul,
	Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> > We could move the call to free_zone_device_folio_prepare() [1] into the
> > driver-side implementation of ->folio_free() and drop the order argument
> > here. Zi didn’t particularly like that; he preferred calling
> > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > which is why this patch exists.
> 
> On a second thought, if calling free_zone_device_folio_prepare() in
> ->folio_free() works, feel free to do so.

I don't think there is anything "prepare" about
free_zone_device_folio_prepare() it effectively zeros the struct page
memory - ie undoes some amount of zone_device_page_init() and AFAIK
there are only two reasons to do this:

 1) It helps catch bugs where things are UAF'ing the folio, now they
    read back zeros (it also creates bugs where zero might be OK, so
    you might be better to poison it under a debug flag)

 2) It avoids the allocate side having to zero the page memory - and
    perhaps the allocate side is not doing a good job of this right now
    but I think you should state a position why it makes more sense for
    the free side to do this instead of the allocate side.

    IOW why should it be mandatory to call
    free_zone_device_folio_prepare() prior to zone_device_page_init()
    ?

Certainly if the only reason you are passing the order is because the
core code zero'd the order too early, that doesn't make alot of sense.

I think calling the deinit function paired with
zone_device_page_init() within the driver does make alot of sense and
I see no issue with that. But please name it more sensibly and
describe concretely why it should be split up like this.

Because what I see is you write to all the folios on free and then
write to them all again on allocation - which is 2x the cost that is
probably really needed...

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-12  0:44   ` Balbir Singh
  1 sibling, 1 reply; 39+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Balbir Singh, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm,
	linux-cxl, linux-kernel, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

Add free_zone_device_folio_prepare(), a helper that restores large
ZONE_DEVICE folios to a sane, initial state before freeing them.

Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
compound metadata). Before returning such pages to the device pgmap
allocator, each constituent page must be reset to a standalone
ZONE_DEVICE folio with a valid pgmap and no compound state.

Use this helper prior to folio_free() for device-private and
device-coherent folios to ensure consistent device page state for
subsequent allocations.

Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 include/linux/memremap.h |  1 +
 mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 97fcffeb1c1e..88e1d4707296 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
 
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page, unsigned int order);
+void free_zone_device_folio_prepare(struct folio *folio);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/mm/memremap.c b/mm/memremap.c
index 39dc4bd190d0..375a61e18858 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
+/**
+ * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
+ * @folio: ZONE_DEVICE folio to prepare for release.
+ *
+ * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
+ * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
+ * must be restored to a sane ZONE_DEVICE state before they are released.
+ *
+ * This helper:
+ *   - Clears @folio->mapping and, for compound folios, clears each page's
+ *     compound-head state (ClearPageHead()/clear_compound_head()).
+ *   - Resets the compound order metadata (folio_reset_order()) and then
+ *     initializes each constituent page as a standalone ZONE_DEVICE folio:
+ *       * clears ->mapping
+ *       * restores ->pgmap (prep_compound_page() overwrites it)
+ *       * clears ->share (only relevant for fsdax; unused for device-private)
+ *
+ * If @folio is order-0, only the mapping is cleared and no further work is
+ * required.
+ */
+void free_zone_device_folio_prepare(struct folio *folio)
+{
+	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
+	int order, i;
+
+	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
+
+	folio->mapping = NULL;
+	order = folio_order(folio);
+	if (!order)
+		return;
+
+	folio_reset_order(folio);
+
+	for (i = 0; i < (1UL << order); i++) {
+		struct page *page = folio_page(folio, i);
+		struct folio *new_folio = (struct folio *)page;
+
+		ClearPageHead(page);
+		clear_compound_head(page);
+
+		new_folio->mapping = NULL;
+		/*
+		 * Reset pgmap which was over-written by
+		 * prep_compound_page().
+		 */
+		new_folio->pgmap = pgmap;
+		new_folio->share = 0;	/* fsdax only, unused for device private */
+		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
+		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
+	}
+}
+EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
+
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
@@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
+		free_zone_device_folio_prepare(folio);
 		pgmap->ops->folio_free(folio, order);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
@ 2026-01-12  0:44   ` Balbir Singh
  2026-01-12  1:16     ` Matthew Brost
  0 siblings, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2026-01-12  0:44 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On 1/12/26 06:55, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Add free_zone_device_folio_prepare(), a helper that restores large
> ZONE_DEVICE folios to a sane, initial state before freeing them.
> 
> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> compound metadata). Before returning such pages to the device pgmap
> allocator, each constituent page must be reset to a standalone
> ZONE_DEVICE folio with a valid pgmap and no compound state.
> 
> Use this helper prior to folio_free() for device-private and
> device-coherent folios to ensure consistent device page state for
> subsequent allocations.
> 
> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: linux-cxl@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  include/linux/memremap.h |  1 +
>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 97fcffeb1c1e..88e1d4707296 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
>  
>  #ifdef CONFIG_ZONE_DEVICE
>  void zone_device_page_init(struct page *page, unsigned int order);
> +void free_zone_device_folio_prepare(struct folio *folio);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 39dc4bd190d0..375a61e18858 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
>  }
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
> +/**
> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> + * @folio: ZONE_DEVICE folio to prepare for release.
> + *
> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> + * must be restored to a sane ZONE_DEVICE state before they are released.
> + *
> + * This helper:
> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> + *   - Resets the compound order metadata (folio_reset_order()) and then
> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> + *       * clears ->mapping
> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> + *
> + * If @folio is order-0, only the mapping is cleared and no further work is
> + * required.
> + */
> +void free_zone_device_folio_prepare(struct folio *folio)
> +{
> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> +	int order, i;
> +
> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> +
> +	folio->mapping = NULL;
> +	order = folio_order(folio);
> +	if (!order)
> +		return;
> +
> +	folio_reset_order(folio);
> +
> +	for (i = 0; i < (1UL << order); i++) {
> +		struct page *page = folio_page(folio, i);
> +		struct folio *new_folio = (struct folio *)page;
> +
> +		ClearPageHead(page);
> +		clear_compound_head(page);
> +
> +		new_folio->mapping = NULL;
> +		/*
> +		 * Reset pgmap which was over-written by
> +		 * prep_compound_page().
> +		 */
> +		new_folio->pgmap = pgmap;
> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);

Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
that PMD_ORDER more frees than we'd like?

> +	}
> +}
> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> +
>  void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
>  	case MEMORY_DEVICE_COHERENT:
>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
>  			break;
> +		free_zone_device_folio_prepare(folio);
>  		pgmap->ops->folio_free(folio, order);
>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
>  		break;

Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  0:44   ` Balbir Singh
@ 2026-01-12  1:16     ` Matthew Brost
  2026-01-12  2:15       ` Balbir Singh
  2026-01-12 23:58       ` Alistair Popple
  0 siblings, 2 replies; 39+ messages in thread
From: Matthew Brost @ 2026-01-12  1:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> On 1/12/26 06:55, Francois Dugast wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > Add free_zone_device_folio_prepare(), a helper that restores large
> > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > 
> > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > compound metadata). Before returning such pages to the device pgmap
> > allocator, each constituent page must be reset to a standalone
> > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > 
> > Use this helper prior to folio_free() for device-private and
> > device-coherent folios to ensure consistent device page state for
> > subsequent allocations.
> > 
> > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-cxl@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  include/linux/memremap.h |  1 +
> >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> > 
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 97fcffeb1c1e..88e1d4707296 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> >  
> >  #ifdef CONFIG_ZONE_DEVICE
> >  void zone_device_page_init(struct page *page, unsigned int order);
> > +void free_zone_device_folio_prepare(struct folio *folio);
> >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >  void memunmap_pages(struct dev_pagemap *pgmap);
> >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 39dc4bd190d0..375a61e18858 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> >  }
> >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >  
> > +/**
> > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > + * @folio: ZONE_DEVICE folio to prepare for release.
> > + *
> > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > + *
> > + * This helper:
> > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > + *       * clears ->mapping
> > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > + *
> > + * If @folio is order-0, only the mapping is cleared and no further work is
> > + * required.
> > + */
> > +void free_zone_device_folio_prepare(struct folio *folio)
> > +{
> > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > +	int order, i;
> > +
> > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > +
> > +	folio->mapping = NULL;
> > +	order = folio_order(folio);
> > +	if (!order)
> > +		return;
> > +
> > +	folio_reset_order(folio);
> > +
> > +	for (i = 0; i < (1UL << order); i++) {
> > +		struct page *page = folio_page(folio, i);
> > +		struct folio *new_folio = (struct folio *)page;
> > +
> > +		ClearPageHead(page);
> > +		clear_compound_head(page);
> > +
> > +		new_folio->mapping = NULL;
> > +		/*
> > +		 * Reset pgmap which was over-written by
> > +		 * prep_compound_page().
> > +		 */
> > +		new_folio->pgmap = pgmap;
> > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> 
> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> that PMD_ORDER more frees than we'd like?
> 

No, calling free_folio() more often doesn’t solve anything—in fact, that
would make my implementation explode. I explained this in detail here [1]
to Zi.

To recap [1], my memory allocator has no visibility into individual
pages or folios; it is DRM Buddy layered on top of TTM BO. This design
allows VRAM to be allocated or evicted for both traditional GPU
allocations (GEMs) and SVM allocations.

Now, to recap the actual issue: if device folios are not split upon free
and are later reallocated with a different order in
zone_device_page_init, the implementation breaks. This problem is not
specific to Xe—Nouveau happens to always allocate at the same order, so
it works by coincidence. Reallocating at a different order is valid
behavior and must be supported.

Matt

[1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413

> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > +
> >  void free_zone_device_folio(struct folio *folio)
> >  {
> >  	struct dev_pagemap *pgmap = folio->pgmap;
> > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> >  	case MEMORY_DEVICE_COHERENT:
> >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> >  			break;
> > +		free_zone_device_folio_prepare(folio);
> >  		pgmap->ops->folio_free(folio, order);
> >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> >  		break;
> 
> Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  1:16     ` Matthew Brost
@ 2026-01-12  2:15       ` Balbir Singh
  2026-01-12  2:37         ` Matthew Brost
  2026-01-12 23:58       ` Alistair Popple
  1 sibling, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2026-01-12  2:15 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On 1/12/26 11:16, Matthew Brost wrote:
> On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
>> On 1/12/26 06:55, Francois Dugast wrote:
>>> From: Matthew Brost <matthew.brost@intel.com>
>>>
>>> Add free_zone_device_folio_prepare(), a helper that restores large
>>> ZONE_DEVICE folios to a sane, initial state before freeing them.
>>>
>>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
>>> compound metadata). Before returning such pages to the device pgmap
>>> allocator, each constituent page must be reset to a standalone
>>> ZONE_DEVICE folio with a valid pgmap and no compound state.
>>>
>>> Use this helper prior to folio_free() for device-private and
>>> device-coherent folios to ensure consistent device page state for
>>> subsequent allocations.
>>>
>>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: David Hildenbrand <david@kernel.org>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Mike Rapoport <rppt@kernel.org>
>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-cxl@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Suggested-by: Alistair Popple <apopple@nvidia.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>> ---
>>>  include/linux/memremap.h |  1 +
>>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 56 insertions(+)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 97fcffeb1c1e..88e1d4707296 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>>  
>>>  #ifdef CONFIG_ZONE_DEVICE
>>>  void zone_device_page_init(struct page *page, unsigned int order);
>>> +void free_zone_device_folio_prepare(struct folio *folio);
>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 39dc4bd190d0..375a61e18858 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
>>>  }
>>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>>  
>>> +/**
>>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
>>> + * @folio: ZONE_DEVICE folio to prepare for release.
>>> + *
>>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
>>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
>>> + * must be restored to a sane ZONE_DEVICE state before they are released.
>>> + *
>>> + * This helper:
>>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
>>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
>>> + *   - Resets the compound order metadata (folio_reset_order()) and then
>>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
>>> + *       * clears ->mapping
>>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
>>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
>>> + *
>>> + * If @folio is order-0, only the mapping is cleared and no further work is
>>> + * required.
>>> + */
>>> +void free_zone_device_folio_prepare(struct folio *folio)
>>> +{
>>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
>>> +	int order, i;
>>> +
>>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
>>> +
>>> +	folio->mapping = NULL;
>>> +	order = folio_order(folio);
>>> +	if (!order)
>>> +		return;
>>> +
>>> +	folio_reset_order(folio);
>>> +
>>> +	for (i = 0; i < (1UL << order); i++) {
>>> +		struct page *page = folio_page(folio, i);
>>> +		struct folio *new_folio = (struct folio *)page;
>>> +
>>> +		ClearPageHead(page);
>>> +		clear_compound_head(page);
>>> +
>>> +		new_folio->mapping = NULL;
>>> +		/*
>>> +		 * Reset pgmap which was over-written by
>>> +		 * prep_compound_page().
>>> +		 */
>>> +		new_folio->pgmap = pgmap;
>>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
>>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>>
>> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
>> that PMD_ORDER more frees than we'd like?
>>
> 
> No, calling free_folio() more often doesn’t solve anything—in fact, that
> would make my implementation explode. I explained this in detail here [1]
> to Zi.
> 
> To recap [1], my memory allocator has no visibility into individual
> pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> allows VRAM to be allocated or evicted for both traditional GPU
> allocations (GEMs) and SVM allocations.
> 

I assume it is still backed by pages that are ref counted? I suspect you'd
need to convert one reference count to PMD_ORDER reference counts to make
this change work, or are the references not at page granularity? 

I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
like a per folio refcount

> Now, to recap the actual issue: if device folios are not split upon free
> and are later reallocated with a different order in
> zone_device_page_init, the implementation breaks. This problem is not
> specific to Xe—Nouveau happens to always allocate at the same order, so
> it works by coincidence. Reallocating at a different order is valid
> behavior and must be supported.
> 

Agreed

> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> 
>>> +	}
>>> +}
>>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
>>> +
>>>  void free_zone_device_folio(struct folio *folio)
>>>  {
>>>  	struct dev_pagemap *pgmap = folio->pgmap;
>>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
>>>  	case MEMORY_DEVICE_COHERENT:
>>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
>>>  			break;
>>> +		free_zone_device_folio_prepare(folio);
>>>  		pgmap->ops->folio_free(folio, order);
>>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>>  		break;
>>
>> Balbir


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  2:15       ` Balbir Singh
@ 2026-01-12  2:37         ` Matthew Brost
  2026-01-12  2:50           ` Matthew Brost
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-12  2:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote:
> On 1/12/26 11:16, Matthew Brost wrote:
> > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> >> On 1/12/26 06:55, Francois Dugast wrote:
> >>> From: Matthew Brost <matthew.brost@intel.com>
> >>>
> >>> Add free_zone_device_folio_prepare(), a helper that restores large
> >>> ZONE_DEVICE folios to a sane, initial state before freeing them.
> >>>
> >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> >>> compound metadata). Before returning such pages to the device pgmap
> >>> allocator, each constituent page must be reset to a standalone
> >>> ZONE_DEVICE folio with a valid pgmap and no compound state.
> >>>
> >>> Use this helper prior to folio_free() for device-private and
> >>> device-coherent folios to ensure consistent device page state for
> >>> subsequent allocations.
> >>>
> >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> >>> Cc: Zi Yan <ziy@nvidia.com>
> >>> Cc: David Hildenbrand <david@kernel.org>
> >>> Cc: Oscar Salvador <osalvador@suse.de>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: Balbir Singh <balbirs@nvidia.com>
> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> >>> Cc: Mike Rapoport <rppt@kernel.org>
> >>> Cc: Suren Baghdasaryan <surenb@google.com>
> >>> Cc: Michal Hocko <mhocko@suse.com>
> >>> Cc: Alistair Popple <apopple@nvidia.com>
> >>> Cc: linux-mm@kvack.org
> >>> Cc: linux-cxl@vger.kernel.org
> >>> Cc: linux-kernel@vger.kernel.org
> >>> Suggested-by: Alistair Popple <apopple@nvidia.com>
> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> >>> ---
> >>>  include/linux/memremap.h |  1 +
> >>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> >>>  2 files changed, 56 insertions(+)
> >>>
> >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >>> index 97fcffeb1c1e..88e1d4707296 100644
> >>> --- a/include/linux/memremap.h
> >>> +++ b/include/linux/memremap.h
> >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> >>>  
> >>>  #ifdef CONFIG_ZONE_DEVICE
> >>>  void zone_device_page_init(struct page *page, unsigned int order);
> >>> +void free_zone_device_folio_prepare(struct folio *folio);
> >>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >>>  void memunmap_pages(struct dev_pagemap *pgmap);
> >>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> >>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>> index 39dc4bd190d0..375a61e18858 100644
> >>> --- a/mm/memremap.c
> >>> +++ b/mm/memremap.c
> >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >>>  
> >>> +/**
> >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> >>> + * @folio: ZONE_DEVICE folio to prepare for release.
> >>> + *
> >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> >>> + * must be restored to a sane ZONE_DEVICE state before they are released.
> >>> + *
> >>> + * This helper:
> >>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> >>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> >>> + *   - Resets the compound order metadata (folio_reset_order()) and then
> >>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> >>> + *       * clears ->mapping
> >>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> >>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> >>> + *
> >>> + * If @folio is order-0, only the mapping is cleared and no further work is
> >>> + * required.
> >>> + */
> >>> +void free_zone_device_folio_prepare(struct folio *folio)
> >>> +{
> >>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> >>> +	int order, i;
> >>> +
> >>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> >>> +
> >>> +	folio->mapping = NULL;
> >>> +	order = folio_order(folio);
> >>> +	if (!order)
> >>> +		return;
> >>> +
> >>> +	folio_reset_order(folio);
> >>> +
> >>> +	for (i = 0; i < (1UL << order); i++) {
> >>> +		struct page *page = folio_page(folio, i);
> >>> +		struct folio *new_folio = (struct folio *)page;
> >>> +
> >>> +		ClearPageHead(page);
> >>> +		clear_compound_head(page);
> >>> +
> >>> +		new_folio->mapping = NULL;
> >>> +		/*
> >>> +		 * Reset pgmap which was over-written by
> >>> +		 * prep_compound_page().
> >>> +		 */
> >>> +		new_folio->pgmap = pgmap;
> >>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> >>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> >>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> >>
> >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> >> that PMD_ORDER more frees than we'd like?
> >>
> > 
> > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > would make my implementation explode. I explained this in detail here [1]
> > to Zi.
> > 
> > To recap [1], my memory allocator has no visibility into individual
> > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > allows VRAM to be allocated or evicted for both traditional GPU
> > allocations (GEMs) and SVM allocations.
> > 
> 
> I assume it is still backed by pages that are ref counted? I suspect you'd

Yes.

> need to convert one reference count to PMD_ORDER reference counts to make
> this change work, or are the references not at page granularity? 
> 
> I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
> like a per folio refcount
> 

The refcount is incremented by 1 for each call to
folio_set_zone_device_data. If we have a 2MB device folio backing a
2MB allocation, the refcount is 1. If we have 512 4KB device pages
backing a 2MB allocation, the refcount is 512. The refcount matches the
number of folio_free calls we expect to receive for the size of the
backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M
but thia all configurable via a table driver side (Xe) in GPU SVM (drm
common layer).

Matt

> > Now, to recap the actual issue: if device folios are not split upon free
> > and are later reallocated with a different order in
> > zone_device_page_init, the implementation breaks. This problem is not
> > specific to Xe—Nouveau happens to always allocate at the same order, so
> > it works by coincidence. Reallocating at a different order is valid
> > behavior and must be supported.
> > 
> 
> Agreed
> 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > 
> >>> +	}
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> >>> +
> >>>  void free_zone_device_folio(struct folio *folio)
> >>>  {
> >>>  	struct dev_pagemap *pgmap = folio->pgmap;
> >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> >>>  	case MEMORY_DEVICE_COHERENT:
> >>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> >>>  			break;
> >>> +		free_zone_device_folio_prepare(folio);
> >>>  		pgmap->ops->folio_free(folio, order);
> >>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> >>>  		break;
> >>
> >> Balbir
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  2:37         ` Matthew Brost
@ 2026-01-12  2:50           ` Matthew Brost
  0 siblings, 0 replies; 39+ messages in thread
From: Matthew Brost @ 2026-01-12  2:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On Sun, Jan 11, 2026 at 06:37:06PM -0800, Matthew Brost wrote:
> On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote:
> > On 1/12/26 11:16, Matthew Brost wrote:
> > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > >> On 1/12/26 06:55, Francois Dugast wrote:
> > >>> From: Matthew Brost <matthew.brost@intel.com>
> > >>>
> > >>> Add free_zone_device_folio_prepare(), a helper that restores large
> > >>> ZONE_DEVICE folios to a sane, initial state before freeing them.
> > >>>
> > >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > >>> compound metadata). Before returning such pages to the device pgmap
> > >>> allocator, each constituent page must be reset to a standalone
> > >>> ZONE_DEVICE folio with a valid pgmap and no compound state.
> > >>>
> > >>> Use this helper prior to folio_free() for device-private and
> > >>> device-coherent folios to ensure consistent device page state for
> > >>> subsequent allocations.
> > >>>
> > >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > >>> Cc: Zi Yan <ziy@nvidia.com>
> > >>> Cc: David Hildenbrand <david@kernel.org>
> > >>> Cc: Oscar Salvador <osalvador@suse.de>
> > >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> > >>> Cc: Balbir Singh <balbirs@nvidia.com>
> > >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> > >>> Cc: Mike Rapoport <rppt@kernel.org>
> > >>> Cc: Suren Baghdasaryan <surenb@google.com>
> > >>> Cc: Michal Hocko <mhocko@suse.com>
> > >>> Cc: Alistair Popple <apopple@nvidia.com>
> > >>> Cc: linux-mm@kvack.org
> > >>> Cc: linux-cxl@vger.kernel.org
> > >>> Cc: linux-kernel@vger.kernel.org
> > >>> Suggested-by: Alistair Popple <apopple@nvidia.com>
> > >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > >>> ---
> > >>>  include/linux/memremap.h |  1 +
> > >>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > >>>  2 files changed, 56 insertions(+)
> > >>>
> > >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > >>> index 97fcffeb1c1e..88e1d4707296 100644
> > >>> --- a/include/linux/memremap.h
> > >>> +++ b/include/linux/memremap.h
> > >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > >>>  
> > >>>  #ifdef CONFIG_ZONE_DEVICE
> > >>>  void zone_device_page_init(struct page *page, unsigned int order);
> > >>> +void free_zone_device_folio_prepare(struct folio *folio);
> > >>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > >>>  void memunmap_pages(struct dev_pagemap *pgmap);
> > >>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > >>> diff --git a/mm/memremap.c b/mm/memremap.c
> > >>> index 39dc4bd190d0..375a61e18858 100644
> > >>> --- a/mm/memremap.c
> > >>> +++ b/mm/memremap.c
> > >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > >>>  }
> > >>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >>>  
> > >>> +/**
> > >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > >>> + * @folio: ZONE_DEVICE folio to prepare for release.
> > >>> + *
> > >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > >>> + * must be restored to a sane ZONE_DEVICE state before they are released.
> > >>> + *
> > >>> + * This helper:
> > >>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > >>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > >>> + *   - Resets the compound order metadata (folio_reset_order()) and then
> > >>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > >>> + *       * clears ->mapping
> > >>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > >>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > >>> + *
> > >>> + * If @folio is order-0, only the mapping is cleared and no further work is
> > >>> + * required.
> > >>> + */
> > >>> +void free_zone_device_folio_prepare(struct folio *folio)
> > >>> +{
> > >>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > >>> +	int order, i;
> > >>> +
> > >>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > >>> +
> > >>> +	folio->mapping = NULL;
> > >>> +	order = folio_order(folio);
> > >>> +	if (!order)
> > >>> +		return;
> > >>> +
> > >>> +	folio_reset_order(folio);
> > >>> +
> > >>> +	for (i = 0; i < (1UL << order); i++) {
> > >>> +		struct page *page = folio_page(folio, i);
> > >>> +		struct folio *new_folio = (struct folio *)page;
> > >>> +
> > >>> +		ClearPageHead(page);
> > >>> +		clear_compound_head(page);
> > >>> +
> > >>> +		new_folio->mapping = NULL;
> > >>> +		/*
> > >>> +		 * Reset pgmap which was over-written by
> > >>> +		 * prep_compound_page().
> > >>> +		 */
> > >>> +		new_folio->pgmap = pgmap;
> > >>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > >>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > >>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > >>
> > >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > >> that PMD_ORDER more frees than we'd like?
> > >>
> > > 
> > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > would make my implementation explode. I explained this in detail here [1]
> > > to Zi.
> > > 
> > > To recap [1], my memory allocator has no visibility into individual
> > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > allows VRAM to be allocated or evicted for both traditional GPU
> > > allocations (GEMs) and SVM allocations.
> > > 
> > 
> > I assume it is still backed by pages that are ref counted? I suspect you'd
> 
> Yes.
> 

Let me clarify this a bit. We don’t track individual pages in our
refcounting; instead, we maintain a reference count for the original
allocation (i.e., there are no partial frees of the original
allocation). This refcounting is handled in GPU SVM (DRM common), and
when the allocation’s refcount reaches zero, GPU SVM calls into the
driver to indicate that the memory can be released. In Xe, the backing
memory is a TTM BO (think of this as an eviction hook), which is layered
on top of DRM Buddy (which actually controls VRAM allocation and can
determine device pages from this layer).

I suspect AMD, when using GPU SVM (they have indicated this is the
plan), will also use TTM BO here. Nova, assuming they eventually adopt
SVM and use GPU SVM, will likely implement something very similar to TTM
in Rust, but with DRM Buddy also controlling the actual allocation (they
have already written bindings for DRM buddy).

Matt

> > need to convert one reference count to PMD_ORDER reference counts to make
> > this change work, or are the references not at page granularity? 
> > 
> > I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
> > like a per folio refcount
> > 
> 
> The refcount is incremented by 1 for each call to
> folio_set_zone_device_data. If we have a 2MB device folio backing a
> 2MB allocation, the refcount is 1. If we have 512 4KB device pages
> backing a 2MB allocation, the refcount is 512. The refcount matches the
> number of folio_free calls we expect to receive for the size of the
> backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M
> but thia all configurable via a table driver side (Xe) in GPU SVM (drm
> common layer).
> 
> Matt
> 
> > > Now, to recap the actual issue: if device folios are not split upon free
> > > and are later reallocated with a different order in
> > > zone_device_page_init, the implementation breaks. This problem is not
> > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > it works by coincidence. Reallocating at a different order is valid
> > > behavior and must be supported.
> > > 
> > 
> > Agreed
> > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > 
> > >>> +	}
> > >>> +}
> > >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > >>> +
> > >>>  void free_zone_device_folio(struct folio *folio)
> > >>>  {
> > >>>  	struct dev_pagemap *pgmap = folio->pgmap;
> > >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > >>>  	case MEMORY_DEVICE_COHERENT:
> > >>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > >>>  			break;
> > >>> +		free_zone_device_folio_prepare(folio);
> > >>>  		pgmap->ops->folio_free(folio, order);
> > >>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > >>>  		break;
> > >>
> > >> Balbir
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  1:16     ` Matthew Brost
  2026-01-12  2:15       ` Balbir Singh
@ 2026-01-12 23:58       ` Alistair Popple
  2026-01-13  0:23         ` Matthew Brost
  1 sibling, 1 reply; 39+ messages in thread
From: Alistair Popple @ 2026-01-12 23:58 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > On 1/12/26 06:55, Francois Dugast wrote:
> > > From: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > 
> > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > compound metadata). Before returning such pages to the device pgmap
> > > allocator, each constituent page must be reset to a standalone
> > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > 
> > > Use this helper prior to folio_free() for device-private and
> > > device-coherent folios to ensure consistent device page state for
> > > subsequent allocations.
> > > 
> > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: David Hildenbrand <david@kernel.org>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Mike Rapoport <rppt@kernel.org>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Alistair Popple <apopple@nvidia.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-cxl@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > ---
> > >  include/linux/memremap.h |  1 +
> > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 56 insertions(+)
> > > 
> > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > index 97fcffeb1c1e..88e1d4707296 100644
> > > --- a/include/linux/memremap.h
> > > +++ b/include/linux/memremap.h
> > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > >  
> > >  #ifdef CONFIG_ZONE_DEVICE
> > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > +void free_zone_device_folio_prepare(struct folio *folio);
> > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 39dc4bd190d0..375a61e18858 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > >  }
> > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >  
> > > +/**
> > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > + *
> > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > + *
> > > + * This helper:
> > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > + *       * clears ->mapping
> > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > + *
> > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > + * required.
> > > + */
> > > +void free_zone_device_folio_prepare(struct folio *folio)

I don't really like the naming here - we're not preparing a folio to be
freed, from the core-mm perspective the folio is already free. This is just
reinitialising the folio metadata ready for the driver to reuse it, which may
actually involve just recreating a compound folio.

So maybe zone_device_folio_reinitialise()? Or would it be possible to
roll this into a zone_device_folio_init() type function (similar to
zone_device_page_init()) that just deals with everything at allocation time?

> > > +{
> > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > +	int order, i;
> > > +
> > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > +
> > > +	folio->mapping = NULL;
> > > +	order = folio_order(folio);
> > > +	if (!order)
> > > +		return;
> > > +
> > > +	folio_reset_order(folio);
> > > +
> > > +	for (i = 0; i < (1UL << order); i++) {
> > > +		struct page *page = folio_page(folio, i);
> > > +		struct folio *new_folio = (struct folio *)page;
> > > +
> > > +		ClearPageHead(page);
> > > +		clear_compound_head(page);
> > > +
> > > +		new_folio->mapping = NULL;
> > > +		/*
> > > +		 * Reset pgmap which was over-written by
> > > +		 * prep_compound_page().
> > > +		 */
> > > +		new_folio->pgmap = pgmap;
> > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > 
> > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > that PMD_ORDER more frees than we'd like?
> > 
> 
> No, calling free_folio() more often doesn’t solve anything—in fact, that
> would make my implementation explode. I explained this in detail here [1]
> to Zi.
> 
> To recap [1], my memory allocator has no visibility into individual
> pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> allows VRAM to be allocated or evicted for both traditional GPU
> allocations (GEMs) and SVM allocations.
> 
> Now, to recap the actual issue: if device folios are not split upon free
> and are later reallocated with a different order in
> zone_device_page_init, the implementation breaks. This problem is not
> specific to Xe—Nouveau happens to always allocate at the same order, so
> it works by coincidence. Reallocating at a different order is valid
> behavior and must be supported.

I agree it's probably by coincidence but it is a perfectly valid design to
always just (re)allocate at the same order and not worry about having to
reinitialise things to different orders.

 - Alistair

> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> 
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > +
> > >  void free_zone_device_folio(struct folio *folio)
> > >  {
> > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > >  	case MEMORY_DEVICE_COHERENT:
> > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > >  			break;
> > > +		free_zone_device_folio_prepare(folio);
> > >  		pgmap->ops->folio_free(folio, order);
> > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > >  		break;
> > 
> > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12 23:58       ` Alistair Popple
@ 2026-01-13  0:23         ` Matthew Brost
  2026-01-13  0:43           ` Alistair Popple
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-13  0:23 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > 
> > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > compound metadata). Before returning such pages to the device pgmap
> > > > allocator, each constituent page must be reset to a standalone
> > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > 
> > > > Use this helper prior to folio_free() for device-private and
> > > > device-coherent folios to ensure consistent device page state for
> > > > subsequent allocations.
> > > > 
> > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > Cc: David Hildenbrand <david@kernel.org>
> > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-cxl@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > ---
> > > >  include/linux/memremap.h |  1 +
> > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 56 insertions(+)
> > > > 
> > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > --- a/include/linux/memremap.h
> > > > +++ b/include/linux/memremap.h
> > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > >  
> > > >  #ifdef CONFIG_ZONE_DEVICE
> > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > index 39dc4bd190d0..375a61e18858 100644
> > > > --- a/mm/memremap.c
> > > > +++ b/mm/memremap.c
> > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > >  
> > > > +/**
> > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > + *
> > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > + *
> > > > + * This helper:
> > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > + *       * clears ->mapping
> > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > + *
> > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > + * required.
> > > > + */
> > > > +void free_zone_device_folio_prepare(struct folio *folio)
> 
> I don't really like the naming here - we're not preparing a folio to be
> freed, from the core-mm perspective the folio is already free. This is just
> reinitialising the folio metadata ready for the driver to reuse it, which may
> actually involve just recreating a compound folio.
> 
> So maybe zone_device_folio_reinitialise()? Or would it be possible to

zone_device_folio_reinitialise - that works for me... but seem like
everyone has a opinion. 

> roll this into a zone_device_folio_init() type function (similar to
> zone_device_page_init()) that just deals with everything at allocation time?
> 

I don’t think doing this at allocation actually works without a big lock
per pgmap. Consider the case where a VRAM allocator allocates two
distinct subsets of a large folio and you have a multi-threaded GPU page
fault handler (Xe does). It’s possible two threads could call
zone_device_folio_reinitialise at the same time, racing and causing all
sorts of issues. My plan is to just call this function in the driver’s
->folio_free() prior to returning the VRAM allocation to my driver pool.

> > > > +{
> > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > +	int order, i;
> > > > +
> > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > +
> > > > +	folio->mapping = NULL;
> > > > +	order = folio_order(folio);
> > > > +	if (!order)
> > > > +		return;
> > > > +
> > > > +	folio_reset_order(folio);
> > > > +
> > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > +		struct page *page = folio_page(folio, i);
> > > > +		struct folio *new_folio = (struct folio *)page;
> > > > +
> > > > +		ClearPageHead(page);
> > > > +		clear_compound_head(page);
> > > > +
> > > > +		new_folio->mapping = NULL;
> > > > +		/*
> > > > +		 * Reset pgmap which was over-written by
> > > > +		 * prep_compound_page().
> > > > +		 */
> > > > +		new_folio->pgmap = pgmap;
> > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > 
> > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > that PMD_ORDER more frees than we'd like?
> > > 
> > 
> > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > would make my implementation explode. I explained this in detail here [1]
> > to Zi.
> > 
> > To recap [1], my memory allocator has no visibility into individual
> > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > allows VRAM to be allocated or evicted for both traditional GPU
> > allocations (GEMs) and SVM allocations.
> > 
> > Now, to recap the actual issue: if device folios are not split upon free
> > and are later reallocated with a different order in
> > zone_device_page_init, the implementation breaks. This problem is not
> > specific to Xe—Nouveau happens to always allocate at the same order, so
> > it works by coincidence. Reallocating at a different order is valid
> > behavior and must be supported.
> 
> I agree it's probably by coincidence but it is a perfectly valid design to
> always just (re)allocate at the same order and not worry about having to
> reinitialise things to different orders.
> 

I would agree with this statement too — it’s perfectly valid if a driver
always wants to (re)allocate at the same order.

Matt

>  - Alistair
> 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > 
> > > > +	}
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > +
> > > >  void free_zone_device_folio(struct folio *folio)
> > > >  {
> > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > >  	case MEMORY_DEVICE_COHERENT:
> > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > >  			break;
> > > > +		free_zone_device_folio_prepare(folio);
> > > >  		pgmap->ops->folio_free(folio, order);
> > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > >  		break;
> > > 
> > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  0:23         ` Matthew Brost
@ 2026-01-13  0:43           ` Alistair Popple
  2026-01-13  1:07             ` Matthew Brost
  0 siblings, 1 reply; 39+ messages in thread
From: Alistair Popple @ 2026-01-13  0:43 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > 
> > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > 
> > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > allocator, each constituent page must be reset to a standalone
> > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > 
> > > > > Use this helper prior to folio_free() for device-private and
> > > > > device-coherent folios to ensure consistent device page state for
> > > > > subsequent allocations.
> > > > > 
> > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > Cc: linux-mm@kvack.org
> > > > > Cc: linux-cxl@vger.kernel.org
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > ---
> > > > >  include/linux/memremap.h |  1 +
> > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > >  2 files changed, 56 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > --- a/include/linux/memremap.h
> > > > > +++ b/include/linux/memremap.h
> > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > >  
> > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > --- a/mm/memremap.c
> > > > > +++ b/mm/memremap.c
> > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > >  
> > > > > +/**
> > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > + *
> > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > + *
> > > > > + * This helper:
> > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > + *       * clears ->mapping
> > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > + *
> > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > + * required.
> > > > > + */
> > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > 
> > I don't really like the naming here - we're not preparing a folio to be
> > freed, from the core-mm perspective the folio is already free. This is just
> > reinitialising the folio metadata ready for the driver to reuse it, which may
> > actually involve just recreating a compound folio.
> > 
> > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> 
> zone_device_folio_reinitialise - that works for me... but seem like
> everyone has a opinion. 

Well of course :) There are only two hard problems in programming and
I forget the other one. But I didn't want to just say I don't like
free_zone_device_folio_prepare() without offering an alternative, I'd be open
to others.

> 
> > roll this into a zone_device_folio_init() type function (similar to
> > zone_device_page_init()) that just deals with everything at allocation time?
> > 
> 
> I don’t think doing this at allocation actually works without a big lock
> per pgmap. Consider the case where a VRAM allocator allocates two
> distinct subsets of a large folio and you have a multi-threaded GPU page
> fault handler (Xe does). It’s possible two threads could call
> zone_device_folio_reinitialise at the same time, racing and causing all
> sorts of issues. My plan is to just call this function in the driver’s
> ->folio_free() prior to returning the VRAM allocation to my driver pool.

This doesn't make sense to me (at least as someone who doesn't know DRM SVM
intimately) - the folio metadata initialisation should only happen after the
VRAM allocation has occured.

IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
physical range you just initialise the folio/pages associated with that range
with zone_device_folio_(re)initialise() and you're done.

Is the concern that reinitialisation would touch pages outside of the allocated
VRAM range if it was previously a large folio?

> > > > > +{
> > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > +	int order, i;
> > > > > +
> > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > +
> > > > > +	folio->mapping = NULL;
> > > > > +	order = folio_order(folio);
> > > > > +	if (!order)
> > > > > +		return;
> > > > > +
> > > > > +	folio_reset_order(folio);
> > > > > +
> > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > +		struct page *page = folio_page(folio, i);
> > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > +
> > > > > +		ClearPageHead(page);
> > > > > +		clear_compound_head(page);
> > > > > +
> > > > > +		new_folio->mapping = NULL;
> > > > > +		/*
> > > > > +		 * Reset pgmap which was over-written by
> > > > > +		 * prep_compound_page().
> > > > > +		 */
> > > > > +		new_folio->pgmap = pgmap;
> > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > 
> > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > that PMD_ORDER more frees than we'd like?
> > > > 
> > > 
> > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > would make my implementation explode. I explained this in detail here [1]
> > > to Zi.
> > > 
> > > To recap [1], my memory allocator has no visibility into individual
> > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > allows VRAM to be allocated or evicted for both traditional GPU
> > > allocations (GEMs) and SVM allocations.
> > > 
> > > Now, to recap the actual issue: if device folios are not split upon free
> > > and are later reallocated with a different order in
> > > zone_device_page_init, the implementation breaks. This problem is not
> > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > it works by coincidence. Reallocating at a different order is valid
> > > behavior and must be supported.
> > 
> > I agree it's probably by coincidence but it is a perfectly valid design to
> > always just (re)allocate at the same order and not worry about having to
> > reinitialise things to different orders.
> > 
> 
> I would agree with this statement too — it’s perfectly valid if a driver
> always wants to (re)allocate at the same order.
> 
> Matt
> 
> >  - Alistair
> > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > 
> > > > > +	}
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > +
> > > > >  void free_zone_device_folio(struct folio *folio)
> > > > >  {
> > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > >  			break;
> > > > > +		free_zone_device_folio_prepare(folio);
> > > > >  		pgmap->ops->folio_free(folio, order);
> > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > >  		break;
> > > > 
> > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  0:43           ` Alistair Popple
@ 2026-01-13  1:07             ` Matthew Brost
  2026-01-13  1:35               ` Alistair Popple
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-13  1:07 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > 
> > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > 
> > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > 
> > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > subsequent allocations.
> > > > > > 
> > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > Cc: linux-mm@kvack.org
> > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > ---
> > > > > >  include/linux/memremap.h |  1 +
> > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > >  2 files changed, 56 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > --- a/include/linux/memremap.h
> > > > > > +++ b/include/linux/memremap.h
> > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > >  
> > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > --- a/mm/memremap.c
> > > > > > +++ b/mm/memremap.c
> > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > >  
> > > > > > +/**
> > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > + *
> > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > + *
> > > > > > + * This helper:
> > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > + *       * clears ->mapping
> > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > + *
> > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > + * required.
> > > > > > + */
> > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > 
> > > I don't really like the naming here - we're not preparing a folio to be
> > > freed, from the core-mm perspective the folio is already free. This is just
> > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > actually involve just recreating a compound folio.
> > > 
> > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > 
> > zone_device_folio_reinitialise - that works for me... but seem like
> > everyone has a opinion. 
> 
> Well of course :) There are only two hard problems in programming and
> I forget the other one. But I didn't want to just say I don't like
> free_zone_device_folio_prepare() without offering an alternative, I'd be open
> to others.
> 

zone_device_folio_reinitialise is good with me.

> > 
> > > roll this into a zone_device_folio_init() type function (similar to
> > > zone_device_page_init()) that just deals with everything at allocation time?
> > > 
> > 
> > I don’t think doing this at allocation actually works without a big lock
> > per pgmap. Consider the case where a VRAM allocator allocates two
> > distinct subsets of a large folio and you have a multi-threaded GPU page
> > fault handler (Xe does). It’s possible two threads could call
> > zone_device_folio_reinitialise at the same time, racing and causing all
> > sorts of issues. My plan is to just call this function in the driver’s
> > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> 
> This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> intimately) - the folio metadata initialisation should only happen after the
> VRAM allocation has occured.
> 
> IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> physical range you just initialise the folio/pages associated with that range
> with zone_device_folio_(re)initialise() and you're done.
> 

Our VRAM allocator does have locking (via DRM buddy), but that layer
doesn’t have visibility into the folio or its pages. By the time we
handle the folio/pages in the GPU fault handler, there are no global
locks preventing two GPU faults from each having, say, 16 pages from the
same order-9 folio. I believe if both threads call
zone_device_folio_reinitialise/init at the same time, bad things could
happen.

> Is the concern that reinitialisation would touch pages outside of the allocated
> VRAM range if it was previously a large folio?

No just two threads call zone_device_folio_reinitialise/init at the same
time, on the same folio.

If we call zone_device_folio_reinitialise in ->folio_free this problem
goes away. We could solve this with split_lock or something but I'd
prefer not to add lock for this (although some of prior revs did do
this, maybe we will revist this later).

Anyways - this falls in driver detail / choice IMO.

Matt

> 
> > > > > > +{
> > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > +	int order, i;
> > > > > > +
> > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > +
> > > > > > +	folio->mapping = NULL;
> > > > > > +	order = folio_order(folio);
> > > > > > +	if (!order)
> > > > > > +		return;
> > > > > > +
> > > > > > +	folio_reset_order(folio);
> > > > > > +
> > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > +
> > > > > > +		ClearPageHead(page);
> > > > > > +		clear_compound_head(page);
> > > > > > +
> > > > > > +		new_folio->mapping = NULL;
> > > > > > +		/*
> > > > > > +		 * Reset pgmap which was over-written by
> > > > > > +		 * prep_compound_page().
> > > > > > +		 */
> > > > > > +		new_folio->pgmap = pgmap;
> > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > 
> > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > that PMD_ORDER more frees than we'd like?
> > > > > 
> > > > 
> > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > would make my implementation explode. I explained this in detail here [1]
> > > > to Zi.
> > > > 
> > > > To recap [1], my memory allocator has no visibility into individual
> > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > allocations (GEMs) and SVM allocations.
> > > > 
> > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > and are later reallocated with a different order in
> > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > it works by coincidence. Reallocating at a different order is valid
> > > > behavior and must be supported.
> > > 
> > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > always just (re)allocate at the same order and not worry about having to
> > > reinitialise things to different orders.
> > > 
> > 
> > I would agree with this statement too — it’s perfectly valid if a driver
> > always wants to (re)allocate at the same order.
> > 
> > Matt
> > 
> > >  - Alistair
> > > 
> > > > Matt
> > > > 
> > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > 
> > > > > > +	}
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > +
> > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > >  {
> > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > >  			break;
> > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > >  		break;
> > > > > 
> > > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  1:07             ` Matthew Brost
@ 2026-01-13  1:35               ` Alistair Popple
  2026-01-13  1:40                 ` Matthew Brost
  0 siblings, 1 reply; 39+ messages in thread
From: Alistair Popple @ 2026-01-13  1:35 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > 
> > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > 
> > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > 
> > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > subsequent allocations.
> > > > > > > 
> > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > Cc: linux-mm@kvack.org
> > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > ---
> > > > > > >  include/linux/memremap.h |  1 +
> > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > --- a/include/linux/memremap.h
> > > > > > > +++ b/include/linux/memremap.h
> > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > >  
> > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > --- a/mm/memremap.c
> > > > > > > +++ b/mm/memremap.c
> > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > >  
> > > > > > > +/**
> > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > + *
> > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > + *
> > > > > > > + * This helper:
> > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > + *       * clears ->mapping
> > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > + *
> > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > + * required.
> > > > > > > + */
> > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > 
> > > > I don't really like the naming here - we're not preparing a folio to be
> > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > actually involve just recreating a compound folio.
> > > > 
> > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > 
> > > zone_device_folio_reinitialise - that works for me... but seem like
> > > everyone has a opinion. 
> > 
> > Well of course :) There are only two hard problems in programming and
> > I forget the other one. But I didn't want to just say I don't like
> > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > to others.
> > 
> 
> zone_device_folio_reinitialise is good with me.
> 
> > > 
> > > > roll this into a zone_device_folio_init() type function (similar to
> > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > 
> > > 
> > > I don’t think doing this at allocation actually works without a big lock
> > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > fault handler (Xe does). It’s possible two threads could call
> > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > sorts of issues. My plan is to just call this function in the driver’s
> > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > 
> > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > intimately) - the folio metadata initialisation should only happen after the
> > VRAM allocation has occured.
> > 
> > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > physical range you just initialise the folio/pages associated with that range
> > with zone_device_folio_(re)initialise() and you're done.
> > 
> 
> Our VRAM allocator does have locking (via DRM buddy), but that layer

I mean I assumed it did :-)

> doesn’t have visibility into the folio or its pages. By the time we
> handle the folio/pages in the GPU fault handler, there are no global
> locks preventing two GPU faults from each having, say, 16 pages from the
> same order-9 folio. I believe if both threads call
> zone_device_folio_reinitialise/init at the same time, bad things could
> happen.

This is confusing to me. If you are getting a GPU fault it implies no page is
mapped at a particular virtual address. The normal process (or at least the
process I'm familiar with) for handling this is to allocate and map a page at
the faulting virtual address. So in the scenario of two GPUs faulting on the
same VA each thread would allocate VRAM using DRM buddy, presumably getting
different physical pages, and so the zone_device_folio_init() call would be to
different folios/pages.

Then eventually one thread would succeed in creating the mapping from VA->VRAM
and the losing thread would free the VRAM allocation back to DRM buddy.

So I'm a bit confused by the above statement that two GPUs faults could each
have the same pages or be calling zone_device_folio_init() on the same pages.
How would that happen?

> > Is the concern that reinitialisation would touch pages outside of the allocated
> > VRAM range if it was previously a large folio?
> 
> No just two threads call zone_device_folio_reinitialise/init at the same
> time, on the same folio.
> 
> If we call zone_device_folio_reinitialise in ->folio_free this problem
> goes away. We could solve this with split_lock or something but I'd
> prefer not to add lock for this (although some of prior revs did do
> this, maybe we will revist this later).
> 
> Anyways - this falls in driver detail / choice IMO.

Agreed.

 - Alistair

> Matt
> 
> > 
> > > > > > > +{
> > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > +	int order, i;
> > > > > > > +
> > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > +
> > > > > > > +	folio->mapping = NULL;
> > > > > > > +	order = folio_order(folio);
> > > > > > > +	if (!order)
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	folio_reset_order(folio);
> > > > > > > +
> > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > +
> > > > > > > +		ClearPageHead(page);
> > > > > > > +		clear_compound_head(page);
> > > > > > > +
> > > > > > > +		new_folio->mapping = NULL;
> > > > > > > +		/*
> > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > +		 * prep_compound_page().
> > > > > > > +		 */
> > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > 
> > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > 
> > > > > 
> > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > to Zi.
> > > > > 
> > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > allocations (GEMs) and SVM allocations.
> > > > > 
> > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > and are later reallocated with a different order in
> > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > behavior and must be supported.
> > > > 
> > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > always just (re)allocate at the same order and not worry about having to
> > > > reinitialise things to different orders.
> > > > 
> > > 
> > > I would agree with this statement too — it’s perfectly valid if a driver
> > > always wants to (re)allocate at the same order.
> > > 
> > > Matt
> > > 
> > > >  - Alistair
> > > > 
> > > > > Matt
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > 
> > > > > > > +	}
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > +
> > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > >  {
> > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > >  			break;
> > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > >  		break;
> > > > > > 
> > > > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  1:35               ` Alistair Popple
@ 2026-01-13  1:40                 ` Matthew Brost
  2026-01-13  2:06                   ` Alistair Popple
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-13  1:40 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > 
> > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > 
> > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > 
> > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > subsequent allocations.
> > > > > > > > 
> > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > ---
> > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > >  
> > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > --- a/mm/memremap.c
> > > > > > > > +++ b/mm/memremap.c
> > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > >  
> > > > > > > > +/**
> > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > + *
> > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > + *
> > > > > > > > + * This helper:
> > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > + *       * clears ->mapping
> > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > + *
> > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > + * required.
> > > > > > > > + */
> > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > 
> > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > actually involve just recreating a compound folio.
> > > > > 
> > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > 
> > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > everyone has a opinion. 
> > > 
> > > Well of course :) There are only two hard problems in programming and
> > > I forget the other one. But I didn't want to just say I don't like
> > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > to others.
> > > 
> > 
> > zone_device_folio_reinitialise is good with me.
> > 
> > > > 
> > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > 
> > > > 
> > > > I don’t think doing this at allocation actually works without a big lock
> > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > fault handler (Xe does). It’s possible two threads could call
> > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > 
> > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > intimately) - the folio metadata initialisation should only happen after the
> > > VRAM allocation has occured.
> > > 
> > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > physical range you just initialise the folio/pages associated with that range
> > > with zone_device_folio_(re)initialise() and you're done.
> > > 
> > 
> > Our VRAM allocator does have locking (via DRM buddy), but that layer
> 
> I mean I assumed it did :-)
> 
> > doesn’t have visibility into the folio or its pages. By the time we
> > handle the folio/pages in the GPU fault handler, there are no global
> > locks preventing two GPU faults from each having, say, 16 pages from the
> > same order-9 folio. I believe if both threads call
> > zone_device_folio_reinitialise/init at the same time, bad things could
> > happen.
> 
> This is confusing to me. If you are getting a GPU fault it implies no page is
> mapped at a particular virtual address. The normal process (or at least the
> process I'm familiar with) for handling this is to allocate and map a page at
> the faulting virtual address. So in the scenario of two GPUs faulting on the
> same VA each thread would allocate VRAM using DRM buddy, presumably getting

Different VAs.

> different physical pages, and so the zone_device_folio_init() call would be to

Yes, different physical pages but same folio which is possible if it
hasn't been split yet (i.e., both threads a different subset of pages in
the same folio, try to split at the same time and boom something bad
happens).

> different folios/pages.
> 
> Then eventually one thread would succeed in creating the mapping from VA->VRAM
> and the losing thread would free the VRAM allocation back to DRM buddy.
> 
> So I'm a bit confused by the above statement that two GPUs faults could each
> have the same pages or be calling zone_device_folio_init() on the same pages.
> How would that happen?
> 

See above. I hope my above statements make this clear.

Matt

> > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > VRAM range if it was previously a large folio?
> > 
> > No just two threads call zone_device_folio_reinitialise/init at the same
> > time, on the same folio.
> > 
> > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > goes away. We could solve this with split_lock or something but I'd
> > prefer not to add lock for this (although some of prior revs did do
> > this, maybe we will revist this later).
> > 
> > Anyways - this falls in driver detail / choice IMO.
> 
> Agreed.
> 
>  - Alistair
> 
> > Matt
> > 
> > > 
> > > > > > > > +{
> > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > +	int order, i;
> > > > > > > > +
> > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > +
> > > > > > > > +	folio->mapping = NULL;
> > > > > > > > +	order = folio_order(folio);
> > > > > > > > +	if (!order)
> > > > > > > > +		return;
> > > > > > > > +
> > > > > > > > +	folio_reset_order(folio);
> > > > > > > > +
> > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > +
> > > > > > > > +		ClearPageHead(page);
> > > > > > > > +		clear_compound_head(page);
> > > > > > > > +
> > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > +		/*
> > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > +		 * prep_compound_page().
> > > > > > > > +		 */
> > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > 
> > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > 
> > > > > > 
> > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > to Zi.
> > > > > > 
> > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > allocations (GEMs) and SVM allocations.
> > > > > > 
> > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > and are later reallocated with a different order in
> > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > behavior and must be supported.
> > > > > 
> > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > always just (re)allocate at the same order and not worry about having to
> > > > > reinitialise things to different orders.
> > > > > 
> > > > 
> > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > always wants to (re)allocate at the same order.
> > > > 
> > > > Matt
> > > > 
> > > > >  - Alistair
> > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > 
> > > > > > > > +	}
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > +
> > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > >  {
> > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > >  			break;
> > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > >  		break;
> > > > > > > 
> > > > > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  1:40                 ` Matthew Brost
@ 2026-01-13  2:06                   ` Alistair Popple
  2026-01-13  2:16                     ` Matthew Brost
  0 siblings, 1 reply; 39+ messages in thread
From: Alistair Popple @ 2026-01-13  2:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > 
> > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > 
> > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > 
> > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > subsequent allocations.
> > > > > > > > > 
> > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > ---
> > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > >  
> > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > >  
> > > > > > > > > +/**
> > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > + *
> > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > + *
> > > > > > > > > + * This helper:
> > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > + *       * clears ->mapping
> > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > + *
> > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > + * required.
> > > > > > > > > + */
> > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > 
> > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > actually involve just recreating a compound folio.
> > > > > > 
> > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > 
> > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > everyone has a opinion. 
> > > > 
> > > > Well of course :) There are only two hard problems in programming and
> > > > I forget the other one. But I didn't want to just say I don't like
> > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > to others.
> > > > 
> > > 
> > > zone_device_folio_reinitialise is good with me.
> > > 
> > > > > 
> > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > 
> > > > > 
> > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > 
> > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > intimately) - the folio metadata initialisation should only happen after the
> > > > VRAM allocation has occured.
> > > > 
> > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > physical range you just initialise the folio/pages associated with that range
> > > > with zone_device_folio_(re)initialise() and you're done.
> > > > 
> > > 
> > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > 
> > I mean I assumed it did :-)
> > 
> > > doesn’t have visibility into the folio or its pages. By the time we
> > > handle the folio/pages in the GPU fault handler, there are no global
> > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > same order-9 folio. I believe if both threads call
> > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > happen.
> > 
> > This is confusing to me. If you are getting a GPU fault it implies no page is
> > mapped at a particular virtual address. The normal process (or at least the
> > process I'm familiar with) for handling this is to allocate and map a page at
> > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> 
> Different VAs.
> 
> > different physical pages, and so the zone_device_folio_init() call would be to
> 
> Yes, different physical pages but same folio which is possible if it
> hasn't been split yet (i.e., both threads a different subset of pages in
> the same folio, try to split at the same time and boom something bad
> happens).

So is you're concern something like this:

1) There is a free folio A of order 9, starting at physical address 0.
2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
5) Both call zone_device_folio_init() which splits the folio, meaning the
   previous step would touch folio_page(folio_A, 0) even though it has not been
   allocated physical address 0.

If that's the concern then what I'm saying (and what I think Jason was getting
at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
just overwrite all the metadata in the struct pages it has been allocated. We're
not really splitting folios, because it makes no sense to talk of splitting a
free folio which I think is why some core-mm people took notice.

Also It doesn't matter that you are leaving the previous compound head struct
pages in some weird state, the core-mm doesn't care about them anymore and the
struct page/folio is only used by core-mm not drivers. They will get properly
(re)initialised when needed for the core-mm in zone_device_folio_init() which in
this case would happen in step 3.

 - Alistair

> > different folios/pages.
> > 
> > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > and the losing thread would free the VRAM allocation back to DRM buddy.
> > 
> > So I'm a bit confused by the above statement that two GPUs faults could each
> > have the same pages or be calling zone_device_folio_init() on the same pages.
> > How would that happen?
> > 
> 
> See above. I hope my above statements make this clear.
> 
> Matt
> 
> > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > VRAM range if it was previously a large folio?
> > > 
> > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > time, on the same folio.
> > > 
> > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > goes away. We could solve this with split_lock or something but I'd
> > > prefer not to add lock for this (although some of prior revs did do
> > > this, maybe we will revist this later).
> > > 
> > > Anyways - this falls in driver detail / choice IMO.
> > 
> > Agreed.
> > 
> >  - Alistair
> > 
> > > Matt
> > > 
> > > > 
> > > > > > > > > +{
> > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > +	int order, i;
> > > > > > > > > +
> > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > +
> > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > +	if (!order)
> > > > > > > > > +		return;
> > > > > > > > > +
> > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > +
> > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > +
> > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > +
> > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > +		/*
> > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > +		 */
> > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > 
> > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > 
> > > > > > > 
> > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > to Zi.
> > > > > > > 
> > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > 
> > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > and are later reallocated with a different order in
> > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > behavior and must be supported.
> > > > > > 
> > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > reinitialise things to different orders.
> > > > > > 
> > > > > 
> > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > always wants to (re)allocate at the same order.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > >  - Alistair
> > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > 
> > > > > > > > > +	}
> > > > > > > > > +}
> > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > +
> > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > >  {
> > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > >  			break;
> > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > >  		break;
> > > > > > > > 
> > > > > > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  2:06                   ` Alistair Popple
@ 2026-01-13  2:16                     ` Matthew Brost
  2026-01-13  2:31                       ` Alistair Popple
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-01-13  2:16 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote:
> On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > 
> > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > > 
> > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > > 
> > > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > > subsequent allocations.
> > > > > > > > > > 
> > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > > >  
> > > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > > >  }
> > > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > > >  
> > > > > > > > > > +/**
> > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > > + *
> > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > > + *
> > > > > > > > > > + * This helper:
> > > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > > + *       * clears ->mapping
> > > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > > + *
> > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > > + * required.
> > > > > > > > > > + */
> > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > > 
> > > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > > actually involve just recreating a compound folio.
> > > > > > > 
> > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > > 
> > > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > > everyone has a opinion. 
> > > > > 
> > > > > Well of course :) There are only two hard problems in programming and
> > > > > I forget the other one. But I didn't want to just say I don't like
> > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > > to others.
> > > > > 
> > > > 
> > > > zone_device_folio_reinitialise is good with me.
> > > > 
> > > > > > 
> > > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > > 
> > > > > > 
> > > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > > 
> > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > > intimately) - the folio metadata initialisation should only happen after the
> > > > > VRAM allocation has occured.
> > > > > 
> > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > > physical range you just initialise the folio/pages associated with that range
> > > > > with zone_device_folio_(re)initialise() and you're done.
> > > > > 
> > > > 
> > > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > > 
> > > I mean I assumed it did :-)
> > > 
> > > > doesn’t have visibility into the folio or its pages. By the time we
> > > > handle the folio/pages in the GPU fault handler, there are no global
> > > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > > same order-9 folio. I believe if both threads call
> > > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > > happen.
> > > 
> > > This is confusing to me. If you are getting a GPU fault it implies no page is
> > > mapped at a particular virtual address. The normal process (or at least the
> > > process I'm familiar with) for handling this is to allocate and map a page at
> > > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> > 
> > Different VAs.
> > 
> > > different physical pages, and so the zone_device_folio_init() call would be to
> > 
> > Yes, different physical pages but same folio which is possible if it
> > hasn't been split yet (i.e., both threads a different subset of pages in
> > the same folio, try to split at the same time and boom something bad
> > happens).
> 
> So is you're concern something like this:
> 
> 1) There is a free folio A of order 9, starting at physical address 0.
> 2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
> 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
> 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
> 5) Both call zone_device_folio_init() which splits the folio, meaning the
>    previous step would touch folio_page(folio_A, 0) even though it has not been
>    allocated physical address 0.
> 

Yes.

> If that's the concern then what I'm saying (and what I think Jason was getting
> at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
> compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
> just overwrite all the metadata in the struct pages it has been allocated. We're
> not really splitting folios, because it makes no sense to talk of splitting a
> free folio which I think is why some core-mm people took notice.
> 
> Also It doesn't matter that you are leaving the previous compound head struct
> pages in some weird state, the core-mm doesn't care about them anymore and the
> struct page/folio is only used by core-mm not drivers. They will get properly
> (re)initialised when needed for the core-mm in zone_device_folio_init() which in
> this case would happen in step 3.
>

Something like this should work too. I started implementing it on my
side earlier today, but of course, I was hitting hangs. From an API
point of view, zone_device_folio_init would need to be updated to accept
a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have
a valid pgmap to retrieve. It could look at the folio’s pgmap, but that
also seems like it could race under the right conditions.

Let me see what this looks like and whether I can get it working.

Matt
 
>  - Alistair
> 
> > > different folios/pages.
> > > 
> > > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > > and the losing thread would free the VRAM allocation back to DRM buddy.
> > > 
> > > So I'm a bit confused by the above statement that two GPUs faults could each
> > > have the same pages or be calling zone_device_folio_init() on the same pages.
> > > How would that happen?
> > > 
> > 
> > See above. I hope my above statements make this clear.
> > 
> > Matt
> > 
> > > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > > VRAM range if it was previously a large folio?
> > > > 
> > > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > > time, on the same folio.
> > > > 
> > > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > > goes away. We could solve this with split_lock or something but I'd
> > > > prefer not to add lock for this (although some of prior revs did do
> > > > this, maybe we will revist this later).
> > > > 
> > > > Anyways - this falls in driver detail / choice IMO.
> > > 
> > > Agreed.
> > > 
> > >  - Alistair
> > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > > > > > > +{
> > > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > > +	int order, i;
> > > > > > > > > > +
> > > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > > +
> > > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > > +	if (!order)
> > > > > > > > > > +		return;
> > > > > > > > > > +
> > > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > > +
> > > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > > +
> > > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > > +
> > > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > > +		/*
> > > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > > +		 */
> > > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > > 
> > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > > to Zi.
> > > > > > > > 
> > > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > > 
> > > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > > and are later reallocated with a different order in
> > > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > > behavior and must be supported.
> > > > > > > 
> > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > > reinitialise things to different orders.
> > > > > > > 
> > > > > > 
> > > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > > always wants to (re)allocate at the same order.
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > >  - Alistair
> > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > > 
> > > > > > > > > > +	}
> > > > > > > > > > +}
> > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > > +
> > > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > > >  {
> > > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > > >  			break;
> > > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > > >  		break;
> > > > > > > > > 
> > > > > > > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-13  2:16                     ` Matthew Brost
@ 2026-01-13  2:31                       ` Alistair Popple
  0 siblings, 0 replies; 39+ messages in thread
From: Alistair Popple @ 2026-01-13  2:31 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl,
	linux-kernel

On 2026-01-13 at 13:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote:
> > On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > 
> > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > > > 
> > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > > > 
> > > > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > > > subsequent allocations.
> > > > > > > > > > > 
> > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > > > >  
> > > > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > > > >  }
> > > > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > > > >  
> > > > > > > > > > > +/**
> > > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > > > + *
> > > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > > > + *
> > > > > > > > > > > + * This helper:
> > > > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > > > + *       * clears ->mapping
> > > > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > > > + *
> > > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > > > + * required.
> > > > > > > > > > > + */
> > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > > > 
> > > > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > > > actually involve just recreating a compound folio.
> > > > > > > > 
> > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > > > 
> > > > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > > > everyone has a opinion. 
> > > > > > 
> > > > > > Well of course :) There are only two hard problems in programming and
> > > > > > I forget the other one. But I didn't want to just say I don't like
> > > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > > > to others.
> > > > > > 
> > > > > 
> > > > > zone_device_folio_reinitialise is good with me.
> > > > > 
> > > > > > > 
> > > > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > > > 
> > > > > > > 
> > > > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > > > 
> > > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > > > intimately) - the folio metadata initialisation should only happen after the
> > > > > > VRAM allocation has occured.
> > > > > > 
> > > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > > > physical range you just initialise the folio/pages associated with that range
> > > > > > with zone_device_folio_(re)initialise() and you're done.
> > > > > > 
> > > > > 
> > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > > > 
> > > > I mean I assumed it did :-)
> > > > 
> > > > > doesn’t have visibility into the folio or its pages. By the time we
> > > > > handle the folio/pages in the GPU fault handler, there are no global
> > > > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > > > same order-9 folio. I believe if both threads call
> > > > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > > > happen.
> > > > 
> > > > This is confusing to me. If you are getting a GPU fault it implies no page is
> > > > mapped at a particular virtual address. The normal process (or at least the
> > > > process I'm familiar with) for handling this is to allocate and map a page at
> > > > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > > > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> > > 
> > > Different VAs.
> > > 
> > > > different physical pages, and so the zone_device_folio_init() call would be to
> > > 
> > > Yes, different physical pages but same folio which is possible if it
> > > hasn't been split yet (i.e., both threads a different subset of pages in
> > > the same folio, try to split at the same time and boom something bad
> > > happens).
> > 
> > So is you're concern something like this:
> > 
> > 1) There is a free folio A of order 9, starting at physical address 0.
> > 2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
> > 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
> > 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
> > 5) Both call zone_device_folio_init() which splits the folio, meaning the
> >    previous step would touch folio_page(folio_A, 0) even though it has not been
> >    allocated physical address 0.
> > 
> 
> Yes.
> 
> > If that's the concern then what I'm saying (and what I think Jason was getting
> > at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
> > compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
> > just overwrite all the metadata in the struct pages it has been allocated. We're
> > not really splitting folios, because it makes no sense to talk of splitting a
> > free folio which I think is why some core-mm people took notice.
> > 
> > Also It doesn't matter that you are leaving the previous compound head struct
> > pages in some weird state, the core-mm doesn't care about them anymore and the
> > struct page/folio is only used by core-mm not drivers. They will get properly
> > (re)initialised when needed for the core-mm in zone_device_folio_init() which in
> > this case would happen in step 3.
> >
> 
> Something like this should work too. I started implementing it on my
> side earlier today, but of course, I was hitting hangs. From an API
> point of view, zone_device_folio_init would need to be updated to accept
> a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have
> a valid pgmap to retrieve. It could look at the folio’s pgmap, but that
> also seems like it could race under the right conditions.

I think passing a pgmap argument in would be fine - it allows us to maintain
the concept that zone_device_folio_init() does exactly what it says on the tin.
That is it initialises a ZONE_DEVICE folio ready for use by the core-mm without
placing any assumptions or restrictions on the current state of the folio/page
structs.

> Let me see what this looks like and whether I can get it working.
> 
> Matt
>  
> >  - Alistair
> > 
> > > > different folios/pages.
> > > > 
> > > > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > > > and the losing thread would free the VRAM allocation back to DRM buddy.
> > > > 
> > > > So I'm a bit confused by the above statement that two GPUs faults could each
> > > > have the same pages or be calling zone_device_folio_init() on the same pages.
> > > > How would that happen?
> > > > 
> > > 
> > > See above. I hope my above statements make this clear.
> > > 
> > > Matt
> > > 
> > > > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > > > VRAM range if it was previously a large folio?
> > > > > 
> > > > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > > > time, on the same folio.
> > > > > 
> > > > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > > > goes away. We could solve this with split_lock or something but I'd
> > > > > prefer not to add lock for this (although some of prior revs did do
> > > > > this, maybe we will revist this later).
> > > > > 
> > > > > Anyways - this falls in driver detail / choice IMO.
> > > > 
> > > > Agreed.
> > > > 
> > > >  - Alistair
> > > > 
> > > > > Matt
> > > > > 
> > > > > > 
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > > > +	int order, i;
> > > > > > > > > > > +
> > > > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > > > +
> > > > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > > > +	if (!order)
> > > > > > > > > > > +		return;
> > > > > > > > > > > +
> > > > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > > > +
> > > > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > > > +
> > > > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > > > +
> > > > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > > > +		/*
> > > > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > > > +		 */
> > > > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > > > 
> > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > > > to Zi.
> > > > > > > > > 
> > > > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > > > 
> > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > > > and are later reallocated with a different order in
> > > > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > > > behavior and must be supported.
> > > > > > > > 
> > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > > > reinitialise things to different orders.
> > > > > > > > 
> > > > > > > 
> > > > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > > > always wants to (re)allocate at the same order.
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > >  - Alistair
> > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > > > 
> > > > > > > > > > > +	}
> > > > > > > > > > > +}
> > > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > > > +
> > > > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > > > >  {
> > > > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > > > >  			break;
> > > > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > > > >  		break;
> > > > > > > > > > 
> > > > > > > > > > Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2026-01-13  2:31 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
2026-01-11 22:35   ` Matthew Wilcox
2026-01-12  0:19     ` Balbir Singh
2026-01-12  0:51       ` Zi Yan
2026-01-12  1:37         ` Matthew Brost
2026-01-12  4:50         ` Balbir Singh
2026-01-12 13:45         ` Jason Gunthorpe
2026-01-12 16:31           ` Zi Yan
2026-01-12 16:50             ` Jason Gunthorpe
2026-01-12 17:46               ` Zi Yan
2026-01-12 18:25                 ` Jason Gunthorpe
2026-01-12 18:55                   ` Zi Yan
2026-01-12 19:28                     ` Jason Gunthorpe
2026-01-12 23:34                       ` Zi Yan
2026-01-12 23:53                         ` Jason Gunthorpe
2026-01-13  0:35                           ` Zi Yan
2026-01-12 23:07               ` Matthew Brost
2026-01-12 21:49           ` Matthew Brost
2026-01-12 23:15             ` Zi Yan
2026-01-12 23:22               ` Matthew Brost
2026-01-12 23:44                 ` Alistair Popple
2026-01-12 23:54                   ` Jason Gunthorpe
2026-01-12 23:31               ` Jason Gunthorpe
2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
2026-01-12  0:44   ` Balbir Singh
2026-01-12  1:16     ` Matthew Brost
2026-01-12  2:15       ` Balbir Singh
2026-01-12  2:37         ` Matthew Brost
2026-01-12  2:50           ` Matthew Brost
2026-01-12 23:58       ` Alistair Popple
2026-01-13  0:23         ` Matthew Brost
2026-01-13  0:43           ` Alistair Popple
2026-01-13  1:07             ` Matthew Brost
2026-01-13  1:35               ` Alistair Popple
2026-01-13  1:40                 ` Matthew Brost
2026-01-13  2:06                   ` Alistair Popple
2026-01-13  2:16                     ` Matthew Brost
2026-01-13  2:31                       ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox