* [PATCH v4 0/7] Enable THP support in drm_pagemap @ 2026-01-11 20:55 Francois Dugast 2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast 2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast 0 siblings, 2 replies; 39+ messages in thread From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw) To: intel-xe Cc: dri-devel, Francois Dugast, Zi Yan, Madhavan Srinivasan, Alistair Popple, Lorenzo Stoakes, Liam R . Howlett, Suren Baghdasaryan, Michal Hocko, Mike Rapoport, Vlastimil Babka, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh, Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl, nvdimm, linux-fsdevel Use Balbir Singh's series for device-private THP support [1] and previous preparation work in drm_pagemap [2] to add 2MB/THP support in xe. This leads to significant performance improvements when using SVM with 2MB pages. [1] https://lore.kernel.org/linux-mm/20251001065707.920170-1-balbirs@nvidia.com/ [2] https://patchwork.freedesktop.org/series/151754/ v2: - rebase on top of multi-device SVM - add drm_pagemap_cpages() with temporary patch - address other feedback from Matt Brost on v1 v3: The major change is to remove the dependency to the mm/huge_memory helper migrate_device_split_page() that was called explicitely when a 2M buddy allocation backed by a large folio would be later reused for a smaller allocation (4K or 64K). Instead, the first 3 patches provided by Matthew Brost ensure large folios are split at the time of freeing. v4: - add order argument to folio_free callback - send complete series to linux-mm and MM folks as requested (Zi Yan and Andrew Morton) and cover letter to anyone receiving at least one of the patches (Liam R. Howlett) Cc: Zi Yan <ziy@nvidia.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org Cc: linux-pci@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-cxl@vger.kernel.org Cc: nvdimm@lists.linux.dev Cc: linux-fsdevel@vger.kernel.org Francois Dugast (3): drm/pagemap: Unlock and put folios when possible drm/pagemap: Add helper to access zone_device_data drm/pagemap: Enable THP support for GPU memory migration Matthew Brost (4): mm/zone_device: Add order argument to folio_free callback mm/zone_device: Add free_zone_device_folio_prepare() helper fs/dax: Use free_zone_device_folio_prepare() helper drm/pagemap: Correct cpages calculation for migrate_vma_setup arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +- drivers/gpu/drm/drm_gpusvm.c | 7 +- drivers/gpu/drm/drm_pagemap.c | 165 ++++++++++++++++++----- drivers/gpu/drm/nouveau/nouveau_dmem.c | 4 +- drivers/pci/p2pdma.c | 2 +- fs/dax.c | 24 +--- include/drm/drm_pagemap.h | 15 +++ include/linux/memremap.h | 8 +- lib/test_hmm.c | 4 +- mm/memremap.c | 60 ++++++++- 11 files changed, 227 insertions(+), 66 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast @ 2026-01-11 20:55 ` Francois Dugast 2026-01-11 22:35 ` Matthew Wilcox 2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast 1 sibling, 1 reply; 39+ messages in thread From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw) To: intel-xe Cc: dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl, Francois Dugast From: Matthew Brost <matthew.brost@intel.com> The core MM splits the folio before calling folio_free, restoring the zone pages associated with the folio to an initialized state (e.g., non-compound, pgmap valid, etc...). The order argument represents the folio’s order prior to the split which can be used driver side to know how many pages are being freed. Fixes: 3a5a06554566 ("mm/zone_device: rename page_free callback to folio_free") Cc: Zi Yan <ziy@nvidia.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org Cc: linux-pci@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-cxl@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Francois Dugast <francois.dugast@intel.com> --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +- drivers/gpu/drm/drm_pagemap.c | 3 ++- drivers/gpu/drm/nouveau/nouveau_dmem.c | 4 ++-- drivers/pci/p2pdma.c | 2 +- include/linux/memremap.h | 7 ++++++- lib/test_hmm.c | 4 +--- mm/memremap.c | 5 +++-- 8 files changed, 17 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index e5000bef90f2..b58f34eec6e5 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -1014,7 +1014,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf) * to a normal PFN during H_SVM_PAGE_OUT. * Gets called with kvm->arch.uvmem_lock held. */ -static void kvmppc_uvmem_folio_free(struct folio *folio) +static void kvmppc_uvmem_folio_free(struct folio *folio, unsigned int order) { struct page *page = &folio->page; unsigned long pfn = page_to_pfn(page) - diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index af53e796ea1b..a26e3c448e47 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -567,7 +567,7 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, return r < 0 ? r : 0; } -static void svm_migrate_folio_free(struct folio *folio) +static void svm_migrate_folio_free(struct folio *folio, unsigned int order) { struct page *page = &folio->page; struct svm_range_bo *svm_bo = page->zone_device_data; diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c index 03ee39a761a4..df253b13cf85 100644 --- a/drivers/gpu/drm/drm_pagemap.c +++ b/drivers/gpu/drm/drm_pagemap.c @@ -1144,11 +1144,12 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas, /** * drm_pagemap_folio_free() - Put GPU SVM zone device data associated with a folio * @folio: Pointer to the folio + * @order: Order of the folio prior to being split by core MM * * This function is a callback used to put the GPU SVM zone device data * associated with a page when it is being released. */ -static void drm_pagemap_folio_free(struct folio *folio) +static void drm_pagemap_folio_free(struct folio *folio, unsigned int order) { drm_pagemap_zdd_put(folio->page.zone_device_data); } diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 58071652679d..545f316fca14 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -115,14 +115,14 @@ unsigned long nouveau_dmem_page_addr(struct page *page) return chunk->bo->offset + off; } -static void nouveau_dmem_folio_free(struct folio *folio) +static void nouveau_dmem_folio_free(struct folio *folio, unsigned int order) { struct page *page = &folio->page; struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page); struct nouveau_dmem *dmem = chunk->drm->dmem; spin_lock(&dmem->lock); - if (folio_order(folio)) { + if (order) { page->zone_device_data = dmem->free_folios; dmem->free_folios = folio; } else { diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 4a2fc7ab42c3..a6fa7610f8a8 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -200,7 +200,7 @@ static const struct attribute_group p2pmem_group = { .name = "p2pmem", }; -static void p2pdma_folio_free(struct folio *folio) +static void p2pdma_folio_free(struct folio *folio, unsigned int order) { struct page *page = &folio->page; struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 713ec0435b48..97fcffeb1c1e 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -79,8 +79,13 @@ struct dev_pagemap_ops { * Called once the folio refcount reaches 0. The reference count will be * reset to one by the core code after the method is called to prepare * for handing out the folio again. + * + * The core MM splits the folio before calling folio_free, restoring the + * zone pages associated with the folio to an initialized state (e.g., + * non-compound, pgmap valid, etc...). The order argument represents the + * folio’s order prior to the split. */ - void (*folio_free)(struct folio *folio); + void (*folio_free)(struct folio *folio, unsigned int order); /* * Used for private (un-addressable) device memory only. Must migrate diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 8af169d3873a..e17c71d02a3a 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -1580,13 +1580,11 @@ static const struct file_operations dmirror_fops = { .owner = THIS_MODULE, }; -static void dmirror_devmem_free(struct folio *folio) +static void dmirror_devmem_free(struct folio *folio, unsigned int order) { struct page *page = &folio->page; struct page *rpage = BACKING_PAGE(page); struct dmirror_device *mdevice; - struct folio *rfolio = page_folio(rpage); - unsigned int order = folio_order(rfolio); if (rpage != page) { if (order) diff --git a/mm/memremap.c b/mm/memremap.c index 63c6ab4fdf08..39dc4bd190d0 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -417,6 +417,7 @@ void free_zone_device_folio(struct folio *folio) { struct dev_pagemap *pgmap = folio->pgmap; unsigned long nr = folio_nr_pages(folio); + unsigned int order = folio_order(folio); int i; if (WARN_ON_ONCE(!pgmap)) @@ -453,7 +454,7 @@ void free_zone_device_folio(struct folio *folio) case MEMORY_DEVICE_COHERENT: if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) break; - pgmap->ops->folio_free(folio); + pgmap->ops->folio_free(folio, order); percpu_ref_put_many(&folio->pgmap->ref, nr); break; @@ -472,7 +473,7 @@ void free_zone_device_folio(struct folio *folio) case MEMORY_DEVICE_PCI_P2PDMA: if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) break; - pgmap->ops->folio_free(folio); + pgmap->ops->folio_free(folio, order); break; } } -- 2.43.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast @ 2026-01-11 22:35 ` Matthew Wilcox 2026-01-12 0:19 ` Balbir Singh 0 siblings, 1 reply; 39+ messages in thread From: Matthew Wilcox @ 2026-01-11 22:35 UTC (permalink / raw) To: Francois Dugast Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: > The core MM splits the folio before calling folio_free, restoring the > zone pages associated with the folio to an initialized state (e.g., > non-compound, pgmap valid, etc...). The order argument represents the > folio’s order prior to the split which can be used driver side to know > how many pages are being freed. This really feels like the wrong way to fix this problem. I think someone from the graphics side really needs to take the lead on understanding what the MM is doing (both currently and in the future). I'm happy to work with you, but it feels like there's a lot of churn right now because there's a lot of people working on this without understanding the MM side of things (and conversely, I don't think (m)any people on the MM side really understand what graphics cards are trying to accomplish). Who is that going to be? I'm happy to get on the phone with someone. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-11 22:35 ` Matthew Wilcox @ 2026-01-12 0:19 ` Balbir Singh 2026-01-12 0:51 ` Zi Yan 0 siblings, 1 reply; 39+ messages in thread From: Balbir Singh @ 2026-01-12 0:19 UTC (permalink / raw) To: Matthew Wilcox, Francois Dugast Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 1/12/26 08:35, Matthew Wilcox wrote: > On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: >> The core MM splits the folio before calling folio_free, restoring the >> zone pages associated with the folio to an initialized state (e.g., >> non-compound, pgmap valid, etc...). The order argument represents the >> folio’s order prior to the split which can be used driver side to know >> how many pages are being freed. > > This really feels like the wrong way to fix this problem. > This stems from a special requirement, freeing is done in two phases 1. Free the folio -> inform the driver (which implies freeing the backing device memory) 2. Return the folio back, split it back to single order folios The current code does not do 2. 1 followed by 2 does not work for Francois since the backing memory can get reused before we reach step 2. The proposed patch does 2 followed 1, but doing 2 means we've lost the folio order and thus the old order is passed in. Although, I wonder if the backing folio's zone_device_data can be used to encode any order information about the device side allocation. @Francois, I hope I did not miss anything in the explanation above. > I think someone from the graphics side really needs to take the lead on > understanding what the MM is doing (both currently and in the future). > I'm happy to work with you, but it feels like there's a lot of churn right > now because there's a lot of people working on this without understanding > the MM side of things (and conversely, I don't think (m)any people on the > MM side really understand what graphics cards are trying to accomplish). > I suspect you are referring to folio specialization and/or downsizing? > Who is that going to be? I'm happy to get on the phone with someone. Happy to work with you, but I am not the authority on graphics, I can speak to zone device folios. I suspect we'd need to speak to more than one person. Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 0:19 ` Balbir Singh @ 2026-01-12 0:51 ` Zi Yan 2026-01-12 1:37 ` Matthew Brost ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Zi Yan @ 2026-01-12 0:51 UTC (permalink / raw) To: Matthew Wilcox, Balbir Singh Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 11 Jan 2026, at 19:19, Balbir Singh wrote: > On 1/12/26 08:35, Matthew Wilcox wrote: >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: >>> The core MM splits the folio before calling folio_free, restoring the >>> zone pages associated with the folio to an initialized state (e.g., >>> non-compound, pgmap valid, etc...). The order argument represents the >>> folio’s order prior to the split which can be used driver side to know >>> how many pages are being freed. >> >> This really feels like the wrong way to fix this problem. >> Hi Matthew, I think the wording is confusing, since the actual issue is that: 1. zone_device_page_init() calls prep_compound_page() to form a large folio, 2. but free_zone_device_folio() never reverse the course, 3. the undo of prep_compound_page() in free_zone_device_folio() needs to be done before driver callback ->folio_free(), since once ->folio_free() is called, the folio can be reallocated immediately, 4. after the undo of prep_compound_page(), folio_order() can no longer provide the original order information, thus, folio_free() needs that for proper device side ref manipulation. So this is not used for "split" but undo of prep_compound_page(). It might look like a split to non core MM people, since it changes a large folio to a bunch of base pages. BTW, core MM has no compound_page_dctor() but open codes it in free_pages_prepare() by resetting page flags, page->mapping, and so on. So it might be why the undo prep_compound_page() is missed by non core MM people. > > This stems from a special requirement, freeing is done in two phases > > 1. Free the folio -> inform the driver (which implies freeing the backing device memory) > 2. Return the folio back, split it back to single order folios Hi Balbir, Please refrain from using "split" here, since it confuses MM people. A folio is split when it is still in use, but in this case, the folio has been freed and needs to be restored to "free page" state. > > The current code does not do 2. 1 followed by 2 does not work for > Francois since the backing memory can get reused before we reach step 2. > The proposed patch does 2 followed 1, but doing 2 means we've lost the > folio order and thus the old order is passed in. Although, I wonder if the > backing folio's zone_device_data can be used to encode any order information > about the device side allocation. > > @Francois, I hope I did not miss anything in the explanation above. > >> I think someone from the graphics side really needs to take the lead on >> understanding what the MM is doing (both currently and in the future). >> I'm happy to work with you, but it feels like there's a lot of churn right >> now because there's a lot of people working on this without understanding >> the MM side of things (and conversely, I don't think (m)any people on the >> MM side really understand what graphics cards are trying to accomplish). >> > > I suspect you are referring to folio specialization and/or downsizing? > >> Who is that going to be? I'm happy to get on the phone with someone. > > Happy to work with you, but I am not the authority on graphics, I can speak > to zone device folios. I suspect we'd need to speak to more than one person. > -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 0:51 ` Zi Yan @ 2026-01-12 1:37 ` Matthew Brost 2026-01-12 4:50 ` Balbir Singh 2026-01-12 13:45 ` Jason Gunthorpe 2 siblings, 0 replies; 39+ messages in thread From: Matthew Brost @ 2026-01-12 1:37 UTC (permalink / raw) To: Zi Yan Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: > On 11 Jan 2026, at 19:19, Balbir Singh wrote: > > > On 1/12/26 08:35, Matthew Wilcox wrote: > >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: > >>> The core MM splits the folio before calling folio_free, restoring the > >>> zone pages associated with the folio to an initialized state (e.g., > >>> non-compound, pgmap valid, etc...). The order argument represents the > >>> folio’s order prior to the split which can be used driver side to know > >>> how many pages are being freed. > >> > >> This really feels like the wrong way to fix this problem. > >> > > Hi Matthew, > > I think the wording is confusing, since the actual issue is that: > > 1. zone_device_page_init() calls prep_compound_page() to form a large folio, > 2. but free_zone_device_folio() never reverse the course, > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to > be done before driver callback ->folio_free(), since once ->folio_free() > is called, the folio can be reallocated immediately, > 4. after the undo of prep_compound_page(), folio_order() can no longer provide > the original order information, thus, folio_free() needs that for proper > device side ref manipulation. > > So this is not used for "split" but undo of prep_compound_page(). It might > look like a split to non core MM people, since it changes a large folio > to a bunch of base pages. BTW, core MM has no compound_page_dctor() but > open codes it in free_pages_prepare() by resetting page flags, page->mapping, > and so on. So it might be why the undo prep_compound_page() is missed > by non core MM people. > Let me try to reword this while avoiding the term “split” and properly explaining the problem. > > > > This stems from a special requirement, freeing is done in two phases > > > > 1. Free the folio -> inform the driver (which implies freeing the backing device memory) > > 2. Return the folio back, split it back to single order folios > > Hi Balbir, > > Please refrain from using "split" here, since it confuses MM people. A folio > is split when it is still in use, but in this case, the folio has been freed > and needs to be restored to "free page" state. > Yeah, “split” is a bad term. We are reinitializing all zone pages in a folio upon free. > > > > The current code does not do 2. 1 followed by 2 does not work for > > Francois since the backing memory can get reused before we reach step 2. > > The proposed patch does 2 followed 1, but doing 2 means we've lost the > > folio order and thus the old order is passed in. Although, I wonder if the > > backing folio's zone_device_data can be used to encode any order information > > about the device side allocation. > > > > @Francois, I hope I did not miss anything in the explanation above. Yes, correct. The pages in the folio must be reinitialized before calling into the driver to free them, because once that happens, the pages can be immediately reallocated. > > > >> I think someone from the graphics side really needs to take the lead on > >> understanding what the MM is doing (both currently and in the future). > >> I'm happy to work with you, but it feels like there's a lot of churn right > >> now because there's a lot of people working on this without understanding > >> the MM side of things (and conversely, I don't think (m)any people on the > >> MM side really understand what graphics cards are trying to accomplish). I can’t disagree with anything you’re saying. The core MM is about as complex as it gets, and my understanding of what’s going on isn’t great—it’s basically just reverse engineering until I reach a point where I can fix a problem, think it’s correct, and hope I don’t get shredded. Graphics/DRM is also quite complex, but that’s where I work... > >> > > > > I suspect you are referring to folio specialization and/or downsizing? > > > >> Who is that going to be? I'm happy to get on the phone with someone. > > > > Happy to work with you, but I am not the authority on graphics, I can speak > > to zone device folios. I suspect we'd need to speak to more than one person. > > Also happy to work with you, but I agree with Zi—graphics isn’t something one company can speak as an authority on, much less one person. Matt > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 0:51 ` Zi Yan 2026-01-12 1:37 ` Matthew Brost @ 2026-01-12 4:50 ` Balbir Singh 2026-01-12 13:45 ` Jason Gunthorpe 2 siblings, 0 replies; 39+ messages in thread From: Balbir Singh @ 2026-01-12 4:50 UTC (permalink / raw) To: Zi Yan, Matthew Wilcox Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 1/12/26 10:51, Zi Yan wrote: > On 11 Jan 2026, at 19:19, Balbir Singh wrote: > >> On 1/12/26 08:35, Matthew Wilcox wrote: >>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: >>>> The core MM splits the folio before calling folio_free, restoring the >>>> zone pages associated with the folio to an initialized state (e.g., >>>> non-compound, pgmap valid, etc...). The order argument represents the >>>> folio’s order prior to the split which can be used driver side to know >>>> how many pages are being freed. >>> >>> This really feels like the wrong way to fix this problem. >>> > > Hi Matthew, > > I think the wording is confusing, since the actual issue is that: > > 1. zone_device_page_init() calls prep_compound_page() to form a large folio, > 2. but free_zone_device_folio() never reverse the course, > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to > be done before driver callback ->folio_free(), since once ->folio_free() > is called, the folio can be reallocated immediately, > 4. after the undo of prep_compound_page(), folio_order() can no longer provide > the original order information, thus, folio_free() needs that for proper > device side ref manipulation. > > So this is not used for "split" but undo of prep_compound_page(). It might > look like a split to non core MM people, since it changes a large folio > to a bunch of base pages. BTW, core MM has no compound_page_dctor() but > open codes it in free_pages_prepare() by resetting page flags, page->mapping, > and so on. So it might be why the undo prep_compound_page() is missed > by non core MM people. > >> >> This stems from a special requirement, freeing is done in two phases >> >> 1. Free the folio -> inform the driver (which implies freeing the backing device memory) >> 2. Return the folio back, split it back to single order folios > > Hi Balbir, > > Please refrain from using "split" here, since it confuses MM people. A folio > is split when it is still in use, but in this case, the folio has been freed > and needs to be restored to "free page" state. > Yeah, the word split came from the initial version that called it folio_split_unref() and I was also thinking of the split callback for zone device folios, but I agree (re)initialization is a better term. >> >> The current code does not do 2. 1 followed by 2 does not work for >> Francois since the backing memory can get reused before we reach step 2. >> The proposed patch does 2 followed 1, but doing 2 means we've lost the >> folio order and thus the old order is passed in. Although, I wonder if the >> backing folio's zone_device_data can be used to encode any order information >> about the device side allocation. >> >> @Francois, I hope I did not miss anything in the explanation above. >> >>> I think someone from the graphics side really needs to take the lead on >>> understanding what the MM is doing (both currently and in the future). >>> I'm happy to work with you, but it feels like there's a lot of churn right >>> now because there's a lot of people working on this without understanding >>> the MM side of things (and conversely, I don't think (m)any people on the >>> MM side really understand what graphics cards are trying to accomplish). >>> >> >> I suspect you are referring to folio specialization and/or downsizing? >> >>> Who is that going to be? I'm happy to get on the phone with someone. >> >> Happy to work with you, but I am not the authority on graphics, I can speak >> to zone device folios. I suspect we'd need to speak to more than one person. >> > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 0:51 ` Zi Yan 2026-01-12 1:37 ` Matthew Brost 2026-01-12 4:50 ` Balbir Singh @ 2026-01-12 13:45 ` Jason Gunthorpe 2026-01-12 16:31 ` Zi Yan 2026-01-12 21:49 ` Matthew Brost 2 siblings, 2 replies; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 13:45 UTC (permalink / raw) To: Zi Yan Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: > On 11 Jan 2026, at 19:19, Balbir Singh wrote: > > > On 1/12/26 08:35, Matthew Wilcox wrote: > >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: > >>> The core MM splits the folio before calling folio_free, restoring the > >>> zone pages associated with the folio to an initialized state (e.g., > >>> non-compound, pgmap valid, etc...). The order argument represents the > >>> folio’s order prior to the split which can be used driver side to know > >>> how many pages are being freed. > >> > >> This really feels like the wrong way to fix this problem. > >> > > Hi Matthew, > > I think the wording is confusing, since the actual issue is that: > > 1. zone_device_page_init() calls prep_compound_page() to form a large folio, > 2. but free_zone_device_folio() never reverse the course, > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to > be done before driver callback ->folio_free(), since once ->folio_free() > is called, the folio can be reallocated immediately, > 4. after the undo of prep_compound_page(), folio_order() can no longer provide > the original order information, thus, folio_free() needs that for proper > device side ref manipulation. There is something wrong with the driver if the "folio can be reallocated immediately". The flow generally expects there to be a driver allocator linked to folio_free() 1) Allocator finds free memory 2) zone_device_page_init() allocates the memory and makes refcount=1 3) __folio_put() knows the recount 0. 4) free_zone_device_folio() calls folio_free(), but it doesn't actually need to undo prep_compound_page() because *NOTHING* can use the page pointer at this point. 5) Driver puts the memory back into the allocator and now #1 can happen. It knows how much memory to put back because folio->order is valid from #2 6) #1 happens again, then #2 happens again and the folio is in the right state for use. The successor #2 fully undoes the work of the predecessor #2. If you have races where #1 can happen immediately after #3 then the driver design is fundamentally broken and passing around order isn't going to help anything. If the allocator is using the struct page memory then step #5 should also clean up the struct page with the allocator data before returning it to the allocator. I vaugely remember talking about this before in the context of the Xe driver.. You can't just take an existing VRAM allocator and layer it on top of the folios and have it broadly ignore the folio_free callback. Jsaon ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 13:45 ` Jason Gunthorpe @ 2026-01-12 16:31 ` Zi Yan 2026-01-12 16:50 ` Jason Gunthorpe 2026-01-12 21:49 ` Matthew Brost 1 sibling, 1 reply; 39+ messages in thread From: Zi Yan @ 2026-01-12 16:31 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 12 Jan 2026, at 8:45, Jason Gunthorpe wrote: > On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: >> On 11 Jan 2026, at 19:19, Balbir Singh wrote: >> >>> On 1/12/26 08:35, Matthew Wilcox wrote: >>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: >>>>> The core MM splits the folio before calling folio_free, restoring the >>>>> zone pages associated with the folio to an initialized state (e.g., >>>>> non-compound, pgmap valid, etc...). The order argument represents the >>>>> folio’s order prior to the split which can be used driver side to know >>>>> how many pages are being freed. >>>> >>>> This really feels like the wrong way to fix this problem. >>>> >> >> Hi Matthew, >> >> I think the wording is confusing, since the actual issue is that: >> >> 1. zone_device_page_init() calls prep_compound_page() to form a large folio, >> 2. but free_zone_device_folio() never reverse the course, >> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to >> be done before driver callback ->folio_free(), since once ->folio_free() >> is called, the folio can be reallocated immediately, >> 4. after the undo of prep_compound_page(), folio_order() can no longer provide >> the original order information, thus, folio_free() needs that for proper >> device side ref manipulation. > > There is something wrong with the driver if the "folio can be > reallocated immediately". > > The flow generally expects there to be a driver allocator linked to > folio_free() > > 1) Allocator finds free memory > 2) zone_device_page_init() allocates the memory and makes refcount=1 > 3) __folio_put() knows the recount 0. > 4) free_zone_device_folio() calls folio_free(), but it doesn't > actually need to undo prep_compound_page() because *NOTHING* can > use the page pointer at this point. > 5) Driver puts the memory back into the allocator and now #1 can > happen. It knows how much memory to put back because folio->order > is valid from #2 > 6) #1 happens again, then #2 happens again and the folio is in the > right state for use. The successor #2 fully undoes the work of the > predecessor #2. But how can a successor #2 undo the work if the second #1 only allocates half of the original folio? For example, an order-9 at PFN 0 is allocated and freed, then an order-8 at PFN 0 is allocated and another order-8 at PFN 256 is allocated. How can two #2s undo the same order-9 without corrupting each other’s data? > > If you have races where #1 can happen immediately after #3 then the > driver design is fundamentally broken and passing around order isn't > going to help anything. > > If the allocator is using the struct page memory then step #5 should > also clean up the struct page with the allocator data before returning > it to the allocator. Do you mean ->folio_free() callback should undo prep_compound_page() instead? > > I vaugely remember talking about this before in the context of the Xe > driver.. You can't just take an existing VRAM allocator and layer it > on top of the folios and have it broadly ignore the folio_free > callback. Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 16:31 ` Zi Yan @ 2026-01-12 16:50 ` Jason Gunthorpe 2026-01-12 17:46 ` Zi Yan 2026-01-12 23:07 ` Matthew Brost 0 siblings, 2 replies; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 16:50 UTC (permalink / raw) To: Zi Yan Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote: > > folio_free() > > > > 1) Allocator finds free memory > > 2) zone_device_page_init() allocates the memory and makes refcount=1 > > 3) __folio_put() knows the recount 0. > > 4) free_zone_device_folio() calls folio_free(), but it doesn't > > actually need to undo prep_compound_page() because *NOTHING* can > > use the page pointer at this point. > > 5) Driver puts the memory back into the allocator and now #1 can > > happen. It knows how much memory to put back because folio->order > > is valid from #2 > > 6) #1 happens again, then #2 happens again and the folio is in the > > right state for use. The successor #2 fully undoes the work of the > > predecessor #2. > > But how can a successor #2 undo the work if the second #1 only allocates > half of the original folio? For example, an order-9 at PFN 0 is > allocated and freed, then an order-8 at PFN 0 is allocated and another > order-8 at PFN 256 is allocated. How can two #2s undo the same order-9 > without corrupting each other’s data? What do you mean? The fundamental rule is you can't read the folio or the order outside folio_free once it's refcount reaches 0. So the successor #2 will write updated heads and order to the order 8 pages at PFN 0 and the ones starting at PFN 256 will remain with garbage. This is OK because nothing is allowed to read them as their refcount is 0. If later PFN256 is allocated then it will get updated head and order at the same time it's refcount becomes 1. There is corruption and they don't corrupt each other's data. > > If the allocator is using the struct page memory then step #5 should > > also clean up the struct page with the allocator data before returning > > it to the allocator. > > Do you mean ->folio_free() callback should undo prep_compound_page() > instead? I wouldn't say undo, I was very careful to say it needs to get the struct page memory into a state that the allocator algorithm expects, whatever that means. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 16:50 ` Jason Gunthorpe @ 2026-01-12 17:46 ` Zi Yan 2026-01-12 18:25 ` Jason Gunthorpe 2026-01-12 23:07 ` Matthew Brost 1 sibling, 1 reply; 39+ messages in thread From: Zi Yan @ 2026-01-12 17:46 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote: > On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote: >>> folio_free() >>> >>> 1) Allocator finds free memory >>> 2) zone_device_page_init() allocates the memory and makes refcount=1 >>> 3) __folio_put() knows the recount 0. >>> 4) free_zone_device_folio() calls folio_free(), but it doesn't >>> actually need to undo prep_compound_page() because *NOTHING* can >>> use the page pointer at this point. >>> 5) Driver puts the memory back into the allocator and now #1 can >>> happen. It knows how much memory to put back because folio->order >>> is valid from #2 >>> 6) #1 happens again, then #2 happens again and the folio is in the >>> right state for use. The successor #2 fully undoes the work of the >>> predecessor #2. >> >> But how can a successor #2 undo the work if the second #1 only allocates >> half of the original folio? For example, an order-9 at PFN 0 is >> allocated and freed, then an order-8 at PFN 0 is allocated and another >> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9 >> without corrupting each other’s data? > > What do you mean? The fundamental rule is you can't read the folio or > the order outside folio_free once it's refcount reaches 0. There is no such a rule. In core MM, folio_split(), which splits a high order folio to low order ones, freezes the folio (turning refcount to 0) and manipulates the folio order and all tail pages compound_head to restructure the folio. Your fundamental rule breaks this. Allowing compound information to stay after a folio is freed means you cannot tell whether a folio is under split or freed. > > So the successor #2 will write updated heads and order to the order 8 > pages at PFN 0 and the ones starting at PFN 256 will remain with > garbage. > > This is OK because nothing is allowed to read them as their refcount > is 0. > > If later PFN256 is allocated then it will get updated head and order > at the same time it's refcount becomes 1. > > There is corruption and they don't corrupt each other's data. > >>> If the allocator is using the struct page memory then step #5 should >>> also clean up the struct page with the allocator data before returning >>> it to the allocator. >> >> Do you mean ->folio_free() callback should undo prep_compound_page() >> instead? > > I wouldn't say undo, I was very careful to say it needs to get the > struct page memory into a state that the allocator algorithm expects, > whatever that means. > > Jason Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 17:46 ` Zi Yan @ 2026-01-12 18:25 ` Jason Gunthorpe 2026-01-12 18:55 ` Zi Yan 0 siblings, 1 reply; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 18:25 UTC (permalink / raw) To: Zi Yan Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote: > On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote: > > > On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote: > >>> folio_free() > >>> > >>> 1) Allocator finds free memory > >>> 2) zone_device_page_init() allocates the memory and makes refcount=1 > >>> 3) __folio_put() knows the recount 0. > >>> 4) free_zone_device_folio() calls folio_free(), but it doesn't > >>> actually need to undo prep_compound_page() because *NOTHING* can > >>> use the page pointer at this point. > >>> 5) Driver puts the memory back into the allocator and now #1 can > >>> happen. It knows how much memory to put back because folio->order > >>> is valid from #2 > >>> 6) #1 happens again, then #2 happens again and the folio is in the > >>> right state for use. The successor #2 fully undoes the work of the > >>> predecessor #2. > >> > >> But how can a successor #2 undo the work if the second #1 only allocates > >> half of the original folio? For example, an order-9 at PFN 0 is > >> allocated and freed, then an order-8 at PFN 0 is allocated and another > >> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9 > >> without corrupting each other’s data? > > > > What do you mean? The fundamental rule is you can't read the folio or > > the order outside folio_free once it's refcount reaches 0. > > There is no such a rule. In core MM, folio_split(), which splits a high > order folio to low order ones, freezes the folio (turning refcount to 0) > and manipulates the folio order and all tail pages compound_head to > restructure the folio. That's different, I am talking about reaching 0 because it has been freed, meaning there are no external pointers to it. Further, when a page is frozen page_ref_freeze() takes in the number of references the caller has ownership over and it doesn't succeed if there are stray references elsewhere. This is very important because the entire operating model of split only works if it has exclusive locks over all the valid pointers into that page. Spurious refcount failures concurrent with split cannot be allowed. I don't see how pointing at __folio_freeze_and_split_unmapped() can justify this series. > Your fundamental rule breaks this. Allowing compound information > to stay after a folio is freed means you cannot tell whether a folio > is under split or freed. You can't refcount a folio out of nothing. It has to come from a memory location that already is holding a refcount, and then you can incr it. For example lockless GUP fast will read the PTE, adjust to the head page, attempt to incr it, then recheck the PTE. If there are races then sure maybe the PTE will point to a stray tail page that refers to an already allocated head page, but the re-check of the PTE wille exclude this. The refcount system already has to tolerate spurious refcount incrs because of GUP fast. Nothing should be looking at order and refcount to try to guess if concurrent split is happening!! Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 18:25 ` Jason Gunthorpe @ 2026-01-12 18:55 ` Zi Yan 2026-01-12 19:28 ` Jason Gunthorpe 0 siblings, 1 reply; 39+ messages in thread From: Zi Yan @ 2026-01-12 18:55 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 12 Jan 2026, at 13:25, Jason Gunthorpe wrote: > On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote: >> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote: >> >>> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote: >>>>> folio_free() >>>>> >>>>> 1) Allocator finds free memory >>>>> 2) zone_device_page_init() allocates the memory and makes refcount=1 >>>>> 3) __folio_put() knows the recount 0. >>>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't >>>>> actually need to undo prep_compound_page() because *NOTHING* can >>>>> use the page pointer at this point. >>>>> 5) Driver puts the memory back into the allocator and now #1 can >>>>> happen. It knows how much memory to put back because folio->order >>>>> is valid from #2 >>>>> 6) #1 happens again, then #2 happens again and the folio is in the >>>>> right state for use. The successor #2 fully undoes the work of the >>>>> predecessor #2. >>>> >>>> But how can a successor #2 undo the work if the second #1 only allocates >>>> half of the original folio? For example, an order-9 at PFN 0 is >>>> allocated and freed, then an order-8 at PFN 0 is allocated and another >>>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9 >>>> without corrupting each other’s data? >>> >>> What do you mean? The fundamental rule is you can't read the folio or >>> the order outside folio_free once it's refcount reaches 0. >> >> There is no such a rule. In core MM, folio_split(), which splits a high >> order folio to low order ones, freezes the folio (turning refcount to 0) >> and manipulates the folio order and all tail pages compound_head to >> restructure the folio. > > That's different, I am talking about reaching 0 because it has been > freed, meaning there are no external pointers to it. > > Further, when a page is frozen page_ref_freeze() takes in the number > of references the caller has ownership over and it doesn't succeed if > there are stray references elsewhere. > > This is very important because the entire operating model of split > only works if it has exclusive locks over all the valid pointers into > that page. > > Spurious refcount failures concurrent with split cannot be allowed. > > I don't see how pointing at __folio_freeze_and_split_unmapped() can > justify this series. > But from anyone looking at the folio state, refcount == 0, compound_head is set, they cannot tell the difference. If what you said is true, why is free_pages_prepare() needed? No one should touch these free pages. Why bother resetting these states. >> Your fundamental rule breaks this. Allowing compound information >> to stay after a folio is freed means you cannot tell whether a folio >> is under split or freed. > > You can't refcount a folio out of nothing. It has to come from a > memory location that already is holding a refcount, and then you can > incr it. Right. There is also no guarantee that all code is correct and follows this. My point here is that calling prep_compound_page() on a compound page does not follow core MM’s conventions. Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 18:55 ` Zi Yan @ 2026-01-12 19:28 ` Jason Gunthorpe 2026-01-12 23:34 ` Zi Yan 0 siblings, 1 reply; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 19:28 UTC (permalink / raw) To: Zi Yan Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote: > > That's different, I am talking about reaching 0 because it has been > > freed, meaning there are no external pointers to it. > > > > Further, when a page is frozen page_ref_freeze() takes in the number > > of references the caller has ownership over and it doesn't succeed if > > there are stray references elsewhere. > > > > This is very important because the entire operating model of split > > only works if it has exclusive locks over all the valid pointers into > > that page. > > > > Spurious refcount failures concurrent with split cannot be allowed. > > > > I don't see how pointing at __folio_freeze_and_split_unmapped() can > > justify this series. > > > > But from anyone looking at the folio state, refcount == 0, compound_head > is set, they cannot tell the difference. This isn't reliable, nothing correct can be doing it :\ > If what you said is true, why is free_pages_prepare() needed? No one > should touch these free pages. Why bother resetting these states. ? that function does alot of stuff, thinks like uncharging the cgroup should obviously happen at free time. What part of it are you looking at? > > You can't refcount a folio out of nothing. It has to come from a > > memory location that already is holding a refcount, and then you can > > incr it. > > Right. There is also no guarantee that all code is correct and follows > this. Let's concretely point at things that have a problem please. > My point here is that calling prep_compound_page() on a compound page > does not follow core MM’s conventions. Maybe, but that doesn't mean it isn't the right solution.. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 19:28 ` Jason Gunthorpe @ 2026-01-12 23:34 ` Zi Yan 2026-01-12 23:53 ` Jason Gunthorpe 0 siblings, 1 reply; 39+ messages in thread From: Zi Yan @ 2026-01-12 23:34 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 12 Jan 2026, at 14:28, Jason Gunthorpe wrote: > On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote: >>> That's different, I am talking about reaching 0 because it has been >>> freed, meaning there are no external pointers to it. >>> >>> Further, when a page is frozen page_ref_freeze() takes in the number >>> of references the caller has ownership over and it doesn't succeed if >>> there are stray references elsewhere. >>> >>> This is very important because the entire operating model of split >>> only works if it has exclusive locks over all the valid pointers into >>> that page. >>> >>> Spurious refcount failures concurrent with split cannot be allowed. >>> >>> I don't see how pointing at __folio_freeze_and_split_unmapped() can >>> justify this series. >>> >> >> But from anyone looking at the folio state, refcount == 0, compound_head >> is set, they cannot tell the difference. > > This isn't reliable, nothing correct can be doing it :\ > >> If what you said is true, why is free_pages_prepare() needed? No one >> should touch these free pages. Why bother resetting these states. > > ? that function does alot of stuff, thinks like uncharging the cgroup > should obviously happen at free time. > > What part of it are you looking at? page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order. free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and compound_head at the end. page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound pages. These three parts undo prep_compound_page(). > >>> You can't refcount a folio out of nothing. It has to come from a >>> memory location that already is holding a refcount, and then you can >>> incr it. >> >> Right. There is also no guarantee that all code is correct and follows >> this. > > Let's concretely point at things that have a problem please. > >> My point here is that calling prep_compound_page() on a compound page >> does not follow core MM’s conventions. > > Maybe, but that doesn't mean it isn't the right solution.. In current nouveau code, ->free_folios is used holding the freed folio. In nouveau_dmem_page_alloc_locked(), the freed folio is passed to zone_device_folio_init(). If the allocated folio order is different from the freed folio order, I do not know how you are going to keep track of the rest of the freed folio. Of course you can implement a buddy allocator there. If this still does not convince you that overwriting an existing compound page with a different order configuration is a bad idea, feel free to do whatever you think it is right. Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 23:34 ` Zi Yan @ 2026-01-12 23:53 ` Jason Gunthorpe 2026-01-13 0:35 ` Zi Yan 0 siblings, 1 reply; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 23:53 UTC (permalink / raw) To: Zi Yan Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 06:34:06PM -0500, Zi Yan wrote: > page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order. > > free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and > compound_head at the end. > > page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound > pages. > > These three parts undo prep_compound_page(). Well, mm doesn't clear all things on alloc.. > In current nouveau code, ->free_folios is used holding the freed folio. > In nouveau_dmem_page_alloc_locked(), the freed folio is passed to > zone_device_folio_init(). If the allocated folio order is different > from the freed folio order, I do not know how you are going to keep > track of the rest of the freed folio. Of course you can implement a > buddy allocator there. nouveau doesn't support high order folios. A simple linked list is not really a suitable data structure to ever support high order folios with.. If it were to use such a thing, and did want to take a high order folio off the list, and reduce its order, then it would have to put the remainder back on the list with a revised order value. That's all, nothing hard. Again if the driver needs to store information in the struct page to manage its free list mechanism (ie linked pointers, order, whatever) then it should be doing that directly. When it takes the memory range off the free list it should call zone_device_page_init() to make it ready to be used again. I think it is a poor argument to say that zone_device_page_init() should rely on values already in the struct page to work properly :\ The usable space within the struct page, and what values must be fixed for correct system function, should exactly mirror what frozen pages require. After free it is effectively now a frozen page owned by the device driver. I haven't seen any documentation on that, but I suspect Matthew and David have some ideas.. If there is a reason for order, flags and mapping to be something particular then it should flow from the definition of frozen pages, and be documented, IMHO. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 23:53 ` Jason Gunthorpe @ 2026-01-13 0:35 ` Zi Yan 0 siblings, 0 replies; 39+ messages in thread From: Zi Yan @ 2026-01-13 0:35 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 12 Jan 2026, at 18:53, Jason Gunthorpe wrote: > On Mon, Jan 12, 2026 at 06:34:06PM -0500, Zi Yan wrote: >> page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order. >> >> free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and >> compound_head at the end. >> >> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound >> pages. >> >> These three parts undo prep_compound_page(). > > Well, mm doesn't clear all things on alloc.. > >> In current nouveau code, ->free_folios is used holding the freed folio. >> In nouveau_dmem_page_alloc_locked(), the freed folio is passed to >> zone_device_folio_init(). If the allocated folio order is different >> from the freed folio order, I do not know how you are going to keep >> track of the rest of the freed folio. Of course you can implement a >> buddy allocator there. > > nouveau doesn't support high order folios. > > A simple linked list is not really a suitable data structure to ever > support high order folios with.. If it were to use such a thing, and > did want to take a high order folio off the list, and reduce its > order, then it would have to put the remainder back on the list with a > revised order value. That's all, nothing hard. > > Again if the driver needs to store information in the struct page to > manage its free list mechanism (ie linked pointers, order, whatever) > then it should be doing that directly. > > When it takes the memory range off the free list it should call > zone_device_page_init() to make it ready to be used again. I think it > is a poor argument to say that zone_device_page_init() should rely on > values already in the struct page to work properly :\ > > The usable space within the struct page, and what values must be fixed > for correct system function, should exactly mirror what frozen pages > require. After free it is effectively now a frozen page owned by the > device driver. > > I haven't seen any documentation on that, but I suspect Matthew and > David have some ideas.. > > If there is a reason for order, flags and mapping to be something > particular then it should flow from the definition of frozen pages, > and be documented, IMHO. Thank you for the explanation. It seems that I do not have enough knowledge to comment on device private pages. I will refrain myself from doing so from now on Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 16:50 ` Jason Gunthorpe 2026-01-12 17:46 ` Zi Yan @ 2026-01-12 23:07 ` Matthew Brost 1 sibling, 0 replies; 39+ messages in thread From: Matthew Brost @ 2026-01-12 23:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 12:50:01PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote: > > > folio_free() > > > > > > 1) Allocator finds free memory > > > 2) zone_device_page_init() allocates the memory and makes refcount=1 > > > 3) __folio_put() knows the recount 0. > > > 4) free_zone_device_folio() calls folio_free(), but it doesn't > > > actually need to undo prep_compound_page() because *NOTHING* can > > > use the page pointer at this point. > > > 5) Driver puts the memory back into the allocator and now #1 can > > > happen. It knows how much memory to put back because folio->order > > > is valid from #2 > > > 6) #1 happens again, then #2 happens again and the folio is in the > > > right state for use. The successor #2 fully undoes the work of the > > > predecessor #2. > > > > But how can a successor #2 undo the work if the second #1 only allocates > > half of the original folio? For example, an order-9 at PFN 0 is > > allocated and freed, then an order-8 at PFN 0 is allocated and another > > order-8 at PFN 256 is allocated. How can two #2s undo the same order-9 > > without corrupting each other’s data? > > What do you mean? The fundamental rule is you can't read the folio or > the order outside folio_free once it's refcount reaches 0. > > So the successor #2 will write updated heads and order to the order 8 > pages at PFN 0 and the ones starting at PFN 256 will remain with > garbage. > > This is OK because nothing is allowed to read them as their refcount > is 0. > > If later PFN256 is allocated then it will get updated head and order > at the same time it's refcount becomes 1. > > There is corruption and they don't corrupt each other's data. > > > > If the allocator is using the struct page memory then step #5 should > > > also clean up the struct page with the allocator data before returning > > > it to the allocator. > > > > Do you mean ->folio_free() callback should undo prep_compound_page() > > instead? > > I wouldn't say undo, I was very careful to say it needs to get the > struct page memory into a state that the allocator algorithm expects, > whatever that means. > Hi Jason, A lot of back and forth with Zi — if I’m understanding correctly, your suggestion is to just call free_zone_device_folio_prepare() [1] in ->folio_free() if required by the driver. This is the function that puts struct page into a state my allocator expects. That works just fine for me. Matt [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4 > Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 13:45 ` Jason Gunthorpe 2026-01-12 16:31 ` Zi Yan @ 2026-01-12 21:49 ` Matthew Brost 2026-01-12 23:15 ` Zi Yan 1 sibling, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-12 21:49 UTC (permalink / raw) To: Jason Gunthorpe Cc: Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote: Hi, catching up here. > On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: > > On 11 Jan 2026, at 19:19, Balbir Singh wrote: > > > > > On 1/12/26 08:35, Matthew Wilcox wrote: > > >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: > > >>> The core MM splits the folio before calling folio_free, restoring the > > >>> zone pages associated with the folio to an initialized state (e.g., > > >>> non-compound, pgmap valid, etc...). The order argument represents the > > >>> folio’s order prior to the split which can be used driver side to know > > >>> how many pages are being freed. > > >> > > >> This really feels like the wrong way to fix this problem. > > >> > > > > Hi Matthew, > > > > I think the wording is confusing, since the actual issue is that: > > > > 1. zone_device_page_init() calls prep_compound_page() to form a large folio, > > 2. but free_zone_device_folio() never reverse the course, > > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to > > be done before driver callback ->folio_free(), since once ->folio_free() > > is called, the folio can be reallocated immediately, > > 4. after the undo of prep_compound_page(), folio_order() can no longer provide > > the original order information, thus, folio_free() needs that for proper > > device side ref manipulation. > > There is something wrong with the driver if the "folio can be > reallocated immediately". > > The flow generally expects there to be a driver allocator linked to > folio_free() > > 1) Allocator finds free memory > 2) zone_device_page_init() allocates the memory and makes refcount=1 > 3) __folio_put() knows the recount 0. > 4) free_zone_device_folio() calls folio_free(), but it doesn't > actually need to undo prep_compound_page() because *NOTHING* can > use the page pointer at this point. Correct—nothing can use the folio prior to calling folio_free(). Once folio_free() returns, the driver side is free to immediately reallocate the folio (or a subset of its pages). > 5) Driver puts the memory back into the allocator and now #1 can > happen. It knows how much memory to put back because folio->order > is valid from #2 > 6) #1 happens again, then #2 happens again and the folio is in the > right state for use. The successor #2 fully undoes the work of the > predecessor #2. > > If you have races where #1 can happen immediately after #3 then the > driver design is fundamentally broken and passing around order isn't > going to help anything. > The above race does not exist; if it did, I agree we’d be solving nothing here. > If the allocator is using the struct page memory then step #5 should > also clean up the struct page with the allocator data before returning > it to the allocator. > We could move the call to free_zone_device_folio_prepare() [1] into the driver-side implementation of ->folio_free() and drop the order argument here. Zi didn’t particularly like that; he preferred calling free_zone_device_folio_prepare() [2] before invoking ->folio_free(), which is why this patch exists. FWIW, I do not have a strong opinion here—either way works. Xe doesn’t actually need the order regardless of where free_zone_device_folio_prepare() is called, but Nouveau does need the order if free_zone_device_folio_prepare() is called before ->folio_free(). [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4 [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405 > I vaugely remember talking about this before in the context of the Xe > driver.. You can't just take an existing VRAM allocator and layer it > on top of the folios and have it broadly ignore the folio_free > callback. > We are definitely not ignoring the ->folio_free callback—that is the point at which we tell our VRAM allocator (DRM buddy) it is okay to release the allocation and make it available for reuse. Matt > Jsaon ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 21:49 ` Matthew Brost @ 2026-01-12 23:15 ` Zi Yan 2026-01-12 23:22 ` Matthew Brost 2026-01-12 23:31 ` Jason Gunthorpe 0 siblings, 2 replies; 39+ messages in thread From: Zi Yan @ 2026-01-12 23:15 UTC (permalink / raw) To: Matthew Brost Cc: Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 12 Jan 2026, at 16:49, Matthew Brost wrote: > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote: > > Hi, catching up here. > >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote: >>> >>>> On 1/12/26 08:35, Matthew Wilcox wrote: >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: >>>>>> The core MM splits the folio before calling folio_free, restoring the >>>>>> zone pages associated with the folio to an initialized state (e.g., >>>>>> non-compound, pgmap valid, etc...). The order argument represents the >>>>>> folio’s order prior to the split which can be used driver side to know >>>>>> how many pages are being freed. >>>>> >>>>> This really feels like the wrong way to fix this problem. >>>>> >>> >>> Hi Matthew, >>> >>> I think the wording is confusing, since the actual issue is that: >>> >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio, >>> 2. but free_zone_device_folio() never reverse the course, >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to >>> be done before driver callback ->folio_free(), since once ->folio_free() >>> is called, the folio can be reallocated immediately, >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide >>> the original order information, thus, folio_free() needs that for proper >>> device side ref manipulation. >> >> There is something wrong with the driver if the "folio can be >> reallocated immediately". >> >> The flow generally expects there to be a driver allocator linked to >> folio_free() >> >> 1) Allocator finds free memory >> 2) zone_device_page_init() allocates the memory and makes refcount=1 >> 3) __folio_put() knows the recount 0. >> 4) free_zone_device_folio() calls folio_free(), but it doesn't >> actually need to undo prep_compound_page() because *NOTHING* can >> use the page pointer at this point. > > Correct—nothing can use the folio prior to calling folio_free(). Once > folio_free() returns, the driver side is free to immediately reallocate > the folio (or a subset of its pages). > >> 5) Driver puts the memory back into the allocator and now #1 can >> happen. It knows how much memory to put back because folio->order >> is valid from #2 >> 6) #1 happens again, then #2 happens again and the folio is in the >> right state for use. The successor #2 fully undoes the work of the >> predecessor #2. >> >> If you have races where #1 can happen immediately after #3 then the >> driver design is fundamentally broken and passing around order isn't >> going to help anything. >> > > The above race does not exist; if it did, I agree we’d be solving > nothing here. > >> If the allocator is using the struct page memory then step #5 should >> also clean up the struct page with the allocator data before returning >> it to the allocator. >> > > We could move the call to free_zone_device_folio_prepare() [1] into the > driver-side implementation of ->folio_free() and drop the order argument > here. Zi didn’t particularly like that; he preferred calling > free_zone_device_folio_prepare() [2] before invoking ->folio_free(), > which is why this patch exists. On a second thought, if calling free_zone_device_folio_prepare() in ->folio_free() works, feel free to do so. > > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t > actually need the order regardless of where > free_zone_device_folio_prepare() is called, but Nouveau does need the > order if free_zone_device_folio_prepare() is called before > ->folio_free(). > > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4 > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405 > >> I vaugely remember talking about this before in the context of the Xe >> driver.. You can't just take an existing VRAM allocator and layer it >> on top of the folios and have it broadly ignore the folio_free >> callback. >> > > We are definitely not ignoring the ->folio_free callback—that is the > point at which we tell our VRAM allocator (DRM buddy) it is okay to > release the allocation and make it available for reuse. > > Matt > >> Jsaon Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 23:15 ` Zi Yan @ 2026-01-12 23:22 ` Matthew Brost 2026-01-12 23:44 ` Alistair Popple 2026-01-12 23:31 ` Jason Gunthorpe 1 sibling, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-12 23:22 UTC (permalink / raw) To: Zi Yan Cc: Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote: > On 12 Jan 2026, at 16:49, Matthew Brost wrote: > > > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote: > > > > Hi, catching up here. > > > >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: > >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote: > >>> > >>>> On 1/12/26 08:35, Matthew Wilcox wrote: > >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: > >>>>>> The core MM splits the folio before calling folio_free, restoring the > >>>>>> zone pages associated with the folio to an initialized state (e.g., > >>>>>> non-compound, pgmap valid, etc...). The order argument represents the > >>>>>> folio’s order prior to the split which can be used driver side to know > >>>>>> how many pages are being freed. > >>>>> > >>>>> This really feels like the wrong way to fix this problem. > >>>>> > >>> > >>> Hi Matthew, > >>> > >>> I think the wording is confusing, since the actual issue is that: > >>> > >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio, > >>> 2. but free_zone_device_folio() never reverse the course, > >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to > >>> be done before driver callback ->folio_free(), since once ->folio_free() > >>> is called, the folio can be reallocated immediately, > >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide > >>> the original order information, thus, folio_free() needs that for proper > >>> device side ref manipulation. > >> > >> There is something wrong with the driver if the "folio can be > >> reallocated immediately". > >> > >> The flow generally expects there to be a driver allocator linked to > >> folio_free() > >> > >> 1) Allocator finds free memory > >> 2) zone_device_page_init() allocates the memory and makes refcount=1 > >> 3) __folio_put() knows the recount 0. > >> 4) free_zone_device_folio() calls folio_free(), but it doesn't > >> actually need to undo prep_compound_page() because *NOTHING* can > >> use the page pointer at this point. > > > > Correct—nothing can use the folio prior to calling folio_free(). Once > > folio_free() returns, the driver side is free to immediately reallocate > > the folio (or a subset of its pages). > > > >> 5) Driver puts the memory back into the allocator and now #1 can > >> happen. It knows how much memory to put back because folio->order > >> is valid from #2 > >> 6) #1 happens again, then #2 happens again and the folio is in the > >> right state for use. The successor #2 fully undoes the work of the > >> predecessor #2. > >> > >> If you have races where #1 can happen immediately after #3 then the > >> driver design is fundamentally broken and passing around order isn't > >> going to help anything. > >> > > > > The above race does not exist; if it did, I agree we’d be solving > > nothing here. > > > >> If the allocator is using the struct page memory then step #5 should > >> also clean up the struct page with the allocator data before returning > >> it to the allocator. > >> > > > > We could move the call to free_zone_device_folio_prepare() [1] into the > > driver-side implementation of ->folio_free() and drop the order argument > > here. Zi didn’t particularly like that; he preferred calling > > free_zone_device_folio_prepare() [2] before invoking ->folio_free(), > > which is why this patch exists. > > On a second thought, if calling free_zone_device_folio_prepare() in > ->folio_free() works, feel free to do so. > +1, testing this change right now and it does indeed work. Matt > > > > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t > > actually need the order regardless of where > > free_zone_device_folio_prepare() is called, but Nouveau does need the > > order if free_zone_device_folio_prepare() is called before > > ->folio_free(). > > > > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4 > > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405 > > > >> I vaugely remember talking about this before in the context of the Xe > >> driver.. You can't just take an existing VRAM allocator and layer it > >> on top of the folios and have it broadly ignore the folio_free > >> callback. > >> > > > > We are definitely not ignoring the ->folio_free callback—that is the > > point at which we tell our VRAM allocator (DRM buddy) it is okay to > > release the allocation and make it available for reuse. > > > > Matt > > > >> Jsaon > > > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 23:22 ` Matthew Brost @ 2026-01-12 23:44 ` Alistair Popple 2026-01-12 23:54 ` Jason Gunthorpe 0 siblings, 1 reply; 39+ messages in thread From: Alistair Popple @ 2026-01-12 23:44 UTC (permalink / raw) To: Matthew Brost Cc: Zi Yan, Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On 2026-01-13 at 10:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote: > > On 12 Jan 2026, at 16:49, Matthew Brost wrote: > > > > > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote: > > > > > > Hi, catching up here. > > > > > >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote: > > >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote: > > >>> > > >>>> On 1/12/26 08:35, Matthew Wilcox wrote: > > >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote: > > >>>>>> The core MM splits the folio before calling folio_free, restoring the > > >>>>>> zone pages associated with the folio to an initialized state (e.g., > > >>>>>> non-compound, pgmap valid, etc...). The order argument represents the > > >>>>>> folio’s order prior to the split which can be used driver side to know > > >>>>>> how many pages are being freed. > > >>>>> > > >>>>> This really feels like the wrong way to fix this problem. > > >>>>> > > >>> > > >>> Hi Matthew, > > >>> > > >>> I think the wording is confusing, since the actual issue is that: > > >>> > > >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio, > > >>> 2. but free_zone_device_folio() never reverse the course, > > >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to > > >>> be done before driver callback ->folio_free(), since once ->folio_free() > > >>> is called, the folio can be reallocated immediately, > > >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide > > >>> the original order information, thus, folio_free() needs that for proper > > >>> device side ref manipulation. > > >> > > >> There is something wrong with the driver if the "folio can be > > >> reallocated immediately". > > >> > > >> The flow generally expects there to be a driver allocator linked to > > >> folio_free() > > >> > > >> 1) Allocator finds free memory > > >> 2) zone_device_page_init() allocates the memory and makes refcount=1 > > >> 3) __folio_put() knows the recount 0. > > >> 4) free_zone_device_folio() calls folio_free(), but it doesn't > > >> actually need to undo prep_compound_page() because *NOTHING* can > > >> use the page pointer at this point. > > > > > > Correct—nothing can use the folio prior to calling folio_free(). Once > > > folio_free() returns, the driver side is free to immediately reallocate > > > the folio (or a subset of its pages). > > > > > >> 5) Driver puts the memory back into the allocator and now #1 can > > >> happen. It knows how much memory to put back because folio->order > > >> is valid from #2 > > >> 6) #1 happens again, then #2 happens again and the folio is in the > > >> right state for use. The successor #2 fully undoes the work of the > > >> predecessor #2. > > >> > > >> If you have races where #1 can happen immediately after #3 then the > > >> driver design is fundamentally broken and passing around order isn't > > >> going to help anything. > > >> > > > > > > The above race does not exist; if it did, I agree we’d be solving > > > nothing here. > > > > > >> If the allocator is using the struct page memory then step #5 should > > >> also clean up the struct page with the allocator data before returning > > >> it to the allocator. > > >> > > > > > > We could move the call to free_zone_device_folio_prepare() [1] into the > > > driver-side implementation of ->folio_free() and drop the order argument > > > here. Zi didn’t particularly like that; he preferred calling > > > free_zone_device_folio_prepare() [2] before invoking ->folio_free(), > > > which is why this patch exists. > > > > On a second thought, if calling free_zone_device_folio_prepare() in > > ->folio_free() works, feel free to do so. I think making drivers do this is the correct approach and is consistent with what P2PDMA and DAX does. All the interfaces for mapping a ZONE_DEVICE folio currently rely on the driver correctly initialising the folio, so this special case for ZONE_DEVICE_PRIVATE/COHERENT seemed weird to me - they shouldn't rely on the core-mm to do some of the re-initialisation in the free paths. Also drivers may have different strategies than just resetting everything back to small pages. For example the may choose to only ever allocate large folios making the whole clearing/resetting of folio fields superfluous. - Alistair > +1, testing this change right now and it does indeed work. > > Matt > > > > > > > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t > > > actually need the order regardless of where > > > free_zone_device_folio_prepare() is called, but Nouveau does need the > > > order if free_zone_device_folio_prepare() is called before > > > ->folio_free(). > > > > > > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4 > > > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405 > > > > > >> I vaugely remember talking about this before in the context of the Xe > > >> driver.. You can't just take an existing VRAM allocator and layer it > > >> on top of the folios and have it broadly ignore the folio_free > > >> callback. > > >> > > > > > > We are definitely not ignoring the ->folio_free callback—that is the > > > point at which we tell our VRAM allocator (DRM buddy) it is okay to > > > release the allocation and make it available for reuse. > > > > > > Matt > > > > > >> Jsaon > > > > > > Best Regards, > > Yan, Zi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 23:44 ` Alistair Popple @ 2026-01-12 23:54 ` Jason Gunthorpe 0 siblings, 0 replies; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 23:54 UTC (permalink / raw) To: Alistair Popple Cc: Matthew Brost, Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Tue, Jan 13, 2026 at 10:44:27AM +1100, Alistair Popple wrote: > Also drivers may have different strategies than just resetting everything back > to small pages. For example the may choose to only ever allocate large folios > making the whole clearing/resetting of folio fields superfluous. +1 Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback 2026-01-12 23:15 ` Zi Yan 2026-01-12 23:22 ` Matthew Brost @ 2026-01-12 23:31 ` Jason Gunthorpe 1 sibling, 0 replies; 39+ messages in thread From: Jason Gunthorpe @ 2026-01-12 23:31 UTC (permalink / raw) To: Zi Yan Cc: Matthew Brost, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP), Felix Kuehling, Alex Deucher, Christian König, David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas, Logan Gunthorpe, David Hildenbrand, Oscar Salvador, Andrew Morton, Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm, linux-cxl On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote: > > We could move the call to free_zone_device_folio_prepare() [1] into the > > driver-side implementation of ->folio_free() and drop the order argument > > here. Zi didn’t particularly like that; he preferred calling > > free_zone_device_folio_prepare() [2] before invoking ->folio_free(), > > which is why this patch exists. > > On a second thought, if calling free_zone_device_folio_prepare() in > ->folio_free() works, feel free to do so. I don't think there is anything "prepare" about free_zone_device_folio_prepare() it effectively zeros the struct page memory - ie undoes some amount of zone_device_page_init() and AFAIK there are only two reasons to do this: 1) It helps catch bugs where things are UAF'ing the folio, now they read back zeros (it also creates bugs where zero might be OK, so you might be better to poison it under a debug flag) 2) It avoids the allocate side having to zero the page memory - and perhaps the allocate side is not doing a good job of this right now but I think you should state a position why it makes more sense for the free side to do this instead of the allocate side. IOW why should it be mandatory to call free_zone_device_folio_prepare() prior to zone_device_page_init() ? Certainly if the only reason you are passing the order is because the core code zero'd the order too early, that doesn't make alot of sense. I think calling the deinit function paired with zone_device_page_init() within the driver does make alot of sense and I see no issue with that. But please name it more sensibly and describe concretely why it should be split up like this. Because what I see is you write to all the folios on free and then write to them all again on allocation - which is 2x the cost that is probably really needed... Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast 2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast @ 2026-01-11 20:55 ` Francois Dugast 2026-01-12 0:44 ` Balbir Singh 1 sibling, 1 reply; 39+ messages in thread From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw) To: intel-xe Cc: dri-devel, Matthew Brost, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Balbir Singh, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm, linux-cxl, linux-kernel, Francois Dugast From: Matthew Brost <matthew.brost@intel.com> Add free_zone_device_folio_prepare(), a helper that restores large ZONE_DEVICE folios to a sane, initial state before freeing them. Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and compound metadata). Before returning such pages to the device pgmap allocator, each constituent page must be reset to a standalone ZONE_DEVICE folio with a valid pgmap and no compound state. Use this helper prior to folio_free() for device-private and device-coherent folios to ensure consistent device page state for subsequent allocations. Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Cc: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: linux-mm@kvack.org Cc: linux-cxl@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Francois Dugast <francois.dugast@intel.com> --- include/linux/memremap.h | 1 + mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 97fcffeb1c1e..88e1d4707296 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) #ifdef CONFIG_ZONE_DEVICE void zone_device_page_init(struct page *page, unsigned int order); +void free_zone_device_folio_prepare(struct folio *folio); void *memremap_pages(struct dev_pagemap *pgmap, int nid); void memunmap_pages(struct dev_pagemap *pgmap); void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); diff --git a/mm/memremap.c b/mm/memremap.c index 39dc4bd190d0..375a61e18858 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) } EXPORT_SYMBOL_GPL(get_dev_pagemap); +/** + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. + * @folio: ZONE_DEVICE folio to prepare for release. + * + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages + * must be restored to a sane ZONE_DEVICE state before they are released. + * + * This helper: + * - Clears @folio->mapping and, for compound folios, clears each page's + * compound-head state (ClearPageHead()/clear_compound_head()). + * - Resets the compound order metadata (folio_reset_order()) and then + * initializes each constituent page as a standalone ZONE_DEVICE folio: + * * clears ->mapping + * * restores ->pgmap (prep_compound_page() overwrites it) + * * clears ->share (only relevant for fsdax; unused for device-private) + * + * If @folio is order-0, only the mapping is cleared and no further work is + * required. + */ +void free_zone_device_folio_prepare(struct folio *folio) +{ + struct dev_pagemap *pgmap = page_pgmap(&folio->page); + int order, i; + + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); + + folio->mapping = NULL; + order = folio_order(folio); + if (!order) + return; + + folio_reset_order(folio); + + for (i = 0; i < (1UL << order); i++) { + struct page *page = folio_page(folio, i); + struct folio *new_folio = (struct folio *)page; + + ClearPageHead(page); + clear_compound_head(page); + + new_folio->mapping = NULL; + /* + * Reset pgmap which was over-written by + * prep_compound_page(). + */ + new_folio->pgmap = pgmap; + new_folio->share = 0; /* fsdax only, unused for device private */ + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); + } +} +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); + void free_zone_device_folio(struct folio *folio) { struct dev_pagemap *pgmap = folio->pgmap; @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) case MEMORY_DEVICE_COHERENT: if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) break; + free_zone_device_folio_prepare(folio); pgmap->ops->folio_free(folio, order); percpu_ref_put_many(&folio->pgmap->ref, nr); break; -- 2.43.0 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast @ 2026-01-12 0:44 ` Balbir Singh 2026-01-12 1:16 ` Matthew Brost 0 siblings, 1 reply; 39+ messages in thread From: Balbir Singh @ 2026-01-12 0:44 UTC (permalink / raw) To: Francois Dugast, intel-xe Cc: dri-devel, Matthew Brost, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm, linux-cxl, linux-kernel On 1/12/26 06:55, Francois Dugast wrote: > From: Matthew Brost <matthew.brost@intel.com> > > Add free_zone_device_folio_prepare(), a helper that restores large > ZONE_DEVICE folios to a sane, initial state before freeing them. > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > compound metadata). Before returning such pages to the device pgmap > allocator, each constituent page must be reset to a standalone > ZONE_DEVICE folio with a valid pgmap and no compound state. > > Use this helper prior to folio_free() for device-private and > device-coherent folios to ensure consistent device page state for > subsequent allocations. > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > Cc: Zi Yan <ziy@nvidia.com> > Cc: David Hildenbrand <david@kernel.org> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Balbir Singh <balbirs@nvidia.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Alistair Popple <apopple@nvidia.com> > Cc: linux-mm@kvack.org > Cc: linux-cxl@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Suggested-by: Alistair Popple <apopple@nvidia.com> > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > --- > include/linux/memremap.h | 1 + > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 56 insertions(+) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 97fcffeb1c1e..88e1d4707296 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > #ifdef CONFIG_ZONE_DEVICE > void zone_device_page_init(struct page *page, unsigned int order); > +void free_zone_device_folio_prepare(struct folio *folio); > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > void memunmap_pages(struct dev_pagemap *pgmap); > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > diff --git a/mm/memremap.c b/mm/memremap.c > index 39dc4bd190d0..375a61e18858 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > } > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > +/** > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > + * @folio: ZONE_DEVICE folio to prepare for release. > + * > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > + * must be restored to a sane ZONE_DEVICE state before they are released. > + * > + * This helper: > + * - Clears @folio->mapping and, for compound folios, clears each page's > + * compound-head state (ClearPageHead()/clear_compound_head()). > + * - Resets the compound order metadata (folio_reset_order()) and then > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > + * * clears ->mapping > + * * restores ->pgmap (prep_compound_page() overwrites it) > + * * clears ->share (only relevant for fsdax; unused for device-private) > + * > + * If @folio is order-0, only the mapping is cleared and no further work is > + * required. > + */ > +void free_zone_device_folio_prepare(struct folio *folio) > +{ > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > + int order, i; > + > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > + > + folio->mapping = NULL; > + order = folio_order(folio); > + if (!order) > + return; > + > + folio_reset_order(folio); > + > + for (i = 0; i < (1UL << order); i++) { > + struct page *page = folio_page(folio, i); > + struct folio *new_folio = (struct folio *)page; > + > + ClearPageHead(page); > + clear_compound_head(page); > + > + new_folio->mapping = NULL; > + /* > + * Reset pgmap which was over-written by > + * prep_compound_page(). > + */ > + new_folio->pgmap = pgmap; > + new_folio->share = 0; /* fsdax only, unused for device private */ > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); Does calling the free_folio() callback on new_folio solve the issue you are facing, or is that PMD_ORDER more frees than we'd like? > + } > +} > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > + > void free_zone_device_folio(struct folio *folio) > { > struct dev_pagemap *pgmap = folio->pgmap; > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > case MEMORY_DEVICE_COHERENT: > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > break; > + free_zone_device_folio_prepare(folio); > pgmap->ops->folio_free(folio, order); > percpu_ref_put_many(&folio->pgmap->ref, nr); > break; Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-12 0:44 ` Balbir Singh @ 2026-01-12 1:16 ` Matthew Brost 2026-01-12 2:15 ` Balbir Singh 2026-01-12 23:58 ` Alistair Popple 0 siblings, 2 replies; 39+ messages in thread From: Matthew Brost @ 2026-01-12 1:16 UTC (permalink / raw) To: Balbir Singh Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm, linux-cxl, linux-kernel On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > On 1/12/26 06:55, Francois Dugast wrote: > > From: Matthew Brost <matthew.brost@intel.com> > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > compound metadata). Before returning such pages to the device pgmap > > allocator, each constituent page must be reset to a standalone > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > Use this helper prior to folio_free() for device-private and > > device-coherent folios to ensure consistent device page state for > > subsequent allocations. > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > Cc: Zi Yan <ziy@nvidia.com> > > Cc: David Hildenbrand <david@kernel.org> > > Cc: Oscar Salvador <osalvador@suse.de> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Balbir Singh <balbirs@nvidia.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Mike Rapoport <rppt@kernel.org> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Alistair Popple <apopple@nvidia.com> > > Cc: linux-mm@kvack.org > > Cc: linux-cxl@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > --- > > include/linux/memremap.h | 1 + > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 56 insertions(+) > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > index 97fcffeb1c1e..88e1d4707296 100644 > > --- a/include/linux/memremap.h > > +++ b/include/linux/memremap.h > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > #ifdef CONFIG_ZONE_DEVICE > > void zone_device_page_init(struct page *page, unsigned int order); > > +void free_zone_device_folio_prepare(struct folio *folio); > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > void memunmap_pages(struct dev_pagemap *pgmap); > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > diff --git a/mm/memremap.c b/mm/memremap.c > > index 39dc4bd190d0..375a61e18858 100644 > > --- a/mm/memremap.c > > +++ b/mm/memremap.c > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > } > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > +/** > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > + * @folio: ZONE_DEVICE folio to prepare for release. > > + * > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > + * > > + * This helper: > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > + * - Resets the compound order metadata (folio_reset_order()) and then > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > + * * clears ->mapping > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > + * > > + * If @folio is order-0, only the mapping is cleared and no further work is > > + * required. > > + */ > > +void free_zone_device_folio_prepare(struct folio *folio) > > +{ > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > + int order, i; > > + > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > + > > + folio->mapping = NULL; > > + order = folio_order(folio); > > + if (!order) > > + return; > > + > > + folio_reset_order(folio); > > + > > + for (i = 0; i < (1UL << order); i++) { > > + struct page *page = folio_page(folio, i); > > + struct folio *new_folio = (struct folio *)page; > > + > > + ClearPageHead(page); > > + clear_compound_head(page); > > + > > + new_folio->mapping = NULL; > > + /* > > + * Reset pgmap which was over-written by > > + * prep_compound_page(). > > + */ > > + new_folio->pgmap = pgmap; > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > that PMD_ORDER more frees than we'd like? > No, calling free_folio() more often doesn’t solve anything—in fact, that would make my implementation explode. I explained this in detail here [1] to Zi. To recap [1], my memory allocator has no visibility into individual pages or folios; it is DRM Buddy layered on top of TTM BO. This design allows VRAM to be allocated or evicted for both traditional GPU allocations (GEMs) and SVM allocations. Now, to recap the actual issue: if device folios are not split upon free and are later reallocated with a different order in zone_device_page_init, the implementation breaks. This problem is not specific to Xe—Nouveau happens to always allocate at the same order, so it works by coincidence. Reallocating at a different order is valid behavior and must be supported. Matt [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > + } > > +} > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > + > > void free_zone_device_folio(struct folio *folio) > > { > > struct dev_pagemap *pgmap = folio->pgmap; > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > case MEMORY_DEVICE_COHERENT: > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > break; > > + free_zone_device_folio_prepare(folio); > > pgmap->ops->folio_free(folio, order); > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > break; > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-12 1:16 ` Matthew Brost @ 2026-01-12 2:15 ` Balbir Singh 2026-01-12 2:37 ` Matthew Brost 2026-01-12 23:58 ` Alistair Popple 1 sibling, 1 reply; 39+ messages in thread From: Balbir Singh @ 2026-01-12 2:15 UTC (permalink / raw) To: Matthew Brost Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm, linux-cxl, linux-kernel On 1/12/26 11:16, Matthew Brost wrote: > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: >> On 1/12/26 06:55, Francois Dugast wrote: >>> From: Matthew Brost <matthew.brost@intel.com> >>> >>> Add free_zone_device_folio_prepare(), a helper that restores large >>> ZONE_DEVICE folios to a sane, initial state before freeing them. >>> >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and >>> compound metadata). Before returning such pages to the device pgmap >>> allocator, each constituent page must be reset to a standalone >>> ZONE_DEVICE folio with a valid pgmap and no compound state. >>> >>> Use this helper prior to folio_free() for device-private and >>> device-coherent folios to ensure consistent device page state for >>> subsequent allocations. >>> >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") >>> Cc: Zi Yan <ziy@nvidia.com> >>> Cc: David Hildenbrand <david@kernel.org> >>> Cc: Oscar Salvador <osalvador@suse.de> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Balbir Singh <balbirs@nvidia.com> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> >>> Cc: Vlastimil Babka <vbabka@suse.cz> >>> Cc: Mike Rapoport <rppt@kernel.org> >>> Cc: Suren Baghdasaryan <surenb@google.com> >>> Cc: Michal Hocko <mhocko@suse.com> >>> Cc: Alistair Popple <apopple@nvidia.com> >>> Cc: linux-mm@kvack.org >>> Cc: linux-cxl@vger.kernel.org >>> Cc: linux-kernel@vger.kernel.org >>> Suggested-by: Alistair Popple <apopple@nvidia.com> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com> >>> --- >>> include/linux/memremap.h | 1 + >>> mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ >>> 2 files changed, 56 insertions(+) >>> >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>> index 97fcffeb1c1e..88e1d4707296 100644 >>> --- a/include/linux/memremap.h >>> +++ b/include/linux/memremap.h >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) >>> >>> #ifdef CONFIG_ZONE_DEVICE >>> void zone_device_page_init(struct page *page, unsigned int order); >>> +void free_zone_device_folio_prepare(struct folio *folio); >>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >>> void memunmap_pages(struct dev_pagemap *pgmap); >>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); >>> diff --git a/mm/memremap.c b/mm/memremap.c >>> index 39dc4bd190d0..375a61e18858 100644 >>> --- a/mm/memremap.c >>> +++ b/mm/memremap.c >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) >>> } >>> EXPORT_SYMBOL_GPL(get_dev_pagemap); >>> >>> +/** >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. >>> + * @folio: ZONE_DEVICE folio to prepare for release. >>> + * >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages >>> + * must be restored to a sane ZONE_DEVICE state before they are released. >>> + * >>> + * This helper: >>> + * - Clears @folio->mapping and, for compound folios, clears each page's >>> + * compound-head state (ClearPageHead()/clear_compound_head()). >>> + * - Resets the compound order metadata (folio_reset_order()) and then >>> + * initializes each constituent page as a standalone ZONE_DEVICE folio: >>> + * * clears ->mapping >>> + * * restores ->pgmap (prep_compound_page() overwrites it) >>> + * * clears ->share (only relevant for fsdax; unused for device-private) >>> + * >>> + * If @folio is order-0, only the mapping is cleared and no further work is >>> + * required. >>> + */ >>> +void free_zone_device_folio_prepare(struct folio *folio) >>> +{ >>> + struct dev_pagemap *pgmap = page_pgmap(&folio->page); >>> + int order, i; >>> + >>> + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); >>> + >>> + folio->mapping = NULL; >>> + order = folio_order(folio); >>> + if (!order) >>> + return; >>> + >>> + folio_reset_order(folio); >>> + >>> + for (i = 0; i < (1UL << order); i++) { >>> + struct page *page = folio_page(folio, i); >>> + struct folio *new_folio = (struct folio *)page; >>> + >>> + ClearPageHead(page); >>> + clear_compound_head(page); >>> + >>> + new_folio->mapping = NULL; >>> + /* >>> + * Reset pgmap which was over-written by >>> + * prep_compound_page(). >>> + */ >>> + new_folio->pgmap = pgmap; >>> + new_folio->share = 0; /* fsdax only, unused for device private */ >>> + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); >>> + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); >> >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is >> that PMD_ORDER more frees than we'd like? >> > > No, calling free_folio() more often doesn’t solve anything—in fact, that > would make my implementation explode. I explained this in detail here [1] > to Zi. > > To recap [1], my memory allocator has no visibility into individual > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > allows VRAM to be allocated or evicted for both traditional GPU > allocations (GEMs) and SVM allocations. > I assume it is still backed by pages that are ref counted? I suspect you'd need to convert one reference count to PMD_ORDER reference counts to make this change work, or are the references not at page granularity? I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed like a per folio refcount > Now, to recap the actual issue: if device folios are not split upon free > and are later reallocated with a different order in > zone_device_page_init, the implementation breaks. This problem is not > specific to Xe—Nouveau happens to always allocate at the same order, so > it works by coincidence. Reallocating at a different order is valid > behavior and must be supported. > Agreed > Matt > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > >>> + } >>> +} >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); >>> + >>> void free_zone_device_folio(struct folio *folio) >>> { >>> struct dev_pagemap *pgmap = folio->pgmap; >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) >>> case MEMORY_DEVICE_COHERENT: >>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) >>> break; >>> + free_zone_device_folio_prepare(folio); >>> pgmap->ops->folio_free(folio, order); >>> percpu_ref_put_many(&folio->pgmap->ref, nr); >>> break; >> >> Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-12 2:15 ` Balbir Singh @ 2026-01-12 2:37 ` Matthew Brost 2026-01-12 2:50 ` Matthew Brost 0 siblings, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-12 2:37 UTC (permalink / raw) To: Balbir Singh Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm, linux-cxl, linux-kernel On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote: > On 1/12/26 11:16, Matthew Brost wrote: > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > >> On 1/12/26 06:55, Francois Dugast wrote: > >>> From: Matthew Brost <matthew.brost@intel.com> > >>> > >>> Add free_zone_device_folio_prepare(), a helper that restores large > >>> ZONE_DEVICE folios to a sane, initial state before freeing them. > >>> > >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > >>> compound metadata). Before returning such pages to the device pgmap > >>> allocator, each constituent page must be reset to a standalone > >>> ZONE_DEVICE folio with a valid pgmap and no compound state. > >>> > >>> Use this helper prior to folio_free() for device-private and > >>> device-coherent folios to ensure consistent device page state for > >>> subsequent allocations. > >>> > >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > >>> Cc: Zi Yan <ziy@nvidia.com> > >>> Cc: David Hildenbrand <david@kernel.org> > >>> Cc: Oscar Salvador <osalvador@suse.de> > >>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>> Cc: Balbir Singh <balbirs@nvidia.com> > >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > >>> Cc: Vlastimil Babka <vbabka@suse.cz> > >>> Cc: Mike Rapoport <rppt@kernel.org> > >>> Cc: Suren Baghdasaryan <surenb@google.com> > >>> Cc: Michal Hocko <mhocko@suse.com> > >>> Cc: Alistair Popple <apopple@nvidia.com> > >>> Cc: linux-mm@kvack.org > >>> Cc: linux-cxl@vger.kernel.org > >>> Cc: linux-kernel@vger.kernel.org > >>> Suggested-by: Alistair Popple <apopple@nvidia.com> > >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com> > >>> --- > >>> include/linux/memremap.h | 1 + > >>> mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > >>> 2 files changed, 56 insertions(+) > >>> > >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h > >>> index 97fcffeb1c1e..88e1d4707296 100644 > >>> --- a/include/linux/memremap.h > >>> +++ b/include/linux/memremap.h > >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > >>> > >>> #ifdef CONFIG_ZONE_DEVICE > >>> void zone_device_page_init(struct page *page, unsigned int order); > >>> +void free_zone_device_folio_prepare(struct folio *folio); > >>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); > >>> void memunmap_pages(struct dev_pagemap *pgmap); > >>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > >>> diff --git a/mm/memremap.c b/mm/memremap.c > >>> index 39dc4bd190d0..375a61e18858 100644 > >>> --- a/mm/memremap.c > >>> +++ b/mm/memremap.c > >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > >>> } > >>> EXPORT_SYMBOL_GPL(get_dev_pagemap); > >>> > >>> +/** > >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > >>> + * @folio: ZONE_DEVICE folio to prepare for release. > >>> + * > >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > >>> + * must be restored to a sane ZONE_DEVICE state before they are released. > >>> + * > >>> + * This helper: > >>> + * - Clears @folio->mapping and, for compound folios, clears each page's > >>> + * compound-head state (ClearPageHead()/clear_compound_head()). > >>> + * - Resets the compound order metadata (folio_reset_order()) and then > >>> + * initializes each constituent page as a standalone ZONE_DEVICE folio: > >>> + * * clears ->mapping > >>> + * * restores ->pgmap (prep_compound_page() overwrites it) > >>> + * * clears ->share (only relevant for fsdax; unused for device-private) > >>> + * > >>> + * If @folio is order-0, only the mapping is cleared and no further work is > >>> + * required. > >>> + */ > >>> +void free_zone_device_folio_prepare(struct folio *folio) > >>> +{ > >>> + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > >>> + int order, i; > >>> + > >>> + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > >>> + > >>> + folio->mapping = NULL; > >>> + order = folio_order(folio); > >>> + if (!order) > >>> + return; > >>> + > >>> + folio_reset_order(folio); > >>> + > >>> + for (i = 0; i < (1UL << order); i++) { > >>> + struct page *page = folio_page(folio, i); > >>> + struct folio *new_folio = (struct folio *)page; > >>> + > >>> + ClearPageHead(page); > >>> + clear_compound_head(page); > >>> + > >>> + new_folio->mapping = NULL; > >>> + /* > >>> + * Reset pgmap which was over-written by > >>> + * prep_compound_page(). > >>> + */ > >>> + new_folio->pgmap = pgmap; > >>> + new_folio->share = 0; /* fsdax only, unused for device private */ > >>> + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > >>> + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > >> > >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > >> that PMD_ORDER more frees than we'd like? > >> > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > would make my implementation explode. I explained this in detail here [1] > > to Zi. > > > > To recap [1], my memory allocator has no visibility into individual > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > allows VRAM to be allocated or evicted for both traditional GPU > > allocations (GEMs) and SVM allocations. > > > > I assume it is still backed by pages that are ref counted? I suspect you'd Yes. > need to convert one reference count to PMD_ORDER reference counts to make > this change work, or are the references not at page granularity? > > I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed > like a per folio refcount > The refcount is incremented by 1 for each call to folio_set_zone_device_data. If we have a 2MB device folio backing a 2MB allocation, the refcount is 1. If we have 512 4KB device pages backing a 2MB allocation, the refcount is 512. The refcount matches the number of folio_free calls we expect to receive for the size of the backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M but thia all configurable via a table driver side (Xe) in GPU SVM (drm common layer). Matt > > Now, to recap the actual issue: if device folios are not split upon free > > and are later reallocated with a different order in > > zone_device_page_init, the implementation breaks. This problem is not > > specific to Xe—Nouveau happens to always allocate at the same order, so > > it works by coincidence. Reallocating at a different order is valid > > behavior and must be supported. > > > > Agreed > > > Matt > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > >>> + } > >>> +} > >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > >>> + > >>> void free_zone_device_folio(struct folio *folio) > >>> { > >>> struct dev_pagemap *pgmap = folio->pgmap; > >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > >>> case MEMORY_DEVICE_COHERENT: > >>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > >>> break; > >>> + free_zone_device_folio_prepare(folio); > >>> pgmap->ops->folio_free(folio, order); > >>> percpu_ref_put_many(&folio->pgmap->ref, nr); > >>> break; > >> > >> Balbir > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-12 2:37 ` Matthew Brost @ 2026-01-12 2:50 ` Matthew Brost 0 siblings, 0 replies; 39+ messages in thread From: Matthew Brost @ 2026-01-12 2:50 UTC (permalink / raw) To: Balbir Singh Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm, linux-cxl, linux-kernel On Sun, Jan 11, 2026 at 06:37:06PM -0800, Matthew Brost wrote: > On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote: > > On 1/12/26 11:16, Matthew Brost wrote: > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > >> On 1/12/26 06:55, Francois Dugast wrote: > > >>> From: Matthew Brost <matthew.brost@intel.com> > > >>> > > >>> Add free_zone_device_folio_prepare(), a helper that restores large > > >>> ZONE_DEVICE folios to a sane, initial state before freeing them. > > >>> > > >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > >>> compound metadata). Before returning such pages to the device pgmap > > >>> allocator, each constituent page must be reset to a standalone > > >>> ZONE_DEVICE folio with a valid pgmap and no compound state. > > >>> > > >>> Use this helper prior to folio_free() for device-private and > > >>> device-coherent folios to ensure consistent device page state for > > >>> subsequent allocations. > > >>> > > >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > >>> Cc: Zi Yan <ziy@nvidia.com> > > >>> Cc: David Hildenbrand <david@kernel.org> > > >>> Cc: Oscar Salvador <osalvador@suse.de> > > >>> Cc: Andrew Morton <akpm@linux-foundation.org> > > >>> Cc: Balbir Singh <balbirs@nvidia.com> > > >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > >>> Cc: Vlastimil Babka <vbabka@suse.cz> > > >>> Cc: Mike Rapoport <rppt@kernel.org> > > >>> Cc: Suren Baghdasaryan <surenb@google.com> > > >>> Cc: Michal Hocko <mhocko@suse.com> > > >>> Cc: Alistair Popple <apopple@nvidia.com> > > >>> Cc: linux-mm@kvack.org > > >>> Cc: linux-cxl@vger.kernel.org > > >>> Cc: linux-kernel@vger.kernel.org > > >>> Suggested-by: Alistair Popple <apopple@nvidia.com> > > >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > >>> --- > > >>> include/linux/memremap.h | 1 + > > >>> mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > >>> 2 files changed, 56 insertions(+) > > >>> > > >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > >>> index 97fcffeb1c1e..88e1d4707296 100644 > > >>> --- a/include/linux/memremap.h > > >>> +++ b/include/linux/memremap.h > > >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > >>> > > >>> #ifdef CONFIG_ZONE_DEVICE > > >>> void zone_device_page_init(struct page *page, unsigned int order); > > >>> +void free_zone_device_folio_prepare(struct folio *folio); > > >>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > >>> void memunmap_pages(struct dev_pagemap *pgmap); > > >>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > >>> diff --git a/mm/memremap.c b/mm/memremap.c > > >>> index 39dc4bd190d0..375a61e18858 100644 > > >>> --- a/mm/memremap.c > > >>> +++ b/mm/memremap.c > > >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > >>> } > > >>> EXPORT_SYMBOL_GPL(get_dev_pagemap); > > >>> > > >>> +/** > > >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > >>> + * @folio: ZONE_DEVICE folio to prepare for release. > > >>> + * > > >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > >>> + * must be restored to a sane ZONE_DEVICE state before they are released. > > >>> + * > > >>> + * This helper: > > >>> + * - Clears @folio->mapping and, for compound folios, clears each page's > > >>> + * compound-head state (ClearPageHead()/clear_compound_head()). > > >>> + * - Resets the compound order metadata (folio_reset_order()) and then > > >>> + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > >>> + * * clears ->mapping > > >>> + * * restores ->pgmap (prep_compound_page() overwrites it) > > >>> + * * clears ->share (only relevant for fsdax; unused for device-private) > > >>> + * > > >>> + * If @folio is order-0, only the mapping is cleared and no further work is > > >>> + * required. > > >>> + */ > > >>> +void free_zone_device_folio_prepare(struct folio *folio) > > >>> +{ > > >>> + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > >>> + int order, i; > > >>> + > > >>> + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > >>> + > > >>> + folio->mapping = NULL; > > >>> + order = folio_order(folio); > > >>> + if (!order) > > >>> + return; > > >>> + > > >>> + folio_reset_order(folio); > > >>> + > > >>> + for (i = 0; i < (1UL << order); i++) { > > >>> + struct page *page = folio_page(folio, i); > > >>> + struct folio *new_folio = (struct folio *)page; > > >>> + > > >>> + ClearPageHead(page); > > >>> + clear_compound_head(page); > > >>> + > > >>> + new_folio->mapping = NULL; > > >>> + /* > > >>> + * Reset pgmap which was over-written by > > >>> + * prep_compound_page(). > > >>> + */ > > >>> + new_folio->pgmap = pgmap; > > >>> + new_folio->share = 0; /* fsdax only, unused for device private */ > > >>> + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > >>> + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > >> > > >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > >> that PMD_ORDER more frees than we'd like? > > >> > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > would make my implementation explode. I explained this in detail here [1] > > > to Zi. > > > > > > To recap [1], my memory allocator has no visibility into individual > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > allows VRAM to be allocated or evicted for both traditional GPU > > > allocations (GEMs) and SVM allocations. > > > > > > > I assume it is still backed by pages that are ref counted? I suspect you'd > > Yes. > Let me clarify this a bit. We don’t track individual pages in our refcounting; instead, we maintain a reference count for the original allocation (i.e., there are no partial frees of the original allocation). This refcounting is handled in GPU SVM (DRM common), and when the allocation’s refcount reaches zero, GPU SVM calls into the driver to indicate that the memory can be released. In Xe, the backing memory is a TTM BO (think of this as an eviction hook), which is layered on top of DRM Buddy (which actually controls VRAM allocation and can determine device pages from this layer). I suspect AMD, when using GPU SVM (they have indicated this is the plan), will also use TTM BO here. Nova, assuming they eventually adopt SVM and use GPU SVM, will likely implement something very similar to TTM in Rust, but with DRM Buddy also controlling the actual allocation (they have already written bindings for DRM buddy). Matt > > need to convert one reference count to PMD_ORDER reference counts to make > > this change work, or are the references not at page granularity? > > > > I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed > > like a per folio refcount > > > > The refcount is incremented by 1 for each call to > folio_set_zone_device_data. If we have a 2MB device folio backing a > 2MB allocation, the refcount is 1. If we have 512 4KB device pages > backing a 2MB allocation, the refcount is 512. The refcount matches the > number of folio_free calls we expect to receive for the size of the > backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M > but thia all configurable via a table driver side (Xe) in GPU SVM (drm > common layer). > > Matt > > > > Now, to recap the actual issue: if device folios are not split upon free > > > and are later reallocated with a different order in > > > zone_device_page_init, the implementation breaks. This problem is not > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > it works by coincidence. Reallocating at a different order is valid > > > behavior and must be supported. > > > > > > > Agreed > > > > > Matt > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > >>> + } > > >>> +} > > >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > >>> + > > >>> void free_zone_device_folio(struct folio *folio) > > >>> { > > >>> struct dev_pagemap *pgmap = folio->pgmap; > > >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > >>> case MEMORY_DEVICE_COHERENT: > > >>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > >>> break; > > >>> + free_zone_device_folio_prepare(folio); > > >>> pgmap->ops->folio_free(folio, order); > > >>> percpu_ref_put_many(&folio->pgmap->ref, nr); > > >>> break; > > >> > > >> Balbir > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-12 1:16 ` Matthew Brost 2026-01-12 2:15 ` Balbir Singh @ 2026-01-12 23:58 ` Alistair Popple 2026-01-13 0:23 ` Matthew Brost 1 sibling, 1 reply; 39+ messages in thread From: Alistair Popple @ 2026-01-12 23:58 UTC (permalink / raw) To: Matthew Brost Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > On 1/12/26 06:55, Francois Dugast wrote: > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > compound metadata). Before returning such pages to the device pgmap > > > allocator, each constituent page must be reset to a standalone > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > Use this helper prior to folio_free() for device-private and > > > device-coherent folios to ensure consistent device page state for > > > subsequent allocations. > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > Cc: Zi Yan <ziy@nvidia.com> > > > Cc: David Hildenbrand <david@kernel.org> > > > Cc: Oscar Salvador <osalvador@suse.de> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: Mike Rapoport <rppt@kernel.org> > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > Cc: Michal Hocko <mhocko@suse.com> > > > Cc: Alistair Popple <apopple@nvidia.com> > > > Cc: linux-mm@kvack.org > > > Cc: linux-cxl@vger.kernel.org > > > Cc: linux-kernel@vger.kernel.org > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > --- > > > include/linux/memremap.h | 1 + > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 56 insertions(+) > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > --- a/include/linux/memremap.h > > > +++ b/include/linux/memremap.h > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > void zone_device_page_init(struct page *page, unsigned int order); > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > index 39dc4bd190d0..375a61e18858 100644 > > > --- a/mm/memremap.c > > > +++ b/mm/memremap.c > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > } > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > +/** > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > + * > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > + * > > > + * This helper: > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > + * * clears ->mapping > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > + * > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > + * required. > > > + */ > > > +void free_zone_device_folio_prepare(struct folio *folio) I don't really like the naming here - we're not preparing a folio to be freed, from the core-mm perspective the folio is already free. This is just reinitialising the folio metadata ready for the driver to reuse it, which may actually involve just recreating a compound folio. So maybe zone_device_folio_reinitialise()? Or would it be possible to roll this into a zone_device_folio_init() type function (similar to zone_device_page_init()) that just deals with everything at allocation time? > > > +{ > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > + int order, i; > > > + > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > + > > > + folio->mapping = NULL; > > > + order = folio_order(folio); > > > + if (!order) > > > + return; > > > + > > > + folio_reset_order(folio); > > > + > > > + for (i = 0; i < (1UL << order); i++) { > > > + struct page *page = folio_page(folio, i); > > > + struct folio *new_folio = (struct folio *)page; > > > + > > > + ClearPageHead(page); > > > + clear_compound_head(page); > > > + > > > + new_folio->mapping = NULL; > > > + /* > > > + * Reset pgmap which was over-written by > > > + * prep_compound_page(). > > > + */ > > > + new_folio->pgmap = pgmap; > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > that PMD_ORDER more frees than we'd like? > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > would make my implementation explode. I explained this in detail here [1] > to Zi. > > To recap [1], my memory allocator has no visibility into individual > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > allows VRAM to be allocated or evicted for both traditional GPU > allocations (GEMs) and SVM allocations. > > Now, to recap the actual issue: if device folios are not split upon free > and are later reallocated with a different order in > zone_device_page_init, the implementation breaks. This problem is not > specific to Xe—Nouveau happens to always allocate at the same order, so > it works by coincidence. Reallocating at a different order is valid > behavior and must be supported. I agree it's probably by coincidence but it is a perfectly valid design to always just (re)allocate at the same order and not worry about having to reinitialise things to different orders. - Alistair > Matt > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > + } > > > +} > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > + > > > void free_zone_device_folio(struct folio *folio) > > > { > > > struct dev_pagemap *pgmap = folio->pgmap; > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > case MEMORY_DEVICE_COHERENT: > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > break; > > > + free_zone_device_folio_prepare(folio); > > > pgmap->ops->folio_free(folio, order); > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > break; > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-12 23:58 ` Alistair Popple @ 2026-01-13 0:23 ` Matthew Brost 2026-01-13 0:43 ` Alistair Popple 0 siblings, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-13 0:23 UTC (permalink / raw) To: Alistair Popple Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > compound metadata). Before returning such pages to the device pgmap > > > > allocator, each constituent page must be reset to a standalone > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > device-coherent folios to ensure consistent device page state for > > > > subsequent allocations. > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > Cc: David Hildenbrand <david@kernel.org> > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > Cc: linux-mm@kvack.org > > > > Cc: linux-cxl@vger.kernel.org > > > > Cc: linux-kernel@vger.kernel.org > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > --- > > > > include/linux/memremap.h | 1 + > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > 2 files changed, 56 insertions(+) > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > --- a/include/linux/memremap.h > > > > +++ b/include/linux/memremap.h > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > --- a/mm/memremap.c > > > > +++ b/mm/memremap.c > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > } > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > +/** > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > + * > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > + * > > > > + * This helper: > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > + * * clears ->mapping > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > + * > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > + * required. > > > > + */ > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > I don't really like the naming here - we're not preparing a folio to be > freed, from the core-mm perspective the folio is already free. This is just > reinitialising the folio metadata ready for the driver to reuse it, which may > actually involve just recreating a compound folio. > > So maybe zone_device_folio_reinitialise()? Or would it be possible to zone_device_folio_reinitialise - that works for me... but seem like everyone has a opinion. > roll this into a zone_device_folio_init() type function (similar to > zone_device_page_init()) that just deals with everything at allocation time? > I don’t think doing this at allocation actually works without a big lock per pgmap. Consider the case where a VRAM allocator allocates two distinct subsets of a large folio and you have a multi-threaded GPU page fault handler (Xe does). It’s possible two threads could call zone_device_folio_reinitialise at the same time, racing and causing all sorts of issues. My plan is to just call this function in the driver’s ->folio_free() prior to returning the VRAM allocation to my driver pool. > > > > +{ > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > + int order, i; > > > > + > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > + > > > > + folio->mapping = NULL; > > > > + order = folio_order(folio); > > > > + if (!order) > > > > + return; > > > > + > > > > + folio_reset_order(folio); > > > > + > > > > + for (i = 0; i < (1UL << order); i++) { > > > > + struct page *page = folio_page(folio, i); > > > > + struct folio *new_folio = (struct folio *)page; > > > > + > > > > + ClearPageHead(page); > > > > + clear_compound_head(page); > > > > + > > > > + new_folio->mapping = NULL; > > > > + /* > > > > + * Reset pgmap which was over-written by > > > > + * prep_compound_page(). > > > > + */ > > > > + new_folio->pgmap = pgmap; > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > that PMD_ORDER more frees than we'd like? > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > would make my implementation explode. I explained this in detail here [1] > > to Zi. > > > > To recap [1], my memory allocator has no visibility into individual > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > allows VRAM to be allocated or evicted for both traditional GPU > > allocations (GEMs) and SVM allocations. > > > > Now, to recap the actual issue: if device folios are not split upon free > > and are later reallocated with a different order in > > zone_device_page_init, the implementation breaks. This problem is not > > specific to Xe—Nouveau happens to always allocate at the same order, so > > it works by coincidence. Reallocating at a different order is valid > > behavior and must be supported. > > I agree it's probably by coincidence but it is a perfectly valid design to > always just (re)allocate at the same order and not worry about having to > reinitialise things to different orders. > I would agree with this statement too — it’s perfectly valid if a driver always wants to (re)allocate at the same order. Matt > - Alistair > > > Matt > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > + } > > > > +} > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > + > > > > void free_zone_device_folio(struct folio *folio) > > > > { > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > case MEMORY_DEVICE_COHERENT: > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > break; > > > > + free_zone_device_folio_prepare(folio); > > > > pgmap->ops->folio_free(folio, order); > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > break; > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 0:23 ` Matthew Brost @ 2026-01-13 0:43 ` Alistair Popple 2026-01-13 1:07 ` Matthew Brost 0 siblings, 1 reply; 39+ messages in thread From: Alistair Popple @ 2026-01-13 0:43 UTC (permalink / raw) To: Matthew Brost Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > allocator, each constituent page must be reset to a standalone > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > device-coherent folios to ensure consistent device page state for > > > > > subsequent allocations. > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > Cc: linux-mm@kvack.org > > > > > Cc: linux-cxl@vger.kernel.org > > > > > Cc: linux-kernel@vger.kernel.org > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > --- > > > > > include/linux/memremap.h | 1 + > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > --- a/include/linux/memremap.h > > > > > +++ b/include/linux/memremap.h > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > --- a/mm/memremap.c > > > > > +++ b/mm/memremap.c > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > } > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > +/** > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > + * > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > + * > > > > > + * This helper: > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > + * * clears ->mapping > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > + * > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > + * required. > > > > > + */ > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > I don't really like the naming here - we're not preparing a folio to be > > freed, from the core-mm perspective the folio is already free. This is just > > reinitialising the folio metadata ready for the driver to reuse it, which may > > actually involve just recreating a compound folio. > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > zone_device_folio_reinitialise - that works for me... but seem like > everyone has a opinion. Well of course :) There are only two hard problems in programming and I forget the other one. But I didn't want to just say I don't like free_zone_device_folio_prepare() without offering an alternative, I'd be open to others. > > > roll this into a zone_device_folio_init() type function (similar to > > zone_device_page_init()) that just deals with everything at allocation time? > > > > I don’t think doing this at allocation actually works without a big lock > per pgmap. Consider the case where a VRAM allocator allocates two > distinct subsets of a large folio and you have a multi-threaded GPU page > fault handler (Xe does). It’s possible two threads could call > zone_device_folio_reinitialise at the same time, racing and causing all > sorts of issues. My plan is to just call this function in the driver’s > ->folio_free() prior to returning the VRAM allocation to my driver pool. This doesn't make sense to me (at least as someone who doesn't know DRM SVM intimately) - the folio metadata initialisation should only happen after the VRAM allocation has occured. IOW the VRAM allocator needs to deal with the locking, once you have the VRAM physical range you just initialise the folio/pages associated with that range with zone_device_folio_(re)initialise() and you're done. Is the concern that reinitialisation would touch pages outside of the allocated VRAM range if it was previously a large folio? > > > > > +{ > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > + int order, i; > > > > > + > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > + > > > > > + folio->mapping = NULL; > > > > > + order = folio_order(folio); > > > > > + if (!order) > > > > > + return; > > > > > + > > > > > + folio_reset_order(folio); > > > > > + > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > + struct page *page = folio_page(folio, i); > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > + > > > > > + ClearPageHead(page); > > > > > + clear_compound_head(page); > > > > > + > > > > > + new_folio->mapping = NULL; > > > > > + /* > > > > > + * Reset pgmap which was over-written by > > > > > + * prep_compound_page(). > > > > > + */ > > > > > + new_folio->pgmap = pgmap; > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > would make my implementation explode. I explained this in detail here [1] > > > to Zi. > > > > > > To recap [1], my memory allocator has no visibility into individual > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > allows VRAM to be allocated or evicted for both traditional GPU > > > allocations (GEMs) and SVM allocations. > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > and are later reallocated with a different order in > > > zone_device_page_init, the implementation breaks. This problem is not > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > it works by coincidence. Reallocating at a different order is valid > > > behavior and must be supported. > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > always just (re)allocate at the same order and not worry about having to > > reinitialise things to different orders. > > > > I would agree with this statement too — it’s perfectly valid if a driver > always wants to (re)allocate at the same order. > > Matt > > > - Alistair > > > > > Matt > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > + } > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > + > > > > > void free_zone_device_folio(struct folio *folio) > > > > > { > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > case MEMORY_DEVICE_COHERENT: > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > break; > > > > > + free_zone_device_folio_prepare(folio); > > > > > pgmap->ops->folio_free(folio, order); > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > break; > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 0:43 ` Alistair Popple @ 2026-01-13 1:07 ` Matthew Brost 2026-01-13 1:35 ` Alistair Popple 0 siblings, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-13 1:07 UTC (permalink / raw) To: Alistair Popple Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote: > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > > allocator, each constituent page must be reset to a standalone > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > > device-coherent folios to ensure consistent device page state for > > > > > > subsequent allocations. > > > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > > Cc: linux-mm@kvack.org > > > > > > Cc: linux-cxl@vger.kernel.org > > > > > > Cc: linux-kernel@vger.kernel.org > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > > --- > > > > > > include/linux/memremap.h | 1 + > > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > > --- a/include/linux/memremap.h > > > > > > +++ b/include/linux/memremap.h > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > > --- a/mm/memremap.c > > > > > > +++ b/mm/memremap.c > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > > } > > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > > > +/** > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > > + * > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > > + * > > > > > > + * This helper: > > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > > + * * clears ->mapping > > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > > + * > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > > + * required. > > > > > > + */ > > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > > > I don't really like the naming here - we're not preparing a folio to be > > > freed, from the core-mm perspective the folio is already free. This is just > > > reinitialising the folio metadata ready for the driver to reuse it, which may > > > actually involve just recreating a compound folio. > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > > > zone_device_folio_reinitialise - that works for me... but seem like > > everyone has a opinion. > > Well of course :) There are only two hard problems in programming and > I forget the other one. But I didn't want to just say I don't like > free_zone_device_folio_prepare() without offering an alternative, I'd be open > to others. > zone_device_folio_reinitialise is good with me. > > > > > roll this into a zone_device_folio_init() type function (similar to > > > zone_device_page_init()) that just deals with everything at allocation time? > > > > > > > I don’t think doing this at allocation actually works without a big lock > > per pgmap. Consider the case where a VRAM allocator allocates two > > distinct subsets of a large folio and you have a multi-threaded GPU page > > fault handler (Xe does). It’s possible two threads could call > > zone_device_folio_reinitialise at the same time, racing and causing all > > sorts of issues. My plan is to just call this function in the driver’s > > ->folio_free() prior to returning the VRAM allocation to my driver pool. > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM > intimately) - the folio metadata initialisation should only happen after the > VRAM allocation has occured. > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM > physical range you just initialise the folio/pages associated with that range > with zone_device_folio_(re)initialise() and you're done. > Our VRAM allocator does have locking (via DRM buddy), but that layer doesn’t have visibility into the folio or its pages. By the time we handle the folio/pages in the GPU fault handler, there are no global locks preventing two GPU faults from each having, say, 16 pages from the same order-9 folio. I believe if both threads call zone_device_folio_reinitialise/init at the same time, bad things could happen. > Is the concern that reinitialisation would touch pages outside of the allocated > VRAM range if it was previously a large folio? No just two threads call zone_device_folio_reinitialise/init at the same time, on the same folio. If we call zone_device_folio_reinitialise in ->folio_free this problem goes away. We could solve this with split_lock or something but I'd prefer not to add lock for this (although some of prior revs did do this, maybe we will revist this later). Anyways - this falls in driver detail / choice IMO. Matt > > > > > > > +{ > > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > > + int order, i; > > > > > > + > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > > + > > > > > > + folio->mapping = NULL; > > > > > > + order = folio_order(folio); > > > > > > + if (!order) > > > > > > + return; > > > > > > + > > > > > > + folio_reset_order(folio); > > > > > > + > > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > > + struct page *page = folio_page(folio, i); > > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > > + > > > > > > + ClearPageHead(page); > > > > > > + clear_compound_head(page); > > > > > > + > > > > > > + new_folio->mapping = NULL; > > > > > > + /* > > > > > > + * Reset pgmap which was over-written by > > > > > > + * prep_compound_page(). > > > > > > + */ > > > > > > + new_folio->pgmap = pgmap; > > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > > would make my implementation explode. I explained this in detail here [1] > > > > to Zi. > > > > > > > > To recap [1], my memory allocator has no visibility into individual > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > > allows VRAM to be allocated or evicted for both traditional GPU > > > > allocations (GEMs) and SVM allocations. > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > > and are later reallocated with a different order in > > > > zone_device_page_init, the implementation breaks. This problem is not > > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > > it works by coincidence. Reallocating at a different order is valid > > > > behavior and must be supported. > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > > always just (re)allocate at the same order and not worry about having to > > > reinitialise things to different orders. > > > > > > > I would agree with this statement too — it’s perfectly valid if a driver > > always wants to (re)allocate at the same order. > > > > Matt > > > > > - Alistair > > > > > > > Matt > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > > > + } > > > > > > +} > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > > + > > > > > > void free_zone_device_folio(struct folio *folio) > > > > > > { > > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > > case MEMORY_DEVICE_COHERENT: > > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > > break; > > > > > > + free_zone_device_folio_prepare(folio); > > > > > > pgmap->ops->folio_free(folio, order); > > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > > break; > > > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 1:07 ` Matthew Brost @ 2026-01-13 1:35 ` Alistair Popple 2026-01-13 1:40 ` Matthew Brost 0 siblings, 1 reply; 39+ messages in thread From: Alistair Popple @ 2026-01-13 1:35 UTC (permalink / raw) To: Matthew Brost Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote: > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > > > allocator, each constituent page must be reset to a standalone > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > > > device-coherent folios to ensure consistent device page state for > > > > > > > subsequent allocations. > > > > > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > > > Cc: linux-mm@kvack.org > > > > > > > Cc: linux-cxl@vger.kernel.org > > > > > > > Cc: linux-kernel@vger.kernel.org > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > > > --- > > > > > > > include/linux/memremap.h | 1 + > > > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > > > --- a/include/linux/memremap.h > > > > > > > +++ b/include/linux/memremap.h > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > > > --- a/mm/memremap.c > > > > > > > +++ b/mm/memremap.c > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > > > } > > > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > > > > > +/** > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > > > + * > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > > > + * > > > > > > > + * This helper: > > > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > > > + * * clears ->mapping > > > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > > > + * > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > > > + * required. > > > > > > > + */ > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > > > > > I don't really like the naming here - we're not preparing a folio to be > > > > freed, from the core-mm perspective the folio is already free. This is just > > > > reinitialising the folio metadata ready for the driver to reuse it, which may > > > > actually involve just recreating a compound folio. > > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > > > > > zone_device_folio_reinitialise - that works for me... but seem like > > > everyone has a opinion. > > > > Well of course :) There are only two hard problems in programming and > > I forget the other one. But I didn't want to just say I don't like > > free_zone_device_folio_prepare() without offering an alternative, I'd be open > > to others. > > > > zone_device_folio_reinitialise is good with me. > > > > > > > > roll this into a zone_device_folio_init() type function (similar to > > > > zone_device_page_init()) that just deals with everything at allocation time? > > > > > > > > > > I don’t think doing this at allocation actually works without a big lock > > > per pgmap. Consider the case where a VRAM allocator allocates two > > > distinct subsets of a large folio and you have a multi-threaded GPU page > > > fault handler (Xe does). It’s possible two threads could call > > > zone_device_folio_reinitialise at the same time, racing and causing all > > > sorts of issues. My plan is to just call this function in the driver’s > > > ->folio_free() prior to returning the VRAM allocation to my driver pool. > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM > > intimately) - the folio metadata initialisation should only happen after the > > VRAM allocation has occured. > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM > > physical range you just initialise the folio/pages associated with that range > > with zone_device_folio_(re)initialise() and you're done. > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer I mean I assumed it did :-) > doesn’t have visibility into the folio or its pages. By the time we > handle the folio/pages in the GPU fault handler, there are no global > locks preventing two GPU faults from each having, say, 16 pages from the > same order-9 folio. I believe if both threads call > zone_device_folio_reinitialise/init at the same time, bad things could > happen. This is confusing to me. If you are getting a GPU fault it implies no page is mapped at a particular virtual address. The normal process (or at least the process I'm familiar with) for handling this is to allocate and map a page at the faulting virtual address. So in the scenario of two GPUs faulting on the same VA each thread would allocate VRAM using DRM buddy, presumably getting different physical pages, and so the zone_device_folio_init() call would be to different folios/pages. Then eventually one thread would succeed in creating the mapping from VA->VRAM and the losing thread would free the VRAM allocation back to DRM buddy. So I'm a bit confused by the above statement that two GPUs faults could each have the same pages or be calling zone_device_folio_init() on the same pages. How would that happen? > > Is the concern that reinitialisation would touch pages outside of the allocated > > VRAM range if it was previously a large folio? > > No just two threads call zone_device_folio_reinitialise/init at the same > time, on the same folio. > > If we call zone_device_folio_reinitialise in ->folio_free this problem > goes away. We could solve this with split_lock or something but I'd > prefer not to add lock for this (although some of prior revs did do > this, maybe we will revist this later). > > Anyways - this falls in driver detail / choice IMO. Agreed. - Alistair > Matt > > > > > > > > > > +{ > > > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > > > + int order, i; > > > > > > > + > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > > > + > > > > > > > + folio->mapping = NULL; > > > > > > > + order = folio_order(folio); > > > > > > > + if (!order) > > > > > > > + return; > > > > > > > + > > > > > > > + folio_reset_order(folio); > > > > > > > + > > > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > > > + struct page *page = folio_page(folio, i); > > > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > > > + > > > > > > > + ClearPageHead(page); > > > > > > > + clear_compound_head(page); > > > > > > > + > > > > > > > + new_folio->mapping = NULL; > > > > > > > + /* > > > > > > > + * Reset pgmap which was over-written by > > > > > > > + * prep_compound_page(). > > > > > > > + */ > > > > > > > + new_folio->pgmap = pgmap; > > > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > > > would make my implementation explode. I explained this in detail here [1] > > > > > to Zi. > > > > > > > > > > To recap [1], my memory allocator has no visibility into individual > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > > > allows VRAM to be allocated or evicted for both traditional GPU > > > > > allocations (GEMs) and SVM allocations. > > > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > > > and are later reallocated with a different order in > > > > > zone_device_page_init, the implementation breaks. This problem is not > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > > > it works by coincidence. Reallocating at a different order is valid > > > > > behavior and must be supported. > > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > > > always just (re)allocate at the same order and not worry about having to > > > > reinitialise things to different orders. > > > > > > > > > > I would agree with this statement too — it’s perfectly valid if a driver > > > always wants to (re)allocate at the same order. > > > > > > Matt > > > > > > > - Alistair > > > > > > > > > Matt > > > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > > > > > + } > > > > > > > +} > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > > > + > > > > > > > void free_zone_device_folio(struct folio *folio) > > > > > > > { > > > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > > > case MEMORY_DEVICE_COHERENT: > > > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > > > break; > > > > > > > + free_zone_device_folio_prepare(folio); > > > > > > > pgmap->ops->folio_free(folio, order); > > > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > > > break; > > > > > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 1:35 ` Alistair Popple @ 2026-01-13 1:40 ` Matthew Brost 2026-01-13 2:06 ` Alistair Popple 0 siblings, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-13 1:40 UTC (permalink / raw) To: Alistair Popple Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote: > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote: > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > > > > allocator, each constituent page must be reset to a standalone > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > > > > device-coherent folios to ensure consistent device page state for > > > > > > > > subsequent allocations. > > > > > > > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > > > > Cc: linux-mm@kvack.org > > > > > > > > Cc: linux-cxl@vger.kernel.org > > > > > > > > Cc: linux-kernel@vger.kernel.org > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > > > > --- > > > > > > > > include/linux/memremap.h | 1 + > > > > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > > > > --- a/include/linux/memremap.h > > > > > > > > +++ b/include/linux/memremap.h > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > > > > --- a/mm/memremap.c > > > > > > > > +++ b/mm/memremap.c > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > > > > } > > > > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > > > > > > > +/** > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > > > > + * > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > > > > + * > > > > > > > > + * This helper: > > > > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > > > > + * * clears ->mapping > > > > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > > > > + * > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > > > > + * required. > > > > > > > > + */ > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > > > > > > > I don't really like the naming here - we're not preparing a folio to be > > > > > freed, from the core-mm perspective the folio is already free. This is just > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may > > > > > actually involve just recreating a compound folio. > > > > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > > > > > > > zone_device_folio_reinitialise - that works for me... but seem like > > > > everyone has a opinion. > > > > > > Well of course :) There are only two hard problems in programming and > > > I forget the other one. But I didn't want to just say I don't like > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open > > > to others. > > > > > > > zone_device_folio_reinitialise is good with me. > > > > > > > > > > > roll this into a zone_device_folio_init() type function (similar to > > > > > zone_device_page_init()) that just deals with everything at allocation time? > > > > > > > > > > > > > I don’t think doing this at allocation actually works without a big lock > > > > per pgmap. Consider the case where a VRAM allocator allocates two > > > > distinct subsets of a large folio and you have a multi-threaded GPU page > > > > fault handler (Xe does). It’s possible two threads could call > > > > zone_device_folio_reinitialise at the same time, racing and causing all > > > > sorts of issues. My plan is to just call this function in the driver’s > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool. > > > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM > > > intimately) - the folio metadata initialisation should only happen after the > > > VRAM allocation has occured. > > > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM > > > physical range you just initialise the folio/pages associated with that range > > > with zone_device_folio_(re)initialise() and you're done. > > > > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer > > I mean I assumed it did :-) > > > doesn’t have visibility into the folio or its pages. By the time we > > handle the folio/pages in the GPU fault handler, there are no global > > locks preventing two GPU faults from each having, say, 16 pages from the > > same order-9 folio. I believe if both threads call > > zone_device_folio_reinitialise/init at the same time, bad things could > > happen. > > This is confusing to me. If you are getting a GPU fault it implies no page is > mapped at a particular virtual address. The normal process (or at least the > process I'm familiar with) for handling this is to allocate and map a page at > the faulting virtual address. So in the scenario of two GPUs faulting on the > same VA each thread would allocate VRAM using DRM buddy, presumably getting Different VAs. > different physical pages, and so the zone_device_folio_init() call would be to Yes, different physical pages but same folio which is possible if it hasn't been split yet (i.e., both threads a different subset of pages in the same folio, try to split at the same time and boom something bad happens). > different folios/pages. > > Then eventually one thread would succeed in creating the mapping from VA->VRAM > and the losing thread would free the VRAM allocation back to DRM buddy. > > So I'm a bit confused by the above statement that two GPUs faults could each > have the same pages or be calling zone_device_folio_init() on the same pages. > How would that happen? > See above. I hope my above statements make this clear. Matt > > > Is the concern that reinitialisation would touch pages outside of the allocated > > > VRAM range if it was previously a large folio? > > > > No just two threads call zone_device_folio_reinitialise/init at the same > > time, on the same folio. > > > > If we call zone_device_folio_reinitialise in ->folio_free this problem > > goes away. We could solve this with split_lock or something but I'd > > prefer not to add lock for this (although some of prior revs did do > > this, maybe we will revist this later). > > > > Anyways - this falls in driver detail / choice IMO. > > Agreed. > > - Alistair > > > Matt > > > > > > > > > > > > > +{ > > > > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > > > > + int order, i; > > > > > > > > + > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > > > > + > > > > > > > > + folio->mapping = NULL; > > > > > > > > + order = folio_order(folio); > > > > > > > > + if (!order) > > > > > > > > + return; > > > > > > > > + > > > > > > > > + folio_reset_order(folio); > > > > > > > > + > > > > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > > > > + struct page *page = folio_page(folio, i); > > > > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > > > > + > > > > > > > > + ClearPageHead(page); > > > > > > > > + clear_compound_head(page); > > > > > > > > + > > > > > > > > + new_folio->mapping = NULL; > > > > > > > > + /* > > > > > > > > + * Reset pgmap which was over-written by > > > > > > > > + * prep_compound_page(). > > > > > > > > + */ > > > > > > > > + new_folio->pgmap = pgmap; > > > > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > > > > would make my implementation explode. I explained this in detail here [1] > > > > > > to Zi. > > > > > > > > > > > > To recap [1], my memory allocator has no visibility into individual > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > > > > allows VRAM to be allocated or evicted for both traditional GPU > > > > > > allocations (GEMs) and SVM allocations. > > > > > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > > > > and are later reallocated with a different order in > > > > > > zone_device_page_init, the implementation breaks. This problem is not > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > > > > it works by coincidence. Reallocating at a different order is valid > > > > > > behavior and must be supported. > > > > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > > > > always just (re)allocate at the same order and not worry about having to > > > > > reinitialise things to different orders. > > > > > > > > > > > > > I would agree with this statement too — it’s perfectly valid if a driver > > > > always wants to (re)allocate at the same order. > > > > > > > > Matt > > > > > > > > > - Alistair > > > > > > > > > > > Matt > > > > > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > > > > > > > + } > > > > > > > > +} > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > > > > + > > > > > > > > void free_zone_device_folio(struct folio *folio) > > > > > > > > { > > > > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > > > > case MEMORY_DEVICE_COHERENT: > > > > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > > > > break; > > > > > > > > + free_zone_device_folio_prepare(folio); > > > > > > > > pgmap->ops->folio_free(folio, order); > > > > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > > > > break; > > > > > > > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 1:40 ` Matthew Brost @ 2026-01-13 2:06 ` Alistair Popple 2026-01-13 2:16 ` Matthew Brost 0 siblings, 1 reply; 39+ messages in thread From: Alistair Popple @ 2026-01-13 2:06 UTC (permalink / raw) To: Matthew Brost Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote: > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote: > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > > > > > allocator, each constituent page must be reset to a standalone > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > > > > > device-coherent folios to ensure consistent device page state for > > > > > > > > > subsequent allocations. > > > > > > > > > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > > > > > Cc: linux-mm@kvack.org > > > > > > > > > Cc: linux-cxl@vger.kernel.org > > > > > > > > > Cc: linux-kernel@vger.kernel.org > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > > > > > --- > > > > > > > > > include/linux/memremap.h | 1 + > > > > > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > > > > > --- a/include/linux/memremap.h > > > > > > > > > +++ b/include/linux/memremap.h > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > > > > > --- a/mm/memremap.c > > > > > > > > > +++ b/mm/memremap.c > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > > > > > } > > > > > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > > > > > + * > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > > > > > + * > > > > > > > > > + * This helper: > > > > > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > > > > > + * * clears ->mapping > > > > > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > > > > > + * > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > > > > > + * required. > > > > > > > > > + */ > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > > > > > > > > > I don't really like the naming here - we're not preparing a folio to be > > > > > > freed, from the core-mm perspective the folio is already free. This is just > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may > > > > > > actually involve just recreating a compound folio. > > > > > > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > > > > > > > > > zone_device_folio_reinitialise - that works for me... but seem like > > > > > everyone has a opinion. > > > > > > > > Well of course :) There are only two hard problems in programming and > > > > I forget the other one. But I didn't want to just say I don't like > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open > > > > to others. > > > > > > > > > > zone_device_folio_reinitialise is good with me. > > > > > > > > > > > > > > roll this into a zone_device_folio_init() type function (similar to > > > > > > zone_device_page_init()) that just deals with everything at allocation time? > > > > > > > > > > > > > > > > I don’t think doing this at allocation actually works without a big lock > > > > > per pgmap. Consider the case where a VRAM allocator allocates two > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page > > > > > fault handler (Xe does). It’s possible two threads could call > > > > > zone_device_folio_reinitialise at the same time, racing and causing all > > > > > sorts of issues. My plan is to just call this function in the driver’s > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool. > > > > > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM > > > > intimately) - the folio metadata initialisation should only happen after the > > > > VRAM allocation has occured. > > > > > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM > > > > physical range you just initialise the folio/pages associated with that range > > > > with zone_device_folio_(re)initialise() and you're done. > > > > > > > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer > > > > I mean I assumed it did :-) > > > > > doesn’t have visibility into the folio or its pages. By the time we > > > handle the folio/pages in the GPU fault handler, there are no global > > > locks preventing two GPU faults from each having, say, 16 pages from the > > > same order-9 folio. I believe if both threads call > > > zone_device_folio_reinitialise/init at the same time, bad things could > > > happen. > > > > This is confusing to me. If you are getting a GPU fault it implies no page is > > mapped at a particular virtual address. The normal process (or at least the > > process I'm familiar with) for handling this is to allocate and map a page at > > the faulting virtual address. So in the scenario of two GPUs faulting on the > > same VA each thread would allocate VRAM using DRM buddy, presumably getting > > Different VAs. > > > different physical pages, and so the zone_device_folio_init() call would be to > > Yes, different physical pages but same folio which is possible if it > hasn't been split yet (i.e., both threads a different subset of pages in > the same folio, try to split at the same time and boom something bad > happens). So is you're concern something like this: 1) There is a free folio A of order 9, starting at physical address 0. 2) You have two GPU faults, both call into DRM Buddy to get a 4K page . 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0)) 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1)) 5) Both call zone_device_folio_init() which splits the folio, meaning the previous step would touch folio_page(folio_A, 0) even though it has not been allocated physical address 0. If that's the concern then what I'm saying (and what I think Jason was getting at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should just overwrite all the metadata in the struct pages it has been allocated. We're not really splitting folios, because it makes no sense to talk of splitting a free folio which I think is why some core-mm people took notice. Also It doesn't matter that you are leaving the previous compound head struct pages in some weird state, the core-mm doesn't care about them anymore and the struct page/folio is only used by core-mm not drivers. They will get properly (re)initialised when needed for the core-mm in zone_device_folio_init() which in this case would happen in step 3. - Alistair > > different folios/pages. > > > > Then eventually one thread would succeed in creating the mapping from VA->VRAM > > and the losing thread would free the VRAM allocation back to DRM buddy. > > > > So I'm a bit confused by the above statement that two GPUs faults could each > > have the same pages or be calling zone_device_folio_init() on the same pages. > > How would that happen? > > > > See above. I hope my above statements make this clear. > > Matt > > > > > Is the concern that reinitialisation would touch pages outside of the allocated > > > > VRAM range if it was previously a large folio? > > > > > > No just two threads call zone_device_folio_reinitialise/init at the same > > > time, on the same folio. > > > > > > If we call zone_device_folio_reinitialise in ->folio_free this problem > > > goes away. We could solve this with split_lock or something but I'd > > > prefer not to add lock for this (although some of prior revs did do > > > this, maybe we will revist this later). > > > > > > Anyways - this falls in driver detail / choice IMO. > > > > Agreed. > > > > - Alistair > > > > > Matt > > > > > > > > > > > > > > > > +{ > > > > > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > > > > > + int order, i; > > > > > > > > > + > > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > > > > > + > > > > > > > > > + folio->mapping = NULL; > > > > > > > > > + order = folio_order(folio); > > > > > > > > > + if (!order) > > > > > > > > > + return; > > > > > > > > > + > > > > > > > > > + folio_reset_order(folio); > > > > > > > > > + > > > > > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > > > > > + struct page *page = folio_page(folio, i); > > > > > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > > > > > + > > > > > > > > > + ClearPageHead(page); > > > > > > > > > + clear_compound_head(page); > > > > > > > > > + > > > > > > > > > + new_folio->mapping = NULL; > > > > > > > > > + /* > > > > > > > > > + * Reset pgmap which was over-written by > > > > > > > > > + * prep_compound_page(). > > > > > > > > > + */ > > > > > > > > > + new_folio->pgmap = pgmap; > > > > > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > > > > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > > > > > would make my implementation explode. I explained this in detail here [1] > > > > > > > to Zi. > > > > > > > > > > > > > > To recap [1], my memory allocator has no visibility into individual > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU > > > > > > > allocations (GEMs) and SVM allocations. > > > > > > > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > > > > > and are later reallocated with a different order in > > > > > > > zone_device_page_init, the implementation breaks. This problem is not > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > > > > > it works by coincidence. Reallocating at a different order is valid > > > > > > > behavior and must be supported. > > > > > > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > > > > > always just (re)allocate at the same order and not worry about having to > > > > > > reinitialise things to different orders. > > > > > > > > > > > > > > > > I would agree with this statement too — it’s perfectly valid if a driver > > > > > always wants to (re)allocate at the same order. > > > > > > > > > > Matt > > > > > > > > > > > - Alistair > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > > > > > > > > > + } > > > > > > > > > +} > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > > > > > + > > > > > > > > > void free_zone_device_folio(struct folio *folio) > > > > > > > > > { > > > > > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > > > > > case MEMORY_DEVICE_COHERENT: > > > > > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > > > > > break; > > > > > > > > > + free_zone_device_folio_prepare(folio); > > > > > > > > > pgmap->ops->folio_free(folio, order); > > > > > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > > > > > break; > > > > > > > > > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 2:06 ` Alistair Popple @ 2026-01-13 2:16 ` Matthew Brost 2026-01-13 2:31 ` Alistair Popple 0 siblings, 1 reply; 39+ messages in thread From: Matthew Brost @ 2026-01-13 2:16 UTC (permalink / raw) To: Alistair Popple Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote: > On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote: > > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote: > > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > > > > > > allocator, each constituent page must be reset to a standalone > > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > > > > > > device-coherent folios to ensure consistent device page state for > > > > > > > > > > subsequent allocations. > > > > > > > > > > > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > > > > > > Cc: linux-mm@kvack.org > > > > > > > > > > Cc: linux-cxl@vger.kernel.org > > > > > > > > > > Cc: linux-kernel@vger.kernel.org > > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > > > > > > --- > > > > > > > > > > include/linux/memremap.h | 1 + > > > > > > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > > > > > > --- a/include/linux/memremap.h > > > > > > > > > > +++ b/include/linux/memremap.h > > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > > > > > > --- a/mm/memremap.c > > > > > > > > > > +++ b/mm/memremap.c > > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > > > > > > } > > > > > > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > > > > > > + * > > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > > > > > > + * > > > > > > > > > > + * This helper: > > > > > > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > > > > > > + * * clears ->mapping > > > > > > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > > > > > > + * > > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > > > > > > + * required. > > > > > > > > > > + */ > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > > > > > > > > > > > I don't really like the naming here - we're not preparing a folio to be > > > > > > > freed, from the core-mm perspective the folio is already free. This is just > > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may > > > > > > > actually involve just recreating a compound folio. > > > > > > > > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > > > > > > > > > > > zone_device_folio_reinitialise - that works for me... but seem like > > > > > > everyone has a opinion. > > > > > > > > > > Well of course :) There are only two hard problems in programming and > > > > > I forget the other one. But I didn't want to just say I don't like > > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open > > > > > to others. > > > > > > > > > > > > > zone_device_folio_reinitialise is good with me. > > > > > > > > > > > > > > > > > roll this into a zone_device_folio_init() type function (similar to > > > > > > > zone_device_page_init()) that just deals with everything at allocation time? > > > > > > > > > > > > > > > > > > > I don’t think doing this at allocation actually works without a big lock > > > > > > per pgmap. Consider the case where a VRAM allocator allocates two > > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page > > > > > > fault handler (Xe does). It’s possible two threads could call > > > > > > zone_device_folio_reinitialise at the same time, racing and causing all > > > > > > sorts of issues. My plan is to just call this function in the driver’s > > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool. > > > > > > > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM > > > > > intimately) - the folio metadata initialisation should only happen after the > > > > > VRAM allocation has occured. > > > > > > > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM > > > > > physical range you just initialise the folio/pages associated with that range > > > > > with zone_device_folio_(re)initialise() and you're done. > > > > > > > > > > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer > > > > > > I mean I assumed it did :-) > > > > > > > doesn’t have visibility into the folio or its pages. By the time we > > > > handle the folio/pages in the GPU fault handler, there are no global > > > > locks preventing two GPU faults from each having, say, 16 pages from the > > > > same order-9 folio. I believe if both threads call > > > > zone_device_folio_reinitialise/init at the same time, bad things could > > > > happen. > > > > > > This is confusing to me. If you are getting a GPU fault it implies no page is > > > mapped at a particular virtual address. The normal process (or at least the > > > process I'm familiar with) for handling this is to allocate and map a page at > > > the faulting virtual address. So in the scenario of two GPUs faulting on the > > > same VA each thread would allocate VRAM using DRM buddy, presumably getting > > > > Different VAs. > > > > > different physical pages, and so the zone_device_folio_init() call would be to > > > > Yes, different physical pages but same folio which is possible if it > > hasn't been split yet (i.e., both threads a different subset of pages in > > the same folio, try to split at the same time and boom something bad > > happens). > > So is you're concern something like this: > > 1) There is a free folio A of order 9, starting at physical address 0. > 2) You have two GPU faults, both call into DRM Buddy to get a 4K page . > 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0)) > 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1)) > 5) Both call zone_device_folio_init() which splits the folio, meaning the > previous step would touch folio_page(folio_A, 0) even though it has not been > allocated physical address 0. > Yes. > If that's the concern then what I'm saying (and what I think Jason was getting > at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the > compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should > just overwrite all the metadata in the struct pages it has been allocated. We're > not really splitting folios, because it makes no sense to talk of splitting a > free folio which I think is why some core-mm people took notice. > > Also It doesn't matter that you are leaving the previous compound head struct > pages in some weird state, the core-mm doesn't care about them anymore and the > struct page/folio is only used by core-mm not drivers. They will get properly > (re)initialised when needed for the core-mm in zone_device_folio_init() which in > this case would happen in step 3. > Something like this should work too. I started implementing it on my side earlier today, but of course, I was hitting hangs. From an API point of view, zone_device_folio_init would need to be updated to accept a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have a valid pgmap to retrieve. It could look at the folio’s pgmap, but that also seems like it could race under the right conditions. Let me see what this looks like and whether I can get it working. Matt > - Alistair > > > > different folios/pages. > > > > > > Then eventually one thread would succeed in creating the mapping from VA->VRAM > > > and the losing thread would free the VRAM allocation back to DRM buddy. > > > > > > So I'm a bit confused by the above statement that two GPUs faults could each > > > have the same pages or be calling zone_device_folio_init() on the same pages. > > > How would that happen? > > > > > > > See above. I hope my above statements make this clear. > > > > Matt > > > > > > > Is the concern that reinitialisation would touch pages outside of the allocated > > > > > VRAM range if it was previously a large folio? > > > > > > > > No just two threads call zone_device_folio_reinitialise/init at the same > > > > time, on the same folio. > > > > > > > > If we call zone_device_folio_reinitialise in ->folio_free this problem > > > > goes away. We could solve this with split_lock or something but I'd > > > > prefer not to add lock for this (although some of prior revs did do > > > > this, maybe we will revist this later). > > > > > > > > Anyways - this falls in driver detail / choice IMO. > > > > > > Agreed. > > > > > > - Alistair > > > > > > > Matt > > > > > > > > > > > > > > > > > > > +{ > > > > > > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > > > > > > + int order, i; > > > > > > > > > > + > > > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > > > > > > + > > > > > > > > > > + folio->mapping = NULL; > > > > > > > > > > + order = folio_order(folio); > > > > > > > > > > + if (!order) > > > > > > > > > > + return; > > > > > > > > > > + > > > > > > > > > > + folio_reset_order(folio); > > > > > > > > > > + > > > > > > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > > > > > > + struct page *page = folio_page(folio, i); > > > > > > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > > > > > > + > > > > > > > > > > + ClearPageHead(page); > > > > > > > > > > + clear_compound_head(page); > > > > > > > > > > + > > > > > > > > > > + new_folio->mapping = NULL; > > > > > > > > > > + /* > > > > > > > > > > + * Reset pgmap which was over-written by > > > > > > > > > > + * prep_compound_page(). > > > > > > > > > > + */ > > > > > > > > > > + new_folio->pgmap = pgmap; > > > > > > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > > > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > > > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > > > > > > > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > > > > > > would make my implementation explode. I explained this in detail here [1] > > > > > > > > to Zi. > > > > > > > > > > > > > > > > To recap [1], my memory allocator has no visibility into individual > > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU > > > > > > > > allocations (GEMs) and SVM allocations. > > > > > > > > > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > > > > > > and are later reallocated with a different order in > > > > > > > > zone_device_page_init, the implementation breaks. This problem is not > > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > > > > > > it works by coincidence. Reallocating at a different order is valid > > > > > > > > behavior and must be supported. > > > > > > > > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > > > > > > always just (re)allocate at the same order and not worry about having to > > > > > > > reinitialise things to different orders. > > > > > > > > > > > > > > > > > > > I would agree with this statement too — it’s perfectly valid if a driver > > > > > > always wants to (re)allocate at the same order. > > > > > > > > > > > > Matt > > > > > > > > > > > > > - Alistair > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > > > > > > > > > > > + } > > > > > > > > > > +} > > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > > > > > > + > > > > > > > > > > void free_zone_device_folio(struct folio *folio) > > > > > > > > > > { > > > > > > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > > > > > > case MEMORY_DEVICE_COHERENT: > > > > > > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > > > > > > break; > > > > > > > > > > + free_zone_device_folio_prepare(folio); > > > > > > > > > > pgmap->ops->folio_free(folio, order); > > > > > > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > > > > > > break; > > > > > > > > > > > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper 2026-01-13 2:16 ` Matthew Brost @ 2026-01-13 2:31 ` Alistair Popple 0 siblings, 0 replies; 39+ messages in thread From: Alistair Popple @ 2026-01-13 2:31 UTC (permalink / raw) To: Matthew Brost Cc: Balbir Singh, Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand, Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-cxl, linux-kernel On 2026-01-13 at 13:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote: > > On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote: > > > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote: > > > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote: > > > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote... > > > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote: > > > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote: > > > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large > > > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them. > > > > > > > > > > > > > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and > > > > > > > > > > > compound metadata). Before returning such pages to the device pgmap > > > > > > > > > > > allocator, each constituent page must be reset to a standalone > > > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state. > > > > > > > > > > > > > > > > > > > > > > Use this helper prior to folio_free() for device-private and > > > > > > > > > > > device-coherent folios to ensure consistent device page state for > > > > > > > > > > > subsequent allocations. > > > > > > > > > > > > > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") > > > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com> > > > > > > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com> > > > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org> > > > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com> > > > > > > > > > > > Cc: linux-mm@kvack.org > > > > > > > > > > > Cc: linux-cxl@vger.kernel.org > > > > > > > > > > > Cc: linux-kernel@vger.kernel.org > > > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com> > > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com> > > > > > > > > > > > --- > > > > > > > > > > > include/linux/memremap.h | 1 + > > > > > > > > > > > mm/memremap.c | 55 ++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > 2 files changed, 56 insertions(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644 > > > > > > > > > > > --- a/include/linux/memremap.h > > > > > > > > > > > +++ b/include/linux/memremap.h > > > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page) > > > > > > > > > > > > > > > > > > > > > > #ifdef CONFIG_ZONE_DEVICE > > > > > > > > > > > void zone_device_page_init(struct page *page, unsigned int order); > > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio); > > > > > > > > > > > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > > > > > > > > > > > void memunmap_pages(struct dev_pagemap *pgmap); > > > > > > > > > > > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > > > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c > > > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644 > > > > > > > > > > > --- a/mm/memremap.c > > > > > > > > > > > +++ b/mm/memremap.c > > > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn) > > > > > > > > > > > } > > > > > > > > > > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing. > > > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release. > > > > > > > > > > > + * > > > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages) > > > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages > > > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released. > > > > > > > > > > > + * > > > > > > > > > > > + * This helper: > > > > > > > > > > > + * - Clears @folio->mapping and, for compound folios, clears each page's > > > > > > > > > > > + * compound-head state (ClearPageHead()/clear_compound_head()). > > > > > > > > > > > + * - Resets the compound order metadata (folio_reset_order()) and then > > > > > > > > > > > + * initializes each constituent page as a standalone ZONE_DEVICE folio: > > > > > > > > > > > + * * clears ->mapping > > > > > > > > > > > + * * restores ->pgmap (prep_compound_page() overwrites it) > > > > > > > > > > > + * * clears ->share (only relevant for fsdax; unused for device-private) > > > > > > > > > > > + * > > > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is > > > > > > > > > > > + * required. > > > > > > > > > > > + */ > > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio) > > > > > > > > > > > > > > > > I don't really like the naming here - we're not preparing a folio to be > > > > > > > > freed, from the core-mm perspective the folio is already free. This is just > > > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may > > > > > > > > actually involve just recreating a compound folio. > > > > > > > > > > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to > > > > > > > > > > > > > > zone_device_folio_reinitialise - that works for me... but seem like > > > > > > > everyone has a opinion. > > > > > > > > > > > > Well of course :) There are only two hard problems in programming and > > > > > > I forget the other one. But I didn't want to just say I don't like > > > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open > > > > > > to others. > > > > > > > > > > > > > > > > zone_device_folio_reinitialise is good with me. > > > > > > > > > > > > > > > > > > > > roll this into a zone_device_folio_init() type function (similar to > > > > > > > > zone_device_page_init()) that just deals with everything at allocation time? > > > > > > > > > > > > > > > > > > > > > > I don’t think doing this at allocation actually works without a big lock > > > > > > > per pgmap. Consider the case where a VRAM allocator allocates two > > > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page > > > > > > > fault handler (Xe does). It’s possible two threads could call > > > > > > > zone_device_folio_reinitialise at the same time, racing and causing all > > > > > > > sorts of issues. My plan is to just call this function in the driver’s > > > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool. > > > > > > > > > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM > > > > > > intimately) - the folio metadata initialisation should only happen after the > > > > > > VRAM allocation has occured. > > > > > > > > > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM > > > > > > physical range you just initialise the folio/pages associated with that range > > > > > > with zone_device_folio_(re)initialise() and you're done. > > > > > > > > > > > > > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer > > > > > > > > I mean I assumed it did :-) > > > > > > > > > doesn’t have visibility into the folio or its pages. By the time we > > > > > handle the folio/pages in the GPU fault handler, there are no global > > > > > locks preventing two GPU faults from each having, say, 16 pages from the > > > > > same order-9 folio. I believe if both threads call > > > > > zone_device_folio_reinitialise/init at the same time, bad things could > > > > > happen. > > > > > > > > This is confusing to me. If you are getting a GPU fault it implies no page is > > > > mapped at a particular virtual address. The normal process (or at least the > > > > process I'm familiar with) for handling this is to allocate and map a page at > > > > the faulting virtual address. So in the scenario of two GPUs faulting on the > > > > same VA each thread would allocate VRAM using DRM buddy, presumably getting > > > > > > Different VAs. > > > > > > > different physical pages, and so the zone_device_folio_init() call would be to > > > > > > Yes, different physical pages but same folio which is possible if it > > > hasn't been split yet (i.e., both threads a different subset of pages in > > > the same folio, try to split at the same time and boom something bad > > > happens). > > > > So is you're concern something like this: > > > > 1) There is a free folio A of order 9, starting at physical address 0. > > 2) You have two GPU faults, both call into DRM Buddy to get a 4K page . > > 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0)) > > 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1)) > > 5) Both call zone_device_folio_init() which splits the folio, meaning the > > previous step would touch folio_page(folio_A, 0) even though it has not been > > allocated physical address 0. > > > > Yes. > > > If that's the concern then what I'm saying (and what I think Jason was getting > > at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the > > compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should > > just overwrite all the metadata in the struct pages it has been allocated. We're > > not really splitting folios, because it makes no sense to talk of splitting a > > free folio which I think is why some core-mm people took notice. > > > > Also It doesn't matter that you are leaving the previous compound head struct > > pages in some weird state, the core-mm doesn't care about them anymore and the > > struct page/folio is only used by core-mm not drivers. They will get properly > > (re)initialised when needed for the core-mm in zone_device_folio_init() which in > > this case would happen in step 3. > > > > Something like this should work too. I started implementing it on my > side earlier today, but of course, I was hitting hangs. From an API > point of view, zone_device_folio_init would need to be updated to accept > a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have > a valid pgmap to retrieve. It could look at the folio’s pgmap, but that > also seems like it could race under the right conditions. I think passing a pgmap argument in would be fine - it allows us to maintain the concept that zone_device_folio_init() does exactly what it says on the tin. That is it initialises a ZONE_DEVICE folio ready for use by the core-mm without placing any assumptions or restrictions on the current state of the folio/page structs. > Let me see what this looks like and whether I can get it working. > > Matt > > > - Alistair > > > > > > different folios/pages. > > > > > > > > Then eventually one thread would succeed in creating the mapping from VA->VRAM > > > > and the losing thread would free the VRAM allocation back to DRM buddy. > > > > > > > > So I'm a bit confused by the above statement that two GPUs faults could each > > > > have the same pages or be calling zone_device_folio_init() on the same pages. > > > > How would that happen? > > > > > > > > > > See above. I hope my above statements make this clear. > > > > > > Matt > > > > > > > > > Is the concern that reinitialisation would touch pages outside of the allocated > > > > > > VRAM range if it was previously a large folio? > > > > > > > > > > No just two threads call zone_device_folio_reinitialise/init at the same > > > > > time, on the same folio. > > > > > > > > > > If we call zone_device_folio_reinitialise in ->folio_free this problem > > > > > goes away. We could solve this with split_lock or something but I'd > > > > > prefer not to add lock for this (although some of prior revs did do > > > > > this, maybe we will revist this later). > > > > > > > > > > Anyways - this falls in driver detail / choice IMO. > > > > > > > > Agreed. > > > > > > > > - Alistair > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > > > +{ > > > > > > > > > > > + struct dev_pagemap *pgmap = page_pgmap(&folio->page); > > > > > > > > > > > + int order, i; > > > > > > > > > > > + > > > > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio); > > > > > > > > > > > + > > > > > > > > > > > + folio->mapping = NULL; > > > > > > > > > > > + order = folio_order(folio); > > > > > > > > > > > + if (!order) > > > > > > > > > > > + return; > > > > > > > > > > > + > > > > > > > > > > > + folio_reset_order(folio); > > > > > > > > > > > + > > > > > > > > > > > + for (i = 0; i < (1UL << order); i++) { > > > > > > > > > > > + struct page *page = folio_page(folio, i); > > > > > > > > > > > + struct folio *new_folio = (struct folio *)page; > > > > > > > > > > > + > > > > > > > > > > > + ClearPageHead(page); > > > > > > > > > > > + clear_compound_head(page); > > > > > > > > > > > + > > > > > > > > > > > + new_folio->mapping = NULL; > > > > > > > > > > > + /* > > > > > > > > > > > + * Reset pgmap which was over-written by > > > > > > > > > > > + * prep_compound_page(). > > > > > > > > > > > + */ > > > > > > > > > > > + new_folio->pgmap = pgmap; > > > > > > > > > > > + new_folio->share = 0; /* fsdax only, unused for device private */ > > > > > > > > > > > + VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio); > > > > > > > > > > > + VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio); > > > > > > > > > > > > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is > > > > > > > > > > that PMD_ORDER more frees than we'd like? > > > > > > > > > > > > > > > > > > > > > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that > > > > > > > > > would make my implementation explode. I explained this in detail here [1] > > > > > > > > > to Zi. > > > > > > > > > > > > > > > > > > To recap [1], my memory allocator has no visibility into individual > > > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design > > > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU > > > > > > > > > allocations (GEMs) and SVM allocations. > > > > > > > > > > > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free > > > > > > > > > and are later reallocated with a different order in > > > > > > > > > zone_device_page_init, the implementation breaks. This problem is not > > > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so > > > > > > > > > it works by coincidence. Reallocating at a different order is valid > > > > > > > > > behavior and must be supported. > > > > > > > > > > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to > > > > > > > > always just (re)allocate at the same order and not worry about having to > > > > > > > > reinitialise things to different orders. > > > > > > > > > > > > > > > > > > > > > > I would agree with this statement too — it’s perfectly valid if a driver > > > > > > > always wants to (re)allocate at the same order. > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > - Alistair > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413 > > > > > > > > > > > > > > > > > > > > + } > > > > > > > > > > > +} > > > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare); > > > > > > > > > > > + > > > > > > > > > > > void free_zone_device_folio(struct folio *folio) > > > > > > > > > > > { > > > > > > > > > > > struct dev_pagemap *pgmap = folio->pgmap; > > > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio) > > > > > > > > > > > case MEMORY_DEVICE_COHERENT: > > > > > > > > > > > if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free)) > > > > > > > > > > > break; > > > > > > > > > > > + free_zone_device_folio_prepare(folio); > > > > > > > > > > > pgmap->ops->folio_free(folio, order); > > > > > > > > > > > percpu_ref_put_many(&folio->pgmap->ref, nr); > > > > > > > > > > > break; > > > > > > > > > > > > > > > > > > > > Balbir ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2026-01-13 2:31 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast 2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast 2026-01-11 22:35 ` Matthew Wilcox 2026-01-12 0:19 ` Balbir Singh 2026-01-12 0:51 ` Zi Yan 2026-01-12 1:37 ` Matthew Brost 2026-01-12 4:50 ` Balbir Singh 2026-01-12 13:45 ` Jason Gunthorpe 2026-01-12 16:31 ` Zi Yan 2026-01-12 16:50 ` Jason Gunthorpe 2026-01-12 17:46 ` Zi Yan 2026-01-12 18:25 ` Jason Gunthorpe 2026-01-12 18:55 ` Zi Yan 2026-01-12 19:28 ` Jason Gunthorpe 2026-01-12 23:34 ` Zi Yan 2026-01-12 23:53 ` Jason Gunthorpe 2026-01-13 0:35 ` Zi Yan 2026-01-12 23:07 ` Matthew Brost 2026-01-12 21:49 ` Matthew Brost 2026-01-12 23:15 ` Zi Yan 2026-01-12 23:22 ` Matthew Brost 2026-01-12 23:44 ` Alistair Popple 2026-01-12 23:54 ` Jason Gunthorpe 2026-01-12 23:31 ` Jason Gunthorpe 2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast 2026-01-12 0:44 ` Balbir Singh 2026-01-12 1:16 ` Matthew Brost 2026-01-12 2:15 ` Balbir Singh 2026-01-12 2:37 ` Matthew Brost 2026-01-12 2:50 ` Matthew Brost 2026-01-12 23:58 ` Alistair Popple 2026-01-13 0:23 ` Matthew Brost 2026-01-13 0:43 ` Alistair Popple 2026-01-13 1:07 ` Matthew Brost 2026-01-13 1:35 ` Alistair Popple 2026-01-13 1:40 ` Matthew Brost 2026-01-13 2:06 ` Alistair Popple 2026-01-13 2:16 ` Matthew Brost 2026-01-13 2:31 ` Alistair Popple
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox